Paper Reading AI Learner

VISLA Benchmark: Evaluating Embedding Sensitivity to Semantic and Lexical Alterations

2024-04-25 07:08:00
Sri Harsha Dumpala, Aman Jaiswal, Chandramouli Sastry, Evangelos Milios, Sageev Oore, Hassan Sajjad

Abstract

Despite their remarkable successes, state-of-the-art language models face challenges in grasping certain important semantic details. This paper introduces the VISLA (Variance and Invariance to Semantic and Lexical Alterations) benchmark, designed to evaluate the semantic and lexical understanding of language models. VISLA presents a 3-way semantic (in)equivalence task with a triplet of sentences associated with an image, to evaluate both vision-language models (VLMs) and unimodal language models (ULMs). An evaluation involving 34 VLMs and 20 ULMs reveals surprising difficulties in distinguishing between lexical and semantic variations. Spatial semantics encoded by language models also appear to be highly sensitive to lexical information. Notably, text encoders of VLMs demonstrate greater sensitivity to semantic and lexical variations than unimodal text encoders. Our contributions include the unification of image-to-text and text-to-text retrieval tasks, an off-the-shelf evaluation without fine-tuning, and assessing LMs' semantic (in)variance in the presence of lexical alterations. The results highlight strengths and weaknesses across diverse vision and unimodal language models, contributing to a deeper understanding of their capabilities. % VISLA enables a rigorous evaluation, shedding light on language models' capabilities in handling semantic and lexical nuances. Data and code will be made available at this https URL.

Abstract (translated)

尽管它们取得了惊人的成功,最先进的语言模型在理解某些重要的语义细节方面仍然面临着挑战。本文介绍了一个名为 VISLA(语义和词义变化对齐)的基准,旨在评估语言模型的语义和词义理解能力。VISLA 提出了一种三元语义(形式)等价任务,三元组句子与图像相关联,以评估视觉语言模型(VLMs)和单语语言模型(ULMs)的语义和词义理解能力。在评估 34 个 VLMs 和 20 个 ULMs 的过程中,揭示了在区分词汇和语义变化方面令人惊讶的困难。语言模型编码的空间语义也似乎对词汇信息非常敏感。值得注意的是,VLMs 的文本编码器在语义和词义变化方面比单语文本编码器更加敏感。我们的贡献包括统一图像到文本和文本到文本检索任务,无需微调即可进行一般评估,以及评估 LMs 在词汇变化影响下的语义(形式)变化。结果表明,各种视觉和单语语言模型的优势和劣势得到了突出,有助于更深入地了解它们的能力。VISLA 为严格的评估提供了一个框架,揭示了语言模型在处理语义和词义细微差别方面的能力。数据和代码将在此链接的 URL 上提供。

URL

https://arxiv.org/abs/2404.16365

PDF

https://arxiv.org/pdf/2404.16365.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot