The multimodal language models (MLMs) based on generative pre-trained Transformer are considered powerful candidates for unifying various domains and tasks. MLMs developed for remote sensing (RS) have demonstrated outstanding performance in multiple tasks, such as visual question answering and visual grounding. In addition to visual grounding that detects specific objects corresponded to given instruction, aerial detection, which detects all objects of multiple categories, is also a valuable and challenging task for RS foundation models. However, aerial detection has not been explored by existing RS MLMs because the autoregressive prediction mechanism of MLMs differs significantly from the detection outputs. In this paper, we present a simple baseline for applying MLMs to aerial detection for the first time, named LMMRotate. Specifically, we first introduce a normalization method to transform detection outputs into textual outputs to be compatible with the MLM framework. Then, we propose a evaluation method, which ensures a fair comparison between MLMs and conventional object detection models. We construct the baseline by fine-tuning open-source general-purpose MLMs and achieve impressive detection performance comparable to conventional detector. We hope that this baseline will serve as a reference for future MLM development, enabling more comprehensive capabilities for understanding RS images. Code is available at this https URL.
基于生成预训练Transformer的多模态语言模型(MLMs)被视为统一各种领域和任务的强大候选者。专为遥感(RS)开发的MLMs在多项任务中展现了卓越性能,如视觉问答和视觉接地。除了检测与给定指令相对应的具体物体的视觉接地外,检测多种类别的所有对象的航空检测也是一个对RS基础模型有价值的且具有挑战性的任务。然而,由于MLMs的自回归预测机制与检测输出显著不同,现有的RS MLMs尚未探索航空检测领域。在这篇文章中,我们首次提出了一种简单的方法用于将MLMs应用于航空检测,并将其命名为LMMRotate。 具体而言,首先引入一种归一化方法,以将检测输出转换为文本形式的输出,从而使其与MLM框架兼容。然后,我们提出了一种评估方法,确保MLMs和传统目标检测模型之间的公平比较。通过微调开源通用MLMs构建基线,并取得了与传统检测器相当的出色检测性能。我们希望这一基线将作为未来MLM发展的参考,使理解RS图像的能力更加全面。 代码可在以下网址获得:[此URL](https://this-url.com)(原文中的链接请自行替换为实际提供的地址)。
https://arxiv.org/abs/2501.09720
This paper investigates the robustness of vision-language models against adversarial visual perturbations and introduces a novel ``double visual defense" to enhance this robustness. Unlike previous approaches that resort to lightweight adversarial fine-tuning of a pre-trained CLIP model, we perform large-scale adversarial vision-language pre-training from scratch using web-scale data. We then strengthen the defense by incorporating adversarial visual instruction tuning. The resulting models from each stage, $\Delta$CLIP and $\Delta^2$LLaVA, show substantially enhanced zero-shot robustness and set a new state-of-the-art in adversarial defense for vision-language models. For example, the adversarial robustness of $\Delta$CLIP surpasses that of the previous best models on ImageNet-1k by ~20%. %For example, $\Delta$CLIP surpasses the previous best models on ImageNet-1k by ~20% in terms of adversarial robustness. Similarly, compared to prior art, $\Delta^2$LLaVA brings a ~30% robustness improvement to image captioning task and a ~20% robustness improvement to visual question answering task. Furthermore, our models exhibit stronger zero-shot recognition capability, fewer hallucinations, and superior reasoning performance compared to baselines. Our project page is this https URL.
这篇论文探讨了视觉-语言模型在对抗性视觉干扰下的鲁棒性,并引入了一种新颖的“双重视觉防御”方法,以增强这种鲁棒性。与以往依赖于轻量级对抗微调预训练CLIP模型的方法不同,我们使用网络规模的数据从头开始进行了大规模的对抗性视觉-语言预训练。然后通过加入对抗性视觉指令调整来加强防护措施。在每个阶段生成的模型$\Delta$CLIP和$\Delta^2$LLaVA显示出了显著增强的零样本鲁棒性,并且在对抗防御方面为视觉-语言模型设定了新的最佳状态。例如,$\Delta$CLIP在ImageNet-1k上的对抗鲁棒性比之前的最好模型高约20%。同样地,与先前的方法相比,$\Delta^2$LLaVA在图像描述任务上带来了大约30%的鲁棒性改进,在视觉问答任务上带来了大约20%的鲁棒性改进。此外,我们的模型还展示了更强的零样本识别能力、更少的幻觉现象以及比基准方法更优越的推理性能。我们的项目页面是这个网址:[请在此处插入正确的URL]。
https://arxiv.org/abs/2501.09446
Processing low-resource languages, such as Kiswahili, using machine learning is difficult due to lack of adequate training data. However, such low-resource languages are still important for human communication and are already in daily use and users need practical machine processing tasks such as summarization, disambiguation and even question answering (QA). One method of processing such languages, while bypassing the need for training data, is the use semantic networks. Some low resource languages, such as Kiswahili, are of the subject-verb-object (SVO) structure, and similarly semantic networks are a triple of subject-predicate-object, hence SVO parts of speech tags can map into a semantic network triple. An algorithm to process raw natural language text and map it into a semantic network is therefore necessary and desirable in structuring low resource languages texts. This algorithm tested on the Kiswahili QA task with upto 78.6% exact match.
处理资源匮乏语言(如斯瓦希里语)的机器学习翻译非常困难,原因是缺乏足够的训练数据。然而,这类资源匮乏的语言在人类沟通中仍然非常重要,并且已经在日常使用中,用户需要诸如文本摘要、词义消歧和问题回答等实用的机器处理任务。 一种不依赖大量训练数据就能对这些语言进行处理的方法是使用语义网络。一些资源匮乏的语言,如斯瓦希里语,具有主谓宾(SVO)结构,并且同样地,语义网络也是由主词-谓词-宾词组成的三元组,因此可以将SVO的句法标签映射到一个语义网络三元组中。开发一种算法来处理原始自然语言文本并将其映射为语义网络三元组,在结构化资源匮乏语言文本方面是必要的且有实用价值的。该算法在斯瓦希里语问题回答任务上进行测试,最高达到了78.6%的确切匹配率。 这段翻译说明了使用语义网络处理低资源语言的一种方法,并强调了一个有效算法的重要性,它可以在缺乏大量训练数据的情况下将自然语言文本转换为结构化的语义表示形式。
https://arxiv.org/abs/2501.09326
Vision Language Models (VLMs) demonstrate significant potential as embodied AI agents for various mobility applications. However, a standardized, closed-loop benchmark for evaluating their spatial reasoning and sequential decision-making capabilities is lacking. To address this, we present MetaVQA: a comprehensive benchmark designed to assess and enhance VLMs' understanding of spatial relationships and scene dynamics through Visual Question Answering (VQA) and closed-loop simulations. MetaVQA leverages Set-of-Mark prompting and top-down view ground-truth annotations from nuScenes and Waymo datasets to automatically generate extensive question-answer pairs based on diverse real-world traffic scenarios, ensuring object-centric and context-rich instructions. Our experiments show that fine-tuning VLMs with the MetaVQA dataset significantly improves their spatial reasoning and embodied scene comprehension in safety-critical simulations, evident not only in improved VQA accuracies but also in emerging safety-aware driving maneuvers. In addition, the learning demonstrates strong transferability from simulation to real-world observation. Code and data will be publicly available at this https URL .
视觉语言模型(VLMs)在各种移动应用中作为具身AI代理展现出巨大潜力。然而,缺乏评估其空间推理和顺序决策能力的标准闭环基准测试。为了解决这一问题,我们介绍了MetaVQA:一个全面的基准测试工具,旨在通过视觉问答(VQA)和闭合回路模拟来评测并提升VLM对空间关系及场景动态的理解能力。MetaVQA利用nuScenes和Waymo数据集中的Set-of-Mark提示以及自顶向下视图的真实标注,自动根据多样化的现实交通情景生成大量问题-答案对,确保指令具有对象中心性和上下文丰富性。我们的实验表明,使用MetaVQA数据集微调VLM能显著提升其在安全关键模拟中空间推理和具身场景理解的能力,不仅体现在视觉问答准确性的提高上,还表现在出现的安全意识驾驶操作上。此外,学习成果展示了从模拟到现实世界观察的强迁移能力。代码和数据将在[此链接](https://this https URL)公开提供。
https://arxiv.org/abs/2501.09167
Existing Theory of Mind (ToM) benchmarks diverge from real-world scenarios in three aspects: 1) they assess a limited range of mental states such as beliefs, 2) false beliefs are not comprehensively explored, and 3) the diverse personality traits of characters are overlooked. To address these challenges, we introduce ToMATO, a new ToM benchmark formulated as multiple-choice QA over conversations. ToMATO is generated via LLM-LLM conversations featuring information asymmetry. By employing a prompting method that requires role-playing LLMs to verbalize their thoughts before each utterance, we capture both first- and second-order mental states across five categories: belief, intention, desire, emotion, and knowledge. These verbalized thoughts serve as answers to questions designed to assess the mental states of characters within conversations. Furthermore, the information asymmetry introduced by hiding thoughts from others induces the generation of false beliefs about various mental states. Assigning distinct personality traits to LLMs further diversifies both utterances and thoughts. ToMATO consists of 5.4k questions, 753 conversations, and 15 personality trait patterns. Our analysis shows that this dataset construction approach frequently generates false beliefs due to the information asymmetry between role-playing LLMs, and effectively reflects diverse personalities. We evaluate nine LLMs on ToMATO and find that even GPT-4o mini lags behind human performance, especially in understanding false beliefs, and lacks robustness to various personality traits.
现有的理论心智(Theory of Mind,ToM)基准测试在三个方面与实际情况相偏离:1) 它们评估的思维状态范围有限,如仅限于信念;2) 对虚假信念的探索不够全面;3) 忽视了角色性格特征的多样性。为解决这些挑战,我们提出了一个新的ToM基准——ToMATO,它以多选问答形式基于对话构建而成。ToMATO是通过大型语言模型(LLM)之间的对话生成而来,这些对话中存在信息不对称的特点。 为了捕捉一阶和二阶心理状态,并涵盖五个类别:信念、意图、欲望、情感和知识,我们采用了一种要求角色扮演的LLM在每次表达之前都要口头化其思维的方法。这种口头化的思想作为问题的答案来评估对话中人物的心理状态。此外,通过隐藏部分信息导致的信息不对称性引入了对各种心理状态的虚假信念的生成。为模型分配不同的性格特征进一步丰富了话语和思考内容。 ToMATO包含5400个问题、753段对话以及15种性格特质模式。我们的分析表明,这种数据集构建方式由于角色扮演LLM之间的信息不对称性而频繁产生虚假信念,并且有效地反映了多样的性格特点。我们评估了九种大型语言模型在ToMATO上的表现,发现即使是GPT-4o mini也落后于人类的表现,尤其是在理解虚假信念方面以及对各种性格特质的适应能力上表现出不足。
https://arxiv.org/abs/2501.08838
Significant progress has been made in the field of video question answering (VideoQA) thanks to deep learning and large-scale pretraining. Despite the presence of sophisticated model structures and powerful video-text foundation models, most existing methods focus solely on maximizing the correlation between answers and video-question pairs during training. We argue that these models often establish shortcuts, resulting in spurious correlations between questions and answers, especially when the alignment between video and text data is suboptimal. To address these spurious correlations, we propose a novel training framework in which the model is compelled to acknowledge its ignorance when presented with an intervened question, rather than making guesses solely based on superficial question-answer correlations. We introduce methodologies for intervening in questions, utilizing techniques such as displacement and perturbation, and design frameworks for the model to admit its lack of knowledge in both multi-choice VideoQA and open-ended settings. In practice, we integrate a state-of-the-art model into our framework to validate its effectiveness. The results clearly demonstrate that our framework can significantly enhance the performance of VideoQA models with minimal structural modifications.
在视频问答(VideoQA)领域,由于深度学习和大规模预训练技术的应用,已经取得了显著的进展。尽管存在复杂的模型结构以及强大的视频-文本基础模型,大多数现有的方法仍然主要集中在通过最大化训练过程中视频问题对与答案之间的相关性来改进性能。我们认为这些模型常常建立了捷径,导致了问题与答案之间虚假的相关性,特别是在视频和文本数据对齐不佳的情况下。 为了应对这种虚假相关性的问题,我们提出了一种新的训练框架,在该框架中,当出现经过修改的问题时,模型被迫承认其知识的局限性而不是仅仅基于表面化的问答关联进行猜测。我们将通过使用位移和扰动等技术来介绍干预问题的方法,并设计了让模型在多选和开放式视频问答设置下都能够表达自己缺乏相关知识的框架。 实践中,我们将最先进的模型集成到我们的框架中以验证其有效性。结果清楚地表明,在几乎不改变模型结构的情况下,该框架能够显著提高VideoQA模型的表现性能。
https://arxiv.org/abs/2501.08771
Large Vision-Language Models (LVLMs) have demonstrated impressive capabilities in multimodal tasks, but their performance is often constrained by the lack of external knowledge integration, limiting their ability to handle knowledge-intensive tasks such as visual question answering and reasoning. To address this challenge, we propose a novel method, Adaptive Knowledge-Guided Pretraining for Large Vision-Language Models (AKGP-LVLM), which dynamically incorporates structured and unstructured knowledge into LVLMs during pretraining and fine-tuning. Our approach employs a knowledge encoder to represent external knowledge, a retrieval mechanism to select task-relevant information, and a dynamic adaptor to align multimodal and knowledge representations effectively. We evaluate our method on four benchmark datasets, demonstrating significant performance improvements over state-of-the-art models. Furthermore, human evaluations highlight the superior correctness and relevance of our model's outputs. Extensive analyses confirm the robustness, efficiency, and scalability of AKGP-LVLM, making it a compelling solution for real-world knowledge-intensive tasks.
大型视觉-语言模型(LVLMs)在多模态任务中展现了出色的能力,但它们的性能往往受到缺乏外部知识整合的限制,从而影响了处理如视觉问答和推理等需要大量知识的任务。为了解决这一挑战,我们提出了一种新的方法——自适应知识引导预训练大型视觉-语言模型(AKGP-LVLM)。该方法在预训练和微调过程中动态地将结构化和非结构化的外部知识整合到LVLM中。我们的方法采用知识编码器来表示外部知识、检索机制来选择与任务相关的信息,以及动态适配器以有效地对齐多模态和知识表示。我们在四个基准数据集上评估了该方法,并展示了相对于现有最佳模型的显著性能改进。此外,人类评价强调了我们模型输出的正确性和相关性更优。广泛的分析确认了AKGP-LVLM的强大、高效及可扩展特性,使其成为处理现实世界中需要大量知识任务的有力解决方案。
https://arxiv.org/abs/2501.08597
Recent studies have revealed that modern image and video quality assessment (IQA/VQA) metrics are vulnerable to adversarial attacks. An attacker can manipulate a video through preprocessing to artificially increase its quality score according to a certain metric, despite no actual improvement in visual quality. Most of the attacks studied in the literature are white-box attacks, while black-box attacks in the context of VQA have received less attention. Moreover, some research indicates a lack of transferability of adversarial examples generated for one model to another when applied to VQA. In this paper, we propose a cross-modal attack method, IC2VQA, aimed at exploring the vulnerabilities of modern VQA models. This approach is motivated by the observation that the low-level feature spaces of images and videos are similar. We investigate the transferability of adversarial perturbations across different modalities; specifically, we analyze how adversarial perturbations generated on a white-box IQA model with an additional CLIP module can effectively target a VQA model. The addition of the CLIP module serves as a valuable aid in increasing transferability, as the CLIP model is known for its effective capture of low-level semantics. Extensive experiments demonstrate that IC2VQA achieves a high success rate in attacking three black-box VQA models. We compare our method with existing black-box attack strategies, highlighting its superiority in terms of attack success within the same number of iterations and levels of attack strength. We believe that the proposed method will contribute to the deeper analysis of robust VQA metrics.
最近的研究揭示了现代图像和视频质量评估(IQA/VQA)指标容易受到对抗性攻击的影响。通过预先处理,攻击者可以人为地提高视频的质量分数,尽管实际上并未改善其视觉效果。大多数文献中研究的都是白盒攻击,而针对VQA领域的黑盒攻击则较少被关注。此外,一些研究表明,在不同模型之间生成的对抗样本缺乏可迁移性,特别是在应用到VQA时。本文提出了一种跨模态攻击方法IC2VQA,旨在探索现代VQA模型的脆弱性。这一方法受到观察结果的启发,即图像和视频在低级特征空间上具有相似性。我们研究了不同模式之间的对抗扰动迁移能力;具体来说,分析了带有额外CLIP模块的白盒IQA模型生成的对抗扰动如何有效地针对VQA模型。添加CLIP模块作为提高可迁移性的有效辅助手段是有益的,因为CLIP模型以其能够捕捉低级语义而著称。大量的实验表明,IC2VQA在攻击三个黑盒VQA模型时取得了较高的成功率。我们将该方法与现有的黑盒攻击策略进行了比较,并强调了它在相同迭代次数和攻击强度下更具优势的攻击成功率。我们认为所提出的方法将有助于对鲁棒性更强的VQA指标进行更深入的研究和分析。
https://arxiv.org/abs/2501.08415
Large Language Models (LLMs) have shown impressive performance in various NLP tasks. However, there are concerns about their reliability in different domains of linguistic variations. Many works have proposed robustness evaluation measures for local adversarial attacks, but we need globally robust models unbiased to different language styles. We take a broader approach to explore a wider range of variations across sociodemographic dimensions to perform structured reliability tests on the reasoning capacity of language models. We extend the SocialIQA dataset to create diverse paraphrased sets conditioned on sociodemographic styles. The assessment aims to provide a deeper understanding of LLMs in (a) their capability of generating demographic paraphrases with engineered prompts and (b) their reasoning capabilities in real-world, complex language scenarios. We also explore measures such as perplexity, explainability, and ATOMIC performance of paraphrases for fine-grained reliability analysis of LLMs on these sets. We find that demographic-specific paraphrasing significantly impacts the performance of language models, indicating that the subtleties of language variations remain a significant challenge. The code and dataset will be made available for reproducibility and future research.
大型语言模型(LLMs)在各种自然语言处理任务中表现出色。然而,关于它们在不同语言变体领域的可靠性存在担忧。虽然许多研究提出了针对局部对抗性攻击的鲁棒性评估措施,但我们仍需要在全球范围内无偏地测试对不同语言风格的适应能力的模型。我们采取了一种更广泛的方法,从社会人口统计维度探索更大范围的变化,并对语言模型的推理能力进行结构化的可靠性测试。我们将SocialIQA数据集扩展为多样化的、基于工程提示的社会人口统计数据风格的改写集合。评估旨在深入理解LLMs在(a)生成具有特定社会人口统计数据风格的改写文本的能力,以及(b)在现实世界复杂语言场景中的推理能力。我们还探讨了如困惑度、可解释性和ATOMIC性能等措施,以对这些数据集上的LLMs进行细致的可靠性分析。我们的研究发现,基于特定社会人口统计数据的改写显著影响了语言模型的表现,表明语言变异性的微妙之处仍然是一个重大挑战。代码和数据集将公开发布,以支持复制和未来的研究工作。
https://arxiv.org/abs/2501.08276
Large Language Models (LLMs) have shown impressive potential in clinical question answering (QA), with Retrieval Augmented Generation (RAG) emerging as a leading approach for ensuring the factual accuracy of model responses. However, current automated RAG metrics perform poorly in clinical and conversational use cases. Using clinical human evaluations of responses is expensive, unscalable, and not conducive to the continuous iterative development of RAG systems. To address these challenges, we introduce ASTRID - an Automated and Scalable TRIaD for evaluating clinical QA systems leveraging RAG - consisting of three metrics: Context Relevance (CR), Refusal Accuracy (RA), and Conversational Faithfulness (CF). Our novel evaluation metric, CF, is designed to better capture the faithfulness of a model's response to the knowledge base without penalising conversational elements. To validate our triad, we curate a dataset of over 200 real-world patient questions posed to an LLM-based QA agent during surgical follow-up for cataract surgery - the highest volume operation in the world - augmented with clinician-selected questions for emergency, clinical, and non-clinical out-of-domain scenarios. We demonstrate that CF can predict human ratings of faithfulness better than existing definitions for conversational use cases. Furthermore, we show that evaluation using our triad consisting of CF, RA, and CR exhibits alignment with clinician assessment for inappropriate, harmful, or unhelpful responses. Finally, using nine different LLMs, we demonstrate that the three metrics can closely agree with human evaluations, highlighting the potential of these metrics for use in LLM-driven automated evaluation pipelines. We also publish the prompts and datasets for these experiments, providing valuable resources for further research and development.
大型语言模型(LLMs)在临床问题回答(QA)中展示了令人印象深刻的潜力,而检索增强生成(RAG)已成为确保模型响应事实准确性的领先方法。然而,目前的自动化RAG指标在临床和对话场景中的表现不佳。使用人工评估来评价临床回应的成本高昂、难以扩展,并且不利于持续迭代开发RAG系统。为了解决这些挑战,我们引入了ASTRID——一种用于通过RAG技术评估临床QA系统的自动可扩展三元组(TRIaD)评估方法,其中包括三个指标:上下文相关性(CR)、拒绝准确性(RA)和对话忠实度(CF)。我们的新型评价指标CF旨在更好地捕捉模型响应对知识库的忠实度,同时不惩罚对话元素。为了验证我们的三元组,我们整理了一个包含超过200个真实世界患者在白内障手术术后随访中向基于LLM的QA代理提出的问题的数据集,并补充了紧急、临床和非临床离域场景下由临床医生选择的问题。我们展示了CF能够在预测人类对忠实度的人为评分方面优于现有的对话使用情况定义。此外,我们证明使用包含CF、RA和CR的三元组评估可以与临床医生的评估在不当、有害或无益的回答上保持一致。最后,通过九种不同的大型语言模型进行演示,显示这三个指标能够密切符合人类评价,突显了这些指标用于LLM驱动的自动化评估管道中的潜在价值。我们还发布了这些实验中使用的提示和数据集,为进一步的研究和发展提供了宝贵的资源。
https://arxiv.org/abs/2501.08208
Remote sensing visual question answering (RSVQA) is a task that automatically extracts information from satellite images and processes a question to predict the answer from the images in textual form, helping with the interpretation of the image. While different methods have been proposed to extract information from optical images with different spectral bands and resolutions, no method has been proposed to answer questions from Synthetic Aperture Radar (SAR) images. SAR images capture electromagnetic information from the scene, and are less affected by atmospheric conditions, such as clouds. In this work, our objective is to introduce SAR in the RSVQA task, finding the best way to use this modality. In our research, we carry out a study on different pipelines for the task of RSVQA taking into account information from both SAR and optical data. To this purpose, we also present a dataset that allows for the introduction of SAR images in the RSVQA framework. We propose two different models to include the SAR modality. The first one is an end-to-end method in which we add an additional encoder for the SAR modality. In the second approach, we build on a two-stage framework. First, relevant information is extracted from SAR and, optionally, optical data. This information is then translated into natural language to be used in the second step which only relies on a language model to provide the answer. We find that the second pipeline allows us to obtain good results with SAR images alone. We then try various types of fusion methods to use SAR and optical images together, finding that a fusion at the decision level achieves the best results on the proposed dataset. We show that SAR data offers additional information when fused with the optical modality, particularly for questions related to specific land cover classes, such as water areas.
远程遥感视觉问答(RSVQA)是一项任务,旨在自动从卫星图像中提取信息,并处理问题以预测文字形式的答案,从而帮助解释图像。尽管已经提出了多种方法来提取不同光谱带和分辨率的光学图像中的信息,但尚未有提出用于回答合成孔径雷达(SAR)图像的问题的方法。SAR图像捕获场景中的电磁信息,受大气条件的影响较小,如云层。 在此项研究中,我们的目标是将SAR引入RSVQA任务,并找到利用这一模态的最佳方式。在本研究中,我们探讨了用于RSVQA任务的不同流水线方法,这些方法同时考虑了来自SAR和光学数据的信息。为此,我们也提出了一组数据集,它允许在RSVQA框架内引入SAR图像。我们提出了两种不同的模型来包含SAR模态。第一种是一个端到端的方法,在其中添加了一个额外的用于SAR模态的编码器。第二种方法是在一个两阶段框架的基础上构建的:首先从SAR(和可选的光学)数据中提取相关信息,然后将这些信息转换为自然语言以供第二步使用,该步骤仅依赖于语言模型来提供答案。我们发现,在仅用SAR图像的情况下,第二种流水线能够获得较好的结果。 接着,我们尝试了各种类型的融合方法来同时利用SAR和光学图像,并发现在决策层进行融合的方法在所提出的数据集上取得了最佳效果。我们证明,当与光学模态结合时,SAR数据为特定土地覆盖类别(如水域)的相关问题提供了额外的信息。
https://arxiv.org/abs/2501.08131
Face image quality assessment (FIQA) algorithms are being integrated into online identity management applications. These applications allow users to upload a face image as part of their document issuance process, where the image is then run through a quality assessment process to make sure it meets the quality and compliance requirements. Concerns about demographic bias have been raised about biometric systems, given the societal implications this may cause. It is therefore important that demographic variability in FIQA algorithms is assessed such that mitigation measures can be created. In this work, we study the demographic variability of all face image quality measures included in the ISO/IEC 29794-5 international standard across three demographic variables: age, gender, and skin tone. The results are rather promising and show no clear bias toward any specific demographic group for most measures. Only two quality measures are found to have considerable variations in their outcomes for different groups on the skin tone variable.
面部图像质量评估(FIQA)算法正在被集成到在线身份管理应用程序中。这些应用允许用户在证件发放流程中上传一张脸部图片,该图片随后会经过一个质量评估过程,以确保其满足质量和合规性要求。鉴于生物识别系统可能带来的社会影响,人们对于此类系统的族裔偏见表示担忧。因此,有必要对FIQA算法中的族裔变量差异进行评估,以便制定相应的缓解措施。 在这项工作中,我们研究了国际标准ISO/IEC 29794-5中包含的所有面部图像质量度量在三个族裔变量上的变化情况:年龄、性别和肤色。结果相当令人鼓舞,大多数指标对任何特定的族裔群体没有明显的偏见。仅发现有两项质量度量在其针对不同组别的肤色变量上存在显著差异。
https://arxiv.org/abs/2501.07898
Multi-modal Large Language Models (MLLMs) exhibit impressive capabilities in 2D tasks, yet encounter challenges in discerning the spatial positions, interrelations, and causal logic in scenes when transitioning from 2D to 3D representations. We find that the limitations mainly lie in: i) the high annotation cost restricting the scale-up of volumes of 3D scene data, and ii) the lack of a straightforward and effective way to perceive 3D information which results in prolonged training durations and complicates the streamlined framework. To this end, we develop pipeline based on open-source 2D MLLMs and LLMs to generate high-quality 3D-text pairs and construct 3DS-160K , to enhance the pre-training process. Leveraging this high-quality pre-training data, we introduce the 3UR-LLM model, an end-to-end 3D MLLM designed for precise interpretation of 3D scenes, showcasing exceptional capability in navigating the complexities of the physical world. 3UR-LLM directly receives 3D point cloud as input and project 3D features fused with text instructions into a manageable set of tokens. Considering the computation burden derived from these hybrid tokens, we design a 3D compressor module to cohesively compress the 3D spatial cues and textual narrative. 3UR-LLM achieves promising performance with respect to the previous SOTAs, for instance, 3UR-LLM exceeds its counterparts by 7.1\% CIDEr on ScanQA, while utilizing fewer training resources. The code and model weights for 3UR-LLM and the 3DS-160K benchmark are available at 3UR-LLM.
多模态大型语言模型(MLLMs)在处理二维任务时表现出色,但在从二维过渡到三维表示时,在辨别场景中的空间位置、相互关系和因果逻辑方面面临挑战。我们发现这些限制主要源自两方面:i) 高昂的标注成本限制了大规模三维场景数据的增长;ii) 缺乏一种直接且有效的方式来感知三维信息,导致训练时间延长,并使模型结构复杂化。 为此,我们在开源2D MLLMs和LLMs的基础上开发了一条生成高质量3D-文本对的流水线,并构建了包含160K组数据的3DS-160K数据集。这一举措旨在优化预训练过程。利用这些高质量的预训练数据,我们引入了一个端到端的三维MLLM模型——3UR-LLM,专为精确解释三维场景设计,展示了在物理世界复杂性导航方面卓越的能力。 3UR-LLM可以直接接收三维点云作为输入,并将融合了文本指令的三维特征转化为一组可管理的令牌。考虑到这些混合令牌带来的计算负担,我们设计了一个3D压缩器模块来综合压缩空间提示和文字叙述信息。 与之前的最先进方法(SOTAs)相比,3UR-LLM在多个指标上表现出色。例如,在ScanQA任务中,3UR-LLM的CIDEr得分比其他模型高出7.1%,同时使用更少的训练资源。3UR-LLM及其基准数据集3DS-160K的相关代码和模型权重可在该项目页面获取。
https://arxiv.org/abs/2501.07819
Among parameter-efficient fine-tuning methods, freezing has emerged as a popular strategy for speeding up training, reducing catastrophic forgetting, and improving downstream performance. We investigate the impact of freezing the decoder in a multi-task setup comprising diverse natural language tasks, aiming to reduce deployment overhead and enhance portability to novel tasks. Our experiments, conducted by fine-tuning both individual and multi-task setups on the AlexaTM model, reveal that freezing decoders is highly effective for tasks with natural language outputs and mitigates catastrophic forgetting in multilingual tasks. However, we find that pairing frozen decoders with a larger model can effectively maintain or even enhance performance in structured and QA tasks, making it a viable strategy for a broader range of task types.
在各种高效的微调方法中,冻结策略因其能加速训练、减少灾难性遗忘并提升下游任务性能而变得非常流行。我们研究了在一个包含多种自然语言任务的多任务设置下冻结解码器的影响,旨在降低部署开销并提高对新任务的适应能力。通过在AlexaTM模型上分别微调单任务和多任务设置,我们的实验发现,对于需要自然语言输出的任务而言,冻结解码器非常有效,并且可以在多语言任务中减轻灾难性遗忘现象。然而,我们还发现将冻结解码器与更大的模型结合使用可以有效地维持或甚至增强结构化任务和问答任务的性能,使其成为适用于更广泛类型任务的一种可行策略。
https://arxiv.org/abs/2501.07818
Image pyramids are widely adopted in top-performing methods to obtain multi-scale features for precise visual perception and understanding. However, current image pyramids use the same large-scale model to process multiple resolutions of images, leading to significant computational cost. To address this challenge, we propose a novel network architecture, called Parameter-Inverted Image Pyramid Networks (PIIP). Specifically, PIIP uses pretrained models (ViTs or CNNs) as branches to process multi-scale images, where images of higher resolutions are processed by smaller network branches to balance computational cost and performance. To integrate information from different spatial scales, we further propose a novel cross-branch feature interaction mechanism. To validate PIIP, we apply it to various perception models and a representative multimodal large language model called LLaVA, and conduct extensive experiments on various tasks such as object detection, segmentation, image classification and multimodal understanding. PIIP achieves superior performance compared to single-branch and existing multi-resolution approaches with lower computational cost. When applied to InternViT-6B, a large-scale vision foundation model, PIIP can improve its performance by 1%-2% on detection and segmentation with only 40%-60% of the original computation, finally achieving 60.0 box AP on MS COCO and 59.7 mIoU on ADE20K. For multimodal understanding, our PIIP-LLaVA achieves 73.0% accuracy on TextVQA and 74.5% on MMBench with only 2.8M training data. Our code is released at this https URL.
图像金字塔广泛应用于高性能方法中,用于获取多尺度特征以实现精确的视觉感知和理解。然而,当前的图像金字塔使用相同的大型模型来处理不同分辨率的图像,导致计算成本显著增加。为了应对这一挑战,我们提出了一种新的网络架构,称为参数反转图像金字塔网络(PIIP)。具体来说,PIIP 使用预训练的模型(ViTs 或 CNNs)作为分支来处理多尺度图像,其中更高分辨率的图像由较小的网络分支进行处理,以平衡计算成本和性能。为了整合不同空间尺度的信息,我们还提出了一种新颖的跨分支特征交互机制。 为了验证 PIIP 的有效性,我们将它应用于各种感知模型以及一种代表性的多模态大型语言模型——LLaVA,并在对象检测、分割、图像分类和多模态理解等各项任务上进行了广泛的实验。PIIP 在计算成本更低的情况下,相较于单分支和现有多种分辨率方法实现了更优的性能表现。 当 PIIP 应用于大规模视觉基础模型 InternViT-6B 时,在检测和分割方面可以提升其1%-2% 的性能,并且仅需原始计算量的40%-60%,最终在 MS COCO 上实现 60.0 box AP,而在 ADE20K 上实现 59.7 mIoU。对于多模态理解,我们的 PIIP-LLaVA 使用仅有2.8M 的训练数据,在 TextVQA 上实现了 73.0% 准确率,并在 MMBench 上达到了 74.5%。 我们的代码已发布在这个网址上:[请参阅原文链接获取具体网址]。
https://arxiv.org/abs/2501.07783
Retrieval-augmented generation (RAG) demonstrates remarkable performance across tasks in open-domain question-answering. However, traditional search engines may retrieve shallow content, limiting the ability of LLMs to handle complex, multi-layered information. To address it, we introduce WebWalkerQA, a benchmark designed to assess the ability of LLMs to perform web traversal. It evaluates the capacity of LLMs to traverse a website's subpages to extract high-quality data systematically. We propose WebWalker, which is a multi-agent framework that mimics human-like web navigation through an explore-critic paradigm. Extensive experimental results show that WebWalkerQA is challenging and demonstrates the effectiveness of RAG combined with WebWalker, through the horizontal and vertical integration in real-world scenarios.
基于检索的生成(RAG)在开放式问题回答任务中表现出色。然而,传统的搜索引擎可能会提取浅层次的内容,限制了大型语言模型处理复杂、多层次信息的能力。为了解决这个问题,我们引入了一个名为WebWalkerQA的新基准测试,旨在评估大型语言模型进行网页遍历的能力。该基准测试考察了大型语言模型系统性地遍历网站子页面以提取高质量数据的能力。 为此,我们提出了一种多智能体框架——WebWalker,它通过探索-批评(explore-critic)模式模拟人类般的网络导航行为。广泛的实验结果表明,WebWalkerQA具有挑战性,并且展示了RAG与WebWalker结合的有效性,在现实世界场景中实现了水平和垂直的集成。 简单来说,这项工作旨在提升大型语言模型处理复杂信息的能力,通过设计一个更具挑战性的基准测试来评估这些系统在实际应用中的性能表现。
https://arxiv.org/abs/2501.07572
Temporal logical understanding, a core facet of human cognition, plays a pivotal role in capturing complex sequential events and their temporal relationships within videos. This capability is particularly crucial in tasks like Video Question Answering (VideoQA), where the goal is to process visual data over time together with textual data to provide coherent answers. However, current VideoQA benchmarks devote little focus to evaluating this critical skill due to the challenge of annotating temporal logic. Despite the advancement of vision-language models, assessing their temporal logical reasoning powers remains a challenge, primarily due to the lack QA pairs that demand formal, complex temporal reasoning. To bridge this gap, we introduce the TimeLogic QA (TLQA) framework to automatically generate the QA pairs, specifically designed to evaluate the temporal logical understanding. To this end, TLQA leverages temporal annotations from existing video datasets together with temporal operators derived from logic theory to construct questions that test understanding of event sequences and their temporal relationships. TLQA framework is generic and scalable, capable of leveraging both, existing video action datasets with temporal action segmentation annotations, or video datasets with temporal scene graph annotations, to automatically generate temporal logical questions. We leverage 4 datasets, STAR, Breakfast, AGQA, and CrossTask, and generate two VideoQA dataset variants - small (TLQA-S) and large (TLQA-L) - containing 2k and 10k QA pairs for each category, resulting in 32k and 160k total pairs per dataset. We undertake a comprehensive evaluation of leading-edge VideoQA models, employing the TLQA to benchmark their temporal logical understanding capabilities. We assess the VideoQA model's temporal reasoning performance on 16 categories of temporal logic with varying temporal complexity.
时间逻辑理解是人类认知的一个核心方面,它在捕捉视频中复杂的事件序列及其时间关系方面发挥着关键作用。这一能力对于诸如视频问答(Video Question Answering, VideoQA)的任务特别重要,在此类任务中,目标是在处理视觉数据随时间变化的同时结合文本数据提供连贯的回答。然而,当前的VideoQA基准测试对评估这种关键技能关注较少,主要是由于注释时间逻辑所面临的挑战。尽管在视觉-语言模型方面取得了进展,但由于缺少需要正式、复杂的时间推理的问题答案(QA)对,因此评估这些模型的时间逻辑推理能力仍然是一项挑战。 为了解决这一缺口,我们引入了TimeLogic QA (TLQA)框架来自动生成用于评估时间逻辑理解的问答对。为此,TLQA利用现有视频数据集中的时间注释以及从逻辑理论中导出的时间操作符来构建测试事件序列及其时间关系理解的问题。TLQA框架是通用且可扩展的,能够利用既有包含时间动作分割标注的动作视频数据集,也可以使用带有时间场景图标注的数据集自动生成时间逻辑问题。 我们利用了4个数据集:STAR、Breakfast、AGQA和CrossTask,并根据每个类别分别生成两个VideoQA数据集变体——小规模(TLQA-S)和大规模(TLQA-L),包含2000和10000对问答,总计每组32000对和160000对。我们利用TLQA框架评估前沿的VideoQA模型的时间逻辑理解能力,并在时间逻辑的16个不同复杂度类别上全面评估了它们的时间推理性能。 这一工作对于提升视觉-语言模型在处理视频数据中的时间逻辑理解方面的能力具有重要意义,同时也为未来的视频理解和问答研究提供了宝贵的资源和基准。
https://arxiv.org/abs/2501.07214
A face image is a mandatory part of ID and travel documents. Obtaining high-quality face images when issuing such documents is crucial for both human examiners and automated face recognition systems. In several international standards, face image quality requirements are intricate and defined in detail. Identifying and understanding non-compliance or defects in the submitted face images is crucial for both issuing authorities and applicants. In this work, we introduce FaceOracle, an LLM-powered AI assistant that helps its users analyze a face image in a natural conversational manner using standard compliant algorithms. Leveraging the power of LLMs, users can get explanations of various face image quality concepts as well as interpret the outcome of face image quality assessment (FIQA) algorithms. We implement a proof-of-concept that demonstrates how experts at an issuing authority could integrate FaceOracle into their workflow to analyze, understand, and communicate their decisions more efficiently, resulting in enhanced productivity.
人脸图像在身份证件和旅行证件中是必不可少的部分。在发放此类证件时获取高质量的人脸图像是对于人工检查员及自动化面部识别系统而言都至关重要的环节。在一些国际标准中,对脸部图像质量的要求复杂且详细规定。识别并理解提交的面部图片中的不符合或缺陷,无论是对颁发机构还是申请人来说都是关键的。 在这项工作中,我们介绍了FaceOracle,这是一个基于大语言模型(LLM)的人工智能助手,它能够以符合标准的方式帮助用户通过自然对话分析人脸图像。利用LLMs的力量,用户可以获得关于各种面部图象质量概念的解释,并理解面部图象质量评估(FIQA)算法的结果。 我们实施了一个概念验证,演示了颁发机构中的专家如何将FaceOracle整合到他们的工作流程中,从而更高效地进行分析、理解和沟通他们的决策,最终提高生产力。
https://arxiv.org/abs/2501.07202
Acquiring face images of sufficiently high quality is important for online ID and travel document issuance applications using face recognition systems (FRS). Low-quality, manipulated (intentionally or unintentionally), or distorted images degrade the FRS performance and facilitate documents' misuse. Securing quality for enrolment images, especially in the unsupervised self-enrolment scenario via a smartphone, becomes important to assure FRS performance. In this work, we focus on the less studied area of radial distortion (a.k.a., the fish-eye effect) in face images and its impact on FRS performance. We introduce an effective radial distortion detection model that can detect and flag radial distortion in the enrolment scenario. We formalize the detection model as a face image quality assessment (FIQA) algorithm and provide a careful inspection of the effect of radial distortion on FRS performance. Evaluation results show excellent detection results for the proposed models, and the study on the impact on FRS uncovers valuable insights into how to best use these models in operational systems.
获取高质量的面部图像对于使用人脸识别系统(FRS)进行在线身份验证和旅行证件发放应用非常重要。低质量、被篡改(有意或无意地)或变形的图像会降低FRS性能,并增加文档滥用的风险。确保注册图像的质量,特别是在通过智能手机进行无监督自我注册场景中尤为重要,以保证FRS的性能。在这项工作中,我们关注较少研究的面部图像径向畸变(又称鱼眼效应)领域及其对FRS性能的影响。我们介绍了一种有效的径向畸变检测模型,可以检测并在注册过程中标记径向畸变。我们将检测模型正式定义为一种面部图像质量评估(FIQA)算法,并详细检查了径向畸变对FRS性能的影响。评价结果显示提出的模型具有出色的检测效果,而关于其对FRS影响的研究揭示了如何在实际系统中最佳使用这些模型的重要见解。
https://arxiv.org/abs/2501.07179
Fair operational systems are crucial in gaining and maintaining society's trust in face recognition systems (FRS). FRS start with capturing an image and assessing its quality before using it further for enrollment or verification. Fair Face Image Quality Assessment (FIQA) schemes therefore become equally important in the context of fair FRS. This work examines the sclera as a quality assessment region for obtaining a fair FIQA. The sclera region is agnostic to demographic variations and skin colour for assessing the quality of a face image. We analyze three skin tone related ISO/IEC face image quality assessment measures and assess the sclera region as an alternative area for assessing FIQ. Our analysis of the face dataset of individuals from different demographic groups representing different skin tones indicates sclera as an alternative to measure dynamic range, over- and under-exposure of face using sclera region alone. The sclera region being agnostic to skin tone, i.e., demographic factors, provides equal utility as a fair FIQA as shown by our Error-vs-Discard Characteristic (EDC) curve analysis.
公正的操作系统对于获得和维持社会对人脸识别系统的信任至关重要。人脸识别系统(FRS)从捕捉图像并评估其质量开始,然后进一步用于注册或验证。因此,在公平的人脸识别系统背景下,公平的脸部图像质量评估(FIQA)方案同样重要。这项工作探讨了巩膜作为获取公正FIQA的质量评估区域的作用。巩膜区域对于评估面部图像的质量而言,不依赖于人口统计学差异和肤色。 我们分析了三种与皮肤色调相关的ISO/IEC脸部图像质量评估措施,并将巩膜区域视为一种替代的评估FIQ(公平的脸部图像质量)区域。我们对不同人种群体代表的不同肤色的面部数据集进行分析后发现,仅通过使用巩膜区域即可测量动态范围、过度曝光和不足曝光情况。由于巩膜区域不受皮肤色调的影响,即不受人口统计学因素影响,因此其作为公正FIQA提供的效用与我们的误差-丢弃特性(EDC)曲线分析结果一致。
https://arxiv.org/abs/2501.07158