Large Language Models still struggle in challenging scenarios that leverage structured data, complex reasoning, or tool usage. In this paper, we propose Source2Synth: a new method that can be used for teaching LLMs new skills without relying on costly human annotations. Source2Synth takes as input a custom data source and produces synthetic data points with intermediate reasoning steps grounded in real-world sources. Source2Synth improves the dataset quality by discarding low-quality generations based on their answerability. We demonstrate the generality of this approach by applying it to two challenging domains: we test reasoning abilities in multi-hop question answering (MHQA), and tool usage in tabular question answering (TQA). Our method improves performance by 25.51% for TQA on WikiSQL and 22.57% for MHQA on HotPotQA compared to the fine-tuned baselines.
大型语言模型在利用结构化数据、复杂推理或工具使用等挑战性场景中仍然存在困难。在本文中,我们提出了一种名为Source2Synth的新方法,可以用于在不依赖昂贵的人类注释的情况下教授LLM新技能。Source2Synth接受一个自定义数据源作为输入,并基于真实世界来源的中间推理步骤生成合成数据点。通过丢弃低质量的预测来提高数据集的质量。我们通过将这种方法应用于两个具有挑战性的领域来证明其普适性:我们测试了在多轮问题回答(MHQA)和表格问题回答(TQA)中的推理能力。与微调基线相比,我们的方法在TQA上的性能提高了25.51%,在MHQA上的性能提高了22.57%。
https://arxiv.org/abs/2409.08239
This paper explores the efficacy of online versus offline evaluation methods in assessing conversational chatbots, specifically comparing first-party direct interactions with third-party observational assessments. By extending a benchmarking dataset of user dialogs with empathetic chatbots with offline third-party evaluations, we present a systematic comparison between the feedback from online interactions and the more detached offline third-party evaluations. Our results reveal that offline human evaluations fail to capture the subtleties of human-chatbot interactions as effectively as online assessments. In comparison, automated third-party evaluations using a GPT-4 model offer a better approximation of first-party human judgments given detailed instructions. This study highlights the limitations of third-party evaluations in grasping the complexities of user experiences and advocates for the integration of direct interaction feedback in conversational AI evaluation to enhance system development and user satisfaction.
本文探讨了在线和离线评估方法在评估会话聊天机器人时的效果,特别是比较了一手直接交互和第三方观察评估之间的差异。通过扩展用户对话数据集,我们比较了在线交互和更疏离的离线第三方评估之间的系统对比。我们的结果表明,离线人类评估未能像在线评估一样有效地捕捉到人类聊天机器人互动的微妙之处。相比之下,使用GPT-4模型的自动第三方评估提供了更好的第三方人类判断的近似,鉴于详细的指示。本研究突出了第三方评估在理解用户体验复杂性方面的局限性,并主张在会话人工智能评估中集成直接交互反馈以提高系统开发和用户满意度。
https://arxiv.org/abs/2409.07823
Capturing complex hierarchical human activities, from atomic actions (e.g., picking up one present, moving to the sofa, unwrapping the present) to contextual events (e.g., celebrating Christmas) is crucial for achieving high-performance video question answering (VideoQA). Recent works have expanded multimodal models (e.g., CLIP, LLaVA) to process continuous video sequences, enhancing the model's temporal reasoning capabilities. However, these approaches often fail to capture contextual events that can be decomposed into multiple atomic actions non-continuously distributed over relatively long-term sequences. In this paper, to leverage the spatial visual context representation capability of the CLIP model for obtaining non-continuous visual representations in terms of contextual events in videos, we convert long-term video sequences into a spatial image domain and finetune the multimodal model LLaVA for the VideoQA task. Our approach achieves competitive performance on the STAR task, in particular, with a 78.4% accuracy score, exceeding the current state-of-the-art score by 2.8 points on the NExTQA task.
捕捉复杂的人性活动(例如捡起一个礼物、移动到沙发、打开礼物)以及上下文事件(例如庆祝圣诞节)对于实现高性能视频问答(VideoQA)至关重要。最近的工作已经将多模态模型(例如CLIP、LLaVa)扩展到处理连续视频序列,提高了模型的时间推理能力。然而,这些方法通常无法捕捉可以分解为多个原子动作并在相对较长的序列中分布式的上下文事件。在本文中,为了利用CLIP模型的空间视觉上下文表示能力,获得视频中的上下文事件的非连续视觉表示,我们将长期视频序列转换为空间图像域,并对多模态模型LLaVa进行微调,以实现VideoQA任务。我们的方法在STAR任务上实现了竞争力的性能,尤其是,在NExTQA任务上的得分比现有 state-of-the-art score 高2.8分。
https://arxiv.org/abs/2409.07748
Video question answering (VideoQA) is a task to predict the correct answer to questions posed about a given video. The system must comprehend spatial and temporal relationships among objects extracted from videos to perform causal and temporal reasoning. While prior works have focused on modeling individual object movements using transformer-based methods, they falter when capturing complex scenarios involving multiple objects (e.g., "a boy is throwing a ball in a hoop"). We propose a contrastive language event graph representation learning method called CLanG to address this limitation. Aiming to capture event representations associated with multiple objects, our method employs a multi-layer GNN-cluster module for adversarial graph representation learning, enabling contrastive learning between the question text and its relevant multi-object event graph. Our method outperforms a strong baseline, achieving up to 2.2% higher accuracy on two challenging VideoQA datasets, NExT-QA and TGIF-QA-R. In particular, it is 2.8% better than baselines in handling causal and temporal questions, highlighting its strength in reasoning multiple object-based events.
视频问题回答(VideoQA)是一个预测关于给定视频的问题正确答案的任务。为了进行因果和时序推理,系统必须理解从视频中提取的物体之间的空间和时间关系。虽然之前的工作主要使用基于Transformer的方法建模单个物体的运动,但当涉及多个物体(例如:“一个男孩在篮筐里扔球”)时,它们的表现就差了。我们提出了一种称为CLanG的对比语言事件图表示学习方法来解决这个局限。我们的目标是捕获多个物体相关的活动表示,为此我们采用了一个多层GNN聚类模块进行对抗图表示学习,实现了问题文本和相关多物体事件图之间的对比学习。我们的方法在两个具有挑战性的VideoQA数据集上的表现优于基线,达到了2.2%的准确率。特别是,它在处理因果和时序问题方面比基线提高了2.8%,突出了其在基于多个物体事件进行推理方面的优势。
https://arxiv.org/abs/2409.07747
Generative AI models, such as the GPT and Llama series, have significant potential to assist laypeople in answering legal questions. However, little prior work focuses on the data sourcing, inference, and evaluation of these models in the context of laypersons. To this end, we propose a human-centric legal NLP pipeline, covering data sourcing, inference, and evaluation. We introduce and release a dataset, LegalQA, with real and specific legal questions spanning from employment law to criminal law, corresponding answers written by legal experts, and citations for each answer. We develop an automatic evaluation protocol for this dataset, then show that retrieval-augmented generation from only 850 citations in the train set can match or outperform internet-wide retrieval, despite containing 9 orders of magnitude less data. Finally, we propose future directions for open-sourced efforts, which fall behind closed-sourced models.
生成式 AI 模型,如 GPT 和 LLAMA 系列,在帮助普通人士回答法律问题方面具有显著潜力。然而,很少有研究关注这些模型在普通人士背景下的数据源、推理和评估。为此,我们提出了一个以人为中心的法律自然语言处理管道,涵盖数据源、推理和评估。我们引入并发布了 LegalQA 数据集,涵盖从劳动法到刑事法的真实和具体法律问题,由法律专家编写的答案,以及每个答案的引用。我们为这个数据集开发了一个自动评估协议,然后表明,从训练集中的仅 850 个引用中进行检索增强生成可以匹配或超越互联网上的检索,尽管该数据集含有训练集的 9 倍少数据。最后,我们提出了开源努力的未来方向,这些方向落后于封闭式模型。
https://arxiv.org/abs/2409.07713
Ranking models play a crucial role in enhancing overall accuracy of text retrieval systems. These multi-stage systems typically utilize either dense embedding models or sparse lexical indices to retrieve relevant passages based on a given query, followed by ranking models that refine the ordering of the candidate passages by its relevance to the query. This paper benchmarks various publicly available ranking models and examines their impact on ranking accuracy. We focus on text retrieval for question-answering tasks, a common use case for Retrieval-Augmented Generation systems. Our evaluation benchmarks include models some of which are commercially viable for industrial applications. We introduce a state-of-the-art ranking model, NV-RerankQA-Mistral-4B-v3, which achieves a significant accuracy increase of ~14% compared to pipelines with other rerankers. We also provide an ablation study comparing the fine-tuning of ranking models with different sizes, losses and self-attention mechanisms. Finally, we discuss challenges of text retrieval pipelines with ranking models in real-world industry applications, in particular the trade-offs among model size, ranking accuracy and system requirements like indexing and serving latency / throughput.
排名模型在提高文本检索系统的整体准确性方面发挥着关键作用。这些多级系统通常使用密集嵌入模型或稀疏词表索引来根据给定查询检索相关段落,然后使用排序模型微调候选段落与查询的相关性。本文对比了各种公开可用的排名模型,并研究了它们对排名准确性的影响。我们重点关注问答任务中的文本检索,这是Retrieval-Augmented Generation系统的一个常见应用场景。我们的评估基准包括一些适用于工业应用的商业化模型。我们引入了一个最先进的排名模型,NV-RerankQA-Mistral-4B-v3,它比其他重排器的准确性提高了~14%。我们还提供了比较不同大小、损失和自注意力机制的排名模型精度的消融研究。最后,我们讨论了文本检索管道中使用排名模型的挑战,特别是模型大小、排名准确性和系统需求(如索引和服务延迟/吞吐量之间的权衡)之间的权衡。
https://arxiv.org/abs/2409.07691
For full-reference image quality assessment (FR-IQA) using deep-learning approaches, the perceptual similarity score between a distorted image and a reference image is typically computed as a distance measure between features extracted from a pretrained CNN or more recently, a Transformer network. Often, these intermediate features require further fine-tuning or processing with additional neural network layers to align the final similarity scores with human judgments. So far, most IQA models based on foundation models have primarily relied on the final layer or the embedding for the quality score estimation. In contrast, this work explores the potential of utilizing the intermediate features of these foundation models, which have largely been unexplored so far in the design of low-level perceptual similarity metrics. We demonstrate that the intermediate features are comparatively more effective. Moreover, without requiring any training, these metrics can outperform both traditional and state-of-the-art learned metrics by utilizing distance measures between the features.
对于使用深度学习方法进行完整参考图像质量评估(FR-IQA),通常会计算扭曲图像和参考图像之间的感知相似性分数作为一个特征提取网络的中间层距离度量。通常,这些中间特征需要进一步微调或处理,借助于 additional neural network 层,以将最终相似性分数与人类评价相一致。到目前为止,大多数基于基础模型的 IQA 模型主要依赖于质量分数估计的最后一层或嵌入。相比之下,这项工作探讨了利用这些基础模型的中间特征的潜力,这些特征在低级别感知相似性指标的设计中还很少被探索。我们证明了中间特征相对更有效。此外,无需训练,这些指标可以超越传统和最先进的 learned 指标,通过利用特征之间的距离测量。
https://arxiv.org/abs/2409.07650
Knowledge conflict arises from discrepancies between information in the context of a large language model (LLM) and the knowledge stored in its parameters. This can hurt performance when using standard decoding techniques, which tend to ignore the context. Existing test-time contrastive methods seek to address this by comparing the LLM's output distribution with and without the context and adjust the model according to the contrast between them. However, we find that these methods frequently misjudge the degree of conflict and struggle to handle instances that vary in their amount of conflict, with static methods over-adjusting when conflict is absent. We propose a fine-grained, instance-level approach called AdaCAD, which dynamically infers the weight of adjustment based on the degree of conflict, as measured by the Jensen-Shannon divergence between distributions representing contextual and parametric knowledge. Our experiments across four models on six diverse question-answering (QA) datasets and three summarization tasks demonstrate that our training-free adaptive method consistently outperforms other decoding methods on QA, with average accuracy gains of 14.21% (absolute) over a static contrastive baseline, and improves the factuality of summaries by 5.59 (AlignScore). Furthermore, our analysis shows that while decoding with contrastive baselines hurts performance when conflict is absent, AdaCAD mitigates these losses, making it more applicable to real-world datasets in which some examples have conflict and others do not.
知识冲突源于大型语言模型(LLM)上下文中的信息与模型参数中存储的知识之间的差异。这会在使用标准解码技术时损害性能,这些技术倾向于忽略上下文。现有的测试时间对比方法试图解决这个问题,通过比较LLM的输出分布有上下文和不有上下文以及根据它们之间的对比调整模型。然而,我们发现这些方法经常错误地估计冲突的程度,并且在冲突缺失的情况下过度调整模型。我们提出了一个细粒度、实例级别的AdaCAD方法,它根据代表上下文和参数知识分布之间的Jensen-Shannon熵变来动态推断调整的权重。我们对六个具有不同问题回答(QA)数据集的四种不同模型以及三个总结任务进行实验,结果表明,我们的无训练自适应方法在QA上始终优于其他解码方法,平均准确率增加了14.21%(绝对值)。此外,我们的分析还发现,使用对比性基线进行解码时,对冲突的缺失会损害性能,而AdaCAD方法减轻了这些损失,使得它更适用于现实世界的数据集,其中有些示例存在冲突,而有些则不存在。
https://arxiv.org/abs/2409.07394
Large Vision-Language Models (LVLMs), trained on multimodal big datasets, have significantly advanced AI by excelling in vision-language tasks. However, these models remain vulnerable to adversarial attacks, particularly jailbreak attacks, which bypass safety protocols and cause the model to generate misleading or harmful responses. This vulnerability stems from both the inherent susceptibilities of LLMs and the expanded attack surface introduced by the visual modality. We propose Sim-CLIP+, a novel defense mechanism that adversarially fine-tunes the CLIP vision encoder by leveraging a Siamese architecture. This approach maximizes cosine similarity between perturbed and clean samples, facilitating resilience against adversarial manipulations. Sim-CLIP+ offers a plug-and-play solution, allowing seamless integration into existing LVLM architectures as a robust vision encoder. Unlike previous defenses, our method requires no structural modifications to the LVLM and incurs minimal computational overhead. Sim-CLIP+ demonstrates effectiveness against both gradient-based adversarial attacks and various jailbreak techniques. We evaluate Sim-CLIP+ against three distinct jailbreak attack strategies and perform clean evaluations using standard downstream datasets, including COCO for image captioning and OKVQA for visual question answering. Extensive experiments demonstrate that Sim-CLIP+ maintains high clean accuracy while substantially improving robustness against both gradient-based adversarial attacks and jailbreak techniques. Our code and robust vision encoders are available at this https URL.
大视图语言模型(LVLMs)通过在多模态大型数据集上训练而显著提高了AI。然而,这些模型仍然容易受到对抗攻击,特别是绕过安全协议的攻击,导致模型产生误导或有害的响应。这种漏洞源于LLMs固有的易感性以及引入的视觉模态带来的攻击面扩展。我们提出Sim-CLIP+,一种通过利用对称架构对CLIP视觉编码器进行对抗微调的新型防御机制。这种方法最大化失真样本的余弦相似度,促进对抗性操纵的抵抗力。Sim-CLIP+提供了一个可插拔的解决方案,将作为现有LVLM架构的稳健视觉编码器无缝集成。与之前的防御措施不同,我们的方法无需对LVLM进行结构修改,且计算开销最小。Sim-CLIP+证明了对抗基于梯度的攻击和各种绕过技术的效果。我们使用三种不同的绕过攻击策略对Sim-CLIP+进行了评估,并使用包括COCO用于图像标题和OKVQA用于视觉问答的标准下游数据集进行了干净评估。大量实验证明,Sim-CLIP+在保持高清洁准确性的同时,大大提高了对抗基于梯度的攻击和绕过技术的稳健性。我们的代码和稳健的视觉编码器可在此处下载:https:// this URL。
https://arxiv.org/abs/2409.07353
Multimodal Large Language Models (MLLMs) have demonstrated great zero-shot performance on visual question answering (VQA). However, when it comes to knowledge-based VQA (KB-VQA), MLLMs may lack human commonsense or specialized domain knowledge to answer such questions and require obtaining necessary information from external knowledge sources. Previous works like Retrival-Augmented VQA-v2 (RAVQA-v2) focus on utilizing as much input information, such as image-based textual descriptions and retrieved knowledge, as possible to improve performance, but they all overlook the issue that with the number of input tokens increasing, inference efficiency significantly decreases, which contradicts the demands of practical applications. To address this issue, we propose Retrieval-Augmented MLLM with Compressed Contexts (RACC). RACC learns to compress and aggregate retrieved contexts, from which it generates a compact modulation in the form of Key-Value (KV) cache. This modulation is then used to adapt the downstream frozen MLLM, thereby achieving effective and efficient inference. RACC achieves a state-of-the-art (SOTA) performance of 62.9% on OK-VQA. Moreover, it significantly reduces inference latency by 22.0%-59.7% compared to the prominent RAVQA-v2. Abundant experiments show RACC's broad applicability. It is compatible with various off-the-shelf MLLMs and can also handle different knowledge sources including textual and multimodal documents.
多模态大型语言模型(MLLMs)在视觉问答(VQA)方面已经展示了出色的零散 shot 性能。然而,在知识基础 VQA(KB-VQA)方面,MLLMs 可能缺乏人类常识或专业知识来回答这些问题,并需要从外部知识来源获取必要的信息。之前的工作如 Retrival-Augmented VQA-v2(RAVQA-v2)主要关注尽可能多地利用输入信息,如图像为基础的文本描述和检索知识,以提高性能,但这些工作都忽视了一个问题,即随着输入标记数的增加,推理效率显著下降,这违反了实际应用的需求。为了解决这个问题,我们提出了 Retrieval-Augmented MLLM with Compressed Contexts(RACC)。RACC 学会了压缩和聚合检索上下文,并生成一种紧凑的调制形式,即键值(KV)缓存。这种调制形式随后被用于适应下游的冻僵 MLLM,从而实现有效的推理。RACC 在 OK-VQA 上的最先进性能达到 62.9%。此外,与著名的 RAVQA-v2 相比,它显著减少了推理延迟,降低了 22.0% 至 59.7%。丰富的实验证明了 RACC 的广泛应用性。它兼容各种标准的 MLLMs,还可以处理包括文本和多模态文档在内的不同知识来源。
https://arxiv.org/abs/2409.07331
Conventional radiography is the widely used imaging technology in diagnosing, monitoring, and prognosticating musculoskeletal (MSK) diseases because of its easy availability, versatility, and cost-effectiveness. In conventional radiographs, bone overlaps are prevalent, and can impede the accurate assessment of bone characteristics by radiologists or algorithms, posing significant challenges to conventional and computer-aided diagnoses. This work initiated the study of a challenging scenario - bone layer separation in conventional radiographs, in which separate overlapped bone regions enable the independent assessment of the bone characteristics of each bone layer and lay the groundwork for MSK disease diagnosis and its automation. This work proposed a Bone Layer Separation GAN (BLS-GAN) framework that can produce high-quality bone layer images with reasonable bone characteristics and texture. This framework introduced a reconstructor based on conventional radiography imaging principles, which achieved efficient reconstruction and mitigates the recurrent calculations and training instability issues caused by soft tissue in the overlapped regions. Additionally, pre-training with synthetic images was implemented to enhance the stability of both the training process and the results. The generated images passed the visual Turing test, and improved performance in downstream tasks. This work affirms the feasibility of extracting bone layer images from conventional radiographs, which holds promise for leveraging bone layer separation technology to facilitate more comprehensive analytical research in MSK diagnosis, monitoring, and prognosis. Code and dataset will be made available.
常规X光摄影(conventional radiography)是诊断、监测和预测骨骼和肌肉骨骼疾病(MSK)的广泛应用成像技术,因为它容易获得、具有多样性和经济性。在常规X光摄影中,骨重叠现象普遍存在,这可能会阻碍放射科医生或算法准确评估骨特征,对传统和计算机辅助诊断造成重大挑战。 这项工作开始了具有挑战性的情景研究——常规X光摄影中的骨层分离,其中分离的骨层区域使每个骨层和层可以独立评估,为MSK疾病诊断和自动化奠定了基础。这项工作提出了一个基于骨层分离的Bone Layer Separation GAN(BLS-GAN)框架,可以生成具有合理骨特征和纹理的高质量骨层图像。这个框架基于传统的X光摄影成像原理,实现了高效的重建,并减轻了软组织在重叠区域中的反复计算和训练不稳定问题。此外,通过合成图像进行预训练,以增强训练过程和结果的稳定性。生成的图像通过了视觉Turing测试,并在下游任务中的性能得到了改善。这项工作证实了从常规X光摄影中提取骨层图像的可能性,这为利用骨层分离技术推动MSK诊断、监测和预后更全面的分析研究带来了希望。代码和数据将公开提供。
https://arxiv.org/abs/2409.07304
Although 3D generated content (3DGC) offers advantages in reducing production costs and accelerating design timelines, its quality often falls short when compared to 3D professionally generated content. Common quality issues frequently affect 3DGC, highlighting the importance of timely and effective quality assessment. Such evaluations not only ensure a higher standard of 3DGCs for end-users but also provide critical insights for advancing generative technologies. To address existing gaps in this domain, this paper introduces a novel 3DGC quality assessment dataset, 3DGCQA, built using 7 representative Text-to-3D generation methods. During the dataset's construction, 50 fixed prompts are utilized to generate contents across all methods, resulting in the creation of 313 textured meshes that constitute the 3DGCQA dataset. The visualization intuitively reveals the presence of 6 common distortion categories in the generated 3DGCs. To further explore the quality of the 3DGCs, subjective quality assessment is conducted by evaluators, whose ratings reveal significant variation in quality across different generation methods. Additionally, several objective quality assessment algorithms are tested on the 3DGCQA dataset. The results expose limitations in the performance of existing algorithms and underscore the need for developing more specialized quality assessment methods. To provide a valuable resource for future research and development in 3D content generation and quality assessment, the dataset has been open-sourced in this https URL.
尽管3D生成内容(3DGC)在降低生产成本和加速设计周期方面具有优势,但与3D专业生成内容相比,其质量往往有所不足。常见的质量问题经常影响3DGC,突出了及时和有效的质量评估的重要性。这样的评估不仅确保了为终端用户提高3DGC的质量标准,还提供了推动生成技术进步的关键洞察。为了解决该领域现有空白,本文引入了一个使用7种代表性的文本到3D生成方法构建的新型3DGC质量评估数据集3DGCQA。在数据集构建过程中,使用了50个固定提示来生成所有方法的内容,导致创建了313个纹理网格的3DGCQA数据集。直观的视觉揭示了生成3DGC中的6种常见扭曲类型的存在。为了进一步研究3DGC的质量,评估者进行了主观质量评估,评估结果表明不同生成方法之间的质量存在很大差异。此外,在3DGCQA数据集上测试了几个客观质量评估算法。结果揭示了现有算法的局限性,并强调了需要开发更专业的质量评估方法。为了为未来的3D内容生成和质量评估研究提供一个有价值的资源,该数据集已公开开源在https://URL。
https://arxiv.org/abs/2409.07236
Scientific documents record research findings and valuable human knowledge, comprising a vast corpus of high-quality data. Leveraging multi-modality data extracted from these documents and assessing large models' abilities to handle scientific document-oriented tasks is therefore meaningful. Despite promising advancements, large models still perform poorly on multi-page scientific document extraction and understanding tasks, and their capacity to process within-document data formats such as charts and equations remains under-explored. To address these issues, we present DocGenome, a structured document benchmark constructed by annotating 500K scientific documents from 153 disciplines in the arXiv open-access community, using our custom auto-labeling pipeline. DocGenome features four key characteristics: 1) Completeness: It is the first dataset to structure data from all modalities including 13 layout attributes along with their LaTeX source codes. 2) Logicality: It provides 6 logical relationships between different entities within each scientific document. 3) Diversity: It covers various document-oriented tasks, including document classification, visual grounding, document layout detection, document transformation, open-ended single-page QA and multi-page QA. 4) Correctness: It undergoes rigorous quality control checks conducted by a specialized team. We conduct extensive experiments to demonstrate the advantages of DocGenome and objectively evaluate the performance of large models on our benchmark.
科学文献记录了研究结果和有价值的人类知识,包括大量高质量的数据。利用从这些文献中提取的多模态数据评估大型模型处理科学文献导向任务的 abilities 因此是有意义的。尽管取得了积极的进展,但大型模型在多页科学文献提取和理解任务上仍然表现不佳,而且它们处理内部文档数据格式(如图表和方程)的能力仍没有被深入探讨。为了应对这些问题,我们提出了 DocGenome,一个由注释153个学科的500K个科学文献构成的结构化文档基准,使用我们自定义的自动标注管道。DocGenome 具有四个关键特点:1)完整性:它是第一个将所有模态数据结构化为包括13个排版属性以及它们的LaTeX源代码的完整数据集。2)逻辑性:它提供了每个科学文献中不同实体之间的6个逻辑关系。3)多样性:它涵盖了各种文档导向任务,包括文档分类、视觉 grounding、文档布局检测、文档转换、单页问答和多页问答。4)正确性:它由一个专业团队进行严格的质量控制检查。我们进行广泛的实验,以证明 DocGenome 的优势并客观地评估大型模型在我们的基准上的表现。
https://arxiv.org/abs/2406.11633
The no-reference image quality assessment is a challenging domain that addresses estimating image quality without the original reference. We introduce an improved mechanism to extract local and non-local information from images via different transformer encoders and CNNs. The utilization of Transformer encoders aims to mitigate locality bias and generate a non-local representation by sequentially processing CNN features, which inherently capture local visual structures. Establishing a stronger connection between subjective and objective assessments is achieved through sorting within batches of images based on relative distance information. A self-consistency approach to self-supervision is presented, explicitly addressing the degradation of no-reference image quality assessment (NR-IQA) models under equivariant transformations. Our approach ensures model robustness by maintaining consistency between an image and its horizontally flipped equivalent. Through empirical evaluation of five popular image quality assessment datasets, the proposed model outperforms alternative algorithms in the context of no-reference image quality assessment datasets, especially on smaller datasets. Codes are available at \href{this https URL}{this https URL}
没有参考图像质量评估是一个具有挑战性的领域,它解决了在没有原始参考图像的情况下估计图像质量的问题。我们引入了一种改进的方法,通过不同的Transformer编码器和CNN提取图像中的局部和非局部信息。Transformer编码器的使用旨在减轻局部偏见,并通过逐层处理CNN特征来生成非局部表示,这本质上捕捉了局部视觉结构。通过根据相对距离信息对图像进行排序,实现了主观和客观评估之间的更强联系。我们提出了一个自一致的 self-supervision 方法,明确解决了等价变换下 no-reference image quality assessment(NR-IQA)模型的退化问题。通过对五个流行的图像质量评估数据集的实验评估,与 alternative 算法相比,所提出的模型在 no-reference image quality assessment 数据集上表现优异,尤其是在较小的数据集上。代码可在此处下载:\href{this <https://this URL>}{this <https://this URL>}
https://arxiv.org/abs/2409.07115
This chapter examines the Variational Quantum Harmonizer, a software tool and musical interface that focuses on the problem of sonification of the minimization steps of Variational Quantum Algorithms (VQA), used for simulating properties of quantum systems and optimization problems assisted by quantum hardware. Particularly, it details the sonification of Quadratic Unconstrained Binary Optimization (QUBO) problems using VQA. A flexible design enables its future applications both as a sonification tool for auditory displays in scientific investigation, and as a hybrid quantum-digital musical instrument for artistic endeavours. In turn, sonification can help researchers understand complex systems better and can serve for the training of quantum physics and quantum computing. The VQH structure, including its software implementation, control mechanisms, and sonification mappings are detailed. Moreover, it guides the design of QUBO cost functions in VQH as a music compositional object. The discussion is extended to the implications of applying quantum-assisted simulation in quantum-computer aided composition and live-coding performances. An artistic output is showcased by the piece \textit{Hexagonal Chambers} (Thomas and Itaboraí, 2023).
本章详细探讨了变分量子和谐器(Variational Quantum Harmonizer,VQH)这个软件工具和音乐界面,重点关注变分量子算法(VQA)最小化步骤的声学化问题,用于模拟量子系统的性质和优化问题。特别是,它详细介绍了使用VQA对二次约束二进制优化(QUBO)问题进行声学化的过程。灵活的设计使其未来应用既作为科学研究的听觉显示工具,也作为艺术创作中的混合量子-数字乐器。通过声学化,可以帮助研究人员更好地理解复杂系统,还可以用于量子物理学和量子计算的培训。详细介绍了VQH的结构,包括其软件实现、控制机制和声学映射。此外,还作为音乐创作对象指导了QUBO成本函数在VQH中的设计。讨论还扩展到在量子计算机辅助创作和现场编码表演中应用量子辅助模拟的影响。由Thomas和Itaboraí创作的《六边形编钟》(Hexagonal Chambers)艺术品展示了这种技术的应用。
https://arxiv.org/abs/2409.07104
Blind Image Quality Assessment (BIQA) aims to develop methods that estimate the quality scores of images in the absence of a reference image. In this paper, we approach BIQA from a distortion identification perspective, where our primary goal is to predict distortion types and strengths using Vision-Language Models (VLMs), such as CLIP, due to their extensive knowledge and generalizability. Based on these predicted distortions, we then estimate the quality score of the image. To achieve this, we propose an explainable approach for distortion identification based on attribute learning. Instead of prompting VLMs with the names of distortions, we prompt them with the attributes or effects of distortions and aggregate this information to infer the distortion strength. Additionally, we consider multiple distortions per image, making our method more scalable. To support this, we generate a dataset consisting of 100,000 images for efficient training. Finally, attribute probabilities are retrieved and fed into a regressor to predict the image quality score. The results show that our approach, besides its explainability and transparency, achieves state-of-the-art (SOTA) performance across multiple datasets in both PLCC and SRCC metrics. Moreover, the zero-shot results demonstrate the generalizability of the proposed approach.
盲图像质量评估(BIQA)的目的是开发没有参考图像的情况下估计图像质量评分的方法。在本文中,我们从扭曲识别的角度来探讨BIQA,我们的主要目标是利用Vision-Language Models(VLMs),如CLIP,根据其广泛的了解和泛化性预测扭曲类型和强度。基于这些预测的扭曲,然后估计图像的质量评分。为了实现这一目标,我们提出了一种基于属性学习的扭曲识别方法。我们不是通过提示VLMs扭曲的名字来获得它们的属性或影响,而是提示它们关于扭曲的属性和聚合该信息以推断扭曲的强度。此外,我们考虑每个图像多个扭曲,使我们的方法更具可扩展性。为了支持这一观点,我们生成了一个由100,000个图像组成的训练数据集。最后,属性概率被检索并输入回归器来预测图像的质量评分。结果表明,除了可解释性和透明度外,我们的方法在多个数据集上的PLCC和SRCC指标上实现了最先进的(SOTA)性能。此外,零 shot结果证明了所提出的方法的泛化性。
https://arxiv.org/abs/2409.06853
With the remarkable success achieved by Multimodal Large Language Models (MLLMs), numerous benchmarks have been designed to assess MLLMs' ability to guide their development in image perception tasks (e.g., image captioning and visual question answering). However, the existence of numerous benchmarks results in a substantial computational burden when evaluating model performance across all of them. Moreover, these benchmarks contain many overly simple problems or challenging samples, which do not effectively differentiate the capabilities among various MLLMs. To address these challenges, we propose a pipeline to process the existing benchmarks, which consists of two modules: (1) Semi-Automated Screening Process and (2) Eliminating Answer Leakage. The Semi-Automated Screening Process filters out samples that cannot distinguish the model's capabilities by synthesizing various MLLMs and manually evaluating them. The Eliminate Answer Leakage module filters samples whose answers can be inferred without images. Finally, we curate the LIME-M: Less Is More for Evaluation of Multimodal LLMs, a lightweight Multimodal benchmark that can more effectively evaluate the performance of different models. Our experiments demonstrate that: LIME-M can better distinguish the performance of different MLLMs with fewer samples (24% of the original) and reduced time (23% of the original); LIME-M eliminates answer leakage, focusing mainly on the information within images; The current automatic metric (i.e., CIDEr) is insufficient for evaluating MLLMs' capabilities in captioning. Moreover, removing the caption task score when calculating the overall score provides a more accurate reflection of model performance differences. All our codes and data are released at this https URL.
随着Multimodal Large Language Models (MLLMs)取得引人注目的成功,为了评估MLLMs在图像感知任务中的发展能力(如图像标注和视觉问答),已经设计了许多基准测试。然而,存在许多基准测试结果导致在评估所有基准测试模型性能时产生大量计算负担。此外,这些基准测试包含许多过于简单的问题或具有挑战性的样本,不能有效地区分各种MLLM的性能。为了应对这些挑战,我们提出了一个处理现有基准测试的流程,包括两个模块:(1)半自动筛选过程和(2)消除答案泄漏。半自动筛选过程通过合成各种MLLM并手动评估它们,过滤出无法通过合成各种MLLM并手动评估来区分模型性能的样本。消除答案泄漏模块过滤出可以推断出答案而无需观察图像的样本。最后,我们导出了LIME-M:更少即更好的多模态基准,一种轻量级的多模态基准,可以更有效地评估不同模型的性能。我们的实验结果表明:LIME-M可以更好地区分不同MLLM的性能,只需要更少的样本(原始样本的24%)和更短的时间(原始样本的23%);LIME-M主要关注图像中的信息,当前自动指标(即CIDEr)不足以评估MLLM在标注方面的能力。此外,在计算整体评分时删除标注任务得分提供了更准确地反映模型性能差异的 reflection。我们所有的代码和数据都发布在https://这个URL上。
https://arxiv.org/abs/2409.06851
Early detection of eye diseases like glaucoma, macular degeneration, and diabetic retinopathy is crucial for preventing vision loss. While artificial intelligence (AI) foundation models hold significant promise for addressing these challenges, existing ophthalmic foundation models primarily focus on a single modality, whereas diagnosing eye diseases requires multiple modalities. A critical yet often overlooked aspect is harnessing the multi-view information across various modalities for the same patient. Additionally, due to the long-tail nature of ophthalmic diseases, standard fully supervised or unsupervised learning approaches often struggle. Therefore, it is essential to integrate clinical text to capture a broader spectrum of diseases. We propose EyeCLIP, a visual-language foundation model developed using over 2.77 million multi-modal ophthalmology images with partial text data. To fully leverage the large multi-modal unlabeled and labeled data, we introduced a pretraining strategy that combines self-supervised reconstructions, multi-modal image contrastive learning, and image-text contrastive learning to learn a shared representation of multiple modalities. Through evaluation using 14 benchmark datasets, EyeCLIP can be transferred to a wide range of downstream tasks involving ocular and systemic diseases, achieving state-of-the-art performance in disease classification, visual question answering, and cross-modal retrieval. EyeCLIP represents a significant advancement over previous methods, especially showcasing few-shot, even zero-shot capabilities in real-world long-tail scenarios.
早于发现像糖尿病视网膜病、黄斑变性和其他眼病等严重眼病对防止视力下降至关重要。虽然人工智能(AI)基础模型对于应对这些挑战具有很大的潜力,但现有的眼部基础模型主要集中在单一模式,而诊断眼病需要多个模式。关键但往往被忽视的一个方面是利用各种模式的多视角信息为同一患者提供信息。此外,由于眼病数据的非标准化和数据稀疏性,标准的完全监督或无监督学习方法通常很难。因此,将临床文本集成以捕捉更广泛的疾病范围至关重要。我们提出了EyeCLIP,一种使用超过2770万张多模态眼病图像以及部分文本数据开发的视觉语言基础模型。为了充分利用大量多模态未标记和标记数据,我们引入了一种预训练策略,将自监督重构、多模态图像对比学习和图像-文本对比学习相结合以学习多个模式的共享表示。通过使用14个基准数据集进行评估,EyeCLIP可以转移到涉及眼部和系统疾病的各种下游任务,并在疾病分类、视觉问答和跨模态检索方面实现最先进的性能。EyeCLIP在以前的方法上取得了显著的进步,尤其是在现实世界中的长尾场景中表现出几 shot甚至零 shot的能力。
https://arxiv.org/abs/2409.06644
Retrieval-Augmented Generation (RAG) has emerged as a common paradigm to use Large Language Models (LLMs) alongside private and up-to-date knowledge bases. In this work, we address the challenges of using LLM-as-a-Judge when evaluating grounded answers generated by RAG systems. To assess the calibration and discrimination capabilities of judge models, we identify 7 generator failure modes and introduce GroUSE (Grounded QA Unitary Scoring of Evaluators), a meta-evaluation benchmark of 144 unit tests. This benchmark reveals that existing automated RAG evaluation frameworks often overlook important failure modes, even when using GPT-4 as a judge. To improve on the current design of automated RAG evaluation frameworks, we propose a novel pipeline and find that while closed models perform well on GroUSE, state-of-the-art open-source judges do not generalize to our proposed criteria, despite strong correlation with GPT-4's judgement. Our findings suggest that correlation with GPT-4 is an incomplete proxy for the practical performance of judge models and should be supplemented with evaluations on unit tests for precise failure mode detection. We further show that finetuning Llama-3 on GPT-4's reasoning traces significantly boosts its evaluation capabilities, improving upon both correlation with GPT-4's evaluations and calibration on reference situations.
检索增强生成(RAG)作为一种与大型语言模型(LLM)结合使用私有和最新知识库的常见范式已经出现。在这项工作中,我们探讨了当使用LLM作为评委时评估由RAG系统生成的有根据答案的挑战。为了评估评判模型的校准和区分能力,我们识别了7个生成器故障模式,并引入了GroUSE(Grounded QA Unitary Scoring of Evaluators),一个包含144个单元测试的元评估基准。这个基准揭示了现有自动RAG评估框架即使使用GPT-4作为评委也会忽视重要的故障模式。为了改进现有自动RAG评估框架的设计,我们提出了一个新颖的管道,并发现尽管关闭模型在GroUSE上表现良好,但最先进的开源评判家并没有泛化到我们提出的标准,尽管与GPT-4的判断高度相关。我们的研究结果表明,与GPT-4的关联是一个不完整的评估指标,应该补充进行单元测试以精确检测故障模式。我们进一步表明,在GPT-4的推理痕迹上微调Llama-3显著提高了其评估能力,改善了与GPT-4评估的关联以及参考情况下的校准。
https://arxiv.org/abs/2409.06595
Although Visual-Language Models (VLMs) have shown impressive capabilities in tasks like visual question answering and image captioning, they still struggle with hallucinations. Analysis of attention distribution in these models shows that VLMs tend to processing textual tokens rather than visual tokens. This imbalance of attention distribution causes VLMs to favor textual knowledge in the case of multimodal knowledge conflicts, resulting in differences from the image information. In this paper, we propose Re-Balancing Contrastive Decoding (RBD) method, which employs textual and visual branches to recalibrate attention distribution in VLMs. Specifically, the textual branch injects image noise to stimulate the model's dependency on text, thereby reducing textual bias. Concurrently, the visual branch focuses on the selection of significant tokens, refining the attention mechanism to highlight the primary subject. This dual-branch strategy enables the RBD method to diminish textual bias while enhancing visual information. Experimental results demonstrate that our method, RBD, outperforms the existing methods by the CHAIR and POPE metrics, mitigate hallucinations without reducing the model's general capabilities.
尽管视觉语言模型(VLMs)在诸如视觉问答和图像标题等任务中表现出令人印象深刻的能力,但它们仍然存在幻觉。对这些模型的注意分配分析表明,VLMs倾向于处理文本标记,而不是图像标记。这种注意分配的不平衡导致VLMs在多模态知识冲突情况下更倾向于文本知识,从而与图像信息存在差异。在本文中,我们提出了重新平衡对比编码(RBD)方法,该方法采用文本和视觉分支来重新调整VLMs的注意分配。具体来说,文本分支向模型注入图像噪音来刺激模型对文本的依赖,从而减少文本偏见。同时,视觉分支关注选择重要标记物,改进关注机制,突出主要主题。这种双分支策略使得RBD方法在减少文本偏见的同时增强视觉信息。实验结果表明,我们的方法,RBD,通过CHAIR和POPE指标超过了现有方法,同时不降低模型的通用能力。
https://arxiv.org/abs/2409.06485