Structured image understanding, such as interpreting tables and charts, requires strategically refocusing across various structures and texts within an image, forming a reasoning sequence to arrive at the final answer. However, current multimodal large language models (LLMs) lack this multihop selective attention capability. In this work, we introduce ReFocus, a simple yet effective framework that equips multimodal LLMs with the ability to generate "visual thoughts" by performing visual editing on the input image through code, shifting and refining their visual focuses. Specifically, ReFocus enables multimodal LLMs to generate Python codes to call tools and modify the input image, sequentially drawing boxes, highlighting sections, and masking out areas, thereby enhancing the visual reasoning process. We experiment upon a wide range of structured image understanding tasks involving tables and charts. ReFocus largely improves performance on all tasks over GPT-4o without visual editing, yielding an average gain of 11.0% on table tasks and 6.8% on chart tasks. We present an in-depth analysis of the effects of different visual edits, and reasons why ReFocus can improve the performance without introducing additional information. Further, we collect a 14k training set using ReFocus, and prove that such visual chain-of-thought with intermediate information offers a better supervision than standard VQA data, reaching a 8.0% average gain over the same model trained with QA pairs and 2.6% over CoT.
结构化图像理解,如解释表格和图表,需要在图像中的各种结构和文本之间战略性地重新聚焦,形成推理序列以得出最终答案。然而,当前的多模态大型语言模型(LLM)缺乏这种多步选择性注意的能力。在这项工作中,我们介绍了 ReFocus,这是一个简单而有效的框架,它使多模态 LLM 具备通过代码在输入图像上执行视觉编辑来生成“视觉思维”的能力,从而转移和精炼其视觉焦点。具体而言,ReFocus 使得多模态 LLM 能够生成 Python 代码调用工具并修改输入图像,在此基础上依次绘制方框、高亮显示部分和屏蔽区域,从而增强视觉推理过程。 我们在涉及表格和图表的多种结构化图像理解任务上进行了实验。与未经视觉编辑的 GPT-4 相比,ReFocus 在所有任务中都显著提高了性能,在表格任务上的平均增益为 11.0%,在图表任务上的平均增益为 6.8%。我们深入分析了不同视觉编辑的效果,并解释了为什么 ReFocus 能够在不引入额外信息的情况下提高性能。 此外,我们使用 ReFocus 收集了一个包含 14,000 条数据的训练集,并证明了这种具有中间信息的视觉思维链提供了比标准 VQA 数据更好的监督效果,在模型训练中与 QA 对相比平均增益为 8.0%,与 CoT 相比则为 2.6%。
https://arxiv.org/abs/2501.05452
Existing benchmarks for evaluating long-context language models (LCLMs) primarily focus on long-context recall, requiring models to produce short responses based on a few critical snippets while processing thousands of irrelevant tokens. We introduce LongProc (Long Procedural Generation), a new benchmark that requires both the integration of highly dispersed information and long-form generation. LongProc consists of six diverse procedural generation tasks, such as extracting structured information from HTML pages into a TSV format and executing complex search procedures to create travel plans. These tasks challenge LCLMs by testing their ability to follow detailed procedural instructions, synthesize and reason over dispersed information, and generate structured, long-form outputs (up to 8K tokens). Furthermore, as these tasks adhere to deterministic procedures and yield structured outputs, they enable reliable rule-based evaluation. We evaluate 17 LCLMs on LongProc across three difficulty levels, with maximum numbers of output tokens set at 500, 2K, and 8K. Notably, while all tested models claim a context window size above 32K tokens, open-weight models typically falter on 2K-token tasks, and closed-source models like GPT-4o show significant degradation on 8K-token tasks. Further analysis reveals that LCLMs struggle to maintain long-range coherence in long-form generations. These findings highlight critical limitations in current LCLMs and suggest substantial room for improvement. Data and code available at: this https URL
现有的评估长上下文语言模型(LCLM)的基准主要集中在长上下文回溯上,即要求这些模型在处理数千个无关令牌的同时,根据几个关键片段生成简短的回答。我们引入了LongProc(长程序化生成),这是一个新的评估基准,它不仅需要整合高度分散的信息,还需要进行长篇生成。LongProc 包含六个多样化的程序化生成任务,例如从HTML页面中提取结构化信息并将其转换为TSV格式、执行复杂的搜索过程来创建旅行计划等。这些任务通过测试LCLM遵循详细程序指令的能力、综合和推理分散的信息以及生成结构化长篇输出(最多8K令牌)的能力来挑战模型。 此外,由于这些任务遵循确定性的流程,并产生结构化的输出,因此它们支持基于规则的可靠评估。我们在LongProc上对17个LCLM进行了不同难度级别的评估,分别设置了500、2K和8K的最大生成标记数量。值得注意的是,尽管所有测试模型都声称其上下文窗口大小超过32K令牌,但开源模型通常在处理2K令牌任务时表现不佳,而像GPT-4o这样的封闭源模型在处理8K令牌的任务时会表现出显著的性能下降。 进一步分析表明,LCLM 在长篇生成中难以保持长期的一致性。这些发现突显了当前LCLMs的关键限制,并指出了改进的巨大空间。数据和代码可在以下链接获取:[提供一个URL]
https://arxiv.org/abs/2501.05414
The rapid advancement of large language models (LLMs) has led to significant improvements in their capabilities, but also to increased concerns about their alignment with human values and intentions. Current alignment strategies, including adaptive training and inference-time methods, have demonstrated potential in this area. However, these approaches still struggle to balance deployment complexity and capability across various tasks and difficulties. In this work, we introduce the Streaming Distribution Induce Aligner (Stream Aligner), a novel alignment paradigm that combines efficiency with enhanced performance in various tasks throughout the generation process. Stream Aligner achieves dynamic sentence-level correction by using a small model to learn the preferences of the suffix sentence, iteratively correcting the suffix sentence output by the upstream model, and then using the corrected sentence to replace the suffix sentence in subsequent generations. Compared to Aligner, our experiments demonstrate that Stream Aligner reduces reliance on the capabilities of additional models, enhances the reasoning abilities of LLMs, and decreases latency during user interaction. Specifically, Stream Aligner-2B model has achieved an improvement of 76.1% in helpfulness, 36.0% in harmlessness on the tested Llama2-70B-chat model, and Stream Aligner-8B has achieved an improvement of 3.5% on the math ability of the tested Llama3-70B-Instruct model.
大型语言模型(LLMs)的迅速发展已经显著提高了它们的能力,但也引发了关于这些模型与人类价值观和意图相一致性的担忧。目前的对齐策略,包括自适应训练和推理时间方法,在这一领域显示出了一定潜力。然而,这些方法在平衡部署复杂性和跨各种任务和难度的任务能力方面仍面临挑战。在这项工作中,我们介绍了流式分布诱导对齐器(Stream Aligner),这是一种结合了效率与生成过程中多种任务性能增强的新颖对齐范式。Stream Aligner 通过使用一个小模型来学习后缀句子的偏好,在迭代中纠正上游模型输出的后缀句子,并用修正后的句子替换后续生成中的后缀句子,实现了动态句级校正。 相比Aligner,我们的实验表明,Stream Aligner 减少了对额外模型能力的依赖,增强了LLMs 的推理能力,并在用户交互期间降低了延迟。具体来说,在测试的Llama2-70B-chat 模型中,Stream Aligner-2B 模型实现了76.1% 的有用性改进和36.0% 的无害性改进;而在测试的Llama3-70B-Instruct 模型中,Stream Aligner-8B 实现了数学能力方面3.5% 的提升。
https://arxiv.org/abs/2501.05336
Interacting with a software system via a chatbot can be challenging, especially when the chatbot needs to generate API calls, in the right order and with the right parameters, to communicate with the system. API calling in chatbot systems poses significant challenges, particularly in complex, multi-step tasks requiring accurate API selection and execution. We contribute to this domain in three ways: first, by introducing a novel dataset designed to assess models on API function selection, parameter generation, and nested API calls; second, by benchmarking state-of-the-art language models across varying levels of complexity to evaluate their performance in API function generation and parameter accuracy; and third, by proposing an enhanced API routing method that combines general-purpose large language models for API selection with fine-tuned models for parameter generation and some prompt engineering approach. These approaches lead to substantial improvements in handling complex API tasks, offering practical advancements for real-world API-driven chatbot systems.
通过聊天机器人与软件系统交互可能会很有挑战性,尤其是在需要生成正确的API调用序列和参数以与系统通信时。在复杂的多步骤任务中,特别是在选择准确的API并执行它们方面,聊天系统中的API调用面临重大挑战。我们从三个方面对此领域做出了贡献:首先,通过引入一个新颖的数据集来评估模型在API功能选择、参数生成以及嵌套API调用方面的表现;其次,通过对最先进的语言模型进行基准测试,在不同复杂度级别上评估其在API函数生成和参数准确性方面的能力;第三,提出了一种增强的API路由方法,该方法结合了通用大型语言模型用于API选择,并使用微调后的模型进行参数生成以及一些提示工程方法。这些方法显著提高了处理复杂API任务的能力,为实际中的基于API驱动聊天机器人系统提供了实用的进步。
https://arxiv.org/abs/2501.05255
The development of Automatic Question Generation (QG) models has the potential to significantly improve educational practices by reducing the teacher workload associated with creating educational content. This paper introduces a novel approach to educational question generation that controls the topical focus of questions. The proposed Topic-Controlled Question Generation (T-CQG) method enhances the relevance and effectiveness of the generated content for educational purposes. Our approach uses fine-tuning on a pre-trained T5-small model, employing specially created datasets tailored to educational needs. The research further explores the impacts of pre-training strategies, quantisation, and data augmentation on the model's performance. We specifically address the challenge of generating semantically aligned questions with paragraph-level contexts, thereby improving the topic specificity of the generated questions. In addition, we introduce and explore novel evaluation methods to assess the topical relatedness of the generated questions. Our results, validated through rigorous offline and human-backed evaluations, demonstrate that the proposed models effectively generate high-quality, topic-focused questions. These models have the potential to reduce teacher workload and support personalised tutoring systems by serving as bespoke question generators. With its relatively small number of parameters, the proposals not only advance the capabilities of question generation models for handling specific educational topics but also offer a scalable solution that reduces infrastructure costs. This scalability makes them feasible for widespread use in education without reliance on proprietary large language models like ChatGPT.
自动问题生成(QG)模型的发展有可能通过减少教师在创建教育内容方面的负担,显著改善教育实践。本文介绍了一种新颖的教育问题生成方法,该方法能够控制问题的主题焦点。所提出的主题控制问题生成(T-CQG)方法提升了生成内容的相关性和有效性,以满足教育目的。 我们的方法使用在预训练 T5-small 模型上的微调,并采用专门针对教育需求设计的数据集。研究进一步探讨了预训练策略、量化和数据增强对模型性能的影响。我们特别解决了生成语义一致的问题与段落级上下文相关的挑战,从而提高了所生成问题的主题特异性。 此外,我们引入并探索了一种新的评估方法,以评估生成的问题在主题相关性方面的情况。通过严格的离线和基于人类的评价验证,我们的结果表明所提出的模型能够有效地生成高质量、主题聚焦的问题。 这些模型有望减少教师的工作负担,并支持个性化辅导系统作为定制问题生成器使用。由于其相对较小的参数数量,这些提议不仅提升了问题生成模型处理特定教育主题的能力,还提供了一种可扩展的解决方案以降低基础设施成本。这种可扩展性使得它们能够在不依赖如 ChatGPT 这样的专有大型语言模型的情况下,在教育领域广泛使用。
https://arxiv.org/abs/2501.05220
Document-Level Biomedical Relation Extraction (Bio-RE) aims to identify relations between biomedical entities within extensive texts, serving as a crucial subfield of biomedical text mining. Existing Bio-RE methods struggle with cross-sentence inference, which is essential for capturing relations spanning multiple sentences. Moreover, previous methods often overlook the incompleteness of documents and lack the integration of external knowledge, limiting contextual richness. Besides, the scarcity of annotated data further hampers model training. Recent advancements in large language models (LLMs) have inspired us to explore all the above issues for document-level Bio-RE. Specifically, we propose a document-level Bio-RE framework via LLM Adaptive Document-Relation Cross-Mapping (ADRCM) Fine-Tuning and Concept Unique Identifier (CUI) Retrieval-Augmented Generation (RAG). First, we introduce the Iteration-of-REsummary (IoRs) prompt for solving the data scarcity issue. In this way, Bio-RE task-specific synthetic data can be generated by guiding ChatGPT to focus on entity relations and iteratively refining synthetic data. Next, we propose ADRCM fine-tuning, a novel fine-tuning recipe that establishes mappings across different documents and relations, enhancing the model's contextual understanding and cross-sentence inference capabilities. Finally, during the inference, a biomedical-specific RAG approach, named CUI RAG, is designed to leverage CUIs as indexes for entities, narrowing the retrieval scope and enriching the relevant document contexts. Experiments conducted on three Bio-RE datasets (GDA, CDR, and BioRED) demonstrate the state-of-the-art performance of our proposed method by comparing it with other related works.
文档级生物医学关系提取(Bio-RE)旨在识别广泛文本中生物医学实体之间的关系,这是生物医学文本挖掘的一个重要子领域。现有的Bio-RE方法在跨句子推理方面存在困难,这对于捕捉跨越多句话的关系至关重要。此外,先前的方法往往忽略了文档的不完备性,并缺乏外部知识整合,从而限制了上下文的丰富度。而且,标注数据的稀缺进一步阻碍了模型训练。最近,在大型语言模型(LLMs)领域的进展激发了我们探索上述所有问题以解决文档级Bio-RE的需求。 具体来说,我们提出了一种通过LLM自适应文档关系跨映射(ADRCM)微调和概念唯一标识符(CUI)检索增强生成(RAG)的文档级Bio-RE框架。首先,我们引入了REsummary迭代(IoRs)提示来解决数据稀缺问题,在这种情况下,通过引导ChatGPT关注实体关系并迭代地精炼合成数据,可以生成特定于Bio-RE任务的合成数据。 其次,我们提出了ADRCM微调方法,这是一种新的微调配方,建立了不同文档和关系之间的映射,增强了模型的上下文理解能力和跨句子推理能力。最后,在进行推断时,设计了一种名为CUI RAG的生物医学特定RAG方法,利用CUI作为实体索引,缩小检索范围并丰富相关文档背景。 我们在三个Bio-RE数据集(GDA、CDR和BioRED)上进行了实验,并通过与其它相关工作对比验证了我们所提出的方法达到了最先进的性能。
https://arxiv.org/abs/2501.05155
Large Language Models (LLMs) are prevalent in modern applications but often memorize training data, leading to privacy breaches and copyright issues. Existing research has mainly focused on posthoc analyses, such as extracting memorized content or developing memorization metrics, without exploring the underlying architectural factors that contribute to memorization. In this work, we investigate memorization from an architectural lens by analyzing how attention modules at different layers impact its memorization and generalization performance. Using attribution techniques, we systematically intervene in the LLM architecture by bypassing attention modules at specific blocks while keeping other components like layer normalization and MLP transformations intact. We provide theorems analyzing our intervention mechanism from a mathematical view, bounding the difference in layer outputs with and without our attributions. Our theoretical and empirical analyses reveal that attention modules in deeper transformer blocks are primarily responsible for memorization, whereas earlier blocks are crucial for the models generalization and reasoning capabilities. We validate our findings through comprehensive experiments on different LLM families (Pythia and GPTNeo) and five benchmark datasets. Our insights offer a practical approach to mitigate memorization in LLMs while preserving their performance, contributing to safer and more ethical deployment in real world applications.
大型语言模型(LLMs)在现代应用中非常普遍,但它们常常会记忆训练数据,导致隐私泄露和版权问题。现有研究主要集中在事后分析上,比如提取被记住的内容或开发记忆指标,而没有探索导致这一现象的底层架构因素。在这项工作中,我们从架构角度研究了模型的记忆能力,通过分析不同层中的注意力模块如何影响其记忆能力和泛化性能。使用归因技术,我们系统性地干预LLM架构,在特定块中绕过注意力模块,同时保持其他组件(如层归一化和MLP变换)不变。我们从数学角度提供了定理来分析我们的干预机制,并界定了在有无归因情况下各层输出的差异。我们的理论和实证分析表明,较深层变压器块中的注意力模块主要负责记忆能力,而早期阶段对于模型的泛化能力和推理能力至关重要。 通过不同LLM家族(Pythia 和 GPTNeo)以及五个基准数据集上的全面实验验证了这些发现。我们提供的见解为在不牺牲性能的情况下减轻LLMs的记忆问题提供了实际方法,有助于在现实世界应用中实现更安全和道德的应用部署。
https://arxiv.org/abs/2501.05078
This paper introduce LongViTU, a large-scale (~121k QA pairs, ~900h videos), automatically generated dataset for long-form video understanding. We developed a systematic approach that organizes videos into a hierarchical tree structure and incorporates self-revision mechanisms to ensure high-quality QA pairs. Each QA pair in LongViTU features: 1) long-term context (average certificate length of 4.6 minutes); 2) rich knowledge and condensed reasoning (commonsense, causality, planning, etc.); and 3) explicit timestamp labels for relevant events. LongViTU also serves as a benchmark for instruction following in long-form and streaming video understanding. We evaluate the open-source state-of-the-art long video understanding model, LongVU, and the commercial model, Gemini-1.5-Pro, on our benchmark. They achieve GPT-4 scores of 49.9 and 52.3, respectively, underscoring the substantial challenge posed by our benchmark. Further supervised fine-tuning (SFT) on LongVU led to performance improvements of 12.0% on our benchmark, 2.2% on the in-distribution (ID) benchmark EgoSchema, 1.0%, 2.2% and 1.2% on the out-of-distribution (OOD) benchmarks VideoMME (Long), WorldQA and OpenEQA, respectively. These outcomes demonstrate LongViTU's high data quality and robust OOD generalizability.
本文介绍了LongViTU,这是一个大规模的数据集(约121k个问答对,约900小时视频),该数据集是为长格式视频理解自动生成的。我们开发了一种系统方法,将视频组织成层次树结构,并引入自我修订机制以确保高质量的问答对。LongViTU中的每个问答对具有以下特点:1)长期上下文(平均证书长度为4.6分钟);2)丰富的知识和浓缩的推理能力(常识、因果关系、计划等);3)相关事件的具体时间戳标签。此外,LongViTU还作为一个基准,用于评估长格式视频和流媒体视频理解中的指令跟随性能。 我们对开源最先进的长视频理解模型LongVU以及商业模型Gemini-1.5-Pro在我们的基准测试上进行了评估。它们分别获得了49.9分和52.3分(以GPT-4评分标准为准),这突显了我们在建立的基准中的挑战性。进一步对LongVU进行监督微调(SFT)后,其在我们基准上的表现提高了12.0%,在分布内基准EgoSchema上提升了2.2%,在分布外基准VideoMME(长)、WorldQA和OpenEQA上分别提高了1.0%、2.2%和1.2%。这些结果表明LongViTU具有高质量的数据和强大的分布外泛化能力。
https://arxiv.org/abs/2501.05037
The UAV-VLA (Visual-Language-Action) system is a tool designed to facilitate communication with aerial robots. By integrating satellite imagery processing with the Visual Language Model (VLM) and the powerful capabilities of GPT, UAV-VLA enables users to generate general flight paths-and-action plans through simple text requests. This system leverages the rich contextual information provided by satellite images, allowing for enhanced decision-making and mission planning. The combination of visual analysis by VLM and natural language processing by GPT can provide the user with the path-and-action set, making aerial operations more efficient and accessible. The newly developed method showed the difference in the length of the created trajectory in 22% and the mean error in finding the objects of interest on a map in 34.22 m by Euclidean distance in the K-Nearest Neighbors (KNN) approach.
UAV-VLA(视觉-语言-行动)系统是一种旨在促进与空中机器人通信的工具。通过将卫星图像处理与视觉语言模型(VLM)和GPT的强大功能相结合,UAV-VLA使用户能够通过简单的文本请求生成通用飞行路径及行动计划。该系统利用卫星图像提供的丰富上下文信息,增强了决策能力和任务规划能力。VLM进行视觉分析以及GPT执行自然语言处理的结合可以为用户提供路径和行动方案集,从而使得空中作业更加高效和易于操作。 新开发的方法在K-最近邻(KNN)方法中显示了生成轨迹长度的不同占22%,并且通过欧几里得距离计算,在地图上发现感兴趣对象的平均误差为34.22米。
https://arxiv.org/abs/2501.05014
Multimodal Large Language Models (MLLMs) have achieved impressive performance and have been put into practical use in commercial applications, but they still have potential safety mechanism vulnerabilities. Jailbreak attacks are red teaming methods that aim to bypass safety mechanisms and discover MLLMs' potential risks. Existing MLLMs' jailbreak methods often bypass the model's safety mechanism through complex optimization methods or carefully designed image and text prompts. Despite achieving some progress, they have a low attack success rate on commercial closed-source MLLMs. Unlike previous research, we empirically find that there exists a Shuffle Inconsistency between MLLMs' comprehension ability and safety ability for the shuffled harmful instruction. That is, from the perspective of comprehension ability, MLLMs can understand the shuffled harmful text-image instructions well. However, they can be easily bypassed by the shuffled harmful instructions from the perspective of safety ability, leading to harmful responses. Then we innovatively propose a text-image jailbreak attack named SI-Attack. Specifically, to fully utilize the Shuffle Inconsistency and overcome the shuffle randomness, we apply a query-based black-box optimization method to select the most harmful shuffled inputs based on the feedback of the toxic judge model. A series of experiments show that SI-Attack can improve the attack's performance on three benchmarks. In particular, SI-Attack can obviously improve the attack success rate for commercial MLLMs such as GPT-4o or Claude-3.5-Sonnet.
多模态大型语言模型(MLLMs)已经取得了令人印象深刻的性能,并在商业应用中得到了实际运用,但它们仍然存在潜在的安全机制漏洞。绕过攻击是红队使用的方法之一,旨在避开安全机制并发现MLLM的潜在风险。现有的MLLM绕过方法通常通过复杂的优化方法或精心设计的文字和图像提示来规避模型的安全机制。尽管取得了一些进展,但在针对商业闭源MLLM时,它们的成功率仍然很低。 与以往的研究不同的是,我们通过实证研究发现了MLLM在处理乱序有害指令方面存在一种理解能力与安全能力之间的不一致性,即“Shuffle Inconsistency”。具体来说,从理解能力的角度来看,MLLM能够很好地理解乱序的有害文字-图像指令。然而,在安全性角度来看,它们很容易被这些乱序的有害指令绕过,从而产生有害响应。 基于此发现,我们创新性地提出了一种名为SI-Attack的文字和图像组合的绕过攻击方法。具体来说,为了充分利用这种不一致性并克服随机性的挑战,我们采用了一种查询式的黑盒优化方法来根据毒害判断模型的反馈选择最具有危害性的乱序输入。 一系列实验表明,SI-Attack在三个基准测试上显著提高了攻击性能。特别是对于GPT-4o或Claude-3.5-Sonnet等商业MLLM而言,SI-Attack明显提升了绕过成功的几率。
https://arxiv.org/abs/2501.04931
Recent advancements in large language models (LLMs) have led to significant progress in text-based dialogue systems. These systems can now generate high-quality responses that are accurate and coherent across a wide range of topics and tasks. However, spoken dialogue systems still lag behind in terms of naturalness. They tend to produce robotic interactions, with issues such as slow response times, overly generic or cautious replies, and a lack of natural rhythm and fluid turn-taking. This shortcoming is largely due to the over-reliance on the traditional cascaded design, which involve separate, sequential components, as well as the use of text as an intermediate representation. This paper propose a real-time, textless spoken dialogue generation model (RTTL-DG) that aims to overcome these challenges. Our system enables fluid turn-taking and generates responses with minimal delay by processing streaming spoken conversation directly. Additionally, our model incorporates backchannels, filters, laughter, and other paralinguistic signals, which are often absent in cascaded dialogue systems, to create more natural and human-like interactions. The implementations and generated samples are available in our repository: this https URL
最近在大型语言模型(LLMs)方面取得的进展,显著推动了基于文本对话系统的进步。这些系统现在能够生成高质量的回答,在广泛的话题和任务上表现出准确性和连贯性。然而,语音对话系统仍然在自然度方面落后于其他系统。它们倾向于产生机械化的互动,存在诸如响应时间过长、过于泛化或谨慎的回复以及缺乏自然节奏和流畅的轮流发言等问题。这一不足主要是由于过度依赖传统的级联设计所导致,这种设计涉及独立且顺序运行的组件,并使用文本作为中间表示形式。 本文提出了一种实时无文本语音对话生成模型(RTTL-DG),旨在克服这些挑战。我们的系统通过直接处理流式的语音对话来支持流畅的轮流发言并快速生成回复。此外,该模型还集成了背景音、过滤器、笑声及其他副语言信号,在传统的级联对话系统中通常缺少这些元素,从而使互动更加自然且贴近人类。 有关实现和生成样本的具体信息,请访问我们的代码库:[此URL](this https URL)
https://arxiv.org/abs/2501.04877
Malware analysis is a complex process of examining and evaluating malicious software's functionality, origin, and potential impact. This arduous process typically involves dissecting the software to understand its components, infection vector, propagation mechanism, and payload. Over the years, deep reverse engineering of malware has become increasingly tedious, mainly due to modern malicious codebases' fast evolution and sophistication. Essentially, analysts are tasked with identifying the elusive needle in the haystack within the complexities of zero-day malware, all while under tight time constraints. Thus, in this paper, we explore leveraging Large Language Models (LLMs) for semantic malware analysis to expedite the analysis of known and novel samples. Built on GPT-4o-mini model, \msp is designed to augment malware analysis for Android through a hierarchical-tiered summarization chain and strategic prompt engineering. Additionally, \msp performs malware categorization, distinguishing potential malware from benign applications, thereby saving time during the malware reverse engineering process. Despite not being fine-tuned for Android malware analysis, we demonstrate that through optimized and advanced prompt engineering \msp can achieve up to 77% classification accuracy while providing highly robust summaries at functional, class, and package levels. In addition, leveraging the backward tracing of the summaries from package to function levels allowed us to pinpoint the precise code snippets responsible for malicious behavior.
恶意软件分析是一个复杂的过程,涉及对恶意软件的功能、来源及其潜在影响进行检查和评估。这一过程通常包括解剖软件以了解其组件、感染途径、传播机制以及有效负载。多年来,恶意软件的深度逆向工程变得越来越繁琐,主要是因为现代恶意代码库快速演变并变得更加复杂。本质上,分析人员的任务是在零日恶意软件的复杂性中找到难以发现的问题,同时还要在时间紧迫的情况下完成任务。因此,在这篇论文中,我们探讨了利用大型语言模型(LLMs)进行语义恶意软件分析的可能性,以加快已知和新型样本的分析速度。 基于GPT-4o-mini模型开发的\msp工具旨在通过分层摘要链和策略性提示工程增强Android平台上的恶意软件分析。此外,\msp还能执行恶意软件分类,区分潜在的恶意软件与良性应用程序,从而在恶意软件逆向工程过程中节省时间。尽管未针对Android恶意软件分析进行微调,但我们展示了通过优化及先进的提示工程技术,\msp能够实现高达77%的分类准确率,并提供功能级、类级别和包级别的高度稳健摘要。此外,利用从包到函数级别的摘要回溯追踪,我们能够精确地定位负责恶意行为的代码片段。
https://arxiv.org/abs/2501.04848
Retrieval-augmented generation (RAG) enhances large language models (LLMs) by incorporating external knowledge to generate a response within a context with improved accuracy and reduced hallucinations. However, multi-modal RAG systems face unique challenges: (i) the retrieval process may select irrelevant entries to user query (e.g., images, documents), and (ii) vision-language models or multi-modal language models like GPT-4o may hallucinate when processing these entries to generate RAG output. In this paper, we aim to address the first challenge, i.e, improving the selection of relevant context from the knowledge-base in retrieval phase of the multi-modal RAG. Specifically, we leverage the relevancy score (RS) measure designed in our previous work for evaluating the RAG performance to select more relevant entries in retrieval process. The retrieval based on embeddings, say CLIP-based embedding, and cosine similarity usually perform poorly particularly for multi-modal data. We show that by using a more advanced relevancy measure, one can enhance the retrieval process by selecting more relevant pieces from the knowledge-base and eliminate the irrelevant pieces from the context by adaptively selecting up-to-$k$ entries instead of fixed number of entries. Our evaluation using COCO dataset demonstrates significant enhancement in selecting relevant context and accuracy of the generated response.
基于检索的增强生成(Retrieval-Augmented Generation,RAG)通过整合外部知识来改进大规模语言模型(LLMs),从而在提高响应准确性和减少幻觉方面取得了进展。然而,多模态RAG系统面临独特的挑战:(i) 检索过程可能会选择与用户查询不相关的条目(例如图像、文档),以及(ii) 视觉-语言模型或多模态语言模型如GPT-4在处理这些条目时生成RAG输出时可能出现幻觉。在这篇论文中,我们旨在解决第一个挑战,即在多模态RAG的检索阶段提高从知识库中选择相关上下文的能力。具体而言,我们将利用我们在之前工作中设计的相关性评分(Relevancy Score, RS)度量来选择更相关的条目。基于嵌入(例如CLIP嵌入)和余弦相似性的检索通常在处理多模态数据时表现不佳。我们展示了通过使用更先进的相关性度量,可以改进检索过程,从知识库中选择更多的相关内容,并通过自适应地选择最多$k$个条目而非固定数量的条目来消除上下文中的不相关信息。我们的实验结果表明,在使用COCO数据集进行评估时,这显著提高了选择相关背景信息和生成响应准确性的能力。
https://arxiv.org/abs/2501.04695
With over 85 million CT scans performed annually in the United States, creating tumor-related reports is a challenging and time-consuming task for radiologists. To address this need, we present RadGPT, an Anatomy-Aware Vision-Language AI Agent for generating detailed reports from CT scans. RadGPT first segments tumors, including benign cysts and malignant tumors, and their surrounding anatomical structures, then transforms this information into both structured reports and narrative reports. These reports provide tumor size, shape, location, attenuation, volume, and interactions with surrounding blood vessels and organs. Extensive evaluation on unseen hospitals shows that RadGPT can produce accurate reports, with high sensitivity/specificity for small tumor (<2 cm) detection: 80/73% for liver tumors, 92/78% for kidney tumors, and 77/77% for pancreatic tumors. For large tumors, sensitivity ranges from 89% to 97%. The results significantly surpass the state-of-the-art in abdominal CT report generation. RadGPT generated reports for 17 public datasets. Through radiologist review and refinement, we have ensured the reports' accuracy, and created the first publicly available image-text 3D medical dataset, comprising over 1.8 million text tokens and 2.7 million images from 9,262 CT scans, including 2,947 tumor scans/reports of 8,562 tumor instances. Our reports can: (1) localize tumors in eight liver sub-segments and three pancreatic sub-segments annotated per-voxel; (2) determine pancreatic tumor stage (T1-T4) in 260 reports; and (3) present individual analyses of multiple tumors--rare in human-made reports. Importantly, 948 of the reports are for early-stage tumors.
在美国,每年进行超过8500万次CT扫描,放射科医生在创建与肿瘤相关的报告时面临着既具挑战性又耗时的任务。为了解决这一需求,我们推出了RadGPT——一个解剖学感知的视觉-语言AI代理,用于从CT扫描中生成详细报告。 RadGPT首先分割肿瘤(包括良性囊肿和恶性肿瘤)及其周围解剖结构,然后将这些信息转化为结构化报告和叙述性报告。这些报告提供了有关肿瘤大小、形状、位置、衰减值、体积以及与周围血管和器官的相互作用的信息。 在未见过的数据集上的广泛评估显示,RadGPT可以生成准确的报告,在小肿瘤(<2 cm)检测中具有高敏感性和特异性:肝肿瘤为80/73%,肾肿瘤为92/78%,胰腺肿瘤为77/77%。对于大型肿瘤,其灵敏度范围从89%到97%不等。这些结果在腹部CT报告生成领域显著超越了现有技术水平。 RadGPT已经针对17个公开数据集进行了报告生成,并通过放射科医生的审查和改进确保了报告的准确性,并创建了第一个公开可用的图像-文本3D医疗数据集,包括超过180万个文本标记和270万张来自9,262次CT扫描(包含2,947个肿瘤扫描/报告中的8,562个肿瘤实例)的图片。 我们的报告可以做到以下几点: (1) 在八个肝脏亚段和三个胰腺亚段中定位肿瘤,这些亚段是按体素标注的; (2) 根据260份报告确定胰腺肿瘤阶段(T1-T4); (3) 为多个肿瘤提供单独分析——在人工制作的报告中较为罕见。尤为重要的是,其中948份报告针对早期阶段的肿瘤。
https://arxiv.org/abs/2501.04678
Recent advancements in multimodal models have shown a strong ability in visual perception, reasoning abilities, and vision-language understanding. However, studies on visual matching ability are missing, where finding the visual correspondence of objects is essential in vision research. Our research reveals that the matching capabilities in recent multimodal LLMs (MLLMs) still exhibit systematic shortcomings, even with current strong MLLMs models, GPT-4o. In particular, we construct a Multimodal Visual Matching (MMVM) benchmark to fairly benchmark over 30 different MLLMs. The MMVM benchmark is built from 15 open-source datasets and Internet videos with manual annotation. We categorize the data samples of MMVM benchmark into eight aspects based on the required cues and capabilities to more comprehensively evaluate and analyze current MLLMs. In addition, we have designed an automatic annotation pipeline to generate the MMVM SFT dataset, including 220K visual matching data with reasoning annotation. Finally, we present CoLVA, a novel contrastive MLLM with two novel technical designs: fine-grained vision expert with object-level contrastive learning and instruction augmentation strategy. CoLVA achieves 51.06\% overall accuracy (OA) on the MMVM benchmark, surpassing GPT-4o and baseline by 8.41\% and 23.58\% OA, respectively. The results show the effectiveness of our MMVM SFT dataset and our novel technical designs. Code, benchmark, dataset, and models are available at this https URL.
近期在多模态模型上的进展显示了其在视觉感知、推理能力和视听语言理解方面的能力显著提升。然而,关于视觉匹配能力的研究却鲜有涉及,而寻找物体的视觉对应关系对于视觉研究至关重要。我们的研究表明,即使使用当前强大的多模态大型语言模型(MLLMs),如GPT-4o,它们的匹配能力依然存在系统性的不足。 为此,我们构建了一个名为“多模态视觉匹配”(MMVM)的基准测试集,用以公平地评估超过30种不同的MLLMs。MMVM基准测试集来源于15个开源数据集和互联网视频,并带有手动注释。根据所需线索和能力的不同,我们将MMVM基准测试集的数据样本分为八个方面进行更全面的评价与分析。 此外,我们还设计了一条自动化注释流水线来生成包含220K视觉匹配数据及推理注释的MMVM SFT数据集。 最后,我们提出了一种新的对比MLLM模型——CoLVA。该模型采用了两种创新技术设计:细粒度视觉专家与对象级对比学习以及指令增强策略。在MMVM基准测试上,CoLVA实现了51.06%的整体准确率(OA),分别比GPT-4o和基线模型高出8.41%和23.58%的OA。实验结果证明了我们设计的MMVM SFT数据集以及新技术方案的有效性。 相关代码、基准测试、数据集及模型可在该链接中获取:[此链接](请替换为实际可用的网址)。
https://arxiv.org/abs/2501.04670
Large Language Models, despite their significant capabilities, are known to fail in surprising and unpredictable ways. Evaluating their true `understanding' of language is particularly challenging due to the extensive web-scale data they are trained on. Therefore, we construct an evaluation to systematically assess natural language understanding (NLU) in LLMs by leveraging Construction Grammar (CxG), which provides insights into the meaning captured by linguistic elements known as constructions (Cxns). CxG is well-suited for this purpose because provides a theoretical basis to construct targeted evaluation sets. These datasets are carefully constructed to include examples which are unlikely to appear in pre-training data, yet intuitive and easy for humans to understand, enabling a more targeted and reliable assessment. Our experiments focus on downstream natural language inference and reasoning tasks by comparing LLMs' understanding of the underlying meanings communicated through 8 unique Cxns with that of humans. The results show that while LLMs demonstrate some knowledge of constructional information, even the latest models including GPT-o1 struggle with abstract meanings conveyed by these Cxns, as demonstrated in cases where test sentences are dissimilar to their pre-training data. We argue that such cases provide a more accurate test of true language understanding, highlighting key limitations in LLMs' semantic capabilities. We make our novel dataset and associated experimental data including prompts and model responses publicly available.
大型语言模型(Large Language Models,LLM)尽管具备显著的能力,但它们在某些方面会以令人惊讶且不可预测的方式失败。由于这些模型是基于大规模的互联网数据进行训练的,因此评估其真正的“理解”能力非常具有挑战性。为了系统地评估LLM中的自然语言理解(NLU),我们利用了构式语法(Construction Grammar, CxG)来构建评估体系。CxG提供了对由称为构式的语言元素所捕捉的意义的理解,并为创建针对特定任务的评估数据集提供理论依据。 这些数据集精心设计,包括了一些不太可能出现在预训练数据中的例子,但这些例子对于人类来说却直观且易于理解,从而使得评估更为精确和可靠。我们的实验专注于通过与人类对8个独特构式所传递的基本含义的理解进行比较来测试LLM的下游自然语言推理和逻辑推断任务。 结果显示,尽管LLM展示出了一定程度上关于构式的知识,但即使是最新版本(包括GPT-o1)也难以理解和处理这些构式传达的高度抽象的意义。特别是当测试句子与预训练数据相差甚远时,这种情况尤为明显。我们主张这样的情况能更准确地检验真正的语言理解能力,并揭示了LLM在语义功能方面的关键限制。 我们将这个新颖的数据集以及相关的实验数据(包括提示和模型响应)公开提供给研究社区使用。
https://arxiv.org/abs/2501.04661
Interior design involves the careful selection and arrangement of objects to create an aesthetically pleasing, functional, and harmonized space that aligns with the client's design brief. This task is particularly challenging, as a successful design must not only incorporate all the necessary objects in a cohesive style, but also ensure they are arranged in a way that maximizes accessibility, while adhering to a variety of affordability and usage considerations. Data-driven solutions have been proposed, but these are typically room- or domain-specific and lack explainability in their design design considerations used in producing the final layout. In this paper, we investigate if large language models (LLMs) can be directly utilized for interior design. While we find that LLMs are not yet capable of generating complete layouts, they can be effectively leveraged in a structured manner, inspired by the workflow of interior designers. By systematically probing LLMs, we can reliably generate a list of objects along with relevant constraints that guide their placement. We translate this information into a design layout graph, which is then solved using an off-the-shelf constrained optimization setup to generate the final layouts. We benchmark our algorithm in various design configurations against existing LLM-based methods and human designs, and evaluate the results using a variety of quantitative and qualitative metrics along with user studies. In summary, we demonstrate that LLMs, when used in a structured manner, can effectively generate diverse high-quality layouts, making them a viable solution for creating large-scale virtual scenes. Project webpage at this https URL
室内设计涉及精心挑选和布置物品,以创造出美观、实用且和谐的空间,该空间需与客户的设计要求相符合。这一任务特别具有挑战性,因为成功的室内设计方案不仅要将所有必要的物体以一致的风格融入其中,还要确保它们被安排在一种最大化可达性的布局中,并同时考虑多种成本效益及使用因素的影响。虽然有人提出过基于数据的方法来解决这些问题,但这些方法通常局限于特定房间或领域,并且在其最终布局设计考量方面的解释性不足。 本文探讨了大型语言模型(LLMs)能否直接用于室内设计的问题。我们发现尽管目前的LLM尚不能生成完整的平面图,但是可以通过借鉴专业设计师的工作流程,以有组织的方式有效利用它们。通过系统地探究这些大语言模型,可以可靠地产生一系列物品及其放置的相关约束条件。然后将这些信息转化为一个设计布局图,并使用现成的受限制优化设置来解决它,从而生成最终的平面图。 我们在各种设计配置下对我们的算法进行基准测试,与现有的基于LLM的方法和人类设计师的设计进行了比较,并通过多种定量及定性指标以及用户研究评估了结果。总之,我们证明了在有组织的方式使用时,大型语言模型能够有效地产生多样化且高质量的布局方案,使其成为创建大规模虚拟场景的一种可行解决方案。 项目网页:[请访问提供的链接获取更多信息](https://this-url-is-an-example.com)
https://arxiv.org/abs/2501.04648
Recent advancements in robots powered by large language models have enhanced their conversational abilities, enabling interactions closely resembling human dialogue. However, these models introduce safety and security concerns in HRI, as they are vulnerable to manipulation that can bypass built-in safety measures. Imagining a social robot deployed in a home, this work aims to understand how everyday users try to exploit a language model to violate ethical principles, such as by prompting the robot to act like a life partner. We conducted a pilot study involving 21 university students who interacted with a Misty robot, attempting to circumvent its safety mechanisms across three scenarios based on specific HRI ethical principles: attachment, freedom, and empathy. Our results reveal that participants employed five techniques, including insulting and appealing to pity using emotional language. We hope this work can inform future research in designing strong safeguards to ensure ethical and secure human-robot interactions.
最近,由大型语言模型驱动的机器人在对话能力方面取得了显著进步,使其互动更加接近人类对话。然而,这些模型引入了人机交互(HRI)中的安全和隐私问题,因为它们容易受到操纵,从而绕过内置的安全措施。设想在一个家庭环境中部署的社会机器人,本研究旨在理解日常用户如何尝试利用语言模型违反伦理原则,例如通过提示机器人扮演生命伴侣的角色来实现这一目的。 我们进行了一项试点研究,涉及21名大学生与Misty机器人的互动,在三个基于具体HRI伦理原则(依恋、自由和同理心)的场景下试图规避其安全机制。我们的结果揭示了参与者使用五种技术,包括用情感语言侮辱机器人以及诉诸同情心。 我们希望这项工作能够为未来的研究提供信息,以便设计强有力的安全措施,确保人机交互的伦理性和安全性。
https://arxiv.org/abs/2501.04633
Recent advancements in omnimodal learning have been achieved in understanding and generation across images, text, and speech, though mainly within proprietary models. Limited omnimodal datasets and the inherent challenges associated with real-time emotional speech generation have hindered open-source progress. To address these issues, we propose openomni, a two-stage training method combining omnimodal alignment and speech generation to develop a state-of-the-art omnimodal large language model. In the alignment phase, a pre-trained speech model is further trained on text-image tasks to generalize from vision to speech in a (near) zero-shot manner, outperforming models trained on tri-modal datasets. In the speech generation phase, a lightweight decoder facilitates real-time emotional speech through training on speech tasks and preference learning. Experiments demonstrate that openomni consistently improves across omnimodal, vision-language, and speech-language evaluations, enabling natural, emotion-rich dialogues and real-time emotional speech generation.
近期,跨模态学习在图像、文本和语音的理解与生成方面取得了进展,尽管这些成果主要体现在专有模型中。受限的跨模态数据集以及实时情感语音生成所固有的挑战阻碍了开源社区的进步。为解决这些问题,我们提出了一个名为openomni的方法,这是一种两阶段训练方法,结合了跨模态对齐和语音生成技术,旨在开发出最先进的跨模态大语言模型。 在第一阶段的对齐过程中,预先训练好的语音模型进一步接受文本-图像任务的训练,从而能够在视觉与语音之间(近乎)零样本地推广,超越基于三模态数据集训练出来的模型性能。在第二阶段的语音生成过程中,一个轻量级解码器通过语音任务和偏好学习来促进实时情感语音的产生。 实验结果显示,openomni方法在跨模态、视觉-语言以及语音-语言评估中均持续表现出色,能够支持自然且充满情感的对话,并实现高质量的实时情感语音生成。
https://arxiv.org/abs/2501.04561
Yield is one of the core goals of crop breeding. By predicting the potential yield of different breeding materials, breeders can screen these materials at various growth stages to select the best performing. Based on unmanned aerial vehicle remote sensing technology, high-throughput crop phenotyping data in breeding areas is collected to provide data support for the breeding decisions of breeders. However, the accuracy of current yield predictions still requires improvement, and the usability and user-friendliness of yield forecasting tools remain suboptimal. To address these challenges, this study introduces a hybrid method and tool for crop yield prediction, designed to allow breeders to interactively and accurately predict wheat yield by chatting with a large language model (LLM). First, the newly designed data assimilation algorithm is used to assimilate the leaf area index into the WOFOST model. Then, selected outputs from the assimilation process, along with remote sensing inversion results, are used to drive the time-series temporal fusion transformer model for wheat yield prediction. Finally, based on this hybrid method and leveraging an LLM with retrieval augmented generation technology, we developed an interactive yield prediction Web tool that is user-friendly and supports sustainable data updates. This tool integrates multi-source data to assist breeding decision-making. This study aims to accelerate the identification of high-yield materials in the breeding process, enhance breeding efficiency, and enable more scientific and smart breeding decisions.
作物育种的目标之一是提高产量。通过预测不同育种材料的潜在产量,育种者可以在生长的不同阶段筛选这些材料,并选择表现最佳的品种。利用无人驾驶飞行器遥感技术收集育种区域内的高通量作物表型数据,为育种决策提供数据支持。然而,目前的产量预测准确性仍有待提高,且当前的产量预报工具在实用性和易用性方面仍存在不足。 为了应对这些挑战,本研究提出了一种基于混合方法和工具进行作物产量预测的方法。该方法允许育种者通过与大型语言模型(LLM)对话来交互式和准确地预测小麦产量。首先,采用新设计的数据同化算法将叶面积指数融入WOFOST模型中。然后,利用选定的同化过程输出以及遥感反演结果驱动时间序列时序融合变换器模型进行小麦产量预测。最后,在这种方法的基础上,并结合使用具有检索增强生成技术的大规模语言模型(LLM),我们开发了一个互动性高且支持可持续数据更新的小麦产量预测网络工具,该工具整合了多源数据以协助育种决策。 本研究旨在加速在育种过程中识别高产材料的进程,提高育种效率,并使育种决策更加科学和智能化。
https://arxiv.org/abs/2501.04487