We present the Qwen2-VL Series, an advanced upgrade of the previous Qwen-VL models that redefines the conventional predetermined-resolution approach in visual processing. Qwen2-VL introduces the Naive Dynamic Resolution mechanism, which enables the model to dynamically process images of varying resolutions into different numbers of visual tokens. This approach allows the model to generate more efficient and accurate visual representations, closely aligning with human perceptual processes. The model also integrates Multimodal Rotary Position Embedding (M-RoPE), facilitating the effective fusion of positional information across text, images, and videos. We employ a unified paradigm for processing both images and videos, enhancing the model's visual perception capabilities. To explore the potential of large multimodal models, Qwen2-VL investigates the scaling laws for large vision-language models (LVLMs). By scaling both the model size-with versions at 2B, 8B, and 72B parameters-and the amount of training data, the Qwen2-VL Series achieves highly competitive performance. Notably, the Qwen2-VL-72B model achieves results comparable to leading models such as GPT-4o and Claude3.5-Sonnet across various multimodal benchmarks, outperforming other generalist models. Code is available at \url{this https URL}.
我们推出了Qwen2-VL系列,这是对之前Qwen-VL模型的先进升级,重新定义了在视觉处理中预先确定分辨率的经典方法。Qwen2-VL采用了朴素动态分辨率机制,使模型能够动态处理不同分辨率的图像,将其转化为视觉token的数目。这种方法使得模型能够生成更高效和准确的视觉表示,与人类感知过程密切相关。模型还集成了多模态旋转位置嵌入(M-RoPE),促进在文本、图像和视频之间有效融合位置信息。我们采用统一的方法处理图像和视频,增强了模型的视觉感知能力。为了探索大型多模态模型的潜力,Qwen2-VL研究了大型视觉语言模型(LVLM)的缩放定律。通过缩放模型大小(以2B、8B和72B参数版本为基础)和训练数据量,Qwen2-VL系列在各种多模态基准测试中实现了高度竞争性的性能。值得注意的是,Qwen2-VL-72B模型在各种多模态基准测试中的结果与诸如GPT-4和Claude3.5-Sonnet等领先模型相当,甚至超过了其他通用模型。代码可在此处访问:\url{此链接}
https://arxiv.org/abs/2409.12191
Large Language Models' (LLM) reasoning can be improved using test-time aggregation strategies, i.e., generating multiple samples and voting among generated samples. While these improve performance, they often reach a saturation point. Refinement offers an alternative by using LLM-generated feedback to improve solution quality. However, refinement introduces 3 key challenges: (1) Excessive refinement: Uniformly refining all instances can over-correct and reduce the overall performance. (2) Inability to localize and address errors: LLMs have a limited ability to self-correct and struggle to identify and correct their own mistakes. (3) Insufficient refinement: Deciding how many iterations of refinement are needed is non-trivial, and stopping too soon could leave errors unaddressed. To tackle these issues, we propose MAgICoRe, which avoids excessive refinement by categorizing problem difficulty as easy or hard, solving easy problems with coarse-grained aggregation and hard ones with fine-grained and iterative multi-agent refinement. To improve error localization, we incorporate external step-wise reward model (RM) scores. Moreover, to ensure effective refinement, we employ a multi-agent loop with three agents: Solver, Reviewer (which generates targeted feedback based on step-wise RM scores), and the Refiner (which incorporates feedback). To ensure sufficient refinement, we re-evaluate updated solutions, iteratively initiating further rounds of refinement. We evaluate MAgICoRe on Llama-3-8B and GPT-3.5 and show its effectiveness across 5 math datasets. Even one iteration of MAgICoRe beats Self-Consistency by 3.4%, Best-of-k by 3.2%, and Self-Refine by 4.0% while using less than half the samples. Unlike iterative refinement with baselines, MAgICoRe continues to improve with more iterations. Finally, our ablations highlight the importance of MAgICoRe's RMs and multi-agent communication.
大语言模型的推理可以通过在测试时间内进行聚合策略来提高,即生成多个样本并从中投票。尽管这些方法提高了性能,但它们通常达到饱和点。通过使用LLM生成的反馈来提高解决方案的质量,提供了另一种方法。然而,这种方法引入了三个关键挑战:(1)过度细化:均匀地细化所有实例可能会过度纠正并降低整体性能。(2)无法定位和解决错误:LLM的自我纠正能力有限,很难识别和纠正自己的错误。(3)不够细化:决定需要多少轮细化是一个非 trivial 的问题, stopping too soon could leave errors unaddressed。为了解决这些问题,我们提出了MAgICoRe,它通过将问题难度分类为容易或困难来避免过度细化,用粗粒度聚合解决容易问题,用细粒度多代理器迭代解决困难问题。为了改善错误定位,我们引入了外部逐步奖励模型(RM)得分。此外,为了确保有效的迭代,我们使用了一个多代理器循环,包括求解器、评论者(根据逐步 RM 得分生成定向反馈)和优化器(包含反馈)。为了确保足够的迭代,我们重新评估了更新的解决方案,并迭代启动进一步的优化轮数。我们在Llama-3-8B和GPT-3.5上评估了MAgICoRe,并证明了其在5个数学数据集上的有效性。即使在只使用一半样本的情况下,MAgICoRe的自我一致性比基线提高了3.4%,最佳推理顺序比基线提高了3.2%,自修复提高了4.0%。与基线迭代方法不同,MAgICoRe在更多迭代后继续改进。最后,我们的实验表明,MAgICoRe的RM和多代理器通信非常重要。
https://arxiv.org/abs/2409.12147
In this paper, we address the challenge of generating realistic 3D human motions for action classes that were never seen during the training phase. Our approach involves decomposing complex actions into simpler movements, specifically those observed during training, by leveraging the knowledge of human motion contained in GPTs models. These simpler movements are then combined into a single, realistic animation using the properties of diffusion models. Our claim is that this decomposition and subsequent recombination of simple movements can synthesize an animation that accurately represents the complex input action. This method operates during the inference phase and can be integrated with any pre-trained diffusion model, enabling the synthesis of motion classes not present in the training data. We evaluate our method by dividing two benchmark human motion datasets into basic and complex actions, and then compare its performance against the state-of-the-art.
在本文中,我们解决了在训练阶段从未见过的动作类中生成逼真的3D人体运动的问题。我们的方法利用GPT模型的知识,通过分解复杂动作为训练期间观察到的更简单的动作,具体这些动作,然后将这些简单的动作合并成一个真实的动画,利用扩散模型的特性。我们的主张是,这种分解和后续简单动作的重新组合可以合成准确地表示复杂输入动作的动画。这种方法在推理阶段运行,可以与任何预训练的扩散模型集成,从而合成训练数据中没有的动量类别。我们通过将两个基准的人体运动数据集分为基本和复杂动作,然后与最先进的水平进行比较,来评估我们的方法。
https://arxiv.org/abs/2409.11920
This paper presents AlignBot, a novel framework designed to optimize VLM-powered customized task planning for household robots by effectively aligning with user reminders. In domestic settings, aligning task planning with user reminders poses significant challenges due to the limited quantity, diversity, and multimodal nature of the reminders. To address these challenges, AlignBot employs a fine-tuned LLaVA-7B model, functioning as an adapter for GPT-4o. This adapter model internalizes diverse forms of user reminders-such as personalized preferences, corrective guidance, and contextual assistance-into structured instruction-formatted cues that prompt GPT-4o in generating customized task plans. Additionally, AlignBot integrates a dynamic retrieval mechanism that selects task-relevant historical successes as prompts for GPT-4o, further enhancing task planning accuracy. To validate the effectiveness of AlignBot, experiments are conducted in real-world household environments, which are constructed within the laboratory to replicate typical household settings. A multimodal dataset with over 1,500 entries derived from volunteer reminders is used for training and evaluation. The results demonstrate that AlignBot significantly improves customized task planning, outperforming existing LLM- and VLM-powered planners by interpreting and aligning with user reminders, achieving 86.8% success rate compared to the vanilla GPT-4o baseline at 21.6%, reflecting a 65% improvement and over four times greater effectiveness. Supplementary materials are available at: this https URL
本文介绍了AlignBot,一种旨在通过有效地与用户提醒对齐来优化基于VLM的个性化任务规划的家用机器人。在国内设置中,将任务规划与用户提醒对齐面临着由于提醒信息数量有限、多样性和多模态性质而带来的巨大挑战。为了应对这些挑战,AlignBot采用了一个经过微调的LLaVA-7B模型,作为GPT-4o的适配器。这个适配器模型将多样形式的用户提醒(如个性化偏好、修正指导、上下文辅助)内部化为结构化指令格式的提示,促使GPT-4o生成定制化任务计划。此外,AlignBot还采用了一种动态检索机制,选择与任务相关的歷史成功作为GPT-4o的提示,进一步提高了任务规划的准确性。为了验证AlignBot的有效性,在现实世界的家庭环境中进行了实验,这些环境是在实验室中构建的,以复制典型的家庭环境。一个由志愿者提醒的多模态数据集用于训练和评估。结果表明,AlignBot显著提高了定制任务规划,超越了现有的LLM和VLM计划,通过解释和与用户提醒对齐,实现了86.8%的成功率,相比纯GPT-4o基线21.6%的增长和超过四倍的效果。附加材料可在:https://this URL。
https://arxiv.org/abs/2409.11905
Large Language Models (LLMs) can memorize sensitive information, raising concerns about potential misuse. LLM Unlearning, a post-hoc approach to remove this information from trained LLMs, offers a promising solution to mitigate these risks. However, previous practices face three key challenges: 1. Utility: successful unlearning often causes catastrophic collapse on unrelated tasks. 2. Efficiency: many methods either involve adding similarly sized models, which slows down unlearning or inference, or require retain data that are difficult to obtain. 3. Robustness: even effective methods may still leak data via extraction techniques. To address these challenges, we propose MEOW, a simple yet effective gradient descent-based unlearning method. Specifically, we use an offline LLM to generate a set of inverted facts. Then, we design a new metric, MEMO, to quantify memorization in LLMs. Finally, based on the signals provided by MEMO, we select the most appropriate set of inverted facts and finetune the model based on them. We evaluate MEOW on the commonly used unlearn benchmark, ToFU, with Llama2-7B-Chat and Phi-1.5B, and test it on both NLU and NLG tasks. Results demonstrate significant improvement of MEOW in forget quality without substantial loss in model utility. Meanwhile, MEOW does not exhibit significant degradation in NLU or NLG capabilities, and there is even a slight improvement in NLU performance.
大语言模型(LLMs)可以记住敏感信息,这引起了潜在滥用担忧。LLM Unlearning,一种在训练好的LLM上删除此信息的后验方法,为减轻这些风险提供了一个有前景的解决方案。然而,之前的实践面临三个关键挑战:1. 效用:成功的卸载通常会导致无关任务上的灾难性崩溃。2. 效率:许多方法包括添加大小相似的模型,这会减缓卸载或推理的速度,或者需要保留难以获得的數據。3. 鲁棒性:即使有效的方法也可能通过提取技术泄露数据。为了应对这些挑战,我们提出了MEOW,一种简单而有效的基于梯度下降的卸载方法。具体来说,我们使用一个离线LLM生成一系列反事实。然后,我们设计了一个新的指标MEMO,用于衡量LLM的记性。最后,根据MEMO提供的信号,我们选择最合适的反事实集,并对模型进行微调。我们在常用的卸载基准测试ToFU(Llama2-7B-Chat和Phi-1.5B)上评估MEOW,并将其应用于NLU和NLG任务上。结果表明,MEOW在保持记忆质量的同时显著提高了模型效用。与此同时,MEOW在NLU和NLG能力上没有显著退化,甚至在NLU性能上还有轻微的提高。
https://arxiv.org/abs/2409.11844
Understanding emotions is fundamental to human interaction and experience. Humans easily infer emotions from situations or facial expressions, situations from emotions, and do a variety of other \emph{affective cognition}. How adept is modern AI at these inferences? We introduce an evaluation framework for testing affective cognition in foundation models. Starting from psychological theory, we generate 1,280 diverse scenarios exploring relationships between appraisals, emotions, expressions, and outcomes. We evaluate the abilities of foundation models (GPT-4, Claude-3, Gemini-1.5-Pro) and humans (N = 567) across carefully selected conditions. Our results show foundation models tend to agree with human intuitions, matching or exceeding interparticipant agreement. In some conditions, models are ``superhuman'' -- they better predict modal human judgements than the average human. All models benefit from chain-of-thought reasoning. This suggests foundation models have acquired a human-like understanding of emotions and their influence on beliefs and behavior.
理解情感是人类互动和体验的基础。人类很容易从情境或面部表情中推断情感,从情感中推断情境,以及进行各种情感认知。现代AI在推断这些方面有多擅长呢?我们为测试基础模型对情感认知的能力建立了一个评估框架。从心理理论开始,我们生成了1280个不同的情景,探讨了评价、情感、表达和结果之间的关系。我们评估了基础模型(GPT-4,Claude-3,Gemini-1.5-Pro)和人类(N=567)在这些条件下的能力。我们的结果显示,基础模型往往与人类的直觉相符,甚至超过参与者之间的共识。在某些情况下,模型表现得“超人类”——它们比平均人类更准确地预测模态人类评判。所有模型都受益于链式思维推理。这表明,基础模型已经获得了类似于人类对情感及其对信念和行为影响的理解。
https://arxiv.org/abs/2409.11733
Large language models (LLMs) have demonstrated the ability to improve human efficiency through conversational interactions. Conventional LLM-powered dialogue systems, operating on a turn-based paradigm, preclude real-time interaction during response generation. To address this limitation, researchers have proposed duplex models. These models can dynamically adapt to user input, facilitating real-time interactive feedback. However, these methods typically require substantial computational resources to acquire the ability. To reduce overhead, this paper presents a new duplex decoding approach that enhances LLMs with duplex ability, requiring minimal additional training. Specifically, our method employs parallel decoding of queries and responses in conversations, effectively implementing a channel-division-multiplexing decoding strategy. Experimental results indicate that our proposed method significantly enhances the naturalness and human-likeness of user-AI interactions with minimal training costs.
大语言模型(LLMs)通过会话互动展示了改善人类效率的能力。传统的LLM驱动对话系统,采用轮询范式操作,在响应生成过程中无法实现实时互动。为了应对这一局限,研究人员提出了双工模型。这些模型可以动态适应用户输入,促进实时交互反馈。然而,这些方法通常需要大量的计算资源来获得实现能力。为了减少开销,本文提出了一种新的双工解码方法,增强了具有双工能力的LLM,并要求 minimal additional training。具体来说,我们的方法采用对话中的查询和响应并行解码,有效实现了信道分集多路解码策略。实验结果表明,与 minimal additional training 相比,我们提出的方法显著增强了用户-AI交互的自然性和人性化,且训练成本较低。
https://arxiv.org/abs/2409.11727
Current Large Language Models (LLMs) exhibit limited ability to understand table structures and to apply precise numerical reasoning, which is crucial for tasks such as table question answering (TQA) and table-based fact verification (TFV). To address these challenges, we introduce our Tool-Augmented Reasoning framework for Tables (TART), which integrates LLMs with specialized tools. TART contains three key components: a table formatter to ensure accurate data representation, a tool maker to develop specific computational tools, and an explanation generator to maintain explainability. We also present the TOOLTAB dataset, a new benchmark designed specifically for training LLMs in table-tool integration. Our experiments indicate that TART achieves substantial improvements over existing methods (e.g., Chain-of-Thought) by improving both the precision of data processing and the clarity of the reasoning process. Notably, TART paired with CodeLlama achieves 90.0% of the accuracy of the closed-sourced LLM GPT-3.5-turbo, highlighting its robustness in diverse real-world scenarios. All the code and data are available at this https URL.
当前的大语言模型(LLMs)在理解和应用精确的数值推理方面表现有限,这对诸如问答表单(TQA)和基于表的知识验证(TFV)等任务至关重要。为了应对这些挑战,我们提出了我们的工具增强推理框架(TART),它将LLM与专用工具集成。TART包含三个关键组件:一个表格式化器以确保准确的数据表示,一个工具生成器用于开发特定的计算工具,和一个解释生成器来保持可解释性。我们还介绍了ToolTAB数据集,专门为训练LLM在表工具集成而设计。我们的实验结果表明,TART通过提高数据处理的精度和推理过程的清晰度,显著超过了现有方法(如Chain-of-Thought)。值得注意的是,TART与CodeLlama的搭配实现了与闭源LLM GPT-3.5-turbo相同的90.0%的准确率,突显了其在多样现实场景下的稳健性。所有代码和数据都可在https://此链接处获取。
https://arxiv.org/abs/2409.11724
In this paper, we study format biases in reinforcement learning from human feedback (RLHF). We observe that many widely-used preference models, including human evaluators, GPT-4, and top-ranking models on the RewardBench benchmark, exhibit strong biases towards specific format patterns, such as lists, links, bold text, and emojis. Furthermore, large language models (LLMs) can exploit these biases to achieve higher rankings on popular benchmarks like AlpacaEval and LMSYS Chatbot Arena. One notable example of this is verbosity bias, where current preference models favor longer responses that appear more comprehensive, even when their quality is equal to or lower than shorter, competing responses. However, format biases beyond verbosity remain largely underexplored in the literature. In this work, we extend the study of biases in preference learning beyond the commonly recognized length bias, offering a comprehensive analysis of a wider range of format biases. Additionally, we show that with a small amount of biased data (less than 1%), we can inject significant bias into the reward model. Moreover, these format biases can also be easily exploited by downstream alignment algorithms, such as best-of-n sampling and online iterative DPO, as it is usually easier to manipulate the format than to improve the quality of responses. Our findings emphasize the need to disentangle format and content both for designing alignment algorithms and evaluating models.
在本文中,我们研究了强化学习中的格式偏见(RLHF)。我们观察到,许多广泛使用的偏好模型,包括人类评估者、GPT-4和RewardBench基准中的排名最高的模型,都表现出对特定格式模式强烈的偏见,例如列表、链接、加粗文本和表情符号。此外,大型语言模型(LLMs)可以利用这些偏见在流行的基准上实现更高的排名,如AlpacaEval和LMSYS Chatbot Arena。一个著名的例子是verbosity bias,其中当前的偏好模型倾向于选择更长、看起来更全面的回答,即使它们的质量与较短的竞争回答相同或更高。然而,在文献中,格式偏见(超过verbosity)仍然没有被充分探索。在这项工作中,我们超越了通常认可的长度偏见,提供了更广泛的格式偏见分析。此外,我们还证明了用很少的偏见数据(不到1%)可以向奖励模型注入显著的偏见。此外,这些格式偏见也可以很容易地被下游的归一化算法(如best-of-n采样和在线迭代DPO)利用,因为通常来说,操纵格式比改善响应的质量更容易。我们的研究结果强调了在设计对齐算法和评估模型时区分格式和内容的重要性。
https://arxiv.org/abs/2409.11704
As Large Language Models (LLMs) advance in natural language processing, there is growing interest in leveraging their capabilities to simplify software interactions. In this paper, we propose a novel system that integrates LLMs for both classifying natural language inputs into corresponding API calls and automating the creation of sample datasets tailored to specific API functions. By classifying natural language commands, our system allows users to invoke complex software functionalities through simple inputs, improving interaction efficiency and lowering the barrier to software utilization. Our dataset generation approach also enables the efficient and systematic evaluation of different LLMs in classifying API calls, offering a practical tool for developers or business owners to assess the suitability of LLMs for customized API management. We conduct experiments on several prominent LLMs using generated sample datasets for various API functions. The results show that GPT-4 achieves a high classification accuracy of 0.996, while LLaMA-3-8B performs much worse at 0.759. These findings highlight the potential of LLMs to transform API management and validate the effectiveness of our system in guiding model testing and selection across diverse applications.
随着大型语言模型(LLMs)在自然语言处理方面的进步,越来越多的人开始利用其能力来简化软件交互。在本文中,我们提出了一个新系统,该系统将LLMs集成到分类自然语言输入到相应API调用中,并自动创建针对特定API功能的样本数据。通过分类自然语言命令,我们的系统允许用户通过简单的输入调用复杂的软件功能,提高交互效率并降低软件利用门槛。我们的数据生成方法还使开发人员或企业所有者能够通过分类API调用评估不同的LLM,为开发人员或企业所有者评估LLM的定制API管理提供了一个实用的工具。我们在各种显著的LLM上进行实验,使用生成的API函数样本数据。结果表明,GPT-4的分类准确率为0.996,而LLaMA-3-8B的分类准确率则大大较低,为0.759。这些发现突出了LLMs在API管理中的潜力,验证了我们的系统在指导模型测试和选择跨多个应用的软件选择方面的有效性。
https://arxiv.org/abs/2409.11703
In this paper, we evaluate the creative fiction writing abilities of a fine-tuned small language model (SLM), BART Large, and compare its performance to humans and two large language models (LLMs): GPT-3.5 and GPT-4o. Our evaluation consists of two experiments: (i) a human evaluation where readers assess the stories generated by the SLM compared to human-written stories, and (ii) a qualitative linguistic analysis comparing the textual characteristics of the stories generated by the different models. In the first experiment, we asked 68 participants to rate short stories generated by the models and humans along dimensions such as grammaticality, relevance, creativity, and attractiveness. BART Large outperformed human writers in most aspects, except creativity, with an overall score of 2.11 compared to 1.85 for human-written texts -- a 14% improvement. In the second experiment, the qualitative analysis revealed that, while GPT-4o exhibited near-perfect internal and external coherence, it tended to produce more predictable narratives, with only 3% of its stories seen as novel. In contrast, 15% of BART's stories were considered novel, indicating a higher degree of creativity despite its smaller model size. This study provides both quantitative and qualitative insights into how model size and fine-tuning influence the balance between creativity, fluency, and coherence in creative writing tasks.
在本文中,我们评估了经过微调的小语言模型(SLM)BART Large的创意小说写作能力,并将其性能与人类和两个大型语言模型(LLMs:GPT-3.5 和 GPT-4o)进行比较。我们的评估包括两个实验:(i)人类评估,读者评估SLM和人类生成的故事在语法正确性、相关性、创造力和吸引力等方面的表现;(ii)定性的语言分析,比较不同模型生成故事的文本特征。在第一个实验中,我们要求68名参与者对由模型和人类生成的短篇小说进行评分,从语法正确性、相关性、创造力和吸引力等方面进行评估。BART Large在大多数方面都超过了人类作家,尽管在创造力方面略逊一筹, overall score为2.11,而人类作品的score为1.85,提高了14%。在第二个实验中,定性的语言分析揭示了以下结论:虽然GPT-4o在内部和外部连贯性方面表现几乎完美,但它倾向于产生更可预测的故事,只有3%的故事被认为是新颖的。相比之下,BART的15%的故事被认为是新颖的,尽管它的小模型尺寸,但在创造力方面表现出了更高的水平。这项研究为探讨模型大小和微调如何影响创造性写作任务中的创意、流畅性和连贯性的平衡提供了定量和定性方面的洞察。
https://arxiv.org/abs/2409.11547
We introduce a technique for multi-document grounded multi-turn synthetic dialog generation that incorporates three main ideas. First, we control the overall dialog flow using taxonomy-driven user queries that are generated with Chain-of-Thought (CoT) prompting. Second, we support the generation of multi-document grounded dialogs by mimicking real-world use of retrievers to update the grounding documents after every user-turn in the dialog. Third, we apply LLM-as-a-Judge to filter out queries with incorrect answers. Human evaluation of the synthetic dialog data suggests that the data is diverse, coherent, and includes mostly correct answers. Both human and automatic evaluations of answerable queries indicate that models fine-tuned on synthetic dialogs consistently out-perform those fine-tuned on existing human generated training data across four publicly available multi-turn document grounded benchmark test sets.
我们提出了一种基于多文档的 grounded multi-turn 合成对话生成技术,它采用了三种主要思想。首先,我们通过使用由 Chain-of-Thought (CoT) 提示生成的分类驱动用户查询来控制对话的整体流。其次,我们通过模仿现实世界中检索器的使用方式来生成多文档 grounded 对话,并在每个用户回合后更新底层文档。第三,我们应用 LLM-as-a-Judge 来过滤出答案错误的查询。人类评估合成对话数据表明,数据具有多样性、连贯性,并且主要包括正确答案。人类和自动评估的答案able 查询表明,在四个公开可用的多轮文档 grounded 基准测试集中,经过预训练的模型在合成对话上的一致表现优于预训练于现有人类生成的训练数据。
https://arxiv.org/abs/2409.11500
Arabic, with its rich diversity of dialects, remains significantly underrepresented in Large Language Models, particularly in dialectal variations. We address this gap by introducing seven synthetic datasets in dialects alongside Modern Standard Arabic (MSA), created using Machine Translation (MT) combined with human post-editing. We present AraDiCE, a benchmark for Arabic Dialect and Cultural Evaluation. We evaluate LLMs on dialect comprehension and generation, focusing specifically on low-resource Arabic dialects. Additionally, we introduce the first-ever fine-grained benchmark designed to evaluate cultural awareness across the Gulf, Egypt, and Levant regions, providing a novel dimension to LLM evaluation. Our findings demonstrate that while Arabic-specific models like Jais and AceGPT outperform multilingual models on dialectal tasks, significant challenges persist in dialect identification, generation, and translation. This work contributes ~45K post-edited samples, a cultural benchmark, and highlights the importance of tailored training to improve LLM performance in capturing the nuances of diverse Arabic dialects and cultural contexts. We will release the dialectal translation models and benchmarks curated in this study.
阿拉伯语,其丰富的方言差异,在大型语言模型中仍然显著代表性不足,特别是在方言变体方面。我们通过引入七個方言數據集来弥补这一差距,利用机器翻译(MT)与人类编辑相结合来创建。我们推出了AraDiCE,一个阿拉伯语方言和文化遗产评估的基准。我们评估了LLM在方言理解和生成方面的表现,重点关注低资源阿拉伯语方言。此外,我们引入了第一个旨在评估阿拉伯地区(卡塔尔,埃及和中东地区)文化意识的精细基准,为LLM评估提供了一个新的维度。我们的研究结果表明,与针对阿拉伯语的模型如Jais和AceGPT在方言任务上表现优异相比,方言识别、生成和翻译仍然存在显著挑战。这项工作贡献了~45K个编辑样本,成为了一个文化基准,并强调了定制化训练对提高LLM在捕捉多样阿拉伯方言和文化背景中的细微差异方面的必要性。我们将发布本研究中编写的方言翻译模型和基准。
https://arxiv.org/abs/2409.11404
We introduce NVLM 1.0, a family of frontier-class multimodal large language models (LLMs) that achieve state-of-the-art results on vision-language tasks, rivaling the leading proprietary models (e.g., GPT-4o) and open-access models (e.g., Llama 3-V 405B and InternVL 2). Remarkably, NVLM 1.0 shows improved text-only performance over its LLM backbone after multimodal training. In terms of model design, we perform a comprehensive comparison between decoder-only multimodal LLMs (e.g., LLaVA) and cross-attention-based models (e.g., Flamingo). Based on the strengths and weaknesses of both approaches, we propose a novel architecture that enhances both training efficiency and multimodal reasoning capabilities. Furthermore, we introduce a 1-D tile-tagging design for tile-based dynamic high-resolution images, which significantly boosts performance on multimodal reasoning and OCR-related tasks. Regarding training data, we meticulously curate and provide detailed information on our multimodal pretraining and supervised fine-tuning datasets. Our findings indicate that dataset quality and task diversity are more important than scale, even during the pretraining phase, across all architectures. Notably, we develop production-grade multimodality for the NVLM-1.0 models, enabling them to excel in vision-language tasks while maintaining and even improving text-only performance compared to their LLM backbones. To achieve this, we craft and integrate a high-quality text-only dataset into multimodal training, alongside a substantial amount of multimodal math and reasoning data, leading to enhanced math and coding capabilities across modalities. To advance research in the field, we are releasing the model weights and will open-source the code for the community: this https URL.
我们介绍 NVLM 1.0,一系列前沿级的多模态大型语言模型(LLMs),在视觉语言任务上实现最先进的成果,与领先的商业模型(如 GPT-4o)和开放访问模型(如 Llama 3-V 405B 和 InternVL 2)相媲美。值得注意的是,在多模态训练后,NVLM 1.0在仅文本输入的测试上的表现已经超过了其LLM骨干网络。在模型设计方面,我们全面比较了编码器-仅多模态LLM(如LLLaVA)和基于跨注意力的模型(如Flamingo)。基于两种方法的优势和劣势,我们提出了一个新颖的架构,既提高了训练效率,又增强了多模态推理能力。此外,我们还引入了一种1D贴标签设计(基于贴标签的动态高分辨率图像)用于提高多模态推理和OCR相关任务的表现。在训练数据方面,我们精心挑选和提供了关于我们的多模态预训练和监督微调数据集的详细信息。我们的研究结果表明,数据质量和任务多样性比规模更重要,即使在预训练阶段,所有架构都在所有架构中更为重要。值得注意的是,我们为NVLM-1.0模型开发了生产级别的多模态性,使它们在视觉语言任务上保持甚至提高文本输入性能,同时保持LLM骨干网络的性能。为了实现这一目标,我们将高质量文本仅数据集融入多模态训练中,同时提供大量多模态数学和推理数据,从而增强模型的多模态能力。为了推动该领域的研究,我们已发布模型权重,并将开源代码:https:// this URL。
https://arxiv.org/abs/2409.11402
AI agents have the potential to aid users on a variety of consequential tasks, including conducting scientific research. To spur the development of useful agents, we need benchmarks that are challenging, but more crucially, directly correspond to real-world tasks of interest. This paper introduces such a benchmark, designed to measure the accuracy of AI agents in tackling a crucial yet surprisingly challenging aspect of scientific research: computational reproducibility. This task, fundamental to the scientific process, involves reproducing the results of a study using the provided code and data. We introduce CORE-Bench (Computational Reproducibility Agent Benchmark), a benchmark consisting of 270 tasks based on 90 scientific papers across three disciplines (computer science, social science, and medicine). Tasks in CORE-Bench consist of three difficulty levels and include both language-only and vision-language tasks. We provide an evaluation system to measure the accuracy of agents in a fast and parallelizable way, saving days of evaluation time for each run compared to a sequential implementation. We evaluated two baseline agents: the general-purpose AutoGPT and a task-specific agent called CORE-Agent. We tested both variants using two underlying language models: GPT-4o and GPT-4o-mini. The best agent achieved an accuracy of 21% on the hardest task, showing the vast scope for improvement in automating routine scientific tasks. Having agents that can reproduce existing work is a necessary step towards building agents that can conduct novel research and could verify and improve the performance of other research agents. We hope that CORE-Bench can improve the state of reproducibility and spur the development of future research agents.
翻译:AI 代理在许多任务中都有助于用户,包括进行科学研究。为了促进有用的代理的发展,我们需要具有挑战性但更关键的是与感兴趣的现实生活中任务直接相关的基准。本文介绍了一个基准,旨在衡量 AI 代理在解决科学研究中关键而又令人惊讶的挑战方面:计算可重复性。这个任务是科学过程的基础,包括使用提供的代码和数据复制研究结果。我们介绍了一个名为 CORE-Bench(计算可重复性代理基准)的基准,它基于三个学科(计算机科学、社会科学和医学)的 90 篇论文,共包括 270 个任务。CORE-Bench 中的任务包括三个难度级别,包括语言仅和视觉语言任务。我们提供了一个快速且并行可评估的系统来测量代理的准确性,与顺序实现相比,可以节省每天数小时的评估时间。我们评估了两个基线代理:通用 AutoGPT 和名为 CORE-Agent 的任务特定代理。我们使用两个底层语言模型(GPT-4o 和 GPT-4o-mini)测试了这两个变体。最佳代理在最难的任务上获得了 21% 的准确度,表明自动化常规科学任务的潜力巨大。拥有可以复制现有工作的代理是走向构建可以进行新颖研究并获得其他研究代理的验证和改善性能的必要一步。我们希望 CORE-Bench 能够提高可重复性状况,并推动未来研究代理的发展。
https://arxiv.org/abs/2409.11363
Hallucination, the generation of factually incorrect content, is a growing challenge in Large Language Models (LLMs). Existing detection and mitigation methods are often isolated and insufficient for domain-specific needs, lacking a standardized pipeline. This paper introduces THaMES (Tool for Hallucination Mitigations and EvaluationS), an integrated framework and library addressing this gap. THaMES offers an end-to-end solution for evaluating and mitigating hallucinations in LLMs, featuring automated test set generation, multifaceted benchmarking, and adaptable mitigation strategies. It automates test set creation from any corpus, ensuring high data quality, diversity, and cost-efficiency through techniques like batch processing, weighted sampling, and counterfactual validation. THaMES assesses a model's ability to detect and reduce hallucinations across various tasks, including text generation and binary classification, applying optimal mitigation strategies like In-Context Learning (ICL), Retrieval Augmented Generation (RAG), and Parameter-Efficient Fine-tuning (PEFT). Evaluations of state-of-the-art LLMs using a knowledge base of academic papers, political news, and Wikipedia reveal that commercial models like GPT-4o benefit more from RAG than ICL, while open-weight models like Llama-3.1-8B-Instruct and Mistral-Nemo gain more from ICL. Additionally, PEFT significantly enhances the performance of Llama-3.1-8B-Instruct in both evaluation tasks.
幻觉,即在大型语言模型(LLMs)中生成事实性不正确的内容,是一个日益增长挑战。现有的检测和缓解方法通常孤立且不足以满足领域特定需求,缺乏标准化流程。本文介绍了一种名为THaMES(用于幻觉缓解和评估)的集成框架和库,填补这一空白。THaMES提供了一个端到端解决方案,用于在LLMs上评估和缓解幻觉,包括自动测试集生成、多角度基准测试和可定制的缓解策略。它通过批处理、加权抽样和反事实验证等技术自动创建测试集,确保高数据质量、多样性和性价比。THaMES评估模型在各种任务上的幻觉检测和缓解能力,包括文本生成和二进制分类,采用最优的缓解策略如In-Context Learning(ICL)、反向传播增强生成(RAG)和参数高效的微调(PEFT)。使用学术论文、政治新闻和维基百科的知识库对最先进的LLM进行评估,发现商业模型如GPT-4o在RAG上的效果比ICL更好,而open-weight模型如Llama-3.1-8B-Instruct和Mistral-Nemo在ICL上的效果比RAG更好。此外,PEFT显著增强了Llama-3.1-8B-Instruct在两个评估任务上的性能。
https://arxiv.org/abs/2409.11353
The surge of digital documents in various formats, including less standardized documents such as business reports and environmental assessments, underscores the growing importance of Document Understanding. While Large Language Models (LLMs) have showcased prowess across diverse natural language processing tasks, their direct application to Document Understanding remains a challenge. Previous research has demonstrated the utility of LLMs in this domain, yet their significant computational demands make them challenging to deploy effectively. Additionally, proprietary Blackbox LLMs often outperform their open-source counterparts, posing a barrier to widespread accessibility. In this paper, we delve into the realm of document understanding, leveraging distillation methods to harness the power of large LLMs while accommodating computational limitations. Specifically, we present a novel approach wherein we distill document understanding knowledge from the proprietary LLM ChatGPT into FLAN-T5. Our methodology integrates labeling and curriculum-learning mechanisms to facilitate efficient knowledge transfer. This work contributes to the advancement of document understanding methodologies by offering a scalable solution that bridges the gap between resource-intensive LLMs and practical applications. Our findings underscore the potential of distillation techniques in facilitating the deployment of sophisticated language models in real-world scenarios, thereby fostering advancements in natural language processing and document comprehension domains.
各种格式中数字文件的激增,包括商业报告和环境评估等不太规范的文件,凸显了文档理解的重要性。虽然大型语言模型(LLMs)在各种自然语言处理任务中展现出了卓越的表现,但它们直接应用于文档理解仍然是一个挑战。之前的研究表明,LLMs在这个领域具有实用性,但它们显著的计算需求使它们难以有效部署。此外,专有 Blackbox LLMs 通常比开源同类产品表现更好,这成了阻碍广泛可访问性的障碍。在本文中,我们深入研究了文档理解的世界,利用蒸馏方法发掘大型 LLM 的潜力,同时考虑计算限制。具体来说,我们提出了一种将专有 LLM ChatGPT 的文档理解知识从其内部转移到 FLAN-T5 的全新方法。我们的方法结合了标签和课程学习机制,以实现知识的高效转移。这项工作为文档理解方法的发展提供了规模化的解决方案,从而将资源密集型 LLM 和实际应用之间的差距缩小。我们的研究结果强调了蒸馏技术在促进将复杂语言模型部署到现实场景中的潜力,从而推动了自然语言处理和文档理解领域的进步。
https://arxiv.org/abs/2409.11282
Large language models (LLMs) offer powerful capabilities but incur substantial computational costs, driving the need for efficient compression techniques. This study evaluates the impact of popular compression methods - Magnitude Pruning, SparseGPT, and Wanda - on the LLaMA-2-7B model, focusing on the trade-offs between model size reduction, downstream task performance, and the role of calibration data. Our findings reveal that while SparseGPT and Wanda preserve perplexity even at 50% sparsity, they suffer significant degradation on downstream tasks, highlighting the inadequacy of perplexity as the sole evaluation metric. To address this, we introduce Jensen-Shannon (JS) Divergence as a more comprehensive metric that captures nuanced changes in model behavior post-compression. We further demonstrate that task-specific calibration data significantly enhances the downstream performance of compressed models compared to general calibration data. This research underscores the necessity for diverse evaluation metrics and careful calibration data selection to fully understand the complexities of LLM compression and its implications for practical applications.
大规模语言模型(LLMs)具有强大的功能,但计算成本很高,导致需要高效的压缩技术。本研究评估了流行的压缩方法 - 规模剪枝、稀疏GPT和Wanda - 对LLA-2-7B模型的影响,重点关注模型大小减少、下游任务性能之间的权衡以及校准数据的作用。我们的研究结果表明,即使在50%的稀疏度下,SparseGPT和Wanda也保留了解谜能力,但下游任务的性能显著下降,突出了仅用解谜能力作为评价指标的不足之处。为了应对这个问题,我们引入了Jensen-Shannon(JS)距离作为更全面的指标,它捕捉了压缩后模型行为的变化。我们进一步证明了任务特定的校准数据显著增强了压缩模型的下游性能,与通用校准数据相比。这项研究强调了需要多样化的评估指标和仔细的校准数据选择,以全面了解LLM压缩的复杂性及其对实际应用的启示。
https://arxiv.org/abs/2409.11233
Aspect-based sentiment analysis (ABSA) involves identifying sentiment towards specific aspect terms in a sentence and allows us to uncover nuanced perspectives and attitudes on particular aspects of a product, service, or topic. However, the scarcity of labeled data poses a significant challenge to training high-quality models. To address this issue, we explore the potential of data augmentation using ChatGPT, a well-performing large language model (LLM), to enhance the sentiment classification performance towards aspect terms. Specifically, we explore three data augmentation strategies based on ChatGPT: context-focused, aspect-focused, and context-aspect data augmentation techniques. Context-focused data augmentation focuses on changing the word expression of context words in the sentence while keeping aspect terms unchanged. In contrast, aspect-focused data augmentation aims to change aspect terms but keep context words unchanged. Context-Aspect data augmentation integrates the above two data augmentations to generate augmented samples. Furthermore, we incorporate contrastive learning into the ABSA tasks to improve performance. Extensive experiments show that all three data augmentation techniques lead to performance improvements, with the context-aspect data augmentation strategy performing best and surpassing the performance of the baseline models.
基于 aspect 的情感分析(ASSA)涉及在句子中识别对特定方面词的 sentiment,并允许我们揭示产品、服务或主题的特定方面上的细微观点和态度。然而,稀疏的标注数据对训练高质量模型造成了重大挑战。为解决这个问题,我们探讨了使用 ChatGPT(表现出色的 large language model)进行数据增强以提高面向方面词的情感分类性能的可能性。具体来说,我们探讨了三种基于 ChatGPT 的数据增强策略:基于上下文的数据增强、基于方面的数据增强和上下文方面数据增强技术。基于上下文的数据增强关注于句子中上下文词的单词表达,而方面词保持不变。相反,基于方面的数据增强旨在改变方面词,而上下文词保持不变。上下文方面数据增强将上述两种数据增强技术相结合以生成增强样本。此外,我们将对比学习引入 ASSA 任务中,以提高性能。大量实验证明,三种数据增强技术都导致了性能提高,其中上下文方面数据增强策略表现最佳,并超过了基线模型的性能。
https://arxiv.org/abs/2409.11218
The workshop is affiliated with 33nd IEEE International Conference on Robot and Human Interactive Communication (RO-MAN 2024) August 26~30, 2023 / Pasadena, CA, USA. It is designed as a half-day event, extending over four hours from 9:00 to 12:30 PST time. It accommodates both in-person and virtual attendees (via Zoom), ensuring a flexible participation mode. The agenda is thoughtfully crafted to include a diverse range of sessions: two keynote speeches that promise to provide insightful perspectives, two dedicated paper presentation sessions, an interactive panel discussion to foster dialogue among experts which facilitates deeper dives into specific topics, and a 15-minute coffee break. The workshop website: this https URL.
该研讨会与2024年8月26日至30日举行的33rd IEEE国际机器人与人类交互沟通会议(RO-MAN 2024)有关。它被设计为半天的活动,从9:00持续到12:30太平洋标准时间。它容纳了亲自参加和通过Zoom参加的虚拟与会者,确保了灵活的参与方式。会议议程精心设计,包括以下内容:两个 keynote speeches,旨在提供富有启示性的观点;两个专题论文发表会议,专家们通过互动式论坛促进对话,并深入探讨了特定主题;和一个15分钟的咖啡休息时间。会议网站:https://this.https://www.meetup.com/RO-MAN-2024/。
https://arxiv.org/abs/2409.11150