Metaphorical comprehension in images remains a critical challenge for AI systems, as existing models struggle to grasp the nuanced cultural, emotional, and contextual implications embedded in visual content. While multimodal large language models (MLLMs) excel in basic Visual Question Answer (VQA) tasks, they struggle with a fundamental limitation on image implication tasks: contextual gaps that obscure the relationships between different visual elements and their abstract meanings. Inspired by the human cognitive process, we propose Let Androids Dream (LAD), a novel framework for image implication understanding and reasoning. LAD addresses contextual missing through the three-stage framework: (1) Perception: converting visual information into rich and multi-level textual representations, (2) Search: iteratively searching and integrating cross-domain knowledge to resolve ambiguity, and (3) Reasoning: generating context-alignment image implication via explicit reasoning. Our framework with the lightweight GPT-4o-mini model achieves SOTA performance compared to 15+ MLLMs on English image implication benchmark and a huge improvement on Chinese benchmark, performing comparable with the GPT-4o model on Multiple-Choice Question (MCQ) and outperforms 36.7% on Open-Style Question (OSQ). Additionally, our work provides new insights into how AI can more effectively interpret image implications, advancing the field of vision-language reasoning and human-AI interaction. Our project is publicly available at this https URL.
图像中的比喻理解仍然是AI系统的重大挑战,因为现有的模型难以把握视觉内容中嵌入的细腻的文化、情感和上下文含义。尽管多模态大型语言模型(MLLMs)在基本的视觉问答(VQA)任务上表现出色,但在涉及图像内涵的任务方面仍面临一个根本性的限制:即不同视觉元素之间关系及其抽象意义所造成的上下文差距。 受人类认知过程启发,我们提出了一种新的框架——让机器人产生梦境(Let Androids Dream, LAD),旨在理解和推理图像的隐含含义。LAD通过三阶段框架解决上下文缺失的问题:(1)感知:将视觉信息转换为丰富且多层次的文本表示;(2)搜索:迭代地搜索和整合跨域知识以消除歧义;以及(3)推理:通过明确推理生成与背景相符的图像隐含含义。使用轻量级GPT-4o-mini模型,我们的框架在英语图像隐含基准测试中相较于15个以上的MLLMs达到了最先进的性能,并在中国语料库的测试中取得了巨大进步,在多项选择题(MCQ)和开放式风格问题(OSQ)上分别与GPT-4o模型表现相当并超越了后者36.7%。此外,我们的工作为AI如何更有效地解释图像隐含含义提供了新的见解,推动了视觉语言推理及人机交互领域的发展。 本项目已在公开网址上发布:[此链接](https://thishttpsURL.com/)(请将“this https URL”替换为您实际的项目地址)。
https://arxiv.org/abs/2505.17019
LLM-based multi-agent systems (MAS) extend the capabilities of single LLMs by enabling cooperation among multiple specialized agents. However, most existing MAS frameworks rely on a single LLM to drive all agents, constraining the system's intelligence to the limit of that model. This paper explores the paradigm of heterogeneous LLM-driven MAS (X-MAS), where agents are powered by diverse LLMs, elevating the system's potential to the collective intelligence of diverse LLMs. We introduce X-MAS-Bench, a comprehensive testbed designed to evaluate the performance of various LLMs across different domains and MAS-related functions. As an extensive empirical study, we assess 27 LLMs across 5 domains (encompassing 21 test sets) and 5 functions, conducting over 1.7 million evaluations to identify optimal model selections for each domain-function combination. Building on these findings, we demonstrate that transitioning from homogeneous to heterogeneous LLM-driven MAS can significantly enhance system performance without requiring structural redesign. Specifically, in a chatbot-only MAS scenario, the heterogeneous configuration yields up to 8.4\% performance improvement on the MATH dataset. In a mixed chatbot-reasoner scenario, the heterogeneous MAS could achieve a remarkable 47\% performance boost on the AIME dataset. Our results underscore the transformative potential of heterogeneous LLMs in MAS, highlighting a promising avenue for advancing scalable, collaborative AI systems.
基于大型语言模型(LLM)的多智能体系统(MAS)通过允许多个专业化代理之间的合作,扩展了单一LLM的能力。然而,大多数现有的MAS框架依赖于单个LLM来驱动所有代理,从而限制了系统的智能水平到该模型的极限。本文探讨了一种异构大型语言模型驱动的多智能体系统(X-MAS)范式,在这种系统中,各个代理由不同的大型语言模型提供动力,将系统的潜力提升到了多样化的大型语言模型集体智慧的高度。我们介绍了X-MAS-Bench,这是一个全面的测试平台,旨在评估各种LLM在不同领域和MAS相关功能上的表现。作为一项广泛的经验研究,我们在五个领域(涵盖21个测试集)和五种功能上对27种不同的LLM进行了超过170万次评估,以识别每个域-功能组合的最佳模型选择。基于这些发现,我们展示了从同质到异构大型语言模型驱动的多智能体系统的转变可以在不进行结构性重新设计的情况下显著提升系统性能。具体而言,在仅限于聊天机器人的MAS场景中,异构配置在MATH数据集上的表现提高了最多8.4%。在一个混合了聊天机器人和推理者的场景中,异构MAS在AIME数据集上实现了令人瞩目的47%的表现提升。我们的结果强调了异构大型语言模型在多智能体系统中的变革潜力,并为开发可扩展、协作的人工智能系统开辟了一条前景光明的道路。
https://arxiv.org/abs/2505.16997
Single-agent LLMs hit hard limits--finite context, role overload, and brittle domain transfer. Conventional multi-agent fixes soften those edges yet expose fresh pains: ill-posed decompositions, fuzzy contracts, and verification overhead that blunts the gains. We therefore present Know-The-Ropes (KtR), a framework that converts domain priors into an algorithmic blueprint hierarchy, in which tasks are recursively split into typed, controller-mediated subtasks, each solved zero-shot or with the lightest viable boost (e.g., chain-of-thought, micro-tune, self-check). Grounded in the No-Free-Lunch theorem, KtR trades the chase for a universal prompt for disciplined decomposition. On the Knapsack problem (3-8 items), three GPT-4o-mini agents raise accuracy from 3% zero-shot to 95% on size-5 instances after patching a single bottleneck agent. On the tougher Task-Assignment problem (6-15 jobs), a six-agent o3-mini blueprint hits 100% up to size 10 and 84% on sizes 13-15, versus 11% zero-shot. Algorithm-aware decomposition plus targeted augmentation thus turns modest models into reliable collaborators--no ever-larger monoliths required.
单一代理LLM面临硬性限制——有限的上下文、角色过载和脆弱的知识领域转移。传统多代理解决方案虽然减轻了这些问题,但也暴露出新的问题:不恰当的任务分解、模糊不清的合作协议以及验证成本高昂,削弱了改进效果。因此,我们提出了一种名为“掌握诀窍”(Know-The-Ropes, KtR)的框架,该框架将领域的先验知识转化为算法蓝图层级结构,在这种结构中,任务被递归地拆分为有类型的、由控制器中介的子任务,每个子任务要么直接解决,要么通过最轻量级的方法进行增强(例如:思维链推理、微调或自我检查)。基于“没有免费午餐”的定理,KtR放弃了寻找通用提示符的努力,转而强调有条不紊的任务分解。 在背包问题(3-8个物品)上,使用三个GPT-4o-mini代理,在补全单一瓶颈代理后,从零样本的3%准确率提高到大小为5的情况下的95%。对于更具挑战性的任务分配问题(6-15项工作),一个由六个o3-mini蓝图组成的系统在规模达到10时能够实现100%的正确率,并且在规模13-15时也能保持84%的准确度,相比之下零样本情况下的准确率为11%。 通过算法意识的任务分解加上有针对性的增强,这种框架使中等大小的模型成为可靠的合作伙伴——无需构建越来越大、越来越复杂的单一代理系统。
https://arxiv.org/abs/2505.16979
Large Language Models (LLMs) have shown strong capability in diverse software engineering tasks, e.g. code completion, bug fixing, and document generation. However, feature-driven development (FDD), a highly prevalent real-world task that involves developing new functionalities for large, existing codebases, remains underexplored. We therefore introduce SWE-Dev, the first large-scale dataset (with 14,000 training and 500 test samples) designed to evaluate and train autonomous coding systems on real-world feature development tasks. To ensure verifiable and diverse training, SWE-Dev uniquely provides all instances with a runnable environment and its developer-authored executable unit tests. This collection not only provides high-quality data for Supervised Fine-Tuning (SFT), but also enables Reinforcement Learning (RL) by delivering accurate reward signals from executable unit tests. Our extensive evaluations on SWE-Dev, covering 17 chatbot LLMs, 10 reasoning models, and 10 Multi-Agent Systems (MAS), reveal that FDD is a profoundly challenging frontier for current AI (e.g., Claude-3.7-Sonnet achieves only 22.45\% Pass@3 on the hard test split). Crucially, we demonstrate that SWE-Dev serves as an effective platform for model improvement: fine-tuning on training set enabled a 7B model comparable to GPT-4o on \textit{hard} split, underscoring the value of its high-quality training data. Code is available here \href{this https URL}{this https URL}.
大型语言模型(LLMs)在各种软件工程任务中展现了强大的能力,例如代码补全、错误修复和文档生成。然而,特性驱动开发(FDD),这是一种高度流行的真实世界任务,涉及到为庞大的现有代码库添加新功能,这一领域仍然被较少探索。为此,我们引入了SWE-Dev,这是首个大规模数据集(包含14,000个训练样本和500个测试样本),旨在评估和训练自动编码系统在真实世界的特性开发任务上的表现。为了确保可验证且多样化的训练过程,SWE-Dev独特地为所有实例提供了一个运行环境及其由开发者编写的执行单元测试。 这个数据集不仅提供了高质量的数据用于监督微调(SFT),而且还通过提供来自可执行单元测试的准确奖励信号支持强化学习(RL)。我们在SWE-Dev上进行了广泛评估,涵盖了17个聊天机器人LLM、10个推理模型和10个多智能体系统(MAS),发现FDD是当前AI面临的深刻挑战前沿(例如,Claude-3.7-Sonnet在困难测试分割上的Pass@3仅达到22.45%)。至关重要的是,我们展示了SWE-Dev作为一个有效的模型改进平台的作用:在训练集上进行微调使一个70亿参数的模型在“困难”分段的表现可媲美GPT-4o,强调了其高质量训练数据的价值。 代码可以在[\href{this https URL}{此处}]获取。
https://arxiv.org/abs/2505.16975
We introduce \texttt{CASS}, the first large-scale dataset and model suite for cross-architecture GPU code transpilation, targeting both source-level (CUDA~$\leftrightarrow$~HIP) and assembly-level (Nvidia SASS~$\leftrightarrow$~AMD RDNA3) translation. The dataset comprises 70k verified code pairs across host and device, addressing a critical gap in low-level GPU code portability. Leveraging this resource, we train the \texttt{CASS} family of domain-specific language models, achieving 95\% source translation accuracy and 37.5\% assembly translation accuracy, substantially outperforming commercial baselines such as GPT-4o, Claude, and Hipify. Our generated code matches native performance in over 85\% of test cases, preserving runtime and memory behavior. To support rigorous evaluation, we introduce \texttt{CASS-Bench}, a curated benchmark spanning 16 GPU domains with ground-truth execution. All data, models, and evaluation tools are released as open source to foster progress in GPU compiler tooling, binary compatibility, and LLM-guided hardware translation. Dataset and benchmark are on \href{this https URL}{\textcolor{blue}{HuggingFace}}, with code at \href{this https URL}{\textcolor{blue}{GitHub}}.
我们介绍了一种名为\texttt{CASS}的大型数据集和模型套件,用于跨架构GPU代码转译,旨在支持源码级(CUDA~$\leftrightarrow$~HIP)和汇编级(Nvidia SASS~$\leftrightarrow$~AMD RDNA3)翻译。该数据集包含了70,000个经过验证的代码对,涵盖了主机端和设备端,填补了低级别GPU代码可移植性中的关键空白。利用这一资源,我们训练了\texttt{CASS}系列特定领域的语言模型,在源码转译准确率上达到了95%,汇编级转译准确率达到37.5%。这些性能远超商业基准如GPT-4o、Claude和Hipify的水平。在超过85%的测试案例中,我们生成的代码能够匹配原生性能,并保持了运行时间和内存行为的一致性。 为了支持严格的评估,我们引入了\texttt{CASS-Bench},这是一个经过精心挑选的基准集,涵盖了16个GPU领域并且拥有真实的执行结果。所有的数据、模型和评估工具都作为开源项目发布,以促进在GPU编译器工具开发、二进制兼容性以及LLM(大型语言模型)指导硬件翻译方面的进步。 该数据集与基准可以在\href{this https URL}{HuggingFace}上找到,代码托管于\href{this https URL}{GitHub}。
https://arxiv.org/abs/2505.16968
Training robust retrieval and reranker models typically relies on large-scale retrieval datasets; for example, the BGE collection contains 1.6 million query-passage pairs sourced from various data sources. However, we find that certain datasets can negatively impact model effectiveness -- pruning 8 out of 15 datasets from the BGE collection reduces the training set size by 2.35$\times$ and increases nDCG@10 on BEIR by 1.0 point. This motivates a deeper examination of training data quality, with a particular focus on "false negatives", where relevant passages are incorrectly labeled as irrelevant. We propose a simple, cost-effective approach using cascading LLM prompts to identify and relabel hard negatives. Experimental results show that relabeling false negatives with true positives improves both E5 (base) and Qwen2.5-7B retrieval models by 0.7-1.4 nDCG@10 on BEIR and by 1.7-1.8 nDCG@10 on zero-shot AIR-Bench evaluation. Similar gains are observed for rerankers fine-tuned on the relabeled data, such as Qwen2.5-3B on BEIR. The reliability of the cascading design is further supported by human annotation results, where we find judgment by GPT-4o shows much higher agreement with humans than GPT-4o-mini.
训练鲁棒的检索和重排序模型通常依赖大规模的检索数据集;例如,BGE集合包含了从各种来源获取的160万条查询-段落对。然而,我们发现某些数据集会对模型效果产生负面影响——移除BGE集合中的8个数据集后,训练集规模减小了2.35倍,但在BEIR上的nDCG@10评分提高了1分。这促使我们更深入地考察训练数据的质量,并特别关注“假阴性”,即相关段落被错误地标记为不相关的案例。为此,我们提出了一种简单且成本效益高的方法:使用级联大语言模型提示来识别和重新标注难例(hard negatives)。实验结果显示,在BEIR上对E5(基础)和Qwen2.5-7B检索模型进行假阴性到真阳性的重标后,nDCG@10评分提高了0.7至1.4分;而在零样本的AIR-Bench评估中,此操作使得分提升了1.7至1.8分。对于在重新标注数据上微调的重排序器模型(如Qwen2.5-3B),也观察到了类似的性能提升效果。此外,级联设计的有效性还通过人工注释结果得到了进一步的支持:我们发现GPT-4o在判断方面的准确性与人类评价高度一致,而其简化版GPT-4o-mini则不具备这一特性。
https://arxiv.org/abs/2505.16967
Despite recent efforts in Large Language Models (LLMs) safety and alignment, current adversarial attacks on frontier LLMs are still able to force harmful generations consistently. Although adversarial training has been widely studied and shown to significantly improve the robustness of traditional machine learning models, its strengths and weaknesses in the context of LLMs are less understood. Specifically, while existing discrete adversarial attacks are effective at producing harmful content, training LLMs with concrete adversarial prompts is often computationally expensive, leading to reliance on continuous relaxations. As these relaxations do not correspond to discrete input tokens, such latent training methods often leave models vulnerable to a diverse set of discrete attacks. In this work, we aim to bridge this gap by introducing MixAT, a novel method that combines stronger discrete and faster continuous attacks during training. We rigorously evaluate MixAT across a wide spectrum of state-of-the-art attacks, proposing the At Least One Attack Success Rate (ALO-ASR) metric to capture the worst-case vulnerability of models. We show MixAT achieves substantially better robustness (ALO-ASR < 20%) compared to prior defenses (ALO-ASR > 50%), while maintaining a runtime comparable to methods based on continuous relaxations. We further analyze MixAT in realistic deployment settings, exploring how chat templates, quantization, low-rank adapters, and temperature affect both adversarial training and evaluation, revealing additional blind spots in current methodologies. Our results demonstrate that MixAT's discrete-continuous defense offers a principled and superior robustness-accuracy tradeoff with minimal computational overhead, highlighting its promise for building safer LLMs. We provide our code and models at this https URL.
尽管在大型语言模型(LLM)的安全性和对齐方面近期做出了不少努力,但目前针对前沿LLM的对抗性攻击仍然能够持续生成有害内容。虽然对抗训练已被广泛研究,并被证明可以显著提高传统机器学习模型的鲁棒性,但在LLM上下文中其优势和局限性却不太为人所知。具体而言,尽管现有的离散对抗性攻击在产生有害内容方面效果很好,但使用具体的对抗提示来训练LLM通常计算成本高昂,导致依赖于连续松弛方法。由于这些松弛方法并不对应于离散输入标记,这样的潜在训练方式常常使模型易受一系列不同的离散攻击。 为了解决这一问题,我们提出了MixAT(混合对抗训练),这是一种结合了更强的离散和更快的连续攻击的新颖方法,在训练过程中加以应用。我们在广泛的前沿攻击上严格评估了MixAT,并提出了一种At Least One Attack Success Rate (ALO-ASR)指标来捕捉模型在最坏情况下的脆弱性。我们展示了MixAT达到了显著更好的鲁棒性(ALO-ASR小于20%),相较于之前的防御措施(ALO-ASR大于50%)而言,同时保持了与基于连续松弛的方法相似的运行时间。此外,我们在真实的部署场景下进一步分析了MixAT的表现,探讨了聊天模板、量化、低秩适配器和温度等因素如何影响对抗训练及评估,揭示了当前方法论中的更多盲点。 我们的结果表明,MixAT通过离散-连续防御提供了一个原则性的、更优的鲁棒性与准确性权衡,并且计算开销极小。这进一步强调了其在构建安全LLM方面的潜力。我们在[此处提供代码和模型](https://this.is.the.url.for.code.and.models/)。
https://arxiv.org/abs/2505.16947
Computing the polar decomposition and the related matrix sign function, has been a well-studied problem in numerical analysis for decades. More recently, it has emerged as an important subroutine in deep learning, particularly within the Muon optimization framework. However, the requirements in this setting differ significantly from those of traditional numerical analysis. In deep learning, methods must be highly efficient and GPU-compatible, but high accuracy is often unnecessary. As a result, classical algorithms like Newton-Schulz (which suffers from slow initial convergence) and methods based on rational functions (which rely on QR decompositions or matrix inverses) are poorly suited to this context. In this work, we introduce Polar Express, a GPU-friendly algorithm for computing the polar decomposition. Like classical polynomial methods such as Newton-Schulz, our approach uses only matrix-matrix multiplications, making it GPU-compatible. Motivated by earlier work of Chen & Chow and Nakatsukasa & Freund, Polar Express adapts the polynomial update rule at each iteration by solving a minimax optimization problem, and we prove that it enjoys a strong worst-case optimality guarantee. This property ensures both rapid early convergence and fast asymptotic convergence. We also address finite-precision issues, making it stable in bfloat16 in practice. We apply Polar Express within the Muon optimization framework and show consistent improvements in validation loss on large-scale models such as GPT-2, outperforming recent alternatives across a range of learning rates.
计算极分解和相关的矩阵符号函数是数值分析领域中长期研究的问题。近年来,这些问题在深度学习领域变得尤为重要,特别是在Muon优化框架中的应用。然而,在这种环境中需求与传统数值分析的需求有显著不同。在深度学习中,方法必须高效且兼容GPU,并且对精度的要求往往不高。因此,传统的算法如牛顿-施瓦茨(其初期收敛速度慢)和基于有理函数的方法(依赖于QR分解或矩阵求逆)在此环境中并不适用。 在这项工作中,我们引入了一种名为Polar Express的新算法,用于在GPU环境下高效计算极分解。与经典的多项式方法(如牛顿-施瓦茨法)类似,我们的方法仅使用矩阵乘法运算,从而使其兼容于GPU环境。受到陈和周以及中村祐介和弗雷德之前工作的启发,Polar Express通过在每次迭代中解决一个最小最大优化问题来调整多项式更新规则,并证明了该算法具有强大的最坏情况下的最优性保证。这一特性确保了快速的早期收敛以及较快的渐近收敛速度。 我们还解决了有限精度的问题,使其在实际应用中能够在bfloat16格式下保持稳定。我们将Polar Express应用于Muon优化框架,在大规模模型(如GPT-2)上验证损失,并显示相对于各种学习率下的近期替代方法而言,其性能得到了一致的改进。
https://arxiv.org/abs/2505.16932
Personally identifiable information (PII) anonymization is a high-stakes task that poses a barrier to many open-science data sharing initiatives. While PII identification has made large strides in recent years, in practice, error thresholds and the recall/precision trade-off still limit the uptake of these anonymization pipelines. We present PIIvot, a lighter-weight framework for PII anonymization that leverages knowledge of the data context to simplify the PII detection problem. To demonstrate its effectiveness, we also contribute QATD-2k, the largest open-source real-world tutoring dataset of its kind, to support the demand for quality educational dialogue data.
个人可识别信息(PII)的匿名化是一项高风险任务,为许多开放科学数据共享倡议设置了障碍。虽然近年来在PII识别方面已经取得了显著进展,但在实践中,错误阈值和召回率/精确度权衡仍然限制了这些匿名化管道的应用推广。我们提出了一个名为PIIvot的轻量级框架,该框架利用对数据上下文的理解来简化PII检测问题。为了证明其有效性,我们也贡献了一个名为QATD-2k的数据集,这是同类中最大的开源现实世界辅导数据集,以支持高质量教育对话数据的需求。
https://arxiv.org/abs/2505.16931
During the finetuning stage of text generation tasks, standard cross-entropy loss treats all tokens equally. This can lead models to overemphasize high-frequency, low-information tokens, neglecting lower-frequency tokens crucial for specificity and informativeness in generated content. This paper introduces a novel loss function, Power-Law Decay Loss (PDL), specifically designed to optimize the finetuning process for text generation. The core motivation for PDL stems from observations in information theory and linguistics: the informativeness of a token is often inversely proportional to its frequency of occurrence. PDL re-weights the contribution of each token in the standard cross-entropy loss based on its frequency in the training corpus, following a power-law decay. Specifically, the weights for high-frequency tokens are reduced, while low-frequency, information-dense tokens are assigned higher weights. This mechanism guides the model during finetuning to focus more on learning and generating tokens that convey specific and unique information, thereby enhancing the quality, diversity, and informativeness of the generated text. We theoretically elaborate on the motivation and construction of PDL and discuss its potential applications and advantages across various text generation finetuning tasks, such as abstractive summarization, dialogue systems, and style transfer.
在文本生成任务的微调阶段,标准交叉熵损失函数会平等对待所有标记(token)。这可能导致模型过度强调高频率但信息量低的标记,而忽略了那些对生成内容的具体性和信息性至关重要的低频标记。本文介绍了一种新颖的损失函数——幂律衰减损失(PDL),它专门用于优化文本生成任务中的微调过程。 PDL的核心动机源于信息论和语言学观察:一个标记的信息量通常与其出现频率成反比。因此,PDL根据训练语料库中标记的频率重新加权标准交叉熵损失中每个标记的贡献值,并遵循幂律衰减原则。具体来说,高频率标记的权重被降低,而低频且信息密集型标记则赋予更高的权重。 这种机制在微调过程中引导模型更加关注学习和生成传达特定及独特信息的标记,从而提高生成文本的质量、多样性和信息量。本文从理论上详细阐述了PDL的动机及其构造,并讨论了其在摘要概括、对话系统以及风格转换等各类文本生成任务中的潜在应用和优势。
https://arxiv.org/abs/2505.16900
This paper introduces a method for detecting inappropriately targeting language in online conversations by integrating crowd and expert annotations with ChatGPT. We focus on English conversation threads from Reddit, examining comments that target individuals or groups. Our approach involves a comprehensive annotation framework that labels a diverse data set for various target categories and specific target words within the conversational context. We perform a comparative analysis of annotations from human experts, crowd annotators, and ChatGPT, revealing strengths and limitations of each method in recognizing both explicit hate speech and subtler discriminatory language. Our findings highlight the significant role of contextual factors in identifying hate speech and uncover new categories of targeting, such as social belief and body image. We also address the challenges and subjective judgments involved in annotation and the limitations of ChatGPT in grasping nuanced language. This study provides insights for improving automated content moderation strategies to enhance online safety and inclusivity.
本文介绍了一种通过结合群众和专家注释与ChatGPT来检测在线对话中不当目标语言的方法。我们重点关注来自Reddit的英文对话线程,分析针对个人或群体的评论。我们的方法涉及一个全面的标注框架,该框架为各种目标类别以及对话上下文中的特定目标词汇标记了一个多样化的数据集。我们对人类专家、群众注释者和ChatGPT提供的注解进行了比较分析,揭示了每种方法在识别明确的仇恨言论和较为微妙的歧视性语言方面的优缺点。 我们的研究结果强调了识别仇恨言论中情境因素的重要性,并发现了一些新的目标类别,如社会信仰和身体形象。我们还讨论了标注过程中的挑战和主观判断以及ChatGPT理解细微语言的局限性。这项研究表明如何改进自动内容审核策略以增强在线安全性和包容性。
https://arxiv.org/abs/2505.16847
Recent advances in scene-based video generation have enabled systems to synthesize coherent visual narratives from structured prompts. However, a crucial dimension of storytelling -- character-driven dialogue and speech -- remains underexplored. In this paper, we present a modular pipeline that transforms action-level prompts into visually and auditorily grounded narrative dialogue, enriching visual storytelling with natural voice and character expression. Our method takes as input a pair of prompts per scene, where the first defines the setting and the second specifies a character's behavior. While a story generation model such as Text2Story generates the corresponding visual scene, we focus on generating expressive character utterances from these prompts and the scene image. We apply a pretrained vision-language encoder to extract a high-level semantic feature from the representative frame, capturing salient visual context. This feature is then combined with the structured prompts and used to guide a large language model in synthesizing natural, character-consistent dialogue. To ensure contextual consistency across scenes, we introduce a Recursive Narrative Bank that conditions each dialogue generation on the accumulated dialogue history from prior scenes. This approach enables characters to speak in ways that reflect their evolving goals and interactions throughout a story. Finally, we render each utterance as expressive, character-consistent speech, resulting in fully-voiced video narratives. Our framework requires no additional training and demonstrates applicability across a variety of story settings, from fantasy adventures to slice-of-life episodes.
最近在基于场景的视频生成领域的进展使得系统能够从结构化的提示中合成连贯的视觉叙述。然而,叙事中的一个关键维度——以角色驱动对话和言语——仍然相对未被充分探索。在这篇论文中,我们提出了一种模块化管道,该管道将动作级别的提示转换为基于视觉和听觉的叙述对话,从而丰富了视觉叙事,并加入了自然的声音和人物表达。我们的方法采用每场景一对输入提示作为输入,其中第一个定义背景设置,第二个指定角色的行为。虽然像Text2Story这样的故事生成模型可以产生相应的视觉场景,但我们专注于从这些提示和场景图像中生成富有表现力的对话文本。 我们应用了一个预训练的视觉-语言编码器来提取代表帧中的高层次语义特征,捕捉显著的视觉上下文。这个特征随后与结构化提示相结合,并用来指导大型语言模型合成自然且角色一致的对话。为了确保在整个故事中的场景之间保持上下文一致性,我们引入了递归叙事库,使得每一次对话生成都基于之前场景积累下来的对话历史。这种方法使角色能够以反映其不断变化的目标和互动的方式进行交谈。 最后,我们将每个语句渲染成富有表现力且符合角色的语音,从而产生完整的有声视频叙述。我们的框架无需额外训练,并展示了在各种故事设置中的适用性,包括幻想冒险和日常生活片段等场景。
https://arxiv.org/abs/2505.16819
This paper presents a comprehensive synthesis of major breakthroughs in artificial intelligence (AI) over the past fifteen years, integrating historical, theoretical, and technological perspectives. It identifies key inflection points in AI' s evolution by tracing the convergence of computational resources, data access, and algorithmic innovation. The analysis highlights how researchers enabled GPU based model training, triggered a data centric shift with ImageNet, simplified architectures through the Transformer, and expanded modeling capabilities with the GPT series. Rather than treating these advances as isolated milestones, the paper frames them as indicators of deeper paradigm shifts. By applying concepts from statistical learning theory such as sample complexity and data efficiency, the paper explains how researchers translated breakthroughs into scalable solutions and why the field must now embrace data centric approaches. In response to rising privacy concerns and tightening regulations, the paper evaluates emerging solutions like federated learning, privacy enhancing technologies (PETs), and the data site paradigm, which reframe data access and security. In cases where real world data remains inaccessible, the paper also assesses the utility and constraints of mock and synthetic data generation. By aligning technical insights with evolving data infrastructure, this study offers strategic guidance for future AI research and policy development.
本文综述了过去十五年人工智能(AI)领域的主要突破,从历史、理论和技术的角度进行了全面的整合。文章通过追踪计算资源、数据访问和算法创新之间的融合点,识别出人工智能演进过程中的关键转折点。分析强调研究人员如何利用基于GPU的模型训练、通过ImageNet推动以数据为中心的转变、采用Transformer简化架构,并借助GPT系列扩展建模能力。本文不仅将这些进展视为孤立的里程碑,还将其视作更深层次范式变化的标志。 论文运用统计学习理论中的样本复杂度和数据效率等概念,解释了研究人员如何将突破转化为可扩展解决方案以及为什么该领域现在必须接纳以数据为中心的方法。面对日益增长的隐私担忧和严格的监管环境,文章评估了联邦学习、隐私增强技术(PETs)和数据站点范式等新兴解决方案的有效性,这些方案重新定义了数据访问与安全。对于那些现实世界中的数据仍然不可获取的情况,论文也评估了模拟和合成数据生成的实用性和局限性。 通过将技术见解与不断发展的数据基础设施相结合,本研究为未来的人工智能研究和政策制定提供了战略指导。
https://arxiv.org/abs/2505.16771
The rapid advancement of native multi-modal models and omni-models, exemplified by GPT-4o, Gemini, and o3, with their capability to process and generate content across modalities such as text and images, marks a significant milestone in the evolution of intelligence. Systematic evaluation of their multi-modal output capabilities in visual thinking processes (also known as multi-modal chain of thought, M-CoT) becomes critically important. However, existing benchmarks for evaluating multi-modal models primarily focus on assessing multi-modal inputs and text-only reasoning while neglecting the importance of reasoning through multi-modal outputs. In this paper, we present a benchmark, dubbed RBench-V, designed to assess models' vision-indispensable reasoning abilities. To construct RBench-V, we carefully hand-pick 803 questions covering math, physics, counting, and games. Unlike previous benchmarks that typically specify certain input modalities, RBench-V presents problems centered on multi-modal outputs, which require image manipulation such as generating novel images and constructing auxiliary lines to support the reasoning process. We evaluate numerous open- and closed-source models on RBench-V, including o3, Gemini 2.5 Pro, Qwen2.5-VL, etc. Even the best-performing model, o3, achieves only 25.8% accuracy on RBench-V, far below the human score of 82.3%, highlighting that current models struggle to leverage multi-modal reasoning. Data and code are available at this https URL
本土多模态模型和全模态模型(如GPT-4o、Gemini和o3)的迅速发展,它们能够处理并生成包括文本和图像在内的跨模态内容,标志着智能演进的重要里程碑。这些模型在视觉思考过程中的多模态输出能力评估(也称为多模态思维链,M-CoT)变得至关重要。然而,现有的用于评估多模态模型的基准测试主要集中在评估多模态输入和纯文本推理上,并忽略了通过多模态输出进行推理的重要性。为此,在这篇论文中我们提出了一项新的基准测试RBench-V,旨在评估模型不可或缺的视觉推理能力。 为了构建RBench-V,我们精心挑选了涵盖数学、物理、计数及游戏等领域的803个问题。不同于以往的基准测试通常指定特定的输入模式,RBench-V提出了以多模态输出为中心的问题,这些问题要求对图像进行操作,如生成新的图像和构造辅助线来支持推理过程。我们在RBench-V上评估了包括o3、Gemini 2.5 Pro、Qwen2.5-VL等在内的众多开源及闭源模型。即使是表现最佳的模型,o3,在RBench-V上的准确率也仅为25.8%,远低于人类平均分82.3%的表现,这凸显了当前模型在利用多模态推理方面存在挑战。 数据和代码可在[此处](https://this_https_URL.com)获取。
https://arxiv.org/abs/2505.16770
Post-training of large language models is essential for adapting pre-trained language models (PLMs) to align with human preferences and downstream tasks. While PLMs typically exhibit well-calibrated confidence, post-trained language models (PoLMs) often suffer from over-confidence, assigning high confidence to both correct and incorrect outputs, which can undermine reliability in critical applications. A major obstacle in calibrating PoLMs is the scarcity of labeled data for individual downstream tasks. To address this, we propose Disagreement-Aware Confidence Alignment (DACA), a novel unsupervised method to optimize the parameters (e.g., temperature $\tau$) in post-hoc confidence calibration. Our method is motivated by the under-confidence issue caused by prediction disagreement between the PLM and PoLM while aligning their confidence via temperature scaling. Theoretically, the PLM's confidence underestimates PoLM's prediction accuracy on disagreement examples, causing a larger $\tau$ and producing under-confident predictions. DACA mitigates this by selectively using only agreement examples for calibration, effectively decoupling the influence of disagreement. In this manner, our method avoids an overly large $\tau$ in temperature scaling caused by disagreement examples, improving calibration performance. Extensive experiments demonstrate the effectiveness of our method, improving the average ECE of open-sourced and API-based LLMs (e.g. GPT-4o) by up to 15.08$\%$ on common benchmarks.
大型语言模型的后期训练对于将预训练的语言模型(PLM)与人类偏好和下游任务对齐至关重要。尽管预训练的语言模型通常表现出良好的置信度校准,但经过后期训练的语言模型(PoLMs)常常会出现过度自信的问题,即在正确输出和错误输出上都赋予了过高的置信度,这可能会影响其在关键应用中的可靠性。校准PoLM的一个主要障碍是为特定的下游任务获取标注数据极为困难。 为了应对这一挑战,我们提出了一种新的无监督方法——不一致感知置信对齐(DACA),用于优化后期自信校准过程中的参数(如温度$\tau$)。我们的方法基于这样一个动机:当PLM和PoLM在预测中出现分歧时,后者会出现低估自身准确性的现象。理论上,在通过调整温度来校准时,这种低估会导致较大的$\tau$值,并产生过度保守的预测。 DACA通过仅使用一致样本进行校准来缓解这一问题,从而有效地解耦了不一致对校准的影响。这种方法避免了在温度缩放过程中由于不一致样本导致的过大的$\tau$值,从而提高了整体的校准性能。 广泛的实验结果证明了我们方法的有效性,在常见的基准测试中将开源和API基础的大规模语言模型(如GPT-4o)的平均ECE改进高达15.08%。
https://arxiv.org/abs/2505.16690
Batteries are essential for various applications, including electric vehicles and renewable energy storage, making safety and efficiency critical concerns. Anomaly detection in battery thermal images helps identify failures early, but traditional deep learning methods require extensive labeled data, which is difficult to obtain, especially for anomalies due to safety risks and high data collection costs. To overcome this, we explore zero-shot anomaly detection using Visual Question Answering (VQA) models, which leverage pretrained knowledge and textbased prompts to generalize across vision tasks. By incorporating prior knowledge of normal battery thermal behavior, we design prompts to detect anomalies without battery-specific training data. We evaluate three VQA models (ChatGPT-4o, LLaVa-13b, and BLIP-2) analyzing their robustness to prompt variations, repeated trials, and qualitative outputs. Despite the lack of finetuning on battery data, our approach demonstrates competitive performance compared to state-of-the-art models that are trained with the battery data. Our findings highlight the potential of VQA-based zero-shot learning for battery anomaly detection and suggest future directions for improving its effectiveness.
电池对于电动汽车和可再生能源存储等各类应用至关重要,因此安全性和效率成为了关键问题。在电池热图像中进行异常检测有助于提前发现故障,但传统的深度学习方法需要大量标注数据,这些数据由于安全性风险及高昂的数据采集成本而难以获得。为解决这一难题,我们探索了利用视觉问答(VQA)模型进行零样本异常检测的方法,这种方法通过使用预训练的知识和基于文本的提示来在不同的视觉任务中实现泛化。结合正常的电池热行为先验知识,我们设计出可以不依赖于特定电池数据训练的提示以识别异常。 我们在三个VQA模型(ChatGPT-4o、LLaVa-13b 和 BLIP-2)上进行了评估,分析了它们对不同提示变化的鲁棒性以及重复实验的结果和定性输出。尽管这些模型没有针对电池数据进行微调,但我们的方法展示了与最先进的已训练电池数据的模型相比具有竞争力的表现。本研究结果突显了基于VQA的零样本学习在电池异常检测中的潜力,并提出了未来改进其有效性的方向。
https://arxiv.org/abs/2505.16674
We present a Japanese domain-specific language model for the pharmaceutical field, developed through continual pretraining on 2 billion Japanese pharmaceutical tokens and 8 billion English biomedical tokens. To enable rigorous evaluation, we introduce three new benchmarks: YakugakuQA, based on national pharmacist licensing exams; NayoseQA, which tests cross-lingual synonym and terminology normalization; and SogoCheck, a novel task designed to assess consistency reasoning between paired statements. We evaluate our model against both open-source medical LLMs and commercial models, including GPT-4o. Results show that our domain-specific model outperforms existing open models and achieves competitive performance with commercial ones, particularly on terminology-heavy and knowledge-based tasks. Interestingly, even GPT-4o performs poorly on SogoCheck, suggesting that cross-sentence consistency reasoning remains an open challenge. Our benchmark suite offers a broader diagnostic lens for pharmaceutical NLP, covering factual recall, lexical variation, and logical consistency. This work demonstrates the feasibility of building practical, secure, and cost-effective language models for Japanese domain-specific applications, and provides reusable evaluation resources for future research in pharmaceutical and healthcare NLP. Our model, codes, and datasets are released at this https URL.
我们介绍了一种针对制药领域的日语特定领域语言模型,该模型通过在20亿个日语文本的医药标记和80亿个英语生物医学标记上进行持续预训练而开发。为了能够严格评估模型性能,我们引入了三个新的基准测试:YakugakuQA,基于国家药剂师执业资格考试;NayoseQA,用于跨语言同义词和术语规范化测试;以及SogoCheck,一个新颖的任务设计用于评估成对语句之间的一致性推理。我们在开源医学LLM(大型语言模型)和商业模型(包括GPT-4o)上对该模型进行了评估。结果显示,我们的特定领域模型在现有开放模型中表现更佳,并且在术语密集型和知识基础任务中与商用模型的性能相当甚至超越。有趣的是,即使是GPT-4o在SogoCheck上的表现也相对较差,这表明跨句子一致性推理仍然是一个待解决的技术难题。 我们的基准测试套件为医药NLP(自然语言处理)提供了一个更为全面的诊断视角,涵盖事实回忆、词汇变化和逻辑一致性。这项工作展示了构建实用、安全且成本效益高的日语文本领域应用的语言模型是可行的,并为未来在制药和医疗保健NLP领域的研究提供了可重复使用的评估资源。 我们的模型、代码和数据集已在此网址发布:[此URL]。
https://arxiv.org/abs/2505.16661
Large language models (LLMs) have recently demonstrated remarkable capabilities in machine translation (MT). However, most advanced MT-specific LLMs heavily rely on external supervision signals during training, such as human-annotated reference data or trained reward models (RMs), which are often expensive to obtain and challenging to scale. To overcome this limitation, we propose a Simple Self-Rewarding (SSR) Reinforcement Learning (RL) framework for MT that is reference-free, fully online, and relies solely on self-judging rewards. Training with SSR using 13K monolingual examples and Qwen-2.5-7B as the backbone, our model SSR-Zero-7B outperforms existing MT-specific LLMs, e.g., TowerInstruct-13B and GemmaX-28-9B, as well as larger general LLMs like Qwen2.5-32B-Instruct in English $\leftrightarrow$ Chinese translation tasks from WMT23, WMT24, and Flores200 benchmarks. Furthermore, by augmenting SSR with external supervision from COMET, our strongest model, SSR-X-Zero-7B, achieves state-of-the-art performance in English $\leftrightarrow$ Chinese translation, surpassing all existing open-source models under 72B parameters and even outperforming closed-source models, e.g., GPT-4o and Gemini 1.5 Pro. Our analysis highlights the effectiveness of the self-rewarding mechanism compared to the external LLM-as-a-judge approach in MT and demonstrates its complementary benefits when combined with trained RMs. Our findings provide valuable insight into the potential of self-improving RL methods. We have publicly released our code, data and models.
最近,大型语言模型(LLM)在机器翻译(MT)方面展示了显著的能力。然而,大多数先进的专门用于MT的LLM在训练过程中严重依赖外部监督信号,例如人工注释参考数据或经过训练的奖励模型(RMs),这些信号往往难以获取且不易扩展。为克服这一限制,我们提出了一种基于自我奖励的简单强化学习(SSR-RL)框架,该框架适用于MT,并且无需参考数据、完全在线进行,仅依靠自评奖励。 使用13K单语料例句和Qwen-2.5-7B作为骨干模型训练后,我们的SSR-Zero-7B模型在WMT23、WMT24及Flores200基准测试中的英汉互译任务中超越了现有的专门用于MT的LLM(如TowerInstruct-13B和GemmaX-28-9B),以及更大规模的一般LLM,例如Qwen2.5-32B-Instruct。此外,通过结合来自COMET的外部监督信号,我们的最强模型SSR-X-Zero-7B在英汉互译任务中达到了最先进的性能,在参数量小于720亿的所有现有开源模型中表现最佳,并且甚至超越了部分闭源模型(例如GPT-4o和Gemini 1.5 Pro)。 我们的分析突显了自我奖励机制相对于外部LLM作为评判者的方法在MT中的有效性,同时也展示了它与训练好的RMs结合时的互补优势。这些发现为自改善RL方法的潜力提供了有价值的见解。我们已公开发布了代码、数据和模型。
https://arxiv.org/abs/2505.16637
The integration of artificial intelligence in sports analytics has transformed soccer video understanding, enabling real-time, automated insights into complex game dynamics. Traditional approaches rely on isolated data streams, limiting their effectiveness in capturing the full context of a match. To address this, we introduce SoccerChat, a multimodal conversational AI framework that integrates visual and textual data for enhanced soccer video comprehension. Leveraging the extensive SoccerNet dataset, enriched with jersey color annotations and automatic speech recognition (ASR) transcripts, SoccerChat is fine-tuned on a structured video instruction dataset to facilitate accurate game understanding, event classification, and referee decision making. We benchmark SoccerChat on action classification and referee decision-making tasks, demonstrating its performance in general soccer event comprehension while maintaining competitive accuracy in referee decision making. Our findings highlight the importance of multimodal integration in advancing soccer analytics, paving the way for more interactive and explainable AI-driven sports analysis. this https URL
将人工智能技术融入体育数据分析,特别是在足球视频理解方面,已经改变了对复杂比赛动态的实时、自动化洞察。传统的方法依赖于孤立的数据流,这限制了它们捕捉比赛全貌的有效性。为了解决这个问题,我们推出了SoccerChat,这是一个多模态对话AI框架,它整合了视觉和文本数据以增强足球视频的理解能力。 通过利用丰富的SoccerNet数据集,并结合球衣颜色注释以及自动语音识别(ASR)转录内容,SoccerChat在经过结构化的视频指令数据集上进行微调,从而能够实现准确的比赛理解、事件分类和裁判决策。我们在动作分类和裁判决策任务上对SoccerChat进行了基准测试,展示了其在通用足球赛事理解方面的性能,并且在裁判决策方面保持了竞争性的准确性。 我们的研究结果强调了多模态整合在推进足球分析领域的关键作用,为更加互动和解释性强的AI驱动体育分析铺平道路。[参考链接](https://this-url)
https://arxiv.org/abs/2505.16630
Small large language models (sLLMs) offer the advantage of being lightweight and efficient, which makes them suitable for resource-constrained environments. However, sLLMs often struggle to maintain topic consistency in task-oriented dialogue systems, which is critical for scenarios such as service chatbots. Specifically, it is important to ensure that the model denies off-topic or malicious inputs and adheres to its intended functionality so as to prevent potential misuse and uphold reliability. Towards this, existing activation engineering approaches have been proposed to manipulate internal activations during inference. While these methods are effective in certain scenarios, our preliminary experiments reveal their limitations in ensuring topic adherence. Therefore, to address this, we propose a novel approach termed Entropy-scaled Steering vectors for Topic Maintenance (EnSToM). EnSToM dynamically adjusts the steering intensity based on input uncertainty, which allows the model to handle off-topic distractors effectively while preserving on-topic accuracy. Our experiments demonstrate that EnSToM achieves significant performance gain with a relatively small data size compared to fine-tuning approaches. By improving topic adherence without compromising efficiency, our approach provides a robust solution for enhancing sLLM-based dialogue systems.
小型大语言模型(简称sLLMs)因其轻量级和高效的特点,在资源受限的环境中表现出色。然而,这些模型在面向任务的对话系统中通常难以保持话题一致性,特别是在服务聊天机器人等场景下,这一问题尤为重要。具体而言,确保模型能够拒绝无关或恶意输入并坚持其预期功能以防止潜在滥用及保证可靠性至关重要。为此,现有激活工程方法提出通过调整推理过程中的内部激活来解决此类问题。尽管这些方法在某些情况下表现出有效性,但我们的初步实验表明它们在维持话题一致性方面存在局限性。 因此,为了应对这一挑战,我们提出了一个名为熵缩放引导向量维护主题(EnSToM)的新方法。EnSToM根据输入的不确定性动态调整控制强度,使得模型能够有效处理无关干扰信息的同时保持与主题相关的准确性。实验结果表明,相较于微调方法,EnSToM在相对较小的数据集上实现了显著的性能提升。 通过提高话题一致性而不牺牲效率,我们的方法为增强基于sLLM的对话系统提供了一个稳健的解决方案。
https://arxiv.org/abs/2505.16526