This paper introduces V$^2$Edit, a novel training-free framework for instruction-guided video and 3D scene editing. Addressing the critical challenge of balancing original content preservation with editing task fulfillment, our approach employs a progressive strategy that decomposes complex editing tasks into a sequence of simpler subtasks. Each subtask is controlled through three key synergistic mechanisms: the initial noise, noise added at each denoising step, and cross-attention maps between text prompts and video content. This ensures robust preservation of original video elements while effectively applying the desired edits. Beyond its native video editing capability, we extend V$^2$Edit to 3D scene editing via a "render-edit-reconstruct" process, enabling high-quality, 3D-consistent edits even for tasks involving substantial geometric changes such as object insertion. Extensive experiments demonstrate that our V$^2$Edit achieves high-quality and successful edits across various challenging video editing tasks and complex 3D scene editing tasks, thereby establishing state-of-the-art performance in both domains.
本文介绍了V$^2$Edit,这是一种新颖的、无需训练的框架,用于指令引导的视频和3D场景编辑。为了解决在保持原始内容完整性和完成编辑任务之间取得平衡这一关键挑战,我们的方法采用了一种渐进策略,即将复杂的编辑任务分解成一系列更简单的子任务。每个子任务通过三种关键协同机制进行控制:初始噪声、每次去噪步骤中添加的噪声以及文本提示与视频内容之间的交叉注意力图。这确保了原始视频元素的稳健保存,同时有效地应用所需的编辑。 除了其原生的视频编辑能力之外,我们还通过“渲染-编辑-重构”过程将V$^2$Edit扩展到3D场景编辑,从而能够在涉及大量几何变化的任务(如对象插入)中进行高质量且一致的3D编辑。广泛的实验表明,我们的V$^2$Edit在各种具有挑战性的视频编辑任务和复杂的3D场景编辑任务上均实现了高质量的成功编辑,在这两个领域都达到了最先进的性能水平。
https://arxiv.org/abs/2503.10634
In this paper, we propose a general framework for universal zero-shot goal-oriented navigation. Existing zero-shot methods build inference framework upon large language models (LLM) for specific tasks, which differs a lot in overall pipeline and fails to generalize across different types of goal. Towards the aim of universal zero-shot navigation, we propose a uniform graph representation to unify different goals, including object category, instance image and text description. We also convert the observation of agent into an online maintained scene graph. With this consistent scene and goal representation, we preserve most structural information compared with pure text and are able to leverage LLM for explicit graph-based reasoning. Specifically, we conduct graph matching between the scene graph and goal graph at each time instant and propose different strategies to generate long-term goal of exploration according to different matching states. The agent first iteratively searches subgraph of goal when zero-matched. With partial matching, the agent then utilizes coordinate projection and anchor pair alignment to infer the goal location. Finally scene graph correction and goal verification are applied for perfect matching. We also present a blacklist mechanism to enable robust switch between stages. Extensive experiments on several benchmarks show that our UniGoal achieves state-of-the-art zero-shot performance on three studied navigation tasks with a single model, even outperforming task-specific zero-shot methods and supervised universal methods.
在这篇论文中,我们提出了一种通用框架,用于实现广泛的零样本目标导向导航。现有的零样本方法基于大型语言模型(LLM)为特定任务构建推理框架,这些方法在整体流程上差异很大,并且无法跨不同类型的目标进行泛化。为了实现普遍的零样本导航目标,我们提出了一个统一的图表示法来统一不同的目标,包括对象类别、实例图像和文本描述。我们还将代理观察到的信息转换成在线维护的场景图。通过这种一致的场景和目标表示,与纯文本相比,我们保留了更多的结构信息,并能够利用LLM进行明确的基于图形的推理。具体而言,在每个时间点上,我们在场景图和目标图之间执行图匹配,并提出了不同的策略来根据不同的匹配状态生成长期探索目标。代理首先在零匹配时迭代地搜索目标子图;当部分匹配时,则使用坐标投影和锚对齐来推断目标位置;最后,在完全匹配的情况下应用场景图校正和目标验证。我们还提出了一种黑名单机制,以实现不同阶段之间的稳健切换。多项基准测试的广泛实验表明,我们的UniGoal在三种研究导航任务上实现了零样本性能的最佳水平,并且甚至超过了针对特定任务的零样本方法和监督通用方法。
https://arxiv.org/abs/2503.10630
Large language models (LLMs) have shown remarkable performance and generalization capabilities across multiple languages and tasks, making them very attractive targets for multi-modality integration (e.g., images or speech). In this work, we extend an existing LLM to the speech modality via speech discretization and continued pre-training. In particular, we are interested in multilingual LLMs, such as TOWER, as their pre-training setting allows us to treat discretized speech input as an additional translation language. The resulting open-source model, SPIRE, is able to transcribe and translate English speech input while maintaining TOWER's original performance on translation-related tasks, showcasing that discretized speech input integration as an additional language is feasible during LLM adaptation. We make our code and models available to the community.
大型语言模型(LLMs)在多种语言和任务中展现了卓越的性能和泛化能力,这使得它们成为多模态集成(如图像或语音)极具吸引力的目标。在这项工作中,我们通过语音离散化和持续预训练将现有的LLM扩展到了语音模式。特别是,我们对多语言LLMs(例如TOWER)感兴趣,因为它们的预训练设置使我们将离散化的语音输入视为一种额外的语言翻译变体。由此产生的开源模型SPIRE能够在保持TOWER在翻译相关任务上原有性能的同时转录和翻译英语语音输入,展示了在LLM适应过程中将离散化语音输入作为附加语言进行集成是可行的。我们向社区开放了我们的代码和模型。
https://arxiv.org/abs/2503.10620
We introduce Siege, a multi-turn adversarial framework that models the gradual erosion of Large Language Model (LLM) safety through a tree search perspective. Unlike single-turn jailbreaks that rely on one meticulously engineered prompt, Siege expands the conversation at each turn in a breadth-first fashion, branching out multiple adversarial prompts that exploit partial compliance from previous responses. By tracking these incremental policy leaks and re-injecting them into subsequent queries, Siege reveals how minor concessions can accumulate into fully disallowed outputs. Evaluations on the JailbreakBench dataset show that Siege achieves a 100% success rate on GPT-3.5-turbo and 97% on GPT-4 in a single multi-turn run, using fewer queries than baselines such as Crescendo or GOAT. This tree search methodology offers an in-depth view of how model safeguards degrade over successive dialogue turns, underscoring the urgency of robust multi-turn testing procedures for language models.
我们介绍了名为“Siege”的多轮对抗框架,该框架通过树搜索的视角模拟大型语言模型(LLM)安全性的逐渐削弱。与依赖于单一精心设计提示的传统单轮越狱方法不同,Siege以广度优先的方式在每一轮中扩展对话,并生成多个利用先前响应部分合规性的敌对性提示。通过跟踪这些逐步泄露的策略并将其重新注入后续查询中,Siege揭示了如何微小的让步可以累积成完全不允许的输出。 在JailbreakBench数据集上的评估表明,在单次多轮运行中,Siege针对GPT-3.5-turbo和GPT-4分别实现了100%和97%的成功率,并且使用的查询次数少于Crescendo或GOAT等基线方法。这种树搜索方法提供了对模型安全措施如何在连续对话轮次中逐渐退化的深入洞察,强调了语言模型进行稳健多轮测试程序的紧迫性。
https://arxiv.org/abs/2503.10619
Text-to-image models like stable diffusion and DALLE-3 still struggle with multi-turn image editing. We decompose such a task as an agentic workflow (path) of tool use that addresses a sequence of subtasks by AI tools of varying costs. Conventional search algorithms require expensive exploration to find tool paths. While large language models (LLMs) possess prior knowledge of subtask planning, they may lack accurate estimations of capabilities and costs of tools to determine which to apply in each subtask. Can we combine the strengths of both LLMs and graph search to find cost-efficient tool paths? We propose a three-stage approach "CoSTA*" that leverages LLMs to create a subtask tree, which helps prune a graph of AI tools for the given task, and then conducts A* search on the small subgraph to find a tool path. To better balance the total cost and quality, CoSTA* combines both metrics of each tool on every subtask to guide the A* search. Each subtask's output is then evaluated by a vision-language model (VLM), where a failure will trigger an update of the tool's cost and quality on the subtask. Hence, the A* search can recover from failures quickly to explore other paths. Moreover, CoSTA* can automatically switch between modalities across subtasks for a better cost-quality trade-off. We build a novel benchmark of challenging multi-turn image editing, on which CoSTA* outperforms state-of-the-art image-editing models or agents in terms of both cost and quality, and performs versatile trade-offs upon user preference.
文本到图像模型如稳定扩散和DALLE-3在多轮次的图片编辑任务上仍然面临挑战。我们把这类任务分解为一个代理工作流(路径),通过使用不同成本的人工智能工具来解决一系列子任务。传统的搜索算法需要昂贵的探索过程来找寻工具路径。虽然大型语言模型(LLMs)具备对子任务规划的先验知识,但它们可能缺乏准确估计各种工具的能力和成本以确定在每个子任务中应用哪些工具。我们能否结合LLM和图搜索的优势来找到高效的成本效益工具路径?为此,我们提出了一种三阶段方法"CoSTA*",利用LLMs创建一个子任务树,帮助修剪给定任务的人工智能工具图,然后对较小的子图执行A*搜索以寻找工具路径。为了更好地平衡总成本和质量,CoSTA*结合了每个子任务中各个工具的成本与质量指标来指导A*搜索。每个子任务的输出由视觉-语言模型(VLM)进行评估,如果失败,则会触发更新该子任务上工具的成本和质量信息。因此,A*搜索能够快速从错误中恢复并探索其他路径。此外,CoSTA*能够在不同子任务之间自动切换模态以更好地实现成本与质量的权衡。我们建立了一个具有挑战性的多轮次图片编辑的新基准测试,在这个测试中,CoSTA*在成本和质量上都超过了现有的图像编辑模型或代理,并且能够根据用户的偏好进行灵活的成本-质量交换。 翻译总结了论文 "CoSTA*" 的主要贡献,该方法结合了大型语言模型 (LLMs) 和图搜索算法的优点来解决多轮次图片编辑任务中的挑战。这种方法旨在提供一个更有效率的工具路径选择过程,同时优化成本和输出质量,并且可以根据用户偏好进行灵活调整。
https://arxiv.org/abs/2503.10613
Object Hallucination (OH) has been acknowledged as one of the major trustworthy challenges in Large Vision-Language Models (LVLMs). Recent advancements in Large Language Models (LLMs) indicate that internal states, such as hidden states, encode the "overall truthfulness" of generated responses. However, it remains under-explored how internal states in LVLMs function and whether they could serve as "per-token" hallucination indicators, which is essential for mitigating OH. In this paper, we first conduct an in-depth exploration of LVLM internal states in relation to OH issues and discover that (1) LVLM internal states are high-specificity per-token indicators of hallucination behaviors. Moreover, (2) different LVLMs encode universal patterns of hallucinations in common latent subspaces, indicating that there exist "generic truthful directions" shared by various LVLMs. Based on these discoveries, we propose Truthful-Guided Pre-Intervention (TruthPrInt) that first learns the truthful direction of LVLM decoding and then applies truthful-guided inference-time intervention during LVLM decoding. We further propose ComnHallu to enhance both cross-LVLM and cross-data hallucination detection transferability by constructing and aligning hallucination latent subspaces. We evaluate TruthPrInt in extensive experimental settings, including in-domain and out-of-domain scenarios, over popular LVLMs and OH benchmarks. Experimental results indicate that TruthPrInt significantly outperforms state-of-the-art methods. Codes will be available at this https URL.
对象幻觉(OH)已被认为是大型视觉-语言模型(LVLM)中主要的可信度挑战之一。最近在大规模语言模型(LLMs)方面的进展表明,内部状态如隐藏状态编码了生成响应的“总体真实性”。然而,目前尚不清楚LVLM中的内部状态如何工作以及它们是否可以作为“按令牌”幻觉指示器,这有助于减轻OH问题。在这篇论文中,我们首先深入探讨了LVLM内部状态与OH问题之间的关系,并发现(1)LVLM内部状态是高特异性的按令牌幻觉行为指标。(2)不同的LVLM在通用的潜在子空间中共编码了幻觉模式,表明存在由各种LVLM共享的“通用真实方向”。基于这些发现,我们提出了Truthful-Guided Pre-Intervention (TruthPrInt),它首先学习LVLM解码的真实方向,然后在LVLM解码过程中应用真实指导的推理时间干预。此外,为了增强跨LVLM和跨数据集幻觉检测的迁移能力,我们提出了ComnHallu,通过构建并对齐幻觉潜在子空间来实现这一目标。 我们在广泛的实验设置中评估了TruthPrInt,包括域内和域外场景,并且针对流行的LVLMs和OH基准进行了测试。实验结果显示,TruthPrInt在性能上显著优于当前最先进的方法。代码可在提供的网址获取(原文中的链接需手动输入)。
https://arxiv.org/abs/2503.10602
Recent Vision-based Large Language Models~(VisionLLMs) for autonomous driving have seen rapid advancements. However, such promotion is extremely dependent on large-scale high-quality annotated data, which is costly and labor-intensive. To address this issue, we propose unlocking the value of abundant yet unlabeled data to improve the language-driving model in a semi-supervised learning manner. Specifically, we first introduce a series of template-based prompts to extract scene information, generating questions that create pseudo-answers for the unlabeled data based on a model trained with limited labeled data. Next, we propose a Self-Consistency Refinement method to improve the quality of these pseudo-annotations, which are later used for further training. By utilizing a pre-trained VisionLLM (e.g., InternVL), we build a strong Language Driving Model (LDM) for driving scene question-answering, outperforming previous state-of-the-art methods. Extensive experiments on the DriveLM benchmark show that our approach performs well with just 5% labeled data, achieving competitive performance against models trained with full datasets. In particular, our LDM achieves 44.85% performance with limited labeled data, increasing to 54.27% when using unlabeled data, while models trained with full datasets reach 60.68% on the DriveLM benchmark.
最近基于视觉的大规模语言模型(VisionLLMs)在自动驾驶领域取得了迅速进展。然而,这种进步极度依赖于大规模高质量的标注数据,而这类数据的成本高且耗时长。为了解决这一问题,我们提出了一种利用大量未标记数据的价值,在半监督学习框架下改进语言驾驶模型的方法。具体而言,首先引入一系列基于模板的提示来提取场景信息,并根据有限标注数据训练出的模型生成伪答案以补充未标记数据。接着,我们提出了自我一致性细化方法来提高这些伪注释的质量,之后将它们用于进一步训练。通过使用预训练的VisionLLM(如InternVL),我们构建了一个强大的语言驾驶模型(LDM)用于处理驾驶场景问答任务,在DriveLM基准测试中超越了先前的最佳方法。广泛的实验表明,仅用5%标注数据时,我们的方法已经表现出色,其性能与利用完整数据集训练出的模型相当竞争。特别地,我们的LDM在有限标注数据情况下达到了44.85%的表现,在使用未标记数据后提高到54.27%,而完全数据集训练出的模型则在DriveLM基准测试中达到60.68%的成绩。
https://arxiv.org/abs/2503.10586
With the rapid advancement of large language models (LLMs) and vision-language models (VLMs), significant progress has been made in developing open-vocabulary robotic manipulation systems. However, many existing approaches overlook the importance of object dynamics, limiting their applicability to more complex, dynamic tasks. In this work, we introduce KUDA, an open-vocabulary manipulation system that integrates dynamics learning and visual prompting through keypoints, leveraging both VLMs and learning-based neural dynamics models. Our key insight is that a keypoint-based target specification is simultaneously interpretable by VLMs and can be efficiently translated into cost functions for model-based planning. Given language instructions and visual observations, KUDA first assigns keypoints to the RGB image and queries the VLM to generate target specifications. These abstract keypoint-based representations are then converted into cost functions, which are optimized using a learned dynamics model to produce robotic trajectories. We evaluate KUDA on a range of manipulation tasks, including free-form language instructions across diverse object categories, multi-object interactions, and deformable or granular objects, demonstrating the effectiveness of our framework. The project page is available at this http URL.
随着大型语言模型(LLMs)和视觉-语言模型(VLMs)的迅速发展,开放式词汇机器人操作系统的开发取得了显著进展。然而,许多现有的方法忽视了物体动力学的重要性,这限制了它们在更复杂、动态任务中的应用范围。在这项工作中,我们介绍了KUDA,这是一种将动力学习与基于关键点的视觉提示相结合的开放词汇操作系统,通过这种方式利用VLMs和基于学习的动力模型。我们的核心观点是,基于关键点的目标指定可以同时被VLM理解和高效地转换为用于模型规划的成本函数。 在接收到语言指令和视觉观察数据后,KUDA首先将关键点分配给RGB图像,并查询VLM以生成目标规格说明。这些抽象的关键点表示随后会被转化为成本函数,然后利用一个已学习的动力学模型进行优化,从而产生机器人的轨迹。 我们在一系列操作任务中评估了KUDA的性能,包括跨多种物体类别的自由形式语言指令、多物体交互以及变形或颗粒状物体的操作,证明了我们框架的有效性。项目的网页在此链接提供:[此HTTP URL]。
https://arxiv.org/abs/2503.10546
3D Multimodal Large Language Models (MLLMs) have recently made substantial advancements. However, their potential remains untapped, primarily due to the limited quantity and suboptimal quality of 3D datasets. Current approaches attempt to transfer knowledge from 2D MLLMs to expand 3D instruction data, but still face modality and domain gaps. To this end, we introduce PiSA-Engine (Point-Self-Augmented-Engine), a new framework for generating instruction point-language datasets enriched with 3D spatial semantics. We observe that existing 3D MLLMs offer a comprehensive understanding of point clouds for annotation, while 2D MLLMs excel at cross-validation by providing complementary information. By integrating holistic 2D and 3D insights from off-the-shelf MLLMs, PiSA-Engine enables a continuous cycle of high-quality data generation. We select PointLLM as the baseline and adopt this co-evolution training framework to develop an enhanced 3D MLLM, termed PointLLM-PiSA. Additionally, we identify limitations in previous 3D benchmarks, which often feature coarse language captions and insufficient category diversity, resulting in inaccurate evaluations. To address this gap, we further introduce PiSA-Bench, a comprehensive 3D benchmark covering six key aspects with detailed and diverse labels. Experimental results demonstrate PointLLM-PiSA's state-of-the-art performance in zero-shot 3D object captioning and generative classification on our PiSA-Bench, achieving significant improvements of 46.45% (+8.33%) and 63.75% (+16.25%), respectively. We will release the code, datasets, and benchmark.
最近,三维多模态大型语言模型(MLLMs)取得了显著进展。然而,由于缺乏大量且质量优良的三维数据集,其潜力仍未被完全发挥。当前的方法试图通过从二维 MLLMs 转移知识来扩充三维指令数据,但仍然面临着模式和领域差距的问题。为此,我们引入了 PiSA-Engine(Point-Self-Augmented-Engine),这是一个新的框架,用于生成富含3D空间语义的指令点语言数据集。我们观察到现有的3D MLLMs在对点云进行注释时提供了全面的理解能力,而2D MLLMs则擅长通过提供补充信息来进行跨模态验证。通过整合现成MLLMs中全方位的2D和3D见解,PiSA-Engine能够实现高质量数据生成的连续循环过程。 我们选取PointLLM作为基准,并采用这一协同进化训练框架开发了一个增强型的三维 MLLM,称为 PointLLM-PiSA。此外,我们还指出了先前的3D基准测试中存在的局限性:这些测试往往包含粗糙的语言描述和不足的类别多样性,导致评估不准确。为了弥补这一差距,我们进一步引入了 PiSA-Bench,这是一个涵盖六个关键方面的全面3D基准测试,具有详细且多样的标签。 实验结果表明,在我们的 PiSA-Bench 上,PointLLM-PiSA 在零样本3D物体描述和生成分类方面达到了最先进的性能,分别实现了46.45% (+8.33%) 和 63.75% (+16.25%) 的显著改进。我们将发布代码、数据集和基准测试。
https://arxiv.org/abs/2503.10529
Discourse understanding is essential for many NLP tasks, yet most existing work remains constrained by framework-dependent discourse representations. This work investigates whether large language models (LLMs) capture discourse knowledge that generalizes across languages and frameworks. We address this question along two dimensions: (1) developing a unified discourse relation label set to facilitate cross-lingual and cross-framework discourse analysis, and (2) probing LLMs to assess whether they encode generalizable discourse abstractions. Using multilingual discourse relation classification as a testbed, we examine a comprehensive set of 23 LLMs of varying sizes and multilingual capabilities. Our results show that LLMs, especially those with multilingual training corpora, can generalize discourse information across languages and frameworks. Further layer-wise analyses reveal that language generalization at the discourse level is most salient in the intermediate layers. Lastly, our error analysis provides an account of challenging relation classes.
语篇理解对许多自然语言处理任务至关重要,但大多数现有研究仍受限于框架依赖的语篇表示。本项工作探讨了大型语言模型(LLMs)是否捕捉到了跨语言和跨框架的一般化语篇知识。我们从两个维度来解决这个问题:(1) 开发一套统一的语篇关系标签集以促进跨语言、跨框架的语篇分析;(2) 探测LLMs,评估它们是否编码了一般化的语篇抽象概念。通过使用多语言语篇关系分类作为测试平台,我们研究了包括不同规模和多语言能力在内的共计23个大型语言模型的全面集合。我们的结果显示,尤其是那些接受了多语言训练数据集的大型语言模型,能够跨语言和框架泛化语篇信息。进一步的逐层分析表明,在中间层最能体现出语言在语篇层次上的泛化特征。最后,我们的错误分析对具有挑战性的关系类别提供了解释。
https://arxiv.org/abs/2503.10515
Multimodal Large Language Models (MLLMs) are becoming increasingly popular, while the high computational cost associated with multimodal data input, particularly from visual tokens, poses a significant challenge. Existing training-based token compression methods improve inference efficiency but require costly retraining, while training-free methods struggle to maintain performance when aggressively reducing token counts. In this study, we reveal that the performance degradation of MLLM closely correlates with the accelerated loss of information in the attention output matrix. This insight introduces a novel information-preserving perspective, making it possible to maintain performance even under extreme token compression. Based on this finding, we propose TokenCarve, a training-free, plug-and-play, two-stage token compression framework. The first stage employs an Information-Preservation-Guided Selection (IPGS) strategy to prune low-information tokens, while the second stage further leverages IPGS to guide token merging, minimizing information loss. Extensive experiments on 11 datasets and 2 model variants demonstrate the effectiveness of TokenCarve. It can even reduce the number of visual tokens to 22.2% of the original count, achieving a 1.23x speedup in inference, a 64% reduction in KV cache storage, and only a 1.54% drop in accuracy. Our code is available at this https URL.
多模态大型语言模型(MLLMs)正变得越来越受欢迎,然而,与多模态数据输入相关的高昂计算成本,尤其是来自视觉标记的数据,构成了一个重大挑战。现有的基于训练的令牌压缩方法虽然能提高推理效率,但需要昂贵的重新训练;而无需训练的方法则在激进地减少令牌数量时难以保持性能。 在这项研究中,我们发现MLLM的性能下降与其注意力输出矩阵中的信息加速丢失紧密相关。这一洞察引入了一个新的保真度视角,使得即使在极端的令牌压缩情况下也能维持性能成为可能。基于这一发现,我们提出了TokenCarve,这是一种无需训练、即插即用的两阶段令牌压缩框架。第一阶段采用信息保持指导选择(IPGS)策略来修剪低信息量令牌;第二阶段则进一步利用IPGS来指导令牌合并,从而最小化信息损失。 广泛的实验在11个数据集和2种模型变体上证明了TokenCarve的有效性。它可以将视觉标记的数量减少到原始数量的22.2%,实现1.23倍的推理速度提升、KV缓存存储减少了64%,同时仅导致精度下降1.54%。 我们的代码可在[此链接](https://this-URL.com)获取。
https://arxiv.org/abs/2503.10501
Traditional benchmarks struggle to evaluate increasingly sophisticated language models in multilingual and culturally diverse contexts. To address this gap, we introduce MMLU-ProX, a comprehensive multilingual benchmark covering 13 typologically diverse languages with approximately 11,829 questions per language. Building on the challenging reasoning-focused design of MMLU-Pro, our framework employs a semi-automatic translation process: translations generated by state-of-the-art large language models (LLMs) are rigorously evaluated by expert annotators to ensure conceptual accuracy, terminological consistency, and cultural relevance. We comprehensively evaluate 25 state-of-the-art LLMs using 5-shot chain-of-thought (CoT) and zero-shot prompting strategies, analyzing their performance across linguistic and cultural boundaries. Our experiments reveal consistent performance degradation from high-resource languages to lower-resource ones, with the best models achieving over 70% accuracy on English but dropping to around 40% for languages like Swahili, highlighting persistent gaps in multilingual capabilities despite recent advances. MMLU-ProX is an ongoing project; we are expanding our benchmark by incorporating additional languages and evaluating more language models to provide a more comprehensive assessment of multilingual capabilities.
传统基准在评估多语言和文化多样化的背景下越来越复杂的语言模型方面存在困难。为了解决这一问题,我们引入了MMLU-ProX,这是一个全面的多语种基准测试集,涵盖了13种类型学上不同的语言,每种语言大约包含11,829个问题。该框架建立在具有挑战性的推理导向设计的MMLU-Pro基础上,并采用了半自动翻译过程:由最先进的大型语言模型(LLMs)生成的翻译被专家标注人员严格评估,以确保概念上的准确性、术语的一致性和文化的关联性。 我们使用5-shot链式思维(CoT)和零样本提示策略对25种最新的语言模型进行了全面评估,并分析了它们在语言和文化边界上的表现。我们的实验结果显示,从高资源语言到低资源语言的表现一致性下降明显,最好的模型在英语上能够达到70%以上的准确率,但在斯瓦希里语等语言上则降至约40%,这表明尽管最近取得了进展,但多语言能力的持续差距仍然存在。 MMLU-ProX是一个正在进行中的项目;我们正在通过纳入更多语言和评估更多的语言模型来扩展我们的基准测试集,以提供一个更全面的多语言能力评估。
https://arxiv.org/abs/2503.10497
LLMs have paved the way for truly simple document-level machine translation, but challenges such as omission errors remain. In this paper, we study a simple method for handling document-level machine translation, by leveraging previous contexts in a multi-turn conversational manner. Specifically, by decomposing documents into segments and iteratively translating them while maintaining previous turns, this method ensures coherent translations without additional training, and can fully re-use the KV cache of previous turns thus minimizing computational overhead. We further propose a `source-primed' method that first provides the whole source document before multi-turn translation. We empirically show this multi-turn method outperforms both translating entire documents in a single turn and translating each segment independently according to multiple automatic metrics in representative LLMs, establishing a strong baseline for document-level translation using LLMs.
大型语言模型(LLMs)已经为真正简单的文档级机器翻译铺平了道路,但诸如遗漏错误等挑战仍然存在。在本文中,我们研究了一种处理文档级机器翻译的简单方法,通过以多轮对话的方式利用先前的上下文来实现这一目标。具体而言,该方法将文档分解成段落,并在其翻译过程中保持之前的交互回合,从而确保连贯的翻译而无需额外训练,同时充分利用之前回合的KV缓存,从而最小化计算开销。我们进一步提出了一种“源文档引导”(source-primed)的方法,在多轮翻译之前先提供整个源文档。通过多个自动评估指标实验证明,这种多轮方法在具有代表性的大型语言模型中优于一次性翻译完整文档和独立翻译每个段落的方式,从而为使用LLMs进行文档级翻译建立了强有力的基准。
https://arxiv.org/abs/2503.10494
Large Language Models (LLMs) are revolutionizing medical diagnostics by enhancing both disease classification and clinical decision-making. In this study, we evaluate the performance of two LLM- based diagnostic tools, DeepSeek R1 and O3 Mini, using a structured dataset of symptoms and diagnoses. We assessed their predictive accuracy at both the disease and category levels, as well as the reliability of their confidence scores. DeepSeek R1 achieved a disease-level accuracy of 76% and an overall accuracy of 82%, outperforming O3 Mini, which attained 72% and 75% respectively. Notably, DeepSeek R1 demonstrated exceptional performance in Mental Health, Neurological Disorders, and Oncology, where it reached 100% accuracy, while O3 Mini excelled in Autoimmune Disease classification with 100% accuracy. Both models, however, struggled with Respiratory Disease classification, recording accuracies of only 40% for DeepSeek R1 and 20% for O3 Mini. Additionally, the analysis of confidence scores revealed that DeepSeek R1 provided high-confidence predictions in 92% of cases, compared to 68% for O3 Mini. Ethical considerations regarding bias, model interpretability, and data privacy are also discussed to ensure the responsible integration of LLMs into clinical practice. Overall, our findings offer valuable insights into the strengths and limitations of LLM-based diagnostic systems and provide a roadmap for future enhancements in AI-driven healthcare.
大型语言模型(LLMs)通过增强疾病分类和临床决策,正在医学诊断领域引发革命。在这项研究中,我们使用一组结构化的症状和诊断数据集评估了两种基于LLM的诊断工具——DeepSeek R1 和 O3 Mini 的性能。我们在疾病层面和类别层面上对其预测准确性进行了评估,并且还分析了它们置信评分的可靠性。 结果显示,DeepSeek R1 在疾病层面达到了76%的准确率,在总体上达到82%,优于O3 Mini,后者分别达到了72% 和 75% 的准确率。值得注意的是,DeepSeek R1在精神健康、神经病学和肿瘤学领域表现尤为突出,这三项领域的准确性均达到100%,而O3 Mini 在自身免疫疾病的分类上则取得了100%的准确度。然而,在呼吸系统疾病分类方面,两者的表现都较差,其中DeepSeek R1 的准确率为40%,而 O3 Mini 仅为20%。 此外,对置信评分的分析显示,DeepSeek R1 在92%的情况下提供了高置信度预测,相比之下,O3 Mini 则为68%。研究还讨论了关于偏见、模型解释性和数据隐私等伦理考量,以确保LLMs能够负责任地融入临床实践。 总的来说,我们的发现为基于LLM的诊断系统的强项和局限性提供了宝贵见解,并为未来在AI驱动医疗领域的改进提供了一条路线图。
https://arxiv.org/abs/2503.10486
While utilizing syntactic tools such as parts-of-speech (POS) tagging has helped us understand sentence structures and their distribution across diverse corpora, it is quite complex and poses a challenge in natural language processing (NLP). This study focuses on understanding sentence structure balance - usages of nouns, verbs, determiners, etc - harmoniously without relying on such tools. It proposes a novel statistical method that uses American Standard Code for Information Interchange (ASCII) codes to represent text of 11 text corpora from various sources and their lexical category alignment after using their compressed versions through PCA, and analyzes the results through histograms and normality tests such as Shapiro-Wilk and Anderson-Darling Tests. By focusing on ASCII codes, this approach simplifies text processing, although not replacing any syntactic tools but complementing them by offering it as a resource-efficient tool for assessing text balance. The story generated by Grok shows near normality indicating balanced sentence structures in LLM outputs, whereas 4 out of the remaining 10 pass the normality tests. Further research could explore potential applications in text quality evaluation and style analysis with syntactic integration for more broader tasks.
虽然利用诸如词性(POS)标注等句法工具帮助我们理解句子结构及其在不同语料库中的分布,但在自然语言处理(NLP)中,这种方法复杂且具有挑战性。本研究专注于在不依赖这些工具的情况下,理解句子结构的平衡——即名词、动词、限定词等的使用是否和谐。该研究提出了一种新颖的统计方法,利用美国信息交换标准代码(ASCII)来表示来自不同来源的11个语料库中的文本及其词汇类别的对齐情况,并在通过主成分分析(PCA)压缩版本后进行分析,采用直方图和夏皮罗-威尔克(Shapiro-Wilk)及安德森-达尔林(Anderson-Darling)等正态性检验。通过专注于ASCII码,这种方法简化了文本处理过程,虽然不替代任何句法工具,但作为一种资源高效的工具来评估文本平衡,起到了补充作用。 Grok生成的故事显示出接近正态性的结果,这表明大语言模型的输出中句子结构较为均衡;而在剩余的10个语料库中,有4个通过了正态性检验。进一步的研究可以探索在句法集成的情况下,在文本质量评估和风格分析等更广泛任务中的潜在应用。
https://arxiv.org/abs/2503.10470
The rapid advancement of large language models (LLMs) has significantly improved their performance in code generation tasks. However, existing code benchmarks remain static, consisting of fixed datasets with predefined problems. This makes them vulnerable to memorization during training, where LLMs recall specific test cases instead of generalizing to new problems, leading to data contamination and unreliable evaluation results. To address these issues, we introduce DynaCode, a dynamic, complexity-aware benchmark that overcomes the limitations of static datasets. DynaCode evaluates LLMs systematically using a complexity-aware metric, incorporating both code complexity and call-graph structures. DynaCode achieves large-scale diversity, generating up to 189 million unique nested code problems across four distinct levels of code complexity, referred to as units, and 16 types of call graphs. Results on 12 latest LLMs show an average performance drop of 16.8% to 45.7% compared to MBPP+, a static code generation benchmark, with performance progressively decreasing as complexity increases. This demonstrates DynaCode's ability to effectively differentiate LLMs. Additionally, by leveraging call graphs, we gain insights into LLM behavior, particularly their preference for handling subfunction interactions within nested code.
大型语言模型(LLMs)的快速进步显著提升了其在代码生成任务中的表现。然而,现有的代码基准测试仍然是静态的,包含固定的数据集和预定义的问题。这使得它们容易在训练过程中被记忆化,即LLM通过回忆特定的测试用例而不是泛化到新问题来应对挑战,从而导致数据污染和评估结果不可靠。为了解决这些问题,我们引入了DynaCode,这是一个动态且考虑复杂度的基准测试,可以克服静态数据集的局限性。 DynaCode使用一个复杂的感知度量系统对LLM进行系统的评估,该系统既考虑到代码复杂度也涉及到调用图结构。Dynacode实现了大规模多样性,在四种不同层次(称为单元)的代码复杂度上生成了多达1.89亿个独特的嵌套代码问题,并且包括了16种类型的调用图。 在对最新的12种LLM进行测试后,与静态代码生成基准MBPP+相比,Dynacode显示出了从平均性能下降16.8%到45.7%的结果。随着复杂度的增加,性能逐渐降低,这证明了Dynacode能够有效地区分不同LLM的能力。此外,通过利用调用图,我们获得了对LLM行为的独特见解,特别是它们在处理嵌套代码中的子函数交互方面的偏好。 这一发现表明,DynaCode不仅提供了一个更可靠的评估框架来衡量大型语言模型的泛化能力,而且还为研究这些模型如何理解和生成复杂结构化的代码提供了宝贵的洞见。
https://arxiv.org/abs/2503.10452
Learning 4D language fields to enable time-sensitive, open-ended language queries in dynamic scenes is essential for many real-world applications. While LangSplat successfully grounds CLIP features into 3D Gaussian representations, achieving precision and efficiency in 3D static scenes, it lacks the ability to handle dynamic 4D fields as CLIP, designed for static image-text tasks, cannot capture temporal dynamics in videos. Real-world environments are inherently dynamic, with object semantics evolving over time. Building a precise 4D language field necessitates obtaining pixel-aligned, object-wise video features, which current vision models struggle to achieve. To address these challenges, we propose 4D LangSplat, which learns 4D language fields to handle time-agnostic or time-sensitive open-vocabulary queries in dynamic scenes efficiently. 4D LangSplat bypasses learning the language field from vision features and instead learns directly from text generated from object-wise video captions via Multimodal Large Language Models (MLLMs). Specifically, we propose a multimodal object-wise video prompting method, consisting of visual and text prompts that guide MLLMs to generate detailed, temporally consistent, high-quality captions for objects throughout a video. These captions are encoded using a Large Language Model into high-quality sentence embeddings, which then serve as pixel-aligned, object-specific feature supervision, facilitating open-vocabulary text queries through shared embedding spaces. Recognizing that objects in 4D scenes exhibit smooth transitions across states, we further propose a status deformable network to model these continuous changes over time effectively. Our results across multiple benchmarks demonstrate that 4D LangSplat attains precise and efficient results for both time-sensitive and time-agnostic open-vocabulary queries.
学习4D语言场以实现在动态场景中进行时间敏感性和开放性语言查询对于许多实际应用来说至关重要。虽然LangSplat成功地将CLIP特征映射到了3D高斯表示,实现了在静态三维场景中的精确和高效性能,但它缺乏处理动态四维字段的能力,因为CLIP设计用于静态图像文本任务,并不能捕捉视频中时间变化的动态性。现实环境本质上是动态的,物体语义随时间而演变。构建一个精确的4D语言场需要获取像素对齐、基于对象的视频特征,而这正是当前视觉模型难以实现的。 为了解决这些挑战,我们提出了4D LangSplat,它学习四维语言场以高效地处理动态场景中的时间和时间无关型开放词汇查询。4D LangSplat不再从视觉特性中学习语言字段,而是直接通过多模态大规模语言模型(MLLM)生成的基于对象的视频描述文本进行学习。具体来说,我们提出了一种多模态基于对象的视频提示方法,包括视觉和文本提示,引导MLLM为视频中的物体生成详细、时间一致且高质量的文字说明。这些文字说明通过大型语言模型编码成高质量句子嵌入,并作为像素对齐、特定于对象的功能监督,从而支持开放词汇查询通过共享嵌入空间实现。 认识到四维场景中对象的状态会平滑过渡,我们进一步提出了一个状态变形网络来有效建模随时间的连续变化。我们的跨多个基准测试的结果表明,4D LangSplat实现了在时间和时间无关型开放词汇查询上的精确和高效结果。
https://arxiv.org/abs/2503.10437
In this paper, we propose BeamLLM, a vision-aided millimeter-wave (mmWave) beam prediction framework leveraging large language models (LLMs) to address the challenges of high training overhead and latency in mmWave communication systems. By combining computer vision (CV) with LLMs' cross-modal reasoning capabilities, the framework extracts user equipment (UE) positional features from RGB images and aligns visual-temporal features with LLMs' semantic space through reprogramming techniques. Evaluated on a realistic vehicle-to-infrastructure (V2I) scenario, the proposed method achieves 61.01% top-1 accuracy and 97.39% top-3 accuracy in standard prediction tasks, significantly outperforming traditional deep learning models. In few-shot prediction scenarios, the performance degradation is limited to 12.56% (top-1) and 5.55% (top-3) from time sample 1 to 10, demonstrating superior prediction capability.
在这篇论文中,我们提出了BeamLLM,这是一个利用大型语言模型(LLMs)辅助毫米波(mmWave)通信系统中的波束预测框架。该框架旨在解决mmWave通信系统的高训练开销和延迟问题。通过将计算机视觉(CV)与LLMs的跨模态推理能力相结合,该框架可以从RGB图像中提取用户设备(UE)的位置特征,并利用重编程技术使视觉-时间特征与LLMs语义空间对齐。 在现实中的车辆到基础设施(V2I)场景下进行评估时,所提出的方法在标准预测任务中分别达到了61.01%的top-1准确率和97.39%的top-3准确率,显著优于传统的深度学习模型。在少量样本预测场景中,从时间样本1到10,性能下降仅限于top-1准确性减少了12.56%,top-3准确性减少了5.55%,这显示了其卓越的预测能力。
https://arxiv.org/abs/2503.10432
We study the capabilities of Large Language Models (LLM) on binary relations, a ubiquitous concept in math employed in most reasoning, math and logic benchmarks. This work focuses on equality, inequality, and inclusion, along with the properties they satisfy, such as ir/reflexivity, a/symmetry, transitivity, and logical complexity (e.g., number of reasoning ``hops''). We propose an alternative to in-context learning that trains only the representations of newly introduced tokens, namely out-of-context representation learning. This method mitigates linguistic biases already present in a model and, differently from in-context learning, does not rely on external information or illustrations. We argue out-of-context representation learning as a better alternative to in-context learning and fine-tuning to evaluate the capabilities of LLMs on logic tasks that are the building blocks of more complex reasoning benchmarks.
我们研究了大型语言模型(LLM)在二元关系上的能力,这是一种数学中普遍存在且在大多数推理、数学和逻辑基准测试中使用的概念。这项工作重点关注等式、不等式和包含关系,并探讨它们所满足的属性,如非自反性、非对称性和传递性,以及逻辑复杂度(例如,推理所需的“跳跃”次数)。 我们提出了一种不同于上下文学习的方法——出 contexto 表征学习,这种方法仅训练新引入令牌的表示。此方法可以缓解模型中已存在的语言偏见,并且与上下文学习不同的是,它不需要外部信息或插图的支持。 我们认为,出 contexto 表征学习是评估大型语言模型在构成更复杂推理基准测试基础的逻辑任务上的能力的一种更好替代方案,相比而言优于上下文学习和微调方法。
https://arxiv.org/abs/2503.10408
Unifying diverse image generation tasks within a single framework remains a fundamental challenge in visual generation. While large language models (LLMs) achieve unification through task-agnostic data and generation, existing visual generation models fail to meet these principles. Current approaches either rely on per-task datasets and large-scale training or adapt pre-trained image models with task-specific modifications, limiting their generalizability. In this work, we explore video models as a foundation for unified image generation, leveraging their inherent ability to model temporal correlations. We introduce RealGeneral, a novel framework that reformulates image generation as a conditional frame prediction task, analogous to in-context learning in LLMs. To bridge the gap between video models and condition-image pairs, we propose (1) a Unified Conditional Embedding module for multi-modal alignment and (2) a Unified Stream DiT Block with decoupled adaptive LayerNorm and attention mask to mitigate cross-modal interference. RealGeneral demonstrates effectiveness in multiple important visual generation tasks, e.g., it achieves a 14.5% improvement in subject similarity for customized generation and a 10% enhancement in image quality for canny-to-image task. Project page: this https URL
在视觉生成领域,将各种不同的图像生成任务统一在一个框架内仍然是一个基本挑战。尽管大型语言模型(LLM)通过任务无关的数据和生成实现了这一统一,现有的视觉生成模型却未能达到这些标准。当前的方法要么依赖于特定任务的数据集和大规模训练,要么对预训练的图像模型进行特定任务的修改来适应它们,这限制了其通用性。 在本研究中,我们探索视频模型作为统一图像生成的基础,并利用它们固有的建模时间相关性的能力。我们引入了一个名为RealGeneral的新框架,该框架将图像生成重新表述为条件帧预测任务,类似于LLM中的情境学习(in-context learning)。为了弥合视频模型与条件图像对之间的差距,我们提出了两种方法:一种是多模式对齐的统一条件嵌入模块;另一种是在减轻跨模态干扰的同时具有解耦自适应LayerNorm和注意力掩码的统一流DiT块。RealGeneral在多个重要的视觉生成任务中表现出色,例如,在定制化生成方面实现了14.5%的主题相似性改进,并且在边缘检测到图像的任务上提高了10%的图像质量。 项目页面:[这个链接](this https URL)
https://arxiv.org/abs/2503.10406