We introduce Cosmos-Transfer, a conditional world generation model that can generate world simulations based on multiple spatial control inputs of various modalities such as segmentation, depth, and edge. In the design, the spatial conditional scheme is adaptive and customizable. It allows weighting different conditional inputs differently at different spatial locations. This enables highly controllable world generation and finds use in various world-to-world transfer use cases, including Sim2Real. We conduct extensive evaluations to analyze the proposed model and demonstrate its applications for Physical AI, including robotics Sim2Real and autonomous vehicle data enrichment. We further demonstrate an inference scaling strategy to achieve real-time world generation with an NVIDIA GB200 NVL72 rack. To help accelerate research development in the field, we open-source our models and code at this https URL.
我们介绍了Cosmos-Transfer,这是一种条件世界生成模型,可以根据多种模式的空间控制输入(如分割、深度和边缘)来生成世界模拟。设计中采用的空间条件方案是自适应且可定制的,允许在不同空间位置对不同的条件输入赋予不同的权重。这使得高度可控的世界生成成为可能,并适用于各种“从一个世界到另一个世界”的转移用例,包括仿真到现实(Sim2Real)。我们进行了广泛的评估来分析所提出的模型,并展示了其在物理AI中的应用,包括机器人技术的Sim2Real和自动驾驶汽车数据丰富化。此外,我们还演示了一种推理扩展策略,以实现使用NVIDIA GB200 NVL72机柜进行实时世界生成。为了加速该领域的研究开发,我们在以下网址开源了我们的模型和代码:[此处提供具体的URL链接]。
https://arxiv.org/abs/2503.14492
Effective human-AI collaboration hinges not only on the AI agent's ability to follow explicit instructions but also on its capacity to navigate ambiguity, incompleteness, invalidity, and irrelevance in communication. Gricean conversational and inference norms facilitate collaboration by aligning unclear instructions with cooperative principles. We propose a normative framework that integrates Gricean norms and cognitive frameworks -- common ground, relevance theory, and theory of mind -- into large language model (LLM) based agents. The normative framework adopts the Gricean maxims of quantity, quality, relation, and manner, along with inference, as Gricean norms to interpret unclear instructions, which are: ambiguous, incomplete, invalid, or irrelevant. Within this framework, we introduce Lamoids, GPT-4 powered agents designed to collaborate with humans. To assess the influence of Gricean norms in human-AI collaboration, we evaluate two versions of a Lamoid: one with norms and one without. In our experiments, a Lamoid collaborates with a human to achieve shared goals in a grid world (Doors, Keys, and Gems) by interpreting both clear and unclear natural language instructions. Our results reveal that the Lamoid with Gricean norms achieves higher task accuracy and generates clearer, more accurate, and contextually relevant responses than the Lamoid without norms. This improvement stems from the normative framework, which enhances the agent's pragmatic reasoning, fostering effective human-AI collaboration and enabling context-aware communication in LLM-based agents.
有效的跨人类与人工智能(AI)协作不仅依赖于AI代理遵循明确指令的能力,还在于其处理模糊性、不完整信息、无效性和无关信息的沟通能力。格赖斯对话和推理准则通过将模棱两可的指示与合作原则对齐来促进这种协作。我们提出了一种规范框架,该框架结合了格赖斯准则以及认知框架——共同知识(common ground)、相关理论(relevance theory)和心智理论(theory of mind),并将这些整合到大型语言模型(LLM)驱动的代理中。 此规范框架采用了格赖斯关于数量、质量、关联性和方式的原则,用于解析模糊指令。在该框架下,我们引入了Lamoids——基于GPT-4的协作型AI代理。为了评估格赖斯准则对人类与AI协作的影响,我们在实验中比较了一个遵循这些规范的Lamoid版本和一个不遵循规范的版本。 在实验中,Lamoid与人类合作,在一个由门、钥匙和宝石组成的网格世界环境中完成共同目标。实验过程中,该环境既包括清晰指令也包括模糊自然语言指令。我们的结果显示,使用格赖斯准则的Lamoid在任务准确性上优于不遵循这些准则的版本,并且其生成的回答更加明确、准确并且与上下文相关。 这种改进源于所提出的规范框架,它增强了代理的语用推理能力,促进了有效的跨人类和AI协作,并使基于大型语言模型的代理能够进行情境感知的沟通。
https://arxiv.org/abs/2503.14484
LLMs often adopt an assertive language style also when making false claims. Such ``overconfident hallucinations'' mislead users and erode trust. Achieving the ability to express in language the actual degree of uncertainty around a claim is therefore of great importance. We find that ``verbal uncertainty'' is governed by a single linear feature in the representation space of LLMs, and show that this has only moderate correlation with the actual ``semantic uncertainty'' of the model. We apply this insight and show that (1) the mismatch between semantic and verbal uncertainty is a better predictor of hallucinations than semantic uncertainty alone and (2) we can intervene on verbal uncertainty at inference time and reduce hallucinations on short-form answers, achieving an average relative reduction of 32%.
大型语言模型(LLMs)在作出错误陈述时也常常采用自信的语言风格。这种“过度自信的幻觉”会误导用户并侵蚀信任。因此,能够用语言表达围绕某个声明的实际不确定性程度变得非常重要。我们发现,“口头不确定性”在LLM的表示空间中由单一线性特征控制,并且这一特征与模型实际存在的“语义不确定性”的相关度仅为适度。 基于这一洞察,我们展示了以下两点: 1. 语义不确定性和口头不确定性之间的不匹配比单独的语义不确定性更准确地预测幻觉; 2. 我们可以在推理时间干预口头不确定性,减少短形式答案中的幻觉现象,并实现平均相对减少32%的效果。
https://arxiv.org/abs/2503.14477
Inference scaling empowers LLMs with unprecedented reasoning ability, with reinforcement learning as the core technique to elicit complex reasoning. However, key technical details of state-of-the-art reasoning LLMs are concealed (such as in OpenAI o1 blog and DeepSeek R1 technical report), thus the community still struggles to reproduce their RL training results. We propose the $\textbf{D}$ecoupled Clip and $\textbf{D}$ynamic s$\textbf{A}$mpling $\textbf{P}$olicy $\textbf{O}$ptimization ($\textbf{DAPO}$) algorithm, and fully open-source a state-of-the-art large-scale RL system that achieves 50 points on AIME 2024 using Qwen2.5-32B base model. Unlike previous works that withhold training details, we introduce four key techniques of our algorithm that make large-scale LLM RL a success. In addition, we open-source our training code, which is built on the verl framework, along with a carefully curated and processed dataset. These components of our open-source system enhance reproducibility and support future research in large-scale LLM RL.
推理缩放赋予大型语言模型(LLM)前所未有的推理能力,强化学习是激发复杂推理的核心技术。然而,最先进的推理LLM的关键技术细节被隐藏(如OpenAI的o1博客和DeepSeek R1的技术报告),因此社区仍然难以再现它们的RL训练结果。我们提出了“解耦裁剪与动态采样策略优化”(DAPO)算法,并完全开源了一个使用Qwen2.5-32B基础模型在AIME 2024上取得50分的大型强化学习系统。不同于之前保留训练细节的做法,我们介绍了四项使大规模LLM RL成为可能的关键技术。此外,我们还开源了我们的训练代码,该代码基于verl框架,并附带了一个精心整理和处理过的数据集。这些开放源码系统的组成部分增强了可重复性,并支持未来在大规模LLM RL领域的研究。
https://arxiv.org/abs/2503.14476
We present RWKV-7 "Goose", a new sequence modeling architecture, along with pre-trained language models that establish a new state-of-the-art in downstream performance at the 3 billion parameter scale on multilingual tasks, and match current SoTA English language performance despite being trained on dramatically fewer tokens than other top 3B models. Nevertheless, RWKV-7 models require only constant memory usage and constant inference time per token. RWKV-7 introduces a newly generalized formulation of the delta rule with vector-valued gating and in-context learning rates, as well as a relaxed value replacement rule. We show that RWKV-7 can perform state tracking and recognize all regular languages, while retaining parallelizability of training. This exceeds the capabilities of Transformers under standard complexity conjectures, which are limited to $\mathsf{TC}^0$. To demonstrate RWKV-7's language modeling capability, we also present an extended open source 3.1 trillion token multilingual corpus, and train four RWKV-7 models ranging from 0.19 billion to 2.9 billion parameters on this dataset. To foster openness, reproduction, and adoption, we release our models and dataset component listing at this https URL, and our training and inference code at this https URL all under the Apache 2.0 License.
我们介绍了RWKV-7 "Goose" 这一新的序列建模架构,以及在多语言任务上达到30亿参数规模新最佳性能的预训练语言模型。这些模型尽管使用的训练标记数量远少于其他顶级30亿参数模型,但它们在英语语言表现方面仍达到了当前的最佳状态(SoTA)。然而,RWKV-7 模型只需恒定内存使用和每令牌恒定推断时间。 RWKV-7 引入了新的广义德尔塔规则公式,包括向量值门控和上下文学习率以及放松的值替换规则。我们展示了RWKV-7能够执行状态跟踪并识别所有正则语言,同时保持训练中的并行性。这超过了标准复杂度假设下变换器的能力限制($\mathsf{TC}^0$)。 为了展示RWKV-7的语言建模能力,我们也提供了一个扩展的开源多语言语料库,包含3.1万亿个标记,并在该数据集上训练了四个不同规模的RWKV-7模型,参数范围从0.19亿到2.9亿。为促进开放性、再现性和采用,我们在 [此链接](https://this-url.com) 发布了我们的模型和数据集组件列表,在 [此链接](https://that-url.com) 发布了训练和推断代码,均在 Apache 2.0 许可下使用。 (注意:原文中的具体URL被替换为占位符,请用实际的URL替换。)
https://arxiv.org/abs/2503.14456
We present a latent diffusion model for fast feed-forward 3D scene generation. Given one or more images, our model Bolt3D directly samples a 3D scene representation in less than seven seconds on a single GPU. We achieve this by leveraging powerful and scalable existing 2D diffusion network architectures to produce consistent high-fidelity 3D scene representations. To train this model, we create a large-scale multiview-consistent dataset of 3D geometry and appearance by applying state-of-the-art dense 3D reconstruction techniques to existing multiview image datasets. Compared to prior multiview generative models that require per-scene optimization for 3D reconstruction, Bolt3D reduces the inference cost by a factor of up to 300 times.
我们提出了一种用于快速前馈生成三维场景的潜在扩散模型。给定一个或多个图像,我们的模型Bolt3D可以直接在单个GPU上以不到七秒的时间采样出高质量的三维场景表示。通过利用强大的、可扩展的现有二维扩散网络架构,我们的模型能够生成一致且高保真的三维场景表示。为了训练这个模型,我们创建了一个大规模的多视角一致数据集,该数据集包含了3D几何和外观信息,这是通过对现有的多视图图像数据集应用最先进的密集三维重建技术得到的。与之前的需要针对每个场景进行三维重建优化的多视图生成模型相比,Bolt3D将推理成本降低了高达300倍。
https://arxiv.org/abs/2503.14445
Large language models (LLMs) are increasingly integrated with specialized external tools, yet many tasks demand zero-shot tool usage with minimal or noisy documentation. Existing solutions rely on manual rewriting or labeled data for validation, making them inapplicable in true zero-shot settings. To address these challenges, we propose PLAY2PROMPT, an automated framework that systematically "plays" with each tool to explore its input-output behaviors. Through this iterative trial-and-error process, PLAY2PROMPT refines tool documentation and generates usage examples without any labeled data. These examples not only guide LLM inference but also serve as validation to further enhance tool utilization. Extensive experiments on real-world tasks demonstrate that PLAY2PROMPT significantly improves zero-shot tool performance across both open and closed models, offering a scalable and effective solution for domain-specific tool integration.
大型语言模型(LLMs)与专门的外部工具结合得越来越紧密,但许多任务需要在几乎没有或文档质量很低的情况下实现零样本工具使用。现有的解决方案依赖于手动重写或带标签的数据进行验证,在真正的零样本设置中难以应用。为了解决这些挑战,我们提出了PLAY2PROMPT,这是一个自动化的框架,系统地“玩转”每个工具以探索其输入输出行为。通过这种迭代的试错过程,PLAY2PROMPT能够无需任何标记数据就完善工具文档并生成使用示例。这些示例如何指导LLM进行推理的同时还作为验证手段来进一步提升工具的应用效率。在真实世界任务上的广泛实验表明,PLAY2PROMPT显著提高了开放模型和封闭模型中的零样本工具性能,提供了一种规模化且有效的特定领域工具集成解决方案。
https://arxiv.org/abs/2503.14432
Dexterous robotic hands often struggle to generalize effectively in complex environments due to the limitations of models trained on low-diversity data. However, the real world presents an inherently unbounded range of scenarios, making it impractical to account for every possible variation. A natural solution is to enable robots learning from experience in complex environments, an approach akin to evolution, where systems improve through continuous feedback, learning from both failures and successes, and iterating toward optimal performance. Motivated by this, we propose EvolvingGrasp, an evolutionary grasp generation method that continuously enhances grasping performance through efficient preference alignment. Specifically, we introduce Handpose wise Preference Optimization (HPO), which allows the model to continuously align with preferences from both positive and negative feedback while progressively refining its grasping strategies. To further enhance efficiency and reliability during online adjustments, we incorporate a Physics-aware Consistency Model within HPO, which accelerates inference, reduces the number of timesteps needed for preference finetuning, and ensures physical plausibility throughout the process. Extensive experiments across four benchmark datasets demonstrate state of the art performance of our method in grasp success rate and sampling efficiency. Our results validate that EvolvingGrasp enables evolutionary grasp generation, ensuring robust, physically feasible, and preference-aligned grasping in both simulation and real scenarios.
灵巧的机器人手在复杂环境中通常难以有效泛化,这是因为训练模型所用的数据多样性较低。然而,现实世界提供了无数种不可预知的情境组合,使得为每一种可能变化都编写特定规则变得不切实际。因此,一个自然的解决方案是让机器人能够从复杂的环境体验中学习,这种方法类似于进化过程,在此过程中系统通过持续反馈不断改进,从中吸取失败和成功的经验,并逐步达到最佳性能表现。基于这种思路,我们提出了EvolvingGrasp方法,这是一种进化的抓取生成技术,它通过高效偏好对齐来连续提高抓取性能。 具体来说,我们引入了手位姿势优选优化(Handpose-wise Preference Optimization, HPO),允许模型根据正负反馈持续调整其偏好,并逐步完善抓取策略。为了在在线调整期间进一步提升效率和可靠性,我们在HPO中加入了一个物理感知一致性模型,这加速了推理过程、减少了偏好微调所需的时步数量,并确保整个过程中符合物理学原理的可行性。 广泛的实验结果表明,在四个基准数据集上,我们的方法在抓取成功率和采样效率方面都达到了最先进的表现。研究结果验证了EvolvingGrasp能够实现进化的抓取生成,保证在模拟与现实环境中都能提供稳健、物理可行且偏好对齐的抓取性能。
https://arxiv.org/abs/2503.14329
Recent advances in Latent Video Diffusion Models (LVDMs) have revolutionized video generation by leveraging Video Variational Autoencoders (Video VAEs) to compress intricate video data into a compact latent this http URL, as LVDM training scales, the computational overhead of Video VAEs becomes a critical bottleneck, particularly for encoding high-resolution videos. To address this, we propose LeanVAE, a novel and ultra-efficient Video VAE framework that introduces two key innovations: (1) a lightweight architecture based on a Neighborhood-Aware Feedforward (NAF) module and non-overlapping patch operations, drastically reducing computational cost, and (2) the integration of wavelet transforms and compressed sensing techniques to enhance reconstruction quality. Extensive experiments validate LeanVAE's superiority in video reconstruction and generation, particularly in enhancing efficiency over existing Video this http URL model offers up to 50x fewer FLOPs and 44x faster inference speed while maintaining competitive reconstruction quality, providing insights for scalable, efficient video this http URL models and code are available at this https URL.
最近在潜在视频扩散模型(LVDM)方面的进展通过利用视频变分自动编码器(Video VAE)将复杂的视频数据压缩为紧凑的潜在表示,从而革新了视频生成领域。然而,随着LVDM训练规模的扩大,Video VAEs的计算开销成为了一个关键瓶颈,尤其是在处理高分辨率视频时更为突出。为此,我们提出了LeanVAE,这是一种新颖且极其高效的视频自动编码器框架,它引入了两个关键创新:(1) 基于邻域感知前馈(NAF)模块和非重叠补丁操作的轻量级架构,大大降低了计算成本;(2) 将小波变换与压缩传感技术相结合以提高重建质量。 大量的实验验证了LeanVAE在视频重建和生成方面的优越性,特别是在效率方面超过了现有的Video VAE模型。该模型提供的FLOPs减少了多达50倍,并且推理速度提高了44倍,同时保持了具有竞争力的重建质量。这为开发可扩展、高效的视频自动编码器提供了见解。 LeanVAE的模型与代码可在以下链接获取:[此处插入实际链接]。
https://arxiv.org/abs/2503.14325
Large Language Models (LLMs) have revolutionized various domains, including natural language processing, data analysis, and software development, by enabling automation. In software engineering, LLM-powered coding agents have garnered significant attention due to their potential to automate complex development tasks, assist in debugging, and enhance productivity. However, existing approaches often struggle with sub-optimal decision-making, requiring either extensive manual intervention or inefficient compute scaling strategies. To improve coding agent performance, we present Dynamic Action Re-Sampling (DARS), a novel inference time compute scaling approach for coding agents, that is faster and more effective at recovering from sub-optimal decisions compared to baselines. While traditional agents either follow linear trajectories or rely on random sampling for scaling compute, our approach DARS works by branching out a trajectory at certain key decision points by taking an alternative action given the history of the trajectory and execution feedback of the previous attempt from that point. We evaluate our approach on SWE-Bench Lite benchmark, demonstrating that this scaling strategy achieves a pass@k score of 55% with Claude 3.5 Sonnet V2. Our framework achieves a pass@1 rate of 47%, outperforming state-of-the-art (SOTA) open-source frameworks.
大型语言模型(LLMs)已经通过实现自动化革新了包括自然语言处理、数据分析和软件开发在内的多个领域。在软件工程中,由于其潜在的复杂开发任务自动化的可能、调试辅助以及提高生产力的能力,由LLM驱动的编码代理受到了广泛关注。然而,现有的方法常常在决策质量方面表现不佳,需要大量的手动干预或低效的计算扩展策略。为了提升编码代理的表现,我们提出了一种名为动态动作重采样(DARS)的新颖推理时间计算扩展方法,这种方法比基线更快速且更有效地从次优决策中恢复。 传统的代理要么遵循线性轨迹,要么依赖于随机采样进行计算扩展,而我们的方法DARS则通过在某些关键决策点分支出一条新轨迹来工作。它根据之前的尝试历史和执行反馈采取不同的行动。我们在SWE-Bench Lite基准上评估了这种方法,并证明该扩展策略使Claude 3.5 Sonnet V2达到了55%的pass@k分数(注:这里的pass@k通常是指模型在前k次预测中至少有一次正确的概率)。我们的框架实现了47%的pass@1比率,超过了最先进的开源框架的表现。
https://arxiv.org/abs/2503.14269
Graph Retrieval-Augmented Generation (GraphRAG) has proven highly effective in enhancing the performance of Large Language Models (LLMs) on tasks that require external knowledge. By leveraging Knowledge Graphs (KGs), GraphRAG improves information retrieval for complex reasoning tasks, providing more precise and comprehensive retrieval and generating more accurate responses to QAs. However, most RAG methods fall short in addressing multi-step reasoning, particularly when both information extraction and inference are necessary. To address this limitation, this paper presents Knowledge Graph-Based Iterative Retrieval-Augmented Generation (KG-IRAG), a novel framework that integrates KGs with iterative reasoning to improve LLMs' ability to handle queries involving temporal and logical dependencies. Through iterative retrieval steps, KG-IRAG incrementally gathers relevant data from external KGs, enabling step-by-step reasoning. The proposed approach is particularly suited for scenarios where reasoning is required alongside dynamic temporal data extraction, such as determining optimal travel times based on weather conditions or traffic patterns. Experimental results show that KG-IRAG improves accuracy in complex reasoning tasks by effectively integrating external knowledge with iterative, logic-based retrieval. Additionally, three new datasets: weatherQA-Irish, weatherQA-Sydney, and trafficQA-TFNSW, are formed to evaluate KG-IRAG's performance, demonstrating its potential beyond traditional RAG applications.
https://arxiv.org/abs/2503.14234
The rapid advancement of large language models (LLMs) has revolutionized code generation tasks across various programming languages. However, the unique characteristics of programming languages, particularly those like Verilog with specific syntax and lower representation in training datasets, pose significant challenges for conventional tokenization and decoding approaches. In this paper, we introduce a novel application of speculative decoding for Verilog code generation, showing that it can improve both inference speed and output quality, effectively achieving speed and quality all in one. Unlike standard LLM tokenization schemes, which often fragment meaningful code structures, our approach aligns decoding stops with syntactically significant tokens, making it easier for models to learn the token distribution. This refinement addresses inherent tokenization issues and enhances the model's ability to capture Verilog's logical constructs more effectively. Our experimental results show that our method achieves up to a 5.05x speedup in Verilog code generation and increases pass@10 functional accuracy on RTLLM by up to 17.19% compared to conventional training strategies. These findings highlight speculative decoding as a promising approach to bridge the quality gap in code generation for specialized programming languages.
大型语言模型(LLMs)的迅速发展已经彻底改变了各种编程语言中的代码生成任务。然而,特定编程语言的独特特性,例如具有特定语法且在训练数据集中表示较少的Verilog语言,给传统的分词和解码方法带来了显著挑战。本文中,我们引入了一种针对Verilog代码生成的新颖投机解码应用,证明它可以同时提高推理速度和输出质量。 与标准LLM分词方案通常会将有意义的代码结构片段化不同,我们的方法使解码停止位置与语义上重要的标记对齐,从而使模型更容易学习标记分布。这一改进解决了内在的分词问题,并增强了模型捕捉Verilog逻辑构造的能力。 实验结果表明,我们提出的方法在Verilog代码生成方面比传统训练策略快高达5.05倍,并且在RTLLM上的pass@10功能准确性提高了最多17.19%。这些发现突显了投机解码作为弥合特定编程语言代码生成质量差距的有前景方法的重要性。
https://arxiv.org/abs/2503.14153
Multi-modal Large Language Models (MLLMs) have introduced a novel dimension to document understanding, i.e., they endow large language models with visual comprehension capabilities; however, how to design a suitable image-text pre-training task for bridging the visual and language modality in document-level MLLMs remains underexplored. In this study, we introduce a novel visual-language alignment method that casts the key issue as a Visual Question Answering with Mask generation (VQAMask) task, optimizing two tasks simultaneously: VQA-based text parsing and mask generation. The former allows the model to implicitly align images and text at the semantic level. The latter introduces an additional mask generator (discarded during inference) to explicitly ensure alignment between visual texts within images and their corresponding image regions at a spatially-aware level. Together, they can prevent model hallucinations when parsing visual text and effectively promote spatially-aware feature representation learning. To support the proposed VQAMask task, we construct a comprehensive image-mask generation pipeline and provide a large-scale dataset with 6M data (MTMask6M). Subsequently, we demonstrate that introducing the proposed mask generation task yields competitive document-level understanding performance. Leveraging the proposed VQAMask, we introduce Marten, a training-efficient MLLM tailored for document-level understanding. Extensive experiments show that our Marten consistently achieves significant improvements among 8B-MLLMs in document-centric tasks. Code and datasets are available at this https URL.
多模态大型语言模型(MLLM)在文档理解方面引入了一个新颖的维度,即它们赋予大型语言模型视觉理解的能力;然而,如何设计一个合适的图像文本预训练任务来连接文档级MLLM中的视觉和语言模式仍然有待探索。在这项研究中,我们介绍了一种新的视觉-语言对齐方法,将其核心问题视为带有遮罩生成(VQAMask)的任务,优化两个任务:基于VQA的文本解析和掩码生成。前者使模型能够在语义层面隐式地将图像与文本对齐。后者则引入了一个额外的掩码生成器(在推理时被丢弃),以显式确保图像内的视觉文本及其相应空间区域之间的对齐,从而防止在解析视觉文本时出现幻觉,并有效促进空间感知特征学习。 为了支持所提出的VQAMask任务,我们构建了一条全面的图像-遮罩生成流水线并提供了一个大规模的数据集(MTMask6M),其中包含600万数据。接着,我们证明引入了建议的掩码生成任务可以获得竞争性的文档级理解性能。利用提出的VQAMask,我们推出了Marten,这是一种训练效率高的MLLM,专门用于文档级别的理解。大量的实验表明,在以文档为中心的任务中,我们的Marten在8B-MLLMs中始终取得了显著改进。代码和数据集可在提供的链接处获得。
https://arxiv.org/abs/2503.14140
Steering the behavior of Large Language Models (LLMs) remains a challenge, particularly in engineering applications where precision and reliability are critical. While fine-tuning and prompting methods can modify model behavior, they lack the dynamic and exact control necessary for engineering applications. Inference-time intervention techniques provide a promising alternative, allowing targeted adjustments to LLM outputs. In this work, we demonstrate how interventions enable fine-grained control for automating the usually time-intensive requirement verification process in Model-Based Systems Engineering (MBSE). Using two early-stage Capella SysML models of space missions with associated requirements, we apply the intervened LLMs to reason over a graph representation of the model to determine whether a requirement is fulfilled. Our method achieves robust and reliable outputs, significantly improving over both a baseline model and a fine-tuning approach. By identifying and modifying as few as one to three specialised attention heads, we can significantly change the model's behavior. When combined with self-consistency, this allows us to achieve perfect precision on our holdout test set.
引导大型语言模型(LLM)的行为仍然是一个挑战,尤其是在工程应用中,这些领域对精确度和可靠性有严格要求。尽管微调和提示方法可以修改模型行为,但它们缺乏动态且精细控制所需的灵活性,这在工程应用中是必要的。推理时干预技术提供了一种有前途的替代方案,允许针对LLM输出进行目标调整。在这项工作中,我们展示了如何通过干预实现细粒度控制,以自动化通常耗时的需求验证过程,在基于模型的系统工程(MBSE)中尤其如此。使用两个早期阶段的Capella SysML航天任务模型及其相关需求,我们将经过干预的LLMs应用于模型图表示形式,以判断某项需求是否得到满足。我们的方法实现了稳健和可靠的输出,显著优于基线模型和微调方法的表现。通过识别并修改一到三个专门化的注意力头,我们可以显著改变模型的行为。结合自洽性原则时,我们能够在保留测试集上达到完美的精度。
https://arxiv.org/abs/2503.14130
Existing 3D Human Pose Estimation (HPE) methods achieve high accuracy but suffer from computational overhead and slow inference, while knowledge distillation methods fail to address spatial relationships between joints and temporal correlations in multi-frame inputs. In this paper, we propose Sparse Correlation and Joint Distillation (SCJD), a novel framework that balances efficiency and accuracy for 3D HPE. SCJD introduces Sparse Correlation Input Sequence Downsampling to reduce redundancy in student network inputs while preserving inter-frame correlations. For effective knowledge transfer, we propose Dynamic Joint Spatial Attention Distillation, which includes Dynamic Joint Embedding Distillation to enhance the student's feature representation using the teacher's multi-frame context feature, and Adjacent Joint Attention Distillation to improve the student network's focus on adjacent joint relationships for better spatial understanding. Additionally, Temporal Consistency Distillation aligns the temporal correlations between teacher and student networks through upsampling and global supervision. Extensive experiments demonstrate that SCJD achieves state-of-the-art performance. Code is available at this https URL.
现有的3D人体姿态估计(HPE)方法虽然能够达到很高的精度,但存在计算开销大和推理速度慢的问题。而知识蒸馏方法则无法解决关节之间的空间关系以及多帧输入中的时间相关性问题。本文提出了一种新的框架——稀疏关联与关节蒸馏(SCJD),该框架能够在3D HPE中实现效率与精度的平衡。 SCJD引入了稀疏相关输入序列降采样技术,以减少学生网络输入中的冗余信息同时保留帧间的相关性。为了有效地进行知识传递,我们提出了动态关节空间注意力蒸馏方法,其中包括动态关节嵌入蒸馏和相邻关节注意蒸馏两个方面:前者利用教师的多帧上下文特征来增强学生的特征表示;后者则改进了学生网络对相邻关节关系的关注度以提高其在空间上的理解能力。此外,通过上采样与全局监督相结合的方式进行时间一致性蒸馏,使教师模型和学生模型之间的时间相关性得到对齐。 经过广泛的实验验证,SCJD实现了最先进的性能表现。该框架的代码可在提供的链接中获取(此URL指向的是原文中的一个链接位置)。
https://arxiv.org/abs/2503.14097
Autoregressive Transformer models have demonstrated impressive performance in video generation, but their sequential token-by-token decoding process poses a major bottleneck, particularly for long videos represented by tens of thousands of tokens. In this paper, we propose Diagonal Decoding (DiagD), a training-free inference acceleration algorithm for autoregressively pre-trained models that exploits spatial and temporal correlations in videos. Our method generates tokens along diagonal paths in the spatial-temporal token grid, enabling parallel decoding within each frame as well as partially overlapping across consecutive frames. The proposed algorithm is versatile and adaptive to various generative models and tasks, while providing flexible control over the trade-off between inference speed and visual quality. Furthermore, we propose a cost-effective finetuning strategy that aligns the attention patterns of the model with our decoding order, further mitigating the training-inference gap on small-scale models. Experiments on multiple autoregressive video generation models and datasets demonstrate that DiagD achieves up to $10\times$ speedup compared to naive sequential decoding, while maintaining comparable visual fidelity.
自回归Transformer模型在视频生成方面表现出色,但其按顺序逐个令牌解码的过程在处理由数万个令牌表示的长视频时成为一个重大瓶颈。本文提出了一种称为对角线解码(DiagD)的方法,这是一种无需训练的推理加速算法,适用于自回归预训练模型,并能利用视频中的空间和时间相关性。该方法沿空间-时间令牌网格上的对角线路径生成令牌,从而在每个帧内实现并行解码,并且在连续帧之间部分重叠。所提出的方法具有通用性和适应性,可以应用于各种生成模型和任务,同时提供灵活的控制机制来调整推理速度与视觉质量之间的权衡。 此外,我们还提出了一种成本效益高的微调策略,该策略使模型的注意力模式与其解码顺序一致,从而进一步缩小小型模型在训练和推理阶段之间的差距。多项自回归视频生成模型和数据集上的实验表明,DiagD相比简单的顺序解码方法可以实现高达10倍的速度提升,并且保持视觉保真度相当。 这种方法不仅加速了长视频的生成过程,而且通过灵活控制解码方式,在速度与质量之间找到了平衡点。这种策略为自回归模型的应用提供了更多的灵活性和效率,特别是对于需要快速响应或大规模视频数据处理的任务场景而言。
https://arxiv.org/abs/2503.14070
Compression is at the heart of intelligence. A theoretically optimal way to compress any sequence of data is to find the shortest program that outputs that sequence and then halts. However, such 'Kolmogorov compression' is uncomputable, and code generating LLMs struggle to approximate this theoretical ideal, as it requires reasoning, planning and search capabilities beyond those of current models. In this work, we introduce the KoLMogorov-Test (KT), a compression-as-intelligence test for code generating LLMs. In KT a model is presented with a sequence of data at inference time, and asked to generate the shortest program that produces the sequence. We identify several benefits of KT for both evaluation and training: an essentially infinite number of problem instances of varying difficulty is readily available, strong baselines already exist, the evaluation metric (compression) cannot be gamed, and pretraining data contamination is highly unlikely. To evaluate current models, we use audio, text, and DNA data, as well as sequences produced by random synthetic programs. Current flagship models perform poorly - both GPT4-o and Llama-3.1-405B struggle on our natural and synthetic sequences. On our synthetic distribution, we are able to train code generation models with lower compression rates than previous approaches. Moreover, we show that gains on synthetic data generalize poorly to real data, suggesting that new innovations are necessary for additional gains on KT.
压缩是智能的核心。理论上,最有效地压缩任何数据序列的方法是在找到能够输出该序列并随后停止的最短程序后进行压缩。然而,“柯尔莫哥洛夫压缩”(Kolmogorov compression)是不可计算的,代码生成的大规模语言模型(LLMs)难以接近这一理论理想,因为这需要超越当前模型推理、规划和搜索的能力。为此,在这项工作中我们引入了“科里莫哥罗夫-测试”(KT),这是一种针对代码生成的大规模语言模型进行压缩作为智能能力评估的测试方法。 在KT中,一个模型在推断时会接收一段数据序列,并被要求生成能产生该序列最短程序。KT为评估和训练提供了几大优势:问题实例的数量几乎是无限的,且难度各不相同;已有强大的基准存在;评估指标(压缩)无法作弊;以及预训练数据污染的可能性极低。 为了评估当前模型的表现,我们使用了音频、文本和DNA数据,还有由随机合成程序产生的序列。目前最先进的模型表现不佳——无论是GPT4-o还是Llama-3.1-405B,在我们的自然及合成序列上都挣扎着。在我们的人工分布中,我们可以训练出压缩率低于先前方法的代码生成模型。此外,我们展示了人工数据上的改进很难推广到真实数据中,这表明为了进一步提高KT性能,需要新的创新。 这个工作揭示了当前大规模语言模型在处理复杂、随机生成的数据序列时遇到的挑战,并为未来研究提供了一种评估和提升智能压缩能力的方法。
https://arxiv.org/abs/2503.13992
Human vision is dynamic and continuous. However, in video understanding with multimodal large language models (LLMs), existing methods primarily rely on static features extracted from images sampled at a fixed low frame rate of frame-per-second (FPS) $\leqslant$2, leading to critical visual information loss. In this paper, we introduce F-16, the first multimodal LLM designed for high-frame-rate video understanding. By increasing the frame rate to 16 FPS and compressing visual tokens within each 1-second clip, F-16 efficiently captures dynamic visual features while preserving key semantic information. Experimental results demonstrate that higher frame rates considerably enhance video understanding across multiple benchmarks, providing a new approach to improving video LLMs beyond scaling model size or training data. F-16 achieves state-of-the-art performance among 7-billion-parameter video LLMs on both general and fine-grained video understanding benchmarks, such as Video-MME and TemporalBench. Furthermore, F-16 excels in complex spatiotemporal tasks, including high-speed sports analysis (\textit{e.g.}, basketball, football, gymnastics, and diving), outperforming SOTA proprietary visual models like GPT-4o and Gemini-1.5-pro. Additionally, we introduce a novel decoding method for F-16 that enables highly efficient low-frame-rate inference without requiring model retraining. Upon acceptance, we will release the source code, model checkpoints, and data.
人类的视觉是动态且连续的。然而,在使用多模态大语言模型(LLM)进行视频理解时,现有方法主要依赖于从以每秒不超过2帧(FPS)的固定低帧率抽取的图像中提取静态特征,导致关键视觉信息的大量损失。在本文中,我们介绍了F-16,这是首个专为高帧率视频理解设计的多模态LLM。通过将帧速率提升至每秒16帧,并对每个1秒片段内的视觉标记进行压缩处理,F-16能够高效捕捉动态视觉特征的同时保留关键语义信息。实验结果显示,在多个基准测试中,更高的帧率显著提升了视频理解能力,为改进视频LLM提供了一种超越模型规模或训练数据扩展的新途径。在诸如Video-MME和TemporalBench等一般性和细粒度的视频理解基准上,F-16在70亿参数级别的视频LLM中达到了最佳性能,并且在高速体育分析(如篮球、足球、体操和跳水)等复杂的时空任务中表现出色,超过了GPT-4o和Gemini-1.5-pro这样的顶级专有视觉模型。此外,我们还为F-16引入了一种新的解码方法,能够在不重新训练模型的情况下实现高效低帧率推理。论文接受后,我们将公开发布源代码、模型检查点及数据集。
https://arxiv.org/abs/2503.13956
Semantic communication is emerging as a promising paradigm that focuses on the extraction and transmission of semantic meanings using deep learning techniques. While current research primarily addresses the reduction of semantic communication overhead, it often overlooks the training phase, which can incur significant communication costs in dynamic wireless environments. To address this challenge, we propose a multi-modal semantic communication system that leverages multi-modal self-supervised learning to enhance task-agnostic feature extraction. The proposed approach employs self-supervised learning during the pre-training phase to extract task-agnostic semantic features, followed by supervised fine-tuning for downstream tasks. This dual-phase strategy effectively captures both modality-invariant and modality-specific features while minimizing training-related communication overhead. Experimental results on the NYU Depth V2 dataset demonstrate that the proposed method significantly reduces training-related communication overhead while maintaining or exceeding the performance of existing supervised learning approaches. The findings underscore the advantages of multi-modal self-supervised learning in semantic communication, paving the way for more efficient and scalable edge inference systems.
语义通信作为一种新兴的有前景的技术范式,专注于利用深度学习技术提取和传输语义含义。尽管当前的研究主要集中在减少语义通信开销上,但往往忽视了在动态无线环境中训练阶段可能产生的显著通信成本。为了解决这一挑战,我们提出了一种多模态语义通信系统,该系统采用多模态自监督学习来提升任务无关的特征提取能力。所提出的方案利用自监督学习方法进行预训练以提取任务无关的语义特征,并随后通过有监督微调适应下游任务的需求。这种双阶段策略能够有效捕捉跨模式不变和特定于模式的特征,同时最小化与培训相关的通信开销。 在NYU深度数据集V2上的实验结果显示,提出的方法显著减少了与训练相关的通信开销,同时保持或超过现有监督学习方法的性能水平。研究结果强调了多模态自监督学习在语义通信中的优势,并为更高效和可扩展的边缘推理系统铺平了道路。
https://arxiv.org/abs/2503.13940
We introduce KANITE, a framework leveraging Kolmogorov-Arnold Networks (KANs) for Individual Treatment Effect (ITE) estimation under multiple treatments setting in causal inference. By utilizing KAN's unique abilities to learn univariate activation functions as opposed to learning linear weights by Multi-Layer Perceptrons (MLPs), we improve the estimates of ITEs. The KANITE framework comprises two key architectures: this http URL Probability Metric (IPM) architecture: This employs an IPM loss in a specialized manner to effectively align towards ITE estimation across multiple treatments. 2. Entropy Balancing (EB) architecture: This uses weights for samples that are learned by optimizing entropy subject to balancing the covariates across treatment groups. Extensive evaluations on benchmark datasets demonstrate that KANITE outperforms state-of-the-art algorithms in both $\epsilon_{\text{PEHE}}$ and $\epsilon_{\text{ATE}}$ metrics. Our experiments highlight the advantages of KANITE in achieving improved causal estimates, emphasizing the potential of KANs to advance causal inference methodologies across diverse application areas.
我们介绍了KANITE框架,该框架利用柯尔莫哥洛夫-阿诺尔德网络(KAN)在因果推断中进行多重处理设置下的个体治疗效应(ITE)估计。通过利用KAN的独特能力来学习单变量激活函数而非多层感知机(MLP)所学的线性权重,我们提高了ITE估计的准确性。KANITE框架包含两个关键架构: 1. **IPM概率度量(IPM)架构**:该架构专门使用IPM损失来有效地针对多重处理情况下的ITE估算进行优化。 2. **熵平衡(EB)架构**:通过在保持协变量跨处理组间平衡的条件下最小化熵,学习样本权重以改善ITE估计。 基准数据集上的广泛评估表明,KANITE框架在$\epsilon_{\text{PEHE}}$和$\epsilon_{\text{ATE}}$这两个关键指标上均优于最先进的算法。我们的实验强调了KANITE在提高因果效应估算方面的能力,并突显了KAN在网络化因果推断方法中的潜在应用价值,尤其是在各种应用场景中。
https://arxiv.org/abs/2503.13912