Vision-Language-Action (VLA) tasks require reasoning over complex visual scenes and executing adaptive actions in dynamic environments. While recent studies on reasoning VLAs show that explicit chain-of-thought (CoT) can improve generalization, they suffer from high inference latency due to lengthy reasoning traces. We propose Fast-ThinkAct, an efficient reasoning framework that achieves compact yet performant planning through verbalizable latent reasoning. Fast-ThinkAct learns to reason efficiently with latent CoTs by distilling from a teacher, driven by a preference-guided objective to align manipulation trajectories that transfers both linguistic and visual planning capabilities for embodied control. This enables reasoning-enhanced policy learning that effectively connects compact reasoning to action execution. Extensive experiments across diverse embodied manipulation and reasoning benchmarks demonstrate that Fast-ThinkAct achieves strong performance with up to 89.3\% reduced inference latency over state-of-the-art reasoning VLAs, while maintaining effective long-horizon planning, few-shot adaptation, and failure recovery.
视觉语言行动(VLA)任务要求在复杂多变的环境中对视觉场景进行推理并执行适应性动作。尽管最近关于VLA推理的研究表明,明确的思维链(CoT)可以提高泛化能力,但由于冗长的推理过程导致了较高的推断延迟。我们提出了一种高效的推理框架Fast-ThinkAct,通过可表达的潜在推理实现了紧凑而高性能的规划。Fast-ThinkAct 通过从教师模型中蒸馏学习,以偏好引导的目标为导向,将操作轨迹对齐,从而同时转移语言和视觉规划能力用于具身控制。这使得增强型策略学习能够有效连接简洁的推理与行动执行。在多个具有挑战性的具身操作和推理基准测试中的广泛实验表明,Fast-ThinkAct 在不牺牲长期规划、少量样本适应性及故障恢复性能的情况下,将最先进的VLA推理模型的推断延迟最多减少了89.3%。
https://arxiv.org/abs/2601.09708
Transformer-based language models often achieve strong results on mathematical reasoning benchmarks while remaining fragile on basic numerical understanding and arithmetic operations. A central limitation is that numbers are processed as symbolic tokens whose embeddings do not explicitly encode numerical value, leading to systematic errors. We introduce a value-aware numerical representation that augments standard tokenized inputs with a dedicated prefix token whose embedding is explicitly conditioned on the underlying numerical value. This mechanism injects magnitude information directly into the model's input space while remaining compatible with existing tokenizers and decoder-only Transformer architectures. Evaluation on arithmetic tasks shows that the proposed approach outperforms baselines across numerical formats, tasks, and operand lengths. These results indicate that explicitly encoding numerical value is an effective and efficient way to improve fundamental numerical robustness in language models.
基于Transformer的语言模型在数学推理基准测试中通常能取得很好的结果,但在基本数值理解和算术运算方面仍然表现出脆弱性。一个主要的限制是数字被处理为符号标记,其嵌入并未明确编码数值大小,从而导致系统性的错误。我们引入了一种价值感知的数值表示方法,该方法通过添加专门的前缀令牌来增强标准分词输入,这个令牌的嵌入受到底层数值大小的影响。这种机制直接将数量级信息注入模型的输入空间,并且与现有的标记器和解码器专用Transformer架构兼容。 在算术任务上的评估表明,所提出的方法在不同数值格式、任务及操作数长度上均优于基线方法。这些结果表明,明确编码数值大小是提高语言模型基本数值稳健性的有效且高效的方式。
https://arxiv.org/abs/2601.09706
Code generation tasks aim to automate the conversion of user requirements into executable code, significantly reducing manual development efforts and enhancing software productivity. The emergence of large language models (LLMs) has significantly advanced code generation, though their efficiency is still impacted by certain inherent architectural constraints. Each token generation necessitates a complete inference pass, requiring persistent retention of contextual information in memory and escalating resource consumption. While existing research prioritizes inference-phase optimizations such as prompt compression and model quantization, the generation phase remains underexplored. To tackle these challenges, we propose a knowledge-infused framework named ShortCoder, which optimizes code generation efficiency while preserving semantic equivalence and readability. In particular, we introduce: (1) ten syntax-level simplification rules for Python, derived from AST-preserving transformations, achieving 18.1% token reduction without functional compromise; (2) a hybrid data synthesis pipeline integrating rule-based rewriting with LLM-guided refinement, producing ShorterCodeBench, a corpus of validated tuples of original code and simplified code with semantic consistency; (3) a fine-tuning strategy that injects conciseness awareness into the base LLMs. Extensive experimental results demonstrate that ShortCoder consistently outperforms state-of-the-art methods on HumanEval, achieving an improvement of 18.1%-37.8% in generation efficiency over previous methods while ensuring the performance of code generation.
代码生成任务的目标是自动化将用户需求转换为可执行代码的过程,从而显著减少手动开发的工作量,并提高软件生产效率。大型语言模型(LLMs)的出现极大地推进了代码生成技术的进步,尽管其效率仍然受到某些内在架构限制的影响。每个标记生成都需要进行一次完整的推理过程,这要求在内存中持续保留上下文信息,并增加资源消耗。现有的研究主要集中在推理阶段的优化上,例如提示压缩和模型量化,而生成阶段则较少被探索。 为了解决这些挑战,我们提出了一种名为ShortCoder的知识融合框架,它能够在保持语义等价性和可读性的同时提高代码生成效率。具体而言,我们引入了以下内容: 1. 十个从抽象语法树(AST)保存转换中衍生出的Python级简化规则,实现了无功能损失的18.1%标记减少。 2. 一个混合数据合成管道,结合基于规则的重写和LLM引导优化,生成ShorterCodeBench,这是一个包含原始代码及其精简版本且具有语义一致性的验证元组集合。 3. 一种微调策略,将简洁性意识注入基础LLMs。 广泛的实验结果表明,与最先进的方法相比,ShortCoder在HumanEval上的性能始终更优,在保持代码生成质量的同时,实现了18.1%至37.8%的生成效率提升。
https://arxiv.org/abs/2601.09703
Segment Anything 3 (SAM3) has established a powerful foundation that robustly detects, segments, and tracks specified targets in videos. However, in its original implementation, its group-level collective memory selection is suboptimal for complex multi-object scenarios, as it employs a synchronized decision across all concurrent targets conditioned on their average performance, often overlooking individual reliability. To this end, we propose SAM3-DMS, a training-free decoupled strategy that utilizes fine-grained memory selection on individual objects. Experiments demonstrate that our approach achieves robust identity preservation and tracking stability. Notably, our advantage becomes more pronounced with increased target density, establishing a solid foundation for simultaneous multi-target video segmentation in the wild.
Segment Anything Model 3 (SAM3) 已经建立了一个强大的基础,能够稳健地检测、分割和跟踪视频中的指定目标。然而,在其原始实现中,由于其群组级别的集体记忆选择在复杂的多对象场景下表现不佳,因为它采用了一种基于所有并发目标平均性能的同步决策方法,往往忽视了个体对象的可靠性。为此,我们提出了 SAM3-DMS,这是一种无训练的解耦策略,利用细粒度的记忆选择针对每个单独的对象。实验表明,我们的方法实现了稳健的身份保持和跟踪稳定性。值得注意的是,随着目标密度的增加,我们的优势变得更加明显,为在野外同时进行多目标视频分割奠定了坚实的基础。
https://arxiv.org/abs/2601.09699
3D pose estimation from sparse multi-views is a critical task for numerous applications, including action recognition, sports analysis, and human-robot interaction. Optimization-based methods typically follow a two-stage pipeline, first detecting 2D keypoints in each view and then associating these detections across views to triangulate the 3D pose. Existing methods rely on mere pairwise associations to model this correspondence problem, treating global consistency between views (i.e., cycle consistency) as a soft constraint. Yet, reconciling these constraints for multiple views becomes brittle when spurious associations propagate errors. We thus propose COMPOSE, a novel framework that formulates multi-view pose correspondence matching as a hypergraph partitioning problem rather than through pairwise association. While the complexity of the resulting integer linear program grows exponentially in theory, we introduce an efficient geometric pruning strategy to substantially reduce the search space. COMPOSE achieves improvements of up to 23% in average precision over previous optimization-based methods and up to 11% over self-supervised end-to-end learned methods, offering a promising solution to a widely studied problem.
从稀疏多视角进行三维姿态估计是一项对于动作识别、体育分析和人机交互等众多应用至关重要的任务。基于优化的方法通常遵循两阶段流程:首先在每个视角中检测2D关键点,然后将这些检测结果跨视角关联起来以三角测量出3D姿态。现有的方法依赖于简单的成对关联来建模这种对应问题,并且将视图之间的全局一致性(即循环一致性)视为软约束处理。然而,在多个视角的情况下,解决这些约束变得脆弱,因为虚假的关联会传播错误。 为此,我们提出了一种名为COMPOSE的新框架,它将多视角姿态对应的匹配问题形式化为超图划分问题,而不是通过成对关联来解决。尽管理论上由此产生的整数线性规划复杂度呈指数增长,但我们引入了一个高效的几何剪枝策略,从而大幅减少了搜索空间。与以前的基于优化的方法相比,COMPOSE在平均精度上提高了多达23%,而相较于自监督端到端学习方法则提升了高达11%的表现,为一个长期研究的问题提供了有前景的解决方案。
https://arxiv.org/abs/2601.09698
Modern video generative models based on diffusion models can produce very realistic clips, but they are computationally inefficient, often requiring minutes of GPU time for just a few seconds of video. This inefficiency poses a critical barrier to deploying generative video in applications that require real-time interactions, such as embodied AI and VR/AR. This paper explores a new strategy for camera-conditioned video generation of static scenes: using diffusion-based generative models to generate a sparse set of keyframes, and then synthesizing the full video through 3D reconstruction and rendering. By lifting keyframes into a 3D representation and rendering intermediate views, our approach amortizes the generation cost across hundreds of frames while enforcing geometric consistency. We further introduce a model that predicts the optimal number of keyframes for a given camera trajectory, allowing the system to adaptively allocate computation. Our final method, SRENDER, uses very sparse keyframes for simple trajectories and denser ones for complex camera motion. This results in video generation that is more than 40 times faster than the diffusion-based baseline in generating 20 seconds of video, while maintaining high visual fidelity and temporal stability, offering a practical path toward efficient and controllable video synthesis.
基于扩散模型的现代视频生成模型能够产生非常逼真的片段,但它们在计算上效率低下,通常需要数分钟的GPU时间才能生成短短几秒钟的视频。这种低效性对那些需要实时互动的应用程序(如具身人工智能和虚拟现实/增强现实)部署生成式视频构成了关键障碍。本文探讨了一种新的策略:针对静态场景进行相机条件下的视频生成——使用基于扩散模型的生成器来产生一组稀疏的关键帧,然后通过3D重建和渲染合成完整的视频。我们的方法通过将这些关键帧提升到三维表示中,并渲染中间视图,在数百帧之间分摊了生成成本,同时保持了几何一致性。我们进一步引入了一种预测给定相机轨迹所需最优关键帧数量的模型,使系统能够自适应地分配计算资源。最终的方法SRENDER针对简单路径使用非常稀疏的关键帧,而对复杂摄像机运动则采用更密集的关键帧设置。这使得在生成20秒视频时,相较于基于扩散的基本线程,速度提高了40多倍,同时保持了视觉保真度和时间稳定性,为高效的可控视频合成提供了一条实用路径。
https://arxiv.org/abs/2601.09697
Large Language Model (LLM) routers dynamically select optimal models for given inputs. Existing approaches typically assume access to ground-truth labeled data, which is often unavailable in practice, especially when user request distributions are heterogeneous and unknown. We introduce Routing with Generated Data (RGD), a challenging setting in which routers are trained exclusively on generated queries and answers produced from high-level task descriptions by generator LLMs. We evaluate query-answer routers (using both queries and labels) and query-only routers across four diverse benchmarks and 12 models, finding that query-answer routers degrade faster than query-only routers as generator quality decreases. Our analysis reveals two crucial characteristics of effective generators: they must accurately respond to their own questions, and their questions must produce sufficient performance differentiation among the model pool. We then show how filtering for these characteristics can improve the quality of generated data. We further propose CASCAL, a novel query-only router that estimates model correctness through consensus voting and identifies model-specific skill niches via hierarchical clustering. CASCAL is substantially more robust to generator quality, outperforming the best query-answer router by 4.6% absolute accuracy when trained on weak generator data.
大型语言模型(LLM)路由器能够动态地选择最适合给定输入的模型。现有方法通常假设可以访问带有真实标签的数据,但在实践中这种情况往往不可用,尤其是在用户请求分布异质且未知的情况下。我们引入了基于生成数据的路由(RGD),这是一种具有挑战性的设置,在该设置中,路由器仅通过由生成器LLM从高层次的任务描述中产生的查询和答案进行训练。 我们在四个多样化的基准测试集上对带有查询和标签的回答路由器以及仅使用查询的路由器进行了评估,并发现在12种模型中,当生成器质量降低时,回答路由器的表现下降速度比仅使用查询的路由器更快。我们的分析揭示了有效生成器的两个关键特性:它们必须准确地回应自己的问题,并且这些问题应该能够在模型池之间产生足够的性能差异。 接着,我们展示了如何通过过滤这些特征来提高生成数据的质量。此外,我们提出了CASCAL,这是一种新颖的仅使用查询的路由方法,它通过共识投票估计模型的正确性并通过分层聚类识别每个模型的具体技能领域(skill niches)。CASCAL对生成器质量具有显著的鲁棒性,在弱生成器数据上训练时,比最佳回答路由器高出4.6%的绝对准确率。
https://arxiv.org/abs/2601.09692
Deep research systems are widely used for multi-step web research, analysis, and cross-source synthesis, yet their evaluation remains challenging. Existing benchmarks often require annotation-intensive task construction, rely on static evaluation dimensions, or fail to reliably verify facts when citations are missing. To bridge these gaps, we introduce DeepResearchEval, an automated framework for deep research task construction and agentic evaluation. For task construction, we propose a persona-driven pipeline generating realistic, complex research tasks anchored in diverse user profiles, applying a two-stage filter Task Qualification and Search Necessity to retain only tasks requiring multi-source evidence integration and external retrieval. For evaluation, we propose an agentic pipeline with two components: an Adaptive Point-wise Quality Evaluation that dynamically derives task-specific evaluation dimensions, criteria, and weights conditioned on each generated task, and an Active Fact-Checking that autonomously extracts and verifies report statements via web search, even when citations are missing.
深度研究系统被广泛应用于多步骤的网页研究、分析以及跨来源综合,然而它们的评估仍然面临挑战。现有的基准测试通常需要密集的手动标注来构建任务,依赖静态的评价维度,或者在缺少引用时无法可靠地验证事实。为了弥补这些不足,我们介绍了DeepResearchEval,这是一个用于深度研究任务构建和代理式评估的自动化框架。 对于任务构建部分,我们提出了一种以人物角色驱动的工作流程,该工作流程能够生成基于多样化用户配置文件的真实且复杂的深度研究任务,并通过两个阶段的任务资格和搜索必要性过滤器来保留那些需要多来源证据整合以及外部检索的任务。 在评价方面,我们提出了一个代理式管道方案,包含两个组成部分:一个是自适应的点对点质量评估系统,该系统能够根据每个生成的任务动态地推导出具体化的评价维度、标准和权重;另一个是主动的事实核查机制,它能够在缺少引用的情况下通过网络搜索自主提取并验证报告中的陈述。
https://arxiv.org/abs/2601.09688
Multi-Task Learning (MTL) combined with Low-Rank Adaptation (LoRA) has emerged as a promising direction for parameter-efficient deployment of Large Language Models (LLMs). By sharing a single adapter across multiple tasks, one can significantly reduce storage overhead. However, this approach suffers from negative transfer, where conflicting gradient updates from distinct tasks degrade the performance of individual tasks compared to single-task fine-tuning. This problem is exacerbated in LoRA due to the low-rank constraint, which limits the optimization landscape's capacity to accommodate diverse task requirements. In this paper, we propose Ortho-LoRA, a gradient projection method specifically tailored for the bipartite structure of LoRA. Ortho-LoRA dynamically projects conflicting task gradients onto the orthogonal complement of each other within the intrinsic LoRA subspace. Extensive experiments on the GLUE benchmark demonstrate that Ortho-LoRA effectively mitigates task interference, outperforming standard joint training and recovering 95\% of the performance gap between multi-task and single-task baselines with negligible computational overhead.
多任务学习(MTL)结合低秩适应(LoRA)已成为大型语言模型(LLMs)参数高效部署的一个有前途的方向。通过在多个任务之间共享单一的适配器,可以显著减少存储开销。然而,这种做法会遇到负面迁移的问题:来自不同任务的冲突梯度更新会导致每个任务的表现低于单任务微调的情况。这个问题在LoRA中更为严重,因为低秩约束限制了优化空间容纳多样化任务需求的能力。 本文提出了一种专门为LoRA的二分结构设计的正交投影方法——Ortho-LoRA。Ortho-LoRA能够动态地将冲突的任务梯度投影到内在LoRA子空间中的彼此正交补集中。在GLUE基准上的广泛实验表明,Ortho-LoRA有效地减轻了任务间的干扰,并超越标准联合训练表现,在计算开销几乎可以忽略不计的情况下,恢复多任务与单任务基线之间的95%性能差距。
https://arxiv.org/abs/2601.09684
Monocular visual SLAM enables 3D reconstruction from internet video and autonomous navigation on resource-constrained platforms, yet suffers from scale drift, i.e., the gradual divergence of estimated scale over long sequences. Existing frame-to-frame methods achieve real-time performance through local optimization but accumulate scale drift due to the lack of global constraints among independent windows. To address this, we propose SCE-SLAM, an end-to-end SLAM system that maintains scale consistency through scene coordinate embeddings, which are learned patch-level representations encoding 3D geometric relationships under a canonical scale reference. The framework consists of two key modules: geometry-guided aggregation that leverages 3D spatial proximity to propagate scale information from historical observations through geometry-modulated attention, and scene coordinate bundle adjustment that anchors current estimates to the reference scale through explicit 3D coordinate constraints decoded from the scene coordinate embeddings. Experiments on KITTI, Waymo, and vKITTI demonstrate substantial improvements: our method reduces absolute trajectory error by 8.36m on KITTI compared to the best prior approach, while maintaining 36 FPS and achieving scale consistency across large-scale scenes.
单目视觉SLAM技术可以从互联网视频中进行三维重建,并在资源受限的平台上实现自主导航,但它会遭受尺度漂移的问题,即长时间序列下估计尺度逐渐偏离真实值。现有的帧到帧方法通过局部优化实现了实时性能,但由于缺乏不同窗口之间的全局约束而积累了尺度漂移。为了解决这个问题,我们提出了SCE-SLAM(Scene Coordinate Embedding SLAM),这是一种端到端的SLAM系统,它利用场景坐标嵌入来保持尺度一致性,这些嵌入是学习得到的补丁级表示,在一个标准尺度参考下编码了三维几何关系。 该框架包含两个关键模块:由几何引导的聚合模块和场景坐标联合调整模块。前者通过利用3D空间邻近性,借助于几何调制注意力从历史观测中传播尺度信息;后者则通过明确解码自场景坐标嵌入的三维坐标约束将当前估计锚定到参考尺度上。 在KITTI、Waymo和vKITTI数据集上的实验表明了显著改进:我们的方法相较于最佳先前方法,在KITTI数据集上减少了8.36米的绝对轨迹误差,同时保持了每秒36帧的速度,并实现了大规模场景中的尺度一致性。
https://arxiv.org/abs/2601.09665
Estimating physically accurate, simulation-ready garments from a single image is challenging due to the absence of image-to-physics datasets and the ill-posed nature of this problem. Prior methods either require multi-view capture and expensive differentiable simulation or predict only garment geometry without the material properties required for realistic simulation. We propose a feed-forward framework that sidesteps these limitations by first fine-tuning a vision-language model to infer material composition and fabric attributes from real images, and then training a lightweight predictor that maps these attributes to the corresponding physical fabric parameters using a small dataset of material-physics measurements. Our approach introduces two new datasets (FTAG and T2P) and delivers simulation-ready garments from a single image without iterative optimization. Experiments show that our estimator achieves superior accuracy in material composition estimation and fabric attribute prediction, and by passing them through our physics parameter estimator, we further achieve higher-fidelity simulations compared to state-of-the-art image-to-garment methods.
从单张图像中估计出物理上准确且可用于模拟的服装是一个挑战,这是因为缺乏图像到物理属性的数据集以及该问题本身的病态性(ill-posed nature)。此前的方法要么需要多视角捕捉和昂贵的可微分模拟,要么仅预测服装几何形状而没有进行逼真模拟所需的材料特性。我们提出了一种前馈框架,通过首先对一个视觉语言模型进行微调来克服这些限制,使其能够从真实图像中推断出材料组成及织物属性,并随后训练一种轻量级预测器,该预测器将这些属性映射到使用小规模的材料物理测量数据集对应的实际织物参数。我们的方法引入了两个新的数据集(FTAG和T2P),并且无需迭代优化就能从单张图像中生成可用于模拟的服装。实验表明,我们的估计器在材料组成估算和织物属性预测方面实现了更优的准确性,并且通过将这些属性传递到我们物理参数估计算法中,与当前最先进的图像到服装方法相比,还进一步提高了高保真度的模拟效果。
https://arxiv.org/abs/2601.09658
Underwater video analysis is particularly challenging due to factors such as low lighting, color distortion, and turbidity, which compromise visual data quality and directly impact the performance of perception modules in robotic applications. This work proposes AquaFeat+, a plug-and-play pipeline designed to enhance features specifically for automated vision tasks, rather than for human perceptual quality. The architecture includes modules for color correction, hierarchical feature enhancement, and an adaptive residual output, which are trained end-to-end and guided directly by the loss function of the final application. Trained and evaluated in the FishTrack23 dataset, AquaFeat+ achieves significant improvements in object detection, classification, and tracking metrics, validating its effectiveness for enhancing perception tasks in underwater robotic applications.
水下视频分析由于低光照、色彩失真和浑浊等因素而面临诸多挑战,这些因素会降低视觉数据质量,并直接影响机器人应用中感知模块的性能。本研究提出了一种名为AquaFeat+的即插即用管道方案,旨在增强自动视觉任务所需的特定特征,而非追求人类观察的质量标准。该架构包括颜色校正、分层特征增强和自适应残差输出等模块,这些模块被端到端训练,并直接由最终应用的损失函数引导。 在FishTrack23数据集中进行训练和评估后,AquaFeat+在目标检测、分类和跟踪指标方面取得了显著提升,验证了其对水下机器人感知任务增强的有效性。
https://arxiv.org/abs/2601.09652
Text-to-image (T2I) models are increasingly popular, producing a large share of AI-generated images online. To compare model quality, voting-based leaderboards have become the standard, relying on anonymized model outputs for fairness. In this work, we show that such anonymity can be easily broken. We find that generations from each T2I model form distinctive clusters in the image embedding space, enabling accurate deanonymization without prompt control or training data. Using 22 models and 280 prompts (150K images), our centroid-based method achieves high accuracy and reveals systematic model-specific signatures. We further introduce a prompt-level distinguishability metric and conduct large-scale analyses showing how certain prompts can lead to near-perfect distinguishability. Our findings expose fundamental security flaws in T2I leaderboards and motivate stronger anonymization defenses.
文本到图像(T2I)模型越来越受欢迎,生成了网络上大量的人工智能图像。为了比较模型质量,基于投票的排行榜已成为标准,依赖于匿名化模型输出以确保公平性。然而,在这项工作中,我们展示了这种匿名性可以轻易被破解。我们发现,每个T2I模型产生的生成物在图像嵌入空间中形成了独特的集群,从而能够在没有提示控制或训练数据的情况下实现准确去匿名化。使用了22个模型和280个提示(共计15万张图片),我们的基于中心点的方法实现了高精度,并揭示了系统性的、特定于每个模型的特征。此外,我们还引入了一个基于提示级别的区分度指标,并进行了大规模分析,展示了某些提示可以导致近乎完美的可识别性。我们的研究结果暴露了T2I排行榜中的基本安全漏洞,并促进了更强匿名化防御措施的发展。
https://arxiv.org/abs/2601.09647
Machine unlearning is becoming essential for building trustworthy and compliant language models. Yet unlearning success varies considerably across individual samples: some are reliably erased, while others persist despite the same procedure. We argue that this disparity is not only a data-side phenomenon, but also reflects model-internal mechanisms that encode and protect memorized information. We study this problem from a mechanistic perspective based on model circuits--structured interaction pathways that govern how predictions are formed. We propose Circuit-guided Unlearning Difficulty (CUD), a {\em pre-unlearning} metric that assigns each sample a continuous difficulty score using circuit-level signals. Extensive experiments demonstrate that CUD reliably separates intrinsically easy and hard samples, and remains stable across unlearning methods. We identify key circuit-level patterns that reveal a mechanistic signature of difficulty: easy-to-unlearn samples are associated with shorter, shallower interactions concentrated in earlier-to-intermediate parts of the original model, whereas hard samples rely on longer and deeper pathways closer to late-stage computation. Compared to existing qualitative studies, CUD takes a first step toward a principled, fine-grained, and interpretable analysis of unlearning difficulty; and motivates the development of unlearning methods grounded in model mechanisms.
机器遗忘对于构建值得信赖且合规的语言模型至关重要。然而,不同样本的遗忘效果差异显著:有些样本能够可靠地被删除,而其他一些则在相同的处理下依然存在。我们主张这种差距不仅体现在数据方面,而且反映了内部机制的作用——这些机制编码并保护着已记住的信息。我们从基于模型回路(结构化交互路径)的角度来研究这个问题,这些回路决定了预测是如何形成的。我们提出了电路引导的遗忘难度(CUD),这是一种在遗忘之前计算每个样本连续难度分数的度量方法,使用的是电路层面的信号。 广泛实验表明,CUD能够可靠地区分固有简单和困难的样本,并且其稳定性不受遗忘方法的影响。我们识别出了一些关键的电路层模式,这些模式揭示了难易程度的机制特征:易于遗忘的样本与早期至中期模型部分内较短、较浅的交互相关联,而难以遗忘的样本则依赖于接近后期计算阶段的更长更深路径。 相比现有的定性研究,CUD为分析遗忘难度提供了一个原理性的、细粒度且可解释的方法,并促进了基于模型机制的遗忘方法的发展。
https://arxiv.org/abs/2601.09624
Accurate and early perception of potential intrusion targets is essential for ensuring the safety of railway transportation systems. However, most existing systems focus narrowly on object classification within fixed visual scopes and apply rule-based heuristics to determine intrusion status, often overlooking targets that pose latent intrusion risks. Anticipating such risks requires the cognition of spatial context and temporal dynamics for the object of interest (OOI), which presents challenges for conventional visual models. To facilitate deep intrusion perception, we introduce a novel benchmark, CogRail, which integrates curated open-source datasets with cognitively driven question-answer annotations to support spatio-temporal reasoning and prediction. Building upon this benchmark, we conduct a systematic evaluation of state-of-the-art visual-language models (VLMs) using multimodal prompts to identify their strengths and limitations in this domain. Furthermore, we fine-tune VLMs for better performance and propose a joint fine-tuning framework that integrates three core tasks, position perception, movement prediction, and threat analysis, facilitating effective adaptation of general-purpose foundation models into specialized models tailored for cognitive intrusion perception. Extensive experiments reveal that current large-scale multimodal models struggle with the complex spatial-temporal reasoning required by the cognitive intrusion perception task, underscoring the limitations of existing foundation models in this safety-critical domain. In contrast, our proposed joint fine-tuning framework significantly enhances model performance by enabling targeted adaptation to domain-specific reasoning demands, highlighting the advantages of structured multi-task learning in improving both accuracy and interpretability. Code will be available at this https URL.
准确且提前识别潜在的入侵目标对于保障铁路运输系统的安全至关重要。然而,现有的大多数系统仅专注于在固定视觉范围内进行物体分类,并采用基于规则的启发式方法来判断入侵状态,往往忽视了那些潜藏入侵风险的目标。为了预测这些风险,需要理解感兴趣对象(OOI)的空间上下文和时间动态特性,这对于传统的视觉模型来说是一个挑战。 为促进深度入侵感知,我们引入了一个新的基准测试平台CogRail,该平台将精心策划的开源数据集与认知驱动的问题-答案注释相结合,以支持时空推理和预测。基于这一基准测试平台,我们使用多模态提示对最先进的视觉语言模型(VLMs)进行了系统的评估,以便识别它们在这个领域的优势和局限性。此外,我们将VLMs进行微调以提高性能,并提出了一种结合位置感知、运动预测和威胁分析这三项核心任务的联合微调框架,从而促进通用基础模型向专门针对认知入侵感知需求调整的模型的有效转变。 广泛的实验表明,当前的大规模多模态模型在处理认知入侵感知所需的复杂时空推理方面存在困难,突显了现有基础模型在这个安全关键领域的局限性。相比之下,我们提出的联合微调框架通过有针对性地适应领域特定的推理要求,显著提升了模型性能,强调了结构化多任务学习在提高准确性和可解释性方面的优势。 代码将在以下网址提供:[链接] (请将[链接]替换为实际的URL)。
https://arxiv.org/abs/2601.09613
Reinforcement learning (RL)-based enhancement of large language models (LLMs) often leads to reduced output diversity, undermining their utility in open-ended tasks like creative writing. Current methods lack explicit mechanisms for guiding diverse exploration and instead prioritize optimization efficiency and performance over diversity. This paper proposes an RL framework structured around a semi-structured long Chain-of-Thought (CoT), in which the generation process is decomposed into explicitly planned intermediate steps. We introduce a Diverse Planning Branching method that strategically introduces divergence at the planning phase based on diversity variation, alongside a group-aware diversity reward to encourage distinct trajectories. Experimental results on creative writing benchmarks demonstrate that our approach significantly improves output diversity without compromising generation quality, consistently outperforming existing baselines.
基于强化学习(RL)的大规模语言模型(LLMs)的增强通常会导致输出多样性降低,这会削弱它们在开放式任务(如创意写作)中的实用性。当前的方法缺乏明确的机制来引导多样性的探索,而是优先考虑优化效率和性能而非多样性。本文提出了一种以半结构化的长思维链(CoT)为基础的RL框架,在这种框架中,生成过程被分解为一系列显式的中间步骤。我们引入了多样规划分支方法,在这一过程中,根据多样性的变化在规划阶段战略性地引入分歧,并且通过群体意识的多样性奖励来鼓励不同的轨迹。实验结果表明,在创意写作基准测试上,我们的方法显著提高了输出多样性而不会牺牲生成质量,并且始终优于现有的基线方法。
https://arxiv.org/abs/2601.09609
Most Multimodal Sentiment Analysis research has focused on point-wise regression. While straightforward, this approach is sensitive to label noise and neglects whether one sample is more positive than another, resulting in unstable predictions and poor correlation alignment. Pairwise ordinal learning frameworks emerged to address this gap, capturing relative order by learning from comparisons. Yet, they introduce two new trade-offs: First, they assign uniform importance to all comparisons, failing to adaptively focus on hard-to-rank samples. Second, they employ static ranking margins, which fail to reflect the varying semantic distances between sentiment groups. To address this, we propose a Two-Stage Group-wise Ranking and Calibration Framework (GRCF) that adapts the philosophy of Group Relative Policy Optimization (GRPO). Our framework resolves these trade-offs by simultaneously preserving relative ordinal structure, ensuring absolute score calibration, and adaptively focusing on difficult samples. Specifically, Stage 1 introduces a GRPO-inspired Advantage-Weighted Dynamic Margin Ranking Loss to build a fine-grained ordinal structure. Stage 2 then employs an MAE-driven objective to align prediction magnitudes. To validate its generalizability, we extend GRCF to classification tasks, including multimodal humor detection and sarcasm detection. GRCF achieves state-of-the-art performance on core regression benchmarks, while also showing strong generalizability in classification tasks.
大多数多模态情感分析研究都集中在点式回归上。尽管这种做法简单直接,但它对标签噪声敏感,并且忽视了某些样本比另一些更积极这一事实,导致预测不稳定和相关性较差。为了解决这些问题,成对序数学习框架应运而生,通过比较来捕捉相对顺序。然而,这些方法引入了两个新的权衡:首先,它们将所有比较的重要性视为统一的,未能适应性地聚焦于难以排名的样本;其次,它们使用静态排序间隔,无法反映不同情感组之间的语义距离变化。 为了解决这些问题,我们提出了一种两阶段分组排序和校准框架(GRCF),该框架借鉴了集团相对策略优化(GRPO)的理念。我们的框架通过同时保持相对序数结构、确保绝对评分校准以及适应性地关注困难样本来解决这些权衡问题。 具体而言,在第一阶段,我们引入了一种受GRPO启发的动态优势加权间隔排序损失函数,以构建一个精细粒度的序数结构。在第二阶段,则采用了一个由MAE驱动的目标函数,用于对齐预测幅度。 为了验证其泛化能力,我们将GRCF扩展到分类任务中,包括多模态幽默检测和讽刺检测。GRCF在核心回归基准测试上取得了最先进的性能,并且在分类任务中也显示出强大的泛化性。
https://arxiv.org/abs/2601.09606
Vision-based policies for robot manipulation have achieved significant recent success, but are still brittle to distribution shifts such as camera viewpoint variations. Robot demonstration data is scarce and often lacks appropriate variation in camera viewpoints. Simulation offers a way to collect robot demonstrations at scale with comprehensive coverage of different viewpoints, but presents a visual sim2real challenge. To bridge this gap, we propose MANGO -- an unpaired image translation method with a novel segmentation-conditioned InfoNCE loss, a highly-regularized discriminator design, and a modified PatchNCE loss. We find that these elements are crucial for maintaining viewpoint consistency during sim2real translation. When training MANGO, we only require a small amount of fixed-camera data from the real world, but show that our method can generate diverse unseen viewpoints by translating simulated observations. In this domain, MANGO outperforms all other image translation methods we tested. Imitation-learning policies trained on data augmented by MANGO are able to achieve success rates as high as 60\% on views that the non-augmented policy fails completely on.
基于视觉的机器人操作策略最近取得了显著的成功,但仍然对分布变化(如相机视角的变化)较为脆弱。由于机器人演示数据稀缺且通常缺乏适当的视角变化,模拟提供了一种大规模收集不同视角下机器人演示的方法,但带来了视觉仿真到现实转换的挑战。为了弥合这一差距,我们提出了MANGO——一种未配对图像翻译方法,该方法采用新颖的分割条件下的InfoNCE损失函数、高度正则化的判别器设计以及修改后的PatchNCE损失。我们发现这些要素对于在从模拟到真实的转换过程中保持视角一致性至关重要。训练MANGO时,我们只需要少量来自真实世界的固定相机数据,但可以生成通过翻译仿真观察来获得的多种未见视角。在这个领域内,MANGO超越了我们测试的所有其他图像翻译方法。增强模仿学习策略的数据由MANGO生成后,在未经增广的策略完全失败的角度下,这些策略能够实现高达60%的成功率。
https://arxiv.org/abs/2601.09605
Point cloud registration is a central theme in computer vision, with alignment algorithms continuously improving for greater robustness. Commonly used methods evaluate Euclidean distances between point clouds and minimize an objective function, such as Root Mean Square Error (RMSE). However, these approaches are most effective when the point clouds are well-prealigned and issues such as differences in density, noise, holes, and limited overlap can compromise the results. Traditional methods, such as Iterative Closest Point (ICP), require choosing one point cloud as fixed, since Euclidean distances lack commutativity. When only one point cloud has issues, adjustments can be made, but in real scenarios, both point clouds may be affected, often necessitating preprocessing. The authors introduce a novel differential entropy-based metric, designed to serve as the objective function within an optimization framework for fine rigid pairwise 3D point cloud registration, denoted as Iterative Differential Entropy Minimization (IDEM). This metric does not depend on the choice of a fixed point cloud and, during transformations, reveals a clear minimum corresponding to the best alignment. Multiple case studies are conducted, and the results are compared with those obtained using RMSE, Chamfer distance, and Hausdorff distance. The proposed metric proves effective even with density differences, noise, holes, and partial overlap, where RMSE does not always yield optimal alignment.
点云配准是计算机视觉中的一个核心主题,对齐算法持续改进以实现更高的鲁棒性。常用的方法通过评估点云之间的欧氏距离,并最小化目标函数(如均方根误差RMSE)来工作。然而,这些方法在点云预配准良好时最为有效;密度差异、噪声、空洞和有限的重叠等问题可能会使结果大打折扣。传统的迭代最近点算法(ICP)等方法需要选择一个固定的点云,因为欧氏距离不具备交换性。当仅有一个点云存在问题时可以进行调整,但在实际情况中,两个点云都可能受到影响,往往需要预处理步骤。 作者引入了一种基于差分熵的新度量标准,旨在作为优化框架内的目标函数用于精细的刚性三维点云配准,并将其命名为迭代差分熵最小化(IDEM)。该度量不受固定点云选择的影响,在变换过程中会揭示一个清晰的极小值对应最佳对齐状态。进行了多个案例研究并将结果与RMSE、Chamfer距离和Hausdorff距离等方法的结果进行了比较。所提出的度量标准即使在密度差异、噪声、空洞以及部分重叠的情况下也能有效工作,而这些情况下RMSE通常无法实现最优配准。
https://arxiv.org/abs/2601.09601
In complex environments, autonomous robot navigation and environmental perception pose higher requirements for SLAM technology. This paper presents a novel method for semantically enhancing 3D point cloud maps with thermal information. By first performing pixel-level fusion of visible and infrared images, the system projects real-time LiDAR point clouds onto this fused image stream. It then segments heat source features in the thermal channel to instantly identify high temperature targets and applies this temperature information as a semantic layer on the final 3D map. This approach generates maps that not only have accurate geometry but also possess a critical semantic understanding of the environment, making it highly valuable for specific applications like rapid disaster assessment and industrial preventive maintenance.
在复杂的环境中,自主机器人的导航和环境感知对SLAM(同步定位与映射)技术提出了更高的要求。本文提出了一种新颖的方法,利用热信息来增强3D点云地图的语义信息。该方法首先执行可见光图像和红外图像的像素级融合,然后将实时LiDAR(激光雷达)点云投影到这一融合后的图像流上。接下来,系统在热成像通道中分割出热源特征,从而即时识别高温目标,并将这些温度信息作为语义层添加到最终生成的3D地图中。这种方法生成的地图不仅具有精确的几何结构,还具备对环境的关键性语义理解能力,在快速灾害评估和工业预防维护等特定应用领域中尤为有价值。
https://arxiv.org/abs/2601.09578