Multimodal large language models (MLLMs) exhibit strong capabilities across diverse applications, yet remain vulnerable to adversarial perturbations that distort their feature representations and induce erroneous predictions. To address this vulnerability, we propose the Feature-space Smoothing (FS) and theoretically prove that FS offers certified robustness on the feature representations of MLLMs. Specifically, FS transforms any feature encoder into a smoothed variant that is guaranteed to maintain a certified lower bound on the feature cosine similarity between clean and adversarial representations under $\ell_2$-bounded attacks. Moreover, we indicate that the value of this Feature Cosine Similarity Bound (FCSB) derived from FS can be improved by enlarging the defined Gaussian robustness score on the vanilla encoder. Building upon this, we introduce the Purifier and Smoothness Mapper (PSM), a plug-and-play module that improves the Gaussian robustness score of MLLMs and thus enhances their certified robustness under FS, without requiring any retraining on MLLMs. We demonstrate that the FS with PSM not only provides a strong theoretical robustness guarantee but also exhibits superior empirical performance compared to adversarial training. Extensive experiments across diverse MLLMs and downstream tasks indicate the effectiveness of the FS-PSM, reducing the Attack Success Rate (ASR) of various white-box attacks from nearly 90\% to about 1\%.
多模态大型语言模型(MLLMs)在各种应用场景中表现出强大的能力,但它们仍然容易受到通过扭曲特征表示并引发错误预测的对抗性干扰的影响。为了解决这一脆弱性问题,我们提出了特征空间平滑(FS),并通过理论证明了FS能够提供关于MLLMs特征表示的认证鲁棒性保障。具体而言,FS将任何特征编码器转换为其平滑版本,并保证在$\ell_2$界限内的攻击下,干净和对抗性表示之间的特征余弦相似度可以维持一个经过验证的最低边界。此外,我们指出通过增加原始编码器上的高斯鲁棒评分,可以从FS中得出的特征余弦相似度边界(FCSB)值得到提高。基于此,我们引入了纯化和平滑映射器(PSM),这是一种即插即用模块,它可以提升MLLMs的高斯鲁棒评分并因此增强其在FS下的认证鲁棒性,而无需对MLLMs进行重新训练。我们展示了带有PSM的FS不仅提供了强大的理论稳健保证,而且在对抗性训练方面也表现出更优越的实际性能。跨多种MLLM和下游任务的广泛实验表明,FS-PSM的有效性,将各种白盒攻击的成功率从接近90%降低到大约1%。
https://arxiv.org/abs/2601.16200
Pixel-wise capabilities are essential for building interactive intelligent systems. However, pixel-wise multi-modal LLMs (MLLMs) remain difficult to scale due to complex region-level encoders, specialized segmentation decoders, and incompatible training objectives. To address these challenges, we present SAMTok, a discrete mask tokenizer that converts any region mask into two special tokens and reconstructs the mask using these tokens with high fidelity. By treating masks as new language tokens, SAMTok enables base MLLMs (such as the QwenVL series) to learn pixel-wise capabilities through standard next-token prediction and simple reinforcement learning, without architectural modifications and specialized loss design. SAMTok builds on SAM2 and is trained on 209M diverse masks using a mask encoder and residual vector quantizer to produce discrete, compact, and information-rich tokens. With 5M SAMTok-formatted mask understanding and generation data samples, QwenVL-SAMTok attains state-of-the-art or comparable results on region captioning, region VQA, grounded conversation, referring segmentation, scene graph parsing, and multi-round interactive segmentation. We further introduce a textual answer-matching reward that enables efficient reinforcement learning for mask generation, delivering substantial improvements on GRES and GCG benchmarks. Our results demonstrate a scalable and straightforward paradigm for equipping MLLMs with strong pixel-wise capabilities. Our code and models are available.
像素级别的能力对于构建互动智能系统至关重要。然而,由于复杂的区域级编码器、专业的分割解码器和不兼容的训练目标,像素级别的多模态大模型(MLLMs)仍然难以扩展规模。为了解决这些挑战,我们提出了SAMTok,这是一种离散的掩码标记器,能够将任何区域掩码转换成两个特殊令牌,并使用这两个令牌以高保真度重建掩码。通过将掩码视为新的语言令牌,SAMTok使基础MLLM(如QwenVL系列)可以通过标准的下一个令牌预测和简单的强化学习来学习像素级别的能力,而无需进行架构修改或专门的损失设计。 基于SAM2,并使用一个掩码编码器和残差向量量化器对2.09亿个多样化的掩码进行训练,SAMTok能够生成离散、紧凑且信息丰富的令牌。通过500万个以SAMTok格式标记的理解与生成数据样本,QwenVL-SAMTok在区域描述、区域VQA(视觉问答)、基于参考的对话、指代分割、场景图解析以及多轮互动分割等任务上取得了当前最优或可比的结果。 我们进一步引入了一个文本答案匹配奖励机制,使掩码生成过程中的强化学习更加高效,在GRES和GCG基准测试中带来了显著改进。我们的结果表明,为MLLM提供强大的像素级别能力提供了一种可扩展且简单的方法。 我们的代码和模型已公开可用。
https://arxiv.org/abs/2601.16093
Accurate prediction of crop above-ground biomass (AGB) under water stress is critical for monitoring crop productivity, guiding irrigation, and supporting climate-resilient agriculture. Data-driven models scale well but often lack interpretability and degrade under distribution shift, whereas process-based crop models (e.g. DSSAT, APSIM, LINTUL5) require extensive calibration and are difficult to deploy over large spatial domains. To address these limitations, we propose AgriPINN, a process-informed neural network that integrates a biophysical crop-growth differential equation as a differentiable constraint within a deep learning backbone. This design encourages physiologically consistent biomass dynamics under water-stress conditions while preserving model scalability for spatially distributed AGB prediction. AgriPINN recovers latent physiological variables, including leaf area index (LAI), absorbed photosynthetically active radiation (PAR), radiation use efficiency (RUE), and water-stress factors, without requiring direct supervision. We pretrain AgriPINN on 60 years of historical data across 397 regions in Germany and fine-tune it on three years of field experiments under controlled water treatments. Results show that AgriPINN consistently outperforms state-of-the-art deep-learning baselines (ConvLSTM-ViT, SLTF, CNN-Transformer) and the process-based LINTUL5 model in terms of accuracy (RMSE reductions up to $43\%$) and computational efficiency. By combining the scalability of deep learning with the biophysical rigor of process-based modeling, AgriPINN provides a robust and interpretable framework for spatio-temporal AGB prediction, offering practical value for planning of irrigation infrastructure, yield forecasting, and climate-adaptation planning.
在水资源紧张的情况下,准确预测作物地上生物量(AGB)对于监测作物生产力、指导灌溉和支撑气候适应性农业至关重要。数据驱动模型虽然易于扩展,但通常缺乏可解释性,并且在数据分布变化时性能会下降;而基于过程的作物模型(如DSSAT、APSIM、LINTUL5等),尽管需要大量的校准工作,但在大范围空间领域部署起来却比较困难。为了解决这些局限性,我们提出了AgriPINN,这是一种受生理学启发的神经网络,它将生物物理作物生长微分方程作为可微约束融入深度学习架构中。这种设计在水资源压力条件下鼓励生物学上一致的生物量动态变化,并保持模型的空间扩展能力以预测分布式的AGB。 通过使用历史数据(长达60年)和德国397个地区的现场实验数据,我们预先训练了AgriPINN并对其进行微调,这些实验数据涵盖了为期三年的各种控制水处理条件。结果显示,与目前最先进的深度学习基线模型(包括ConvLSTM-ViT、SLTF、CNN-Transformer)以及基于过程的LINTUL5模型相比,AgriPINN在精度上显著优于它们(RMSE降低高达43%),并且计算效率更高。 通过将深度学习的扩展性与基于过程建模的生物物理严谨性相结合,AgriPINN为时空AGB预测提供了一个稳健且可解释的框架。这不仅对灌溉基础设施规划、产量预测和气候适应性规划具有实用价值,而且在应对气候变化挑战方面也展现出巨大潜力。
https://arxiv.org/abs/2601.16045
Modern foundational Multimodal Large Language Models (MLLMs) and video world models have advanced significantly in mathematical, common-sense, and visual reasoning, but their grasp of the underlying physics remains underexplored. Existing benchmarks attempting to measure this matter rely on synthetic, Visual Question Answer templates or focus on perceptual video quality that is tangential to measuring how well the video abides by physical laws. To address this fragmentation, we introduce PhysicsMind, a unified benchmark with both real and simulation environments that evaluates law-consistent reasoning and generation over three canonical principles: Center of Mass, Lever Equilibrium, and Newton's First Law. PhysicsMind comprises two main tasks: i) VQA tasks, testing whether models can reason and determine physical quantities and values from images or short videos, and ii) Video Generation(VG) tasks, evaluating if predicted motion trajectories obey the same center-of-mass, torque, and inertial constraints as the ground truth. A broad range of recent models and video generation models is evaluated on PhysicsMind and found to rely on appearance heuristics while often violating basic mechanics. These gaps indicate that current scaling and training are still insufficient for robust physical understanding, underscoring PhysicsMind as a focused testbed for physics-aware multimodal models. Our data will be released upon acceptance.
现代的多模态大型语言模型(MLLMs)和视频世界模型在数学、常识以及视觉推理方面取得了显著进展,但它们对物理现象的理解仍然未得到充分探索。现有的评估这些能力的基准测试通常依赖于合成的视觉问答模板,或者关注感知上的视频质量,而这与衡量视频是否遵循物理定律关系不大。为了解决这种碎片化问题,我们引入了PhysicsMind,这是一个结合了真实和模拟环境的统一基准,用于评估在三个经典原则(质心、杠杆平衡以及牛顿第一定律)下的守恒推理和生成能力。 PhysicsMind包含两个主要任务: 1. 视觉问答(VQA)任务:测试模型是否能够从图像或短视频中推断并确定物理量和值。 2. 视频生成(VG)任务:评估预测的运动轨迹是否遵守与地面实况相同的质心、力矩以及惯性约束。 我们对一系列最近的多模态模型和视频生成模型进行了PhysicsMind基准测试,发现这些模型通常依赖于外观启发式方法,并且经常违反基本力学原理。这些差距表明当前的规模扩展和训练对于建立稳健的物理理解仍然不足,从而凸显了PhysicsMind作为物理感知型多模态模型集中测试平台的重要性。 我们的数据将在获得接受后公开发布。
https://arxiv.org/abs/2601.16007
The Decision Transformer (DT) has established a powerful sequence modeling approach to offline reinforcement learning. It conditions its action predictions on Return-to-Go (RTG), using it both to distinguish trajectory quality during training and to guide action generation at inference. In this work, we identify a critical redundancy in this design: feeding the entire sequence of RTGs into the Transformer is theoretically unnecessary, as only the most recent RTG affects action prediction. We show that this redundancy can impair DT's performance through experiments. To resolve this, we propose the Decoupled DT (DDT). DDT simplifies the architecture by processing only observation and action sequences through the Transformer, using the latest RTG to guide the action prediction. This streamlined approach not only improves performance but also reduces computational cost. Our experiments show that DDT significantly outperforms DT and establishes competitive performance against state-of-the-art DT variants across multiple offline RL tasks.
决策变压器(Decision Transformer,DT)已建立了一种强大的序列建模方法用于离线强化学习。它根据返回到目标(Return-to-Go, RTG)条件化其动作预测,在训练过程中使用RTG来区分轨迹质量,并在推理时用它来指导动作生成。在这项工作中,我们识别出该设计中的一个关键冗余:将整个RTG序列输入Transformer理论上是不必要的,因为只有最新的RTG会影响动作预测。我们通过实验表明这种冗余会损害DT的性能。为了解决这个问题,我们提出了解耦决策变压器(Decoupled Decision Transformer, DDT)。DDT简化了架构,仅通过Transformer处理观察和动作序列,并使用最新RTG来指导动作预测。这种方法不仅提高了性能,还减少了计算成本。我们的实验表明,与最先进的DT变体相比,DDT在多个离线强化学习任务中显著优于DT,并建立了竞争性的表现水平。
https://arxiv.org/abs/2601.15953
Novel view synthesis (NVS) of static and dynamic urban scenes is essential for autonomous driving simulation, yet existing methods often struggle to balance reconstruction time with quality. While state-of-the-art neural radiance fields and 3D Gaussian Splatting approaches achieve photorealism, they often rely on time-consuming per-scene optimization. Conversely, emerging feed-forward methods frequently adopt per-pixel Gaussian representations, which lead to 3D inconsistencies when aggregating multi-view predictions in complex, dynamic environments. We propose EvolSplat4D, a feed-forward framework that moves beyond existing per-pixel paradigms by unifying volume-based and pixel-based Gaussian prediction across three specialized branches. For close-range static regions, we predict consistent geometry of 3D Gaussians over multiple frames directly from a 3D feature volume, complemented by a semantically-enhanced image-based rendering module for predicting their appearance. For dynamic actors, we utilize object-centric canonical spaces and a motion-adjusted rendering module to aggregate temporal features, ensuring stable 4D reconstruction despite noisy motion priors. Far-Field scenery is handled by an efficient per-pixel Gaussian branch to ensure full-scene coverage. Experimental results on the KITTI-360, KITTI, Waymo, and PandaSet datasets show that EvolSplat4D reconstructs both static and dynamic environments with superior accuracy and consistency, outperforming both per-scene optimization and state-of-the-art feed-forward baselines.
城市静态和动态场景的新型视图合成(NVS)对于自动驾驶仿真至关重要,但现有方法往往难以在重建时间和质量之间取得平衡。尽管最先进的神经辐射场和3D高斯点阵方法实现了照片级真实感,它们通常依赖于耗时的每场景优化过程。相反,新兴的前馈方法经常采用像素级别的高斯表示,在复杂的动态环境中聚合多视图预测会导致三维不一致性。 我们提出了EvolSplat4D,这是一种超越现有基于像素范式的前馈框架,通过在三个专门分支中统一了体积和像素基础的高斯预测。对于近距离静态区域,我们在3D特征体直接从多个帧预测一致的3D高斯几何,并辅以增强语义的图像渲染模块来预测其外观。对于动态物体,我们利用对象中心化的规范空间以及运动调整渲染模块来聚合时间特性,确保即使在噪声运动先验下也能实现稳定的4D重建。远距离场景则通过一个高效的像素级别高斯分支处理,以确保全场景覆盖。 实验结果表明,在KITTI-360、KITTI、Waymo和PandaSet数据集上,EvolSplat4D能够以更高的精度和一致性重建静态及动态环境,并且优于每场景优化以及最先进的前馈基准方法。
https://arxiv.org/abs/2601.15951
Deep neural network models degrade significantly in the long-tailed data distribution, with the overall training data dominated by a small set of classes in the head, and the tail classes obtaining less training examples. Addressing the imbalance in the classes, attention in the related literature was given mainly to the adjustments carried out in the decision space in terms of either corrections performed at the logit level in order to compensate class-prior bias, with the least attention to the optimization process resulting from the adjustments introduced through the differences in the confidences among the samples. In the current study, we present the design of a class and confidence-aware re-weighting scheme for long-tailed learning. This scheme is purely based upon the loss level and has a complementary nature to the existing methods performing the adjustment of the logits. In the practical implementation stage of the proposed scheme, we use an {\Omega}(p_t, f_c) function. This function enables the modulation of the contribution towards the training task based upon the confidence value of the prediction, as well as the relative frequency of the corresponding class. Our observations in the experiments are corroborated by significant experimental results performed on the CIFAR-100-LT, ImageNet-LT, and iNaturalist2018 datasets under various values of imbalance factors that clearly authenticate the theoretical discussions above.
深度神经网络模型在长尾数据分布中表现显著下降,这种情况通常是由于训练数据主要由头部少量类别主导,而尾部类别的训练样本较少造成的。针对这种类别不平衡问题,现有文献中的研究重点大多集中在决策空间的调整上,例如通过修正logit级别以补偿先验偏置,相比之下对优化过程的关注较少,尤其是那些因样本间信心值差异所引入的调整。在当前的研究中,我们提出了一种基于类和信心感知的重新加权方案,专门针对长尾学习问题。该方案纯粹是基于损失层面设计,并且与现有方法(通过修改logit来调整)具有互补性。 我们在实践中采用了一个Ω(p_t, f_c)函数实现这一设计方案,此函数允许根据预测的信心值以及对应类别的相对频率调制对训练任务的贡献度。我们的实验结果表明,在CIFAR-100-LT、ImageNet-LT和iNaturalist2018数据集上,针对各种不平衡因素进行的不同价值验证均显著支持了上述理论讨论的有效性。 简而言之,这种新提出的类与信心感知加权方案在处理长尾问题时表现出色,并且通过实验结果得到了验证。
https://arxiv.org/abs/2601.15924
Purpose: Accurate 3D hand pose estimation supports surgical applications such as skill assessment, robot-assisted interventions, and geometry-aware workflow analysis. However, surgical environments pose severe challenges, including intense and localized lighting, frequent occlusions by instruments or staff, and uniform hand appearance due to gloves, combined with a scarcity of annotated datasets for reliable model training. Method: We propose a robust multi-view pipeline for 3D hand pose estimation in surgical contexts that requires no domain-specific fine-tuning and relies solely on off-the-shelf pretrained models. The pipeline integrates reliable person detection, whole-body pose estimation, and state-of-the-art 2D hand keypoint prediction on tracked hand crops, followed by a constrained 3D optimization. In addition, we introduce a novel surgical benchmark dataset comprising over 68,000 frames and 3,000 manually annotated 2D hand poses with triangulated 3D ground truth, recorded in a replica operating room under varying levels of scene complexity. Results: Quantitative experiments demonstrate that our method consistently outperforms baselines, achieving a 31% reduction in 2D mean joint error and a 76% reduction in 3D mean per-joint position error. Conclusion: Our work establishes a strong baseline for 3D hand pose estimation in surgery, providing both a training-free pipeline and a comprehensive annotated dataset to facilitate future research in surgical computer vision.
目的:准确的三维手部姿态估计支持手术应用,如技能评估、机器人辅助干预和几何感知工作流程分析。然而,手术环境带来了严重的挑战,包括强烈的局部照明、频繁的手被仪器或工作人员遮挡以及由于手套导致的手部外观一致化,并且缺乏可靠的模型训练所需的数据集。 方法:我们提出了一种稳健的多视角流水线,用于在手术环境中进行三维手部姿态估计,该流水线无需特定领域的微调,仅依赖现成的预训练模型。这个流程包括可靠的人体检测、全身姿势估计和基于跟踪的手部裁剪区域内的最先进的二维关键点预测,并通过受约束的三维优化来完成整个过程。此外,我们引入了一个新颖的手术基准数据集,该数据集包含超过68,000帧及3,000个手动注释的二维手部姿态,这些数据是在一个模拟手术室中记录下来的,在不同的场景复杂度下都有三角测量的三维真实值。 结果:定量实验表明,我们的方法在性能上始终优于基准模型,实现了2D平均关节误差降低31%,以及3D平均每关节位置误差减少76%的成绩。 结论:我们提出的工作为手术环境中的三维手部姿态估计建立了坚实的基础,提供了一个无需训练的流水线和一个全面注释的数据集,以促进未来在手术计算机视觉领域的研究。
https://arxiv.org/abs/2601.15918
Recent advances in medical vision language models guide the learning of visual representations; however, this form of supervision is constrained by the availability of paired image text data, raising the question of whether robust radiology encoders can be learned without relying on language supervision. In this work, we introduce RadJEPA, a self-supervised framework built on a Joint Embedding Predictive Architecture that learns without language supervision. Pre-trained solely on unlabeled chest X-ray images, the model learns to predict latent representations of masked image regions. This predictive objective differs fundamentally from both image text pre-training and DINO-style self-distillation: rather than aligning global representations across views or modalities, RadJEPA explicitly models latent-space prediction. We evaluate the learned encoder on disease classification, semantic segmentation, and report generation tasks. Across benchmarks, RadJEPA achieves performance exceeding state-of-the-art approaches, including Rad-DINO.
近期的医学视觉语言模型的发展指导了视觉表示的学习;然而,这种形式的监督受限于配对图像文本数据的可用性,引发了是否可以不依赖语言监督来学习稳健的放射学编码器的问题。在本工作中,我们介绍了RadJEPA,这是一种基于联合嵌入预测架构构建的自监督框架,它可以在没有语言监督的情况下进行学习。该模型仅通过未标记的胸部X光图像预训练,学习对遮蔽图像区域的潜在表示进行预测。这种预测目标与图像文本预训练和DINO风格的自我蒸馏方法根本不同:RadJEPA不是跨视图或模态对齐全局表示,而是明确地建模潜在空间中的预测。 我们在疾病分类、语义分割和报告生成任务上评估了所学习到的编码器。在各个基准测试中,RadJEPA的表现超过了包括Rad-DINO在内的最先进方法。
https://arxiv.org/abs/2601.15891
The rapid spread of multimodal fake news poses a serious societal threat, as its evolving nature and reliance on timely factual details challenge existing detection methods. Dynamic Retrieval-Augmented Generation provides a promising solution by triggering keyword-based retrieval and incorporating external knowledge, thus enabling both efficient and accurate evidence selection. However, it still faces challenges in addressing issues such as redundant retrieval, coarse similarity, and irrelevant evidence when applied to deceptive content. In this paper, we propose ExDR, an Explanation-driven Dynamic Retrieval-Augmented Generation framework for Multimodal Fake News Detection. Our framework systematically leverages model-generated explanations in both the retrieval triggering and evidence retrieval modules. It assesses triggering confidence from three complementary dimensions, constructs entity-aware indices by fusing deceptive entities, and retrieves contrastive evidence based on deception-specific features to challenge the initial claim and enhance the final prediction. Experiments on two benchmark datasets, AMG and MR2, demonstrate that ExDR consistently outperforms previous methods in retrieval triggering accuracy, retrieval quality, and overall detection performance, highlighting its effectiveness and generalization capability.
多模态虚假新闻的快速传播构成了严重的社会威胁,因为其不断演变的性质和对及时事实细节的依赖挑战了现有的检测方法。动态检索增强生成提供了一种有希望的解决方案,通过触发基于关键词的检索并融合外部知识来实现高效的证据选择和准确度。然而,在应用于欺骗性内容时,它仍然面临诸如冗余检索、粗略相似性和无关证据等挑战。 在本文中,我们提出了ExDR,这是一种针对多模态虚假新闻检测的解释驱动动态检索增强生成框架。我们的框架系统地利用模型生成的解释来触发检索和证据检索模块,并从三个互补维度评估触发信心;通过融合欺骗实体构建基于实体感知的索引;并根据特定于欺骗性的特征检索对比证据,以挑战初始主张并提升最终预测。 在两个基准数据集AMG和MR2上的实验表明,ExDR在检索触发准确性、检索质量和整体检测性能方面始终优于先前的方法,突显了其有效性和泛化能力。
https://arxiv.org/abs/2601.15820
In hyperspectral image classification (HSIC), most deep learning models rely on opaque spectral-spatial feature mixing, limiting their interpretability and hindering understanding of internal decision mechanisms. We present physical spectrum-aware white-box mHC, named ES-mHC, a hyper-connection framework that explicitly models interactions among different electromagnetic spectrum groupings (residual stream in mHC) interactions using structured, directional matrices. By separating feature representation from interaction structure, ES-mHC promotes electromagnetic spectrum grouping specialization, reduces redundancy, and exposes internal information flow that can be directly visualized and spatially analyzed. Using hyperspectral image classification as a representative testbed, we demonstrate that the learned hyper-connection matrices exhibit coherent spatial patterns and asymmetric interaction behaviors, providing mechanistic insight into the model internal dynamics. Furthermore, we find that increasing the expansion rate accelerates the emergence of structured interaction patterns. These results suggest that ES-mHC transforms HSIC from a purely black-box prediction task into a structurally transparent, partially white-box learning process.
在超光谱图像分类(HSIC)中,大多数深度学习模型依赖于不透明的光谱-空间特征混合,这限制了它们的可解释性,并阻碍了对内部决策机制的理解。我们提出了一种物理光谱感知的白盒模型mHC,命名为ES-mHC,这是一种超连接框架,该框架通过结构化、定向矩阵明确地建模不同电磁频段组之间的相互作用(即mHC中的残差流)。通过将特征表示与交互结构分离,ES-mHC促进了电磁频段分组的专业化,减少了冗余,并揭示了可以直接可视化和空间分析的内部信息流动。利用超光谱图像分类作为代表性试验平台,我们展示了学习到的超连接矩阵呈现出连贯的空间模式以及不对称的相互作用行为,这为模型的内部动力学提供了机制性的见解。此外,我们发现增加扩张率会加速结构化交互模式的出现。这些结果表明ES-mHC将HSIC从一个纯粹的黑箱预测任务转化为一种结构性透明、部分白盒的学习过程。
https://arxiv.org/abs/2601.15757
Diffusion models have emerged as a powerful approach for multimodal motion planning in autonomous driving. However, their practical deployment is typically hindered by the inherent difficulty in enforcing vehicle dynamics and a critical reliance on accurate predictions of other agents, making them prone to safety issues under uncertain interactions. To address these limitations, we introduce DualShield, a planning and control framework that leverages Hamilton-Jacobi (HJ) reachability value functions in a dual capacity. First, the value functions act as proactive guidance, steering the diffusion denoising process towards safe and dynamically feasible regions. Second, they form a reactive safety shield using control barrier-value functions (CBVFs) to modify the executed actions and ensure safety. This dual mechanism preserves the rich exploration capabilities of diffusion models while providing principled safety assurance under uncertain and even adversarial interactions. Simulations in challenging unprotected U-turn scenarios demonstrate that DualShield significantly improves both safety and task efficiency compared to leading methods from different planning paradigms under uncertainty.
扩散模型在自主驾驶中的多模态运动规划方面展现出了强大的能力。然而,由于难以强制执行车辆动力学以及对其他交通参与者准确预测的严重依赖,它们的实际部署通常会受到阻碍,并且在这种不确定交互中容易引发安全问题。为了应对这些局限性,我们引入了DualShield框架,这是一个结合了哈密顿-雅可比(Hamilton-Jacobi, HJ)可达值函数的规划和控制框架,在双重能力下使用该值函数。 首先,价值函数作为前瞻性指导,引导扩散去噪过程朝向安全且动力学上可行的区域。其次,它们通过构建基于控制屏障值函数(Control Barrier-Value Functions, CBVFs)的安全防护层来修改执行的动作并确保安全性。这种双重机制不仅保留了扩散模型丰富的探索能力,在不确定甚至对抗性交互中还提供了原则性的安全保障。 在具有挑战性的无保护左转场景中的模拟实验表明,与来自不同规划范式的领先方法相比,DualShield在不确定性条件下显著提高了安全性和任务效率。
https://arxiv.org/abs/2601.15729
Understanding what users like is relatively straightforward; understanding what users dislike, however, remains a challenging and underexplored problem. Research into users' negative preferences has gained increasing importance in modern recommendation systems. Numerous platforms have introduced explicit negative feedback mechanisms and leverage such signals to refine their recommendation models. Beyond traditional business metrics, user experience-driven metrics, such as negative feedback rates, have become critical indicators for evaluating system performance. However, most existing approaches primarily use negative feedback as an auxiliary signal to enhance positive recommendations, paying little attention to directly modeling negative interests, which can be highly valuable in offline applications. Moreover, due to the inherent sparsity of negative feedback data, models often suffer from context understanding biases induced by positive feedback dominance. To address these challenges, we propose the first large language model framework for negative feedback modeling with special designed context-discerning modules. We use semantic ID Representation to replace text-based item descriptions and introduce an item-level alignment task that enhances the LLM's understanding of the semantic context behind negative feedback. Furthermore, we design a Progressive GRPO training paradigm that enables the model to dynamically balance the positive and negative behavioral context utilization. Besides, our investigation further reveals a fundamental misalignment between the conventional next-negative-item prediction objective and users' true negative preferences, which is heavily influenced by the system's recommendation order. To mitigate this, we propose a novel reward function and evaluation metric grounded in multi-day future negative feedback and their collaborative signals.
理解用户喜欢的东西相对简单;然而,理解用户不喜欢的东西仍然是一个挑战性大且研究不足的问题。现代推荐系统中,对用户负面偏好的研究越来越重要。许多平台已经引入了明确的负面反馈机制,并利用这些信号来优化他们的推荐模型。除了传统的业务指标之外,以用户体验为导向的指标(如负面反馈率)已成为评估系统性能的关键指标。然而,大多数现有方法主要将负面反馈用作辅助信号,用于增强正面推荐,而很少直接对负面兴趣建模,在离线应用中这可能非常有价值。此外,由于负面反馈数据固有的稀疏性,模型往往因正向反馈占主导地位而导致上下文理解偏差。 为了解决这些挑战,我们提出了第一个专门针对负面反馈建模的大型语言模型框架,并设计了专门用于识别不同上下文的模块。我们使用语义ID表示法替代基于文本的项目描述,并引入了一个项目级别的对齐任务,以增强LLM(Large Language Model)理解负面反馈背后的语义背景的能力。此外,我们设计了一种渐进式GRPO训练范式,使模型能够动态地平衡正向和负向行为上下文的利用。 进一步的研究还揭示了传统的下一个负面项目预测目标与用户真实的负面偏好之间存在基本不一致的问题,这在很大程度上受到系统推荐顺序的影响。为解决这一问题,我们提出了一种基于多日未来负面反馈及其协同信号的新颖奖励函数和评估指标。
https://arxiv.org/abs/2601.15721
Fine-grained attribute prediction is essential for fashion retail applications including catalog enrichment, visual search, and recommendation systems. Vision-Language Models (VLMs) offer zero-shot prediction without task-specific training, yet their systematic evaluation on multi-attribute fashion tasks remains underexplored. A key challenge is that fashion attributes are often conditional. For example, "outer fabric" is undefined when no outer garment is visible. This requires models to detect attribute applicability before attempting classification. We introduce a three-tier evaluation framework that decomposes this challenge: (1) overall task performance across all classes (including NA class: suggesting attribute is not applicable) for all attributes, (2) attribute applicability detection, and (3) fine-grained classification when attributes are determinable. Using DeepFashion-MultiModal, which explicitly defines NA (meaning attribute doesn't exist or is not visible) within attribute label spaces, we benchmark nine VLMs spanning flagship (GPT-5, Gemini 2.5 Pro), efficient (GPT-5 Mini, Gemini 2.5 Flash), and ultra-efficient tiers (GPT-5 Nano, Gemini 2.5 Flash-Lite) against classifiers trained on pretrained Fashion-CLIP embeddings on 5,000 images across 18 attributes. Our findings reveal that: (1) zero-shot VLMs achieve 64.0% macro-F1, a threefold improvement over logistic regression on pretrained Fashion-CLIP embeddings; (2) VLMs excel at fine-grained classification (Tier 3: 70.8% F1) but struggle with applicability detection (Tier 2: 34.1% NA-F1), identifying a key bottleneck; (3) efficient models achieve over 90% of flagship performance at lower cost, offering practical deployment paths. This diagnostic framework enables practitioners to pinpoint whether errors stem from visibility detection or classification, guiding targeted improvements for production systems.
细粒度属性预测对于时尚零售应用(包括目录丰富、视觉搜索和推荐系统)至关重要。视觉-语言模型(VLMs)在不进行特定任务训练的情况下提供了零样本预测,然而这些模型在多属性时尚任务中的系统性评估尚处于探索阶段。一个关键挑战在于时尚属性往往是条件性的:例如,“外层面料”这一属性仅在外穿衣物可见时才具有定义。这就要求模型首先检测属性是否适用再进行分类。 我们引入了一个三级评价框架来分解这个难题: 1. 跨所有类别的整体任务表现(包括NA类别,表示该属性不适用)。 2. 属性适用性检测。 3. 在可确定的情况下进行细粒度分类。 利用DeepFashion-MultiModal数据集,该数据集在属性标签空间中明确定义了NA(表示该属性不存在或不可见),我们对九种VLMs进行了基准测试,这些模型涵盖了旗舰级(GPT-5, Gemini 2.5 Pro)、高效级(GPT-5 Mini, Gemini 2.5 Flash)和超级高效的级别(GPT-5 Nano, Gemini 2.5 Flash-Lite),并且对比了基于预训练Fashion-CLIP嵌入的分类器在跨18个属性、5000张图像上的表现。 我们的发现表明: 1. 零样本VLMs达到了64.0%的宏F1分数,相较于基于预训练Fashion-CLIP嵌入的逻辑回归模型,有三倍以上的改进。 2. VLMs在细粒度分类(第三级:70.8% F1)方面表现出色,但在适用性检测(第二级:34.1% NA-F1)上表现不佳,这揭示了一个关键瓶颈。 3. 高效模型在较低成本下实现了旗舰性能的90%以上,为实用部署提供了路径。 这一诊断框架使实践者能够确定错误是源于可见性检测还是分类,并指导生产系统进行针对性改进。
https://arxiv.org/abs/2601.15711
Accurate alignment of multi-degree-of-freedom rehabilitation robots is essential for safe and effective patient training. This paper proposes a two-stage calibration framework for a self-designed three-degree-of-freedom (3-DOF) ankle rehabilitation robot. First, a Kronecker-product-based open-loop calibration method is developed to cast the input-output alignment into a linear parameter identification problem, which in turn defines the associated experimental design objective through the resulting information matrix. Building on this formulation, calibration posture selection is posed as a combinatorial design-of-experiments problem guided by a D-optimality criterion, i.e., selecting a small subset of postures that maximises the determinant of the information matrix. To enable practical selection under constraints, a Proximal Policy Optimization (PPO) agent is trained in simulation to choose 4 informative postures from a candidate set of 50. Across simulation and real-robot evaluations, the learned policy consistently yields substantially more informative posture combinations than random selection: the mean determinant of the information matrix achieved by PPO is reported to be more than two orders of magnitude higher with reduced variance. In addition, real-world results indicate that a parameter vector identified from only four D-optimality-guided postures provides stronger cross-episode prediction consistency than estimates obtained from a larger but unstructured set of 50 postures. The proposed framework therefore improves calibration efficiency while maintaining robust parameter estimation, offering practical guidance for high-precision alignment of multi-DOF rehabilitation robots.
多自由度康复机器人的精确对准对于确保患者训练的安全和有效性至关重要。本文提出了一种针对自设计的三自由度(3-DOF)踝关节康复机器人进行校准的两阶段框架。首先,开发了一种基于克罗内克积的开环校准方法,将输入输出对准问题转化为一个线性参数识别问题,并通过所得的信息矩阵定义了相应的实验设计目标。在此基础上,借助D-最优准则指导组合实验设计问题,以选择一组能够最大化信息矩阵行列式的有限姿势集合作为校准姿态的选择策略。 为了在实际约束条件下实现可行的姿态选择,使用模拟训练了一种近端策略优化(PPO)代理程序,在候选的50个姿势中挑选出4个具有代表性的姿势。无论是在模拟评估还是实机测试中,所学策略始终比随机选取方式产生了信息量更大的姿态组合:根据PPO得到的信息矩阵行列式的平均值提高了两个数量级且方差降低。 此外,实际结果显示,仅基于四个D-最优准则指导下的姿态识别出的参数向量,在跨周期预测一致性方面优于50个未结构化的姿势集得到估计。因此,所提出的框架在保持稳健性的同时可以提高校准效率,并为多自由度康复机器人的高精度对准提供实用指南。
https://arxiv.org/abs/2601.15707
This work focuses on national-scale land-use/land-cover (LULC) semantic segmentation using ALOS-2 single-polarization (HH) SAR data over Japan, together with a companion binary water detection task. Building on SAR-W-MixMAE self-supervised pretraining [1], we address common SAR dense-prediction failure modes, boundary over-smoothing, missed thin/slender structures, and rare-class degradation under long-tailed labels, without increasing pipeline complexity. We introduce three lightweight refinements: (i) injecting high-resolution features into multi-scale decoding, (ii) a progressive refine-up head that alternates convolutional refinement and stepwise upsampling, and (iii) an $\alpha$-scale factor that tempers class reweighting within a focal+dice objective. The resulting model yields consistent improvements on the Japan-wide ALOS-2 LULC benchmark, particularly for under-represented classes, and improves water detection across standard evaluation metrics.
这项工作专注于使用ALOS-2单极化(HH)SAR数据在日本进行国家尺度的土地利用/土地覆盖(LULC)语义分割,同时包括一个伴生的二元水体检测任务。基于SAR-W-MixMAE自监督预训练[1],我们解决了常见的SAR密集预测失败模式,如边界过度平滑、遗漏细长结构以及在长尾标签下罕见类别的性能下降问题,而无需增加管道复杂度。我们引入了三项轻量级改进:(i) 将高分辨率特征注入多尺度解码中;(ii) 一种逐步细化和上采样的交替进行的渐进式细化头部,以及 (iii) $\alpha$-缩放因子,用于调节在焦点+dice目标下的类重新加权。最终模型在日本全境ALOS-2 LULC基准测试中取得了持续性的改进,特别是在代表性不足的类别中,并且在标准评估指标下提高了水体检测性能。
https://arxiv.org/abs/2601.15705
While Large Language Models (LLMs) show remarkable capabilities, their unreliability remains a critical barrier to deployment in high-stakes domains. This survey charts a functional evolution in addressing this challenge: the evolution of uncertainty from a passive diagnostic metric to an active control signal guiding real-time model behavior. We demonstrate how uncertainty is leveraged as an active control signal across three frontiers: in \textbf{advanced reasoning} to optimize computation and trigger self-correction; in \textbf{autonomous agents} to govern metacognitive decisions about tool use and information seeking; and in \textbf{reinforcement learning} to mitigate reward hacking and enable self-improvement via intrinsic rewards. By grounding these advancements in emerging theoretical frameworks like Bayesian methods and Conformal Prediction, we provide a unified perspective on this transformative trend. This survey provides a comprehensive overview, critical analysis, and practical design patterns, arguing that mastering the new trend of uncertainty is essential for building the next generation of scalable, reliable, and trustworthy AI.
尽管大型语言模型(LLMs)展示了非凡的能力,但它们的不可靠性仍然是在高风险领域部署的关键障碍。这篇综述描述了为应对这一挑战而进行的功能演变:即不确定性从一种被动诊断指标转变为引导实时模型行为的主动控制信号的演化过程。我们展示了如何将不确定性作为主动控制信号应用于三个前沿领域:在**高级推理**中优化计算并触发自我修正;在**自主代理**中管理有关工具使用和信息搜索的元认知决策;以及在**强化学习**中通过内在奖励来减轻奖励漏洞并实现自我改进。 这些进展基于贝叶斯方法和一致性预测等新兴理论框架,为这一变革性趋势提供了一个统一的视角。这篇综述提供了全面的概述、批判性的分析及实用的设计模式,主张掌握不确定性这一新趋势对于构建下一代可扩展、可靠且值得信赖的人工智能至关重要。
https://arxiv.org/abs/2601.15690
Trigger-Action Programming (TAP) platforms such as IFTTT and Zapier enable Web of Things (WoT) automation by composing event-driven rules across heterogeneous services. A TAP applet links a trigger to an action and must bind trigger outputs (ingredients) to action inputs (fields) to be executable. Prior work largely treats TAP as service-level prediction from natural language, which often yields non-executable applets that still require manual configuration. We study the function-level configuration problem: generating complete applets with correct ingredient-to-field bindings. We propose FARM (Field-Aware Resolution Model), a two-stage architecture for automated applet generation with full configuration. Stage 1 trains contrastive dual encoders with selective layer freezing over schema-enriched representations, retrieving candidates from 1,724 trigger functions and 1,287 action functions (2.2M possible trigger-action pairs). Stage 2 performs selection and configuration using an LLM-based multi-agent pipeline. It includes intent analysis, trigger selection, action selection via cross-schema scoring, and configuration verification. Agents coordinate through shared state and agreement-based selection. FARM achieves 81% joint accuracy on Gold (62% Noisy, 70% One-shot) at the function level, where both trigger and action functions must match the ground truth. For comparison with service-level baselines, we map functions to their parent services and evaluate at the service level. FARM reaches 81% joint accuracy and improves over TARGE by 23 percentage points. FARM also generates ingredient-to-field bindings, producing executable automation configurations.
触发器-动作编程(TAP)平台,如IFTTT和Zapier,通过在异构服务之间组合事件驱动规则来实现万物互联(Web of Things, WoT)的自动化。一个TAP小程序将一个触发器与一个动作相连接,并且必须绑定触发器输出(配料)到动作输入(字段),才能使其可执行。先前的工作大多把TAP视为从自然语言预测服务级内容,这通常会产生需要手动配置的非可执行程序。我们研究了功能级别配置问题:生成具有正确配料-字段绑定的完整小程序。 为此,我们提出了一种名为FARM(Field-Aware Resolution Model)的方法,这是一个两阶段架构用于自动化小程序生成并实现全面配置: **第一阶段**:训练对比双重编码器,并选择性地冻结某些层,在扩充了模式信息表示后检索1724个触发函数和1287个动作函数中的候选者(总共约有2.2百万可能的触发-动作对)。 **第二阶段**:使用基于LLM的多代理管道进行选择和配置。这一阶段包括意图分析、触发器选择、通过跨模式评分进行的动作选择以及配置验证。代理们通过共享状态和基于同意的选择机制相互协调,确保生成的功能级小程序是完整且准确的。 FARM在功能级别上实现了81%的联合准确性(其中黄金数据集为81%,噪声数据集为62%,一次性评估为70%),这意味着触发器和动作函数都必须与真实情况相匹配。为了与服务级别的基线进行比较,我们将功能映射到其父服务,并在服务级别上进行了评估。FARM达到了81%的联合准确性,比TARGE高出23个百分点。 除了配置外,FARM还能生成配料-字段绑定,从而产出可执行的自动化配置。
https://arxiv.org/abs/2601.15687
Emotional information in speech plays a unique role in multimodal perception. However, current Speech Large Language Models (SpeechLLMs), similar to conventional speech emotion recognition (SER) systems, still treat emotion understanding as a simple classification problem. This provides limited interpretability of predictions, while leaving the LLMs' expressive and reasoning capabilities underutilized. In this work, we take the first step to reformulate SER as a deep reasoning problem through reinforcement learning (RL). We propose EmotionThinker, which is designed to generate accurate emotion predictions with interpretable explanations grounded in fine-grained acoustic cues. To achieve this, we first construct EmotionCoT-35K, an emotional reasoning dataset with Chain-of-Thought annotations and detailed captions. Second, we observe that current SpeechLLMs exhibit weak prosody perception, whereas prosodic cues constitute fundamental signals for interpreting emotions. To address this, we develop the prosody-enhanced foundation model EmotionThinker-Base, and demonstrate that prosody enhancement improves emotion understanding. Third, we introduce Group-Relative-Policy-Optimization with Progressive-Trust-aware-Reasoning-Reward (GRPO-PTR) for RL. Different from standard GRPO, which relies only on rule-based outcome rewards, GRPO-PTR progressively introduces reasoning reward, dynamically adjusts it with a trustworthiness weight reflecting the alignment between reasoning and outcome, and evaluates the overall reasoning quality with a reward model based on multi-dimensional criteria. EmotionThinker outperforms previous state-of-the-art evaluation models both in emotion accuracy and explanation quality, advancing SER toward interpretable multimodal reasoning. Project page: this https URL
情感信息在多模态感知中扮演着独特的角色。然而,当前的语音大型语言模型(SpeechLLMs)以及传统的语音情绪识别(SER)系统仍然将情绪理解视为一个简单的分类问题。这种做法限制了预测的解释性,并未充分利用LLM的表达和推理能力。为此,在这项工作中,我们首次尝试通过强化学习(RL)将SER重新定义为深度推理问题。我们提出了EmotionThinker模型,旨在生成基于细粒度声学线索的情感预测及可解释的说明。 为了实现这一目标,首先构建了带有链式思考标注和详细描述的情感推理数据集EmotionCoT-35K。其次,观察到当前的SpeechLLMs对语调感知较弱,而语调信号构成了理解情绪的基本信号。为了解决这个问题,我们开发了一个增强型基础模型EmotionThinker-Base,并展示了语调增强改善了情感的理解能力。最后,引入了基于逐步信任意识推理奖励的组相对策略优化(Group-Relative-Policy-Optimization with Progressive-Trust-aware-Reasoning-Reward, GRPO-PTR)用于RL。与仅依赖规则基础结果奖励的标准GRPO不同,GRPO-PTR逐渐引入推理奖励,并根据推论和结果之间的对齐情况动态调整,使用基于多维度标准的奖励模型来评估整体推理质量。 EmotionThinker在情感准确性及解释质量方面均超越了先前最先进的评估模型,推动SER向可解释的多模态推理迈进。项目页面:[此处插入实际URL]
https://arxiv.org/abs/2601.15668
3D occupancy prediction plays a pivotal role in the realm of autonomous driving, as it provides a comprehensive understanding of the driving environment. Most existing methods construct dense scene representations for occupancy prediction, overlooking the inherent sparsity of real-world driving scenes. Recently, 3D superquadric representation has emerged as a promising sparse alternative to dense scene representations due to the strong geometric expressiveness of superquadrics. However, existing superquadric frameworks still suffer from insufficient temporal modeling, a challenging trade-off between query sparsity and geometric expressiveness, and inefficient superquadric-to-voxel splatting. To address these issues, we propose SuperOcc, a novel framework for superquadric-based 3D occupancy prediction. SuperOcc incorporates three key designs: (1) a cohesive temporal modeling mechanism to simultaneously exploit view-centric and object-centric temporal cues; (2) a multi-superquadric decoding strategy to enhance geometric expressiveness without sacrificing query sparsity; and (3) an efficient superquadric-to-voxel splatting scheme to improve computational efficiency. Extensive experiments on the SurroundOcc and Occ3D benchmarks demonstrate that SuperOcc achieves state-of-the-art performance while maintaining superior efficiency. The code is available at this https URL.
三维占用预测在自动驾驶领域中扮演着至关重要的角色,因为它提供了对驾驶环境的全面理解。大多数现有的方法构建密集场景表示来进行占用预测,却忽视了真实世界驾驶场景内在的稀疏性。最近,3D超二次体(superquadric)表示作为一种有前景的稀疏替代方案出现,可以弥补密集场景表示的不足,因为超二次体具有强大的几何表达能力。然而,现有的超二次体框架仍然在时间建模不足、查询稀疏性和几何表达力之间的艰难权衡以及不高效的超二次体到体素转换(splatting)方面存在问题。 为了解决这些问题,我们提出了SuperOcc,这是一个基于超二次体的三维占用预测的新框架。SuperOcc整合了三个关键设计:(1) 一个连贯的时间建模机制,同时利用以视点为中心和以对象为中心的时间线索;(2) 多个超二次体解码策略,以增强几何表达力而不牺牲查询稀疏性;以及 (3) 一种高效的从超二次体到体素的转换方案,从而提高计算效率。在SurroundOcc和Occ3D基准测试上进行的一系列实验表明,SuperOcc达到了最先进的性能,同时保持了卓越的效率。代码可以在提供的链接处获取。
https://arxiv.org/abs/2601.15644