Diffusion language models generate text through iterative refinement, a process that is often computationally inefficient because many tokens reach stability long before the final denoising step. We introduce a training-free, token-level early stopping approach that identifies convergence independently at each position. Our method leverages lightweight signals derived from the model's predictions and local context to dynamically determine when individual tokens can be finalized. This yields adaptive per-token freezing without task-specific fine-tuning, substantially reducing the total number of diffusion steps required. Across diverse benchmarks, spanning mathematical reasoning, general question answering, and scientific understanding, our approach achieves state-of-the-art efficiency gains while preserving generation quality.
扩散语言模型通过迭代细化生成文本,这一过程通常在计算上是低效的,因为许多标记在最终去噪步骤之前很久就已经达到了稳定状态。我们提出了一种无需训练、基于令牌级别的提前停止方法,在每个位置独立识别收敛点。我们的方法利用了从模型预测和局部上下文中得出的轻量级信号,动态确定各个标记何时可以被最终化。这使得每个标记都能根据情况灵活冻结,而不需要针对特定任务进行微调,从而大幅减少了所需的扩散步骤总数。 在涵盖数学推理、通用问题回答以及科学理解等多方面基准测试中,我们的方法实现了最先进的效率提升,同时保持了生成文本的质量。
https://arxiv.org/abs/2602.11133
With the rapid development of large multimodal models, reliable judge and critic models have become essential for open-ended evaluation and preference alignment, providing pairwise preferences, numerical scores, and explanatory justifications for assessing model-generated responses. However, existing critics are primarily trained in general visual domains such as captioning or image question answering, leaving physical AI tasks involving perception, causal reasoning, and planning largely underexplored. We introduce PhyCritic, a multimodal critic model optimized for physical AI through a two-stage RLVR pipeline: a physical skill warmup stage that enhances physically oriented perception and reasoning, followed by self-referential critic finetuning, where the critic generates its own prediction as an internal reference before judging candidate responses, improving judgment stability and physical correctness. Across both physical and general-purpose multimodal judge benchmarks, PhyCritic achieves strong performance gains over open-source baselines and, when applied as a policy model, further improves perception and reasoning in physically grounded tasks.
随着大型多模态模型的快速发展,可靠的评判和批评模型对于开放式的评估及偏好对齐变得至关重要。这些模型能够提供成对的偏好、数值评分以及解释性的理由,用于评估由模型生成的回答。然而,现有的批评模型主要是在诸如图像描述或基于图像的问题回答等通用视觉领域进行训练,而涉及感知、因果推理和规划的物理AI任务则鲜有探索。 我们引入了PhyCritic,这是一种通过两阶段RLVR(Reinforcement Learning from Visual Representations)管道优化后的多模态批评模型,专门针对物理AI。该管道包括两个关键阶段:首先是物理技能预热阶段,增强与物理学相关的感知和推理能力;其次是自我参照性批评微调阶段,在此阶段中,批评模型在其评判候选回答之前先生成自己的预测作为内部参考,以提高判断的稳定性和物理正确性。 在涉及物理及通用目的多模态裁判基准测试中,PhyCritic均取得了相对于开源基线显著的性能提升。当将PhyCritic用作策略模型时,在基于物理学的任务上进一步提升了感知和推理能力。
https://arxiv.org/abs/2602.11124
Flow-matching models deliver state-of-the-art fidelity in image and video generation, but the inherent sequential denoising process renders them slower. Existing acceleration methods like distillation, trajectory truncation, and consistency approaches are static, require retraining, and often fail to generalize across tasks. We propose FastFlow, a plug-and-play adaptive inference framework that accelerates generation in flow matching models. FastFlow identifies denoising steps that produce only minor adjustments to the denoising path and approximates them without using the full neural network models used for velocity predictions. The approximation utilizes finite-difference velocity estimates from prior predictions to efficiently extrapolate future states, enabling faster advancements along the denoising path at zero compute cost. This enables skipping computation at intermediary steps. We model the decision of how many steps to safely skip before requiring a full model computation as a multi-armed bandit problem. The bandit learns the optimal skips to balance speed with performance. FastFlow integrates seamlessly with existing pipelines and generalizes across image generation, video generation, and editing tasks. Experiments demonstrate a speedup of over 2.6x while maintaining high-quality outputs. The source code for this work can be found at this https URL.
流动匹配模型在图像和视频生成方面提供了最先进的保真度,但其内在的顺序去噪过程使其运行速度较慢。现有的加速方法,如知识蒸馏、轨迹截断和一致性方法等,都是静态的,需要重新训练,并且通常无法跨任务泛化。我们提出了一种即插即用的自适应推理框架FastFlow,用于加速流动匹配模型中的生成过程。FastFlow能够识别出仅对去噪路径产生微小调整的去噪步骤,并通过不使用完整神经网络模型来预测速度的方法对其进行近似处理。这种近似利用了从前一步骤的速度估计值进行有限差分法估算,以高效地外推未来状态,从而在零计算成本的情况下沿着去噪路径更快推进,进而可以跳过中间步骤的计算。 我们将如何安全地跳过多少步直到需要完整模型计算这一决策建模为多臂赌博机问题。通过这种方法,赌博机能够学习到最优的步伐跳跃数以平衡速度与性能之间的关系。FastFlow可无缝集成至现有流程,并能跨图像生成、视频生成及编辑任务实现泛化应用。 实验结果显示,在保持高质量输出的同时,FastFlow可以加速超过2.6倍的计算效率。该项目的相关源代码可在以下链接找到:[此URL]。
https://arxiv.org/abs/2602.11105
Neural PDE surrogates are often deployed in data-limited or partially observed regimes where downstream decisions depend on calibrated uncertainty in addition to low prediction error. Existing approaches obtain uncertainty through ensemble replication, fixed stochastic noise such as dropout, or post hoc calibration. Cross-regularized uncertainty learns uncertainty parameters during training using gradients routed through a held-out regularization split. The predictor is optimized on the training split for fit, while low-dimensional uncertainty controls are optimized on the regularization split to reduce train-test mismatch, yielding regime-adaptive uncertainty without per-regime noise tuning. The framework can learn continuous noise levels at the output head, within hidden features, or within operator-specific components such as spectral modes. We instantiate the approach in Fourier Neural Operators and evaluate on APEBench sweeps over observed fraction and training-set size. Across these sweeps, the learned predictive distributions are better calibrated on held-out splits and the resulting uncertainty fields concentrate in high-error regions in one-step spatial diagnostics.
神经PDE代理通常在数据有限或部分观测的情况下部署,这时下游决策不仅依赖于低预测误差,还取决于校准后的不确定性。现有方法通过集合复制、固定的随机噪声(如dropout)或事后校准来获取不确定性。 交叉正则化的不确定性学习法则是通过保留的正则化分割中的梯度,在训练期间获得不确定性参数。这种方法在训练分割中优化预测器以适应拟合,同时在正则化分割中优化低维不确定性控制,减少训练集和测试集之间的不匹配,从而实现针对不同情况自适应的不确定性估计,并且无需为每种情况调整噪声。 该框架可以在输出端、隐藏特征或特定操作组件(如频谱模式)内部学习连续的噪声水平。我们在傅立叶神经算子中实现了这种方法,并在APEBench上进行了评估,涵盖了观测比例和训练集大小的变化范围。在这些建模范围内,所学得的概率分布对保留分割具有更好的校准效果,并且由此产生的不确定性区域集中在一步空间诊断中的高误差区域。
https://arxiv.org/abs/2602.11090
Feature selection remains a major challenge in medical prediction, where existing approaches such as LASSO often lack robustness and interpretability. We introduce GRASP, a novel framework that couples Shapley value driven attribution with group $L_{21}$ regularization to extract compact and non-redundant feature sets. GRASP first distills group level importance scores from a pretrained tree model via SHAP, then enforces structured sparsity through group $L_{21}$ regularized logistic regression, yielding stable and interpretable selections. Extensive comparisons with LASSO, SHAP, and deep learning based methods show that GRASP consistently delivers comparable or superior predictive accuracy, while identifying fewer, less redundant, and more stable features.
特征选择仍然是医学预测中的一个主要挑战,现有的方法如LASSO通常缺乏鲁棒性和可解释性。我们引入了一种新的框架GRASP(基于Shapley值归因与群组$L_{21}$正则化的耦合),旨在提取紧凑且非冗余的特征集。GRASP首先通过SHAP从预训练的树模型中提炼出群体级别的重要性评分,然后通过群组$L_{21}$正则化逻辑回归强制执行结构化稀疏性,从而产生稳定且可解释的选择结果。与LASSO、SHAP以及基于深度学习的方法进行广泛的比较表明,GRASP在提供相当的或更优的预测准确性的同时,能够识别出较少的特征,并且这些特征更加非冗余和稳定。
https://arxiv.org/abs/2602.11084
Human conversation is organized by an implicit chain of thoughts that manifests as timed speech acts. Capturing this perceptual pathway is key to building natural full-duplex interactive systems. We introduce a framework that models this process as multi-level perception, and then reasons over conversational behaviors via a Graph-of-Thoughts (GoT). Our approach formalizes the intent-to-action pathway with a hierarchical labeling scheme, predicting high-level communicative intents and low-level speech acts to learn their causal and temporal dependencies. To train this system, we develop a high quality corpus that pairs controllable, event-rich dialogue data with human-annotated labels. The GoT framework structures streaming predictions as an evolving graph, enabling a transformer to forecast the next speech act, generate concise justifications for its decisions, and dynamically refine its reasoning. Experiments on both synthetic and real duplex dialogues show that the framework delivers robust behavior detection, produces interpretable reasoning chains, and establishes a foundation for benchmarking conversational reasoning in full duplex spoken dialogue systems.
人类对话是由一系列隐含的思维链组织起来,表现为有时间顺序的语言行为。捕捉这种感知路径是构建自然全双工交互系统的关键。我们介绍了一个框架,该框架将这一过程建模为多层次感知,并通过思想图(GoT)来推理对话行为。我们的方法使用分层标记方案形式化了意图到行动的路径,预测高层次的交流意图和低层次的语言行为,以学习它们之间的因果关系和时间依赖性。为了训练这个系统,我们开发了一个高质量的数据集,该数据集将可控、事件丰富的对话与人类标注标签配对。 GoT框架结构化连续流式的预测为一个不断演变的图,使变压器能够预测下一个语言行为、生成简洁的理由为其决策,并动态地改进其推理过程。在合成和真实的全双工对话上的实验表明,该框架提供了强大的行为检测能力,产生了可解释的推理链,并为进一步基准测试全双工语音对话系统的会话推理奠定了基础。
https://arxiv.org/abs/2602.11065
Query-based 3D scene instance segmentation from point clouds has attained notable performance. However, existing methods suffer from the query initialization dilemma due to the sparse nature of point clouds and rely on computationally intensive attention mechanisms in query decoders. We accordingly introduce LaSSM, prioritizing simplicity and efficiency while maintaining competitive performance. Specifically, we propose a hierarchical semantic-spatial query initializer to derive the query set from superpoints by considering both semantic cues and spatial distribution, achieving comprehensive scene coverage and accelerated convergence. We further present a coordinate-guided state space model (SSM) decoder that progressively refines queries. The novel decoder features a local aggregation scheme that restricts the model to focus on geometrically coherent regions and a spatial dual-path SSM block to capture underlying dependencies within the query set by integrating associated coordinates information. Our design enables efficient instance prediction, avoiding the incorporation of noisy information and reducing redundant computation. LaSSM ranks first place on the latest ScanNet++ V2 leaderboard, outperforming the previous best method by 2.5% mAP with only 1/3 FLOPs, demonstrating its superiority in challenging large-scale scene instance segmentation. LaSSM also achieves competitive performance on ScanNet, ScanNet200, S3DIS and ScanNet++ V1 benchmarks with less computational cost. Extensive ablation studies and qualitative results validate the effectiveness of our design. The code and weights are available at this https URL.
https://arxiv.org/abs/2602.11007
Monocular depth estimation is a central problem in computer vision with applications in robotics, AR, and autonomous driving, yet the self-attention mechanisms that drive modern Transformer architectures remain opaque. We introduce SVD-Inspired Attention (SVDA) into the Dense Prediction Transformer (DPT), providing the first spectrally structured formulation of attention for dense prediction tasks. SVDA decouples directional alignment from spectral modulation by embedding a learnable diagonal matrix into normalized query-key interactions, enabling attention maps that are intrinsically interpretable rather than post-hoc approximations. Experiments on KITTI and NYU-v2 show that SVDA preserves or slightly improves predictive accuracy while adding only minor computational overhead. More importantly, SVDA unlocks six spectral indicators that quantify entropy, rank, sparsity, alignment, selectivity, and robustness. These reveal consistent cross-dataset and depth-wise patterns in how attention organizes during training, insights that remain inaccessible in standard Transformers. By shifting the role of attention from opaque mechanism to quantifiable descriptor, SVDA redefines interpretability in monocular depth estimation and opens a principled avenue toward transparent dense prediction models.
单目深度估计是计算机视觉中的一个核心问题,它在机器人技术、增强现实(AR)和自动驾驶等领域都有广泛应用。然而,驱动现代Transformer架构的自注意力机制仍然难以理解。我们向密集预测变换器(Dense Prediction Transformer, DPT)引入了受奇异值分解(SVD)启发的关注机制(SVDA),这是首个为密集预测任务提供谱结构化表述的关注机制。 SVDA通过在标准化查询-键交互中嵌入一个可学习的对角矩阵,将方向对齐与频谱调制解耦开来,从而生成内在可解释而非后验近似的关注图。实验表明,在KITTI和NYU-v2数据集上,SVDA能够保持或略微提升预测准确性,并且仅增加了轻微的计算开销。 更重要的是,SVDA解锁了六个频谱指标,这些指标可以量化熵、秩、稀疏性、对齐度、选择性和鲁棒性。这揭示了训练过程中注意力在跨数据集和深度维度上如何组织的一致模式,这些都是标准Transformer中无法获得的见解。通过将注意机制的作用从不透明机制转变为可量化的描述符,SVDA重新定义了单目深度估计中的可解释性,并为透明密集预测模型提供了原则性的途径。
https://arxiv.org/abs/2602.11005
Guava fruits often suffer from many diseases. This can harm fruit quality and fruit crop yield. Early identification is important for minimizing damage and ensuring fruit health. This study focuses on 3 different categories for classifying diseases. These are Anthracnose, Fruit flies, and Healthy fruit. The data set used in this study is collected from Mendeley Data. This dataset contains 473 original images of Guava. These images vary in size and format. The original dataset was resized to 256x256 pixels with RGB color mode for better consistency. After this, the Data augmentation process is applied to improve the dataset by generating variations of the original images. The augmented dataset consists of 3784 images using advanced preprocessing techniques. Two deep learning models were implemented to classify the images. The InceptionV3 model is well known for its advanced framework. These apply multiple convolutional filters for obtaining different features effectively. On the other hand, the ResNet50 model helps to train deeper networks by using residual learning. The InceptionV3 model achieved the impressive accuracy of 98.15%, and ResNet50got 94.46% accuracy. Data mixing methods such as CutMix and MixUp were applied to enhance the model's robustness. The confusion matrix was used to evaluate the overall model performance of both InceptionV3 and Resnet50. Additionally, SHAP analysis is used to improve interpretability, which helps to find the significant parts of the image for the model prediction. This study purposes to highlight how advanced models enhan
翻译如下: 番石榴水果常常会遭受多种疾病的影响。这会对果实品质和产量造成危害。早期识别对于减少损害并确保水果健康至关重要。本研究专注于三种不同类别来对疾病进行分类,即炭疽病、果蝇以及健康的果实。所使用的数据集是从Mendeley Data收集的。该数据集包含473张番石榴原始图像,这些图片在大小和格式上有所不同。原始数据集被调整为256x256像素,并以RGB颜色模式显示,以便更好地保持一致性。之后应用了数据增强过程来通过生成原图的不同变体改进数据集。经过处理后的数据集中包含了3784张图像,采用先进的预处理技术进行制作。 本研究中实施了两种深度学习模型对这些图像进行分类。InceptionV3模型以其高级框架而闻名,能够通过应用多种卷积滤镜来有效获得不同特征。另一方面,ResNet50模型则借助残差学习帮助训练更深层的网络。InceptionV3模型达到了98.15%的惊人准确率,而ResNet50模型则取得了94.46%的准确性。 为了提高模型鲁棒性,本研究还应用了诸如CutMix和MixUp等数据混合方法。混淆矩阵被用来评估两个模型的整体性能,即InceptionV3和Resnet50。此外,SHAP分析也被使用来增强模型解释性,帮助识别图像中对预测至关重要的部分。 这项研究旨在强调高级模型如何通过以上技术显著提高疾病分类的准确性,并有助于实现更有效的番石榴健康管理和产量优化。
https://arxiv.org/abs/2602.10967
Research on misinformation has focused almost exclusively on supply, asking what falsehoods circulate, who produces them, and whether corrections work. A basic demand-side question remains unanswered. When ordinary people can fact-check anything they want, what do they actually ask about? We provide the first large-scale evidence on this question by analyzing close to 2{,}500 statements submitted by 457 participants to an open-ended AI fact-checking system. Each claim is classified along five semantic dimensions (domain, epistemic form, verifiability, target entity, and temporal reference), producing a behavioral map of public verification demand. Three findings stand out. First, users range widely across topics but default to a narrow epistemic repertoire, overwhelmingly submitting simple descriptive claims about present-day observables. Second, roughly one in four requests concerns statements that cannot be empirically resolved, including moral judgments, speculative predictions, and subjective evaluations, revealing a systematic mismatch between what users seek from fact-checking tools and what such tools can deliver. Third, comparison with the FEVER benchmark dataset exposes sharp structural divergences across all five dimensions, indicating that standard evaluation corpora encode a synthetic claim environment that does not resemble real-world verification needs. These results reframe fact-checking as a demand-driven problem and identify where current AI systems and benchmarks are misaligned with the uncertainty people actually experience.
关于错误信息的研究几乎完全集中在供应方面,探讨了哪些虚假信息在传播、谁制造了这些虚假信息以及更正是否有效。然而,一个基本的需求侧问题仍未得到解答:当普通人可以对任何他们想查证的信息进行事实核查时,他们实际上会问些什么?我们通过对457名参与者提交给开放式AI事实核查系统的近2,500条陈述的分析,提供了对此问题的第一个大规模证据。每一条声明根据五个语义维度(领域、认识论形式、可验证性、目标实体和时间参照)进行分类,生成了公众查证需求的行为图谱。 三个主要发现尤为突出: 1. 用户在主题上广泛分布,但在认识论方面却倾向于使用狭窄的工具包,提交的大多是关于当今可观察现象的简单描述性声明。 2. 大约四分之一的需求涉及无法通过实证研究解决的说法,包括道德判断、推测性的预测和主观评估。这揭示了用户从事实核查工具中寻求的信息与其实际能提供的信息之间存在系统性的不匹配。 3. 与FEVER基准数据集进行比较显示,在五个维度上都存在着显著的结构差异,表明标准评价语料库编码了一个合成的事实声明环境,并且这个环境并不反映现实中的验证需求。 这些结果重新定义了事实核查作为由需求驱动的问题,并指出了当前AI系统和基准测试在应对人们实际经历的不确定性方面存在不一致的地方。
https://arxiv.org/abs/2602.10935
The comprehensive understanding capabilities of world models for driving scenarios have significantly improved the planning accuracy of end-to-end autonomous driving frameworks. However, the redundant modeling of static regions and the lack of deep interaction with trajectories hinder world models from exerting their full effectiveness. In this paper, we propose Temporal Residual World Model (TR-World), which focuses on dynamic object modeling. By calculating the temporal residuals of scene representations, the information of dynamic objects can be extracted without relying on detection and tracking. TR-World takes only temporal residuals as input, thus predicting the future spatial distribution of dynamic objects more precisely. By combining the prediction with the static object information contained in the current BEV features, accurate future BEV features can be obtained. Furthermore, we propose Future-Guided Trajectory Refinement (FGTR) module, which conducts interaction between prior trajectories (predicted from the current scene representation) and the future BEV features. This module can not only utilize future road conditions to refine trajectories, but also provides sparse spatial-temporal supervision on future BEV features to prevent world model collapse. Comprehensive experiments conducted on the nuScenes and NAVSIM datasets demonstrate that our method, namely ResWorld, achieves state-of-the-art planning performance. The code is available at this https URL.
世界模型在驾驶场景中的综合理解能力显著提高了端到端自主驾驶框架的规划精度。然而,静态区域的冗余建模以及与轨迹深度交互不足阻碍了世界模型发挥其全部效果。为此,本文提出了一种名为“时间残差世界模型”(Temporal Residual World Model, TR-World)的方法,该方法专注于动态对象建模。通过计算场景表示的时间残差,可以提取动态对象的信息而不依赖于检测和跟踪过程。TR-World仅将时间残差作为输入,从而更精确地预测动态对象的未来空间分布。结合当前BEV特征中的静态物体信息与预测结果,则可以获得准确的未来BEV特征。 此外,我们还提出了一种名为“未来引导轨迹细化”(Future-Guided Trajectory Refinement, FGTR)模块的方法,它在先前轨迹(根据当前场景表示预测得出)和未来的BEV特征之间进行交互。这一模块不仅能利用未来道路条件来优化轨迹,还可以为未来BEV特征提供稀疏的空间-时间监督,从而防止世界模型崩溃。 在nuScenes和NAVSIM数据集上进行的全面实验表明,我们的方法(命名为ResWorld)实现了最先进的规划性能。相关代码可在此网址获取:[此https URL]。
https://arxiv.org/abs/2602.10884
Time Series Foundation Models (TSFMs) have introduced zero-shot prediction capabilities that bypass the need for task-specific training. Whether these capabilities translate to mission-critical applications such as electricity demand forecasting--where accuracy, calibration, and robustness directly affect grid operations--remains an open question. We present a multi-dimensional benchmark evaluating four TSFMs (Chronos-Bolt, Chronos-2, Moirai-2, and TinyTimeMixer) alongside Prophet as an industry-standard baseline and two statistical references (SARIMA and Seasonal Naive), using ERCOT hourly load data from 2020 to 2024. All experiments run on consumer-grade hardware (AMD Ryzen 7, 16GB RAM, no GPU). The evaluation spans four axes: (1) context length sensitivity from 24 to 2048 hours, (2) probabilistic forecast calibration, (3) robustness under distribution shifts including COVID-19 lockdowns and Winter Storm Uri, and (4) prescriptive analytics for operational decision support. The top-performing foundation models achieve MASE values near 0.31 at long context lengths (C = 2048h, day-ahead horizon), a 47% reduction over the Seasonal Naive baseline. The inclusion of Prophet exposes a structural advantage of pre-trained models: Prophet fails when the fitting window is shorter than its seasonality period (MASE > 74 at 24-hour context), while TSFMs maintain stable accuracy even with minimal context because they recognise temporal patterns learned during pre-training rather than estimating them from scratch. Calibration varies substantially across models--Chronos-2 produces well-calibrated prediction intervals (95% empirical coverage at 90% nominal level) while both Moirai-2 and Prophet exhibit overconfidence (~70% coverage). We provide practical model selection guidelines and release the complete benchmark framework for reproducibility.
时间序列基础模型(TSFM)引入了无需特定任务训练的零样本预测能力。然而,这些能力是否能够应用于电力需求预测等关键任务中——准确性、校准和稳健性直接影响电网运行——仍然是一个开放的问题。我们提出一个多维度基准测试,评估四种时间序列基础模型(Chronos-Bolt、Chronos-2、Moirai-2 和 TinyTimeMixer),以及作为行业标准基线的Prophet,还有两个统计参考方法(SARIMA和季节性朴素法),使用ERCOT从2020年到2024年的每小时负载数据进行测试。所有实验都在消费级硬件上运行(AMD Ryzen 7处理器、16GB内存、无GPU)。评估涵盖了四个维度:(1) 上下文长度敏感度,范围从24小时至2048小时;(2) 概率预测校准;(3) 面对包括新冠疫情封锁和冬季风暴Uri在内的分布变化时的稳健性;以及(4) 用于操作决策支持的规范性分析。表现最佳的基础模型在长时间上下文长度(C = 2048小时,一天前展望)下的MASE值接近0.31,在长上下文中较季节性朴素法基准减少了47%。Prophet的引入揭示了预训练模型的一个结构优势:当拟合窗口短于其季节性周期时(在24小时上下文下MASE > 74),Prophet无法正常工作,而TSFM即使在最小上下文情况下也能保持稳定的准确性,因为它们利用了预训练期间学习到的时间模式,而不是从头开始估计这些模式。校准在不同模型之间差异显著——Chronos-2生成具有良好校准的预测区间(95%的经验覆盖率对应名义水平的90%),而Moirai-2和Prophet则表现出过度自信(大约70%的覆盖范围)。我们提供了实用的模型选择指南,并发布了完整的基准框架以支持可重复性。
https://arxiv.org/abs/2602.10848
Large Language Models (LLMs) exhibit impressive capabilities yet suffer from sensitivity to slight input context variations, hampering reliability. Conventional metrics like accuracy and perplexity fail to assess local prediction robustness, as normalized output probabilities can obscure the underlying resilience of an LLM's internal state to perturbations. We introduce the Token Constraint Bound ($\delta_{\mathrm{TCB}}$), a novel metric that quantifies the maximum internal state perturbation an LLM can withstand before its dominant next-token prediction significantly changes. Intrinsically linked to output embedding space geometry, $\delta_{\mathrm{TCB}}$ provides insights into the stability of the model's internal predictive commitment. Our experiments show $\delta_{\mathrm{TCB}}$ correlates with effective prompt engineering and uncovers critical prediction instabilities missed by perplexity during in-context learning and text generation. $\delta_{\mathrm{TCB}}$ offers a principled, complementary approach to analyze and potentially improve the contextual stability of LLM predictions.
大型语言模型(LLM)展示了令人印象深刻的性能,但它们对输入上下文的细微变化敏感,这影响了其可靠性。传统的准确性、困惑度等评估指标无法衡量局部预测的稳健性,因为归一化的输出概率掩盖了LLM内部状态对扰动的实际抵抗力。我们引入了一种新的度量标准——Token Constraint Bound($\delta_{\mathrm{TCB}}$),用于量化在大型语言模型的主导下一个标记预测发生显著变化之前,其内部状态能承受的最大扰动程度。该指标与输出嵌入空间几何结构内在关联,能够揭示模型内部预测承诺的稳定性。实验表明,$\delta_{\mathrm{TCB}}$ 与有效的提示工程相关,并且在上下文学习和文本生成过程中发现了困惑度所未能捕捉到的关键预测不稳定性。因此,$\delta_{\mathrm{TCB}}$ 提供了一种原则性的、互补的方法来分析并潜在地改进大型语言模型的上下文稳定性。
https://arxiv.org/abs/2602.10816
Software vulnerability detection (SVD) is a critical challenge in modern systems. Large language models (LLMs) offer natural-language explanations alongside predictions, but most work focuses on binary evaluation, and explanations often lack semantic consistency with Common Weakness Enumeration (CWE) categories. We propose VulReaD, a knowledge-graph-guided approach for vulnerability reasoning and detection that moves beyond binary classification toward CWE-level reasoning. VulReaD leverages a security knowledge graph (KG) as a semantic backbone and uses a strong teacher LLM to generate CWE-consistent contrastive reasoning supervision, enabling student model training without manual annotations. Students are fine-tuned with Odds Ratio Preference Optimization (ORPO) to encourage taxonomy-aligned reasoning while suppressing unsupported explanations. Across three real-world datasets, VulReaD improves binary F1 by 8-10% and multi-class classification by 30% Macro-F1 and 18% Micro-F1 compared to state-of-the-art baselines. Results show that LLMs outperform deep learning baselines in binary detection and that KG-guided reasoning enhances CWE coverage and interpretability.
软件漏洞检测(SVD)是现代系统中一个关键的挑战。大型语言模型(LLMs)能够提供自然语言解释与预测,但大多数研究工作集中在二元评估上,并且生成的解释通常缺乏与通用弱点枚举(CWE)类别的语义一致性。我们提出了VulReaD方法,这是一种知识图谱引导的漏洞推理和检测方法,超越了二元分类走向基于CWE级别的推理。VulReaD利用安全知识图谱(KG)作为语义骨干,并采用一个强大的教师LLM生成与CWE一致的对比推理监督,从而在没有人工注释的情况下训练学生模型。通过使用比值偏好优化(ORPO),对学生模型进行微调以鼓励符合分类法的推理并抑制无根据的解释。在三个真实世界的数据集中,VulReaD相比于最先进的基线技术,在二元F1评分上提高了8-10%,多类分类上的Macro-F1和Micro-F1分别提升了30%和18%。 结果表明,LLMs在二元检测方面优于深度学习基线模型,并且知识图谱引导的推理能够增强CWE覆盖率及解释性。
https://arxiv.org/abs/2602.10787
Robotic manipulation requires anticipating how the environment evolves in response to actions, yet most existing systems lack this predictive capability, often resulting in errors and inefficiency. While Vision-Language Models (VLMs) provide high-level guidance, they cannot explicitly forecast future states, and existing world models either predict only short horizons or produce spatially inconsistent frames. To address these challenges, we propose a framework for fast and predictive video-conditioned action. Our approach first selects and adapts a robust video generation model to ensure reliable future predictions, then applies adversarial distillation for fast, few-step video generation, and finally trains an action model that leverages both generated videos and real observations to correct spatial errors. Extensive experiments show that our method produces temporally coherent, spatially accurate video predictions that directly support precise manipulation, achieving significant improvements in embodiment consistency, spatial referring ability, and task completion over existing baselines. Codes & Models will be released.
机器人操作需要预测环境在行动后的演变情况,然而大多数现有系统缺乏这种预测能力,导致经常出现错误和效率低下。虽然视觉-语言模型(VLMs)提供了高层次的指导,但它们无法明确地预测未来的状态,而现有的世界模型要么只能预测短期的时间范围,要么生成的空间上不一致的画面。为了解决这些问题,我们提出了一种快速且具备预测能力的视频条件动作框架。 我们的方法首先选择并调整一个稳健的视频生成模型,以确保可靠的未来预测;然后应用对抗蒸馏(adversarial distillation)进行快速、少步骤的视频生成;最后训练一个利用生成的视频和真实观察来纠正空间错误的动作模型。大量的实验表明,我们提出的方法能够产生时间上连贯且空间上准确的视频预测,直接支持精确的操作,并在具身一致性(embodiment consistency)、空间指代能力以及任务完成度方面显著优于现有的基准方法。 代码与模型将在适当的时候发布。
https://arxiv.org/abs/2602.10717
Vision-Language-Action (VLA) models have recently achieved remarkable progress in robotic perception and control, yet most existing approaches primarily rely on VLM trained using 2D images, which limits their spatial understanding and action grounding in complex 3D environments. To address this limitation, we propose a novel framework that integrates depth estimation into VLA models to enrich 3D feature representations. Specifically, we employ a depth estimation baseline called VGGT to extract geometry-aware 3D cues from standard RGB inputs, enabling efficient utilization of existing large-scale 2D datasets while implicitly recovering 3D structural information. To further enhance the reliability of these depth-derived features, we introduce a new module called action assistant, which constrains the learned 3D representations with action priors and ensures their consistency with downstream control tasks. By fusing the enhanced 3D features with conventional 2D visual tokens, our approach significantly improves the generalization ability and robustness of VLA models. Experimental results demonstrate that the proposed method not only strengthens perception in geometrically ambiguous scenarios but also leads to superior action prediction accuracy. This work highlights the potential of depth-driven data augmentation and auxiliary expert supervision for bridging the gap between 2D observations and 3D-aware decision-making in robotic systems.
翻译如下: 基于视觉-语言-行动(VLA)模型最近在机器人感知和控制方面取得了显著的进展,但大多数现有的方法主要依赖于使用二维图像训练的视觉-语言模型(VLM),这限制了它们对复杂三维环境的空间理解和动作定位。为了解决这一局限性,我们提出了一种新颖的框架,该框架将深度估计集成到VLA模型中,以丰富三维特征表示。具体而言,我们采用一种称为VGGT的深度估计基线方法从标准RGB输入中提取具有几何感知能力的三维线索,从而在利用现有大规模二维数据集的同时隐式恢复了三维结构信息。为了进一步增强这些由深度驱动的特征的可靠性,我们引入了一个名为动作助手的新模块,该模块通过使用动作先验来约束学习到的三维表示,并确保其与下游控制任务的一致性。通过将增强后的三维特征与传统的二维视觉标记融合,我们的方法显著提高了VLA模型的泛化能力和鲁棒性。实验结果表明,所提出的方法不仅在几何上模棱两可的情况下增强了感知能力,还导致了更优的动作预测精度。这项工作强调了深度驱动的数据增强和辅助专家监督对于弥合二维观察与机器人系统中三维感知决策之间的差距的潜力。
https://arxiv.org/abs/2602.10698
Visual Chain-of-Thought (VCoT) has emerged as a promising paradigm for enhancing multimodal reasoning by integrating visual perception into intermediate reasoning steps. However, existing VCoT approaches are largely confined to static scenarios and struggle to capture the temporal dynamics essential for tasks such as instruction, prediction, and camera motion. To bridge this gap, we propose TwiFF-2.7M, the first large-scale, temporally grounded VCoT dataset derived from $2.7$ million video clips, explicitly designed for dynamic visual question and answer. Accompanying this, we introduce TwiFF-Bench, a high-quality evaluation benchmark of $1,078$ samples that assesses both the plausibility of reasoning trajectories and the correctness of final answers in open-ended dynamic settings. Building on these foundations, we propose the TwiFF model, a unified modal that synergistically leverages pre-trained video generation and image comprehension capabilities to produce temporally coherent visual reasoning cues-iteratively generating future action frames and textual reasoning. Extensive experiments demonstrate that TwiFF significantly outperforms existing VCoT methods and Textual Chain-of-Thought baselines on dynamic reasoning tasks, which fully validates the effectiveness for visual question answering in dynamic scenarios. Our code and data is available at this https URL.
视觉链式思考(VCoT)作为一种增强多模态推理的方法,通过将视觉感知整合到中间的推理步骤中而崭露头角。然而,现有的VCoT方法主要局限于静态场景,并且难以捕捉执行指令、预测和摄像机运动等任务所需的时间动态特性。为解决这一问题,我们提出了一种名为TwiFF-2.7M的数据集,这是首个大规模的时间基础视觉链式思考数据集,包含来自270万个视频片段的素材,专门设计用于处理动态场景中的视觉问答问题。 与该数据集相配套的是TwiFF-Bench,这是一个由1,078个样本组成的高质量评估基准,旨在评估推理轨迹的真实性和最终答案的准确性,特别针对开放式动态设置进行优化。在此基础上,我们提出了TwiFF模型,这是一种统一模式,能够协同利用预训练视频生成和图像理解的能力来产生时间连贯的视觉推理线索——通过迭代地生成未来的动作帧和文本推理。 大量的实验结果表明,在动态推理任务上,TwiFF显著超越了现有的VCoT方法及基于文本链式思考的方法。这充分验证了其在处理动态场景中的视觉问答问题时的有效性。我们的代码与数据集可在上述网址获得。
https://arxiv.org/abs/2602.10675
Speech Enhancement (SE) in audio devices is often supported by auxiliary modules for Voice Activity Detection (VAD), SNR estimation, or Acoustic Scene Classification to ensure robust context-aware behavior and seamless user experience. Just like SE, these tasks often employ deep learning; however, deploying additional models on-device is computationally impractical, whereas cloud-based inference would introduce additional latency and compromise privacy. Prior work on SE employed Dynamic Channel Pruning (DynCP) to reduce computation by adaptively disabling specific channels based on the current input. In this work, we investigate whether useful signal properties can be estimated from these internal pruning masks, thus removing the need for separate models. We show that simple, interpretable predictors achieve up to 93% accuracy on VAD, 84% on noise classification, and an R2 of 0.86 on F0 estimation. With binary masks, predictions reduce to weighted sums, inducing negligible overhead. Our contribution is twofold: on one hand, we examine the emergent behavior of DynCP models through the lens of downstream prediction tasks, to reveal what they are learning; on the other, we repurpose and re-propose DynCP as a holistic solution for efficient SE and simultaneous estimation of signal properties.
语音增强(SE)在音频设备中通常依赖于辅助模块,如声源活动检测(VAD)、信噪比估计或声音场景分类来确保稳健的上下文感知行为和无缝用户体验。与SE一样,这些任务也常采用深度学习技术;然而,在设备上部署额外模型从计算角度来看是不可行的,而基于云的推理则会增加延迟并损害隐私。先前的研究中,SE采用了动态通道剪枝(DynCP)来通过自适应地禁用特定通道减少计算量,具体依据当前输入进行调整。 在本项研究工作中,我们探讨了是否可以从这些内部剪枝掩码中估计出有用的信号属性,从而消除单独模型的需求。研究表明,简单的、可解释的预测器可以达到高达93%的VAD准确率,84%的噪声分类准确率以及F0估计上R2值为0.86的表现。使用二进制掩码时,预测简化为加权求和操作,几乎不会增加额外开销。 我们的贡献是双重的:一方面,我们通过下游预测任务的角度来考察DynCP模型产生的新兴行为,揭示它们所学的内容;另一方面,我们将重新定义并提议使用DynCP作为高效语音增强和同时估计信号属性的整体解决方案。
https://arxiv.org/abs/2602.10666
Decoder-only large language models are increasingly used as behavioral encoders for user representation learning, yet the impact of attention masking on the quality of user embeddings remains underexplored. In this work, we conduct a systematic study of causal, hybrid, and bidirectional attention masks within a unified contrastive learning framework trained on large-scale real-world Alipay data that integrates long-horizon heterogeneous user behaviors. To improve training dynamics when transitioning from causal to bidirectional attention, we propose Gradient-Guided Soft Masking, a gradient-based pre-warmup applied before a linear scheduler that gradually opens future attention during optimization. Evaluated on 9 industrial user cognition benchmarks covering prediction, preference, and marketing sensitivity tasks, our approach consistently yields more stable training and higher-quality bidirectional representations compared with causal, hybrid, and scheduler-only baselines, while remaining compatible with decoder pretraining. Overall, our findings highlight the importance of masking design and training transition in adapting decoder-only LLMs for effective user representation learning. Our code is available at this https URL.
翻译如下: 仅解码器的大规模语言模型越来越多地被用作用户表示学习的行为编码器,然而注意力屏蔽对用户嵌入质量的影响仍然缺乏深入的研究。在本研究中,我们针对大规模现实世界的支付宝数据(该数据集整合了长期异构的用户行为)进行了一项系统性研究,探讨因果、混合和双向注意掩码在统一对比学习框架中的应用效果。为了改进从因果到双向注意力过渡过程中的训练动态,我们提出了一种基于梯度引导的软屏蔽方法——Gradient-Guided Soft Masking,在线性调度器渐进式开放未来注意力优化之前进行预热。通过评估涵盖预测、偏好和营销敏感性的9个工业用户认知基准任务,我们的方法在对比因果、混合及仅使用调度器的基线模型时,展现出了更稳定的训练过程以及更高质量的双向表示,同时仍与解码器预训练保持兼容。 总体而言,我们的研究结果突显了屏蔽设计和训练转换对于适应仅解码器的大规模语言模型以实现有效的用户表示学习的重要性。相关代码可在提供的网址获取。
https://arxiv.org/abs/2602.10622
Molecular understanding is central to advancing areas such as scientific discovery, yet Large Language Models (LLMs) struggle to understand molecular graphs effectively. Existing graph-LLM bridges often adapt the Q-Former-style connector with fixed-length static tokens, which is originally designed for vision tasks. These designs overlook stereochemistry and substructural context and typically require costly LLM-backbone fine-tuning, limiting efficiency and generalization. We introduce EDT-Former, an Entropy-guided Dynamic Token Transformer that generates tokens aligned with informative molecular patches, thereby preserving both local and global structural features for molecular graph understanding. Beyond prior approaches, EDT-Former enables alignment between frozen graph encoders and LLMs without tuning the LLM backbone (excluding the embedding layer), resulting in computationally efficient finetuning, and achieves stateof-the-art results on MoleculeQA, Molecule-oriented Mol-Instructions, and property prediction benchmarks (TDC, MoleculeNet), underscoring its effectiveness for scalable and generalizable multimodal molecular understanding
分子层面的理解对于科学发现等领域的进步至关重要,然而大型语言模型(LLMs)在有效地理解分子图方面存在困难。现有的图形-LLM桥梁通常采用为视觉任务设计的Q-Former风格连接器,并使用固定长度的静态令牌。这些设计忽视了立体化学和次结构背景信息,并且通常需要耗费大量资源对LLM骨干进行微调,从而限制了效率和泛化能力。 为此,我们引入了一种新的方法:EDT-Former(Entropy-guided Dynamic Token Transformer),它能够生成与分子片段中的信息性区域相匹配的令牌。这种方法既能保留局部特征又能保持全局结构特性,以便更好地理解分子图。相较于以往的方法,EDT-Former能够在不调整LLM骨干的情况下使冻结图形编码器和LLMs对齐(除了嵌入层以外的部分),从而实现计算效率更高的微调,并在MoleculeQA、面向分子的Mol-Instructions以及属性预测基准测试(如TDC和MoleculeNet)中取得了最先进的成果,这凸显了其在可扩展性和通用化多模态分子理解中的有效性。
https://arxiv.org/abs/2602.02742