Robot foundation models are beginning to deliver on the promise of generalist robotic agents, yet progress remains constrained by the scarcity of large-scale real-world manipulation datasets. Simulation and synthetic data generation offer a scalable alternative, but their usefulness is limited by the visual domain gap between simulation and reality. In this work, we present Point Bridge, a framework that leverages unified, domain-agnostic point-based representations to unlock synthetic datasets for zero-shot sim-to-real policy transfer, without explicit visual or object-level alignment. Point Bridge combines automated point-based representation extraction via Vision-Language Models (VLMs), transformer-based policy learning, and efficient inference-time pipelines to train capable real-world manipulation agents using only synthetic data. With additional co-training on small sets of real demonstrations, Point Bridge further improves performance, substantially outperforming prior vision-based sim-and-real co-training methods. It achieves up to 44% gains in zero-shot sim-to-real transfer and up to 66% with limited real data across both single-task and multitask settings. Videos of the robot are best viewed at: this https URL
机器人基础模型已经开始实现通用型机器人代理的承诺,但其进展仍受限于大规模现实世界操作数据集的稀缺。仿真和合成数据生成提供了一种可扩展的替代方案,但由于模拟与现实之间的视觉领域差距,它们的有效性受到了限制。在本文中,我们介绍了Point Bridge框架,该框架利用统一、无领域的点基表示法来解锁合成数据集以实现零样本仿真实现策略迁移,而无需显式的视觉或对象级别的对齐。通过结合基于视觉-语言模型(VLMs)的自动点基表示提取、基于变压器的学习策略以及高效的推理时间管道,Point Bridge能够仅使用合成数据训练具备能力的真实世界操作代理。在额外与少量实际演示进行共训的情况下,Point Bridge进一步提高了性能,并显著超越了先前基于视觉的仿真实现共训方法的表现。在零样本仿真到现实迁移中,它最多可实现44%的增长,在具有有限真实数据的情境下跨单一任务和多任务设置则可达66%。 请注意查看机器人视频的最佳方式是访问此链接:[请在此插入实际URL](原文中的“this https URL”应为具体的网址链接)。
https://arxiv.org/abs/2601.16212
We study Compositional Video Understanding (CVU), where models must recognize verbs and objects and compose them to generalize to unseen combinations. We find that existing Zero-Shot Compositional Action Recognition (ZS-CAR) models fail primarily due to an overlooked failure mode: object-driven verb shortcuts. Through systematic analysis, we show that this behavior arises from two intertwined factors: severe sparsity and skewness of compositional supervision, and the asymmetric learning difficulty between verbs and objects. As training progresses, the existing ZS-CAR model increasingly ignores visual evidence and overfits to co-occurrence statistics. Consequently, the existing model does not gain the benefit of compositional recognition in unseen verb-object compositions. To address this, we propose RCORE, a simple and effective framework that enforces temporally grounded verb learning. RCORE introduces (i) a composition-aware augmentation that diversifies verb-object combinations without corrupting motion cues, and (ii) a temporal order regularization loss that penalizes shortcut behaviors by explicitly modeling temporal structure. Across two benchmarks, Sth-com and our newly constructed EK100-com, RCORE significantly improves unseen composition accuracy, reduces reliance on co-occurrence bias, and achieves consistently positive compositional gaps. Our findings reveal object-driven shortcuts as a critical limiting factor in ZS-CAR and demonstrate that addressing them is essential for robust compositional video understanding.
我们研究了组合视频理解(CVU),在这种情况下,模型必须识别动词和物体,并将它们组合起来以推广到未见过的组合。我们发现现有的零样本组合动作识别(ZS-CAR)模型主要由于一个被忽略的问题模式而失败:基于对象的动词捷径。通过系统的分析,我们展示了这种行为是由两个相互交织的因素引起的:组成监督的高度稀疏性和偏斜性,以及动词和物体之间的不对称学习难度。随着训练的进行,现有的ZS-CAR模型越来越忽视视觉证据,并过度适应共现统计信息。因此,现有模型无法获得在未见过的动词-物体组合中的组合识别益处。 为了解决这个问题,我们提出了RCORE,这是一个简单而有效的框架,强制执行基于时间的基础动词学习。RCORE引入了(i)一种组合感知增强方法,可以在不破坏运动线索的情况下多样化动词-对象组合;(ii)一种时间顺序正则化损失,通过显式建模时间结构来惩罚捷径行为。在两个基准测试Sth-com和我们新构建的EK100-com上,RCORE显著提高了未见过组合的准确性,减少了对共现偏差的依赖,并实现了持续的正面组合差距。 我们的发现揭示了基于对象的捷径作为ZS-CAR中的关键限制因素,并证明解决这些问题对于稳健的组合视频理解至关重要。
https://arxiv.org/abs/2601.16211
Discrete video VAEs underpin modern text-to-video generation and video understanding systems, yet existing tokenizers typically learn visual codebooks at a single scale with limited vocabularies and shallow language supervision, leading to poor cross-modal alignment and zero-shot transfer. We introduce PyraTok, a language-aligned pyramidal tokenizer that learns semantically structured discrete latents across multiple spatiotemporal resolutions. PyraTok builds on a pretrained video VAE and a novel Language aligned Pyramidal Quantization (LaPQ) module that discretizes encoder features at several depths using a shared large binary codebook, yielding compact yet expressive video token sequences. To tightly couple visual tokens with language, PyraTok jointly optimizes multi-scale text-guided quantization and a global autoregressive objective over the token hierarchy. Across ten benchmarks, PyraTok delivers state-of-the-art (SOTA) video reconstruction, consistently improves text-to-video quality, and sets new SOTA zero-shot performance on video segmentation, temporal action localization, and video understanding, scaling robustly to up to 4K/8K resolutions.
离散视频变分自编码器(VAEs)是现代文本到视频生成和视频理解系统的基础,然而现有的标记化方法通常在单一尺度上学习有限词汇量的视觉代码本,并且语言监督较浅层,导致跨模态对齐效果不佳及零样本迁移性能差。我们引入了PyraTok,这是一种与语言相匹配的金字塔式标记器,它能够在多个时空分辨率下学习语义结构化的离散潜在变量。PyraTok基于预训练的视频VAE以及一种新颖的语言一致金字塔量化(LaPQ)模块构建而成,该模块通过共享的大二进制代码本来自不同深度对编码特征进行离散化处理,生成紧凑且表达力强的视频标记序列。 为了将视觉标记与语言紧密耦合,PyraTok同时优化多尺度文本引导量化和整个令牌层次上的全局自回归目标。在十项基准测试中,PyraTok提供了最先进的(SOTA)视频重建效果,在文本到视频质量上持续改进,并在视频分割、时间动作定位以及视频理解的零样本性能方面设立新的SOTA标准,能够稳健地扩展至4K/8K分辨率。
https://arxiv.org/abs/2601.16210
Many Vision-Language-Action (VLA) models flatten image patches into a 1D token sequence, weakening the 2D spatial cues needed for precise manipulation. We introduce IVRA, a lightweight, training-free method that improves spatial understanding by exploiting affinity hints already available in the model's built-in vision encoder, without requiring any external encoder or retraining. IVRA selectively injects these affinity signals into a language-model layer in which instance-level features reside. This inference-time intervention realigns visual-token interactions and better preserves geometric structure while keeping all model parameters fixed. We demonstrate the generality of IVRA by applying it to diverse VLA architectures (LLaRA, OpenVLA, and FLOWER) across simulated benchmarks spanning both 2D and 3D manipulation (VIMA and LIBERO) and on various real-robot tasks. On 2D VIMA, IVRA improves average success by +4.2% over the baseline LLaRA in a low-data regime. On 3D LIBERO, it yields consistent gains over the OpenVLA and FLOWER baselines, including improvements when baseline accuracy is near saturation (96.3% to 97.1%). All code and models will be released publicly. Visualizations are available at: this http URL
许多视觉-语言-动作(VLA)模型将图像补丁展平为一维标记序列,从而削弱了进行精确操作所需的二维空间线索。我们提出了一种轻量级、无需训练的方法IVRA,该方法通过利用内置视觉编码器中已有的亲和性提示来改进对空间的理解,而不需要任何外部编码器或重新训练。IVRA选择性地将这些亲和信号注入包含实例级特征的语言模型层中。这种推理时的干预措施能够重新调整视觉标记之间的相互作用,并更好地保持几何结构的同时固定所有模型参数不变。 我们通过将其应用于多种VLA架构(包括LLaRA、OpenVLA及FLOWER)在跨越2D和3D操作(如VIMA和LIBERO)的模拟基准测试以及各种真实机器人任务上,展示了IVRA的通用性。在低数据环境下的2D VIMA中,与基础模型LLaRA相比,IVRA平均成功率提高了+4.2%。在3D LIBERO场景中,与OpenVLA及FLOWER基线相比,它保持了一致性的改进,即使是在基准准确度接近饱和(96.3%到97.1%)的情况下也是如此。 所有代码和模型将公开发布。可视化材料可在此处访问:此HTTP链接。
https://arxiv.org/abs/2601.16207
We propose a novel training regime termed counterfactual training that leverages counterfactual explanations to increase the explanatory capacity of models. Counterfactual explanations have emerged as a popular post-hoc explanation method for opaque machine learning models: they inform how factual inputs would need to change in order for a model to produce some desired output. To be useful in real-world decision-making systems, counterfactuals should be plausible with respect to the underlying data and actionable with respect to the feature mutability constraints. Much existing research has therefore focused on developing post-hoc methods to generate counterfactuals that meet these desiderata. In this work, we instead hold models directly accountable for the desired end goal: counterfactual training employs counterfactuals during the training phase to minimize the divergence between learned representations and plausible, actionable explanations. We demonstrate empirically and theoretically that our proposed method facilitates training models that deliver inherently desirable counterfactual explanations and additionally exhibit improved adversarial robustness.
我们提出了一种新的训练方法,称为反事实训练(counterfactual training),该方法利用反事实解释来增强模型的解释能力。反事实解释作为一种流行的事后解释方法已经为不透明的机器学习模型广泛使用:它们提供关于现实输入如何需要改变才能使模型产生所需输出的信息。为了在实际决策系统中发挥作用,反事实应该与底层数据相符,并且在特征可变性约束下具有操作性。因此,现有的许多研究都集中在开发能够生成符合这些标准的事后方法上。 然而,在这项工作中,我们直接让模型对其期望的目标负责:反事实训练通过在训练阶段使用反事实来最小化学习表示与合理、可行的解释之间的差异。我们从实证和理论上证明了所提出的方法有助于训练出自然提供具有内在价值的反事实解释的模型,并且这些模型还表现出改进后的对抗鲁棒性。
https://arxiv.org/abs/2601.16205
Recent video generation models demonstrate remarkable ability to capture complex physical interactions and scene evolution over time. To leverage their spatiotemporal priors, robotics works have adapted video models for policy learning but introduce complexity by requiring multiple stages of post-training and new architectural components for action generation. In this work, we introduce Cosmos Policy, a simple approach for adapting a large pretrained video model (Cosmos-Predict2) into an effective robot policy through a single stage of post-training on the robot demonstration data collected on the target platform, with no architectural modifications. Cosmos Policy learns to directly generate robot actions encoded as latent frames within the video model's latent diffusion process, harnessing the model's pretrained priors and core learning algorithm to capture complex action distributions. Additionally, Cosmos Policy generates future state images and values (expected cumulative rewards), which are similarly encoded as latent frames, enabling test-time planning of action trajectories with higher likelihood of success. In our evaluations, Cosmos Policy achieves state-of-the-art performance on the LIBERO and RoboCasa simulation benchmarks (98.5% and 67.1% average success rates, respectively) and the highest average score in challenging real-world bimanual manipulation tasks, outperforming strong diffusion policies trained from scratch, video model-based policies, and state-of-the-art vision-language-action models fine-tuned on the same robot demonstrations. Furthermore, given policy rollout data, Cosmos Policy can learn from experience to refine its world model and value function and leverage model-based planning to achieve even higher success rates in challenging tasks. We release code, models, and training data at this https URL
最近的视频生成模型展示了捕捉复杂物理交互和随时间演变场景的强大能力。为了利用这些时空先验,机器人研究工作已经将视频模型应用于策略学习中,但这种方法通过引入多阶段后训练以及用于动作生成的新架构组件而增加了复杂性。在本工作中,我们介绍了Cosmos Policy,这是一种简单的方法,它可以通过在目标平台上收集的机器人演示数据上进行单阶段后训练,将大型预训练视频模型(Cosmos-Predict2)适应为有效的机器人策略,并且无需对架构进行任何修改。 Cosmos Policy 学习直接生成编码为视频模型潜在扩散过程中的潜在帧的机器人动作,利用该模型预先训练的先验知识和核心学习算法来捕捉复杂的动作分布。此外,Cosmos Policy 生成未来状态图像和值(预期累积奖励),这些同样被编码为潜在帧,在测试时进行行动轨迹规划,从而增加成功几率。 在我们的评估中,Cosmos Policy 在 LIBERO 和 RoboCasa 模拟基准上实现了最先进的性能 (平均成功率分别为98.5% 和 67.1%),并且在具有挑战性的现实世界双臂操作任务中获得了最高的平均分数,优于从头开始训练的强大扩散策略、基于视频模型的策略以及在同一机器人演示数据上微调的状态-of-the-art 视觉-语言-动作模型。此外,在给定策略回滚数据的情况下,Cosmos Policy 可以通过学习经验来改进其世界模型和价值函数,并利用基于模型的规划在具有挑战性的任务中实现更高的成功率。 我们将在该网址发布代码、模型以及训练数据:[请在此处插入URL]
https://arxiv.org/abs/2601.16163
Keyword Spotting (KWS) systems with small footprint models deployed on edge devices face significant accuracy and robustness challenges due to domain shifts caused by varying noise and recording conditions. To address this, we propose a comprehensive framework for continual learning designed to adapt to new domains while maintaining computational efficiency. The proposed pipeline integrates a dual-input Convolutional Neural Network, utilizing both Mel Frequency Cepstral Coefficients (MFCC) and Mel-spectrogram features, supported by a multi-stage denoising process, involving discrete wavelet transform and spectral subtraction techniques, plus model and prototype update blocks. Unlike prior methods that restrict updates to specific layers, our approach updates the complete quantized model, made possible due to compact model architecture. A subset of input samples are selected during runtime using class prototypes and confidence-driven filtering, which are then pseudo-labeled and combined with rehearsal buffer for incremental model retraining. Experimental results on noisy test dataset demonstrate the framework's effectiveness, achieving 99.63\% accuracy on clean data and maintaining robust performance (exceeding 94\% accuracy) across diverse noisy environments, even at -10 dB Signal-to-Noise Ratio. The proposed framework work confirms that integrating efficient denoising with prototype-based continual learning enables KWS models to operate autonomously and robustly in resource-constrained, dynamic environments.
关键词识别(KWS)系统在边缘设备上部署的小型模型面临着由于噪声和录音条件变化导致的领域偏移所引起的准确性和鲁棒性挑战。为了解决这些问题,我们提出了一种全面的连续学习框架,旨在适应新的领域同时保持计算效率。该提议的流程集成了一个双输入卷积神经网络(CNN),利用梅尔频率倒谱系数(MFCC)和梅尔频谱图特征,并结合多级去噪过程,包括离散小波变换和频谱减法技术以及模型更新和原型更新模块。 与以前的方法仅限于特定层的更新不同,我们的方法更新整个量化模型,这得益于紧凑型模型架构。在运行时使用类原型和基于置信度的过滤器选择输入样本的一部分,在这些选定的样本上添加伪标签,并将其与回放缓冲区结合以进行增量模型重训练。 实验结果表明,在嘈杂的数据测试集中该框架的有效性:对于干净数据,准确率达到了99.63%,并且即使在-10 dB信噪比的情况下,也能保持稳健性能(超过94%的准确性),适用于各种噪音环境。这项工作证明了结合高效的去噪技术与基于原型的连续学习可以使KWS模型能够在资源受限和动态环境中自主且鲁棒地运行。
https://arxiv.org/abs/2601.16158
The success of CLIP has driven substantial progress in text-video retrieval. However, current methods often suffer from "blind" feature interaction, where the model struggles to discern key visual information from background noise due to the sparsity of textual queries. To bridge this gap, we draw inspiration from human cognitive behavior and propose the Human Vision-Driven (HVD) model. Our framework establishes a coarse-to-fine alignment mechanism comprising two key components: the Frame Features Selection Module (FFSM) and the Patch Features Compression Module (PFCM). FFSM mimics the human macro-perception ability by selecting key frames to eliminate temporal redundancy. Subsequently, PFCM simulates micro-perception by aggregating patch features into salient visual entities through an advanced attention mechanism, enabling precise entity-level matching. Extensive experiments on five benchmarks demonstrate that HVD not only captures human-like visual focus but also achieves state-of-the-art performance.
CLIP模型的成功推动了文本视频检索领域的显著进步。然而,当前的方法往往在“盲”特征交互方面存在问题,即由于文本查询的稀疏性,模型难以从背景噪声中区分出关键视觉信息。为了弥补这一差距,我们借鉴了人类的认知行为,并提出了人眼视图驱动(HVD)模型。我们的框架建立了一个由粗到细的对齐机制,包含两个关键组件:帧特征选择模块(FFSM)和补丁特征压缩模块(PFCM)。FFSM通过选择关键帧来消除时间冗余,模拟了人类宏观感知能力。随后,PFCM通过先进的注意力机制聚合补丁特征以形成显著视觉实体,从而模仿微观感知并实现精确的实体级别匹配。 在五个基准测试中的大量实验表明,HVD不仅能够捕捉到类似人的视觉关注点,还实现了最先进的性能。
https://arxiv.org/abs/2601.16155
Melodic harmonization, the task of generating harmonic accompaniments for a given melody, remains a central challenge in computational music generation. Recent single encoder transformer approaches have framed harmonization as a masked sequence modeling problem, but existing training curricula inspired by discrete diffusion often result in weak (cross) attention between melody and harmony. This leads to limited exploitation of melodic cues, particularly in out-of-domain contexts. In this work, we introduce a training curriculum, FF (full-to-full), which keeps all harmony tokens masked for several training steps before progressively unmasking entire sequences during training to strengthen melody-harmony interactions. We systematically evaluate this approach against prior curricula across multiple experimental axes, including temporal quantization (quarter vs. sixteenth note), bar-level vs. time-signature conditioning, melody representation (full range vs. pitch class), and inference-time unmasking strategies. Models are trained on the HookTheory dataset and evaluated both in-domain and on a curated collection of jazz standards, using a comprehensive set of metrics that assess chord progression structure, harmony-melody alignment, and rhythmic coherence. Results demonstrate that the proposed FF curriculum consistently outperforms baselines in nearly all metrics, with particularly strong gains in out-of-domain evaluations where harmonic adaptability to novel melodic queues is crucial. We further find that quarter-note quantization, intertwining of bar tokens, and pitch-class melody representations are advantageous in the FF setting. Our findings highlight the importance of training curricula in enabling effective melody conditioning and suggest that full-to-full unmasking offers a robust strategy for single encoder harmonization.
旋律和声化,即为给定的旋律生成和声伴奏,在计算音乐生成中仍然是一个核心挑战。最近采用单一编码器变压器的方法将和声化问题视为屏蔽序列建模问题,但现有的训练课程(受离散扩散启发)通常会导致旋律与和声之间的弱交叉注意力。这导致了对旋律线索利用的限制,尤其是在域外上下文的情况下。 在这项工作中,我们引入了一种训练课程 FF (full-to-full),该方法在训练初期将所有和声音符保持屏蔽状态,并逐渐在整个序列训练过程中解除屏蔽,以加强旋律与和声之间的相互作用。我们在多个实验轴上系统地评估了这种方法与先前的课程效果,包括时间量化(四分音符 vs. 十六分音符)、小节级 vs. 节拍签名条件、旋律表示形式(全范围 vs. 音阶)以及推理时的解除屏蔽策略。模型在 HookTheory 数据集上进行训练,并且使用全面评估和声进程结构、和声-旋律对齐以及节奏一致性的指标,在域内及一个精选的爵士标准曲集合中进行了评估。 实验结果表明,我们提出的 FF 课程方案几乎在所有指标上都优于基线方法,特别是在需要适应新型旋律线索的域外评估中表现尤为突出。此外,四分音符量化、小节标记交织以及音阶表示形式被证明在 FF 设置下具有优势。我们的研究强调了训练课程在有效旋律调适中的重要性,并表明全面解除屏蔽策略为单一编码器和声生成提供了一种稳健的方法。
https://arxiv.org/abs/2601.16150
Generating animated 3D objects is at the heart of many applications, yet most advanced works are typically difficult to apply in practice because of their limited setup, their long runtime, or their limited quality. We introduce ActionMesh, a generative model that predicts production-ready 3D meshes "in action" in a feed-forward manner. Drawing inspiration from early video models, our key insight is to modify existing 3D diffusion models to include a temporal axis, resulting in a framework we dubbed "temporal 3D diffusion". Specifically, we first adapt the 3D diffusion stage to generate a sequence of synchronized latents representing time-varying and independent 3D shapes. Second, we design a temporal 3D autoencoder that translates a sequence of independent shapes into the corresponding deformations of a pre-defined reference shape, allowing us to build an animation. Combining these two components, ActionMesh generates animated 3D meshes from different inputs like a monocular video, a text description, or even a 3D mesh with a text prompt describing its animation. Besides, compared to previous approaches, our method is fast and produces results that are rig-free and topology consistent, hence enabling rapid iteration and seamless applications like texturing and retargeting. We evaluate our model on standard video-to-4D benchmarks (Consistent4D, Objaverse) and report state-of-the-art performances on both geometric accuracy and temporal consistency, demonstrating that our model can deliver animated 3D meshes with unprecedented speed and quality.
生成动画的三维对象是许多应用程序的核心,然而大多数先进的研究成果通常难以在实践中应用,原因在于其设置有限、运行时间长或质量不佳。我们介绍了ActionMesh,这是一种生成模型,它能够以前馈方式预测“行动中”的生产级3D网格(mesh)。借鉴早期视频模型的灵感,我们的关键洞察是修改现有的3D扩散模型,使其包括一个时间轴,从而形成所谓的“时序3D扩散”框架。 具体来说,我们首先将3D扩散阶段调整为生成一系列同步潜变量序列,这些序列代表随时间变化且独立的三维形状。其次,我们设计了一个时序3D自动编码器,该编码器可以将一系列独立的形状转换成预定义参考形状的相应变形,从而构建动画。结合这两个组件,ActionMesh可以从不同的输入中生成动画的3D网格,如单目视频、文本描述或带有描述其动画的文本提示的3D网格。 此外,与之前的方法相比,我们的方法速度快,并且产生的结果无骨骼绑定(rig-free)和拓扑一致,因此能够快速迭代并支持无缝应用如纹理映射和重定向。我们在标准的视频到4D基准测试(Consistent4D、Objaverse)上评估了我们的模型,在几何准确性和时间一致性方面均达到了最先进的性能水平,证明了该模型可以以前所未有的速度和质量提供动画3D网格。
https://arxiv.org/abs/2601.16148
As large language models (LLMs) become increasingly common in educational applications, there is a growing need for evidence-based methods to design and evaluate LLM prompts that produce personalized and pedagogically aligned out-puts. This study presents a generalizable, systematic approach for evaluating prompts, demonstrated through an analysis of LLM-generated follow-up questions in a structured dialogue activity. Six prompt templates were designed and tested. The templates incorporated established prompt engineering patterns, with each prompt emphasizing distinct pedagogical strategies. The prompt templates were compared through a tournament-style evaluation framework that can be adapted for other educational applications. The tournament employed the Glicko2 rating system with eight judges evaluating question pairs across three dimensions: format, dialogue support, and appropriateness for learners. Data was sourced from 120 authentic user interactions across three distinct educational deployments. Results showed that a single prompt related to strategic reading out-performed other templates with win probabilities ranging from 81% to 100% in pairwise comparisons. This prompt combined persona and context manager pat-terns and was designed to support metacognitive learning strategies such as self-directed learning. The methodology showcases how educational technology re- searchers can systematically evaluate and improve prompt designs, moving beyond ad-hoc prompt engineering toward evidence-based prompt development for educational applications.
随着大型语言模型(LLM)在教育应用中的日益普及,设计和评估能够产生个性化且符合教学目标输出的LLM提示词的需求也越来越大。本研究提出了一种可通用、系统的评估提示词的方法,并通过分析结构化对话活动中生成的后续问题来展示这种方法的有效性。该方法涉及六种不同的提示模板的设计与测试,这些模板采用了现有的提示工程模式,每种提示强调了不同的教学策略。通过一种适应其他教育应用的锦标赛式评估框架对这六个模板进行了比较。竞赛使用Glicko2等级分系统,并由八位评判员从格式、对话支持和适合学习者三个维度上评估问题配对。数据来自120名真实用户在三种不同教育场景下的交互。 研究结果显示,与其它模板相比,在一对一的比较中,一个专门针对策略性阅读设计的提示词获得了81%到100%的不同胜率。该提示结合了人物角色和上下文管理模式,并旨在支持如自我导向学习等元认知学习策略的设计理念。这种方法展示了教育技术研究人员如何能够系统地评估并改进提示设计,从非系统的提示工程转向基于证据的、为教育应用优化的提示开发方法。
https://arxiv.org/abs/2601.16134
Climate disinformation has become a major challenge in today digital world, especially with the rise of misleading images and videos shared widely on social media. These false claims are often convincing and difficult to detect, which can delay actions on climate change. While vision-language models (VLMs) have been used to identify visual disinformation, they rely only on the knowledge available at the time of training. This limits their ability to reason about recent events or updates. The main goal of this paper is to overcome that limitation by combining VLMs with external knowledge. By retrieving up-to-date information such as reverse image results, online fact-checks, and trusted expert content, the system can better assess whether an image and its claim are accurate, misleading, false, or unverifiable. This approach improves the model ability to handle real-world climate disinformation and supports efforts to protect public understanding of science in a rapidly changing information landscape.
在当今的数字世界中,气候不实信息已成为一个主要挑战,尤其是随着误导性的图片和视频在社交媒体上广泛传播。这些虚假声明往往极具说服力且难以察觉,这可能会延迟应对气候变化的行动。尽管视觉-语言模型(VLM)已被用于识别视觉上的不实信息,但它们仅依赖于训练时已有的知识。这种限制使得它们无法有效推理最近发生的事件或更新的情况。本文的主要目标是通过将VLM与外部知识相结合来克服这一局限性。通过检索最新的信息,如反向图像搜索结果、在线事实核查和可信的专家内容,系统能够更好地评估图片及其声明是否准确、具有误导性、虚假或无法验证。这种方法提高了模型处理现实世界中气候不实信息的能力,并支持在快速变化的信息环境中保护公众对科学的理解的努力。
https://arxiv.org/abs/2601.16108
We introduce Neural Particle Automata (NPA), a Lagrangian generalization of Neural Cellular Automata (NCA) from static lattices to dynamic particle systems. Unlike classical Eulerian NCA where cells are pinned to pixels or voxels, NPA model each cell as a particle with a continuous position and internal state, both updated by a shared, learnable neural rule. This particle-based formulation yields clear individuation of cells, allows heterogeneous dynamics, and concentrates computation only on regions where activity is present. At the same time, particle systems pose challenges: neighborhoods are dynamic, and a naive implementation of local interactions scale quadratically with the number of particles. We address these challenges by replacing grid-based neighborhood perception with differentiable Smoothed Particle Hydrodynamics (SPH) operators backed by memory-efficient, CUDA-accelerated kernels, enabling scalable end-to-end training. Across tasks including morphogenesis, point-cloud classification, and particle-based texture synthesis, we show that NPA retain key NCA behaviors such as robustness and self-regeneration, while enabling new behaviors specific to particle systems. Together, these results position NPA as a compact neural model for learning self-organizing particle dynamics.
我们介绍了一种名为神经粒子自动机(Neural Particle Automata,NPA)的新模型,它是对静态格点系统中的神经细胞自动机(Neural Cellular Automata,NCA)进行拉格朗日泛化的动态粒子系统的扩展。与经典欧拉方法下的NCA不同,在这种情况下,每个单元被固定在像素或体素上,NPA将每个单元视为具有连续位置和内部状态的粒子,这两个参数都通过一个共享且可学习的神经规则更新。基于粒子的这一形式化方法清晰地界定了各细胞个体性,允许异质动态,并仅对存在活动的区域进行计算。 然而,粒子系统也带来了一些挑战:邻居关系是动态变化的,直接实现局部相互作用会导致其复杂度随粒子数量呈二次增长。为了解决这些问题,我们用可微分的光滑粒子流体动力学(Smoothed Particle Hydrodynamics,SPH)算子替代了网格感知方法,并且利用内存高效、CUDA加速的核心进行支持,从而实现了端到端的大规模训练。 在包括形态发生、点云分类和基于粒子的纹理合成等任务中,我们展示了NPA不仅保留了NCA的关键特性(如鲁棒性和自我再生),而且还赋予粒子系统特有的新行为。综上所述,这些结果将NPA定位为一种紧凑型神经模型,用于学习自组织的粒子动力学。
https://arxiv.org/abs/2601.16096
Large language model (LLM) agents often exhibit abrupt shifts in tone and persona during extended interaction, reflecting the absence of explicit temporal structure governing agent-level state. While prior work emphasizes turn-local sentiment or static emotion classification, the role of explicit affective dynamics in shaping long-horizon agent behavior remains underexplored. This work investigates whether imposing dynamical structure on an external affective state can induce temporal coherence and controlled recovery in multi-turn dialogue. We introduce an agent-level affective subsystem that maintains a continuous Valence-Arousal-Dominance (VAD) state external to the language model and governed by first- and second-order update rules. Instantaneous affective signals are extracted using a fixed, memoryless estimator and integrated over time via exponential smoothing or momentum-based dynamics. The resulting affective state is injected back into generation without modifying model parameters. Using a fixed 25-turn dialogue protocol, we compare stateless, first-order, and second-order affective dynamics. Stateless agents fail to exhibit coherent trajectories or recovery, while state persistence enables delayed responses and reliable recovery. Second-order dynamics introduce affective inertia and hysteresis that increase with momentum, revealing a trade-off between stability and responsiveness.
大型语言模型(LLM)代理在长时间互动中常常表现出语气和人格的突然变化,这反映了没有明确时间结构来管理代理级别的状态。尽管之前的工作强调了转轮本地的情感或静态情感分类,但显性情感动态对长时段内代理行为塑造的作用仍被忽视。这项工作研究了通过强加外在情感状态的动力学结构是否可以诱导多回合对话中的时间连贯性和可控恢复。 我们引入了一个保持连续“效价-唤醒度-支配度”(VAD)状态的代理级别情感子系统,该状态独立于语言模型并由一阶和二阶更新规则控制。即时的情感信号通过固定的无记忆估算器提取,并通过指数平滑或动量动力学进行时间集成。由此产生的感情状态被注入生成过程而不改变模型参数。 使用固定25轮对话协议,我们比较了无状态、一阶和二阶情感动态的效果。无状态的代理无法展示连贯的行为轨迹或恢复能力,而状态持久性则允许延迟响应并确保可靠的恢复。二阶动力学引入了随着动量增加的情感惯性和滞后效应,揭示了稳定性和反应性之间的权衡关系。
https://arxiv.org/abs/2601.16087
Vision-Language Action (VLA) models have shown remarkable progress in robotic manipulation by leveraging the powerful perception abilities of Vision-Language Models (VLMs) to understand environments and directly output actions. However, by default, VLA models may overly attend to image tokens in the task-irrelevant region, which we describe as 'distracting tokens'. This behavior can disturb the model from the generation of the desired action tokens in each step, affecting the success rate of tasks. In this paper, we introduce a simple yet effective plug-and-play Distracting Token Pruning (DTP) framework, which dynamically detects and prunes these distracting image tokens. By correcting the model's visual attention patterns, we aim to improve the task success rate, as well as exploring the performance upper boundaries of the model without altering its original architecture or adding additional inputs. Experiments on the SIMPLER Benchmark (Li et al., 2024) show that our method consistently achieving relative improvements in task success rates across different types of novel VLA models, demonstrating generalizability to transformer-based VLAs. Further analysis reveals a negative correlation between the task success rate and the amount of attentions in the task-irrelevant region for all models tested, highlighting a common phenomenon of VLA models that could guide future research. We also publish our code at: this https URL.
翻译如下: Vision-Language Action (VLA) 模型在利用 Vision-Language Models (VLMs) 强大的感知能力来理解环境并直接输出动作方面,已经在机器人操作中取得了显著进展。然而,默认情况下,VLA 模型可能会过分关注任务无关区域中的图像标记(我们称之为“分散注意力的标记”),这种行为可能干扰模型在每一步生成所需的行动标记,从而影响任务的成功率。在这篇论文中,我们介绍了一种简单而有效的即插即用的 Distracting Token Pruning (DTP) 框架,该框架能够动态检测和修剪这些分散注意力的图像标记。通过纠正模型的视觉注意模式,我们的目标是提高任务成功率,并探索模型在不改变其原始架构或添加额外输入的情况下所能达到的最佳性能边界。SIMPLER Benchmark(Li 等人,2024)上的实验表明,我们的方法能够持续实现不同类型新型 VLA 模型的成功率相对提升,在各种类型的模型上展现出泛化能力。进一步分析显示,所有测试模型的任务成功率与任务无关区域中的注意力量之间存在负相关关系,这凸显了 VLA 模型的一个共同现象,可能为未来的研究提供指导方向。 我们已在以下网址发布我们的代码:this https URL。
https://arxiv.org/abs/2601.16065
Widely adopted medical image segmentation methods, although efficient, are primarily deterministic and remain poorly amenable to natural language prompts. Thus, they lack the capability to estimate multiple proposals, human interaction, and cross-modality adaptation. Recently, text-to-image diffusion models have shown potential to bridge the gap. However, training them from scratch requires a large dataset-a limitation for medical image segmentation. Furthermore, they are often limited to binary segmentation and cannot be conditioned on a natural language prompt. To this end, we propose a novel framework called ProGiDiff that leverages existing image generation models for medical image segmentation purposes. Specifically, we propose a ControlNet-style conditioning mechanism with a custom encoder, suitable for image conditioning, to steer a pre-trained diffusion model to output segmentation masks. It naturally extends to a multi-class setting simply by prompting the target organ. Our experiment on organ segmentation from CT images demonstrates strong performance compared to previous methods and could greatly benefit from an expert-in-the-loop setting to leverage multiple proposals. Importantly, we demonstrate that the learned conditioning mechanism can be easily transferred through low-rank, few-shot adaptation to segment MR images.
广泛采用的医学图像分割方法虽然高效,但主要是确定性的,并且对自然语言提示不太友好。因此,它们缺乏估计多种提案、人机交互和跨模态适应的能力。最近,文本到图像的扩散模型显示出弥合这一差距的潜力。然而,从头开始训练这些模型需要大量的数据集——这对医学图像分割来说是一个限制。此外,它们通常仅限于二值分割,并且不能以自然语言提示为条件进行操作。 为此,我们提出了一种称为ProGiDiff的新框架,该框架利用现有的图像生成模型来实现医学图像分割的目的。具体而言,我们提出了一个类似ControlNet的控制机制和一个自定义编码器,适用于图像条件化,可以引导预训练的扩散模型输出分割掩码。通过提示目标器官,它自然地扩展到了多类设置。 我们在CT图像上的器官分割实验中展示了与先前方法相比的强大性能,并且可以从“专家在循环”(expert-in-the-loop)设置中受益匪浅,以利用多种提案。重要的是,我们证明了学习到的控制机制可以通过低秩、少量样本适应轻松转移到对MR图像进行分割。 此框架和方法表明,在医学图像分割领域,通过采用先进的文本引导技术结合现有生成模型可以显著提升算法的能力与灵活性,尤其是在处理跨模态数据时展现出了巨大的潜力。
https://arxiv.org/abs/2601.16060
Language-driven dexterous grasp generation requires the models to understand task semantics, 3D geometry, and complex hand-object interactions. While vision-language models have been applied to this problem, existing approaches directly map observations to grasp parameters without intermediate reasoning about physical interactions. We present DextER, Dexterous Grasp Generation with Embodied Reasoning, which introduces contact-based embodied reasoning for multi-finger manipulation. Our key insight is that predicting which hand links contact where on the object surface provides an embodiment-aware intermediate representation bridging task semantics with physical constraints. DextER autoregressively generates embodied contact tokens specifying which finger links contact where on the object surface, followed by grasp tokens encoding the hand configuration. On DexGYS, DextER achieves 67.14% success rate, outperforming state-of-the-art by 3.83%p with 96.4% improvement in intention alignment. We also demonstrate steerable generation through partial contact specification, providing fine-grained control over grasp synthesis.
基于语言的灵巧抓取生成需要模型理解任务语义、三维几何结构和复杂的手-物体交互。尽管视觉-语言模型已被应用于此类问题,但现有方法直接将观察结果映射到抓取参数上,并未经过关于物理交互的中间推理过程。我们提出了DextER(具有具身推理的灵巧抓取生成),这是一种基于接触点进行多指操作具身推理的方法。我们的关键洞察是,预测哪些手部链接在物体表面上何处接触可以提供一种感知身体特性的中间表示形式,将任务语义与物理约束连接起来。 DextER通过自回归方式生成具身接触令牌,这些令牌指定手指链在哪部分物体表面接触,随后生成抓取令牌来编码手部配置。在DexGYS数据集上,DextER实现了67.14%的成功率,比现有最佳方法高出3.83个百分点,并且意图对齐提高了96.4%。此外,我们还展示了通过部分接触规范实现可控制的生成过程,提供了对手部抓取合成进行精细调控的能力。 该研究强调了在灵巧抓取任务中引入物理交互理解的重要性,展示了一种将视觉-语言模型与复杂手-物体互动结合的有效途径,并为机器人操作和自动化系统中的精细化具身推理设定了新标准。
https://arxiv.org/abs/2601.16046
Large Language Models (LLMs) can aid synthesis planning in chemistry, but standard prompting methods often yield hallucinated or outdated suggestions. We study LLM interactions with a reaction knowledge graph by casting reaction path retrieval as a Text2Cypher (natural language to graph query) generation problem, and define single- and multi-step retrieval tasks. We compare zero-shot prompting to one-shot variants using static, random, and embedding-based exemplar selection, and assess a checklist-driven validator/corrector loop. To evaluate our framework, we consider query validity and retrieval accuracy. We find that one-shot prompting with aligned exemplars consistently performs best. Our checklist-style self-correction loop mainly improves executability in zero-shot settings and offers limited additional retrieval gains once a good exemplar is present. We provide a reproducible Text2Cypher evaluation setup to facilitate further work on KG-grounded LLMs for synthesis planning. Code is available at this https URL.
大型语言模型(LLM)可以在化学合成规划中发挥作用,但标准的提示方法往往会产生幻觉或过时的建议。我们通过将反应路径检索视为一个自然语言到图查询(Text2Cypher)生成问题来研究LLM与反应知识图之间的交互,并定义了一步和多步检索任务。我们将零样本提示与静态、随机以及基于嵌入的示例选择的一次性变体进行比较,同时评估了以检查表为驱动的验证/校正循环。为了评估我们的框架,我们考虑查询的有效性和检索准确性。我们发现使用对齐示例的一次性提示始终表现最佳。在零样本设置中,检查表式的自我修正循环主要提高了可执行性,并且一旦有良好的示例如何存在时,额外的检索增益就非常有限。 为了促进基于知识图的LLM在合成规划方面的进一步研究工作,我们提供了一个可重复的Text2Cypher评估环境。代码可在以下链接获取:[此URL](请将此占位符替换为实际提供的URL)。
https://arxiv.org/abs/2601.16038
The rise of live streaming has transformed online interaction, enabling massive real-time engagement but also exposing platforms to complex risks such as scams and coordinated malicious behaviors. Detecting these risks is challenging because harmful actions often accumulate gradually and recur across seemingly unrelated streams. To address this, we propose CS-VAR (Cross-Session Evidence-Aware Retrieval-Augmented Detector) for live streaming risk assessment. In CS-VAR, a lightweight, domain-specific model performs fast session-level risk inference, guided during training by a Large Language Model (LLM) that reasons over retrieved cross-session behavioral evidence and transfers its local-to-global insights to the small model. This design enables the small model to recognize recurring patterns across streams, perform structured risk assessment, and maintain efficiency for real-time deployment. Extensive offline experiments on large-scale industrial datasets, combined with online validation, demonstrate the state-of-the-art performance of CS-VAR. Furthermore, CS-VAR provides interpretable, localized signals that effectively empower real-world moderation for live streaming.
直播流媒体的兴起已经改变了在线互动方式,它不仅促进了大规模的实时参与,也使平台面临诸如诈骗和协同恶意行为等复杂风险。由于有害行为常常逐渐积累并在看似无关的直播间中反复出现,因此检测这些风险颇具挑战性。为解决这一问题,我们提出了CS-VAR(跨会话证据感知检索增强检测器)用于直播流媒体的风险评估。 在CS-VAR架构中,一个轻量级、特定领域的模型执行快速的会话级别风险推断,并且在训练过程中由大型语言模型(LLM)指导。该大型语言模型通过检索跨会话的行为证据进行推理,并将其从局部到全局的理解传递给小型模型。这种设计使小型模型能够识别不同直播间中的重复模式,执行结构化的风险评估,并保持实时部署的效率。 我们通过对大规模工业数据集进行了广泛的离线实验,并结合在线验证,展示了CS-VAR在性能上的领先水平。此外,CS-VAR还提供了可解释且本地化的信号,有效地支持了实际直播流媒体内容管理中的监管工作。
https://arxiv.org/abs/2601.16027
The Decision Transformer (DT) has established a powerful sequence modeling approach to offline reinforcement learning. It conditions its action predictions on Return-to-Go (RTG), using it both to distinguish trajectory quality during training and to guide action generation at inference. In this work, we identify a critical redundancy in this design: feeding the entire sequence of RTGs into the Transformer is theoretically unnecessary, as only the most recent RTG affects action prediction. We show that this redundancy can impair DT's performance through experiments. To resolve this, we propose the Decoupled DT (DDT). DDT simplifies the architecture by processing only observation and action sequences through the Transformer, using the latest RTG to guide the action prediction. This streamlined approach not only improves performance but also reduces computational cost. Our experiments show that DDT significantly outperforms DT and establishes competitive performance against state-of-the-art DT variants across multiple offline RL tasks.
决策变压器(Decision Transformer,DT)已建立了一种强大的序列建模方法用于离线强化学习。它根据返回到目标(Return-to-Go, RTG)条件化其动作预测,在训练过程中使用RTG来区分轨迹质量,并在推理时用它来指导动作生成。在这项工作中,我们识别出该设计中的一个关键冗余:将整个RTG序列输入Transformer理论上是不必要的,因为只有最新的RTG会影响动作预测。我们通过实验表明这种冗余会损害DT的性能。为了解决这个问题,我们提出了解耦决策变压器(Decoupled Decision Transformer, DDT)。DDT简化了架构,仅通过Transformer处理观察和动作序列,并使用最新RTG来指导动作预测。这种方法不仅提高了性能,还减少了计算成本。我们的实验表明,与最先进的DT变体相比,DDT在多个离线强化学习任务中显著优于DT,并建立了竞争性的表现水平。
https://arxiv.org/abs/2601.15953