Robot foundation models are beginning to deliver on the promise of generalist robotic agents, yet progress remains constrained by the scarcity of large-scale real-world manipulation datasets. Simulation and synthetic data generation offer a scalable alternative, but their usefulness is limited by the visual domain gap between simulation and reality. In this work, we present Point Bridge, a framework that leverages unified, domain-agnostic point-based representations to unlock synthetic datasets for zero-shot sim-to-real policy transfer, without explicit visual or object-level alignment. Point Bridge combines automated point-based representation extraction via Vision-Language Models (VLMs), transformer-based policy learning, and efficient inference-time pipelines to train capable real-world manipulation agents using only synthetic data. With additional co-training on small sets of real demonstrations, Point Bridge further improves performance, substantially outperforming prior vision-based sim-and-real co-training methods. It achieves up to 44% gains in zero-shot sim-to-real transfer and up to 66% with limited real data across both single-task and multitask settings. Videos of the robot are best viewed at: this https URL
机器人基础模型已经开始实现通用型机器人代理的承诺,但其进展仍受限于大规模现实世界操作数据集的稀缺。仿真和合成数据生成提供了一种可扩展的替代方案,但由于模拟与现实之间的视觉领域差距,它们的有效性受到了限制。在本文中,我们介绍了Point Bridge框架,该框架利用统一、无领域的点基表示法来解锁合成数据集以实现零样本仿真实现策略迁移,而无需显式的视觉或对象级别的对齐。通过结合基于视觉-语言模型(VLMs)的自动点基表示提取、基于变压器的学习策略以及高效的推理时间管道,Point Bridge能够仅使用合成数据训练具备能力的真实世界操作代理。在额外与少量实际演示进行共训的情况下,Point Bridge进一步提高了性能,并显著超越了先前基于视觉的仿真实现共训方法的表现。在零样本仿真到现实迁移中,它最多可实现44%的增长,在具有有限真实数据的情境下跨单一任务和多任务设置则可达66%。 请注意查看机器人视频的最佳方式是访问此链接:[请在此插入实际URL](原文中的“this https URL”应为具体的网址链接)。
https://arxiv.org/abs/2601.16212
Many Vision-Language-Action (VLA) models flatten image patches into a 1D token sequence, weakening the 2D spatial cues needed for precise manipulation. We introduce IVRA, a lightweight, training-free method that improves spatial understanding by exploiting affinity hints already available in the model's built-in vision encoder, without requiring any external encoder or retraining. IVRA selectively injects these affinity signals into a language-model layer in which instance-level features reside. This inference-time intervention realigns visual-token interactions and better preserves geometric structure while keeping all model parameters fixed. We demonstrate the generality of IVRA by applying it to diverse VLA architectures (LLaRA, OpenVLA, and FLOWER) across simulated benchmarks spanning both 2D and 3D manipulation (VIMA and LIBERO) and on various real-robot tasks. On 2D VIMA, IVRA improves average success by +4.2% over the baseline LLaRA in a low-data regime. On 3D LIBERO, it yields consistent gains over the OpenVLA and FLOWER baselines, including improvements when baseline accuracy is near saturation (96.3% to 97.1%). All code and models will be released publicly. Visualizations are available at: this http URL
许多视觉-语言-动作(VLA)模型将图像补丁展平为一维标记序列,从而削弱了进行精确操作所需的二维空间线索。我们提出了一种轻量级、无需训练的方法IVRA,该方法通过利用内置视觉编码器中已有的亲和性提示来改进对空间的理解,而不需要任何外部编码器或重新训练。IVRA选择性地将这些亲和信号注入包含实例级特征的语言模型层中。这种推理时的干预措施能够重新调整视觉标记之间的相互作用,并更好地保持几何结构的同时固定所有模型参数不变。 我们通过将其应用于多种VLA架构(包括LLaRA、OpenVLA及FLOWER)在跨越2D和3D操作(如VIMA和LIBERO)的模拟基准测试以及各种真实机器人任务上,展示了IVRA的通用性。在低数据环境下的2D VIMA中,与基础模型LLaRA相比,IVRA平均成功率提高了+4.2%。在3D LIBERO场景中,与OpenVLA及FLOWER基线相比,它保持了一致性的改进,即使是在基准准确度接近饱和(96.3%到97.1%)的情况下也是如此。 所有代码和模型将公开发布。可视化材料可在此处访问:此HTTP链接。
https://arxiv.org/abs/2601.16207
State-of-the-art neural theorem provers like DeepSeek-Prover-V1.5 combine large language models with reinforcement learning, achieving impressive results through sophisticated training. We ask: do these highly-trained models still benefit from simple structural guidance at inference time? We evaluate a lightweight intervention -- a fixed prompt schedule over 15 common tactic skeletons -- on the miniF2F benchmark. This simple approach yields 21.7% pass@16 compared to 15.2% for standard sampling from the same model, a 43% relative improvement using the same number of samples (k=16) and same maximum generation length (1024 tokens). Our results suggest that even capable RL-trained provers underutilize structural priors available in the tactic language, and that simple inference-time guidance remains a cheap, complementary boost.
最先进的神经定理证明器,如DeepSeek-Prover-V1.5,结合了大型语言模型和强化学习,在经过复杂的训练后取得了令人印象深刻的结果。我们的问题是:这些高度训练的模型在推理时是否仍然能从简单的结构引导中获益?我们在miniF2F基准上评估了一种轻量级干预方法——一个固定的提示时间表,涵盖15个常见的策略框架。这种简单的方法相比于使用相同模型的标准采样(pass@16为15.2%)在相同的样本数量(k=16)和最大生成长度(1024令牌)下实现了21.7%的通过率,相对改进了43%。我们的结果表明,即使是能力较强的强化学习训练证明器也未能充分利用定理语言中的结构先验知识,在推理时简单的引导仍是一种低成本且互补的方法,可以进一步提升性能。
https://arxiv.org/abs/2601.16172
Melodic harmonization, the task of generating harmonic accompaniments for a given melody, remains a central challenge in computational music generation. Recent single encoder transformer approaches have framed harmonization as a masked sequence modeling problem, but existing training curricula inspired by discrete diffusion often result in weak (cross) attention between melody and harmony. This leads to limited exploitation of melodic cues, particularly in out-of-domain contexts. In this work, we introduce a training curriculum, FF (full-to-full), which keeps all harmony tokens masked for several training steps before progressively unmasking entire sequences during training to strengthen melody-harmony interactions. We systematically evaluate this approach against prior curricula across multiple experimental axes, including temporal quantization (quarter vs. sixteenth note), bar-level vs. time-signature conditioning, melody representation (full range vs. pitch class), and inference-time unmasking strategies. Models are trained on the HookTheory dataset and evaluated both in-domain and on a curated collection of jazz standards, using a comprehensive set of metrics that assess chord progression structure, harmony-melody alignment, and rhythmic coherence. Results demonstrate that the proposed FF curriculum consistently outperforms baselines in nearly all metrics, with particularly strong gains in out-of-domain evaluations where harmonic adaptability to novel melodic queues is crucial. We further find that quarter-note quantization, intertwining of bar tokens, and pitch-class melody representations are advantageous in the FF setting. Our findings highlight the importance of training curricula in enabling effective melody conditioning and suggest that full-to-full unmasking offers a robust strategy for single encoder harmonization.
旋律和声化,即为给定的旋律生成和声伴奏,在计算音乐生成中仍然是一个核心挑战。最近采用单一编码器变压器的方法将和声化问题视为屏蔽序列建模问题,但现有的训练课程(受离散扩散启发)通常会导致旋律与和声之间的弱交叉注意力。这导致了对旋律线索利用的限制,尤其是在域外上下文的情况下。 在这项工作中,我们引入了一种训练课程 FF (full-to-full),该方法在训练初期将所有和声音符保持屏蔽状态,并逐渐在整个序列训练过程中解除屏蔽,以加强旋律与和声之间的相互作用。我们在多个实验轴上系统地评估了这种方法与先前的课程效果,包括时间量化(四分音符 vs. 十六分音符)、小节级 vs. 节拍签名条件、旋律表示形式(全范围 vs. 音阶)以及推理时的解除屏蔽策略。模型在 HookTheory 数据集上进行训练,并且使用全面评估和声进程结构、和声-旋律对齐以及节奏一致性的指标,在域内及一个精选的爵士标准曲集合中进行了评估。 实验结果表明,我们提出的 FF 课程方案几乎在所有指标上都优于基线方法,特别是在需要适应新型旋律线索的域外评估中表现尤为突出。此外,四分音符量化、小节标记交织以及音阶表示形式被证明在 FF 设置下具有优势。我们的研究强调了训练课程在有效旋律调适中的重要性,并表明全面解除屏蔽策略为单一编码器和声生成提供了一种稳健的方法。
https://arxiv.org/abs/2601.16150
Computing the conditional mode of a distribution, better known as the $\mathit{maximum\ a\ posteriori}$ (MAP) assignment, is a fundamental task in probabilistic inference. However, MAP estimation is generally intractable, and remains hard even under many common structural constraints and approximation schemes. We introduce $\mathit{probably\ approximately\ correct}$ (PAC) algorithms for MAP inference that provide provably optimal solutions under variable and fixed computational budgets. We characterize tractability conditions for PAC-MAP using information theoretic measures that can be estimated from finite samples. Our PAC-MAP solvers are efficiently implemented using probabilistic circuits with appropriate architectures. The randomization strategies we develop can be used either as standalone MAP inference techniques or to improve on popular heuristics, fortifying their solutions with rigorous guarantees. Experiments confirm the benefits of our method in a range of benchmarks.
计算概率分布的条件模式,即所谓的最大后验估计(MAP),是概率推理中的一个基本任务。然而,MAP估计通常是不可行的,在许多常见的结构约束和近似方案下仍然难以解决。我们引入了“大概正确”(PAC)算法来进行MAP推断,这些算法在变量和固定计算预算下提供了理论上最优的解决方案。我们使用可以从有限样本中估算的信息理论度量来表征PAC-MAP的可操作性条件。我们的PAC-MAP求解器通过具有适当架构的概率电路高效实现。我们开发的随机化策略可以作为独立的MAP推断技术,或者用来改进流行的启发式方法,并为其解决方案提供严格的保证。实验确认了在一系列基准测试中我们的方法带来的益处。 以下是更加细化和明确的翻译: 计算分布的条件模式(即最大后验估计,MAP)是概率推理中的一个核心任务。然而,由于一般情况下该问题难以解决,在许多常见的结构约束和近似方案下仍然保持其复杂性。我们引入了“大概正确”(PAC)算法来执行MAP推断,这些算法可以在给定的变量或固定计算资源预算内提供最优解。为了确定PAC-MAP在哪些条件下的有效性,我们使用信息理论度量来进行评估,而这些度量可以从有限的数据样本中估计出来。 我们的PAC-MAP求解器通过设计良好的概率电路高效地实现。此外,我们开发的随机化策略不仅可以用作独立的MAP推断技术,并且还可以用来改进流行的启发式方法,从而提供更可靠的解决方案。实验结果表明,在一系列基准测试任务中,该方法能够带来显著的优势和效果。 这样的翻译保持了原文的技术性和专业性,同时也确保了语言表达的准确与通顺。
https://arxiv.org/abs/2601.16083
Human motion reconstruction from monocular videos is a fundamental challenge in computer vision, with broad applications in AR/VR, robotics, and digital content creation, but remains challenging under frequent occlusions in real-world this http URL regression-based methods are efficient but fragile to missing observations, while optimization- and diffusion-based approaches improve robustness at the cost of slow inference speed and heavy preprocessing steps. To address these limitations, we leverage recent advances in generative masked modeling and present MoRo: Masked Modeling for human motion Recovery under Occlusions. MoRo is an occlusion-robust, end-to-end generative framework that formulates motion reconstruction as a video-conditioned task, and efficiently recover human motion in a consistent global coordinate system from RGB videos. By masked modeling, MoRo naturally handles occlusions while enabling efficient, end-to-end inference. To overcome the scarcity of paired video-motion data, we design a cross-modality learning scheme that learns multi-modal priors from a set of heterogeneous datasets: (i) a trajectory-aware motion prior trained on MoCap datasets, (ii) an image-conditioned pose prior trained on image-pose datasets, capturing diverse per-frame poses, and (iii) a video-conditioned masked transformer that fuses motion and pose priors, finetuned on video-motion datasets to integrate visual cues with motion dynamics for robust inference. Extensive experiments on EgoBody and RICH demonstrate that MoRo substantially outperforms state-of-the-art methods in accuracy and motion realism under occlusions, while performing on-par in non-occluded scenarios. MoRo achieves real-time inference at 70 FPS on a single H200 GPU.
从单目视频中重建人体运动是计算机视觉中的一个基本挑战,具有广泛的应用场景,包括增强现实/虚拟现实、机器人技术和数字内容创作。然而,在实际环境中由于频繁的遮挡问题,这一任务仍然极具挑战性。基于回归的方法虽然效率高但对缺失观测非常敏感,而优化和扩散方法则通过牺牲推理速度并增加预处理步骤来提高鲁棒性。为了解决这些问题,我们利用最近在生成式掩码建模方面的进展,并提出了一种用于遮挡下人体运动恢复的框架——MoRo(Masked Modeling for human motion Recovery under Occlusions)。 MoRo是一种针对遮挡具有鲁棒性的端到端生成框架,它将运动重建视为一个视频条件下的任务,在全局坐标系中从RGB视频高效地恢复人类运动。通过掩码建模,MoRo能够自然处理遮挡问题,并支持高效的端到端推理。为了克服成对的视频-动作数据稀缺的问题,我们设计了一种跨模态学习方案,该方案从一组异构的数据集中学习多模式先验:(i)一种在MoCap数据集上训练的动作轨迹感知运动先验;(ii)一种基于图像的姿态先验,在图像姿态数据集上进行训练,捕捉每帧中多样的姿势;以及(iii)一个视频条件下的掩码变换器,该模型融合了动作和姿态的先验,并通过在视频-动作数据集上的微调与视觉线索结合运动动力学以实现稳健推理。 在EgoBody和RICH数据集上进行的大量实验表明,在遮挡条件下,MoRo在准确性和运动逼真度方面显著优于最先进的方法,而在非遮挡场景中则表现出相当的性能。此外,MoRo能够在单个H200 GPU上以每秒70帧的速度实现实时推理。
https://arxiv.org/abs/2601.16079
Foundation Models (FMs) have demonstrated strong generalization across diverse vision tasks. However, their deployment in federated settings is hindered by high computational demands, substantial communication overhead, and significant inference costs. We propose DSFedMed, a dual-scale federated framework that enables mutual knowledge distillation between a centralized foundation model and lightweight client models for medical image segmentation. To support knowledge distillation, a set of high-quality medical images is generated to replace real public datasets, and a learnability-guided sample selection strategy is proposed to enhance efficiency and effectiveness in dual-scale distillation. This mutual distillation enables the foundation model to transfer general knowledge to lightweight clients, while also incorporating client-specific insights to refine the foundation model. Evaluations on five medical imaging segmentation datasets show that DSFedMed achieves an average 2 percent improvement in Dice score while reducing communication costs and inference time by nearly 90 percent compared to existing federated foundation model baselines. These results demonstrate significant efficiency gains and scalability for resource-limited federated deployments.
基础模型(FMs)在各种视觉任务中展现出了强大的泛化能力。然而,它们在联邦环境中的部署由于计算需求高、通信开销大以及推理成本显著而受到阻碍。为此,我们提出了DSFedMed,这是一种双尺度联邦框架,该框架允许集中式基础模型与轻量级客户端模型之间进行相互知识蒸馏,以用于医学图像分割任务中。为了支持这种知识蒸馏过程,生成了一组高质量的医学图像来替代真实的公开数据集,并提出了一种基于可学习性引导的样本选择策略,以提高双尺度蒸馏中的效率和效果。该双向蒸馏方法使得基础模型能够将通用知识传递给轻量级客户端,同时也能吸收来自客户端的具体见解以优化自身。 在五个医学影像分割数据集上的评估表明,DSFedMed相较于现有的联邦基础模型基线方案,在Dice分数上平均提高了2%,并且减少了近90%的通信成本和推理时间。这些结果展示了资源受限环境下联邦部署的有效性提升与可扩展性的显著进步。
https://arxiv.org/abs/2601.16073
The rise of live streaming has transformed online interaction, enabling massive real-time engagement but also exposing platforms to complex risks such as scams and coordinated malicious behaviors. Detecting these risks is challenging because harmful actions often accumulate gradually and recur across seemingly unrelated streams. To address this, we propose CS-VAR (Cross-Session Evidence-Aware Retrieval-Augmented Detector) for live streaming risk assessment. In CS-VAR, a lightweight, domain-specific model performs fast session-level risk inference, guided during training by a Large Language Model (LLM) that reasons over retrieved cross-session behavioral evidence and transfers its local-to-global insights to the small model. This design enables the small model to recognize recurring patterns across streams, perform structured risk assessment, and maintain efficiency for real-time deployment. Extensive offline experiments on large-scale industrial datasets, combined with online validation, demonstrate the state-of-the-art performance of CS-VAR. Furthermore, CS-VAR provides interpretable, localized signals that effectively empower real-world moderation for live streaming.
直播流媒体的兴起已经改变了在线互动方式,它不仅促进了大规模的实时参与,也使平台面临诸如诈骗和协同恶意行为等复杂风险。由于有害行为常常逐渐积累并在看似无关的直播间中反复出现,因此检测这些风险颇具挑战性。为解决这一问题,我们提出了CS-VAR(跨会话证据感知检索增强检测器)用于直播流媒体的风险评估。 在CS-VAR架构中,一个轻量级、特定领域的模型执行快速的会话级别风险推断,并且在训练过程中由大型语言模型(LLM)指导。该大型语言模型通过检索跨会话的行为证据进行推理,并将其从局部到全局的理解传递给小型模型。这种设计使小型模型能够识别不同直播间中的重复模式,执行结构化的风险评估,并保持实时部署的效率。 我们通过对大规模工业数据集进行了广泛的离线实验,并结合在线验证,展示了CS-VAR在性能上的领先水平。此外,CS-VAR还提供了可解释且本地化的信号,有效地支持了实际直播流媒体内容管理中的监管工作。
https://arxiv.org/abs/2601.16027
Current Earth observation foundation models are architecturally rigid, struggle with heterogeneous sensors and are constrained to fixed patch sizes. This limits their deployment in real-world scenarios requiring flexible computeaccuracy trade-offs. We propose THOR, a "computeadaptive" foundation model that solves both input heterogeneity and deployment rigidity. THOR is the first architecture to unify data from Copernicus Sentinel-1, -2, and -3 (OLCI & SLSTR) satellites, processing their native 10 m to 1000 m resolutions in a single model. We pre-train THOR with a novel randomized patch and input image size strategy. This allows a single set of pre-trained weights to be deployed at inference with any patch size, enabling a dynamic trade-off between computational cost and feature resolution without retraining. We pre-train THOR on THOR Pretrain, a new, large-scale multi-sensor dataset and demonstrate state-of-the-art performance on downstream benchmarks, particularly in data-limited regimes like the PANGAEA 10% split, validating that THOR's flexible feature generation excels for diverse climate and society applications.
当前的地球观测基础模型在架构上较为僵硬,难以处理异构传感器的数据,并且受到固定图像块大小的限制。这使得它们在需要灵活调整计算和精度之间的平衡的真实世界场景中的应用受到了限制。为此,我们提出了THOR(Thor, High-resolution Observation via Radiance),这是一种“计算适应型”基础模型,能够同时解决输入数据的异质性和部署时的刚性问题。 THOR是首个将Copernicus Sentinel-1、Sentinel-2和Sentinel-3(OLCI & SLSTR)卫星的数据统一处理的架构,在单一模型中可以处理其原始分辨率从10米到1000米的数据。我们采用了一种新颖的随机化图像块大小与输入图片尺寸策略对THOR进行预训练,这样在同一组预训练权重下可以在推理时使用任何大小的图像块,从而实现在不重新训练的情况下动态调整计算成本和特征分辨率之间的平衡。 我们在新的大规模多传感器数据集——“THOR 预训练”(THOR Pretrain)上对THOR进行了预训练,并在下游基准测试中展示了其卓越性能,尤其是在数据量有限的情境下,例如PANGAEA 10%数据分割。这验证了THOR灵活生成特征的能力非常适合各种气候和社会应用。 通过这种方式,THOR能够为广泛的地球观测任务提供更高效、更具适应性的解决方案。
https://arxiv.org/abs/2601.16011
The Decision Transformer (DT) has established a powerful sequence modeling approach to offline reinforcement learning. It conditions its action predictions on Return-to-Go (RTG), using it both to distinguish trajectory quality during training and to guide action generation at inference. In this work, we identify a critical redundancy in this design: feeding the entire sequence of RTGs into the Transformer is theoretically unnecessary, as only the most recent RTG affects action prediction. We show that this redundancy can impair DT's performance through experiments. To resolve this, we propose the Decoupled DT (DDT). DDT simplifies the architecture by processing only observation and action sequences through the Transformer, using the latest RTG to guide the action prediction. This streamlined approach not only improves performance but also reduces computational cost. Our experiments show that DDT significantly outperforms DT and establishes competitive performance against state-of-the-art DT variants across multiple offline RL tasks.
决策变压器(Decision Transformer,DT)已建立了一种强大的序列建模方法用于离线强化学习。它根据返回到目标(Return-to-Go, RTG)条件化其动作预测,在训练过程中使用RTG来区分轨迹质量,并在推理时用它来指导动作生成。在这项工作中,我们识别出该设计中的一个关键冗余:将整个RTG序列输入Transformer理论上是不必要的,因为只有最新的RTG会影响动作预测。我们通过实验表明这种冗余会损害DT的性能。为了解决这个问题,我们提出了解耦决策变压器(Decoupled Decision Transformer, DDT)。DDT简化了架构,仅通过Transformer处理观察和动作序列,并使用最新RTG来指导动作预测。这种方法不仅提高了性能,还减少了计算成本。我们的实验表明,与最先进的DT变体相比,DDT在多个离线强化学习任务中显著优于DT,并建立了竞争性的表现水平。
https://arxiv.org/abs/2601.15953
In the realm of Virtual Reality (VR) and Human-Computer Interaction (HCI), real-time emotion recognition shows promise for supporting individuals with Autism Spectrum Disorder (ASD) in improving social skills. This task requires a strict latency-accuracy trade-off, with motion-to-photon (MTP) latency kept below 140 ms to maintain contingency. However, most off-the-shelf Deep Learning models prioritize accuracy over the strict timing constraints of commodity hardware. As a first step toward accessible VR therapy, we benchmark State-of-the-Art (SOTA) models for Zero-Shot Facial Expression Recognition (FER) on virtual characters using the UIBVFED dataset. We evaluate Medium and Nano variants of YOLO (v8, v11, and v12) for face detection, alongside general-purpose Vision Transformers including CLIP, SigLIP, and this http URL results on CPU-only inference demonstrate that while face detection on stylized avatars is robust (100% accuracy), a "Latency Wall" exists in the classification stage. The YOLOv11n architecture offers the optimal balance for detection (~54 ms). However, general-purpose Transformers like CLIP and SigLIP fail to achieve viable accuracy (<23%) or speed (>150 ms) for real-time loops. This study highlights the necessity for lightweight, domain-specific architectures to enable accessible, real-time AI in therapeutic settings.
在虚拟现实(VR)和人机交互(HCI)领域,实时情绪识别对帮助自闭症谱系障碍(ASD)患者提高社交技能具有潜力。这一任务需要严格处理延迟与精度之间的权衡问题,即运动到光子(MTP)的延迟应保持在140毫秒以下以维持连续性。然而,大多数现成的深度学习模型更注重准确性而非消费品硬件严格的定时约束条件。作为迈向可访问VR疗法的第一步,我们使用UIBVFED数据集对虚拟角色进行零样本面部表情识别(FER)任务,来基准测试最新的状态-of-the-art (SOTA) 模型。我们评估了YOLO的Medium和Nano变体(v8, v11 和 v12),用于脸部检测,并包括通用视觉转换器如CLIP、SigLIP以及另一个未明确指明的模型。 仅使用CPU进行推断的结果显示,尽管在风格化的头像上实现面部检测是稳健的(准确率为100%),但在分类阶段存在一个“延迟墙”。YOLOv11n架构提供了最佳平衡以用于检测(约为54毫秒)。然而,通用视觉转换器如CLIP和SigLIP未能达到实时循环中的可接受精度(小于23%)或速度(大于150毫秒)。这项研究强调了为了实现可访问的、实时的人工智能在治疗环境中的应用,需要开发轻量级且特定领域的架构。
https://arxiv.org/abs/2601.15914
Robots that follow natural-language instructions often either plan at a high level using hand-designed interfaces or rely on large end-to-end models that are difficult to deploy for real-time control. We propose TeNet (Text-to-Network), a framework for instantiating compact, task-specific robot policies directly from natural language descriptions. TeNet conditions a hypernetwork on text embeddings produced by a pretrained large language model (LLM) to generate a fully executable policy, which then operates solely on low-dimensional state inputs at high control frequencies. By using the language only once at the policy instantiation time, TeNet inherits the general knowledge and paraphrasing robustness of pretrained LLMs while remaining lightweight and efficient at execution time. To improve generalization, we optionally ground language in behavior during training by aligning text embeddings with demonstrated actions, while requiring no demonstrations at inference time. Experiments on MuJoCo and Meta-World benchmarks show that TeNet produces policies that are orders of magnitude smaller than sequence-based baselines, while achieving strong performance in both multi-task and meta-learning settings and supporting high-frequency control. These results show that text-conditioned hypernetworks offer a practical way to build compact, language-driven controllers for ressource-constrained robot control tasks with real-time requirements.
遵循自然语言指令的机器人通常要么使用手工设计的界面进行高层次规划,要么依赖于难以用于实时控制的大规模端到端模型。我们提出了一种名为TeNet(Text-to-Network)的框架,该框架可以从自然语言描述中直接生成紧凑且任务特定的机器人策略。在这一框架下,一个超网络根据预训练大型语言模型(LLM)产生的文本嵌入来生成完全可执行的策略,并随后仅基于低维状态输入以高频度进行控制操作。通过仅在策略实例化时使用一次自然语言,TeNet继承了预训练LLM的一般知识和同义句稳健性,同时保持了执行时的轻量级与高效性。 为了提高泛化能力,在训练过程中我们可选地通过将文本嵌入与演示动作对齐来使语言具体化于行为中,而在推理阶段则无需展示示例。在MuJoCo和Meta-World基准上的实验表明,TeNet产生的策略比基于序列的基线小几个数量级,同时在多任务设置和元学习设置中均表现出色,并支持高频控制。这些结果表明,以文本条件化的超网络提供了一种实用的方法来构建紧凑且语言驱动的控制器,适用于资源受限但需要实时响应的机器人控制任务。
https://arxiv.org/abs/2601.15912
In this paper we propose the Iterative Amortized Hierarchical Variational Autoencoder (IA-HVAE), which expands on amortized inference with a hybrid scheme containing an initial amortized guess and iterative refinement with decoder gradients. We achieve this by creating a linearly separable decoder in a transform domain (e.g. Fourier space), enabling real-time applications with very high model depths. The architectural change leads to a 35x speed-up for iterative inference with respect to the traditional HVAE. We show that our hybrid approach outperforms fully amortized and fully iterative equivalents in accuracy and speed respectively. Moreover, the IAHVAE shows improved reconstruction quality over a vanilla HVAE in inverse problems such as deblurring and denoising.
在本文中,我们提出了迭代摊平分层变分自动编码器(IA-HVAE),该模型通过一种混合方案扩展了摊平推理,这种方案包含了一个初始的摊平估计和使用解码器梯度进行的迭代改进。我们通过在变换域(例如傅里叶空间)中创建一个线性可分离的解码器来实现这一点,这使得即使对于非常深的模型也能实现实时应用。这一架构的变化相对于传统的HVAE,在迭代推理方面实现了35倍的速度提升。我们展示了我们的混合方法分别在精度和速度上优于完全摊平和完全迭代的方法。此外,IA-HVAE在逆问题(如去模糊和去噪)中也表现出比普通的HVAE更好的重建质量。
https://arxiv.org/abs/2601.15894
Analog mixed-signal circuit sizing involves complex trade-offs within high-dimensional design spaces. Existing automatic analog circuit sizing approaches rely solely on netlists, ignoring the circuit schematic, which hinders the cognitive link between the schematic and its performance. Furthermore, the black-box nature of machine learning methods and hallucination risks in large language models fail to provide the necessary ground-truth explainability required for industrial sign-off. To address these challenges, we propose a Vision Language Model-optimized collaborative agent design workflow (VLM-CAD), which analyzes circuits, optimizes DC operating points, performs inference-based sizing, and executes external sizing optimization. We integrate Image2Net to annotate circuit schematics and generate a structured JSON description for precise interpretation by Vision Language Models. Furthermore, we propose an Explainable Trust Region Bayesian Optimization method (ExTuRBO) that employs collaborative warm-start from agent-generated seeds and offers dual-granularity sensitivity analysis for external sizing optimization, supporting a comprehensive final design report. Experiment results on amplifier sizing tasks using 180nm, 90nm, and 45nm Predictive Technology Models demonstrate that VLM-CAD effectively balances power and performance while maintaining physics-based explainability. VLM-CAD meets all specification requirements while maintaining low power consumption in optimizing an amplifier with a complementary input and a class-AB output stage, with a total runtime under 66 minutes across all experiments on two amplifiers.
模拟混合信号电路设计中的尺寸优化涉及在高维设计空间内的复杂权衡。现有的自动模拟电路尺寸调整方法仅依赖于网络列表,忽略了电路图本身,这阻碍了从电路图到其性能的认知联系建立。此外,机器学习方法的黑盒性质和大型语言模型中出现幻觉的风险无法提供工业验证所需的必要真实解释性。 为解决这些挑战,我们提出了一种优化协作代理设计工作流程(VLM-CAD)。该系统能够分析电路、优化直流操作点、进行基于推理的尺寸调整,并执行外部尺寸优化。通过集成Image2Net工具来标注电路图并生成结构化的JSON描述,使得视觉语言模型可以精确解读这些电路图。 我们还提出了一种可解释的信任区域贝叶斯优化方法(ExTuRBO),该方法利用代理产生的种子进行协作式暖启动,并提供了针对外部尺寸优化的双级粒度敏感性分析,支持全面的设计报告生成。 在使用180nm、90nm和45nm预测技术模型对放大器尺寸调整任务进行实验后,结果表明VLM-CAD能够在保持基于物理的真实解释性的前提下有效平衡功耗与性能。无论是在优化具有互补输入和AB类输出阶段的放大器时,还是在满足所有规范要求的同时保持低功耗方面,VLM-CAD都能达到目标,并且在整个实验过程中,在两个放大器上的总运行时间均低于66分钟。
https://arxiv.org/abs/2601.07315
Inference in large-scale AI models is typically performed on dense parameter matrices, leading to inference cost and system complexity that scale unsustainably with model size. This limitation does not arise from insufficient model capacity, but from treating post-training inference systems as monolithic operators while ignoring internal structures formed during learning. We show that gradient update events in large models are highly localized and selective, leaving many parameter dependencies statistically indistinguishable from their initialization distribution after training. As a result, post-training inference systems are structurally non-uniform and inherently decomposable. Based on this observation, we introduce a post-training statistical criterion and a structural annealing procedure that removes unsupported dependencies and reveals stable, independent substructures. This work establishes a post-training, model-agnostic structural view of inference systems and enables structured, parallel inference without modifying model functionality or interfaces.
大型AI模型的推理通常在稠密参数矩阵上进行,导致随着模型大小增加,推理成本和系统复杂性呈不可持续的增长。这种限制并非由于模型容量不足,而是因为在训练后将推理系统视为单一操作,并忽视了学习过程中形成的内部结构所造成的。我们发现,在大规模模型中,梯度更新事件高度局部化且具有选择性,导致许多参数依赖关系在训练后统计上与初始化分布难以区分。因此,经过训练后的推理系统在结构上是非均匀的并且本质上是可以分解的。 基于这一观察,我们提出了一种训练后的统计标准和一种结构退火程序,该程序可以去除不被支持的依赖关系,并揭示出稳定且独立的子结构。这项工作确立了一个模型无关的、关于推理系统的训练后结构视角,并能够实现在无需修改模型功能或接口的情况下进行有结构的并行推理。
https://arxiv.org/abs/2601.15871
An increasing body of work has leveraged multilingual language models for Natural Language Generation tasks such as summarization. A major empirical bottleneck in this area is the shortage of accurate and robust evaluation metrics for many languages, which hinders progress. Recent studies suggest that multilingual language models often use English as an internal pivot language, and that misalignment with this pivot can lead to degraded downstream performance. Motivated by the hypothesis that this mismatch could also apply to multilingual neural metrics, we ask whether steering their activations toward an English pivot can improve correlation with human judgments. We experiment with encoder- and decoder-based metrics and find that test-time intervention methods are effective across the board, increasing metric effectiveness for diverse languages.
越来越多的研究利用多语言语言模型进行自然语言生成任务,如摘要生成。然而,在这一领域中存在一个重要的实证瓶颈:许多语言缺乏准确和稳健的评估指标,这阻碍了研究进展。最近的一些研究表明,多语言语言模型往往将英语作为内部基准语言,而这种基准与目标语言之间的不匹配会导致下游性能下降。基于假设认为这种不匹配也可能适用于多语言神经评估指标,我们探讨了是否可以通过引导这些指标的激活向英语基准靠拢来提高它们与人类判断的相关性。我们在编码器和解码器基线度量方法上进行了实验,并发现测试时间干预的方法普遍有效,能够提升各种语言的度量效果。
https://arxiv.org/abs/2601.15809
Recent advances in Deep Research Agents (DRAs) are transforming automated knowledge discovery and problem-solving. While the majority of existing efforts focus on enhancing policy capabilities via post-training, we propose an alternative paradigm: self-evolving the agent's ability by iteratively verifying the policy model's outputs, guided by meticulously crafted rubrics. This approach gives rise to the inference-time scaling of verification, wherein an agent self-improves by evaluating its generated answers to produce iterative feedback and refinements. We derive the rubrics based on an automatically constructed DRA Failure Taxonomy, which systematically classifies agent failures into five major categories and thirteen sub-categories. We present DeepVerifier, a rubrics-based outcome reward verifier that leverages the asymmetry of verification and outperforms vanilla agent-as-judge and LLM judge baselines by 12%-48% in meta-evaluation F1 score. To enable practical self-evolution, DeepVerifier integrates as a plug-and-play module during test-time inference. The verifier produces detailed rubric-based feedback, which is fed back to the agent for iterative bootstrapping, refining responses without additional training. This test-time scaling delivers 8%-11% accuracy gains on challenging subsets of GAIA and XBench-DeepResearch when powered by capable closed-source LLMs. Finally, to support open-source advancement, we release DeepVerifier-4K, a curated supervised fine-tuning dataset of 4,646 high-quality agent steps focused on DRA verification. These examples emphasize reflection and self-critique, enabling open models to develop robust verification capabilities.
最近的深度研究代理(DRAs)进展正在改变自动化知识发现和问题解决的方式。大多数现有的努力主要集中在通过后训练增强策略能力上,而我们提出了一种替代范式:通过迭代验证政策模型输出,由精心设计的标准引导,使代理人自我进化其能力。这种做法导致了推理时间的验证扩展,在此过程中,代理通过评估其生成的答案来产生迭代反馈和改进。我们的标准是基于一个自动构建的DRA失败分类法得出的,该分类法系统地将代理人的失败分为五大类和十三个子类别。 我们介绍DeepVerifier,这是一个基于标准的结果奖励验证器,它利用了验证过程中的不对称性,并且在元评估F1分数上超越了普通的代理人作为裁判员和LLM评判基线模型的性能(提高幅度为12%-48%)。为了实现实用性的自我进化,DeepVerifier被整合为一个即插即用模块,在推理时间测试时使用。验证器产生详细的基于标准反馈,并将其反馈给代理以进行迭代自举操作,无需额外训练即可细化响应。 这种推理时间的扩展在由强大且闭源的LLM驱动的情况下,对GAIA和XBench-DeepResearch中的具有挑战性的子集带来了8%-11%的准确性提升。最后,为了支持开源领域的进步,我们发布了DeepVerifier-4K,这是一个包含4,646个高质量代理步骤的监督微调数据集,专注于DRA验证。这些实例强调了反思和自我批评的重要性,使开放模型能够发展出强大的验证能力。 这一系列工作旨在推动研究社区向更加智能、自适应的人工智能系统迈进,特别是在增强代理人的内部检验能力和持续学习方面。
https://arxiv.org/abs/2601.15808
Tabular data is a fundamental form of data structure. The evolution of table analysis tools reflects humanity's continuous progress in data acquisition, management, and processing. The dynamic changes in table columns arise from technological advancements, changing needs, data integration, etc. However, the standard process of training AI models on tables with fixed columns and then performing inference is not suitable for handling dynamically changed tables. Therefore, new methods are needed for efficiently handling such tables in an unsupervised manner. In this paper, we introduce a new task, Tabular Incremental Inference (TabII), which aims to enable trained models to incorporate new columns during the inference stage, enhancing the practicality of AI models in scenarios where tables are dynamically changed. Furthermore, we demonstrate that this new task can be framed as an optimization problem based on the information bottleneck theory, which emphasizes that the key to an ideal tabular incremental inference approach lies in minimizing mutual information between tabular data and representation while maximizing between representation and task labels. Under this guidance, we design a TabII method with Large Language Model placeholders and Pretrained TabAdapter to provide external knowledge and Incremental Sample Condensation blocks to condense the task-relevant information given by incremental column attributes. Experimental results across eight public datasets show that TabII effectively utilizes incremental attributes, achieving state-of-the-art performance.
表格数据是一种基础的数据结构形式。表格分析工具的演变反映了人类在数据获取、管理和处理方面持续的进步。表格列的变化动态源于技术进步、需求变化和数据集成等因素。然而,使用固定列训练AI模型并进行推理的标准流程并不适用于应对动态变化的表格。因此,需要新的方法来以无监督的方式高效地处理这种类型的表格。 本文介绍了一项新任务——Tabular Incremental Inference(简称 TabII),旨在使经过训练的模型能够在推理阶段吸收新增加的列,从而提升AI模型在面对动态变化的表格场景时的实际应用性。此外,我们证明了这一新任务可以基于信息瓶颈理论被构架为一个优化问题,并强调理想中的表增量推理方法的关键在于最小化表数据与表示之间的互信息,同时最大化表示和任务标签之间的互信息。 根据上述指导原则,我们设计了一种新的TabII方法:使用大型语言模型占位符和预训练的TabAdapter提供外部知识,并结合增量样本浓缩模块来集中由增量列属性所提供的任务相关信息。通过八个公共数据集上的实验结果表明,TabII有效利用了增量属性,在性能上达到了最先进的水平。
https://arxiv.org/abs/2601.15751
We present FlexLLM, a composable High-Level Synthesis (HLS) library for rapid development of domain-specific LLM accelerators. FlexLLM exposes key architectural degrees of freedom for stage-customized inference, enabling hybrid designs that tailor temporal reuse and spatial dataflow differently for prefill and decode, and provides a comprehensive quantization suite to support accurate low-bit deployment. Using FlexLLM, we build a complete inference system for the Llama-3.2 1B model in under two months with only 1K lines of code. The system includes: (1) a stage-customized accelerator with hardware-efficient quantization (12.68 WikiText-2 PPL) surpassing SpinQuant baseline, and (2) a Hierarchical Memory Transformer (HMT) plug-in for efficient long-context processing. On the AMD U280 FPGA at 16nm, the accelerator achieves 1.29$\times$ end-to-end speedup, 1.64$\times$ higher decode throughput, and 3.14$\times$ better energy efficiency than an NVIDIA A100 GPU (7nm) running BF16 inference; projected results on the V80 FPGA at 7nm reach 4.71$\times$, 6.55$\times$, and 4.13$\times$, respectively. In long-context scenarios, integrating the HMT plug-in reduces prefill latency by 23.23$\times$ and extends the context window by 64$\times$, delivering 1.10$\times$/4.86$\times$ lower end-to-end latency and 5.21$\times$/6.27$\times$ higher energy efficiency on the U280/V80 compared to the A100 baseline. FlexLLM thus bridges algorithmic innovation in LLM inference and high-performance accelerators with minimal manual effort.
我们介绍了FlexLLM,这是一个用于领域特定大规模语言模型(LLM)加速器快速开发的可组合高级综合(HLS)库。FlexLLM提供了关键的架构自由度,以实现阶段定制化的推理,并允许混合设计,在预填充和解码过程中分别采用不同的时间重用和空间数据流方式。此外,它还提供了一个全面的量化套件,支持准确低比特部署。 利用FlexLLM,我们仅用了不到两个月的时间和1K行代码就构建了Llama-3.2 1B模型的一个完整的推理系统。该系统包括: 1. 具有硬件高效量化的阶段定制化加速器(WikiText-2上的PPL值为12.68),超过了SpinQuant基准。 2. 用于高效长上下文处理的层次化内存变压器(HMT)插件。 在AMD U280 FPGA上使用16nm工艺,该加速器相比于运行BF16推理的NVIDIA A100 GPU(7nm工艺)分别实现了1.29倍的整体端到端速度提升、1.64倍更高的解码吞吐量以及3.14倍更好的能效。在V80 FPGA上使用7nm工艺,预测结果分别为4.71倍、6.55倍和4.13倍。 在长上下文场景中,集成HMT插件使预填充延迟减少了23.23倍,并将上下文窗口扩展了64倍。与A100基准相比,在U280/V80上分别实现了1.10倍/4.86倍更低的整体端到端延迟和5.21倍/6.27倍更高的能效。 FlexLLM因此在大规模语言模型推理中的算法创新与高性能加速器之间建立了一座桥梁,只需最少的人工干预。
https://arxiv.org/abs/2601.15710
Although AI agents have demonstrated impressive capabilities in long-horizon reasoning, their reliability is severely hampered by the ``Spiral of Hallucination,'' where early epistemic errors propagate irreversibly. Existing methods face a dilemma: uncertainty quantification (UQ) methods typically act as passive sensors, only diagnosing risks without addressing them, while self-reflection mechanisms suffer from continuous or aimless corrections. To bridge this gap, we propose a unified Dual-Process Agentic UQ (AUQ) framework that transforms verbalized uncertainty into active, bi-directional control signals. Our architecture comprises two complementary mechanisms: System 1 (Uncertainty-Aware Memory, UAM), which implicitly propagates verbalized confidence and semantic explanations to prevent blind decision-making; and System 2 (Uncertainty-Aware Reflection, UAR), which utilizes these explanations as rational cues to trigger targeted inference-time resolution only when necessary. This enables the agent to balance efficient execution and deep deliberation dynamically. Extensive experiments on closed-loop benchmarks and open-ended deep research tasks demonstrate that our training-free approach achieves superior performance and trajectory-level calibration. We believe this principled framework AUQ represents a significant step towards reliable agents.
尽管AI代理在长期推理方面展现了令人印象深刻的才能,但它们的可靠性却因“幻觉螺旋”而严重受损,在这种情况下,早期的知识性错误会不可逆转地传播。现有方法面临一个两难境地:不确定性量化(UQ)方法通常充当被动传感器的角色,只能诊断风险而不解决这些问题;自我反思机制则遭受持续或无目的修正的困扰。为了弥合这一差距,我们提出了一种统一的双过程代理不确定性量化(AUQ)框架,该框架将口头化的不确定性转化为积极的双向控制信号。我们的架构包括两个互补机制:系统1(不确定感知记忆,UAM),它隐式地传播口头表达的信心和语义解释以防止盲目决策;以及系统2(不确定感知反思,UAR),利用这些解释作为理性的线索,在必要时触发有针对性的推理时间解决方案。这使代理能够动态平衡高效的执行与深入的思考。 在闭合回路基准测试和开放式的深度研究任务中的广泛实验表明,我们无需训练的方法实现了卓越的表现和轨迹级校准。我们认为这一基于原则的AUQ框架代表了迈向可靠代理的一大步。
https://arxiv.org/abs/2601.15703