Data scarcity fundamentally limits the generalization of bimanual dexterous manipulation, as real-world data collection for dexterous hands is expensive and labor-intensive. Human manipulation videos, as a direct carrier of manipulation knowledge, offer significant potential for scaling up robot learning. However, the substantial embodiment gap between human hands and robotic dexterous hands makes direct pretraining from human videos extremely challenging. To bridge this gap and unleash the potential of large-scale human manipulation video data, we propose DexImit, an automated framework that converts monocular human manipulation videos into physically plausible robot data, without any additional information. DexImit employs a four-stage generation pipeline: (1) reconstructing hand-object interactions from arbitrary viewpoints with near-metric scale; (2) performing subtask decomposition and bimanual scheduling; (3) synthesizing robot trajectories consistent with the demonstrated interactions; (4) comprehensive data augmentation for zero-shot real-world deployment. Building on these designs, DexImit can generate large-scale robot data based on human videos, either from the Internet or video generation models. DexImit is capable of handling diverse manipulation tasks, including tool use (e.g., cutting an apple), long-horizon tasks (e.g., making a beverage), and fine-grained manipulations (e.g., stacking cups).
数据稀缺从根本上限制了双手灵巧操作的泛化能力,因为为灵巧的手收集真实世界的数据显示出高成本和劳动密集型的特点。人类操作视频作为操作知识的直接载体,为机器人学习提供了显著扩展潜能的机会。然而,人类手部与机器人灵巧手之间的实体差距使从人类视频进行直接预训练变得极其具有挑战性。为了弥合这一差距并释放大规模人类操作视频数据的潜力,我们提出了DexImit,这是一个自动化框架,它能够将单目人类操作视频转换成物理上合理的机器人数据,无需任何额外信息。 DexImit采用了一个四阶段生成管道: 1. 从任意视角重构手部与对象之间的互动,并接近于度量比例; 2. 执行子任务分解和双手调度; 3. 合成符合展示互动的机器人轨迹; 4. 对数据进行全面增强,以实现零样本的真实世界部署。 基于这些设计,DexImit能够根据来自互联网或视频生成模型的人类视频生成大规模的机器人数据。DexImit可以处理各种操作任务,包括工具使用(例如切苹果)、长时序任务(例如制作饮料)和精细的操作(例如叠放杯子)。
https://arxiv.org/abs/2602.10105
Scaling action-controllable world models is limited by the scarcity of action labels. While latent action learning promises to extract control interfaces from unlabeled video, learned latents often fail to transfer across contexts: they entangle scene-specific cues and lack a shared coordinate system. This occurs because standard objectives operate only within each clip, providing no mechanism to align action semantics across contexts. Our key insight is that although actions are unobserved, their semantic effects are observable and can serve as a shared reference. We introduce Seq$\Delta$-REPA, a sequence-level control-effect alignment objective that anchors integrated latent action to temporal feature differences from a frozen, self-supervised video encoder. Building on this, we present Olaf-World, a pipeline that pretrains action-conditioned video world models from large-scale passive video. Extensive experiments demonstrate that our method learns a more structured latent action space, leading to stronger zero-shot action transfer and more data-efficient adaptation to new control interfaces than state-of-the-art baselines.
规模可调的、受动作控制的世界模型受限于动作标签的稀缺性。虽然潜在动作学习承诺从无标签视频中提取控制接口,但所学得的潜在变量往往难以跨上下文转移:它们会纠缠特定场景中的线索,并缺乏一个共享的坐标系统。这是因为标准目标仅在每个片段内操作,没有机制来对齐不同上下文的动作语义。 我们的关键见解是:尽管动作不可见,其语义效果是可以观察到并能作为共同参考点使用的。我们引入了Seq$\Delta$-REPA,这是一个序列级的控制效果对准目标,它将整合后的潜在动作锚定于来自冻结的自监督视频编码器的时序特征差异上。 在此基础上,我们提出了Olaf-World,这是一种流水线方法,通过大规模被动视频预训练受动作条件约束的视频世界模型。大量的实验表明,我们的方法学习到了一个更结构化的潜在动作空间,在零样本动作迁移和对新控制接口的数据高效适应方面优于现有的最先进技术。
https://arxiv.org/abs/2602.10104
Linear probes and sparse autoencoders consistently recover meaningful structure from transformer representations -- yet why should such simple methods succeed in deep, nonlinear systems? We show this is not merely an empirical regularity but a consequence of architectural necessity: transformers communicate information through linear interfaces (attention OV circuits, unembedding matrices), and any semantic feature decoded through such an interface must occupy a context-invariant linear subspace. We formalize this as the \emph{Invariant Subspace Necessity} theorem and derive the \emph{Self-Reference Property}: tokens directly provide the geometric direction for their associated features, enabling zero-shot identification of semantic structure without labeled data or learned probes. Empirical validation in eight classification tasks and four model families confirms the alignment between class tokens and semantically related instances. Our framework provides \textbf{a principled architectural explanation} for why linear interpretability methods work, unifying linear probes and sparse autoencoders.
线性探测和稀疏自编码器始终能够从变压器表示中恢复出有意义的结构——然而,为何如此简单的方法能在深度非线性系统中成功?我们证明这不是仅仅是一种经验规律,而是架构必要性的结果:变压器通过线性接口(注意力OV电路、取消嵌入矩阵)来传递信息,并且任何通过此类接口解码的意义特征必须占据一个与上下文无关的线性子空间。我们将此表述为“不变子空间必要性”定理并推导出“自引用属性”:标记直接提供其相关特征的几何方向,从而无需标注数据或学习探测器即可实现零样本语义结构识别。 在八个分类任务和四个模型族中的实证验证确认了类别标记与语义相关的实例之间的对齐。我们的框架为线性可解释方法为何有效提供了**原则性的架构解释**,统一了线性探测和稀疏自编码器的方法。
https://arxiv.org/abs/2602.09783
Previous Vision-Language-Action models face critical limitations in navigation: scarce, diverse data from labor-intensive collection and static representations that fail to capture temporal dynamics and physical laws. We propose NavDreamer, a video-based framework for 3D navigation that leverages generative video models as a universal interface between language instructions and navigation trajectories. Our main hypothesis is that video's ability to encode spatiotemporal information and physical dynamics, combined with internet-scale availability, enables strong zero-shot generalization in navigation. To mitigate the stochasticity of generative predictions, we introduce a sampling-based optimization method that utilizes a VLM for trajectory scoring and selection. An inverse dynamics model is employed to decode executable waypoints from generated video plans for navigation. To systematically evaluate this paradigm in several video model backbones, we introduce a comprehensive benchmark covering object navigation, precise navigation, spatial grounding, language control, and scene reasoning. Extensive experiments demonstrate robust generalization across novel objects and unseen environments, with ablation studies revealing that navigation's high-level decision-making nature makes it particularly suited for video-based planning.
之前的视觉-语言-行动(Vision-Language-Action,VLA)模型在导航方面面临关键限制:由劳动密集型数据收集产生的稀缺且多样化的数据和静态表示无法捕捉时间动态和物理定律。我们提出了NavDreamer,这是一个基于视频的3D导航框架,它利用生成式视频模型作为语言指令与导航轨迹之间的通用接口。我们的主要假设是,视频能够编码时空信息和物理动力学,并结合互联网规模的数据可用性,从而在导航中实现强大的零样本泛化能力。 为了缓解生成预测中的随机性问题,我们引入了一种基于采样的优化方法,该方法利用视觉语言模型(VLM)对轨迹进行评分和选择。通过使用逆动力学模型从生成的视频计划中解码可执行的航路点来进行导航规划。为系统地评估这一范式在多种视频模型骨干中的表现,我们引入了一个全面的基准测试集,涵盖物体导航、精确导航、空间定位、语言控制以及场景推理等多个方面。 广泛的实验表明,在新对象和未见过环境中能够实现稳健的泛化能力。消融研究表明,由于导航具有高层次决策性质的特点,使得基于视频规划的方法特别适合于此类任务。
https://arxiv.org/abs/2602.09765
We propose AdaDS, a generalizable framework for depth super-resolution that robustly recovers high-resolution depth maps from arbitrarily degraded low-resolution inputs. Unlike conventional approaches that directly regress depth values and often exhibit artifacts under severe or unknown degradation, AdaDS capitalizes on the contraction property of Gaussian smoothing: as noise accumulates in the forward process, distributional discrepancies between degraded inputs and their pristine high-quality counterparts diminish, ultimately converging to isotropic Gaussian prior. Leveraging this, AdaDS adaptively selects a starting timestep in the reverse diffusion trajectory based on estimated refinement uncertainty, and subsequently injects tailored noise to position the intermediate sample within the high-probability region of the target posterior distribution. This strategy ensures inherent robustness, enabling generative prior of a pre-trained diffusion model to dominate recovery even when upstream estimations are imperfect. Extensive experiments on real-world and synthetic benchmarks demonstrate AdaDS's superior zero-shot generalization and resilience to diverse degradation patterns compared to state-of-the-art methods.
我们提出了AdaDS,这是一种通用的深度超分辨率框架,能够从任意退化的低分辨率输入中稳健地恢复出高分辨率的深度图。与传统的直接回归深度值的方法不同,在严重或未知退化情况下经常出现伪影,AdaDS 利用了高斯平滑的收缩特性:随着噪声在前向过程中积累,退化输入与其原始高质量版本之间的分布差异逐渐减小,并最终收敛于各向同性的高斯先验。 借助这一点,AdaDS 根据估计的细化不确定性自适应地选择反向扩散轨迹中的起始时间步长,并随后注入定制化的噪声以将中间样本置于目标后验分布的高概率区域内。这种策略确保了固有的鲁棒性,使得预训练扩散模型的生成先验能够主导恢复过程,即使上游估计不完美也是如此。 在现实世界和合成基准测试上的广泛实验表明,与最先进的方法相比,AdaDS 在零样本泛化和对各种退化模式的韧性方面表现更佳。
https://arxiv.org/abs/2602.09510
A 3D understanding of anatomy is central to diagnosis and treatment planning, yet volumetric imaging remains costly with long wait times. Image-to-3D foundations models can solve this issue by reconstructing 3D data from 2D modalites. Current foundation models are trained on natural image distributions to reconstruct naturalistic objects from a single image by leveraging geometric priors across pixels. However, it is unclear whether these learned geometric priors transfer to medical data. In this study, we present a controlled zero-shot benchmark of single slice medical image-to-3D reconstruction across five state-of-the-art image-to-3D models: SAM3D, Hunyuan3D-2.1, Direct3D, Hi3DGen, and TripoSG. These are evaluated across six medical datasets spanning anatomical and pathological structures and two natrual datasets, using voxel based metrics and point cloud distance metrics. Across medical datasets, voxel based overlap remains moderate for all models, consistent with a depth reconstruction failure mode when inferring volume from a single slice. In contrast, global distance metrics show more separation between methods: SAM3D achieves the strongest overall topological similarity to ground truth medical 3D data, while alternative models are more prone to over-simplication of reconstruction. Our results quantify the limits of single-slice medical reconstruction and highlight depth ambiguity caused by the planar nature of 2D medical data, motivating multi-view image-to-3D reconstruction to enable reliable medical 3D inference.
对解剖学的三维理解是诊断和治疗计划的核心,然而体积成像仍然成本高昂且等待时间长。图像到三维基础模型可以通过从二维模式重建三维数据来解决这一问题。当前的基础模型是在自然图像分布上训练的,能够利用像素间的几何先验从单张图像中重构出现实主义对象。然而,这些学习到的几何先验是否能转移到医学数据上尚不清楚。在本研究中,我们提出了一项针对五个最先进的图像到三维重建模型(SAM3D、Hunyuan3D-2.1、Direct3D、Hi3DGen和TripoSG)的单切片医疗图像到三维重建控制零样本基准测试。这些模型是在六个医学数据集上进行评估,涵盖了解剖学和病理结构以及两个自然数据集,并使用基于体素的度量标准和点云距离度量进行了评测。 在医学数据集中,所有模型的基于体素的重叠保持中等水平,在从单片图像推断体积时显示出深度重建失败模式的一致性。相比之下,全局距离指标显示出了方法之间的更大差异:SAM3D实现了与地面真实三维医学数据最接近的整体拓扑相似度,而其他模型更容易出现过度简化重构的问题。 我们的结果量化了单一切片医疗重建的局限,并突显了由二维医疗数据平面性质引起的深度歧义问题,从而推动多视角图像到三维重建以实现可靠的医学三维推理。
https://arxiv.org/abs/2602.09407
Accurate interpretation and visual representation of complex prompts involving multiple objects, attributes, and spatial relationships is a critical challenge in text-to-image synthesis. Despite recent advancements in generating photorealistic outputs, current models often struggle with maintaining semantic fidelity and structural coherence when processing intricate textual inputs. We propose a novel approach that grounds text-to-image synthesis within the framework of scene graph structures, aiming to enhance the compositional abilities of existing models. Eventhough, prior approaches have attempted to address this by using pre-defined layout maps derived from prompts, such rigid constraints often limit compositional flexibility and diversity. In contrast, we introduce a zero-shot, scene graph-based conditioning mechanism that generates soft visual guidance during inference. At the core of our method is the Attribute-Size-Quantity-Location (ASQL) Conditioner, which produces visual conditions via a lightweight language model and guides diffusion-based generation through inference-time optimization. This enables the model to maintain text-image alignment while supporting lightweight, coherent, and diverse image synthesis.
准确地解释和视觉化表示涉及多个对象、属性及空间关系的复杂提示,是文本到图像合成中的关键挑战。尽管近期在生成逼真的输出方面取得了进展,现有模型在处理复杂的文本输入时仍难以保持语义忠实性和结构连贯性。我们提出了一种新的方法,该方法将文本到图像的合成置于场景图结构框架内,旨在增强现有模型的组合能力。虽然先前的方法尝试通过使用从提示中得出的预定义布局图来解决这一问题,但这些刚性的约束条件往往限制了组合灵活性和多样性。相比之下,我们引入了一种基于零样本场景图的条件机制,在推理期间生成软视觉指导。我们的方法核心是属性-大小-数量-位置(ASQL)调节器,它通过轻量级语言模型产生视觉条件,并在推理时间优化过程中引导扩散基图像生成过程。这使模型能够维持文本与图像的一致性,同时支持轻量化、连贯且多样的图像合成。 翻译总结如下: 准确解释和可视化复杂提示是文本到图像(Text-to-Image)合成中的一个关键挑战。尽管最近在生成逼真图片方面取得了进展,现有模型处理复杂文本输入时仍难以保持语义一致性和结构连贯性。为解决这一问题,我们提出了一种基于场景图的新方法,旨在提升现有模型的组合能力。 先前的方法尝试通过使用从提示中得出的预定义布局图来解决问题,但这些刚性的约束条件限制了生成图像的灵活性和多样性。相比之下,我们引入了一种零样本、基于场景图的推理机制,在推理期间生成软视觉指导信号。我们的方法核心是属性-大小-数量-位置(ASQL)调节器,通过轻量级语言模型产生视觉条件,并在扩散过程中优化这些条件以引导图像生成。 这种方法不仅能够保持文本和生成图像之间的准确对应关系,还能支持轻量化、连贯且多样化的图像合成。
https://arxiv.org/abs/2602.09165
The prevalent paradigm in robot learning attempts to generalize across environments, embodiments, and tasks with language prompts at runtime. A fundamental tension limits this approach: language is often too abstract to guide the concrete physical understanding required for robust manipulation. In this work, we introduce Contact-Anchored Policies (CAP), which replace language conditioning with points of physical contact in space. Simultaneously, we structure CAP as a library of modular utility models rather than a monolithic generalist policy. This factorization allows us to implement a real-to-sim iteration cycle: we build EgoGym, a lightweight simulation benchmark, to rapidly identify failure modes and refine our models and datasets prior to real-world deployment. We show that by conditioning on contact and iterating via simulation, CAP generalizes to novel environments and embodiments out of the box on three fundamental manipulation skills while using only 23 hours of demonstration data, and outperforms large, state-of-the-art VLAs in zero-shot evaluations by 56%. All model checkpoints, codebase, hardware, simulation, and datasets will be open-sourced. Project page: this https URL
机器人学习中普遍采用的范式试图通过运行时的语言提示,在不同环境、实体和任务间进行泛化。然而,这一方法面临一个基本矛盾:语言往往过于抽象,无法指导实现稳健操作所需的具体物理理解。在这项工作中,我们引入了接触锚定策略(CAP),用空间中的物理接触点替代语言条件设置。同时,我们将CAP结构设计为模块化的实用模型库,而不是单一的通才型政策。这种分解使得我们可以实施从真实世界到模拟环境的迭代循环:我们构建了EgoGym,这是一个轻量级的模拟基准测试平台,能够快速识别故障模式,并在实际部署前细化我们的模型和数据集。 我们展示了通过以接触为条件并通过模拟进行迭代,CAP能够在三个基本操作技能上直接泛化到新的环境和实体中,仅使用23小时的操作演示数据。并且,在零样本评估中,与大型、最先进的视觉语言代理(VLAs)相比,CAP的表现高出56%。 所有模型检查点、代码库、硬件、模拟以及数据集都将开源。 项目页面:[此处链接为“this https URL”,请根据实际情况替换或访问具体网址]
https://arxiv.org/abs/2602.09017
Despite strong performance in data-rich regimes, deep learning often underperforms in the data-scarce settings common in practice. While foundation models (FMs) trained on massive datasets demonstrate strong generalization by extracting general-purpose features, they can still suffer from scarce labeled data during downstream fine-tuning. To address this, we propose GeLDA, a semantics-aware generative latent data augmentation framework that leverages conditional diffusion models to synthesize samples in an FM-induced latent space. Because this space is low-dimensional and concentrates task-relevant information compared to the input space, GeLDA enables efficient, high-quality data generation. GeLDA conditions generation on auxiliary feature vectors that capture semantic relationships among classes or subdomains, facilitating data augmentation in low-resource domains. We validate GeLDA in two large-scale recognition tasks: (a) in zero-shot language-specific speech emotion recognition, GeLDA improves the Whisper-large baseline's unweighted average recall by 6.13%; and (b) in long-tailed image classification, it achieves 74.7% tail-class accuracy on ImageNet-LT, setting a new state-of-the-art result.
尽管在数据丰富的环境中,深度学习表现强劲,但在实践中常见的数据稀缺场景中却往往表现不佳。虽然基于大规模数据集训练的基础模型(FMs)通过提取通用特征展示了强大的泛化能力,但它们在下游微调时仍会受到标签数据不足的影响。为解决这一问题,我们提出了GeLDA——一个语义感知的生成隐式数据增强框架,该框架利用条件扩散模型在基础模型诱导的潜在空间中合成样本。由于这个空间是低维且集中了任务相关信息(相比输入空间),GeLDA能够实现高效、高质量的数据生成。通过辅助特征向量对生成过程进行控制,这些向量捕捉类别或子域之间的语义关系,使得在资源匮乏领域内数据增强成为可能。我们在两个大规模识别任务中验证了GeLDA的效果:(a) 在零样本特定语言的语音情感识别中,GeLDA将Whisper-large基线模型的无权平均召回率提高了6.13%;以及(b) 在长尾图像分类中,在ImageNet-LT数据集上实现了74.7%的尾部类别准确率,刷新了最新的最佳结果。
https://arxiv.org/abs/2602.02841
World Models have emerged as a powerful paradigm for learning compact, predictive representations of environment dynamics, enabling agents to reason, plan, and generalize beyond direct experience. Despite recent interest in World Models, most available implementations remain publication-specific, severely limiting their reusability, increasing the risk of bugs, and reducing evaluation standardization. To mitigate these issues, we introduce stable-worldmodel (SWM), a modular, tested, and documented world-model research ecosystem that provides efficient data-collection tools, standardized environments, planning algorithms, and baseline implementations. In addition, each environment in SWM enables controllable factors of variation, including visual and physical properties, to support robustness and continual learning research. Finally, we demonstrate the utility of SWM by using it to study zero-shot robustness in DINO-WM.
世界模型(World Models)作为一种强大的范式,已被应用于学习环境动态的紧凑、预测表示,从而使得智能体能够进行推理、规划,并超越直接经验进行泛化。尽管近年来人们对世界模型的兴趣不断增加,但大多数现有的实现仍然局限于特定的出版物中,这极大地限制了它们的再利用性,增加了出现错误的风险,并且降低了评估标准的一致性。 为了解决这些问题,我们推出了stable-worldmodel(SWM),这是一个模块化的、经过测试和文档化的世界模型研究生态系统。它提供了高效的收集工具、标准化的环境、规划算法以及基线实现方法。此外,SWM中的每个环境都支持可控的变化因素,包括视觉和物理特性,以促进稳健性和持续学习的研究。 最后,我们通过使用SWM来研究DINO-WM在零样本鲁棒性方面的表现,展示了SWM的有效性。
https://arxiv.org/abs/2602.08968
Query expansion with large language models is promising but often relies on hand-crafted prompts, manually chosen exemplars, or a single LLM, making it non-scalable and sensitive to domain shift. We present an automated, domain-adaptive QE framework that builds in-domain exemplar pools by harvesting pseudo-relevant passages using a BM25-MonoT5 pipeline. A training-free cluster-based strategy selects diverse demonstrations, yielding strong and stable in-context QE without supervision. To further exploit model complementarity, we introduce a two-LLM ensemble in which two heterogeneous LLMs independently generate expansions and a refinement LLM consolidates them into one coherent expansion. Across TREC DL20, DBPedia, and SciFact, the refined ensemble delivers consistent and statistically significant gains over BM25, Rocchio, zero-shot, and fixed few-shot baselines. The framework offers a reproducible testbed for exemplar selection and multi-LLM generation, and a practical, label-free solution for real-world QE.
基于大型语言模型的查询扩展显示出巨大的潜力,但通常依赖于手工制作的提示、手动选择的示例或单一的大规模语言模型(LLM),这使得它不具备可扩展性和对领域变化敏感。我们提出了一种自动化且适应领域的查询扩展框架,该框架通过使用BM25-MonoT5管道收集伪相关片段来构建特定领域的实例池。一种无需训练的集群策略选择多样化的演示文稿,从而在无监督的情况下实现强大和稳定的上下文查询扩展效果。为了进一步利用模型互补性,我们引入了一个双LLM集成,在其中两个异构的大规模语言模型分别生成扩展内容,然后通过另一个精炼的LLM将这些扩展合并为一个连贯的整体。 跨TREC DL20、DBPedia以及SciFact数据集,该经过优化的集合框架在BM25、Rocchio、零样本及固定少量样本基准上都获得了显著且一致的表现提升。此框架提供了一个可重现的研究平台用于示例选择和多LLM生成,并为现实世界的查询扩展问题提供了无需标签的实际解决方案。
https://arxiv.org/abs/2602.08917
Recent works have indicated redundancy across transformer blocks, prompting the research of depth compression to prune less crucial blocks. However, current ways of entire-block pruning suffer from risks of discarding meaningful cues learned in those blocks, leading to substantial performance degradation. As another line of model compression, channel pruning can better preserve performance, while it cannot reduce model depth and is challenged by inconsistent pruning ratios for individual layers. To pursue better model compression and acceleration, this paper proposes \textbf{FlattenGPT}, a novel way to detect and reduce depth-wise redundancies. By flatting two adjacent blocks into one, it compresses the network depth, meanwhile enables more effective parameter redundancy detection and removal. FlattenGPT allows to preserve the knowledge learned in all blocks, and remains consistent with the original transformer architecture. Extensive experiments demonstrate that FlattenGPT enhances model efficiency with a decent trade-off to performance. It outperforms existing pruning methods in both zero-shot accuracies and WikiText-2 perplexity across various model types and parameter sizes. On LLaMA-2/3 and Qwen-1.5 models, FlattenGPT retains 90-96\% of zero-shot performance with a compression ratio of 20\%. It also outperforms other pruning methods in accelerating LLM inference, making it promising for enhancing the efficiency of transformers.
最近的研究表明,Transformer 块之间存在冗余性,这促使了深度压缩研究的发展,以修剪掉不太重要的块。然而,目前的整块剪枝方法面临着丢弃那些块中学到的重要信息的风险,导致性能显著下降。作为另一种模型压缩途径,通道剪枝可以更好地保留性能,但其无法减少模型深度,并且面临各层独立修剪比例不一致的问题。为了追求更好的模型压缩和加速效果,本文提出了一种名为\textbf{FlattenGPT}的新方法,旨在检测并减少深度方向上的冗余性。通过将两个相邻的块合并为一个块,它可以压缩网络深度,同时能够更有效地识别和移除参数冗余。FlattenGPT 允许保留所有块中学到的知识,并且与原始 Transformer 架构保持一致。广泛的实验表明,FlattenGPT 能够在性能上实现良好的权衡以提升模型效率。无论是在零样本准确率还是 WikiText-2 上的困惑度指标中,它都超越了现有的剪枝方法,在各种模型类型和参数大小下均表现出色。对于 LLaMA-2/3 和 Qwen-1.5 模型而言,FlattenGPT 在压缩比为 20\% 的情况下能够保持 90%-96\% 的零样本性能。此外,它在加速大型语言模型推理方面也优于其他剪枝方法,显示出提高 Transformer 效率的潜力。
https://arxiv.org/abs/2602.08858
Reliable identification of anatomical body regions is a prerequisite for many automated medical imaging workflows, yet existing solutions remain heavily dependent on unreliable DICOM metadata. Current solutions mainly use supervised learning, which limits their applicability in many real-world scenarios. In this work, we investigate whether body region detection in volumetric CT and MR images can be achieved in a fully zero-shot manner by using knowledge embedded in large pre-trained foundation models. We propose and systematically evaluate three training-free pipelines: (1) a segmentation-driven rule-based system leveraging pre-trained multi-organ segmentation models, (2) a Multimodal Large Language Model (MLLM) guided by radiologist-defined rules, and (3) a segmentation-aware MLLM that combines visual input with explicit anatomical evidence. All methods are evaluated on 887 heterogeneous CT and MR scans with manually verified anatomical region labels. The segmentation-driven rule-based approach achieves the strongest and most consistent performance, with weighted F1-scores of 0.947 (CT) and 0.914 (MR), demonstrating robustness across modalities and atypical scan coverage. The MLLM performs competitively in visually distinctive regions, while the segmentation-aware MLLM reveals fundamental limitations.
可靠的解剖区域识别是许多自动化医学影像工作流程的前提,但现有的解决方案仍然严重依赖不可靠的DICOM元数据。目前的解决方案主要采用监督学习方法,这限制了它们在许多实际场景中的适用性。在这项工作中,我们研究了是否可以通过使用大型预训练基础模型中嵌入的知识,在全零样本方式下实现体素CT和MR图像中解剖区域检测。我们提出并系统地评估了三种无训练的管道:(1)一个基于规则的分割驱动系统,利用预训练的多器官分割模型;(2)由放射科医生定义规则指导的多模态大型语言模型(MLLM);以及 (3) 结合视觉输入和显式解剖学证据的感知分割的多模态大型语言模型。所有方法都在887个具有手动验证解剖区域标签的异质CT和MR扫描上进行了评估。基于规则的分割驱动方法取得了最强且最一致的表现,加权F1分数分别为0.947(CT)和0.914(MR),展示了跨模态和非典型扫描覆盖范围的强大适应性。MLLM在视觉区分明显的区域表现出竞争力,而感知分割的多模态大型语言模型揭示了基础限制。
https://arxiv.org/abs/2602.08717
Photorealistic color retouching plays a vital role in visual content creation, yet manual retouching remains inaccessible to non-experts due to its reliance on specialized expertise. Reference-based methods offer a promising alternative by transferring the preset color of a reference image to a source image. However, these approaches often operate as novice learners, performing global color mappings derived from pixel-level statistics, without a true understanding of semantic context or human aesthetics. To address this issue, we propose SemiNFT, a Diffusion Transformer (DiT)-based retouching framework that mirrors the trajectory of human artistic training: beginning with rigid imitation and evolving into intuitive creation. Specifically, SemiNFT is first taught with paired triplets to acquire basic structural preservation and color mapping skills, and then advanced to reinforcement learning (RL) on unpaired data to cultivate nuanced aesthetic perception. Crucially, during the RL stage, to prevent catastrophic forgetting of old skills, we design a hybrid online-offline reward mechanism that anchors aesthetic exploration with structural review. % experiments Extensive experiments show that SemiNFT not only outperforms state-of-the-art methods on standard preset transfer benchmarks but also demonstrates remarkable intelligence in zero-shot tasks, such as black-and-white photo colorization and cross-domain (anime-to-photo) preset transfer. These results confirm that SemiNFT transcends simple statistical matching and achieves a sophisticated level of aesthetic comprehension. Our project can be found at this https URL.
逼真的色彩修图在视觉内容创作中扮演着至关重要的角色,然而,由于依赖专门的技术知识,手动修图对于非专业人士来说仍然是难以触及的。基于参考的方法通过将参考图像的预设颜色转移到源图像上提供了一个有前景的选择。然而,这些方法往往操作如同初学者的学习过程,仅从像素级统计中进行全局色彩映射,而不理解语义上下文或人类美学。为了解决这个问题,我们提出了SemiNFT(半自主网络迁移框架),这是一个基于扩散变换器(DiT)的修图框架,它模拟了人类艺术训练的发展轨迹:从严格的模仿开始,逐渐演变为直观创造。具体来说,SemiNFT首先通过成对的三元组进行学习,以获得基本的结构保持和色彩映射技能,并进一步过渡到无配对数据上的强化学习(RL)阶段,以培养细微的美学感知能力。尤为重要的是,在RL阶段,为防止旧技能的灾难性遗忘,我们设计了一个混合在线-离线奖励机制,将美学探索与结构审查相结合。 实验结果表明,SemiNFT不仅在标准预设转移基准测试中超越了现有技术方法,还在零样本任务(如黑白照片上色和跨域转换[动漫到真实图片]的预设转移)方面表现出令人印象深刻的智能。这些结果证实了SemiNFT超越简单的统计匹配,并达到了一种复杂的美学理解水平。我们的项目可以在提供的链接地址找到。
https://arxiv.org/abs/2602.08582
While deep learning has advanced speech enhancement (SE), effective phase modeling remains challenging, as conventional networks typically operate within a flat Euclidean feature space, which is not easy to model the underlying circular topology of the phase. To address this, we propose a manifold-aware magnitude-phase dual-stream framework that aligns the phase stream with its intrinsic circular geometry by enforcing Global Rotation Equivariance (GRE) characteristic. Specifically, we introduce a Magnitude-Phase Interactive Convolutional Module (MPICM) for modulus-based information exchange and a Hybrid-Attention Dual-FFN (HADF) bottleneck for unified feature fusion, both of which are designed to preserve GRE in the phase stream. Comprehensive evaluations are conducted across phase retrieval, denoising, dereverberation, and bandwidth extension tasks to validate the superiority of the proposed method over multiple advanced baselines. Notably, the proposed architecture reduces Phase Distance by over 20\% in the phase retrieval task and improves PESQ by more than 0.1 in zero-shot cross-corpus denoising evaluations. The overall superiority is also established in universal SE tasks involving mixed distortions. Qualitative analysis further reveals that the learned phase features exhibit distinct periodic patterns, which are consistent with the intrinsic circular nature of the phase. The source code is available at this https URL.
尽管深度学习在语音增强(SE)方面取得了进展,但有效的相位建模仍然具有挑战性。传统网络通常在平坦的欧氏特征空间中操作,这难以模拟相位的基本环形拓扑结构。为了解决这个问题,我们提出了一种流形感知的幅度-相位双通道框架,通过强制执行全局旋转等变(GRE)特性来使相位通道与其固有的圆形几何形状对齐。具体而言,我们引入了基于模量的信息交换幅度-相位交互卷积模块(MPICM)和用于统一特征融合的混合注意力双FFN(HADF)瓶颈,两者都旨在在相位流中保持GRE。 为了验证所提出方法相对于多个高级基线模型的优势,我们在相位检索、降噪、去混响以及带宽扩展任务上进行了全面评估。值得注意的是,在相位检索任务中,我们的架构将相位距离降低了超过20%,并且在零样本跨语料库降噪评估中,PESQ提高了超过0.1分。在涉及混合失真的通用SE任务中也建立了整体优势。 定性分析进一步揭示了学习到的相位特征表现出明显的周期性模式,这与相位的基本环形本质一致。源代码可在此处获取(请将此处替换为实际链接)。
https://arxiv.org/abs/2602.08556
Modern large language models (LLMs) are often evaluated and deployed under a \emph{one-shot, greedy} inference protocol, especially in professional settings that require deterministic behavior. This regime can systematically under-estimate a fixed model's true capability: many errors arise not from missing knowledge, but from premature commitment under internal ambiguity. We introduce \emph{Reinforcement Inference}, an entropy-aware inference-time control strategy that uses the model's own uncertainty to selectively invoke a second, more deliberate reasoning attempt, enabling stronger performance \emph{without any retraining}. On 12,032 MMLU-Pro questions across 14 subjects, using DeepSeek-v3.2 with deterministic decoding in a zero-shot setting, Reinforcement Inference improves accuracy from 60.72\% to 84.03\%, while only incurring 61.06\% additional inference calls. A 100\% re-asking ablation reaches 84.35\%, indicating that uncertainty-aware selection captures most of the attainable improvement with substantially less compute. Moreover, a \emph{prompt-only} ablation underperforms the baseline, suggesting that the gains are not explained by generic `` your output had high entropy, think step-by-step'' prompting alone. Beyond providing a practical inference-time upgrade, our results suggest a broader \emph{entropy-aware} paradigm for measuring and expanding model capability: because modern decoder-based models generate outputs autoregressively, entropy and related confidence measures arise naturally as first-class control signals during generation. The resulting gap between one-pass greedy inference and uncertainty-conditioned deliberation offers a diagnostic lens on an LLM's latent reasoning horizon and motivates future training objectives that explicitly constrain correctness--confidence alignment.
现代大型语言模型(LLMs)通常在**一次性、贪婪式**的推理协议下进行评估和部署,尤其是在需要确定性行为的专业环境中。这种模式会系统地低估固定模型的真实能力:许多错误并非源于知识缺失,而是由于内在模糊性导致过早承诺造成的。我们引入了一种名为**强化推理**(Reinforcement Inference)的方法,这是一种基于熵感知的推理时间控制策略,它利用模型自身的不确定性来选择性地调用第二次更为仔细的推理尝试,从而在无需重新训练的情况下提高性能。 在零样本设置中使用DeepSeek-v3.2进行确定性解码,在涵盖14个科目的12,032个MMLU-Pro问题上,强化推理将准确性从60.72%提升至84.03%,同时仅增加了61.06%的额外推理调用。当完全使用重询策略时(即100%重新询问),准确率可达到84.35%,这表明不确定性感知选择能够捕捉到大部分改进,且计算成本显著减少。此外,在一个仅依赖于提示而没有其他改动的情况下,其性能低于基线模型,这表明提升效果并非仅仅由通用的“您的输出具有高熵,请逐步思考”这类提示所带来的。 除了提供一种实用的推理时间升级之外,我们的研究结果还指出现代模型能力评估和扩展的新范式——**基于熵感知**:由于现代解码器型模型以自回归方式生成输出,熵及相关的置信度指标自然成为生成过程中的首要控制信号。一次性贪婪式推理与不确定性条件下的仔细推断之间的差距为理解LLM潜在的推理范围提供了一种诊断视角,并激励未来的训练目标明确地约束正确性-信心一致性。 这种方法不仅有助于提高大型语言模型的实际性能,还开启了关于如何更好地测量和扩展这些模型能力的新研究方向。
https://arxiv.org/abs/2602.08520
We revisit the problem of training attention-based sparse image matching models for various local features. We first identify one critical design choice that has been previously overlooked, which significantly impacts the performance of the LightGlue model. We then investigate the role of detectors and descriptors within the transformer-based matching framework, finding that detectors, rather than descriptors, are often the primary cause for performance difference. Finally, we propose a novel approach to fine-tune existing image matching models using keypoints from a diverse set of detectors, resulting in a universal, detector-agnostic model. When deployed as a zero-shot matcher for novel detectors, the resulting model achieves or exceeds the accuracy of models specifically trained for those features. Our findings offer valuable insights for the deployment of transformer-based matching models and the future design of local features.
我们重新审视了针对各种局部特征训练基于注意力的稀疏图像匹配模型的问题。首先,我们识别出一个此前被忽视的关键设计选择,这一选择对LightGlue模型的性能产生了显著影响。接着,我们在基于变压器的匹配框架中研究检测器和描述符的作用,发现是检测器而非描述符往往是性能差异的主要原因。最后,我们提出了一种新的方法,通过使用多样化的检测器生成的关键点来微调现有的图像匹配模型,从而构建出一个通用且不受特定检测器限制的模型。当作为零样本匹配器用于新型检测器时,该模型能够达到或超越为那些特征专门训练的模型的准确性。我们的发现为变压器基匹配模型的部署及未来局部特征的设计提供了宝贵的见解。
https://arxiv.org/abs/2602.08430
Bimanual manipulation is imperative yet challenging for robots to execute complex tasks, requiring coordinated collaboration between two arms. However, existing methods for bimanual manipulation often rely on costly data collection and training, struggling to generalize to unseen objects in novel categories efficiently. In this paper, we present Bi-Adapt, a novel framework designed for efficient generalization for bimanual manipulation via semantic correspondence. Bi-Adapt achieves cross-category affordance mapping by leveraging the strong capability of vision foundation models. Fine-tuning with restricted data on novel categories, Bi-Adapt exhibits notable generalization to out-of-category objects in a zero-shot manner. Extensive experiments conducted in both simulation and real-world environments validate the effectiveness of our approach and demonstrate its high efficiency, achieving a high success rate on different benchmark tasks across novel categories with limited data. Project website: this https URL
双臂操作对于机器人执行复杂任务来说是必要的但也很具挑战性,这需要两个手臂之间进行协调合作。然而,现有的双臂操作方法通常依赖于昂贵的数据收集和训练过程,并且在面对新型类别中的未见过物体时很难高效地泛化应用。 本文介绍了一种名为Bi-Adapt的创新框架,旨在通过语义对应实现高效的双臂操作泛化能力。Bi-Adapt利用视觉基础模型的强大功能来完成跨类别的可操作性映射,并且在对新型类别进行少量数据微调后,能够在零样本(zero-shot)情况下显著地将技能泛化到该类中的未见过物体。 通过模拟和真实环境的大量实验验证了我们方法的有效性并展示了其高效率,在不同基准任务中使用有限的数据就能达到较高的成功率。项目网站:[此处应提供具体链接,请参考原文献获取准确信息]
https://arxiv.org/abs/2602.08425
Realizing versatile and human-like performance in high-demand sports like badminton remains a formidable challenge for humanoid robotics. Unlike standard locomotion or static manipulation, this task demands a seamless integration of explosive whole-body coordination and precise, timing-critical interception. While recent advances have achieved lifelike motion mimicry, bridging the gap between kinematic imitation and functional, physics-aware striking without compromising stylistic naturalness is non-trivial. To address this, we propose Imitation-to-Interaction, a progressive reinforcement learning framework designed to evolve a robot from a "mimic" to a capable "striker." Our approach establishes a robust motor prior from human data, distills it into a compact, model-based state representation, and stabilizes dynamics via adversarial priors. Crucially, to overcome the sparsity of expert demonstrations, we introduce a manifold expansion strategy that generalizes discrete strike points into a dense interaction volume. We validate our framework through the mastery of diverse skills, including lifts and drop shots, in simulation. Furthermore, we demonstrate the first zero-shot sim-to-real transfer of anthropomorphic badminton skills to a humanoid robot, successfully replicating the kinetic elegance and functional precision of human athletes in the physical world.
在像羽毛球这样的高需求体育项目中,实现多功能和人类般的表现仍然是人形机器人的重大挑战。与标准行走或静态操作不同,这项任务需要全身协调的爆发力以及精确、时间敏感性的拦截能力。尽管最近的进步已经实现了逼真的动作模仿,但要在不牺牲风格自然性的情况下从姿态模拟过渡到功能性和物理感知的击球则是非同寻常的困难。为了解决这一问题,我们提出了一种名为“模仿至互动”的逐步强化学习框架,旨在将机器人从“模仿者”进化成一个有能力的“打击者”。我们的方法通过人体数据建立了稳健的运动先验,并将其提炼成了一个紧凑、模型为基础的状态表示形式。此外,我们还通过对抗性先验稳定了动力学性能。至关重要的是,为了克服专家演示稀疏性的难题,我们引入了一种流形扩展策略,将离散打击点推广到密集互动空间。 我们在模拟中通过掌握多样化的技能,包括挑高球和低手杀球来验证我们的框架的有效性。此外,我们首次展示了从仿真环境到现实世界的人类形态羽毛球技能的零样本迁移,成功地在物理环境中再现了人类运动员的动态优雅性和功能精确度。
https://arxiv.org/abs/2602.08370
Memory mechanism is a core component of LLM-based agents, enabling reasoning and knowledge discovery over long-horizon contexts. Existing agent memory systems are typically designed within isolated paradigms (e.g., explicit, parametric, or latent memory) with tightly coupled retrieval methods that hinder cross-paradigm generalization and fusion. In this work, we take a first step toward unifying heterogeneous memory paradigms within a single memory system. We propose MemAdapter, a memory retrieval framework that enables fast alignment across agent memory paradigms. MemAdapter adopts a two-stage training strategy: (1) training a generative subgraph retriever from the unified memory space, and (2) adapting the retriever to unseen memory paradigms by training a lightweight alignment module through contrastive learning. This design improves the flexibility for memory retrieval and substantially reduces alignment cost across paradigms. Comprehensive experiments on three public evaluation benchmarks demonstrate that the generative subgraph retriever consistently outperforms five strong agent memory systems across three memory paradigms and agent model scales. Notably, MemAdapter completes cross-paradigm alignment within 13 minutes on a single GPU, achieving superior performance over original memory retrievers with less than 5% of training compute. Furthermore, MemAdapter enables effective zero-shot fusion across memory paradigms, highlighting its potential as a plug-and-play solution for agent memory systems.
记忆机制是基于大模型的代理的核心组成部分,它使代理能够在长时间上下文中进行推理和知识发现。现有的代理内存系统通常是在孤立的范式(例如显式、参数化或潜在内存)中设计,并且这些系统的检索方法紧密耦合,这阻碍了跨范式的泛化与融合。在这项工作中,我们朝着在单一内存系统内统一异构记忆范式迈出了第一步。我们提出了MemAdapter,这是一个允许代理记忆范式之间快速对齐的记忆检索框架。 MemAdapter采用两阶段训练策略:(1) 从统一的内存空间中训练生成子图检索器;(2) 通过对比学习训练一个轻量级对齐模块来适应未见过的记忆范式。这种设计提高了内存检索的灵活性,并且显著减少了跨范式的对准成本。 在三个公共评估基准上进行的全面实验表明,生成子图检索器在三种记忆范式和代理模型规模下始终优于五个强大的代理记忆系统。值得注意的是,MemAdapter在一个GPU上完成跨范式的对齐仅需13分钟,并且使用不到5%的训练计算量就实现了比原始内存检索者更优的表现。此外,MemAdapter还能够有效实现不同记忆范式之间的零样本融合,突显了它作为代理内存系统即插即用解决方案的巨大潜力。
https://arxiv.org/abs/2602.08369