Recent advances in camera-controlled video diffusion models have significantly improved video-camera alignment. However, the camera controllability still remains limited. In this work, we build upon Reward Feedback Learning and aim to further improve camera controllability. However, directly borrowing existing ReFL approaches faces several challenges. First, current reward models lack the capacity to assess video-camera alignment. Second, decoding latent into RGB videos for reward computation introduces substantial computational overhead. Third, 3D geometric information is typically neglected during video decoding. To address these limitations, we introduce an efficient camera-aware 3D decoder that decodes video latent into 3D representations for reward quantization. Specifically, video latent along with the camera pose are decoded into 3D Gaussians. In this process, the camera pose not only acts as input, but also serves as a projection parameter. Misalignment between the video latent and camera pose will cause geometric distortions in the 3D structure, resulting in blurry renderings. Based on this property, we explicitly optimize pixel-level consistency between the rendered novel views and ground-truth ones as reward. To accommodate the stochastic nature, we further introduce a visibility term that selectively supervises only deterministic regions derived via geometric warping. Extensive experiments conducted on RealEstate10K and WorldScore benchmarks demonstrate the effectiveness of our proposed method. Project page: \href{this https URL}{CamPilot Page}.
近期,在相机控制视频扩散模型方面的进展显著提高了视频与摄像机之间的对齐精度。然而,摄像机的可控性仍然有限。在这项工作中,我们基于奖励反馈学习(Reward Feedback Learning)方法,并致力于进一步提升摄像机的可控性。不过,直接借用现有的ReFL方法会遇到几个挑战:首先,当前的奖励模型缺乏评估视频与摄像机对齐能力的能力;其次,在计算奖励时将潜在变量解码为RGB视频带来了大量的计算开销;第三,在视频解码过程中通常忽略了3D几何信息。 为了应对这些局限性,我们引入了一个高效的感知相机的3D解码器,该解码器能够将视频潜变量解码成用于奖励量化的3D表示。具体而言,视频潜在编码与摄像机姿态一起被解码为3D高斯分布,在这一过程中,摄像机姿态不仅作为输入,还充当投影参数的角色。如果视频潜在变量和摄像机姿态之间存在对齐误差,则会导致3D结构的几何失真,并进而导致渲染模糊。 基于该特性,我们明确地优化了合成视角与真实视图之间的像素级一致性作为奖励计算的基础。考虑到这一随机性质,我们进一步引入了一个可见性项,仅针对通过几何变形导出的确定区域进行监督。在RealEstate10K和WorldScore基准上的广泛实验验证了所提出方法的有效性。 项目页面:\[链接\](请将“this https URL”替换为实际链接)。
https://arxiv.org/abs/2601.16214
Representation Autoencoders (RAEs) have shown distinct advantages in diffusion modeling on ImageNet by training in high-dimensional semantic latent spaces. In this work, we investigate whether this framework can scale to large-scale, freeform text-to-image (T2I) generation. We first scale RAE decoders on the frozen representation encoder (SigLIP-2) beyond ImageNet by training on web, synthetic, and text-rendering data, finding that while scale improves general fidelity, targeted data composition is essential for specific domains like text. We then rigorously stress-test the RAE design choices originally proposed for ImageNet. Our analysis reveals that scaling simplifies the framework: while dimension-dependent noise scheduling remains critical, architectural complexities such as wide diffusion heads and noise-augmented decoding offer negligible benefits at scale Building on this simplified framework, we conduct a controlled comparison of RAE against the state-of-the-art FLUX VAE across diffusion transformer scales from 0.5B to 9.8B parameters. RAEs consistently outperform VAEs during pretraining across all model scales. Further, during finetuning on high-quality datasets, VAE-based models catastrophically overfit after 64 epochs, while RAE models remain stable through 256 epochs and achieve consistently better performance. Across all experiments, RAE-based diffusion models demonstrate faster convergence and better generation quality, establishing RAEs as a simpler and stronger foundation than VAEs for large-scale T2I generation. Additionally, because both visual understanding and generation can operate in a shared representation space, the multimodal model can directly reason over generated latents, opening new possibilities for unified models.
表示自编码器(RAE,Representation Autoencoders)在图像网(ImageNet)上的扩散模型训练中显示出了显著的优势,尤其是在高维语义潜在空间的训练方面。在这项工作中,我们探讨了这一框架是否可以扩展到大规模、自由形式的文字转图像(T2I,Text-to-Image)生成任务上。 首先,我们在冻结表示编码器(SigLIP-2)的基础上,通过网络数据、合成数据和文本渲染数据对RAE解码器进行训练,以超越ImageNet的限制。我们发现,在扩展规模时虽然整体保真度有所提高,但在特定领域如文字生成中,有针对性的数据组合至关重要。 接着,我们严格测试了最初为ImageNet设计的RAE架构选择的有效性。分析结果显示,随着规模的扩大,框架变得简化:尽管维度依赖性的噪音调度仍然关键,但诸如扩散头部宽度加大和噪音增强解码等复杂结构在大规模下几乎没有带来实际好处。 基于这一简化的框架,我们对比了RAE与当前最佳的FLUX VAE(变分自编码器),在从0.5B到9.8B参数的不同规模下的扩散变压器模型上进行了有控制的比较。结果表明,在所有模型规模的预训练阶段,RAEs始终优于VAEs。 进一步地,在高质量数据集上的微调过程中,基于VAE的模型在64个epoch后出现灾难性过拟合,而基于RAE的模型则保持稳定至256个epoch,并且在整个过程中表现更佳。在所有实验中,基于RAE的扩散模型都显示出更快的收敛速度和更好的生成质量,确立了RAEs作为大规模T2I生成任务中的简化且更强的基础框架的地位。 此外,由于视觉理解和生成都可以在共享表示空间内操作,多模态模型可以直接对生成的潜在表达进行推理,为统一性模型提供了新的可能性。
https://arxiv.org/abs/2601.16208
Multimodal large language models (MLLMs) exhibit strong capabilities across diverse applications, yet remain vulnerable to adversarial perturbations that distort their feature representations and induce erroneous predictions. To address this vulnerability, we propose the Feature-space Smoothing (FS) and theoretically prove that FS offers certified robustness on the feature representations of MLLMs. Specifically, FS transforms any feature encoder into a smoothed variant that is guaranteed to maintain a certified lower bound on the feature cosine similarity between clean and adversarial representations under $\ell_2$-bounded attacks. Moreover, we indicate that the value of this Feature Cosine Similarity Bound (FCSB) derived from FS can be improved by enlarging the defined Gaussian robustness score on the vanilla encoder. Building upon this, we introduce the Purifier and Smoothness Mapper (PSM), a plug-and-play module that improves the Gaussian robustness score of MLLMs and thus enhances their certified robustness under FS, without requiring any retraining on MLLMs. We demonstrate that the FS with PSM not only provides a strong theoretical robustness guarantee but also exhibits superior empirical performance compared to adversarial training. Extensive experiments across diverse MLLMs and downstream tasks indicate the effectiveness of the FS-PSM, reducing the Attack Success Rate (ASR) of various white-box attacks from nearly 90\% to about 1\%.
多模态大型语言模型(MLLMs)在各种应用场景中表现出强大的能力,但它们仍然容易受到通过扭曲特征表示并引发错误预测的对抗性干扰的影响。为了解决这一脆弱性问题,我们提出了特征空间平滑(FS),并通过理论证明了FS能够提供关于MLLMs特征表示的认证鲁棒性保障。具体而言,FS将任何特征编码器转换为其平滑版本,并保证在$\ell_2$界限内的攻击下,干净和对抗性表示之间的特征余弦相似度可以维持一个经过验证的最低边界。此外,我们指出通过增加原始编码器上的高斯鲁棒评分,可以从FS中得出的特征余弦相似度边界(FCSB)值得到提高。基于此,我们引入了纯化和平滑映射器(PSM),这是一种即插即用模块,它可以提升MLLMs的高斯鲁棒评分并因此增强其在FS下的认证鲁棒性,而无需对MLLMs进行重新训练。我们展示了带有PSM的FS不仅提供了强大的理论稳健保证,而且在对抗性训练方面也表现出更优越的实际性能。跨多种MLLM和下游任务的广泛实验表明,FS-PSM的有效性,将各种白盒攻击的成功率从接近90%降低到大约1%。
https://arxiv.org/abs/2601.16200
Lifting perspective images and videos to 360° panoramas enables immersive 3D world generation. Existing approaches often rely on explicit geometric alignment between the perspective and the equirectangular projection (ERP) space. Yet, this requires known camera metadata, obscuring the application to in-the-wild data where such calibration is typically absent or noisy. We propose 360Anything, a geometry-free framework built upon pre-trained diffusion transformers. By treating the perspective input and the panorama target simply as token sequences, 360Anything learns the perspective-to-equirectangular mapping in a purely data-driven way, eliminating the need for camera information. Our approach achieves state-of-the-art performance on both image and video perspective-to-360° generation, outperforming prior works that use ground-truth camera information. We also trace the root cause of the seam artifacts at ERP boundaries to zero-padding in the VAE encoder, and introduce Circular Latent Encoding to facilitate seamless generation. Finally, we show competitive results in zero-shot camera FoV and orientation estimation benchmarks, demonstrating 360Anything's deep geometric understanding and broader utility in computer vision tasks. Additional results are available at this https URL.
将透视图像和视频转换为360°全景图能够实现沉浸式的三维世界生成。现有的方法通常依赖于透视视图与等距矩形投影(ERP)空间之间的显式几何对齐。然而,这需要已知的相机元数据,在野外的数据中,这种校准通常是缺失或有噪声的。我们提出了一种名为360Anything的新框架,该框架基于预训练的扩散变换器构建,并且不需要任何几何信息。通过将透视输入和全景图目标视为简单的令牌序列,360Anything能够以完全数据驱动的方式学习透视到等距矩形映射,从而消除了对相机信息的需求。 我们的方法在图像和视频从透视视图到360°生成的性能上达到了最先进的水平,并且超越了那些使用真实相机信息的方法。我们还追溯到了ERP边界处的接缝瑕疵的根本原因——VAE编码器中的零填充处理,并引入了圆形潜在编码以促进无缝生成。 最后,我们在无提示相机视野和方向估计基准测试中展示了具有竞争力的结果,这表明360Anything具备深刻的几何理解能力以及在计算机视觉任务中的更广泛实用性。更多的研究成果可以访问此链接:[提供的URL]。 简而言之,这项工作展示了一种创新的方法来处理没有明确几何对齐信息的图像和视频数据,并且证明了这种方法在多种应用中的有效性和广泛的适用性。
https://arxiv.org/abs/2601.16192
The success of CLIP has driven substantial progress in text-video retrieval. However, current methods often suffer from "blind" feature interaction, where the model struggles to discern key visual information from background noise due to the sparsity of textual queries. To bridge this gap, we draw inspiration from human cognitive behavior and propose the Human Vision-Driven (HVD) model. Our framework establishes a coarse-to-fine alignment mechanism comprising two key components: the Frame Features Selection Module (FFSM) and the Patch Features Compression Module (PFCM). FFSM mimics the human macro-perception ability by selecting key frames to eliminate temporal redundancy. Subsequently, PFCM simulates micro-perception by aggregating patch features into salient visual entities through an advanced attention mechanism, enabling precise entity-level matching. Extensive experiments on five benchmarks demonstrate that HVD not only captures human-like visual focus but also achieves state-of-the-art performance.
CLIP模型的成功推动了文本视频检索领域的显著进步。然而,当前的方法往往在“盲”特征交互方面存在问题,即由于文本查询的稀疏性,模型难以从背景噪声中区分出关键视觉信息。为了弥补这一差距,我们借鉴了人类的认知行为,并提出了人眼视图驱动(HVD)模型。我们的框架建立了一个由粗到细的对齐机制,包含两个关键组件:帧特征选择模块(FFSM)和补丁特征压缩模块(PFCM)。FFSM通过选择关键帧来消除时间冗余,模拟了人类宏观感知能力。随后,PFCM通过先进的注意力机制聚合补丁特征以形成显著视觉实体,从而模仿微观感知并实现精确的实体级别匹配。 在五个基准测试中的大量实验表明,HVD不仅能够捕捉到类似人的视觉关注点,还实现了最先进的性能。
https://arxiv.org/abs/2601.16155
Generating animated 3D objects is at the heart of many applications, yet most advanced works are typically difficult to apply in practice because of their limited setup, their long runtime, or their limited quality. We introduce ActionMesh, a generative model that predicts production-ready 3D meshes "in action" in a feed-forward manner. Drawing inspiration from early video models, our key insight is to modify existing 3D diffusion models to include a temporal axis, resulting in a framework we dubbed "temporal 3D diffusion". Specifically, we first adapt the 3D diffusion stage to generate a sequence of synchronized latents representing time-varying and independent 3D shapes. Second, we design a temporal 3D autoencoder that translates a sequence of independent shapes into the corresponding deformations of a pre-defined reference shape, allowing us to build an animation. Combining these two components, ActionMesh generates animated 3D meshes from different inputs like a monocular video, a text description, or even a 3D mesh with a text prompt describing its animation. Besides, compared to previous approaches, our method is fast and produces results that are rig-free and topology consistent, hence enabling rapid iteration and seamless applications like texturing and retargeting. We evaluate our model on standard video-to-4D benchmarks (Consistent4D, Objaverse) and report state-of-the-art performances on both geometric accuracy and temporal consistency, demonstrating that our model can deliver animated 3D meshes with unprecedented speed and quality.
生成动画的三维对象是许多应用程序的核心,然而大多数先进的研究成果通常难以在实践中应用,原因在于其设置有限、运行时间长或质量不佳。我们介绍了ActionMesh,这是一种生成模型,它能够以前馈方式预测“行动中”的生产级3D网格(mesh)。借鉴早期视频模型的灵感,我们的关键洞察是修改现有的3D扩散模型,使其包括一个时间轴,从而形成所谓的“时序3D扩散”框架。 具体来说,我们首先将3D扩散阶段调整为生成一系列同步潜变量序列,这些序列代表随时间变化且独立的三维形状。其次,我们设计了一个时序3D自动编码器,该编码器可以将一系列独立的形状转换成预定义参考形状的相应变形,从而构建动画。结合这两个组件,ActionMesh可以从不同的输入中生成动画的3D网格,如单目视频、文本描述或带有描述其动画的文本提示的3D网格。 此外,与之前的方法相比,我们的方法速度快,并且产生的结果无骨骼绑定(rig-free)和拓扑一致,因此能够快速迭代并支持无缝应用如纹理映射和重定向。我们在标准的视频到4D基准测试(Consistent4D、Objaverse)上评估了我们的模型,在几何准确性和时间一致性方面均达到了最先进的性能水平,证明了该模型可以以前所未有的速度和质量提供动画3D网格。
https://arxiv.org/abs/2601.16148
Composed Image Retrieval (CIR) is a pivotal and complex task in multimodal understanding. Current CIR benchmarks typically feature limited query categories and fail to capture the diverse requirements of real-world scenarios. To bridge this evaluation gap, we leverage image editing to achieve precise control over modification types and content, enabling a pipeline for synthesizing queries across a broad spectrum of categories. Using this pipeline, we construct EDIR, a novel fine-grained CIR benchmark. EDIR encompasses 5,000 high-quality queries structured across five main categories and fifteen subcategories. Our comprehensive evaluation of 13 multimodal embedding models reveals a significant capability gap; even state-of-the-art models (e.g., RzenEmbed and GME) struggle to perform consistently across all subcategories, highlighting the rigorous nature of our benchmark. Through comparative analysis, we further uncover inherent limitations in existing benchmarks, such as modality biases and insufficient categorical coverage. Furthermore, an in-domain training experiment demonstrates the feasibility of our benchmark. This experiment clarifies the task challenges by distinguishing between categories that are solvable with targeted data and those that expose intrinsic limitations of current model architectures.
组成图像检索(CIR)是跨模态理解中的一个重要且复杂的任务。当前的CIR基准测试通常包含有限的查询类别,无法捕捉到真实世界场景中多样化的需求。为了弥补这一评估差距,我们利用图像编辑来实现对修改类型和内容的精确控制,并构建了一条能够跨越广泛类别合成查询的流水线。通过这条管道,我们创建了EDIR,这是一个新型的细粒度CIR基准测试集。EDIR包括5000个高质量的查询,这些查询结构化分布在五个主要类别和十五个子类别中。 对13种跨模态嵌入模型进行全面评估后,我们发现了一个显著的能力差距;即使是当前最先进的模型(如RzenEmbed和GME)也无法在所有子类别上保持一致性表现,这进一步强调了我们的基准测试的严格性。通过对比分析,我们还揭示了现有基准中存在的固有局限性,例如模态偏见和类别覆盖不足的问题。 此外,一个针对特定领域的训练实验展示了我们基准的有效性。该实验通过区分可以使用定向数据解决的任务类别与揭示当前模型架构内在限制的任务类别来明确任务挑战。
https://arxiv.org/abs/2601.16125
Edge devices operate in constrained and varying resource settings, requiring dynamic architectures that can adapt to limitations of the available resources. To meet such demands, layer dropping ($\mathcal{LD}$) approach is typically used to transform static models into dynamic ones by skipping parts of the network along with reducing overall computational complexity. However, existing $\mathcal{LD}$ methods greatly impact the dynamic model's performance for low and high dropping cases, deteriorating the performance-computation trade-off. To this end, we propose a distillation-based layer dropping (DLD) framework that effectively combines the capabilities of knowledge distillation and $\mathcal{LD}$ in an end-to-end fashion, thereby achieving state-of-the-art performance for dynamic speech networks. Comprehensive experimentation utilizing well-known speech recognition methods, including conformer and WavLM, on three public benchmarks demonstrates the effectiveness of our framework, reducing the word error rate by $9.32\%$ and $2.25\%$ for high and no dropping cases with $33.3\%$ reduction in training time.
边缘设备在有限和变化的资源环境中运行,需要能够适应可用资源限制的动态架构。为了满足这一需求,通常采用层掉落($\mathcal{LD}$)方法将静态模型转换为动态模型,通过跳过网络的部分来减少整体计算复杂度。然而,现有的层掉落方法对低频和高频掉层情况下的动态模型性能影响很大,从而恶化了性能与计算量之间的权衡。为此,我们提出了一种基于蒸馏的层掉落(DLD)框架,该框架能够以端到端的方式有效地结合知识蒸馏和$\mathcal{LD}$的能力,从而在动态语音网络中实现最先进的性能。 通过使用包括Conformer和WavLM在内的知名语音识别方法,在三个公共基准上进行的全面实验展示了我们框架的有效性。对于高频掉层情况,我们的框架将词错误率降低了9.32%,而对于无掉层的情况则减少了2.25%。此外,该框架在训练时间方面也实现了33.3%的减少。
https://arxiv.org/abs/2601.16117
Optical Character Recognition (OCR) for low-resource languages remains a significant challenge due to the scarcity of large-scale annotated training datasets. Languages such as Kashmiri, with approximately 7 million speakers and a complex Perso-Arabic script featuring unique diacritical marks, currently lack support in major OCR systems including Tesseract, TrOCR, and PaddleOCR. Manual dataset creation for such languages is prohibitively expensive, time-consuming, and error-prone, often requiring word by word transcription of printed or handwritten text. We present SynthOCR-Gen, an open-source synthetic OCR dataset generator specifically designed for low-resource languages. Our tool addresses the fundamental bottleneck in OCR development by transforming digital Unicode text corpora into ready-to-use training datasets. The system implements a comprehensive pipeline encompassing text segmentation (character, word, n-gram, sentence, and line levels), Unicode normalization with script purity enforcement, multi-font rendering with configurable distribution, and 25+ data augmentation techniques simulating real-world document degradations including rotation, blur, noise, and scanner artifacts. We demonstrate the efficacy of our approach by generating a 600,000-sample word-segmented Kashmiri OCR dataset, which we release publicly on HuggingFace. This work provides a practical pathway for bringing low-resource languages into the era of vision-language AI models, and the tool is openly available for researchers and practitioners working with underserved writing systems worldwide.
对于资源匮乏的语言,光学字符识别(OCR)仍然是一项重大挑战,主要是因为缺乏大规模的标注训练数据集。像克什米尔语这样的语言,拥有大约700万使用者和复杂的波斯-阿拉伯书写系统,其中包括独特的标点符号,在Tesseract、TrOCR 和 PaddleOCR 等主要 OCR 系统中目前仍得不到支持。为这类语言创建手动数据集的成本高得令人难以承受,耗时且容易出错,并常常需要逐词转录印刷或手写文本。 我们提出了一种开源的合成 OCR 数据集生成器——SynthOCR-Gen,专门针对资源匮乏的语言设计。该工具通过将数字 Unicode 文本语料库转换为即用型训练数据集来解决 OCR 开发中的基本瓶颈问题。系统实现了一个全面的工作流程,包括文本分割(字符、单词、n-gram、句子和行级别)、Unicode 正规化以及强制实施书写系统的纯度,多字体渲染和支持配置的分布设置,以及 25 多种数据增强技术来模拟现实世界文档退化的多种情况,如旋转、模糊、噪声和扫描器产生的伪影。 我们通过生成一个包含60万样本的克什米尔语单词级 OCR 数据集来展示了这种方法的有效性,并将其公开发布在 HuggingFace 上。本工作为资源匮乏的语言进入视觉-语言 AI 模型时代提供了一条实用路径,且工具对全世界从事未得到充分服务的文字系统的研究人员和实践者完全开放使用。
https://arxiv.org/abs/2601.16113
Although Mamba models greatly improve Hyperspectral Image (HSI) classification, they have critical challenges in terms defining efficient and adaptive token sequences for improve performance. This paper therefore presents CSSMamba (Clustering-guided Spatial-Spectral Mamba) framework to better address the challenges, with the following contributions. First, to achieve efficient and adaptive token sequences for improved Mamba performance, we integrate the clustering mechanism into a spatial Mamba architecture, leading to a cluster-guided spatial Mamba module (CSpaMamba) that reduces the Mamba sequence length and improves Mamba feature learning capability. Second, to improve the learning of both spatial and spectral information, we integrate the CSpaMamba module with a spectral mamba module (SpeMamba), leading to a complete clustering-guided spatial-spectral Mamba framework. Third, to further improve feature learning capability, we introduce an Attention-Driven Token Selection mechanism to optimize Mamba token sequencing. Last, to seamlessly integrate clustering into the Mamba model in a coherent manner, we design a Learnable Clustering Module that learns the cluster memberships in an adaptive manner. Experiments on the Pavia University, Indian Pines, and Liao-Ning 01 datasets demonstrate that CSSMamba achieves higher accuracy and better boundary preservation compared to state-of-the-art CNN, Transformer, and Mamba-based methods.
尽管马尔巴(Mamba)模型在高光谱图像(HSI)分类方面取得了显著进展,但在定义高效的自适应标记序列以提高性能方面仍面临重大挑战。为此,本文提出了一种由聚类引导的空间-光谱马尔巴(CSSMamba:Clustering-guided Spatial-Spectral Mamba)框架,旨在更好地应对这些挑战,并作出以下贡献。 首先,为了实现高效且自适应的标记序列并提升马尔巴模型的表现,我们将聚类机制整合到空间马尔巴架构中,从而构建了一个由聚类引导的空间马尔巴模块(CSpaMamba),该模块减少了马尔巴序列长度并提高了马尔巴特征学习的能力。 其次,为了提高空间和光谱信息的学习效果,我们结合了CSpaMamba模块与一个光谱马尔巴模块(SpeMamba),形成了一个完整的由聚类引导的空间-光谱马尔巴框架。 第三,为进一步提升特征学习能力,我们引入了一种注意力驱动的标记选择机制来优化马尔巴模型中的标记序列。 最后,为了以一致的方式无缝地将聚类整合到马尔巴模型中,我们设计了一个可学聚类模块(Learnable Clustering Module),该模块能够在自适应的情况下学习集群成员资格。 在帕维亚大学、印度普林斯和辽宁01数据集上的实验表明,CSSMamba相较于最先进的CNN、Transformer以及基于马尔巴的方法,在准确性方面表现出更高的性能,并且能够更好地保持边界。
https://arxiv.org/abs/2601.16098
We introduce Neural Particle Automata (NPA), a Lagrangian generalization of Neural Cellular Automata (NCA) from static lattices to dynamic particle systems. Unlike classical Eulerian NCA where cells are pinned to pixels or voxels, NPA model each cell as a particle with a continuous position and internal state, both updated by a shared, learnable neural rule. This particle-based formulation yields clear individuation of cells, allows heterogeneous dynamics, and concentrates computation only on regions where activity is present. At the same time, particle systems pose challenges: neighborhoods are dynamic, and a naive implementation of local interactions scale quadratically with the number of particles. We address these challenges by replacing grid-based neighborhood perception with differentiable Smoothed Particle Hydrodynamics (SPH) operators backed by memory-efficient, CUDA-accelerated kernels, enabling scalable end-to-end training. Across tasks including morphogenesis, point-cloud classification, and particle-based texture synthesis, we show that NPA retain key NCA behaviors such as robustness and self-regeneration, while enabling new behaviors specific to particle systems. Together, these results position NPA as a compact neural model for learning self-organizing particle dynamics.
我们介绍了一种名为神经粒子自动机(Neural Particle Automata,NPA)的新模型,它是对静态格点系统中的神经细胞自动机(Neural Cellular Automata,NCA)进行拉格朗日泛化的动态粒子系统的扩展。与经典欧拉方法下的NCA不同,在这种情况下,每个单元被固定在像素或体素上,NPA将每个单元视为具有连续位置和内部状态的粒子,这两个参数都通过一个共享且可学习的神经规则更新。基于粒子的这一形式化方法清晰地界定了各细胞个体性,允许异质动态,并仅对存在活动的区域进行计算。 然而,粒子系统也带来了一些挑战:邻居关系是动态变化的,直接实现局部相互作用会导致其复杂度随粒子数量呈二次增长。为了解决这些问题,我们用可微分的光滑粒子流体动力学(Smoothed Particle Hydrodynamics,SPH)算子替代了网格感知方法,并且利用内存高效、CUDA加速的核心进行支持,从而实现了端到端的大规模训练。 在包括形态发生、点云分类和基于粒子的纹理合成等任务中,我们展示了NPA不仅保留了NCA的关键特性(如鲁棒性和自我再生),而且还赋予粒子系统特有的新行为。综上所述,这些结果将NPA定位为一种紧凑型神经模型,用于学习自组织的粒子动力学。
https://arxiv.org/abs/2601.16096
Pixel-wise capabilities are essential for building interactive intelligent systems. However, pixel-wise multi-modal LLMs (MLLMs) remain difficult to scale due to complex region-level encoders, specialized segmentation decoders, and incompatible training objectives. To address these challenges, we present SAMTok, a discrete mask tokenizer that converts any region mask into two special tokens and reconstructs the mask using these tokens with high fidelity. By treating masks as new language tokens, SAMTok enables base MLLMs (such as the QwenVL series) to learn pixel-wise capabilities through standard next-token prediction and simple reinforcement learning, without architectural modifications and specialized loss design. SAMTok builds on SAM2 and is trained on 209M diverse masks using a mask encoder and residual vector quantizer to produce discrete, compact, and information-rich tokens. With 5M SAMTok-formatted mask understanding and generation data samples, QwenVL-SAMTok attains state-of-the-art or comparable results on region captioning, region VQA, grounded conversation, referring segmentation, scene graph parsing, and multi-round interactive segmentation. We further introduce a textual answer-matching reward that enables efficient reinforcement learning for mask generation, delivering substantial improvements on GRES and GCG benchmarks. Our results demonstrate a scalable and straightforward paradigm for equipping MLLMs with strong pixel-wise capabilities. Our code and models are available.
像素级别的能力对于构建互动智能系统至关重要。然而,由于复杂的区域级编码器、专业的分割解码器和不兼容的训练目标,像素级别的多模态大模型(MLLMs)仍然难以扩展规模。为了解决这些挑战,我们提出了SAMTok,这是一种离散的掩码标记器,能够将任何区域掩码转换成两个特殊令牌,并使用这两个令牌以高保真度重建掩码。通过将掩码视为新的语言令牌,SAMTok使基础MLLM(如QwenVL系列)可以通过标准的下一个令牌预测和简单的强化学习来学习像素级别的能力,而无需进行架构修改或专门的损失设计。 基于SAM2,并使用一个掩码编码器和残差向量量化器对2.09亿个多样化的掩码进行训练,SAMTok能够生成离散、紧凑且信息丰富的令牌。通过500万个以SAMTok格式标记的理解与生成数据样本,QwenVL-SAMTok在区域描述、区域VQA(视觉问答)、基于参考的对话、指代分割、场景图解析以及多轮互动分割等任务上取得了当前最优或可比的结果。 我们进一步引入了一个文本答案匹配奖励机制,使掩码生成过程中的强化学习更加高效,在GRES和GCG基准测试中带来了显著改进。我们的结果表明,为MLLM提供强大的像素级别能力提供了一种可扩展且简单的方法。 我们的代码和模型已公开可用。
https://arxiv.org/abs/2601.16093
Human motion reconstruction from monocular videos is a fundamental challenge in computer vision, with broad applications in AR/VR, robotics, and digital content creation, but remains challenging under frequent occlusions in real-world this http URL regression-based methods are efficient but fragile to missing observations, while optimization- and diffusion-based approaches improve robustness at the cost of slow inference speed and heavy preprocessing steps. To address these limitations, we leverage recent advances in generative masked modeling and present MoRo: Masked Modeling for human motion Recovery under Occlusions. MoRo is an occlusion-robust, end-to-end generative framework that formulates motion reconstruction as a video-conditioned task, and efficiently recover human motion in a consistent global coordinate system from RGB videos. By masked modeling, MoRo naturally handles occlusions while enabling efficient, end-to-end inference. To overcome the scarcity of paired video-motion data, we design a cross-modality learning scheme that learns multi-modal priors from a set of heterogeneous datasets: (i) a trajectory-aware motion prior trained on MoCap datasets, (ii) an image-conditioned pose prior trained on image-pose datasets, capturing diverse per-frame poses, and (iii) a video-conditioned masked transformer that fuses motion and pose priors, finetuned on video-motion datasets to integrate visual cues with motion dynamics for robust inference. Extensive experiments on EgoBody and RICH demonstrate that MoRo substantially outperforms state-of-the-art methods in accuracy and motion realism under occlusions, while performing on-par in non-occluded scenarios. MoRo achieves real-time inference at 70 FPS on a single H200 GPU.
从单目视频中重建人体运动是计算机视觉中的一个基本挑战,具有广泛的应用场景,包括增强现实/虚拟现实、机器人技术和数字内容创作。然而,在实际环境中由于频繁的遮挡问题,这一任务仍然极具挑战性。基于回归的方法虽然效率高但对缺失观测非常敏感,而优化和扩散方法则通过牺牲推理速度并增加预处理步骤来提高鲁棒性。为了解决这些问题,我们利用最近在生成式掩码建模方面的进展,并提出了一种用于遮挡下人体运动恢复的框架——MoRo(Masked Modeling for human motion Recovery under Occlusions)。 MoRo是一种针对遮挡具有鲁棒性的端到端生成框架,它将运动重建视为一个视频条件下的任务,在全局坐标系中从RGB视频高效地恢复人类运动。通过掩码建模,MoRo能够自然处理遮挡问题,并支持高效的端到端推理。为了克服成对的视频-动作数据稀缺的问题,我们设计了一种跨模态学习方案,该方案从一组异构的数据集中学习多模式先验:(i)一种在MoCap数据集上训练的动作轨迹感知运动先验;(ii)一种基于图像的姿态先验,在图像姿态数据集上进行训练,捕捉每帧中多样的姿势;以及(iii)一个视频条件下的掩码变换器,该模型融合了动作和姿态的先验,并通过在视频-动作数据集上的微调与视觉线索结合运动动力学以实现稳健推理。 在EgoBody和RICH数据集上进行的大量实验表明,在遮挡条件下,MoRo在准确性和运动逼真度方面显著优于最先进的方法,而在非遮挡场景中则表现出相当的性能。此外,MoRo能够在单个H200 GPU上以每秒70帧的速度实现实时推理。
https://arxiv.org/abs/2601.16079
Foundation Models (FMs) have demonstrated strong generalization across diverse vision tasks. However, their deployment in federated settings is hindered by high computational demands, substantial communication overhead, and significant inference costs. We propose DSFedMed, a dual-scale federated framework that enables mutual knowledge distillation between a centralized foundation model and lightweight client models for medical image segmentation. To support knowledge distillation, a set of high-quality medical images is generated to replace real public datasets, and a learnability-guided sample selection strategy is proposed to enhance efficiency and effectiveness in dual-scale distillation. This mutual distillation enables the foundation model to transfer general knowledge to lightweight clients, while also incorporating client-specific insights to refine the foundation model. Evaluations on five medical imaging segmentation datasets show that DSFedMed achieves an average 2 percent improvement in Dice score while reducing communication costs and inference time by nearly 90 percent compared to existing federated foundation model baselines. These results demonstrate significant efficiency gains and scalability for resource-limited federated deployments.
基础模型(FMs)在各种视觉任务中展现出了强大的泛化能力。然而,它们在联邦环境中的部署由于计算需求高、通信开销大以及推理成本显著而受到阻碍。为此,我们提出了DSFedMed,这是一种双尺度联邦框架,该框架允许集中式基础模型与轻量级客户端模型之间进行相互知识蒸馏,以用于医学图像分割任务中。为了支持这种知识蒸馏过程,生成了一组高质量的医学图像来替代真实的公开数据集,并提出了一种基于可学习性引导的样本选择策略,以提高双尺度蒸馏中的效率和效果。该双向蒸馏方法使得基础模型能够将通用知识传递给轻量级客户端,同时也能吸收来自客户端的具体见解以优化自身。 在五个医学影像分割数据集上的评估表明,DSFedMed相较于现有的联邦基础模型基线方案,在Dice分数上平均提高了2%,并且减少了近90%的通信成本和推理时间。这些结果展示了资源受限环境下联邦部署的有效性提升与可扩展性的显著进步。
https://arxiv.org/abs/2601.16073
Vision-Language Action (VLA) models have shown remarkable progress in robotic manipulation by leveraging the powerful perception abilities of Vision-Language Models (VLMs) to understand environments and directly output actions. However, by default, VLA models may overly attend to image tokens in the task-irrelevant region, which we describe as 'distracting tokens'. This behavior can disturb the model from the generation of the desired action tokens in each step, affecting the success rate of tasks. In this paper, we introduce a simple yet effective plug-and-play Distracting Token Pruning (DTP) framework, which dynamically detects and prunes these distracting image tokens. By correcting the model's visual attention patterns, we aim to improve the task success rate, as well as exploring the performance upper boundaries of the model without altering its original architecture or adding additional inputs. Experiments on the SIMPLER Benchmark (Li et al., 2024) show that our method consistently achieving relative improvements in task success rates across different types of novel VLA models, demonstrating generalizability to transformer-based VLAs. Further analysis reveals a negative correlation between the task success rate and the amount of attentions in the task-irrelevant region for all models tested, highlighting a common phenomenon of VLA models that could guide future research. We also publish our code at: this https URL.
翻译如下: Vision-Language Action (VLA) 模型在利用 Vision-Language Models (VLMs) 强大的感知能力来理解环境并直接输出动作方面,已经在机器人操作中取得了显著进展。然而,默认情况下,VLA 模型可能会过分关注任务无关区域中的图像标记(我们称之为“分散注意力的标记”),这种行为可能干扰模型在每一步生成所需的行动标记,从而影响任务的成功率。在这篇论文中,我们介绍了一种简单而有效的即插即用的 Distracting Token Pruning (DTP) 框架,该框架能够动态检测和修剪这些分散注意力的图像标记。通过纠正模型的视觉注意模式,我们的目标是提高任务成功率,并探索模型在不改变其原始架构或添加额外输入的情况下所能达到的最佳性能边界。SIMPLER Benchmark(Li 等人,2024)上的实验表明,我们的方法能够持续实现不同类型新型 VLA 模型的成功率相对提升,在各种类型的模型上展现出泛化能力。进一步分析显示,所有测试模型的任务成功率与任务无关区域中的注意力量之间存在负相关关系,这凸显了 VLA 模型的一个共同现象,可能为未来的研究提供指导方向。 我们已在以下网址发布我们的代码:this https URL。
https://arxiv.org/abs/2601.16065
Deep learning has substantially advanced medical image segmentation, yet achieving robust generalization across diverse imaging modalities and anatomical structures remains a major challenge. A key contributor to this limitation lies in how existing architectures, ranging from CNNs to Transformers and their hybrids, primarily encode spatial information while overlooking frequency-domain representations that capture rich structural and textural cues. Although few recent studies have begun exploring spectral information at the feature level, supervision-level integration of frequency cues-crucial for fine-grained object localization-remains largely untapped. To this end, we propose Phi-SegNet, a CNN-based architecture that incorporates phase-aware information at both architectural and optimization levels. The network integrates Bi-Feature Mask Former (BFMF) modules that blend neighboring encoder features to reduce semantic gaps, and Reverse Fourier Attention (RFA) blocks that refine decoder outputs using phase-regularized features. A dedicated phase-aware loss aligns these features with structural priors, forming a closed feedback loop that emphasizes boundary precision. Evaluated on five public datasets spanning X-ray, US, histopathology, MRI, and colonoscopy, Phi-SegNet consistently achieved state-of-the-art performance, with an average relative improvement of 1.54+/-1.26% in IoU and 0.98+/-0.71% in F1-score over the next best-performing model. In cross-dataset generalization scenarios involving unseen datasets from the known domain, Phi-SegNet also exhibits robust and superior performance, highlighting its adaptability and modality-agnostic design. These findings demonstrate the potential of leveraging spectral priors in both feature representation and supervision, paving the way for generalized segmentation frameworks that excel in fine-grained object localization.
深度学习在医学图像分割领域取得了显著进展,然而,在不同成像模态和解剖结构之间实现稳健的泛化仍然是一个重大挑战。现有架构(从CNN到Transformer及其混合体)主要编码空间信息,而忽视了捕捉丰富结构和纹理线索的频域表示,这是导致这一限制的关键因素之一。虽然最近有一些研究开始探索特征级别的光谱信息,但在监督级别上融合频率线索——这对于精细目标定位至关重要——仍然很大程度上未被开发。 为此,我们提出Phi-SegNet,这是一种基于CNN的架构,在体系结构和优化层面都整合了相位感知信息。该网络集成了Bi-Feature Mask Former(BFMF)模块,用于融合相邻编码器特征以减少语义差距,并使用相位正则化特征来精炼解码器输出的Reverse Fourier Attention(RFA)块。 通过专门设计的相位感知损失函数将这些特征与结构先验对齐,形成了一个闭环反馈机制,强调了边界的精确性。在涵盖X射线、超声波、组织病理学、MRI和结肠镜检查等领域的五个公开数据集上进行了评估,Phi-SegNet始终取得了最先进的性能,在平均相对改进方面,相较于下一个最佳模型,IoU提高了1.54±1.26%,F1得分提高了0.98±0.71%。 在涉及来自已知域但未经训练的数据集的跨数据集泛化场景中,Phi-SegNet也表现出稳健且优越的表现,彰显了其适应性和模态无关设计。这些发现表明,在特征表示和监督方面利用光谱先验具有潜力,并为实现卓越精细目标定位能力的通用分割框架铺平道路。
https://arxiv.org/abs/2601.16064
Widely adopted medical image segmentation methods, although efficient, are primarily deterministic and remain poorly amenable to natural language prompts. Thus, they lack the capability to estimate multiple proposals, human interaction, and cross-modality adaptation. Recently, text-to-image diffusion models have shown potential to bridge the gap. However, training them from scratch requires a large dataset-a limitation for medical image segmentation. Furthermore, they are often limited to binary segmentation and cannot be conditioned on a natural language prompt. To this end, we propose a novel framework called ProGiDiff that leverages existing image generation models for medical image segmentation purposes. Specifically, we propose a ControlNet-style conditioning mechanism with a custom encoder, suitable for image conditioning, to steer a pre-trained diffusion model to output segmentation masks. It naturally extends to a multi-class setting simply by prompting the target organ. Our experiment on organ segmentation from CT images demonstrates strong performance compared to previous methods and could greatly benefit from an expert-in-the-loop setting to leverage multiple proposals. Importantly, we demonstrate that the learned conditioning mechanism can be easily transferred through low-rank, few-shot adaptation to segment MR images.
广泛采用的医学图像分割方法虽然高效,但主要是确定性的,并且对自然语言提示不太友好。因此,它们缺乏估计多种提案、人机交互和跨模态适应的能力。最近,文本到图像的扩散模型显示出弥合这一差距的潜力。然而,从头开始训练这些模型需要大量的数据集——这对医学图像分割来说是一个限制。此外,它们通常仅限于二值分割,并且不能以自然语言提示为条件进行操作。 为此,我们提出了一种称为ProGiDiff的新框架,该框架利用现有的图像生成模型来实现医学图像分割的目的。具体而言,我们提出了一个类似ControlNet的控制机制和一个自定义编码器,适用于图像条件化,可以引导预训练的扩散模型输出分割掩码。通过提示目标器官,它自然地扩展到了多类设置。 我们在CT图像上的器官分割实验中展示了与先前方法相比的强大性能,并且可以从“专家在循环”(expert-in-the-loop)设置中受益匪浅,以利用多种提案。重要的是,我们证明了学习到的控制机制可以通过低秩、少量样本适应轻松转移到对MR图像进行分割。 此框架和方法表明,在医学图像分割领域,通过采用先进的文本引导技术结合现有生成模型可以显著提升算法的能力与灵活性,尤其是在处理跨模态数据时展现出了巨大的潜力。
https://arxiv.org/abs/2601.16060
Language-driven dexterous grasp generation requires the models to understand task semantics, 3D geometry, and complex hand-object interactions. While vision-language models have been applied to this problem, existing approaches directly map observations to grasp parameters without intermediate reasoning about physical interactions. We present DextER, Dexterous Grasp Generation with Embodied Reasoning, which introduces contact-based embodied reasoning for multi-finger manipulation. Our key insight is that predicting which hand links contact where on the object surface provides an embodiment-aware intermediate representation bridging task semantics with physical constraints. DextER autoregressively generates embodied contact tokens specifying which finger links contact where on the object surface, followed by grasp tokens encoding the hand configuration. On DexGYS, DextER achieves 67.14% success rate, outperforming state-of-the-art by 3.83%p with 96.4% improvement in intention alignment. We also demonstrate steerable generation through partial contact specification, providing fine-grained control over grasp synthesis.
基于语言的灵巧抓取生成需要模型理解任务语义、三维几何结构和复杂的手-物体交互。尽管视觉-语言模型已被应用于此类问题,但现有方法直接将观察结果映射到抓取参数上,并未经过关于物理交互的中间推理过程。我们提出了DextER(具有具身推理的灵巧抓取生成),这是一种基于接触点进行多指操作具身推理的方法。我们的关键洞察是,预测哪些手部链接在物体表面上何处接触可以提供一种感知身体特性的中间表示形式,将任务语义与物理约束连接起来。 DextER通过自回归方式生成具身接触令牌,这些令牌指定手指链在哪部分物体表面接触,随后生成抓取令牌来编码手部配置。在DexGYS数据集上,DextER实现了67.14%的成功率,比现有最佳方法高出3.83个百分点,并且意图对齐提高了96.4%。此外,我们还展示了通过部分接触规范实现可控制的生成过程,提供了对手部抓取合成进行精细调控的能力。 该研究强调了在灵巧抓取任务中引入物理交互理解的重要性,展示了一种将视觉-语言模型与复杂手-物体互动结合的有效途径,并为机器人操作和自动化系统中的精细化具身推理设定了新标准。
https://arxiv.org/abs/2601.16046
Virtual immunohistochemistry (IHC) aims to computationally synthesize molecular staining patterns from routine Hematoxylin and Eosin (H\&E) images, offering a cost-effective and tissue-efficient alternative to traditional physical staining. However, this task is particularly challenging: H\&E morphology provides ambiguous cues about protein expression, and similar tissue structures may correspond to distinct molecular states. Most existing methods focus on direct appearance synthesis to implicitly achieve cross-modal generation, often resulting in semantic inconsistencies due to insufficient structural priors. In this paper, we propose Pathology-Aware Integrated Next-Scale Transformation (PAINT), a visual autoregressive framework that reformulates the synthesis process as a structure-first conditional generation task. Unlike direct image translation, PAINT enforces a causal order by resolving molecular details conditioned on a global structural layout. Central to this approach is the introduction of a Spatial Structural Start Map (3S-Map), which grounds the autoregressive initialization in observed morphology, ensuring deterministic, spatially aligned synthesis. Experiments on the IHC4BC and MIST datasets demonstrate that PAINT outperforms state-of-the-art methods in structural fidelity and clinical downstream tasks, validating the potential of structure-guided autoregressive modeling.
虚拟免疫组化(IHC)旨在通过计算手段从常规的苏木精和伊红(H&E)图像中合成分子染色模式,提供了一种成本效益高且组织使用效率高的替代传统物理染色的方法。然而,这一任务具有特别的挑战性:H&E形态提供的关于蛋白质表达的线索是模糊的,并且相似的组织结构可能对应不同的分子状态。大多数现有的方法集中在直接外观综合上,以隐式实现跨模态生成,但由于缺乏足够的结构先验知识,常常导致语义不一致。 本文提出了病理学感知集成次尺度变换(PAINT),这是一种视觉自回归框架,将合成过程重新表述为一个基于全局结构布局的条件性生成任务。不同于直接图像转换,PAINT通过解决分子细节来强制执行因果顺序,这些细节是根据整体结构布局进行判断的。这种方法的核心在于引入了空间结构性起点图(3S-Map),它使自回归初始化以观察到的形态为依据,确保确定性的、空间对齐的合成。 在IHC4BC和MIST数据集上的实验表明,PAINT在结构保真度和临床下游任务方面超过了最先进的方法,验证了基于结构引导的自回归建模的潜力。
https://arxiv.org/abs/2601.16024
The emergence of visual foundation models has revolutionized visual odometry~(VO) and SLAM, enabling pose estimation and dense reconstruction within a single feed-forward network. However, unlike traditional pipelines that leverage keyframe methods to enhance efficiency and accuracy, current foundation model based methods, such as VGGT-Long, typically process raw image sequences indiscriminately. This leads to computational redundancy and degraded performance caused by low inter-frame parallax, which provides limited contextual stereo information. Integrating traditional geometric heuristics into these methods is non-trivial, as their performance depends on high-dimensional latent representations rather than explicit geometric metrics. To bridge this gap, we propose a novel keyframe-based feed-forward VO. Instead of relying on hand-crafted rules, our approach employs reinforcement learning to derive an adaptive keyframe policy in a data-driven manner, aligning selection with the intrinsic characteristics of the underlying foundation model. We train our agent on TartanAir dataset and conduct extensive evaluations across several real-world datasets. Experimental results demonstrate that the proposed method achieves consistent and substantial improvements over state-of-the-art feed-forward VO methods.
视觉基础模型的出现已经彻底革新了视觉里程计(VO)和同时定位与地图构建(SLAM),使得姿态估计和密集重建可以在单一前馈网络中完成。然而,不同于传统的管道利用关键帧方法来提高效率和精度,目前基于基础模型的方法,例如VGGT-Long,通常会不加区分地处理原始图像序列。这导致了由于低帧间视差引起的计算冗余以及性能下降,因为低帧间视差提供的立体背景信息有限。将传统的几何启发式融入这些方法中颇具挑战性,因为它们的性能依赖于高维潜在表示而非明确的几何度量。 为了弥合这一差距,我们提出了一种新颖的关键帧前馈VO方法。不同于依赖手工设计规则的方法,我们的方法利用强化学习以数据驱动的方式推导出适应性的关键帧策略,并与基础模型的本质特性相匹配。我们在TartanAir数据集上训练了代理,并在几个真实世界的数据库中进行了广泛的评估。实验结果表明,所提出的方法在最先进的前馈VO方法中实现了持续且显著的改进。
https://arxiv.org/abs/2601.16020