Recent advances in camera-controlled video diffusion models have significantly improved video-camera alignment. However, the camera controllability still remains limited. In this work, we build upon Reward Feedback Learning and aim to further improve camera controllability. However, directly borrowing existing ReFL approaches faces several challenges. First, current reward models lack the capacity to assess video-camera alignment. Second, decoding latent into RGB videos for reward computation introduces substantial computational overhead. Third, 3D geometric information is typically neglected during video decoding. To address these limitations, we introduce an efficient camera-aware 3D decoder that decodes video latent into 3D representations for reward quantization. Specifically, video latent along with the camera pose are decoded into 3D Gaussians. In this process, the camera pose not only acts as input, but also serves as a projection parameter. Misalignment between the video latent and camera pose will cause geometric distortions in the 3D structure, resulting in blurry renderings. Based on this property, we explicitly optimize pixel-level consistency between the rendered novel views and ground-truth ones as reward. To accommodate the stochastic nature, we further introduce a visibility term that selectively supervises only deterministic regions derived via geometric warping. Extensive experiments conducted on RealEstate10K and WorldScore benchmarks demonstrate the effectiveness of our proposed method. Project page: \href{this https URL}{CamPilot Page}.
近期,在相机控制视频扩散模型方面的进展显著提高了视频与摄像机之间的对齐精度。然而,摄像机的可控性仍然有限。在这项工作中,我们基于奖励反馈学习(Reward Feedback Learning)方法,并致力于进一步提升摄像机的可控性。不过,直接借用现有的ReFL方法会遇到几个挑战:首先,当前的奖励模型缺乏评估视频与摄像机对齐能力的能力;其次,在计算奖励时将潜在变量解码为RGB视频带来了大量的计算开销;第三,在视频解码过程中通常忽略了3D几何信息。 为了应对这些局限性,我们引入了一个高效的感知相机的3D解码器,该解码器能够将视频潜变量解码成用于奖励量化的3D表示。具体而言,视频潜在编码与摄像机姿态一起被解码为3D高斯分布,在这一过程中,摄像机姿态不仅作为输入,还充当投影参数的角色。如果视频潜在变量和摄像机姿态之间存在对齐误差,则会导致3D结构的几何失真,并进而导致渲染模糊。 基于该特性,我们明确地优化了合成视角与真实视图之间的像素级一致性作为奖励计算的基础。考虑到这一随机性质,我们进一步引入了一个可见性项,仅针对通过几何变形导出的确定区域进行监督。在RealEstate10K和WorldScore基准上的广泛实验验证了所提出方法的有效性。 项目页面:\[链接\](请将“this https URL”替换为实际链接)。
https://arxiv.org/abs/2601.16214
We study Compositional Video Understanding (CVU), where models must recognize verbs and objects and compose them to generalize to unseen combinations. We find that existing Zero-Shot Compositional Action Recognition (ZS-CAR) models fail primarily due to an overlooked failure mode: object-driven verb shortcuts. Through systematic analysis, we show that this behavior arises from two intertwined factors: severe sparsity and skewness of compositional supervision, and the asymmetric learning difficulty between verbs and objects. As training progresses, the existing ZS-CAR model increasingly ignores visual evidence and overfits to co-occurrence statistics. Consequently, the existing model does not gain the benefit of compositional recognition in unseen verb-object compositions. To address this, we propose RCORE, a simple and effective framework that enforces temporally grounded verb learning. RCORE introduces (i) a composition-aware augmentation that diversifies verb-object combinations without corrupting motion cues, and (ii) a temporal order regularization loss that penalizes shortcut behaviors by explicitly modeling temporal structure. Across two benchmarks, Sth-com and our newly constructed EK100-com, RCORE significantly improves unseen composition accuracy, reduces reliance on co-occurrence bias, and achieves consistently positive compositional gaps. Our findings reveal object-driven shortcuts as a critical limiting factor in ZS-CAR and demonstrate that addressing them is essential for robust compositional video understanding.
我们研究了组合视频理解(CVU),在这种情况下,模型必须识别动词和物体,并将它们组合起来以推广到未见过的组合。我们发现现有的零样本组合动作识别(ZS-CAR)模型主要由于一个被忽略的问题模式而失败:基于对象的动词捷径。通过系统的分析,我们展示了这种行为是由两个相互交织的因素引起的:组成监督的高度稀疏性和偏斜性,以及动词和物体之间的不对称学习难度。随着训练的进行,现有的ZS-CAR模型越来越忽视视觉证据,并过度适应共现统计信息。因此,现有模型无法获得在未见过的动词-物体组合中的组合识别益处。 为了解决这个问题,我们提出了RCORE,这是一个简单而有效的框架,强制执行基于时间的基础动词学习。RCORE引入了(i)一种组合感知增强方法,可以在不破坏运动线索的情况下多样化动词-对象组合;(ii)一种时间顺序正则化损失,通过显式建模时间结构来惩罚捷径行为。在两个基准测试Sth-com和我们新构建的EK100-com上,RCORE显著提高了未见过组合的准确性,减少了对共现偏差的依赖,并实现了持续的正面组合差距。 我们的发现揭示了基于对象的捷径作为ZS-CAR中的关键限制因素,并证明解决这些问题对于稳健的组合视频理解至关重要。
https://arxiv.org/abs/2601.16211
Representation Autoencoders (RAEs) have shown distinct advantages in diffusion modeling on ImageNet by training in high-dimensional semantic latent spaces. In this work, we investigate whether this framework can scale to large-scale, freeform text-to-image (T2I) generation. We first scale RAE decoders on the frozen representation encoder (SigLIP-2) beyond ImageNet by training on web, synthetic, and text-rendering data, finding that while scale improves general fidelity, targeted data composition is essential for specific domains like text. We then rigorously stress-test the RAE design choices originally proposed for ImageNet. Our analysis reveals that scaling simplifies the framework: while dimension-dependent noise scheduling remains critical, architectural complexities such as wide diffusion heads and noise-augmented decoding offer negligible benefits at scale Building on this simplified framework, we conduct a controlled comparison of RAE against the state-of-the-art FLUX VAE across diffusion transformer scales from 0.5B to 9.8B parameters. RAEs consistently outperform VAEs during pretraining across all model scales. Further, during finetuning on high-quality datasets, VAE-based models catastrophically overfit after 64 epochs, while RAE models remain stable through 256 epochs and achieve consistently better performance. Across all experiments, RAE-based diffusion models demonstrate faster convergence and better generation quality, establishing RAEs as a simpler and stronger foundation than VAEs for large-scale T2I generation. Additionally, because both visual understanding and generation can operate in a shared representation space, the multimodal model can directly reason over generated latents, opening new possibilities for unified models.
表示自编码器(RAE,Representation Autoencoders)在图像网(ImageNet)上的扩散模型训练中显示出了显著的优势,尤其是在高维语义潜在空间的训练方面。在这项工作中,我们探讨了这一框架是否可以扩展到大规模、自由形式的文字转图像(T2I,Text-to-Image)生成任务上。 首先,我们在冻结表示编码器(SigLIP-2)的基础上,通过网络数据、合成数据和文本渲染数据对RAE解码器进行训练,以超越ImageNet的限制。我们发现,在扩展规模时虽然整体保真度有所提高,但在特定领域如文字生成中,有针对性的数据组合至关重要。 接着,我们严格测试了最初为ImageNet设计的RAE架构选择的有效性。分析结果显示,随着规模的扩大,框架变得简化:尽管维度依赖性的噪音调度仍然关键,但诸如扩散头部宽度加大和噪音增强解码等复杂结构在大规模下几乎没有带来实际好处。 基于这一简化的框架,我们对比了RAE与当前最佳的FLUX VAE(变分自编码器),在从0.5B到9.8B参数的不同规模下的扩散变压器模型上进行了有控制的比较。结果表明,在所有模型规模的预训练阶段,RAEs始终优于VAEs。 进一步地,在高质量数据集上的微调过程中,基于VAE的模型在64个epoch后出现灾难性过拟合,而基于RAE的模型则保持稳定至256个epoch,并且在整个过程中表现更佳。在所有实验中,基于RAE的扩散模型都显示出更快的收敛速度和更好的生成质量,确立了RAEs作为大规模T2I生成任务中的简化且更强的基础框架的地位。 此外,由于视觉理解和生成都可以在共享表示空间内操作,多模态模型可以直接对生成的潜在表达进行推理,为统一性模型提供了新的可能性。
https://arxiv.org/abs/2601.16208
We propose a novel training regime termed counterfactual training that leverages counterfactual explanations to increase the explanatory capacity of models. Counterfactual explanations have emerged as a popular post-hoc explanation method for opaque machine learning models: they inform how factual inputs would need to change in order for a model to produce some desired output. To be useful in real-world decision-making systems, counterfactuals should be plausible with respect to the underlying data and actionable with respect to the feature mutability constraints. Much existing research has therefore focused on developing post-hoc methods to generate counterfactuals that meet these desiderata. In this work, we instead hold models directly accountable for the desired end goal: counterfactual training employs counterfactuals during the training phase to minimize the divergence between learned representations and plausible, actionable explanations. We demonstrate empirically and theoretically that our proposed method facilitates training models that deliver inherently desirable counterfactual explanations and additionally exhibit improved adversarial robustness.
我们提出了一种新的训练方法,称为反事实训练(counterfactual training),该方法利用反事实解释来增强模型的解释能力。反事实解释作为一种流行的事后解释方法已经为不透明的机器学习模型广泛使用:它们提供关于现实输入如何需要改变才能使模型产生所需输出的信息。为了在实际决策系统中发挥作用,反事实应该与底层数据相符,并且在特征可变性约束下具有操作性。因此,现有的许多研究都集中在开发能够生成符合这些标准的事后方法上。 然而,在这项工作中,我们直接让模型对其期望的目标负责:反事实训练通过在训练阶段使用反事实来最小化学习表示与合理、可行的解释之间的差异。我们从实证和理论上证明了所提出的方法有助于训练出自然提供具有内在价值的反事实解释的模型,并且这些模型还表现出改进后的对抗鲁棒性。
https://arxiv.org/abs/2601.16205
Multimodal large language models (MLLMs) exhibit strong capabilities across diverse applications, yet remain vulnerable to adversarial perturbations that distort their feature representations and induce erroneous predictions. To address this vulnerability, we propose the Feature-space Smoothing (FS) and theoretically prove that FS offers certified robustness on the feature representations of MLLMs. Specifically, FS transforms any feature encoder into a smoothed variant that is guaranteed to maintain a certified lower bound on the feature cosine similarity between clean and adversarial representations under $\ell_2$-bounded attacks. Moreover, we indicate that the value of this Feature Cosine Similarity Bound (FCSB) derived from FS can be improved by enlarging the defined Gaussian robustness score on the vanilla encoder. Building upon this, we introduce the Purifier and Smoothness Mapper (PSM), a plug-and-play module that improves the Gaussian robustness score of MLLMs and thus enhances their certified robustness under FS, without requiring any retraining on MLLMs. We demonstrate that the FS with PSM not only provides a strong theoretical robustness guarantee but also exhibits superior empirical performance compared to adversarial training. Extensive experiments across diverse MLLMs and downstream tasks indicate the effectiveness of the FS-PSM, reducing the Attack Success Rate (ASR) of various white-box attacks from nearly 90\% to about 1\%.
多模态大型语言模型(MLLMs)在各种应用场景中表现出强大的能力,但它们仍然容易受到通过扭曲特征表示并引发错误预测的对抗性干扰的影响。为了解决这一脆弱性问题,我们提出了特征空间平滑(FS),并通过理论证明了FS能够提供关于MLLMs特征表示的认证鲁棒性保障。具体而言,FS将任何特征编码器转换为其平滑版本,并保证在$\ell_2$界限内的攻击下,干净和对抗性表示之间的特征余弦相似度可以维持一个经过验证的最低边界。此外,我们指出通过增加原始编码器上的高斯鲁棒评分,可以从FS中得出的特征余弦相似度边界(FCSB)值得到提高。基于此,我们引入了纯化和平滑映射器(PSM),这是一种即插即用模块,它可以提升MLLMs的高斯鲁棒评分并因此增强其在FS下的认证鲁棒性,而无需对MLLMs进行重新训练。我们展示了带有PSM的FS不仅提供了强大的理论稳健保证,而且在对抗性训练方面也表现出更优越的实际性能。跨多种MLLM和下游任务的广泛实验表明,FS-PSM的有效性,将各种白盒攻击的成功率从接近90%降低到大约1%。
https://arxiv.org/abs/2601.16200
Lifting perspective images and videos to 360° panoramas enables immersive 3D world generation. Existing approaches often rely on explicit geometric alignment between the perspective and the equirectangular projection (ERP) space. Yet, this requires known camera metadata, obscuring the application to in-the-wild data where such calibration is typically absent or noisy. We propose 360Anything, a geometry-free framework built upon pre-trained diffusion transformers. By treating the perspective input and the panorama target simply as token sequences, 360Anything learns the perspective-to-equirectangular mapping in a purely data-driven way, eliminating the need for camera information. Our approach achieves state-of-the-art performance on both image and video perspective-to-360° generation, outperforming prior works that use ground-truth camera information. We also trace the root cause of the seam artifacts at ERP boundaries to zero-padding in the VAE encoder, and introduce Circular Latent Encoding to facilitate seamless generation. Finally, we show competitive results in zero-shot camera FoV and orientation estimation benchmarks, demonstrating 360Anything's deep geometric understanding and broader utility in computer vision tasks. Additional results are available at this https URL.
将透视图像和视频转换为360°全景图能够实现沉浸式的三维世界生成。现有的方法通常依赖于透视视图与等距矩形投影(ERP)空间之间的显式几何对齐。然而,这需要已知的相机元数据,在野外的数据中,这种校准通常是缺失或有噪声的。我们提出了一种名为360Anything的新框架,该框架基于预训练的扩散变换器构建,并且不需要任何几何信息。通过将透视输入和全景图目标视为简单的令牌序列,360Anything能够以完全数据驱动的方式学习透视到等距矩形映射,从而消除了对相机信息的需求。 我们的方法在图像和视频从透视视图到360°生成的性能上达到了最先进的水平,并且超越了那些使用真实相机信息的方法。我们还追溯到了ERP边界处的接缝瑕疵的根本原因——VAE编码器中的零填充处理,并引入了圆形潜在编码以促进无缝生成。 最后,我们在无提示相机视野和方向估计基准测试中展示了具有竞争力的结果,这表明360Anything具备深刻的几何理解能力以及在计算机视觉任务中的更广泛实用性。更多的研究成果可以访问此链接:[提供的URL]。 简而言之,这项工作展示了一种创新的方法来处理没有明确几何对齐信息的图像和视频数据,并且证明了这种方法在多种应用中的有效性和广泛的适用性。
https://arxiv.org/abs/2601.16192
Keyword Spotting (KWS) systems with small footprint models deployed on edge devices face significant accuracy and robustness challenges due to domain shifts caused by varying noise and recording conditions. To address this, we propose a comprehensive framework for continual learning designed to adapt to new domains while maintaining computational efficiency. The proposed pipeline integrates a dual-input Convolutional Neural Network, utilizing both Mel Frequency Cepstral Coefficients (MFCC) and Mel-spectrogram features, supported by a multi-stage denoising process, involving discrete wavelet transform and spectral subtraction techniques, plus model and prototype update blocks. Unlike prior methods that restrict updates to specific layers, our approach updates the complete quantized model, made possible due to compact model architecture. A subset of input samples are selected during runtime using class prototypes and confidence-driven filtering, which are then pseudo-labeled and combined with rehearsal buffer for incremental model retraining. Experimental results on noisy test dataset demonstrate the framework's effectiveness, achieving 99.63\% accuracy on clean data and maintaining robust performance (exceeding 94\% accuracy) across diverse noisy environments, even at -10 dB Signal-to-Noise Ratio. The proposed framework work confirms that integrating efficient denoising with prototype-based continual learning enables KWS models to operate autonomously and robustly in resource-constrained, dynamic environments.
关键词识别(KWS)系统在边缘设备上部署的小型模型面临着由于噪声和录音条件变化导致的领域偏移所引起的准确性和鲁棒性挑战。为了解决这些问题,我们提出了一种全面的连续学习框架,旨在适应新的领域同时保持计算效率。该提议的流程集成了一个双输入卷积神经网络(CNN),利用梅尔频率倒谱系数(MFCC)和梅尔频谱图特征,并结合多级去噪过程,包括离散小波变换和频谱减法技术以及模型更新和原型更新模块。 与以前的方法仅限于特定层的更新不同,我们的方法更新整个量化模型,这得益于紧凑型模型架构。在运行时使用类原型和基于置信度的过滤器选择输入样本的一部分,在这些选定的样本上添加伪标签,并将其与回放缓冲区结合以进行增量模型重训练。 实验结果表明,在嘈杂的数据测试集中该框架的有效性:对于干净数据,准确率达到了99.63%,并且即使在-10 dB信噪比的情况下,也能保持稳健性能(超过94%的准确性),适用于各种噪音环境。这项工作证明了结合高效的去噪技术与基于原型的连续学习可以使KWS模型能够在资源受限和动态环境中自主且鲁棒地运行。
https://arxiv.org/abs/2601.16158
The success of CLIP has driven substantial progress in text-video retrieval. However, current methods often suffer from "blind" feature interaction, where the model struggles to discern key visual information from background noise due to the sparsity of textual queries. To bridge this gap, we draw inspiration from human cognitive behavior and propose the Human Vision-Driven (HVD) model. Our framework establishes a coarse-to-fine alignment mechanism comprising two key components: the Frame Features Selection Module (FFSM) and the Patch Features Compression Module (PFCM). FFSM mimics the human macro-perception ability by selecting key frames to eliminate temporal redundancy. Subsequently, PFCM simulates micro-perception by aggregating patch features into salient visual entities through an advanced attention mechanism, enabling precise entity-level matching. Extensive experiments on five benchmarks demonstrate that HVD not only captures human-like visual focus but also achieves state-of-the-art performance.
CLIP模型的成功推动了文本视频检索领域的显著进步。然而,当前的方法往往在“盲”特征交互方面存在问题,即由于文本查询的稀疏性,模型难以从背景噪声中区分出关键视觉信息。为了弥补这一差距,我们借鉴了人类的认知行为,并提出了人眼视图驱动(HVD)模型。我们的框架建立了一个由粗到细的对齐机制,包含两个关键组件:帧特征选择模块(FFSM)和补丁特征压缩模块(PFCM)。FFSM通过选择关键帧来消除时间冗余,模拟了人类宏观感知能力。随后,PFCM通过先进的注意力机制聚合补丁特征以形成显著视觉实体,从而模仿微观感知并实现精确的实体级别匹配。 在五个基准测试中的大量实验表明,HVD不仅能够捕捉到类似人的视觉关注点,还实现了最先进的性能。
https://arxiv.org/abs/2601.16155
Modern data systems increasingly operate under conditions of persistent legal, political, and analytic disagreement. In such settings, interoperability cannot rely on shared interpretation, negotiated semantics, or centralized authority. Instead, representations must function as neutral substrates that preserve stable reference across incompatible extensions. This paper investigates the structural constraints imposed on ontological design by this requirement. Building on a neutrality framework that treats interpretive non-commitment and stability under extension as explicit design constraints, we ask what minimal ontological structure is forced if accountability relationships are to remain referable and comparable under disagreement. Minimality here is not mere parsimony: a reduction is admissible only if it does not reintroduce stability-critical distinctions as hidden roles, flags, or contextual predicates. We establish a conditional lower-bound result: any ontology capable of supporting accountability under persistent disagreement must realize at least six distinct identity-and-persistence regimes. We further show that a construction with exactly six such regimes is sufficient to satisfy the stated requirements without embedding causal or normative commitments in the substrate. The result is not a proposal for a universal ontology, but a constraint on what is possible when neutrality and stable reference are treated as non-negotiable design goals.
现代数据系统越来越多地在持续存在的法律、政治和分析分歧条件下运行。在这种环境下,互操作性不能依赖于共享解释、协商语义或中央权威。相反,表示必须作为中立的基础结构来保持稳定引用,在不兼容的扩展下依然有效。本文探讨了这种需求对本体设计施加的结构性限制。 基于一个将解释非承诺和在扩展下的稳定性视为显式设计约束的中立框架,我们研究了如果问责关系要在分歧条件下仍然可参照且可比较的话,最小化的本体结构必须是什么样的。这里的“最小化”并非简单的简约性:只有当这种简化的结果不重新引入影响稳定性的关键区别作为隐藏角色、标志或上下文谓词时才是可以接受的。 我们证明了一个条件下的下限结果:任何能够支持持续分歧条件下问责制的本体都必须实现至少六种不同的身份和持久性制度。此外,我们还表明,具有恰好六种这样的制度结构足以满足既定要求而不将因果或规范承诺嵌入基础架构中。这一结果不是对通用本体的一种提议,而是在将中立性和稳定引用视为不可谈判的设计目标时可能实现的限制条件。
https://arxiv.org/abs/2601.16152
Melodic harmonization, the task of generating harmonic accompaniments for a given melody, remains a central challenge in computational music generation. Recent single encoder transformer approaches have framed harmonization as a masked sequence modeling problem, but existing training curricula inspired by discrete diffusion often result in weak (cross) attention between melody and harmony. This leads to limited exploitation of melodic cues, particularly in out-of-domain contexts. In this work, we introduce a training curriculum, FF (full-to-full), which keeps all harmony tokens masked for several training steps before progressively unmasking entire sequences during training to strengthen melody-harmony interactions. We systematically evaluate this approach against prior curricula across multiple experimental axes, including temporal quantization (quarter vs. sixteenth note), bar-level vs. time-signature conditioning, melody representation (full range vs. pitch class), and inference-time unmasking strategies. Models are trained on the HookTheory dataset and evaluated both in-domain and on a curated collection of jazz standards, using a comprehensive set of metrics that assess chord progression structure, harmony-melody alignment, and rhythmic coherence. Results demonstrate that the proposed FF curriculum consistently outperforms baselines in nearly all metrics, with particularly strong gains in out-of-domain evaluations where harmonic adaptability to novel melodic queues is crucial. We further find that quarter-note quantization, intertwining of bar tokens, and pitch-class melody representations are advantageous in the FF setting. Our findings highlight the importance of training curricula in enabling effective melody conditioning and suggest that full-to-full unmasking offers a robust strategy for single encoder harmonization.
旋律和声化,即为给定的旋律生成和声伴奏,在计算音乐生成中仍然是一个核心挑战。最近采用单一编码器变压器的方法将和声化问题视为屏蔽序列建模问题,但现有的训练课程(受离散扩散启发)通常会导致旋律与和声之间的弱交叉注意力。这导致了对旋律线索利用的限制,尤其是在域外上下文的情况下。 在这项工作中,我们引入了一种训练课程 FF (full-to-full),该方法在训练初期将所有和声音符保持屏蔽状态,并逐渐在整个序列训练过程中解除屏蔽,以加强旋律与和声之间的相互作用。我们在多个实验轴上系统地评估了这种方法与先前的课程效果,包括时间量化(四分音符 vs. 十六分音符)、小节级 vs. 节拍签名条件、旋律表示形式(全范围 vs. 音阶)以及推理时的解除屏蔽策略。模型在 HookTheory 数据集上进行训练,并且使用全面评估和声进程结构、和声-旋律对齐以及节奏一致性的指标,在域内及一个精选的爵士标准曲集合中进行了评估。 实验结果表明,我们提出的 FF 课程方案几乎在所有指标上都优于基线方法,特别是在需要适应新型旋律线索的域外评估中表现尤为突出。此外,四分音符量化、小节标记交织以及音阶表示形式被证明在 FF 设置下具有优势。我们的研究强调了训练课程在有效旋律调适中的重要性,并表明全面解除屏蔽策略为单一编码器和声生成提供了一种稳健的方法。
https://arxiv.org/abs/2601.16150
The Arabic language has undergone notable transformations over time, including the emergence of new vocabulary, the obsolescence of others, and shifts in word usage. This evolution is evident in the distinction between the classical and modern Arabic eras. Although historians and linguists have partitioned Arabic literature into multiple eras, relatively little research has explored the automatic classification of Arabic texts by time period, particularly beyond the domain of poetry. This paper addresses this gap by employing neural networks and deep learning techniques to automatically classify Arabic texts into distinct eras and periods. The proposed models are evaluated using two datasets derived from two publicly available corpora, covering texts from the pre-Islamic to the modern era. The study examines class setups ranging from binary to 15-class classification and considers both predefined historical eras and custom periodizations. Results range from F1-scores of 0.83 and 0.79 on the binary-era classification task using the OpenITI and APCD datasets, respectively, to 0.20 on the 15-era classification task using OpenITI and 0.18 on the 12-era classification task using APCD.
阿拉伯语随着时间的推移经历了显著的变化,包括新词汇的出现、旧词汇的淘汰以及词语使用的转变。这种演变在古典时代和现代阿拉伯时代的区别中尤为明显。虽然历史学家和语言学家已经将阿拉伯文学划分成多个时期,但较少有研究探索自动分类不同时间段的阿拉伯文本,尤其是在诗歌领域之外的研究更为稀缺。本文通过运用神经网络和深度学习技术来填补这一空白,旨在自动将阿拉伯文本划分为不同的时代和地区。所提出的模型使用了两个公开可用语料库派生的数据集进行评估,这些数据集涵盖了从前伊斯兰时期到现代的各种文本。研究考察了从二元分类到15类分类的不同设置,并考虑到了预定义的历史时期和定制的时间段划分。结果显示,在使用OpenITI数据集的二元时代分类任务中,F1分数为0.83;在使用APCD数据集的任务中,为0.79。而在使用OpenITI数据集进行15类时代分类时,F1分数下降到0.20,在使用APCD数据集进行12类时代分类时则降至0.18。
https://arxiv.org/abs/2601.16138
Composed Image Retrieval (CIR) is a pivotal and complex task in multimodal understanding. Current CIR benchmarks typically feature limited query categories and fail to capture the diverse requirements of real-world scenarios. To bridge this evaluation gap, we leverage image editing to achieve precise control over modification types and content, enabling a pipeline for synthesizing queries across a broad spectrum of categories. Using this pipeline, we construct EDIR, a novel fine-grained CIR benchmark. EDIR encompasses 5,000 high-quality queries structured across five main categories and fifteen subcategories. Our comprehensive evaluation of 13 multimodal embedding models reveals a significant capability gap; even state-of-the-art models (e.g., RzenEmbed and GME) struggle to perform consistently across all subcategories, highlighting the rigorous nature of our benchmark. Through comparative analysis, we further uncover inherent limitations in existing benchmarks, such as modality biases and insufficient categorical coverage. Furthermore, an in-domain training experiment demonstrates the feasibility of our benchmark. This experiment clarifies the task challenges by distinguishing between categories that are solvable with targeted data and those that expose intrinsic limitations of current model architectures.
组成图像检索(CIR)是跨模态理解中的一个重要且复杂的任务。当前的CIR基准测试通常包含有限的查询类别,无法捕捉到真实世界场景中多样化的需求。为了弥补这一评估差距,我们利用图像编辑来实现对修改类型和内容的精确控制,并构建了一条能够跨越广泛类别合成查询的流水线。通过这条管道,我们创建了EDIR,这是一个新型的细粒度CIR基准测试集。EDIR包括5000个高质量的查询,这些查询结构化分布在五个主要类别和十五个子类别中。 对13种跨模态嵌入模型进行全面评估后,我们发现了一个显著的能力差距;即使是当前最先进的模型(如RzenEmbed和GME)也无法在所有子类别上保持一致性表现,这进一步强调了我们的基准测试的严格性。通过对比分析,我们还揭示了现有基准中存在的固有局限性,例如模态偏见和类别覆盖不足的问题。 此外,一个针对特定领域的训练实验展示了我们基准的有效性。该实验通过区分可以使用定向数据解决的任务类别与揭示当前模型架构内在限制的任务类别来明确任务挑战。
https://arxiv.org/abs/2601.16125
Edge devices operate in constrained and varying resource settings, requiring dynamic architectures that can adapt to limitations of the available resources. To meet such demands, layer dropping ($\mathcal{LD}$) approach is typically used to transform static models into dynamic ones by skipping parts of the network along with reducing overall computational complexity. However, existing $\mathcal{LD}$ methods greatly impact the dynamic model's performance for low and high dropping cases, deteriorating the performance-computation trade-off. To this end, we propose a distillation-based layer dropping (DLD) framework that effectively combines the capabilities of knowledge distillation and $\mathcal{LD}$ in an end-to-end fashion, thereby achieving state-of-the-art performance for dynamic speech networks. Comprehensive experimentation utilizing well-known speech recognition methods, including conformer and WavLM, on three public benchmarks demonstrates the effectiveness of our framework, reducing the word error rate by $9.32\%$ and $2.25\%$ for high and no dropping cases with $33.3\%$ reduction in training time.
边缘设备在有限和变化的资源环境中运行,需要能够适应可用资源限制的动态架构。为了满足这一需求,通常采用层掉落($\mathcal{LD}$)方法将静态模型转换为动态模型,通过跳过网络的部分来减少整体计算复杂度。然而,现有的层掉落方法对低频和高频掉层情况下的动态模型性能影响很大,从而恶化了性能与计算量之间的权衡。为此,我们提出了一种基于蒸馏的层掉落(DLD)框架,该框架能够以端到端的方式有效地结合知识蒸馏和$\mathcal{LD}$的能力,从而在动态语音网络中实现最先进的性能。 通过使用包括Conformer和WavLM在内的知名语音识别方法,在三个公共基准上进行的全面实验展示了我们框架的有效性。对于高频掉层情况,我们的框架将词错误率降低了9.32%,而对于无掉层的情况则减少了2.25%。此外,该框架在训练时间方面也实现了33.3%的减少。
https://arxiv.org/abs/2601.16117
We propose a control framework that integrates model-based bipedal locomotion with residual reinforcement learning (RL) to achieve robust and adaptive walking in the presence of real-world uncertainties. Our approach leverages a model-based controller, comprising a Divergent Component of Motion (DCM) trajectory planner and a whole-body controller, as a reliable base policy. To address the uncertainties of inaccurate dynamics modeling and sensor noise, we introduce a residual policy trained through RL with domain randomization. Crucially, we employ a model-based oracle policy, which has privileged access to ground-truth dynamics during training, to supervise the residual policy via a novel supervised loss. This supervision enables the policy to efficiently learn corrective behaviors that compensate for unmodeled effects without extensive reward shaping. Our method demonstrates improved robustness and generalization across a range of randomized conditions, offering a scalable solution for sim-to-real transfer in bipedal locomotion.
我们提出了一种控制框架,该框架将基于模型的双足行走与残差强化学习(RL)结合在一起,以实现在现实世界不确定性中的稳健和适应性步行。我们的方法利用了一个基于模型的控制器,包括发散运动成分(DCM)轨迹规划器和全身控制器,作为可靠的基线策略。为了应对动力学建模不准确和传感器噪声等不确定性的挑战,我们引入了一种通过域随机化RL训练得到的残差政策。关键在于,我们使用一个具有访问到真实动力学特权信息的基于模型的预言家策略,在训练期间监督残差策略并通过一种新的监督损失进行指导。这种监督使策略能够高效地学习矫正行为以补偿未建模的影响,而无需大量的奖励塑形。 我们的方法在各种随机条件下表现出增强的鲁棒性和泛化能力,并为双足行走中的仿真到现实转换提供了可扩展解决方案。
https://arxiv.org/abs/2601.16109
We introduce Neural Particle Automata (NPA), a Lagrangian generalization of Neural Cellular Automata (NCA) from static lattices to dynamic particle systems. Unlike classical Eulerian NCA where cells are pinned to pixels or voxels, NPA model each cell as a particle with a continuous position and internal state, both updated by a shared, learnable neural rule. This particle-based formulation yields clear individuation of cells, allows heterogeneous dynamics, and concentrates computation only on regions where activity is present. At the same time, particle systems pose challenges: neighborhoods are dynamic, and a naive implementation of local interactions scale quadratically with the number of particles. We address these challenges by replacing grid-based neighborhood perception with differentiable Smoothed Particle Hydrodynamics (SPH) operators backed by memory-efficient, CUDA-accelerated kernels, enabling scalable end-to-end training. Across tasks including morphogenesis, point-cloud classification, and particle-based texture synthesis, we show that NPA retain key NCA behaviors such as robustness and self-regeneration, while enabling new behaviors specific to particle systems. Together, these results position NPA as a compact neural model for learning self-organizing particle dynamics.
我们介绍了一种名为神经粒子自动机(Neural Particle Automata,NPA)的新模型,它是对静态格点系统中的神经细胞自动机(Neural Cellular Automata,NCA)进行拉格朗日泛化的动态粒子系统的扩展。与经典欧拉方法下的NCA不同,在这种情况下,每个单元被固定在像素或体素上,NPA将每个单元视为具有连续位置和内部状态的粒子,这两个参数都通过一个共享且可学习的神经规则更新。基于粒子的这一形式化方法清晰地界定了各细胞个体性,允许异质动态,并仅对存在活动的区域进行计算。 然而,粒子系统也带来了一些挑战:邻居关系是动态变化的,直接实现局部相互作用会导致其复杂度随粒子数量呈二次增长。为了解决这些问题,我们用可微分的光滑粒子流体动力学(Smoothed Particle Hydrodynamics,SPH)算子替代了网格感知方法,并且利用内存高效、CUDA加速的核心进行支持,从而实现了端到端的大规模训练。 在包括形态发生、点云分类和基于粒子的纹理合成等任务中,我们展示了NPA不仅保留了NCA的关键特性(如鲁棒性和自我再生),而且还赋予粒子系统特有的新行为。综上所述,这些结果将NPA定位为一种紧凑型神经模型,用于学习自组织的粒子动力学。
https://arxiv.org/abs/2601.16096
Clustering is a fundamental problem, aiming to partition a set of elements, like agents or data points, into clusters such that elements in the same cluster are closer to each other than to those in other clusters. In this paper, we present a new framework for studying online non-centroid clustering with delays, where elements, that arrive one at a time as points in a finite metric space, should be assigned to clusters, but assignments need not be immediate. Specifically, upon arrival, each point's location is revealed, and an online algorithm has to irrevocably assign it to an existing cluster or create a new one containing, at this moment, only this point. However, we allow decisions to be postponed at a delay cost, instead of following the more common assumption of immediate decisions upon arrival. This poses a critical challenge: the goal is to minimize both the total distance costs between points in each cluster and the overall delay costs incurred by postponing assignments. In the classic worst-case arrival model, where points arrive in an arbitrary order, no algorithm has a competitive ratio better than sublogarithmic in the number of points. To overcome this strong impossibility, we focus on a stochastic arrival model, where points' locations are drawn independently across time from an unknown and fixed probability distribution over the finite metric space. We offer hope for beyond worst-case adversaries: we devise an algorithm that is constant competitive in the sense that, as the number of points grows, the ratio between the expected overall costs of the output clustering and an optimal offline clustering is bounded by a constant.
聚类是一种基本问题,目标是将一组元素(如代理或数据点)划分为若干集群,使得同一集群内的元素彼此之间的距离更近,而非同集群的元素之间距离较远。本文介绍了一个新的框架来研究带有延迟的在线非中心聚类,在这种情况下,每个元素以单个时间点的形式在一个有限度量空间中依次到达,并应被分配到一个集群中,但是分配不一定需要即时完成。 具体来说,在每个元素到达时,其位置会被揭示出来,而一个在线算法必须立即决定将该点分配给现有的某个集群或创建一个新的只包含这个点的集群。然而,我们允许决策可以延后进行,并为此付出延迟成本,这不同于通常假设的即时决策模型。这一设定带来了关键挑战:目标是同时最小化每个集群内部元素之间的总距离成本和由于推迟分配所导致的整体延迟成本。 在经典最坏情况到达模式中,即点以任意顺序到达时,没有任何算法的竞争比(competitive ratio)能够优于关于点数量对数的次线性值。为了克服这种强烈的不可能性结果,我们关注于随机到达模型,在此模型下,每个元素的位置独立地从一个未知但固定的概率分布在有限度量空间中抽取而来。针对这一情况,我们提出了一种算法,它在某种意义上是常数竞争性的:随着点的数量增加,输出聚类的期望总成本与最优离线聚类的成本之比被限制在一个常数值内。这种结果为超越最坏情形对手提供了希望。
https://arxiv.org/abs/2601.16091
Human motion reconstruction from monocular videos is a fundamental challenge in computer vision, with broad applications in AR/VR, robotics, and digital content creation, but remains challenging under frequent occlusions in real-world this http URL regression-based methods are efficient but fragile to missing observations, while optimization- and diffusion-based approaches improve robustness at the cost of slow inference speed and heavy preprocessing steps. To address these limitations, we leverage recent advances in generative masked modeling and present MoRo: Masked Modeling for human motion Recovery under Occlusions. MoRo is an occlusion-robust, end-to-end generative framework that formulates motion reconstruction as a video-conditioned task, and efficiently recover human motion in a consistent global coordinate system from RGB videos. By masked modeling, MoRo naturally handles occlusions while enabling efficient, end-to-end inference. To overcome the scarcity of paired video-motion data, we design a cross-modality learning scheme that learns multi-modal priors from a set of heterogeneous datasets: (i) a trajectory-aware motion prior trained on MoCap datasets, (ii) an image-conditioned pose prior trained on image-pose datasets, capturing diverse per-frame poses, and (iii) a video-conditioned masked transformer that fuses motion and pose priors, finetuned on video-motion datasets to integrate visual cues with motion dynamics for robust inference. Extensive experiments on EgoBody and RICH demonstrate that MoRo substantially outperforms state-of-the-art methods in accuracy and motion realism under occlusions, while performing on-par in non-occluded scenarios. MoRo achieves real-time inference at 70 FPS on a single H200 GPU.
从单目视频中重建人体运动是计算机视觉中的一个基本挑战,具有广泛的应用场景,包括增强现实/虚拟现实、机器人技术和数字内容创作。然而,在实际环境中由于频繁的遮挡问题,这一任务仍然极具挑战性。基于回归的方法虽然效率高但对缺失观测非常敏感,而优化和扩散方法则通过牺牲推理速度并增加预处理步骤来提高鲁棒性。为了解决这些问题,我们利用最近在生成式掩码建模方面的进展,并提出了一种用于遮挡下人体运动恢复的框架——MoRo(Masked Modeling for human motion Recovery under Occlusions)。 MoRo是一种针对遮挡具有鲁棒性的端到端生成框架,它将运动重建视为一个视频条件下的任务,在全局坐标系中从RGB视频高效地恢复人类运动。通过掩码建模,MoRo能够自然处理遮挡问题,并支持高效的端到端推理。为了克服成对的视频-动作数据稀缺的问题,我们设计了一种跨模态学习方案,该方案从一组异构的数据集中学习多模式先验:(i)一种在MoCap数据集上训练的动作轨迹感知运动先验;(ii)一种基于图像的姿态先验,在图像姿态数据集上进行训练,捕捉每帧中多样的姿势;以及(iii)一个视频条件下的掩码变换器,该模型融合了动作和姿态的先验,并通过在视频-动作数据集上的微调与视觉线索结合运动动力学以实现稳健推理。 在EgoBody和RICH数据集上进行的大量实验表明,在遮挡条件下,MoRo在准确性和运动逼真度方面显著优于最先进的方法,而在非遮挡场景中则表现出相当的性能。此外,MoRo能够在单个H200 GPU上以每秒70帧的速度实现实时推理。
https://arxiv.org/abs/2601.16079
One of the core advantages of SE2(3) Lie group framework for navigation modeling lies in the autonomy of error propagation. In the previous paper, the theoretical analysis of autonomy property of navigation model in inertial, earth and world frames was given. A construction method for SE2(3) group navigation model is proposed to improve the non-inertial navigation model toward full autonomy. This paper serves as a counterpart to previous paper and conducts the real-world strapdown inertial navigation system (SINS)/odometer(ODO) experiments as well as Monte-Carlo simulations to demonstrate the performance of improved SE2(3) group based high-precision navigation models.
SE2(3)李群框架在导航建模中的一个核心优势在于误差传播的自主性。在之前的论文中,已经对惯性系、地球系和世界坐标系下导航模型的自主性质进行了理论分析。本文提出了一种构建SE2(3)群导航模型的方法,以改进非惯性导航模型并实现完全自主化。该文作为前一论文的补充,通过真实世界的捷联惯性导航系统(SINS)/里程计(ODO)实验以及蒙特卡洛模拟来展示基于改进后的SE2(3)群高精度导航模型的性能。
https://arxiv.org/abs/2601.16078
Foundation Models (FMs) have demonstrated strong generalization across diverse vision tasks. However, their deployment in federated settings is hindered by high computational demands, substantial communication overhead, and significant inference costs. We propose DSFedMed, a dual-scale federated framework that enables mutual knowledge distillation between a centralized foundation model and lightweight client models for medical image segmentation. To support knowledge distillation, a set of high-quality medical images is generated to replace real public datasets, and a learnability-guided sample selection strategy is proposed to enhance efficiency and effectiveness in dual-scale distillation. This mutual distillation enables the foundation model to transfer general knowledge to lightweight clients, while also incorporating client-specific insights to refine the foundation model. Evaluations on five medical imaging segmentation datasets show that DSFedMed achieves an average 2 percent improvement in Dice score while reducing communication costs and inference time by nearly 90 percent compared to existing federated foundation model baselines. These results demonstrate significant efficiency gains and scalability for resource-limited federated deployments.
基础模型(FMs)在各种视觉任务中展现出了强大的泛化能力。然而,它们在联邦环境中的部署由于计算需求高、通信开销大以及推理成本显著而受到阻碍。为此,我们提出了DSFedMed,这是一种双尺度联邦框架,该框架允许集中式基础模型与轻量级客户端模型之间进行相互知识蒸馏,以用于医学图像分割任务中。为了支持这种知识蒸馏过程,生成了一组高质量的医学图像来替代真实的公开数据集,并提出了一种基于可学习性引导的样本选择策略,以提高双尺度蒸馏中的效率和效果。该双向蒸馏方法使得基础模型能够将通用知识传递给轻量级客户端,同时也能吸收来自客户端的具体见解以优化自身。 在五个医学影像分割数据集上的评估表明,DSFedMed相较于现有的联邦基础模型基线方案,在Dice分数上平均提高了2%,并且减少了近90%的通信成本和推理时间。这些结果展示了资源受限环境下联邦部署的有效性提升与可扩展性的显著进步。
https://arxiv.org/abs/2601.16073
Deep learning has substantially advanced medical image segmentation, yet achieving robust generalization across diverse imaging modalities and anatomical structures remains a major challenge. A key contributor to this limitation lies in how existing architectures, ranging from CNNs to Transformers and their hybrids, primarily encode spatial information while overlooking frequency-domain representations that capture rich structural and textural cues. Although few recent studies have begun exploring spectral information at the feature level, supervision-level integration of frequency cues-crucial for fine-grained object localization-remains largely untapped. To this end, we propose Phi-SegNet, a CNN-based architecture that incorporates phase-aware information at both architectural and optimization levels. The network integrates Bi-Feature Mask Former (BFMF) modules that blend neighboring encoder features to reduce semantic gaps, and Reverse Fourier Attention (RFA) blocks that refine decoder outputs using phase-regularized features. A dedicated phase-aware loss aligns these features with structural priors, forming a closed feedback loop that emphasizes boundary precision. Evaluated on five public datasets spanning X-ray, US, histopathology, MRI, and colonoscopy, Phi-SegNet consistently achieved state-of-the-art performance, with an average relative improvement of 1.54+/-1.26% in IoU and 0.98+/-0.71% in F1-score over the next best-performing model. In cross-dataset generalization scenarios involving unseen datasets from the known domain, Phi-SegNet also exhibits robust and superior performance, highlighting its adaptability and modality-agnostic design. These findings demonstrate the potential of leveraging spectral priors in both feature representation and supervision, paving the way for generalized segmentation frameworks that excel in fine-grained object localization.
深度学习在医学图像分割领域取得了显著进展,然而,在不同成像模态和解剖结构之间实现稳健的泛化仍然是一个重大挑战。现有架构(从CNN到Transformer及其混合体)主要编码空间信息,而忽视了捕捉丰富结构和纹理线索的频域表示,这是导致这一限制的关键因素之一。虽然最近有一些研究开始探索特征级别的光谱信息,但在监督级别上融合频率线索——这对于精细目标定位至关重要——仍然很大程度上未被开发。 为此,我们提出Phi-SegNet,这是一种基于CNN的架构,在体系结构和优化层面都整合了相位感知信息。该网络集成了Bi-Feature Mask Former(BFMF)模块,用于融合相邻编码器特征以减少语义差距,并使用相位正则化特征来精炼解码器输出的Reverse Fourier Attention(RFA)块。 通过专门设计的相位感知损失函数将这些特征与结构先验对齐,形成了一个闭环反馈机制,强调了边界的精确性。在涵盖X射线、超声波、组织病理学、MRI和结肠镜检查等领域的五个公开数据集上进行了评估,Phi-SegNet始终取得了最先进的性能,在平均相对改进方面,相较于下一个最佳模型,IoU提高了1.54±1.26%,F1得分提高了0.98±0.71%。 在涉及来自已知域但未经训练的数据集的跨数据集泛化场景中,Phi-SegNet也表现出稳健且优越的表现,彰显了其适应性和模态无关设计。这些发现表明,在特征表示和监督方面利用光谱先验具有潜力,并为实现卓越精细目标定位能力的通用分割框架铺平道路。
https://arxiv.org/abs/2601.16064