Modern neural architectures for 3D point cloud processing contain both convolutional layers and attention blocks, but the best way to assemble them remains unclear. We analyse the role of different computational blocks in 3D point cloud networks and find an intuitive behaviour: convolution is adequate to extract low-level geometry at high-resolution in early layers, where attention is expensive without bringing any benefits; attention captures high-level semantics and context in low-resolution, deep layers more efficiently. Guided by this design principle, we propose a new, improved 3D point cloud backbone that employs convolutions in early stages and switches to attention for deeper layers. To avoid the loss of spatial layout information when discarding redundant convolution layers, we introduce a novel, training-free 3D positional encoding, PointROPE. The resulting LitePT model has $3.6\times$ fewer parameters, runs $2\times$ faster, and uses $2\times$ less memory than the state-of-the-art Point Transformer V3, but nonetheless matches or even outperforms it on a range of tasks and datasets. Code and models are available at: this https URL.
现代用于处理三维点云的神经网络架构包含卷积层和注意力模块,但最佳的组装方式尚不明确。我们分析了3D点云网络中不同计算块的作用,并发现了一种直观的行为:在早期层次的高分辨率下,卷积足以提取低级几何信息,在这种情况下使用注意力机制成本高昂且没有带来任何好处;而注意力则可以更有效地捕捉低分辨率深层中的高级语义和上下文信息。遵循这一设计原则,我们提出了一种新的、改进的3D点云骨干网络,它在早期阶段采用卷积,并在深层切换为注意力模块。为了避免丢弃冗余卷积层时空间布局信息丢失的问题,我们引入了一种新颖且无需训练的三维位置编码方法——PointROPE。最终得到的LitePT模型相比最先进的Point Transformer V3,在参数量上减少了3.6倍,运行速度提升了2倍,内存使用减少了2倍,但在一系列任务和数据集上的表现却相匹配甚至更优。代码和模型可在以下网址获取:this https URL。
https://arxiv.org/abs/2512.13689
Video diffusion models have revolutionized generative video synthesis, but they are imprecise, slow, and can be opaque during generation -- keeping users in the dark for a prolonged period. In this work, we propose DiffusionBrowser, a model-agnostic, lightweight decoder framework that allows users to interactively generate previews at any point (timestep or transformer block) during the denoising process. Our model can generate multi-modal preview representations that include RGB and scene intrinsics at more than 4$\times$ real-time speed (less than 1 second for a 4-second video) that convey consistent appearance and motion to the final video. With the trained decoder, we show that it is possible to interactively guide the generation at intermediate noise steps via stochasticity reinjection and modal steering, unlocking a new control capability. Moreover, we systematically probe the model using the learned decoders, revealing how scene, object, and other details are composed and assembled during the otherwise black-box denoising process.
视频扩散模型已经革新了生成式视频合成技术,但它们存在不精确、速度慢以及在生成过程中不够透明的问题——用户会因此长时间处于不知状态。为此,我们提出了DiffusionBrowser,这是一种与具体模型无关的轻量级解码框架,它允许用户在去噪过程中的任意时刻(时间步或变压器块)互动式地生成预览。我们的模型能够以超过4倍于实时的速度生成包括RGB和场景固有属性在内的多模态预览表示,这些预览可以提供与最终视频一致的外观和运动效果(对于四秒长的视频,在不到一秒的时间内完成)。通过训练好的解码器,我们展示了在中间噪声步骤中可以通过随机性再注入和模式引导来交互式地指导生成过程的新控制能力。此外,我们系统性地使用学习到的解码器对模型进行探查,揭示了场景、对象和其他细节是如何在原本是黑箱操作的去噪过程中构建和组合起来的。
https://arxiv.org/abs/2512.13690
The quality of the latent space in visual tokenizers (e.g., VAEs) is crucial for modern generative models. However, the standard reconstruction-based training paradigm produces a latent space that is biased towards low-level information, leading to a foundation flaw: better pixel-level accuracy does not lead to higher-quality generation. This implies that pouring extensive compute into visual tokenizer pre-training translates poorly to improved performance in generation. We identify this as the ``pre-training scaling problem`` and suggest a necessary shift: to be effective for generation, a latent space must concisely represent high-level semantics. We present VTP, a unified visual tokenizer pre-training framework, pioneering the joint optimization of image-text contrastive, self-supervised, and reconstruction losses. Our large-scale study reveals two principal findings: (1) understanding is a key driver of generation, and (2) much better scaling properties, where generative performance scales effectively with compute, parameters, and data allocated to the pretraining of the visual tokenizer. After large-scale pre-training, our tokenizer delivers a competitive profile (78.2 zero-shot accuracy and 0.36 rFID on ImageNet) and 4.1 times faster convergence on generation compared to advanced distillation methods. More importantly, it scales effectively: without modifying standard DiT training specs, solely investing more FLOPS in pretraining VTP achieves 65.8\% FID improvement in downstream generation, while conventional autoencoder stagnates very early at 1/10 FLOPS. Our pre-trained models are available at this https URL.
在视觉标记化器(如VAE)中,潜在空间的质量对于现代生成模型至关重要。然而,标准的基于重构的训练范式会产生偏向于低级信息的潜在空间,导致了一个基础缺陷:更好的像素级别准确性并不会带来更高质量的生成结果。这意味着将大量的计算资源投入到视觉标记化器预训练中并不能有效提升其在生成任务中的表现。我们称这一现象为“预训练扩展问题”,并提出需要一个必要的转变:为了使潜在空间对生成任务有效,它必须能够简洁地表示高级语义信息。 我们介绍了VTP(Visual Token Pre-training),这是一个统一的视觉标记化器预训练框架,开创了图像-文本对比学习、自我监督和重构损失联合优化的新方法。我们的大规模研究表明,两个主要发现是:(1) 对于生成任务来说,理解是一个关键驱动因素;(2) 大幅改善了扩展性能,在计算资源、参数数量以及用于标记化器预训练的数据量上,生成性能得到了有效提升。 在大规模预训练之后,我们的标记化器表现出具有竞争力的性能(ImageNet上的零样本精度78.2%和0.36 rFID),并且在生成任务中比先进的蒸馏方法收敛速度快4.1倍。更重要的是,它可以有效地扩展:不修改标准DiT训练规格的情况下,在预训练VTP上仅增加更多的计算量就能使下游生成性能的FID(Frechet Inception Distance)提高65.8%,而传统的自动编码器在非常早期时就停滞不前,只有1/10的计算量。我们的预训练模型可以在以下链接找到:[this https URL]。 通过这种新的方法和发现,VTP为视觉标记化器的预训练提供了一种更有效的方法,并且展示了大规模预训练如何能够显著改善生成任务的表现。
https://arxiv.org/abs/2512.13687
Alzheimer's Disease (AD) is a progressive neurodegenerative condition that adversely affects cognitive abilities. Language-related changes can be automatically identified through the analysis of outputs from linguistic assessment tasks, such as picture description. Language models show promise as a basis for screening tools for AD, but their limited interpretability poses a challenge in distinguishing true linguistic markers of cognitive decline from surface-level textual patterns. To address this issue, we examine how surface form variation affects classification performance, with the goal of assessing the ability of language models to represent underlying semantic indicators. We introduce a novel approach where texts surface forms are transformed by altering syntax and vocabulary while preserving semantic content. The transformations significantly modify the structure and lexical content, as indicated by low BLEU and chrF scores, yet retain the underlying semantics, as reflected in high semantic similarity scores, isolating the effect of semantic information, and finding models perform similarly to if they were using the original text, with only small deviations in macro-F1. We also investigate whether language from picture descriptions retains enough detail to reconstruct the original image using generative models. We found that image-based transformations add substantial noise reducing classification accuracy. Our methodology provides a novel way of looking at what features influence model predictions, and allows the removal of possible spurious correlations. We find that just using semantic information, language model based classifiers can still detect AD. This work shows that difficult to detect semantic impairment can be identified, addressing an overlooked feature of linguistic deterioration, and opening new pathways for early detection systems.
阿尔茨海默病(AD)是一种进行性的神经退行性疾病,会对认知能力产生不利影响。语言相关的改变可以通过分析诸如图片描述等语言评估任务的输出来自动识别。语言模型作为AD筛查工具的基础显示出潜力,但它们有限的可解释性使得区分真正的语言标志和表面文本模式变得困难。为了解决这个问题,我们研究了表层形式的变化如何影响分类性能,并旨在评估语言模型表示潜在语义指标的能力。 我们引入了一种新颖的方法,在此方法中通过改变句法和词汇来变换文本的形式,同时保留其语义内容。这些转换显著改变了结构和词汇内容,如低BLEU和chrF得分所示,但保留了底层的语义,这反映在高的语义相似性评分上,从而隔离了语义信息的影响,并发现模型的表现与使用原始文本时几乎相同,仅有微小的宏观F1值偏差。 我们还探讨了图片描述中的语言是否包含足够的细节以利用生成式模型重建原图。我们发现基于图像的变化会引入大量噪音,降低分类准确性。 我们的方法为研究哪些特征影响模型预测提供了一种新的视角,并允许消除可能存在的虚假相关性。结果表明,仅使用语义信息时,基于语言模型的分类器仍能检测到AD的存在。 这项工作展示了难以察觉的语义损伤可以被识别出来,弥补了对语言退化的一个未被重视的特点的关注,并为早期诊断系统开辟了新的途径。
https://arxiv.org/abs/2512.13685
We present Recurrent Video Masked-Autoencoders (RVM): a novel video representation learning approach that uses a transformer-based recurrent neural network to aggregate dense image features over time, effectively capturing the spatio-temporal structure of natural video data. RVM learns via an asymmetric masked prediction task requiring only a standard pixel reconstruction objective. This design yields a highly efficient ``generalist'' encoder: RVM achieves competitive performance with state-of-the-art video models (e.g. VideoMAE, V-JEPA) on video-level tasks like action recognition and point/object tracking, while also performing favorably against image models (e.g. DINOv2) on tasks that test geometric and dense spatial understanding. Notably, RVM achieves strong performance in the small-model regime without requiring knowledge distillation, exhibiting up to 30x greater parameter efficiency than competing video masked autoencoders. Moreover, we demonstrate that RVM's recurrent nature allows for stable feature propagation over long temporal horizons with linear computational cost, overcoming some of the limitations of standard spatio-temporal attention-based architectures. Finally, we use qualitative visualizations to highlight that RVM learns rich representations of scene semantics, structure, and motion.
我们提出了递归视频掩码自编码器(RVM):一种新颖的视频表示学习方法,它使用基于变压器的循环神经网络来在时间维度上聚合密集图像特征,从而有效地捕捉自然视频数据中的空间-时间结构。RVM 通过一个非对称的掩码预测任务进行学习,该任务只需要标准像素重建目标即可完成。这种设计产生了一个高效的“通才”编码器:RVM 在诸如动作识别和点/对象跟踪等视频级别的任务上表现出与最先进的视频模型(如 VideoMAE、V-JEPA)相当的性能,并且在测试几何学和密集空间理解的任务中,其表现也优于图像模型(例如 DINOv2)。值得注意的是,即使不使用知识蒸馏,在小规模模型环境下 RVM 也能取得优异的成绩,比竞争中的视频掩码自编码器参数效率高出多达30倍。此外,我们证明了由于 RVM 的递归特性,它能够在长时序范围内以线性计算成本稳定地传播特征,克服了一些基于空间-时间注意的架构的限制。最后,我们通过定性的可视化展示了 RVM 学习到了丰富的场景语义、结构和运动表示。
https://arxiv.org/abs/2512.13684
Generalization remains the central challenge for interactive 3D scene generation. Existing learning-based approaches ground spatial understanding in limited scene dataset, restricting generalization to new layouts. We instead reprogram a pre-trained 3D instance generator to act as a scene level learner, replacing dataset-bounded supervision with model-centric spatial supervision. This reprogramming unlocks the generator transferable spatial knowledge, enabling generalization to unseen layouts and novel object compositions. Remarkably, spatial reasoning still emerges even when the training scenes are randomly composed objects. This demonstrates that the generator's transferable scene prior provides a rich learning signal for inferring proximity, support, and symmetry from purely geometric cues. Replacing widely used canonical space, we instantiate this insight with a view-centric formulation of the scene space, yielding a fully feed-forward, generalizable scene generator that learns spatial relations directly from the instance model. Quantitative and qualitative results show that a 3D instance generator is an implicit spatial learner and reasoner, pointing toward foundation models for interactive 3D scene understanding and generation. Project page: this https URL
泛化仍然是互动式三维场景生成的核心挑战。现有的基于学习的方法将空间理解建立在有限的场景数据集上,从而限制了其对新布局的一般化能力。我们则重新编程了一个预训练的3D实例生成器,使其作为场景级的学习者来运作,并用模型中心的空间监督取代了局限于数据集的监督。这种重编程解锁了生成器可迁移的空间知识,使它能够推广到未见过的布局和新物体组合上。值得注意的是,即使在由随机组合的对象构成的训练场景中,空间推理仍然出现。这表明生成器的可转移场景先验提供了丰富的学习信号,可以从纯粹的几何线索中推断出接近性、支撑性和对称性。 通过用以视角为中心的形式化替代广泛使用的标准空间概念,我们实现了这一洞察,并提出了一种完全前馈的、通用化的场景生成器,该生成器直接从实例模型中学得空间关系。定量和定性的结果表明,一个3D实例生成器是一个隐含的空间学习者和推理工具,这指向了用于交互式3D场景理解和生成的基础模型的发展方向。 项目页面: [此链接](this https URL)
https://arxiv.org/abs/2512.13683
Recent feed-forward reconstruction models like VGGT and $\pi^3$ achieve impressive reconstruction quality but cannot process streaming videos due to quadratic memory complexity, limiting their practical deployment. While existing streaming methods address this through learned memory mechanisms or causal attention, they require extensive retraining and may not fully leverage the strong geometric priors of state-of-the-art offline models. We propose LASER, a training-free framework that converts an offline reconstruction model into a streaming system by aligning predictions across consecutive temporal windows. We observe that simple similarity transformation ($\mathrm{Sim}(3)$) alignment fails due to layer depth misalignment: monocular scale ambiguity causes relative depth scales of different scene layers to vary inconsistently between windows. To address this, we introduce layer-wise scale alignment, which segments depth predictions into discrete layers, computes per-layer scale factors, and propagates them across both adjacent windows and timestamps. Extensive experiments show that LASER achieves state-of-the-art performance on camera pose estimation and point map reconstruction %quality with offline models while operating at 14 FPS with 6 GB peak memory on a RTX A6000 GPU, enabling practical deployment for kilometer-scale streaming videos. Project website: $\href{this https URL}{\texttt{this https URL}}$
最近的前向反馈重建模型,如VGGT和$\pi^3$,在图像重建质量上表现出色,但由于其二次内存复杂度,无法处理流媒体视频,这限制了它们的实际部署。尽管现有的流媒体方法通过学习记忆机制或因果注意力解决了这一问题,但这些方法需要大量的重新训练,并且可能未能充分利用最先进的离线模型中的强几何先验。 我们提出了LASER框架,这是一个无需训练的框架,它可以将一个离线重建模型转化为一个流式系统,通过在连续的时间窗口中对预测进行对齐来实现这一点。我们观察到,简单的相似性变换($\mathrm{Sim}(3)$)对齐由于层深度错位而失效:单目尺度模糊导致不同场景层的相对深度比例在不同的窗口之间不一致变化。 为了解决这个问题,我们引入了逐层尺度对齐方法,该方法将深度预测分割成离散的层次,并计算每个层次的比例因子,然后将其传播到相邻的时间窗口和时间戳上。大量的实验表明,LASER在相机姿态估计和点云重建方面的表现达到了最先进的水平,同时还能以每秒14帧的速度运行,并且在RTX A6000 GPU上的峰值内存占用仅为6GB,这使得它能够处理千米级的流媒体视频,在实际应用中具有可行性。 项目网站:[此链接](https://this https URL/)
https://arxiv.org/abs/2512.13680
Recent progress in image-to-3D has opened up immense possibilities for design, AR/VR, and robotics. However, to use AI-generated 3D assets in real applications, a critical requirement is the capability to edit them easily. We present a feedforward method, Steer3D, to add text steerability to image-to-3D models, which enables editing of generated 3D assets with language. Our approach is inspired by ControlNet, which we adapt to image-to-3D generation to enable text steering directly in a forward pass. We build a scalable data engine for automatic data generation, and develop a two-stage training recipe based on flow-matching training and Direct Preference Optimization (DPO). Compared to competing methods, Steer3D more faithfully follows the language instruction and maintains better consistency with the original 3D asset, while being 2.4x to 28.5x faster. Steer3D demonstrates that it is possible to add a new modality (text) to steer the generation of pretrained image-to-3D generative models with 100k data. Project website: this https URL
最近在图像到三维(image-to-3D)领域的进展为设计、增强现实/虚拟现实(AR/VR)和机器人技术开辟了巨大的可能性。然而,要在实际应用中使用人工智能生成的三维资产,一个关键的要求就是能够轻松编辑它们。我们提出了一种前馈方法Steer3D,它可以向图像到三维模型添加文本控制能力,从而支持利用语言对生成的三维资源进行编辑。我们的方法受到了ControlNet的启发,并将其适应于图像到三维生成,使得可以直接在一次正向传递中实现文本引导。 为了构建一个可扩展的数据引擎以自动生成数据,我们开发了一个两阶段的训练配方,该配方基于流匹配训练和直接偏好优化(DPO)。与竞争的方法相比,Steer3D更忠实地遵循语言指令,并且能够更好地保持与原始三维资产的一致性,同时速度快2.4到28.5倍。 此外,Steer3D展示了将新的模式(文本)添加到预训练的图像到三维生成模型中以引导生成过程是可能的,这只需要10万数据就可以实现。项目网站: [链接](this https URL)
https://arxiv.org/abs/2512.13678
In this paper, we present JoVA, a unified framework for joint video-audio generation. Despite recent encouraging advances, existing methods face two critical limitations. First, most existing approaches can only generate ambient sounds and lack the capability to produce human speech synchronized with lip movements. Second, recent attempts at unified human video-audio generation typically rely on explicit fusion or modality-specific alignment modules, which introduce additional architecture design and weaken the model simplicity of the original transformers. To address these issues, JoVA employs joint self-attention across video and audio tokens within each transformer layer, enabling direct and efficient cross-modal interaction without the need for additional alignment modules. Furthermore, to enable high-quality lip-speech synchronization, we introduce a simple yet effective mouth-area loss based on facial keypoint detection, which enhances supervision on the critical mouth region during training without compromising architectural simplicity. Extensive experiments on benchmarks demonstrate that JoVA outperforms or is competitive with both unified and audio-driven state-of-the-art methods in lip-sync accuracy, speech quality, and overall video-audio generation fidelity. Our results establish JoVA as an elegant framework for high-quality multimodal generation.
在这篇论文中,我们介绍了JoVA,这是一个用于联合视频-音频生成的统一框架。尽管最近取得了一些令人鼓舞的进步,现有的方法仍然面临着两个关键限制。首先,大多数现有方法只能生成环境声音,并且缺乏产生与唇部动作同步的人类语音的能力。其次,近期尝试进行统一的人体视频-音频生成的方法通常依赖于显式的融合或特定模态的对齐模块,这会引入额外的架构设计并削弱原始变压器模型的简洁性。 为了解决这些问题,JoVA在每个变压器层内通过跨视频和音频标记的联合自注意力机制来直接进行有效的跨模式交互,从而无需使用额外的对齐模块。此外,为了实现高质量的唇部语音同步,我们引入了一个基于面部关键点检测的简单而有效的口区损失函数,该函数可以在不牺牲架构简洁性的情况下增强训练过程中对关键口区域的监督。 在基准测试上的广泛实验表明,JoVA在唇部同步精度、语音质量和整体视频-音频生成保真度方面优于或可与当前最先进的统一方法和音频驱动的方法相媲美。我们的研究结果确立了JoVA作为高质量多模态生成框架的地位。
https://arxiv.org/abs/2512.13677
Personalization is becoming indispensable for LLMs to align with individual user preferences and needs. Yet current approaches are often computationally expensive, data-intensive, susceptible to catastrophic forgetting, and prone to performance degradation in multi-turn interactions or when handling implicit queries. To address these challenges, we conceptualize personalization as a model editing task and introduce Personalization Editing, a framework that applies localized edits guided by clustered preference representations. This design enables precise preference-aligned updates while preserving overall model capabilities. In addition, existing personalization benchmarks frequently rely on persona-based dialogs between LLMs rather than user-LLM interactions, or focus primarily on stylistic imitation while neglecting information-seeking tasks that require accurate recall of user-specific preferences. We introduce User Preference Question Answering (UPQA), a short-answer QA dataset constructed from in-situ user queries with varying levels of difficulty. Unlike prior benchmarks, UPQA directly evaluates a model's ability to recall and apply specific user preferences. Across experimental settings, Personalization Editing achieves higher editing accuracy and greater computational efficiency than fine-tuning, while outperforming prompting-based baselines in multi-turn conversations and implicit preference questions settings.
个性化对于大型语言模型(LLM)来说正变得不可或缺,以使其能够与个人用户的偏好和需求保持一致。然而,目前的方法往往计算成本高昂、数据密集型,并且容易发生灾难性遗忘,在处理多轮交互或隐式查询时性能会下降。为了解决这些问题,我们将个性化视为一个模型编辑任务,并引入了“Personalization Editing”框架,该框架通过集群偏好数字表示进行局部编辑指导。这种设计能够在保持整体模型能力的同时实现精准的偏好对齐更新。 此外,现有的个性化基准测试通常依赖于基于角色的人机对话,而不是用户与LLM之间的实际交互,或者它们主要集中在风格模仿上,而忽略了需要准确回忆出特定用户偏好的信息查找任务。我们引入了“User Preference Question Answering”(UPQA),这是一个从现场用户的查询中构建的短答案问答数据集,并且具有不同难度级别的问题。与先前的基准测试相比,UPQA直接评估模型回忆和应用具体用户偏好的能力。 在各种实验设置下,“Personalization Editing”的编辑精度高于微调方法,并且计算效率更高,在多轮对话和隐式偏好查询场景中优于基于提示的方法。
https://arxiv.org/abs/2512.13676
We introduce Interactive Intelligence, a novel paradigm of digital human that is capable of personality-aligned expression, adaptive interaction, and self-evolution. To realize this, we present Mio (Multimodal Interactive Omni-Avatar), an end-to-end framework composed of five specialized modules: Thinker, Talker, Face Animator, Body Animator, and Renderer. This unified architecture integrates cognitive reasoning with real-time multimodal embodiment to enable fluid, consistent interaction. Furthermore, we establish a new benchmark to rigorously evaluate the capabilities of interactive intelligence. Extensive experiments demonstrate that our framework achieves superior performance compared to state-of-the-art methods across all evaluated dimensions. Together, these contributions move digital humans beyond superficial imitation toward intelligent interaction.
我们引入了互动智能(Interactive Intelligence),这是一种新颖的数字人类范式,能够实现个性化的表达、适应性的交互和自我进化。为了实现这一目标,我们提出了Mio(多模态互动全方位化身),这是一个由五个专门模块组成的端到端框架:Thinker(思考者)、Talker(交谈者)、Face Animator(面部动画师)、Body Animator(身体动画师)和Renderer(渲染器)。这种统一的架构将认知推理与实时多模式具身化相结合,以实现流畅、一致的互动。此外,我们建立了一个新的基准测试体系,严格评估互动智能的能力。广泛的实验表明,在所有评估维度上,我们的框架相比现有最佳方法均表现出色。这些贡献共同推动了数字人类从表面模仿向智能化交互的发展。
https://arxiv.org/abs/2512.13674
Textual Inversion (TI) is an efficient approach to text-to-image personalization but often fails on complex prompts. We trace these failures to embedding norm inflation: learned tokens drift to out-of-distribution magnitudes, degrading prompt conditioning in pre-norm Transformers. Empirically, we show semantics are primarily encoded by direction in CLIP token space, while inflated norms harm contextualization; theoretically, we analyze how large magnitudes attenuate positional information and hinder residual updates in pre-norm blocks. We propose Directional Textual Inversion (DTI), which fixes the embedding magnitude to an in-distribution scale and optimizes only direction on the unit hypersphere via Riemannian SGD. We cast direction learning as MAP with a von Mises-Fisher prior, yielding a constant-direction prior gradient that is simple and efficient to incorporate. Across personalization tasks, DTI improves text fidelity over TI and TI-variants while maintaining subject similarity. Crucially, DTI's hyperspherical parameterization enables smooth, semantically coherent interpolation between learned concepts (slerp), a capability that is absent in standard TI. Our findings suggest that direction-only optimization is a robust and scalable path for prompt-faithful personalization.
文本逆向(Textual Inversion,TI)是一种高效的文字到图像个性化方法,但在处理复杂提示时往往失败。我们将这些失败归因于嵌入范数膨胀:学习到的标记漂移到了分布外的幅度大小,在预归一化Transformer中降低了提示条件的质量。从经验上讲,我们发现语义主要在CLIP标记空间的方向中编码,而膨胀的范数会损害上下文信息;理论上分析表明大的幅值会削弱位置信息并阻碍预归一化层中的残差更新。 为此,我们提出了一种方向文本逆向(Directional Textual Inversion, DTI)方法。DTI固定了嵌入幅度到一个分布内的尺度,并在单位超球体上仅优化方向通过黎曼SGD实现。我们将方向学习视为具有冯·米塞斯-费希尔先验的极大似然估计,这产生了一个常量方向先验梯度,简单且易于集成。 在个性化任务中,DTI不仅提高了文本忠实度超越了TI和它的变体,并保持了主题相似性。至关重要的是,DTI的超球参数化使得学习概念之间进行平滑、语义一致的插值成为可能(slerp),这是标准TI所不具备的能力。我们的研究结果表明,仅优化方向是一种稳健且可扩展的方法来实现忠实于提示的个性化。
https://arxiv.org/abs/2512.13672
Industrial anomaly detection (IAD) is difficult due to the scarcity of normal reference samples and the subtle, localized nature of many defects. Single-pass vision-language models (VLMs) often overlook small abnormalities and lack explicit mechanisms to compare against canonical normal patterns. We propose AgentIAD, a tool-driven agentic framework that enables multi-stage visual inspection. The agent is equipped with a Perceptive Zoomer (PZ) for localized fine-grained analysis and a Comparative Retriever (CR) for querying normal exemplars when evidence is ambiguous. To teach these inspection behaviors, we construct structured perceptive and comparative trajectories from the MMAD dataset and train the model in two stages: supervised fine-tuning followed by reinforcement learning. A two-part reward design drives this process: a perception reward that supervises classification accuracy, spatial alignment, and type correctness, and a behavior reward that encourages efficient tool use. Together, these components enable the model to refine its judgment through step-wise observation, zooming, and verification. AgentIAD achieves a new state-of-the-art 97.62% classification accuracy on MMAD, surpassing prior MLLM-based approaches while producing transparent and interpretable inspection traces.
工业异常检测(IAD)由于正常参考样本的稀缺性和许多缺陷的细微、局部性质而变得困难。单一通过的视觉-语言模型(VLMs)通常会忽略小的异常情况,并且缺乏与标准正常模式进行比较的明确机制。我们提出了AgentIAD,这是一个工具驱动的代理框架,能够执行多阶段视觉检查。该代理配备了感知缩放器(PZ),用于局部细粒度分析和对比检索器(CR),在证据模棱两可时查询正常的样本。为了教授这些检测行为,我们从MMAD数据集中构建了结构化的感知和比较轨迹,并分两个阶段训练模型:监督微调后跟强化学习。 这一过程的设计包括两部分奖励机制:一种是感知奖励,用于监督分类准确性、空间对齐和类型正确性;另一种是行为奖励,鼓励高效地使用工具。这些组件共同作用使模型能够通过逐步观察、缩放和验证来细化其判断。 AgentIAD在MMAD数据集上实现了新的最佳性能,分类准确率达到97.62%,超过了先前基于多模态大语言模型的方法,并且生成的检测痕迹透明可解释。
https://arxiv.org/abs/2512.13671
Spatio-Temporal Logic (SpaTiaL) offers a principled formalism for expressing geometric spatial requirements-an essential component of robotic manipulation, where object locations, neighborhood relations, pose constraints, and interactions directly determine task success. Yet prior works have largely relied on standard temporal logic (TL), which models only robot trajectories and overlooks object-level interactions. Existing datasets built from randomly generated TL formulas paired with natural-language descriptions therefore cover temporal operators but fail to represent the layered spatial relations that manipulation tasks depend on. To address this gap, we introduce a dataset generation framework that synthesizes SpaTiaL specifications and converts them into natural-language descriptions through a deterministic, semantics-preserving back-translation procedure. This pipeline produces the NL2SpaTiaL dataset, aligning natural language with multi-level spatial relations and temporal objectives to reflect the compositional structure of manipulation tasks. Building on this foundation, we propose a translation-verification framework equipped with a language-based semantic checker that ensures the generated SpaTiaL formulas faithfully encode the semantics specified by the input description. Experiments across a suite of manipulation tasks show that SpaTiaL-based representations yield more interpretable, verifiable, and compositional grounding for instruction following. Project website: this https URL
空间时间逻辑(SpaTiaL)提供了一种原则性的形式化方法,用于表达几何空间需求——这是机器人操作中的一个关键组成部分,在这种情况下,物体的位置、邻近关系、姿态约束和交互直接决定了任务的成功与否。然而,之前的大多数研究主要依赖于标准的时间逻辑(TL),这仅建模了机器人的轨迹,并忽略了对象级别的交互。基于随机生成的TL公式与自然语言描述配对而创建的现有数据集虽然涵盖了时间操作符,但未能代表机器人抓取任务所依赖的多层次空间关系。 为了填补这一空白,我们引入了一个数据集生成框架,该框架综合出SpaTiaL规范,并通过一种确定性的、语义保持的逆向翻译过程将其转换为自然语言描述。此流程产生NL2SpaTiaL数据集,使自然语言与多层次的空间关系和时间目标相匹配,反映了抓取任务的组成结构。 在此基础上,我们提出了一个翻译验证框架,该框架配备了一个基于语言的语义检查器,确保生成的SpaTiaL公式准确地编码了输入描述所指定的语义。在一系列操作任务上的实验显示,基于SpaTiaL表示的方法产生了更为可解释、可验证和组成化的指令遵循基础。 项目网站:[这个链接](this%20https%20URL)
https://arxiv.org/abs/2512.13670
Forensic scientists often need to identify an unknown speaker or writer in cases such as ransom calls, covert recordings, alleged suicide notes, or anonymous online communications, among many others. Speaker recognition in the speech domain usually examines phonetic or acoustic properties of a voice, and these methods can be accurate and robust under certain conditions. However, if a speaker disguises their voice or employs text-to-speech software, vocal properties may no longer be reliable, leaving only their linguistic content available for analysis. Authorship attribution methods traditionally use syntactic, semantic, and related linguistic information to identify writers of written text (authorship attribution). In this paper, we apply a content-based authorship approach to speech that has been transcribed into text, using what a speaker says to attribute speech to individuals (speaker attribution). We introduce a stylometric method, StyloSpeaker, which incorporates character, word, token, sentence, and style features from the stylometric literature on authorship, to assess whether two transcripts were produced by the same speaker. We evaluate this method on two types of transcript formatting: one approximating prescriptive written text with capitalization and punctuation and another normalized style that removes these conventions. The transcripts' conversation topics are also controlled to varying degrees. We find generally higher attribution performance on normalized transcripts, except under the strongest topic control condition, in which overall performance is highest. Finally, we compare this more explainable stylometric model to black-box neural approaches on the same data and investigate which stylistic features most effectively distinguish speakers.
法医科学家经常需要在诸如赎金电话、秘密录音、疑似遗书或匿名在线通信等案件中识别未知的说话人或写作者。语音领域中的说话人识别通常会考察声音的发音或声学属性,这些方法在一定条件下可以达到较高的准确性和鲁棒性。然而,如果说话者伪装了自己的声音或者使用了文字转语音软件,那么仅靠音质特征可能不再可靠,此时只剩下语言内容可供分析。 传统上,写作者身份识别的方法通常利用句法、语义及相关语言信息来确认书写文本的作者(即作者归属)。在这篇论文中,我们将基于内容的作者归属方法应用于已转换成文本的语音,使用说话者所说的内容来进行语音归因。我们介绍了一种文体测量学方法StyloSpeaker,该方法结合了来自作者归属文体学文献中的字符、单词、词元、句子和风格特征,以评估两份转写记录是否由同一人产生。我们在两种类型的转写格式上对这种方法进行了测试:一种是近似规范书面文本的格式(包括大小写和标点符号),另一种则是去除了这些约定的形式化风格。同时控制了转写记录的主题内容。 我们发现,在除最强的主题控制条件之外,标准化转录的表现普遍较好;而在最强主题控制条件下,整体表现最高。最后,我们将这种更具解释性的文体测量模型与黑盒神经网络方法进行了比较,并探讨哪些文体特征最有效地区分说话人。
https://arxiv.org/abs/2512.13667
Recent advances in diffusion-based generation techniques enable AI models to produce highly realistic videos, heightening the need for reliable detection mechanisms. However, existing detection methods provide only limited exploration of the 3D geometric patterns present in generated videos. In this paper, we use vanishing points as an explicit representation of 3D geometry patterns, revealing fundamental discrepancies in geometric consistency between real and AI-generated videos. We introduce Grab-3D, a geometry-aware transformer framework for detecting AI-generated videos based on 3D geometric temporal consistency. To enable reliable evaluation, we construct an AI-generated video dataset of static scenes, allowing stable 3D geometric feature extraction. We propose a geometry-aware transformer equipped with geometric positional encoding, temporal-geometric attention, and an EMA-based geometric classifier head to explicitly inject 3D geometric awareness into temporal modeling. Experiments demonstrate that Grab-3D significantly outperforms state-of-the-art detectors, achieving robust cross-domain generalization to unseen generators.
最近基于扩散生成技术的进步使得AI模型能够产生高度逼真的视频,从而加大了可靠检测机制的需求。然而,现有的检测方法仅对生成视频中存在的三维几何模式进行了有限的探索。在本文中,我们使用消失点作为三维几何图案的明确表示方式,揭示了真实视频与AI生成视频之间在几何一致性上的基本差异。我们引入了一种基于3D几何时间一致性的几何感知变换器框架Grab-3D来检测AI生成的视频。 为了实现可靠的评估,我们构建了一个由静态场景组成的AI生成视频数据集,从而能够稳定地提取三维几何特征。我们提出了一种配备了几何位置编码、时序几何注意力机制以及基于EMA(指数移动平均)的几何分类头的几何感知变换器,以明确将3D几何意识注入时间建模中。 实验表明,Grab-3D在检测AI生成视频方面显著优于现有的最先进的方法,并且能够实现对未知生成器的强大跨域泛化能力。
https://arxiv.org/abs/2512.13665
Spatial tracing, as a fundamental embodied interaction ability for robots, is inherently challenging as it requires multi-step metric-grounded reasoning compounded with complex spatial referring and real-world metric measurement. However, existing methods struggle with this compositional task. To this end, we propose RoboTracer, a 3D-aware VLM that first achieves both 3D spatial referring and measuring via a universal spatial encoder and a regression-supervised decoder to enhance scale awareness during supervised fine-tuning (SFT). Moreover, RoboTracer advances multi-step metric-grounded reasoning via reinforcement fine-tuning (RFT) with metric-sensitive process rewards, supervising key intermediate perceptual cues to accurately generate spatial traces. To support SFT and RFT training, we introduce TraceSpatial, a large-scale dataset of 30M QA pairs, spanning outdoor/indoor/tabletop scenes and supporting complex reasoning processes (up to 9 steps). We further present TraceSpatial-Bench, a challenging benchmark filling the gap to evaluate spatial tracing. Experimental results show that RoboTracer surpasses baselines in spatial understanding, measuring, and referring, with an average success rate of 79.1%, and also achieves SOTA performance on TraceSpatial-Bench by a large margin, exceeding Gemini-2.5-Pro by 36% accuracy. Notably, RoboTracer can be integrated with various control policies to execute long-horizon, dynamic tasks across diverse robots (UR5, G1 humanoid) in cluttered real-world scenes.
空间追踪作为机器人本体互动的基本能力,由于其需要进行多步骤的度量推理并结合复杂的空间指代和现实世界的度量测量,因而具有固有的挑战性。然而,现有的方法难以处理这一组合任务。为此,我们提出了RoboTracer,这是一种3D感知视觉语言模型(VLM),它首先通过一个通用的空间编码器和受回归监督的解码器来实现空间指代与度量,从而在有监督微调(SFT)期间增强对尺度的认识。此外,RoboTracer通过带有度量子敏感过程奖励的强化学习微调(RFT)进一步推进了多步骤度量推理,并指导关键中间感知线索以准确生成空间轨迹。 为了支持SFT和RFT训练,我们引入了TraceSpatial,这是一个包含30M问题-答案对的大规模数据集,涵盖了户外、室内和平面场景,并且能够支持复杂的推理过程(多达9步)。此外,我们还推出了TraceSpatial-Bench,这是用于评估空间追踪性能的具有挑战性的基准测试,填补了现有评价方法的空白。实验结果表明,在空间理解、度量和指代方面,RoboTracer超越了基线模型,并在TraceSpatial-Bench上表现出色,大幅优于Gemini-2.5-Pro,准确性提高了36%。 值得注意的是,RoboTracer可以与各种控制策略集成在一起,以执行跨多种机器人(UR5、G1人形)的复杂现实场景中的长期动态任务。
https://arxiv.org/abs/2512.13660
As the online learning landscape evolves, the need for personalization is increasingly evident. Although educational resources are burgeoning, educators face challenges selecting materials that both align with intended learning outcomes and address diverse learner needs. Large Language Models (LLMs) are attracting growing interest for their potential to create learning resources that better support personalization, but verifying coverage of intended outcomes still requires human alignment review, which is costly and limits scalability. We propose a framework that supports the cost-effective automation of evaluating alignment between educational resources and intended learning outcomes. Using human-generated materials, we benchmarked LLM-based text-embedding models and found that the most accurate model (Voyage) achieved 79% accuracy in detecting alignment. We then applied the optimal model to LLM-generated resources and, via expert evaluation, confirmed that it reliably assessed correspondence to intended outcomes (83% accuracy). Finally, in a three-group experiment with 360 learners, higher alignment scores were positively related to greater learning performance, chi-squared(2, N = 360) = 15.39, p < 0.001. These findings show that embedding-based alignment scores can facilitate scalable personalization by confirming alignment with learning outcomes, which allows teachers to focus on tailoring content to diverse learner needs.
随着在线学习领域的不断发展,个性化教学的需求变得越来越明显。尽管教育资源日益丰富,但教师在选择既能与预期的学习成果相符合又能满足多样化学生需求的材料时面临着挑战。大型语言模型(LLMs)因其有潜力生成更支持个性化的学习资源而引起了越来越多的兴趣,但是验证这些资源是否覆盖了预期的教学目标仍然需要人工审核,这既昂贵又限制了可扩展性。我们提出了一种框架,以实现对教育材料与预期学习成果之间一致性的评估的低成本自动化。 使用由人类创建的材料作为基准,我们测试了基于大型语言模型生成的文字嵌入(text-embedding)模型,并发现最准确的模型(Voyage)在检测一致性方面达到了79%的准确性。接着,我们将最佳模型应用于LLM生成的学习资源中,并通过专家评估确认该模型能够可靠地评估与预期成果的一致性,其准确率为83%。最后,在一个包括360名学习者的三组实验中,我们发现更高的一致性得分与更好的学习表现呈正相关关系,卡方检验结果为chi-squared(2, N = 360) = 15.39,p < 0.001。 这些研究结果表明,基于嵌入的一致性评分可以通过确认教育材料与学习成果之间的一致性来促进可扩展的个性化教学。这使教师能够专注于根据多样化学生需求调整内容,从而进一步提升教学质量。
https://arxiv.org/abs/2512.13658
Safety alignment mechanisms in large language models prevent responses to harmful queries through learned refusal behavior, yet these same mechanisms impede legitimate research applications including cognitive modeling, adversarial testing, and security analysis. While abliteration techniques enable surgical removal of refusal representations through directional orthogonalization, the relative effectiveness of available implementations remains uncharacterized. This study evaluates four abliteration tools (Heretic, DECCP, ErisForge, FailSpy) across sixteen instruction-tuned models (7B-14B parameters), reporting tool compatibility on all 16 models and quantitative metrics on subsets dictated by tool support. Single-pass methods demonstrated superior capability preservation on the benchmarked subset (avg GSM8K change across three models: ErisForge -0.28 pp; DECCP -0.13 pp), while Bayesian-optimized abliteration produced variable distribution shift (KL divergence: 0.043-1.646) with model-dependent capability impact. These findings provide researchers with evidence-based selection criteria for abliteration tool deployment across diverse model architectures. The principal finding indicates that mathematical reasoning capabilities exhibit the highest sensitivity to abliteration interventions, with GSM8K change ranging from +1.51 pp to -18.81 pp (-26.5% relative) depending on tool selection and model architecture.
大型语言模型中的安全对齐机制通过学习拒绝行为来防止对有害查询的响应,然而这些相同的机制也阻碍了包括认知建模、对抗性测试和安全性分析在内的合法研究应用。虽然消融技术能够通过方向正交化实现拒绝始终态的外科手术式移除,但现有实施方法的有效性仍缺乏系统评估。本研究评估了四种消融工具(Heretic、DECCC、ErisForge、FailSpy)在十六个指令微调模型(70亿至140亿参数规模)上的表现,并报告所有16种模型的工具兼容性和子集限定下的定量指标。单步方法在基准测试子集中展示了更强的能力保持效果(三个模型平均GSM8K变化:ErisForge -0.28个百分点;DECCP -0.13个百分点),而贝叶斯优化消融则产生了不同的分布转移(KL散度范围为0.043至1.646),对不同模型的能力影响各异。这些发现为研究人员在多样化模型架构上部署消融工具提供了基于证据的选择标准。主要发现指出,数学推理能力对消融干预最敏感,GSM8K变化范围从+1.51个百分点到-18.81个百分点(相对下降26.5%),这取决于所选工具和模型架构的不同。
https://arxiv.org/abs/2512.13655
Large-language models (LLMs) have been shown to respond in a variety of ways for classification tasks outside of question-answering. LLM responses are sometimes called "hallucinations" since the output is not what is ex pected. Memorization strategies in LLMs are being studied in detail, with the goal of understanding how LLMs respond. We perform a deep dive into a classification task based on United States Supreme Court (SCOTUS) decisions. The SCOTUS corpus is an ideal classification task to study for LLM memory accuracy because it presents significant challenges due to extensive sentence length, complex legal terminology, non-standard structure, and domain-specific vocabulary. Experimentation is performed with the latest LLM fine tuning and retrieval-based approaches, such as parameter-efficient fine-tuning, auto-modeling, and others, on two traditional category-based SCOTUS classification tasks: one with 15 labeled topics and another with 279. We show that prompt-based models with memories, such as DeepSeek, can be more robust than previous BERT-based models on both tasks scoring about 2 points better than previous models not based on prompting.
大型语言模型(LLMs)已被证明在问答之外的分类任务中会以多种方式响应,有时这些响应被称为“幻觉”,因为输出结果不符合预期。对于LLM中的记忆策略研究正在深入进行,目的是理解这些模型如何做出反应。我们针对美国最高法院(SCOTUS)裁决开展了一项基于分类任务的深度分析。由于句子长度长、法律术语复杂、结构非标准化以及专业词汇的存在,SCOTUS语料库成为了研究LLM记忆准确性的一个理想分类任务。 实验采用了最新的参数高效微调和检索式方法,例如参数高效的微调、自动建模等技术,在两个基于传统类别划分的SCOTUS分类任务上进行测试:一个是包含15个主题标签的任务,另一个则包含了279个。我们展示了带有记忆功能的提示驱动模型(如DeepSeek)在两项任务中的表现均优于先前的BERT基线模型,得分大约高出2分。 通过这种方法的研究表明,具有记忆机制和基于提示调整能力的大型语言模型能够更好地完成复杂法律文本分类任务,并且相较于传统的仅依赖微调的方法,这些新方法可以有效提升模型的表现。
https://arxiv.org/abs/2512.13654