In large-scale AI training, Sparse Mixture-of-Experts (s-MoE) layers enable scaling by activating only a small subset of experts per token. An operational challenge in this design is load balancing: routing tokens to minimize the number of idle experts, which is important for the efficient utilization of (costly) GPUs. We provide a theoretical framework for analyzing the Auxiliary-Loss-Free Load Balancing (ALF-LB) procedure -- proposed by DeepSeek's Wang et al. (2024) -- by casting it as a one-step-per-iteration primal-dual method for an assignment problem. First, in a stylized deterministic setting, our framework yields several insightful structural properties: (i) a monotonic improvement of a Lagrangian objective, (ii) a preference rule that moves tokens from overloaded to underloaded experts, and (iii) an approximate-balancing guarantee. Then, we incorporate the stochastic and dynamic nature of AI training using a generalized online optimization formulation. In the online setting, we derive a strong convexity property of the objective that leads to a logarithmic expected regret bound under certain step-size choices. Additionally, we present real experiments on 1B-parameter DeepSeekMoE models to complement our theoretical findings. Together, these results build a principled framework for analyzing the Auxiliary-Loss-Free Load Balancing of s-MoE in AI models.
在大规模AI训练中,稀疏混合专家(Sparse Mixture-of-Experts, s-MoE)层通过仅激活每个标记的小部分专家来实现可扩展性。这种设计的一个操作挑战是负载均衡:将令牌路由到最小化空闲专家的数量上,这对于高效利用昂贵的GPU至关重要。我们提供了一个理论框架来分析由DeepSeek的Wang等人(2024年)提出的无辅助损失负载均衡(Auxiliary-Loss-Free Load Balancing, ALF-LB)程序——将其视为分配问题的一次迭代一步法原始-对偶方法。首先,在一个简化的确定性设置中,我们的框架提供了几个具有洞察力的结构性质:(i) 拉格朗日目标单调改进;(ii) 优先规则,该规则将令牌从超载专家移动到欠载专家;(iii) 近似平衡保证。 接着,我们将AI训练中的随机性和动态性纳入考虑,并使用广义在线优化公式来描述。在在线设置中,我们推导出目标函数的强凸性属性,在某些步长选择下会导致对数期望遗憾界。 此外,我们在10亿参数DeepSeekMoE模型上进行了真实实验以补充我们的理论发现。这些结果共同构建了一个分析s-MoE中无辅助损失负载均衡的原则框架。
https://arxiv.org/abs/2512.03915
Standard Vision Transformers flatten 2D images into 1D sequences, disrupting the natural spatial topology. While Rotary Positional Embedding (RoPE) excels in 1D, it inherits this limitation, often treating spatially distant patches (e.g., at row edges) as sequence neighbors. Existing 2D approaches typically treat spatial axes independently, failing to decouple this false sequential proximity from true spatial distance. To restore the 2D spatial manifold, we introduce Geometric Positional Embedding (GeoPE), a framework that extends rotations to 3D Euclidean space using quaternions. To overcome non-commutativity and ensure symmetry, GeoPE constructs a unified rotational operator by computing the geometric mean in the Lie algebra. This creates a geometrically coupled encoding that effectively separates spatial dimensions. Extensive experiments on image classification, object detection, and 3D semantic segmentation demonstrate that GeoPE consistently outperforms existing 2D RoPE variants and significantly enhances shape bias, confirming its ability to capture true geometric structure.
标准的视觉变换模型将二维图像展平为一维序列,这破坏了自然的空间拓扑结构。虽然旋转位置编码(RoPE)在处理一维数据时表现优异,但它继承了这一限制,常常将空间上距离较远的补丁(例如位于行边缘的补丁)视为序列中的邻居。现有的二维方法通常独立地处理空间轴,未能解开这种错误的顺序接近与真实的空间距离之间的关联。 为了恢复二维的空间流形结构,我们引入了几何位置编码(GeoPE),这是一个通过四元数将旋转扩展到三维欧几里得空间的框架。为克服非交换性并确保对称性,GeoPE通过在李代数中计算几何平均值来构建一个统一的旋转变换算符。这创建了一个几何耦合编码,有效地分离了空间维度。 大量的实验结果表明,在图像分类、目标检测和三维语义分割任务上,GeoPE持续优于现有的二维RoPE变体,并显著增强了形状偏向性,证明其能够捕捉真实的几何结构。
https://arxiv.org/abs/2512.04963
Despite the fact that visuomotor-based policies obtained via imitation learning demonstrate good performances in complex manipulation tasks, they usually struggle to achieve the same accuracy and speed as traditional control based methods. In this work, we introduce Hybrid-Diffusion models that combine open-loop routines with visuomotor diffusion policies. We develop Teleoperation Augmentation Primitives (TAPs) that allow the operator to perform predefined routines, such as locking specific axes, moving to perching waypoints, or triggering task-specific routines seamlessly during demonstrations. Our Hybrid-Diffusion method learns to trigger such TAPs during inference. We validate the method on challenging real-world tasks: Vial Aspiration, Open-Container Liquid Transfer, and container unscrewing. All experimental videos are available on the project's website: this https URL
尽管基于模仿学习获得的视觉运动策略在复杂的操作任务中表现出良好的性能,但它们通常难以达到传统控制方法相同的准确性和速度。在这项工作中,我们介绍了混合扩散模型(Hybrid-Diffusion models),该模型结合了开环程序和视觉-运动扩散策略。我们开发了一种称为远程操作增强原语(Teleoperation Augmentation Primitives, TAPs)的技术,允许操作员在演示期间无缝执行预定义的常规任务,例如锁定特定轴、移动到停靠点或触发特定任务的例行程序。我们的混合扩散方法学习如何在推理过程中触发这些TAPs。我们在具有挑战性的现实世界任务上验证了该方法的有效性:试管吸取(Vial Aspiration)、开容器液体转移(Open-Container Liquid Transfer)和容器松开(container unscrewing)。所有实验视频可在项目网站上查看:[此链接](this https URL)
https://arxiv.org/abs/2512.04960
Large language models (LLMs) demonstrate remarkable potential across diverse language related tasks, yet whether they capture deeper linguistic properties, such as syntactic structure, phonetic cues, and metrical patterns from raw text remains unclear. To analysis whether LLMs can learn these features effectively and apply them to important nature language related tasks, we introduce a novel multilingual genre classification dataset derived from Project Gutenberg, a large-scale digital library offering free access to thousands of public domain literary works, comprising thousands of sentences per binary task (poetry vs. novel;drama vs. poetry;drama vs. novel) in six languages (English, French, German, Italian, Spanish, and Portuguese). We augment each with three explicit linguistic feature sets (syntactic tree structures, metaphor counts, and phonetic metrics) to evaluate their impact on classification performance. Experiments demonstrate that although LLM classifiers can learn latent linguistic structures either from raw text or from explicitly provided features, different features contribute unevenly across tasks, which underscores the importance of incorporating more complex linguistic signals during model training.
大型语言模型(LLMs)在各种与语言相关的任务中展现了显著的潜力,但它们是否能从原始文本中捕捉到更深层次的语言特性,如句法结构、语音线索和韵律模式,这一点尚不清楚。为了分析LLM能否有效地学习这些特征并将其应用于重要的自然语言相关任务,我们引入了一个基于Project Gutenberg(一个提供数千部公共领域文学作品免费访问的大规模数字图书馆)的新型多语言体裁分类数据集。该数据集包含每种二元任务(诗歌与小说;戏剧与诗歌;戏剧与小说)六种语言(英语、法语、德语、意大利语、西班牙语和葡萄牙语)中的数千个句子,并通过三个明确的语言特征集合(句法树结构、隐喻计数和语音度量)进行扩充,以评估它们对分类性能的影响。实验表明,尽管LLM分类器能够从原始文本或显式提供的特征中学习潜在的语法规则,但不同的特征在不同任务中的贡献不均衡,这强调了在模型训练过程中纳入更复杂的语言信号的重要性。
https://arxiv.org/abs/2512.04957
Autoregressive vision-language-action (VLA) models have recently demonstrated strong capabilities in robotic manipulation. However, their core process of action tokenization often involves a trade-off between reconstruction fidelity and inference efficiency. We introduce FASTer, a unified framework for efficient and generalizable robot learning that integrates a learnable tokenizer with an autoregressive policy built upon it. FASTerVQ encodes action chunks as single-channel images, capturing global spatio-temporal dependencies while maintaining a high compression ratio. FASTerVLA builds on this tokenizer with block-wise autoregressive decoding and a lightweight action expert, achieving both faster inference and higher task performance. Extensive experiments across simulated and real-world benchmarks show that FASTerVQ delivers superior reconstruction quality, high token utilization, and strong cross-task and cross-embodiment generalization, while FASTerVLA further improves overall capability, surpassing previous state-of-the-art VLA models in both inference speed and task performance.
自回归视觉-语言-行动(VLA)模型近期在机器人操作领域展现出了强大的能力。然而,这些模型的核心过程——动作标记化通常需要在重构保真度和推理效率之间做出权衡。我们引入了FASTer框架,这是一个统一的高效且通用化的机器人学习方法,它集成了一个可学习的标记器与基于此构建的自回归策略。 - FASTerVQ:通过将动作块编码为单通道图像,该模型捕捉全局时空依赖性的同时保持高压缩率。 - FASTerVLA:在此基础上,FASTerVLA利用分块自回归解码和轻量级的动作专家来实现更快的推理速度以及更高的任务性能。 在模拟和真实世界的基准测试中进行了广泛的实验后发现: - FASTerVQ提供了卓越的重构质量、高标记利用率以及强大的跨任务和跨形态泛化能力。 - 而FASTerVLA进一步提高了整体性能,在推理速度和任务表现方面超越了现有的视觉-语言-行动模型的最佳实践。 总之,FASTer系列框架不仅提升了自回归动作模型在机器人操作中的效率,还增强了它们的通用性和适应性。
https://arxiv.org/abs/2512.04952
Agents capable of accomplishing complex tasks through multiple interactions with the environment have emerged as a popular research direction. However, in such multi-step settings, the conventional group-level policy optimization algorithm becomes suboptimal because of its underlying assumption that each action holds equal contribution, which deviates significantly from reality. Our analysis reveals that only a small fraction of actions are critical in determining the final outcome. Building on this insight, we propose CARL, a critical-action-focused reinforcement learning algorithm tailored for multi-step agents. CARL achieves focused training through providing action-level optimization signals for high-criticality actions while excluding low-criticality actions from model update. Extensive experiments demonstrate that CARL achieves both stronger performance and higher efficiency during training and inference across diverse evaluation settings.
能够通过与环境的多次互动完成复杂任务的智能体已经成为研究的一个热门方向。然而,在这种多步骤设置中,传统的群体级策略优化算法由于假设每个动作对结果的贡献是相等的这一前提而变得次优,这与实际情况存在较大偏差。我们的分析表明,只有少数关键性动作决定了最终的结果。基于此见解,我们提出了CARL(针对多步智能体的关键行动聚焦强化学习算法)。CARL通过为高关键性的动作提供动作级别的优化信号,并排除低关键性动作的模型更新来实现集中训练。广泛的实验显示,在各种评估设置中,CARL在训练和推理阶段都达到了更强的表现力和更高的效率。 翻译如下: 能够完成复杂任务并通过多次与环境互动而运作的智能体已经成为研究中的热门方向。然而,在这种多步骤环境中,传统的群体级策略优化算法由于其基本假设——即每个动作对结果的影响是等同的——变得不再是最优选择,这一假设在实际情况中往往并不成立。我们的分析表明,只有少数关键性行动真正决定了最终的结果。基于此发现,我们提出了一种新的强化学习算法CARL(关键行动聚焦强化学习),它专门针对多步智能体设计。CARL通过为高关键性的动作提供具体的优化信号,并且在模型更新时排除低关键性动作的影响来实现集中训练。广泛的实验表明,在各种评估环境中,CARL不仅表现出更强大的性能,而且在训练和推理阶段的效率也更高。
https://arxiv.org/abs/2512.04949
This study introduces a pioneering methodology for human action recognition by harnessing deep neural network techniques and adaptive fusion strategies across multiple modalities, including RGB, optical flows, audio, and depth information. Employing gating mechanisms for multimodal fusion, we aim to surpass limitations inherent in traditional unimodal recognition methods while exploring novel possibilities for diverse applications. Through an exhaustive investigation of gating mechanisms and adaptive weighting-based fusion architectures, our methodology enables the selective integration of relevant information from various modalities, thereby bolstering both accuracy and robustness in action recognition tasks. We meticulously examine various gated fusion strategies to pinpoint the most effective approach for multimodal action recognition, showcasing its superiority over conventional unimodal methods. Gating mechanisms facilitate the extraction of pivotal features, resulting in a more holistic representation of actions and substantial enhancements in recognition performance. Our evaluations across human action recognition, violence action detection, and multiple self-supervised learning tasks on benchmark datasets demonstrate promising advancements in accuracy. The significance of this research lies in its potential to revolutionize action recognition systems across diverse fields. The fusion of multimodal information promises sophisticated applications in surveillance and human-computer interaction, especially in contexts related to active assisted living.
这项研究介绍了一种利用深度神经网络技术和跨多模态(包括RGB、光学流、音频和深度信息)的自适应融合策略进行人体动作识别的开创性方法。通过采用门控机制进行多模态融合,我们旨在超越传统单一模态识别方法固有的局限性,并探索多样化的应用可能。通过对各种门控机制及基于自适应加权的融合架构进行详尽调查,本研究的方法能够选择性地整合来自不同模式的相关信息,从而增强动作识别任务中的准确性和鲁棒性。我们仔细评估了多种门控融合策略,以确定最适合多模态行动识别的最佳方法,并展示了其相对于传统单一模态方法的优势。门控机制有助于提取关键特征,使动作的表示更加全面,显著提高了识别性能。我们在人类行为识别、暴力行为检测及多个自监督学习任务上的基准数据集评估表明,在准确性方面取得了令人鼓舞的进步。这项研究的意义在于它有可能在各个领域革新动作识别系统。多模态信息融合预示着在监控和人机交互领域的复杂应用,特别是在主动辅助生活相关的情境中具有重要意义。
https://arxiv.org/abs/2512.04943
3D vision foundation models like Visual Geometry Grounded Transformer (VGGT) have advanced greatly in geometric perception. However, it is time-consuming and memory-intensive for long sequences, limiting application to large-scale scenes beyond hundreds of images. To address this, we propose LiteVGGT, achieving up to 10x speedup and substantial memory reduction, enabling efficient processing of 1000-image scenes. We derive two key insights for 3D reconstruction: (1) tokens from local image regions have inherent geometric correlations, leading to high similarity and computational redundancy; (2) token similarity across adjacent network layers remains stable, allowing for reusable merge decisions. Guided by these, we design a simple yet efficient strategy, dubbed geometry-aware cached token merging. We analyze each token's geometric importance, optimizing anchor token selection to better preserve key information for reconstruction. We also cache and reuse merge indices across layers, substantially reducing latency with minimal accuracy impact. This strategy retains VGGT's core performance, enabling efficient fine-tuning and FP8 quantization for further gains. Extensive experiments validate LiteVGGT's effectiveness, scalability, and robustness. Project page: this https URL
3D视觉基础模型,如视觉几何锚定变压器(Visual Geometry Grounded Transformer,简称VGGT),在几何感知方面取得了显著进步。然而,对于长序列来说,这类模型既耗时又占用大量内存,限制了它们在超过数百张图像的大规模场景中的应用。为了解决这一问题,我们提出了一种名为LiteVGGT的轻量化模型,它能够实现高达10倍的速度提升和大量的内存减少,从而能够高效处理包含多达1000张图像的场景。 我们在3D重建中获得了两个关键见解:(1)来自局部图像区域的令牌具有内在的几何相关性,导致相似性和计算冗余;(2)相邻网络层之间的令牌相似性保持稳定,允许重用合并决策。基于这些洞察,我们设计了一种简单而高效的策略,称为几何感知缓存令牌合并。通过分析每个令牌的几何重要性,并优化锚点令牌的选择以更好地保留重建的关键信息,我们的策略能够显著减少延迟且几乎不影响准确性。 此外,我们在各层之间存储并重用合并索引,进一步减少了延迟,同时对精度的影响最小化。这种策略保持了VGGT的核心性能,使得高效的微调和FP8量化成为可能,从而实现更大的收益。广泛的实验验证了LiteVGGT的有效性、可扩展性和鲁棒性。 项目页面:[此链接](https://this-url.com)
https://arxiv.org/abs/2512.04939
The Herculaneum Papyri are a collection of rolled papyrus documents that were charred and buried by the famous eruption of Mount Vesuvius. They promise to contain a wealth of previously unseen Greek and Latin texts, but are extremely fragile and thus most cannot be unrolled physically. A solution to access these texts is virtual unrolling, where the papyrus surface is digitally traced out in a CT scan of the scroll, to create a flattened representation. This tracing is very laborious to do manually in gigavoxel-sized scans, so automated approaches are desirable. We present the first top-down method that automatically fits a surface model to a CT scan of a severely damaged scroll. We take a novel approach that globally fits an explicit parametric model of the deformed scroll to existing neural network predictions of where the rolled papyrus likely passes. Our method guarantees the resulting surface is a single continuous 2D sheet, even passing through regions where the surface is not detectable in the CT scan. We conduct comprehensive experiments on high-resolution CT scans of two scrolls, showing that our approach successfully unrolls large regions, and exceeds the performance of the only existing automated unrolling method suitable for this data.
赫库兰尼姆纸草文献是一系列由维苏威火山著名喷发而炭化并掩埋的卷轴式纸莎草文档。这些文献包含了大量此前未被发现的希腊文和拉丁文本,但由于它们极其脆弱,大多数无法进行物理展开阅读。为了解决这一问题,提出了一种数字“虚拟解卷”方法:通过CT扫描来绘制纸莎草表面,并创建出一张扁平化的表示图。然而,在处理这些巨量数据(gigavoxel)的扫描图像时,手动追踪工作极其繁琐,因此自动解决方案十分必要。 本文介绍了一种从上至下的方法,该方法能够自动拟合一个CT扫描中严重受损卷轴表面模型。我们采用了一种新颖的方法:在全球范围内为已变形的纸莎草建立显式参数模型,并与现有的神经网络预测相结合(这些预测可以推测出滚起的纸莎草可能经过的位置)。我们的方法保证了最终生成的表面是一个连续的二维平面,即使是在CT扫描中无法检测到该表面的情况下也是如此。通过在两个卷轴的高分辨率CT扫描上进行详尽的实验,我们展示了这种方法能够成功解卷大片区域,并且超过了目前唯一适用于此类数据的自动解卷方法的表现。
https://arxiv.org/abs/2512.04927
Latent Diffusion Models (LDMs) inherently follow a coarse-to-fine generation process, where high-level semantic structure is generated slightly earlier than fine-grained texture. This indicates the preceding semantics potentially benefit texture generation by providing a semantic anchor. Recent advances have integrated semantic priors from pretrained visual encoders to further enhance LDMs, yet they still denoise semantic and VAE-encoded texture synchronously, neglecting such ordering. Observing these, we propose Semantic-First Diffusion (SFD), a latent diffusion paradigm that explicitly prioritizes semantic formation. SFD first constructs composite latents by combining a compact semantic latent, which is extracted from a pretrained visual encoder via a dedicated Semantic VAE, with the texture latent. The core of SFD is to denoise the semantic and texture latents asynchronously using separate noise schedules: semantics precede textures by a temporal offset, providing clearer high-level guidance for texture refinement and enabling natural coarse-to-fine generation. On ImageNet 256x256 with guidance, SFD achieves FID 1.06 (LightningDiT-XL) and FID 1.04 (1.0B LightningDiT-XXL), while achieving up to 100x faster convergence than the original DiT. SFD also improves existing methods like ReDi and VA-VAE, demonstrating the effectiveness of asynchronous, semantics-led modeling. Project page and code: this https URL.
潜扩散模型(LDMs)内在地遵循从粗到细的生成过程,在此过程中,高层次语义结构比精细纹理稍早产生。这表明先前的语义可能通过提供一个语义锚点来帮助纹理生成。最近的研究将来自预训练视觉编码器的语义先验整合到了潜扩散模型中,以进一步增强其性能;然而,它们仍然同步去噪语义和VAE编码的纹理,忽略了这种顺序性。鉴于此,我们提出了“Semantic-First Diffusion”(SFD),这是一种显式优先形成语义的潜在扩散范式。在SFD中,首先通过将从预训练视觉编码器提取的紧凑语义潜在变量与纹理潜在变量结合来构建复合潜在变量。SFD的核心在于异步去噪语义和纹理潜在变量:采用不同的噪声调度,让语义比纹理提前一个时间偏移量完成去噪过程,为纹理细化提供更清晰的高层次指导,并使自然从粗到细的生成成为可能。 在带有引导信息的ImageNet 256x256数据集上,SFD取得了FID(Fréchet Inception Distance)得分1.06(LightningDiT-XL模型)和1.04(1.0B LightningDiT-XXL模型),并且相比原始DiT模型的收敛速度提高了高达100倍。此外,SFD还改进了现有的方法如ReDi和VA-VAE,展示了异步、语义引导建模的有效性。 项目页面和代码:[链接] 原文链接可能需要访问特定的网址才能查看具体内容,请注意替换“this https URL”为实际提供的链接地址。
https://arxiv.org/abs/2512.04926
Large language models (LLMs) have proven to be highly effective for solving complex reasoning tasks. Surprisingly, their capabilities can often be improved by iterating on previously generated solutions. In this context, a reasoning plan for generating and combining a set of solutions can be thought of as an algorithm for reasoning using a probabilistic oracle. We introduce a theoretical framework for analyzing such reasoning algorithms. This framework formalizes the principles underlying popular techniques for iterative improvement and answer aggregation, providing a foundation for designing a new generation of more powerful reasoning methods. Unlike approaches for understanding models that rely on architectural specifics, our model is grounded in experimental evidence. As a result, it offers a general perspective that may extend to a wide range of current and future reasoning oracles.
大型语言模型(LLMs)已被证明在解决复杂的推理任务方面非常有效。令人惊讶的是,通过迭代改进先前生成的解决方案,可以进一步提升它们的能力。在这种背景下,生成和组合一组解答回答计划可以被视为使用概率预言机进行推理的一种算法。我们提出了一种理论框架来分析这样的推理算法。这个框架形式化了流行技术中用于迭代改进和答案聚合的基本原理,为设计新一代更强大的推理方法奠定了基础。 与依赖于架构细节理解模型的方法不同,我们的模型建立在实验证据之上。因此,它提供了一个通用的视角,可能适用于当前及未来各种推理预言机。
https://arxiv.org/abs/2512.04923
We introduce the first version of the AI Consumer Index (ACE), a benchmark for assessing whether frontier AI models can perform high-value consumer tasks. ACE contains a hidden heldout set of 400 test cases, split across four consumer activities: shopping, food, gaming, and DIY. We are also open sourcing 80 cases as a devset with a CC-BY license. For the ACE leaderboard we evaluated 10 frontier models (with websearch turned on) using a novel grading methodology that dynamically checks whether relevant parts of the response are grounded in the retrieved web sources. GPT 5 (Thinking = High) is the top-performing model, scoring 56.1%, followed by o3 Pro (Thinking = On) (55.2%) and GPT 5.1 (Thinking = High) (55.1%). Models differ across domains, and in Shopping the top model scores under 50%. For some requests (such as giving the correct price or providing working links), models are highly prone to hallucination. Overall, ACE shows a substantial gap between the performance of even the best models and consumers' AI needs.
我们介绍了AI消费者指数(ACE)的第一版,这是一个评估前沿AI模型能否执行高价值消费者任务的基准。ACE包含一个隐藏的独立测试集,共有400个测试案例,分布在四个消费者活动类别中:购物、餐饮、游戏和DIY。同时,我们也开源了80个案例作为开发集(devset),并采用CC-BY许可协议。对于ACE排行榜,我们使用了一种新颖的评分方法对10个前沿模型(开启网页搜索功能)进行了评估,该方法动态检查响应的相关部分是否基于检索到的网络资源进行论证。GPT 5(思考模式=高)以56.1%的成绩位居榜首,其次是o3 Pro(思考模式=开)(55.2%),以及GPT 5.1(思考模式=高)(55.1%)。模型在不同领域的表现有所不同,在购物领域中,最佳模型的得分甚至低于50%。对于某些请求(例如给出正确价格或提供有效链接),模型容易产生幻觉。总体而言,ACE显示了即使是最优秀的模型与消费者AI需求之间的巨大差距。
https://arxiv.org/abs/2512.04921
This work investigates how disturbance-aware, robustness-embedded reference trajectories translate into driving performance when executed by professional drivers in a dynamic simulator. Three planned reference trajectories are compared against a free-driving baseline (NOREF) to assess trade-offs between lap time (LT) and steering effort (SE): NOM, the nominal time-optimal trajectory; TLC, a track-limit-robust trajectory obtained by tightening margins to the track edges; and FLC, a friction-limit-robust trajectory obtained by tightening against axle and tire saturation. All trajectories share the same minimum lap-time objective with a small steering-smoothness regularizer and are evaluated by two professional drivers using a high-performance car on a virtual track. The trajectories derive from a disturbance-aware minimum-lap-time framework recently proposed by the authors, where worst-case disturbance growth is propagated over a finite horizon and used to tighten tire-friction and track-limit constraints, preserving performance while providing probabilistic safety margins. LT and SE are used as performance indicators, while RMS lateral deviation, speed error, and drift angle characterize driving style. Results show a Pareto-like LT-SE trade-off: NOM yields the shortest LT but highest SE; TLC minimizes SE at the cost of longer LT; FLC lies near the efficient frontier, substantially reducing SE relative to NOM with only a small LT increase. Removing trajectory guidance (NOREF) increases both LT and SE, confirming that reference trajectories improve pace and control efficiency. Overall, the findings highlight reference-based and disturbance-aware planning, especially FLC, as effective tools for training and for achieving fast yet stable trajectories.
这项研究探讨了扰动感知、鲁棒性嵌入的参考轨迹如何在动态模拟器中由专业驾驶员执行时转化为驾驶性能。比较了三种规划好的参考轨迹与自由驾驶基准(NOREF),以评估圈速(LT)和转向努力(SE)之间的权衡:NOM,名义时间最优轨迹;TLC,通过收紧到赛道边缘的边界获得的抗迹线限制鲁棒性轨迹;FLC,通过收紧轴和轮胎饱和度而获得的抗摩擦极限鲁棒性轨迹。所有这些轨迹都具有相同的最小圈时目标,并添加了一个微小的方向平滑性的正则化器,由两位使用高性能汽车在虚拟赛道上进行评估的专业驾驶员进行了评测。 这些轨迹源自作者最近提出的扰动感知最短圈速框架,在该框架中,最坏情况下的干扰增长被在整个有限的时间范围内传播并用于收紧轮胎摩擦和赛道限制约束。在此过程中保持性能的同时还提供了概率性安全边界。使用圈速(LT)和转向努力(SE)作为绩效指标,同时用横向偏移的标准差、速度误差以及侧滑角来描述驾驶风格。 研究结果显示出类似于帕累托的LT-SE权衡:NOM产生最短的LT但最高的SE;TLC通过延长LT来最小化SE;FLC位于高效前沿附近,在转向努力上显著减少相对NOM的同时只增加了很短的时间。移除轨迹指导(NOREF)则同时增加圈速和转向努力,确认了参考轨迹在提升速度及控制效率方面的优势。 总体而言,研究结果强调基于参考的以及扰动感知的规划方法——尤其是FLC——作为有效训练工具的重要性,并且有助于实现既快速又稳定的驾驶轨迹。
https://arxiv.org/abs/2512.04917
Despite tremendous recent progress, Flow Matching methods still suffer from exposure bias due to discrepancies in training and inference. This paper investigates the root causes of exposure bias in Flow Matching, including: (1) the model lacks generalization to biased inputs during training, and (2) insufficient low-frequency content captured during early denoising, leading to accumulated bias. Based on these insights, we propose ReflexFlow, a simple and effective reflexive refinement of the Flow Matching learning objective that dynamically corrects exposure bias. ReflexFlow consists of two components: (1) Anti-Drift Rectification (ADR), which reflexively adjusts prediction targets for biased inputs utilizing a redesigned loss under training-time scheduled sampling; and (2) Frequency Compensation (FC), which reflects on missing low-frequency components and compensates them by reweighting the loss using exposure bias. ReflexFlow is model-agnostic, compatible with all Flow Matching frameworks, and improves generation quality across datasets. Experiments on CIFAR-10, CelebA-64, and ImageNet-256 show that ReflexFlow outperforms prior approaches in mitigating exposure bias, achieving a 35.65% reduction in FID on CelebA-64.
尽管最近取得了巨大进步,流匹配(Flow Matching)方法仍然由于训练和推理过程中的差异而遭受暴露偏差的影响。本文探讨了流匹配中暴露偏差的根本原因,包括:(1)模型在处理训练期间的偏置输入时缺乏泛化能力;以及(2)早期去噪过程中捕捉到的低频内容不足导致累积误差。 基于这些见解,我们提出了ReflexFlow,这是一种简单而有效的反向修正流匹配学习目标的方法,可以动态纠正暴露偏差。ReflexFlow 包含两个组成部分: 1. **抗漂移校正 (Anti-Drift Rectification, ADR)**:在训练时采用经过重新设计的损失函数和计划抽样策略,自动调整对偏置输入的预测目标。 2. **频率补偿(Frequency Compensation, FC)**:通过反思缺失的低频成分并使用暴露偏差来重新加权损失,从而补偿这些缺失。 ReflexFlow 是一种与模型无关的方法,并且可以兼容所有流匹配框架。实验表明,在CIFAR-10、CelebA-64和ImageNet-256数据集上,ReflexFlow 在减轻暴露偏差方面优于以前的方法,在CelebA-64上的FID得分降低35.65%。
https://arxiv.org/abs/2512.04904
We present E(3)-Pose, a novel fast pose estimation method that jointly and explicitly models rotation equivariance and object symmetry. Our work is motivated by the challenging problem of accounting for fetal head motion during a diagnostic MRI scan. We aim to enable automatic adaptive prescription of 2D diagnostic MRI slices with 6-DoF head pose estimation, supported by 3D MRI volumes rapidly acquired before each 2D slice. Existing methods struggle to generalize to clinical volumes, due to pose ambiguities induced by inherent anatomical symmetries, as well as low resolution, noise, and artifacts. In contrast, E(3)-Pose captures anatomical symmetries and rigid pose equivariance by construction, and yields robust estimates of the fetal head pose. Our experiments on publicly available and representative clinical fetal MRI datasets demonstrate the superior robustness and generalization of our method across domains. Crucially, E(3)-Pose achieves state-of-the-art accuracy on clinical MRI volumes, paving the way for clinical translation. Our implementation is available at this http URL.
我们介绍了E(3)-Pose,这是一种新颖的快速姿态估计方法,能够同时且明确地建模旋转等变性和物体对称性。我们的工作受到在诊断MRI扫描期间应对胎儿头部运动这一挑战性问题的启发。我们旨在通过使用六自由度(6-DoF)头部姿态估计来实现自动适应性的2D诊断MRI切片处方,并利用在每次采集2D切片前快速获得的3D MRI体积数据作为支持。 现有的方法由于解剖结构固有的对称性和姿势模糊性,以及低分辨率、噪声和伪影的问题,在泛化到临床体素时遇到了困难。相比之下,E(3)-Pose通过设计捕捉到了解剖学对称性和刚性姿态等变性,并且能够提供稳健的胎儿头部姿态估计。 我们在公开可用且具有代表性的临床胎儿MRI数据集上的实验表明,我们的方法在不同领域内展示了优越的鲁棒性和泛化能力。至关重要的是,E(3)-Pose在临床MRI体素上实现了最先进的精度,为临床应用铺平了道路。我们的实现可在以下网址获得:[此URL]。
https://arxiv.org/abs/2512.04890
Object detection constitutes the primary task within the domain of computer vision. It is utilized in numerous domains. Nonetheless, object detection continues to encounter the issue of catastrophic forgetting. The model must be retrained whenever new products are introduced, utilizing not only the new products dataset but also the entirety of the previous dataset. The outcome is obvious: increasing model training expenses and significant time consumption. In numerous sectors, particularly retail checkout, the frequent introduction of new products presents a great challenge. This study introduces You Only Train Once (YOTO), a methodology designed to address the issue of catastrophic forgetting by integrating YOLO11n for object localization with DeIT and Proxy Anchor Loss for feature extraction and metric learning. For classification, we utilize cosine similarity between the embedding features of the target product and those in the Qdrant vector database. In a case study conducted in a retail store with 140 products, the experimental results demonstrate that our proposed framework achieves encouraging accuracy, whether for detecting new or existing products. Furthermore, without retraining, the training duration difference is significant. We achieve almost 3 times the training time efficiency compared to classical object detection approaches. This efficiency escalates as additional new products are added to the product database. The average inference time is 580 ms per image containing multiple products, on an edge device, validating the proposed framework's feasibility for practical use.
物体检测构成了计算机视觉领域的核心任务之一,并被广泛应用于多个领域。然而,物体检测仍然面临灾难性遗忘的问题——即当引入新的产品时,模型需要重新训练,不仅使用新产品的数据集,还需要使用之前的全部数据集。这导致了明显的后果:增加模型的训练成本和时间消耗。在许多行业中,尤其是在零售结账环节,频繁地引进新产品带来了巨大的挑战。 本研究提出了“仅训练一次”(You Only Train Once, YOTO) 方法,旨在通过整合YOLO11n(用于物体定位)、DeIT(用于特征提取)以及Proxy Anchor Loss(用于度量学习),来解决灾难性遗忘的问题。对于分类任务,我们采用目标产品嵌入特征与Qdrant向量数据库中各特征之间的余弦相似度来进行。 在一家包含140种产品的零售店进行的案例研究中,实验结果显示我们的框架无论是在检测新商品还是现有商品上都能达到令人鼓舞的准确率。而且,在不重新训练模型的情况下,相较于传统的物体检测方法,我们的训练时间效率提升了近三倍,并且随着新增加到产品数据库中的新产品数量增加,这种效率还会进一步提升。平均推理时间为每张包含多件产品的图像580毫秒,在边缘设备上运行时验证了所提出的框架在实际应用中的可行性。
https://arxiv.org/abs/2512.04888
We present a dataset for force-grounded, cross-view articulated manipulation that couples what is seen with what is done and what is felt during real human interaction. The dataset contains 3048 sequences across 381 articulated objects in 38 environments. Each object is operated under four embodiments - (i) human hand, (ii) human hand with a wrist-mounted camera, (iii) handheld UMI gripper, and (iv) a custom Hoi! gripper - where the tool embodiment provides synchronized end-effector forces and tactile sensing. Our dataset offers a holistic view of interaction understanding from video, enabling researchers to evaluate how well methods transfer between human and robotic viewpoints, but also investigate underexplored modalities such as force sensing and prediction.
我们提供了一个数据集,用于力感知的跨视角连杆操作,该数据集将人在真实互动中所见、所为和所感结合在一起。该数据集中包含3048个序列,涉及在38种环境中的381个连杆物体。每个对象在四种执行器表现形式下被操作:(i) 人手;(ii) 配备腕部安装摄像头的人手;(iii) 手持UMI夹爪;以及(iv) 自定义Hoi! 夹爪,其中工具的形态提供了同步的末端效应器力和触觉感知。我们的数据集从视频的角度为交互理解提供了一个全面的观点,使研究人员能够评估方法在人与机器人视角之间转移的效果,并研究尚未充分探索的模式,如力感知和预测。
https://arxiv.org/abs/2512.04884
Real-time tracking of small unmanned aerial vehicles (UAVs) on edge devices faces a fundamental resolution-speed conflict. Downsampling high-resolution imagery to standard detector input sizes causes small target features to collapse below detectable thresholds. Yet processing native 1080p frames on resource-constrained platforms yields insufficient throughput for smooth gimbal control. We propose SDG-Track, a Sparse Detection-Guided Tracker that adopts an Observer-Follower architecture to reconcile this conflict. The Observer stream runs a high-capacity detector at low frequency on the GPU to provide accurate position anchors from 1920x1080 frames. The Follower stream performs high-frequency trajectory interpolation via ROI-constrained sparse optical flow on the CPU. To handle tracking failures from occlusion or model drift caused by spectrally similar distractors, we introduce Dual-Space Recovery, a training-free re-acquisition mechanism combining color histogram matching with geometric consistency constraints. Experiments on a ground-to-air tracking station demonstrate that SDG-Track achieves 35.1 FPS system throughput while retaining 97.2\% of the frame-by-frame detection precision. The system successfully tracks agile FPV drones under real-world operational conditions on an NVIDIA Jetson Orin Nano. Our paper code is publicly available at this https URL
实时在边缘设备上追踪小型无人驾驶飞行器(UAV)面临着基本的分辨率-速度冲突。将高分辨率图像降采样至标准检测模型输入尺寸会导致小目标特征消失到不可检出阈值以下。然而,在资源受限平台上处理原始1080p帧无法达到平稳云台控制所需的足够吞吐量。为此,我们提出了SDG-Track,即稀疏检测引导追踪器,采用观察者-跟随者架构来解决这一冲突。观察者流在GPU上以低频运行高容量检测器,从1920x1080帧中提供准确的位置锚点。跟随者流则通过CPU上的感兴趣区域(ROI)约束稀疏光流,在高频下进行轨迹插值。 为了处理由遮挡或与目标具有相似光谱特性的干扰物导致的追踪失败,我们引入了双空间恢复机制——这是一种无需训练的重新获取手段,结合颜色直方图匹配和几何一致性约束。在地面至空中跟踪站上的实验表明,SDG-Track能够在保持97.2%的帧间检测精度的同时实现35.1 FPS的系统吞吐量。该系统成功地在现实世界操作条件下,在NVIDIA Jetson Orin Nano上追踪敏捷的第一人称视角(FPV)无人机。 我们的研究代码可公开获取,详情请访问:[提供的链接] (请将 [提供的链接] 替换为实际的网址)
https://arxiv.org/abs/2512.04883
Automated lesion detection in chest X-rays has demonstrated significant potential for improving clinical diagnosis by precisely localizing pathological abnormalities. While recent promptable detection frameworks have achieved remarkable accuracy in target localization, existing methods typically rely on manual annotations as prompts, which are labor-intensive and impractical for clinical applications. To address this limitation, we propose SP-Det, a novel self-prompted detection framework that automatically generates rich textual context to guide multi-label lesion detection without requiring expert annotations. Specifically, we introduce an expert-free dual-text prompt generator (DTPG) that leverages two complementary textual modalities: semantic context prompts that capture global pathological patterns and disease beacon prompts that focus on disease-specific manifestations. Moreover, we devise a bidirectional feature enhancer (BFE) that synergistically integrates comprehensive diagnostic context with disease-specific embeddings to significantly improve feature representation and detection accuracy. Extensive experiments on two chest X-ray datasets with diverse thoracic disease categories demonstrate that our SP-Det framework outperforms state-of-the-art detection methods while completely eliminating the dependency on expert-annotated prompts compared to existing promptable architectures.
在胸部X光片中自动检测病灶显示出显著的潜力,能够通过精确定位病理异常来改善临床诊断。尽管最近的一些可提示的目标定位框架已达到了惊人的准确性,但现有的方法通常依赖于手动注释作为提示信息,这些注释既费时又不适合实际临床应用。为解决这一限制,我们提出了SP-Det(Self-Prompted Detection),这是一种新颖的自动检测框架,它能够生成丰富的文本背景来指导多标签病灶检测,并且无需专家注释。 具体而言,我们引入了一个非专家依赖型双文本提示生成器(DTPG),该生成器利用了两种互补的文字模态:语义上下文提示捕捉全局病理模式;疾病标志物提示则关注特定疾病的特征。此外,我们设计了一种双向特性增强器(BFE),它将全面的诊断背景信息与疾病特异性嵌入相整合,从而显著提高特征表示和检测准确性。 在包含多种胸部疾病类别的两个胸部X光片数据集上的广泛实验表明,我们的SP-Det框架优于现有的最先进的检测方法,并且完全消除了对专家注释提示的依赖。
https://arxiv.org/abs/2512.04875
Recent adaptations of Large Language Models (LLMs) for time series forecasting often fail to effectively enhance information for raw series, leaving LLM reasoning capabilities underutilized. Existing prompting strategies rely on static correlations rather than generative interpretations of dynamic behavior, lacking critical global and instance-specific context. To address this, we propose STELLA (Semantic-Temporal Alignment with Language Abstractions), a framework that systematically mines and injects structured supplementary and complementary information. STELLA employs a dynamic semantic abstraction mechanism that decouples input series into trend, seasonality, and residual components. It then translates intrinsic behavioral features of these components into Hierarchical Semantic Anchors: a Corpus-level Semantic Prior (CSP) for global context and a Fine-grained Behavioral Prompt (FBP) for instance-level patterns. Using these anchors as prefix-prompts, STELLA guides the LLM to model intrinsic dynamics. Experiments on eight benchmark datasets demonstrate that STELLA outperforms state-of-the-art methods in long- and short-term forecasting, showing superior generalization in zero-shot and few-shot settings. Ablation studies further validate the effectiveness of our dynamically generated semantic anchors.
近期针对时间序列预测的大规模语言模型(LLMs)的适应性改进往往未能有效地增强原始序列的信息,导致这些模型的推理能力被忽视。现有的提示策略依赖于静态相关性而非动态行为的生成式解释,并且缺乏关键的全局和实例特定上下文信息。为解决这些问题,我们提出了一种名为STELLA(基于语言抽象的时间语义对齐)的新框架,该框架系统地挖掘并注入结构化的补充和互补信息。 STELLA采用了一个动态语义抽象机制,将输入序列分解为主导趋势、季节性和残差三个组成部分。然后,它将这些组件的内在行为特征转化为层次化语义锚点:一个用于全局上下文的语料库级语义先验(CSP)和一个用于实例级别模式的细粒度行为提示(FBP)。利用这些锚点作为前缀提示,STELLA指导LLM建模内在动态。在八个基准数据集上的实验表明,STELLA在长短期预测中均优于现有最佳方法,并且在零样本和少量样本设置下表现出色的泛化能力。进一步的消融研究验证了我们动态生成的语义锚点的有效性。
https://arxiv.org/abs/2512.04871