Learning interactive motion behaviors among multiple agents is a core challenge in autonomous driving. While imitation learning models generate realistic trajectories, they often inherit biases from datasets dominated by safe demonstrations, limiting robustness in safety-critical cases. Moreover, most studies rely on open-loop evaluation, overlooking compounding errors in closed-loop execution. We address these limitations with two complementary strategies. First, we propose Group Relative Behavior Optimization (GRBO), a reinforcement learning post-training method that fine-tunes pretrained behavior models via group relative advantage maximization with human regularization. Using only 10% of the training dataset, GRBO improves safety performance by over 40% while preserving behavioral realism. Second, we introduce Warm-K, a warm-started Top-K sampling strategy that balances consistency and diversity in motion selection. Our Warm-K method-based test-time scaling enhances behavioral consistency and reactivity at test time without retraining, mitigating covariate shift and reducing performance discrepancies. Demo videos are available in the supplementary material.
学习多代理之间的交互式运动行为是自动驾驶领域的一个核心挑战。虽然模仿学习模型能够生成现实的轨迹,但它们通常会从以安全演示为主的数据集中继承偏差,这限制了在安全性关键情况下表现的鲁棒性。此外,大多数研究依赖于开环评估方法,忽略了闭环执行中的累积误差问题。 为了解决这些局限性,我们采用了两种互补策略。首先,我们提出了组相对行为优化(GRBO),这是一种强化学习后期训练方法,通过组间的相对优势最大化以及人类规范化的手段来微调预训练的行为模型。使用仅10%的训练数据集,GRBO在保持行为真实性的同时,将安全性表现提高了超过40%。 其次,我们引入了Warm-K策略,这是一个带有热启动的Top-K采样方法,能够平衡运动选择的一致性和多样性。基于我们的Warm-K测试时间缩放法,在不重新进行训练的情况下,能够在测试时提升行为一致性与响应性,并且能缓解协变量变化和减少性能差异。 演示视频可在补充材料中查看。
https://arxiv.org/abs/2512.13262
In Dataset Condensation, the goal is to synthesize a small dataset that replicates the training utility of a large original dataset. Existing condensation methods synthesize datasets with significant redundancy, so there is a dire need to reduce redundancy and improve the diversity of the synthesized datasets. To tackle this, we propose an intuitive Diversity Regularizer (DiRe) composed of cosine similarity and Euclidean distance, which can be applied off-the-shelf to various state-of-the-art condensation methods. Through extensive experiments, we demonstrate that the addition of our regularizer improves state-of-the-art condensation methods on various benchmark datasets from CIFAR-10 to ImageNet-1K with respect to generalization and diversity metrics.
在数据集浓缩(Dataset Condensation)中,目标是合成一个小的数据集,以复制大型原始数据集的训练效用。现有的浓缩方法生成了包含大量冗余性的数据集,因此减少冗余并提高所合成数据集多样性的需求十分迫切。为此,我们提出了一种直观的多样性和正则化器(Diversity Regularizer, DiRe),它由余弦相似度和欧几里得距离组成,可以即插即用地应用于各种最先进的浓缩方法中。通过广泛的实验,我们在从CIFAR-10到ImageNet-1K的各种基准数据集上展示了我们提出的正则化器能够提高现有先进浓缩方法在泛化性和多样性指标上的表现。
https://arxiv.org/abs/2512.13083
Large-scale multimodal foundation models, particularly Contrastive Captioners (CoCa), have achieved state-of-the-art results by unifying contrastive alignment with generative captioning. While zero-shot transfer capabilities are well-documented, the adaptation of these generative-contrastive hybrids to downstream tasks with extreme data scarcity (few-shot learning) remains under-explored. Existing literature predominantly focuses on dual-encoder architectures like CLIP, leaving a gap in understanding how CoCa's distinct latent space responds to parameter-efficient fine-tuning (PEFT). This paper presents a comprehensive empirical study on adapting the CoCa visual backbone for few-shot image classification. We systematically evaluate a hierarchy of strategies, ranging from training-free hybrid prototyping to deep parameter adaptation via Low-Rank Adaptation (LoRA). First, we identify an "augmentation divergence": while strong data augmentation degrades the performance of linear probing in low-shot settings, it is essential for stabilizing LoRA fine-tuning. We also demonstrate that hybrid objectives incorporating Supervised Contrastive (SupCon) loss yield consistent performance improvements over standard Cross-Entropy across varying shot counts. Crucially, we characterize the sensitivity of training configurations to data scarcity, providing empirical reference settings for scaling regularization, rank, and sampling strategies to facilitate the efficient adaptation of generative-contrastive foundation models.
大规模多模态基础模型,尤其是对比式描述器(CoCa),通过将对比对齐与生成性描述相结合,在性能上达到了行业领先水平。尽管零样本迁移能力已经得到了充分的研究,但这些生成-对比混合模型如何适应数据极度稀缺的下游任务(即少样本学习)仍是一个未被充分探索的问题。现有文献主要集中在双编码器架构如CLIP上,这使得CoCa独特潜在空间在参数高效微调(PEFT)方面的反应机制尚不为人所知。本文提供了一个全面的经验研究,探讨了将CoCa视觉骨干网络适应于少样本图像分类的方法。我们系统地评估了一系列策略,从无训练的混合原型到通过低秩适配(LoRA)进行深度参数调整。首先,我们发现了一种“增强差异”:虽然强大的数据增强在少样本设置中会降低线性探测的性能,但它是稳定LoRA微调所必需的。此外,我们还展示了结合监督对比(SupCon)损失的混合目标在不同样本数量下始终能提高标准交叉熵的表现。尤为重要的是,本文描述了训练配置对数据稀缺性的敏感度,并提供了关于如何调整正则化、秩和采样策略以促进生成-对比基础模型有效适应的经验参考设置。
https://arxiv.org/abs/2512.12824
Temporal Knowledge Graph Reasoning (TKGR) aims to complete missing factual elements along the timeline. Depending on the temporal position of the query, the task is categorized into interpolation and extrapolation. Existing interpolation methods typically embed temporal information into individual facts to complete missing historical knowledge, while extrapolation techniques often leverage sequence models over graph snapshots to identify recurring patterns for future event prediction. These methods face two critical challenges: limited contextual modeling in interpolation and cognitive generalization bias in extrapolation. To address these, we propose a unified method for TKGR, dubbed DynaGen. For interpolation, DynaGen dynamically constructs entity-centric subgraphs and processes them with a synergistic dual-branch GNN encoder to capture evolving structural context. For extrapolation, it applies a conditional diffusion process, which forces the model to learn underlying evolutionary principles rather than just superficial patterns, enhancing its ability to predict unseen future events. Extensive experiments on six benchmark datasets show DynaGen achieves state-of-the-art performance. On average, compared to the second-best models, DynaGen improves the Mean Reciprocal Rank (MRR) score by 2.61 points for interpolation and 1.45 points for extrapolation.
时间知识图谱推理(Temporal Knowledge Graph Reasoning,TKGR)旨在沿时间线填补缺失的事实元素。根据查询的时间位置,任务被分类为插值和外推。现有的插值方法通常将时间信息嵌入到单个事实中以完成缺失的历史知识的填补,而外推技术则往往利用图快照上的序列模型来识别重复出现的模式以进行未来事件预测。这些方法面临着两个关键挑战:在插值中的上下文建模能力有限和在外推中的认知泛化偏差。 为了解决这些问题,我们提出了一种名为DynaGen的统一方法来进行时间知识图谱推理(TKGR)。对于插值任务,DynaGen动态构建以实体为中心的子图,并通过协同双分支GNN编码器处理这些子图来捕捉演变的结构上下文。对于外推任务,它应用了一个条件扩散过程,该过程迫使模型学习底层演化原理而非仅表面模式,从而增强了其预测未来未见过事件的能力。 在六个基准数据集上的广泛实验表明,DynaGen达到了最先进的性能水平。平均而言,在插值任务中,与第二好的方法相比,DynaGen将平均倒数排名(Mean Reciprocal Rank, MRR)得分提高了2.61分;在外推任务中,这一改进为1.45分。
https://arxiv.org/abs/2512.12669
The proliferation of synthetic facial imagery has intensified the need for robust Open-World DeepFake Attribution (OW-DFA), which aims to attribute both known and unknown forgeries using labeled data for known types and unlabeled data containing a mixture of known and novel types. However, existing OW-DFA methods face two critical limitations: 1) A confidence skew that leads to unreliable pseudo-labels for novel forgeries, resulting in biased training. 2) An unrealistic assumption that the number of unknown forgery types is known *a priori*. To address these challenges, we propose a Confidence-Aware Asymmetric Learning (CAL) framework, which adaptively balances model confidence across known and novel forgery types. CAL mainly consists of two components: Confidence-Aware Consistency Regularization (CCR) and Asymmetric Confidence Reinforcement (ACR). CCR mitigates pseudo-label bias by dynamically scaling sample losses based on normalized confidence, gradually shifting the training focus from high- to low-confidence samples. ACR complements this by separately calibrating confidence for known and novel classes through selective learning on high-confidence samples, guided by their confidence gap. Together, CCR and ACR form a mutually reinforcing loop that significantly improves the model's OW-DFA performance. Moreover, we introduce a Dynamic Prototype Pruning (DPP) strategy that automatically estimates the number of novel forgery types in a coarse-to-fine manner, removing the need for unrealistic prior assumptions and enhancing the scalability of our methods to real-world OW-DFA scenarios. Extensive experiments on the standard OW-DFA benchmark and a newly extended benchmark incorporating advanced manipulations demonstrate that CAL consistently outperforms previous methods, achieving new state-of-the-art performance on both known and novel forgery attribution.
合成面部图像的泛滥加剧了对稳健型开放世界Deepfake归因(OW-DFA)的需求,这种方法旨在利用已知类型的标注数据和包含已知及新型伪造物的未标注数据来识别这两种情况下的伪造内容。然而,现有的OW-DFA方法面临着两个关键限制:1) 由于导致新伪造品伪标签不可靠的信任偏差,训练过程会产生偏见;2) 过度理想化的假设,即事先知道未知伪造类型的确切数量。 为了解决这些挑战,我们提出了一种信任感知不对称学习(CAL)框架,该框架能够根据已知和新型伪造类型的模型自信程度进行自适应调整。CAL主要包括两个组件:信任感知一致性正则化(CCR)和不对称信任强化(ACR)。CCR通过基于标准化自信度动态缩放样本损失来缓解伪标签偏差,并逐步将训练焦点从高自信度转向低自信度样本。而ACR则是通过在具有显著信心差距的高自信度样本上进行选择性学习,分别校准已知和新类型的信心,以补充这一点。 CCR与ACR共同形成了一个互相强化的循环,这一循环能显著提升模型对OW-DFA任务的表现。此外,我们还引入了一种动态原型修剪(DPP)策略,在粗到细的方式下自动估计新型伪造类型的数量,消除了对不切实际先验假设的需求,并提升了我们的方法在现实世界中处理OW-DFA场景的可扩展性。 通过标准OW-DFA基准测试以及一个新增加了先进操作的新基准上的广泛实验表明,CAL框架持续优于先前的方法,在已知和新类型伪造归因方面实现了新的最佳性能。
https://arxiv.org/abs/2512.12667
Deep neural networks possess strong representational capacity yet remain vulnerable to overfitting, primarily because neurons tend to co-adapt in ways that, while capturing complex and fine-grained feature interactions, also reinforce spurious and non-generalizable patterns that inflate training performance but reduce reliability on unseen data. Noise-based regularizers such as Dropout and DropConnect address this issue by injecting stochastic perturbations during training, but the noise they apply is typically uniform across a layer or across a batch of samples, which can suppress both harmful and beneficial co-adaptation. This work introduces PerNodeDrop, a lightweight stochastic regularization method. It applies per-sample, per-node perturbations to break the uniformity of the noise injected by existing techniques, thereby allowing each node to experience input-specific variability. Hence, PerNodeDrop preserves useful co-adaptation while applying regularization. This narrows the gap between training and validation performance and improves reliability on unseen data, as evident from the experiments. Although superficially similar to DropConnect, PerNodeDrop operates at the sample level. It drops weights at the sample level, not the batch level. An expected-loss analysis formalizes how its perturbations attenuate excessive co-adaptation while retaining predictive interactions. Empirical evaluations on vision, text, and audio benchmarks indicate improved generalization relative to the standard noise-based regularizer.
深度神经网络具有强大的表征能力,但容易过拟合,主要是因为神经元倾向于以捕捉复杂且细微特征交互的方式共同适应,同时也强化了那些虚假的和非泛化的模式。这些模式在训练数据上提高了性能,但在未见过的数据上的可靠性却降低了。噪声正则化方法(如Dropout和DropConnect)通过在训练期间注入随机扰动来解决这个问题,但它们所应用的噪声通常是整个层或样本批次中的均匀分布,这可能会抑制有益和有害的共同适应性。 这项工作介绍了一种轻量级的随机正则化方法——PerNodeDrop。它针对每个样本、每个节点进行独立的扰动处理,以打破现有技术中注入的噪声的一致性,从而使每个节点能够经历输入特定的变化。因此,PerNodeDrop在保留有益共同适应性的前提下应用了正则化。这缩小了训练性能和验证性能之间的差距,并提高了对未见数据的可靠性,从实验结果可以看出这一点。 尽管与DropConnect表面上相似,但PerNodeDrop是在样本级别上操作的,它以样本级别的权重丢弃而不是批量级别的形式运作。预期损失分析正式说明了其扰动如何在抑制过度共同适应的同时保留预测交互作用。对视觉、文本和音频基准的实证评估表明,相比标准噪声正则化方法,PerNodeDrop具有更好的泛化能力。
https://arxiv.org/abs/2512.12663
Instruction-based text editing is increasingly critical for real-world applications such as code editors (e.g., Cursor), but Large Language Models (LLMs) continue to struggle with this task. Unlike free-form generation, editing requires faithfully implementing user instructions while preserving unchanged content, as even minor unintended modifications can break functionality. Existing approaches treat editing as generic text generation, leading to two key failures: they struggle to faithfully align edits with diverse user intents, and they often over-edit unchanged regions. We propose HyperEdit to address both issues. First, we introduce hypernetwork-based dynamic adaptation that generates request-specific parameters, enabling the model to tailor its editing strategy to each instruction. Second, we develop difference-aware regularization that focuses supervision on modified spans, preventing over-editing while ensuring precise, minimal changes. HyperEdit achieves a 9%--30% relative improvement in BLEU on modified regions over state-of-the-art baselines, despite utilizing only 3B parameters.
基于指令的文本编辑在诸如代码编辑器(例如Cursor)等实际应用中变得越来越重要,但大型语言模型(LLMs)在此任务上仍然面临挑战。与自由形式生成不同,编辑要求忠实执行用户的指示并保留未更改的内容,因为即使是细微的意外修改也可能破坏功能完整性。现有的方法将编辑视为通用文本生成处理方式,导致两个关键问题:它们难以准确地根据不同的用户意图实施编辑,并且经常在未改变的部分过度编辑。我们提出了HyperEdit来解决这些问题。 首先,我们引入了基于超网络的动态适应机制,该机制可以为每个指令生成特定请求的参数,从而使模型能够为其编辑策略定制化以满足具体需求。 其次,我们开发了一种差异感知正则化方法,这种方法专注于监督修改区域,防止过度编辑的同时确保精确且最小化的更改。 HyperEdit在已修改部分实现了显著改进,在BLEU评分上比最先进的基准高出9%到30%,而仅使用了约30亿参数。
https://arxiv.org/abs/2512.12544
We present Animus3D, a text-driven 3D animation framework that generates motion field given a static 3D asset and text prompt. Previous methods mostly leverage the vanilla Score Distillation Sampling (SDS) objective to distill motion from pretrained text-to-video diffusion, leading to animations with minimal movement or noticeable jitter. To address this, our approach introduces a novel SDS alternative, Motion Score Distillation (MSD). Specifically, we introduce a LoRA-enhanced video diffusion model that defines a static source distribution rather than pure noise as in SDS, while another inversion-based noise estimation technique ensures appearance preservation when guiding motion. To further improve motion fidelity, we incorporate explicit temporal and spatial regularization terms that mitigate geometric distortions across time and space. Additionally, we propose a motion refinement module to upscale the temporal resolution and enhance fine-grained details, overcoming the fixed-resolution constraints of the underlying video model. Extensive experiments demonstrate that Animus3D successfully animates static 3D assets from diverse text prompts, generating significantly more substantial and detailed motion than state-of-the-art baselines while maintaining high visual integrity. Code will be released at this https URL.
我们介绍了一种名为Animus3D的文本驱动的三维动画框架,该框架能够根据给定的静态3D资产和文本提示生成运动场。以往的方法大多使用原始的得分蒸馏采样(Score Distillation Sampling, SDS)目标从预训练的文本到视频扩散模型中提取动作,这导致生成的动作十分微小或有明显的抖动现象。为了解决这些问题,我们提出了一种新的SDS替代方案——运动得分蒸馏(Motion Score Distillation, MSD)。具体来说,我们引入了一个增强的LoRA视频扩散模型,该模型定义了一个静态源分布而非像SDS那样纯粹基于噪声,同时另一种基于逆向过程的噪声估计技术确保在引导动作的同时保持外观的一致性。为了进一步提高运动的真实性,我们在模型中加入了显式的时序和空间正则化项,以减少时间与空间上的几何失真。此外,我们还提出了一种运动细化模块,用于提升时间分辨率并增强精细细节,克服了基础视频模型固定分辨率的限制。 通过广泛的实验验证,Animus3D成功地将来自不同文本提示的各种静态3D资产进行了动画化,并且生成的动作比现有的最先进的基准方法更加丰富和细腻,同时保持了高视觉一致性。相关代码将在以下链接发布:[提供URL]。
https://arxiv.org/abs/2512.12534
Time series forecasting predicts future values from past data. In real-world settings, some anomalous events have lasting effects and influence the forecast, while others are short-lived and should be ignored. Standard forecasting models fail to make this distinction, often either overreacting to noise or missing persistent shifts. We propose Co-TSFA (Contrastive Time Series Forecasting with Anomalies), a regularization framework that learns when to ignore anomalies and when to respond. Co-TSFA generates input-only and input-output augmentations to model forecast-irrelevant and forecast-relevant anomalies, and introduces a latent-output alignment loss that ties representation changes to forecast changes. This encourages invariance to irrelevant perturbations while preserving sensitivity to meaningful distributional shifts. Experiments on the Traffic and Electricity benchmarks, as well as on a real-world cash-demand dataset, demonstrate that Co-TSFA improves performance under anomalous conditions while maintaining accuracy on normal data. An anonymized GitHub repository with the implementation of Co-TSFA is provided and will be made public upon acceptance.
时间序列预测是从过去的数据中预测未来的值。在现实世界的应用中,一些异常事件会产生持久的影响并影响预测结果,而另一些则是短暂的,并且应该被忽略。标准的时间序列预测模型无法区分这些情况,常常要么过度反应噪音,要么错过持续的变化。我们提出了一种新的框架——Co-TSFA(具有异常值对比时间序列预测),这是一个正则化框架,用于学习在何时忽视异常事件以及在何时做出响应。 Co-TSFA 通过生成只包含输入和包含输入输出的数据增强方法来建模与预测无关的和相关的异常情况,并引入了一种潜在输出对齐损失函数,该函数将表示变化与预测变化联系起来。这鼓励模型对于不重要的扰动保持不变性,同时保留了对有意义的分布变化的高度敏感性。 在交通和电力基准数据集上的实验以及一个真实的现金需求数据集上显示,Co-TSFA 在异常条件下提高了性能,并且还能保持正常情况下的准确性。我们提供了一个匿名的GitHub仓库用于实现 Co-TSFA,在论文被接受后将会公开发布代码。
https://arxiv.org/abs/2512.11526
Deep neural networks achieve superior performance in semantic segmentation, but are limited to a predefined set of classes, which leads to failures when they encounter unknown objects in open-world scenarios. Recognizing and segmenting these out-of-distribution (OOD) objects is crucial for safety-critical applications such as automated driving. In this work, we present an evidence segmentation framework using a Wasserstein loss, which captures distributional distances while respecting the probability simplex geometry. Combined with Kullback-Leibler regularization and Dice structural consistency terms, our approach leads to improved OOD segmentation performance compared to uncertainty-based approaches.
深度神经网络在语义分割方面表现出色,但它们仅限于预定义的一组类别,这导致在遇到开放世界场景中的未知对象时会出现失败。识别和分割这些分布外(OOD)的对象对于自动驾驶等安全关键应用至关重要。在这项工作中,我们提出了一种基于 Wasserstein 损失的证据分割框架,该框架能够捕捉分布距离并尊重概率单纯形几何结构。结合 Kullback-Leibler 正则化和 Dice 结构一致性术语,我们的方法在 OOD 分割性能方面优于不确定性基线方法。
https://arxiv.org/abs/2512.11373
Memorization in large-scale text-to-image diffusion models poses significant security and intellectual property risks, enabling adversarial attribute extraction and the unauthorized reproduction of sensitive or proprietary features. While conventional dememorization techniques, such as regularization and data filtering, limit overfitting to specific training examples, they fail to systematically prevent the internalization of prohibited concept-level features. Simply discarding all images containing a sensitive feature wastes invaluable training data, necessitating a method for selective unlearning at the concept level. To address this, we introduce a Gradient Projection Framework designed to enforce a stringent requirement of concept-level feature exclusion. Our defense operates during backpropagation by systematically identifying and excising training signals aligned with embeddings of prohibited attributes. Specifically, we project each gradient update onto the orthogonal complement of the sensitive feature's embedding space, thereby zeroing out its influence on the model's weights. Our method integrates seamlessly into standard diffusion model training pipelines and complements existing defenses. We analyze our method against an adversary aiming for feature extraction. In extensive experiments, we demonstrate that our framework drastically reduces memorization while rigorously preserving generation quality and semantic fidelity. By reframing memorization control as selective learning, our approach establishes a new paradigm for IP-safe and privacy-preserving generative AI.
在大规模文本到图像的扩散模型中,记忆化(即模型学习特定数据集中的详细信息)带来了显著的安全和知识产权风险,因为它使得对手能够提取模型中的属性并未经授权复制敏感或专有特征。传统的去记忆化技术,例如正则化和数据过滤,虽然能限制过度拟合于特定训练样本,但无法系统性地防止模型内部化被禁止的概念级特征。简单地丢弃所有含有敏感特征的图片会浪费宝贵的训练数据,因此需要一种在概念级别选择性遗忘的方法。 为了解决这个问题,我们引入了一个梯度投影框架,旨在强制执行严格的概念级特征排除要求。我们的防御机制通过反向传播系统性地识别并消除与禁止属性嵌入相关的训练信号来运作。具体来说,我们将每个梯度更新投射到敏感特征嵌入空间的正交补空间上,从而将其对模型权重的影响清零。 该方法可以无缝集成到标准扩散模型训练管道中,并补充现有的防御措施。我们通过分析对抗者试图提取特征的情况,评估了我们的方法的有效性。在广泛的实验中,我们证明了我们的框架能够显著减少记忆化,同时严格保持生成质量与语义保真度。通过将控制记忆问题重定义为选择性学习的问题,我们的方法建立了一个新的范式,即知识产权安全且保护隐私的生成AI技术。
https://arxiv.org/abs/2512.11194
Human-level contact-rich manipulation relies on the distinct roles of two key modalities: vision provides spatially rich but temporally slow global context, while force sensing captures rapid, high-frequency local contact dynamics. Integrating these signals is challenging due to their fundamental frequency and informational disparities. In this work, we propose ImplicitRDP, a unified end-to-end visual-force diffusion policy that integrates visual planning and reactive force control within a single network. We introduce Structural Slow-Fast Learning, a mechanism utilizing causal attention to simultaneously process asynchronous visual and force tokens, allowing the policy to perform closed-loop adjustments at the force frequency while maintaining the temporal coherence of action chunks. Furthermore, to mitigate modality collapse where end-to-end models fail to adjust the weights across different modalities, we propose Virtual-target-based Representation Regularization. This auxiliary objective maps force feedback into the same space as the action, providing a stronger, physics-grounded learning signal than raw force prediction. Extensive experiments on contact-rich tasks demonstrate that ImplicitRDP significantly outperforms both vision-only and hierarchical baselines, achieving superior reactivity and success rates with a streamlined training pipeline. Code and videos will be publicly available at this https URL.
人类级别的接触密集型操作依赖于两种关键模式的独特作用:视觉提供了空间丰富但时间上较慢的全局背景信息,而力感知则捕捉到了快速、高频的局部接触动态。由于它们的基本频率和信息差异,整合这些信号颇具挑战性。在这项工作中,我们提出了ImplicitRDP(隐式关联的视觉-力扩散策略),这是一种统一的端到端视觉-力传播政策,它在单一网络中集成了视觉规划与反应性的力控制。 为了处理异步的视觉和力信号,并允许该策略以力频率进行闭环调整同时保持行动片段的时间连贯性,我们引入了结构化慢快学习机制,利用因果注意力来同时处理这两种异步信号。此外,为了解决端到端模型在不同模式之间难以调整权重的问题(即模态塌陷),我们提出了基于虚拟目标的表示正则化方法。这一辅助目标将力反馈映射到了与行动相同的空间内,提供了一个比原始力预测更强、更具有物理基础的学习信号。 通过接触密集型任务中的大量实验表明,ImplicitRDP显著优于仅依赖视觉和分层基准模型,在反应性和成功率方面都表现卓越,并且拥有精简的训练流程。相关代码和视频将公开发布在该链接上:[此URL](请将"[此URL]"替换为实际提供的具体链接)。
https://arxiv.org/abs/2512.10946
In our work we not explicitly hint that it is a misconception to think that humans learn fast. Learning process takes time. Babies start learning to move in the restricted liquid area called placenta. Children often are limited by underdeveloped body. Even adults are not allowed to participate in complex competitions right away. However, with robots, when learning from scratch, we often don't have the privilege of waiting for dozen millions of steps. "Swaddling" regularization is responsible for restraining an agent in rapid but unstable development penalizing action strength in a specific way not affecting actions directly. The Symphony, Transitional-policy Deterministic Actor and Critic algorithm, is a concise combination of different ideas for possibility of training humanoid robots from scratch with Sample Efficiency, Sample Proximity and Safety of Actions in mind. It is no secret that continuous increase in Gaussian noise without appropriate smoothing is harmful for motors and gearboxes. Compared to Stochastic algorithms, we set a limited parametric noise and promote a reduced strength of actions, safely increasing entropy, since the actions are kind of immersed in weaker noise. When actions require more extreme values, actions rise above the weak noise. Training becomes empirically much safer for both the environment around and the robot's mechanisms. We use Fading Replay Buffer: using a fixed formula containing the hyperbolic tangent, we adjust the batch sampling probability: the memory contains a recent memory and a long-term memory trail. Fading Replay Buffer allows us to use Temporal Advantage when we improve the current Critic Network prediction compared to the exponential moving average. Temporal Advantage allows us to update Actor and Critic in one pass, as well as combine Actor and Critic in one Object and implement their Losses in one line.
在我们的工作中,我们并没有明确指出认为“人类学习速度快”是一种误解。实际上,学习过程需要时间。婴儿从胎盘这个受限的液体环境中开始运动的学习。儿童常常受到身体不成熟的影响而不能自由行动。即使成年人也不能立即参与复杂的比赛或任务。然而,对于机器人而言,在从零开始学习时,我们通常没有等待数百万次步骤的奢侈。 “襁褓”正则化(Swaddling regularization)的作用是限制智能体在快速但不稳定的发展阶段中动作强度,以特定方式施加惩罚而不直接影响行动本身。“交响曲”,即过渡策略确定性执行者和评论家算法(Symphony, Transitional-policy Deterministic Actor and Critic algorithm),是一种结合了不同概念的简洁组合,旨在考虑样本效率、采样接近性和操作安全性的情况下,从零开始训练类人机器人。众所周知,如果不适当平滑处理的话,持续增加高斯噪声对电机和齿轮箱是有害的。 与随机算法相比,我们设定了有限参数噪音并促进动作强度的降低,在保证安全的前提下逐步提升熵值,因为在这种较弱的噪声中,行动会逐渐显现出来。当操作需要更极端的数值时,行动则会在这种微弱的噪声之上出现。训练因此在环境和机器人机制方面都变得更为安全。 我们使用了消退重放缓冲区(Fading Replay Buffer):通过包含双曲正切函数的固定公式来调整批量采样概率。记忆中包含了近期的记忆和长期的记忆轨迹。消退重放缓冲区让我们可以在改进当前评论家网络预测时利用时间优势,该优势与指数移动平均值进行比较。时间优势使我们能够一次性更新执行者和评论家,并将它们组合成一个对象,同时在一个公式中实现它们的损失函数。
https://arxiv.org/abs/2512.10477
3D Gaussian Splatting (3DGS) has emerged as a state-of-the-art method for novel view synthesis. However, its performance heavily relies on dense, high-quality input imagery, an assumption that is often violated in real-world applications, where data is typically sparse and motion-blurred. These two issues create a vicious cycle: sparse views ignore the multi-view constraints necessary to resolve motion blur, while motion blur erases high-frequency details crucial for aligning the limited views. Thus, reconstruction often fails catastrophically, with fragmented views and a low-frequency bias. To break this cycle, we introduce CoherentGS, a novel framework for high-fidelity 3D reconstruction from sparse and blurry images. Our key insight is to address these compound degradations using a dual-prior strategy. Specifically, we combine two pre-trained generative models: a specialized deblurring network for restoring sharp details and providing photometric guidance, and a diffusion model that offers geometric priors to fill in unobserved regions of the scene. This dual-prior strategy is supported by several key techniques, including a consistency-guided camera exploration module that adaptively guides the generative process, and a depth regularization loss that ensures geometric plausibility. We evaluate CoherentGS through both quantitative and qualitative experiments on synthetic and real-world scenes, using as few as 3, 6, and 9 input views. Our results demonstrate that CoherentGS significantly outperforms existing methods, setting a new state-of-the-art for this challenging task. The code and video demos are available at this https URL.
3D高斯点阵(3D Gaussian Splatting,简称3DGS)已成为新颖视角合成中的顶尖方法。然而,其性能严重依赖于密集且高质量的输入图像这一假设,在实际应用中常常被违背,因为数据通常是稀疏和运动模糊的。这两个问题形成了一个恶性循环:视图稀疏忽视了多视图约束,这些约束对于解决运动模糊至关重要;而运动模糊则抹除了对齐有限视图所需的关键高频细节。因此,重建往往以片段化视角和低频偏差告终。为了打破这一循环,我们引入了一种新的框架CoherentGS,用于从稀疏且模糊的图像中进行高质量的3D重建。 我们的关键见解是采用双重先验策略解决这些复合退化问题。具体而言,我们将两种预训练生成模型结合起来:一种专业的去模糊网络,用于恢复清晰细节并提供光度指导;另一种扩散模型,则通过几何先验填充场景中的未观察区域。这种双重先验策略得到了几种关键技术的支持,包括一个一致性引导的相机探索模块,该模块能够适应性地指引生成过程,并且有一种深度正则化损失确保了几何合理性。 我们通过对合成和现实世界的场景进行定量和定性的实验评估了CoherentGS的效果,使用的输入视图数量分别为3、6和9个。我们的结果表明,CoherentGS显著优于现有方法,在这一具有挑战性的任务上设立了新的性能标杆。代码及视频演示可在以下链接找到:[此URL](请将此处的“this https URL”替换为实际链接)。
https://arxiv.org/abs/2512.10369
The safety alignment of large language models (LLMs) is becoming increasingly important with their democratization. In this paper, we study the safety degradation that comes with adapting LLMs to new tasks. We attribute this safety compromise to catastrophic forgetting and frame the problem of preserving safety when fine-tuning as a continual learning (CL) problem. We consider the fine-tuning-as-a-service setup where the user uploads their data to a service provider to get a customized model that excels on the user's selected task. We adapt several CL approaches from the literature and systematically evaluate their ability to mitigate safety degradation. These include regularization-based, memory-based, and model merging approaches. We consider two scenarios, (1) benign user data and (2) poisoned user data. Our results demonstrate that CL approaches consistently achieve lower attack success rates than standard fine-tuning. Among these, DER outperforms both other CL methods and existing safety-preserving baselines while maintaining task utility. These findings generalize across three downstream tasks (GSM8K, SST2, Code) and three model families (LLaMA2-7B, Mistral-7B, Gemma-2B), establishing CL as a practical solution to preserve safety.
大型语言模型(LLMs)的安全性随着它们的普及变得越来越重要。在这篇论文中,我们研究了在将这些模型适应新任务时出现的安全性能下降问题,并将其归因于灾难性遗忘现象。我们将当微调时保持安全性的挑战视为一个连续学习(CL)的问题。我们考虑了一种“微调即服务”的设置,在这种设置下用户将自己的数据上传给服务提供商,以获取在用户选定的任务上表现优异的定制模型。 为了缓解安全性能下降问题,我们借鉴了文献中几种连续学习的方法,并系统地评估它们减轻安全隐患的能力。这些方法包括基于正则化的、记忆增强的和模型合并的技术。我们在两种场景下进行了研究:(1) 无害用户数据;(2) 被篡改过的用户数据。 我们的实验结果显示,相较于传统的微调方法,连续学习的方法在降低攻击成功率方面表现得更为出色。在这其中,DER(一种CL方法)不仅超越了其他连续学习技术的表现,还优于现有的安全保持基准方法,并且能够维持任务的实用性。 这些发现广泛适用于三个下游任务(GSM8K、SST2、Code)和三种模型家族(LLaMA2-7B、Mistral-7B、Gemma-2B),从而证明了连续学习作为一种实际可行的方法,能够在保持大型语言模型安全的同时,实现有效的定制化。
https://arxiv.org/abs/2512.10150
Continual learning in Neural Machine Translation (NMT) faces the dual challenges of catastrophic forgetting and the high computational cost of retraining. This study establishes Low-Rank Adaptation (LoRA) as a parameter-efficient framework to address these challenges in dedicated NMT architectures. We first demonstrate that LoRA-based fine-tuning adapts NMT models to new languages and domains with performance on par with full-parameter techniques, while utilizing only a fraction of the parameter space. Second, we propose an interactive adaptation method using a calibrated linear combination of LoRA modules. This approach functions as a gate-free mixture of experts, enabling real-time, user-controllable adjustments to domain and style without retraining. Finally, to mitigate catastrophic forgetting, we introduce a novel gradient-based regularization strategy specifically designed for low-rank decomposition matrices. Unlike methods that regularize the full parameter set, our approach weights the penalty on the low-rank updates using historical gradient information. Experimental results indicate that this strategy efficiently preserves prior domain knowledge while facilitating the acquisition of new tasks, offering a scalable paradigm for interactive and continual NMT.
在神经机器翻译(NMT)中的持续学习面临灾难性遗忘和重新训练高计算成本的双重挑战。本研究建立了低秩适应(LoRA)作为解决这些挑战的有效参数框架,特别是在专门针对NMT架构的情况下。 首先,我们展示了基于LoRA的微调可以使NMT模型适应新的语言和领域,并且其性能与全参数技术相当,但仅使用了一小部分参数空间。其次,我们提出了一种交互式适应方法,该方法采用校准后的线性组合来融合LoRA模块,这可以作为无门控专家混合体工作,能够在不重新训练的情况下实时调整领域和风格。 最后,为了解决灾难性遗忘问题,我们引入了一种新的基于梯度的正则化策略,专门针对低秩分解矩阵。与对整个参数集进行规则化的常规方法不同,我们的方法利用历史梯度信息来加权低秩更新的惩罚。实验结果表明,这种策略能够有效地保留先前领域的知识,同时促进新任务的学习,为交互式和持续的NMT提供了一种可扩展的方法论。 简而言之,本研究通过LoRA及其创新性正则化方法,在减少计算负担的同时提高了模型在多语言、多领域设置下的适应性和持久学习能力。
https://arxiv.org/abs/2512.09910
Semi-supervised learning (SSL) has become a promising direction for medical image segmentation, enabling models to learn from limited labeled data alongside abundant unlabeled samples. However, existing SSL approaches for multi-modal medical imaging often struggle to exploit the complementary information between modalities due to semantic discrepancies and misalignment across MRI sequences. To address this, we propose a novel semi-supervised multi-modal framework that explicitly enhances modality-specific representations and facilitates adaptive cross-modal information fusion. Specifically, we introduce a Modality-specific Enhancing Module (MEM) to strengthen semantic cues unique to each modality via channel-wise attention, and a learnable Complementary Information Fusion (CIF) module to adaptively exchange complementary knowledge between modalities. The overall framework is optimized using a hybrid objective combining supervised segmentation loss and cross-modal consistency regularization on unlabeled data. Extensive experiments on the BraTS 2019 (HGG subset) demonstrate that our method consistently outperforms strong semi-supervised and multi-modal baselines under 1\%, 5\%, and 10\% labeled data settings, achieving significant improvements in both Dice and Sensitivity scores. Ablation studies further confirm the complementary effects of our proposed MEM and CIF in bridging cross-modality discrepancies and improving segmentation robustness under scarce supervision.
半监督学习(SSL)已成为医学图像分割的一个有前景的方向,使模型能够从有限的标注数据和大量的未标注样本中进行学习。然而,现有的多模态医疗影像的半监督学习方法往往难以利用各模式之间的互补信息,因为不同MRI序列之间存在语义差异和错位问题。为了解决这一挑战,我们提出了一种新颖的半监督多模态框架,该框架明确地增强了特定于每个模态的表现,并促进了适应性跨模态信息融合。具体而言,我们引入了一个模态特异性增强模块(MEM),通过通道注意力机制来强化各模态特有的语义线索;以及一个可学习互补信息融合(CIF)模块,用于在不同模态之间灵活交换互补知识。整个框架利用结合了监督分割损失和未标注数据上跨模态一致性正则化的混合目标函数进行优化。 我们在BraTS 2019(HGG子集)上进行了广泛的实验,结果表明,在1%、5%和10%不同标注数据比例的设定下,我们的方法始终优于强大的半监督及多模态基线模型,并在Dice分数和灵敏度得分方面取得了显著改进。进一步的消融研究表明,我们提出的MEM和CIF模块能够在缓解跨模态差异的同时,在稀缺监督条件下提高分割的鲁棒性。
https://arxiv.org/abs/2512.09801
Conventional SLAM systems using visual or LiDAR data often struggle in poor lighting and severe weather. Although 4D radar is suited for such environments, its sparse and noisy point clouds hinder accurate odometry estimation, while the radar maps suffer from obscure and incomplete structures. Thus, we propose Super4DR, a 4D radar-centric framework for learning-based odometry estimation and gaussian-based map optimization. First, we design a cluster-aware odometry network that incorporates object-level cues from the clustered radar points for inter-frame matching, alongside a hierarchical self-supervision mechanism to overcome outliers through spatio-temporal consistency, knowledge transfer, and feature contrast. Second, we propose using 3D gaussians as an intermediate representation, coupled with a radar-specific growth strategy, selective separation, and multi-view regularization, to recover blurry map areas and those undetected based on image texture. Experiments show that Super4DR achieves a 67% performance gain over prior self-supervised methods, nearly matches supervised odometry, and narrows the map quality disparity with LiDAR while enabling multi-modal image rendering.
传统的SLAM( simultaneous localization and mapping,即时定位与地图构建)系统在使用视觉或激光雷达数据时,在光照条件差和恶劣天气条件下表现不佳。尽管4D雷达在这种环境下表现出色,但其稀疏且嘈杂的点云数据使得精确的姿态估计变得困难,并且生成的地图结构模糊且不完整。因此,我们提出了一种名为Super4DR的以4D雷达为中心的学习框架,用于基于高斯分布的里程计估算和地图优化。 首先,我们设计了一个具有集群感知能力的里程计网络,该网络利用聚类点云中的对象级线索进行帧间匹配,并通过时空一致性、知识迁移以及特征对比等机制来克服异常值问题。其次,我们提出了一种使用3D高斯作为中间表示的方法,结合雷达特定的增长策略、选择性分离和多视图正则化技术,以恢复地图中模糊不清的区域及基于图像纹理未被检测到的部分。 实验结果表明,Super4DR在性能上相较于先前的自监督方法提升了67%,几乎达到了监督式里程计的效果,并且缩小了与激光雷达的地图质量差距。此外,该系统还能够支持多模态图像渲染功能。
https://arxiv.org/abs/2512.09608
The growing adoption of XR devices has fueled strong demand for high-quality stereo video, yet its production remains costly and artifact-prone. To address this challenge, we present StereoWorld, an end-to-end framework that repurposes a pretrained video generator for high-fidelity monocular-to-stereo video generation. Our framework jointly conditions the model on the monocular video input while explicitly supervising the generation with a geometry-aware regularization to ensure 3D structural fidelity. A spatio-temporal tiling scheme is further integrated to enable efficient, high-resolution synthesis. To enable large-scale training and evaluation, we curate a high-definition stereo video dataset containing over 11M frames aligned to natural human interpupillary distance (IPD). Extensive experiments demonstrate that StereoWorld substantially outperforms prior methods, generating stereo videos with superior visual fidelity and geometric consistency. The project webpage is available at this https URL.
XR设备的日益普及推动了对高质量立体视频的需求,然而其制作成本高昂且容易产生瑕疵。为了解决这一挑战,我们提出了StereoWorld,这是一个端到端框架,它重新利用了一个预训练的视频生成器来实现高保真的单目至立体视频转换。我们的框架同时将模型以单眼视频输入作为条件,并通过几何感知正则化明确监督生成过程,确保3D结构的一致性。此外,我们还整合了一种时空分块方案,以支持高效的高分辨率合成。为了支持大规模的训练和评估,我们收集了一个包含超过1100万帧、与自然人类瞳距(IPD)对齐的高清立体视频数据集。广泛的实验表明,StereoWorld显著优于现有方法,在生成视觉保真度和几何一致性方面具有明显优势。项目网页可在此网址访问:[此URL](请将链接替换为实际提供的网站地址)。
https://arxiv.org/abs/2512.09363
Modeling relightable and animatable human avatars from monocular video is a long-standing and challenging task. Recently, Neural Radiance Field (NeRF) and 3D Gaussian Splatting (3DGS) methods have been employed to reconstruct the avatars. However, they often produce unsatisfactory photo-realistic results because of insufficient geometrical details related to body motion, such as clothing wrinkles. In this paper, we propose a 3DGS-based human avatar modeling framework, termed as Relightable and Dynamic Gaussian Avatar (RnD-Avatar), that presents accurate pose-variant deformation for high-fidelity geometrical details. To achieve this, we introduce dynamic skinning weights that define the human avatar's articulation based on pose while also learning additional deformations induced by body motion. We also introduce a novel regularization to capture fine geometric details under sparse visual cues. Furthermore, we present a new multi-view dataset with varied lighting conditions to evaluate relight. Our framework enables realistic rendering of novel poses and views while supporting photo-realistic lighting effects under arbitrary lighting conditions. Our method achieves state-of-the-art performance in novel view synthesis, novel pose rendering, and relighting.
从单目视频建模出可重新光照和动画处理的人类化身是一项长期且具有挑战性的任务。最近,神经辐射场(NeRF)和3D高斯点阵(3DGS)方法被用于重建这些化身,但它们往往由于缺乏与人体运动相关的几何细节而产生不令人满意的逼真效果,例如衣物的皱纹。在本文中,我们提出了一种基于3DGS的人类化身建模框架,称为可重新光照和动态高斯化身(RnD-Avatar),该框架能够对高度保真的几何细节进行准确的姿态变化变形。为了实现这一目标,我们引入了动态蒙皮权重,它根据姿态定义人类化身的关节活动,同时学习由身体运动引起的附加变形。此外,我们还提出了一种新的正则化方法来捕捉在稀疏视觉线索下的细小几何细节。另外,我们提供了一个具有不同光照条件的新多视角数据集以评估重新光照效果。我们的框架能够在支持任意光照条件下逼真光线效果的同时实现新姿态和视图的真实渲染。我们的方法在新颖视图合成、新姿态渲染以及重新光照方面均达到了最先进的性能水平。
https://arxiv.org/abs/2512.09335