Designing efficient and effective architectural backbones has been in the core of research efforts to enhance the capability of foundation models. Inspired by the human cognitive phenomenon of attentional bias-the natural tendency to prioritize certain events or stimuli-we reconceptualize neural architectures, including Transformers, Titans, and modern linear recurrent neural networks as associative memory modules that learn a mapping of keys and values using an internal objective, referred to as attentional bias. Surprisingly, we observed that most existing sequence models leverage either (1) dot-product similarity, or (2) L2 regression objectives as their attentional bias. Going beyond these objectives, we present a set of alternative attentional bias configurations along with their effective approximations to stabilize their training procedure. We then reinterpret forgetting mechanisms in modern deep learning architectures as a form of retention regularization, providing a novel set of forget gates for sequence models. Building upon these insights, we present Miras, a general framework to design deep learning architectures based on four choices of: (i) associative memory architecture, (ii) attentional bias objective, (iii) retention gate, and (iv) memory learning algorithm. We present three novel sequence models-Moneta, Yaad, and Memora-that go beyond the power of existing linear RNNs while maintaining a fast parallelizable training process. Our experiments show different design choices in Miras yield models with varying strengths. For example, certain instances of Miras achieve exceptional performance in special tasks such as language modeling, commonsense reasoning, and recall intensive tasks, even outperforming Transformers and other modern linear recurrent models.
设计高效且有效的架构骨干是提升基础模型能力的核心研究方向。受人类认知现象“注意力偏差”——即优先关注某些事件或刺激的自然倾向启发,我们将神经网络架构(包括Transformer、Titans以及现代线性循环神经网络)重新构想为关联记忆模块,这些模块利用内部目标学习键值映射,我们将其称为“注意力偏置”。令人惊讶的是,我们观察到大多数现有的序列模型要么依赖点积相似度,要么采用L2回归作为其注意力偏置。超越这些目标,本文提出了一套替代的注意力偏置配置及其有效近似方法以稳定训练过程。此外,我们将现代深度学习架构中的遗忘机制重新解释为一种保留正则化形式,并为此类序列模型提供了一组新的遗忘门。 基于上述洞察,我们提出了Miras框架,该框架用于设计深度学习架构并提供了四种选择:(i)关联记忆结构;(ii)注意力偏置目标;(iii)保留门;以及(iv)内存学习算法。随后,本文介绍了三种新颖的序列模型——Moneta、Yaad和Memora,这些模型在超越现有线性RNN能力的同时保持了快速可并行化的训练过程。实验结果表明,在Miras框架下做出的不同设计选择会产出具有不同优势的模型。例如,某些Miras实例在诸如语言建模、常识推理以及记忆密集型任务等特定任务上表现出色,甚至超越了Transformer和其他现代线性循环模型的表现。
https://arxiv.org/abs/2504.13173
Large language models (LLMs) frequently memorize sensitive information during training, posing risks when deploying publicly accessible models. Current machine unlearning methods struggle to selectively remove specific data associations without degrading overall model capabilities. This paper presents our solution to SemEval-2025 Task 4 on targeted unlearning, which introduces a two-stage methodology that combines causal mediation analysis with layer-specific optimization. Through systematic causal tracing experiments on OLMo architectures (1B and 7B parameters), we identify the critical role of the first few transformer layers (layers 0-5) in storing subject-attribute associations within MLP modules. Building on this insight, we develop a constrained optimization approach that freezes upper layers while applying a novel joint loss function to lower layers-simultaneously maximizing forget set loss via output token cross-entropy penalties and minimizing retain set deviation through adaptive regularization. Our method achieves 2nd place in the 1B model track, demonstrating strong task performance while maintaining 88% of baseline MMLU accuracy. These results establish causal-informed layer optimization as a promising paradigm for efficient, precise unlearning in LLMs, offering a significant step forward in addressing data privacy concerns in AI systems.
大型语言模型(LLMs)在训练过程中经常会记住敏感信息,在部署公开可访问的模型时会带来风险。当前的机器遗忘方法难以选择性地删除特定数据关联而不损害整体模型能力。本文提出了我们针对SemEval-2025 Task 4提出的针对性遗忘问题的解决方案,该方案采用了一种两阶段的方法结合因果中介分析和层特异性优化。 通过在OLMo架构(1B和7B参数)上进行系统的因果追踪实验,我们发现前几层变换器层(第0到第5层)对于存储主体属性关联在其MLP模块中起着关键作用。基于这一见解,我们开发了一种受约束的优化方法,在这种方法中冻结高层并应用一种新颖的联合损失函数对低层进行操作——同时通过输出令牌交叉熵惩罚最大化遗忘集合损失,并通过自适应正则化最小化保留集合偏差。 我们的方法在1B模型赛道上取得了第二名的成绩,展示了强大的任务性能,同时保持了基线MMLU准确率的88%。这些结果确立了因果信息层优化作为LLMs高效精确遗忘的一种有前景的方法,为解决AI系统中的数据隐私问题提供了重要的一步。
https://arxiv.org/abs/2504.12996
Late interaction neural IR models like ColBERT offer a competitive effectiveness-efficiency trade-off across many benchmarks. However, they require a huge memory space to store the contextual representation for all the document tokens. Some works have proposed using either heuristics or statistical-based techniques to prune tokens from each document. This however doesn't guarantee that the removed tokens have no impact on the retrieval score. Our work uses a principled approach to define how to prune tokens without impacting the score between a document and a query. We introduce three regularization losses, that induce a solution with high pruning ratios, as well as two pruning strategies. We study them experimentally (in and out-domain), showing that we can preserve ColBERT's performance while using only 30\% of the tokens.
类似于ColBERT这样的后期交互神经信息检索模型在许多基准测试中提供了一个具有竞争力的效果与效率的权衡。然而,它们需要巨大的内存空间来存储所有文档标记的上下文表示。一些研究提出使用基于规则或统计的方法从每个文档中删除标记以减少内存占用,但这并不能保证被移除的标记不会对检索分数产生影响。我们的工作采用了一种原理性的方法来定义如何在不影响查询和文档之间的得分的情况下进行标记删减。我们引入了三种正则化损失函数,这些损失函数可以诱导出具有高删减率的解决方案,并且还提出了两种删除策略。通过实验研究(包括内部和外部领域),我们发现可以在只使用30%的标记情况下保持ColBERT的表现性能不变。
https://arxiv.org/abs/2504.12778
The elliptical shape prior information plays a vital role in improving the accuracy of image segmentation for specific tasks in medical and natural images. Existing deep learning-based segmentation methods, including the Segment Anything Model (SAM), often struggle to produce segmentation results with elliptical shapes efficiently. This paper proposes a new approach to integrate the prior of elliptical shapes into the deep learning-based SAM image segmentation techniques using variational methods. The proposed method establishes a parameterized elliptical contour field, which constrains the segmentation results to align with predefined elliptical contours. Utilizing the dual algorithm, the model seamlessly integrates image features with elliptical priors and spatial regularization priors, thereby greatly enhancing segmentation accuracy. By decomposing SAM into four mathematical sub-problems, we integrate the variational ellipse prior to design a new SAM network structure, ensuring that the segmentation output of SAM consists of elliptical regions. Experimental results on some specific image datasets demonstrate an improvement over the original SAM.
椭圆形的先验信息在医学和自然图像特定任务中的图像分割准确性提升方面起着关键作用。现有的基于深度学习的分割方法,包括“Segment Anything Model(SAM)”,往往难以有效地生成具有椭圆形状的分割结果。本文提出了一种新方法,通过变分法将椭圆形的先验信息整合到基于深度学习的SAM图像分割技术中。所提出的方法建立了一个参数化的椭圆轮廓场,限制了分割结果与预定义的椭圆轮廓对齐。利用双算法,该模型可以无缝地融合图像特征和椭圆先验及空间正则化先验,从而大幅提高分割精度。通过将SAM分解为四个数学子问题,我们将变分椭圆先验整合到新的SAM网络结构设计中,确保了SAM的分割输出由椭圆形区域组成。在某些特定图像数据集上的实验结果表明,相较于原始的SAM方法,本研究的方法有所改进。
https://arxiv.org/abs/2504.12556
Transformer-based architectures achieve state-of-the-art performance across a wide range of tasks in natural language processing, computer vision, and speech. However, their immense capacity often leads to overfitting, especially when training data is limited or noisy. We propose AttentionDrop, a unified family of stochastic regularization techniques that operate directly on the self-attention distributions. We introduces three variants: 1. Hard Attention Masking: randomly zeroes out top-k attention logits per query to encourage diverse context utilization. 2. Blurred Attention Smoothing: applies a dynamic Gaussian convolution over attention logits to diffuse overly peaked distributions. 3. Consistency-Regularized AttentionDrop: enforces output stability under multiple independent AttentionDrop perturbations via a KL-based consistency loss.
基于Transformer的架构在自然语言处理、计算机视觉和语音等多个任务中实现了最先进的性能。然而,它们巨大的容量通常会导致过拟合问题,尤其是在训练数据有限或存在噪声的情况下。为此,我们提出了AttentionDrop,这是一种统一的随机正则化技术家族,直接作用于自注意力分布上。AttentionDrop包括三种变体: 1. **硬注意力掩码(Hard Attention Masking)**:对于每个查询,随机将top-k注意力得分置零,以鼓励多样化地利用上下文信息。 2. **模糊注意力平滑(Blurred Attention Smoothing)**:通过动态的高斯卷积应用在注意力得分上,来扩散过于尖锐分布。 3. **一致性正则化AttentionDrop(Consistency-Regularized AttentionDrop)**:使用基于KL散度的一致性损失,在多个独立的AttentionDrop扰动下强制输出稳定性。 这些方法旨在提高模型泛化能力,尤其是在数据量较小或质量较差的情况下。
https://arxiv.org/abs/2504.12088
Despite recent advances in Large Video Language Models (LVLMs), they still struggle with fine-grained temporal understanding, hallucinate, and often make simple mistakes on even simple video question-answering tasks, all of which pose significant challenges to their safe and reliable deployment in real-world applications. To address these limitations, we propose a self-alignment framework that enables LVLMs to learn from their own errors. Our proposed framework first obtains a training set of preferred and non-preferred response pairs, where non-preferred responses are generated by incorporating common error patterns that often occur due to inadequate spatio-temporal understanding, spurious correlations between co-occurring concepts, and over-reliance on linguistic cues while neglecting the vision modality, among others. To facilitate self-alignment of LVLMs with the constructed preferred and non-preferred response pairs, we introduce Refined Regularized Preference Optimization (RRPO), a novel preference optimization method that utilizes sub-sequence-level refined rewards and token-wise KL regularization to address the limitations of Direct Preference Optimization (DPO). We demonstrate that RRPO achieves more precise alignment and more stable training compared to DPO. Our experiments and analysis validate the effectiveness of our approach across diverse video tasks, including video hallucination, short- and long-video understanding, and fine-grained temporal reasoning.
尽管大型视频语言模型(LVLM)在最近取得了进展,它们仍然难以理解视频中的细微时间信息、会出现幻觉现象,并且在简单的视频问答任务中也会犯一些基本错误。所有这些都对它们在实际应用中的安全和可靠部署构成了重大挑战。为了解决这些问题,我们提出了一种自我校准框架,该框架可以让LVLM从自身错误中学习。 我们的方法首先获得一组优选响应与非优选响应的训练集,其中非优选响应是通过加入常见的错误模式生成的,这些错误通常由于时空理解不足、共同出现概念之间的虚假相关性以及过分依赖语言线索而忽视视觉模态等原因产生。为了使LVLM能够与构建出的优选和非优选响应对进行自我校准,我们提出了一种新颖的偏好优化方法——精炼正则化偏好优化(RRPO),它利用子序列级别的精细奖励及逐令牌的KL正则化来解决直接偏好优化(DPO)的局限性。实验表明,与DPO相比,RRPO能够实现更精确的校准和更加稳定的训练过程。 我们的实验证明了该方法在视频幻觉、短/长视频理解以及细微时间推理等多样化视频任务中的有效性。
https://arxiv.org/abs/2504.12083
Recent advances in Source-Free Unsupervised Video Domain Adaptation (SFUVDA) leverage vision-language models to enhance pseudo-label generation. However, challenges such as noisy pseudo-labels and over-confident predictions limit their effectiveness in adapting well across domains. We propose Co-STAR, a novel framework that integrates curriculum learning with collaborative self-training between a source-trained teacher and a contrastive vision-language model (CLIP). Our curriculum learning approach employs a reliability-based weight function that measures bidirectional prediction alignment between the teacher and CLIP, balancing between confident and uncertain predictions. This function preserves uncertainty for difficult samples, while prioritizing reliable pseudo-labels when the predictions from both models closely align. To further improve adaptation, we propose Adaptive Curriculum Regularization, which modifies the learning priority of samples in a probabilistic, adaptive manner based on their confidence scores and prediction stability, mitigating overfitting to noisy and over-confident samples. Extensive experiments across multiple video domain adaptation benchmarks demonstrate that Co-STAR consistently outperforms state-of-the-art SFUVDA methods. Code is available at: this https URL
最近在无源非监督视频领域适应(SFUVDA)方面的进展利用了视觉语言模型来增强伪标签生成。然而,诸如噪声伪标签和过于自信的预测等挑战限制了它们跨域适应的有效性。我们提出了一种名为Co-STAR的新框架,该框架结合了课程学习,并实现了在源训练教师与对比视觉语言模型(CLIP)之间的协作自我训练。我们的课程学习方法采用了一个基于可靠性的权重函数,用于衡量教师和CLIP之间双向预测对齐的程度,在自信预测和不确定预测之间进行平衡。此功能保留了困难样本的不确定性,同时当两个模型的预测紧密对准时优先选择可靠的伪标签。 为了进一步提高适应性,我们提出了自适应课程正则化,这是一种根据样本的信心分数和预测稳定性以概率、自适应的方式调整学习优先级的方法,从而减轻对噪声及过于自信样本过度拟合的问题。广泛的实验结果显示,在多个视频领域适应基准上,Co-STAR始终优于现有的最先进的SFUVDA方法。 代码可在以下网址获取:[此URL]
https://arxiv.org/abs/2504.11669
Surface normal estimation serves as a cornerstone for a spectrum of computer vision applications. While numerous efforts have been devoted to static image scenarios, ensuring temporal coherence in video-based normal estimation remains a formidable challenge. Instead of merely augmenting existing methods with temporal components, we present NormalCrafter to leverage the inherent temporal priors of video diffusion models. To secure high-fidelity normal estimation across sequences, we propose Semantic Feature Regularization (SFR), which aligns diffusion features with semantic cues, encouraging the model to concentrate on the intrinsic semantics of the scene. Moreover, we introduce a two-stage training protocol that leverages both latent and pixel space learning to preserve spatial accuracy while maintaining long temporal context. Extensive evaluations demonstrate the efficacy of our method, showcasing a superior performance in generating temporally consistent normal sequences with intricate details from diverse videos.
表面法线估计是计算机视觉应用中的重要基础。尽管已经为静态图像场景投入了大量努力,但在基于视频的法线估计中确保时间连贯性仍然是一个艰巨的挑战。我们没有简单地将现有方法与时间组件结合,而是提出了NormalCrafter来利用视频扩散模型固有的时间先验信息。为了在整个序列中保持高保真的法线估计,我们提出了一种语义特征正则化(SFR),该方法使扩散特征与语义线索对齐,并鼓励模型专注于场景的内在语义。此外,我们还引入了一个两阶段训练协议,利用潜在空间和像素空间学习来保留空间精度并保持长时间的时间上下文。广泛的评估展示了我们方法的有效性,在生成具有复杂细节的不同视频中时间上一致的法线序列方面表现出卓越性能。
https://arxiv.org/abs/2504.11427
While multimodal fusion has been extensively studied in Multimodal Sentiment Analysis (MSA), the role of fusion depth and multimodal capacity allocation remains underexplored. In this work, we position fusion depth, scalability, and dedicated multimodal capacity as primary factors for effective fusion. We introduce DeepMLF, a novel multimodal language model (LM) with learnable tokens tailored toward deep fusion. DeepMLF leverages an audiovisual encoder and a pretrained decoder LM augmented with multimodal information across its layers. We append learnable tokens to the LM that: 1) capture modality interactions in a controlled fashion and 2) preserve independent information flow for each modality. These fusion tokens gather linguistic information via causal self-attention in LM Blocks and integrate with audiovisual information through cross-attention MM Blocks. Serving as dedicated multimodal capacity, this design enables progressive fusion across multiple layers, providing depth in the fusion process. Our training recipe combines modality-specific losses and language modelling loss, with the decoder LM tasked to predict ground truth polarity. Across three MSA benchmarks with varying dataset characteristics, DeepMLF achieves state-of-the-art performance. Our results confirm that deeper fusion leads to better performance, with optimal fusion depths (5-7) exceeding those of existing approaches. Additionally, our analysis on the number of fusion tokens reveals that small token sets ($\sim$20) achieve optimal performance. We examine the importance of representation learning order (fusion curriculum) through audiovisual encoder initialization experiments. Our ablation studies demonstrate the superiority of the proposed fusion design and gating while providing a holistic examination of DeepMLF's scalability to LLMs, and the impact of each training objective and embedding regularization.
尽管多模态融合在多模态情感分析(MSA)中已经得到了广泛研究,但关于融合深度和多模态容量分配的作用仍需进一步探索。在这项工作中,我们定位了融合深度、可扩展性和专有的多模态容量作为有效融合的主要因素。我们引入了一种新的多模态语言模型DeepMLF,该模型具有针对深层融合而定制的可学习令牌。DeepMLF利用了一个音频-视觉编码器和一个通过其各个层融入了多模态信息的预训练解码器LM。我们在LM中附加了可学习的令牌,以:1)有控制地捕获模式间的交互作用;2)为每种模式保留独立的信息流。这些融合令牌通过LM块中的因果自注意力机制收集语言信息,并通过跨注意力MM块与音频-视觉信息集成。作为专有的多模态容量,这种设计允许在多个层级上进行渐进式融合,在融合过程中提供深度。 我们的训练方案结合了特定于模式的损失和语言建模损失,其中解码器LM的任务是预测真实极性。在具有不同数据集特征的三个MSA基准测试中,DeepMLF实现了最先进的性能。我们的结果证实了更深层次的融合能够带来更好的表现,并且最佳融合深度(5-7)超过了现有方法的表现。此外,我们关于融合令牌数量的分析表明,较小的令牌集合(约20个)可以达到最优性能。 通过音频-视觉编码器初始化实验,我们探讨了表示学习顺序的重要性(即融合课程)。我们的消融研究展示了所提出融合设计和门控机制的优越性,并提供了对DeepMLF在大规模语言模型中的可扩展性的全面评估。此外,还揭示了每个训练目标及嵌入正则化的影响。
https://arxiv.org/abs/2504.11082
Vision Transformers (ViTs) have demonstrated impressive performance across a range of applications, including many safety-critical tasks. However, their unique architectural properties raise new challenges and opportunities in adversarial robustness. In particular, we observe that adversarial examples crafted on ViTs exhibit higher transferability compared to those crafted on CNNs, suggesting that ViTs contain structural characteristics favorable for transferable attacks. In this work, we investigate the role of computational redundancy in ViTs and its impact on adversarial transferability. Unlike prior studies that aim to reduce computation for efficiency, we propose to exploit this redundancy to improve the quality and transferability of adversarial examples. Through a detailed analysis, we identify two forms of redundancy, including the data-level and model-level, that can be harnessed to amplify attack effectiveness. Building on this insight, we design a suite of techniques, including attention sparsity manipulation, attention head permutation, clean token regularization, ghost MoE diversification, and test-time adversarial training. Extensive experiments on the ImageNet-1k dataset validate the effectiveness of our approach, showing that our methods significantly outperform existing baselines in both transferability and generality across diverse model architectures.
视觉变压器(ViT)在包括许多关键安全任务在内的各种应用中展示了出色的性能。然而,它们独特的架构特性提出了新的挑战和机会,在对抗鲁棒性方面尤其如此。我们观察到,与CNN相比,在ViTs上生成的对抗样本表现出更高的转移能力,这表明ViTs包含有利于可传输攻击的结构特征。在这项工作中,我们研究了计算冗余在ViT中的作用及其对对抗转移的影响。不同于以往旨在为了效率减少计算的研究,我们提出利用这种冗余来提高对抗样本的质量和转移性。通过详细分析,我们识别出两种形式的冗余,包括数据级和模型级,这些冗余可以被用来放大攻击效果。基于这一见解,我们设计了一系列技术,包括注意力稀疏操作、注意力头排列、干净令牌正则化、幽灵MoE多样化以及测试时间对抗训练。 在ImageNet-1k数据集上的广泛实验验证了我们的方法的有效性,结果显示我们的方法在多样化的模型架构中,在转移能力和普遍性方面都显著优于现有基准。
https://arxiv.org/abs/2504.10804
Generative models often map noise to data by matching flows or scores, but these approaches become cumbersome for incorporating partial observations or additional priors. Inspired by recent advances in Wasserstein gradient flows, we propose Energy Matching, a framework that unifies flow-based approaches with the flexibility of energy-based models (EBMs). Far from the data manifold, samples move along curl-free, optimal transport paths from noise to data. As they approach the data manifold, an entropic energy term guides the system into a Boltzmann equilibrium distribution, explicitly capturing the underlying likelihood structure of the data. We parameterize this dynamic with a single time-independent scalar field, which serves as both a powerful generator and a flexible prior for effective regularization of inverse problems. Our method substantially outperforms existing EBMs on CIFAR-10 generation (FID 3.97 compared to 8.61), while retaining the simulation-free training of transport-based approaches away from the data manifold. Additionally, we exploit the flexibility of our method and introduce an interaction energy for diverse mode exploration. Our approach focuses on learning a static scalar potential energy -- without time conditioning, auxiliary generators, or additional networks -- marking a significant departure from recent EBM methods. We believe this simplified framework significantly advances EBM capabilities and paves the way for their broader adoption in generative modeling across diverse domains.
生成模型通常通过匹配流或分数将噪声映射到数据,但这些方法在结合部分观测值或额外先验时变得复杂。受最近关于Wasserstein梯度流进展的启发,我们提出了能量匹配(Energy Matching)框架,该框架统一了基于流的方法与能量模型(EBMs)的灵活性。远离数据流形时,样本沿无旋、最优传输路径从噪声到数据移动;当接近数据流形时,一个熵能项引导系统进入玻尔兹曼平衡分布,明确捕捉数据的基本概率结构。 我们用一个不依赖时间的标量场参数化这个动态过程,该场同时作为强大的生成器和灵活的先验,有效正则化逆问题。我们的方法在CIFAR-10数据集上的表现大幅超越了现有的EBMs(FID为3.97,而后者为8.61),同时还保留了传输基础方法远离数据流形时无需模拟训练的优势。 此外,我们利用方法的灵活性引入了一个交互能量,用于探索多模态分布。我们的方法专注于学习一个静态标量势能——不依赖于时间条件、辅助生成器或额外网络——这标志着与最近EBM方法的重大区别。我们认为这个简化框架显著增强了EBMs的能力,并为它们在各种领域内的广泛采用铺平了道路。 通过这种方法,能量匹配不仅提高了数据生成的质量和多样性,还使得模型更容易理解和实现,从而推动了基于能量的模型在生成性建模中的应用。
https://arxiv.org/abs/2504.10612
Poisson-Gaussian noise describes the noise of various imaging systems thus the need of efficient algorithms for Poisson-Gaussian image restoration. Deep learning methods offer state-of-the-art performance but often require sensor-specific training when used in a supervised setting. A promising alternative is given by plug-and-play (PnP) methods, which consist in learning only a regularization through a denoiser, allowing to restore images from several sources with the same network. This paper introduces PG-DPIR, an efficient PnP method for high-count Poisson-Gaussian inverse problems, adapted from DPIR. While DPIR is designed for white Gaussian noise, a naive adaptation to Poisson-Gaussian noise leads to prohibitively slow algorithms due to the absence of a closed-form proximal operator. To address this, we adapt DPIR for the specificities of Poisson-Gaussian noise and propose in particular an efficient initialization of the gradient descent required for the proximal step that accelerates convergence by several orders of magnitude. Experiments are conducted on satellite image restoration and super-resolution problems. High-resolution realistic Pleiades images are simulated for the experiments, which demonstrate that PG-DPIR achieves state-of-the-art performance with improved efficiency, which seems promising for on-ground satellite processing chains.
泊松-高斯噪声描述了各种成像系统中的噪声特性,因此需要高效的算法来解决泊松-高斯图像恢复问题。深度学习方法提供了最先进的性能,但当在监督设置下使用时通常需要特定传感器的训练。一种有前景的替代方案是由插件播放(PnP)方法提供的,这些方法仅通过去噪器学习正则化项,从而能够利用同一网络从多个来源恢复图像。本文介绍了PG-DPIR,这是一种高效的PnP方法,用于解决高计数泊松-高斯逆问题,基于DPIR进行了改进。虽然DPIR是为白色高斯噪声设计的,但直接将其应用于泊松-高斯噪声会导致算法运行速度极其缓慢,因为缺乏封闭形式的近似算子。为了应对这一挑战,我们针对Poisson-Gaussian噪声的特点对DPIR进行了调整,并特别提出了一种高效的梯度下降初始化方法,用于加速近似步骤中的收敛速度,提高了几个数量级的速度。实验在卫星图像恢复和超分辨率问题上进行。利用高分辨率的现实Pleiades图像模拟了实验数据,结果表明PG-DPIR实现了最先进的性能并提高了效率,这似乎对于地面卫星处理链来说前景广阔。
https://arxiv.org/abs/2504.10375
Continual Learning (CL) epitomizes an advanced training paradigm wherein prior data samples remain inaccessible during the acquisition of new tasks. Numerous investigations have delved into leveraging a pre-trained Vision Transformer (ViT) to enhance model efficacy in continual learning. Nonetheless, these approaches typically utilize a singular, static backbone, which inadequately adapts to novel tasks, particularly when engaging with diverse data domains, due to a substantial number of inactive parameters. This paper addresses this limitation by introducing an innovative Self-Controlled Dynamic Expansion Model (SCDEM), which orchestrates multiple distinct trainable pre-trained ViT backbones to furnish diverse and semantically enriched representations. Specifically, by employing the multi-backbone architecture as a shared module, the proposed SCDEM dynamically generates a new expert with minimal parameters to accommodate a new task. A novel Collaborative Optimization Mechanism (COM) is introduced to synergistically optimize multiple backbones by harnessing prediction signals from historical experts, thereby facilitating new task learning without erasing previously acquired knowledge. Additionally, a novel Feature Distribution Consistency (FDC) approach is proposed to align semantic similarity between previously and currently learned representations through an optimal transport distance-based mechanism, effectively mitigating negative knowledge transfer effects. Furthermore, to alleviate over-regularization challenges, this paper presents a novel Dynamic Layer-Wise Feature Attention Mechanism (DLWFAM) to autonomously determine the penalization intensity on each trainable representation layer. An extensive series of experiments have been conducted to evaluate the proposed methodology's efficacy, with empirical results corroborating that the approach attains state-of-the-art performance.
持续学习(CL)代表了一种先进的训练范式,在获取新任务时,之前的数据样本无法再访问。许多研究探讨了利用预训练的视觉变换器(ViT)来增强模型在持续学习中的效能。然而,这些方法通常采用单一静态骨干网络,这不足以适应新的任务,尤其是在处理多样化的数据领域时,因为大量的参数处于非活动状态。 本文通过引入一种创新的自我控制动态扩展模型(SCDEM)解决了这一限制问题,该模型可以协调多个可训练的不同预训练ViT骨干网络来提供多样化和语义丰富的表示。具体而言,通过将多骨干架构用作共享模块,所提出的SCDEM能够以最少参数量动态生成新的专家,以便适应新任务。 本文还引入了一种新颖的协同优化机制(COM),利用历史专家的预测信号同时优化多个骨干网络,从而在不删除之前获得的知识的情况下促进新任务的学习。此外,提出了一种基于最优传输距离的方法来对齐以前和当前学习表示之间的语义相似性——即一种新的特征分布一致性(FDC)方法——有效地减少了负面知识转移的影响。 为了减轻过度正则化带来的挑战,本文还提出了一个新颖的动态层级特征注意力机制(DLWFAM),能够自主决定在每个可训练表示层上的惩罚强度。进行了广泛的实验来评估所提出的方法的有效性,实验证明该方法达到了最先进的性能水平。
https://arxiv.org/abs/2504.10561
This paper jointly addresses the problem of data uncertainty, popularity bias, and exposure bias in session-based recommender systems. We study the symptoms of this bias both in item embeddings and in recommendations. We propose treating user interest as a stochastic process in the latent space and providing a model-agnostic implementation of this mathematical concept. The proposed stochastic component consists of elements: debiasing item embeddings with regularization for embedding uniformity, modeling dense user interest from session prefixes, and introducing fake targets in the data to simulate extended exposure. We conducted computational experiments on two popular benchmark datasets, Diginetica and YooChoose 1/64, as well as several modifications of the YooChoose dataset with different ratios of popular items. The results show that the proposed approach allows us to mitigate the challenges mentioned.
本文共同探讨了基于会话的推荐系统中数据不确定性、流行偏差和曝光偏差的问题。我们研究了这些偏见在项目嵌入和推荐中的症状。我们提出将用户兴趣视为潜在空间中的随机过程,并提供了一种与模型无关的方式来实现这一数学概念。所提出的随机组件包括以下几个部分:通过嵌入均匀化正则化来去偏项目嵌入,从会话前缀中建模稠密的用户兴趣,以及在数据中引入虚假目标以模拟延长曝光。 我们在两个流行的基准数据集Diginetica和YooChoose 1/64上进行了计算实验,并对YooChoose数据集的不同热门物品比例的几种变体也进行了实验。结果显示所提出的方法能够缓解上述提到的问题。
https://arxiv.org/abs/2504.10005
All-in-one image restoration, addressing diverse degradation types with a unified model, presents significant challenges in designing task-specific prompts that effectively guide restoration across multiple degradation scenarios. While adaptive prompt learning enables end-to-end optimization, it often yields overlapping or redundant task representations. Conversely, explicit prompts derived from pretrained classifiers enhance discriminability but may discard critical visual information for reconstruction. To address these limitations, we introduce Contrastive Prompt Learning (CPL), a novel framework that fundamentally enhances prompt-task alignment through two complementary innovations: a \emph{Sparse Prompt Module (SPM)} that efficiently captures degradation-specific features while minimizing redundancy, and a \emph{Contrastive Prompt Regularization (CPR)} that explicitly strengthens task boundaries by incorporating negative prompt samples across different degradation types. Unlike previous approaches that focus primarily on degradation classification, CPL optimizes the critical interaction between prompts and the restoration model itself. Extensive experiments across five comprehensive benchmarks demonstrate that CPL consistently enhances state-of-the-art all-in-one restoration models, achieving significant improvements in both standard multi-task scenarios and challenging composite degradation settings. Our framework establishes new state-of-the-art performance while maintaining parameter efficiency, offering a principled solution for unified image restoration.
针对多种退化类型进行统一建模的一站式图像修复任务,在设计能够有效指导跨多场景修复的任务特定提示方面面临着重大挑战。虽然自适应提示学习可以实现端到端优化,但往往会生成重叠或冗余的任务表示形式。相反,基于预训练分类器得出的显式提示增强了辨别能力,却可能忽略重构所需的视觉关键信息。为了解决这些问题,我们引入了对比性提示学习(Contrastive Prompt Learning, CPL),这是一种通过两项互补创新来根本上增强提示与任务对齐的新框架:一种是有效捕捉特定退化特征的同时最小化冗余的稀疏提示模块(Sparse Prompt Module, SPM);另一种则是通过纳入不同退化类型中的负样本,明确加强任务边界的对比性提示正则化(Contrastive Prompt Regularization, CPR)。与以往主要关注于退化分类的方法相比,CPL优化了提示和修复模型之间的关键交互。在五个全面基准测试中进行的广泛实验表明,CPL能够持续提升现有的一站式图像修复模型,在标准多任务场景以及复杂混合退化设置中均取得了显著改进。我们的框架在保持参数效率的同时建立了新的性能标杆,并为统一图像修复提供了一种原理性的解决方案。
https://arxiv.org/abs/2504.09973
Humanoid locomotion is a challenging task due to its inherent complexity and high-dimensional dynamics, as well as the need to adapt to diverse and unpredictable environments. In this work, we introduce a novel learning framework for effectively training a humanoid locomotion policy that imitates the behavior of a model-based controller while extending its capabilities to handle more complex locomotion tasks, such as more challenging terrain and higher velocity commands. Our framework consists of three key components: pre-training through imitation of the model-based controller, fine-tuning via reinforcement learning, and model-assumption-based regularization (MAR) during fine-tuning. In particular, MAR aligns the policy with actions from the model-based controller only in states where the model assumption holds to prevent catastrophic forgetting. We evaluate the proposed framework through comprehensive simulation tests and hardware experiments on a full-size humanoid robot, Digit, demonstrating a forward speed of 1.5 m/s and robust locomotion across diverse terrains, including slippery, sloped, uneven, and sandy terrains.
人形机器人的行走是一个具有挑战性的任务,因为它本身固有的复杂性和高维度的动力学特性,以及需要适应多样和不可预测的环境。在这项工作中,我们引入了一个新颖的学习框架,用于有效训练一个人形机器人行走策略,该策略模仿基于模型控制器的行为,并扩展其能力以处理更复杂的行走任务,例如更具挑战性的地形和更高的速度命令。我们的框架包含三个关键组成部分:通过模仿基于模型控制器进行预训练、利用强化学习进行微调以及在微调过程中使用基于假设的正则化(MAR)。特别是,在假设成立的状态下,MAR 使策略与来自基于模型控制器的动作保持一致,以防止灾难性遗忘的发生。 我们通过对全尺寸人形机器人Digit进行了全面的仿真测试和硬件实验来评估提出的框架。这些实验显示了1.5米/秒的前向速度,并在各种地形(包括滑坡、斜坡、不平整以及沙地)上展示了稳健的行走能力。
https://arxiv.org/abs/2504.09833
In the realm of large language model (LLM), as the size of large models increases, it also brings higher training costs. There is a urgent need to minimize the data size in LLM training. Compared with data selection method, the data distillation method aims to synthesize a small number of data samples to achieve the training effect of the full data set and has better flexibility. Despite its successes in computer vision, the discreteness of text data has hitherto stymied its exploration in natural language processing (NLP). In this work, we proposed a method that involves learning pseudo prompt data based on trajectory matching and finding its nearest neighbor ID to achieve cross-architecture transfer. During the distillation process, we introduce a regularization loss to improve the robustness of our distilled data. To our best knowledge, this is the first data distillation work suitable for text generation tasks such as instruction tuning. Evaluations on two benchmarks, including ARC-Easy and MMLU instruction tuning datasets, established the superiority of our distillation approach over the SOTA data selection method LESS. Furthermore, our method demonstrates a good transferability over LLM structures (i.e., OPT to Llama).
在大型语言模型(LLM)领域,随着大模型规模的增加,其训练成本也随之上升。因此,在LLM训练中减少数据量的需求变得尤为迫切。与数据选择方法相比,数据蒸馏方法旨在合成少量的数据样本以达到全数据集的训练效果,并具有更好的灵活性。尽管在计算机视觉领域取得了成功,但文本数据的离散性至今阻碍了它在自然语言处理(NLP)中的探索。在这项工作中,我们提出了一种基于轨迹匹配学习伪提示数据并找到其最近邻ID的方法,以实现跨架构迁移。在蒸馏过程中,我们引入了一个正则化损失来提高蒸馏数据的鲁棒性。据我们所知,这是第一个适用于指令微调等文本生成任务的数据蒸馏方法。我们在两个基准测试(包括ARC-Easy和MMLU指令微调数据集)上进行了评估,并证明了我们的蒸馏方法优于最先进的数据选择方法LESS。此外,我们的方法在大型语言模型结构之间展示了良好的迁移能力(例如从OPT到Llama)。
https://arxiv.org/abs/2504.09818
Semi-Supervised Semantic Segmentation (SSSS) aims to improve segmentation accuracy by leveraging a small set of labeled images alongside a larger pool of unlabeled data. Recent advances primarily focus on pseudo-labeling, consistency regularization, and co-training strategies. However, existing methods struggle to balance global semantic representation with fine-grained local feature extraction. To address this challenge, we propose a novel tri-branch semi-supervised segmentation framework incorporating a dual-teacher strategy, named IGL-DT. Our approach employs SwinUnet for high-level semantic guidance through Global Context Learning and ResUnet for detailed feature refinement via Local Regional Learning. Additionally, a Discrepancy Learning mechanism mitigates over-reliance on a single teacher, promoting adaptive feature learning. Extensive experiments on benchmark datasets demonstrate that our method outperforms state-of-the-art approaches, achieving superior segmentation performance across various data regimes.
半监督语义分割(SSSS)的目标是通过利用少量标记图像和大量未标记数据来提高分割精度。最近的研究主要集中在伪标签生成、一致性正则化和协同训练策略上。然而,现有的方法在平衡全局语义表示与精细的局部特征提取方面存在困难。为了解决这一挑战,我们提出了一种新的三分支半监督分割框架,并引入了双教师策略,命名为IGL-DT(Instance Guidance Learning with Dual Teachers)。我们的方法使用SwinUnet进行高层次语义引导,通过全局上下文学习来实现;同时利用ResUNet对细节特征进行细化,通过局部区域学习来完成。此外,我们还设计了一种差异学习机制,以减少对单一教师的过度依赖,促进自适应特征学习。 在基准数据集上的广泛实验表明,我们的方法优于现有的最先进方案,在各种数据规模下均取得了卓越的分割性能。
https://arxiv.org/abs/2504.09797
Online 3D occupancy prediction provides a comprehensive spatial understanding of embodied environments. While the innovative EmbodiedOcc framework utilizes 3D semantic Gaussians for progressive indoor occupancy prediction, it overlooks the geometric characteristics of indoor environments, which are primarily characterized by planar structures. This paper introduces EmbodiedOcc++, enhancing the original framework with two key innovations: a Geometry-guided Refinement Module (GRM) that constrains Gaussian updates through plane regularization, along with a Semantic-aware Uncertainty Sampler (SUS) that enables more effective updates in overlapping regions between consecutive frames. GRM regularizes the position update to align with surface normals. It determines the adaptive regularization weight using curvature-based and depth-based constraints, allowing semantic Gaussians to align accurately with planar surfaces while adapting in complex regions. To effectively improve geometric consistency from different views, SUS adaptively selects proper Gaussians to update. Comprehensive experiments on the EmbodiedOcc-ScanNet benchmark demonstrate that EmbodiedOcc++ achieves state-of-the-art performance across different settings. Our method demonstrates improved edge accuracy and retains more geometric details while ensuring computational efficiency, which is essential for online embodied perception. The code will be released at: this https URL.
在线3D占用预测提供了对具身环境的全面空间理解。尽管创新性的EmbodiedOcc框架利用了3D语义高斯分布来进行渐进式室内占用预测,但它忽略了室内环境中主要由平面结构所决定的几何特性。本文介绍了一种名为EmbodiedOcc++的方法,在原始框架的基础上进行了两个关键改进:一个称为几何引导细化模块(GRM)的功能,通过平面正则化来限制高斯更新;以及一个语义感知不确定性采样器(SUS),使得在连续帧之间的重叠区域中可以更有效地进行更新。GRM通过表面法线的约束对位置更新进行了调整,并使用基于曲率和深度的约束自适应地确定了正则化的权重,使语义高斯分布能够与平面表面精确对齐并在复杂区域中灵活变化。为了有效提高不同视角下的几何一致性,SUS自适应选择合适的高斯函数进行更新。 在EmbodiedOcc-ScanNet基准测试上进行全面实验后发现,EmbodiedOcc++在各种设置下均达到了最先进的性能水平。我们的方法展示了边缘精度的提升,并保留了更多的几何细节,同时确保了计算效率,这对于在线具身感知至关重要。代码将在以下网址发布:[请在此处插入实际链接]。
https://arxiv.org/abs/2504.09540
Longitudinal image registration enables studying temporal changes in brain morphology which is useful in applications where monitoring the growth or atrophy of specific structures is important. However this task is challenging due to; noise/artifacts in the data and quantifying small anatomical changes between sequential scans. We propose a novel longitudinal registration method that models structural changes using temporally parameterized neural displacement fields. Specifically, we implement an implicit neural representation (INR) using a multi-layer perceptron that serves as a continuous coordinate-based approximation of the deformation field at any time point. In effect, for any N scans of a particular subject, our model takes as input a 3D spatial coordinate location x, y, z and a corresponding temporal representation t and learns to describe the continuous morphology of structures for both observed and unobserved points in time. Furthermore, we leverage the analytic derivatives of the INR to derive a new regularization function that enforces monotonic rate of change in the trajectory of the voxels, which is shown to provide more biologically plausible patterns. We demonstrate the effectiveness of our method on 4D brain MR registration.
纵向图像配准能够研究大脑形态的时变变化,这对于监测特定结构生长或萎缩的应用非常重要。然而,由于数据中的噪声/伪影以及量化连续扫描之间微小解剖变化的难度,这一任务颇具挑战性。我们提出了一种新颖的纵向注册方法,该方法使用时间参数化的神经位移场来建模结构性的变化。具体而言,我们采用一个多层感知机实现隐式神经表示(INR),作为变形场在任意时间点上的连续坐标基逼近。 实际上,对于特定受试者的N次扫描,我们的模型将3D空间坐标位置(x, y, z)和对应的时间表示t作为输入,并学习描述从观察到的和未观察到的所有时间节点上结构的持续形态变化。此外,我们利用INR的解析导数推导出一个新的正则化函数,该函数强制执行体素轨迹在时间上的单调变化速率,这被证明能提供更符合生物学实际的变化模式。 我们在4D脑MRI配准中展示了我们方法的有效性。
https://arxiv.org/abs/2504.09514