While humans effortlessly draw visual objects and shapes by adaptively allocating attention based on their complexity, existing multimodal large language models (MLLMs) remain constrained by rigid token representations. Bridging this gap, we propose ALTo, an adaptive length tokenizer for autoregressive mask generation. To achieve this, a novel token length predictor is designed, along with a length regularization term and a differentiable token chunking strategy. We further build ALToLLM that seamlessly integrates ALTo into MLLM. Preferences on the trade-offs between mask quality and efficiency is implemented by group relative policy optimization (GRPO). Experiments demonstrate that ALToLLM achieves state-of-the-art performance with adaptive token cost on popular segmentation benchmarks. Code and models are released at this https URL.
当人类能够轻松地根据视觉对象和形状的复杂性灵活分配注意力时,现有的多模态大型语言模型(MLLMs)仍然受到刚性的令牌表示形式的限制。为弥合这一差距,我们提出了ALTo,这是一种自回归掩码生成的自适应长度标记器。为此设计了一个新颖的令牌长度预测器,并引入了长度正则化项和可微分的令牌分块策略。进一步地,我们构建了ALToLLM,它将ALTo无缝集成到MLLM中。通过群体相对政策优化(GRPO)实现对掩码质量和效率之间权衡偏好的实施。实验表明,在流行的分割基准测试上,ALToLLM在自适应令牌成本方面达到了最先进的性能。代码和模型可在[此链接](https://this-url.com)获取。
https://arxiv.org/abs/2505.16495
Reinforcement learning (RL)-based legged locomotion controllers often require meticulous reward tuning to track velocities or goal positions while preserving smooth motion on various terrains. Motion imitation methods via RL using demonstration data reduce reward engineering but fail to generalize to novel environments. We address this by proposing a hierarchical RL framework in which a low-level policy is first pre-trained to imitate animal motions on flat ground, thereby establishing motion priors. A subsequent high-level, goal-conditioned policy then builds on these priors, learning residual corrections that enable perceptive locomotion, local obstacle avoidance, and goal-directed navigation across diverse and rugged terrains. Simulation experiments illustrate the effectiveness of learned residuals in adapting to progressively challenging uneven terrains while still preserving the locomotion characteristics provided by the motion priors. Furthermore, our results demonstrate improvements in motion regularization over baseline models trained without motion priors under similar reward setups. Real-world experiments with an ANYmal-D quadruped robot confirm our policy's capability to generalize animal-like locomotion skills to complex terrains, demonstrating smooth and efficient locomotion and local navigation performance amidst challenging terrains with obstacles.
基于强化学习(RL)的四足机器人行走控制器通常需要精细调整奖励机制,以在各种地形上保持平滑运动的同时追踪速度或目标位置。通过使用示范数据进行动作模仿的方法减少了对奖励工程的需求,但无法泛化到新的环境中去。为此,我们提出了一种分层强化学习框架,在该框架中,低级策略首先经过预训练,模拟动物在平坦地面上的动作,从而建立运动先验。随后的高级、目标导向策略则在此基础上进行残差修正的学习,使其能够在各种复杂和崎岖地形上实现感知行走、局部避障及目标定向导航。 仿真实验展示了学习到的残差在适应越来越具有挑战性的不平整地形时的有效性,同时保持了由运动先验提供的行走特征。此外,在类似的奖励设置下,我们的结果还表明相比没有使用运动先验的基础模型训练方式,在动作规范方面取得了改进。通过在ANYmal-D四足机器人上的真实世界实验验证了我们策略将动物般的行走技能泛化到复杂地形的能力,展示了在充满障碍的挑战性环境中实现平滑高效行走和局部导航性能的能力。
https://arxiv.org/abs/2505.16084
We propose LAGO - Language Similarity-Aware Graph Optimization - a novel approach for few-shot cross-lingual embedding inversion attacks, addressing critical privacy vulnerabilities in multilingual NLP systems. Unlike prior work in embedding inversion attacks that treat languages independently, LAGO explicitly models linguistic relationships through a graph-based constrained distributed optimization framework. By integrating syntactic and lexical similarity as edge constraints, our method enables collaborative parameter learning across related languages. Theoretically, we show this formulation generalizes prior approaches, such as ALGEN, which emerges as a special case when similarity constraints are relaxed. Our framework uniquely combines Frobenius-norm regularization with linear inequality or total variation constraints, ensuring robust alignment of cross-lingual embedding spaces even with extremely limited data (as few as 10 samples per language). Extensive experiments across multiple languages and embedding models demonstrate that LAGO substantially improves the transferability of attacks with 10-20% increase in Rouge-L score over baselines. This work establishes language similarity as a critical factor in inversion attack transferability, urging renewed focus on language-aware privacy-preserving multilingual embeddings.
我们提出了LAGO(Language Similarity-Aware Graph Optimization),这是一种新颖的方法,用于解决多语言自然语言处理系统中关键隐私漏洞的少量样本跨语言嵌入逆向攻击问题。与先前将不同语言独立对待的嵌入逆向攻击工作不同,LAGO 通过图基约束分布式优化框架显式地建模了语言关系。通过将句法和词汇相似性作为边约束整合进来,我们的方法能够实现在相关语言间协同学习参数的目标。理论上,我们证明该公式可以推广之前的方法(例如ALGEN),当相似度约束被放松时,ALGEN 就会成为其特例。我们的框架独特地结合了弗罗贝尼乌斯范数正则化和线性不等式或总变化约束,即使在数据极度有限的情况下(每种语言只有10个样本),也能确保跨语言嵌入空间的稳健对齐。 经过多种语言和嵌入模型上的广泛实验,结果表明LAGO 在逆向攻击的可迁移性上有了显著提升,与基准方法相比,Rouge-L得分提高了10-20%。这项工作确立了语言相似度在逆向攻击转移性中的关键作用,并敦促重新关注具有语言意识的多语言隐私保护嵌入。
https://arxiv.org/abs/2505.16008
Diffusion models have emerged as powerful generative tools across various domains, yet tailoring pre-trained models to exhibit specific desirable properties remains challenging. While reinforcement learning (RL) offers a promising solution,current methods struggle to simultaneously achieve stable, efficient fine-tuning and support non-differentiable rewards. Furthermore, their reliance on sparse rewards provides inadequate supervision during intermediate steps, often resulting in suboptimal generation quality. To address these limitations, dense and differentiable signals are required throughout the diffusion process. Hence, we propose VAlue-based Reinforced Diffusion (VARD): a novel approach that first learns a value function predicting expection of rewards from intermediate states, and subsequently uses this value function with KL regularization to provide dense supervision throughout the generation process. Our method maintains proximity to the pretrained model while enabling effective and stable training via backpropagation. Experimental results demonstrate that our approach facilitates better trajectory guidance, improves training efficiency and extends the applicability of RL to diffusion models optimized for complex, non-differentiable reward functions.
扩散模型已经在多个领域作为强大的生成工具兴起,但将预训练模型调整为具有特定所需属性的模型仍然颇具挑战性。尽管强化学习(RL)提供了一种有前景的解决方案,但现有方法难以同时实现稳定和高效的微调,并且支持非可微分奖励的能力有限。此外,依赖稀疏奖励提供的中间步骤监督不足,常常导致生成质量不佳。为了解决这些问题,在扩散过程中需要在整个过程中使用密集且可微分的信号。因此,我们提出了基于价值的强化扩散(VARD):一种新颖的方法,它首先学习一个值函数来预测从中间状态到最终奖励的期望,并随后利用这个值函数和KL正则化在整个生成过程中提供密集监督。我们的方法在保持与预训练模型接近的同时,通过反向传播实现有效的稳定训练。实验结果表明,我们的方法能够更好地指导轨迹,提高训练效率,并将RL的应用范围扩展至优化复杂且非可微分奖励函数的扩散模型。
https://arxiv.org/abs/2505.15791
Uncertainty estimation remains a critical challenge in adapting pre-trained language models to classification tasks, particularly under parameter-efficient fine-tuning approaches such as adapters. We introduce AdUE1, an efficient post-hoc uncertainty estimation (UE) method, to enhance softmax-based estimates. Our approach (1) uses a differentiable approximation of the maximum function and (2) applies additional regularization through L2-SP, anchoring the fine-tuned head weights and regularizing the model. Evaluations on five NLP classification datasets across four language models (RoBERTa, ELECTRA, LLaMA-2, Qwen) demonstrate that our method consistently outperforms established baselines such as Mahalanobis distance and softmax response. Our approach is lightweight (no base-model changes) and produces better-calibrated confidence.
不确定性估计仍然是将预训练语言模型适应到分类任务中的一个关键挑战,特别是在使用如适配器(adapters)这类参数高效微调方法的情况下。我们提出了一种高效的后处理不确定性估计(UE)方法——AdUE1,以增强基于softmax的估计。我们的方法包括: 1. 使用最大函数的可微近似; 2. 通过L2-SP应用额外的正则化,固定微调后的头权重并正则化模型。 我们在四个语言模型(RoBERTa、ELECTRA、LLaMA-2和Qwen)上的五个NLP分类数据集上进行了评估,结果表明我们的方法在一致性上优于传统的基准方法,如马哈拉诺比斯距离和softmax响应。此外,我们的方法是轻量级的(无需更改基础模型),并且能够生成更好的校准置信度。
https://arxiv.org/abs/2505.15443
Understanding the neural mechanisms underlying speech production is essential for both advancing cognitive neuroscience theory and developing practical communication technologies. In this study, we investigated magnetoencephalography signals to decode phones from brain activity during speech production and perception (passive listening and voice playback) tasks. Using a dataset comprising 17 participants, we performed pairwise phone classification, extending our analysis to 15 phonetic pairs. Multiple machine learning approaches, including regularized linear models and neural network architectures, were compared to determine their effectiveness in decoding phonetic information. Our results demonstrate significantly higher decoding accuracy during speech production (76.6%) compared to passive listening and playback modalities (~51%), emphasizing the richer neural information available during overt speech. Among the models, the Elastic Net classifier consistently outperformed more complex neural networks, highlighting the effectiveness of traditional regularization techniques when applied to limited and high-dimensional MEG datasets. Besides, analysis of specific brain frequency bands revealed that low-frequency oscillations, particularly Delta (0.2-3 Hz) and Theta (4-7 Hz), contributed the most substantially to decoding accuracy, suggesting that these bands encode critical speech production-related neural processes. Despite using advanced denoising methods, it remains unclear whether decoding solely reflects neural activity or if residual muscular or movement artifacts also contributed, indicating the need for further methodological refinement. Overall, our findings underline the critical importance of examining overt speech production paradigms, which, despite their complexity, offer opportunities to improve brain-computer interfaces to help individuals with severe speech impairments.
理解言语产生的神经机制对于推进认知神经科学理论以及开发实用的通信技术至关重要。在这项研究中,我们使用脑磁图(MEG)信号来解码在言语产生和感知任务(被动聆听和语音回放)期间大脑活动中的音位信息。通过包括17名参与者的数据集,我们进行了两两音位分类,并扩展了我们的分析以涵盖15对音位。比较了几种机器学习方法的有效性,包括正则化线性模型和神经网络架构,来确定它们在解码音位信息方面的表现。我们的结果表明,在言语产生过程中(76.6%)的解码准确性显著高于被动聆听和播放模式(约51%),强调了在口头语中可获得更丰富的神经信息。在所有模型中,Elastic Net分类器始终优于复杂的神经网络,突显了当应用于有限且高维度的MEG数据集时传统正则化技术的有效性。 此外,对特定脑频率带的分析显示,低频振荡(尤其是δ波[0.2-3 Hz]和θ波[4-7 Hz])在解码准确性中贡献最大,表明这些频率带编码了与言语产生相关的关键神经过程。尽管使用了先进的降噪方法,但尚不清楚解码是否仅反映神经活动还是残余的肌肉或运动伪影也有所贡献,这表明需要进一步的方法改进。 总体而言,我们的发现强调了研究口语产生的范式的重要性,尽管它们很复杂,但仍为改善帮助患有严重言语障碍个体的大脑-计算机接口提供了机会。
https://arxiv.org/abs/2505.15355
Recently, continuous representation methods emerge as novel paradigms that characterize the intrinsic structures of real-world data through function representations that map positional coordinates to their corresponding values in the continuous space. As compared with the traditional discrete framework, the continuous framework demonstrates inherent superiority for data representation and reconstruction (e.g., image restoration, novel view synthesis, and waveform inversion) by offering inherent advantages including resolution flexibility, cross-modal adaptability, inherent smoothness, and parameter efficiency. In this review, we systematically examine recent advancements in continuous representation frameworks, focusing on three aspects: (i) Continuous representation method designs such as basis function representation, statistical modeling, tensor function decomposition, and implicit neural representation; (ii) Theoretical foundations of continuous representations such as approximation error analysis, convergence property, and implicit regularization; (iii) Real-world applications of continuous representations derived from computer vision, graphics, bioinformatics, and remote sensing. Furthermore, we outline future directions and perspectives to inspire exploration and deepen insights to facilitate continuous representation methods, theories, and applications. All referenced works are summarized in our open-source repository: this https URL.
最近,连续表示方法作为一种新颖的范式出现,通过将位置坐标映射到连续空间中对应值的功能表示来表征现实世界数据的基本结构。与传统的离散框架相比,连续框架在数据表示和重建(例如图像恢复、新视角合成和波形逆问题)方面展示了固有的优势,这些优势包括分辨率灵活性、跨模态适应性、内在平滑性和参数效率等。 在这篇综述中,我们系统地审视了连续表示框架的最新进展,并重点关注三个方面:(i) 连续表示方法的设计,例如基函数表示、统计建模、张量函数分解和隐式神经网络表示;(ii) 连续表示的理论基础,包括近似误差分析、收敛性属性以及隐式正则化;(iii) 从计算机视觉、图形学、生物信息学和遥感领域衍生的实际应用。此外,我们还概述了未来的发展方向和视角,以激发探索并深化对连续表示方法、理论及其应用的理解。 所有参考作品都总结在我们的开源代码库中:[此处插入链接]。
https://arxiv.org/abs/2505.15222
Domain adaptation remains a challenge when there is significant manifold discrepancy between source and target domains. Although recent methods leverage manifold-aware adversarial perturbations to perform data augmentation, they often neglect precise manifold alignment and systematic exploration of structured perturbations. To address this, we propose GAMA (Geometry-Aware Manifold Alignment), a structured framework that achieves explicit manifold alignment via adversarial perturbation guided by geometric information. GAMA systematically employs tangent space exploration and manifold-constrained adversarial optimization, simultaneously enhancing semantic consistency, robustness to off-manifold deviations, and cross-domain alignment. Theoretical analysis shows that GAMA tightens the generalization bound via structured regularization and explicit alignment. Empirical results on DomainNet, VisDA, and Office-Home demonstrate that GAMA consistently outperforms existing adversarial and adaptation methods in both unsupervised and few-shot settings, exhibiting superior robustness, generalization, and manifold alignment capability.
领域适应在源域和目标域之间存在显著的流形差异时仍面临挑战。尽管最近的方法利用了基于流形感知的对抗性扰动来进行数据增强,但它们往往忽视了精确的流形对齐以及结构化扰动的系统探索。为此,我们提出了GAMA(几何感知流形对齐),这是一种通过几何信息引导的对抗性扰动来实现显式流形对齐的结构性框架。GAMA系统地采用切空间探索和流形约束下的对抗优化,同时增强了语义一致性、对脱离流形偏差的鲁棒性和跨域对齐能力。理论分析表明,GAMA通过结构化正则化和显式对齐收紧了泛化界。在DomainNet、VisDA和Office-Home数据集上的实验证明,无论是在无监督还是少样本设置下,GAMA都显著优于现有的对抗方法和适应方法,在鲁棒性、泛化能力和流形对齐能力方面表现出色。
https://arxiv.org/abs/2505.15194
Transfer learning under domain shift remains a fundamental challenge due to the divergence between source and target data manifolds. In this paper, we propose MAADA (Manifold-Aware Adversarial Data Augmentation), a novel framework that decomposes adversarial perturbations into on-manifold and off-manifold components to simultaneously capture semantic variation and model brittleness. We theoretically demonstrate that enforcing on-manifold consistency reduces hypothesis complexity and improves generalization, while off-manifold regularization smooths decision boundaries in low-density regions. Moreover, we introduce a geometry-aware alignment loss that minimizes geodesic discrepancy between source and target manifolds. Experiments on DomainNet, VisDA, and Office-Home show that MAADA consistently outperforms existing adversarial and adaptation methods in both unsupervised and few-shot settings, demonstrating superior structural robustness and cross-domain generalization.
领域转移下的迁移学习仍然是一个基本挑战,因为源数据和目标数据流形之间的差异会导致模型的泛化能力下降。在本文中,我们提出了MAADA(Manifold-Aware Adversarial Data Augmentation),这是一种新的框架,它将对抗性扰动分解为流形内和流形外两个部分,同时捕捉语义变化和模型脆弱性。从理论上讲,我们证明了强制执行流形内的一致性可以减少假设的复杂性并提高泛化能力,而流形外的正则化可以在低密度区域平滑决策边界。此外,我们引入了一个几何感知对齐损失函数,该函数最小化源域和目标域之间的测地距离差异。 在DomainNet、VisDA以及Office-Home数据集上的实验表明,在无监督和少量样本设置下,MAADA始终优于现有的对抗方法和适应方法,这表明了其优越的结构鲁棒性和跨领域泛化能力。
https://arxiv.org/abs/2505.15191
Learned Sparse Retrieval (LSR) models encode text as weighted term vectors, which need to be sparse to leverage inverted index structures during retrieval. SPLADE, the most popular LSR model, uses FLOPS regularization to encourage vector sparsity during training. However, FLOPS regularization does not ensure sparsity among terms - only within a given query or document. Terms with very high Document Frequencies (DFs) substantially increase latency in production retrieval engines, such as Apache Solr, due to their lengthy posting lists. To address the issue of high DFs, we present a new variant of FLOPS regularization: DF-FLOPS. This new regularization technique penalizes the usage of high-DF terms, thereby shortening posting lists and reducing retrieval latency. Unlike other inference-time sparsification methods, such as stopword removal, DF-FLOPS regularization allows for the selective inclusion of high-frequency terms in cases where the terms are truly salient. We find that DF-FLOPS successfully reduces the prevalence of high-DF terms and lowers retrieval latency (around 10x faster) in a production-grade engine while maintaining effectiveness both in-domain (only a 2.2-point drop in MRR@10) and cross-domain (improved performance in 12 out of 13 tasks on which we tested). With retrieval latencies on par with BM25, this work provides an important step towards making LSR practical for deployment in production-grade search engines.
学习稀疏检索(LSR)模型将文本编码为加权的词向量,这些词向量需要保持稀疏性以便在检索过程中利用倒排索引结构。SPLADE,作为最流行的LSR模型之一,通过使用FLOPS正则化来鼓励训练期间向量的稀疏性。然而,FLOPS正则化并不确保词汇项之间的稀疏性——仅限于给定查询或文档内部。高频词(高文档频率DF)在生产检索引擎中(如Apache Solr),由于其冗长的后置列表(posting list),会导致显著的延迟增加。 为了解决高频词问题,我们提出了一种新的FLOPS正则化变体:DF-FLOPS。这一新方法通过惩罚高文档频率词汇项的使用来缩短posting lists并降低检索延迟。与其它推理时稀疏化的方法(如停用词去除)不同,DF-FLOPS允许在特定情况下包含那些高频但具有重要意义的词语。 我们发现DF-FLOPS有效地减少了高DF词的比例,并且在生产级引擎中将检索延迟降低了大约10倍(同时保持了MRR@10指标仅下降2.2分的效果),并且跨域性能也得到了提升(在测试的13个任务中的12个任务上表现更好)。通过提供与BM25相当的检索延迟,这项工作为LSR模型的实际应用提供了重要步骤,使其能够部署到生产级搜索引擎中。
https://arxiv.org/abs/2505.15070
Many practical medical imaging scenarios include categories that are under-represented but still crucial. The relevance of image recognition models to real-world applications lies in their ability to generalize to these rare classes as well as unseen classes. Real-world generalization requires taking into account the various complexities that can be encountered in the real-world. First, training data is highly imbalanced, which may lead to model exhibiting bias toward the more frequently represented classes. Moreover, real-world data may contain unseen classes that need to be identified, and model performance is affected by the data scarcity. While medical image recognition has been extensively addressed in the literature, current methods do not take into account all the intricacies in the real-world scenarios. To this end, we propose an open-set learning method for highly imbalanced medical datasets using a semi-supervised approach. Understanding the adverse impact of long-tail distribution at the inherent model characteristics, we implement a regularization strategy at the feature level complemented by a classifier normalization technique. We conduct extensive experiments on the publicly available datasets, ISIC2018, ISIC2019, and TissueMNIST with various numbers of labelled samples. Our analysis shows that addressing the impact of long-tail data in classification significantly improves the overall performance of the network in terms of closed-set and open-set accuracies on all datasets. Our code and trained models will be made publicly available at this https URL.
许多实际的医学影像应用场景包括一些代表性不足但仍然关键的类别。图像识别模型对现实世界应用的相关性在于它们能够推广到这些罕见类以及未见过的类上。现实世界的泛化要求考虑在真实环境中可能遇到的各种复杂情况。首先,训练数据高度不平衡,这可能导致模型偏向于更频繁出现的类。此外,实际数据中可能存在需要被识别的未见类别,并且由于数据稀缺性,模型性能会受到影响。尽管医学图像识别已在文献中得到了广泛的研究,但目前的方法并未考虑到现实场景中的所有复杂情况。 为了解决这些问题,我们提出了一种针对高度不平衡医疗数据集的开放集合学习方法,采用半监督学习的方式。鉴于长尾分布对内在模型特征产生的负面影响,我们在特征级别实施了正则化策略,并结合分类器归一化技术来解决这一问题。我们在ISIC2018、ISIC2019和TissueMNIST这些公开数据集上进行了大量实验,使用各种数量的标注样本。 我们的分析表明,在分类中处理长尾数据的影响显著提高了所有数据集中封闭集合和开放集合准确率的整体性能。我们的代码和训练模型将在以下网址公开发布:[这个链接]。
https://arxiv.org/abs/2505.14846
Due to low spatial resolution, hyperspectral data often consists of mixtures of contributions from multiple materials. This limitation motivates the task of hyperspectral unmixing (HU), a fundamental problem in hyperspectral imaging. HU aims to identify the spectral signatures (\textit{endmembers}) of the materials present in an observed scene, along with their relative proportions (\textit{fractional abundance}) in each pixel. A major challenge lies in the class variability in materials, which hinders accurate representation by a single spectral signature, as assumed in the conventional linear mixing model. Moreover, To address this issue, we propose using group sparsity after representing each material with a set of spectral signatures, known as endmember bundles, where each group corresponds to a specific material. In particular, we develop a bundle-based framework that can enforce either inter-group sparsity or sparsity within and across groups (SWAG) on the abundance coefficients. Furthermore, our framework offers the flexibility to incorporate a variety of sparsity-promoting penalties, among which the transformed $\ell_1$ (TL1) penalty is a novel regularization in the HU literature. Extensive experiments conducted on both synthetic and real hyperspectral data demonstrate the effectiveness and superiority of the proposed approaches.
由于空间分辨率较低,高光谱数据通常包含来自多种材料的混合成分。这一限制促使了高光谱解混(HU)任务的产生,这是高光谱成像中的一个基本问题。HU旨在识别出现在观察场景中各种材料的光谱特征(即端元),并确定每个像素中这些材料相对比例(即丰度)。主要挑战在于材料类别的多样性,这阻碍了通过单一光谱特征进行准确表示,而这一假设在传统的线性混合模型中被提出。此外,为了解决这个问题,我们建议使用组稀疏性,在该方法中,每种材料由一组称为端元包(endmember bundles)的光谱特征表示,并且每个组对应于一种特定材料。具体而言,我们开发了一种基于包的方法框架,可以在丰度系数上施加组间稀疏性或在和跨组内的稀疏性(SWAG)。此外,我们的框架提供了灵活性,可以引入多种促进稀疏性的惩罚项,在这些惩罚项中,转换的$\ell_1$ (TL1) 惩罚是HU文献中的一个新正则化方法。对合成高光谱数据和真实高光谱数据进行的广泛实验表明了所提出方法的有效性和优越性。
https://arxiv.org/abs/2505.14634
Inverse scattering is a fundamental challenge in many imaging applications, ranging from microscopy to remote sensing. Solving this problem often requires jointly estimating two unknowns -- the image and the scattering field inside the object -- necessitating effective image prior to regularize the inference. In this paper, we propose a regularized neural field (NF) approach which integrates the denoising score function used in score-based generative models. The neural field formulation offers convenient flexibility to performing joint estimation, while the denoising score function imposes the rich structural prior of images. Our results on three high-contrast simulated objects show that the proposed approach yields a better imaging quality compared to the state-of-the-art NF approach, where regularization is based on total variation.
逆散射是许多成像应用中的一个基本挑战,涵盖了从显微镜到远程感应的广泛领域。解决这一问题通常需要同时估计两个未知量——图像和物体内部的散射场,这要求采用有效的图像先验来正则化推断过程。在本文中,我们提出了一种基于分数生成模型中使用的去噪评分函数的正则化神经场(NF)方法。神经场的形式提供了进行联合估计的方便灵活性,而去噪评分函数则施加了丰富的图像结构先验。我们在三个高对比度模拟物体上的实验结果表明,所提出的这种方法相比当前最先进的基于总变化(Total Variation, TV)的NF方法,在成像质量上有所提升。
https://arxiv.org/abs/2505.14560
Semi-supervised learning (SSL) has achieved significant progress in medical image segmentation (SSMIS) through effective utilization of limited labeled data. While current SSL methods for medical images predominantly rely on consistency regularization and pseudo-labeling, they often overlook transferable semantic relationships across different clinical domains and imaging modalities. To address this, we propose TransMedSeg, a novel transferable semantic framework for semi-supervised medical image segmentation. Our approach introduces a Transferable Semantic Augmentation (TSA) module, which implicitly enhances feature representations by aligning domain-invariant semantics through cross-domain distribution matching and intra-domain structural preservation. Specifically, TransMedSeg constructs a unified feature space where teacher network features are adaptively augmented towards student network semantics via a lightweight memory module, enabling implicit semantic transformation without explicit data generation. Interestingly, this augmentation is implicitly realized through an expected transferable cross-entropy loss computed over the augmented teacher distribution. An upper bound of the expected loss is theoretically derived and minimized during training, incurring negligible computational overhead. Extensive experiments on medical image datasets demonstrate that TransMedSeg outperforms existing semi-supervised methods, establishing a new direction for transferable representation learning in medical image analysis.
半监督学习(SSL)在医学图像分割(SSMIS)中通过有效利用有限的标注数据取得了显著进展。然而,目前针对医学图像的SSL方法主要依赖于一致性正则化和伪标签生成,却往往忽视了不同临床领域和成像模态之间可转移的语义关系。为了解决这一问题,我们提出了TransMedSeg,这是一个新颖的可转移语义框架,用于半监督医学图像分割。我们的方法引入了一个可转移语义增强(TSA)模块,该模块通过跨域分布匹配和内部结构保留来隐式地增强了特征表示,并将领域不变性的语义对齐。 具体来说,TransMedSeg构建了一个统一的特征空间,在这个空间中,教师网络的特征能够被轻量级记忆模块自适应增强为学生网络的语义表达。这种方法使得在没有显式生成数据的情况下实现隐式的语义转换成为可能。此外,这种增强是通过计算增强后的教师分布上的期望可转移交叉熵损失来隐式地实现的。理论上推导出该预期损失的一个上界,并且在训练过程中将其最小化,从而引入了几乎可以忽略不计的计算开销。 广泛的医学图像数据集实验表明,TransMedSeg优于现有的半监督方法,在医学图像分析中的可转移表示学习方面开辟了一个新的方向。
https://arxiv.org/abs/2505.14753
Developing robust speaker verification (SV) systems without speaker labels has been a longstanding challenge. Earlier research has highlighted a considerable performance gap between self-supervised and fully supervised approaches. In this paper, we enhance the non-contrastive self-supervised framework, Self-Distillation Prototypes Network (SDPN), by introducing dimension regularization that explicitly addresses the collapse problem through the application of regularization terms to speaker embeddings. Moreover, we integrate score normalization techniques from fully supervised SV to further bridge the gap toward supervised verification performance. SDPN with dimension regularization and score normalization sets a new state-of-the-art on the VoxCeleb1 speaker verification evaluation benchmark, achieving Equal Error Rate 1.29%, 1.60%, and 2.80% for trial VoxCeleb1-{O,E,H} respectively. These results demonstrate relative improvements of 28.3%, 19.6%, and 22.6% over the current best self-supervised methods, thereby advancing the frontiers of SV technology.
开发无需说话人标签的稳健语音验证(SV)系统一直是一个长期面临的挑战。早期的研究表明,自监督方法与完全监督的方法之间存在显著的性能差距。在本文中,我们通过引入维度正则化来改进非对比自监督框架——自我蒸馏原型网络(SDPN)。这种维度正则化明确地解决了嵌入坍塌问题,并通过将正则化项应用于说话人嵌入来实现。此外,我们还集成了来自完全监督SV的评分规范化技术,进一步缩小了与监督验证性能之间的差距。 引入了维度正则化和评分规范化的SDPN在VoxCeleb1语音识别评估基准测试中设定了新的最先进水平,在试验中的VoxCeleb1-{O,E,H}分别达到了1.29%,1.60% 和2.80%的等错误率(Equal Error Rate)。这些结果相对于目前最佳的自监督方法,分别提高了28.3%,19.6%和22.6%,从而推进了语音验证技术的发展前沿。
https://arxiv.org/abs/2505.13826
As AI systems become more capable, deceptive behaviors can undermine evaluation and mislead users at deployment. Recent work has shown that lie detectors can accurately classify deceptive behavior, but they are not typically used in the training pipeline due to concerns around contamination and objective hacking. We examine these concerns by incorporating a lie detector into the labelling step of LLM post-training and evaluating whether the learned policy is genuinely more honest, or instead learns to fool the lie detector while remaining deceptive. Using DolusChat, a novel 65k-example dataset with paired truthful/deceptive responses, we identify three key factors that determine the honesty of learned policies: amount of exploration during preference learning, lie detector accuracy, and KL regularization strength. We find that preference learning with lie detectors and GRPO can lead to policies which evade lie detectors, with deception rates of over 85\%. However, if the lie detector true positive rate (TPR) or KL regularization is sufficiently high, GRPO learns honest policies. In contrast, off-policy algorithms (DPO) consistently lead to deception rates under 25\% for realistic TPRs. Our results illustrate a more complex picture than previously assumed: depending on the context, lie-detector-enhanced training can be a powerful tool for scalable oversight, or a counterproductive method encouraging undetectable misalignment.
随着AI系统的功能越来越强大,欺骗行为可能会破坏评估过程并误导用户。最近的研究表明,测谎器能够准确地分类出欺骗行为,但由于担心数据污染和目标操纵的问题,它们通常不被用在训练流程中。我们通过将测谎器纳入LLM(大型语言模型)后期训练的标签步骤来考察这些问题,并评估所学策略是否真的更诚实,或者只是学会了如何欺骗测谎器而不改变其欺诈行为的本质。 利用DolusChat这一包含65,000个案例的新数据集——该数据集中有配对的真实与虚假响应,我们确定了三个决定学习策略诚实度的关键因素:偏好学习期间的探索量、测谎器的准确性以及KL正则化强度。研究发现,在使用GRPO(带有KL散度正则化的偏好优化)和测谎器进行偏好学习时,可以产生能够逃避测谎器的策略,其欺骗率超过85%。然而,如果测谎器的真实阳性率(TPR)足够高或者KL正则化强度足够强,则GRPO会学会诚实的行为策略。 相比之下,使用离策算法(如DPO)会导致在真实且合理的TPR条件下,欺骗率为25%以下。我们的研究结果展示了一个比之前假设更为复杂的画面:视情况而定,带有测谎器增强训练的系统可以成为可扩展监管的强大工具;但在某些情况下,则可能成为一个鼓励不可检测错误对齐的方法。 总的来说,这项研究表明了在AI模型中加入测谎机制既可以作为一种有效的监控手段来促进更诚实的行为表现,也可能因为设计不当而导致被优化后的策略避开监测而继续进行欺骗行为。因此,在实际应用时需要权衡各种因素并谨慎操作。
https://arxiv.org/abs/2505.13787
State-of-the-art neural networks can be trained to become remarkable solutions to many problems. But while these architectures can express symbolic, perfect solutions, trained models often arrive at approximations instead. We show that the choice of regularization method plays a crucial role: when trained on formal languages with standard regularization ($L_1$, $L_2$, or none), expressive architectures not only fail to converge to correct solutions but are actively pushed away from perfect initializations. In contrast, applying the Minimum Description Length (MDL) principle to balance model complexity with data fit provides a theoretically grounded regularization method. Using MDL, perfect solutions are selected over approximations, independently of the optimization algorithm. We propose that unlike existing regularization techniques, MDL introduces the appropriate inductive bias to effectively counteract overfitting and promote generalization.
最先进的神经网络可以训练成解决许多问题的卓越方案。然而,尽管这些架构能够表达符号化的、完美的解决方案,但经过训练的模型往往得出的是近似解而不是精确解。我们展示了正则化方法的选择起着关键作用:当使用标准正则化(如$L_1$、$L_2$或无正则化)在形式语言上进行训练时,即使具有表达力的架构也无法收敛到正确的解决方案,并且会远离完美的初始状态。相反,应用最小描述长度(MDL)原则来平衡模型复杂度和数据拟合提供了一种理论上有根据的正则化方法。通过MDL,可以独立于优化算法选择完美解而不是近似解。我们提出,与现有的正则化技术不同,MDL引入了适当的归纳偏置,能够有效地防止过拟合并促进泛化。
https://arxiv.org/abs/2505.13398
We propose RoPECraft, a training-free video motion transfer method for diffusion transformers that operates solely by modifying their rotary positional embeddings (RoPE). We first extract dense optical flow from a reference video, and utilize the resulting motion offsets to warp the complex-exponential tensors of RoPE, effectively encoding motion into the generation process. These embeddings are then further optimized during denoising time steps via trajectory alignment between the predicted and target velocities using a flow-matching objective. To keep the output faithful to the text prompt and prevent duplicate generations, we incorporate a regularization term based on the phase components of the reference video's Fourier transform, projecting the phase angles onto a smooth manifold to suppress high-frequency artifacts. Experiments on benchmarks reveal that RoPECraft outperforms all recently published methods, both qualitatively and quantitatively.
我们提出了一种名为RoPECraft的方法,这是一种无需训练的视频运动转移方法,专门针对扩散变换器。该方法仅通过修改其旋转位置嵌入(RoPE)来操作。首先从参考视频中提取密集光流,并利用由此产生的运动偏移量对RoPE的复指数张量进行变形处理,从而将运动编码到生成过程中。在去噪阶段,我们进一步优化这些嵌入,通过预测速度和目标速度之间的轨迹对齐,采用流动匹配的目标函数来完成这一过程。为了确保输出忠于文本提示并防止重复生成,我们在参考视频傅里叶变换的相位成分基础上引入了一项正则化项,将相位角度投影到平滑流形上以抑制高频伪影。 在基准测试中的实验表明,RoPECraft无论是在定性还是定量评估中都优于所有最近发表的方法。
https://arxiv.org/abs/2505.13344
Learning robust representations from data often requires scale, which has led to the success of recent zero-shot models such as CLIP. However, the obtained robustness can easily be deteriorated when these models are fine-tuned on other downstream tasks (e.g., of smaller scales). Previous works often interpret this phenomenon in the context of domain shift, developing fine-tuning methods that aim to preserve the original domain as much as possible. However, in a different context, fine-tuned models with limited data are also prone to learning features that are spurious to humans, such as background or texture. In this paper, we propose StarFT (Spurious Textual Alignment Regularization), a novel framework for fine-tuning zero-shot models to enhance robustness by preventing them from learning spuriosity. We introduce a regularization that aligns the output distribution for spuriosity-injected labels with the original zero-shot model, ensuring that the model is not induced to extract irrelevant features further from these this http URL leverage recent language models to get such spuriosity-injected labels by generating alternative textual descriptions that highlight potentially confounding this http URL experiments validate the robust generalization of StarFT and its emerging properties: zero-shot group robustness and improved zero-shot classification. Notably, StarFT boosts both worst-group and average accuracy by 14.30% and 3.02%, respectively, in the Waterbirds group shift scenario, where other robust fine-tuning baselines show even degraded performance.
从数据中学习鲁棒表示通常需要大量的规模,这促进了最近零样本模型(如CLIP)的成功。然而,在对这些模型进行微调以适应其他下游任务(例如较小规模的任务)时,它们所获得的鲁棒性很容易被削弱。之前的工作常常在领域偏移的背景下解释这一现象,并开发出旨在尽可能保留原始领域的微调方法。但是,在另一个上下文中,数据有限的情况下经过微调的模型也容易学习到对人类来说不相关或错误的特征,如背景或纹理。 本文提出了StarFT(Spurious Textual Alignment Regularization),这是一种新的框架,用于通过防止零样本模型学习这些不相关的特征来增强其鲁棒性。我们引入了一种正则化方法,该方法将注入了虚假特征标签后的输出分布与原始的零样本模型对齐,以确保模型不会被诱导提取出更多的无关特征。 为了获得这种带有虚假特征的标签,我们利用最近的语言模型生成强调潜在干扰因素的替代文本描述。通过实验验证了StarFT在鲁棒泛化方面的优越性及其新兴特性:零样本群体鲁棒性和改进的零样本分类性能。值得注意的是,在水鸟分组偏移场景下,与其他鲁棒微调基线相比,StarFT分别提高了最差群体和平均准确率14.30%和3.02%,而其他稳健的微调基准表现反而下降了。
https://arxiv.org/abs/2505.13232
The 1st SpeechWellness Challenge conveys the need for speech-based suicide risk assessment in adolescents. This study investigates a multimodal approach for this challenge, integrating automatic transcription with WhisperX, linguistic embeddings from Chinese RoBERTa, and audio embeddings from WavLM. Additionally, handcrafted acoustic features -- including MFCCs, spectral contrast, and pitch-related statistics -- were incorporated. We explored three fusion strategies: early concatenation, modality-specific processing, and weighted attention with mixup regularization. Results show that weighted attention provided the best generalization, achieving 69% accuracy on the development set, though a performance gap between development and test sets highlights generalization challenges. Our findings, strictly tied to the MINI-KID framework, emphasize the importance of refining embedding representations and fusion mechanisms to enhance classification reliability.
第一届SpeechWellness挑战赛强调了青少年自杀风险评估中基于语音的必要性。本研究探索了一种多模态方法来应对这一挑战,该方法结合了自动转录(使用WhisperX)、来自中文RoBERTa的语言嵌入以及WavLM生成的音频嵌入。此外,我们还纳入了一些手工制作的声学特征——包括梅尔频率倒谱系数(MFCCs)、频谱对比度和与音高相关的统计信息。我们探索了三种融合策略:早期拼接、模态特异性处理和带mixup正则化的加权注意机制。结果显示,加权注意力方法在开发集上取得了最佳的泛化效果,准确率达到69%,尽管开发集和测试集之间的性能差距表明了泛化方面的挑战。我们的研究结果严格基于MINI-KID框架,强调了改进嵌入表示和融合机制对于提升分类可靠性的关键作用。
https://arxiv.org/abs/2505.13069