Learning latent motion from Internet videos is crucial for building generalist robots. However, existing discrete latent action methods suffer from information loss and struggle with complex and fine-grained dynamics. We propose CoMo, which aims to learn more informative continuous motion representations from diverse, internet-scale videos. CoMo employs a early temporal feature difference mechanism to prevent model collapse and suppress static appearance noise, effectively discouraging shortcut learning problem. Furthermore, guided by the information bottleneck principle, we constrain the latent motion embedding dimensionality to achieve a better balance between retaining sufficient action-relevant information and minimizing the inclusion of action-irrelevant appearance noise. Additionally, we also introduce two new metrics for more robustly and affordably evaluating motion and guiding motion learning methods development: (i) the linear probing MSE of action prediction, and (ii) the cosine similarity between past-to-current and future-to-current motion embeddings. Critically, CoMo exhibits strong zero-shot generalization, enabling it to generate continuous pseudo actions for previously unseen video domains. This capability facilitates unified policy joint learning using pseudo actions derived from various action-less video datasets (such as cross-embodiment videos and, notably, human demonstration videos), potentially augmented with limited labeled robot data. Extensive experiments show that policies co-trained with CoMo pseudo actions achieve superior performance with both diffusion and autoregressive architectures in simulated and real-world settings.
从互联网视频中学习潜在运动对于构建通用型机器人至关重要。然而,现有的离散潜在动作方法存在信息损失的问题,并且难以处理复杂和细微的动态变化。为此我们提出了CoMo(Continuous Motion),旨在从多样化的、大规模的互联网视频中学习更为详尽的连续运动表示。 CoMo采用了早期时间特征差分机制来防止模型崩溃并抑制静态外观噪声,从而有效避免了捷径学习问题的发生。同时,遵循信息瓶颈原则,我们将潜在运动嵌入维度进行限制,以在保留足够的与动作相关的信息和最小化无关的外观噪声之间取得更好的平衡。 此外,我们还引入了两个新的评估指标,用于更加稳健且经济地评价运动并指导运动学习方法的发展:(i)动作预测线性探测MSE;(ii)过去到当前及未来到当前运动嵌入之间的余弦相似度。这两个指标对于衡量模型在不同时间和视角下保持一致性和相关性的能力至关重要。 最关键的是,CoMo展示出了强大的零样本泛化能力,使其能够为之前未见过的视频领域生成连续伪动作。这种能力使得利用从无标签视频数据集中提取的各种伪动作进行统一策略联合学习成为可能(例如跨实体视频和显著的人类演示视频),这在必要时可以结合有限标记的机器人数据进一步增强。 广泛的实验表明,与CoMo伪动作协同训练的策略在模拟和现实世界环境中使用扩散模型和自回归架构均表现出卓越性能。
https://arxiv.org/abs/2505.17006
We study the task of learning association between faces and voices, which is gaining interest in the multimodal community lately. These methods suffer from the deliberate crafting of negative mining procedures as well as the reliance on the distant margin parameter. These issues are addressed by learning a joint embedding space in which orthogonality constraints are applied to the fused embeddings of faces and voices. However, embedding spaces of faces and voices possess different characteristics and require spaces to be aligned before fusing them. To this end, we propose a method that accurately aligns the embedding spaces and fuses them with an enhanced gated fusion thereby improving the performance of face-voice association. Extensive experiments on the VoxCeleb dataset reveals the merits of the proposed approach.
我们研究了面部与声音之间关联学习的任务,这一任务最近在多模态社区引起了广泛关注。这些方法面临着故意设计负样本挖掘过程以及依赖于远离边际参数的问题。这些问题通过在一个共同嵌入空间中应用正交约束来解决,该空间融合了面部和声音的嵌入表示。然而,面部和声音的嵌入空间具有不同的特性,并且在将它们融合之前需要对齐这些空间。为此,我们提出了一种方法,能够准确地对齐嵌入空间并使用增强型门控融合技术将其融合在一起,从而提高面部与声音关联任务的表现。在VoxCeleb数据集上的广泛实验揭示了所提方法的优势。
https://arxiv.org/abs/2505.17002
In this paper, we combine two-step knowledge distillation, structured pruning, truncation, and vocabulary trimming for extremely compressing multilingual encoder-only language models for low-resource languages. Our novel approach systematically combines existing techniques and takes them to the extreme, reducing layer depth, feed-forward hidden size, and intermediate layer embedding size to create significantly smaller monolingual models while retaining essential language-specific knowledge. We achieve compression rates of up to 92% with only a marginal performance drop of 2-10% in four downstream tasks, including sentiment analysis, topic classification, named entity recognition, and part-of-speech tagging, across three low-resource languages. Notably, the performance degradation correlates with the amount of language-specific data in the teacher model, with larger datasets resulting in smaller performance losses. Additionally, we conduct extensive ablation studies to identify best practices for multilingual model compression using these techniques.
在这篇论文中,我们结合了两步知识蒸馏、结构化剪枝、截断和词汇精简技术,以极大压缩低资源语言的多语种编码器模型。我们的创新方法系统性地整合现有技术并推向极限,减少层数深度、前向反馈隐藏层大小以及中间层嵌入尺寸,从而创建显著更小的语言特定单语模型,同时保留了关键的语言特性知识。我们在包括情感分析、主题分类、命名实体识别和词性标注在内的四个下游任务中,在三种低资源语言上实现了高达92%的压缩率,并且仅在性能上有微不足道的下降(2-10%)。值得注意的是,性能下降的程度与教师模型中的语言特定数据量相关,较大的数据集导致较小的性能损失。此外,我们进行了详尽的消融研究,以确定使用这些技术进行多语种模型压缩的最佳实践。
https://arxiv.org/abs/2505.16956
In this work, we introduce LLaDA-V, a purely diffusion-based Multimodal Large Language Model (MLLM) that integrates visual instruction tuning with masked diffusion models, representing a departure from the autoregressive paradigms dominant in current multimodal approaches. Built upon LLaDA, a representative large language diffusion model, LLaDA-V incorporates a vision encoder and MLP connector that projects visual features into the language embedding space, enabling effective multimodal alignment. Our empirical investigation reveals several intriguing results: First, LLaDA-V demonstrates promising multimodal performance despite its language model being weaker on purely textual tasks than counterparts like LLaMA3-8B and Qwen2-7B. When trained on the same instruction data, LLaDA-V is highly competitive to LLaMA3-V across multimodal tasks with better data scalability. It also narrows the performance gap to Qwen2-VL, suggesting the effectiveness of its architecture for multimodal tasks. Second, LLaDA-V achieves state-of-the-art performance in multimodal understanding compared to existing hybrid autoregressive-diffusion and purely diffusion-based MLLMs. Our findings suggest that large language diffusion models show promise in multimodal contexts and warrant further investigation in future research. Project page and codes: this https URL.
在这项工作中,我们介绍了LLaDA-V,这是一种纯扩散式的多模态大型语言模型(MLLM),它将视觉指令调优与掩码扩散模型结合在一起,标志着从当前多模态方法中主导的自回归范式中的重大转变。LLaDA-V基于LLaDA,一个典型的大型语言扩散模型,并集成了一个视觉编码器和MLP连接器,该连接器可以将视觉特征映射到语言嵌入空间中,从而实现有效的跨模态对齐。 我们的实证研究表明了几个引人注目的结果:首先,尽管在纯文本任务上LLaDA-V的语言模型不如如LLaMA3-8B和Qwen2-7B这样的同类模型强大,但LLaDA-V仍展示了令人鼓舞的多模态性能。当使用相同的指令数据进行训练时,在多模态任务中,LLaDA-V与LLaMA3-V同样具有竞争力,并且在数据可扩展性方面表现更好。它还缩小了与Qwen2-VL之间的性能差距,表明其架构对于多模态任务的有效性。 其次,LLaDA-V在多模态理解方面的性能优于现有的混合自回归-扩散和纯扩散型MLLM模型,在这一领域达到了最先进的水平。我们的研究结果表明大型语言扩散模型在多模态上下文中显示出潜力,并且需要在未来的研究中进一步探讨其应用前景。 项目页面及代码:[链接](this https URL)
https://arxiv.org/abs/2505.16933
Uncertainty quantification in Knowledge Graph Embedding (KGE) methods is crucial for ensuring the reliability of downstream applications. A recent work applies conformal prediction to KGE methods, providing uncertainty estimates by generating a set of answers that is guaranteed to include the true answer with a predefined confidence level. However, existing methods provide probabilistic guarantees averaged over a reference set of queries and answers (marginal coverage guarantee). In high-stakes applications such as medical diagnosis, a stronger guarantee is often required: the predicted sets must provide consistent coverage per query (conditional coverage guarantee). We propose CondKGCP, a novel method that approximates predicate-conditional coverage guarantees while maintaining compact prediction sets. CondKGCP merges predicates with similar vector representations and augments calibration with rank information. We prove the theoretical guarantees and demonstrate empirical effectiveness of CondKGCP by comprehensive evaluations.
知识图谱嵌入(KGE)方法中的不确定性量化对于确保下游应用的可靠性至关重要。最近的一项工作将符合预测应用于KGE方法,通过生成一组包含真实答案且保证达到预定义置信水平的答案集合来提供不确定性估计。然而,现有的方法仅提供了基于查询和答案参考集上的概率保证(边际覆盖保证)。在高风险应用场景中,如医学诊断,通常需要更强的保证:预测集必须为每个单独的查询提供一致的覆盖率(条件覆盖保证)。 我们提出了一种名为CondKGCP的新方法,该方法可以近似谓词条件下的覆盖保证,并同时保持紧凑的预测集合。CondKGCP通过合并具有相似向量表示的谓词并利用排名信息进行校准来实现这一目标。我们证明了CondKGCP的理论保证并通过全面评估展示了其在实际应用中的有效性。
https://arxiv.org/abs/2505.16877
A primary challenge when deploying speaker recognition systems in real-world applications is performance degradation caused by environmental mismatch. We propose a diffusion-based method that takes speaker embeddings extracted from a pre-trained speaker recognition model and generates refined embeddings. For training, our approach progressively adds Gaussian noise to both clean and noisy speaker embeddings extracted from clean and noisy speech, respectively, via forward process of a diffusion model, and then reconstructs them to clean embeddings in the reverse process. While inferencing, all embeddings are regenerated via diffusion process. Our method needs neither speaker label nor any modification to the existing speaker recognition pipeline. Experiments on evaluation sets simulating environment mismatch scenarios show that our method can improve recognition accuracy by up to 19.6% over baseline models while retaining performance on conventional scenarios. We publish our code here this https URL
在将说话人识别系统部署到实际应用中时,一个主要挑战是由于环境不匹配导致的性能下降。我们提出了一种基于扩散模型的方法,该方法从预训练的说话人识别模型中提取说话人嵌入,并生成优化后的嵌入。在训练阶段,我们的方法通过逐步向干净和嘈杂语音分别提取的干净与嘈杂的说话人嵌入添加高斯噪声,进行正向过程处理,然后在反向过程中重建为干净的嵌入。而在推理阶段,所有嵌入都将通过扩散过程再生。我们这种方法不需要说话人的标签,也不需要对现有的说话人识别流程做任何修改。 实验表明,在模拟环境不匹配场景的数据集上,我们的方法相对于基准模型可以提高高达19.6%的识别准确率,并且在传统场景下性能也能得到保持。代码已发布于此链接: [这里应填写实际的URL地址]
https://arxiv.org/abs/2505.16798
Diffusion Transformers (DiTs) deliver state-of-the-art image quality, yet their training remains notoriously slow. A recent remedy -- representation alignment (REPA) that matches DiT hidden features to those of a non-generative teacher (e.g. DINO) -- dramatically accelerates the early epochs but plateaus or even degrades performance later. We trace this failure to a capacity mismatch: once the generative student begins modelling the joint data distribution, the teacher's lower-dimensional embeddings and attention patterns become a straitjacket rather than a guide. We then introduce HASTE (Holistic Alignment with Stage-wise Termination for Efficient training), a two-phase schedule that keeps the help and drops the hindrance. Phase I applies a holistic alignment loss that simultaneously distills attention maps (relational priors) and feature projections (semantic anchors) from the teacher into mid-level layers of the DiT, yielding rapid convergence. Phase II then performs one-shot termination that deactivates the alignment loss, once a simple trigger such as a fixed iteration is hit, freeing the DiT to focus on denoising and exploit its generative capacity. HASTE speeds up training of diverse DiTs without architecture changes. On ImageNet 256X256, it reaches the vanilla SiT-XL/2 baseline FID in 50 epochs and matches REPA's best FID in 500 epochs, amounting to a 28X reduction in optimization steps. HASTE also improves text-to-image DiTs on MS-COCO, demonstrating to be a simple yet principled recipe for efficient diffusion training across various tasks. Our code is available at this https URL .
扩散变换器(DiT)能够提供最先进的图像质量,但它们的训练过程仍然非常耗时。最近的一个解决方案——表示对齐(REPA),通过将DiT隐藏特征与非生成性教师模型(例如DINO)的特征匹配来加速早期训练阶段,但在后期要么停滞不前,要么性能下降。我们发现这种失败的原因是能力错配:一旦生成型学生开始建模联合数据分布,教师较低维度的嵌入和注意力模式就会成为一种限制而非指导。 为解决这一问题,我们提出了HASTE(分阶段终止的整体对齐高效训练),这是一种两阶段的时间表,旨在保持有益的因素并剔除有害的部分。第一阶段应用整体对齐损失,在中期层面上同时从教师模型中提取注意力图(关系先验)和特征投影(语义锚点)到DiT中,从而实现快速收敛。第二阶段则进行一次性终止操作,在达到一个简单的触发器(例如固定迭代次数)时停用对齐损失,让DiT能够专注于去噪任务并充分利用其生成能力。 HASTE能够在不改变架构的情况下加速各种DiTs的训练过程。在ImageNet 256x256的数据集上,它仅通过50个周期就达到了基础SiT-XL/2模型的FID评分,并且在经过500个周期后能够匹配REPA的最佳FID评分,这相当于优化步骤减少了28倍。此外,HASTE还在MS-COCO数据集上的文字到图像DiTs任务中进行了改进,证明了其作为一种简单而原则性的方法,在多种任务的高效扩散训练中具有广泛的应用价值。 我们的代码可在提供的链接中获取(原文中的链接未给出具体地址,请访问原始论文或官方页面查找)。
https://arxiv.org/abs/2505.16792
For text enrollment-based open-vocabulary keyword spotting (KWS), acoustic and text embeddings are typically compared at either the phoneme or utterance level. To facilitate this, we optimize acoustic and text encoders using deep metric learning (DML), enabling direct comparison of multi-modal embeddings in a shared embedding space. However, the inherent heterogeneity between audio and text modalities presents a significant challenge. To address this, we propose Modality Adversarial Learning (MAL), which reduces the domain gap in heterogeneous modality representations. Specifically, we train a modality classifier adversarially to encourage both encoders to generate modality-invariant embeddings. Additionally, we apply DML to achieve phoneme-level alignment between audio and text, and conduct comprehensive comparisons across various DML objectives. Experiments on the Wall Street Journal (WSJ) and LibriPhrase datasets demonstrate the effectiveness of the proposed approach.
对于基于文本注册的开放词汇关键词检测(KWS),声学和文本嵌入通常在音素或话语级别进行比较。为了实现这一点,我们使用深度度量学习(DML)优化声学和文本编码器,从而能够在共享嵌入空间中直接比较多模态嵌入。然而,音频和文本模态之间的固有异质性带来了重大挑战。为了解决这个问题,我们提出了一种模态对抗学习(MAL),以减少异构模态表示中的领域差距。具体而言,我们通过对抗训练一个模态分类器来鼓励两个编码器生成模态不变的嵌入。此外,我们将DML应用于实现音频和文本在音素级别的对齐,并进行了各种DML目标的全面比较。在华尔街期刊(WSJ)和LibriPhrase数据集上的实验展示了所提出方法的有效性。
https://arxiv.org/abs/2505.16735
Datasets in engineering domains are often small, sparsely labeled, and contain numerical as well as categorical conditions. Additionally. computational resources are typically limited in practical applications which hinders the adoption of generative models for engineering tasks. We introduce a novel masked-conditioning approach, that enables generative models to work with sparse, mixed-type data. We mask conditions during training to simulate sparse conditions at inference time. For this purpose, we explore the use of various sparsity schedules that show different strengths and weaknesses. In addition, we introduce a flexible embedding that deals with categorical as well as numerical conditions. We integrate our method into an efficient variational autoencoder as well as a latent diffusion model and demonstrate the applicability of our approach on two engineering-related datasets of 2D point clouds and images. Finally, we show that small models trained on limited data can be coupled with large pretrained foundation models to improve generation quality while retaining the controllability induced by our conditioning scheme.
工程领域的数据集通常规模较小、标注稀疏,并且包含数值型和分类条件。此外,在实际应用中计算资源有限,这阻碍了生成模型在工程任务中的使用。我们提出了一种新颖的掩码条件方法,使生成模型能够处理稀疏、混合类型的数据。我们在训练过程中遮蔽条件以模拟推理时的稀疏条件。为此,我们探索了不同类型的稀疏调度,它们表现出不同的优缺点。此外,我们引入了一种灵活嵌入方法,用于处理分类和数值条件。我们将该方法集成到高效的变分自动编码器以及潜在扩散模型中,并在两个与工程相关的2D点云和图像数据集上展示了我们的方法的适用性。最后,我们表明,在有限数据上训练的小型模型可以与大型预训练基础模型结合使用,以提高生成质量同时保持由我们的条件化方案诱导的可控性。
https://arxiv.org/abs/2505.16725
Fine-tuning-as-a-service, while commercially successful for Large Language Model (LLM) providers, exposes models to harmful fine-tuning attacks. As a widely explored defense paradigm against such attacks, unlearning attempts to remove malicious knowledge from LLMs, thereby essentially preventing them from being used to perform malicious tasks. However, we highlight a critical flaw: the powerful general adaptability of LLMs allows them to easily bypass selective unlearning by rapidly relearning or repurposing their capabilities for harmful tasks. To address this fundamental limitation, we propose a paradigm shift: instead of selective removal, we advocate for inducing model collapse--effectively forcing the model to "unlearn everything"--specifically in response to updates characteristic of malicious adaptation. This collapse directly neutralizes the very general capabilities that attackers exploit, tackling the core issue unaddressed by selective unlearning. We introduce the Collapse Trap (CTRAP) as a practical mechanism to implement this concept conditionally. Embedded during alignment, CTRAP pre-configures the model's reaction to subsequent fine-tuning dynamics. If updates during fine-tuning constitute a persistent attempt to reverse safety alignment, the pre-configured trap triggers a progressive degradation of the model's core language modeling abilities, ultimately rendering it inert and useless for the attacker. Crucially, this collapse mechanism remains dormant during benign fine-tuning, ensuring the model's utility and general capabilities are preserved for legitimate users. Extensive empirical results demonstrate that CTRAP effectively counters harmful fine-tuning risks across various LLMs and attack settings, while maintaining high performance in benign scenarios. Our code is available at this https URL.
虽然大型语言模型(LLM)提供商通过“作为服务的微调”在商业上取得了成功,但这使模型暴露于有害的微调攻击之下。作为一种广泛探讨的防御模式来对抗这些攻击,“遗忘”试图从LLM中移除恶意知识,从而实质上防止它们被用于执行恶意任务。然而,我们指出了一个关键缺陷:强大的泛化适应能力使得LLM能够轻易地通过快速再学习或重新定位其能力来绕过选择性“遗忘”,以便于有害任务。 为了应对这一根本限制,我们提出了一种范式转变:与其进行选择性移除,我们倡导诱导模型崩溃——即迫使模型“完全忘记一切”——特别是在回应恶意适应的更新时。这种崩溃直接中和了攻击者所利用的一般能力,解决了选择性遗忘未能解决的核心问题。 我们引入了崩溃陷阱(CTRAP)作为一种实用机制来有条件地实施这一概念。在对齐过程中嵌入,CTRAP预先配置模型对未来微调动态的反应。如果在微调过程中的更新构成了持续尝试逆转安全对齐的努力,则预置的陷阱将触发逐步降级模型的核心语言建模能力,最终使其对于攻击者来说变得无用且无法使用。 关键的是,这种崩溃机制在良性微调期间保持休眠状态,确保了模型的效用和一般能力被保留给合法用户。广泛的实证结果表明,CTRAP在各种LLM和攻击设置下有效应对有害微调风险的同时,在良性场景中也维持着高水平的表现。我们的代码可在该链接获取。 (提供了一个假想的网址以替代原文中的实际URL)
https://arxiv.org/abs/2505.16559
Large Language Models (LLMs) achieve superior performance through Chain-of-Thought (CoT) reasoning, but these token-level reasoning chains are computationally expensive and inefficient. In this paper, we introduce Compressed Latent Reasoning (CoLaR), a novel framework that dynamically compresses reasoning processes in latent space through a two-stage training approach. First, during supervised fine-tuning, CoLaR extends beyond next-token prediction by incorporating an auxiliary next compressed embedding prediction objective. This process merges embeddings of consecutive tokens using a compression factor randomly sampled from a predefined range, and trains a specialized latent head to predict distributions of subsequent compressed embeddings. Second, we enhance CoLaR through reinforcement learning (RL) that leverages the latent head's non-deterministic nature to explore diverse reasoning paths and exploit more compact ones. This approach enables CoLaR to: i) perform reasoning at a dense latent level (i.e., silently), substantially reducing reasoning chain length, and ii) dynamically adjust reasoning speed at inference time by simply prompting the desired compression factor. Extensive experiments across four mathematical reasoning datasets demonstrate that CoLaR achieves 14.1% higher accuracy than latent-based baseline methods at comparable compression ratios, and reduces reasoning chain length by 53.3% with only 4.8% performance degradation compared to explicit CoT method. Moreover, when applied to more challenging mathematical reasoning tasks, our RL-enhanced CoLaR demonstrates performance gains of up to 5.4% while dramatically reducing latent reasoning chain length by 82.8%. The code and models will be released upon acceptance.
大型语言模型(LLMs)通过链式思维(CoT,Chain-of-Thought)推理实现了卓越的性能,但这种基于token级别的推理链条在计算上是非常昂贵且效率低下的。在这篇论文中,我们介绍了Compressed Latent Reasoning(CoLaR),这是一种新颖的框架,它通过两阶段训练方法动态地压缩隐空间中的推理过程。 首先,在监督微调过程中,CoLaR不仅进行下一个token预测,还加入了辅助性的下个压缩嵌入预测目标。这个过程使用从预定义范围内随机采样的压缩因子合并连续tokens的嵌入,并且训练一个专门化的潜在头部来预测后续压缩嵌入的分布。 其次,我们通过强化学习(RL)进一步增强CoLaR,该方法利用了潜在头部的非确定性特性来探索多样性的推理路径并利用更紧凑的路径。这种方法使CoLaR能够: i) 在密集的潜在级别(即安静地)进行推理,显著减少推理链的长度; ii) 通过在推理时间仅提示所需的压缩因子,动态调整推理速度。 广泛的实验证明,在四个数学推理数据集上,与基于隐式的基线方法相比,在相似的压缩比率下,CoLaR能提高14.1%的准确性,并且将推理链长度减少了53.3%,而相对于显式的CoT方法仅损失了4.8%的表现。此外,应用于更复杂的数学推理任务时,我们的增强型RL-CoLaR能够表现出高达5.4%的性能提升,同时显著减少潜在推理链长度达82.8%。 论文通过接受后会公开代码和模型。
https://arxiv.org/abs/2505.16552
We propose a novel Model Predictive Control (MPC) framework for a jet-powered flying humanoid robot. The controller is based on a linearised centroidal momentum model to represent the flight dynamics, augmented with a second-order nonlinear model to explicitly account for the slow and nonlinear dynamics of jet propulsion. A key contribution is the introduction of a multi-rate MPC formulation that handles the different actuation rates of the robot's joints and jet engines while embedding the jet dynamics directly into the predictive model. We validated the framework using the jet-powered humanoid robot iRonCub, performing simulations in Mujoco; the simulation results demonstrate the robot's ability to recover from external disturbances and perform stable, non-abrupt flight manoeuvres, validating the effectiveness of the proposed approach.
我们提出了一种新颖的模型预测控制(MPC)框架,用于喷气动力飞行类人机器人。控制器基于线性化重心动量模型来表示飞行动态,并通过引入二阶非线性模型进行增强,以明确考虑喷气推进的缓慢和非线性动态特性。一个关键贡献是提出了一种多速率MPC公式,该方法能够处理机器人的关节和喷气发动机的不同执行频率,并将喷气动力学直接嵌入到预测模型中。我们使用配备有喷气动力的人形机器人iRonCub进行了验证,通过在MuJoCo环境中进行仿真,结果表明了该框架的有效性:机器人能够在受到外部干扰后恢复姿态并完成稳定且平滑的飞行动作。
https://arxiv.org/abs/2505.16478
Rotary Position Embedding (RoPE) is a widely adopted technique for encoding relative positional information in large language models (LLMs). However, when extended to large vision-language models (LVLMs), its variants introduce unintended cross-modal positional biases. Specifically, they enforce relative positional dependencies between text token indices and image tokens, causing spurious alignments. This issue arises because image tokens representing the same content but located at different spatial positions are assigned distinct positional biases, leading to inconsistent cross-modal associations. To address this, we propose Per-Token Distance (PTD) - a simple yet effective metric for quantifying the independence of positional encodings across modalities. Informed by this analysis, we introduce Circle-RoPE, a novel encoding scheme that maps image token indices onto a circular trajectory orthogonal to the linear path of text token indices, forming a cone-like structure. This configuration ensures that each text token maintains an equal distance to all image tokens, reducing artificial cross-modal biases while preserving intra-image spatial information. To further enhance performance, we propose a staggered layer strategy that applies different RoPE variants across layers. This design leverages the complementary strengths of each RoPE variant, thereby enhancing the model's overall performance. Our experimental results demonstrate that our method effectively preserves spatial information from images while reducing relative positional bias, offering a more robust and flexible positional encoding framework for LVLMs. The code is available at [this https URL](this https URL).
旋转位置嵌入(RoPE)是一种广泛应用于大型语言模型(LLMs)中,用于编码相对位置信息的技术。然而,当其被扩展到大型视觉-语言模型(LVLMs)时,其变体引入了未预期的跨模态位置偏差。具体来说,它们强制执行文本标记索引和图像标记之间的相对位置依赖性,导致虚假对齐。这个问题出现的原因在于,表示相同内容但位于不同空间位置的图像令牌被分配了不同的位置偏置,从而产生了不一致的跨模态关联。 为了解决这一问题,我们提出了一种简单的度量标准——单令牌距离(PTD),用于量化多模式位置编码之间的独立性。基于此分析,我们引入了一个新的编码方案:Circle-RoPE,它将图像标记索引映射到一个与文本标记线性路径正交的圆形轨迹上,形成了类似圆锥结构的空间布局。这种配置确保了每个文本令牌与其他所有图像令牌之间保持相等的距离,从而减少了人工跨模态偏差的同时保留了图像内的空间信息。 为了进一步提升性能,我们提出了一种交错层策略,即在不同层次应用不同的RoPE变体。这一设计利用了每种RoPE变体的独特优势,从而增强了模型的整体性能。我们的实验结果表明,该方法有效地保持了来自图像的空间信息,并减少了相对位置偏差,为LVLMs提供了一个更加稳健和灵活的位置编码框架。 相关代码可在[此链接](this https URL)获取。
https://arxiv.org/abs/2505.16416
Large Reasoning Models (LRMs) have demonstrated remarkable capabilities in multi-step reasoning and calling search engines at appropriate steps. However, existing retrieval-augmented reasoning approaches rely on separate retrieval models, limiting the LRM's role in retrieval to deciding when to retrieve and how to query. This separation not only increases hardware and operational costs but also leads to errors in the retrieval process due to the representation bottleneck, a phenomenon where the retriever's embedding space is not expressive enough to meet the generator's requirements. To address this, we shift our perspective from sequence-to-sequence matching to locating the answer-containing paths within the corpus, and propose a novel framework called FREESON (Retriever-FREE Retrieval-Augmented ReaSONing). This framework enables LRMs to retrieve relevant knowledge on their own by acting as both a generator and retriever. To achieve this, we introduce a variant of the MCTS algorithm specialized for the retrieval task, which we call CT-MCTS (Corpus-Traversing Monte Carlo Tree Search). In this algorithm, LRMs traverse through the corpus toward answer-containing regions. Our results on five open-domain QA benchmarks, including single-hop and multi-hop questions, show that FREESON achieves an average improvement of 14.4% in EM and F1 over four multi-step reasoning models with a separate retriever, and it also performs comparably to the strongest baseline, surpassing it by 3% on PopQA and 2WikiMultihopQA.
大型推理模型(LRMs)在多步骤推理和适当步骤调用搜索引擎方面表现出显著的能力。然而,现有的检索增强型推理方法依赖于独立的检索模型,这限制了LRM在检索中的角色仅限于决定何时检索以及如何查询。这种分离不仅增加了硬件和运营成本,还由于表示瓶颈导致检索过程中的错误,这是一种现象,在此过程中检索器的嵌入空间不足以满足生成器的需求。 为了应对这一挑战,我们改变了视角,从序列到序列匹配转变为在语料库中定位包含答案的路径,并提出了一种名为FREESON(无需检索器的检索增强型推理)的新框架。该框架使LRM能够在作为生成器的同时自主检索相关知识。为此,我们引入了蒙特卡洛树搜索(MCTS)算法的一个变体,专门用于检索任务,称为CT-MCTS(语料库遍历蒙特卡洛树搜索)。在这个算法中,LRMs在语料库中向着包含答案的区域进行遍历。 我们在五个开放领域的问答基准测试上进行了实验,包括单跳和多跳问题。结果显示,FREESON相比于使用单独检索器的四个多步骤推理模型,在EM(精确匹配)和F1指标上的平均改进率为14.4%。此外,它在最强基线方面的表现也相当,分别在PopQA和2WikiMultihopQA上超过了3%和2%。 通过这一创新框架,LRM不仅能够更有效地进行检索操作,还能显著提高其整体推理性能,同时减少对独立检索模型的依赖带来的开销。
https://arxiv.org/abs/2505.16409
Due to the challenges of processing temporal information, most trackers depend solely on visual discriminability and overlook the unique temporal coherence of video data. In this paper, we propose a lightweight and plug-and-play motion prompt tracking method. It can be easily integrated into existing vision-based trackers to build a joint tracking framework leveraging both motion and vision cues, thereby achieving robust tracking through efficient prompt learning. A motion encoder with three different positional encodings is proposed to encode the long-term motion trajectory into the visual embedding space, while a fusion decoder and an adaptive weight mechanism are designed to dynamically fuse visual and motion features. We integrate our motion module into three different trackers with five models in total. Experiments on seven challenging tracking benchmarks demonstrate that the proposed motion module significantly improves the robustness of vision-based trackers, with minimal training costs and negligible speed sacrifice. Code is available at this https URL.
由于处理时间信息的挑战,大多数追踪器主要依赖于视觉区分能力,并忽视了视频数据独特的时序一致性。在本文中,我们提出了一种轻量级且即插即用的动作提示跟踪方法。它可以轻松地集成到现有的基于视觉的追踪器中,以建立一个同时利用动作和视觉线索的联合追踪框架,从而通过高效的提示学习实现鲁棒性的追踪。为了将长期运动轨迹编码为视觉嵌入空间,我们提出了一种带有三种不同位置编码的动作编码器,并设计了一个融合解码器及自适应权重机制来动态地融合视觉与运动特征。我们将我们的动作模块集成到三个不同的追踪器中,总共使用了五种模型。在七个具有挑战性的跟踪基准上的实验表明,所提出的动作模块显著提高了基于视觉的追踪器的鲁棒性,并且训练成本极低,几乎不会牺牲速度。代码可在[此链接](https://this https URL)获得。
https://arxiv.org/abs/2505.16321
Recently, prototype learning has emerged in semi-supervised medical image segmentation and achieved remarkable performance. However, the scarcity of labeled data limits the expressiveness of prototypes in previous methods, potentially hindering the complete representation of prototypes for class embedding. To overcome this issue, we propose an efficient prototype consistency learning via joint uncertainty quantification and data augmentation (EPCL-JUDA) to enhance the semantic expression of prototypes based on the framework of Mean-Teacher. The concatenation of original and augmented labeled data is fed into student network to generate expressive prototypes. Then, a joint uncertainty quantification method is devised to optimize pseudo-labels and generate reliable prototypes for original and augmented unlabeled data separately. High-quality global prototypes for each class are formed by fusing labeled and unlabeled prototypes, which are utilized to generate prototype-to-features to conduct consistency learning. Notably, a prototype network is proposed to reduce high memory requirements brought by the introduction of augmented data. Extensive experiments on Left Atrium, Pancreas-NIH, Type B Aortic Dissection datasets demonstrate EPCL-JUDA's superiority over previous state-of-the-art approaches, confirming the effectiveness of our framework. The code will be released soon.
最近,在半监督医学图像分割领域,原型学习作为一种新兴方法取得了显著的性能。然而,标注数据的稀缺性限制了先前方法中原型的表现力,可能阻碍了原型对类嵌入的完全表示。为了解决这个问题,我们提出了一种通过联合不确定性量化和数据增强(EPCL-JUDA)进行高效的原型一致性学习的方法,以提高基于Mean-Teacher框架下原型的语义表达能力。原始标注数据与增强后的标注数据串联后输入学生网络生成具有表现力的原型。随后,设计了一个联合不确定性量化方法来优化伪标签,并为原始和增强的未标注数据分别生成可靠的原型。通过融合标注数据和未标注数据的原型,可以形成每个类别的高质量全局原型,这些全局原型被用来生成从原型到特征的映射,从而进行一致性学习。值得注意的是,我们提出了一种原型网络来减少引入增强数据后带来的高内存需求。 在左心房、胰腺-NIH以及B型主动脉夹层等数据集上的大量实验表明,EPCL-JUDA方法优于先前最先进的方法,证实了我们的框架的有效性。代码即将发布。
https://arxiv.org/abs/2505.16283
In this paper, we propose to compress human body video with interactive semantics, which can facilitate video coding to be interactive and controllable by manipulating semantic-level representations embedded in the coded bitstream. In particular, the proposed encoder employs a 3D human model to disentangle nonlinear dynamics and complex motion of human body signal into a series of configurable embeddings, which are controllably edited, compactly compressed, and efficiently transmitted. Moreover, the proposed decoder can evolve the mesh-based motion fields from these decoded semantics to realize the high-quality human body video reconstruction. Experimental results illustrate that the proposed framework can achieve promising compression performance for human body videos at ultra-low bitrate ranges compared with the state-of-the-art video coding standard Versatile Video Coding (VVC) and the latest generative compression schemes. Furthermore, the proposed framework enables interactive human body video coding without any additional pre-/post-manipulation processes, which is expected to shed light on metaverse-related digital human communication in the future.
在这篇论文中,我们提出了一种具有互动语义的人体视频压缩方法。这种方法可以使视频编码通过操作嵌入在码流中的语义级表示来实现交互性和可控性。具体来说,所提出的编解码器采用了一个3D人体模型,将人体信号的非线性动态和复杂运动分解为一系列可配置的嵌入,并且这些嵌入可以被可控地编辑、紧凑地压缩并高效传输。此外,该论文中提议的解码器可以从这些解码后的语义中进化出基于网格的运动场,从而实现高质量的人体视频重建。 实验结果表明,在与最先进的视频编码标准通用视频编码(VVC)和最新的生成式压缩方案相比时,所提出的框架在极低比特率范围内实现了令人满意的人体视频压缩性能。此外,该框架能够实现在不进行任何额外的预处理或后处理的情况下进行交互式人体视频编码,这有望对未来元宇宙相关的数字人通信带来启示。
https://arxiv.org/abs/2505.16152
Recent studies have applied large language models (LLMs) to machine translation quality estimation (MTQE) by prompting models to assign numeric scores. Nonetheless, these direct scoring methods tend to show low segment-level correlation with human judgments. In this paper, we propose a generation-based evaluation paradigm that leverages decoder-only LLMs to produce high-quality references, followed by semantic similarity scoring using sentence embeddings. We conduct the most extensive evaluation to date in MTQE, covering 8 LLMs and 8 language pairs. Empirical results show that our method outperforms both intra-LLM direct scoring baselines and external non-LLM reference-free metrics from MTME. These findings demonstrate the strength of generation-based evaluation and support a shift toward hybrid approaches that combine fluent generation with accurate semantic assessment.
最近的研究将大型语言模型(LLMs)应用于机器翻译质量估计(MTQE),通过提示模型分配数值评分。然而,这些直接评分方法在与人类判断的段落级相关性方面表现较低。本文中,我们提出了一种基于生成的评估范式,利用解码器仅模型来产生高质量的参考文本,随后使用句子嵌入进行语义相似度评分。我们在MTQE领域进行了迄今为止最广泛的评估,涵盖了8个大型语言模型和8种语言对。实证结果表明,我们的方法在性能上优于基于LLM的直接评分基线以及来自MTME的外部非LLM无参考指标。这些发现展示了基于生成式评估的优势,并支持转向结合流畅生成与准确语义评估的混合方法。
https://arxiv.org/abs/2505.16129
Embedding-Based Retrieval (EBR) is an important technique in modern search engines, enabling semantic match between search queries and relevant results. However, search logging data on platforms like Facebook Marketplace lacks the diversity and details needed for effective EBR model training, limiting the models' ability to capture nuanced search patterns. To address this challenge, we propose Aug2Search, an EBR-based framework leveraging synthetic data generated by Generative AI (GenAI) models, in a multimodal and multitask approach to optimize query-product relevance. This paper investigates the capabilities of GenAI, particularly Large Language Models (LLMs), in generating high-quality synthetic data, and analyzing its impact on enhancing EBR models. We conducted experiments using eight Llama models and 100 million data points from Facebook Marketplace logs. Our synthetic data generation follows three strategies: (1) generate queries, (2) enhance product listings, and (3) generate queries from enhanced listings. We train EBR models on three different datasets: sampled engagement data or original data ((e.g., "Click" and "Listing Interactions")), synthetic data, and a mixture of both engagement and synthetic data to assess their performance across various training sets. Our findings underscore the robustness of Llama models in producing synthetic queries and listings with high coherence, relevance, and diversity, while maintaining low levels of hallucination. Aug2Search achieves an improvement of up to 4% in ROC_AUC with 100 million synthetic data samples, demonstrating the effectiveness of our approach. Moreover, our experiments reveal that with the same volume of training data, models trained exclusively on synthetic data often outperform those trained on original data only or a mixture of original and synthetic data.
基于嵌入的检索(EBR)是现代搜索引擎中的一项关键技术,它能够实现搜索查询与相关结果之间的语义匹配。然而,在Facebook Marketplace等平台上的搜索日志数据缺乏有效训练EBR模型所需的多样性和细节,限制了这些模型捕捉复杂搜索模式的能力。为了解决这一挑战,我们提出了Aug2Search框架,该框架基于生成式AI(GenAI)模型产生的合成数据来优化查询-产品相关性,并采用多模态和多任务的方法提升EBR技术的效果。本文探讨了大型语言模型(LLMs)在生成高质量合成数据方面的潜力及其对改进EBR模型的影响。我们使用八种Llama模型以及来自Facebook Marketplace日志的1亿个数据点进行了实验。 我们的合成数据生成遵循三种策略:(1) 生成查询;(2) 提升产品列表的质量;(3) 根据提升后的列表生成新的查询。我们在三个不同的数据集上训练EBR模型,分别是采样互动数据或原始数据(例如:“点击”和“列表交互”)、合成数据以及两种类型的数据混合,以评估它们在不同训练集上的性能。 我们的研究结果强调了Llama模型在产生高连贯性、相关性和多样性的合成查询及产品清单方面的稳健性,并且这些模型的幻觉水平较低。通过使用1亿个合成数据样本,Aug2Search实现了高达4%的ROC_AUC改进,展示了我们方法的有效性。此外,实验结果表明,在相同的训练数据量下,仅基于合成数据进行训练的模型往往比只基于原始数据或混合类型的数据训练出的模型表现更好。
https://arxiv.org/abs/2505.16065
This article presents Platform Adaptive Locomotion (PAL), a unified control method for quadrupedal robots with different morphologies and dynamics. We leverage deep reinforcement learning to train a single locomotion policy on procedurally generated robots. The policy maps proprioceptive robot state information and base velocity commands into desired joint actuation targets, which are conditioned using a latent embedding of the temporally local system dynamics. We explore two conditioning strategies - one using a GRU-based dynamics encoder and another using a morphology-based property estimator - and show that morphology-aware conditioning outperforms temporal dynamics encoding regarding velocity task tracking for our hardware test on ANYmal C. Our results demonstrate that both approaches achieve robust zero-shot transfer across multiple unseen simulated quadrupeds. Furthermore, we demonstrate the need for careful robot reference modelling during training, enabling us to reduce the velocity tracking error by up to 30% compared to the baseline method. Despite PAL not surpassing the best-performing reference-free controller in all cases, our analysis uncovers critical design choices and informs improvements to the state of the art.
本文介绍了平台自适应行走(PAL),这是一种针对形态和动态特性不同的四足机器人统一控制方法。我们利用深度强化学习,在生成的机器人上训练单一的行走策略。该策略将内感受性机器人状态信息和底座速度命令映射到期望的关节驱动目标,这些目标通过临时局部系统动态特性的潜在嵌入进行条件化处理。 我们探讨了两种条件化策略:一种使用基于GRU(门控循环单元)的动力学编码器,另一种使用形态属性估计器。结果显示,在ANYmal C硬件测试中,形态感知条件化在速度任务跟踪方面优于时间动力学编码方法。 我们的研究结果表明,这两种方法都能实现多种未见过的模拟四足机器人之间稳健的零样本迁移(即不需要针对新环境重新训练就能良好适应)。此外,我们还展示了在训练期间进行精确机器人参考建模的重要性,并证明与基线方法相比,这种方法可以将速度跟踪误差降低多达30%。 尽管PAL在所有情况下并未超越最佳无参考控制器的表现,但我们的分析揭示了关键的设计选择,并为提高当前技术水平提供了指导。
https://arxiv.org/abs/2505.16042