Diffusion models have shown preliminary success in virtual try-on (VTON) task. The typical dual-branch architecture comprises two UNets for implicit garment deformation and synthesized image generation respectively, and has emerged as the recipe for VTON task. Nevertheless, the problem remains challenging to preserve the shape and every detail of the given garment due to the intrinsic stochasticity of diffusion model. To alleviate this issue, we novelly propose to explicitly capitalize on visual correspondence as the prior to tame diffusion process instead of simply feeding the whole garment into UNet as the appearance reference. Specifically, we interpret the fine-grained appearance and texture details as a set of structured semantic points, and match the semantic points rooted in garment to the ones over target person through local flow warping. Such 2D points are then augmented into 3D-aware cues with depth/normal map of target person. The correspondence mimics the way of putting clothing on human body and the 3D-aware cues act as semantic point matching to supervise diffusion model training. A point-focused diffusion loss is further devised to fully take the advantage of semantic point matching. Extensive experiments demonstrate strong garment detail preservation of our approach, evidenced by state-of-the-art VTON performances on both VITON-HD and DressCode datasets. Code is publicly available at: this https URL.
扩散模型在虚拟试穿(VTON)任务中已显示出初步的成功。典型的双分支架构包括两个UNet,分别用于隐式的服装变形和合成图像生成,并已成为执行VTON任务的标准方法。然而,由于扩散模型固有的随机性,保留给定服装的形状及每一个细节的问题仍然具有挑战性。为了解决这个问题,我们新颖地提出了利用视觉对应关系作为先验知识来控制扩散过程的方法,而不是简单地将整个服装输入到UNet中作为外观参考。 具体来说,我们将精细的外观和纹理细节解释为一组结构化的语义点,并通过局部流扭曲匹配服装中的语义点与目标人体上的语义点。然后,这些2D点被增强为带有目标人物深度/法线图的3D感知线索。这种对应关系模仿了将衣物穿在人身上的过程,而3D感知线索则充当语义点匹配来监督扩散模型训练。此外,还设计了一种以点为中心的扩散损失函数,以便充分利用语义点匹配。 大量的实验表明,我们的方法能够很好地保持服装细节,并通过VITON-HD和DressCode数据集上的最先进的VTON性能得到了验证。代码在以下网址公开提供:[此链接](https://this-url.com)(请将链接替换为实际的公开代码地址)。
https://arxiv.org/abs/2505.16977
Recent optical flow estimation methods often employ local cost sampling from a dense all-pairs correlation volume. This results in quadratic computational and memory complexity in the number of pixels. Although an alternative memory-efficient implementation with on-demand cost computation exists, this is slower in practice and therefore prior methods typically process images at reduced resolutions, missing fine-grained details. To address this, we propose a more efficient implementation of the all-pairs correlation volume sampling, still matching the exact mathematical operator as defined by RAFT. Our approach outperforms on-demand sampling by up to 90% while maintaining low memory usage, and performs on par with the default implementation with up to 95% lower memory usage. As cost sampling makes up a significant portion of the overall runtime, this can translate to up to 50% savings for the total end-to-end model inference in memory-constrained environments. Our evaluation of existing methods includes an 8K ultra-high-resolution dataset and an additional inference-time modification of the recent SEA-RAFT method. With this, we achieve state-of-the-art results at high resolutions both in accuracy and efficiency.
最近的光流估计方法通常采用从密集的所有成对相关体积中抽取局部成本的方法。这导致计算和内存复杂度随像素数量呈二次增长。虽然存在一种记忆高效的实现方式,可以在需要时进行成本计算,但这种方法在实践中速度较慢,因此以前的方法通常会以降低图像分辨率的方式来处理图像,从而忽略了细粒度的细节。为了解决这个问题,我们提出了一种更高效的全部成对相关体积采样实现方法,在保持与RAFT定义的确切数学操作一致的同时,比按需成本计算方式快最多90%,同时保持低内存使用率,并且相比默认实现方式内存消耗减少了高达95%。由于成本采样占整个运行时间的重要部分,这可以在内存受限的环境中将总体端到端模型推理节省多达50%的内存。 我们对现有方法的评估包括一个8K超高分辨率的数据集以及最近SEA-RAFT方法在推断时的一个额外修改版本。通过这种方法,在高分辨率下,我们在准确性和效率方面均达到了最先进的成果。
https://arxiv.org/abs/2505.16942
Few-shot counting estimates the number of target objects in an image using only a few annotated exemplars. However, domain shift severely hinders existing methods to generalize to unseen scenarios. This falls into the realm of single domain generalization that remains unexplored in few-shot counting. To solve this problem, we begin by analyzing the main limitations of current methods, which typically follow a standard pipeline that extract the object prototypes from exemplars and then match them with image feature to construct the correlation map. We argue that existing methods overlook the significance of learning highly generalized prototypes. Building on this insight, we propose the first single domain generalization few-shot counting model, Universal Representation Matching, termed URM. Our primary contribution is the discovery that incorporating universal vision-language representations distilled from a large scale pretrained vision-language model into the correlation construction process substantially improves robustness to domain shifts without compromising in domain performance. As a result, URM achieves state-of-the-art performance on both in domain and the newly introduced domain generalization setting.
少量样本计数通过仅使用几个标注示例就估计图像中目标对象的数量。然而,领域偏移(domain shift)严重阻碍了现有方法向未见过的情况推广。这属于单域泛化范畴,在少量样本计数任务中尚未得到探索。 为了解决这一问题,我们首先分析当前方法的主要局限性:它们通常遵循标准流程,从示例中提取对象原型,并将其与图像特征匹配以构建相关图(correlation map)。我们认为现有方法忽视了学习高度通用的原型的重要性。基于此见解,我们提出了首个单域泛化少量样本计数模型——通用表示匹配(Universal Representation Matching, URM)。 我们的主要贡献在于发现将大规模预训练视觉-语言模型中提炼出的通用视觉-语言表示集成到相关图构建过程中,能够显著提高对领域偏移的鲁棒性而不损害领域内性能。因此,URM在内部领域和新提出的领域泛化设置下均实现了最先进的表现。
https://arxiv.org/abs/2505.16778
Voice Conversion research in recent times has increasingly focused on improving the zero-shot capabilities of existing methods. Despite remarkable advancements, current architectures still tend to struggle in zero-shot cross-lingual settings. They are also often unable to generalize for speakers of unseen languages and accents. In this paper, we adopt a simple yet effective approach that combines discrete speech representations from self-supervised models with a non-autoregressive Diffusion-Transformer based conditional flow matching speech decoder. We show that this architecture allows us to train a voice-conversion model in a purely textless, self-supervised fashion. Our technique works without requiring multiple encoders to disentangle speech features. Our model also manages to excel in zero-shot cross-lingual settings even for unseen languages.
近期的语音转换研究越来越多地集中在提升现有方法的零样本(zero-shot)能力上。尽管取得了显著的进步,现有的架构在处理零样本跨语言设置时仍存在挑战,并且通常无法为未见过的语言和口音泛化。在这篇论文中,我们采用了一种简单而有效的方法,该方法结合了自监督模型中的离散语音表示与基于非自回归的Diffusion-Transformer条件流匹配语音解码器。这种方法使我们能够以完全无文本、自我监督的方式训练一个语音转换模型。我们的技术无需使用多个编码器来分离语音特征就能工作,并且在零样本跨语言设置中,即使对于未见过的语言也能表现出色。
https://arxiv.org/abs/2505.16691
We investigate fine-tuning Vision-Language Models (VLMs) for multi-task medical image understanding, focusing on detection, localization, and counting of findings in medical images. Our objective is to evaluate whether instruction-tuned VLMs can simultaneously improve these tasks, with the goal of enhancing diagnostic accuracy and efficiency. Using MedMultiPoints, a multimodal dataset with annotations from endoscopy (polyps and instruments) and microscopy (sperm cells), we reformulate each task into instruction-based prompts suitable for vision-language reasoning. We fine-tune Qwen2.5-VL-7B-Instruct using Low-Rank Adaptation (LoRA) across multiple task combinations. Results show that multi-task training improves robustness and accuracy. For example, it reduces the Count Mean Absolute Error (MAE) and increases Matching Accuracy in the Counting + Pointing task. However, trade-offs emerge, such as more zero-case point predictions, indicating reduced reliability in edge cases despite overall performance gains. Our study highlights the potential of adapting general-purpose VLMs to specialized medical tasks via prompt-driven fine-tuning. This approach mirrors clinical workflows, where radiologists simultaneously localize, count, and describe findings - demonstrating how VLMs can learn composite diagnostic reasoning patterns. The model produces interpretable, structured outputs, offering a promising step toward explainable and versatile medical AI. Code, model weights, and scripts will be released for reproducibility at this https URL.
我们研究了针对多任务医学图像理解的视觉语言模型(VLMs)微调方法,重点关注在医学图像中检测、定位和计数病变的任务。我们的目标是评估通过指令调整后的VLM是否能够同时改善这些任务,并以此提高诊断准确性和效率。使用MedMultiPoints这一包含内窥镜(息肉和仪器)及显微镜检查(精子细胞)注释的多模态数据集,我们将每个任务重新表述为基于视觉语言推理的指令提示。我们采用低秩适应(LoRA)方法对Qwen2.5-VL-7B-Instruct模型进行多任务组合下的微调训练。实验结果显示,多任务训练提升了模型的鲁棒性和准确性,例如,在计数+定位任务中减少了计数平均绝对误差(MAE),提高了匹配准确度。然而,也存在一些权衡,比如更多的零点预测情况出现,这表明在边缘案例中的可靠性有所下降,尽管整体性能有所提升。 我们的研究强调了通过提示驱动的微调方法将通用视觉语言模型应用于专门医学任务的潜力。这种方法模仿了临床工作流程,其中放射科医生同时定位、计数并描述病变——展示了VLM如何学习复合诊断推理模式。该模型生成可解释和结构化的输出,为具有透明度与多样性的医疗人工智能提供了有前景的发展方向。代码、模型权重及脚本将在 [提供的URL] 上发布,以确保研究的可重复性。
https://arxiv.org/abs/2505.16647
Despite bilingual speakers frequently using mixed-language queries in web searches, Information Retrieval (IR) research on them remains scarce. To address this, we introduce MiLQ,Mixed-Language Query test set, the first public benchmark of mixed-language queries, confirmed as realistic and highly preferred. Experiments show that multilingual IR models perform moderately on MiLQ and inconsistently across native, English, and mixed-language queries, also suggesting code-switched training data's potential for robust IR models handling such queries. Meanwhile, intentional English mixing in queries proves an effective strategy for bilinguals searching English documents, which our analysis attributes to enhanced token matching compared to native queries.
尽管双语使用者经常在网页搜索中使用混合语言查询,但关于这种查询的信息检索(IR)研究仍然很少。为此,我们引入了MiLQ(Mixed-Language Query测试集),这是第一个公开的混合语言查询基准,并且已经被证实是现实且非常受欢迎的。实验表明,多语言信息检索模型在处理MiLQ时表现适中,在针对母语、英语和混合语言查询上也表现出不一致的结果,这也暗示了代码切换训练数据对于构建能够有效处理此类查询的信息检索模型具有潜力。同时,故意将英语混入查询中被证明是双语使用者搜索英文文档的有效策略,我们的分析将其归因于与原生语言查询相比增强了词元匹配的效果。
https://arxiv.org/abs/2505.16631
Egocentric hand-object motion generation is crucial for immersive AR/VR and robotic imitation but remains challenging due to unstable viewpoints, self-occlusions, perspective distortion, and noisy ego-motion. Existing methods rely on predefined 3D object priors, limiting generalization to novel objects, which restricts their generalizability to novel objects. Meanwhile, recent multimodal approaches suffer from ambiguous generation from abstract textual cues, intricate pipelines for modeling 3D hand-object correlation, and compounding errors in open-loop prediction. We propose MEgoHand, a multimodal framework that synthesizes physically plausible hand-object interactions from egocentric RGB, text, and initial hand pose. MEgoHand introduces a bi-level architecture: a high-level "cerebrum" leverages a vision language model (VLM) to infer motion priors from visual-textual context and a monocular depth estimator for object-agnostic spatial reasoning, while a low-level DiT-based flow-matching policy generates fine-grained trajectories with temporal orthogonal filtering to enhance stability. To address dataset inconsistency, we design a dataset curation paradigm with an Inverse MANO Retargeting Network and Virtual RGB-D Renderer, curating a unified dataset of 3.35M RGB-D frames, 24K interactions, and 1.2K objects. Extensive experiments across five in-domain and two cross-domain datasets demonstrate the effectiveness of MEgoHand, achieving substantial reductions in wrist translation error (86.9%) and joint rotation error (34.1%), highlighting its capacity to accurately model fine-grained hand joint structures and generalize robustly across diverse scenarios.
基于自我视角的手部与物体运动生成对于沉浸式AR/VR和机器人模仿至关重要,但由于不稳定视点、自遮挡、透视畸变以及噪声音频自身运动等因素,这一任务仍具有挑战性。现有的方法依赖于预定义的3D对象先验知识,这限制了其对新对象的泛化能力。同时,最近的多模态方法由于从抽象文本线索生成模糊不清的问题、用于建模3D手部与物体关联的复杂管道以及开环预测中的累积误差而面临问题。 我们提出了MEgoHand,这是一种多模态框架,可以从自我视角的RGB图像、文本和初始手部姿态中合成物理上合理的手部-物体交互。MEgoHand引入了一个双层架构:高层次的“大脑”利用视觉语言模型(VLM)从视觉-文本上下文中推断运动先验,并使用单目深度估计器进行无对象偏见的空间推理,而低层次基于DiT的流匹配策略则通过时间正交滤波生成细粒度轨迹以提高稳定性。为了解决数据集不一致性的问题,我们设计了一种数据集整理范式,其中包括逆向MANO重定向网络和虚拟RGB-D渲染器,用于整理一个统一的数据集,该数据集中包含3.35M个RGB-D帧、24K次交互及1.2K个对象。 跨五个同域和两个异域数据集的广泛实验展示了MEgoHand的有效性,在手腕平移误差上实现了86.9%的显著降低,在关节旋转误差上则降低了34.1%,这突显了其在准确建模细粒度手部关节结构以及在各种场景中稳健泛化方面的能力。
https://arxiv.org/abs/2505.16602
Large Vision-Language Models (LVLMs) have recently advanced robotic manipulation by leveraging vision for scene perception and language for instruction following. However, existing methods rely heavily on costly human-annotated training datasets, which limits their generalization and causes them to struggle in out-of-domain (OOD) scenarios, reducing real-world adaptability. To address these challenges, we propose ManipLVM-R1, a novel reinforcement learning framework that replaces traditional supervision with Reinforcement Learning using Verifiable Rewards (RLVR). By directly optimizing for task-aligned outcomes, our method enhances generalization and physical reasoning while removing the dependence on costly annotations. Specifically, we design two rule-based reward functions targeting key robotic manipulation subtasks: an Affordance Perception Reward to enhance localization of interaction regions, and a Trajectory Match Reward to ensure the physical plausibility of action paths. These rewards provide immediate feedback and impose spatial-logical constraints, encouraging the model to go beyond shallow pattern matching and instead learn deeper, more systematic reasoning about physical interactions.
大型视觉语言模型(LVLMs)最近通过利用视觉进行场景感知和语言指令的遵循,推动了机器人操控技术的进步。然而,现有的方法严重依赖于昂贵的人工标注训练数据集,这限制了它们的泛化能力,并导致在域外(OOD)场景中表现不佳,从而降低了现实世界的适应性。为了解决这些问题,我们提出了ManipLVM-R1,这是一个新颖的强化学习框架,它用基于验证奖励的强化学习(RLVR)替代传统的监督方法。通过直接针对任务相关结果进行优化,我们的方法增强了泛化能力和物理推理能力,并且减少了对昂贵标注数据的依赖。 具体而言,我们设计了两个基于规则的奖励函数,旨在解决机器人操控中的关键子任务:一个是操作感知奖励,用于增强交互区域定位;另一个是轨迹匹配奖励,以确保动作路径的物理合理性。这些奖励提供即时反馈并施加空间逻辑约束,鼓励模型超越浅层模式匹配,并学习更深层次、更具系统性的关于物理互动的理解和推理能力。
https://arxiv.org/abs/2505.16517
TAT-VPR is a ternary-quantized transformer that brings dynamic accuracy-efficiency trade-offs to visual SLAM loop-closure. By fusing ternary weights with a learned activation-sparsity gate, the model can control computation by up to 40% at run-time without degrading performance (Recall@1). The proposed two-stage distillation pipeline preserves descriptor quality, letting it run on micro-UAV and embedded SLAM stacks while matching state-of-the-art localization accuracy.
TAT-VPR 是一种三值量化变压器,它为视觉SLAM闭环提供了动态的精度与效率权衡。通过融合三值权重和一个学习到的激活-稀疏门控,该模型可以在运行时最多控制计算量减少40%,同时不降低性能(Recall@1)。提出的两阶段蒸馏流水线保持了描述符的质量,使其能够在微型无人机和嵌入式SLAM系统中运行,并且在定位精度方面与当前最先进的技术相匹配。
https://arxiv.org/abs/2505.16447
The learning mechanisms by which humans acquire internal representations of objects are not fully understood. Deep neural networks (DNNs) have emerged as a useful tool for investigating this question, as they have internal representations similar to those of humans as a byproduct of optimizing their objective functions. While previous studies have shown that models trained with various learning paradigms - such as supervised, self-supervised, and CLIP - acquire human-like representations, it remains unclear whether their similarity to human representations is primarily at a coarse category level or extends to finer details. Here, we employ an unsupervised alignment method based on Gromov-Wasserstein Optimal Transport to compare human and model object representations at both fine-grained and coarse-grained levels. The unique feature of this method compared to conventional representational similarity analysis is that it estimates optimal fine-grained mappings between the representation of each object in human and model representations. We used this unsupervised alignment method to assess the extent to which the representation of each object in humans is correctly mapped to the corresponding representation of the same object in models. Using human similarity judgments of 1,854 objects from the THINGS dataset, we find that models trained with CLIP consistently achieve strong fine- and coarse-grained matching with human object representations. In contrast, self-supervised models showed limited matching at both fine- and coarse-grained levels, but still formed object clusters that reflected human coarse category structure. Our results offer new insights into the role of linguistic information in acquiring precise object representations and the potential of self-supervised learning to capture coarse categorical structures.
人类获取物体内部表征的学习机制尚不完全清楚。深度神经网络(DNN)作为一种有用的工具,因其在优化目标函数的过程中产生了类似人类的内部表示而被用于研究这一问题。尽管之前的研究表明,在监督学习、自监督学习以及CLIP等不同学习范式下训练的模型都能获得类似于人类的表征,但它们与人类表征的相似性主要体现在粗略类别层级还是延伸到了更细微层面仍不清楚。在此研究中,我们采用了一种基于Gromov-Wasserstein最优传输的无监督对齐方法,在细粒度和粗粒度层面上比较了人与模型的对象表示。相较于传统的代表性相似性分析方法,这种方法的独特之处在于它能够估计人类对象表征与模型对象表征之间最优化的细粒度映射关系。 我们使用这种无监督对齐方法来评估在人类中每个对象的表示是否能正确地映射到相应模型中的相同对象。基于THINGS数据集中1,854个物体的人类相似性判断,研究发现训练于CLIP框架下的模型始终能达到与人类物体表征强相关的细粒度和粗粒度匹配。相比之下,在自监督学习下训练的模型在细粒度和粗粒度层面上表现出有限的匹配能力,但仍然形成了反映人类粗略类别结构的对象簇。 我们的研究结果为语言信息在精确获取对象表示中的作用以及自监督学习捕捉粗略分类结构的能力提供了新的见解。
https://arxiv.org/abs/2505.16419
Pose-invariant face recognition has become a challenging problem for modern AI-based face recognition systems. It aims at matching a profile face captured in the wild with a frontal face registered in a database. Existing methods perform face frontalization via either generative models or learning a pose robust feature representation. In this paper, a new method is presented to perform face frontalization and recognition within the feature space. First, a novel feature space pose frontalization module (FSPFM) is proposed to transform profile images with arbitrary angles into frontal counterparts. Second, a new training paradigm is proposed to maximize the potential of FSPFM and boost its performance. The latter consists of a pre-training and an attention-guided fine-tuning stage. Moreover, extensive experiments have been conducted on five popular face recognition benchmarks. Results show that not only our method outperforms the state-of-the-art in the pose-invariant face recognition task but also maintains superior performance in other standard scenarios.
无姿态不变的人脸识别已成为现代基于AI的人脸识别系统面临的挑战性问题。该任务旨在将野外采集的侧脸与数据库中注册的正脸进行匹配。现有的方法通过生成模型或学习鲁棒的姿态特征表示来进行人脸正面化。本文提出了一种新的方法,在特征空间内执行人脸正面化和识别。首先,提出了一个新颖的特征空间姿态正面化模块(FSPFM),将任意角度的侧脸图像转换为正脸对应物。其次,提出了一种新的训练范式来最大化FSPFM的潜力并提升其性能。后者包括预训练和注意力引导微调两个阶段。此外,在五个流行的人脸识别基准数据集上进行了广泛的实验。结果显示,不仅我们提出的方法在无姿态不变人脸识别人脸任务中优于现有最佳方法,并且在其他标准场景中也保持了优越的性能。
https://arxiv.org/abs/2505.16412
Large Reasoning Models (LRMs) have demonstrated remarkable capabilities in multi-step reasoning and calling search engines at appropriate steps. However, existing retrieval-augmented reasoning approaches rely on separate retrieval models, limiting the LRM's role in retrieval to deciding when to retrieve and how to query. This separation not only increases hardware and operational costs but also leads to errors in the retrieval process due to the representation bottleneck, a phenomenon where the retriever's embedding space is not expressive enough to meet the generator's requirements. To address this, we shift our perspective from sequence-to-sequence matching to locating the answer-containing paths within the corpus, and propose a novel framework called FREESON (Retriever-FREE Retrieval-Augmented ReaSONing). This framework enables LRMs to retrieve relevant knowledge on their own by acting as both a generator and retriever. To achieve this, we introduce a variant of the MCTS algorithm specialized for the retrieval task, which we call CT-MCTS (Corpus-Traversing Monte Carlo Tree Search). In this algorithm, LRMs traverse through the corpus toward answer-containing regions. Our results on five open-domain QA benchmarks, including single-hop and multi-hop questions, show that FREESON achieves an average improvement of 14.4% in EM and F1 over four multi-step reasoning models with a separate retriever, and it also performs comparably to the strongest baseline, surpassing it by 3% on PopQA and 2WikiMultihopQA.
大型推理模型(LRMs)在多步骤推理和适当步骤调用搜索引擎方面表现出显著的能力。然而,现有的检索增强型推理方法依赖于独立的检索模型,这限制了LRM在检索中的角色仅限于决定何时检索以及如何查询。这种分离不仅增加了硬件和运营成本,还由于表示瓶颈导致检索过程中的错误,这是一种现象,在此过程中检索器的嵌入空间不足以满足生成器的需求。 为了应对这一挑战,我们改变了视角,从序列到序列匹配转变为在语料库中定位包含答案的路径,并提出了一种名为FREESON(无需检索器的检索增强型推理)的新框架。该框架使LRM能够在作为生成器的同时自主检索相关知识。为此,我们引入了蒙特卡洛树搜索(MCTS)算法的一个变体,专门用于检索任务,称为CT-MCTS(语料库遍历蒙特卡洛树搜索)。在这个算法中,LRMs在语料库中向着包含答案的区域进行遍历。 我们在五个开放领域的问答基准测试上进行了实验,包括单跳和多跳问题。结果显示,FREESON相比于使用单独检索器的四个多步骤推理模型,在EM(精确匹配)和F1指标上的平均改进率为14.4%。此外,它在最强基线方面的表现也相当,分别在PopQA和2WikiMultihopQA上超过了3%和2%。 通过这一创新框架,LRM不仅能够更有效地进行检索操作,还能显著提高其整体推理性能,同时减少对独立检索模型的依赖带来的开销。
https://arxiv.org/abs/2505.16409
Autonomous vehicles are typical complex intelligent systems with artificial intelligence at their core. However, perception methods based on deep learning are extremely vulnerable to adversarial samples, resulting in safety accidents. How to generate effective adversarial examples in the physical world and evaluate object detection systems is a huge challenge. In this study, we propose a unified joint adversarial training framework for both 2D and 3D samples to address the challenges of intra-class diversity and environmental variations in real-world scenarios. Building upon this framework, we introduce an adversarial sample reality enhancement approach that incorporates non-rigid surface modeling and a realistic 3D matching mechanism. We compare with 5 advanced adversarial patches and evaluate their attack performance on 8 object detecotrs, including single-stage, two-stage, and transformer-based models. Extensive experiment results in digital and physical environments demonstrate that the adversarial textures generated by our method can effectively mislead the target detection model. Moreover, proposed method demonstrates excellent robustness and transferability under multi-angle attacks, varying lighting conditions, and different distance in the physical world. The demo video and code can be obtained at this https URL.
自主驾驶车辆是典型复杂的智能系统,其核心采用了人工智能技术。然而,基于深度学习的感知方法极其容易受到对抗样本的影响,这可能导致安全事故发生。如何在物理世界中生成有效的对抗示例并评估物体检测系统的性能是一个巨大的挑战。 在这项研究中,我们提出了一种统一的联合对抗训练框架,适用于二维和三维样本,以应对现实场景中的同类多样性及环境变化带来的挑战。基于此框架,我们引入了包含非刚性表面建模和真实3D匹配机制的对抗样本真实性增强方法。我们将我们的方法与5种先进的对抗补丁进行了比较,并评估它们在8种物体检测器上的攻击性能,这些检测器包括单阶段、双阶段以及基于Transformer模型。 数字环境和物理环境中的大量实验结果表明,我们提出的方法生成的对抗纹理能够有效地误导目标检测模型。此外,在多角度攻击、不同光照条件及距离变化下的实际环境中,该方法展示了出色的鲁棒性和迁移性。 演示视频和代码可在[此链接](https://example.com)获取。(注意:示例链接为虚构,请根据实际情况提供具体网址)
https://arxiv.org/abs/2505.16402
Developing new molecular compounds is crucial to address pressing challenges, from health to environmental sustainability. However, exploring the molecular space to discover new molecules is difficult due to the vastness of the space. Here we introduce CoCoGraph, a collaborative and constrained graph diffusion model capable of generating molecules that are guaranteed to be chemically valid. Thanks to the constraints built into the model and to the collaborative mechanism, CoCoGraph outperforms state-of-the-art approaches on standard benchmarks while requiring up to an order of magnitude fewer parameters. Analysis of 36 chemical properties also demonstrates that CoCoGraph generates molecules with distributions more closely matching real molecules than current models. Leveraging the model's efficiency, we created a database of 8.2M million synthetically generated molecules and conducted a Turing-like test with organic chemistry experts to further assess the plausibility of the generated molecules, and potential biases and limitations of CoCoGraph.
开发新的分子化合物对于解决从健康到环境可持续性等紧迫挑战至关重要。然而,由于分子空间的巨大范围,探索该领域以发现新分子变得非常困难。在这里,我们介绍了CoCoGraph模型,这是一种协作且受约束的图扩散模型,能够生成化学上有效的分子。得益于内置在模型中的约束和协作机制,CoCoGraph在标准基准测试中超越了最先进的方法,并且所需的参数减少了多达一个数量级。对于36种化学性质的分析也表明,与当前模型相比,CoCoGraph生成的分子分布更加接近真实的分子。 利用该模型的效率,我们创建了一个包含820万个合成生成分子的数据库,并通过类似图灵测试的方法让有机化学专家评估这些生成分子的可能性以及可能存在的偏见和限制。
https://arxiv.org/abs/2505.16365
We introduce AdamS, a simple yet effective alternative to Adam for large language model (LLM) pretraining and post-training. By leveraging a novel denominator, i.e., the root of weighted sum of squares of the momentum and the current gradient, AdamS eliminates the need for second-moment estimates. Hence, AdamS is efficient, matching the memory and compute footprint of SGD with momentum while delivering superior optimization performance. Moreover, AdamS is easy to adopt: it can directly inherit hyperparameters of AdamW, and is entirely model-agnostic, integrating seamlessly into existing pipelines without modifications to optimizer APIs or architectures. The motivation behind AdamS stems from the observed $(L_0, L_1)$ smoothness properties in transformer objectives, where local smoothness is governed by gradient magnitudes that can be further approximated by momentum magnitudes. We establish rigorous theoretical convergence guarantees and provide practical guidelines for hyperparameter selection. Empirically, AdamS demonstrates strong performance in various tasks, including pre-training runs on GPT-2 and Llama2 (up to 13B parameters) and reinforcement learning in post-training regimes. With its efficiency, simplicity, and theoretical grounding, AdamS stands as a compelling alternative to existing optimizers.
我们介绍了AdamS,这是一种简单而有效的优化器,旨在替代大型语言模型(LLM)预训练和微调过程中的Adam。通过利用一种新颖的分母——即动量与当前梯度平方加权和的根——AdamS消除了对二阶矩估计的需求。因此,AdamS在匹配具有动量的随机梯度下降(SGD)所需的内存和计算资源的同时,提供了更优秀的优化性能。此外,AdamS易于采用:可以直接继承AdamW的超参数,并且完全与模型无关,在不修改现有优化器API或架构的情况下无缝集成到现有的工作流程中。 AdamS的设计动机源于在Transformer目标函数中的$(L_0, L_1)$平滑性属性观察结果,局部平滑度由梯度幅度决定,这些幅度可以进一步通过动量的大小来近似。我们建立了严格的理论收敛保证,并提供了实用的超参数选择指南。实验表明,在各种任务中(包括GPT-2和Llama2(高达13B参数)预训练运行以及微调阶段中的强化学习),AdamS展示了强大的性能。 凭借其效率、简洁性和坚实的理论基础,AdamS成为现有优化器的一个有吸引力的替代方案。
https://arxiv.org/abs/2505.16363
While most people associate LiDAR primarily with its ability to measure distances and provide geometric information about the environment (via point clouds), LiDAR also captures additional data, including reflectivity or intensity values. Unfortunately, when LiDAR is applied to Place Recognition (PR) in mobile robotics, most previous works on LiDAR-based PR rely only on geometric measurements, neglecting the additional reflectivity information that LiDAR provides. In this paper, we propose a novel descriptor for 3D PR, named RE-TRIP (REflectivity-instance augmented TRIangle descriPtor). This new descriptor leverages both geometric measurements and reflectivity to enhance robustness in challenging scenarios such as geometric degeneracy, high geometric similarity, and the presence of dynamic objects. To implement RE-TRIP in real-world applications, we further propose (1) a keypoint extraction method, (2) a key instance segmentation method, (3) a RE-TRIP matching method, and (4) a reflectivity-combined loop verification method. Finally, we conduct a series of experiments to demonstrate the effectiveness of RE-TRIP. Applied to public datasets (i.e., HELIPR, FusionPortable) containing diverse scenarios such as long corridors, bridges, large-scale urban areas, and highly dynamic environments -- our experimental results show that the proposed method outperforms existing state-of-the-art methods in terms of Scan Context, Intensity Scan Context, and STD.
尽管大多数人将LiDAR主要与测量距离和提供环境几何信息(通过点云)的能力联系在一起,但LiDAR也捕捉了包括反射率或强度值在内的额外数据。然而,在移动机器人中的位置识别(PR)应用中,大多数基于LiDAR的位置识别工作仅依赖于几何测量,忽略了LiDAR提供的额外反射性信息。在本文中,我们提出了一种名为RE-TRIP(增强的反射实例与三角描述符)的新3D PR描述符。该新描述符利用了几何测量和反射率,增强了在几何退化、高几何相似性和动态对象存在等挑战场景中的鲁棒性。 为了在实际应用中实现RE-TRIP,我们进一步提出了四种方法:(1)关键点提取方法;(2)关键实例分割方法;(3) RE-TRIP匹配方法;以及(4)反射率综合循环验证方法。最后,我们进行了一系列实验以展示RE-TRIP的有效性。我们将该方法应用于包含各种场景的公共数据集(例如长走廊、桥梁、大规模城市区域和高度动态环境)——我们的实验结果表明,在扫描上下文(Scan Context)、强度扫描上下文(Intensity Scan Context)以及标准差(STD)方面,所提出的方法优于现有的最先进方法。
https://arxiv.org/abs/2505.16165
We present GMatch, a learning-free feature matcher designed for robust 6DoF object pose estimation, addressing common local ambiguities in sparse feature matching. Unlike traditional methods that rely solely on descriptor similarity, GMatch performs a guided, incremental search, enforcing SE(3)-invariant geometric consistency throughout the matching process. It leverages a provably complete set of geometric features that uniquely determine 3D keypoint configurations, ensuring globally consistent correspondences without the need for training or GPU support. When combined with classical descriptors such as SIFT, GMatch-SIFT forms a general-purpose pose estimation pipeline that offers strong interpretability and generalization across diverse objects and scenes. Experiments on the HOPE dataset show that GMatch outperforms both traditional and learning-based matchers, with GMatch-SIFT achieving or surpassing the performance of instance-level pose networks. On the YCB-Video dataset, GMatch-SIFT demonstrates high accuracy and low variance on texture-rich objects. These results not only validate the effectiveness of GMatch-SIFT for object pose estimation but also highlight the broader applicability of GMatch as a general-purpose feature matcher. Code will be released upon acceptance.
我们介绍了GMatch,这是一种无需学习的特征匹配器,专门用于稳健的6自由度(6DoF)物体姿态估计,并且能够解决稀疏特征匹配中常见的局部歧义问题。与依赖于描述符相似性的传统方法不同,GMatch执行引导式的增量搜索,在整个匹配过程中强制执行SE(3)不变几何一致性。它利用了一组理论上完备的几何特征,这些特征能唯一确定三维关键点配置,从而在无需训练或GPU支持的情况下确保全局一致的对应关系。当与经典描述符(如SIFT)结合使用时,GMatch-SIFT形成了一种通用的姿态估计管道,提供了强大的可解释性和跨不同物体和场景的一般化能力。 在HOPE数据集上的实验表明,GMatch的表现优于传统方法及基于学习的方法,而GMatch-SIFT的性能与实例级别的姿态网络相当或更优。在YCB-Video数据集上,GMatch-SIFT对纹理丰富的对象表现出高精度和低方差的特点。这些结果不仅验证了GMatch-SIFT在物体姿态估计中的有效性,还突显了GMatch作为通用特征匹配器的广泛适用性。 接受后将发布代码。
https://arxiv.org/abs/2505.16144
We consider the problem of single-channel audio source separation with the goal of reconstructing $K$ sources from their mixture. We address this ill-posed problem with FLOSS (FLOw matching for Source Separation), a constrained generation method based on flow matching, ensuring strict mixture consistency. Flow matching is a general methodology that, when given samples from two probability distributions defined on the same space, learns an ordinary differential equation to output a sample from one of the distributions when provided with a sample from the other. In our context, we have access to samples from the joint distribution of $K$ sources and so the corresponding samples from the lower-dimensional distribution of their mixture. To apply flow matching, we augment these mixture samples with artificial noise components to ensure the resulting "augmented" distribution matches the dimensionality of the $K$ source distribution. Additionally, as any permutation of the sources yields the same mixture, we adopt an equivariant formulation of flow matching which relies on a suitable custom-designed neural network architecture. We demonstrate the performance of the method for the separation of overlapping speech.
我们考虑单通道音频源分离问题,目标是从它们的混合信号中重建出$K$个独立音源。为了解决这个病态问题,我们提出了FLOSS(用于源分离的Flow匹配),这是一种基于流匹配的约束生成方法,确保严格的混叠一致性。流匹配是一种通用的方法论:当给定同一空间上定义的两个概率分布的样本时,它能够学习一个常微分方程,从而在提供其中一个分布的样本后输出另一个分布的一个样本。 在我们的应用场景中,我们有来自$K$个源联合分布的样本,并因此获取了相应的低维混合信号分布中的样本。为了应用流匹配,我们通过添加人工噪声组件来扩充这些混合信号样本,确保由此产生的“增强”分布与$K$个音源分布的维度相匹配。 此外,由于任何来源组合都会产生相同的混合效果,我们采用了等变形式的流匹配,并依赖于一种合适的自定义神经网络架构。我们展示了该方法在重叠语音分离中的性能表现。
https://arxiv.org/abs/2505.16119
Existing benchmarks fail to capture a crucial aspect of intelligence: physical reasoning, the integrated ability to combine domain knowledge, symbolic reasoning, and understanding of real-world constraints. To address this gap, we introduce PhyX: the first large-scale benchmark designed to assess models capacity for physics-grounded reasoning in visual scenarios. PhyX includes 3K meticulously curated multimodal questions spanning 6 reasoning types across 25 sub-domains and 6 core physics domains: thermodynamics, electromagnetism, mechanics, modern physics, optics, and wave\&acoustics. In our comprehensive evaluation, even state-of-the-art models struggle significantly with physical reasoning. GPT-4o, Claude3.7-Sonnet, and GPT-o4-mini achieve only 32.5\%, 42.2\%, and 45.8\% accuracy respectively-performance gaps exceeding 29\% compared to human experts. Our analysis exposes critical limitations in current models: over-reliance on memorized disciplinary knowledge, excessive dependence on mathematical formulations, and surface-level visual pattern matching rather than genuine physical understanding. We provide in-depth analysis through fine-grained statistics, detailed case studies, and multiple evaluation paradigms to thoroughly examine physical reasoning capabilities. To ensure reproducibility, we implement a compatible evaluation protocol based on widely-used toolkits such as VLMEvalKit, enabling one-click evaluation.
现有的基准测试未能捕捉到智能的一个关键方面:物理推理,即结合领域知识、符号推理以及对现实世界约束的理解的综合能力。为了解决这一缺口,我们引入了PhyX:第一个大规模基准测试,旨在评估模型在视觉场景中进行基于物理学推理的能力。PhyX 包含 3000 多个精心策划的多模态问题,涵盖了 25 个子领域和六个核心物理领域的六种推理类型:热力学、电磁学、力学、现代物理学、光学以及波动与声学。 在我们全面评估中,即使是最新最先进的模型也在物理推理方面遇到极大困难。GPT-4o、Claude3.7-Sonnet 和 GPT-o4-mini 分别仅获得了 32.5%、42.2% 和 45.8% 的准确率——这一表现与人类专家相比,差距超过了 29%。我们的分析揭示了当前模型的关键局限:过度依赖记忆中的学科知识,过分依靠数学公式,并且进行表面级的视觉模式匹配,而非真正理解物理学。 我们通过精细统计、详细案例研究和多种评估方案对物理推理能力进行了深入分析,以全面检查这些模型的能力。为确保可重复性,我们基于常用工具包(如VLMEvalKit)实施了一种兼容性的评估协议,并提供了“一键式”评估功能。
https://arxiv.org/abs/2505.15929
Exemplar-based image colorization aims to colorize a grayscale image using a reference color image, ensuring that reference colors are applied to corresponding input regions based on their semantic similarity. To achieve accurate semantic matching between regions, we leverage the self-attention module of a pre-trained diffusion model, which is trained on a large dataset and exhibits powerful attention capabilities. To harness this power, we propose a novel, fine-tuning-free approach based on a pre-trained diffusion model, making two key contributions. First, we introduce dual attention-guided color transfer. We utilize the self-attention module to compute an attention map between the input and reference images, effectively capturing semantic correspondences. The color features from the reference image is then transferred to the semantically matching regions of the input image, guided by this attention map, and finally, the grayscale features are replaced with the corresponding color features. Notably, we utilize dual attention to calculate attention maps separately for the grayscale and color images, achieving more precise semantic alignment. Second, we propose classifier-free colorization guidance, which enhances the transferred colors by combining color-transferred and non-color-transferred outputs. This process improves the quality of colorization. Our experimental results demonstrate that our method outperforms existing techniques in terms of image quality and fidelity to the reference. Specifically, we use 335 input-reference pairs from previous research, achieving an FID of 95.27 (image quality) and an SI-FID of 5.51 (fidelity to the reference). Our source code is available at this https URL.
基于范例的图像着色旨在使用参考颜色图像为灰度图像上色,确保根据输入区域与参考区域之间的语义相似性应用相应的颜色。为了实现不同区域之间准确的语义匹配,我们利用预训练扩散模型中的自注意力模块,在大型数据集上进行训练,该模型展示了强大的注意力能力。为了充分利用这种力量,我们提出了一种无需微调的新方法,基于预训练的扩散模型,并作出了两个关键贡献。 首先,我们提出了双注意引导的颜色传输。通过使用自注意力模块计算输入图像和参考图像之间的注意力图,可以有效地捕捉语义对应关系。然后,根据此注意力图将参考图像中的颜色特征传递到输入图像中与之语义匹配的区域,并最终用相应的颜色特征替换灰度特征。值得注意的是,我们利用双注意机制分别对灰度图像和彩色图像计算注意力图,从而实现更精确的语义对齐。 其次,我们提出了无分类器的颜色化引导方法,通过结合颜色传输和未进行颜色传输的输出来增强传递的颜色质量。这一过程提高了着色的质量。 我们的实验结果表明,在图像质量和参考一致性方面,我们的方法优于现有技术。具体来说,我们使用了之前研究中的335对输入-参考图像对,并达到了FID值为95.27(图像质量)和SI-FID值为5.51(参考一致性)。我们的源代码可在此链接访问:[此处插入实际的URL]。
https://arxiv.org/abs/2505.15812