End-to-end human animation with rich multi-modal conditions, e.g., text, image and audio has achieved remarkable advancements in recent years. However, most existing methods could only animate a single subject and inject conditions in a global manner, ignoring scenarios that multiple concepts could appears in the same video with rich human-human interactions and human-object interactions. Such global assumption prevents precise and per-identity control of multiple concepts including humans and objects, therefore hinders applications. In this work, we discard the single-entity assumption and introduce a novel framework that enforces strong, region-specific binding of conditions from modalities to each identity's spatiotemporal footprint. Given reference images of multiple concepts, our method could automatically infer layout information by leveraging a mask predictor to match appearance cues between the denoised video and each reference appearance. Furthermore, we inject local audio condition into its corresponding region to ensure layout-aligned modality matching in a iterative manner. This design enables the high-quality generation of controllable multi-concept human-centric videos. Empirical results and ablation studies validate the effectiveness of our explicit layout control for multi-modal conditions compared to implicit counterparts and other existing methods.
近年来,端到端的人体动画在接收丰富多模态条件(如文本、图像和音频)方面取得了显著进展。然而,大多数现有的方法只能为单一主体进行动画制作,并且以全局方式注入这些条件,忽略了同一视频中可能存在多种概念以及丰富的多人互动及人与物体的互动场景。这种全局假设阻止了对多个包含人类和物体的概念实施精确而有针对性的身份控制,从而限制了其应用范围。在这项工作中,我们摒弃单一实体的假设,并引入了一个新颖的框架,该框架强制执行来自不同模态条件到每个身份时空脚印上的特定区域绑定。给定多种概念的参考图像,我们的方法能够自动推断布局信息,通过利用一个掩码预测器来匹配去噪视频与每种参考外观之间的视觉线索。此外,我们还把局部音频条件注入其对应的区域,在迭代过程中确保模态间的布局对齐匹配。这种设计使得生成高质量且可控的多概念以人类为中心的视频成为可能。实证结果和消融研究验证了我们的显式布局控制在处理多模态条件下相比隐式方法和其他现有方法的有效性。
https://arxiv.org/abs/2506.09984
Recent advances in Text-to-Speech (TTS) have enabled highly natural speech synthesis, yet integrating speech with complex background environments remains challenging. We introduce UmbraTTS, a flow-matching based TTS model that jointly generates both speech and environmental audio, conditioned on text and acoustic context. Our model allows fine-grained control over background volume and produces diverse, coherent, and context-aware audio scenes. A key challenge is the lack of data with speech and background audio aligned in natural context. To overcome the lack of paired training data, we propose a self-supervised framework that extracts speech, background audio, and transcripts from unannotated recordings. Extensive evaluations demonstrate that UmbraTTS significantly outperformed existing baselines, producing natural, high-quality, environmentally aware audios.
近期在文本转语音(TTS)领域的进展实现了高度自然的语音合成,然而将语音与复杂背景环境相结合仍然具有挑战性。我们提出了UmbraTTS,这是一种基于流匹配的TTS模型,该模型可以根据文本和声学上下文同时生成语音和环境音频。我们的模型允许对背景音量进行精细控制,并能够产生多样化、连贯且符合上下文的音频场景。 一个关键问题是缺乏带有同步自然语境下的语音与背景音频的数据。为了克服训练数据不足的问题,我们提出了一种自我监督框架,可以从无标注记录中提取出语音、环境音频和文本转录信息。通过广泛的评估,证明UmbraTTS显著优于现有的基准模型,在生成自然且高质量的环境感知音频方面表现出色。
https://arxiv.org/abs/2506.09874
Absolute localization, aiming to determine an agent's location with respect to a global reference, is crucial for unmanned aerial vehicles (UAVs) in various applications, but it becomes challenging when global navigation satellite system (GNSS) signals are unavailable. Vision-based absolute localization methods, which locate the current view of the UAV in a reference satellite map to estimate its position, have become popular in GNSS-denied scenarios. However, existing methods mostly rely on traditional and low-level image matching, suffering from difficulties due to significant differences introduced by cross-source discrepancies and temporal variations. To overcome these limitations, in this paper, we introduce a hierarchical cross-source image matching method designed for UAV absolute localization, which integrates a semantic-aware and structure-constrained coarse matching module with a lightweight fine-grained matching module. Specifically, in the coarse matching module, semantic features derived from a vision foundation model first establish region-level correspondences under semantic and structural constraints. Then, the fine-grained matching module is applied to extract fine features and establish pixel-level correspondences. Building upon this, a UAV absolute visual localization pipeline is constructed without any reliance on relative localization techniques, mainly by employing an image retrieval module before the proposed hierarchical image matching modules. Experimental evaluations on public benchmark datasets and a newly introduced CS-UAV dataset demonstrate superior accuracy and robustness of the proposed method under various challenging conditions, confirming its effectiveness.
绝对定位,旨在确定无人飞行器(UAV)相对于全球参考的位置,在各种应用中至关重要。然而,在无法使用全球导航卫星系统(GNSS)信号的情况下,这一任务变得极具挑战性。基于视觉的绝对定位方法通过将无人机当前视角与参考卫星地图中的位置进行匹配来估算其位置,因此在无GNSS情况下变得流行。然而,现有的大多数方法主要依赖于传统的和低层次的图像匹配技术,这导致了由于跨源差异和时间变化引入的重大困难。 为克服这些限制,在本文中我们介绍了一种用于无人机绝对定位的分层跨源图像匹配方法。该方法整合了一个基于语义感知和结构约束的粗略匹配模块与一个轻量级精细匹配模块。具体来说,在粗略匹配阶段,首先通过视觉基础模型提取出的语义特征在语义及结构约束下建立区域级别的对应关系。随后,精细匹配模块用于提取细粒度特征并建立像素级别的对应关系。 在此基础上,我们构建了一个不依赖于相对定位技术的无人机绝对视觉定位流水线,并主要利用图像检索模块之前提出的分层图像匹配模块。实验评估在公开基准数据集和新引入的CS-UAV数据集中进行,证明了所提出方法在各种挑战性条件下的优越准确性和鲁棒性,确认了其有效性。
https://arxiv.org/abs/2506.09748
This paper studies the problem of Line Segment Detection (LSD) for the characterization of line geometry in images, with the aim of learning a domain-agnostic robust LSD model that works well for any natural images. With the focus of scalable self-supervised learning of LSD, we revisit and streamline the fundamental designs of (deep and non-deep) LSD approaches to have a high-performing and efficient LSD learner, dubbed as ScaleLSD, for the curation of line geometry at scale from over 10M unlabeled real-world images. Our ScaleLSD works very well to detect much more number of line segments from any natural images even than the pioneered non-deep LSD approach, having a more complete and accurate geometric characterization of images using line segments. Experimentally, our proposed ScaleLSD is comprehensively testified under zero-shot protocols in detection performance, single-view 3D geometry estimation, two-view line segment matching, and multiview 3D line mapping, all with excellent performance obtained. Based on the thorough evaluation, our ScaleLSD is observed to be the first deep approach that outperforms the pioneered non-deep LSD in all aspects we have tested, significantly expanding and reinforcing the versatility of the line geometry of images. Code and Models are available at this https URL
本文研究了线段检测(Line Segment Detection,LSD)的问题,旨在为任何自然图像开发一种领域无关的稳健型LSD模型。着眼于可扩展的自监督学习方法在LSD中的应用,我们重新审视并简化了(深度和非深度)LSD方法的基本设计,以打造一个高性能且高效的LSD学习器——ScaleLSD,用于从超过1000万张未标记的真实世界图像中大规模地整理线几何。我们的ScaleLSD能够非常有效地检测出任何自然图像中的大量线段,甚至超越了早期的非深度LSD方法,在使用线段对图像进行完整的和准确的几何描述方面表现得更加出色。在实验验证中,所提出的ScaleLSD在无监督协议下的检测性能、单视图三维几何估计、两视图线段匹配及多视角三维线映射等所有测试领域均表现出色。基于全面评估,我们的ScaleLSD被观察到是首个在我们测试的所有方面都超越了早期非深度LSD方法的深度学习方案,显著扩展并加强了图像中线条几何的多样性与适应性。 代码和模型可在以下网址获取:[链接] (请将链接替换为实际提供的URL)
https://arxiv.org/abs/2506.09369
We present the first comprehensive study of latent multi-head attention (MLA) for small language models, revealing interesting efficiency-quality trade-offs. Training 30M-parameter GPT models on 100,000 synthetic stories, we benchmark three architectural variants: standard multi-head attention (MHA), MLA, and MLA with rotary positional embeddings (MLA+RoPE). Our key finding is that MLA+RoPE with half-rank latent dimensions (r = d/2) achieves a 45% KV-cache memory reduction while incurring only a 0.3% increase in validation loss (essentially matching MHA quality)- a Pareto improvement for memory constrained deployment. We further show that RoPE is crucial for MLA in small models: without it, MLA underperforms vanilla attention by 3-5%, but with RoPE, it surpasses vanilla by 2%. Inference benchmarks on NVIDIA A100 GPUs reveal that MLA with r=d/2 achieves a 1.4 times speedup over full-rank MLA while maintaining the memory savings. GPT-4 evaluations corroborate perplexity results, with ours achieving the highest quality scores (7.4/10) across grammar, creativity, and consistency metrics. Code and models will be released upon acceptance.
我们首次全面研究了小型语言模型中潜在多头注意力(MLA)的使用,揭示了效率与质量之间的有趣权衡。在10万篇合成故事上训练30M参数量的GPT模型时,我们对三种架构变体进行了基准测试:标准多头注意力(MHA)、MLA以及结合旋转位置编码的MLA (MLA+RoPE)。我们的关键发现是,在保持质量几乎不变的情况下(验证损失仅增加0.3%,与MHA相当),带有一半维度隐变量的MLA+RoPE可以在KV缓存内存方面实现45%的节省,这对于内存受限的应用场景来说是一个帕累托改进。此外,我们还展示了在小型模型中使用旋转位置编码对于MLA的重要性:没有它,MLA的表现会比普通注意力低3-5%,但有了旋转位置编码后,它可以比普通注意力高出2%。在NVIDIA A100 GPU上的推理基准测试显示,当r=d/2时,MLA的执行速度可以达到全维度MLA的1.4倍,并且还能维持内存节省的优势。GPT-4评估结果验证了困惑度的结果,我们的模型在这三项指标(语法、创意和一致性)上均获得了最高的质量评分(7.4/10)。代码和模型将在审稿接受后发布。
https://arxiv.org/abs/2506.09342
Dense image correspondence is central to many applications, such as visual odometry, 3D reconstruction, object association, and re-identification. Historically, dense correspondence has been tackled separately for wide-baseline scenarios and optical flow estimation, despite the common goal of matching content between two images. In this paper, we develop a Unified Flow & Matching model (UFM), which is trained on unified data for pixels that are co-visible in both source and target images. UFM uses a simple, generic transformer architecture that directly regresses the (u,v) flow. It is easier to train and more accurate for large flows compared to the typical coarse-to-fine cost volumes in prior work. UFM is 28% more accurate than state-of-the-art flow methods (Unimatch), while also having 62% less error and 6.7x faster than dense wide-baseline matchers (RoMa). UFM is the first to demonstrate that unified training can outperform specialized approaches across both domains. This result enables fast, general-purpose correspondence and opens new directions for multi-modal, long-range, and real-time correspondence tasks.
密集图像对应关系在许多应用中至关重要,例如视觉测距、三维重建、物体关联和重新识别。历史上,对于宽基线场景下的稠密对应关系与光流估计问题通常是分开处理的,尽管它们都旨在匹配两张图片之间的内容。在这篇文章中,我们开发了一种统一流动及匹配模型(UFM),该模型在源图像和目标图像中共视像素上进行统一流数据训练。UFM使用了一个简单的、通用的变压器架构,直接回归(u,v)流。相比先前工作中的典型粗到细成本体积方法,UFM更容易训练,并且对于大范围的流动更准确。 与最先进的流方法(Unimatch)相比,UFM精确度高出28%;同时,与稠密宽基线匹配器(RoMa)相比,它的误差减少了62%,运行速度提高了6.7倍。UFM首次证明了统一训练方法可以在两个领域内超越专业化的方法。这一成果使得快速、通用的对应关系成为可能,并为跨模态、长距离和实时对应的任务开辟了新的方向。
https://arxiv.org/abs/2506.09278
Autoregressive generative models naturally generate variable-length sequences, while non-autoregressive models struggle, often imposing rigid, token-wise structures. We propose Edit Flows, a non-autoregressive model that overcomes these limitations by defining a discrete flow over sequences through edit operations-insertions, deletions, and substitutions. By modeling these operations within a Continuous-time Markov Chain over the sequence space, Edit Flows enable flexible, position-relative generation that aligns more closely with the structure of sequence data. Our training method leverages an expanded state space with auxiliary variables, making the learning process efficient and tractable. Empirical results show that Edit Flows outperforms both autoregressive and mask models on image captioning and significantly outperforms the mask construction in text and code generation.
自回归生成模型自然地生成可变长度的序列,而非自回归模型则在这方面遇到困难,常常强加固定、逐令牌的结构。我们提出了编辑流(Edit Flows),这是一种非自回归模型,通过在序列上定义基于插入、删除和替换操作的离散流程来克服这些限制。通过在一个连续时间马尔可夫链中对这些操作进行建模,编辑流能够实现灵活的位置相关的生成过程,这与序列数据结构更加吻合。我们的训练方法利用扩展的状态空间以及辅助变量,使得学习过程既高效又可行。实验证明,在图像描述和文本及代码生成方面,编辑流在性能上超过了自回归模型和掩码模型,并且在文本和代码生成中显著优于掩码构造方法。
https://arxiv.org/abs/2506.09018
Medical vision-language alignment through cross-modal contrastive learning shows promising performance in image-text matching tasks, such as retrieval and zero-shot classification. However, conventional cross-modal contrastive learning (CLIP-based) methods suffer from suboptimal visual representation capabilities, which also limits their effectiveness in vision-language alignment. In contrast, although the models pretrained via multimodal masked modeling struggle with direct cross-modal matching, they excel in visual representation. To address this contradiction, we propose ALTA (ALign Through Adapting), an efficient medical vision-language alignment method that utilizes only about 8% of the trainable parameters and less than 1/5 of the computational consumption required for masked record modeling. ALTA achieves superior performance in vision-language matching tasks like retrieval and zero-shot classification by adapting the pretrained vision model from masked record modeling. Additionally, we integrate temporal-multiview radiograph inputs to enhance the information consistency between radiographs and their corresponding descriptions in reports, further improving the vision-language alignment. Experimental evaluations show that ALTA outperforms the best-performing counterpart by over 4% absolute points in text-to-image accuracy and approximately 6% absolute points in image-to-text retrieval accuracy. The adaptation of vision-language models during efficient alignment also promotes better vision and language understanding. Code is publicly available at this https URL.
通过跨模态对比学习进行的医学视觉-语言对齐在图像文本匹配任务(如检索和零样本分类)中表现出色。然而,传统的跨模态对比学习方法(基于CLIP的方法)存在不足之处,其视觉表示能力较差,限制了其在视觉-语言对齐中的有效性。相比之下,虽然通过多模式掩码建模预训练的模型在直接跨模态匹配方面存在问题,但它们在视觉表示方面表现出色。为解决这一矛盾,我们提出了一种名为ALTA(Adapting Through Alignment)的方法,这是一种高效的医学视觉-语言对齐方法,仅使用约8%可训练参数,并且所需的计算量仅为掩码记录建模的五分之一以下。通过将预训练的视觉模型从掩码记录建模中适应过来,ALTA在诸如检索和零样本分类等视觉-语言匹配任务中表现出色。此外,我们还集成了时序多视图放射影像输入以增强放射影像与其相应描述之间的信息一致性,进一步改进了视觉-语言对齐。实验评估表明,在文本到图像的准确性和图像到文本检索的准确性方面,ALTA比最佳对照方法分别高出4%和约6%的绝对点数。此外,在高效对齐期间适应视觉-语言模型还促进了更好的视觉和语言理解能力。代码可在以下网址公开获取:[此链接](请将 [此链接] 替换为实际提供的链接)。
https://arxiv.org/abs/2506.08990
Vision Transformers (ViTs) have demonstrated impressive performance across a wide range of biometric tasks, including face and body recognition. In this work, we adapt a ViT model pretrained on visible (VIS) imagery to the challenging problem of cross-spectral body recognition, which involves matching images captured in the visible and infrared (IR) domains. Recent ViT architectures have explored incorporating additional embeddings beyond traditional positional embeddings. Building on this idea, we integrate Side Information Embedding (SIE) and examine the impact of encoding domain and camera information to enhance cross-spectral matching. Surprisingly, our results show that encoding only camera information - without explicitly incorporating domain information - achieves state-of-the-art performance on the LLCM dataset. While occlusion handling has been extensively studied in visible-spectrum person re-identification (Re-ID), occlusions in visible-infrared (VI) Re-ID remain largely underexplored - primarily because existing VI-ReID datasets, such as LLCM, SYSU-MM01, and RegDB, predominantly feature full-body, unoccluded images. To address this gap, we analyze the impact of range-induced occlusions using the IARPA Janus Benchmark Multi-Domain Face (IJB-MDF) dataset, which provides a diverse set of visible and infrared images captured at various distances, enabling cross-range, cross-spectral evaluations.
视觉变压器(ViT)在包括面部和身体识别在内的多种生物特征任务中表现出色。在这项工作中,我们将一个预训练于可见光图像的ViT模型应用于跨谱系身体识别这一具有挑战性的问题上,该问题涉及到匹配可视域和红外域捕获的图像。最近的ViT架构探索了在传统位置嵌入之外添加额外嵌入的方法。在此基础上,我们整合了侧信息嵌入(SIE),并研究编码领域和相机信息对跨谱系匹配的影响。令人惊讶的是,我们的结果显示,在LLCM数据集上,仅通过编码相机信息——而无需明确地将领域信息纳入考虑内——即可达到最先进的性能。 尽管在可见光谱范围内的人员重新识别(Re-ID)中已经广泛研究了遮挡处理方法,但针对可见-红外(VI)Re-ID中的遮挡问题尚未得到充分探索。这主要是因为现有的VI-ReID数据集(如LLCM、SYSU-MM01和RegDB),主要包含的是全身且未被遮挡的图像。 为了解决这一差距,我们利用IARPA Janus基准多域面部识别(IJB-MDF)数据集分析了由范围引起的遮挡影响。该数据集提供了一组多样化的可见光和红外图像,在不同距离下拍摄,支持跨范围、跨谱系评估。
https://arxiv.org/abs/2506.08953
Large-scale pre-training has fundamentally changed how machine learning research is done today: large foundation models are trained once, and then can be used by anyone in the community (including those without data or compute resources to train a model from scratch) to adapt and fine-tune to specific tasks. Applying this same framework to reinforcement learning (RL) is appealing because it offers compelling avenues for addressing core challenges in RL, including sample efficiency and robustness. However, there remains a fundamental challenge to pre-train large models in the context of RL: actions have long-term dependencies, so training a foundation model that reasons across time is important. Recent advances in generative AI have provided new tools for modeling highly complex distributions. In this paper, we build a probabilistic model to predict which states an agent will visit in the temporally distant future (i.e., an occupancy measure) using flow matching. As large datasets are often constructed by many distinct users performing distinct tasks, we include in our model a latent variable capturing the user intention. This intention increases the expressivity of our model, and enables adaptation with generalized policy improvement. We call our proposed method intention-conditioned flow occupancy models (InFOM). Comparing with alternative methods for pre-training, our experiments on $36$ state-based and $4$ image-based benchmark tasks demonstrate that the proposed method achieves $1.8 \times$ median improvement in returns and increases success rates by $36\%$. Website: this https URL Code: this https URL
大规模预训练已经彻底改变了当今机器学习研究的方式:大型基础模型只需一次训练,然后就可以被社区中的任何人(包括那些没有数据或计算资源从头开始训练模型的人)用来适应和微调特定任务。将这种方法应用于强化学习(RL)具有吸引力,因为它提供了解决RL核心挑战(如样本效率和鲁棒性)的有力途径。然而,在RL背景下预训练大型模型仍然存在一个基本难题:动作具有长期依赖关系,因此训练能够跨时间推理的基础模型非常重要。 最近在生成式AI领域的进展提供了新的工具来建模高度复杂的分布。在这篇论文中,我们使用流匹配构建了一个概率模型,用来预测代理在未来某个时刻将访问的状态(即占用度量)。由于大型数据集通常是由许多不同的用户执行不同任务时创建的,我们在模型中包括一个隐变量以捕捉用户的意图。这种意图增加了我们模型的表现力,并且使泛化的策略改进成为可能。 我们将提出的方法称为条件于意图的流占用模型(InFOM)。与预训练的其他方法相比,在36个基于状态和4个基于图像的基准任务上进行实验表明,所提出的方法在回报方面实现了1.8倍的中位数改进,并将成功率提高了36%。 论文网站: [此链接](https://this_https_URL/) 代码网址: [此链接](https://this_https_URL/)
https://arxiv.org/abs/2506.08902
The parameter-efficient adaptation of the image-text pretraining model CLIP for video-text retrieval is a prominent area of research. While CLIP is focused on image-level vision-language matching, video-text retrieval demands comprehensive understanding at the video level. Three key discrepancies emerge in the transfer from image-level to video-level: vision, language, and alignment. However, existing methods mainly focus on vision while neglecting language and alignment. In this paper, we propose Discrepancy Reduction in Vision, Language, and Alignment (DiscoVLA), which simultaneously mitigates all three discrepancies. Specifically, we introduce Image-Video Features Fusion to integrate image-level and video-level features, effectively tackling both vision and language discrepancies. Additionally, we generate pseudo image captions to learn fine-grained image-level alignment. To mitigate alignment discrepancies, we propose Image-to-Video Alignment Distillation, which leverages image-level alignment knowledge to enhance video-level alignment. Extensive experiments demonstrate the superiority of our DiscoVLA. In particular, on MSRVTT with CLIP (ViT-B/16), DiscoVLA outperforms previous methods by 1.5% in R@1, reaching a final score of 50.5% R@1. The code is available at this https URL.
将图像文本预训练模型CLIP用于视频文本检索的参数高效适应是一个重要的研究领域。虽然CLIP主要关注于图像级别的视觉-语言匹配,但视频文本检索需要在视频级别上进行全面的理解。从图像级到视频级转移时会出现三个关键差异:视觉、语言和对齐。然而,现有的方法主要集中在解决视觉问题,而忽略了语言和对齐的问题。在这篇论文中,我们提出了视觉、语言和对齐差异减少(DiscoVLA),该方法同时缓解了这三个方面的差异。 具体而言,我们引入了图像-视频特征融合技术,以整合图像级和视频级的特征,有效地解决了视觉和语言两方面的差异问题。此外,为了学习细粒度级别的图像级对齐,我们生成伪图像标题。为了解决对齐差异问题,我们提出了从图像级到视频级对齐蒸馏的方法,利用图像级的知识来改善视频级的对齐效果。 大量的实验表明了我们提出的DiscoVLA方法的优越性。特别地,在MSRVTT数据集上使用CLIP(ViT-B/16)时,我们的方法在R@1指标上比先前的方法高出1.5%,达到了最终的50.5% R@1的成绩。 代码可以在以下网址获取:[请在这里插入链接]
https://arxiv.org/abs/2506.08887
Recently, the rectified flow (RF) has emerged as the new state-of-the-art among flow-based diffusion models due to its high efficiency advantage in straight path sampling, especially with the amazing images generated by a series of RF models such as Flux 1.0 and SD 3.0. Although a straight-line connection between the noisy and natural data distributions is intuitive, fast, and easy to optimize, it still inevitably leads to: 1) Diversity concerns, which arise since straight-line paths only cover a fairly restricted sampling space. 2) Multi-scale noise modeling concerns, since the straight line flow only needs to optimize the constant velocity field $\bm v$ between the two distributions $\bm\pi_0$ and $\bm\pi_1$. In this work, we present Discretized-RF, a new family of rectified flow (also called momentum flow models since they refer to the previous velocity component and the random velocity component in each diffusion step), which discretizes the straight path into a series of variable velocity field sub-paths (namely ``momentum fields'') to expand the search space, especially when close to the distribution $p_\text{noise}$. Different from the previous case where noise is directly superimposed on $\bm x$, we introduce noise on the velocity $\bm v$ of the sub-path to change its direction in order to improve the diversity and multi-scale noise modeling abilities. Experimental results on several representative datasets demonstrate that learning momentum flow matching by sampling random velocity fields will produce trajectories that are both diverse and efficient, and can consistently generate high-quality and diverse results. Code is available at this https URL.
近期,修正流(Rectified Flow,RF)因其在直路采样中的高效率优势而成为基于流的扩散模型的新一代最佳选择,尤其是通过一系列如Flux 1.0和SD 3.0等RF模型生成的惊人图像。虽然从噪声数据分布到自然数据分布之间的直线连接直观、快速且易于优化,但这种方法不可避免地带来以下问题:1)多样性问题,因为直路径仅覆盖了有限的采样空间;2)多尺度噪声建模的问题,由于直线流只需要在两个分布$\bm\pi_0$和$\bm\pi_1$之间优化恒定速度场$\bm v$。 在这项工作中,我们介绍了离散化修正流(Discretized-RF),这是一个新的修正流家族(也称为动量流模型,因为它们参考了每个扩散步骤中的先前速度分量和随机速度分量)。该方法将直路径分解为一系列可变速度场子路径(即“动量场”)以扩大搜索空间,特别是在接近噪声分布$p_\text{noise}$时。与直接在$\bm x$上叠加噪声的先前方法不同,我们在此处引入了对子路径的速度$\bm v$上的噪声处理,目的是改变其方向来提高多样性和多尺度噪声建模能力。 实验结果表明,在几个代表性数据集上通过采样随机速度场进行动量流匹配学习能够产生既多样化又高效的轨迹,并且可以持续生成高质量和多样化的结果。代码可在[此处](https://this https URL)获取(请将URL替换为实际提供的链接地址)。
https://arxiv.org/abs/2506.08796
In this work, we approach spoken Dialogue State Tracking (DST) by bridging the representation spaces of speech encoders and LLMs via a small connector module, with a focus on fully open-sourced and open-data components (WavLM-large, OLMo). We focus on ablating different aspects of such systems including full/LoRA adapter fine-tuning, the effect of agent turns in the dialogue history, as well as fuzzy matching-based output post-processing, which greatly improves performance of our systems on named entities in the dialogue slot values. We conduct our experiments on the SpokenWOZ dataset, and additionally utilize the Speech-Aware MultiWOZ dataset to augment our training data. Ultimately, our best-performing WavLM + connector + OLMo-1B aligned models achieve state of the art on the SpokenWOZ test set (34.66% JGA), and our system with Gemma-2-9B-instruct further surpasses this result, reaching 42.17% JGA on SpokenWOZ test.
在这项工作中,我们通过一个小的连接模块来连接语音编码器和大语言模型(LLM)的表示空间,以解决口语对话状态跟踪(DST)问题。我们的工作重点是完全开源且开放数据组件(WavLM-large 和 OLMo)。我们在系统不同方面的消融研究包括全量/LoRA适配器微调、对话历史中代理回合的影响以及基于模糊匹配的输出后处理,这极大地提高了我们系统在对话槽值中的命名实体性能。我们的实验是在SpokenWOZ数据集上进行的,并且还利用Speech-Aware MultiWOZ数据集来增强训练数据。最终,我们在SpokenWOZ测试集中,表现最好的WavLM + 连接器 + OLMo-1B对齐模型达到了最新的技术水平(34.66% JGA),而使用Gemma-2-9B-instruct的系统进一步超越了这一结果,在SpokenWOZ测试集上达到42.17% JGA。
https://arxiv.org/abs/2506.08633
Generative machine learning methods, such as diffusion models and flow matching, have shown great potential in modeling complex system behaviors and building efficient surrogate models. However, these methods typically learn the underlying physics implicitly from data. We propose Physics-Based Flow Matching (PBFM), a novel generative framework that explicitly embeds physical constraints, both PDE residuals and algebraic relations, into the flow matching objective. We also introduce temporal unrolling at training time that improves the accuracy of the final, noise-free sample prediction. Our method jointly minimizes the flow matching loss and the physics-based residual loss without requiring hyperparameter tuning of their relative weights. Additionally, we analyze the role of the minimum noise level, $\sigma_{\min}$, in the context of physical constraints and evaluate a stochastic sampling strategy that helps to reduce physical residuals. Through extensive benchmarks on three representative PDE problems, we show that our approach yields up to an $8\times$ more accurate physical residuals compared to FM, while clearly outperforming existing algorithms in terms of distributional accuracy. PBFM thus provides a principled and efficient framework for surrogate modeling, uncertainty quantification, and accelerated simulation in physics and engineering applications.
生成式机器学习方法,如扩散模型和流匹配(flow matching),在模拟复杂系统行为及构建高效代理模型方面展现出了巨大的潜力。然而,这些方法通常从数据中隐式地学习底层物理规律。我们提出了一种名为基于物理学的流匹配(Physics-Based Flow Matching, PBFM)的新颖生成框架,该框架明确嵌入了物理约束,包括偏微分方程残差和代数关系,到流匹配的目标函数中。此外,在训练过程中引入时间展开以提高最终无噪声样本预测的准确性。我们的方法在不需调整超参数权重的情况下同时最小化流匹配损失和基于物理残差的损失。我们还分析了最小噪声水平 $\sigma_{\min}$ 在物理约束背景下的作用,并评估了一种有助于减少物理残差的随机采样策略。通过三个具有代表性的偏微分方程问题上的广泛基准测试,我们展示了我们的方法相比于流匹配(FM),最多可以将物理残差精确度提高8倍,同时在分布准确度方面明显优于现有算法。因此,PBFM为代理建模、不确定性量化及加速物理和工程应用中的模拟提供了一个原理清晰且高效的框架。
https://arxiv.org/abs/2506.08604
This work focuses on an observed limitation of text encoders: embeddings may not be able to recognize fine-grained entities or events within the semantics, resulting in failed dense retrieval on even simple cases. To examine such behaviors, we first introduce a new evaluation dataset in Chinese, named CapRetrieval, whose passages are image captions, and queries are phrases inquiring entities or events in various forms. Zero-shot evaluation suggests that encoders may fail on these fine-grained matching, regardless of training sources or model sizes. Aiming for enhancement, we proceed to finetune encoders with our proposed data generation strategies, which obtains the best performance on CapRetrieval. Within this process, we further identify an issue of granularity dilemma, a challenge for embeddings to express fine-grained salience while aligning with overall semantics. Our dataset, code and models in this work are publicly released at this https URL.
这项工作关注了文本编码器的一个观察到的限制:嵌入可能无法识别语义中的细微实体或事件,导致即使在简单情况下也无法进行密集检索。为了研究此类行为,我们首先引入了一个新的中文评估数据集,名为CapRetrieval,该数据集中的文章是图像说明文字,而查询则是以各种形式询问实体或事件的短语。零样本评估表明,编码器可能无法在这类细粒度匹配上成功,无论训练来源或模型大小如何。 为了改进这一情况,我们接着使用所提出的生成策略对编码器进行微调,并在CapRetrieval数据集上获得了最佳性能。在此过程中,我们进一步识别出一个颗粒度困境的问题,即嵌入表达细粒度显著性时与整体语义保持一致的挑战。 本工作中使用的数据集、代码和模型已公开发布于以下网址:[此URL](请将方括号内的文本替换为实际链接)。
https://arxiv.org/abs/2506.08592
Recent progress in text-to-music generation has enabled models to synthesize high-quality musical segments, full compositions, and even respond to fine-grained control signals, e.g. chord progressions. State-of-the-art (SOTA) systems differ significantly across many dimensions, such as training datasets, modeling paradigms, and architectural choices. This diversity complicates efforts to evaluate models fairly and pinpoint which design choices most influence performance. While factors like data and architecture are important, in this study we focus exclusively on the modeling paradigm. We conduct a systematic empirical analysis to isolate its effects, offering insights into associated trade-offs and emergent behaviors that can guide future text-to-music generation systems. Specifically, we compare the two arguably most common modeling paradigms: Auto-Regressive decoding and Conditional Flow-Matching. We conduct a controlled comparison by training all models from scratch using identical datasets, training configurations, and similar backbone architectures. Performance is evaluated across multiple axes, including generation quality, robustness to inference configurations, scalability, adherence to both textual and temporally aligned conditioning, and editing capabilities in the form of audio inpainting. This comparative study sheds light on distinct strengths and limitations of each paradigm, providing actionable insights that can inform future architectural and training decisions in the evolving landscape of text-to-music generation. Audio sampled examples are available at: this https URL
最近在文本到音乐生成领域的进展使得模型能够合成高质量的音乐片段、完整的乐曲,甚至响应精细控制信号(如和弦进程)。目前最先进的系统在许多维度上存在着显著差异,例如训练数据集、建模范式以及架构选择。这种多样性复杂化了公平评估模型的努力,并且难以确定哪些设计决策对性能影响最大。虽然像数据和架构这样的因素很重要,但在本研究中我们专注于探讨建模范式的影响。我们进行了一项系统性的实证分析,以隔离其效果,并提供有关相关权衡和新兴行为的见解,从而为未来的文本到音乐生成系统指明方向。具体而言,我们比较了两种最常用的建模范式:自回归解码(Auto-Regressive decoding)与条件流匹配(Conditional Flow-Matching)。为了进行有控制的比较,我们在相同的数据集、训练配置以及相似的基本架构上从头开始训练所有模型。性能评估涵盖多个维度,包括生成质量、推理配置下的鲁棒性、可扩展性、对文本和时间同步条件的遵守程度,以及以音频中涂(audio inpainting)形式出现的编辑能力。 这项比较研究揭示了每种范式的独特优势与局限,并提供了切实可行的见解,以指导未来在不断发展的文本到音乐生成领域的架构设计和训练决策。有关音频样本示例,请访问:[此链接](https://example.com)(请将此链接替换为实际提供的URL)。
https://arxiv.org/abs/2506.08570
Vision-Language Navigation (VLN) enables intelligent agents to navigate environments by integrating visual perception and natural language instructions, yet faces significant challenges due to the scarcity of fine-grained cross-modal alignment annotations. Existing datasets primarily focus on global instruction-trajectory matching, neglecting sub-instruction-level and entity-level alignments critical for accurate navigation action decision-making. To address this limitation, we propose FCA-NIG, a generative framework that automatically constructs navigation instructions with dual-level fine-grained cross-modal annotations. In this framework, an augmented trajectory is first divided into sub-trajectories, which are then processed through GLIP-based landmark detection, crafted instruction construction, OFA-Speaker based R2R-like instruction generation, and CLIP-powered entity selection, generating sub-instruction-trajectory pairs with entity-landmark annotations. Finally, these sub-pairs are aggregated to form a complete instruction-trajectory pair. The framework generates the FCA-R2R dataset, the first large-scale augmentation dataset featuring precise sub-instruction-sub-trajectory and entity-landmark alignments. Extensive experiments demonstrate that training with FCA-R2R significantly improves the performance of multiple state-of-the-art VLN agents, including SF, EnvDrop, RecBERT, and HAMT. Incorporating sub-instruction-trajectory alignment enhances agents' state awareness and decision accuracy, while entity-landmark alignment further boosts navigation performance and generalization. These results highlight the effectiveness of FCA-NIG in generating high-quality, scalable training data without manual annotation, advancing fine-grained cross-modal learning in complex navigation tasks.
视觉语言导航(VLN)使智能代理能够通过融合视觉感知和自然语言指令在环境中导航,但因细粒度跨模态对齐标注的稀缺而面临重大挑战。现有数据集主要集中在全局指令-轨迹匹配上,忽略了准确的导航动作决策所需的子指令级别和实体级别的对齐。为解决这一局限性,我们提出了FCA-NIG框架,该框架能够自动构建具有双层细粒度跨模态标注的导航指令。在该框架中,首先将增强后的轨迹划分为子轨迹,然后通过基于GLIP的地标的检测、定制的指令构建、基于OFA-Speaker的R2R样式的指令生成以及CLIP驱动的实体选择来处理这些子轨迹,从而生成带有实体-地标标注的子指令-轨迹对。最后,将这些子对汇总形成完整的指令-轨迹对。该框架生成了FCA-R2R数据集,这是第一个大规模增强型数据集,具有精确的子指令-子轨迹和实体-地标对齐。广泛实验表明,在FCA-R2R上进行训练可以显著提升多个最先进的VLN代理(包括SF、EnvDrop、RecBERT和HAMT)的表现。整合细粒度的子指令-轨迹对齐增强了代理的状态感知力和决策准确性,而实体-地标对齐进一步提升了导航性能和泛化能力。这些结果突显了FCA-NIG在生成高质量且可扩展训练数据(无需人工标注)方面的有效性,并推动了复杂导航任务中的细粒度跨模态学习的进步。
https://arxiv.org/abs/2506.08566
Efficient and accurate motion prediction is crucial for ensuring safety and informed decision-making in autonomous driving, particularly under dynamic real-world conditions that necessitate multi-modal forecasts. We introduce TrajFlow, a novel flow matching-based motion prediction framework that addresses the scalability and efficiency challenges of existing generative trajectory prediction methods. Unlike conventional generative approaches that employ i.i.d. sampling and require multiple inference passes to capture diverse outcomes, TrajFlow predicts multiple plausible future trajectories in a single pass, significantly reducing computational overhead while maintaining coherence across predictions. Moreover, we propose a ranking loss based on the Plackett-Luce distribution to improve uncertainty estimation of predicted trajectories. Additionally, we design a self-conditioning training technique that reuses the model's own predictions to construct noisy inputs during a second forward pass, thereby improving generalization and accelerating inference. Extensive experiments on the large-scale Waymo Open Motion Dataset (WOMD) demonstrate that TrajFlow achieves state-of-the-art performance across various key metrics, underscoring its effectiveness for safety-critical autonomous driving applications. The code and other details are available on the project website this https URL.
高效且准确的运动预测对于确保自动驾驶汽车在动态真实世界条件下的安全性和决策信息至关重要,特别是在需要多模式预测的情况下。我们引入了TrajFlow,这是一种基于流匹配的新型运动预测框架,旨在解决现有生成式轨迹预测方法面临的可扩展性和效率挑战。 与传统的生成方法使用独立同分布(i.i.d.)采样并在多次推理过程中捕捉多样结果不同,TrajFlow能够在一次通过中预测多个可能的未来轨迹,显著减少了计算开销,同时保持了预测之间的连贯性。此外,我们提出了一种基于Plackett-Luce分布的排名损失方法,以改进预测轨迹的不确定性估计。 为了提高泛化能力和加快推理速度,我们设计了一种自我调节训练技术,在第二次前向传播过程中使用模型自身的预测来构建带有噪声的输入数据。在大规模Waymo Open Motion Dataset (WOMD)上的广泛实验表明,TrajFlow在各种关键指标上均达到了最先进的性能水平,证明了其对于安全至关重要的自动驾驶应用的有效性。 项目的代码和其他细节可在项目网站this https URL上找到。
https://arxiv.org/abs/2506.08541
Domain Generalization (DG) seeks to train models that perform reliably on unseen target domains without access to target data during training. While recent progress in smoothing the loss landscape has improved generalization, existing methods often falter under long-tailed class distributions and conflicting optimization objectives. We introduce FedTAIL, a federated domain generalization framework that explicitly addresses these challenges through sharpness-guided, gradient-aligned optimization. Our method incorporates a gradient coherence regularizer to mitigate conflicts between classification and adversarial objectives, leading to more stable convergence. To combat class imbalance, we perform class-wise sharpness minimization and propose a curvature-aware dynamic weighting scheme that adaptively emphasizes underrepresented tail classes. Furthermore, we enhance conditional distribution alignment by integrating sharpness-aware perturbations into entropy regularization, improving robustness under domain shift. FedTAIL unifies optimization harmonization, class-aware regularization, and conditional alignment into a scalable, federated-compatible framework. Extensive evaluations across standard domain generalization benchmarks demonstrate that FedTAIL achieves state-of-the-art performance, particularly in the presence of domain shifts and label imbalance, validating its effectiveness in both centralized and federated settings. Code: this https URL
领域泛化(Domain Generalization,DG)的目标是训练模型在没有访问目标数据的情况下也能可靠地适应未见过的目标域。尽管最近通过平滑损失景观的进展提高了泛化能力,但现有方法往往难以应对长尾类别分布和冲突优化目标的问题。我们提出了FedTAIL,这是一个联邦领域泛化的框架,它通过尖锐度引导、梯度对齐优化明确解决了这些挑战。 我们的方法引入了一个梯度一致性正则器来缓解分类与对抗性目标之间的矛盾,从而实现更稳定的收敛。为了应对类别不平衡问题,我们进行了类别的尖锐度最小化,并提出了一种感知曲率的动态加权方案,该方案能够自适应地强调少数派尾部类别。 此外,通过将尖锐度感知扰动集成到熵正则化中来增强条件分布对齐,提高在域转换情况下的鲁棒性。FedTAIL统一了优化和谐、类意识正则化和条件对准,形成了一个可扩展的联邦兼容框架。 跨标准领域泛化的基准测试进行的广泛评估表明,在存在领域迁移和标签不平衡的情况下,FedTAIL实现了最先进的性能,并且其有效性在集中式和分布式设置中得到了验证。代码可在提供的链接获取:this https URL
https://arxiv.org/abs/2506.08518
The rise of vision foundation models (VFMs) calls for systematic evaluation. A common approach pairs VFMs with large language models (LLMs) as general-purpose heads, followed by evaluation on broad Visual Question Answering (VQA) benchmarks. However, this protocol has two key blind spots: (i) the instruction tuning data may not align with VQA test distributions, meaning a wrong prediction can stem from such data mismatch rather than a VFM' visual shortcomings; (ii) VQA benchmarks often require multiple visual abilities, making it hard to tell whether errors stem from lacking all required abilities or just a single critical one. To address these gaps, we introduce AVA-Bench, the first benchmark that explicitly disentangles 14 Atomic Visual Abilities (AVAs) -- foundational skills like localization, depth estimation, and spatial understanding that collectively support complex visual reasoning tasks. By decoupling AVAs and matching training and test distributions within each, AVA-Bench pinpoints exactly where a VFM excels or falters. Applying AVA-Bench to leading VFMs thus reveals distinctive "ability fingerprints," turning VFM selection from educated guesswork into principled engineering. Notably, we find that a 0.5B LLM yields similar VFM rankings as a 7B LLM while cutting GPU hours by 8x, enabling more efficient evaluation. By offering a comprehensive and transparent benchmark, we hope AVA-Bench lays the foundation for the next generation of VFMs.
视觉基础模型(VFMs)的兴起需要系统的评估方法。一种常见的做法是将VFMs与大型语言模型(LLMs)配对作为通用头部,并在广泛的视觉问答(VQA)基准上进行评估。然而,这一协议存在两个关键盲点:(i) 指令微调数据可能无法与VQA测试分布相匹配,这意味着错误预测可能是由于这种数据不匹配而非VFMs的视觉缺陷;(ii) VQA基准通常需要多种视觉能力,使得很难判断错误是源自缺少所有必需的能力还是仅仅缺少单一关键能力。为解决这些差距,我们引入了AVA-Bench,这是首个明确拆分出14种原子视觉能力(AVAs)——如定位、深度估计和空间理解等基础技能的基准测试,它们共同支持复杂的视觉推理任务。通过分离AVAs并匹配每项训练和测试分布,AVA-Bench能够精确指出VFMs的优点与不足之处。将AVA-Bench应用于领先VFMs,揭示了独特的“能力指纹”,从而将VFM选择从猜测转变为原则性的工程实践。值得注意的是,我们发现一个0.5B的LLM在评估中能产生与7B LLM相似的VFM排名,同时减少了GPU小时数8倍,实现了更高效的评估过程。通过提供全面且透明的基准测试,我们希望AVA-Bench为下一代VFMs奠定基础。
https://arxiv.org/abs/2506.09082