Recent advances in Emotional Support Conversation (ESC) have improved emotional support generation by fine-tuning Large Language Models (LLMs) via Supervised Fine-Tuning (SFT). However, common psychological errors still persist. While Direct Preference Optimization (DPO) shows promise in reducing such errors through pairwise preference learning, its effectiveness in ESC tasks is limited by two key challenges: (1) Entangled data structure: Existing ESC data inherently entangles psychological strategies and response content, making it difficult to construct high-quality preference pairs; and (2) Optimization ambiguity: Applying vanilla DPO to such entangled pairwise data leads to ambiguous training objectives. To address these issues, we introduce Inferential Preference Mining (IPM) to construct high-quality preference data, forming the IPM-PrefDial dataset. Building upon this data, we propose a Decoupled ESC framework inspired by Gross's Extended Process Model of Emotion Regulation, which decomposes the ESC task into two sequential subtasks: strategy planning and empathic response generation. Each was trained via SFT and subsequently enhanced by DPO to align with the psychological preference. Extensive experiments demonstrate that our Decoupled ESC framework outperforms joint optimization baselines, reducing preference bias and improving response quality.
近期在情感支持对话(ESC)领域取得的进展,通过监督微调(SFT)对大规模语言模型(LLMs)进行精细调整,提高了情感支持生成的质量。然而,常见的心理错误仍然存在。直接偏好优化(DPO)通过成对偏好学习来减少这些错误显示出潜力,但其在ESC任务中的有效性受到两个关键挑战的限制:(1) 数据结构复杂交织:现有的ESC数据固有地将心理策略和响应内容纠缠在一起,这使得构建高质量的偏好对变得困难;(2) 优化模糊性:将原始DPO应用于这种复杂的成对数据会导致训练目标不明确。为了解决这些问题,我们引入了推断式偏好挖掘(IPM)来构建高质量的偏好数据,并形成了IPM-PrefDial数据集。基于这些数据,我们提出了一个解耦ESC框架,该框架借鉴了格罗斯的情感调节扩展过程模型,将ESC任务分解为两个顺序子任务:策略规划和共鸣回应生成。每个子任务都通过SFT进行训练,并进一步通过DPO进行优化以符合心理偏好。广泛的实验表明,我们的解耦ESC框架优于联合优化基线,在减少偏好评分偏差的同时提升了响应质量。
https://arxiv.org/abs/2505.16995
Large recommender models have extended LLMs as powerful recommenders via encoding or item generation, and recent breakthroughs in LLM reasoning synchronously motivate the exploration of reasoning in recommendation. Current studies usually position LLMs as external reasoning modules to yield auxiliary thought for augmenting conventional recommendation pipelines. However, such decoupled designs are limited in significant resource cost and suboptimal joint optimization. To address these issues, we propose \name, a unified large recommender model with intrinsic reasoning capabilities. Initially, we reconceptualize the model architecture to facilitate interleaved reasoning and recommendation in the autoregressive process. Subsequently, we propose RecPO, a corresponding reinforcement learning framework that optimizes \name\ both the reasoning and recommendation capabilities simultaneously in a single policy update; RecPO introduces a fused reward scheme that solely leverages recommendation labels to simulate the reasoning capability, eliminating dependency on specialized reasoning annotations. Experiments on three datasets with various baselines verify the effectiveness of \name, showing relative improvements of 68.67\% in Hit@5 and 45.21\% in NDCG@20. Code available at this https URL.
大型推荐模型已经通过编码或项目生成将大规模语言模型(LLM)扩展为强大的推荐系统,并且最近在LLM推理方面的突破同步激励了在推荐领域中探索推理能力。目前的研究通常将LLM定位为外部推理模块,以提供辅助思考来增强传统的推荐流水线。然而,这种解耦设计由于显著的资源成本和次优联合优化而受到限制。为了应对这些问题,我们提出了一个统一的大规模推荐模型\name,该模型具备内在的推理能力。 首先,我们重新构想模型架构,使推理与推荐在自回归过程中能够交错进行。随后,我们提出RecPO框架,这是一个对应的强化学习框架,它通过单一策略更新同时优化\name的推理和推荐能力;RecPO引入了一种融合奖励方案,仅利用推荐标签来模拟推理能力,消除了对专门推理注释的依赖。 在三个具有不同基线的数据集上进行的实验验证了\name的有效性,显示其Hit@5(点击率)相对提高了68.67%,NDCG@20(归一化折扣累积增益)相对提高了45.21%。代码可在提供的URL获取。 这一段文字描述了一个名为“\name”的新模型及其配套的强化学习框架RecPO的设计和效果,该系统旨在改进大型推荐模型通过内置推理能力来优化推荐性能的方法,并展示了其在实验中的优越表现。
https://arxiv.org/abs/2505.16994
Uniform downsampling remains the de facto standard for reducing spatial resolution in vision backbones. In this work, we propose an alternative design built around a content-aware spatial grouping layer, that dynamically assigns tokens to a reduced set based on image boundaries and their semantic content. Stacking our grouping layer across consecutive backbone stages results in hierarchical segmentation that arises natively in the feature extraction process, resulting in our coined Native Segmentation Vision Transformer. We show that a careful design of our architecture enables the emergence of strong segmentation masks solely from grouping layers, that is, without additional segmentation-specific heads. This sets the foundation for a new paradigm of native, backbone-level segmentation, which enables strong zero-shot results without mask supervision, as well as a minimal and efficient standalone model design for downstream segmentation tasks. Our project page is this https URL.
均匀下采样一直是降低视觉骨干网络空间分辨率的事实标准。在本工作中,我们提出了一种基于内容感知的空间分组层的替代设计方案,该设计可以根据图像边界及其语义内容动态地将标记分配给一个缩小的集合中。在整个连续骨干阶段堆叠我们的分组层会产生一种层次化分割,这种分割自然出现在特征提取过程中,从而形成了我们提出的原生分割视觉变换器(Native Segmentation Vision Transformer)。我们展示了对架构进行精心设计可以使仅通过分组层就能产生强大的分割掩码,而无需额外的特定于分割的头部。这为新的原生骨干级分割范式奠定了基础,该范式可以在没有掩码监督的情况下实现强大的零样本结果,并且对于下游分割任务具有最小和高效的独立模型设计。我们的项目页面在此 [URL]。 注:原文中的项目页面链接(https URL)未给出具体网址,在实际引用时需要提供完整的网址信息。
https://arxiv.org/abs/2505.16993
Recent advancement in deep learning encouraged developing large automatic speech recognition (ASR) models that achieve promising results while ignoring computational and memory constraints. However, deploying such models on low resource devices is impractical despite of their favorable performance. Existing approaches (pruning, distillation, layer skip etc.) transform the large models into smaller ones at the cost of significant performance degradation or require prolonged training of smaller models for better performance. To address these issues, we introduce an efficacious two-step representation learning based approach capable of producing several small sized models from a single large model ensuring considerably better performance in limited number of epochs. Comprehensive experimentation on ASR benchmarks reveals the efficacy of our approach, achieving three-fold training speed-up and up to 12.54% word error rate improvement.
最近的深度学习进展鼓励开发出了一系列大规模自动语音识别(ASR)模型,这些模型在忽略计算和内存限制的情况下取得了令人鼓舞的结果。然而,在资源有限的设备上部署这样的大模型是不切实际的,尽管它们有良好的性能表现。现有的方法(如剪枝、蒸馏、跳过层等),虽然可以将大型模型转换为较小的模型,但会导致显著的性能下降或需要长时间训练小型模型以获得更好的性能。 为了应对这些问题,我们提出了一种有效的两步表示学习方法,可以从单个大规模模型中生成多个小规模模型,并确保在有限的训练周期内有相当不错的性能表现。我们在ASR基准测试上的全面实验表明了该方法的有效性,实现了三倍的训练速度提升,并且错误词率(WER)最多减少了12.54%。
https://arxiv.org/abs/2505.16991
In this work, we propose Dimple, the first Discrete Diffusion Multimodal Large Language Model (DMLLM). We observe that training with a purely discrete diffusion approach leads to significant training instability, suboptimal performance, and severe length bias issues. To address these challenges, we design a novel training paradigm that combines an initial autoregressive phase with a subsequent diffusion phase. This approach yields the Dimple-7B model, trained on the same dataset and using a similar training pipeline as LLaVA-NEXT. Dimple-7B ultimately surpasses LLaVA-NEXT in performance by 3.9%, demonstrating that DMLLM can achieve performance comparable to that of autoregressive models. To improve inference efficiency, we propose a decoding strategy termed confident decoding, which dynamically adjusts the number of tokens generated at each step, significantly reducing the number of generation iterations. In autoregressive models, the number of forward iterations during generation equals the response length. With confident decoding, however, the number of iterations needed by Dimple is even only $\frac{\text{response length}}{3}$. We also re-implement the prefilling technique in autoregressive models and demonstrate that it does not significantly impact performance on most benchmark evaluations, while offering a speedup of 1.5x to 7x. Additionally, we explore Dimple's capability to precisely control its response using structure priors. These priors enable structured responses in a manner distinct from instruction-based or chain-of-thought prompting, and allow fine-grained control over response format and length, which is difficult to achieve in autoregressive models. Overall, this work validates the feasibility and advantages of DMLLM and enhances its inference efficiency and controllability. Code and models are available at this https URL.
在这项工作中,我们提出了Dimple,这是首个离散扩散多模态大型语言模型(DMLLM)。我们观察到,使用纯粹的离散扩散方法进行训练会导致显著的训练不稳定、次优性能和严重的长度偏差问题。为了应对这些挑战,我们设计了一种新的训练范式,该范式结合了初始自回归阶段与后续的扩散阶段。这种方法生成了Dimple-7B模型,其在相同的语料库上进行了训练,并使用了类似于LLaVA-NEXT的训练管道。最终,Dimple-7B以3.9%的优势超越了LLaVA-NEXT,这表明DMLLM可以实现与自回归模型相当的性能。 为了提高推理效率,我们提出了一种名为“自信解码”的解码策略,该策略在每个步骤中动态调整生成令牌的数量,显著减少了生成迭代次数。在自回归模型中,生成期间的前向迭代次数等于响应长度。然而,在使用自信解码时,Dimple所需的迭代次数仅为响应长度的$\frac{1}{3}$。 此外,我们重新实现了自回归模型中的填充技术,并展示了这种技术对大多数基准评估性能影响不大,但提供了1.5倍到7倍的速度提升。我们也探讨了Dimple利用结构先验精准控制其响应的能力。这些先验使得以不同于指令或链式思考提示的方式生成结构化回复成为可能,从而可以精确地控制回复格式和长度,而这在自回归模型中是难以实现的。 总的来说,这项工作验证了DMLLM的可行性和优势,并提高了它的推理效率和可控性。代码与模型可在[此处](https://this-url.com)获取。
https://arxiv.org/abs/2505.16990
LLM-based multi-agent systems (MAS) have demonstrated significant potential in enhancing single LLMs to address complex and diverse tasks in practical applications. Despite considerable advancements, the field lacks a unified codebase that consolidates existing methods, resulting in redundant re-implementation efforts, unfair comparisons, and high entry barriers for researchers. To address these challenges, we introduce MASLab, a unified, comprehensive, and research-friendly codebase for LLM-based MAS. (1) MASLab integrates over 20 established methods across multiple domains, each rigorously validated by comparing step-by-step outputs with its official implementation. (2) MASLab provides a unified environment with various benchmarks for fair comparisons among methods, ensuring consistent inputs and standardized evaluation protocols. (3) MASLab implements methods within a shared streamlined structure, lowering the barriers for understanding and extension. Building on MASLab, we conduct extensive experiments covering 10+ benchmarks and 8 models, offering researchers a clear and comprehensive view of the current landscape of MAS methods. MASLab will continue to evolve, tracking the latest developments in the field, and invite contributions from the broader open-source community.
基于大型语言模型(LLM)的多智能体系统(MAS)在实际应用中展现出提升单一LLM能力以应对复杂和多样化任务的巨大潜力。尽管取得了一定进展,该领域仍缺乏一个统一的代码库来整合现有方法,这导致了重复实现的努力、不公平的比较以及研究人员较高的入门门槛。为解决这些问题,我们引入了MASLab——一个统一、全面且适合研究者使用的基于LLM的MAS代码库。 1. MASLab集成了超过20种跨多个领域的成熟方法,并通过与官方实现逐步骤输出对比的方式严谨验证每一种方法。 2. MASLab提供了一个统一的环境,包括多种基准测试以进行公平的方法比较,确保一致的输入和标准化的评估协议。 3. MASLab在共享的精简结构中实现了各种方法,降低了理解和扩展的门槛。基于MASLab,我们进行了广泛的实验,覆盖了10多个基准测试和8种模型,为研究人员提供了对当前MAS方法领域清晰全面的观点。 未来,MASLab将继续发展,跟踪该领域的最新进展,并邀请更广泛开源社区的贡献。
https://arxiv.org/abs/2505.16988
Large Language Models (LLMs) have demonstrated impressive capabilities as intelligent agents capable of solving complex problems. However, effective planning in scenarios involving dependencies between API or tool calls-particularly in multi-turn conversations-remains a significant challenge. To address this, we introduce T1, a tool-augmented, multi-domain, multi-turn conversational dataset specifically designed to capture and manage inter-tool dependencies across diverse domains. T1 enables rigorous evaluation of agents' ability to coordinate tool use across nine distinct domains (4 single domain and 5 multi-domain) with the help of an integrated caching mechanism for both short- and long-term memory, while supporting dynamic replanning-such as deciding whether to recompute or reuse cached results. Beyond facilitating research on tool use and planning, T1 also serves as a benchmark for evaluating the performance of open-source language models. We present results powered by T1-Agent, highlighting their ability to plan and reason in complex, tool-dependent scenarios.
大型语言模型(LLMs)已经展示了作为智能代理解决复杂问题的卓越能力。然而,在涉及API或工具调用之间依赖关系的情景下——特别是在多轮对话中——进行有效规划仍然是一个重大的挑战。为了解决这个问题,我们推出了T1,这是一个增强型、跨领域、多轮会话的数据集,专门设计用于捕捉和管理不同领域的工具间的相互依赖性。T1通过集成的缓存机制(支持短期和长期记忆)帮助智能代理在九个不同的领域(包括4个单一领域和5个多领域)协调使用工具,并支持动态重新规划——例如决定是否重新计算或重用已缓存的结果。 除了促进关于工具使用和计划的研究外,T1还作为评估开源语言模型性能的基准。我们介绍了由T1-Agent提供支持的结果,展示了它们在复杂、依赖于工具的情景中进行规划和推理的能力。
https://arxiv.org/abs/2505.16986
Out-of-distribution (OOD) detection and segmentation are crucial for deploying machine learning models in safety-critical applications such as autonomous driving and robot-assisted surgery. While prior research has primarily focused on unimodal image data, real-world applications are inherently multimodal, requiring the integration of multiple modalities for improved OOD detection. A key challenge is the lack of supervision signals from unknown data, leading to overconfident predictions on OOD samples. To address this challenge, we propose Feature Mixing, an extremely simple and fast method for multimodal outlier synthesis with theoretical support, which can be further optimized to help the model better distinguish between in-distribution (ID) and OOD data. Feature Mixing is modality-agnostic and applicable to various modality combinations. Additionally, we introduce CARLA-OOD, a novel multimodal dataset for OOD segmentation, featuring synthetic OOD objects across diverse scenes and weather conditions. Extensive experiments on SemanticKITTI, nuScenes, CARLA-OOD datasets, and the MultiOOD benchmark demonstrate that Feature Mixing achieves state-of-the-art performance with a $10 \times$ to $370 \times$ speedup. Our source code and dataset will be available at this https URL.
出界(Out-of-distribution,OOD)检测和分割对于在自动驾驶和机器人辅助手术等安全关键应用中部署机器学习模型至关重要。尽管之前的大多数研究主要集中在单模态图像数据上,但现实世界的应用本质上是多模态的,需要整合多种模态以提高OOD检测的效果。一个关键挑战是没有来自未知数据的监督信号,导致模型在处理OOD样本时过于自信。为解决这一挑战,我们提出了特征混合(Feature Mixing)方法,这是一种极其简单且快速的方法,用于生成具有理论支持的多模态异常值,可以通过进一步优化帮助模型更好地区分已知分布(in-distribution,ID)和OOD数据。特征混合与模式无关,并适用于各种模态组合。 此外,我们还介绍了CARLA-OOD,这是一个新颖的多模态数据集,用于OOD分割任务,其中包含在不同场景和天气条件下合成的OOD物体。在SemanticKITTI、nuScenes、CARLA-OOD以及MultiOOD基准测试上进行的大量实验表明,特征混合方法能够实现最先进的性能,并且速度提高了10倍到370倍。我们的源代码和数据集将在[此处](https://this https URL)提供。 该段落翻译为中文后清晰地介绍了研究背景、提出的方法及其优势,以及用于验证新方法的数据集和实验结果。
https://arxiv.org/abs/2505.16985
Post-training has demonstrated its importance in enhancing the reasoning capabilities of large language models (LLMs). The primary post-training methods can be categorized into supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT). SFT is efficient and well-suited for small language models, but it may lead to overfitting and limit the reasoning abilities of larger models. In contrast, RFT generally yields better generalization but depends heavily on the strength of the base model. To address the limitations of SFT and RFT, we propose Unified Fine-Tuning (UFT), a novel post-training paradigm that unifies SFT and RFT into a single, integrated process. UFT enables the model to effectively explore solutions while incorporating informative supervision signals, bridging the gap between memorizing and thinking underlying existing methods. Notably, UFT outperforms both SFT and RFT in general, regardless of model sizes. Furthermore, we theoretically prove that UFT breaks RFT's inherent exponential sample complexity bottleneck, showing for the first time that unified training can exponentially accelerate convergence on long-horizon reasoning tasks.
训练后的微调已经证明了其在增强大规模语言模型(LLMs)推理能力方面的重要性。主要的微调方法可以分为监督微调(SFT)和强化微调(RFT)两大类。SFT因其效率高且适合小型语言模型而受到青睐,但可能导致过拟合,并限制大型模型的推理能力。相比之下,RFT通常能产生更好的泛化效果,但是高度依赖于基础模型的质量。为了克服SFT和RFT的局限性,我们提出了一种新的微调范式——统一微调(UFT),它将SFT和RFT整合为一个单一且集成的过程。UFT使模型能够有效地探索解决方案,同时融入信息丰富的监督信号,弥合了现有方法中记忆与思考之间的差距。值得注意的是,无论模型大小如何,UFT在总体上都优于SFT和RFT。此外,我们从理论上证明了UFT突破了RFT内在的指数级样本复杂性瓶颈,首次展示了统一训练能够在长时态推理任务上实现指数级加速收敛的效果。
https://arxiv.org/abs/2505.16984
Large Language Models (LLMs) are primarily designed for batch processing. Existing methods for adapting LLMs to streaming rely either on expensive re-encoding or specialized architectures with limited scalability. This work identifies three key mismatches in adapting batch-oriented LLMs to streaming: (1) input-attention, (2) output-attention, and (3) position-ID mismatches. While it is commonly assumed that the latter two mismatches require frequent re-encoding, our analysis reveals that only the input-attention mismatch significantly impacts performance, indicating re-encoding outputs is largely unnecessary. To better understand this discrepancy with the common assumption, we provide the first comprehensive analysis of the impact of position encoding on LLMs in streaming, showing that preserving relative positions within source and target contexts is more critical than maintaining absolute order. Motivated by the above analysis, we introduce a group position encoding paradigm built on batch architectures to enhance consistency between streaming and batch modes. Extensive experiments on cross-lingual and cross-modal tasks demonstrate that our method outperforms existing approaches. Our method requires no architectural modifications, exhibits strong generalization in both streaming and batch modes. The code is available at repository this https URL.
大型语言模型(LLMs)主要设计用于批处理。现有将LLMs适应流处理的方法要么依赖昂贵的重新编码,要么依赖具有有限可扩展性的专用架构。本研究识别了在将面向批次的LLM调整为流模式时出现的三个关键不匹配:(1) 输入注意机制不匹配;(2) 输出注意机制不匹配;以及 (3) 位置ID不匹配。尽管通常认为后两种不匹配需要频繁重新编码,但我们的分析表明,只有输入注意机制不匹配显著影响性能,这意味着输出的重新编码在很大程度上是不必要的。为了更好地理解这种与普遍假设之间的差异,我们首次对LLMs在流处理中的位置编码影响进行了全面分析,显示保持源和目标上下文内的相对位置比维持绝对顺序更为关键。 受到上述分析的启发,我们引入了一种基于批处理架构的组位置编码范式,以增强流模式与批处理模式之间的一致性。跨语言和跨模态任务上的大量实验表明,我们的方法优于现有方法,并且无需对架构进行修改,在流模式和批处理模式下都表现出强大的泛化能力。代码可在以下网址获取:[此链接](请将“this https URL”替换为实际的URL)。
https://arxiv.org/abs/2505.16983
Large Language Models (LLMs) show promise in biomedicine but lack true causal understanding, relying instead on correlations. This paper envisions causal LLM agents that integrate multimodal data (text, images, genomics, etc.) and perform intervention-based reasoning to infer cause-and-effect. Addressing this requires overcoming key challenges: designing safe, controllable agentic frameworks; developing rigorous benchmarks for causal evaluation; integrating heterogeneous data sources; and synergistically combining LLMs with structured knowledge (KGs) and formal causal inference tools. Such agents could unlock transformative opportunities, including accelerating drug discovery through automated hypothesis generation and simulation, enabling personalized medicine through patient-specific causal models. This research agenda aims to foster interdisciplinary efforts, bridging causal concepts and foundation models to develop reliable AI partners for biomedical progress.
大型语言模型(LLMs)在生物医学领域展现出巨大的潜力,但它们缺乏真正的因果理解能力,而是依赖于相关性。本文构想了具备因果推理能力的多模态数据集成型代理(包括文本、图像、基因组学等),通过基于干预的推理来推断因果关系。实现这一目标需要克服几个关键挑战:设计安全且可控的代理框架;开发严谨的基准测试以评估因果模型;整合异构的数据源;以及将大型语言模型与结构化知识图谱(KGs)和正式因果推理工具协同结合。 这样的代理能够开启一系列变革性机会,包括通过自动化假设生成和模拟加速药物发现,通过患者特定的因果模型实现个性化医疗。这一研究议程旨在促进跨学科合作,弥合因果概念与基础模型之间的差距,并开发出可靠的AI伙伴以推动生物医学领域的进步。
https://arxiv.org/abs/2505.16982
Video virtual try-on aims to seamlessly dress a subject in a video with a specific garment. The primary challenge involves preserving the visual authenticity of the garment while dynamically adapting to the pose and physique of the subject. While existing methods have predominantly focused on image-based virtual try-on, extending these techniques directly to videos often results in temporal inconsistencies. Most current video virtual try-on approaches alleviate this challenge by incorporating temporal modules, yet still overlook the critical spatiotemporal pose interactions between human and garment. Effective pose interactions in videos should not only consider spatial alignment between human and garment poses in each frame but also account for the temporal dynamics of human poses throughout the entire video. With such motivation, we propose a new framework, namely Dynamic Pose Interaction Diffusion Models (DPIDM), to leverage diffusion models to delve into dynamic pose interactions for video virtual try-on. Technically, DPIDM introduces a skeleton-based pose adapter to integrate synchronized human and garment poses into the denoising network. A hierarchical attention module is then exquisitely designed to model intra-frame human-garment pose interactions and long-term human pose dynamics across frames through pose-aware spatial and temporal attention mechanisms. Moreover, DPIDM capitalizes on a temporal regularized attention loss between consecutive frames to enhance temporal consistency. Extensive experiments conducted on VITON-HD, VVT and ViViD datasets demonstrate the superiority of our DPIDM against the baseline methods. Notably, DPIDM achieves VFID score of 0.506 on VVT dataset, leading to 60.5% improvement over the state-of-the-art GPD-VVTO approach.
视频虚拟试衣的目标是将视频中的主体无缝地穿上特定的衣物。主要挑战在于在动态适应主体姿势和体型的同时,保持服装的真实视觉效果。尽管现有的方法大多集中在基于图像的虚拟试穿技术上,但直接将其应用到视频中通常会导致时间上的不一致性。目前大多数视频虚拟试衣的方法通过加入时间模块来缓解这一问题,但仍忽略了人类与衣物之间的关键时空姿态互动。 为了有效解决视频中的姿势交互,在每一帧中不仅需要考虑人体和衣物姿势的空间对齐,还需要考虑到整个视频中的人体姿势的动态变化。基于此动机,我们提出了一种新的框架——动态姿态互动扩散模型(Dynamic Pose Interaction Diffusion Models, DPIDM),利用扩散模型深入探索动态姿态互动在视频虚拟试衣中的应用。 技术上,DPIDM引入了一个骨架基础的姿态适配器,将同步的人体和衣物姿势整合到去噪网络中。随后设计了一种分层注意力模块,通过基于姿态的空域和时间域注意机制来建模帧内人体与衣物的姿势互动以及跨帧长时间段内的动态变化。此外,DPIDM利用连续帧之间的正则化注意损失来增强时间一致性。 在VITON-HD、VVT 和ViViD 数据集上进行的大量实验表明了我们提出的DPIDM方法相对于基线方法的优势。值得注意的是,在VVT数据集中,DPIDM达到了VFID得分为0.506,比最先进的GPD-VVTO方法提高了60.5%。
https://arxiv.org/abs/2505.16980
Single-agent LLMs hit hard limits--finite context, role overload, and brittle domain transfer. Conventional multi-agent fixes soften those edges yet expose fresh pains: ill-posed decompositions, fuzzy contracts, and verification overhead that blunts the gains. We therefore present Know-The-Ropes (KtR), a framework that converts domain priors into an algorithmic blueprint hierarchy, in which tasks are recursively split into typed, controller-mediated subtasks, each solved zero-shot or with the lightest viable boost (e.g., chain-of-thought, micro-tune, self-check). Grounded in the No-Free-Lunch theorem, KtR trades the chase for a universal prompt for disciplined decomposition. On the Knapsack problem (3-8 items), three GPT-4o-mini agents raise accuracy from 3% zero-shot to 95% on size-5 instances after patching a single bottleneck agent. On the tougher Task-Assignment problem (6-15 jobs), a six-agent o3-mini blueprint hits 100% up to size 10 and 84% on sizes 13-15, versus 11% zero-shot. Algorithm-aware decomposition plus targeted augmentation thus turns modest models into reliable collaborators--no ever-larger monoliths required.
单一代理LLM面临硬性限制——有限的上下文、角色过载和脆弱的知识领域转移。传统多代理解决方案虽然减轻了这些问题,但也暴露出新的问题:不恰当的任务分解、模糊不清的合作协议以及验证成本高昂,削弱了改进效果。因此,我们提出了一种名为“掌握诀窍”(Know-The-Ropes, KtR)的框架,该框架将领域的先验知识转化为算法蓝图层级结构,在这种结构中,任务被递归地拆分为有类型的、由控制器中介的子任务,每个子任务要么直接解决,要么通过最轻量级的方法进行增强(例如:思维链推理、微调或自我检查)。基于“没有免费午餐”的定理,KtR放弃了寻找通用提示符的努力,转而强调有条不紊的任务分解。 在背包问题(3-8个物品)上,使用三个GPT-4o-mini代理,在补全单一瓶颈代理后,从零样本的3%准确率提高到大小为5的情况下的95%。对于更具挑战性的任务分配问题(6-15项工作),一个由六个o3-mini蓝图组成的系统在规模达到10时能够实现100%的正确率,并且在规模13-15时也能保持84%的准确度,相比之下零样本情况下的准确率为11%。 通过算法意识的任务分解加上有针对性的增强,这种框架使中等大小的模型成为可靠的合作伙伴——无需构建越来越大、越来越复杂的单一代理系统。
https://arxiv.org/abs/2505.16979
Grammar plays a critical role in natural language processing and text/code generation by enabling the definition of syntax, the creation of parsers, and guiding structured outputs. Although large language models (LLMs) demonstrate impressive capabilities across domains, their ability to infer and generate grammars has not yet been thoroughly explored. In this paper, we aim to study and improve the ability of LLMs for few-shot grammar generation, where grammars are inferred from sets of a small number of positive and negative examples and generated in Backus-Naur Form. To explore this, we introduced a novel dataset comprising 540 structured grammar generation challenges, devised 6 metrics, and evaluated 8 various LLMs against it. Our findings reveal that existing LLMs perform sub-optimally in grammar generation. To address this, we propose an LLM-driven hybrid genetic algorithm, namely HyGenar, to optimize grammar generation. HyGenar achieves substantial improvements in both the syntactic and semantic correctness of generated grammars across LLMs.
语法在自然语言处理和文本/代码生成中扮演着关键角色,它能够定义句法、创建解析器,并指导结构化输出。尽管大型语言模型(LLMs)在其广泛的应用领域表现出令人印象深刻的能力,但它们推断和生成语法规则的能力尚未得到充分探索。在这篇论文中,我们旨在研究并改进LLM在小样本语法生成中的能力,在这种情况下,从一组少量的正例和反例中推导出语法,并将其以Backus-Naur形式(BNF)生成出来。为了探究这一点,我们引入了一个包含540个结构化语法生成挑战的新数据集,设计了6种评估指标,并对8种不同的LLM进行了评测。我们的研究发现表明,现有的LLM在语法生成方面表现不佳。为了解决这个问题,我们提出了一种新的方法——HyGenar,这是一种由LLM驱动的混合遗传算法,旨在优化语法规则的生成过程。HyGenar显著提高了不同LLM在语法生成中的句法和语义正确性。
https://arxiv.org/abs/2505.16978
Diffusion models have shown preliminary success in virtual try-on (VTON) task. The typical dual-branch architecture comprises two UNets for implicit garment deformation and synthesized image generation respectively, and has emerged as the recipe for VTON task. Nevertheless, the problem remains challenging to preserve the shape and every detail of the given garment due to the intrinsic stochasticity of diffusion model. To alleviate this issue, we novelly propose to explicitly capitalize on visual correspondence as the prior to tame diffusion process instead of simply feeding the whole garment into UNet as the appearance reference. Specifically, we interpret the fine-grained appearance and texture details as a set of structured semantic points, and match the semantic points rooted in garment to the ones over target person through local flow warping. Such 2D points are then augmented into 3D-aware cues with depth/normal map of target person. The correspondence mimics the way of putting clothing on human body and the 3D-aware cues act as semantic point matching to supervise diffusion model training. A point-focused diffusion loss is further devised to fully take the advantage of semantic point matching. Extensive experiments demonstrate strong garment detail preservation of our approach, evidenced by state-of-the-art VTON performances on both VITON-HD and DressCode datasets. Code is publicly available at: this https URL.
扩散模型在虚拟试穿(VTON)任务中已显示出初步的成功。典型的双分支架构包括两个UNet,分别用于隐式的服装变形和合成图像生成,并已成为执行VTON任务的标准方法。然而,由于扩散模型固有的随机性,保留给定服装的形状及每一个细节的问题仍然具有挑战性。为了解决这个问题,我们新颖地提出了利用视觉对应关系作为先验知识来控制扩散过程的方法,而不是简单地将整个服装输入到UNet中作为外观参考。 具体来说,我们将精细的外观和纹理细节解释为一组结构化的语义点,并通过局部流扭曲匹配服装中的语义点与目标人体上的语义点。然后,这些2D点被增强为带有目标人物深度/法线图的3D感知线索。这种对应关系模仿了将衣物穿在人身上的过程,而3D感知线索则充当语义点匹配来监督扩散模型训练。此外,还设计了一种以点为中心的扩散损失函数,以便充分利用语义点匹配。 大量的实验表明,我们的方法能够很好地保持服装细节,并通过VITON-HD和DressCode数据集上的最先进的VTON性能得到了验证。代码在以下网址公开提供:[此链接](https://this-url.com)(请将链接替换为实际的公开代码地址)。
https://arxiv.org/abs/2505.16977
Contemporary diffusion models show remarkable capability in text-to-image generation, while still being limited to restricted resolutions (e.g., 1,024 X 1,024). Recent advances enable tuning-free higher-resolution image generation by recycling pre-trained diffusion models and extending them via regional denoising or dilated sampling/convolutions. However, these models struggle to simultaneously preserve global semantic structure and produce creative regional details in higher-resolution images. To address this, we present C-Upscale, a new recipe of tuning-free image upscaling that pivots on global-regional priors derived from given global prompt and estimated regional prompts via Multimodal LLM. Technically, the low-frequency component of low-resolution image is recognized as global structure prior to encourage global semantic consistency in high-resolution generation. Next, we perform regional attention control to screen cross-attention between global prompt and each region during regional denoising, leading to regional attention prior that alleviates object repetition issue. The estimated regional prompts containing rich descriptive details further act as regional semantic prior to fuel the creativity of regional detail generation. Both quantitative and qualitative evaluations demonstrate that our C-Upscale manages to generate ultra-high-resolution images (e.g., 4,096 X 4,096 and 8,192 X 8,192) with higher visual fidelity and more creative regional details.
当代扩散模型在文本到图像的生成方面表现出卓越的能力,但仍受限于固定的分辨率(例如,1024x1024)。近期的研究进展使得通过循环利用预训练的扩散模型,并借助区域去噪或膨胀采样/卷积技术来实现无需微调的高分辨率图像生成成为可能。然而,这些模型在同时保持全局语义结构和生成具有创意性的局部细节方面仍面临挑战。 为了解决这个问题,我们提出了C-Upscale,这是一种新的无需微调的图像上采样方法,它基于从给定的全局提示以及通过多模态大语言模型估算出的区域提示中提取出的全局-区域先验。在技术实现上,低分辨率图像中的低频部分被识别为全局结构先验,以鼓励高分辨率生成过程中的全局语义一致性。接下来,我们执行区域注意力控制,在区域去噪过程中筛选全局提示与每个区域之间的交叉注意,从而缓解对象重复的问题,并形成一个区域注意力先验。估算出的包含丰富描述性细节的区域提示进一步充当区域语义先验,以激发局部细节生成的创造性。 无论是定量还是定性的评估都表明,我们的C-Upscale方法能够生成超高分辨率图像(例如4096x4096和8192x8192),并具备更高的视觉保真度以及更多具有创意性的区域细节。
https://arxiv.org/abs/2505.16976
Large Language Models (LLMs) have shown strong capability in diverse software engineering tasks, e.g. code completion, bug fixing, and document generation. However, feature-driven development (FDD), a highly prevalent real-world task that involves developing new functionalities for large, existing codebases, remains underexplored. We therefore introduce SWE-Dev, the first large-scale dataset (with 14,000 training and 500 test samples) designed to evaluate and train autonomous coding systems on real-world feature development tasks. To ensure verifiable and diverse training, SWE-Dev uniquely provides all instances with a runnable environment and its developer-authored executable unit tests. This collection not only provides high-quality data for Supervised Fine-Tuning (SFT), but also enables Reinforcement Learning (RL) by delivering accurate reward signals from executable unit tests. Our extensive evaluations on SWE-Dev, covering 17 chatbot LLMs, 10 reasoning models, and 10 Multi-Agent Systems (MAS), reveal that FDD is a profoundly challenging frontier for current AI (e.g., Claude-3.7-Sonnet achieves only 22.45\% Pass@3 on the hard test split). Crucially, we demonstrate that SWE-Dev serves as an effective platform for model improvement: fine-tuning on training set enabled a 7B model comparable to GPT-4o on \textit{hard} split, underscoring the value of its high-quality training data. Code is available here \href{this https URL}{this https URL}.
大型语言模型(LLMs)在各种软件工程任务中展现了强大的能力,例如代码补全、错误修复和文档生成。然而,特性驱动开发(FDD),这是一种高度流行的真实世界任务,涉及到为庞大的现有代码库添加新功能,这一领域仍然被较少探索。为此,我们引入了SWE-Dev,这是首个大规模数据集(包含14,000个训练样本和500个测试样本),旨在评估和训练自动编码系统在真实世界的特性开发任务上的表现。为了确保可验证且多样化的训练过程,SWE-Dev独特地为所有实例提供了一个运行环境及其由开发者编写的执行单元测试。 这个数据集不仅提供了高质量的数据用于监督微调(SFT),而且还通过提供来自可执行单元测试的准确奖励信号支持强化学习(RL)。我们在SWE-Dev上进行了广泛评估,涵盖了17个聊天机器人LLM、10个推理模型和10个多智能体系统(MAS),发现FDD是当前AI面临的深刻挑战前沿(例如,Claude-3.7-Sonnet在困难测试分割上的Pass@3仅达到22.45%)。至关重要的是,我们展示了SWE-Dev作为一个有效的模型改进平台的作用:在训练集上进行微调使一个70亿参数的模型在“困难”分段的表现可媲美GPT-4o,强调了其高质量训练数据的价值。 代码可以在[\href{this https URL}{此处}]获取。
https://arxiv.org/abs/2505.16975
Open-Vocabulary Segmentation (OVS) has drawn increasing attention for its capacity to generalize segmentation beyond predefined categories. However, existing methods typically predict segmentation masks with simple forward inference, lacking explicit reasoning and interpretability. This makes it challenging for OVS model to distinguish similar categories in open-world settings due to the lack of contextual understanding and discriminative visual cues. To address this limitation, we propose a step-by-step visual reasoning framework for open-vocabulary segmentation, named OpenSeg-R. The proposed OpenSeg-R leverages Large Multimodal Models (LMMs) to perform hierarchical visual reasoning before segmentation. Specifically, we generate both generic and image-specific reasoning for each image, forming structured triplets that explain the visual reason for objects in a coarse-to-fine manner. Based on these reasoning steps, we can compose detailed description prompts, and feed them to the segmentor to produce more accurate segmentation masks. To the best of our knowledge, OpenSeg-R is the first framework to introduce explicit step-by-step visual reasoning into OVS. Experimental results demonstrate that OpenSeg-R significantly outperforms state-of-the-art methods on open-vocabulary semantic segmentation across five benchmark datasets. Moreover, it achieves consistent gains across all metrics on open-vocabulary panoptic segmentation. Qualitative results further highlight the effectiveness of our reasoning-guided framework in improving both segmentation precision and interpretability. Our code is publicly available at this https URL.
开放词汇分割(OVS)因其能够将分割任务推广到预定义类别之外的能力而越来越受到关注。然而,现有的方法通常通过简单的前向推理来预测分割掩码,缺乏明确的推理过程和可解释性。这使得OVS模型在开放式环境中难以区分相似的类别,因为缺少上下文理解和具有判别性的视觉线索。 为了解决这一局限性,我们提出了一种逐步视觉推理框架,用于开放词汇分割,并将其命名为OpenSeg-R。所提出的OpenSeg-R利用大型多模态模型(LMMs)来进行分层视觉推理,在进行分割之前完成该步骤。具体而言,针对每张图像生成通用和特定于图像的推理内容,形成结构化三元组,以从粗到细的方式解释对象的视觉原因。基于这些推理步骤,我们可以合成详细的描述提示,并将其输入到分割器中,从而产生更准确的分割掩码。 据我们所知,OpenSeg-R是第一个将明确的逐步视觉推理引入OVS框架的方法。实验结果表明,在五个基准数据集上的开放词汇语义分割任务上,OpenSeg-R显著优于当前最佳方法。此外,在开放词汇全景分割的所有指标中也取得了持续性的改进。定性结果显示了我们推理引导框架在提高分割精度和可解释性方面的有效性。 我们的代码可在以下链接获取:[提供URL的地方](请替换为实际的代码公开链接)。
https://arxiv.org/abs/2505.16974
Metrics like FactScore and VeriScore that evaluate long-form factuality operate by decomposing an input response into atomic claims and then individually verifying each claim. While effective and interpretable, these methods incur numerous LLM calls and can take upwards of 100 seconds to evaluate a single response, limiting their practicality in large-scale evaluation and training scenarios. To address this, we propose VeriFastScore, which leverages synthetic data to fine-tune Llama3.1 8B for simultaneously extracting and verifying all verifiable claims within a given text based on evidence from Google Search. We show that this task cannot be solved via few-shot prompting with closed LLMs due to its complexity: the model receives ~4K tokens of evidence on average and needs to concurrently decompose claims, judge their verifiability, and verify them against noisy evidence. However, our fine-tuned VeriFastScore model demonstrates strong correlation with the original VeriScore pipeline at both the example level (r=0.80) and system level (r=0.94) while achieving an overall speedup of 6.6x (9.9x excluding evidence retrieval) over VeriScore. To facilitate future factuality research, we publicly release our VeriFastScore model and synthetic datasets.
评估长文事实性的指标,如FactScore和VeriScore,通过将输入响应分解为原子声明,并逐一验证每个声明来工作。尽管这些方法有效且易于理解,但它们需要进行大量的大型语言模型(LLM)调用,并且可能需要长达100秒才能评估单个响应,这在大规模评估和训练场景中是不切实际的。为了应对这一挑战,我们提出了VeriFastScore,该方法利用合成数据对Llama3.1 8B进行微调,使其能够同时从给定文本中提取并根据Google搜索提供的证据验证所有可核实声明。 我们展示了这样一个任务不能通过使用封闭式LLM的少量提示来解决,因为其复杂性:模型需要接收平均约4K令牌的证据,并且需要同时分解声明、判断它们的可证实性和将它们与噪声证据进行对比。然而,我们的微调VeriFastScore模型在示例级别(r=0.80)和系统级别(r=0.94)上都显示出与原始VeriScore管道具有很强的相关性,并且比VeriScore整体速度提高了6.6倍(不包括证据检索则为9.9倍)。 为了促进未来的事实性研究,我们公开发布了我们的VeriFastScore模型和合成数据集。
https://arxiv.org/abs/2505.16973
Recent advances in Automatic Speech Recognition (ASR) have been largely fueled by massive speech corpora. However, extending coverage to diverse languages with limited resources remains a formidable challenge. This paper introduces Speech Back-Translation, a scalable pipeline that improves multilingual ASR models by converting large-scale text corpora into synthetic speech via off-the-shelf text-to-speech (TTS) models. We demonstrate that just tens of hours of real transcribed speech can effectively train TTS models to generate synthetic speech at hundreds of times the original volume while maintaining high quality. To evaluate synthetic speech quality, we develop an intelligibility-based assessment framework and establish clear thresholds for when synthetic data benefits ASR training. Using Speech Back-Translation, we generate more than 500,000 hours of synthetic speech in ten languages and continue pre-training Whisper-large-v3, achieving average transcription error reductions of over 30\%. These results highlight the scalability and effectiveness of Speech Back-Translation for enhancing multilingual ASR systems.
最近的自动语音识别(ASR)进展主要得益于大规模语音语料库。然而,将覆盖范围扩展到资源有限的多种语言仍然是一个巨大的挑战。本文介绍了Speech Back-Translation,这是一种可扩展的工作流程,通过使用现成的文本转语音(TTS)模型将大型文本语料库转换为合成语音来改进多语言ASR模型。我们证明了仅用数十小时的实际录音可以有效训练TTS模型以生成比原始数据量大几百倍的高质量合成语音。为了评估合成语音的质量,我们开发了一个基于可理解性的评价框架,并确定了合成数据对ASR培训有益的确切阈值。通过使用Speech Back-Translation,我们在十种语言中生成了超过50万小时的合成语音并继续预训练Whisper-large-v3模型,在平均转录错误率上降低了超过30%。这些结果强调了Speech Back-Translation在增强多语言ASR系统中的可扩展性和有效性。
https://arxiv.org/abs/2505.16972