Large language models (LLMs) are increasingly integrated with specialized external tools, yet many tasks demand zero-shot tool usage with minimal or noisy documentation. Existing solutions rely on manual rewriting or labeled data for validation, making them inapplicable in true zero-shot settings. To address these challenges, we propose PLAY2PROMPT, an automated framework that systematically "plays" with each tool to explore its input-output behaviors. Through this iterative trial-and-error process, PLAY2PROMPT refines tool documentation and generates usage examples without any labeled data. These examples not only guide LLM inference but also serve as validation to further enhance tool utilization. Extensive experiments on real-world tasks demonstrate that PLAY2PROMPT significantly improves zero-shot tool performance across both open and closed models, offering a scalable and effective solution for domain-specific tool integration.
大型语言模型(LLMs)与专门的外部工具结合得越来越紧密,但许多任务需要在几乎没有或文档质量很低的情况下实现零样本工具使用。现有的解决方案依赖于手动重写或带标签的数据进行验证,在真正的零样本设置中难以应用。为了解决这些挑战,我们提出了PLAY2PROMPT,这是一个自动化的框架,系统地“玩转”每个工具以探索其输入输出行为。通过这种迭代的试错过程,PLAY2PROMPT能够无需任何标记数据就完善工具文档并生成使用示例。这些示例如何指导LLM进行推理的同时还作为验证手段来进一步提升工具的应用效率。在真实世界任务上的广泛实验表明,PLAY2PROMPT显著提高了开放模型和封闭模型中的零样本工具性能,提供了一种规模化且有效的特定领域工具集成解决方案。
https://arxiv.org/abs/2503.14432
Despite the growing scale of medical Vision-Language datasets, the impact of dataset quality on model performance remains under-explored. We introduce Open-PMC, a high-quality medical dataset from PubMed Central, containing 2.2 million image-text pairs, enriched with image modality annotations, subfigures, and summarized in-text references. Notably, the in-text references provide richer medical context, extending beyond the abstract information typically found in captions. Through extensive experiments, we benchmark Open-PMC against larger datasets across retrieval and zero-shot classification tasks. Our results show that dataset quality-not just size-drives significant performance gains. We complement our benchmark with an in-depth analysis of feature representation. Our findings highlight the crucial role of data curation quality in advancing multimodal medical AI. We release Open-PMC, along with the trained models and our codebase.
尽管医学视觉语言数据集的规模正在不断扩大,但数据集质量对模型性能的影响仍然研究不足。我们介绍了Open-PMC,这是一个来自PubMed Central的高质量医学数据集,包含220万张图像文本配对,并附有图像模态注释、子图和总结在文中引用。值得注意的是,在文中的引用提供了更丰富的医疗背景信息,超出了通常出现在说明文字中的摘要信息。通过广泛的实验,我们从检索任务和零样本分类任务两个方面将Open-PMC与更大的数据集进行了基准测试。我们的结果显示,数据集的质量——而不仅仅是规模——能够带来显著的性能提升。为了补充这一基准测试,我们还对特征表示进行了一项深入分析。我们的研究结果突显了高质量的数据整理在推进多模态医学AI发展中的关键作用。我们将Open-PMC、训练好的模型和代码库一同发布。
https://arxiv.org/abs/2503.14377
Recent video diffusion models have enhanced video editing, but it remains challenging to handle instructional editing and diverse tasks (e.g., adding, removing, changing) within a unified framework. In this paper, we introduce VEGGIE, a Video Editor with Grounded Generation from Instructions, a simple end-to-end framework that unifies video concept editing, grounding, and reasoning based on diverse user instructions. Specifically, given a video and text query, VEGGIE first utilizes an MLLM to interpret user intentions in instructions and ground them to the video contexts, generating frame-specific grounded task queries for pixel-space responses. A diffusion model then renders these plans and generates edited videos that align with user intent. To support diverse tasks and complex instructions, we employ a curriculum learning strategy: first aligning the MLLM and video diffusion model with large-scale instructional image editing data, followed by end-to-end fine-tuning on high-quality multitask video data. Additionally, we introduce a novel data synthesis pipeline to generate paired instructional video editing data for model training. It transforms static image data into diverse, high-quality video editing samples by leveraging Image-to-Video models to inject dynamics. VEGGIE shows strong performance in instructional video editing with different editing skills, outperforming the best instructional baseline as a versatile model, while other models struggle with multi-tasking. VEGGIE also excels in video object grounding and reasoning segmentation, where other baselines fail. We further reveal how the multiple tasks help each other and highlight promising applications like zero-shot multimodal instructional and in-context video editing.
最近的视频扩散模型(video diffusion models)已经提升了视频编辑的效果,但仍然面临挑战,特别是在处理基于指令的编辑任务和多样化的编辑需求上。本文介绍了一种新的方法VEGGIE(Video Editor with Grounded Generation from Instructions),这是一种简单而全面的端到端框架,旨在统一各种用户指令下的视频概念编辑、定位以及推理过程。具体而言,给定一个视频和文本查询后,VEGGIE首先使用大规模语言模型(MLLM)来解析用户的意图,并将这些意图与视频内容相关联,生成特定于每一帧的任务查询,以供后续像素空间的响应处理。接下来,扩散模型根据这些计划绘制图像并生成符合用户需求的编辑后的视频。 为了支持多样化的任务和复杂的指令,我们采用了一种课程学习策略:首先使用大规模的基于指令的图像编辑数据对MLLM和视频扩散模型进行预训练;然后在高质量的多任务视频数据上进行端到端微调。此外,还引入了一个新颖的数据合成流程来生成配对的、用于模型训练的基于指令的视频编辑数据集。该流程通过利用从静态图像到动态视频的转换技术(例如Image-to-Video模型),将静态图片数据转化为多样化且高质量的视频编辑样本。 VEGGIE在处理不同技能水平的视频编辑任务时表现优异,超越了单一功能的最佳基准方法,并展示了其作为多功能模型的优势。同时,在涉及视频对象定位和推理分割的任务上,其他基准方法常常表现不佳,而VEGGIE却能很好地完成这些任务。 此外,本文还探讨了多任务之间如何相互促进和支持,并强调了零样本跨模态指令编辑与上下文视频编辑等有前景的应用领域。
https://arxiv.org/abs/2503.14350
Recent advances in text-to-speech synthesis have achieved notable success in generating high-quality short utterances for individual speakers. However, these systems still face challenges when extending their capabilities to long, multi-speaker, and spontaneous dialogues, typical of real-world scenarios such as podcasts. These limitations arise from two primary challenges: 1) long speech: podcasts typically span several minutes, exceeding the upper limit of most existing work; 2) spontaneity: podcasts are marked by their spontaneous, oral nature, which sharply contrasts with formal, written contexts; existing works often fall short in capturing this spontaneity. In this paper, we propose MoonCast, a solution for high-quality zero-shot podcast generation, aiming to synthesize natural podcast-style speech from text-only sources (e.g., stories, technical reports, news in TXT, PDF, or Web URL formats) using the voices of unseen speakers. To generate long audio, we adopt a long-context language model-based audio modeling approach utilizing large-scale long-context speech data. To enhance spontaneity, we utilize a podcast generation module to generate scripts with spontaneous details, which have been empirically shown to be as crucial as the text-to-speech modeling itself. Experiments demonstrate that MoonCast outperforms baselines, with particularly notable improvements in spontaneity and coherence.
最近的文本到语音合成技术在为个别说话人生成高质量的简短话语方面取得了显著成功。然而,当这些系统试图扩展其能力以处理长篇、多说话者和即兴对话时(如播客等现实场景),它们仍面临挑战。这些限制主要源于以下两个问题:1)长时间语音:播客通常持续几分钟甚至更长的时间,超出了大多数现有研究的工作上限;2)自发性:播客以其非正式的口头形式为特征,这与现有的书面语境截然不同;目前的研究往往难以捕捉这种自发性。 在本文中,我们提出了MoonCast,这是一种针对高质量零样本(zero-shot)播客生成解决方案。MoonCast旨在使用未知说话人的声音从仅有文本来源(例如故事、技术报告或新闻的TXT、PDF或Web URL格式文件)合成自然且风格化的播客式语音。为了生成长时间音频,MoonCast采用了一种基于大规模长上下文语言模型的声音建模方法,以利用大量的长上下文语音数据。为了提高自发性,我们引入了一个播客生成模块来创建带有即兴细节的脚本,这些即兴细节已被证明与文本到语音的建模本身同样重要。 实验结果表明,MoonCast超越了基线系统,在自发性和连贯性方面尤其表现出显著改进。
https://arxiv.org/abs/2503.14345
Industrial Anomaly Detection (IAD) is critical to ensure product quality during manufacturing. Although existing zero-shot defect segmentation and detection methods have shown effectiveness, they cannot provide detailed descriptions of the defects. Furthermore, the application of large multi-modal models in IAD remains in its infancy, facing challenges in balancing question-answering (QA) performance and mask-based grounding capabilities, often owing to overfitting during the fine-tuning process. To address these challenges, we propose a novel approach that introduces a dedicated multi-modal defect localization module to decouple the dialog functionality from the core feature extraction. This decoupling is achieved through independent optimization objectives and tailored learning strategies. Additionally, we contribute to the first multi-modal industrial anomaly detection training dataset, named Defect Detection Question Answering (DDQA), encompassing a wide range of defect types and industrial scenarios. Unlike conventional datasets that rely on GPT-generated data, DDQA ensures authenticity and reliability and offers a robust foundation for model training. Experimental results demonstrate that our proposed method, Explainable Industrial Anomaly Detection Assistant (EIAD), achieves outstanding performance in defect detection and localization tasks. It not only significantly enhances accuracy but also improves interpretability. These advancements highlight the potential of EIAD for practical applications in industrial settings.
工业异常检测(IAD)对于确保制造过程中的产品质量至关重要。尽管现有的零样本缺陷分割和检测方法已经表现出有效性,但它们无法提供有关缺陷的详细描述。此外,在IAD中应用大型多模态模型仍处于起步阶段,面临平衡问答性能与基于掩码的定位能力的挑战,并且这些问题通常由于微调过程中出现过拟合而加剧。 为了解决这些挑战,我们提出了一种新颖的方法,引入了专门用于缺陷定位的多模态模块,将对话功能从核心特征提取中分离出来。通过独立优化目标和量身定制的学习策略实现了这种分离。此外,我们还贡献了一个全新的多模态工业异常检测训练数据集,名为“Defect Detection Question Answering”(DDQA),该数据集涵盖了广泛的缺陷类型及工业场景。不同于依赖GPT生成数据的传统数据集,DDQA确保了真实性和可靠性,并为模型的训练提供了坚实的基础。 实验结果显示,我们提出的可解释性工业异常检测助手(EIAD)在缺陷检测和定位任务中表现出卓越性能。它不仅显著提高了准确性,还增强了模型的可解释性。这些进步凸显了EIAD在未来实际应用中的潜力,特别是在工业环境中。
https://arxiv.org/abs/2503.14162
Accurate transformation estimation between camera space and robot space is essential. Traditional methods using markers for hand-eye calibration require offline image collection, limiting their suitability for online self-calibration. Recent learning-based robot pose estimation methods, while advancing online calibration, struggle with cross-robot generalization and require the robot to be fully visible. This work proposes a Foundation feature-driven online End-Effector Pose Estimation (FEEPE) algorithm, characterized by its training-free and cross end-effector generalization capabilities. Inspired by the zero-shot generalization capabilities of foundation models, FEEPE leverages pre-trained visual features to estimate 2D-3D correspondences derived from the CAD model and target image, enabling 6D pose estimation via the PnP algorithm. To resolve ambiguities from partial observations and symmetry, a multi-historical key frame enhanced pose optimization algorithm is introduced, utilizing temporal information for improved accuracy. Compared to traditional hand-eye calibration, FEEPE enables marker-free online calibration. Unlike robot pose estimation, it generalizes across robots and end-effectors in a training-free manner. Extensive experiments demonstrate its superior flexibility, generalization, and performance.
相机空间和机器人空间之间的精确变换估计至关重要。传统方法使用标记进行手眼标定需要离线图像采集,这限制了它们在线自校准的适用性。虽然基于学习的方法在推进在线校准方面取得了进展,但这些方法难以实现跨机器人的泛化,并且要求机器人完全可见。这项工作提出了一种基于基础特征的在线末端执行器姿态估计(FEEPE)算法,该算法具有免训练和跨末端执行器泛化的特性。 受到零样本泛化能力强大的基础模型的启发,FEEPE利用预先训练的视觉特征来估计从CAD模型和目标图像中提取的2D-3D对应关系,并通过PnP算法实现6D姿态估计。为了解决来自部分观测和对称性的歧义问题,引入了一种多历史关键帧增强的姿态优化算法,该算法利用时间信息提高准确性。 与传统的手眼标定方法相比,FEEPE可以实现无标记的在线校准。与其他机器人姿态估计方法不同,它可以在不进行训练的情况下跨多个机器人和末端执行器进行泛化。广泛的实验表明其在灵活性、泛化能力和性能方面具有明显优势。
https://arxiv.org/abs/2503.14051
Leading large language models have demonstrated impressive capabilities in reasoning-intensive tasks, such as standardized educational testing. However, they often require extensive training in low-resource settings with inaccessible infrastructure. Small or compact models, though more efficient, frequently lack sufficient support for underrepresented languages, leaving a performance gap in critical domains. This work explores the potential of parameter-efficient fine-tuning of compact open-weight language models to handle reasoning-intensive tasks in the underrepresented Ukrainian language, building on the findings of the ZNO-Eval benchmark. Parameter-efficient fine-tuning of LLaMA 3.1 (8 billion parameters), LLaMA 3.2 (3 billion parameters), and Gemma 2 (9 billion parameters) models on chain-of-thought solutions resulted in a modest test score improvement of up to 17.4% on complex matching tasks and 1.6% overall compared to tuning on answer letters alone, offering enhanced interpretability and robustness. In addition, the proposed tuning method with joint task topic and step-by-step solution generation outperforms standard chain-of-thought tuning in matching tasks and provides a 5.4% gain over the best LLaMA 3.2 model due to guiding the model to recall and apply domain-relevant information. Contrasting obtained results with zero-shot evaluations of leading open-weight and proprietary models such as Qwen, DeepSeek R1, OpenAI o1 and o3, Gemini, and Claude, highlight that fine-tuning LLaMA and Gemma models with 2,032 step-by-step solutions and 20 to 50 million trainable parameters on a single A100 GPU lets them outperform GPT-4o mini, Mistral Large, and larger open-weight models. This research also evaluates how merging the quantized adapter with the base model influences the generation quality. Source code and tuned models are available at this https URL.
领先的大型语言模型在需要推理能力的任务中表现出色,例如标准化教育测试。然而,它们通常需要大量的训练,在资源匮乏且基础设施不足的环境中难以实现。小型或紧凑型模型虽然更高效,但往往对代表性不足的语言支持不够充分,导致关键领域的性能差距。本研究探讨了通过参数高效的微调方法来改进在乌克兰语等代表性不足语言中处理推理密集型任务的能力,并基于ZNO-Eval基准测试的研究成果进行分析。 通过对LLaMA 3.1(80亿参数)、LLaMA 3.2(30亿参数)和Gemma 2(90亿参数)模型进行链式思维解决方案的参数高效微调,复杂匹配任务上的测试分数提高了高达17.4%,总体上比仅对答案字母进行调整高出1.6%。这种方法增强了模型的可解释性和鲁棒性。此外,提出的结合任务主题和分步解答生成的方法优于标准链式思维调整方法,在匹配任务中表现出色,并且相较于最佳LLaMA 3.2模型提高了5.4%,这是因为该方法引导模型回忆并应用相关领域的信息。 通过将微调后的LLaMA和Gemma模型(使用了2,032个分步解决方案和20至50百万可训练参数,在单个A100 GPU上运行)与未经过微调的开放权重及专有模型进行对比,包括Qwen、DeepSeek R1、OpenAI o1 和o3、Gemini以及Claude等。结果显示,上述微调后的LLaMA和Gemma模型在性能上优于未经训练的小型GPT-4o mini、Mistral Large以及其他大型开放权重语言模型。 该研究还评估了将量化适配器与基础模型合并对生成质量的影响。源代码和经过调整的模型可在[此链接](请替换为实际URL)获取。
https://arxiv.org/abs/2503.13988
Text-driven motion generation has advanced significantly with the rise of denoising diffusion models. However, previous methods often oversimplify representations for the skeletal joints, temporal frames, and textual words, limiting their ability to fully capture the information within each modality and their interactions. Moreover, when using pre-trained models for downstream tasks, such as editing, they typically require additional efforts, including manual interventions, optimization, or fine-tuning. In this paper, we introduce a skeleton-aware latent diffusion (SALAD), a model that explicitly captures the intricate inter-relationships between joints, frames, and words. Furthermore, by leveraging cross-attention maps produced during the generation process, we enable attention-based zero-shot text-driven motion editing using a pre-trained SALAD model, requiring no additional user input beyond text prompts. Our approach significantly outperforms previous methods in terms of text-motion alignment without compromising generation quality, and demonstrates practical versatility by providing diverse editing capabilities beyond generation. Code is available at project page.
基于文本的运动生成随着去噪扩散模型的发展取得了显著进展。然而,先前的方法通常过于简化骨骼关节、时间帧和文字词之间的表示方式,这限制了它们捕捉每种模态内信息及其相互作用的能力。此外,在使用预训练模型进行下游任务(如编辑)时,这些方法往往需要额外的努力,包括手动干预、优化或微调。在本文中,我们介绍了SALAD(骨架感知潜在扩散模型),这是一种明确捕获关节、帧和词之间复杂关系的模型。通过利用生成过程中产生的跨注意力图谱,我们实现了无需任何额外用户输入(仅需文本提示)即可使用预训练的SALAD模型进行基于注意机制的零样本文本驱动运动编辑的功能。我们的方法在不牺牲生成质量的前提下,在文本与动作对齐方面显著优于以往的方法,并通过提供超出了单纯生成能力之外的各种编辑功能,展示了其实用性多样性。项目代码可从项目页面获取。
https://arxiv.org/abs/2503.13836
In the domain of audio-visual event perception, which focuses on the temporal localization and classification of events across distinct modalities (audio and visual), existing approaches are constrained by the vocabulary available in their training data. This limitation significantly impedes their capacity to generalize to novel, unseen event categories. Furthermore, the annotation process for this task is labor-intensive, requiring extensive manual labeling across modalities and temporal segments, limiting the scalability of current methods. Current state-of-the-art models ignore the shifts in event distributions over time, reducing their ability to adjust to changing video dynamics. Additionally, previous methods rely on late fusion to combine audio and visual information. While straightforward, this approach results in a significant loss of multimodal interactions. To address these challenges, we propose Audio-Visual Adaptive Video Analysis ($\text{AV}^2\text{A}$), a model-agnostic approach that requires no further training and integrates a score-level fusion technique to retain richer multimodal interactions. $\text{AV}^2\text{A}$ also includes a within-video label shift algorithm, leveraging input video data and predictions from prior frames to dynamically adjust event distributions for subsequent frames. Moreover, we present the first training-free, open-vocabulary baseline for audio-visual event perception, demonstrating that $\text{AV}^2\text{A}$ achieves substantial improvements over naive training-free baselines. We demonstrate the effectiveness of $\text{AV}^2\text{A}$ on both zero-shot and weakly-supervised state-of-the-art methods, achieving notable improvements in performance metrics over existing approaches.
在音频-视频事件感知领域,该领域的研究重点在于跨模态(音频和视觉)的时序定位与分类。现有方法受制于训练数据词汇表范围,这一限制显著阻碍了它们对新出现且未见过的事物类别的泛化能力。此外,此类任务的标注过程非常耗时,需要在多模态间进行广泛的标签标注,并按时间片段分配,这制约了当前方法的大规模应用性。目前最先进的模型未能考虑到事件分布随时间的变化,从而降低了它们适应视频动态变化的能力。而且,以往的方法依赖于后期融合技术来结合音频和视觉信息,尽管这种方法简单直接,但它会导致多模态交互的大量损失。 为了解决这些挑战,我们提出了音频-视频自适应视频分析($\text{AV}^2\text{A}$),这是一种与模型无关的方法,无需额外训练,并采用评分级融合技术来保留更丰富的跨模态互动。$\text{AV}^2\text{A}$ 还包含了一个针对视频内的标签变化的算法,利用输入视频数据和先前帧的预测结果动态调整后续帧中的事件分布。 此外,我们还推出了首个无训练需求且开放词汇表的基础模型用于音频-视频事件感知领域。实验表明,$\text{AV}^2\text{A}$ 在性能上相较于简单的无训练基础线有了显著提升。我们在零样本和弱监督的最新方法中展示了 $\text{AV}^2\text{A}$ 的有效性,并证明它在各种性能指标上的表现超过了现有的方法。
https://arxiv.org/abs/2503.13693
Methods based on diffusion backbones have recently revolutionized novel view synthesis (NVS). However, those models require pretrained 2D diffusion checkpoints (e.g., Stable Diffusion) as the basis for geometrical priors. Since such checkpoints require exorbitant amounts of data and compute to train, this greatly limits the scalability of diffusion-based NVS models. We present Next-Scale Autoregression Conditioned by View (ArchonView), a method that significantly exceeds state-of-the-art methods despite being trained from scratch with 3D rendering data only and no 2D pretraining. We achieve this by incorporating both global (pose-augmented semantics) and local (multi-scale hierarchical encodings) conditioning into a backbone based on the next-scale autoregression paradigm. Our model also exhibits robust performance even for difficult camera poses where previous methods fail, and is several times faster in inference speed compared to diffusion. We experimentally verify that performance scales with model and dataset size, and conduct extensive demonstration of our method's synthesis quality across several tasks. Our code is open-sourced at this https URL.
基于扩散模型的方法最近在新颖视角合成(NVS)领域取得了重大突破。然而,这些方法需要预先训练的2D扩散检查点(如Stable Diffusion)作为几何先验的基础。由于此类检查点的训练需要大量的数据和计算资源,这大大限制了基于扩散的NVS模型的可扩展性。我们提出了Next-Scale Autoregression Conditioned by View (ArchonView)方法,在仅使用3D渲染数据从头开始训练且没有2D预训练的情况下,该方法显著超过了现有的最佳性能。通过在基于next-scale自回归范式的骨干网络中加入全局(姿态增强语义)和局部(多尺度层次编码)条件化,我们实现了这一目标。 我们的模型即使在先前的方法无法处理的困难相机姿态下也表现出强大的性能,并且其推理速度比扩散方法快数倍。通过实验验证了性能随着模型大小和数据集规模的增长而提升,同时还对多个任务进行了合成质量的广泛展示。我们的代码开源地址为[此处提供链接]。 (注意:原文中的链接未能成功嵌入,实际使用时请替换为正确的URL)
https://arxiv.org/abs/2503.13588
Mobile manipulation is the fundamental challenge for robotics to assist humans with diverse tasks and environments in everyday life. However, conventional mobile manipulation approaches often struggle to generalize across different tasks and environments because of the lack of large-scale training. In contrast, recent advances in vision-language-action (VLA) models have shown impressive generalization capabilities, but these foundation models are developed for fixed-base manipulation tasks. Therefore, we propose an efficient policy adaptation framework named MoManipVLA to transfer pre-trained VLA models of fix-base manipulation to mobile manipulation, so that high generalization ability across tasks and environments can be achieved in mobile manipulation policy. Specifically, we utilize pre-trained VLA models to generate waypoints of the end-effector with high generalization ability. We design motion planning objectives for the mobile base and the robot arm, which aim at maximizing the physical feasibility of the trajectory. Finally, we present an efficient bi-level objective optimization framework for trajectory generation, where the upper-level optimization predicts waypoints for base movement to enhance the manipulator policy space, and the lower-level optimization selects the optimal end-effector trajectory to complete the manipulation task. In this way, MoManipVLA can adjust the position of the robot base in a zero-shot manner, thus making the waypoints predicted from the fixed-base VLA models feasible. Extensive experimental results on OVMM and the real world demonstrate that MoManipVLA achieves a 4.2% higher success rate than the state-of-the-art mobile manipulation, and only requires 50 training cost for real world deployment due to the strong generalization ability in the pre-trained VLA models.
移动操作是机器人技术协助人类完成日常生活中各种任务和环境的基本挑战。然而,传统的移动操作方法往往难以在不同的任务和环境中进行泛化,这主要是因为缺乏大规模训练数据。相比之下,最近在视觉-语言-行动(VLA)模型上的进展展示了出色的泛化能力,但这些基础模型是为固定基座的操作任务设计的。因此,我们提出了一种高效的策略适应框架,名为MoManipVLA,用于将预先训练好的固定基座操作VLA模型转移到移动操作中,从而在移动操作策略中实现跨任务和环境的高度泛化能力。 具体来说,我们利用预先训练好的VLA模型来生成具有高泛化能力的末端执行器路径点。此外,我们设计了针对移动底座和机械臂的动作规划目标,旨在最大化轨迹的物理可行性。最后,我们提出了一种高效的两级目标优化框架用于轨迹生成,在此框架中,上层优化预测基座移动的路径点以增强操作策略空间,而下层优化则选择完成操作任务的最佳末端执行器轨迹。 通过这种方式,MoManipVLA可以零样本方式调整机器人底座的位置,从而使得从固定基座VLA模型预测出的路径点变得可行。在OVMM和现实世界的广泛实验中证明,与最先进的移动操作方法相比,MoManipVLA的成功率提高了4.2%,并且由于预先训练好的VLA模型中的强大泛化能力,在实际应用部署时仅需50次训练成本。
https://arxiv.org/abs/2503.13446
Patient matching is the process of linking patients to appropriate clinical trials by accurately identifying and matching their medical records with trial eligibility criteria. We propose LLM-Match, a novel framework for patient matching leveraging fine-tuned open-source large language models. Our approach consists of four key components. First, a retrieval-augmented generation (RAG) module extracts relevant patient context from a vast pool of electronic health records (EHRs). Second, a prompt generation module constructs input prompts by integrating trial eligibility criteria (both inclusion and exclusion criteria), patient context, and system instructions. Third, a fine-tuning module with a classification head optimizes the model parameters using structured prompts and ground-truth labels. Fourth, an evaluation module assesses the fine-tuned model's performance on the testing datasets. We evaluated LLM-Match on four open datasets, n2c2, SIGIR, TREC 2021, and TREC 2022, using open-source models, comparing it against TrialGPT, Zero-Shot, and GPT-4-based closed models. LLM-Match outperformed all baselines.
患者匹配是指通过准确识别和匹配患者的医疗记录与临床试验的纳入标准,将患者链接到合适的临床试验的过程。我们提出了一种新的框架——LLM-Match,该框架利用微调后的开源大型语言模型进行患者匹配。我们的方法包含四个关键组成部分: 首先,一个检索增强生成(RAG)模块从大量的电子健康记录(EHRs)中提取相关的患者背景信息。 其次,一个提示生成模块通过将试验的纳入和排除标准、患者背景以及系统指令整合起来构建输入提示。 第三,一个微调模块利用结构化提示和真实标签优化模型参数,并在该模块上添加了一个分类头。 第四,评估模块对微调后的模型在测试数据集上的性能进行评估。 我们在n2c2、SIGIR、TREC 2021和TREC 2022四个公开数据集上使用开源模型评估了LLM-Match,并将其与TrialGPT、零样本学习(Zero-Shot)以及基于GPT-4的封闭模型进行了比较。结果显示,LLM-Match在所有基准测试中都表现出优越性。
https://arxiv.org/abs/2503.13281
Video anomaly detection models aim to detect anomalies that deviate from what is expected. In open-world scenarios, the expected events may change as requirements change. For example, not wearing a mask is considered abnormal during a flu outbreak but normal otherwise. However, existing methods assume that the definition of anomalies is invariable, and thus are not applicable to the open world. To address this, we propose a novel open-world VAD paradigm with variable definitions, allowing guided detection through user-provided natural language at inference time. This paradigm necessitates establishing a robust mapping from video and textual definition to anomaly score. Therefore, we propose LaGoVAD (Language-guided Open-world VAD), a model that dynamically adapts anomaly definitions through two regularization strategies: diversifying the relative durations of anomalies via dynamic video synthesis, and enhancing feature robustness through contrastive learning with negative mining. Training such adaptable models requires diverse anomaly definitions, but existing datasets typically provide given labels without semantic descriptions. To bridge this gap, we collect PreVAD (Pre-training Video Anomaly Dataset), the largest and most diverse video anomaly dataset to date, featuring 35,279 annotated videos with multi-level category labels and descriptions that explicitly define anomalies. Zero-shot experiments on seven datasets demonstrate SOTA performance. Data and code will be released.
视频异常检测模型旨在识别与预期情况不符的异常事件。在开放世界场景中,预期发生的事件会随着需求的变化而变化。例如,在流感爆发期间不戴口罩被视为异常行为,但在其他情况下则是正常的。然而,现有的方法假设异常定义是不变的,因此无法应用于开放世界环境。为了解决这一问题,我们提出了一种新的开放世界视频异常检测(VAD)范式,该范式具有可变的异常定义,并允许通过用户提供的自然语言指导进行推理时的异常检测。这种范式需要建立一个从视频和文本描述到异常评分的稳健映射关系。为此,我们提出了LaGoVAD(Language-guided Open-world VAD),这是一种模型,它可以通过两种正则化策略动态调整异常定义:通过动态视频合成来多样化异常相对持续时间,并通过具有负例挖掘的对比学习增强特征鲁棒性。 训练这种适应性强的模型需要多样化的异常定义,但现有的数据集通常只提供标签而没有语义描述。为了解决这一差距,我们收集了PreVAD(Pre-training Video Anomaly Dataset),这是迄今为止规模最大、多样性最强的视频异常数据集,包含35,279个注释视频,具有多层级类别标签和明确定义异常情况的描述。在七个数据集上的零样本实验显示出了最先进的性能。数据和代码将公开发布。
https://arxiv.org/abs/2503.13160
Due to the large volume of medical imaging data, advanced AI methodologies are needed to assist radiologists in diagnosing thoracic diseases from chest X-rays (CXRs). Existing deep learning models often require large, labeled datasets, which are scarce in medical imaging due to the time-consuming and expert-driven annotation process. In this paper, we extend the existing approach to enhance zero-shot learning in medical imaging by integrating Contrastive Language-Image Pre-training (CLIP) with Momentum Contrast (MoCo), resulting in our proposed model, MoCoCLIP. Our method addresses challenges posed by class-imbalanced and unlabeled datasets, enabling improved detection of pulmonary pathologies. Experimental results on the NIH ChestXray14 dataset demonstrate that MoCoCLIP outperforms the state-of-the-art CheXZero model, achieving relative improvement of approximately 6.5%. Furthermore, on the CheXpert dataset, MoCoCLIP demonstrates superior zero-shot performance, achieving an average AUC of 0.750 compared to CheXZero with 0.746 AUC, highlighting its enhanced generalization capabilities on unseen data.
由于医学影像数据量庞大,需要高级的人工智能方法来帮助放射科医生从胸部 X 光片(CXRs)中诊断肺部疾病。现有的深度学习模型通常需要大量标注的数据集,在医疗成像领域这类数据集稀缺,因为标注过程耗时且依赖于专家知识。在本文中,我们扩展了现有方法,通过将对比语言-图像预训练(CLIP)与动量对比(MoCo)相结合来增强医学影像中的零样本学习能力,提出了我们的模型 MoCoCLIP。该方法解决了类别不平衡和未标注数据集带来的挑战,从而提高了对肺部病理的检测性能。在 NIH 胸部 X 光片14 数据集中进行的实验表明,MoCoCLIP 模型优于最先进的 CheXZero 模型,在相对改进方面大约提升了 6.5% 的表现。此外,在 CheXpert 数据集上,MoCoCLIP 展示了卓越的零样本性能,平均 AUC 达到了 0.750,相比之下 CheXZero 模型为 0.746 AUC,这凸显了 MoCoCLIP 在未见数据上的泛化能力增强。
https://arxiv.org/abs/2503.13134
Recent advances in large language models (LLMs) have introduced the novel paradigm of using LLMs as judges, where an LLM evaluates and scores the outputs of another LLM, which often correlates highly with human preferences. However, the use of LLM-as-a-judge has been primarily studied in English. In this paper, we evaluate this framework in Russian by introducing the Russian Error tyPes Annotation dataset (REPA), a dataset of 1k user queries and 2k LLM-generated responses. Human annotators labeled each response pair expressing their preferences across ten specific error types, as well as selecting an overall preference. We rank six generative LLMs across the error types using three rating systems based on human preferences. We also evaluate responses using eight LLM judges in zero-shot and few-shot settings. We describe the results of analyzing the judges and position and length biases. Our findings reveal a notable gap between LLM judge performance in Russian and English. However, rankings based on human and LLM preferences show partial alignment, suggesting that while current LLM judges struggle with fine-grained evaluation in Russian, there is potential for improvement.
最近在大规模语言模型(LLM)方面取得的进展引入了一种新颖的方法,即使用LLM作为评判者来评估和评分另一个LLM生成的内容,这种方法通常与人类偏好高度相关。然而,这种LLM作为评判者的应用主要是在英语环境中进行研究。本文中,我们通过介绍俄罗斯错误类型注释数据集(REPA)在俄语环境下评估了这一框架的有效性,该数据集包含1000个用户查询和2000个由LLM生成的回答。人类标注者针对每对回答进行了十种特定错误类型的偏好标注,并选择了总体偏好项。我们使用基于人类偏好的三种评分系统,在这十种错误类型上对六种生成型的LLM进行排名。同时,我们也通过八个零样本和少量样本设置下的LLM评判者来评估答案的质量。我们描述了分析评判者、位置偏差和长度偏差的结果。研究发现显示俄语环境下LLM评判者的性能与英语环境之间存在显著差距。然而,基于人类偏好和LLM偏好的排名显示出部分一致性,这表明尽管当前的LLM评判者在俄语中的细粒度评估方面面临挑战,但仍具有改进的空间。
https://arxiv.org/abs/2503.13102
Performing robotic grasping from a cluttered bin based on human instructions is a challenging task, as it requires understanding both the nuances of free-form language and the spatial relationships between objects. Vision-Language Models (VLMs) trained on web-scale data, such as GPT-4o, have demonstrated remarkable reasoning capabilities across both text and images. But can they truly be used for this task in a zero-shot setting? And what are their limitations? In this paper, we explore these research questions via the free-form language-based robotic grasping task, and propose a novel method, FreeGrasp, leveraging the pre-trained VLMs' world knowledge to reason about human instructions and object spatial arrangements. Our method detects all objects as keypoints and uses these keypoints to annotate marks on images, aiming to facilitate GPT-4o's zero-shot spatial reasoning. This allows our method to determine whether a requested object is directly graspable or if other objects must be grasped and removed first. Since no existing dataset is specifically designed for this task, we introduce a synthetic dataset FreeGraspData by extending the MetaGraspNetV2 dataset with human-annotated instructions and ground-truth grasping sequences. We conduct extensive analyses with both FreeGraspData and real-world validation with a gripper-equipped robotic arm, demonstrating state-of-the-art performance in grasp reasoning and execution. Project website: this https URL.
基于人类指令从杂乱的容器中进行机器人抓取是一项具有挑战性的任务,因为它需要理解自由形式语言中的细微差别以及物体之间的空间关系。像GPT-4o这样的大规模视觉-语言模型(VLMs),在跨文本和图像上的推理能力已经展现出了非凡的表现力。但是它们是否能够在这种零样本设置中真正用于此类任务?又存在哪些局限性呢? 本文通过自由形式语言驱动的机器人抓取任务探讨了这些研究问题,并提出了一个新方法FreeGrasp,该方法利用预训练VLM的世界知识来理解和推理人类指令以及物体的空间排列。我们的方法检测所有对象作为关键点,并使用这些关键点在图像上标记位置,旨在促进GPT-4o进行零样本空间推理的能力。这使得我们的方法能够判断请求的物品是否可以直接抓取,或者需要先抓取并移除其他物品才能到达目标物品。 由于目前没有专门为此类任务设计的数据集,我们引入了一个合成数据集FreeGraspData,该数据集是在MetaGraspNetV2的基础上通过增加人类注释指令和真实抓取序列来扩展的。我们在FreeGraspData以及使用装备了机械臂的真实世界环境中进行了广泛分析,结果表明我们的方法在抓取推理与执行方面达到了最先进的性能水平。 项目网站: [此URL](this%20https%20URL)
https://arxiv.org/abs/2503.13082
CLIP has demonstrated exceptional image-text matching capabilities due to its training on contrastive learning tasks. Past research has suggested that whereas CLIP effectively matches text to images when the matching can be achieved just by matching the text with the objects in the image, CLIP struggles when the matching depends on representing the relationship among the objects in the images (i.e., inferring relations). Previous attempts to address this limitation by training CLIP on relation detection datasets with only linguistic supervision have met with limited success. In this paper, we offer insights and practical methods to advance the field of relation inference from images. This paper approaches the task of creating a model that effectively detects relations among the objects in images by producing text and image embeddings that capture relationships through linguistic supervision. To this end, we propose Dynamic Relation Inference via Verb Embeddings (DRIVE), which augments the COCO dataset, fine-tunes CLIP with hard negatives subject-relation-object triples and corresponding images, and introduces a novel loss function to improve relation detection. Evaluated on multiple CLIP-based models, our method significantly improves zero-shot relation inference accuracy in both frozen and fine-tuned settings, significantly outperforming CLIP and state-of-the-art models while generalizing well on unseen data.
CLIP 由于在对比学习任务上的训练,展现了出色的形象-文本匹配能力。然而,过去的研究表明,当匹配仅依赖于将文字与图片中的对象进行匹配时,CLIP 表现良好;但是当匹配需要表示图片中物体之间的关系(即推断关系)时,CLIP 的表现则不尽如人意。先前尝试通过在语言监督下的关系检测数据集上训练 CLIP 来解决这一限制的方法仅取得了有限的成功。本文提供了关于从图像中进行关系推理领域的见解和实用方法。 本文提出了一个模型创建任务,旨在有效地检测图片中物体之间的关系,并通过语言监督产生捕获这些关系的文本和图像嵌入。为此,我们提出了一种名为动态关系推断动词嵌入(DRIVE)的方法。该方法通过对COCO数据集进行增强、使用具有挑战性的主语-关系-宾语三元组及其对应的图片对CLIP进行微调,并引入一种新的损失函数来改进关系检测。 在多个基于 CLIP 的模型上评估后,我们的方法显著提高了冻结和微调设置下的零样本关系推理准确率,在未见过的数据集上的泛化能力也明显优于 CLIP 和最先进的模型。
https://arxiv.org/abs/2503.13021
Recently, zero-shot anomaly detection (ZSAD) has emerged as a pivotal paradigm for identifying defects in unseen categories without requiring target samples in training phase. However, existing ZSAD methods struggle with the boundary of small and complex defects due to insufficient representations. Most of them use the single manually designed prompts, failing to work for diverse objects and anomalies. In this paper, we propose MFP-CLIP, a novel prompt-based CLIP framework which explores the efficacy of multi-form prompts for zero-shot industrial anomaly detection. We employ an image to text prompting(I2TP) mechanism to better represent the object in the image. MFP-CLIP enhances perception to multi-scale and complex anomalies by self prompting(SP) and a multi-patch feature aggregation(MPFA) module. To precisely localize defects, we introduce the mask prompting(MP) module to guide model to focus on potential anomaly regions. Extensive experiments are conducted on two wildly used industrial anomaly detection benchmarks, MVTecAD and VisA, demonstrating MFP-CLIP's superiority in ZSAD.
最近,零样本异常检测(ZSAD)作为一种识别未见类别缺陷的重要范式应运而生,该方法在训练阶段不需要目标样本。然而,现有的ZSAD方法由于表示不足,在处理小且复杂的缺陷边界时遇到困难。大多数现有方法使用单一的手动设计的提示,无法适用于多样化的物体和异常情况。为此,本文提出了MFP-CLIP,这是一种基于提示的CLIP框架,探索了多种形态提示在零样本工业异常检测中的有效性。我们采用图像到文本提示(I2TP)机制以更好地表示图像中的对象。通过自我提示(SP)和多补丁特征聚合(MPFA)模块,MFP-CLIP增强了对多尺度和复杂异常的感知能力。为了精确地定位缺陷,我们引入了掩码提示(MP)模块来引导模型专注于潜在的异常区域。我们在两个常用的工业异常检测基准数据集——MVTecAD和VisA上进行了广泛的实验,结果表明MFP-CLIP在零样本异常检测中具有优越性。
https://arxiv.org/abs/2503.12910
Unsupervised sentence embedding representation has become a hot research topic in natural language processing. As a tensor, sentence embedding has two critical properties: direction and norm. Existing works have been limited to constraining only the orientation of the samples' representations while ignoring the features of their module lengths. To address this issue, we propose a new training objective that optimizes the training of unsupervised contrastive learning by constraining the module length features between positive samples. We combine the training objective of Tensor's Norm Constraints with ensemble learning to propose a new Sentence Embedding representation framework, TNCSE. We evaluate seven semantic text similarity tasks, and the results show that TNCSE and derived models are the current state-of-the-art approach; in addition, we conduct extensive zero-shot evaluations, and the results show that TNCSE outperforms other baselines.
无监督句子嵌入表示已成为自然语言处理领域的一个热门研究课题。作为一种张量,句子嵌入具有两个关键属性:方向和范数(模长)。现有工作仅限于约束样本表示的方向性特征,而忽略了其模块长度的特性。为了解决这一问题,我们提出了一种新的训练目标,通过限制正样本之间的模长特征来优化无监督对比学习的训练过程。我们将张量范数约束的训练目标与集成学习相结合,提出了一个新的句子嵌入表示框架TNCSE。我们在七项语义文本相似度任务上进行了评估,结果显示TNCSE及其衍生模型是当前最先进的方法;此外,我们还进行了广泛的零样本评估,结果显示TNCSE优于其他基准。
https://arxiv.org/abs/2503.12739
Text-to-image latent diffusion models (LDMs) have recently emerged as powerful generative models with great potential for solving inverse problems in imaging. However, leveraging such models in a Plug & Play (PnP), zero-shot manner remains challenging because it requires identifying a suitable text prompt for the unknown image of interest. Also, existing text-to-image PnP approaches are highly computationally expensive. We herein address these challenges by proposing a novel PnP inference paradigm specifically designed for embedding generative models within stochastic inverse solvers, with special attention to Latent Consistency Models (LCMs), which distill LDMs into fast generators. We leverage our framework to propose LAtent consisTency INverse sOlver (LATINO), the first zero-shot PnP framework to solve inverse problems with priors encoded by LCMs. Our conditioning mechanism avoids automatic differentiation and reaches SOTA quality in as little as 8 neural function evaluations. As a result, LATINO delivers remarkably accurate solutions and is significantly more memory and computationally efficient than previous approaches. We then embed LATINO within an empirical Bayesian framework that automatically calibrates the text prompt from the observed measurements by marginal maximum likelihood estimation. Extensive experiments show that prompt self-calibration greatly improves estimation, allowing LATINO with PRompt Optimization to define new SOTAs in image reconstruction quality and computational efficiency.
最近,文本到图像的潜在扩散模型(LDMs)作为解决成像逆问题的强大生成模型而崭露头角。然而,在零样本设置下以“即插即用”(PnP)的方式使用这些模型仍然具有挑战性,因为需要为未知目标图像找到合适的文本提示。此外,现有的基于文本到图像的PnP方法在计算上非常昂贵。我们在此提出了一种新型的PnP推理范式,专门设计用于将生成模型嵌入随机逆向求解器中,并特别关注潜在一致性模型(LCMs),该模型可以将LDMs转化为快速生成器。我们利用这一框架提出了LAtent consisTency INverse sOlver (LATINO),这是第一个使用由LCM编码的先验来解决逆问题的零样本PnP框架。 我们的条件机制避免了自动微分,并在仅需8次神经函数评估的情况下达到了SOTA的质量。因此,与之前的方法相比,LATINO提供了更准确的解决方案,并且在内存和计算效率方面都显著提高。随后,我们将在经验贝叶斯框架内嵌入LATINO,该框架能够通过边缘极大似然估计自动校准从观测测量中获得的文本提示。广泛的实验表明,提示自我校准极大地提高了估算质量,使具有PRompt Optimization的LATINO在图像重建质量和计算效率方面定义了新的SOTA(State-of-the-Art)。
https://arxiv.org/abs/2503.12615