The introduction of vision-language models like CLIP has enabled the development of foundational video models capable of generalizing to unseen videos and human actions. However, these models are typically trained on web videos, which often fail to capture the challenges present in Activities of Daily Living (ADL) videos. Existing works address ADL-specific challenges, such as similar appearances, subtle motion patterns, and multiple viewpoints, by combining 3D skeletons and RGB videos. However, these approaches are not integrated with language, limiting their ability to generalize to unseen action classes. In this paper, we introduce SKI models, which integrate 3D skeletons into the vision-language embedding space. SKI models leverage a skeleton-language model, SkeletonCLIP, to infuse skeleton information into Vision Language Models (VLMs) and Large Vision Language Models (LVLMs) through collaborative training. Notably, SKI models do not require skeleton data during inference, enhancing their robustness for real-world applications. The effectiveness of SKI models is validated on three popular ADL datasets for zero-shot action recognition and video caption generation tasks.
视觉-语言模型(如CLIP)的引入,促进了能够泛化到未见过视频和人类动作的基础视频模型的发展。然而,这些模型通常是在网络视频上进行训练的,而这些视频往往无法捕捉日常活动(ADL)视频中存在的挑战。现有研究通过结合3D骨架与RGB视频来解决类似外观、细微的动作模式及多视角等特定于ADL的问题。不过,这种方法未将语言整合进来,从而限制了其对新动作类别的泛化能力。 在本文中,我们提出了SKI模型,该模型将3D骨架融入到视觉-语言嵌入空间中。通过联合训练,SKI模型利用了一种骨骼-语言模型(SkeletonCLIP),能够将骨架信息注入到视觉语言模型(VLMs)和大型视觉语言模型(LVLMs)中。值得注意的是,在推理阶段SKI模型不需要骨架数据,从而增强了其在实际应用中的鲁棒性。 我们通过三个流行的ADL数据集上的零样本动作识别与视频字幕生成任务验证了SKI模型的有效性。
https://arxiv.org/abs/2502.03459
Zero-shot prompting techniques have significantly improved the performance of Large Language Models (LLMs). However, we lack a clear understanding of why zero-shot prompts are so effective. For example, in the prompt "Let's think step-by-step," is "think" or "step-by-step" more crucial to its success? Existing interpretability methods, such as gradient-based and attention-based approaches, are computationally intensive and restricted to open-source models. We introduce the ZIP score (Zero-shot Importance of Perturbation score), a versatile metric applicable to both open and closed-source models, based on systematic input word perturbations. Our experiments across four recent LLMs, seven widely-used prompts, and several tasks, reveal interesting patterns in word importance. For instance, while both 'step-by-step' and 'think' show high ZIP scores, which one is more influential depends on the model and task. We validate our method using controlled experiments and compare our results with human judgments, finding that proprietary models align more closely with human intuition regarding word significance. These findings enhance our understanding of LLM behavior and contribute to developing more effective zero-shot prompts and improved model analysis.
零样本提示技术显著提升了大型语言模型(LLM)的性能。然而,我们缺乏对其有效性的清晰理解。例如,在提示“Let's think step-by-step”中,“think”或“step-by-step”哪一个更为关键?现有的解释方法,如基于梯度和注意力的方法,在计算上较为耗时,并且仅限于开源模型使用。为此,我们引入了ZIP评分(零样本扰动重要性分数),这是一种适用于开放源代码及封闭源代码模型的灵活衡量标准,其基于系统化的输入词干扰来评估。 我们的实验覆盖了四个近期大型语言模型、七个广泛使用的提示以及多个任务,在这些实验中揭示了一些关于单词重要性的有趣模式。例如,尽管“step-by-step”和“think”都显示出了高ZIP评分,但哪个更具有影响力则取决于具体的模型和任务情况。我们通过受控实验验证了我们的方法,并将结果与人类判断进行了比较,发现专有模型在衡量词的重要性时更加接近于人类的直觉。 这些发现增强了我们对LLM行为的理解,并有助于开发更具效果的零样本提示以及改进模型分析技术。
https://arxiv.org/abs/2502.03418
Time series forecasting is essential for operational intelligence in the hospitality industry, and particularly challenging in large-scale, distributed systems. This study evaluates the performance of statistical, machine learning (ML), deep learning, and foundation models in forecasting hourly sales over a 14-day horizon using real-world data from a network of thousands of restaurants across Germany. The forecasting solution includes features such as weather conditions, calendar events, and time-of-day patterns. Results demonstrate the strong performance of ML-based meta-models and highlight the emerging potential of foundation models like Chronos and TimesFM, which deliver competitive performance with minimal feature engineering, leveraging only the pre-trained model (zero-shot inference). Additionally, a hybrid PySpark-Pandas approach proves to be a robust solution for achieving horizontal scalability in large-scale deployments.
时间序列预测对于酒店行业的运营智能至关重要,而在大规模分布式系统中实现这一点尤其具有挑战性。本研究评估了统计方法、机器学习(ML)、深度学习以及基础模型在使用德国数千家餐厅的真实世界数据来预测14天内每小时销售额方面的表现。该预测解决方案包括天气条件、日历事件和时间段模式等特征。 实验结果表明,基于机器学习的元模型表现出色,并强调了Chronos和TimesFM等基础模型的新兴潜力,这些模型在无需复杂特征工程的情况下仅通过预训练模型(零样本推理)即可提供具有竞争力的表现。此外,混合使用PySpark和Pandas的方法被证明是实现大规模部署横向扩展的一种稳健解决方案。
https://arxiv.org/abs/2502.03395
Service robots operating in unstructured environments must effectively recognize and segment unknown objects to enhance their functionality. Traditional supervised learningbased segmentation techniques require extensive annotated datasets, which are impractical for the diversity of objects encountered in real-world scenarios. Unseen Object Instance Segmentation (UOIS) methods aim to address this by training models on synthetic data to generalize to novel objects, but they often suffer from the simulation-to-reality gap. This paper proposes a novel approach (ZISVFM) for solving UOIS by leveraging the powerful zero-shot capability of the segment anything model (SAM) and explicit visual representations from a selfsupervised vision transformer (ViT). The proposed framework operates in three stages: (1) generating object-agnostic mask proposals from colorized depth images using SAM, (2) refining these proposals using attention-based features from the selfsupervised ViT to filter non-object masks, and (3) applying K-Medoids clustering to generate point prompts that guide SAM towards precise object segmentation. Experimental validation on two benchmark datasets and a self-collected dataset demonstrates the superior performance of ZISVFM in complex environments, including hierarchical settings such as cabinets, drawers, and handheld objects. Our source code is available at this https URL.
服务机器人在非结构化环境中运行时,必须能够有效识别和分割未知物体以增强其功能。传统的基于监督学习的分割技术需要大量的注释数据集,这对于现实世界场景中遇到的各种各样的对象来说是不切实际的。未见物体实例分割(UOIS)方法旨在通过在合成数据上训练模型来解决这个问题,从而使模型能够泛化到新对象,但它们往往受到模拟与现实差距的影响。 本文提出了一种新的方法(ZISVFM),用于解决UOIS问题,该方法利用了片段任何事物模型(SAM)的强大零样本能力以及来自自监督视觉变换器(ViT)的显式视觉表示。提出的框架分为三个阶段:(1) 使用SAM从着色深度图像生成对象无关的掩码提案;(2) 利用自监督ViT中的基于注意力的功能来细化这些提案,以过滤掉非物体掩码;(3) 应用K-Medoids聚类来生成点提示,引导SAM进行精确的对象分割。 在两个基准数据集和一个自采集的数据集上的实验验证显示了ZISVFM在复杂环境中的卓越性能,包括橱柜、抽屉和手持物品等分层设置。我们的源代码可在[此处](https://这个URL应该被实际的网址替换)获得。
https://arxiv.org/abs/2502.03266
We introduce Metis, a foundation model for unified speech generation. Unlike previous task-specific or multi-task models, Metis follows a pre-training and fine-tuning paradigm. It is pre-trained on large-scale unlabeled speech data using masked generative modeling and then fine-tuned to adapt to diverse speech generation tasks. Specifically, 1) Metis utilizes two discrete speech representations: SSL tokens derived from speech self-supervised learning (SSL) features, and acoustic tokens directly quantized from waveforms. 2) Metis performs masked generative pre-training on SSL tokens, utilizing 300K hours of diverse speech data, without any additional condition. 3) Through fine-tuning with task-specific conditions, Metis achieves efficient adaptation to various speech generation tasks while supporting multimodal input, even when using limited data and trainable parameters. Experiments demonstrate that Metis can serve as a foundation model for unified speech generation: Metis outperforms state-of-the-art task-specific or multi-task systems across five speech generation tasks, including zero-shot text-to-speech, voice conversion, target speaker extraction, speech enhancement, and lip-to-speech, even with fewer than 20M trainable parameters or 300 times less training data. Audio samples are are available at this https URL.
我们介绍了Metis,这是一种用于统一语音生成的基础模型。与以往的任务特定或多任务模型不同,Metis遵循预训练和微调的范式。它在大规模未标记的语音数据上使用掩码生成建模进行预训练,并随后通过微调适应各种语音生成任务。 具体来说: 1. Metis利用两种离散的语音表示:一种是从语音自监督学习(SSL)特征中派生出来的SSL令牌,另一种则是直接从波形量化得到的声学令牌。 2. Metis在SSL令牌上进行掩码生成预训练,并使用30万小时多样化的语音数据集,无需任何额外条件。 3. 通过加入特定任务的条件进行微调,Metis能够高效地适应各种语音生成任务,即使在有限的数据和可训练参数的情况下也能支持多模态输入。实验表明,Metis可以作为统一语音生成的基础模型:在五个语音生成任务(包括零样本文本到语音、声音转换、目标说话人提取、语音增强以及唇动至语音)中,即使使用少于20M的可训练参数或比现有数据集小300倍的数据量,Metis也优于最先进的特定任务或多任务系统。 音频示例可在以下链接找到:[https URL](实际提供链接时,请替换为有效URL)。
https://arxiv.org/abs/2502.03128
While Large Language Models (LLMs) demonstrate impressive reasoning capabilities, understanding and validating their knowledge utilization remains challenging. Chain-of-thought (CoT) prompting partially addresses this by revealing intermediate reasoning steps, but the knowledge flow and application remain implicit. We introduce IAO (Input-Action-Output) prompting, a structured template-based method that explicitly models how LLMs access and apply their knowledge during complex reasoning tasks. IAO decomposes problems into sequential steps, each clearly identifying the input knowledge being used, the action being performed, and the resulting output. This structured decomposition enables us to trace knowledge flow, verify factual consistency, and identify potential knowledge gaps or misapplications. Through experiments across diverse reasoning tasks, we demonstrate that IAO not only improves zero-shot performance but also provides transparency in how LLMs leverage their stored knowledge. Human evaluation confirms that this structured approach enhances our ability to verify knowledge utilization and detect potential hallucinations or reasoning errors. Our findings provide insights into both knowledge representation within LLMs and methods for more reliable knowledge application.
尽管大型语言模型(LLM)展示了令人印象深刻的推理能力,但理解和验证它们的知识利用仍然具有挑战性。链式思维(CoT)提示部分解决了这一问题,通过揭示中间的推理步骤来实现,但是知识流和应用仍然是隐式的。我们引入了一种名为输入-操作-输出(IAO)提示的结构化模板方法,该方法明确地模拟了LLM在复杂推理任务中访问和使用其知识的方式。IAO将问题分解为一系列连续的步骤,在每个步骤中清晰地标识出正在使用的输入知识、执行的操作以及产生的结果。这种结构化的分解使我们能够追踪知识流、验证事实一致性,并识别潜在的知识缺口或误用。 通过在各种推理任务上的实验,我们证明了IAO不仅提高了零样本性能,还提供了LLM如何利用其存储知识的透明度。人类评估确认,这种方法增强了我们验证知识使用和检测潜在虚构或推理错误的能力。我们的研究结果为理解大型语言模型中的知识表示以及更可靠的知识应用方法提供了见解。
https://arxiv.org/abs/2502.03080
Integrating human feedback to align text-to-speech (TTS) system outputs with human preferences has proven to be an effective approach for enhancing the robustness of language model-based TTS systems. Current approaches primarily focus on using preference data annotated at the utterance level. However, frequent issues that affect the listening experience often only arise in specific segments of audio samples, while other segments are well-generated. In this study, we propose a fine-grained preference optimization approach (FPO) to enhance the robustness of TTS systems. FPO focuses on addressing localized issues in generated samples rather than uniformly optimizing the entire utterance. Specifically, we first analyze the types of issues in generated samples, categorize them into two groups, and propose a selective training loss strategy to optimize preferences based on fine-grained labels for each issue type. Experimental results show that FPO enhances the robustness of zero-shot TTS systems by effectively addressing local issues, significantly reducing the bad case ratio, and improving intelligibility. Furthermore, FPO exhibits superior data efficiency compared with baseline systems, achieving similar performance with fewer training samples.
将人类反馈融入到文本转语音(TTS)系统的输出中,以使其与人的偏好一致已被证明是一种有效的方法来增强基于语言模型的TTS系统的鲁棒性。目前的方法主要集中在使用在句子级别注释的偏好数据上。然而,影响听觉体验的问题往往只出现在音频样本中的特定片段内,而其他部分则生成得很好。在这项研究中,我们提出了一种细粒度偏好优化方法(FPO),以增强TTS系统的鲁棒性。FPO专注于解决生成样本中局部问题而不是对整个句子进行统一的优化。具体来说,我们首先分析了在生成样本中的各种问题类型,并将其分为两类,然后提出了一个选择性训练损失策略来根据每个问题类型的细粒度标签来进行偏好优化。 实验结果表明,FPO通过有效处理本地化的问题增强了零样本TTS系统的鲁棒性,显著减少了不良案例的比例并提高了清晰度。此外,与基线系统相比,FPO在数据效率方面表现出色,在使用较少的训练样本的情况下达到了类似的表现水平。
https://arxiv.org/abs/2502.02950
Segment Anything Model 2 (SAM 2), a prompt-driven foundation model extending SAM to both image and video domains, has shown superior zero-shot performance compared to its predecessor. Building on SAM's success in medical image segmentation, SAM 2 presents significant potential for further advancement. However, similar to SAM, SAM 2 is limited by its output of binary masks, inability to infer semantic labels, and dependence on precise prompts for the target object area. Additionally, direct application of SAM and SAM 2 to medical image segmentation tasks yields suboptimal results. In this paper, we explore the upper performance limit of SAM 2 using custom fine-tuning adapters, achieving a Dice Similarity Coefficient (DSC) of 92.30% on the BTCV dataset, surpassing the state-of-the-art nnUNet by 12%. Following this, we address the prompt dependency by investigating various prompt generators. We introduce a UNet to autonomously generate predicted masks and bounding boxes, which serve as input to SAM 2. Subsequent dual-stage refinements by SAM 2 further enhance performance. Extensive experiments show that our method achieves state-of-the-art results on the AMOS2022 dataset, with a Dice improvement of 2.9% compared to nnUNet, and outperforms nnUNet by 6.4% on the BTCV dataset.
《段落式模型 Segment Anything Model 2 (SAM 2),一种基于提示的通用基础模型,扩展了 SAM 在图像和视频领域的应用,并且相比其前身展现了卓越的零样本性能。在医学影像分割领域,SAM 的成功表明 SAM 2 具备进一步发展的巨大潜力。然而,与 SAM 类似,SAM 2 受限于只能输出二值掩码、无法推断语义标签以及依赖精确提示来定位目标物体区域等限制。此外,直接将 SAM 和 SAM 2 应用于医学影像分割任务会产生次优的结果。在本文中,我们通过使用自定义的微调适配器探索了 SAM 2 的性能上限,在 BTCV 数据集上达到了 92.30% 的 Dice 相似性系数 (DSC),这一成绩比最先进的 nnUNet 高出 12%。随后,为了解决提示依赖问题,我们研究了几种提示生成器,并引入了一种 UNet 来自动产生预测掩码和边界框,这些输出作为 SAM 2 的输入。经过 SAM 2 的后续双阶段优化后,性能进一步提升。广泛的实验表明,我们的方法在 AMOS2022 数据集上达到了最先进的水平,Dice 分数相比 nnUNet 提升了 2.9%,并且在 BTCV 数据集上的表现优于 nnUNet 6.4%。》
https://arxiv.org/abs/2502.02741
Natural Language Processing (NLP) has seen remarkable advances in recent years, particularly with the emergence of Large Language Models that have achieved unprecedented performance across many tasks. However, these developments have mainly benefited a small number of high-resource languages such as English. The majority of languages still face significant challenges due to the scarcity of training data and computational resources. To address this issue, this thesis focuses on cross-lingual transfer learning, a research area aimed at leveraging data and models from high-resource languages to improve NLP performance for low-resource languages. Specifically, we focus on Sequence Labeling tasks such as Named Entity Recognition, Opinion Target Extraction, and Argument Mining. The research is structured around three main objectives: (1) advancing data-based cross-lingual transfer learning methods through improved translation and annotation projection techniques, (2) developing enhanced model-based transfer learning approaches utilizing state-of-the-art multilingual models, and (3) applying these methods to real-world problems while creating open-source resources that facilitate future research in low-resource NLP. More specifically, this thesis presents a new method to improve data-based transfer with T-Projection, a state-of-the-art annotation projection method that leverages text-to-text multilingual models and machine translation systems. T-Projection significantly outperforms previous annotation projection methods by a wide margin. For model-based transfer, we introduce a constrained decoding algorithm that enhances cross-lingual Sequence Labeling in zero-shot settings using text-to-text models. Finally, we develop Medical mT5, the first multilingual text-to-text medical model, demonstrating the practical impact of our research on real-world applications.
近年来,自然语言处理(NLP)领域取得了显著的进展,特别是随着大型语言模型的出现,在许多任务上实现了前所未有的性能。然而,这些发展主要使英语等高资源语言受益。大多数语言仍然面临重大挑战,这些问题源于训练数据和计算资源的稀缺。为了解决这个问题,本论文聚焦于跨语言迁移学习这一研究领域,旨在利用高资源语言的数据和模型来提升低资源语言的NLP性能。具体而言,我们关注序列标注任务,如命名实体识别、意见目标抽取以及论点挖掘。 该研究围绕三大主要目标展开:(1) 通过改进翻译和注释投影技术推进基于数据的跨语言迁移学习方法;(2) 开发利用先进的多语种模型提升迁移学习效果的方法;(3) 应用这些方法解决现实问题,并创造开源资源以促进低资源NLP领域的未来研究。更具体地说,本论文提出了一种通过T-Projection(一种最先进的注释投影技术)改进基于数据的转移的新方法,该技术利用了文本到文本的多语种模型和机器翻译系统。相比以前的注释投影方法,T-Projection有了显著提升。对于基于模型的迁移学习,我们引入了一个约束解码算法,在零样本设置中使用文本到文本模型增强了跨语言序列标注效果。最后,我们开发出了Medical mT5,这是第一款多语种医疗领域文本到文本模型,展示了我们的研究成果在实际应用中的重要性。
https://arxiv.org/abs/2502.02722
Robot learning requires a considerable amount of high-quality data to realize the promise of generalization. However, large data sets are costly to collect in the real world. Physics simulators can cheaply generate vast data sets with broad coverage over states, actions, and environments. However, physics engines are fundamentally misspecified approximations to reality. This makes direct zero-shot transfer from simulation to reality challenging, especially in tasks where precise and force-sensitive manipulation is necessary. Thus, fine-tuning these policies with small real-world data sets is an appealing pathway for scaling robot learning. However, current reinforcement learning fine-tuning frameworks leverage general, unstructured exploration strategies which are too inefficient to make real-world adaptation practical. This paper introduces the Simulation-Guided Fine-tuning (SGFT) framework, which demonstrates how to extract structural priors from physics simulators to substantially accelerate real-world adaptation. Specifically, our approach uses a value function learned in simulation to guide real-world exploration. We demonstrate this approach across five real-world dexterous manipulation tasks where zero-shot sim-to-real transfer fails. We further demonstrate our framework substantially outperforms baseline fine-tuning methods, requiring up to an order of magnitude fewer real-world samples and succeeding at difficult tasks where prior approaches fail entirely. Last but not least, we provide theoretical justification for this new paradigm which underpins how SGFT can rapidly learn high-performance policies in the face of large sim-to-real dynamics gaps. Project webpage: this https URL{this http URL}
机器人学习需要大量的高质量数据来实现泛化能力。然而,在现实世界中收集大规模的数据集成本高昂。物理仿真器可以廉价地生成覆盖广泛状态、动作和环境的庞大数据集。但是,物理引擎本质上是对现实世界的近似模拟,这使得从模拟到实际环境中的直接零样本迁移(zero-shot transfer)具有挑战性,尤其是在需要精确且力敏感的操作任务中更为困难。因此,通过使用小规模的真实世界数据集对这些策略进行微调是一种有吸引力的扩展机器人学习的方法。然而,现有的强化学习微调框架依赖于一般的、非结构化的探索策略,在现实世界的适应上效率过低。 本文介绍了模拟引导微调(Simulation-Guided Fine-tuning, SGFT)框架,该框架展示了如何从物理仿真器中提取结构性先验知识来显著加速在真实世界中的适应。具体而言,我们的方法使用在模拟中学习到的价值函数指导现实世界的探索。我们在五个零样本从模拟到实际转移失败的真实世界灵巧操作任务上验证了这种方法的有效性。我们进一步证明我们的框架大幅优于基准微调方法,在困难的任务中表现尤其出色,相较于之前的尝试所需的真实世界样本数量减少了多达一个数量级,并且成功完成了一些之前的方法完全无法完成的复杂任务。 最后但同样重要的是,本文提供了对这一新范式的理论依据,解释了SGFT如何能够快速地在巨大的模拟到现实动态差异面前学习出高性能策略。项目网页链接: [这里插入具体网址]
https://arxiv.org/abs/2502.02705
Streaming multi-talker speech translation is a task that involves not only generating accurate and fluent translations with low latency but also recognizing when a speaker change occurs and what the speaker's gender is. Speaker change information can be used to create audio prompts for a zero-shot text-to-speech system, and gender can help to select speaker profiles in a conventional text-to-speech model. We propose to tackle streaming speaker change detection and gender classification by incorporating speaker embeddings into a transducer-based streaming end-to-end speech translation model. Our experiments demonstrate that the proposed methods can achieve high accuracy for both speaker change detection and gender classification.
流式多说话人语音翻译任务不仅要求生成准确流畅、延迟低的翻译,还需要识别说话人的变更以及判断说话人性别。说话人变更信息可用于为零样本文本转语音系统创建音频提示,而性别则有助于在传统文本转语音模型中选择合适的发音者配置文件。我们提出通过将说话人嵌入整合到基于转换器的流式端到端语音翻译模型中来解决流式说话人变更检测和性别分类问题。实验结果表明,所提出的方案能够实现高准确度的说话人变更检测和性别分类。
https://arxiv.org/abs/2502.02683
Using extensive training data from SA-1B, the Segment Anything Model (SAM) has demonstrated exceptional generalization and zero-shot capabilities, attracting widespread attention in areas such as medical image segmentation and remote sensing image segmentation. However, its performance in the field of image manipulation detection remains largely unexplored and unconfirmed. There are two main challenges in applying SAM to image manipulation detection: a) reliance on manual prompts, and b) the difficulty of single-view information in supporting cross-dataset generalization. To address these challenges, we develops a cross-view prompt learning paradigm called IMDPrompter based on SAM. Benefiting from the design of automated prompts, IMDPrompter no longer relies on manual guidance, enabling automated detection and localization. Additionally, we propose components such as Cross-view Feature Perception, Optimal Prompt Selection, and Cross-View Prompt Consistency, which facilitate cross-view perceptual learning and guide SAM to generate accurate masks. Extensive experimental results from five datasets (CASIA, Columbia, Coverage, IMD2020, and NIST16) validate the effectiveness of our proposed method.
通过使用来自SA-1B的广泛训练数据,段一切模型(Segment Anything Model,SAM)已经展示了卓越的泛化能力和零样本学习能力,在医学图像分割和遥感图像分割等领域引起了广泛关注。然而,其在图像篡改检测领域的性能仍然鲜为人知且未经证实。将SAM应用于图像篡改检测面临两大主要挑战:一是依赖于手动提示,二是单视图信息难以支持跨数据集的泛化。 为了解决这些问题,我们基于SAM开发了一种称为IMDPrompter的跨视角提示学习范式。得益于自动化提示的设计,IMDPrompter不再需要人工指导,从而能够实现自动化的检测和定位功能。此外,我们还提出了诸如跨视图特征感知、最优提示选择以及跨视图提示一致性等组件,这些设计有助于促进跨视图感知学习,并引导SAM生成准确的掩膜。 来自五个数据集(CASIA、哥伦比亚大学、Coverage、IMD2020和NIST16)的大量实验结果验证了我们方法的有效性。
https://arxiv.org/abs/2502.02454
Large Language Models (LLMs) have gained attention for addressing coding problems, but their effectiveness in fixing code maintainability remains unclear. This study evaluates LLMs capability to resolve 127 maintainability issues from 10 GitHub repositories. We use zero-shot prompting for Copilot Chat and Llama 3.1, and few-shot prompting with Llama only. The LLM-generated solutions are assessed for compilation errors, test failures, and new maintainability problems. Llama with few-shot prompting successfully fixed 44.9% of the methods, while Copilot Chat and Llama zero-shot fixed 32.29% and 30%, respectively. However, most solutions introduced errors or new maintainability issues. We also conducted a human study with 45 participants to evaluate the readability of 51 LLM-generated solutions. The human study showed that 68.63% of participants observed improved readability. Overall, while LLMs show potential for fixing maintainability issues, their introduction of errors highlights their current limitations.
大型语言模型(LLMs)在解决编码问题方面引起了广泛关注,但它们在改善代码可维护性方面的有效性仍然不明。这项研究评估了LLM处理来自10个GitHub仓库的127个可维护性问题的能力。我们对Copilot Chat和Llama 3.1使用零样本提示法(zero-shot prompting),仅对Llama使用少量样本提示法(few-shot prompting)。评估LLM生成的解决方案时,根据编译错误、测试失败以及新出现的可维护性问题进行了评分。 结果显示,在使用少量样本提示的情况下,Llama成功修复了44.9%的方法,而Copilot Chat和Llama在零样本提示下的修复率分别为32.29%和30%。然而,大多数解决方案引入了错误或新的可维护性问题。我们还进行了一项涉及45名参与者的实验,以评估51个LLM生成的代码解决方案的易读性。结果显示,68.63%的参与者认为这些解决方案的可读性有所提高。 总体而言,虽然LLMs在解决可维护性问题方面表现出一定的潜力,但它们引入错误的事实也凸显了当前技术的局限性。
https://arxiv.org/abs/2502.02368
The rapid advancements in vision-language models (VLMs), such as CLIP, have intensified the need to address distribution shifts between training and testing datasets. Although prior Test-Time Training (TTT) techniques for VLMs have demonstrated robust performance, they predominantly rely on tuning text prompts, a process that demands substantial computational resources and is heavily dependent on entropy-based loss. In this paper, we propose LoRA-TTT, a novel TTT method that leverages Low-Rank Adaptation (LoRA), applied exclusively to the image encoder of VLMs. By introducing LoRA and updating only its parameters during test time, our method offers a simple yet effective TTT approach, retaining the model's initial generalization capability while achieving substantial performance gains with minimal memory and runtime overhead. Additionally, we introduce a highly efficient reconstruction loss tailored for TTT. Our method can adapt to diverse domains by combining these two losses, without increasing memory consumption or runtime. Extensive experiments on two benchmarks, covering 15 datasets, demonstrate that our method improves the zero-shot top-1 accuracy of CLIP-ViT-B/16 by an average of 5.79% on the OOD benchmark and 1.36% on the fine-grained benchmark, efficiently surpassing test-time prompt tuning, without relying on any external models or cache.
视觉语言模型(VLMs),如CLIP,的快速进步加剧了训练数据集与测试数据集之间分布差异问题的需求。尽管先前针对VLMs的Test-Time Training (TTT) 技术已经表现出强大的性能,但这些技术主要依赖于调整文本提示,并且这个过程需要大量的计算资源并且严重依赖于基于熵的损失函数。在本文中,我们提出了LoRA-TTT,这是一种新颖的TTT方法,它利用了低秩适应(Low-Rank Adaptation, LoRA),仅应用于VLMs的图像编码器上。通过引入LoRA并在测试期间仅更新其参数,我们的方法提供了一种简单而有效的TTT方案,在保持模型初始泛化能力的同时实现了显著的性能提升,并且只需要极小的内存和运行时开销。此外,我们还提出了一种高度高效的适用于TTT的重建损失函数。通过结合这两种损失函数,我们的方法能够在不增加内存消耗或运行时间的情况下适应不同的领域。在两个基准测试上进行的广泛实验(包括15个数据集)表明,与依赖外部模型或缓存的测试时提示调整相比,我们的方法能够提高CLIP-ViT-B/16在OOD基准上的零样本top-1准确率平均提高了5.79%,并且在细粒度基准上提升了1.36%。
https://arxiv.org/abs/2502.02069
Multi-agent reinforcement learning (MARL) has made significant progress, largely fueled by the development of specialized testbeds that enable systematic evaluation of algorithms in controlled yet challenging scenarios. However, existing testbeds often focus on purely virtual simulations or limited robot morphologies such as robotic arms, quadrupeds, and humanoids, leaving high-mobility platforms with real-world physical constraints like drones underexplored. To bridge this gap, we present VolleyBots, a new MARL testbed where multiple drones cooperate and compete in the sport of volleyball under physical dynamics. VolleyBots features a turn-based interaction model under volleyball rules, a hierarchical decision-making process that combines motion control and strategic play, and a high-fidelity simulation for seamless sim-to-real transfer. We provide a comprehensive suite of tasks ranging from single-drone drills to multi-drone cooperative and competitive tasks, accompanied by baseline evaluations of representative MARL and game-theoretic algorithms. Results in simulation show that while existing algorithms handle simple tasks effectively, they encounter difficulty in complex tasks that require both low-level control and high-level strategy. We further demonstrate zero-shot deployment of a simulation-learned policy to real-world drones, highlighting VolleyBots' potential to propel MARL research involving agile robotic platforms. The project page is at this https URL.
多智能体强化学习(MARL)取得了显著进展,这主要得益于开发了一系列专业测试平台,这些平台能够对算法进行系统评估,在控制但具有挑战性的场景中。然而,现有的大多数测试床往往专注于纯粹的虚拟仿真或有限的机器人形态,如机械臂、四足动物和人形机器人,而忽视了带有真实世界物理限制的高机动性平台(例如无人机)的研究。 为了弥补这一差距,我们提出了VolleyBots,这是一个新的MARL测试平台,在该平台上,多架无人机可以在受控动力学环境中合作并竞争打排球。VolleyBots采用基于排球规则的回合制互动模式,并结合了运动控制和策略性游戏决策的分层决策过程。此外,它还具有高保真模拟功能,以实现从仿真到现实无缝转换。 我们提供了包括单一无人机训练在内的各种任务以及多架无人机之间的合作与竞争任务,附带了代表性的MARL和博弈论算法的基准评估结果。在仿真实验中,结果显示现有算法可以在简单任务上表现良好,但在需要低级控制与高级策略相结合的复杂任务中则遇到了困难。 我们进一步展示了从仿真环境中学习到的政策可以直接部署到实际无人机上,并强调了VolleyBots平台对于涉及敏捷机器人平台的MARL研究的巨大潜力。项目页面位于此处(URL略)。
https://arxiv.org/abs/2502.01932
The preservation of aquatic biodiversity is critical in mitigating the effects of climate change. Aquatic scene understanding plays a pivotal role in aiding marine scientists in their decision-making processes. In this paper, we introduce AquaticCLIP, a novel contrastive language-image pre-training model tailored for aquatic scene understanding. AquaticCLIP presents a new unsupervised learning framework that aligns images and texts in aquatic environments, enabling tasks such as segmentation, classification, detection, and object counting. By leveraging our large-scale underwater image-text paired dataset without the need for ground-truth annotations, our model enriches existing vision-language models in the aquatic domain. For this purpose, we construct a 2 million underwater image-text paired dataset using heterogeneous resources, including YouTube, Netflix, NatGeo, etc. To fine-tune AquaticCLIP, we propose a prompt-guided vision encoder that progressively aggregates patch features via learnable prompts, while a vision-guided mechanism enhances the language encoder by incorporating visual context. The model is optimized through a contrastive pretraining loss to align visual and textual modalities. AquaticCLIP achieves notable performance improvements in zero-shot settings across multiple underwater computer vision tasks, outperforming existing methods in both robustness and interpretability. Our model sets a new benchmark for vision-language applications in underwater environments. The code and dataset for AquaticCLIP are publicly available on GitHub at xxx.
水生生物多样性的保护对于缓解气候变化的影响至关重要。水下场景理解在辅助海洋科学家进行决策过程中扮演着关键角色。在这篇论文中,我们介绍了AquaticCLIP,这是一种新型的对比语言-图像预训练模型,专门为水下场景理解设计。AquaticCLIP提出了一种新的无监督学习框架,在水生环境中对齐图像和文本数据,从而支持分割、分类、检测以及物体计数等任务。通过利用大规模的没有地面真实标签标注的水下图文配对数据集,我们的模型丰富了现有的视觉-语言模型在水下的应用。 为此,我们使用包括YouTube、Netflix、NatGeo在内的异构资源构建了一个包含200万张水下图像和文本配对的数据集。为了微调AquaticCLIP,我们提出了一种由提示引导的视觉编码器,该编码器通过可学习的提示逐步聚合补丁特征,并且一种基于视觉的方法增强了语言编码器,使其能够整合视觉上下文信息。模型通过对比预训练损失优化了视觉和文本模式的一致性。 在多种水下计算机视觉任务中,AquaticCLIP在零样本设置下的性能有显著提升,其鲁棒性和可解释性均优于现有方法,在水下环境的视觉-语言应用领域设立了新的基准标准。AquaticCLIP的代码和数据集可以在GitHub上公开访问(地址为xxx)。
https://arxiv.org/abs/2502.01785
We propose a novel family of language models, Latent-Thought Language Models (LTMs), which incorporate explicit latent thought vectors that follow an explicit prior model in latent space. These latent thought vectors guide the autoregressive generation of ground tokens through a Transformer decoder. Training employs a dual-rate optimization process within the classical variational Bayes framework: fast learning of local variational parameters for the posterior distribution of latent vectors, and slow learning of global decoder parameters. Empirical studies reveal that LTMs possess additional scaling dimensions beyond traditional LLMs, yielding a structured design space. Higher sample efficiency can be achieved by increasing training compute per token, with further gains possible by trading model size for more inference steps. Designed based on these scaling properties, LTMs demonstrate superior sample and parameter efficiency compared to conventional autoregressive models and discrete diffusion models. They significantly outperform these counterparts in validation perplexity and zero-shot language modeling. Additionally, LTMs exhibit emergent few-shot in-context reasoning capabilities that scale with model and latent size, and achieve competitive performance in conditional and unconditional text generation.
我们提出了一种新型的语言模型家族——潜思维语言模型(Latent-Thought Language Models,简称LTMs),该模型在潜在空间中融合了遵循显式先验模型的明确潜思维向量。这些潜思维向量引导Transformer解码器通过自回归方式生成地面标记(token)。训练过程采用了一种基于经典变分贝叶斯框架的双速率优化方法:快速学习后验分布的局部变分参数,以及缓慢更新全局解码器参数。实证研究表明,LTMs相较于传统的大型语言模型(LLMs)拥有额外的扩展维度,从而形成一个结构化的设计空间。通过增加每标记训练计算量可以提高样本效率,同时还可以通过以更多的推理步骤换取模型规模来进一步提升性能。基于这些扩展特性设计的LTMs,在样本和参数使用效率上都优于传统的自回归模型及离散扩散模型,并在验证困惑度(validation perplexity)和零样本语言建模方面显著超越了这些模型。此外,LTMs展示出了随着模型大小和潜在空间尺寸增加而发展的少量样本文中推理能力,在条件性和非条件性文本生成任务上取得了与现有最佳方法相竞争的性能。
https://arxiv.org/abs/2502.01567
In reinforcement learning (RL), agents often struggle to perform well on tasks that differ from those encountered during training. This limitation presents a challenge to the broader deployment of RL in diverse and dynamic task settings. In this work, we introduce memory augmentation, a memory-based RL approach to improve task generalization. Our approach leverages task-structured augmentations to simulate plausible out-of-distribution scenarios and incorporates memory mechanisms to enable context-aware policy adaptation. Trained on a predefined set of tasks, our policy demonstrates the ability to generalize to unseen tasks through memory augmentation without requiring additional interactions with the environment. Through extensive simulation experiments and real-world hardware evaluations on legged locomotion tasks, we demonstrate that our approach achieves zero-shot generalization to unseen tasks while maintaining robust in-distribution performance and high sample efficiency.
在强化学习(RL)中,代理通常难以在与训练期间遇到的任务不同的任务上表现出色。这种限制对将RL广泛部署到多变和动态的任务环境中构成了挑战。在这项工作中,我们引入了记忆增强技术,这是一种基于记忆的RL方法,旨在改进任务泛化能力。我们的方法利用任务结构化的增强来模拟可能的分布外场景,并通过整合内存机制使策略能够根据上下文进行适应性调整。在一组预定义的任务上训练后,我们的策略可以通过记忆增强展示出对未见任务的一般化能力,而无需与环境进一步交互。 通过广泛的仿真实验和在腿部步行任务上的真实世界硬件评估,我们证明了我们的方法能够在保持分布内性能的稳健性和高样本效率的同时实现对未见任务的零次学习泛化。
https://arxiv.org/abs/2502.01521
Previous humanoid robot research works treat the robot as a bipedal mobile manipulation platform, where only the feet and hands contact the environment. However, we humans use all body parts to interact with the world, e.g., we sit in chairs, get up from the ground, or roll on the floor. Contacting the environment using body parts other than feet and hands brings significant challenges in both model-predictive control and reinforcement learning-based methods. An unpredictable contact sequence makes it almost impossible for model-predictive control to plan ahead in real time. The success of the zero-shot sim-to-real reinforcement learning method for humanoids heavily depends on the acceleration of GPU-based rigid-body physical simulator and simplification of the collision detection. Lacking extreme torso movement of the humanoid research makes all other components non-trivial to design, such as termination conditions, motion commands and reward designs. To address these potential challenges, we propose a general humanoid motion framework that takes discrete motion commands and controls the robot's motor action in real time. Using a GPU-accelerated rigid-body simulator, we train a humanoid whole-body control policy that follows the high-level motion command in the real world in real time, even with stochastic contacts and extremely large robot base rotation and not-so-feasible motion command. More details at this https URL
此前的人形机器人研究将机器人视为一种双足移动操作平台,其中只有脚和手会接触环境。然而,人类使用全身各部位与世界互动,例如坐着、从地上起身或在地上滚动。除了脚和手之外的其他身体部分接触环境给基于模型预测控制和强化学习的方法带来了显著挑战。不可预知的接触序列使得实时规划几乎不可能通过模型预测控制来实现。对于人形机器人的零样本仿真到现实(sim-to-real)的强化学习方法的成功,很大程度上依赖于GPU加速刚体物理模拟器的速度以及简化碰撞检测过程。由于缺乏对人形机器人研究中躯干极端运动的研究,使得其他组件的设计变得不那么简单,比如终止条件、动作指令和奖励设计等。 为了应对这些潜在挑战,我们提出了一种通用的人形机器人运动框架,该框架接受离散的运动命令,并实时控制机器人的电机行为。利用GPU加速刚体物理模拟器,我们在仿真环境中训练出了一个能够跟随高层次的运动命令并在现实世界中进行全身体态控制策略,在存在随机接触和极大底座旋转的情况下依然有效运作,即使面对不太可行的动作指令也是如此。更多细节请参见此链接:[https URL](请注意,这里应是一个实际的有效链接)。
https://arxiv.org/abs/2502.01465
The availability of foundational models (FMs) pre-trained on large-scale data has advanced the state-of-the-art in many computer vision tasks. While FMs have demonstrated good zero-shot performance on many image classification tasks, there is often scope for performance improvement by adapting the FM to the downstream task. However, the data that is required for this adaptation typically exists in silos across multiple entities (data owners) and cannot be collated at a central location due to regulations and privacy concerns. At the same time, a learning service provider (LSP) who owns the FM cannot share the model with the data owners due to proprietary reasons. In some cases, the data owners may not even have the resources to store such large FMs. Hence, there is a need for algorithms to adapt the FM in a double-blind federated manner, i.e., the data owners do not know the FM or each other's data, and the LSP does not see the data for the downstream tasks. In this work, we propose a framework for double-blind federated adaptation of FMs using fully homomorphic encryption (FHE). The proposed framework first decomposes the FM into a sequence of FHE-friendly blocks through knowledge distillation. The resulting FHE-friendly model is adapted for the downstream task via low-rank parallel adapters that can be learned without backpropagation through the FM. Since the proposed framework requires the LSP to share intermediate representations with the data owners, we design a privacy-preserving permutation scheme to prevent the data owners from learning the FM through model extraction attacks. Finally, a secure aggregation protocol is employed for federated learning of the low-rank parallel adapters. Empirical results on four datasets demonstrate the practical feasibility of the proposed framework.
大规模数据预训练的基础模型(FMs)在许多计算机视觉任务中取得了最先进的成果。尽管基础模型在许多图像分类任务上表现出良好的零样本性能,但在将基础模型适应下游任务时通常仍有改进的空间。然而,用于这种适应的数据往往分散在多个实体(数据所有者)之间,并且由于监管和隐私问题而不能集中存储。同时,拥有基础模型的学习服务提供商(LSP)也无法出于专有原因与数据所有者分享该模型。此外,在某些情况下,数据所有者甚至没有足够的资源来存储如此庞大的基础模型。因此,需要一种算法以双盲联邦学习的方式适应基础模型,即数据所有者不知道基础模型或彼此的数据,而LSP也看不到用于下游任务的数据。 在本文中,我们提出了一种基于全同态加密(FHE)的框架,用于实现基础模型的双盲联邦适应。该框架首先通过知识蒸馏将基础模型分解为一系列适合FHE操作的模块。由此产生的FHE友好型模型可以通过低秩并行适配器进行下游任务调整,并且这些适配器可以在不需要通过基础模型反向传播的情况下被学习到。由于该框架需要LSP与数据所有者分享中间表示,我们设计了一种隐私保护置换方案以防止数据所有者通过模型提取攻击来了解基础模型。最后,采用了安全聚合协议来进行低秩并行适配器的联邦学习。 在四个数据集上的实验证明了所提出框架的实际可行性。
https://arxiv.org/abs/2502.01289