Human understanding and generation are critical for modeling digital humans and humanoid embodiments. Recently, Human-centric Foundation Models (HcFMs) inspired by the success of generalist models, such as large language and vision models, have emerged to unify diverse human-centric tasks into a single framework, surpassing traditional task-specific approaches. In this survey, we present a comprehensive overview of HcFMs by proposing a taxonomy that categorizes current approaches into four groups: (1) Human-centric Perception Foundation Models that capture fine-grained features for multi-modal 2D and 3D understanding. (2) Human-centric AIGC Foundation Models that generate high-fidelity, diverse human-related content. (3) Unified Perception and Generation Models that integrate these capabilities to enhance both human understanding and synthesis. (4) Human-centric Agentic Foundation Models that extend beyond perception and generation to learn human-like intelligence and interactive behaviors for humanoid embodied tasks. We review state-of-the-art techniques, discuss emerging challenges and future research directions. This survey aims to serve as a roadmap for researchers and practitioners working towards more robust, versatile, and intelligent digital human and embodiments modeling.
人类的理解和生成对于数字人及仿人模型的构建至关重要。最近,受大型语言和视觉模型等通用模型成功的启发,以人类为中心的基础模型(HcFMs)兴起并致力于将各种以人为中心的任务整合到一个统一框架中,从而超越了传统的特定任务方法。在这篇综述中,我们提出了一种分类法,通过将其当前的方法分为四个类别来全面概述HcFMs:(1) 以人类为中心的感知基础模型,捕捉多模态2D和3D理解中的细微特征;(2) 以人为中心的人工智能生成(AIGC)基础模型,能够生成高保真度、多样化的人类相关内容;(3) 统一感知与生成模型,整合这些能力以增强人类理解和合成;以及 (4) 以人类为中心的代理基础模型,超越感知和生成,学习类似人的智慧及用于仿人任务中的交互行为。我们回顾了最新的技术,并讨论了新兴挑战和未来的研究方向。该综述旨在为致力于更稳健、多样化且智能的数字人和仿生体建模的研究人员和实践者提供路线图。
https://arxiv.org/abs/2502.08556
Learning from mistakes is a fundamental feature of human intelligence. Previous work has shown that Large Language Models (LLMs) can also learn from incorrect answers when provided with a comprehensive rationale detailing why an answer is wrong or how to correct it. In this work, we examine whether LLMs can learn from mistakes in mathematical reasoning tasks when these explanations are not provided. We investigate if LLMs are able to implicitly infer such rationales simply from observing both incorrect and correct answers. Surprisingly, we find that LLMs perform better, on average, when rationales are eliminated from the context and incorrect answers are simply shown alongside correct ones. This approach also substantially outperforms chain-of-thought prompting in our evaluations. We show that these results are consistent across LLMs of different sizes and varying reasoning abilities. Further, we carry out an in-depth analysis, and show that prompting with both wrong and correct answers leads to greater performance and better generalisation than introducing additional, more diverse question-answer pairs into the context. Finally, we show that new rationales generated by models that have only observed incorrect and correct answers are scored equally as highly by humans as those produced with the aid of exemplar rationales. Our results demonstrate that LLMs are indeed capable of in-context implicit learning.
从错误中学习是人类智能的基本特征。先前的研究表明,当提供详细的理由说明为什么一个答案是错的或如何纠正它时,大型语言模型(LLMs)也可以从错误的答案中学习。在本项研究中,我们探讨了当没有提供此类解释的情况下,LLM是否能够在数学推理任务中从错误中学习。我们调查了这些模型是否能够仅通过观察错误和正确答案而隐式地推断出这样的理由。令人惊讶的是,我们发现,在去除上下文中提供的理由后,并且简单地将错误的答案与正确的并列显示时,LLMs在平均性能上表现得更好。这种方法在我们的评估中也大大优于链式思维提示法(chain-of-thought prompting)。这些结果在不同大小和推理能力各异的LLM之间是一致的。 此外,我们进行了深入分析,并表明使用错误答案和正确答案进行提问比引入更多样化的问答对到上下文中更能提高性能并促进更好的泛化。最后,我们展示了仅观察过错误和正确答案的新模型生成的理由,在人类评分中与依靠示例理由产生的理由一样获得高分。 我们的研究结果证明了LLMs确实具有在上下文中的隐式学习能力。
https://arxiv.org/abs/2502.08550
Model-based clustering techniques have been widely applied to various application areas, while most studies focus on canonical mixtures with unique component distribution form. However, this strict assumption is often hard to satisfy. In this paper, we consider the more flexible Copula-Based Mixture Models (CBMMs) for clustering, which allow heterogeneous component distributions composed by flexible choices of marginal and copula forms. More specifically, we propose an adaptation of the Generalized Iterative Conditional Estimation (GICE) algorithm to identify the CBMMs in an unsupervised manner, where the marginal and copula forms and their parameters are estimated iteratively. GICE is adapted from its original version developed for switching Markov model identification with the choice of realization time. Our CBMM-GICE clustering method is then tested on synthetic two-cluster data (N=2000 samples) with discussion of the factors impacting its convergence. Finally, it is compared to the Expectation Maximization identified mixture models with unique component form on the entire MNIST database (N=70000), and on real cardiac magnetic resonance data (N=276) to illustrate its value for imaging applications.
基于模型的聚类技术已被广泛应用于各种应用领域,而大多数研究主要关注具有独特成分分布形式的标准混合模型。然而,这种严格的假设往往难以满足实际情况。在这篇论文中,我们考虑使用更灵活的Copula基混合模型(CBMMs)来进行聚类,这些模型允许通过选择不同的边缘和copula形式来构建异质成分分布。具体来说,我们提出了一种对广义迭代条件估计(GICE)算法进行适应性改进的方法,以便在无监督的情况下识别CBMMs,在此过程中边际和copula的形式及其参数将被迭代地估算。GICE是从其最初为切换马尔可夫模型识别而开发的版本中调整过来的,并引入了实现时间的选择。 我们的基于CBMM-GICE的聚类方法随后在一个合成数据集上进行了测试,该数据集中有两个簇(样本量N=2000),并对影响其收敛性的因素进行了讨论。最后,我们将其与具有独特成分形式的期望最大化混合模型在完整的MNIST数据库(样本量N=70000)以及真实的心脏磁共振成像数据集(样本量N=276)上进行比较,以展示它对于图像应用的价值。 这一段落详细描述了一种新型聚类算法——基于CBMM的GICE方法的发展及其在不同规模和类型的数据集上的测试与验证过程。这种方法提供了一个更加灵活且适用范围更广的选择,尤其是在处理复杂数据结构时。
https://arxiv.org/abs/2502.08549
Video Moment Retrieval is a common task to evaluate the performance of visual-language models - it involves localising start and end times of moments in videos from query sentences. The current task formulation assumes that the queried moment is present in the video, resulting in false positive moment predictions when irrelevant query sentences are provided. In this paper we propose the task of Negative-Aware Video Moment Retrieval (NA-VMR), which considers both moment retrieval accuracy and negative query rejection accuracy. We make the distinction between In-Domain and Out-of-Domain negative queries and provide new evaluation benchmarks for two popular video moment retrieval datasets: QVHighlights and Charades-STA. We analyse the ability of current SOTA video moment retrieval approaches to adapt to Negative-Aware Video Moment Retrieval and propose UniVTG-NA, an adaptation of UniVTG designed to tackle NA-VMR. UniVTG-NA achieves high negative rejection accuracy (avg. $98.4\%$) scores while retaining moment retrieval scores to within $3.87\%$ Recall@1. Dataset splits and code are available at this https URL
视频时刻检索是一项用于评估视觉语言模型性能的常见任务,它涉及根据查询句子在视频中定位特定时刻的开始和结束时间。当前的任务设定假设被查询的时刻存在于视频中,这会导致当提供不相关的查询句时产生假阳性的时刻预测。在这篇论文中,我们提出了负样本感知视频时刻检索(Negative-Aware Video Moment Retrieval, NA-VMR)任务,该任务不仅考虑了时刻检索的准确性还考虑了对负面查询的拒绝精度。 我们区分了域内和域外负面查询,并为两个流行的视频时刻检索数据集——QVHighlights 和 Charades-STA 提供了新的评估基准。本文分析了现有最先进的视频时刻检索方法适应负样本感知视频时刻检索的能力,并提出了 UniVTG-NA,这是对 UniVTG 的改进版本,旨在解决 NA-VMR 问题。UniVTG-NA 在保持召回率@1精度(平均 96.13%)的同时,达到了很高的负面拒绝准确性(平均 98.4%)。数据集分割和代码在以下链接提供:[此URL](http://this https URL)
https://arxiv.org/abs/2502.08544
Image quality assessment (IQA) represents a pivotal challenge in image-focused technologies, significantly influencing the advancement trajectory of image processing and computer vision. Recently, IQA has witnessed a notable surge in innovative research efforts, driven by the emergence of novel architectural paradigms and sophisticated computational techniques. This survey delivers an extensive analysis of contemporary IQA methodologies, organized according to their application scenarios, serving as a beneficial reference for both beginners and experienced researchers. We analyze the advantages and limitations of current approaches and suggest potential future research pathways. The survey encompasses both general and specific IQA methodologies, including conventional statistical measures, machine learning techniques, and cutting-edge deep learning models such as convolutional neural networks (CNNs) and Transformer models. The analysis within this survey highlights the necessity for distortion-specific IQA methods tailored to various application scenarios, emphasizing the significance of practicality, interpretability, and ease of implementation in future developments.
图像质量评估(IQA)在以图像为中心的技术中代表了一个关键的挑战,对图像处理和计算机视觉的发展路径有着重要影响。近年来,由于新型架构范式和复杂计算技术的出现,IQA领域见证了创新研究工作的显著增长。本次综述提供了当代IQA方法学的全面分析,并按照其应用场景进行组织,为初学者和有经验的研究人员提供有益的参考资源。我们评估了现有方法的优势与局限性,并提出未来潜在的研究方向建议。此次调查涵盖了通用及特定场景下的图像质量评估技术,包括传统的统计指标、机器学习技术以及前沿深度学习模型(如卷积神经网络(CNN)和Transformer模型)。综述中的分析强调了针对不同应用场景的失真特异性IQA方法的重要性,并突出了未来发展的实用性和可解释性等方面的必要要求。
https://arxiv.org/abs/2502.08540
The properties of black holes and accretion flows can be inferred by fitting Event Horizon Telescope (EHT) data to simulated images generated through general relativistic ray tracing (GRRT). However, due to the computationally intensive nature of GRRT, the efficiency of generating specific radiation flux images needs to be improved. This paper introduces the Branch Correction Denoising Diffusion Model (BCDDM), which uses a branch correction mechanism and a weighted mixed loss function to improve the accuracy of generated black hole images based on seven physical parameters of the radiatively inefficient accretion flow (RIAF) model. Our experiments show a strong correlation between the generated images and their physical parameters. By enhancing the GRRT dataset with BCDDM-generated images and using ResNet50 for parameter regression, we achieve significant improvements in parameter prediction performance. This approach reduces computational costs and provides a faster, more efficient method for dataset expansion, parameter estimation, and model fitting.
黑洞和吸积流的特性可以通过将事件视界望远镜(EHT)的数据与通过广义相对论光线追踪(GRRT)生成的模拟图像拟合来推断。然而,由于GRRT计算密集型的性质,提高特定辐射通量图像生成效率的需求仍然存在。本文介绍了分支校正去噪扩散模型(BCDDM),该模型利用了一种分支校正机制和加权混合损失函数,以基于放射性低效吸积流(RIAF)模型的七个物理参数来提升生成黑洞图像的准确性。实验结果表明,生成的图像与其物理参数之间存在很强的相关性。通过使用BCDDM生成的图像增强GRRT数据集,并采用ResNet50进行参数回归,我们在参数预测性能上实现了显著改进。这种方法降低了计算成本,并提供了一种更快、更有效的方法来扩展数据集、估计参数和拟合模型。
https://arxiv.org/abs/2502.08528
Next token prediction has been the standard training objective used in large language model pretraining. Representations are learned as a result of optimizing for token-level perplexity. We propose Continuous Concept Mixing (CoCoMix), a novel pretraining framework that combines discrete next token prediction with continuous concepts. Specifically, CoCoMix predicts continuous concepts learned from a pretrained sparse autoencoder and mixes them into the model's hidden state by interleaving with token hidden representations. Through experiments on multiple benchmarks, including language modeling and downstream reasoning tasks, we show that CoCoMix is more sample efficient and consistently outperforms standard next token prediction, knowledge distillation and inserting pause tokens. We find that combining both concept learning and interleaving in an end-to-end framework is critical to performance gains. Furthermore, CoCoMix enhances interpretability and steerability by allowing direct inspection and modification of the predicted concept, offering a transparent way to guide the model's internal reasoning process.
接下来的标记预测一直是大规模语言模型预训练中使用的标准训练目标。通过优化单个标记级别的困惑度,学习到了表示形式。我们提出了一个新颖的预训练框架——连续概念混合(CoCoMix),该框架结合了离散的下一个标记预测和连续的概念。具体而言,CoCoMix 预测由预先训练好的稀疏自动编码器学到的连续概念,并通过在隐藏状态中插入与标记隐藏表示交错的方式将这些概念融入模型之中。通过语言建模和下游推理任务等多个基准实验显示,CoCoMix 在样本效率上表现更佳,并且持续优于标准的下一个标记预测、知识蒸馏以及插入暂停标记的方法。 我们发现,在端到端框架中结合概念学习与插值对于性能提升至关重要。此外,CoCoMix 还通过允许直接检查和修改预测的概念来增强模型的可解释性和可控性,为指导模型内部推理过程提供了一种透明的方式。
https://arxiv.org/abs/2502.08524
Faithfulness evaluators based on large language models (LLMs) are often fooled by the fluency of the text and struggle with identifying errors in the summaries. We propose an approach to summary faithfulness evaluation in which multiple LLM-based agents are assigned initial stances (regardless of what their belief might be) and forced to come up with a reason to justify the imposed belief, thus engaging in a multi-round debate to reach an agreement. The uniformly distributed initial assignments result in a greater diversity of stances leading to more meaningful debates and ultimately more errors identified. Furthermore, by analyzing the recent faithfulness evaluation datasets, we observe that naturally, it is not always the case for a summary to be either faithful to the source document or not. We therefore introduce a new dimension, ambiguity, and a detailed taxonomy to identify such special cases. Experiments demonstrate our approach can help identify ambiguities, and have even a stronger performance on non-ambiguous summaries.
基于大型语言模型(LLM)的忠实度评估器常常会被文本的流畅性所迷惑,难以识别摘要中的错误。我们提出了一种新的方法来进行总结忠实度评价:分配多个基于LLM的代理不同的初始立场(无论这些立场是否符合他们的实际信念),并要求他们为自己的立场找出理由进行辩护,从而展开多轮辩论以达成一致意见。这种均匀分布的初始设定可以导致更多样化的观点,进而引发更有意义的讨论,并最终识别出更多的错误。 此外,通过分析最近发布的忠实度评估数据集,我们观察到自然情况下并非所有的总结都严格地忠于原始文档或完全不忠于其内容。因此,我们引入了一个新的维度——模糊性(ambiguity),并制定了详细的分类来标识这类特殊情况。 实验结果表明,我们的方法不仅能够帮助识别出这些模棱两可的情况,并且在处理非模棱两可的总结时也表现出更强的效果。
https://arxiv.org/abs/2502.08514
Large language models (LLMs) are widely adopted to generate synthetic datasets for various natural language processing (NLP) tasks, such as text classification and summarization. However, accurately measuring the diversity of these synthetic datasets-an aspect crucial for robust model performance-remains a significant challenge. In this paper, we introduce DCScore, a novel method for measuring synthetic dataset diversity from a classification perspective. Specifically, DCScore formulates diversity evaluation as a sample classification task, leveraging mutual relationships among samples. We further provide theoretical verification of the diversity-related axioms satisfied by DCScore, highlighting its role as a principled diversity evaluation method. Experimental results on synthetic datasets reveal that DCScore enjoys a stronger correlation with multiple diversity pseudo-truths of evaluated datasets, underscoring its effectiveness. Moreover, both empirical and theoretical evidence demonstrate that DCScore substantially reduces computational costs compared to existing approaches. Code is available at: this https URL.
大型语言模型(LLMs)被广泛应用于生成各种自然语言处理(NLP)任务的合成数据集,例如文本分类和摘要。然而,准确衡量这些合成数据集的多样性——这对于实现稳健的模型性能至关重要——仍然是一个重大挑战。在本文中,我们介绍了DCScore,这是一种从分类角度测量合成数据集多样性的新方法。具体来说,DCScore将多样性评估表述为一个样本分类任务,并利用了样本之间的相互关系。此外,我们还提供了关于DCScore所满足的多样性相关公理的理论验证,强调其作为原则性多样性评价方法的角色。在合成数据集上的实验结果表明,DCScore与被评估的数据集中多个多样性的伪真实情况之间存在更强的相关性,这凸显了它的有效性。而且,无论是经验还是理论证据都证明,相较于现有方法,DCScore能够显著降低计算成本。代码可在以下链接获取:[此URL](请将“this https URL”替换为实际提供的GitHub或其它版本控制系统的链接)。
https://arxiv.org/abs/2502.08512
Grammatical error correction (GEC) aims to correct grammatical, spelling, and semantic errors in natural language text. With the growing of large language models (LLMs), direct text generation has gradually become the focus of the GEC methods, and few-shot in-context learning presents a cost-effective solution. However, selecting effective in-context examples remains challenging, as the similarity between input texts does not necessarily correspond to similar grammatical error patterns. In this paper, we propose a novel retrieval method based on natural language grammatical error explanations (GEE) to address this issue. Our method retrieves suitable few-shot demonstrations by matching the GEE of the test input with that of pre-constructed database samples, where explanations for erroneous samples are generated by LLMs. We conducted multilingual GEC few-shot experiments on both major open-source and closed-source LLMs. Experiments across five languages show that our method outperforms existing semantic and BM25-based retrieval techniques, without requiring additional training or language adaptation. This also suggests that matching error patterns is key to selecting examples.
语法错误校正(GEC)旨在纠正自然语言文本中的语法、拼写和语义错误。随着大型语言模型(LLMs)的发展,直接文本生成逐渐成为GEC方法的重点,而少量样例的上下文学习则提供了一种成本效益高的解决方案。然而,选择有效的上下文示例仍然具有挑战性,因为输入文本之间的相似性不一定对应于类似的语法错误模式。为此,在本文中我们提出了一种基于自然语言语法规则解释(GEE)的新颖检索方法来解决这一问题。我们的方法通过匹配测试输入与预构建数据库样本的GEE来检索合适的少量样例演示,其中错误样本的解释由LLMs生成。我们在主要开源和闭源LLMs上进行了多语言GEC少量样例实验。跨五种语言的实验表明,我们的方法超越了现有的语义和基于BM25的检索技术,在不需额外训练或语言适应的情况下取得更好的效果。这还表明匹配错误模式是选择示例的关键。
https://arxiv.org/abs/2502.08507
This work introduces Salamandra, a suite of open-source decoder-only large language models available in three different sizes: 2, 7, and 40 billion parameters. The models were trained from scratch on highly multilingual data that comprises text in 35 European languages and code. Our carefully curated corpus is made exclusively from open-access data compiled from a wide variety of sources. Along with the base models, supplementary checkpoints that were fine-tuned on public-domain instruction data are also released for chat applications. Additionally, we also share our preliminary experiments on multimodality, which serve as proof-of-concept to showcase potential applications for the Salamandra family. Our extensive evaluations on multilingual benchmarks reveal that Salamandra has strong capabilities, achieving competitive performance when compared to similarly sized open-source models. We provide comprehensive evaluation results both on standard downstream tasks as well as key aspects related to bias and this http URL this technical report, we intend to promote open science by sharing all the details behind our design choices, data curation strategy and evaluation methodology. In addition to that, we deviate from the usual practice by making our training and evaluation scripts publicly accessible. We release all models under a permissive Apache 2.0 license in order to foster future research and facilitate commercial use, thereby contributing to the open-source ecosystem of large language models.
这项工作介绍了Salamandra,一个包含三种不同规模(20亿、70亿和400亿参数)的开源解码器专用大型语言模型系列。这些模型是基于多语种数据从头训练出来的,该数据包括35种欧洲语言及代码文本。我们精心策划的数据集仅由来自各种来源的开放访问数据组成。除了基础模型之外,还在聊天应用程序中发布了通过公共领域指令数据微调得到的补充检查点。此外,我们也分享了初步的多模态实验结果,这些结果作为概念验证展示了Salamandra系列的潜在应用价值。 在多项跨语言基准测试上的广泛评估表明,Salamandra具有强大的能力,并且与同规模的开源模型相比,在性能上达到了竞争水平。我们提供了标准下游任务以及偏见和安全性相关关键方面的全面评价结果。通过这份技术报告,我们旨在推广开放科学理念,分享了我们的设计选择、数据策划策略及评估方法的所有细节。 除此之外,我们还偏离了惯例,公开发布了训练和评估脚本以供公众使用。所有模型均在宽松的Apache 2.0许可证下发布,以此促进未来的研究,并便于商业用途,从而为大型语言模型的开源生态系统做出贡献。
https://arxiv.org/abs/2502.08489
Referring Remote Sensing Image Segmentation (RRSIS) is critical for ecological monitoring, urban planning, and disaster management, requiring precise segmentation of objects in remote sensing imagery guided by textual descriptions. This task is uniquely challenging due to the considerable vision-language gap, the high spatial resolution and broad coverage of remote sensing imagery with diverse categories and small targets, and the presence of clustered, unclear targets with blurred edges. To tackle these issues, we propose \ours, a novel framework designed to bridge the vision-language gap, enhance multi-scale feature interaction, and improve fine-grained object differentiation. Specifically, \ours introduces: (1) the Bidirectional Spatial Correlation (BSC) for improved vision-language feature alignment, (2) the Target-Background TwinStream Decoder (T-BTD) for precise distinction between targets and non-targets, and (3) the Dual-Modal Object Learning Strategy (D-MOLS) for robust multimodal feature reconstruction. Extensive experiments on the benchmark datasets RefSegRS and RRSIS-D demonstrate that \ours achieves state-of-the-art performance. Specifically, \ours improves the overall IoU (oIoU) by 3.76 percentage points (80.57) and 1.44 percentage points (79.23) on the two datasets, respectively. Additionally, it outperforms previous methods in the mean IoU (mIoU) by 5.37 percentage points (67.95) and 1.84 percentage points (66.04), effectively addressing the core challenges of RRSIS with enhanced precision and robustness.
参考遥感图像分割(RRSIS)对于生态监测、城市规划和灾害管理至关重要,需要根据文本描述精确地对遥感图像中的对象进行分割。这项任务具有独特的挑战性,原因在于视觉与语言之间的显著差距,以及高空间分辨率的遥感影像所带来的广泛覆盖范围、多样化的类别和微小目标的存在,并且其中包含聚集成群、边缘模糊不清的目标。为了应对这些难题,我们提出了一种新的框架——\ours,旨在弥合视觉-语言鸿沟,增强多尺度特征交互,并提升细粒度对象的区分能力。 具体而言,\ours引入了以下三个创新组件: 1. 双向空间相关性(Bidirectional Spatial Correlation, BSC),以改善视觉与语言特征之间的对齐; 2. 目标-背景双流解码器(Target-Background TwinStream Decoder, T-BTD),用于准确区分目标和非目标区域; 3. 双模态对象学习策略(Dual-Modal Object Learning Strategy, D-MOLS),用于增强多模态特征的重建能力。 在基准数据集RefSegRS和RRSIS-D上的大量实验表明,\ours达到了当前最佳性能。具体而言,在两个数据集中,\ours分别提高了整体交并比(oIoU)3.76个百分点(达到80.57)和1.44个百分点(达到79.23)。同时,它在平均交并比(mIoU)方面也超过了之前的最佳方法,分别提升了5.37个百分点(达到67.95)和1.84个百分点(达到66.04),从而有效解决了RRSIS的核心挑战,并且通过提高精度和鲁棒性来应对这些挑战。
https://arxiv.org/abs/2502.08486
Chain-of-Thought (CoT) prompting has emerged as a powerful technique for enhancing language model's reasoning capabilities. However, generating long and correct CoT trajectories is challenging. Recent studies have demonstrated that Looped Transformers possess remarkable length generalization capabilities, but their limited generality and adaptability prevent them from serving as an alternative to auto-regressive solutions. To better leverage the strengths of Looped Transformers, we propose RELAY (REasoning through Loop Alignment iterativelY). Specifically, we align the steps of Chain-of-Thought (CoT) reasoning with loop iterations and apply intermediate supervision during the training of Looped Transformers. This additional iteration-wise supervision not only preserves the Looped Transformer's ability for length generalization but also enables it to predict CoT reasoning steps for unseen data. Therefore, we leverage this Looped Transformer to generate accurate reasoning chains for complex problems that exceed the training length, which will then be used to fine-tune an auto-regressive model. We conduct extensive experiments, and the results demonstrate the effectiveness of our approach, with significant improvements in the performance of the auto-regressive model. Code will be released at this https URL.
链式思维(Chain-of-Thought,CoT)提示作为一种增强语言模型推理能力的有力技术已经出现。然而,生成长且正确的CoT轨迹具有挑战性。最近的研究表明,循环变压器在长度泛化方面表现出色,但其有限的一般性和适应性使其无法替代自回归解决方案。为了更好地利用循环变压器的优势,我们提出了RELAY(通过循环对齐进行迭代推理)。具体而言,我们将链式思维的步骤与循环迭代对齐,并在训练循环变压器时引入中间监督。这种额外的迭代级监督不仅保留了循环变压器长度泛化的能力,还使其能够预测未见数据的CoT推理步骤。因此,我们利用该循环变压器生成复杂问题的准确推理链,这些问题超出了训练长度,然后用于微调自回归模型。我们进行了广泛的实验,并且结果证明了我们方法的有效性,在自回归模型性能方面取得了显著改进。代码将在提供的链接中发布。 (原文链接地址未提供具体网址,请根据需要补充完整。)
https://arxiv.org/abs/2502.08482
Multimodal embedding models have gained significant attention for their ability to map data from different modalities, such as text and images, into a unified representation space. However, the limited labeled multimodal data often hinders embedding performance. Recent approaches have leveraged data synthesis to address this problem, yet the quality of synthetic data remains a critical bottleneck. In this work, we identify three criteria for high-quality synthetic multimodal data. First, broad scope ensures that the generated data covers diverse tasks and modalities, making it applicable to various downstream scenarios. Second, robust cross-modal alignment makes different modalities semantically consistent. Third, high fidelity ensures that the synthetic data maintains realistic details to enhance its reliability. Guided by these principles, we synthesize datasets that: (1) cover a wide range of tasks, modality combinations, and languages, (2) are generated via a deep thinking process within a single pass of a multimodal large language model, and (3) incorporate real-world images with accurate and relevant texts, ensuring fidelity through self-evaluation and refinement. Leveraging these high-quality synthetic and labeled datasets, we train a multimodal multilingual E5 model mmE5. Extensive experiments demonstrate that mmE5 achieves state-of-the-art performance on the MMEB Benchmark and superior multilingual performance on the XTD benchmark. Our codes, datasets and models are released in this https URL.
多模态嵌入模型因其能够将来自不同模式(如文本和图像)的数据映射到统一的表示空间而受到了广泛关注。然而,有限的标注多模态数据常常限制了这些模型的表现。近期的研究通过利用数据合成来解决这一问题,但合成数据的质量仍然是一个关键瓶颈。在这项工作中,我们确定了高质量合成多模态数据的三个标准:首先,广泛的覆盖范围确保生成的数据涵盖了多种任务和模式,使其适用于各种下游场景;其次,强大的跨模式对齐使得不同的模式在语义上保持一致;第三,高保真度保证合成数据保留真实细节以增强其可靠性。遵循这些原则,我们合成了符合以下条件的数据库:(1)涵盖广泛的任务、模态组合和语言,(2)通过多模态大语言模型的一次深度思维过程生成,并且(3)结合了带有准确相关文本的真实世界图像,确保保真度通过自我评估和改进实现。利用这些高质量的合成及标注数据集,我们训练了一个跨模态多语言E5模型mmE5。广泛的实验表明,mmE5在MMEB基准测试中达到了最先进的性能,并且在XTD基准测试中的多语种表现也更为优越。我们的代码、数据集和模型可在上述链接获得(原文中的具体链接未提供)。
https://arxiv.org/abs/2502.08468
Cultural and language factors significantly influence counseling, but Natural Language Processing research has not yet examined whether the findings of conversational analysis for counseling conducted in English apply to other languages. This paper presents a first step towards this direction. We introduce MIDAS (Motivational Interviewing Dataset in Spanish), a counseling dataset created from public video sources that contains expert annotations for counseling reflections and questions. Using this dataset, we explore language-based differences in counselor behavior in English and Spanish and develop classifiers in monolingual and multilingual settings, demonstrating its applications in counselor behavioral coding tasks.
文化和语言因素在心理咨询中起着重要作用,但自然语言处理研究尚未考察用英语进行的会话分析结果是否适用于其他语言。本文标志着朝着这一方向迈出的第一步。我们介绍了MIDAS(西班牙语动机访谈数据集),这是一个从公共视频来源创建的心理咨询数据集,并包含了专家对咨询反思和问题的专业标注。利用该数据集,我们在英语和西班牙语中探索了心理咨询师行为的语言差异,并在单语言和多语言设置下开发分类器,展示了其在心理咨询行为编码任务中的应用价值。
https://arxiv.org/abs/2502.08458
Simultaneously grasping and transporting multiple objects can significantly enhance robotic work efficiency and has been a key research focus for decades. The primary challenge lies in determining how to push objects, group them, and execute simultaneous grasping for respective groups while considering object distribution and the hardware constraints of the robot. Traditional rule-based methods struggle to flexibly adapt to diverse scenarios. To address this challenge, this paper proposes an imitation learning-based approach. We collect a series of expert demonstrations through teleoperation and train a diffusion policy network, enabling the robot to dynamically generate action sequences for pushing, grouping, and grasping, thereby facilitating efficient multi-object grasping and transportation. We conducted experiments to evaluate the method under different training dataset sizes, varying object quantities, and real-world object scenarios. The results demonstrate that the proposed approach can effectively and adaptively generate multi-object grouping and grasping strategies. With the support of more training data, imitation learning is expected to be an effective approach for solving the multi-object grasping problem.
同时抓取和搬运多个物体可以显著提高机器人的工作效率,这一直是几十年来的重要研究课题。主要挑战在于如何确定推动物体的方式、对它们进行分组,并为每个组执行同步抓取操作,同时还要考虑物体的分布情况以及机器人硬件的限制条件。传统的基于规则的方法难以灵活适应各种场景需求。为此,本文提出了一种基于模仿学习的方法。我们通过遥操作系统收集了一系列专家演示的数据,并训练了一个扩散策略网络,使机器人能够动态生成推动、分组和抓取的动作序列,从而促进高效多物体抓取与搬运。我们在不同的训练数据集大小、不同数量的物体以及真实世界中的物体场景下进行了实验评估。结果表明,所提出的方法可以有效且自适应地生成多物体分组和抓取策略。在更多训练数据的支持下,模仿学习有望成为解决多物体抓取问题的有效方法。
https://arxiv.org/abs/2502.08452
In automated essay scoring (AES), recent efforts have shifted toward cross-prompt settings that score essays on unseen prompts for practical applicability. However, prior methods trained with essay-score pairs of specific prompts pose challenges in obtaining prompt-generalized essay representation. In this work, we propose a grammar-aware cross-prompt trait scoring (GAPS), which internally captures prompt-independent syntactic aspects to learn generic essay representation. We acquire grammatical error-corrected information in essays via the grammar error correction technique and design the AES model to seamlessly integrate such information. By internally referring to both the corrected and the original essays, the model can focus on generic features during training. Empirical experiments validate our method's generalizability, showing remarkable improvements in prompt-independent and grammar-related traits. Furthermore, GAPS achieves notable QWK gains in the most challenging cross-prompt scenario, highlighting its strength in evaluating unseen prompts.
在自动作文评分(AES)领域,最近的研究努力转向跨题型设置,即评估未见过题目下的作文以增强其实用性。然而,之前的方法使用特定题目的作文分数对进行训练,在获取适用于所有题目的通用作文表示方面遇到了挑战。为此,我们提出了一种基于语法规则的跨题型特性评分方法(GAPS),该方法内部捕捉了与题目无关的语言结构特征,从而学习到更广泛的作文表示形式。通过使用语法纠错技术在文章中获得正确的语法信息,并设计AES模型以无缝地整合这些信息,我们的方法能够在训练过程中同时参考纠正后的和原始的文章文本,使模型能够专注于通用特性。 实证实验验证了我们方法的泛化能力,在与题目无关及语法相关的特征方面显示出显著改进。此外,GAPS在最具有挑战性的跨题型场景下取得了显著的QWK(Quadratic Weighted Kappa)得分提升,这突显出其在评估未见过题目作文中的优势。
https://arxiv.org/abs/2502.08450
Achieving human-level dexterity in robots is a key objective in the field of robotic manipulation. Recent advancements in 3D-based imitation learning have shown promising results, providing an effective pathway to achieve this goal. However, obtaining high-quality 3D representations presents two key problems: (1) the quality of point clouds captured by a single-view camera is significantly affected by factors such as camera resolution, positioning, and occlusions caused by the dexterous hand; (2) the global point clouds lack crucial contact information and spatial correspondences, which are necessary for fine-grained dexterous manipulation tasks. To eliminate these limitations, we propose CordViP, a novel framework that constructs and learns correspondences by leveraging the robust 6D pose estimation of objects and robot proprioception. Specifically, we first introduce the interaction-aware point clouds, which establish correspondences between the object and the hand. These point clouds are then used for our pre-training policy, where we also incorporate object-centric contact maps and hand-arm coordination information, effectively capturing both spatial and temporal dynamics. Our method demonstrates exceptional dexterous manipulation capabilities with an average success rate of 90\% in four real-world tasks, surpassing other baselines by a large margin. Experimental results also highlight the superior generalization and robustness of CordViP to different objects, viewpoints, and scenarios. Code and videos are available on this https URL.
在机器人操作领域,实现与人类相当的灵巧性是关键目标之一。近年来,基于3D模仿学习的进步显示出了令人鼓舞的结果,为达成这一目标提供了有效途径。然而,获取高质量的3D表示存在两个主要问题:(1)单视图相机捕获的点云质量显著受到相机分辨率、定位以及灵巧手造成的遮挡等因素的影响;(2)全局点云缺乏关键的接触信息和空间对应关系,这对精细的手部操作任务至关重要。为了消除这些限制,我们提出了一种新的框架CordViP,该框架通过利用物体稳健的6D姿态估计及机器人内感觉数据来构建并学习对应的交互感知点云。 具体来说,我们首先引入了具有互动意识的点云,建立了手和对象之间的对应关系。然后使用这些点云进行预训练策略,并结合以物体为中心的接触图和手臂协调信息,有效捕捉空间和时间动态变化。我们的方法在四个真实世界任务中的平均成功率为90%,远超其他基线模型的表现。实验结果还强调了CordViP在面对不同对象、视角及场景时表现出的卓越泛化能力和鲁棒性。代码和视频可在提供的链接中获取:[此 URL](https://this-url.com)(原文中未提供具体URL,此处用占位符表示)。
https://arxiv.org/abs/2502.08449
Despite their remarkable capabilities, LLMs learn word representations that exhibit the undesirable yet poorly understood feature of anisotropy. In this paper, we argue that the second moment in Adam is a cause of anisotropic embeddings, and suggest a modified optimizer called Coupled Adam to mitigate the problem. Our experiments demonstrate that Coupled Adam significantly improves the quality of embeddings, while also leading to better upstream and downstream performance on large enough datasets.
尽管大型语言模型(LLMs)具备卓越的能力,但它们学习的词表示却存在一种令人不悦且理解不足的特点——各向异性。在本文中,我们论证了Adam优化器中的二阶矩是导致嵌入层各向异性的原因,并提出了一种名为“Coupled Adam”的改良优化器来缓解这一问题。我们的实验表明,使用Coupled Adam可以显著提高词嵌入的质量,在足够大的数据集上还能带来更好的上流和下游任务性能。
https://arxiv.org/abs/2502.08441
Non-native speakers with limited vocabulary often struggle to name specific objects despite being able to visualize them, e.g., people outside Australia searching for numbats. Further, users may want to search for such elusive objects with difficult-to-sketch interactions, e.g., numbat digging in the ground. In such common but complex situations, users desire a search interface that accepts composite multimodal queries comprising hand-drawn sketches of difficult-to-name but easy-to-draw objects and text describing difficult-to-sketch but easy-to-verbalize object attributes or interaction with the scene. This novel problem statement distinctly differs from the previously well-researched TBIR (text-based image retrieval) and SBIR (sketch-based image retrieval) problems. To study this under-explored task, we curate a dataset, CSTBIR (Composite Sketch+Text Based Image Retrieval), consisting of approx. 2M queries and 108K natural scene images. Further, as a solution to this problem, we propose a pretrained multimodal transformer-based baseline, STNET (Sketch+Text Network), that uses a hand-drawn sketch to localize relevant objects in the natural scene image, and encodes the text and image to perform image retrieval. In addition to contrastive learning, we propose multiple training objectives that improve the performance of our model. Extensive experiments show that our proposed method outperforms several state-of-the-art retrieval methods for text-only, sketch-only, and composite query modalities. We make the dataset and code available at our project website.
非母语人士由于词汇量有限,常常难以命名特定物体,尽管他们能够想象这些物体,例如澳大利亚以外的人们在寻找袋鼩。此外,用户可能希望使用复杂的互动方式来搜索这样的难以描绘的物体,例如袋鼩在地下的挖掘行为。在这种常见但复杂的情况下,用户希望能够使用一种接受复合多模态查询的搜索界面,这种查询包含难以命名但容易绘制的手绘草图以及描述难以描绘但易于口头表达的对象属性或场景交互的文字说明。这一新颖的问题陈述与之前研究充分的基于文本的图像检索(TBIR)和基于草图的图像检索(SBIR)问题明显不同。 为了研究这个尚未充分探索的任务,我们整理了一个数据集,称为CSTBIR(复合草图+文本基图像检索),该数据集中包含约200万个查询和10.8万张自然场景图片。作为解决这一问题的方案,我们提出了一个预训练的基于多模态变换器的基础模型——STNET(草图+文本网络)。这个模型使用手绘草图在自然场景图像中定位相关对象,并将文字和图像编码以进行图像检索。除了对比学习之外,我们还提出了一些提高模型性能的多种训练目标。 大量的实验表明,我们的方法在这三种查询模态(仅基于文本、仅基于草图以及复合查询)上都优于许多最先进的检索方法。我们在项目网站上提供了数据集和代码。
https://arxiv.org/abs/2502.08438