Online toxic language causes real harm, especially in regions with limited moderation tools. In this study, we evaluate how large language models handle toxic comments in Serbian, Croatian, and Bosnian, languages with limited labeled data. We built and manually labeled a dataset of 4,500 YouTube and TikTok comments drawn from videos across diverse categories, including music, politics, sports, modeling, influencer content, discussions of sexism, and general topics. Four models (GPT-3.5 Turbo, GPT-4.1, Gemini 1.5 Pro, and Claude 3 Opus) were tested in two modes: zero-shot and context-augmented. We measured precision, recall, F1 score, accuracy and false positive rates. Including a short context snippet raised recall by about 0.12 on average and improved F1 score by up to 0.10, though it sometimes increased false positives. The best balance came from Gemini in context-augmented mode, reaching an F1 score of 0.82 and accuracy of 0.82, while zero-shot GPT-4.1 led on precision and had the lowest false alarms. We show how adding minimal context can improve toxic language detection in low-resource settings and suggest practical strategies such as improved prompt design and threshold calibration. These results show that prompt design alone can yield meaningful gains in toxicity detection for underserved Balkan language communities.
在线有毒语言会造成实际伤害,特别是在那些缺乏监管工具的地区。在这项研究中,我们评估了大型语言模型在处理塞尔维亚语、克罗地亚语和波斯尼亚语中的有害评论的能力,这些语言由于数据标注资源有限而难以应对。为此,我们构建并手动标注了一个包含4500条YouTube和TikTok评论的数据集,这些评论来自各种类别的视频内容,包括音乐、政治、体育、模特展示、网红内容以及关于性别歧视的讨论等。 四种模型(GPT-3.5 Turbo、GPT-4.1、Gemini 1.5 Pro 和 Claude 3 Opus)在两种模式下进行了测试:零样本模式和上下文增强模式。我们测量了这些模型的精度、召回率、F1分数、准确性和假阳性率。 加入简短的上下文片段平均提高了召回率约0.12,并将F1分数最多提高到0.10,尽管有时会增加假阳性的数量。在上下文增强模式下,Gemini表现最佳,达到了F1分数和准确率为0.82,而零样本GPT-4.1则在精度上领先,并且具有最低的误报率。 我们展示了即使是在资源有限的情况下,添加少量的上下文信息也能提高有害语言检测的效果。此外,我们还提出了实用策略,例如改进提示设计和阈值校准,以进一步提升这些模型的表现。研究结果表明,仅靠优化提示设计就能在服务不足的巴尔干语社区中实现有意义的有害言论检测改进。
https://arxiv.org/abs/2506.09992
A major challenge for modern AI is to learn to understand the world and learn to act largely by observation. This paper explores a self-supervised approach that combines internet-scale video data with a small amount of interaction data (robot trajectories), to develop models capable of understanding, predicting, and planning in the physical world. We first pre-train an action-free joint-embedding-predictive architecture, V-JEPA 2, on a video and image dataset comprising over 1 million hours of internet video. V-JEPA 2 achieves strong performance on motion understanding (77.3 top-1 accuracy on Something-Something v2) and state-of-the-art performance on human action anticipation (39.7 recall-at-5 on Epic-Kitchens-100) surpassing previous task-specific models. Additionally, after aligning V-JEPA 2 with a large language model, we demonstrate state-of-the-art performance on multiple video question-answering tasks at the 8 billion parameter scale (e.g., 84.0 on PerceptionTest, 76.9 on TempCompass). Finally, we show how self-supervised learning can be applied to robotic planning tasks by post-training a latent action-conditioned world model, V-JEPA 2-AC, using less than 62 hours of unlabeled robot videos from the Droid dataset. We deploy V-JEPA 2-AC zero-shot on Franka arms in two different labs and enable picking and placing of objects using planning with image goals. Notably, this is achieved without collecting any data from the robots in these environments, and without any task-specific training or reward. This work demonstrates how self-supervised learning from web-scale data and a small amount of robot interaction data can yield a world model capable of planning in the physical world.
现代人工智能面临的一个主要挑战是通过观察来学习理解世界和采取行动。本文探讨了一种结合互联网规模视频数据与少量交互数据(机器人轨迹)的自监督方法,以开发能够在物理世界中理解和规划模型。我们首先在一个包含超过100万小时互联网视频图像数据集上对一个无动作的动作联合嵌入预测架构V-JEPA 2进行预训练。V-JEPA 2在运动理解方面表现出色(Something-Something v2的top-1准确率为77.3),并且在人类行为预测方面达到最先进的水平(Epic-Kitchens-100上recall-at-5为39.7,超过了之前特定任务模型的表现)。此外,在将V-JEPA 2与大型语言模型对齐后,我们展示了在视频问答任务上的最先进性能,规模达80亿参数级(例如,在PerceptionTest上得分84.0,在TempCompass上得分为76.9)。最后,我们展示如何通过后续训练一个隐含的动作条件世界模型V-JEPA 2-AC来将自监督学习应用于机器人规划任务,该模型使用来自Droid数据集的不到62小时未标记的机器人视频进行训练。我们将V-JEPA 2-AC零样本部署到两个实验室中的Franka机械臂上,并能够利用图像目标规划实现物体的选择和放置功能。值得注意的是,这一成果是在没有从这些环境中收集任何机器人数据以及没有特定任务培训或奖励的情况下完成的。这项工作展示了如何通过网络规模的数据和少量机器人交互数据进行自监督学习,从而开发出一种能够在物理世界中进行规划的世界模型。
https://arxiv.org/abs/2506.09985
Recent work has identified retrieval heads (Wu et al., 2025b), a subset of attention heads responsible for retrieving salient information in long-context language models (LMs), as measured by their copy-paste behavior in Needle-in-a-Haystack tasks. In this paper, we introduce QRHEAD (Query-Focused Retrieval Head), an improved set of attention heads that enhance retrieval from long context. We identify QRHEAD by aggregating attention scores with respect to the input query, using a handful of examples from real-world tasks (e.g., long-context QA). We further introduce QR- RETRIEVER, an efficient and effective retriever that uses the accumulated attention mass of QRHEAD as retrieval scores. We use QR- RETRIEVER for long-context reasoning by selecting the most relevant parts with the highest retrieval scores. On multi-hop reasoning tasks LongMemEval and CLIPPER, this yields over 10% performance gains over full context and outperforms strong dense retrievers. We also evaluate QRRETRIEVER as a re-ranker on the BEIR benchmark and find that it achieves strong zero-shot performance, outperforming other LLM-based re-rankers such as RankGPT. Further analysis shows that both the querycontext attention scoring and task selection are crucial for identifying QRHEAD with strong downstream utility. Overall, our work contributes a general-purpose retriever and offers interpretability insights into the long-context capabilities of LMs.
最近的研究(Wu et al., 2025b)发现了一类称为检索头(retrieval heads)的注意力机制子集,这些头部在长上下文语言模型中负责从大量文本信息中提取关键内容,这一结论是通过它们在针刺干草堆任务中的复制粘贴行为得出的。在此论文中,我们提出了QRHEAD(查询聚焦检索头),这是一种改进后的注意力机制集合,旨在增强从长文本中进行有效检索的能力。我们通过聚集与输入查询相关的注意力得分,并结合真实世界任务(如长上下文问答)示例来识别QRHEAD。 此外,我们引入了QR- RETRIEVER,这是一个高效且有效的检索器,它使用QRHEAD累积的注意力质量作为检索分数。我们在多跳推理任务LongMemEval和CLIPPER中利用QR-RETRIEVER进行长文本推理,通过选择具有最高检索分数的相关部分来实现这一目标。相比全面考虑上下文的方法,这种方法在这些任务上带来了超过10%的性能提升,并且优于其他密集型检索器。 我们还在BEIR基准测试中将QR-RETRIEVER作为重排序器进行了评估,并发现它实现了强大的零样本性能,超过了其他基于大语言模型(LLM)的重排序器,如RankGPT。进一步分析表明,查询上下文注意力评分以及任务选择对于识别具有强大下游效用的QRHEAD至关重要。 总体而言,我们的工作为长文本推理贡献了一种通用检索方法,并提供了关于LMs在处理长上下文时能力的理解和解释性见解。
https://arxiv.org/abs/2506.09944
Out-of-context and misattributed imagery is the leading form of media manipulation in today's misinformation and disinformation landscape. The existing methods attempting to detect this practice often only consider whether the semantics of the imagery corresponds to the text narrative, missing manipulation so long as the depicted objects or scenes somewhat correspond to the narrative at hand. To tackle this, we introduce News Media Provenance Dataset, a dataset of news articles with provenance-tagged images. We formulate two tasks on this dataset, location of origin relevance (LOR) and date and time of origin relevance (DTOR), and present baseline results on six large language models (LLMs). We identify that, while the zero-shot performance on LOR is promising, the performance on DTOR hinders, leaving room for specialized architectures and future work.
出处不明和被误归属性的图像目前是当今错误信息和虚假信息环境中媒体操纵的主要形式。现有尝试检测此类行为的方法通常只考虑图像语义是否与文本叙述相匹配,只要所描绘的对象或场景在一定程度上符合叙述内容,就会认为没有进行过篡改。为解决这一问题,我们引入了“新闻媒体出处数据集”,这是一个包含带有出处标签的新闻文章和图片的数据集。我们在该数据集上制定了两项任务:位置相关性(LOR)和时间日期相关性(DTOR),并在六种大型语言模型(LLMs)上展示了基准结果。我们发现,虽然在零样本设置下LOR的表现令人满意,但DTOR的性能有待提高,这表明未来需要专门的架构设计和技术研究来进一步改进这方面的工作。
https://arxiv.org/abs/2506.09847
In this short note, we report and analyze a striking event: OpenAI's large language model o3 has outwitted all students in a university exam on thermodynamics. The thermodynamics exam is a difficult hurdle for most students, where they must show that they have mastered the fundamentals of this important topic. Consequently, the failure rates are very high, A-grades are rare - and they are considered proof of the students' exceptional intellectual abilities. This is because pattern learning does not help in the exam. The problems can only be solved by knowledgeably and creatively combining principles of thermodynamics. We have given our latest thermodynamics exam not only to the students but also to OpenAI's most powerful reasoning model, o3, and have assessed the answers of o3 exactly the same way as those of the students. In zero-shot mode, the model o3 solved all problems correctly, better than all students who took the exam; its overall score was in the range of the best scores we have seen in more than 10,000 similar exams since 1985. This is a turning point: machines now excel in complex tasks, usually taken as proof of human intellectual capabilities. We discuss the consequences this has for the work of engineers and the education of future engineers.
在这份简短的报告中,我们记录并分析了一个引人注目的事件:OpenAI的大规模语言模型o3在一所大学的热力学考试中击败了所有学生。热力学考试对于大多数学生来说是一道难关,他们必须证明自己掌握了这一重要主题的基础知识。因此,考试失败率很高,A等成绩非常罕见——它们被认为是学生非凡智力能力的证据。这是因为模式学习在这种考试中无济于事。问题只能通过有见识且创造性地结合热力学原理来解答。 我们不仅将最新的热力学考试提供给学生们,还将其提供给了OpenAI最强大的推理模型o3,并以与评估学生答案相同的方式对o3的答案进行了评估。在零样本模式下,该模型o3正确解决了所有问题,在此次考试中表现优于所有的学生;它的总体得分达到了自1985年以来我们见过的超过10,000份类似试卷中的最高等级。 这是一个转折点:机器现在在通常被认为是证明人类智力能力的复杂任务上表现出色。我们将讨论这对工程师的工作以及未来工程师教育的影响。
https://arxiv.org/abs/2506.09822
Knowing how test takers answer items in educational assessments is essential for test development, to evaluate item quality, and to improve test validity. However, this process usually requires extensive pilot studies with human participants. If large language models (LLMs) exhibit human-like response behavior to test items, this could open up the possibility of using them as pilot participants to accelerate test development. In this paper, we evaluate the human-likeness or psychometric plausibility of responses from 18 instruction-tuned LLMs with two publicly available datasets of multiple-choice test items across three subjects: reading, U.S. history, and economics. Our methodology builds on two theoretical frameworks from psychometrics which are commonly used in educational assessment, classical test theory and item response theory. The results show that while larger models are excessively confident, their response distributions can be more human-like when calibrated with temperature scaling. In addition, we find that LLMs tend to correlate better with humans in reading comprehension items compared to other subjects. However, the correlations are not very strong overall, indicating that LLMs should not be used for piloting educational assessments in a zero-shot setting.
了解受试者在教育评估中如何回答问题对于测试开发、评价题目质量和提高测试有效性至关重要。然而,这一过程通常需要进行广泛的试点研究并招募人类参与者。如果大型语言模型(LLMs)表现出与人类相似的答题行为,则可以考虑使用它们作为试点参与者来加速测试开发进程。本文通过采用两种常用的心理测量理论框架——经典测验理论和项目反应理论——对18个经过指令调优的LLM在两个公开发布的涵盖阅读、美国历史和经济学三门学科的选择题数据集上的回答进行评估,考察其人类行为相似性或心理测量合理性。 实验结果显示,虽然较大的模型过于自信,但在应用温度缩放校准后,它们的回答分布会更接近于人类。此外,我们发现LLM在阅读理解题目中与人类的相关性较好,而其他学科的关联度较低。然而,总体而言,相关性的强度并不高,这表明不应直接将LLM用于教育评估的零样本试点测试(即不进行任何特殊调整的情况下)。
https://arxiv.org/abs/2506.09796
Estimating the 6D pose of objects from RGBD data is a fundamental problem in computer vision, with applications in robotics and augmented reality. A key challenge is achieving generalization to novel objects that were not seen during training. Most existing approaches address this by scaling up training on synthetic data tailored to the task, a process that demands substantial computational resources. But is task-specific training really necessary for accurate and efficient 6D pose estimation of novel objects? To answer No!, we introduce FreeZeV2, the second generation of FreeZe: a training-free method that achieves strong generalization to unseen objects by leveraging geometric and vision foundation models pre-trained on unrelated data. FreeZeV2 improves both accuracy and efficiency over FreeZe through three key contributions: (i) a sparse feature extraction strategy that reduces inference-time computation without sacrificing accuracy; (ii) a feature-aware scoring mechanism that improves both pose selection during RANSAC-based 3D registration and the final ranking of pose candidates; and (iii) a modular design that supports ensembles of instance segmentation models, increasing robustness to segmentation masks errors. We evaluate FreeZeV2 on the seven core datasets of the BOP Benchmark, where it establishes a new state-of-the-art in 6D pose estimation of unseen objects. When using the same segmentation masks, FreeZeV2 achieves a remarkable 8x speedup over FreeZe while also improving accuracy by 5%. When using ensembles of segmentation models, FreeZeV2 gains an additional 8% in accuracy while still running 2.5x faster than FreeZe. FreeZeV2 was awarded Best Overall Method at the BOP Challenge 2024.
从RGBD数据估算物体的6D姿态是计算机视觉中的一个基本问题,具有在机器人技术和增强现实领域的应用。其中一个关键挑战是在训练过程中未见过的新物体上的泛化能力。大多数现有方法通过增加合成数据(针对特定任务定制)的训练来解决这一问题,这需要大量的计算资源。然而,对于新对象的6D姿态估计而言,是否真的需要特定于任务的培训就能达到准确且高效的程度?为了回答“不”的观点,我们引入了FreeZeV2——这是FreeZe的第二代产品:一种无需训练的方法,通过利用预训练在无关数据上的几何和视觉基础模型来实现对未见过物体的强大泛化能力。相较于最初的FreeZe,FreeZeV2通过以下三个关键贡献提高了准确性和效率: (i) 一种稀疏特征提取策略,在不牺牲准确性的情况下减少了推理时间的计算量; (ii) 一个基于特征感知评分机制,改进了RANSAC(随机采样一致性)基础的3D注册过程中姿态选择以及最终姿势候选者排名的质量; (iii) 一个模块化设计,支持实例分割模型的集合,增强了对分割掩码错误的鲁棒性。 我们在BOP基准测试的七个核心数据集上评估了FreeZeV2,在这些数据集中,它建立了6D姿态估计的新最先进水平。在使用相同的分割掩码时,FreeZeV2比FreeZe的速度提高了8倍,并且准确度提高了5%。当使用集合的分割模型时,FreeZeV2获得了额外8%的准确性,同时仍然比FreeZe快2.5倍。FreeZeV2在BOP挑战赛2024中荣获最佳整体方法奖。
https://arxiv.org/abs/2506.09784
Diffusion models excel at image generation. Recent studies have shown that these models not only generate high-quality images but also encode text-image alignment information through attention maps or loss functions. This information is valuable for various downstream tasks, including segmentation, text-guided image editing, and compositional image generation. However, current methods heavily rely on the assumption of perfect text-image alignment in diffusion models, which is not the case. In this paper, we propose using zero-shot referring image segmentation as a proxy task to evaluate the pixel-level image and class-level text alignment of popular diffusion models. We conduct an in-depth analysis of pixel-text misalignment in diffusion models from the perspective of training data bias. We find that misalignment occurs in images with small sized, occluded, or rare object classes. Therefore, we propose ELBO-T2IAlign, a simple yet effective method to calibrate pixel-text alignment in diffusion models based on the evidence lower bound (ELBO) of likelihood. Our method is training-free and generic, eliminating the need to identify the specific cause of misalignment and works well across various diffusion model architectures. Extensive experiments on commonly used benchmark datasets on image segmentation and generation have verified the effectiveness of our proposed calibration approach.
扩散模型在图像生成方面表现出色。近期研究显示,这些模型不仅能够生成高质量的图像,还能通过注意力图或损失函数来编码文本与图像之间的对齐信息。这种信息对于包括分割、基于文本的图像编辑和组合式图像生成在内的各种下游任务来说非常有价值。然而,当前方法严重依赖于扩散模型中完美文本-图像对齐这一假设,而实际情况并非如此。 在本文中,我们提出使用零样本引用图像分割作为代理任务来评估流行扩散模型中的像素级图像与类别级文本的对齐情况。我们从训练数据偏差的角度深入分析了扩散模型中存在的像素-文本不匹配问题,并发现这种不匹配主要出现在包含小尺寸、被遮挡或罕见物体类别的图像中。 因此,我们提出了ELBO-T2IAlign方法,这是一种基于似然性证据下界的(ELBO)简单而有效的技术,用于校准扩散模型中的像素-文本对齐。该方法无需训练且通用性强,适用于多种扩散模型架构,并能在不识别具体导致对齐问题的原因的情况下工作良好。 在常用的图像分割和生成基准数据集上的大量实验已经验证了我们所提出校准方法的有效性。
https://arxiv.org/abs/2506.09740
Accurate 6D pose estimation of complex objects in 3D environments is essential for effective robotic manipulation. Yet, existing benchmarks fall short in evaluating 6D pose estimation methods under realistic industrial conditions, as most datasets focus on household objects in domestic settings, while the few available industrial datasets are limited to artificial setups with objects placed on tables. To bridge this gap, we introduce CHIP, the first dataset designed for 6D pose estimation of chairs manipulated by a robotic arm in a real-world industrial environment. CHIP includes seven distinct chairs captured using three different RGBD sensing technologies and presents unique challenges, such as distractor objects with fine-grained differences and severe occlusions caused by the robotic arm and human operators. CHIP comprises 77,811 RGBD images annotated with ground-truth 6D poses automatically derived from the robot's kinematics, averaging 11,115 annotations per chair. We benchmark CHIP using three zero-shot 6D pose estimation methods, assessing performance across different sensor types, localization priors, and occlusion levels. Results show substantial room for improvement, highlighting the unique challenges posed by the dataset. CHIP will be publicly released.
在三维环境中对复杂物体进行精确的六维姿态估计对于有效的机器人操作至关重要。然而,现有的基准测试方法未能充分评估六维姿态估计技术在现实工业条件下的性能,因为大多数数据集主要关注家庭环境中的日常物品,而为数不多的工业数据集中对象又仅限于放置在桌子上的人工设置环境中。为了填补这一空白,我们引入了CHIP(椅子的真实世界工业环境六维姿态估计数据集),这是首个专为机器人手臂操作椅子设计的数据集。 CHIP涵盖了七种不同的椅子,并使用三种不同类型的RGBD传感器技术进行捕捉。该数据集中包含了一些独特的挑战,如具有细微差异的干扰物体以及由于机器人机械臂和人工操作员造成的严重遮挡问题。此外,CHIP包含了77,811张标注了真实六维姿态信息的RGBD图像,这些注释基于机器人动力学自动产生,并且平均每把椅子有超过11,115个注释。 为了评估CHIP的表现,我们使用三种零样本学习(zero-shot learning)的六维姿态估计方法,在不同的传感器类型、定位先验条件以及遮挡水平下进行基准测试。实验结果显示,现有的算法还有很大的改进空间,并突显了该数据集的独特挑战性。未来计划将CHIP公开发布。
https://arxiv.org/abs/2506.09699
CLIP-based domain generalization aims to improve model generalization to unseen domains by leveraging the powerful zero-shot classification capabilities of CLIP and multiple source datasets. Existing methods typically train a single model across multiple source domains to capture domain-shared information. However, this paradigm inherently suffers from two types of conflicts: 1) sample conflicts, arising from noisy samples and extreme domain shifts among sources; and 2) optimization conflicts, stemming from competition and trade-offs during multi-source training. Both hinder the generalization and lead to suboptimal solutions. Recent studies have shown that model merging can effectively mitigate the competition of multi-objective optimization and improve generalization performance. Inspired by these findings, we propose Harmonizing and Merging (HAM), a novel source model merging framework for CLIP-based domain generalization. During the training process of the source models, HAM enriches the source samples without conflicting samples, and harmonizes the update directions of all models. Then, a redundancy-aware historical model merging method is introduced to effectively integrate knowledge across all source models. HAM comprehensively consolidates source domain information while enabling mutual enhancement among source models, ultimately yielding a final model with optimal generalization capabilities. Extensive experiments on five widely used benchmark datasets demonstrate the effectiveness of our approach, achieving state-of-the-art performance.
基于CLIP的领域泛化旨在通过利用CLIP强大的零样本分类能力和多个源数据集,来提高模型在未见领域的泛化能力。现有方法通常会跨多个源域训练单个模型以捕捉共享于所有域的信息,然而这种范式内在地存在两种冲突:1)样本冲突,源于噪声样本和不同源之间的极端领域偏移;2)优化冲突,在多源训练过程中由于竞争和权衡所产生。这两种情况都会阻碍泛化能力,并导致次优解决方案的出现。近期的研究表明,模型合并可以有效地缓解多目标优化的竞争问题,并提高泛化性能。受到这些发现的启发,我们提出了协调与合并(HAM),这是一种用于CLIP基于领域的泛化的新型源模型合并框架。 在训练源模型的过程中,HAM能够丰富无冲突样本的数据集,并且调和所有模型更新的方向。随后,引入了一种考虑冗余的历史模型合并方法,以有效整合所有源模型的知识。通过这种方式,HAM全面巩固了源自域的信息,并使各源模型之间实现相互增强,最终产生一个具有最优泛化能力的最终模型。 在五个广泛使用的基准数据集上进行的大量实验表明,我们的方法非常有效,并达到了最先进的性能水平。
https://arxiv.org/abs/2506.09446
Detecting deception in an increasingly digital world is both a critical and challenging task. In this study, we present a comprehensive evaluation of the automated deception detection capabilities of Large Language Models (LLMs) and Large Multimodal Models (LMMs) across diverse domains. We assess the performance of both open-source and commercial LLMs on three distinct datasets: real life trial interviews (RLTD), instructed deception in interpersonal scenarios (MU3D), and deceptive reviews (OpSpam). We systematically analyze the effectiveness of different experimental setups for deception detection, including zero-shot and few-shot approaches with random or similarity-based in-context example selection. Our results show that fine-tuned LLMs achieve state-of-the-art performance on textual deception detection tasks, while LMMs struggle to fully leverage cross-modal cues. Additionally, we analyze the impact of auxiliary features, such as non-verbal gestures and video summaries, and examine the effectiveness of different prompting strategies, including direct label generation and chain-of-thought reasoning. Our findings provide key insights into how LLMs process and interpret deceptive cues across modalities, highlighting their potential and limitations in real-world deception detection applications.
在日益数字化的世界中,检测欺骗是一项既关键又具挑战性的任务。本研究全面评估了大型语言模型(LLM)和大型多模态模型(LMM)在不同领域的自动化欺骗检测能力。我们针对开源和商业LLMs,在三个不同的数据集上评估其性能:现实生活中的面试(RLTD)、人际情景中指示性欺骗(MU3D)以及虚假评论(OpSpam)。系统地分析了各种实验设置对欺诈检测的有效性,包括零样本及少量样本方法,并采用随机或基于相似性的上下文示例选择。我们的研究结果表明,经过微调的LLM在文本欺诈检测任务中达到了最先进的性能水平;然而,LMM难以充分利用跨模态线索。此外,我们还分析了辅助特征(如非语言手势和视频摘要)的影响,并探讨了不同提示策略的有效性,包括直接标签生成和链式思考推理。我们的发现为LLMs如何处理和解释不同模式的欺骗信号提供了关键见解,突显了它们在现实世界中应用中的潜力与局限。
https://arxiv.org/abs/2506.09424
We propose Noise Conditional Variational Score Distillation (NCVSD), a novel method for distilling pretrained diffusion models into generative denoisers. We achieve this by revealing that the unconditional score function implicitly characterizes the score function of denoising posterior distributions. By integrating this insight into the Variational Score Distillation (VSD) framework, we enable scalable learning of generative denoisers capable of approximating samples from the denoising posterior distribution across a wide range of noise levels. The proposed generative denoisers exhibit desirable properties that allow fast generation while preserve the benefit of iterative refinement: (1) fast one-step generation through sampling from pure Gaussian noise at high noise levels; (2) improved sample quality by scaling the test-time compute with multi-step sampling; and (3) zero-shot probabilistic inference for flexible and controllable sampling. We evaluate NCVSD through extensive experiments, including class-conditional image generation and inverse problem solving. By scaling the test-time compute, our method outperforms teacher diffusion models and is on par with consistency models of larger sizes. Additionally, with significantly fewer NFEs than diffusion-based methods, we achieve record-breaking LPIPS on inverse problems.
我们提出了一种新的方法,称为噪声条件变分分数蒸馏(Noise Conditional Variational Score Distillation, NCVSD),用于将预训练的扩散模型转化为生成去噪器。通过揭示无条件分数函数隐含地表征了去噪后验分布的分数函数这一事实,我们将这一见解融入到变分分数蒸馏(VSD)框架中,从而实现了能够从广泛的噪声水平下近似采样自去噪后验分布的大规模生成去噪器的学习。所提出的生成去噪器具有如下优点:允许快速生成同时保留迭代细化的好处: 1. 在高噪声级别上通过从纯高斯噪声采样进行一步式生成; 2. 通过多步采样的测试时计算扩展来提高样本质量; 3. 对于灵活和可控的采样提供零样本概率推理。 我们通过对类条件图像生成和逆问题求解等广泛实验对NCVSD进行了评估。通过扩展测试时间计算,我们的方法优于教师扩散模型,并且与更大规模的一致性模型相当。此外,在使用比基于扩散的方法显著更少的NFE(网络函数评估)的情况下,我们在逆问题上的LPIPS达到了创纪录的成绩。
https://arxiv.org/abs/2506.09416
Domain Adaptation (DA) is crucial for robust deployment of medical image segmentation models when applied to new clinical centers with significant domain shifts. Source-Free Domain Adaptation (SFDA) is appealing as it can deal with privacy concerns and access constraints on source-domain data during adaptation to target-domain data. However, SFDA faces challenges such as insufficient supervision in the target domain with unlabeled images. In this work, we propose a Segment Anything Model (SAM)-guided Reliable Pseudo-Labels method for SFDA (SRPL-SFDA) with three key components: 1) Test-Time Tri-branch Intensity Enhancement (T3IE) that not only improves quality of raw pseudo-labels in the target domain, but also leads to SAM-compatible inputs with three channels to better leverage SAM's zero-shot inference ability for refining the pseudo-labels; 2) A reliable pseudo-label selection module that rejects low-quality pseudo-labels based on Consistency of Multiple SAM Outputs (CMSO) under input perturbations with T3IE; and 3) A reliability-aware training procedure in the unlabeled target domain where reliable pseudo-labels are used for supervision and unreliable parts are regularized by entropy minimization. Experiments conducted on two multi-domain medical image segmentation datasets for fetal brain and the prostate respectively demonstrate that: 1) SRPL-SFDA effectively enhances pseudo-label quality in the unlabeled target domain, and improves SFDA performance by leveraging the reliability-aware training; 2) SRPL-SFDA outperformed state-of-the-art SFDA methods, and its performance is close to that of supervised training in the target domain. The code of this work is available online: this https URL.
领域适应(Domain Adaptation,DA)对于医疗图像分割模型在应用于具有显著域变化的新临床中心时的稳健部署至关重要。源域无关领域自适应(Source-Free Domain Adaptation,SFDA)因其能够处理隐私问题和对源域数据访问限制而颇具吸引力。然而,SFDA面临挑战,例如目标域中未标记图像的监督不足等问题。 为此,本文提出了一种基于Segment Anything Model (SAM) 的可靠伪标签方法SRPL-SFDA,并引入了三个关键组件: 1. 测试时间三分支强度增强(Test-Time Tri-branch Intensity Enhancement,T3IE),不仅提高了目标域中原始伪标签的质量,还生成具有三个通道的输入以更好地利用SAM零样本推理能力来改进伪标签。 2. 可靠伪标签选择模块,该模块基于在T3IE扰动下的多输出一致性(Consistency of Multiple SAM Outputs,CMSO)拒绝低质量的伪标签。 3. 一个考虑可靠性的无监督目标域训练过程,在此过程中使用可靠的伪标签进行监督,并通过最小化熵来正则化不可靠部分。 分别针对胎儿大脑和前列腺的两个跨域医疗图像分割数据集上的实验表明: 1. SRPL-SFDA有效提高了未标记的目标域中的伪标签质量,利用了考虑可靠性的训练方法,从而改进了SFDA性能。 2. SRPL-SFDA在跨领域自适应中超越了现有的前沿方法,并且其表现接近于目标领域的监督训练效果。 该工作的代码在线公开:[链接](this https URL)。
https://arxiv.org/abs/2506.09403
Speaker recognition systems are often limited to classification tasks and struggle to generate detailed speaker characteristics or provide context-rich descriptions. These models primarily extract embeddings for speaker identification but fail to capture demographic attributes such as dialect, gender, and age in a structured manner. This paper introduces CoLMbo, a Speaker Language Model (SLM) that addresses these limitations by integrating a speaker encoder with prompt-based conditioning. This allows for the creation of detailed captions based on speaker embeddings. CoLMbo utilizes user-defined prompts to adapt dynamically to new speaker characteristics and provides customized descriptions, including regional dialect variations and age-related traits. This innovative approach not only enhances traditional speaker profiling but also excels in zero-shot scenarios across diverse datasets, marking a significant advancement in the field of speaker recognition.
语音识别系统通常局限于分类任务,难以生成详细的说话人特征或提供丰富的上下文描述。这些模型主要提取用于说话人识别的嵌入向量,但无法以结构化的方式捕捉诸如方言、性别和年龄等人口统计属性。本文介绍了CoLMbo(Speaker Language Model, SLM),这是一种通过将说话人编码器与基于提示的条件相结合来解决这些问题的方法。这种方法允许根据说话人的嵌入生成详细的描述性文字。CoLMbo利用用户定义的提示,能够动态适应新的说话人特征,并提供定制化的描述,包括地区方言差异和年龄相关特性。这一创新方法不仅提升了传统的说话人画像技术,还在各种数据集中的零样本场景中表现出色,标志着语音识别领域的重要进步。
https://arxiv.org/abs/2506.09375
This paper studies the problem of Line Segment Detection (LSD) for the characterization of line geometry in images, with the aim of learning a domain-agnostic robust LSD model that works well for any natural images. With the focus of scalable self-supervised learning of LSD, we revisit and streamline the fundamental designs of (deep and non-deep) LSD approaches to have a high-performing and efficient LSD learner, dubbed as ScaleLSD, for the curation of line geometry at scale from over 10M unlabeled real-world images. Our ScaleLSD works very well to detect much more number of line segments from any natural images even than the pioneered non-deep LSD approach, having a more complete and accurate geometric characterization of images using line segments. Experimentally, our proposed ScaleLSD is comprehensively testified under zero-shot protocols in detection performance, single-view 3D geometry estimation, two-view line segment matching, and multiview 3D line mapping, all with excellent performance obtained. Based on the thorough evaluation, our ScaleLSD is observed to be the first deep approach that outperforms the pioneered non-deep LSD in all aspects we have tested, significantly expanding and reinforcing the versatility of the line geometry of images. Code and Models are available at this https URL
本文研究了线段检测(Line Segment Detection,LSD)的问题,旨在为任何自然图像开发一种领域无关的稳健型LSD模型。着眼于可扩展的自监督学习方法在LSD中的应用,我们重新审视并简化了(深度和非深度)LSD方法的基本设计,以打造一个高性能且高效的LSD学习器——ScaleLSD,用于从超过1000万张未标记的真实世界图像中大规模地整理线几何。我们的ScaleLSD能够非常有效地检测出任何自然图像中的大量线段,甚至超越了早期的非深度LSD方法,在使用线段对图像进行完整的和准确的几何描述方面表现得更加出色。在实验验证中,所提出的ScaleLSD在无监督协议下的检测性能、单视图三维几何估计、两视图线段匹配及多视角三维线映射等所有测试领域均表现出色。基于全面评估,我们的ScaleLSD被观察到是首个在我们测试的所有方面都超越了早期非深度LSD方法的深度学习方案,显著扩展并加强了图像中线条几何的多样性与适应性。 代码和模型可在以下网址获取:[链接] (请将链接替换为实际提供的URL)
https://arxiv.org/abs/2506.09369
Modern Large Language Models (LLMs) exhibit impressive zero-shot and few-shot generalization capabilities across complex natural language tasks, enabling their widespread use as virtual assistants for diverse applications such as translation and summarization. Despite being trained solely on large corpora of text without explicit supervision on author intent, LLMs appear to infer the underlying meaning of textual interactions. This raises a fundamental question: can LLMs model and reason about the intentions of others, i.e., do they possess a form of theory of mind? Understanding other's intentions is crucial for effective collaboration, which underpins human societal success and is essential for cooperative interactions among multiple agents, including humans and autonomous systems. In this work, we investigate the theory of mind in LLMs through the lens of cooperative multi-agent reinforcement learning (MARL), where agents learn to collaborate via repeated interactions, mirroring human social reasoning. Our approach aims to enhance artificial agent's ability to adapt and cooperate with both artificial and human partners. By leveraging LLM-based agents capable of natural language interaction, we move towards creating hybrid human-AI systems that can foster seamless collaboration, with broad implications for the future of human-artificial interaction.
现代大型语言模型(LLMs)在复杂的自然语言任务中展现出了令人印象深刻的零样本和少量样本泛化能力,使其能够广泛用作翻译和摘要等多样化应用的虚拟助手。尽管这些模型仅通过大规模文本语料库进行训练而未明确监督作者意图,它们似乎能够推断出文本交互背后的含义。这引发了这样一个基本问题:LLMs 是否具备推理他人意图的能力,即是否拥有某种形式的心智理论?理解他人的意图对于有效合作至关重要,这是人类社会成功的基础,并且在包括人和自主系统在内的多代理协同互动中也是必不可少的。在这项工作中,我们通过合作型多智能体强化学习(MARL)来探讨LLMs中的心智理论问题,在这种环境中,代理通过重复交互学习如何协作,这反映了人类的社会推理过程。我们的方法旨在增强人工代理适应并与其它人工智能代理和人类伙伴进行合作的能力。通过利用能够进行自然语言互动的基于LLM的代理,我们朝着创建可以促进无缝协同工作的混合人机系统迈进了一步,这对于未来的人类与AI交互具有深远的影响。
https://arxiv.org/abs/2506.09331
Large language models (LLMs) exhibit remarkable capabilities across diverse tasks, yet aligning them efficiently and effectively with human expectations remains a critical challenge. This thesis advances LLM alignment by introducing novel methodologies in data collection, training, and evaluation. We first address alignment data collection. Existing approaches rely heavily on manually curated datasets or proprietary models. To overcome these limitations, we propose Lion, an adversarial distillation framework that iteratively refines training data by identifying and generating challenging instructions, enabling state-of-the-art zero-shot reasoning. Additionally, we introduce Web Reconstruction (WebR), a fully automated framework that synthesizes instruction-tuning data directly from raw web documents, significantly improving data diversity and scalability over existing synthetic data methods. Next, we enhance alignment training through novel optimization techniques. We develop Learning to Edit (LTE), a framework that enables LLMs to efficiently integrate new knowledge while preserving existing information. LTE leverages meta-learning to improve both real-time and batch knowledge updates. Furthermore, we introduce Bridging and Modeling Correlations (BMC), a refinement of Direct Preference Optimization (DPO) that explicitly captures token-level correlations in preference data, leading to superior alignment across QA and mathematical reasoning tasks. Finally, we tackle the challenge of evaluating alignment. Existing benchmarks emphasize response quality but overlook adherence to specific constraints. To bridge this gap, we introduce FollowBench, a multi-level, fine-grained benchmark assessing LLMs' ability to follow complex constraints across diverse instruction types. Our results expose key weaknesses in current models' constraint adherence, offering insights for future improvements.
大型语言模型(LLM)在各种任务中表现出卓越的能力,但如何有效地将其与人类期望对齐仍然是一个关键挑战。本论文通过引入数据收集、训练和评估方面的创新方法来推进LLM的对齐工作。 首先,我们解决了对齐数据采集的问题。现有方法主要依赖于手动策划的数据集或专有模型。为克服这些限制,我们提出了Lion框架,这是一个对抗性蒸馏框架,能够通过识别并生成具有挑战性的指令,迭代地精炼训练数据,从而实现最先进的零样本推理能力。 此外,我们引入了Web Reconstruction(WebR),一个完全自动化的框架,它直接从原始网络文档中合成指令微调数据,显著提高了现有合成数据方法的数据多样性和可扩展性。 接下来,我们通过新型优化技术增强对齐训练。我们开发了一种称为Learning to Edit (LTE)的框架,该框架能够使LLM在整合新知识的同时保留现有信息。LTE利用元学习来改进实时和批处理的知识更新。 另外,我们还介绍了Bridging and Modeling Correlations(BMC)方法,这是对Direct Preference Optimization(DPO)的一种改进,它明确地捕捉偏好数据中的标记级相关性,在问答任务和数学推理任务中实现了更好的对齐效果。 最后,我们解决了评估对齐的挑战。现有的基准测试重视响应质量但忽略了特定约束条件的遵守情况。为了弥补这一缺口,我们推出了FollowBench,这是一个多层次、细粒度的基准测试,用于评估LLM遵循各种指令类型复杂约束的能力。我们的研究揭示了当前模型在遵守约束方面的关键弱点,并为未来的改进提供了宝贵的见解。 通过这些方法和框架的发展与应用,本论文为进一步优化大型语言模型的行为模式提供了一条新的途径。
https://arxiv.org/abs/2506.09329
We present SensorLM, a family of sensor-language foundation models that enable wearable sensor data understanding with natural language. Despite its pervasive nature, aligning and interpreting sensor data with language remains challenging due to the lack of paired, richly annotated sensor-text descriptions in uncurated, real-world wearable data. We introduce a hierarchical caption generation pipeline designed to capture statistical, structural, and semantic information from sensor data. This approach enabled the curation of the largest sensor-language dataset to date, comprising over 59.7 million hours of data from more than 103,000 people. Furthermore, SensorLM extends prominent multimodal pretraining architectures (e.g., CLIP, CoCa) and recovers them as specific variants within a generic architecture. Extensive experiments on real-world tasks in human activity analysis and healthcare verify the superior performance of SensorLM over state-of-the-art in zero-shot recognition, few-shot learning, and cross-modal retrieval. SensorLM also demonstrates intriguing capabilities including scaling behaviors, label efficiency, sensor captioning, and zero-shot generalization to unseen tasks.
我们介绍了SensorLM,这是一个传感器-语言基础模型家族,能够通过自然语言理解可穿戴设备传感器数据。尽管其普遍存在,但由于缺乏未整理的真实世界可穿戴数据中配对的、丰富注释的传感器文本描述,将传感器数据与语言对齐和解释仍具有挑战性。为此,我们引入了一个分层的标题生成管道,旨在从传感器数据中捕获统计学、结构化以及语义信息。这一方法使得能够整理迄今为止最大的传感器-语言数据集,该数据集包含超过103,000人的5970万小时以上的数据。 此外,SensorLM扩展了显著的多模态预训练架构(例如CLIP、CoCa),并在一个通用架构中恢复它们作为特定变体。在真实世界任务中的广泛实验——包括人类活动分析和医疗保健领域——验证了SensorLM在零样本识别、少样本学习以及跨模态检索方面超越现有最佳表现的优越性能。 此外,SensorLM还展示了引人注目的能力,如规模效应、标签效率、传感器标题生成及对未见过任务的零样本泛化。
https://arxiv.org/abs/2506.09108
Medical vision-language alignment through cross-modal contrastive learning shows promising performance in image-text matching tasks, such as retrieval and zero-shot classification. However, conventional cross-modal contrastive learning (CLIP-based) methods suffer from suboptimal visual representation capabilities, which also limits their effectiveness in vision-language alignment. In contrast, although the models pretrained via multimodal masked modeling struggle with direct cross-modal matching, they excel in visual representation. To address this contradiction, we propose ALTA (ALign Through Adapting), an efficient medical vision-language alignment method that utilizes only about 8% of the trainable parameters and less than 1/5 of the computational consumption required for masked record modeling. ALTA achieves superior performance in vision-language matching tasks like retrieval and zero-shot classification by adapting the pretrained vision model from masked record modeling. Additionally, we integrate temporal-multiview radiograph inputs to enhance the information consistency between radiographs and their corresponding descriptions in reports, further improving the vision-language alignment. Experimental evaluations show that ALTA outperforms the best-performing counterpart by over 4% absolute points in text-to-image accuracy and approximately 6% absolute points in image-to-text retrieval accuracy. The adaptation of vision-language models during efficient alignment also promotes better vision and language understanding. Code is publicly available at this https URL.
通过跨模态对比学习进行的医学视觉-语言对齐在图像文本匹配任务(如检索和零样本分类)中表现出色。然而,传统的跨模态对比学习方法(基于CLIP的方法)存在不足之处,其视觉表示能力较差,限制了其在视觉-语言对齐中的有效性。相比之下,虽然通过多模式掩码建模预训练的模型在直接跨模态匹配方面存在问题,但它们在视觉表示方面表现出色。为解决这一矛盾,我们提出了一种名为ALTA(Adapting Through Alignment)的方法,这是一种高效的医学视觉-语言对齐方法,仅使用约8%可训练参数,并且所需的计算量仅为掩码记录建模的五分之一以下。通过将预训练的视觉模型从掩码记录建模中适应过来,ALTA在诸如检索和零样本分类等视觉-语言匹配任务中表现出色。此外,我们还集成了时序多视图放射影像输入以增强放射影像与其相应描述之间的信息一致性,进一步改进了视觉-语言对齐。实验评估表明,在文本到图像的准确性和图像到文本检索的准确性方面,ALTA比最佳对照方法分别高出4%和约6%的绝对点数。此外,在高效对齐期间适应视觉-语言模型还促进了更好的视觉和语言理解能力。代码可在以下网址公开获取:[此链接](请将 [此链接] 替换为实际提供的链接)。
https://arxiv.org/abs/2506.08990
Feature augmentation generates novel samples in the feature space, providing an effective way to enhance the generalization ability of learning algorithms with hyperbolic geometry. Most hyperbolic feature augmentation is confined to closed-environment, assuming the number of classes is fixed (\emph{i.e.}, seen classes) and generating features only for these classes. In this paper, we propose a hyperbolic dual feature augmentation method for open-environment, which augments features for both seen and unseen classes in the hyperbolic space. To obtain a more precise approximation of the real data distribution for efficient training, (1) we adopt a neural ordinary differential equation module, enhanced by meta-learning, estimating the feature distributions of both seen and unseen classes; (2) we then introduce a regularizer to preserve the latent hierarchical structures of data in the hyperbolic space; (3) we also derive an upper bound for the hyperbolic dual augmentation loss, allowing us to train a hyperbolic model using infinite augmentations for seen and unseen classes. Extensive experiments on five open-environment tasks: class-incremental learning, few-shot open-set recognition, few-shot learning, zero-shot learning, and general image classification, demonstrate that our method effectively enhances the performance of hyperbolic algorithms in open-environment.
特征增强通过在特征空间中生成新颖样本,提供了一种有效的方法来利用双曲几何提升学习算法的泛化能力。大多数基于双曲几何的特征增强局限于封闭环境假设类别数量固定(即已知类别),仅对这些类别进行特征生成。本文提出了一种面向开放环境的双曲双重特征增强方法,它能在双曲空间中为已知和未知类同时生成特征。 为了更精确地逼近实际数据分布以实现高效训练,我们采取了以下措施: 1. 采用神经常微分方程模块并结合元学习技术来估计已知及未知类别的特征分布; 2. 引入正则化项来保持双曲空间中数据的潜在层次结构; 3. 推导出一种上界用于双曲双重增强损失,从而使得可以利用无限数量的增广样本(对于已知和未知类别)训练双曲模型。 在五个开放环境任务上进行了广泛的实验:类增量学习、少量样本开集识别、少量样本学习、零样本学习以及通用图像分类。结果表明,我们的方法有效提升了双曲算法在开放环境下的性能。
https://arxiv.org/abs/2506.08906