Large language models (LLMs) can often accurately describe probability distributions using natural language, yet they still struggle to generate faithful samples from them. This mismatch limits their use in tasks requiring reliable stochasticity, such as Monte Carlo methods, agent-based simulations, and randomized decision-making. We investigate this gap between knowledge and sampling in the context of Bernoulli distributions. We introduce Verbalized Rejection Sampling (VRS), a natural-language adaptation of classical rejection sampling that prompts the LLM to reason about and accept or reject proposed samples. Despite relying on the same Bernoulli mechanism internally, VRS substantially reduces sampling bias across models. We provide theoretical analysis showing that, under mild assumptions, VRS improves over direct sampling, with gains attributable to both the algorithm and prompt design. More broadly, our results show how classical probabilistic tools can be verbalized and embedded into LLM workflows to improve reliability, without requiring access to model internals or heavy prompt engineering.
大型语言模型(LLMs)通常能够准确地用自然语言描述概率分布,但它们在从这些分布中生成忠实样本时仍存在困难。这种不匹配限制了它们在需要可靠随机性的任务中的应用,例如蒙特卡洛方法、基于代理的仿真和随机决策制定。我们在此背景下研究伯努利分布的知识与采样之间的差距,并提出了语言化的拒绝抽样(Verbalized Rejection Sampling, VRS),这是一种经典拒绝抽样的自然语言版本,它促使LLM对提出的样本进行推理并接受或拒绝这些样本。尽管VRS在内部依赖于相同的伯努利机制,但它显著减少了不同模型的采样偏差。 我们提供理论分析表明,在适度假设下,VRS优于直接抽样,并且改进来自于算法本身以及提示设计。更广泛地说,我们的研究结果展示了如何将经典的概率工具语言化并嵌入到LLM工作流程中以提高可靠性,而无需访问模型内部或进行复杂的提示工程。
https://arxiv.org/abs/2506.09998
Detecting AI-generated text is a difficult problem to begin with; detecting AI-generated text on social media is made even more difficult due to the short text length and informal, idiosyncratic language of the internet. It is nonetheless important to tackle this problem, as social media represents a significant attack vector in online influence campaigns, which may be bolstered through the use of mass-produced AI-generated posts supporting (or opposing) particular policies, decisions, or events. We approach this problem with the mindset and resources of a reasonably sophisticated threat actor, and create a dataset of 505,159 AI-generated social media posts from a combination of open-source, closed-source, and fine-tuned LLMs, covering 11 different controversial topics. We show that while the posts can be detected under typical research assumptions about knowledge of and access to the generating models, under the more realistic assumption that an attacker will not release their fine-tuned model to the public, detectability drops dramatically. This result is confirmed with a human study. Ablation experiments highlight the vulnerability of various detection algorithms to fine-tuned LLMs. This result has implications across all detection domains, since fine-tuning is a generally applicable and realistic LLM use case.
检测人工智能生成的文本本身就是一个难题;而在社交媒体上检测这类文本则更加困难,因为短文本长度和互联网特有的非正式、个性化的语言使得这一任务更为复杂。尽管如此,解决这个问题仍然非常重要,因为在网络影响力活动中,社交媒体代表了一个重要的攻击途径,通过大规模生产支持(或反对)特定政策、决策或事件的人工智能生成帖子可以增强这种活动的力度。 我们以一个较为复杂的威胁行为者的思维方式和资源来应对这个问题,并创建了一套数据集,其中包含来自开源、闭源以及经过微调的语言模型生成的505,159条社交媒体帖子,这些帖子涵盖了11个有争议的话题。研究表明,在典型的科研假设下(即研究者对生成文本的模型具有一定的了解和访问权限),可以检测到这些帖子;但在更为现实的情况下,若攻击者不会将其微调后的模型公开,则可检测性会大幅下降。这项结果也通过一项人类实验得到了确认。 消融实验进一步揭示了各种检测算法对于经过微调的语言模型存在明显的脆弱性。这一发现对所有领域的检测工作都有重要的影响,因为微调是大型语言模型的一种普遍适用且现实的应用场景。
https://arxiv.org/abs/2506.09975
Conditional diffusion models (CDMs) have shown impressive performance across a range of generative tasks. Their ability to model the full data distribution has opened new avenues for analysis-by-synthesis in downstream discriminative learning. However, this same modeling capacity causes CDMs to entangle the class-defining features with irrelevant context, posing challenges to extracting robust and interpretable representations. To this end, we identify Canonical LAtent Representations (CLAReps), latent codes whose internal CDM features preserve essential categorical information while discarding non-discriminative signals. When decoded, CLAReps produce representative samples for each class, offering an interpretable and compact summary of the core class semantics with minimal irrelevant details. Exploiting CLAReps, we develop a novel diffusion-based feature-distillation paradigm, CaDistill. While the student has full access to the training set, the CDM as teacher transfers core class knowledge only via CLAReps, which amounts to merely 10 % of the training data in size. After training, the student achieves strong adversarial robustness and generalization ability, focusing more on the class signals instead of spurious background cues. Our findings suggest that CDMs can serve not just as image generators but also as compact, interpretable teachers that can drive robust representation learning.
条件扩散模型(CDMs)在多种生成任务中表现出卓越的性能。它们能够对完整数据分布进行建模的能力,为下游判别学习中的分析与合成方法开辟了新的途径。然而,这种强大的建模能力也导致CDMs将定义类别的特征与不相关的背景信息纠缠在一起,使得提取稳健且可解释的表示变得具有挑战性。为此,我们识别出了典范潜在表示(Canonical LAtent Representations, CLAReps),这是一种内部CDM特征能够保留关键类别信息同时摒弃非判别信号的潜在编码方式。当这些CLAReps被解码时,它们能为每个类生成代表性的样本,并提供一个简洁、可解释的核心语义摘要,包含最少的无关细节。 利用CLAReps,我们开发了一种新颖的基于扩散的方法——CaDistill(特征蒸馏),用于知识传递。在此过程中,学生模型可以完全访问整个训练集,而作为教师的CDM则仅通过CLAReps将核心类别知识传输给学生,这些CLAReps只占原始训练数据量的10%左右。经过训练后,学生模型在对抗鲁棒性和泛化能力方面表现优异,并且更注重类别的信号而非误导性的背景线索。 我们的发现表明,CDMs不仅可以作为图像生成器使用,还能充当紧凑、可解释的知识传授者,促进稳健的表示学习。
https://arxiv.org/abs/2506.09955
In outside knowledge visual question answering (OK-VQA), the model must identify relevant visual information within an image and incorporate external knowledge to accurately respond to a question. Extending this task to a visually grounded dialogue setting based on videos, a conversational model must both recognize pertinent visual details over time and answer questions where the required information is not necessarily present in the visual information. Moreover, the context of the overall conversation must be considered for the subsequent dialogue. To explore this task, we introduce a dataset comprised of $2,017$ videos with $5,986$ human-annotated dialogues consisting of $40,954$ interleaved dialogue turns. While the dialogue context is visually grounded in specific video segments, the questions further require external knowledge that is not visually present. Thus, the model not only has to identify relevant video parts but also leverage external knowledge to converse within the dialogue. We further provide several baselines evaluated on our dataset and show future challenges associated with this task. The dataset is made publicly available here: this https URL.
在外部知识视觉问答(OK-VQA)中,模型必须识别图像中的相关视觉信息,并结合外部知识以准确回答问题。将这项任务扩展到基于视频的视觉支持对话设置中,对话模型不仅要能够随时间推断出相关的视觉细节,还要能回答那些所需信息不一定存在于视觉信息中的问题。此外,整体对话的上下文也必须被考虑在内。 为了探索这一任务,我们引入了一个数据集,包含2,017个视频和5,986个人工标注的对话,这些对话包括40,954轮交替进行的对话回合。虽然对话背景是基于特定视频片段中的视觉信息,但问题进一步需要那些不在视觉中直接显示的外部知识。因此,模型不仅必须识别相关视频部分,还要利用外部知识在对话中交流。 我们还在我们的数据集上提供了一些基准测试,并展示了与此任务相关的未来挑战。该数据集可在此公开获取:[此URL](this https URL)。
https://arxiv.org/abs/2506.09953
The scale diversity of point cloud data presents significant challenges in developing unified representation learning techniques for 3D vision. Currently, there are few unified 3D models, and no existing pre-training method is equally effective for both object- and scene-level point clouds. In this paper, we introduce UniPre3D, the first unified pre-training method that can be seamlessly applied to point clouds of any scale and 3D models of any architecture. Our approach predicts Gaussian primitives as the pre-training task and employs differentiable Gaussian splatting to render images, enabling precise pixel-level supervision and end-to-end optimization. To further regulate the complexity of the pre-training task and direct the model's focus toward geometric structures, we integrate 2D features from pre-trained image models to incorporate well-established texture knowledge. We validate the universal effectiveness of our proposed method through extensive experiments across a variety of object- and scene-level tasks, using diverse point cloud models as backbones. Code is available at this https URL.
点云数据的尺度多样性为三维视觉中的统一表示学习技术的发展带来了显著挑战。目前,很少有通用的3D模型存在,并且没有现有的预训练方法能够同时有效地应用于对象级和场景级点云。在本文中,我们介绍了UniPre3D,这是首个可以无缝应用于任何规模点云及任意架构3D模型的统一预训练方法。我们的方法将预测高斯基元作为预训练任务,并采用可微分高斯渲染技术来生成图像,从而实现精确的像素级监督和端到端优化。为了进一步调节预训练任务的复杂度并引导模型关注几何结构,我们整合了来自预先训练好的图像模型的2D特征,以纳入已确立的良好纹理知识。我们通过广泛的实验验证了所提出方法在各种对象级和场景级任务中的通用有效性,并使用多种点云模型作为骨干网络进行测试。代码可在提供的链接中获取。
https://arxiv.org/abs/2506.09952
Information asymmetry is a pervasive feature of multi-agent systems, especially evident in economics and social sciences. In these settings, agents tailor their actions based on private information to maximize their rewards. These strategic behaviors often introduce complexities due to confounding variables. Simultaneously, knowledge transportability poses another significant challenge, arising from the difficulties of conducting experiments in target environments. It requires transferring knowledge from environments where empirical data is more readily available. Against these backdrops, this paper explores a fundamental question in online learning: Can we employ non-i.i.d. actions to learn about confounders even when requiring knowledge transfer? We present a sample-efficient algorithm designed to accurately identify system dynamics under information asymmetry and to navigate the challenges of knowledge transfer effectively in reinforcement learning, framed within an online strategic interaction model. Our method provably achieves learning of an $\epsilon$-optimal policy with a tight sample complexity of $O(1/\epsilon^2)$.
信息不对称是多代理系统中普遍存在的一种特征,尤其在经济学和社会科学领域表现得尤为明显。在这种背景下,各主体根据私有信息调整行为以最大化自身收益。这种策略性行为往往由于混淆变量的引入而变得复杂。同时,在目标环境中进行实验的难度也带来了知识迁移的重大挑战,这需要将知识从数据更容易获取的环境转移到其他场景中。在此背景下,本文探讨了在线学习中的一个基本问题:我们能否利用非独立同分布(non-i.i.d.)的动作来了解混淆变量,即使在这种情况下仍需实现知识迁移?为此,我们提出了一种样本效率高的算法,旨在准确识别信息不对称条件下的系统动态,并在强化学习框架内有效应对知识转移的挑战,在一个在线策略互动模型下进行。我们的方法可以证明,能够在具有紧致样本复杂度$O(1/\epsilon^2)$的情况下,实现$\epsilon$-最优策略的学习。
https://arxiv.org/abs/2506.09940
While vision-language-action models (VLAs) have shown promising robotic behaviors across a diverse set of manipulation tasks, they achieve limited success rates when deployed on novel tasks out-of-the-box. To allow these policies to safely interact with their environments, we need a failure detector that gives a timely alert such that the robot can stop, backtrack, or ask for help. However, existing failure detectors are trained and tested only on one or a few specific tasks, while VLAs require the detector to generalize and detect failures also in unseen tasks and novel environments. In this paper, we introduce the multitask failure detection problem and propose SAFE, a failure detector for generalist robot policies such as VLAs. We analyze the VLA feature space and find that VLAs have sufficient high-level knowledge about task success and failure, which is generic across different tasks. Based on this insight, we design SAFE to learn from VLA internal features and predict a single scalar indicating the likelihood of task failure. SAFE is trained on both successful and failed rollouts, and is evaluated on unseen tasks. SAFE is compatible with different policy architectures. We test it on OpenVLA, $\pi_0$, and $\pi_0$-FAST in both simulated and real-world environments extensively. We compare SAFE with diverse baselines and show that SAFE achieves state-of-the-art failure detection performance and the best trade-off between accuracy and detection time using conformal prediction. More qualitative results can be found at this https URL.
尽管视觉语言行动模型(VLAs)在各种操作任务中展示了有前景的机器人行为,但当部署到全新的任务时,它们的成功率有限。为了使这些策略能够安全地与其环境互动,我们需要一个故障检测器,能够在关键时刻发出警报,让机器人可以停止、回溯或请求帮助。然而,现有的故障检测器仅在特定的一个或几个任务上进行训练和测试,而VLAs则需要检测器能在未见过的任务和新环境中泛化并识别故障。 在这篇论文中,我们引入了多任务故障检测问题,并提出了SAFE——一个为包括VLAs在内的通才机器人策略设计的故障检测器。通过对VLA特征空间的分析,我们发现VLAs对任务的成功与失败拥有足够的高层次知识,这种知识在不同任务间具有通用性。基于这一洞察,我们将SAFE设计成能够从VLA内部特性学习,并预测一个单一标量值来表示任务失败的可能性。SAFE是在成功和失败的情况下进行训练的,且评估时使用的是未见过的任务。此外,SAFE与不同的策略架构兼容。 我们在模拟环境和现实世界中对OpenVLA、$\pi_0$以及$\pi_0$-FAST进行了广泛的测试。我们将SAFE与各种基线进行了比较,并展示了它在故障检测性能上取得了最先进的成果,且使用一致预测实现了准确性与检测时间的最佳平衡。更多定性结果可以在[该链接](https://thisisnotalink.com)中找到。 请注意,提供的URL实际为示例链接,请根据实际情况替换为正确的网址。
https://arxiv.org/abs/2506.09937
We present a novel approach for detecting hallucinations in large language models (LLMs) by analyzing the probabilistic divergence between prompt and response hidden-state distributions. Counterintuitively, we find that hallucinated responses exhibit smaller deviations from their prompts compared to grounded responses, suggesting that hallucinations often arise from superficial rephrasing rather than substantive reasoning. Leveraging this insight, we propose a model-intrinsic detection method that uses distributional distances as principled hallucination scores, eliminating the need for external knowledge or auxiliary models. To enhance sensitivity, we employ deep learnable kernels that automatically adapt to capture nuanced geometric differences between distributions. Our approach outperforms existing baselines, demonstrating state-of-the-art performance on several benchmarks. The method remains competitive even without kernel training, offering a robust, scalable solution for hallucination detection.
我们提出了一种通过分析提示和响应隐藏状态分布之间的概率差异来检测大型语言模型(LLM)中幻觉的新方法。出人意料的是,我们发现幻觉回应与其提示相比表现出较小的偏差,这表明幻觉往往源于表面性的改写而非实质性的推理。基于这一洞察,我们提出了一种利用分布距离作为原则性幻觉评分的内在模型检测方法,这种方法无需外部知识或辅助模型的支持。为了提高灵敏度,我们采用了深度可学习核函数来自动适应并捕捉分布在细微几何差异上的变化。我们的方法在多项基准测试中超越了现有基线,在幻觉检测方面展示了最先进的性能。即使不进行内核训练,该方法仍然具有竞争力,提供了一种稳健且可扩展的解决方案用于检测幻觉。
https://arxiv.org/abs/2506.09886
We consider the problem of generalizable novel view synthesis (NVS), which aims to generate photorealistic novel views from sparse or even unposed 2D images without per-scene optimization. This task remains fundamentally challenging, as it requires inferring 3D structure from incomplete and ambiguous 2D observations. Early approaches typically rely on strong 3D knowledge, including architectural 3D inductive biases (e.g., embedding explicit 3D representations, such as NeRF or 3DGS, into network design) and ground-truth camera poses for both input and target views. While recent efforts have sought to reduce the 3D inductive bias or the dependence on known camera poses of input views, critical questions regarding the role of 3D knowledge and the necessity of circumventing its use remain under-explored. In this work, we conduct a systematic analysis on the 3D knowledge and uncover a critical trend: the performance of methods that requires less 3D knowledge accelerates more as data scales, eventually achieving performance on par with their 3D knowledge-driven counterparts, which highlights the increasing importance of reducing dependence on 3D knowledge in the era of large-scale data. Motivated by and following this trend, we propose a novel NVS framework that minimizes 3D inductive bias and pose dependence for both input and target views. By eliminating this 3D knowledge, our method fully leverages data scaling and learns implicit 3D awareness directly from sparse 2D images, without any 3D inductive bias or pose annotation during training. Extensive experiments demonstrate that our model generates photorealistic and 3D-consistent novel views, achieving even comparable performance with methods that rely on posed inputs, thereby validating the feasibility and effectiveness of our data-centric paradigm. Project page: this https URL .
我们考虑的是通用的新视角合成(NVS)问题,其目标是从稀疏或未指定2D图像生成逼真的新视图,而不依赖于每个场景的单独优化。这项任务在本质上仍然具有挑战性,因为需要从不完整且模棱两可的二维观察中推断三维结构。早期的方法通常依赖于强大的三维知识,包括架构上的三维归纳偏差(例如,在网络设计中嵌入显式的3D表示,如NeRF或3DGS)和输入视图及目标视图的真实相机姿态信息。 虽然近期的研究致力于减少对三维知识的依赖或对已知摄像机位置的依赖,但关于三维知识的作用以及绕过其使用是否必要等问题仍然未得到充分探讨。在这项工作中,我们系统地分析了三维知识,并发现了一个关键趋势:那些需要较少三维知识的方法在数据量增加时性能提升更快,最终可以达到与基于三维知识方法相当的表现水平,这表明减少对三维信息的依赖在大规模数据时代变得越来越重要。 受这一趋势启发,我们提出了一种新的NVS框架,该框架大大减少了输入和目标视图中的3D归纳偏差及姿态依赖。通过去除这种3D知识,我们的方法能够充分利用数据量的增长,并直接从稀疏的2D图像中学习隐式的三维感知能力,在训练过程中无需任何3D归纳偏差或姿态标注。 广泛的实验表明,我们开发的模型可以生成逼真的、与3D一致的新视角视图,其性能甚至可与依赖于已定位输入的方法相媲美,从而验证了我们的数据为中心范式在可行性和有效性方面的潜力。项目页面:[此处插入实际链接]
https://arxiv.org/abs/2506.09885
Micro-expressions (MEs) are subtle, fleeting nonverbal cues that reveal an individual's genuine emotional state. Their analysis has attracted considerable interest due to its promising applications in fields such as healthcare, criminal investigation, and human-computer interaction. However, existing ME research is limited to single visual modality, overlooking the rich emotional information conveyed by other physiological modalities, resulting in ME recognition and spotting performance far below practical application needs. Therefore, exploring the cross-modal association mechanism between ME visual features and physiological signals (PS), and developing a multimodal fusion framework, represents a pivotal step toward advancing ME analysis. This study introduces a novel ME dataset, MMME, which, for the first time, enables synchronized collection of facial action signals (MEs), central nervous system signals (EEG), and peripheral PS (PPG, RSP, SKT, EDA, and ECG). By overcoming the constraints of existing ME corpora, MMME comprises 634 MEs, 2,841 macro-expressions (MaEs), and 2,890 trials of synchronized multimodal PS, establishing a robust foundation for investigating ME neural mechanisms and conducting multimodal fusion-based analyses. Extensive experiments validate the dataset's reliability and provide benchmarks for ME analysis, demonstrating that integrating MEs with PS significantly enhances recognition and spotting performance. To the best of our knowledge, MMME is the most comprehensive ME dataset to date in terms of modality diversity. It provides critical data support for exploring the neural mechanisms of MEs and uncovering the visual-physiological synergistic effects, driving a paradigm shift in ME research from single-modality visual analysis to multimodal fusion. The dataset will be publicly available upon acceptance of this paper.
微表情(ME)是揭示个体真实情绪状态的微妙、短暂的非语言信号。其分析在医疗保健、刑事调查和人机交互等领域中引起了极大的关注,因其具有广泛的应用前景。然而,现有的微表情研究主要局限于单一视觉模态,忽略了其他生理模态传达的情感信息,导致微表情识别与检测性能远低于实际应用需求。因此,探索微表情视觉特征与其他生理信号(PS)之间的跨模态关联机制,并开发多模态融合框架,是推进微表情分析的关键一步。 本研究引入了一个新颖的微表情数据集MMME(Multimodal Micro-expression),首次实现了面部动作信号(微表情)、中枢神经系统信号(EEG)和外周生理信号(PPG、RSP、SKT、EDA以及ECG)的同步采集。通过克服现有微表情语料库的局限,MMME包含634个微表情、2,841个宏表情(MaEs)和2,890次多模态生理信号同步数据记录,为研究微表情神经机制和进行基于多模态融合分析奠定了坚实基础。广泛实验验证了该数据集的可靠性,并提供了基准测试以评估微表情分析的性能,证明整合微表情与生理信号显著提升了识别与检测性能。 据我们所知,MMME是迄今为止在模态多样性方面最全面的微表情数据集。它为探索微表情神经机制和揭示视觉-生理协同效应提供了关键的数据支持,并推动了从单一视觉分析向多模态融合的研究范式转变。该数据集将在本论文被接受后公开发布。
https://arxiv.org/abs/2506.09834
In this short note, we report and analyze a striking event: OpenAI's large language model o3 has outwitted all students in a university exam on thermodynamics. The thermodynamics exam is a difficult hurdle for most students, where they must show that they have mastered the fundamentals of this important topic. Consequently, the failure rates are very high, A-grades are rare - and they are considered proof of the students' exceptional intellectual abilities. This is because pattern learning does not help in the exam. The problems can only be solved by knowledgeably and creatively combining principles of thermodynamics. We have given our latest thermodynamics exam not only to the students but also to OpenAI's most powerful reasoning model, o3, and have assessed the answers of o3 exactly the same way as those of the students. In zero-shot mode, the model o3 solved all problems correctly, better than all students who took the exam; its overall score was in the range of the best scores we have seen in more than 10,000 similar exams since 1985. This is a turning point: machines now excel in complex tasks, usually taken as proof of human intellectual capabilities. We discuss the consequences this has for the work of engineers and the education of future engineers.
在这份简短的报告中,我们记录并分析了一个引人注目的事件:OpenAI的大规模语言模型o3在一所大学的热力学考试中击败了所有学生。热力学考试对于大多数学生来说是一道难关,他们必须证明自己掌握了这一重要主题的基础知识。因此,考试失败率很高,A等成绩非常罕见——它们被认为是学生非凡智力能力的证据。这是因为模式学习在这种考试中无济于事。问题只能通过有见识且创造性地结合热力学原理来解答。 我们不仅将最新的热力学考试提供给学生们,还将其提供给了OpenAI最强大的推理模型o3,并以与评估学生答案相同的方式对o3的答案进行了评估。在零样本模式下,该模型o3正确解决了所有问题,在此次考试中表现优于所有的学生;它的总体得分达到了自1985年以来我们见过的超过10,000份类似试卷中的最高等级。 这是一个转折点:机器现在在通常被认为是证明人类智力能力的复杂任务上表现出色。我们将讨论这对工程师的工作以及未来工程师教育的影响。
https://arxiv.org/abs/2506.09822
Large Reasoning Models (LRMs) like o1 and DeepSeek-R1 have shown remarkable progress in natural language reasoning with long chain-of-thought (CoT), yet they remain inefficient or inaccurate when handling complex mathematical operations. Addressing these limitations through computational tools (e.g., computation libraries and symbolic solvers) is promising, but it introduces a technical challenge: Code Interpreter (CI) brings external knowledge beyond the model's internal text representations, thus the direct combination is not efficient. This paper introduces CoRT, a post-training framework for teaching LRMs to leverage CI effectively and efficiently. As a first step, we address the data scarcity issue by synthesizing code-integrated reasoning data through Hint-Engineering, which strategically inserts different hints at appropriate positions to optimize LRM-CI interaction. We manually create 30 high-quality samples, upon which we post-train models ranging from 1.5B to 32B parameters, with supervised fine-tuning, rejection fine-tuning and reinforcement learning. Our experimental results demonstrate that Hint-Engineering models achieve 4\% and 8\% absolute improvements on DeepSeek-R1-Distill-Qwen-32B and DeepSeek-R1-Distill-Qwen-1.5B respectively, across five challenging mathematical reasoning datasets. Furthermore, Hint-Engineering models use about 30\% fewer tokens for the 32B model and 50\% fewer tokens for the 1.5B model compared with the natural language models. The models and code are available at this https URL.
大型推理模型(LRMs)如o1和DeepSeek-R1在处理长链思维过程的自然语言推理方面展现了显著的进步,然而它们在应对复杂数学运算时仍然效率低下或准确性不足。通过计算工具(例如计算库和符号求解器)来解决这些局限性具有前景,但这也引入了一个技术挑战:代码解释器(CI)带来了超出模型内部文本表示的外部知识,因此直接结合使用的效果不佳。本文介绍了一种名为CoRT的后训练框架,旨在教导LRMs有效且高效地利用CI。 作为第一步,我们通过Hint-Engineering解决了数据稀缺问题,这一方法通过在适当位置策略性插入不同的提示来合成代码集成推理数据,并以此优化LRM-CI交互过程。我们手动创建了30个高质量样本,在此基础上,我们将从1.5B到32B参数的模型进行了监督微调、拒绝微调和强化学习后训练。 我们的实验结果表明,采用Hint-Engineering的模型在DeepSeek-R1-Distill-Qwen-32B和DeepSeek-R1-Distill-Qwen-1.5B上分别取得了4%和8%的绝对改进,在五个具有挑战性的数学推理数据集上的表现尤为突出。此外,相对于纯自然语言模型,Hint-Engineering的模型在32B模型中使用的令牌减少了约30%,而在1.5B模型中则减少了一半左右。 该研究的相关代码与模型可通过此链接获取:[请在此处插入实际提供的URL]
https://arxiv.org/abs/2506.09820
End-to-end autonomous driving has emerged as a promising paradigm for directly mapping sensor inputs to planning maneuvers using learning-based modular integrations. However, existing imitation learning (IL)-based models suffer from generalization to hard cases, and a lack of corrective feedback loop under post-deployment. While reinforcement learning (RL) offers a potential solution to tackle hard cases with optimality, it is often hindered by overfitting to specific driving cases, resulting in catastrophic forgetting of generalizable knowledge and sample inefficiency. To overcome these challenges, we propose Reinforced Refinement with Self-aware Expansion (R2SE), a novel learning pipeline that constantly refines hard domain while keeping generalizable driving policy for model-agnostic end-to-end driving systems. Through reinforcement fine-tuning and policy expansion that facilitates continuous improvement, R2SE features three key components: 1) Generalist Pretraining with hard-case allocation trains a generalist imitation learning (IL) driving system while dynamically identifying failure-prone cases for targeted refinement; 2) Residual Reinforced Specialist Fine-tuning optimizes residual corrections using reinforcement learning (RL) to improve performance in hard case domain while preserving global driving knowledge; 3) Self-aware Adapter Expansion dynamically integrates specialist policies back into the generalist model, enhancing continuous performance improvement. Experimental results in closed-loop simulation and real-world datasets demonstrate improvements in generalization, safety, and long-horizon policy robustness over state-of-the-art E2E systems, highlighting the effectiveness of reinforce refinement for scalable autonomous driving.
端到端自动驾驶作为一种有前景的范式,直接将传感器输入映射为规划操作,利用基于学习的方法进行模块化集成。然而,现有的基于模仿学习(IL)的模型在面对困难案例时难以泛化,并且缺乏部署后的纠正反馈循环。虽然强化学习(RL)提供了一种潜在解决方案来解决这些难题案例并实现最优解,但其通常会因为过度适应特定驾驶情况而受到阻碍,从而导致对通用知识的记忆丧失和样本效率低下。为了克服这些挑战,我们提出了自我感知扩展的增强型细化方法(R2SE),这是一种新型学习管道,能够不断优化困难领域的同时保持模型无关端到端驾驶系统的可泛化驾驶策略。通过强化微调和支持持续改进的政策扩张,R2SE具有三个关键组成部分:1)硬案例分配的一般预训练,可以培训一个基于模仿学习(IL)的通用自动驾驶系统,并动态识别易于失败的情况进行有针对性的细化;2)残差增强专家精炼,使用强化学习优化残差校正以提高在困难领域的性能,同时保持整体驾驶知识;3)自我感知适配器扩展,能够将专业策略动态集成回通用模型中,从而增强持续性的表现改进。闭合循环模拟和真实世界数据集的实验结果表明,在泛化、安全性和长视距政策稳健性方面优于最先进的E2E系统,这突显了通过强化学习进行细化对于可扩展自动驾驶的有效性。
https://arxiv.org/abs/2506.09800
Audio-visual target speaker extraction (AV-TSE) models primarily rely on target visual cues to isolate the target speaker's voice from others. We know that humans leverage linguistic knowledge, such as syntax and semantics, to support speech perception. Inspired by this, we explore the potential of pre-trained speech-language models (PSLMs) and pre-trained language models (PLMs) as auxiliary knowledge sources for AV-TSE. In this study, we propose incorporating the linguistic constraints from PSLMs or PLMs for the AV-TSE model as additional supervision signals. Without introducing any extra computational cost during inference, the proposed approach consistently improves speech quality and intelligibility. Furthermore, we evaluate our method in multi-language settings and visual cue-impaired scenarios and show robust performance gains.
音频-视觉目标说话人分离(AV-TSE)模型主要依赖于目标的视觉线索来从其他人声中隔离出目标说话人的声音。我们知道,人类利用语言知识,如句法和语义,来支持语音感知。受到这一启发,我们探索了预训练的语音-语言模型(PSLMs)和预训练的语言模型(PLMs)作为AV-TSE辅助知识来源的潜力。在本研究中,我们提出将从PSLM或PLM获取的语言约束用作AV-TSE模型的额外监督信号。无需增加推理过程中的计算成本,所提出的方案可以持续提高语音质量和可理解性。此外,我们在多语言设置和视觉线索受损场景下评估了我们的方法,并展示了稳健的性能提升。
https://arxiv.org/abs/2506.09792
Research and practice in Intelligent Design (ID) have significantly enhanced engineering innovation, efficiency, quality, and productivity over recent decades, fundamentally reshaping how engineering designers think, behave, and interact with design processes. The recent emergence of Foundation Models (FMs), particularly Large Language Models (LLMs), has demonstrated general knowledge-based reasoning capabilities, and open new paths and avenues for further transformation in engineering design. In this context, this paper introduces Intelligent Design 4.0 (ID 4.0) as an emerging paradigm empowered by agentic AI systems. We review the historical evolution of ID across four distinct stages: rule-based expert systems, task-specific machine learning models, large-scale foundation AI models, and the recent emerging paradigm of multi-agent collaboration. We propose a conceptual framework for ID 4.0 and discuss its potential to support end-to-end automation of engineering design processes through coordinated, autonomous multi-agent-based systems. Furthermore, we discuss future perspectives to enhance and fully realize ID 4.0's potential, including more complex design scenarios, more practical design implementations, novel agent coordination mechanisms, and autonomous design goal-setting with better human value alignment. In sum, these insights lay a foundation for advancing Intelligent Design toward greater adaptivity, autonomy, and effectiveness in addressing increasingly complex design challenges.
近年来,智能设计(ID)领域的研究与实践在工程创新、效率、质量和生产力方面取得了显著进步,并彻底改变了工程师的设计思维、行为方式以及他们与设计流程的互动模式。最近出现的基础模型(FMs),特别是大型语言模型(LLMs),已经展示了基于广泛知识的推理能力,并为工程设计进一步转型开辟了新的路径和方向。在此背景下,本文介绍了智能设计4.0(ID 4.0)作为由代理型人工智能系统赋能的新范式。文章回顾了智能设计在四个不同阶段的历史演变:规则基础专家系统、特定任务机器学习模型、大规模基础AI模型以及最近出现的多代理协作新范式。我们提出了一个概念框架来定义ID 4.0,并讨论了其通过自主协调的多代理系统支持工程设计过程全流程自动化的能力。此外,本文还探讨了增强和充分实现ID 4.0潜力的未来展望,包括更为复杂的设景、更具实用性的设计方案、新颖的代理协作机制以及与人类价值观更好地对齐的自主设计目标设定。 总的来说,这些见解为智能设计向更加适应性、自主性和有效性的方向发展奠定了基础,以应对日益复杂的设计挑战。
https://arxiv.org/abs/2506.09755
In complex engineering systems, the interdependencies among components or development activities are often modeled and analyzed using Design Structure Matrix (DSM). Reorganizing elements within a DSM to minimize feedback loops and enhance modularity or process efficiency constitutes a challenging combinatorial optimization (CO) problem in engineering design and operations. As problem sizes increase and dependency networks become more intricate, traditional optimization methods that solely use mathematical heuristics often fail to capture the contextual nuances and struggle to deliver effective solutions. In this study, we explore the potential of Large Language Models (LLMs) for helping solve such CO problems by leveraging their capabilities for advanced reasoning and contextual understanding. We propose a novel LLM-based framework that integrates network topology with contextual domain knowledge for iterative optimization of DSM element sequencing - a common CO problem. Experiments on various DSM cases show that our method consistently achieves faster convergence and superior solution quality compared to both stochastic and deterministic baselines. Notably, we find that incorporating contextual domain knowledge significantly enhances optimization performance regardless of the chosen LLM backbone. These findings highlight the potential of LLMs to solve complex engineering CO problems by combining semantic and mathematical reasoning. This approach paves the way towards a new paradigm in LLM-based engineering design optimization.
在复杂的工程系统中,组件或开发活动之间的相互依赖性通常通过设计结构矩阵(DSM)进行建模和分析。重新组织 DSM 中的元素以减少反馈循环并提高模块化或流程效率构成了一个具有挑战性的组合优化 (CO) 问题,在工程设计与运营中尤为突出。随着问题规模的增长及依赖网络变得更加复杂,传统的仅使用数学启发式方法的优化手段往往难以捕捉到上下文中的细微差别,并且很难提供有效的解决方案。 在本研究中,我们探索了大型语言模型(LLMs)在此类 CO 问题解决中的潜在能力,利用其高级推理和语境理解的能力。我们提出了一种基于 LLM 的新型框架,将网络拓扑与领域知识相结合,以迭代优化 DSM 元素的排序——这是一个常见的 CO 问题。在不同 DSM 案例上的实验表明,我们的方法相比随机及确定性的基线方法,在收敛速度和解决方案质量方面均表现出色。 值得注意的是,我们发现结合上下文领域的知识可以显著提升优化性能,无论所选择的 LLM 基础架构如何。这些研究结果突显了通过将语义与数学推理相结合的方式利用 LLM 来解决复杂的工程 CO 问题的巨大潜力。这种方法为基于 LLM 的工程设计优化开启了新的范例。
https://arxiv.org/abs/2506.09749
Existing multimodal methods typically assume that different modalities share the same category set. However, in real-world applications, the category distributions in multimodal data exhibit inconsistencies, which can hinder the model's ability to effectively utilize cross-modal information for recognizing all categories. In this work, we propose the practical setting termed Multi-Modal Heterogeneous Category-set Learning (MMHCL), where models are trained in heterogeneous category sets of multi-modal data and aim to recognize complete classes set of all modalities during test. To effectively address this task, we propose a Class Similarity-based Cross-modal Fusion model (CSCF). Specifically, CSCF aligns modality-specific features to a shared semantic space to enable knowledge transfer between seen and unseen classes. It then selects the most discriminative modality for decision fusion through uncertainty estimation. Finally, it integrates cross-modal information based on class similarity, where the auxiliary modality refines the prediction of the dominant one. Experimental results show that our method significantly outperforms existing state-of-the-art (SOTA) approaches on multiple benchmark datasets, effectively addressing the MMHCL task.
现有的多模态方法通常假设不同的模态共享相同的类别集。然而,在实际应用中,多模态数据的类别分布表现出不一致性,这可能阻碍模型有效利用跨模态信息来识别所有类别的能力。为此,我们在本文提出了一个实用设置,称为多模态异构类别集学习(MMHCL),在这个框架下,模型在不同模态的数据异构类别集中进行训练,并且目标是在测试时识别出所有模态的完整类别集合。 为有效应对这一任务,我们提出了一种基于类相似性的跨模态融合模型(Class Similarity-based Cross-modal Fusion, CSCF)。具体而言,CSCF 将特定于每个模态的特征对齐到共享语义空间中,以促进已见和未见类别之间的知识迁移。然后通过不确定性估计选择最具判别力的模态来进行决策融合。最后,该模型基于类相似性整合跨模态信息,其中辅助模态细化主要模态的预测。 实验结果显示,在多个基准数据集上,我们提出的方法显著优于现有的最先进的(SOTA)方法,有效解决了 MMHCL 任务。
https://arxiv.org/abs/2506.09745
Micro-expression recognition (MER), a critical subfield of affective computing, presents greater challenges than macro-expression recognition due to its brief duration and low intensity. While incorporating prior knowledge has been shown to enhance MER performance, existing methods predominantly rely on simplistic, singular sources of prior knowledge, failing to fully exploit multi-source information. This paper introduces the Multi-Prior Fusion Network (MPFNet), leveraging a progressive training strategy to optimize MER tasks. We propose two complementary encoders: the Generic Feature Encoder (GFE) and the Advanced Feature Encoder (AFE), both based on Inflated 3D ConvNets (I3D) with Coordinate Attention (CA) mechanisms, to improve the model's ability to capture spatiotemporal and channel-specific features. Inspired by developmental psychology, we present two variants of MPFNet--MPFNet-P and MPFNet-C--corresponding to two fundamental modes of infant cognitive development: parallel and hierarchical processing. These variants enable the evaluation of different strategies for integrating prior knowledge. Extensive experiments demonstrate that MPFNet significantly improves MER accuracy while maintaining balanced performance across categories, achieving accuracies of 0.811, 0.924, and 0.857 on the SMIC, CASME II, and SAMM datasets, respectively. To the best of our knowledge, our approach achieves state-of-the-art performance on the SMIC and SAMM datasets.
微表情识别(MER)是情感计算中的一个重要子领域,相较于宏观表达识别,它面临更大的挑战,因为其持续时间短且强度低。虽然将先验知识纳入模型已被证明能够提升MER性能,但现有的方法主要依赖于单一、简单的先验知识来源,未能充分利用多源信息的优势。本文提出了一个名为多先验融合网络(MPFNet)的框架,并采用了一种渐进式训练策略来优化MER任务。我们设计了两个互补的编码器:通用特征编码器(GFE)和高级特征编码器(AFE),两者均基于配备了坐标注意力机制的.inflate 3D卷积网络(I3D),以此提升模型捕捉时空和通道特异性的能力。 受发展心理学的启发,本文提出了MPFNet框架下的两种变体:MPFNet-P与MPFNet-C。这两种变体分别对应于婴儿认知发展中两大基本模式——并行处理和层次化处理方式,从而能够评估不同先验知识整合策略的效果。 实验结果表明,在SMIC、CASME II及SAMM数据集上,我们的方法能够显著提升MER的准确率,并且在各类别之间保持均衡表现。具体而言,我们模型在这三个数据集上的准确性分别为0.811、0.924和0.857。据我们所知,在SMIC与SAMM这两个数据集中,我们的方法实现了迄今为止的最佳性能。
https://arxiv.org/abs/2506.09735
Unstructured Knowledge Editing (UKE) is crucial for updating the relevant knowledge of large language models (LLMs). It focuses on unstructured inputs, such as long or free-form texts, which are common forms of real-world knowledge. Although previous studies have proposed effective methods and tested them, some issues exist: (1) Lack of Locality evaluation for UKE, and (2) Abnormal failure of fine-tuning (FT) based methods for UKE. To address these issues, we first construct two datasets, UnKEBench-Loc and AKEW-Loc (CF), by extending two existing UKE datasets with locality test data from the unstructured and structured views. This enables a systematic evaluation of the Locality of post-edited models. Furthermore, we identify four factors that may affect the performance of FT-based methods. Based on these factors, we conduct experiments to determine how the well-performing FT-based methods should be trained for the UKE task, providing a training recipe for future research. Our experimental results indicate that the FT-based method with the optimal setting (FT-UKE) is surprisingly strong, outperforming the existing state-of-the-art (SOTA). In batch editing scenarios, FT-UKE shows strong performance as well, with its advantage over SOTA methods increasing as the batch size grows, expanding the average metric lead from +6.78% to +10.80%
无结构知识编辑(UKE)对于更新大规模语言模型(LLM)的相关知识至关重要。它专注于处理不规则输入,例如长篇或自由形式的文本,这些是现实世界中知识常见的表现形式。尽管之前的研究所提出的有效方法已经被测试过,但仍然存在一些问题:(1) 缺乏针对UKE的局部性评估;(2) 基于微调(FT)的方法在进行UKE时出现异常失败的情况。 为了解决这些问题,我们首先构建了两个数据集——UnKEBench-Loc 和 AKEW-Loc (CF),通过从无结构和有结构视角扩展现有的两个 UKE 数据集,并加入局部性测试数据。这使得对编辑后模型的局部性能进行系统评估成为可能。 此外,我们识别出了四个可能影响基于微调方法表现的因素。根据这些因素,我们进行了实验以确定用于UKE任务时高性能的基于微调的方法应该如何训练,为未来的研究提供了一个训练配方。我们的实验结果显示,在最佳设置下的基于微调方法(FT-UKE)表现出令人惊讶的强大性能,并且超越了现有的最先进的技术(SOTA)。在批量编辑场景中,FT-UKE 也展示了强大的表现能力,随着批处理规模的增大,其相较于 SOTA 方法的优势也在增加,使平均度量指标领先从 +6.78% 提升到 +10.80%。
https://arxiv.org/abs/2506.09672
It is important for Large Language Models to be aware of the boundary of their knowledge, the mechanism of identifying known and unknown queries. This type of awareness can help models perform adaptive inference, such as invoking RAG, engaging in slow and deep thinking, or adopting the abstention mechanism, which is beneficial to the development of efficient and trustworthy AI. In this work, we propose a method to detect knowledge boundaries via Query-Level Uncertainty, which aims to determine if the model is able to address a given query without generating any tokens. To this end, we introduce a novel and training-free method called \emph{Internal Confidence}, which leverages self-evaluations across layers and tokens. Empirical results on both factual QA and mathematical reasoning tasks demonstrate that our internal confidence can outperform several baselines. Furthermore, we showcase that our proposed method can be used for efficient RAG and model cascading, which is able to reduce inference costs while maintaining performance.
大型语言模型了解其知识边界并能够识别已知和未知查询的机制是非常重要的。这种意识有助于模型进行自适应推理,比如调用检索增强生成(RAG)、开展慢而深入的思考或采用弃权机制,这有利于高效且可信的人工智能的发展。在这项工作中,我们提出了一种通过查询级别的不确定性来检测知识边界的的方法,旨在确定模型是否能够在不生成任何标记的情况下解决给定的查询。为此,我们引入了一个新颖且无需训练的方法——“内部置信度”,它利用了跨层和标记的自我评估。 在事实问答和数学推理任务上的实证结果表明,我们的内部置信度方法能够超越多个基准线。此外,我们展示了所提出的方法可以用于高效的RAG和模型级联,从而降低推断成本的同时保持性能水平。
https://arxiv.org/abs/2506.09669