It is a widely known issue that Transformers, when trained on shorter sequences, fail to generalize robustly to longer ones at test time. This raises the question of whether Transformer models are real reasoning engines, despite their impressive abilities in mathematical problem solving and code synthesis. In this paper, we offer a vanishing variance perspective on this issue. To the best of our knowledge, we are the first to demonstrate that even for today's frontier models, a longer sequence length results in a decrease in variance in the output of the multi-head attention modules. On the argmax retrieval and dictionary lookup tasks, our experiments show that applying layer normalization after the attention outputs leads to significantly better length generalization. Our analyses attribute this improvement to a reduction-though not a complete elimination-of the distribution shift caused by vanishing variance.
众所周知,当使用较短序列进行训练时,Transformer模型在测试较长序列时无法稳健地泛化。这引发了这样一个问题:尽管Transformer模型在数学问题解决和代码合成方面表现出色,它们是否真正具备推理能力。在这篇论文中,我们从方差渐进消失的角度探讨了这一问题。据我们所知,这是我们首次证明即使是当今最先进的模型,在较长的序列长度下,多头注意力模块的输出方差也会减少。在argmax检索和字典查找任务上,我们的实验表明,在注意力输出之后应用层归一化可以显著改善长度泛化的性能。分析表明,这种改进主要归因于分布偏移(由方差渐进消失引起)有所减小——尽管并未完全消除这一现象。
https://arxiv.org/abs/2504.02827
Thermal imaging is crucial for scene understanding, particularly in low-light and nighttime conditions. However, collecting large thermal datasets is costly and labor-intensive due to the specialized equipment required for infrared image capture. To address this challenge, researchers have explored visible-to-thermal image translation. Most existing methods rely on Generative Adversarial Networks (GANs) or Diffusion Models (DMs), treating the task as a style transfer problem. As a result, these approaches attempt to learn both the modality distribution shift and underlying physical principles from limited training data. In this paper, we propose F-ViTA, a novel approach that leverages the general world knowledge embedded in foundation models to guide the diffusion process for improved translation. Specifically, we condition an InstructPix2Pix Diffusion Model with zero-shot masks and labels from foundation models such as SAM and Grounded DINO. This allows the model to learn meaningful correlations between scene objects and their thermal signatures in infrared imagery. Extensive experiments on five public datasets demonstrate that F-ViTA outperforms state-of-the-art (SOTA) methods. Furthermore, our model generalizes well to out-of-distribution (OOD) scenarios and can generate Long-Wave Infrared (LWIR), Mid-Wave Infrared (MWIR), and Near-Infrared (NIR) translations from the same visible image. Code: this https URL.
热成像是场景理解的关键,尤其是在低光和夜间条件下。然而,收集大规模的热数据集由于需要专门的红外图像捕捉设备而成本高昂且耗时长。为了解决这一挑战,研究人员探索了可见光到热成像的转换技术。大多数现有的方法依赖于生成对抗网络(GANs)或扩散模型(DMs),将任务视为风格迁移问题。因此,这些方法试图从有限的训练数据中学习模态分布变化和潜在的物理原理。 在本文中,我们提出了F-ViTA,这是一种新颖的方法,它利用基础模型中嵌入的一般世界知识来指导扩散过程以实现更优的转换效果。具体而言,我们将InstructPix2Pix扩散模型与来自SAM(Segment Anything Model)和Grounded DINO等基础模型的零样本掩码和标签相结合进行条件化处理。这样可以使模型学习场景中的物体与其在红外图像中的热特征之间的有意义的相关性。 我们在五个公共数据集上进行了广泛的实验,结果显示F-ViTA优于当前最先进(SOTA)的方法。此外,我们的模型能够在分布外(OOD)情景下表现出良好的泛化能力,并且可以从同一张可见光图像生成长波红外(LWIR)、中波红外(MWIR)和近红外(NIR)的转换结果。 代码地址:[这个链接](https://this https URL)
https://arxiv.org/abs/2504.02801
The recent breakthroughs in OpenAI's GPT4o model have demonstrated surprisingly good capabilities in image generation and editing, resulting in significant excitement in the community. This technical report presents the first-look evaluation benchmark (named GPT-ImgEval), quantitatively and qualitatively diagnosing GPT-4o's performance across three critical dimensions: (1) generation quality, (2) editing proficiency, and (3) world knowledge-informed semantic synthesis. Across all three tasks, GPT-4o demonstrates strong performance, significantly surpassing existing methods in both image generation control and output quality, while also showcasing exceptional knowledge reasoning capabilities. Furthermore, based on the GPT-4o's generated data, we propose a classification-model-based approach to investigate the underlying architecture of GPT-4o, where our empirical results suggest the model consists of an auto-regressive (AR) combined with a diffusion-based head for image decoding, rather than the VAR-like architectures. We also provide a complete speculation on GPT-4o's overall architecture. In addition, we conduct a series of analyses to identify and visualize GPT-4o's specific limitations and the synthetic artifacts commonly observed in its image generation. We also present a comparative study of multi-round image editing between GPT-4o and Gemini 2.0 Flash, and discuss the safety implications of GPT-4o's outputs, particularly their detectability by existing image forensic models. We hope that our work can offer valuable insight and provide a reliable benchmark to guide future research, foster reproducibility, and accelerate innovation in the field of image generation and beyond. The codes and datasets used for evaluating GPT-4o can be found at this https URL.
最近,OpenAI的GPT-4o模型在图像生成和编辑方面的突破展示了令人惊讶的能力,这引起了社区的极大兴奋。本技术报告提出了首个评估基准(命名为GPT-ImgEval),从定量和定性的角度诊断了GPT-4o在三个关键维度上的性能:(1) 生成质量;(2) 编辑熟练度;以及 (3) 基于世界知识的语义合成。在整个三项任务中,GPT-4o展示了强大的表现,在图像生成控制和输出质量方面显著超越了现有方法,并且在知识推理能力上也表现出卓越的能力。 此外,基于GPT-4o产生的数据,我们提出了一种基于分类模型的方法来研究GPT-4o的底层架构。我们的实证结果表明,该模型由自回归(AR)与扩散头结合构成用于图像解码,而非类似VAR的架构。我们也提供了一个对GPT-4o整体架构的完整推测。 此外,我们进行了一系列分析以识别和可视化GPT-4o在图像生成中具体的局限性以及常见的合成伪影。我们还进行了GPT-4o与Gemini 2.0 Flash之间的多轮编辑比较研究,并讨论了GPT-4o输出的安全影响,特别是其被现有图像取证模型检测的可能性。 希望我们的工作能够提供有价值的见解并为未来的研究提供可靠的基准,以促进可重复性并加速在图像生成领域及更广泛领域的创新。用于评估GPT-4o的代码和数据集可在以下网址找到:[此URL]。
https://arxiv.org/abs/2504.02782
The rise of Generative AI, and Large Language Models (LLMs) in particular, is fundamentally changing cognitive processes in knowledge work, raising critical questions about their impact on human reasoning and problem-solving capabilities. As these AI systems become increasingly integrated into workflows, they offer unprecedented opportunities for augmenting human thinking while simultaneously risking cognitive erosion through passive consumption of generated answers. This tension is particularly pronounced in open-ended tasks, where effective solutions require deep contextualization and integration of domain knowledge. Unlike structured tasks with established metrics, measuring the quality of human-LLM interaction in such open-ended tasks poses significant challenges due to the absence of ground truth and the iterative nature of solution development. To address this, we present a framework that analyzes interaction patterns along two dimensions: cognitive activity mode (exploration vs. exploitation) and cognitive engagement mode (constructive vs. detrimental). This framework provides systematic measurements to evaluate when LLMs are effective tools for thought rather than substitutes for human cognition, advancing theoretical understanding and practical guidance for developing AI systems that protect and augment human cognitive capabilities.
生成式人工智能的崛起,特别是大型语言模型(LLMs)的发展,从根本上改变了知识工作中认知过程的方式,引发了关于它们对人类推理和问题解决能力影响的重要问题。随着这些AI系统在工作流程中的集成越来越深入,它们为增强人类思维提供了前所未有的机会,同时通过被动消费生成的答案而带来了认知侵蚀的风险。这种张力在开放式任务中尤为明显,在这类任务中,有效的解决方案需要深厚的情境理解和领域知识的整合。与结构化任务不同,后者有既定的评价标准,在缺乏客观依据和解决方案开发具有迭代性质的情况下,评估开放性任务中人机交互质量面临重大挑战。 为了应对这一问题,我们提出了一种框架,该框架从两个维度分析互动模式:认知活动模式(探索 vs 利用)和认知参与模式(建设性的 vs 损害性的)。这个框架提供了系统化的测量方法来评估在何种情况下LLMs是思考的有效工具而非人类认知的替代品。通过这种方式,该框架不仅推进了理论理解,还为开发能够保护并增强人类认知能力的人工智能系统提供了实际指导。
https://arxiv.org/abs/2504.02780
Natural language instructions are often abstract and complex, requiring robots to execute multiple subtasks even for seemingly simple queries. For example, when a user asks a robot to prepare avocado toast, the task involves several sequential steps. Moreover, such instructions can be ambiguous or infeasible for the robot or may exceed the robot's existing knowledge. While Large Language Models (LLMs) offer strong language reasoning capabilities to handle these challenges, effectively integrating them into robotic systems remains a key challenge. To address this, we propose BT-ACTION, a test-driven approach that combines the modular structure of Behavior Trees (BT) with LLMs to generate coherent sequences of robot actions for following complex user instructions, specifically in the context of preparing recipes in a kitchen-assistance setting. We evaluated BT-ACTION in a comprehensive user study with 45 participants, comparing its performance to direct LLM prompting. Results demonstrate that the modular design of BT-ACTION helped the robot make fewer mistakes and increased user trust, and participants showed a significant preference for the robot leveraging BT-ACTION. The code is publicly available at this https URL.
自然语言指令通常抽象且复杂,要求机器人执行看似简单的查询时也能完成多个子任务。例如,当用户让机器人准备牛油果吐司时,这一任务涉及一系列的连续步骤。此外,这样的指令可能对机器人来说是模棱两可或不可行的,甚至会超出机器人的现有知识范围。虽然大型语言模型(LLM)提供了强大的语言推理能力来应对这些挑战,但有效地将它们融入到机器人系统中仍然是一个关键难题。 为了解决这个问题,我们提出了BT-ACTION,这是一种基于行为树(Behavior Trees, BT)模块化结构并与大型语言模型相结合的测试驱动方法。这种方法能够生成连贯的动作序列,使机器人能够在厨房辅助设置下按照复杂的用户指令准备食谱。 我们在45名参与者中进行了全面的用户研究,并将BT-ACTION的表现与直接使用LLM提示的方式进行比较。结果显示,BT-ACTION模块化的设计帮助机器人减少了错误并增加了用户的信任度,参与者对机器采用BT-ACTION表示出了显著的偏好。 该代码可在以下链接公开获取:[此链接](请根据实际情况提供正确的URL)。
https://arxiv.org/abs/2504.02779
The spread of scientific knowledge depends on how researchers discover and cite previous work. The adoption of large language models (LLMs) in the scientific research process introduces a new layer to these citation practices. However, it remains unclear to what extent LLMs align with human citation practices, how they perform across domains, and may influence citation dynamics. Here, we show that LLMs systematically reinforce the Matthew effect in citations by consistently favoring highly cited papers when generating references. This pattern persists across scientific domains despite significant field-specific variations in existence rates, which refer to the proportion of generated references that match existing records in external bibliometric databases. Analyzing 274,951 references generated by GPT-4o for 10,000 papers, we find that LLM recommendations diverge from traditional citation patterns by preferring more recent references with shorter titles and fewer authors. Emphasizing their content-level relevance, the generated references are semantically aligned with the content of each paper at levels comparable to the ground truth references and display similar network effects while reducing author self-citations. These findings illustrate how LLMs may reshape citation practices and influence the trajectory of scientific discovery by reflecting and amplifying established trends. As LLMs become more integrated into the scientific research process, it is important to understand their role in shaping how scientific communities discover and build upon prior work.
科学知识的传播取决于研究人员如何发现和引用先前的工作。大型语言模型(LLMs)在科学研究过程中的采用为这些引用实践引入了一个新的层面。然而,目前尚不清楚LLMs与人类引用习惯有多大的一致性、它们在不同领域的表现以及可能对引用动态产生的影响。在这里,我们展示了LLMs系统地强化了引文中的马太效应,在生成参考文献时持续倾向于引用高被引论文。尽管存在领域特定的存在率(即,生成的引用中匹配外部引文数据库现有记录的比例)差异显著,这一模式在各个科学领域仍然普遍存在。通过对GPT-4为10,000篇论文生成的274,951个参考文献进行分析,我们发现LLM推荐偏离了传统的引用模式,更偏好标题较短、作者较少的近期参考文献。强调其内容相关性,生成的参考文献在语义上与每篇文章的内容高度一致,并显示相似的网络效应,同时减少作者自引现象。这些研究结果表明,LLMs如何通过反映和放大既定趋势来重塑引用实践并影响科学研究的发展轨迹。随着LLMs越来越多地融入科学研究过程,了解它们在塑造科学社群发现和建立于先前工作之上的方式中的作用至关重要。
https://arxiv.org/abs/2504.02767
Large Language Models (LLMs) are revolutionizing the development of AI assistants capable of performing diverse tasks across domains. However, current state-of-the-art LLM-driven agents face significant challenges, including high operational costs and limited success rates on complex benchmarks like GAIA. To address these issues, we propose the Knowledge Graph of Thoughts (KGoT), an innovative AI assistant architecture that integrates LLM reasoning with dynamically constructed knowledge graphs (KGs). KGoT extracts and structures task-relevant knowledge into a dynamic KG representation, iteratively enhanced through external tools such as math solvers, web crawlers, and Python scripts. Such structured representation of task-relevant knowledge enables low-cost models to solve complex tasks effectively. For example, KGoT achieves a 29% improvement in task success rates on the GAIA benchmark compared to Hugging Face Agents with GPT-4o mini, while reducing costs by over 36x compared to GPT-4o. Improvements for recent reasoning models are similar, e.g., 36% and 37.5% for Qwen2.5-32B and Deepseek-R1-70B, respectively. KGoT offers a scalable, affordable, and high-performing solution for AI assistants.
大型语言模型(LLM)正在革新能够跨领域执行多种任务的AI助手的发展。然而,目前最先进的由LLM驱动的代理面临着重大挑战,包括高昂的操作成本以及在如GAIA等复杂基准测试中的成功率有限。为了解决这些问题,我们提出了知识图谱思维(Knowledge Graph of Thoughts, KGoT),这是一种创新的人工智能助手架构,它将大型语言模型推理与动态构建的知识图(KG)相结合。 KGoT通过提取和组织任务相关的知识到一个动态的KG表示中,并利用外部工具如数学求解器、网络爬虫和Python脚本等进行迭代增强。这种结构化的任务相关知识表示使得低成本模型能够有效解决复杂任务。例如,在GAIA基准测试上,与使用GPT-4o mini的Hugging Face Agents相比,KGoT将成功率提高了29%,同时成本降低了36倍以上,相较于GPT-4o而言。最近的推理模型也取得了类似改进,比如Qwen2.5-32B和Deepseek-R1-70B分别实现了36%和37.5%的成功率提升。 KGoT为AI助手提供了一种可扩展、经济实惠且高性能的解决方案。
https://arxiv.org/abs/2504.02670
Continual Learning (CL) strives to learn incrementally across tasks while mitigating catastrophic forgetting. A key challenge in CL is balancing stability (retaining prior knowledge) and plasticity (learning new tasks). While representative gradient projection methods ensure stability, they often limit plasticity. Model merging techniques offer promising solutions, but prior methods typically rely on empirical assumptions and carefully selected hyperparameters. In this paper, we explore the potential of model merging to enhance the stability-plasticity trade-off, providing theoretical insights that underscore its benefits. Specifically, we reformulate the merging mechanism using Bayesian continual learning principles and derive a closed-form solution for the optimal merging coefficient that adapts to the diverse characteristics of tasks. To validate our approach, we introduce a two-stage framework named BECAME, which synergizes the expertise of gradient projection and adaptive merging. Extensive experiments show that our approach outperforms state-of-the-art CL methods and existing merging strategies.
持续学习(CL)旨在逐步跨任务增量学习,同时尽量减少灾难性遗忘。在持续学习中,一个关键挑战是平衡稳定性和可塑性:保持先前的知识和学习新任务的能力之间的关系。虽然代表性的梯度投影方法可以确保稳定性,但它们通常会限制可塑性。模型合并技术提供了一些有前景的解决方案,但是之前的方法往往依赖于经验假设和精心选择的超参数。 在本文中,我们探索了使用模型合并来增强稳定性和可塑性之间的权衡潜力,并提供了理论见解以突出其优势。具体而言,我们通过贝叶斯持续学习原则重新制定了合并机制,并推导出了一种闭式解法,用于适应任务多样化特征的最佳合并系数。 为了验证我们的方法,我们引入了一个两阶段框架,名为BECAME(Bayesian continual learning with adaptive merging),该框架结合了梯度投影和自适应合并的专长。广泛的实验表明,我们的方法优于最新的持续学习方法和现有的合并策略。
https://arxiv.org/abs/2504.02666
We propose a learning architecture that allows symbolic control and guidance in reinforcement learning with deep neural networks. We introduce SymDQN, a novel modular approach that augments the existing Dueling Deep Q-Networks (DuelDQN) architecture with modules based on the neuro-symbolic framework of Logic Tensor Networks (LTNs). The modules guide action policy learning and allow reinforcement learning agents to display behaviour consistent with reasoning about the environment. Our experiment is an ablation study performed on the modules. It is conducted in a reinforcement learning environment of a 5x5 grid navigated by an agent that encounters various shapes, each associated with a given reward. The underlying DuelDQN attempts to learn the optimal behaviour of the agent in this environment, while the modules facilitate shape recognition and reward prediction. We show that our architecture significantly improves learning, both in terms of performance and the precision of the agent. The modularity of SymDQN allows reflecting on the intricacies and complexities of combining neural and symbolic approaches in reinforcement learning.
我们提出了一种学习架构,该架构允许在使用深度神经网络进行强化学习时引入符号控制和指导。为此,我们介绍了SymDQN,这是一种新颖的模块化方法,它将现有的双轨深度Q-网络(DuelDQN)架构与基于神经符号框架逻辑张量网络(LTNs)的模块相结合。这些模块引导行为策略的学习,并使强化学习代理能够表现出符合对环境推理的行为。 我们的实验是对这些模块进行的一项消融研究,实验在由5x5网格构成的一个环境中进行,一个智能体在此环境中遇到各种形状,每个形状都与特定奖励相关联。基础的DuelDQN试图在这个环境下学习智能体的最佳行为,而模块则帮助识别形状和预测奖励。 我们证明了我们的架构显著提升了学习效果,在性能和代理精确度方面均有改进。SymDQN的模块化特性允许深入探讨在强化学习中结合神经网络与符号方法的复杂性和细微差别。
https://arxiv.org/abs/2504.02654
Task arithmetic has emerged as a promising approach for editing models by representing task-specific knowledge as composable task vectors. However, existing methods rely on network linearization to derive task vectors, leading to computational bottlenecks during training and inference. Moreover, linearization alone does not ensure weight disentanglement, the key property that enables conflict-free composition of task vectors. To address this, we propose TaLoS which allows to build sparse task vectors with minimal interference without requiring explicit linearization and sharing information across tasks. We find that pre-trained models contain a subset of parameters with consistently low gradient sensitivity across tasks, and that sparsely updating only these parameters allows for promoting weight disentanglement during fine-tuning. Our experiments prove that TaLoS improves training and inference efficiency while outperforming current methods in task addition and negation. By enabling modular parameter editing, our approach fosters practical deployment of adaptable foundation models in real-world applications.
任务算术作为一种有前景的方法,通过将特定于任务的知识表示为可组合的任务向量来编辑模型。然而,现有的方法依赖于网络线性化以推导出任务向量,这在训练和推理过程中会导致计算瓶颈。此外,仅靠线性化并不能确保权重解耦,这是使任务向量无冲突组合的关键属性。为此,我们提出了TaLoS(Task Arithmetic via LoRA and Sparsity),该方法能够在构建稀疏任务向量时最小限度地干扰模型,并且不需要显式的线性化操作,在不同任务之间共享信息。 我们的研究发现预训练模型中存在一组参数,在各个任务上具有持续较低的梯度敏感性,而仅对这些参数进行稀疏更新可以在微调过程中促进权重解耦。实验表明,TaLoS不仅提高了训练和推理效率,还在任务添加和否定方面优于当前方法。 通过支持模块化参数编辑,我们的方法促进了适应性强的基础模型在实际应用中的部署。这种方法使得能够灵活地根据具体需求调整基础模型的性能,而不必从头开始重新训练整个模型,从而节省了大量计算资源和时间成本。
https://arxiv.org/abs/2504.02620
The practical deployment of learning-based autonomous systems would greatly benefit from tools that flexibly obtain safety guarantees in the form of certificate functions from data. While the geometrical properties of such certificate functions are well understood, synthesizing them using machine learning techniques still remains a challenge. To mitigate this issue, we propose a diffeomorphic function learning framework where prior structural knowledge of the desired output is encoded in the geometry of a simple surrogate function, which is subsequently augmented through an expressive, topology-preserving state-space transformation. Thereby, we achieve an indirect function approximation framework that is guaranteed to remain in the desired hypothesis space. To this end, we introduce a novel approach to construct diffeomorphic maps based on RBF networks, which facilitate precise, local transformations around data. Finally, we demonstrate our approach by learning diffeomorphic Lyapunov functions from real-world data and apply our method to different attractor systems.
学习自主系统的实际部署将极大地受益于能够从数据中灵活获得安全保证的工具,这些安全保证的形式是证书函数。虽然人们对这类证书函数的几何特性已经相当了解,但使用机器学习技术来合成它们仍然是一项挑战。为缓解这一问题,我们提出了一种微分同胚函数学习框架,在该框架中,所需输出的先验结构知识被编码在简单替代函数的几何结构中,并通过一种表达力强且保持拓扑的相空间转换进行增强。从而实现了一个间接函数逼近框架,它保证始终处于期望假设空间内。为此,我们提出了一种基于径向基函数(RBF)网络构建微分同胚映射的新方法,该方法可以提供数据点周围的精确局部变换。最后,通过从实际数据中学习微分同胚李雅普诺夫函数,并将我们的方法应用于不同的吸引子系统来展示这种方法的有效性。
https://arxiv.org/abs/2504.02607
In this paper, we propose a new geometric approach for knowledge graph completion via low rank tensor approximation. We augment a pretrained and well-established Euclidean model based on a Tucker tensor decomposition with a novel hyperbolic interaction term. This correction enables more nuanced capturing of distributional properties in data better aligned with real-world knowledge graphs. By combining two geometries together, our approach improves expressivity of the resulting model achieving new state-of-the-art link prediction accuracy with a significantly lower number of parameters compared to the previous Euclidean and hyperbolic models.
在这篇论文中,我们提出了一种新的几何方法,通过低秩张量近似来完成知识图谱。我们将基于Tucker张量分解的预训练且成熟的欧几里得模型与一个新颖的双曲交互项结合了起来。这一修正使得该模型能够更细致地捕捉到与现实世界知识图谱更为一致的数据分布特性。通过将两种几何结构结合起来,我们的方法提高了生成模型的表现力,在参数数量显著减少的情况下实现了新的链接预测精度最佳水平,超越了之前的欧几里得和双曲模型。
https://arxiv.org/abs/2504.02589
The recent advancements in Deep Learning models and techniques have led to significant strides in performance across diverse tasks and modalities. However, while the overall capabilities of models show promising growth, our understanding of their internal reasoning processes remains limited, particularly concerning systematic inconsistencies or errors patterns of logical or inferential flaws. These inconsistencies may manifest as contradictory outputs, failure to generalize across similar tasks, or erroneous conclusions in specific contexts. Even detecting and measuring such reasoning discrepancies is challenging, as they may arise from opaque internal procedures, biases and imbalances in training data, or the inherent complexity of the task. Without effective methods to detect, measure, and mitigate these errors, there is a risk of deploying models that are biased, exploitable, or logically unreliable. This thesis aims to address these issues by producing novel methods for deep learning models that reason over knowledge graphs, natural language, and images. The thesis contributes two techniques for detecting and quantifying predictive inconsistencies originating from opaque internal procedures in natural language and image processing models. To mitigate inconsistencies from biases in training data, this thesis presents a data efficient sampling method to improve fairness and performance and a synthetic dataset generation approach in low resource scenarios. Finally, the thesis offers two techniques to optimize the models for complex reasoning tasks. These methods enhance model performance while allowing for more faithful and interpretable exploration and exploitation during inference. Critically, this thesis provides a comprehensive framework to improve the robustness, fairness, and interpretability of deep learning models across diverse tasks and modalities.
最近在深度学习模型和技术上的进展,已经在各种任务和模态中显著提升了性能。然而,尽管整体模型能力显示出了令人鼓舞的增长趋势,我们对其内部推理过程的理解仍然有限,尤其是关于系统性不一致或逻辑或推断错误模式的了解更是如此。这些不一致性可能表现为矛盾的结果、无法在类似任务之间泛化或者在特定情境下的错误结论。检测和衡量这样的推理差异也颇具挑战,因为它们可能是由于模型内部复杂的操作程序、训练数据中的偏见与不平衡,或是任务本身的固有复杂性所导致的。如果没有有效的方法来检测、测量并缓解这些错误,那么部署的模型可能会存在偏见、可被利用或者逻辑上不可靠的风险。 本论文旨在通过为处理知识图谱、自然语言和图像的深度学习模型开发新的方法来解决这些问题。具体而言,本文贡献了两种针对自然语言与图像处理模型中源自不透明内部过程的预测不一致进行检测和量化的新技术。为了缓解训练数据偏见所导致的不一致性问题,本论文提出了一种效率高的采样方法以提升公平性和性能,并在资源匮乏的情况下提供了一套合成数据集生成方案。最后,本文提供了两种优化复杂推理任务模型的方法。这些方法提升了模型的表现力,同时在推断过程中促进了更加忠实和可解释性更强的探索与利用。 关键的是,本论文为提高深度学习模型在各种任务和模态下的鲁棒性、公平性和可解释性提供了一个全面框架。
https://arxiv.org/abs/2504.02577
Knowledge distillation has emerged as an effective strategy for compressing large language models' (LLMs) knowledge into smaller, more efficient student models. However, standard one-shot distillation methods often produce suboptimal results due to a mismatch between teacher-generated rationales and the student's specific learning requirements. In this paper, we introduce the UNDO: UNderstanding Distillation as Optimization framework, designed to bridge this gap by iteratively identifying the student's errors and prompting the teacher to refine its explanations accordingly. Each iteration directly targets the student's learning deficiencies, motivating the teacher to provide tailored and enhanced rationales that specifically address these weaknesses. Empirical evaluations on various challenging mathematical and commonsense reasoning tasks demonstrate that our iterative distillation method, UNDO, significantly outperforms standard one-step distillation methods, achieving performance gains of up to 20%. Additionally, we show that teacher-generated data refined through our iterative process remains effective even when applied to different student models, underscoring the broad applicability of our approach. Our work fundamentally reframes knowledge distillation as an iterative teacher-student interaction, effectively leveraging dynamic refinement by the teacher for better knowledge distillation.
知识蒸馏已经作为一种有效的策略,将大型语言模型(LLM)的知识压缩到更小、更高效的“学生”模型中。然而,标准的一次性蒸馏方法往往由于教师生成的解释与学生特定学习需求不匹配而产生次优结果。在本文中,我们引入了UNDO:理解蒸馏为优化框架,旨在通过迭代识别学生的错误并促使教师相应地细化其解释来弥合这一差距。每次迭代都直接针对学生的学习不足,激励教师提供专门针对这些弱点的定制和改进后的解释。我们在各种具有挑战性的数学和常识推理任务上的实证评估表明,我们的迭代蒸馏方法UNDO显著优于标准的一步蒸馏方法,在某些情况下性能提高了高达20%。此外,我们展示了通过我们的迭代过程精炼过的教师生成的数据即使应用于不同的学生模型仍然有效,强调了这种方法的广泛应用性。我们的工作从根本上将知识蒸馏重新定义为一种迭代的师生互动,并有效地利用教师的动态改进来实现更好的知识蒸馏效果。
https://arxiv.org/abs/2504.02521
This paper introduces a self-learning agent that integrates LLaMA 3.2 with a Progressive Neural Network (PNN) for continual learning in conversational AI and code generation. The framework dynamically collects data, fine-tunes tasks with minimal samples, and leverages Meta-Learning for rapid adaptation. LoRA optimizes fine-tuning, while Elastic Weight Consolidation (EWC) enhances knowledge retention. Experimental results demonstrate improved adaptability and memory stability, positioning this approach as a scalable step toward Artificial General Intelligence (AGI).
本文介绍了一种自学习代理,它将LLaMA 3.2与渐进式神经网络(PNN)结合,用于对话AI和代码生成的连续学习。该框架动态地收集数据,使用少量样本微调任务,并利用元学习实现快速适应。LoRA优化了微调过程,而弹性权重巩固(EWC)增强了知识保持能力。实验结果表明,这种方法在可扩展性和记忆稳定性方面都有所提升,为迈向人工通用智能(AGI)提供了可行的一步。
https://arxiv.org/abs/2504.02489
We propose a decentralized reinforcement learning solution for multi-agent shepherding of non-cohesive targets using policy-gradient methods. Our architecture integrates target-selection with target-driving through Proximal Policy Optimization, overcoming discrete-action constraints of previous Deep Q-Network approaches and enabling smoother agent trajectories. This model-free framework effectively solves the shepherding problem without prior dynamics knowledge. Experiments demonstrate our method's effectiveness and scalability with increased target numbers and limited sensing capabilities.
我们提出了一种基于策略梯度方法的去中心化强化学习解决方案,用于多智能体在非凝聚目标下的牧羊问题。我们的架构通过近端策略优化(Proximal Policy Optimization)将目标选择与驱动目标相结合,克服了先前深度Q网络方法中离散动作约束的问题,并使代理轨迹更加平滑。这种无模型框架能够在没有先验动力学知识的情况下有效解决牧羊问题。实验结果证明了我们的方法在目标数量增加和感知能力有限的情况下的有效性及可扩展性。
https://arxiv.org/abs/2504.02479
Program-guided reasoning has shown promise in complex claim fact-checking by decomposing claims into function calls and executing reasoning programs. However, prior work primarily relies on few-shot in-context learning (ICL) with ad-hoc demonstrations, which limit program diversity and require manual design with substantial domain knowledge. Fundamentally, the underlying principles of effective reasoning program generation still remain underexplored, making it challenging to construct effective demonstrations. To address this, we propose BOOST, a bootstrapping-based framework for few-shot reasoning program generation. BOOST explicitly integrates claim decomposition and information-gathering strategies as structural guidance for program generation, iteratively refining bootstrapped demonstrations in a strategy-driven and data-centric manner without human intervention. This enables a seamless transition from zero-shot to few-shot strategic program-guided learning, enhancing interpretability and effectiveness. Experimental results show that BOOST outperforms prior few-shot baselines in both zero-shot and few-shot settings for complex claim verification.
程序引导的推理在复杂声明的事实核查中显示出潜力,通过将声明分解为函数调用并执行推理程序来实现。然而,之前的工作主要依赖于基于少量样本的上下文学习(ICL)和临时示范,这限制了程序多样性,并且需要大量领域知识才能进行人工设计。从根本上讲,有效的推理程序生成的基本原理仍然缺乏深入研究,这使得构建有效演示变得困难。 为了解决这个问题,我们提出了BOOST框架,这是一个基于引导式的少量样本推理程序生成框架。BOOST明确地将声明分解和信息收集策略作为程序生成的结构指导,在没有人为干预的情况下以策略驱动且数据为中心的方式迭代改进引导示范。这使从零样本学习无缝过渡到少量样本的战略性程序导向学习成为可能,并增强了模型的可解释性和有效性。 实验结果显示,无论是对于复杂声明验证的零样本还是少量样本设置,BOOST都超越了先前的基准方法。
https://arxiv.org/abs/2504.02467
Recently, Large Language Model (LLM)-empowered recommender systems have revolutionized personalized recommendation frameworks and attracted extensive attention. Despite the remarkable success, existing LLM-empowered RecSys have been demonstrated to be highly vulnerable to minor perturbations. To mitigate the negative impact of such vulnerabilities, one potential solution is to employ collaborative signals based on item-item co-occurrence to purify the malicious collaborative knowledge from the user's historical interactions inserted by attackers. On the other hand, due to the capabilities to expand insufficient internal knowledge of LLMs, Retrieval-Augmented Generation (RAG) techniques provide unprecedented opportunities to enhance the robustness of LLM-empowered recommender systems by introducing external collaborative knowledge. Therefore, in this paper, we propose a novel framework (RETURN) by retrieving external collaborative signals to purify the poisoned user profiles and enhance the robustness of LLM-empowered RecSys in a plug-and-play manner. Specifically, retrieval-augmented perturbation positioning is proposed to identify potential perturbations within the users' historical sequences by retrieving external knowledge from collaborative item graphs. After that, we further retrieve the collaborative knowledge to cleanse the perturbations by using either deletion or replacement strategies and introduce a robust ensemble recommendation strategy to generate final robust predictions. Extensive experiments on three real-world datasets demonstrate the effectiveness of the proposed RETURN.
最近,基于大型语言模型(LLM)的推荐系统已经革新了个性化推荐框架,并引起了广泛关注。尽管取得了显著的成功,现有的LLM驱动的推荐系统(RecSys)已被证明对细微扰动非常脆弱。为了减轻这种漏洞带来的负面影响,一种潜在的解决方案是利用基于项目共现的合作信号来清除攻击者插入到用户历史交互中的恶意合作知识。另一方面,由于缺乏扩展LLMs内部不足的知识能力,检索增强生成(RAG)技术提供了前所未有的机会,通过引入外部合作知识来提升LLM驱动推荐系统的鲁棒性。 因此,在本文中,我们提出了一种新的框架RETURN,该框架通过检索外部合作信号以清除被污染的用户资料,并以即插即用的方式提高LLM驱动RecSys的稳健性。具体而言,我们提出了一个增强的扰动定位方法,通过从协作项目图中检索外部知识来识别用户历史序列中的潜在扰动。随后,我们进一步使用删除或替换策略来检索合作知识以清除这些干扰,并引入了一种鲁棒的集成推荐策略来生成最终稳健的预测。 在三个真实世界数据集上的广泛实验验证了所提出的RETURN框架的有效性。
https://arxiv.org/abs/2504.02458
Recent years have witnessed remarkable advances in talking head generation, owing to its potential to revolutionize the human-AI interaction from text interfaces into realistic video chats. However, research on text-driven talking heads remains underexplored, with existing methods predominantly adopting a cascaded pipeline that combines TTS systems with audio-driven talking head models. This conventional pipeline not only introduces system complexity and latency overhead but also fundamentally suffers from asynchronous audiovisual output and stylistic discrepancies between generated speech and visual expressions. To address these limitations, we introduce OmniTalker, an end-to-end unified framework that simultaneously generates synchronized speech and talking head videos from text and reference video in real-time zero-shot scenarios, while preserving both speech style and facial styles. The framework employs a dual-branch diffusion transformer architecture: the audio branch synthesizes mel-spectrograms from text, while the visual branch predicts fine-grained head poses and facial dynamics. To bridge modalities, we introduce a novel audio-visual fusion module that integrates cross-modal information to ensure temporal synchronization and stylistic coherence between audio and visual outputs. Furthermore, our in-context reference learning module effectively captures both speech and facial style characteristics from a single reference video without introducing an extra style extracting module. To the best of our knowledge, OmniTalker presents the first unified framework that jointly models speech style and facial style in a zero-shot setting, achieving real-time inference speed of 25 FPS. Extensive experiments demonstrate that our method surpasses existing approaches in generation quality, particularly excelling in style preservation and audio-video synchronization.
近年来,在说话人头部生成领域取得了显著进展,这得益于其潜在的能力将人类与AI的互动从文本界面转变为逼真的视频聊天。然而,基于文本驱动的说话人头部研究仍相对较少探索,现有方法主要采用结合TTS(语音合成)系统和音频驱动的说话人头部模型的级联管道。这种传统的管道不仅引入了系统的复杂性和延迟开销,而且从根本上遭受了异步音视频输出以及生成语音与面部表情之间风格不一致的问题。 为了解决这些问题,我们提出了OmniTalker,这是一个端到端统一框架,在实时零样本场景中同时从文本和参考视频生成同步的语音和说话人头部视频,同时保留了语音风格和面部风格。该框架采用了双分支扩散变换器架构:音频分支从文本合成mel频谱图,而视觉分支预测精细的头部姿态和面部动态。为了连接这些模式,我们引入了一个新颖的音视频融合模块,它整合跨模态信息以确保音频与视频输出之间的时间同步和风格一致性。 此外,我们的上下文参考学习模块能够仅通过单个参考视频有效地捕捉语音和面部风格特征,而无需额外的样式提取模块。据我们所知,OmniTalker是第一个在零样本设置下联合建模语音风格和面部风格的一体化框架,并且实现了每秒25帧(FPS)的实时推理速度。 广泛的实验表明,我们的方法超越了现有的生成质量,在风格保留和音视频同步方面表现尤为出色。
https://arxiv.org/abs/2504.02433
Bayesian networks and causal models provide frameworks for handling queries about external interventions and counterfactuals, enabling tasks that go beyond what probability distributions alone can address. While these formalisms are often informally described as capturing causal knowledge, there is a lack of a formal theory characterizing the type of knowledge required to predict the effects of external interventions. This work introduces the theoretical framework of causal systems to clarify Aristotle's distinction between knowledge that and knowledge why within artificial intelligence. By interpreting existing artificial intelligence technologies as causal systems, it investigates the corresponding types of knowledge. Furthermore, it argues that predicting the effects of external interventions is feasible only with knowledge why, providing a more precise understanding of the knowledge necessary for such tasks.
贝叶斯网络和因果模型为处理关于外部干预及反事实的查询提供了框架,能够完成超出概率分布本身所不能解决的任务。尽管这些形式主义常被非正式地描述为捕捉因果知识,但缺乏一个正式理论来刻画预测外部干预效果所需的知识类型。这项工作引入了因果系统的理论框架,以阐明亚里士多德关于“知道什么”和“知道为什么”的区分在人工智能领域的应用。通过将现有的人工智能技术解释为因果系统,该研究探讨了相应的知识类型。此外,它论证了只有具备“知道为什么”的知识才能预测外部干预的效果,从而对完成此类任务所需的知识提供了一个更精确的理解。 这一理论框架有助于澄清在处理复杂问题时,理解因果关系的重要性,并强调了仅仅拥有数据和概率模型不足以进行有效的决策或推理;需要深入理解事件之间的因果机制。这为人工智能领域内的研究开辟了一条新的路径,即不仅要关注“是什么”,更要重视“为什么”的探索。
https://arxiv.org/abs/2504.02430