In real-world scenarios, it is desirable for embodied agents to have the ability to leverage human language to gain explicit or implicit knowledge for learning tasks. Despite recent progress, most previous approaches adopt simple low-level instructions as language inputs, which may not reflect natural human communication. It's not clear how to incorporate rich language use to facilitate task learning. To address this question, this paper studies different types of language inputs in facilitating reinforcement learning (RL) embodied agents. More specifically, we examine how different levels of language informativeness (i.e., feedback on past behaviors and future guidance) and diversity (i.e., variation of language expressions) impact agent learning and inference. Our empirical results based on four RL benchmarks demonstrate that agents trained with diverse and informative language feedback can achieve enhanced generalization and fast adaptation to new tasks. These findings highlight the pivotal role of language use in teaching embodied agents new tasks in an open world. Project website: this https URL
在现实场景中,期望具身化智能体能够利用人类语言获取显性或隐性的知识以促进学习任务。尽管近期取得了进展,但大多数先前的方法采用简单的低级指令作为语言输入,这可能无法反映自然的人类交流方式。尚不清楚如何将丰富的语言使用融入以促进任务学习。为了解决这一问题,本文研究了不同类型的语言输入在促进强化学习(RL)具身化智能体中的作用。更具体地说,我们探讨不同层次的语言信息度(即对过去行为的反馈和对未来指导的信息量)和多样性(即语言表达的变化)如何影响智能体的学习和推理。基于四个RL基准测试的经验结果表明,使用多样化且具有丰富信息的语言反馈训练的智能体能够实现增强的泛化能力,并快速适应新任务。这些发现强调了在开放世界中利用语言来教导具身化智能体执行新任务的关键作用。项目网站:此 https URL
https://arxiv.org/abs/2410.24218
Deep Neural Networks (DNNs) are well-known to act as over-parameterized deep image priors (DIP) that regularize various image inverse problems. Meanwhile, researchers also proposed extremely compact, under-parameterized image priors (e.g., deep decoder) that are strikingly competent for image restoration too, despite a loss of accuracy. These two extremes push us to think whether there exists a better solution in the middle: between over- and under-parameterized image priors, can one identify "intermediate" parameterized image priors that achieve better trade-offs between performance, efficiency, and even preserving strong transferability? Drawing inspirations from the lottery ticket hypothesis (LTH), we conjecture and study a novel "lottery image prior" (LIP) by exploiting DNN inherent sparsity, stated as: given an over-parameterized DNN-based image prior, it will contain a sparse subnetwork that can be trained in isolation, to match the original DNN's performance when being applied as a prior to various image inverse problems. Our results validate the superiority of LIPs: we can successfully locate the LIP subnetworks from over-parameterized DIPs at substantial sparsity ranges. Those LIP subnetworks significantly outperform deep decoders under comparably compact model sizes (by often fully preserving the effectiveness of their over-parameterized counterparts), and they also possess high transferability across different images as well as restoration task types. Besides, we also extend LIP to compressive sensing image reconstruction, where a pre-trained GAN generator is used as the prior (in contrast to untrained DIP or deep decoder), and confirm its validity in this setting too. To our best knowledge, this is the first time that LTH is demonstrated to be relevant in the context of inverse problems or image priors.
深度神经网络(DNNs)作为过度参数化的深层图像先验(DIP),已知能够正则化各种图像逆问题。与此同时,研究者也提出了极其紧凑、参数不足的图像先验(例如:深解码器),尽管在准确性上有所损失,但它们对图像恢复同样表现出色。这两种极端促使我们思考是否存在一个更好的中间解决方案:在过度和参数不足的图像先验之间,能否找到实现性能、效率以及保持强迁移能力之间更好平衡的“中间”参数化图像先验?借鉴彩票假设(LTH),我们提出了并研究了一种新颖的“彩票图像先验”(LIP)概念,通过利用DNN内在稀疏性来定义:给定一个基于过度参数化的DNN的图像先验,它将包含一个可以单独训练的稀疏子网络,在应用于各种图像逆问题时可匹配原始DNN的表现。我们的结果验证了LIP的优越性:我们可以在显著的稀疏范围内成功地从过度参数化的DIP中定位到LIP子网络。这些LIP子网络在与深解码器相当紧凑的模型尺寸下表现明显更优(通常完全保留其过度参数化对应物的有效性),并且它们还具备跨不同图像和恢复任务类型的高迁移能力。此外,我们还将LIP扩展到了压缩感知图像重建中,在这里使用预训练的GAN生成器作为先验(与未经过训练的DIP或深解码器相对比),并在这种情况下也证实了其有效性。据我们所知,这是首次在逆问题或图像先验背景下证明LTH的相关性。
https://arxiv.org/abs/2410.24187
Federated Learning (FL) is a form of distributed learning that allows multiple institutions or clients to collaboratively learn a global model to solve a task. This allows the model to utilize the information from every institute while preserving data privacy. However, recent studies show that the promise of protecting the privacy of data is not upheld by existing methods and that it is possible to recreate the training data from the different institutions. This is done by utilizing gradients transferred between the clients and the global server during training or by knowing the model architecture at the client end. In this paper, we propose a federated learning framework for semantic segmentation without knowing the model architecture nor transferring gradients between the client and the server, thus enabling better privacy preservation. We propose BlackFed - a black-box adaptation of neural networks that utilizes zero order optimization (ZOO) to update the client model weights and first order optimization (FOO) to update the server weights. We evaluate our approach on several computer vision and medical imaging datasets to demonstrate its effectiveness. To the best of our knowledge, this work is one of the first works in employing federated learning for segmentation, devoid of gradients or model information exchange. Code: this https URL
联邦学习(FL)是一种分布式学习形式,它允许多个机构或客户端共同学习一个全局模型来解决任务。这使模型能够利用每个机构的信息同时保护数据隐私。然而,最近的研究表明,现有的方法并没有实现保护数据隐私的承诺,并且有可能从不同的机构中复原训练数据。这是通过利用在训练过程中客户与全球服务器之间传输的梯度或者知道客户端端的模型架构来完成的。 在这篇论文中,我们提出了一种不需要了解模型架构也不需要在客户端和服务器间传递梯度的语义分割联邦学习框架,从而实现更好的隐私保护。我们提出了BlackFed——一种基于零阶优化(ZOO)更新客户端模型权重,并利用一阶优化(FOO)更新服务器权重的神经网络黑箱适应方法。我们在多个计算机视觉和医学影像数据集上评估了我们的方法的有效性。 据我们所知,这是首次在没有梯度或模型信息交换的情况下使用联邦学习进行分割的工作之一。代码:此 https URL
https://arxiv.org/abs/2410.24181
Robot learning holds tremendous promise to unlock the full potential of flexible, general, and dexterous robot systems, as well as to address some of the deepest questions in artificial intelligence. However, bringing robot learning to the level of generality required for effective real-world systems faces major obstacles in terms of data, generalization, and robustness. In this paper, we discuss how generalist robot policies (i.e., robot foundation models) can address these challenges, and how we can design effective generalist robot policies for complex and highly dexterous tasks. We propose a novel flow matching architecture built on top of a pre-trained vision-language model (VLM) to inherit Internet-scale semantic knowledge. We then discuss how this model can be trained on a large and diverse dataset from multiple dexterous robot platforms, including single-arm robots, dual-arm robots, and mobile manipulators. We evaluate our model in terms of its ability to perform tasks in zero shot after pre-training, follow language instructions from people and from a high-level VLM policy, and its ability to acquire new skills via fine-tuning. Our results cover a wide variety of tasks, such as laundry folding, table cleaning, and assembling boxes.
机器人学习有望释放灵活、通用和灵巧的机器人系统的全部潜力,并解决人工智能领域一些最深层次的问题。然而,将机器人学习提升到实际系统所需的一般化水平面临着数据、泛化能力和鲁棒性方面的重大障碍。本文讨论了如何通过通用型机器人策略(即机器人基础模型)来应对这些挑战,并探讨如何设计有效的通用型机器人策略以执行复杂且高度灵巧的任务。我们提出了一种基于预训练视觉语言模型(VLM)的新型流匹配架构,以继承互联网规模的语义知识。随后,我们将讨论该模型如何在来自多种灵巧机器人平台的大规模多样化数据集上进行训练,包括单臂机器人、双臂机器人和移动操作器。我们从多个方面评估了我们的模型:预训练后的零样本任务执行能力;遵循人类及高级VLM策略的语音指令的能力;以及通过微调来学习新技能的能力。我们的研究结果涵盖了广泛的领域,如折叠衣物、清洁桌子和组装盒子等任务。
https://arxiv.org/abs/2410.24164
The cooperative driving technology of Connected and Autonomous Vehicles (CAVs) is crucial for improving the efficiency and safety of transportation systems. Learning-based methods, such as Multi-Agent Reinforcement Learning (MARL), have demonstrated strong capabilities in cooperative decision-making tasks. However, existing MARL approaches still face challenges in terms of learning efficiency and performance. In recent years, Large Language Models (LLMs) have rapidly advanced and shown remarkable abilities in various sequential decision-making tasks. To enhance the learning capabilities of cooperative agents while ensuring decision-making efficiency and cost-effectiveness, we propose LDPD, a language-driven policy distillation method for guiding MARL exploration. In this framework, a teacher agent based on LLM trains smaller student agents to achieve cooperative decision-making through its own decision-making demonstrations. The teacher agent enhances the observation information of CAVs and utilizes LLMs to perform complex cooperative decision-making reasoning, which also leverages carefully designed decision-making tools to achieve expert-level decisions, providing high-quality teaching experiences. The student agent then refines the teacher's prior knowledge into its own model through gradient policy updates. The experiments demonstrate that the students can rapidly improve their capabilities with minimal guidance from the teacher and eventually surpass the teacher's performance. Extensive experiments show that our approach demonstrates better performance and learning efficiency compared to baseline methods.
连接与自动驾驶车辆(CAVs)的协同驾驶技术对于提高交通系统的效率和安全性至关重要。基于学习的方法,如多智能体强化学习(MARL),在协作决策任务中表现出强大的能力。然而,现有的MARL方法在学习效率和性能方面仍面临挑战。近年来,大型语言模型(LLMs)迅速发展,在各种顺序决策任务中展示了卓越的能力。为了增强协同代理的学习能力,并确保决策的高效性和成本效益,我们提出了LDPD(Language-Driven Policy Distillation),一种基于语言驱动策略蒸馏的方法来指导MARL探索。在这个框架中,一个基于LLM的教师代理通过自己的决策演示训练较小的学生代理以实现协作决策。教师代理增强了CAVs的观测信息,并利用LLMs进行复杂的协同决策推理,同时借助精心设计的决策工具实现专家级别的决策,提供高质量的教学体验。学生代理随后通过梯度策略更新将教师的先验知识提炼到其自身模型中。实验表明,在教师的最少指导下,学生能够快速提升能力并最终超越教师的表现。广泛的实验结果显示,我们的方法在性能和学习效率上优于基线方法。
https://arxiv.org/abs/2410.24152
The rise of large language models (LLMs) has revolutionized user interactions with knowledge-based systems, enabling chatbots to synthesize vast amounts of information and assist with complex, exploratory tasks. However, LLM-based chatbots often struggle to provide personalized support, particularly when users start with vague queries or lack sufficient contextual information. This paper introduces the Collaborative Assistant for Personalized Exploration (CARE), a system designed to enhance personalization in exploratory tasks by combining a multi-agent LLM framework with a structured user interface. CARE's interface consists of a Chat Panel, Solution Panel, and Needs Panel, enabling iterative query refinement and dynamic solution generation. The multi-agent framework collaborates to identify both explicit and implicit user needs, delivering tailored, actionable solutions. In a within-subject user study with 22 participants, CARE was consistently preferred over a baseline LLM chatbot, with users praising its ability to reduce cognitive load, inspire creativity, and provide more tailored solutions. Our findings highlight CARE's potential to transform LLM-based systems from passive information retrievers to proactive partners in personalized problem-solving and exploration.
大型语言模型(LLMs)的兴起彻底改变了用户与知识型系统的交互方式,使聊天机器人能够合成大量信息,并辅助完成复杂的探索性任务。然而,基于LLM的聊天机器人在提供个性化支持方面常常遇到困难,特别是在用户以模糊查询开始或缺乏足够上下文信息的情况下。本文介绍了一种名为协作式个人化探索助手(CARE)的系统,该系统通过结合多智能体LLM框架和结构化的用户界面来提升探索性任务中的个性化体验。CARE的界面包括聊天面板、解决方案面板和需求面板,支持迭代查询优化和动态方案生成。多智能体框架合作识别用户的显性和隐性需求,提供定制化且可操作的解决方案。在一项有22名参与者的单因素用户研究中,CARE一直优于基线LLM聊天机器人,用户称赞它能够减少认知负担、激发创意并提供更多量身定制的方案。我们的发现突出了CARE将基于LLM的系统从被动的信息检索者转变为个性化问题解决和探索中的主动伙伴的潜力。
https://arxiv.org/abs/2410.24032
Introduction: Tracing the spread of ideas and the presence of influence is a question of special importance across a wide range of disciplines, ranging from intellectual history to cultural analytics, computational social science, and the science of science. Method: We collect a corpus of open source journal articles, generate Knowledge Graph representations using the Gemini LLM, and attempt to predict the existence of citations between sampled pairs of articles using previously published methods and a novel Graph Neural Network based embedding model. Results: We demonstrate that our knowledge graph embedding method is superior at distinguishing pairs of articles with and without citation. Once trained, it runs efficiently and can be fine-tuned on specific corpora to suit individual researcher needs. Conclusion(s): This experiment demonstrates that the relationships encoded in a knowledge graph, especially the types of concepts brought together by specific relations can encode information capable of revealing intellectual influence. This suggests that further work in analyzing document level knowledge graphs to understand latent structures could provide valuable insights.
引言:追踪思想的传播和影响的存在是一个在众多学科中具有特殊重要性的问题,这些学科范围从知识史到文化分析、计算社会科学以及科学学。 方法:我们收集了一组开源期刊文章,并使用Gemini LLM生成知识图表示。然后,尝试利用先前发表的方法和一种新的基于图形神经网络的嵌入模型来预测采样文章对之间的引用关系。 结果:我们证明了我们的知识图嵌入方法在区分有无引用的文章对方面更优。一旦训练完成,该方法运行效率高,并且可以根据特定语料库进行微调以满足个别研究人员的需求。 结论:本实验表明,知识图中编码的关系,特别是由具体关系连接的类型概念可以编码能够揭示智力影响的信息。这表明进一步分析文档级别知识图来理解潜在结构可能提供有价值的见解。
https://arxiv.org/abs/2410.24021
Synthetic data generation is gaining increasing popularity in different computer vision applications. Existing state-of-the-art face recognition models are trained using large-scale face datasets, which are crawled from the Internet and raise privacy and ethical concerns. To address such concerns, several works have proposed generating synthetic face datasets to train face recognition models. However, these methods depend on generative models, which are trained on real face images. In this work, we design a simple yet effective membership inference attack to systematically study if any of the existing synthetic face recognition datasets leak any information from the real data used to train the generator model. We provide an extensive study on 6 state-of-the-art synthetic face recognition datasets, and show that in all these synthetic datasets, several samples from the original real dataset are leaked. To our knowledge, this paper is the first work which shows the leakage from training data of generator models into the generated synthetic face recognition datasets. Our study demonstrates privacy pitfalls in synthetic face recognition datasets and paves the way for future studies on generating responsible synthetic face datasets.
合成数据生成在不同的计算机视觉应用中正变得越来越受欢迎。现有的最先进的面部识别模型是通过从互联网爬取的大规模面部数据集进行训练的,这引发了隐私和伦理方面的担忧。为了解决这些问题,一些工作提出了使用合成面部数据集来训练面部识别模型的方法。然而,这些方法依赖于生成模型,而这些生成模型是在真实的面部图像上训练的。在这项工作中,我们设计了一种简单但有效的成员推断攻击,系统地研究了现有的任何合成面部识别数据集中是否泄露了用于训练生成模型的真实数据中的信息。我们在6个最先进的合成面部识别数据集上进行了广泛的研究,并表明在所有这些合成数据集中,原始真实数据集的几个样本都被泄露了。据我们所知,这篇论文是首次展示从生成器模型的训练数据到生成的合成面部识别数据集的信息泄漏的工作。我们的研究表明了合成面部识别数据集中的隐私漏洞,并为未来研究如何生成负责任的合成面部数据集铺平了道路。
https://arxiv.org/abs/2410.24015
Open-vocabulary 3D object detection (OV-3Det) aims to generalize beyond the limited number of base categories labeled during the training phase. The biggest bottleneck is the scarcity of annotated 3D data, whereas 2D image datasets are abundant and richly annotated. Consequently, it is intuitive to leverage the wealth of annotations in 2D images to alleviate the inherent data scarcity in OV-3Det. In this paper, we push the task setup to its limits by exploring the potential of using solely 2D images to learn OV-3Det. The major challenges for this setup is the modality gap between training images and testing point clouds, which prevents effective integration of 2D knowledge into OV-3Det. To address this challenge, we propose a novel framework ImOV3D to leverage pseudo multimodal representation containing both images and point clouds (PC) to close the modality gap. The key of ImOV3D lies in flexible modality conversion where 2D images can be lifted into 3D using monocular depth estimation and can also be derived from 3D scenes through rendering. This allows unifying both training images and testing point clouds into a common image-PC representation, encompassing a wealth of 2D semantic information and also incorporating the depth and structural characteristics of 3D spatial data. We carefully conduct such conversion to minimize the domain gap between training and test cases. Extensive experiments on two benchmark datasets, SUNRGBD and ScanNet, show that ImOV3D significantly outperforms existing methods, even in the absence of ground truth 3D training data. With the inclusion of a minimal amount of real 3D data for fine-tuning, the performance also significantly surpasses previous state-of-the-art. Codes and pre-trained models are released on the this https URL.
开放词汇表3D物体检测(OV-3Det)旨在超越训练阶段标注的有限基本类别。最大的瓶颈是注释过的3D数据稀缺,而2D图像数据集则丰富且详细注释。因此,直观地利用2D图像中的大量注释来缓解OV-3Det中固有的数据稀缺性是有意义的。在这篇论文中,我们通过探索仅使用2D图像学习OV-3Det的潜力将任务设置推向极限。这种设置的主要挑战在于训练图像与测试点云之间的模态差距,这阻碍了将2D知识有效整合到OV-3Det中的可能性。为了解决这个挑战,我们提出了一种新的框架ImOV3D来利用包含图像和点云(PC)的伪多模态表示以弥合这种模态差距。ImOV3D的核心在于灵活的模态转换,即2D图像可以通过单目深度估计提升到三维空间,并且也可以通过渲染从3D场景中提取。这使得将训练图像和测试点云统一到一个共同的图像-PC表示中成为可能,该表示既包含丰富的2D语义信息,也整合了3D空间数据的深度和结构特征。我们精心进行了这种转换以尽量减少训练与测试案例之间的领域差距。在两个基准数据集SUNRGBD和ScanNet上的广泛实验表明,ImOV3D显著优于现有方法,即使没有地面真实性的3D训练数据也是如此。通过包含少量真实的3D数据进行微调,性能也明显超越了之前的最佳技术。代码和预训练模型可在以下链接https URL上获取。
https://arxiv.org/abs/2410.24001
Surgical scene segmentation is essential for enhancing surgical precision, yet it is frequently compromised by the scarcity and imbalance of available data. To address these challenges, semantic image synthesis methods based on generative adversarial networks and diffusion models have been developed. However, these models often yield non-diverse images and fail to capture small, critical tissue classes, limiting their effectiveness. In response, we propose the Class-Aware Semantic Diffusion Model (CASDM), a novel approach which utilizes segmentation maps as conditions for image synthesis to tackle data scarcity and imbalance. Novel class-aware mean squared error and class-aware self-perceptual loss functions have been defined to prioritize critical, less visible classes, thereby enhancing image quality and relevance. Furthermore, to our knowledge, we are the first to generate multi-class segmentation maps using text prompts in a novel fashion to specify their contents. These maps are then used by CASDM to generate surgical scene images, enhancing datasets for training and validating segmentation models. Our evaluation, which assesses both image quality and downstream segmentation performance, demonstrates the strong effectiveness and generalisability of CASDM in producing realistic image-map pairs, significantly advancing surgical scene segmentation across diverse and challenging datasets.
手术场景分割对于提高手术精确性至关重要,然而,它经常受到可用数据稀缺和不平衡的挑战。为了解决这些问题,基于生成对抗网络和扩散模型的语义图像合成方法被开发出来。然而,这些模型通常会产生非多样化的图像,并且无法捕捉到小而关键的组织类别,从而限制了它们的有效性。为此,我们提出了一种新的方法——类感知语义扩散模型(CASDM),该模型利用分割图作为条件进行图像合成以解决数据稀缺和不平衡的问题。定义了新型的类感知均方误差和类感知自我知觉损失函数,优先处理关键但不显眼的类别,从而提升图像质量和相关性。此外,据我们所知,我们是第一个使用文本提示生成多类别分割图的方法来指定其内容。这些图随后被CASDM用于生成手术场景图像,以增强训练和验证分割模型的数据集。我们的评估既考量了图像质量也考量了下游分割性能,结果表明CASDM在产生逼真的图像-图对方面具有强大的有效性与泛化能力,在多样且具挑战性的数据集中显著推进了手术场景的分割技术。
https://arxiv.org/abs/2410.23962
Recent advancements in Large Language Models (LLMs) have significantly enhanced their ability to process long contexts, yet a notable gap remains in generating long, aligned outputs. This limitation stems from a training gap where pre-training lacks effective instructions for long-text generation, and post-training data primarily consists of short query-response pairs. Current approaches, such as instruction backtranslation and behavior imitation, face challenges including data quality, copyright issues, and constraints on proprietary model usage. In this paper, we introduce an innovative iterative training framework called Self-Lengthen that leverages only the intrinsic knowledge and skills of LLMs without the need for auxiliary data or proprietary models. The framework consists of two roles: the Generator and the Extender. The Generator produces the initial response, which is then split and expanded by the Extender. This process results in a new, longer response, which is used to train both the Generator and the Extender iteratively. Through this process, the models are progressively trained to handle increasingly longer responses. Experiments on benchmarks and human evaluations show that Self-Lengthen outperforms existing methods in long-text generation, when applied to top open-source LLMs such as Qwen2 and LLaMA3. Our code is publicly available at this https URL.
大型语言模型(LLMs)在处理长上下文方面有了显著的进步,但生成长且一致输出的能力仍有明显的不足。这一限制源于训练过程中的差距:预训练过程中缺乏有效的长文本生成指导,而微调数据主要由短的查询-响应对组成。目前的方法,如指令回译和行为模仿,在数据质量、版权问题以及专有模型使用受限等方面面临挑战。本文引入了一个创新的迭代训练框架——Self-Lengthen(自我扩展),该框架仅利用LLMs内在的知识和技能进行操作,无需辅助数据或专有模型。框架包含两个角色:生成器和扩展器。生成器产生初始响应,然后由扩展器将其拆分并扩展。这一过程产生了一个新的、更长的响应,用于迭代训练生成器和扩展器。通过此过程,模型逐渐被训练以处理越来越长的响应。在基准测试和人工评价中显示,Self-Lengthen 在长文本生成方面优于现有方法,尤其是在像 Qwen2 和 LLaMA3 这样的顶级开源LLMs上应用时效果显著。我们的代码可以在 https://this-url.com 公开获取。
https://arxiv.org/abs/2410.23933
Large language models (LLMs) have revolutionized numerous applications, yet their deployment remains challenged by memory constraints on local devices. While scaling laws have enhanced LLM capabilities, the primary bottleneck has shifted from \textit{capability} to \textit{availability}, emphasizing the need for efficient memory management. Traditional compression methods, such as quantization, often require predefined compression ratios and separate compression processes for each setting, complicating deployment in variable memory environments. In this paper, we introduce \textbf{BitStack}, a novel, training-free weight compression approach that enables megabyte-level trade-offs between memory usage and model performance. By leveraging weight decomposition, BitStack can dynamically adjust the model size with minimal transmission between running memory and storage devices. Our approach iteratively decomposes weight matrices while considering the significance of each parameter, resulting in an approximately 1-bit per parameter residual block in each decomposition iteration. These blocks are sorted and stacked in storage as basic transmission units, with different quantities loaded based on current memory availability. Extensive experiments across a wide range of tasks demonstrate that, despite offering fine-grained size control, BitStack consistently matches or surpasses strong quantization baselines, particularly at extreme compression ratios. To the best of our knowledge, this is the first decomposition-based method that effectively bridges the gap to practical compression techniques like quantization. Code is available at this https URL.
大型语言模型(LLMs)已经彻底改变了众多应用,但它们的部署仍然受到本地设备内存限制的挑战。尽管扩展定律增强了LLM的能力,主要瓶颈已从\textit{能力}转变为\textit{可用性},强调了高效内存管理的需求。传统的压缩方法,如量化,通常需要预定义的压缩比例和针对每个设置的独立压缩过程,在变化的内存环境中部署较为复杂。本文介绍了\textbf{BitStack},一种新颖且无需训练的权重压缩方法,能够实现以兆字节为单位的模型性能与内存使用之间的权衡。通过利用权重分解,BitStack可以在运行内存和存储设备之间进行最小传输的同时动态调整模型大小。我们的方法在考虑每个参数重要性的基础上迭代地分解权重矩阵,在每次分解过程中形成一个近似每参数1比特的残差块。这些块被排序并作为基本传输单元堆叠在存储中,并根据当前的内存可用性加载不同数量的数据。广泛的实验表明,尽管BitStack提供了细粒度大小控制,它仍然能够与强大的量化基线一致或超越,特别是在极端压缩比下。据我们所知,这是首个有效弥合分解方法与实用压缩技术如量化之间差距的方法。代码可在该 https URL 获取。
https://arxiv.org/abs/2410.23918
Detecting Human-Object Interactions (HOI) in zero-shot settings, where models must handle unseen classes, poses significant challenges. Existing methods that rely on aligning visual encoders with large Vision-Language Models (VLMs) to tap into the extensive knowledge of VLMs, require large, computationally expensive models and encounter training difficulties. Adapting VLMs with prompt learning offers an alternative to direct alignment. However, fine-tuning on task-specific datasets often leads to overfitting to seen classes and suboptimal performance on unseen classes, due to the absence of unseen class labels. To address these challenges, we introduce a novel prompt learning-based framework for Efficient Zero-Shot HOI detection (EZ-HOI). First, we introduce Large Language Model (LLM) and VLM guidance for learnable prompts, integrating detailed HOI descriptions and visual semantics to adapt VLMs to HOI tasks. However, because training datasets contain seen-class labels alone, fine-tuning VLMs on such datasets tends to optimize learnable prompts for seen classes instead of unseen ones. Therefore, we design prompt learning for unseen classes using information from related seen classes, with LLMs utilized to highlight the differences between unseen and related seen classes. Quantitative evaluations on benchmark datasets demonstrate that our EZ-HOI achieves state-of-the-art performance across various zero-shot settings with only 10.35% to 33.95% of the trainable parameters compared to existing methods. Code is available at this https URL.
在零样本设置中检测人-物交互(HOI),模型需要处理未见过的类别,这提出了显著的挑战。现有方法依赖于将视觉编码器与大型视觉语言模型(VLMs)对齐以利用VLMs的广泛知识,但这种方法需要大规模且计算成本高昂的模型,并且在训练时遇到困难。通过提示学习来适应VLMs为直接对齐提供了一种替代方案。然而,在任务特定的数据集上微调往往导致过度拟合已见类别并在未见过的类别的性能不佳,因为缺乏未见过类别的标签。为了应对这些挑战,我们提出了一种基于提示学习的新框架,用于高效的零样本HOI检测(EZ-HOI)。首先,我们引入了大型语言模型(LLM)和VLM指导可学习提示,整合详细的HOI描述和视觉语义以适应VLMs到HOI任务。然而,由于训练数据集仅包含已见类别的标签,因此在这些数据集上微调VLMs倾向于优化已见过的类别而非未见过的类别。因此,我们设计了使用相关已见类别信息来学习提示的未见过类别,并利用LLMs突出显示未见过类别和相关已见类别之间的差异。基准数据集上的定量评估表明,我们的EZ-HOI在各种零样本设置中仅使用现有方法10.35%至33.95%的可训练参数实现了最先进的性能。代码可在以下链接获取:[这个 https URL]。
https://arxiv.org/abs/2410.23904
In this work, we investigate the causal reasoning abilities of large language models (LLMs) through the representative problem of inferring causal relationships from narratives. We find that even state-of-the-art language models rely on unreliable shortcuts, both in terms of the narrative presentation and their parametric knowledge. For example, LLMs tend to determine causal relationships based on the topological ordering of events (i.e., earlier events cause later ones), resulting in lower performance whenever events are not narrated in their exact causal order. Similarly, we demonstrate that LLMs struggle with long-term causal reasoning and often fail when the narratives are long and contain many events. Additionally, we show LLMs appear to rely heavily on their parametric knowledge at the expense of reasoning over the provided narrative. This degrades their abilities whenever the narrative opposes parametric knowledge. We extensively validate these failure modes through carefully controlled synthetic experiments, as well as evaluations on real-world narratives. Finally, we observe that explicitly generating a causal graph generally improves performance while naive chain-of-thought is ineffective. Collectively, our results distill precise failure modes of current state-of-the-art models and can pave the way for future techniques to enhance causal reasoning in LLMs.
在这项工作中,我们通过从叙述中推断因果关系的代表性问题来研究大型语言模型(LLMs)的因果推理能力。我们发现,即使是目前最先进的语言模型,在叙述呈现和其参数知识方面都依赖于不可靠的捷径。例如,LLMs倾向于根据事件的拓扑排序(即,早期事件导致后期事件)来确定因果关系,这使得当事件没有按其确切的因果顺序进行描述时表现较差。同样地,我们展示了LLMs在长期因果推理中遇到困难,并且往往在叙述较长、包含许多事件的情况下失败。此外,我们还表明,LLMs似乎过度依赖于它们的参数知识,而牺牲了对所提供叙述的推理能力。当叙述与参数知识相矛盾时,这会削弱它们的能力。我们通过仔细控制的合成实验和对现实世界叙述的评估广泛验证了这些失效模式。最后,我们观察到显式生成因果图通常可以提高性能,而简单的思维链方法则无效。总体而言,我们的结果提炼出了当前最先进的模型的具体失败模式,并为未来增强LLMs中因果推理的技术铺平道路。
https://arxiv.org/abs/2410.23884
Large Language Models (LLMs) have shown remarkable reasoning capabilities on complex tasks, but they still suffer from out-of-date knowledge, hallucinations, and opaque decision-making. In contrast, Knowledge Graphs (KGs) can provide explicit and editable knowledge for LLMs to alleviate these issues. Existing paradigm of KG-augmented LLM manually predefines the breadth of exploration space and requires flawless navigation in KGs. However, this paradigm cannot adaptively explore reasoning paths in KGs based on the question semantics and self-correct erroneous reasoning paths, resulting in a bottleneck in efficiency and effect. To address these limitations, we propose a novel self-correcting adaptive planning paradigm for KG-augmented LLM named Plan-on-Graph (PoG), which first decomposes the question into several sub-objectives and then repeats the process of adaptively exploring reasoning paths, updating memory, and reflecting on the need to self-correct erroneous reasoning paths until arriving at the answer. Specifically, three important mechanisms of Guidance, Memory, and Reflection are designed to work together, to guarantee the adaptive breadth of self-correcting planning for graph reasoning. Finally, extensive experiments on three real-world datasets demonstrate the effectiveness and efficiency of PoG.
大型语言模型(LLMs)在复杂任务中展现出了出色的推理能力,但它们仍面临知识过时、幻觉和决策不透明的问题。相比之下,知识图谱(KGs)可以为LLMs提供明确且可编辑的知识,以缓解这些问题。现有的KG增强型LLM范式需要手动预定义探索空间的广度,并要求在KG中进行无误导航。然而,这种范式无法基于问题语义自适应地探索推理路径或自我纠正错误的推理路径,从而导致效率和效果上的瓶颈。为了解决这些限制,我们提出了一种名为“基于图计划”(Plan-on-Graph, PoG)的新自纠正式自适应规划范式。PoG首先将问题分解成几个子目标,然后反复进行自适应探索推理路径、更新记忆以及反思是否需要自我纠正错误的推理路径的过程,直到得到答案。具体来说,设计了指导、记忆和反思三种重要机制协同工作,以确保图推理中的自适应广度规划。最后,在三个真实世界数据集上的广泛实验验证了PoG的有效性和效率。
https://arxiv.org/abs/2410.23875
The Automated Verification of Textual Claims (AVeriTeC) shared task asks participants to retrieve evidence and predict veracity for real-world claims checked by fact-checkers. Evidence can be found either via a search engine, or via a knowledge store provided by the organisers. Submissions are evaluated using AVeriTeC score, which considers a claim to be accurately verified if and only if both the verdict is correct and retrieved evidence is considered to meet a certain quality threshold. The shared task received 21 submissions, 18 of which surpassed our baseline. The winning team was TUDA_MAI with an AVeriTeC score of 63%. In this paper we describe the shared task, present the full results, and highlight key takeaways from the shared task.
《文本声明自动化验证(AVeriTeC)》共享任务要求参与者检索证据并预测事实核查员检查的真实世界声明的真伪。证据可以通过搜索引擎找到,也可以通过组织者提供的知识库获得。提交的内容使用 AVeriTeC 分数进行评估,只有当判决正确且检索到的证据被认为达到了一定的质量标准时,才认为该声明被准确验证了。此次共享任务收到了21份提交,其中18份超过了我们的基线。获胜队伍是TUDA_MAI,其AVeriTeC得分为63%。在本文中,我们将描述这个共享任务、展示全部结果,并突出共享任务的关键收获。
https://arxiv.org/abs/2410.23850
Knowledge editing technology is crucial for maintaining the accuracy and timeliness of large language models (LLMs) . However, the setting of this task overlooks a significant portion of commonsense knowledge based on free-text in the real world, characterized by broad knowledge scope, long content and non instantiation. The editing objects of previous methods (e.g., MEMIT) were single token or entity, which were not suitable for commonsense knowledge in free-text form. To address the aforementioned challenges, we conducted experiments from two perspectives: knowledge localization and knowledge editing. Firstly, we introduced Knowledge Localization for Free-Text(KLFT) method, revealing the challenges associated with the distribution of commonsense knowledge in MLP and Attention layers, as well as in decentralized distribution. Next, we propose a Dynamics-aware Editing Method(DEM), which utilizes a Dynamics-aware Module to locate the parameter positions corresponding to commonsense knowledge, and uses Knowledge Editing Module to update knowledge. The DEM method fully explores the potential of the MLP and Attention layers, and successfully edits commonsense knowledge based on free-text. The experimental results indicate that the DEM can achieve excellent editing performance.
知识编辑技术对于维护大型语言模型(LLMs)的准确性和时效性至关重要。然而,该任务设置忽视了基于现实世界自由文本的常识知识的重要部分,这部分常识知识具有广泛的知识范围、较长的内容和非实例化的特点。以前的方法(例如MEMIT)的编辑对象是单个标记或实体,这不适合于以自由文本形式存在的常识知识。为了解决上述挑战,我们从两个角度进行了实验:知识定位和知识编辑。首先,我们介绍了用于自由文本的知识定位(KLFT)方法,揭示了常识知识在多层感知器(MLP)和注意力层中的分布以及去中心化分布所面临的挑战。接下来,我们提出了一种基于动态感知的编辑方法(DEM),该方法利用动态感知模块来定位与常识知识相对应的参数位置,并使用知识编辑模块进行更新。DEM方法充分挖掘了多层感知器(MLP)和注意力层的潜力,并成功地对自由文本形式的常识知识进行了编辑。实验结果表明,DEM可以实现卓越的编辑性能。
https://arxiv.org/abs/2410.23844
Knowledge editing technology has received widespread attention for low-cost updates of incorrect or outdated knowledge in large-scale language models. However, recent research has found that edited models often exhibit varying degrees of performance degradation. The reasons behind this phenomenon and potential solutions have not yet been provided. In order to investigate the reasons for the performance decline of the edited model and optimize the editing method, this work explores the underlying reasons from both data and model perspectives. Specifically, 1) from a data perspective, to clarify the impact of data on the performance of editing models, this paper first constructs a Multi-Question Dataset (MQD) to evaluate the impact of different types of editing data on model performance. The performance of the editing model is mainly affected by the diversity of editing targets and sequence length, as determined through experiments. 2) From a model perspective, this article explores the factors that affect the performance of editing models. The results indicate a strong correlation between the L1-norm of the editing model layer and the editing accuracy, and clarify that this is an important factor leading to the bottleneck of editing performance. Finally, in order to improve the performance of the editing model, this paper further proposes a Dump for Sequence (D4S) method, which successfully overcomes the previous editing bottleneck by reducing the L1-norm of the editing layer, allowing users to perform multiple effective edits and minimizing model damage. Our code is available at this https URL.
知识编辑技术因其能够以低成本更新大规模语言模型中的错误或过时的知识而受到了广泛关注。然而,近期的研究发现,经过编辑的模型常常表现出不同程度的性能下降。导致这一现象的原因以及潜在解决方案尚未提供。为了调查编辑模型性能下降的原因并优化编辑方法,本研究从数据和模型两个角度探讨其背后的根本原因。具体而言: 1. **从数据角度来看**,为阐明数据对编辑模型性能的影响,本文首先构建了一个多问题数据集(MQD),用于评估不同类型编辑数据对模型性能的影响。实验表明,编辑目标的多样性和序列长度是影响编辑模型性能的主要因素。 2. **从模型角度来看**,本文探讨了影响编辑模型性能的因素。结果表明,编辑模型层的L1范数与编辑准确率之间存在强相关性,并明确了这是导致编辑性能瓶颈的重要因素。 最后,为了提升编辑模型的性能,本文进一步提出了一种序列卸载(D4S)方法,通过降低编辑层的L1范数成功克服了之前的编辑瓶颈。这种方法允许用户进行多次有效的编辑并最小化对模型的损害。我们的代码可在以下链接获取:[此处提供的是一个https URL]。 此翻译保留了原文的主要信息和结构,并进行了适当的语句调整以确保中文表述的自然流畅。
https://arxiv.org/abs/2410.23843
Deep neural networks (DNNS) excel at learning from static datasets but struggle with continual learning, where data arrives sequentially. Catastrophic forgetting, the phenomenon of forgetting previously learned knowledge, is a primary challenge. This paper introduces EXponentially Averaged Class-wise Feature Significance (EXACFS) to mitigate this issue in the class incremental learning (CIL) setting. By estimating the significance of model features for each learned class using loss gradients, gradually aging the significance through the incremental tasks and preserving the significant features through a distillation loss, EXACFS effectively balances remembering old knowledge (stability) and learning new knowledge (plasticity). Extensive experiments on CIFAR-100 and ImageNet-100 demonstrate EXACFS's superior performance in preserving stability while acquiring plasticity.
深度神经网络(DNNS)擅长从静态数据集中学习,但在连续学习中遇到困难,即数据是顺序到达的。灾难性遗忘,一种忘记之前学到知识的现象,是主要挑战之一。本文介绍了指数平均类特征显著性(EXACFS),以解决类别增量学习(CIL)环境中的这一问题。通过使用损失梯度估计每个已学类别的模型特征的重要性,并随着增量任务逐步老化这些重要性,同时通过蒸馏损失保持重要的特征,EXACFS有效地平衡了记住旧知识的稳定性与学习新知识的可塑性。在CIFAR-100和ImageNet-100上的广泛实验表明,EXACFS在获取可塑性的同时保留稳定性的表现优于其他方法。
https://arxiv.org/abs/2410.23751
What makes a difference in the post-training of LLMs? We investigate the training patterns of different layers in large language models (LLMs), through the lens of gradient, when training with different responses and initial models. We are specifically interested in how fast vs. slow thinking affects the layer-wise gradients, given the recent popularity of training LLMs on reasoning paths such as chain-of-thoughts (CoT) and process rewards. In our study, fast thinking without CoT leads to larger gradients and larger differences of gradients across layers than slow thinking (Detailed CoT), indicating the learning stability brought by the latter. Moreover, pre-trained LLMs are less affected by the instability of fast thinking than instruction-tuned LLMs. Additionally, we study whether the gradient patterns can reflect the correctness of responses when training different LLMs using slow vs. fast thinking paths. The results show that the gradients of slow thinking can distinguish correct and irrelevant reasoning paths. As a comparison, we conduct similar gradient analyses on non-reasoning knowledge learning tasks, on which, however, trivially increasing the response length does not lead to similar behaviors of slow thinking. Our study strengthens fundamental understandings of LLM training and sheds novel insights on its efficiency and stability, which pave the way towards building a generalizable System-2 agent. Our code, data, and gradient statistics can be found in: this https URL.
是什么在LLMs的后训练中产生了差异?我们通过观察不同响应和初始模型在训练过程中各层的梯度模式来研究大型语言模型(LLMs)的不同层次。特别地,鉴于最近对链式思维(CoT)等推理路径及过程奖励进行训练的流行趋势,我们关注快速思考与慢速思考如何影响逐层梯度的变化。我们的研究表明,不使用CoT的快速思考导致了更大的梯度以及各层之间更显著的梯度差异,而慢速思考(详细CoT)带来了更高的学习稳定性。此外,预训练的LLMs比指令调整后的LLMs受到快速思考不稳定性的负面影响较小。另外,我们研究了在使用慢速与快速思考路径对不同LLMs进行训练时,梯度模式是否能反映出响应的正确性。结果显示,慢速思考的梯度可以区分正确的和无关的推理路径。作为对比,我们在非推理知识学习任务上进行了类似的梯度分析,在这种情况下,简单增加响应长度并不会带来类似慢速思考的行为表现。我们的研究增强了对LLM训练基础原理的理解,并为构建一个通用化的System-2代理提供了新的效率和稳定性见解。我们的代码、数据及梯度统计可在以下链接找到:此 https URL。
https://arxiv.org/abs/2410.23743