Autonomous driving demands safe motion planning, especially in critical "long-tail" scenarios. Recent end-to-end autonomous driving systems leverage large language models (LLMs) as planners to improve generalizability to rare events. However, using LLMs at test time introduces high computational costs. To address this, we propose DiMA, an end-to-end autonomous driving system that maintains the efficiency of an LLM-free (or vision-based) planner while leveraging the world knowledge of an LLM. DiMA distills the information from a multi-modal LLM to a vision-based end-to-end planner through a set of specially designed surrogate tasks. Under a joint training strategy, a scene encoder common to both networks produces structured representations that are semantically grounded as well as aligned to the final planning objective. Notably, the LLM is optional at inference, enabling robust planning without compromising on efficiency. Training with DiMA results in a 37% reduction in the L2 trajectory error and an 80% reduction in the collision rate of the vision-based planner, as well as a 44% trajectory error reduction in longtail scenarios. DiMA also achieves state-of-the-art performance on the nuScenes planning benchmark.
自动驾驶需要安全的路径规划,尤其是在关键的“长尾”场景中。最近端到端的自动驾驶系统利用大型语言模型(LLM)作为规划器来提高对罕见事件的泛化能力。然而,在测试时使用LLM会引入高昂的计算成本。为了解决这个问题,我们提出了DiMA,这是一种保持无LLM(或基于视觉的)规划器效率的同时又能利用LLM世界知识的端到端自动驾驶系统。通过一组特别设计的代理任务,DiMA将多模态LLM中的信息浓缩成一个基于视觉的端到端规划器。在联合训练策略下,两个网络共用的一个场景编码器生成结构化表示,并且这些表示既语义相关又与最终规划目标对齐。值得注意的是,在推理时不需要使用LLM,从而实现了无需牺牲效率的前提下进行稳健规划的能力。采用DiMA训练后,基于视觉的规划器在L2轨迹误差上减少了37%,碰撞率降低了80%;同时在长尾场景下轨迹误差也减少了44%。此外,DiMA还在nuScenes规划基准测试中取得了最先进的性能水平。
https://arxiv.org/abs/2501.09757
Our objective is to translate continuous sign language into spoken language text. Inspired by the way human interpreters rely on context for accurate translation, we incorporate additional contextual cues together with the signing video, into a new translation framework. Specifically, besides visual sign recognition features that encode the input video, we integrate complementary textual information from (i) captions describing the background show, (ii) translation of previous sentences, as well as (iii) pseudo-glosses transcribing the signing. These are automatically extracted and inputted along with the visual features to a pre-trained large language model (LLM), which we fine-tune to generate spoken language translations in text form. Through extensive ablation studies, we show the positive contribution of each input cue to the translation performance. We train and evaluate our approach on BOBSL -- the largest British Sign Language dataset currently available. We show that our contextual approach significantly enhances the quality of the translations compared to previously reported results on BOBSL, and also to state-of-the-art methods that we implement as baselines. Furthermore, we demonstrate the generality of our approach by applying it also to How2Sign, an American Sign Language dataset, and achieve competitive results.
我们的目标是将连续的手语翻译成口语文本。受人类口译员依赖上下文进行准确翻译的启发,我们将额外的上下文线索与手语视频整合到一个新的翻译框架中。具体来说,在编码输入视频的手势识别特征之外,我们还集成了三种补充性的文本信息:(i)描述背景节目的字幕;(ii)前一句的口语翻译;以及(iii)转录手势的伪术语。这些信息被自动提取并与视觉特征一起输入到预训练的大语言模型(LLM)中,该模型经过微调后能够生成口语形式的文本翻译。通过大量的消融研究,我们展示了每种输入线索对翻译性能的正面贡献。我们在BOBSL——目前最大的英国手语数据集上进行训练和评估。结果显示,我们的上下文方法显著提高了在BOBSL上的翻译质量,并且优于之前报道的结果以及作为基线实现的最新技术方法。此外,我们通过将其应用于How2Sign(一个美国手语数据集)来展示该方法的通用性,并取得了具有竞争力的结果。
https://arxiv.org/abs/2501.09754
Machine writing with large language models often relies on retrieval-augmented generation. However, these approaches remain confined within the boundaries of the model's predefined scope, limiting the generation of content with rich information. Specifically, vanilla-retrieved information tends to lack depth, utility, and suffers from redundancy, which negatively impacts the quality of generated articles, leading to shallow, repetitive, and unoriginal outputs. To address these issues, we propose OmniThink, a machine writing framework that emulates the human-like process of iterative expansion and reflection. The core idea behind OmniThink is to simulate the cognitive behavior of learners as they progressively deepen their knowledge of the topics. Experimental results demonstrate that OmniThink improves the knowledge density of generated articles without compromising metrics such as coherence and depth. Human evaluations and expert feedback further highlight the potential of OmniThink to address real-world challenges in the generation of long-form articles.
基于大型语言模型的机器写作通常依赖于检索增强生成技术。然而,这些方法仍然局限于模型预定义的范围内,限制了内容丰富信息的生成能力。具体而言,常规检索到的信息往往缺乏深度、实用性,并且存在冗余问题,这会降低生成文章的质量,导致产出浅薄、重复和缺乏原创性的结果。为了解决这些问题,我们提出了一种名为OmniThink的机器写作框架,该框架模仿人类迭代扩展与反思的过程。OmniThink的核心思想是模拟学习者在其主题知识中逐步深化的认知行为。 实验结果显示,相比现有方法,OmniThink能够提高生成文章的知识密度,并且不会影响连贯性和深度等指标的表现。通过人工评价和专家反馈进一步表明,OmniThink有潜力解决长篇幅文章生成中的现实挑战。
https://arxiv.org/abs/2501.09751
Recent large language models (LLMs) have demonstrated exceptional performance on general-purpose text embedding tasks. While dense embeddings have dominated related research, we introduce the first Lexicon-based EmbeddiNgS (LENS) leveraging LLMs that achieve competitive performance on these tasks. Regarding the inherent tokenization redundancy issue and unidirectional attention limitations in traditional causal LLMs, LENS consolidates the vocabulary space through token embedding clustering, and investigates bidirectional attention and various pooling strategies. Specifically, LENS simplifies lexicon matching by assigning each dimension to a specific token cluster, where semantically similar tokens are grouped together, and unlocking the full potential of LLMs through bidirectional attention. Extensive experiments demonstrate that LENS outperforms dense embeddings on the Massive Text Embedding Benchmark (MTEB), delivering compact feature representations that match the sizes of dense counterparts. Notably, combining LENSE with dense embeddings achieves state-of-the-art performance on the retrieval subset of MTEB (i.e. BEIR).
近期的大规模语言模型(LLMs)在通用文本嵌入任务上表现出色。虽然稠密嵌入一直主导着相关研究,我们首次提出了基于词典的嵌入方法(LENS),这种方法利用了LLM并在此类任务中取得了竞争性性能。针对传统因果LLM中存在的固有分词冗余问题和单向注意力限制,LENS通过分词嵌入聚类来整合词汇空间,并探索双向注意机制及多种池化策略。具体而言,LENS简化了词典匹配过程,为每个维度分配一个特定的分词簇,在这个簇中,语义相似的单词被聚集在一起,而通过双向注意力则释放出LLM的全部潜力。广泛的实验表明,LENS在大规模文本嵌入基准(MTEB)上优于稠密嵌入方法,并提供与稠密嵌入相同大小的紧凑特征表示。值得一提的是,将LENS与稠密嵌入相结合,在MTEB中的检索子集(即BEIR)中取得了最先进的性能。
https://arxiv.org/abs/2501.09749
Machine learning developers frequently use interactive computational notebooks, such as Jupyter notebooks, to host code for data processing and model training. Jupyter notebooks provide a convenient tool for writing machine learning pipelines and interactively observing outputs, however, maintaining Jupyter notebooks, e.g., to add new features or fix bugs, can be challenging due to the length and complexity of the notebooks. Moreover, there is no existing benchmark related to developer edits on Jupyter notebooks. To address this, we present the first dataset of 48,398 Jupyter notebook edits derived from 20,095 revisions of 792 machine learning repositories on GitHub, and perform the first study of the using LLMs to predict code edits in Jupyter notebooks. Our dataset captures granular details of cell-level and line-level modifications, offering a foundation for understanding real-world maintenance patterns in machine learning workflows. We observed that the edits on Jupyter notebooks are highly localized, with changes averaging only 166 lines of code in repositories. While larger models outperform smaller counterparts in code editing, all models have low accuracy on our dataset even after finetuning, demonstrating the complexity of real-world machine learning maintenance tasks. Our findings emphasize the critical role of contextual information in improving model performance and point toward promising avenues for advancing large language models' capabilities in engineering machine learning code.
机器学习开发者经常使用像Jupyter笔记本这样的交互式计算本,来托管数据处理和模型训练的代码。Jupyter笔记本为编写机器学习管道和互动观察输出提供了一个方便的工具,然而,由于笔记本网页的长度和复杂性,维护这些笔记(例如添加新功能或修复错误)可能会变得具有挑战性。此外,目前还没有关于开发人员在Jupyter笔记本上的编辑相关的基准测试。为此,我们发布了第一个包含来自GitHub上792个机器学习仓库中的20,095次修订的48,398条Jupyter笔记编辑的数据集,并进行了首次使用大型语言模型(LLM)来预测Jupyter笔记代码修改的研究。我们的数据集捕捉了单元级别和行级别的修改细节,为理解现实世界中机器学习工作流程中的维护模式提供了基础。我们观察到,在Jupyter笔记本上的编辑具有高度的局部性,平均每个仓库只有166行代码的变化。尽管较大的模型在代码编辑方面胜过较小的模型,但所有模型在经过微调后仍在我方数据集上表现出较低的准确率,这表明现实世界中的机器学习维护任务相当复杂。我们的研究结果强调了改善模型性能的上下文信息的关键作用,并指向了提高大型语言模型在工程机器学习代码方面能力的有希望的方向。
https://arxiv.org/abs/2501.09745
Generative models have made significant impacts across various domains, largely due to their ability to scale during training by increasing data, computational resources, and model size, a phenomenon characterized by the scaling laws. Recent research has begun to explore inference-time scaling behavior in Large Language Models (LLMs), revealing how performance can further improve with additional computation during inference. Unlike LLMs, diffusion models inherently possess the flexibility to adjust inference-time computation via the number of denoising steps, although the performance gains typically flatten after a few dozen. In this work, we explore the inference-time scaling behavior of diffusion models beyond increasing denoising steps and investigate how the generation performance can further improve with increased computation. Specifically, we consider a search problem aimed at identifying better noises for the diffusion sampling process. We structure the design space along two axes: the verifiers used to provide feedback, and the algorithms used to find better noise candidates. Through extensive experiments on class-conditioned and text-conditioned image generation benchmarks, our findings reveal that increasing inference-time compute leads to substantial improvements in the quality of samples generated by diffusion models, and with the complicated nature of images, combinations of the components in the framework can be specifically chosen to conform with different application scenario.
生成模型已经在多个领域产生了重要影响,这主要归功于它们在训练过程中通过增加数据量、计算资源和模型规模来扩展的能力,这一现象被称为扩展定律。最近的研究已经开始探索大型语言模型(LLMs)的推理时长扩展行为,揭示了如何通过增加推理过程中的计算能力进一步提升性能。与LLMs不同的是,扩散模型本身具备通过调整去噪步骤数量来灵活调节推理时间计算的能力,尽管通常在几十个去噪步骤之后性能增益会趋于平缓。在这项工作中,我们探讨了超出增加去噪步骤之外的扩散模型的推理时长扩展行为,并研究了如何利用更多的计算资源进一步提升生成表现。 具体来说,我们考虑了一个搜索问题,旨在为扩散采样过程找到更好的噪声样本。我们在两个轴上构建设计空间:一是提供反馈的验证器;二是用于寻找更好噪声候选者的算法。通过在类别条件和文本条件图像生成基准上的大量实验,我们的研究发现表明增加推理时间计算能够显著提升由扩散模型生成样本的质量,并且鉴于图像的复杂性,框架中组件的不同组合可以根据不同的应用场景具体选择以符合需求。
https://arxiv.org/abs/2501.09732
With the increased use of the internet and social networks for online discussions, the spread of toxic and inappropriate content on social networking sites has also increased. Several studies have been conducted in different languages. However, there is less work done for South Asian languages for inappropriate content identification using deep learning techniques. In Urdu language, the spellings are not unique, and people write different common spellings for the same word, while mixing it other languages, like English in the text makes it more challenging, and limited research work is available to process such language with the finest algorithms. The use of attention layer with a deep learning model can help handling the long-term dependencies and increase its efficiency . To explore the effects of the attention layer, this study proposes attention-based Bidirectional GRU hybrid model for identifying inappropriate content in Urdu Unicode text language. Four different baseline deep learning models; LSTM, Bi-LSTM, GRU, and TCN, are used to compare the performance of the proposed model. The results of these models were compared based on evaluation metrics, dataset size, and impact of the word embedding layer. The pre-trained Urdu word2Vec embeddings were utilized for our case. Our proposed model BiGRU-A outperformed all other baseline models by yielding 84\% accuracy without using pre-trained word2Vec layer. From our experiments, we have established that the attention layer improves the model's efficiency, and pre-trained word2Vec embedding does not work well with an inappropriate content dataset.
随着互联网和社交网络在在线讨论中的使用增加,社交媒体平台上毒性和不适当内容的传播也有所增加。不同语言中已经进行了多项研究,但在南亚语言中利用深度学习技术进行不当内容识别的研究工作较少。乌尔都语拼写并不唯一,同一单词有多种常见的拼写方式,而且会与其他语言(如英语)混合使用,这使得处理这种语言更具挑战性,并且可用的算法研究有限。 使用注意力层与深度学习模型相结合可以帮助处理长期依赖关系并提高其效率。为了探索注意力层的效果,本研究提出了一种基于注意力的双向GRU混合模型,用于识别乌尔都语Unicode文本中的不当内容。四种不同的基线深度学习模型:LSTM、Bi-LSTM、GRU和TCN被用来比较所提出的模型性能。根据评估指标、数据集大小以及词嵌入层的影响来对比这些模型的结果。我们使用了预训练的乌尔都语word2Vec嵌入。 我们的拟议模型BiGRU-A在不使用预训练的word2Vec层的情况下达到了84%的准确率,优于所有其他基线模型。从实验中得出结论,注意力层可以提高模型效率,并且与不当内容数据集相比,预训练的词向量层表现不佳。
https://arxiv.org/abs/2501.09722
The multimodal language models (MLMs) based on generative pre-trained Transformer are considered powerful candidates for unifying various domains and tasks. MLMs developed for remote sensing (RS) have demonstrated outstanding performance in multiple tasks, such as visual question answering and visual grounding. In addition to visual grounding that detects specific objects corresponded to given instruction, aerial detection, which detects all objects of multiple categories, is also a valuable and challenging task for RS foundation models. However, aerial detection has not been explored by existing RS MLMs because the autoregressive prediction mechanism of MLMs differs significantly from the detection outputs. In this paper, we present a simple baseline for applying MLMs to aerial detection for the first time, named LMMRotate. Specifically, we first introduce a normalization method to transform detection outputs into textual outputs to be compatible with the MLM framework. Then, we propose a evaluation method, which ensures a fair comparison between MLMs and conventional object detection models. We construct the baseline by fine-tuning open-source general-purpose MLMs and achieve impressive detection performance comparable to conventional detector. We hope that this baseline will serve as a reference for future MLM development, enabling more comprehensive capabilities for understanding RS images. Code is available at this https URL.
基于生成预训练Transformer的多模态语言模型(MLMs)被视为统一各种领域和任务的强大候选者。专为遥感(RS)开发的MLMs在多项任务中展现了卓越性能,如视觉问答和视觉接地。除了检测与给定指令相对应的具体物体的视觉接地外,检测多种类别的所有对象的航空检测也是一个对RS基础模型有价值的且具有挑战性的任务。然而,由于MLMs的自回归预测机制与检测输出显著不同,现有的RS MLMs尚未探索航空检测领域。在这篇文章中,我们首次提出了一种简单的方法用于将MLMs应用于航空检测,并将其命名为LMMRotate。 具体而言,首先引入一种归一化方法,以将检测输出转换为文本形式的输出,从而使其与MLM框架兼容。然后,我们提出了一种评估方法,确保MLMs和传统目标检测模型之间的公平比较。通过微调开源通用MLMs构建基线,并取得了与传统检测器相当的出色检测性能。我们希望这一基线将作为未来MLM发展的参考,使理解RS图像的能力更加全面。 代码可在以下网址获得:[此URL](https://this-url.com)(原文中的链接请自行替换为实际提供的地址)。
https://arxiv.org/abs/2501.09720
Many non-traditional students in cybersecurity programs often lack access to advice from peers, family members and professors, which can hinder their educational experiences. Additionally, these students may not fully benefit from various LLM-powered AI assistants due to issues like content relevance, locality of advice, minimum expertise, and timing. This paper addresses these challenges by introducing an application designed to provide comprehensive support by answering questions related to knowledge, skills, and career preparation advice tailored to the needs of these students. We developed a learning tool platform, CyberMentor, to address the diverse needs and pain points of students majoring in cybersecurity. Powered by agentic workflow and Generative Large Language Models (LLMs), the platform leverages Retrieval-Augmented Generation (RAG) for accurate and contextually relevant information retrieval to achieve accessibility and personalization. We demonstrated its value in addressing knowledge requirements for cybersecurity education and for career marketability, in tackling skill requirements for analytical and programming assignments, and in delivering real time on demand learning support. Using three use scenarios, we showcased CyberMentor in facilitating knowledge acquisition and career preparation and providing seamless skill-based guidance and support. We also employed the LangChain prompt-based evaluation methodology to evaluate the platform's impact, confirming its strong performance in helpfulness, correctness, and completeness. These results underscore the system's ability to support students in developing practical cybersecurity skills while improving equity and sustainability within higher education. Furthermore, CyberMentor's open-source design allows for adaptation across other disciplines, fostering educational innovation and broadening its potential impact.
许多非传统学生在网络安全项目中往往缺乏来自同龄人、家庭成员和教授的建议,这会妨碍他们的学习经历。此外,由于内容的相关性、建议的地方性、最低专业知识要求以及时机等问题,这些学生可能无法充分利用各种LLM(大型语言模型)驱动的人工智能助手提供的服务。本文通过介绍一款专门为满足这些学生的知识、技能和职业准备咨询需求而设计的应用程序来解决这些问题。我们开发了一个学习工具平台“CyberMentor”,旨在应对网络安全专业学生多样化的需要与痛点。该平台利用代理工作流和生成式大型语言模型(LLMs),并通过检索增强生成技术(RAG)实现准确且上下文相关的信息检索,以确保可访问性和个性化服务。 我们展示了CyberMentor在满足网络安全教育的知识需求、职业市场的适应性要求、分析和编程任务的技能需求以及提供即时按需学习支持方面的作用。通过三种使用场景的应用展示,CyberMentor在促进知识获取与职业准备方面发挥了重要作用,并提供了无缝的技术指导和支持。我们还采用了LangChain提示评价法来评估该平台的影响,确认其在帮助性、准确性和完整性方面的优秀表现。 这些结果强调了该系统支持学生发展实用的网络安全技能的能力,同时提高高等教育中的公平性和可持续性。此外,“CyberMentor”的开源设计允许它被其他学科领域采纳和适应,推动教育创新并扩大其潜在影响。
https://arxiv.org/abs/2501.09709
We present the e-Llama models: 8 billion and 70 billion parameter large language models that are adapted towards the e-commerce domain. These models are meant as foundation models with deep knowledge about e-commerce, that form a base for instruction- and fine-tuning. The e-Llama models are obtained by continuously pretraining the Llama 3.1 base models on 1 trillion tokens of domain-specific data. We discuss our approach and motivate our choice of hyperparameters with a series of ablation studies. To quantify how well the models have been adapted to the e-commerce domain, we define and implement a set of multilingual, e-commerce specific evaluation tasks. We show that, when carefully choosing the training setup, the Llama 3.1 models can be adapted towards the new domain without sacrificing significant performance on general domain tasks. We also explore the possibility of merging the adapted model and the base model for a better control of the performance trade-off between domains.
我们介绍e-Llama模型:这是两个分别拥有80亿和700亿参数的大型语言模型,专为电子商务领域进行了优化。这些模型作为基础模型,在电子商务方面具备深厚的知识积累,并为基础模型的指令微调提供了依据。通过在特定领域的1万亿个标记数据上进行持续预训练,我们得到了e-Llama模型。 我们在一系列消融研究中讨论了我们的方法,并根据实验结果解释了选择超参数的理由。为了量化这些模型在适应电子商务领域方面的效果,我们定义并实施了一套多语言、专门针对电子商务的评估任务集。结果显示,在精心设计的培训设置下,Llama 3.1模型可以被调整以适应新的电子商务领域,同时不会牺牲其在通用领域任务上的显著性能。 此外,我们还探索了将优化后的模型和基础模型合并的可能性,以便更好地控制不同领域之间的性能权衡。
https://arxiv.org/abs/2501.09706
Hallucination remains a major challenge for Large Vision-Language Models (LVLMs). Direct Preference Optimization (DPO) has gained increasing attention as a simple solution to hallucination issues. It directly learns from constructed preference pairs that reflect the severity of hallucinations in responses to the same prompt and image. Nonetheless, different data construction methods in existing works bring notable performance variations. We identify a crucial factor here: outcomes are largely contingent on whether the constructed data aligns on-policy w.r.t the initial (reference) policy of DPO. Theoretical analysis suggests that learning from off-policy data is impeded by the presence of KL-divergence between the updated policy and the reference policy. From the perspective of dataset distribution, we systematically summarize the inherent flaws in existing algorithms that employ DPO to address hallucination issues. To alleviate the problems, we propose On-Policy Alignment (OPA)-DPO framework, which uniquely leverages expert feedback to correct hallucinated responses and aligns both the original and expert-revised responses in an on-policy manner. Notably, with only 4.8k data, OPA-DPO achieves an additional reduction in the hallucination rate of LLaVA-1.5-7B: 13.26% on the AMBER benchmark and 5.39% on the Object-Hal benchmark, compared to the previous SOTA algorithm trained with 16k samples.
幻觉仍然是大型视觉语言模型(LVLM)面临的主要挑战之一。直接偏好优化(DPO)作为一种简单的解决方案,近年来受到了越来越多的关注,它通过从反映同一提示和图像的响应中幻觉严重程度所构建的偏好对进行直接学习。然而,现有的工作中的不同数据构建方法带来了显著的性能差异。我们在这里识别了一个关键因素:结果在很大程度上取决于所构建的数据是否与DPO最初的(参考)策略一致。理论上分析表明,从离策略数据学习会受到更新后的策略和参考策略之间存在的KL散度的影响。 从数据集分布的角度来看,我们系统地总结了现有算法使用DPO解决幻觉问题时固有的缺陷。为了解决这些问题,我们提出了在政策对齐(OPA)-DPO框架,它利用专家反馈来纠正幻觉响应,并以在策略的方式对准原始和经过专家修订的响应。值得注意的是,在仅使用4.8k数据的情况下,与先前使用的16k样本训练的最佳现有算法相比,OPA-DPO使LLaVA-1.5-7B模型在AMBER基准测试中实现了幻觉率额外降低13.26%,在Object-Hal基准测试中降低了5.39%。
https://arxiv.org/abs/2501.09695
Language has long been conceived as an essential tool for human reasoning. The breakthrough of Large Language Models (LLMs) has sparked significant research interest in leveraging these models to tackle complex reasoning tasks. Researchers have moved beyond simple autoregressive token generation by introducing the concept of "thought" -- a sequence of tokens representing intermediate steps in the reasoning process. This innovative paradigm enables LLMs' to mimic complex human reasoning processes, such as tree search and reflective thinking. Recently, an emerging trend of learning to reason has applied reinforcement learning (RL) to train LLMs to master reasoning processes. This approach enables the automatic generation of high-quality reasoning trajectories through trial-and-error search algorithms, significantly expanding LLMs' reasoning capacity by providing substantially more training data. Furthermore, recent studies demonstrate that encouraging LLMs to "think" with more tokens during test-time inference can further significantly boost reasoning accuracy. Therefore, the train-time and test-time scaling combined to show a new research frontier -- a path toward Large Reasoning Model. The introduction of OpenAI's o1 series marks a significant milestone in this research direction. In this survey, we present a comprehensive review of recent progress in LLM reasoning. We begin by introducing the foundational background of LLMs and then explore the key technical components driving the development of large reasoning models, with a focus on automated data construction, learning-to-reason techniques, and test-time scaling. We also analyze popular open-source projects at building large reasoning models, and conclude with open challenges and future research directions.
长期以来,人们一直认为语言是人类推理的必要工具。大型语言模型(LLMs)的重大突破激发了利用这些模型来解决复杂推理任务的研究兴趣。研究人员已经超越了简单的自回归令牌生成,引入了“思维”这一概念——一系列代表推理过程中间步骤的令牌序列。这种创新的方法使LLM能够模仿复杂的类人类推理过程,如树搜索和反思思考。最近,一种新兴的学习推理趋势是应用强化学习(RL)来训练LLM掌握推理过程。这种方法通过试错算法实现了高质量推理轨迹的自动生成,并且通过提供大量训练数据显著扩展了LLM的推理能力。此外,近期研究表明,在测试时鼓励LLMs使用更多令牌进行“思考”可以进一步大幅提高推理准确性。因此,结合训练时间和测试时间上的扩展展示了一个新的研究前沿——通向大型推理模型的道路。OpenAI推出的o1系列标志着这一研究方向的一个重要里程碑。在这份综述中,我们将介绍近期在LLM推理方面的重大进展。我们首先引入LLMs的基础背景知识,然后探讨驱动大规模推理模型发展的关键技术组件,重点在于自动化数据构建、学习推理技术以及测试时间扩展。此外,我们还会分析一些热门的开源项目在构建大型推理模型中的应用,并最终提出开放性挑战和未来研究方向。
https://arxiv.org/abs/2501.09686
This tutorial provides an in-depth guide on inference-time guidance and alignment methods for optimizing downstream reward functions in diffusion models. While diffusion models are renowned for their generative modeling capabilities, practical applications in fields such as biology often require sample generation that maximizes specific metrics (e.g., stability, affinity in proteins, closeness to target structures). In these scenarios, diffusion models can be adapted not only to generate realistic samples but also to explicitly maximize desired measures at inference time without fine-tuning. This tutorial explores the foundational aspects of such inference-time algorithms. We review these methods from a unified perspective, demonstrating that current techniques -- such as Sequential Monte Carlo (SMC)-based guidance, value-based sampling, and classifier guidance -- aim to approximate soft optimal denoising processes (a.k.a. policies in RL) that combine pre-trained denoising processes with value functions serving as look-ahead functions that predict from intermediate states to terminal rewards. Within this framework, we present several novel algorithms not yet covered in the literature. Furthermore, we discuss (1) fine-tuning methods combined with inference-time techniques, (2) inference-time algorithms based on search algorithms such as Monte Carlo tree search, which have received limited attention in current research, and (3) connections between inference-time algorithms in language models and diffusion models. The code of this tutorial on protein design is available at this https URL
这篇教程提供了关于推理时引导和对齐方法的深入指南,这些方法用于优化扩散模型中的下游奖励函数。虽然扩散模型因其生成建模能力而闻名,但在生物学等领域中的实际应用通常需要生成最大化特定指标(例如蛋白质的稳定性、亲和力以及接近目标结构的程度)的样本。在这些场景中,可以对扩散模型进行调整,使其不仅能生成逼真的样本,还能在推理时明确地最大化所需的度量值而不需微调。本教程探讨了此类推理时间算法的基础方面,并从统一的角度回顾这些方法,表明当前的技术——如基于序列蒙特卡洛(SMC)的引导、基于价值的采样以及分类器引导——旨在近似软优化去噪过程(即RL中的策略),该过程结合了预训练的去噪过程和作为预测函数的价值功能,从中间状态到最终奖励。在此框架内,我们提出了一些尚未在文献中被涵盖的新算法。 此外,本教程还讨论了: 1. 结合推理时间技术的微调方法; 2. 基于搜索算法(如蒙特卡洛树搜索)的推理时间算法,在当前研究中受到了较少关注;以及 3. 语言模型与扩散模型之间在推理时间算法上的联系。 有关蛋白质设计教程代码,请访问此链接:[https URL]
https://arxiv.org/abs/2501.09685
The proliferation of Vision-Language Models (VLMs) in the past several years calls for rigorous and comprehensive evaluation methods and benchmarks. This work analyzes existing VLM evaluation techniques, including automated metrics, AI-based assessments, and human evaluations across diverse tasks. We first introduce Robin - a novel suite of VLMs that we built by combining Large Language Models (LLMs) and Vision Encoders (VEs) at multiple scales, and use Robin to identify shortcomings of current evaluation approaches across scales. Next, to overcome the identified limitations, we introduce CHIRP - a new long form response benchmark we developed for more robust and complete VLM evaluation. We provide open access to the Robin training code, model suite, and CHIRP benchmark to promote reproducibility and advance VLM research.
近年来,视觉-语言模型(VLMs)的迅速发展需要严格的和全面的评估方法及基准。本文分析了现有的VLM评估技术,包括自动化指标、基于AI的评估以及跨不同任务的人类评价。首先,我们介绍了Robin——一个新型的VLM套件,它是通过在多个尺度上结合大规模语言模型(LLMs)和视觉编码器(VEs)构建而成,并使用Robin来识别当前评估方法在各个尺度上的不足之处。接下来,为了克服这些已发现的局限性,我们引入了CHIRP——一个新的长形式响应基准,旨在为VLM提供更稳健且全面的评价。我们提供了Robin的训练代码、模型套件以及CHIRP基准测试的开放访问权限,以促进可重复性和推动VLM的研究进展。
https://arxiv.org/abs/2501.09672
The recent rise in the popularity of large language models has spurred the development of extensive code datasets needed to train them. This has left limited code available for collection and use in the downstream investigation of specific behaviors, or evaluation of large language models without suffering from data contamination. To address this problem, we release The Heap, a large multilingual dataset covering 57 programming languages that has been deduplicated with respect to other open datasets of code, enabling researchers to conduct fair evaluations of large language models without significant data cleaning overhead.
近期,大型语言模型流行度的上升推动了训练这些模型所需的广泛代码数据集的发展。这导致可用于特定行为下游调查或在不遭受数据污染的情况下评估大型语言模型的有效代码资源变得稀缺。为解决这一问题,我们发布了The Heap,这是一个涵盖了57种编程语言的大规模多语言数据集,并且已经与其他公开的代码数据集进行了去重处理,使研究人员能够进行公平地评估大型语言模型,而无需承受显著的数据清理负担。
https://arxiv.org/abs/2501.09653
In today's assistant landscape, personalisation enhances interactions, fosters long-term relationships, and deepens engagement. However, many systems struggle with retaining user preferences, leading to repetitive user requests and disengagement. Furthermore, the unregulated and opaque extraction of user preferences in industry applications raises significant concerns about privacy and trust, especially in regions with stringent regulations like Europe. In response to these challenges, we propose a long-term memory system for voice assistants, structured around predefined categories. This approach leverages Large Language Models to efficiently extract, store, and retrieve preferences within these categories, ensuring both personalisation and transparency. We also introduce a synthetic multi-turn, multi-session conversation dataset (CarMem), grounded in real industry data, tailored to an in-car voice assistant setting. Benchmarked on the dataset, our system achieves an F1-score of .78 to .95 in preference extraction, depending on category granularity. Our maintenance strategy reduces redundant preferences by 95% and contradictory ones by 92%, while the accuracy of optimal retrieval is at .87. Collectively, the results demonstrate the system's suitability for industrial applications.
在当今的助手生态系统中,个性化增强了互动,促进了长期关系,并加深了用户的参与度。然而,许多系统难以保留用户偏好,导致重复性的用户请求和失去兴趣。此外,在诸如欧洲等监管严格的地区,行业应用中不规范且不透明地提取用户偏好的做法引发了关于隐私和信任的重大担忧。 为应对这些挑战,我们提出了一种基于预定义类别的语音助手长期记忆系统。该方法利用大型语言模型高效地从这些类别中提取、存储和检索偏好信息,确保个性化的同时也保持透明度。此外,我们还引入了一个合成的多轮对话数据集(CarMem),这个数据集以真实行业数据为基础,并针对车载语音助理场景进行了定制。 在这一数据集上的评估结果显示,我们的系统根据不同类别的详细程度,在偏好提取方面取得了F1值从0.78到0.95的成绩。通过我们的维护策略,重复的偏好减少了95%,矛盾的偏好减少了92%;而最优检索精度为0.87。总体而言,这些结果展示了该系统的工业应用潜力和适用性。
https://arxiv.org/abs/2501.09645
Recent advances in large language models (LLMs) have demonstrated significant progress in performing complex tasks. While Reinforcement Learning from Human Feedback (RLHF) has been effective in aligning LLMs with human preferences, it is susceptible to spurious correlations in reward modeling. Consequently, it often introduces biases-such as length bias, sycophancy, conceptual bias, and discrimination that hinder the model's ability to capture true causal relationships. To address this, we propose a novel causal reward modeling approach that integrates causal inference to mitigate these spurious correlations. Our method enforces counterfactual invariance, ensuring reward predictions remain consistent when irrelevant variables are altered. Through experiments on both synthetic and real-world datasets, we show that our approach mitigates various types of spurious correlations effectively, resulting in more reliable and fair alignment of LLMs with human preferences. As a drop-in enhancement to the existing RLHF workflow, our causal reward modeling provides a practical way to improve the trustworthiness and fairness of LLM finetuning.
最近在大型语言模型(LLM)方面取得的进展已经证明了其在执行复杂任务方面的显著进步。虽然基于人类反馈的强化学习(RLHF)在将LLM与人类偏好对齐方面非常有效,但它容易受到奖励建模中的虚假相关性的困扰。这通常会导致诸如长度偏见、阿谀奉承倾向、概念偏差和歧视等偏见问题,这些都阻碍了模型捕捉真正因果关系的能力。为了解决这些问题,我们提出了一种新颖的因果奖励建模方法,该方法整合了因果推理来减轻这些虚假相关性。我们的方法强制执行反事实不变性,确保在无关变量改变时,奖励预测仍然保持一致。通过在合成和真实世界数据集上的实验,我们展示了我们的方法能够有效地缓解各种类型的虚假相关性,从而使得LLM与人类偏好的对齐更加可靠和公平。作为现有RLHF工作流程的即插即用增强功能,我们的因果奖励建模提供了一种实用的方法来提高LLM微调过程中的可信度和公平性。
https://arxiv.org/abs/2501.09620
The rapid spread of fake news presents a significant global challenge, particularly in low-resource languages like Bangla, which lack adequate datasets and detection tools. Although manual fact-checking is accurate, it is expensive and slow to prevent the dissemination of fake news. Addressing this gap, we introduce BanFakeNews-2.0, a robust dataset to enhance Bangla fake news detection. This version includes 11,700 additional, meticulously curated fake news articles validated from credible sources, creating a proportional dataset of 47,000 authentic and 13,000 fake news items across 13 categories. In addition, we created a manually curated independent test set of 460 fake and 540 authentic news items for rigorous evaluation. We invest efforts in collecting fake news from credible sources and manually verified while preserving the linguistic richness. We develop a benchmark system utilizing transformer-based architectures, including fine-tuned Bidirectional Encoder Representations from Transformers variants (F1-87\%) and Large Language Models with Quantized Low-Rank Approximation (F1-89\%), that significantly outperforms traditional methods. BanFakeNews-2.0 offers a valuable resource to advance research and application in fake news detection for low-resourced languages. We publicly release our dataset and model on Github to foster research in this direction.
假新闻的迅速传播带来了全球性的挑战,特别是在像孟加拉语这样的资源匮乏语言中,这些语言缺乏足够的数据集和检测工具。尽管人工事实核查准确度高,但成本高昂且耗时长,无法有效阻止假新闻的扩散。为了解决这一缺口,我们推出了BanFakeNews-2.0,这是一个旨在增强孟加拉语假新闻检测能力的强大数据集。这个版本包含11,700篇额外、精心策划并从可信来源验证过的假新闻文章,形成了一个由47,000条真实和13,000条虚假新闻组成的数据集,涵盖13个类别。此外,我们还创建了一个独立的手工策划测试数据集,其中包含460篇假新闻和540篇真实新闻,用于严格的评估。我们在收集来自可信来源的假新闻时投入了大量努力,并进行了人工验证以保留语言的丰富性。 我们开发了一个基准系统,采用基于变压器的架构,包括微调后的双向编码器表示(F1-87%)和量化低秩近似的大规模语言模型(F1-89%),这些方法显著优于传统的方法。BanFakeNews-2.0为资源匮乏语言中的假新闻检测研究和应用提供了宝贵的资源。我们将在Github上公开发布我们的数据集和模型,以促进在这个方向上的研究。
https://arxiv.org/abs/2501.09604
De-identification of medical images is a critical step to ensure privacy during data sharing in research and clinical settings. The initial step in this process involves detecting Protected Health Information (PHI), which can be found in image metadata or imprinted within image pixels. Despite the importance of such systems, there has been limited evaluation of existing AI-based solutions, creating barriers to the development of reliable and robust tools. In this study, we present an AI-based pipeline for PHI detection, comprising three key components: text detection, text extraction, and analysis of PHI content in medical images. By experimenting with exchanging roles of vision and language models within the pipeline, we evaluate the performance and recommend the best setup for the PHI detection task.
医学图像去识别是确保研究和临床环境中数据共享期间隐私保护的关键步骤。此过程的初始阶段涉及检测受保护的健康信息(PHI),这些信息可能存在于图像元数据中或嵌印在图像像素内。尽管此类系统的至关重要性,现有的基于AI的解决方案却很少被评估,这阻碍了可靠且稳健工具的发展。在这项研究中,我们提出了一种用于检测PHI的基于人工智能的流程,包括三个关键组成部分:文本检测、文本提取以及医学图像中PHI内容的分析。通过在管道中交换视觉和语言模型的角色进行实验,我们评估了其性能,并推荐了最适合执行PHI检测任务的最佳配置。
https://arxiv.org/abs/2501.09552
In this paper, we elaborate on how AI can support diversity and inclusion and exemplify research projects conducted in that direction. We start by looking at the challenges and progress in making large language models (LLMs) more transparent, inclusive, and aware of social biases. Even though LLMs like ChatGPT have impressive abilities, they struggle to understand different cultural contexts and engage in meaningful, human like conversations. A key issue is that biases in language processing, especially in machine translation, can reinforce inequality. Tackling these biases requires a multidisciplinary approach to ensure AI promotes diversity, fairness, and inclusion. We also highlight AI's role in identifying biased content in media, which is important for improving representation. By detecting unequal portrayals of social groups, AI can help challenge stereotypes and create more inclusive technologies. Transparent AI algorithms, which clearly explain their decisions, are essential for building trust and reducing bias in AI systems. We also stress AI systems need diverse and inclusive training data. Projects like the Child Growth Monitor show how using a wide range of data can help address real world problems like malnutrition and poverty. We present a project that demonstrates how AI can be applied to monitor the role of search engines in spreading disinformation about the LGBTQ+ community. Moreover, we discuss the SignON project as an example of how technology can bridge communication gaps between hearing and deaf people, emphasizing the importance of collaboration and mutual trust in developing inclusive AI. Overall, with this paper, we advocate for AI systems that are not only effective but also socially responsible, promoting fair and inclusive interactions between humans and machines.
在这篇论文中,我们详细阐述了人工智能如何支持多样性和包容性,并举例说明了朝这个方向开展的研究项目。首先,我们审视了使大型语言模型(LLMs)更加透明、包容以及对社会偏见有所意识的挑战与进展。尽管像ChatGPT这样的语言模型具备令人印象深刻的技能,但它们在理解不同的文化背景和进行有意义的人类对话方面仍然存在困难。一个关键问题是,在语言处理中特别是机器翻译中的偏见会加剧不平等现象。解决这些偏见需要采取多学科的方法,以确保AI能够促进多样性和包容性,并维护公平原则。 我们还强调了人工智能在识别媒体中的有偏内容方面的角色,这对于改善代表性和消除刻板印象至关重要。通过检测社会群体的不公平描绘,AI可以帮助挑战现有观念并推动更加包容的技术发展。透明的人工智能算法,即那些能明确解释其决策过程的系统,在建立信任和减少AI系统的偏见方面是必不可少的。 此外,我们强调AI系统需要多样且包容性的训练数据集。例如,“儿童成长监测器”项目展示了如何利用广泛的数据来解决现实世界的问题,如营养不良和贫困问题。我们还介绍了一个项目,该项目演示了AI在监控搜索引擎传播有关LGBTQ+社区错误信息方面的作用。 此外,我们讨论了“SignON”项目作为技术促进听力障碍者与听觉正常人之间沟通的一个例子,强调开发包容性人工智能的重要性在于协作和相互信任。 总体而言,通过这篇论文,我们提倡构建不仅有效而且具有社会责任感的人工智能系统,这些系统能够促进人类与机器之间的公平且包容的互动。
https://arxiv.org/abs/2501.09534