Autonomous driving demands safe motion planning, especially in critical "long-tail" scenarios. Recent end-to-end autonomous driving systems leverage large language models (LLMs) as planners to improve generalizability to rare events. However, using LLMs at test time introduces high computational costs. To address this, we propose DiMA, an end-to-end autonomous driving system that maintains the efficiency of an LLM-free (or vision-based) planner while leveraging the world knowledge of an LLM. DiMA distills the information from a multi-modal LLM to a vision-based end-to-end planner through a set of specially designed surrogate tasks. Under a joint training strategy, a scene encoder common to both networks produces structured representations that are semantically grounded as well as aligned to the final planning objective. Notably, the LLM is optional at inference, enabling robust planning without compromising on efficiency. Training with DiMA results in a 37% reduction in the L2 trajectory error and an 80% reduction in the collision rate of the vision-based planner, as well as a 44% trajectory error reduction in longtail scenarios. DiMA also achieves state-of-the-art performance on the nuScenes planning benchmark.
自动驾驶需要安全的路径规划,尤其是在关键的“长尾”场景中。最近端到端的自动驾驶系统利用大型语言模型(LLM)作为规划器来提高对罕见事件的泛化能力。然而,在测试时使用LLM会引入高昂的计算成本。为了解决这个问题,我们提出了DiMA,这是一种保持无LLM(或基于视觉的)规划器效率的同时又能利用LLM世界知识的端到端自动驾驶系统。通过一组特别设计的代理任务,DiMA将多模态LLM中的信息浓缩成一个基于视觉的端到端规划器。在联合训练策略下,两个网络共用的一个场景编码器生成结构化表示,并且这些表示既语义相关又与最终规划目标对齐。值得注意的是,在推理时不需要使用LLM,从而实现了无需牺牲效率的前提下进行稳健规划的能力。采用DiMA训练后,基于视觉的规划器在L2轨迹误差上减少了37%,碰撞率降低了80%;同时在长尾场景下轨迹误差也减少了44%。此外,DiMA还在nuScenes规划基准测试中取得了最先进的性能水平。
https://arxiv.org/abs/2501.09757
Our objective is to translate continuous sign language into spoken language text. Inspired by the way human interpreters rely on context for accurate translation, we incorporate additional contextual cues together with the signing video, into a new translation framework. Specifically, besides visual sign recognition features that encode the input video, we integrate complementary textual information from (i) captions describing the background show, (ii) translation of previous sentences, as well as (iii) pseudo-glosses transcribing the signing. These are automatically extracted and inputted along with the visual features to a pre-trained large language model (LLM), which we fine-tune to generate spoken language translations in text form. Through extensive ablation studies, we show the positive contribution of each input cue to the translation performance. We train and evaluate our approach on BOBSL -- the largest British Sign Language dataset currently available. We show that our contextual approach significantly enhances the quality of the translations compared to previously reported results on BOBSL, and also to state-of-the-art methods that we implement as baselines. Furthermore, we demonstrate the generality of our approach by applying it also to How2Sign, an American Sign Language dataset, and achieve competitive results.
我们的目标是将连续的手语翻译成口语文本。受人类口译员依赖上下文进行准确翻译的启发,我们将额外的上下文线索与手语视频整合到一个新的翻译框架中。具体来说,在编码输入视频的手势识别特征之外,我们还集成了三种补充性的文本信息:(i)描述背景节目的字幕;(ii)前一句的口语翻译;以及(iii)转录手势的伪术语。这些信息被自动提取并与视觉特征一起输入到预训练的大语言模型(LLM)中,该模型经过微调后能够生成口语形式的文本翻译。通过大量的消融研究,我们展示了每种输入线索对翻译性能的正面贡献。我们在BOBSL——目前最大的英国手语数据集上进行训练和评估。结果显示,我们的上下文方法显著提高了在BOBSL上的翻译质量,并且优于之前报道的结果以及作为基线实现的最新技术方法。此外,我们通过将其应用于How2Sign(一个美国手语数据集)来展示该方法的通用性,并取得了具有竞争力的结果。
https://arxiv.org/abs/2501.09754
Recent large language models (LLMs) have demonstrated exceptional performance on general-purpose text embedding tasks. While dense embeddings have dominated related research, we introduce the first Lexicon-based EmbeddiNgS (LENS) leveraging LLMs that achieve competitive performance on these tasks. Regarding the inherent tokenization redundancy issue and unidirectional attention limitations in traditional causal LLMs, LENS consolidates the vocabulary space through token embedding clustering, and investigates bidirectional attention and various pooling strategies. Specifically, LENS simplifies lexicon matching by assigning each dimension to a specific token cluster, where semantically similar tokens are grouped together, and unlocking the full potential of LLMs through bidirectional attention. Extensive experiments demonstrate that LENS outperforms dense embeddings on the Massive Text Embedding Benchmark (MTEB), delivering compact feature representations that match the sizes of dense counterparts. Notably, combining LENSE with dense embeddings achieves state-of-the-art performance on the retrieval subset of MTEB (i.e. BEIR).
近期的大规模语言模型(LLMs)在通用文本嵌入任务上表现出色。虽然稠密嵌入一直主导着相关研究,我们首次提出了基于词典的嵌入方法(LENS),这种方法利用了LLM并在此类任务中取得了竞争性性能。针对传统因果LLM中存在的固有分词冗余问题和单向注意力限制,LENS通过分词嵌入聚类来整合词汇空间,并探索双向注意机制及多种池化策略。具体而言,LENS简化了词典匹配过程,为每个维度分配一个特定的分词簇,在这个簇中,语义相似的单词被聚集在一起,而通过双向注意力则释放出LLM的全部潜力。广泛的实验表明,LENS在大规模文本嵌入基准(MTEB)上优于稠密嵌入方法,并提供与稠密嵌入相同大小的紧凑特征表示。值得一提的是,将LENS与稠密嵌入相结合,在MTEB中的检索子集(即BEIR)中取得了最先进的性能。
https://arxiv.org/abs/2501.09749
Machine learning developers frequently use interactive computational notebooks, such as Jupyter notebooks, to host code for data processing and model training. Jupyter notebooks provide a convenient tool for writing machine learning pipelines and interactively observing outputs, however, maintaining Jupyter notebooks, e.g., to add new features or fix bugs, can be challenging due to the length and complexity of the notebooks. Moreover, there is no existing benchmark related to developer edits on Jupyter notebooks. To address this, we present the first dataset of 48,398 Jupyter notebook edits derived from 20,095 revisions of 792 machine learning repositories on GitHub, and perform the first study of the using LLMs to predict code edits in Jupyter notebooks. Our dataset captures granular details of cell-level and line-level modifications, offering a foundation for understanding real-world maintenance patterns in machine learning workflows. We observed that the edits on Jupyter notebooks are highly localized, with changes averaging only 166 lines of code in repositories. While larger models outperform smaller counterparts in code editing, all models have low accuracy on our dataset even after finetuning, demonstrating the complexity of real-world machine learning maintenance tasks. Our findings emphasize the critical role of contextual information in improving model performance and point toward promising avenues for advancing large language models' capabilities in engineering machine learning code.
机器学习开发者经常使用像Jupyter笔记本这样的交互式计算本,来托管数据处理和模型训练的代码。Jupyter笔记本为编写机器学习管道和互动观察输出提供了一个方便的工具,然而,由于笔记本网页的长度和复杂性,维护这些笔记(例如添加新功能或修复错误)可能会变得具有挑战性。此外,目前还没有关于开发人员在Jupyter笔记本上的编辑相关的基准测试。为此,我们发布了第一个包含来自GitHub上792个机器学习仓库中的20,095次修订的48,398条Jupyter笔记编辑的数据集,并进行了首次使用大型语言模型(LLM)来预测Jupyter笔记代码修改的研究。我们的数据集捕捉了单元级别和行级别的修改细节,为理解现实世界中机器学习工作流程中的维护模式提供了基础。我们观察到,在Jupyter笔记本上的编辑具有高度的局部性,平均每个仓库只有166行代码的变化。尽管较大的模型在代码编辑方面胜过较小的模型,但所有模型在经过微调后仍在我方数据集上表现出较低的准确率,这表明现实世界中的机器学习维护任务相当复杂。我们的研究结果强调了改善模型性能的上下文信息的关键作用,并指向了提高大型语言模型在工程机器学习代码方面能力的有希望的方向。
https://arxiv.org/abs/2501.09745
Generative models have made significant impacts across various domains, largely due to their ability to scale during training by increasing data, computational resources, and model size, a phenomenon characterized by the scaling laws. Recent research has begun to explore inference-time scaling behavior in Large Language Models (LLMs), revealing how performance can further improve with additional computation during inference. Unlike LLMs, diffusion models inherently possess the flexibility to adjust inference-time computation via the number of denoising steps, although the performance gains typically flatten after a few dozen. In this work, we explore the inference-time scaling behavior of diffusion models beyond increasing denoising steps and investigate how the generation performance can further improve with increased computation. Specifically, we consider a search problem aimed at identifying better noises for the diffusion sampling process. We structure the design space along two axes: the verifiers used to provide feedback, and the algorithms used to find better noise candidates. Through extensive experiments on class-conditioned and text-conditioned image generation benchmarks, our findings reveal that increasing inference-time compute leads to substantial improvements in the quality of samples generated by diffusion models, and with the complicated nature of images, combinations of the components in the framework can be specifically chosen to conform with different application scenario.
生成模型已经在多个领域产生了重要影响,这主要归功于它们在训练过程中通过增加数据量、计算资源和模型规模来扩展的能力,这一现象被称为扩展定律。最近的研究已经开始探索大型语言模型(LLMs)的推理时长扩展行为,揭示了如何通过增加推理过程中的计算能力进一步提升性能。与LLMs不同的是,扩散模型本身具备通过调整去噪步骤数量来灵活调节推理时间计算的能力,尽管通常在几十个去噪步骤之后性能增益会趋于平缓。在这项工作中,我们探讨了超出增加去噪步骤之外的扩散模型的推理时长扩展行为,并研究了如何利用更多的计算资源进一步提升生成表现。 具体来说,我们考虑了一个搜索问题,旨在为扩散采样过程找到更好的噪声样本。我们在两个轴上构建设计空间:一是提供反馈的验证器;二是用于寻找更好噪声候选者的算法。通过在类别条件和文本条件图像生成基准上的大量实验,我们的研究发现表明增加推理时间计算能够显著提升由扩散模型生成样本的质量,并且鉴于图像的复杂性,框架中组件的不同组合可以根据不同的应用场景具体选择以符合需求。
https://arxiv.org/abs/2501.09732
Many non-traditional students in cybersecurity programs often lack access to advice from peers, family members and professors, which can hinder their educational experiences. Additionally, these students may not fully benefit from various LLM-powered AI assistants due to issues like content relevance, locality of advice, minimum expertise, and timing. This paper addresses these challenges by introducing an application designed to provide comprehensive support by answering questions related to knowledge, skills, and career preparation advice tailored to the needs of these students. We developed a learning tool platform, CyberMentor, to address the diverse needs and pain points of students majoring in cybersecurity. Powered by agentic workflow and Generative Large Language Models (LLMs), the platform leverages Retrieval-Augmented Generation (RAG) for accurate and contextually relevant information retrieval to achieve accessibility and personalization. We demonstrated its value in addressing knowledge requirements for cybersecurity education and for career marketability, in tackling skill requirements for analytical and programming assignments, and in delivering real time on demand learning support. Using three use scenarios, we showcased CyberMentor in facilitating knowledge acquisition and career preparation and providing seamless skill-based guidance and support. We also employed the LangChain prompt-based evaluation methodology to evaluate the platform's impact, confirming its strong performance in helpfulness, correctness, and completeness. These results underscore the system's ability to support students in developing practical cybersecurity skills while improving equity and sustainability within higher education. Furthermore, CyberMentor's open-source design allows for adaptation across other disciplines, fostering educational innovation and broadening its potential impact.
许多非传统学生在网络安全项目中往往缺乏来自同龄人、家庭成员和教授的建议,这会妨碍他们的学习经历。此外,由于内容的相关性、建议的地方性、最低专业知识要求以及时机等问题,这些学生可能无法充分利用各种LLM(大型语言模型)驱动的人工智能助手提供的服务。本文通过介绍一款专门为满足这些学生的知识、技能和职业准备咨询需求而设计的应用程序来解决这些问题。我们开发了一个学习工具平台“CyberMentor”,旨在应对网络安全专业学生多样化的需要与痛点。该平台利用代理工作流和生成式大型语言模型(LLMs),并通过检索增强生成技术(RAG)实现准确且上下文相关的信息检索,以确保可访问性和个性化服务。 我们展示了CyberMentor在满足网络安全教育的知识需求、职业市场的适应性要求、分析和编程任务的技能需求以及提供即时按需学习支持方面的作用。通过三种使用场景的应用展示,CyberMentor在促进知识获取与职业准备方面发挥了重要作用,并提供了无缝的技术指导和支持。我们还采用了LangChain提示评价法来评估该平台的影响,确认其在帮助性、准确性和完整性方面的优秀表现。 这些结果强调了该系统支持学生发展实用的网络安全技能的能力,同时提高高等教育中的公平性和可持续性。此外,“CyberMentor”的开源设计允许它被其他学科领域采纳和适应,推动教育创新并扩大其潜在影响。
https://arxiv.org/abs/2501.09709
We present the e-Llama models: 8 billion and 70 billion parameter large language models that are adapted towards the e-commerce domain. These models are meant as foundation models with deep knowledge about e-commerce, that form a base for instruction- and fine-tuning. The e-Llama models are obtained by continuously pretraining the Llama 3.1 base models on 1 trillion tokens of domain-specific data. We discuss our approach and motivate our choice of hyperparameters with a series of ablation studies. To quantify how well the models have been adapted to the e-commerce domain, we define and implement a set of multilingual, e-commerce specific evaluation tasks. We show that, when carefully choosing the training setup, the Llama 3.1 models can be adapted towards the new domain without sacrificing significant performance on general domain tasks. We also explore the possibility of merging the adapted model and the base model for a better control of the performance trade-off between domains.
我们介绍e-Llama模型:这是两个分别拥有80亿和700亿参数的大型语言模型,专为电子商务领域进行了优化。这些模型作为基础模型,在电子商务方面具备深厚的知识积累,并为基础模型的指令微调提供了依据。通过在特定领域的1万亿个标记数据上进行持续预训练,我们得到了e-Llama模型。 我们在一系列消融研究中讨论了我们的方法,并根据实验结果解释了选择超参数的理由。为了量化这些模型在适应电子商务领域方面的效果,我们定义并实施了一套多语言、专门针对电子商务的评估任务集。结果显示,在精心设计的培训设置下,Llama 3.1模型可以被调整以适应新的电子商务领域,同时不会牺牲其在通用领域任务上的显著性能。 此外,我们还探索了将优化后的模型和基础模型合并的可能性,以便更好地控制不同领域之间的性能权衡。
https://arxiv.org/abs/2501.09706
Language has long been conceived as an essential tool for human reasoning. The breakthrough of Large Language Models (LLMs) has sparked significant research interest in leveraging these models to tackle complex reasoning tasks. Researchers have moved beyond simple autoregressive token generation by introducing the concept of "thought" -- a sequence of tokens representing intermediate steps in the reasoning process. This innovative paradigm enables LLMs' to mimic complex human reasoning processes, such as tree search and reflective thinking. Recently, an emerging trend of learning to reason has applied reinforcement learning (RL) to train LLMs to master reasoning processes. This approach enables the automatic generation of high-quality reasoning trajectories through trial-and-error search algorithms, significantly expanding LLMs' reasoning capacity by providing substantially more training data. Furthermore, recent studies demonstrate that encouraging LLMs to "think" with more tokens during test-time inference can further significantly boost reasoning accuracy. Therefore, the train-time and test-time scaling combined to show a new research frontier -- a path toward Large Reasoning Model. The introduction of OpenAI's o1 series marks a significant milestone in this research direction. In this survey, we present a comprehensive review of recent progress in LLM reasoning. We begin by introducing the foundational background of LLMs and then explore the key technical components driving the development of large reasoning models, with a focus on automated data construction, learning-to-reason techniques, and test-time scaling. We also analyze popular open-source projects at building large reasoning models, and conclude with open challenges and future research directions.
长期以来,人们一直认为语言是人类推理的必要工具。大型语言模型(LLMs)的重大突破激发了利用这些模型来解决复杂推理任务的研究兴趣。研究人员已经超越了简单的自回归令牌生成,引入了“思维”这一概念——一系列代表推理过程中间步骤的令牌序列。这种创新的方法使LLM能够模仿复杂的类人类推理过程,如树搜索和反思思考。最近,一种新兴的学习推理趋势是应用强化学习(RL)来训练LLM掌握推理过程。这种方法通过试错算法实现了高质量推理轨迹的自动生成,并且通过提供大量训练数据显著扩展了LLM的推理能力。此外,近期研究表明,在测试时鼓励LLMs使用更多令牌进行“思考”可以进一步大幅提高推理准确性。因此,结合训练时间和测试时间上的扩展展示了一个新的研究前沿——通向大型推理模型的道路。OpenAI推出的o1系列标志着这一研究方向的一个重要里程碑。在这份综述中,我们将介绍近期在LLM推理方面的重大进展。我们首先引入LLMs的基础背景知识,然后探讨驱动大规模推理模型发展的关键技术组件,重点在于自动化数据构建、学习推理技术以及测试时间扩展。此外,我们还会分析一些热门的开源项目在构建大型推理模型中的应用,并最终提出开放性挑战和未来研究方向。
https://arxiv.org/abs/2501.09686
The proliferation of Vision-Language Models (VLMs) in the past several years calls for rigorous and comprehensive evaluation methods and benchmarks. This work analyzes existing VLM evaluation techniques, including automated metrics, AI-based assessments, and human evaluations across diverse tasks. We first introduce Robin - a novel suite of VLMs that we built by combining Large Language Models (LLMs) and Vision Encoders (VEs) at multiple scales, and use Robin to identify shortcomings of current evaluation approaches across scales. Next, to overcome the identified limitations, we introduce CHIRP - a new long form response benchmark we developed for more robust and complete VLM evaluation. We provide open access to the Robin training code, model suite, and CHIRP benchmark to promote reproducibility and advance VLM research.
近年来,视觉-语言模型(VLMs)的迅速发展需要严格的和全面的评估方法及基准。本文分析了现有的VLM评估技术,包括自动化指标、基于AI的评估以及跨不同任务的人类评价。首先,我们介绍了Robin——一个新型的VLM套件,它是通过在多个尺度上结合大规模语言模型(LLMs)和视觉编码器(VEs)构建而成,并使用Robin来识别当前评估方法在各个尺度上的不足之处。接下来,为了克服这些已发现的局限性,我们引入了CHIRP——一个新的长形式响应基准,旨在为VLM提供更稳健且全面的评价。我们提供了Robin的训练代码、模型套件以及CHIRP基准测试的开放访问权限,以促进可重复性和推动VLM的研究进展。
https://arxiv.org/abs/2501.09672
In today's assistant landscape, personalisation enhances interactions, fosters long-term relationships, and deepens engagement. However, many systems struggle with retaining user preferences, leading to repetitive user requests and disengagement. Furthermore, the unregulated and opaque extraction of user preferences in industry applications raises significant concerns about privacy and trust, especially in regions with stringent regulations like Europe. In response to these challenges, we propose a long-term memory system for voice assistants, structured around predefined categories. This approach leverages Large Language Models to efficiently extract, store, and retrieve preferences within these categories, ensuring both personalisation and transparency. We also introduce a synthetic multi-turn, multi-session conversation dataset (CarMem), grounded in real industry data, tailored to an in-car voice assistant setting. Benchmarked on the dataset, our system achieves an F1-score of .78 to .95 in preference extraction, depending on category granularity. Our maintenance strategy reduces redundant preferences by 95% and contradictory ones by 92%, while the accuracy of optimal retrieval is at .87. Collectively, the results demonstrate the system's suitability for industrial applications.
在当今的助手生态系统中,个性化增强了互动,促进了长期关系,并加深了用户的参与度。然而,许多系统难以保留用户偏好,导致重复性的用户请求和失去兴趣。此外,在诸如欧洲等监管严格的地区,行业应用中不规范且不透明地提取用户偏好的做法引发了关于隐私和信任的重大担忧。 为应对这些挑战,我们提出了一种基于预定义类别的语音助手长期记忆系统。该方法利用大型语言模型高效地从这些类别中提取、存储和检索偏好信息,确保个性化的同时也保持透明度。此外,我们还引入了一个合成的多轮对话数据集(CarMem),这个数据集以真实行业数据为基础,并针对车载语音助理场景进行了定制。 在这一数据集上的评估结果显示,我们的系统根据不同类别的详细程度,在偏好提取方面取得了F1值从0.78到0.95的成绩。通过我们的维护策略,重复的偏好减少了95%,矛盾的偏好减少了92%;而最优检索精度为0.87。总体而言,这些结果展示了该系统的工业应用潜力和适用性。
https://arxiv.org/abs/2501.09645
Recent advances in large language models (LLMs) have demonstrated significant progress in performing complex tasks. While Reinforcement Learning from Human Feedback (RLHF) has been effective in aligning LLMs with human preferences, it is susceptible to spurious correlations in reward modeling. Consequently, it often introduces biases-such as length bias, sycophancy, conceptual bias, and discrimination that hinder the model's ability to capture true causal relationships. To address this, we propose a novel causal reward modeling approach that integrates causal inference to mitigate these spurious correlations. Our method enforces counterfactual invariance, ensuring reward predictions remain consistent when irrelevant variables are altered. Through experiments on both synthetic and real-world datasets, we show that our approach mitigates various types of spurious correlations effectively, resulting in more reliable and fair alignment of LLMs with human preferences. As a drop-in enhancement to the existing RLHF workflow, our causal reward modeling provides a practical way to improve the trustworthiness and fairness of LLM finetuning.
最近在大型语言模型(LLM)方面取得的进展已经证明了其在执行复杂任务方面的显著进步。虽然基于人类反馈的强化学习(RLHF)在将LLM与人类偏好对齐方面非常有效,但它容易受到奖励建模中的虚假相关性的困扰。这通常会导致诸如长度偏见、阿谀奉承倾向、概念偏差和歧视等偏见问题,这些都阻碍了模型捕捉真正因果关系的能力。为了解决这些问题,我们提出了一种新颖的因果奖励建模方法,该方法整合了因果推理来减轻这些虚假相关性。我们的方法强制执行反事实不变性,确保在无关变量改变时,奖励预测仍然保持一致。通过在合成和真实世界数据集上的实验,我们展示了我们的方法能够有效地缓解各种类型的虚假相关性,从而使得LLM与人类偏好的对齐更加可靠和公平。作为现有RLHF工作流程的即插即用增强功能,我们的因果奖励建模提供了一种实用的方法来提高LLM微调过程中的可信度和公平性。
https://arxiv.org/abs/2501.09620
The rapid spread of fake news presents a significant global challenge, particularly in low-resource languages like Bangla, which lack adequate datasets and detection tools. Although manual fact-checking is accurate, it is expensive and slow to prevent the dissemination of fake news. Addressing this gap, we introduce BanFakeNews-2.0, a robust dataset to enhance Bangla fake news detection. This version includes 11,700 additional, meticulously curated fake news articles validated from credible sources, creating a proportional dataset of 47,000 authentic and 13,000 fake news items across 13 categories. In addition, we created a manually curated independent test set of 460 fake and 540 authentic news items for rigorous evaluation. We invest efforts in collecting fake news from credible sources and manually verified while preserving the linguistic richness. We develop a benchmark system utilizing transformer-based architectures, including fine-tuned Bidirectional Encoder Representations from Transformers variants (F1-87\%) and Large Language Models with Quantized Low-Rank Approximation (F1-89\%), that significantly outperforms traditional methods. BanFakeNews-2.0 offers a valuable resource to advance research and application in fake news detection for low-resourced languages. We publicly release our dataset and model on Github to foster research in this direction.
假新闻的迅速传播带来了全球性的挑战,特别是在像孟加拉语这样的资源匮乏语言中,这些语言缺乏足够的数据集和检测工具。尽管人工事实核查准确度高,但成本高昂且耗时长,无法有效阻止假新闻的扩散。为了解决这一缺口,我们推出了BanFakeNews-2.0,这是一个旨在增强孟加拉语假新闻检测能力的强大数据集。这个版本包含11,700篇额外、精心策划并从可信来源验证过的假新闻文章,形成了一个由47,000条真实和13,000条虚假新闻组成的数据集,涵盖13个类别。此外,我们还创建了一个独立的手工策划测试数据集,其中包含460篇假新闻和540篇真实新闻,用于严格的评估。我们在收集来自可信来源的假新闻时投入了大量努力,并进行了人工验证以保留语言的丰富性。 我们开发了一个基准系统,采用基于变压器的架构,包括微调后的双向编码器表示(F1-87%)和量化低秩近似的大规模语言模型(F1-89%),这些方法显著优于传统的方法。BanFakeNews-2.0为资源匮乏语言中的假新闻检测研究和应用提供了宝贵的资源。我们将在Github上公开发布我们的数据集和模型,以促进在这个方向上的研究。
https://arxiv.org/abs/2501.09604
In this paper, we elaborate on how AI can support diversity and inclusion and exemplify research projects conducted in that direction. We start by looking at the challenges and progress in making large language models (LLMs) more transparent, inclusive, and aware of social biases. Even though LLMs like ChatGPT have impressive abilities, they struggle to understand different cultural contexts and engage in meaningful, human like conversations. A key issue is that biases in language processing, especially in machine translation, can reinforce inequality. Tackling these biases requires a multidisciplinary approach to ensure AI promotes diversity, fairness, and inclusion. We also highlight AI's role in identifying biased content in media, which is important for improving representation. By detecting unequal portrayals of social groups, AI can help challenge stereotypes and create more inclusive technologies. Transparent AI algorithms, which clearly explain their decisions, are essential for building trust and reducing bias in AI systems. We also stress AI systems need diverse and inclusive training data. Projects like the Child Growth Monitor show how using a wide range of data can help address real world problems like malnutrition and poverty. We present a project that demonstrates how AI can be applied to monitor the role of search engines in spreading disinformation about the LGBTQ+ community. Moreover, we discuss the SignON project as an example of how technology can bridge communication gaps between hearing and deaf people, emphasizing the importance of collaboration and mutual trust in developing inclusive AI. Overall, with this paper, we advocate for AI systems that are not only effective but also socially responsible, promoting fair and inclusive interactions between humans and machines.
在这篇论文中,我们详细阐述了人工智能如何支持多样性和包容性,并举例说明了朝这个方向开展的研究项目。首先,我们审视了使大型语言模型(LLMs)更加透明、包容以及对社会偏见有所意识的挑战与进展。尽管像ChatGPT这样的语言模型具备令人印象深刻的技能,但它们在理解不同的文化背景和进行有意义的人类对话方面仍然存在困难。一个关键问题是,在语言处理中特别是机器翻译中的偏见会加剧不平等现象。解决这些偏见需要采取多学科的方法,以确保AI能够促进多样性和包容性,并维护公平原则。 我们还强调了人工智能在识别媒体中的有偏内容方面的角色,这对于改善代表性和消除刻板印象至关重要。通过检测社会群体的不公平描绘,AI可以帮助挑战现有观念并推动更加包容的技术发展。透明的人工智能算法,即那些能明确解释其决策过程的系统,在建立信任和减少AI系统的偏见方面是必不可少的。 此外,我们强调AI系统需要多样且包容性的训练数据集。例如,“儿童成长监测器”项目展示了如何利用广泛的数据来解决现实世界的问题,如营养不良和贫困问题。我们还介绍了一个项目,该项目演示了AI在监控搜索引擎传播有关LGBTQ+社区错误信息方面的作用。 此外,我们讨论了“SignON”项目作为技术促进听力障碍者与听觉正常人之间沟通的一个例子,强调开发包容性人工智能的重要性在于协作和相互信任。 总体而言,通过这篇论文,我们提倡构建不仅有效而且具有社会责任感的人工智能系统,这些系统能够促进人类与机器之间的公平且包容的互动。
https://arxiv.org/abs/2501.09534
The success of VLMs often relies on the dynamic high-resolution schema that adaptively augments the input images to multiple crops, so that the details of the images can be retained. However, such approaches result in a large number of redundant visual tokens, thus significantly reducing the efficiency of the VLMs. To improve the VLMs' efficiency without introducing extra training costs, many research works are proposed to reduce the visual tokens by filtering the uninformative visual tokens or aggregating their information. Some approaches propose to reduce the visual tokens according to the self-attention of VLMs, which are biased, to result in inaccurate responses. The token reduction approaches solely rely on visual cues are text-agnostic, and fail to focus on the areas that are most relevant to the question, especially when the queried objects are non-salient to the image. In this work, we first conduct experiments to show that the original text embeddings are aligned with the visual tokens, without bias on the tailed visual tokens. We then propose a self-adaptive cross-modality attention mixture mechanism that dynamically leverages the effectiveness of visual saliency and text-to-image similarity in the pre-LLM layers to select the visual tokens that are informative. Extensive experiments demonstrate that the proposed approach achieves state-of-the-art training-free VLM acceleration performance, especially when the reduction rate is sufficiently large.
VLM(视觉语言模型)的成功往往依赖于能够动态生成高分辨率图像的方案,这些方案会自适应地将输入图像分割成多个区域块,以保留图像中的细节信息。然而,这样的方法会产生大量的冗余视觉标记,从而大大降低了VLM的效率。为了在不增加额外训练成本的情况下提高VLM的效率,许多研究工作提出了通过过滤掉无用的视觉标记或聚合它们的信息来减少视觉标记的方法。一些方法提出根据VLM中的自注意力机制来减少视觉标记,但由于这种机制存在偏差,会导致响应不准确。仅仅依赖于视觉线索进行标记减少的方法对文本是不可知的,在处理与问题最相关的区域时会失败,尤其是在查询对象在图像中不太突出的情况下。 在这项工作中,我们首先进行了实验以证明原始文本嵌入与视觉标记对齐,并且对于尾部视觉标记没有偏差。然后,我们提出了一种自我适应的跨模态注意力混合机制,在预训练大语言模型(LLM)层中动态利用视觉显著性和文本到图像相似性的有效性来选择那些信息丰富的视觉标记。广泛的实验表明,所提出的这种方法在无训练成本的情况下实现了最先进的VLM加速性能,尤其是在减少率足够大的情况下尤其有效。
https://arxiv.org/abs/2501.09532
Text-to-SQL enables users to interact with databases through natural language, simplifying the retrieval and synthesis of information. Despite the success of large language models (LLMs) in converting natural language questions into SQL queries, their broader adoption is limited by two main challenges: achieving robust generalization across diverse queries and ensuring interpretative confidence in their predictions. To tackle these issues, our research investigates the integration of selective classifiers into Text-to-SQL systems. We analyse the trade-off between coverage and risk using entropy based confidence estimation with selective classifiers and assess its impact on the overall performance of Text-to-SQL models. Additionally, we explore the models' initial calibration and improve it with calibration techniques for better model alignment between confidence and accuracy. Our experimental results show that encoder-decoder T5 is better calibrated than in-context-learning GPT 4 and decoder-only Llama 3, thus the designated external entropy-based selective classifier has better performance. The study also reveal that, in terms of error detection, selective classifier with a higher probability detects errors associated with irrelevant questions rather than incorrect query generations.
文本到SQL(Text-to-SQL)技术允许用户通过自然语言与数据库进行交互,从而简化信息的检索和合成。尽管大型语言模型(LLMs)在将自然语言问题转换为SQL查询方面取得了成功,但它们更广泛的应用受到两大挑战的限制:实现跨多种查询类型的稳健泛化以及确保其预测结果具有解释性的信心。 为了应对这些问题,我们的研究探讨了将选择性分类器集成到Text-to-SQL系统中的方法。我们使用基于熵的信任度估计来分析覆盖率和风险之间的权衡,并评估这种方法对整体性能的影响。此外,我们还探索了模型的初始校准情况,并通过校准技术对其进行改进,以更好地使信任度与准确性之间达成一致。 实验结果显示,编码器-解码器T5模型比上下文学习GPT 4以及仅解码器Llama 3具有更好的校准效果。因此,指定的基于外部熵的选择性分类器表现出更佳性能。研究还发现,在错误检测方面,概率更高的选择性分类器能够更准确地识别与无关问题相关的错误,而非因查询生成不正确导致的错误。 总之,我们的研究表明了将选择性分类器应用于Text-to-SQL系统的潜力,并展示了如何通过改进模型校准来提高其可靠性和准确性。
https://arxiv.org/abs/2501.09527
We present a method for augmenting a Large Language Model (LLM) with a combination of text and visual data to enable accurate question answering in visualization of scientific data, making conversational visualization possible. LLMs struggle with tasks like visual data interaction, as they lack contextual visual information. We address this problem by merging a text description of a visualization and dataset with snapshots of the visualization. We extract their essential features into a structured text file, highly compact, yet descriptive enough to appropriately augment the LLM with contextual information, without any fine-tuning. This approach can be applied to any visualization that is already finally rendered, as long as it is associated with some textual description.
我们提出了一种方法,通过结合文本和视觉数据来增强大型语言模型(LLM),使科学数据可视化中的准确问答成为可能,并实现对话式可视化。由于缺乏上下文视觉信息,LLMs在处理如视觉数据分析之类的任务时遇到困难。为了解决这个问题,我们将一个可视化的文本描述和相关数据集与该可视化的快照相结合。我们提取这些内容的关键特征并将其转化为结构化、高度紧凑但又足够详细的文本文件,以适当增强LLM的上下文信息,而无需进行任何微调。只要可视化已经最终渲染并与一些文本描述关联,这种方法就可以应用于任何可视化。
https://arxiv.org/abs/2501.09521
Understanding emotions accurately is essential for fields like human-computer interaction. Due to the complexity of emotions and their multi-modal nature (e.g., emotions are influenced by facial expressions and audio), researchers have turned to using multi-modal models to understand human emotions rather than single-modality. However, current video multi-modal large language models (MLLMs) encounter difficulties in effectively integrating audio and identifying subtle facial micro-expressions. Furthermore, the lack of detailed emotion analysis datasets also limits the development of multimodal emotion analysis. To address these issues, we introduce a self-reviewed dataset and a human-reviewed dataset, comprising 24,137 coarse-grained samples and 3,500 manually annotated samples with detailed emotion annotations, respectively. These datasets allow models to learn from diverse scenarios and better generalize to real-world applications. Moreover, in addition to the audio modeling, we propose to explicitly integrate facial encoding models into the existing advanced Video MLLM, enabling the MLLM to effectively unify audio and the subtle facial cues for emotion understanding. By aligning these features within a unified space and employing instruction tuning in our proposed datasets, our Omni-Emotion achieves state-of-the-art performance in both emotion recognition and reasoning tasks.
准确理解情感对于人机交互等领域来说至关重要。由于情绪的复杂性和多模态特性(例如,情绪会受到面部表情和音频的影响),研究人员已经转向使用多模态模型来理解和分析人类情绪,而不是单一模式的方法。然而,目前的视频多模态大语言模型在有效地融合音频数据以及识别细微的面部微表情方面遇到了困难。此外,缺乏详细的多模态情感分析数据集也限制了该领域的发展。 为了解决这些问题,我们引入了一个自我审查的数据集和一个人工审查的数据集,分别包含了24,137个粗粒度样本和3,500个详细标注的情感样本。这些数据集使模型能够从各种场景中学习,并更好地泛化到实际应用中去。 此外,在音频建模之外,我们提议将面部编码模型明确地整合到现有的先进视频多模态大语言模型(Video MLLM)之中,使得该模型能有效地统一音频和细微的面部线索进行情感理解。通过在提出的这些数据集中对特征进行空间上的对齐,并采用指令调优方法,我们的Omni-Emotion系统在情绪识别和推理任务中均达到了当前的最佳性能水平。
https://arxiv.org/abs/2501.09502
While large language models (LLMs) present significant potential for supporting numerous real-world applica- tions and delivering positive social impacts, they still face significant challenges in terms of the inherent risk of privacy leakage, hallucinated outputs, and value misalignment, and can be maliciously used for generating toxic content and unethical purposes after been jailbroken. Therefore, in this survey, we present a comprehensive review of recent advancements aimed at mitigating these issues, organized across the four phases of LLM development and usage: data collecting and pre-training, fine-tuning and alignment, prompting and reasoning, and post-processing and auditing. We elaborate on the recent advances for enhancing the performance of LLMs in terms of privacy protection, hallucination reduction, value alignment, toxicity elimination, and jailbreak defenses. In contrast to previous surveys that focus on a single dimension of responsible LLMs, this survey presents a unified framework that encompasses these diverse dimensions, providing a comprehensive view of enhancing LLMs to better serve real-world applications.
尽管大型语言模型(LLMs)在支持众多现实世界应用和带来积极社会影响方面展现出巨大潜力,但它们仍然面临着固有的隐私泄露风险、幻觉输出以及价值不一致等重大挑战,并且在被破解后可能会恶意用于生成有毒内容和不符合伦理的目的。因此,在本次综述中,我们全面回顾了近期为减轻这些问题而取得的进展,这些进展按照LLMs开发和使用过程中的四个阶段进行组织:数据收集与预训练、微调与对齐、提示与推理以及后期处理与审计。我们将详细阐述最近在增强LLMs隐私保护性能、幻觉减少、价值一致性和消除毒性以及防破解措施方面的进展。 与之前专注于负责任的LLMs单一维度的综述不同,本综述提出了一种涵盖这些多样维度的统一框架,为如何通过多种方式提升LLMs性能以更好地服务于现实世界应用提供了全面的看法。
https://arxiv.org/abs/2501.09431
Traditional in-person psychological counseling remains primarily niche, often chosen by individuals with psychological issues, while online automated counseling offers a potential solution for those hesitant to seek help due to feelings of shame. Cognitive Behavioral Therapy (CBT) is an essential and widely used approach in psychological counseling. The advent of large language models (LLMs) and agent technology enables automatic CBT diagnosis and treatment. However, current LLM-based CBT systems use agents with a fixed structure, limiting their self-optimization capabilities, or providing hollow, unhelpful suggestions due to redundant response patterns. In this work, we utilize Quora-like and YiXinLi single-round consultation models to build a general agent framework that generates high-quality responses for single-turn psychological consultation scenarios. We use a bilingual dataset to evaluate the quality of single-response consultations generated by each framework. Then, we incorporate dynamic routing and supervisory mechanisms inspired by real psychological counseling to construct a CBT-oriented autonomous multi-agent framework, demonstrating its general applicability. Experimental results indicate that AutoCBT can provide higher-quality automated psychological counseling services.
传统面对面的心理咨询仍然主要局限于特定群体,通常是那些有心理问题的人的选择。而在线自动化心理咨询为那些因羞耻感而不愿意寻求帮助的人提供了一个潜在的解决方案。认知行为疗法(Cognitive Behavioral Therapy, CBT)是心理治疗中一种重要且广泛使用的方法。大型语言模型(Large Language Models, LLMs)和代理技术的发展使得自动化的CBT诊断与治疗成为可能。然而,目前基于LLM的CBT系统使用的通常是结构固定的代理,这限制了它们自我优化的能力;或者由于冗余的回答模式提供空洞且无益的建议。 在此项工作中,我们利用类似于Quora和“一颗心”单轮咨询模型构建了一个通用代理框架,该框架能够生成高质量的回答以应对单回合的心理咨询服务场景。通过使用双语数据集来评估每个框架所产生的单一回应心理咨询的质量。然后,我们将受实际心理治疗启发的动态路由和监管机制融入其中,构建一个面向CBT的自主多智能体框架,并展示了其广泛的适用性。 实验结果显示,AutoCBT能够提供更高质量的自动化心理健康咨询服务。
https://arxiv.org/abs/2501.09426
Large language models (LLMs) have demonstrated remarkable capabilities across a wide range of natural language processing tasks. Exploiting the heterogeneous capabilities of edge LLMs is crucial for diverse emerging applications, as it enables greater cost-effectiveness and reduced latency. In this work, we introduce \textit{Mixture-of-Edge-Experts (MoE$^2$)}, a novel collaborative inference framework for edge LLMs. We formulate the joint gating and expert selection problem to optimize inference performance under energy and latency constraints. Unlike conventional MoE problems, LLM expert selection is significantly more challenging due to the combinatorial nature and the heterogeneity of edge LLMs across various attributes. To this end, we propose a two-level expert selection mechanism through which we uncover an optimality-preserving property of gating parameters across expert selections. This property enables the decomposition of the training and selection processes, significantly reducing complexity. Furthermore, we leverage the objective's monotonicity and design a discrete monotonic optimization algorithm for optimal expert selection. We implement edge servers with NVIDIA Jetson AGX Orins and NVIDIA RTX 4090 GPUs, and perform extensive experiments. Our results validate that performance improvements of various LLM models and show that our MoE$^2$ method can achieve optimal trade-offs among different delay and energy budgets, and outperforms baselines under various system resource constraints.
大型语言模型(LLMs)在广泛的自然语言处理任务中展示了显著的能力。利用边缘设备上大型语言模型的异构能力对于各种新兴应用至关重要,因为它能够提高成本效益并减少延迟。在这项工作中,我们引入了“混合边缘专家(MoE²)”,这是一种针对边缘LLM的新颖协作推理框架。我们将联合门控和专家选择问题进行公式化,以在能量和延迟约束下优化推理性能。与传统的MoE问题不同,由于组合性质以及各种属性上的异质性,LLM的专家选择显著更具挑战性。 为此,我们提出了一种两级专家选择机制,通过这种方法发现了在各种专家选择中门控参数保持最优性的特性。这种特性使得训练和选择过程可以分解,从而大大降低了复杂度。此外,我们利用目标函数的单调性质,并设计了一个离散单调优化算法来进行最佳专家选择。 我们在NVIDIA Jetson AGX Orin和NVIDIA RTX 4090 GPU上实现了边缘服务器,并进行了广泛的实验。我们的结果验证了各种LLM模型的表现改进,并表明我们的MoE²方法能够在不同的延迟和能量预算之间实现最优权衡,且在各种系统资源约束下优于基线方法。 这一研究为如何更有效地利用边缘设备上的大型语言模型提供了新的思路和技术手段,有助于推动相关技术的发展与应用。
https://arxiv.org/abs/2501.09410