The expanding size of language models has created the necessity for a comprehensive examination across various dimensions that reflect the desiderata with respect to the tradeoffs between various hardware metrics, such as latency, energy consumption, GPU memory usage, and performance. There is a growing interest in establishing Pareto frontiers for different language model configurations to identify optimal models with specified hardware constraints. Notably, architectures that excel in latency on one device may not perform optimally on another. However, exhaustive training and evaluation of numerous architectures across diverse hardware configurations is computationally prohibitive. To this end, we propose HW-GPT-Bench, a hardware-aware language model surrogate benchmark, where we leverage weight-sharing techniques from Neural Architecture Search (NAS) to efficiently train a supernet proxy, encompassing language models of varying scales in a single model. We conduct profiling of these models across 13 devices, considering 5 hardware metrics and 3 distinct model scales. Finally, we showcase the usability of HW-GPT-Bench using 8 different multi-objective NAS algorithms and evaluate the quality of the resultant Pareto fronts. Through this benchmark, our objective is to propel and expedite research in the advancement of multi-objective methods for NAS and structural pruning in large language models.
随着语言模型的大小不断扩展,在各种硬件指标之间进行全面的权衡已经变得必要。为了满足硬件约束,人们越来越关注为不同的语言模型配置建立Pareto前沿,以确定指定的硬件约束下的最优模型。值得注意的是,在单个设备上表现出卓越延迟的架构在其他设备上可能不会表现最优。然而,对多种硬件配置下的大量架构进行详尽训练和评估是计算上过于耗费资源的。为此,我们提出了HW-GPT-Bench,一个硬件感知的语言模型代理基准,利用来自神经架构搜索(NAS)的权重共享技术,以在单个模型中高效训练一个超级网络代理,涵盖不同规模的语言模型。我们在13个设备上对这些模型进行 profiling,考虑了5个硬件指标和3个不同的模型规模。最后,我们使用8种不同的多目标NAS算法展示了HW-GPT-Bench的可用性,并评估了由此产生的Pareto前沿的质量。通过这个基准,我们的目标是以推动和研究大型语言模型中多目标方法和结构修剪的进展为目的,加快研究步伐。
https://arxiv.org/abs/2405.10299
Large vision-language models (VLMs) fine-tuned on specialized visual instruction-following data have exhibited impressive language reasoning capabilities across various scenarios. However, this fine-tuning paradigm may not be able to efficiently learn optimal decision-making agents in multi-step goal-directed tasks from interactive environments. To address this challenge, we propose an algorithmic framework that fine-tunes VLMs with reinforcement learning (RL). Specifically, our framework provides a task description and then prompts the VLM to generate chain-of-thought (CoT) reasoning, enabling the VLM to efficiently explore intermediate reasoning steps that lead to the final text-based action. Next, the open-ended text output is parsed into an executable action to interact with the environment to obtain goal-directed task rewards. Finally, our framework uses these task rewards to fine-tune the entire VLM with RL. Empirically, we demonstrate that our proposed framework enhances the decision-making capabilities of VLM agents across various tasks, enabling 7b models to outperform commercial models such as GPT4-V or Gemini. Furthermore, we find that CoT reasoning is a crucial component for performance improvement, as removing the CoT reasoning results in a significant decrease in the overall performance of our method.
大视觉语言模型(VLMs)在专用视觉指令跟随数据上进行微调已经展示了令人印象深刻的语言推理能力。然而,这种微调范式可能无法有效地从交互环境中学习最优决策策略。为解决这个问题,我们提出了一个使用强化学习(RL)微调VLMs的算法框架。具体来说,我们的框架提供任务描述,然后提示VLM生成连锁推理(CoT)思维,使VLM能够高效探索导致最终文本基于行动的中间推理步骤。接下来,开放的文本输出被解析为可执行动作,以与环境交互以获得目标导向任务奖励。最后,我们的框架使用这些任务奖励对整个VLM进行微调。实验证明,我们提出的框架增强了VLM代理在不同任务中的决策能力,使得7b模型能够优于诸如GPT4-V或Gemini等商业模型。此外,我们发现,CoT推理是提高性能的关键组成部分,因为去除CoT推理会导致我们方法的整体性能显著下降。
https://arxiv.org/abs/2405.10292
Facts extraction is pivotal for constructing knowledge graphs. Recently, the increasing demand for temporal facts in downstream tasks has led to the emergence of the task of temporal fact extraction. In this paper, we specifically address the extraction of temporal facts from natural language text. Previous studies fail to handle the challenge of establishing time-to-fact correspondences in complex sentences. To overcome this hurdle, we propose a timeline-based sentence decomposition strategy using large language models (LLMs) with in-context learning, ensuring a fine-grained understanding of the timeline associated with various facts. In addition, we evaluate the performance of LLMs for direct temporal fact extraction and get unsatisfactory results. To this end, we introduce TSDRE, a method that incorporates the decomposition capabilities of LLMs into the traditional fine-tuning of smaller pre-trained language models (PLMs). To support the evaluation, we construct ComplexTRED, a complex temporal fact extraction dataset. Our experiments show that TSDRE achieves state-of-the-art results on both HyperRED-Temporal and ComplexTRED datasets.
事实提取对于构建知识图谱至关重要。最近,对于下游任务中不断增加的时间性事实需求,导致出现了时间性事实提取任务。在本文中,我们重点讨论自然语言文本中提取时间性事实的问题。之前的研究未能解决在复杂句子中建立时间性事实的时间挑战。为了克服这一障碍,我们提出了一个基于时间轴的句子分解策略,使用大型语言模型(LLMs)进行预训练,确保对与各种事实相关的时间轴有细粒度的理解。此外,我们评估了LLMs的直接时间性事实提取性能,并得到不满意的结果。为此,我们引入了TSDRE,一种将LLM的分解能力融入对较小预训练语言模型(PLM)传统微调的方法。为了支持评估,我们构建了复杂的时间性事实提取数据集ComplexTRED。我们的实验结果表明,TSDRE在HyperRED-Temporal和ComplexTRED数据集上均取得了最先进的成果。
https://arxiv.org/abs/2405.10288
Despite noise and caption quality having been acknowledged as important factors impacting vision-language contrastive pre-training, in this paper, we show that the full potential of improving the training process by addressing such issues is yet to be realized. Specifically, we firstly study and analyze two issues affecting training: incorrect assignment of negative pairs, and low caption quality and diversity. Then, we devise effective solutions for addressing both problems, which essentially require training with multiple true positive pairs. Finally, we propose training with sigmoid loss to address such a requirement. We show very large gains over the current state-of-the-art for both image recognition ($\sim +6\%$ on average over 11 datasets) and image retrieval ($\sim +19\%$ on Flickr30k and $\sim +15\%$ on MSCOCO).
尽管噪音和字幕质量被认为是影响视觉语言对比预训练的重要因素,但在这篇论文中,我们展示了通过解决这些问题来改进训练过程的全部潜力尚未得到实现。具体来说,我们首先研究并分析了两个影响训练的问题:错误的负对分配和低字幕质量和多样性。然后,我们为解决这两个问题制定了有效的解决方案,这本质上需要进行多组真实正例的训练。最后,我们提出了使用sigmoid损失进行训练来满足这一要求。我们证明了在图像识别(平均每11个数据集提高约6%)和图像检索(Flicker30k上的平均提高约19%,MSCOCO上的平均提高约15%)方面,当前最先进的技术都有非常大的提升。
https://arxiv.org/abs/2405.10286
Numerous recent works aim to enhance the efficacy of Large Language Models (LLMs) through strategic prompting. In particular, the Optimization by PROmpting (OPRO) approach provides state-of-the-art performance by leveraging LLMs as optimizers where the optimization task is to find instructions that maximize the task accuracy. In this paper, we revisit OPRO for automated prompting with relatively small-scale LLMs, such as LLaMa-2 family and Mistral 7B. Our investigation reveals that OPRO shows limited effectiveness in small-scale LLMs, with limited inference capabilities constraining optimization ability. We suggest future automatic prompting engineering to consider both model capabilities and computational costs. Additionally, for small-scale LLMs, we recommend direct instructions that clearly outline objectives and methodologies as robust prompt baselines, ensuring efficient and effective prompt engineering in ongoing research.
许多最近的工作旨在通过策略提示增强大型语言模型(LLMs)的效率。特别是,通过利用LLM作为优化器,Proof of Programming (OPRO)方法在优化任务中提供了最先进的性能。在本文中,我们重新审视了OPROMpting (OPR)方法用于自动提示相对较小的LLM,如LLLaMa-2系列和Mistral 7B。我们的调查显示,在小型LLM上,OPROMpting的优化效果有限,有限的语言能力限制了优化能力。我们建议,在未来的自动提示工程中,要考虑模型的特性和计算成本。此外,对于小型LLM,我们建议使用明确说明要达到的目标和方法的直接指令作为稳健的提示基础,以确保在 ongoing研究中有高效的和有效的提示工程。
https://arxiv.org/abs/2405.10276
Authorship obfuscation techniques hold the promise of helping people protect their privacy in online communications by automatically rewriting text to hide the identity of the original author. However, obfuscation has been evaluated in narrow settings in the NLP literature and has primarily been addressed with superficial edit operations that can lead to unnatural outputs. In this work, we introduce an automatic text privatization framework that fine-tunes a large language model via reinforcement learning to produce rewrites that balance soundness, sense, and privacy. We evaluate it extensively on a large-scale test set of English Reddit posts by 68k authors composed of short-medium length texts. We study how the performance changes among evaluative conditions including authorial profile length and authorship detection strategy. Our method maintains high text quality according to both automated metrics and human evaluation, and successfully evades several automated authorship attacks.
作者身份混淆技术在帮助人们保护在线交流中的隐私方面具有潜力,通过自动重写文本来隐藏原始作者的身份。然而,在自然语言处理文献中,混淆性主要通过粗略的编辑操作来解决,可能导致输出不自然。在这项工作中,我们介绍了一种自动文本保密框架,通过强化学习方法微调一个大语言模型,以产生平衡音调、意义和隐私的重新编写。我们在由68k名作者组成的英语Reddit帖子的大型测试集中对其进行了评估。我们研究了评估条件包括作者个人资料长度和作者身份检测策略时,性能的变化。我们的方法根据自动和人类评估都保持了高质量,并成功逃避了几个自动作者身份攻击。
https://arxiv.org/abs/2405.10260
As large language models (LLMs) evolve, their integration with 3D spatial data (3D-LLMs) has seen rapid progress, offering unprecedented capabilities for understanding and interacting with physical spaces. This survey provides a comprehensive overview of the methodologies enabling LLMs to process, understand, and generate 3D data. Highlighting the unique advantages of LLMs, such as in-context learning, step-by-step reasoning, open-vocabulary capabilities, and extensive world knowledge, we underscore their potential to significantly advance spatial comprehension and interaction within embodied Artificial Intelligence (AI) systems. Our investigation spans various 3D data representations, from point clouds to Neural Radiance Fields (NeRFs). It examines their integration with LLMs for tasks such as 3D scene understanding, captioning, question-answering, and dialogue, as well as LLM-based agents for spatial reasoning, planning, and navigation. The paper also includes a brief review of other methods that integrate 3D and language. The meta-analysis presented in this paper reveals significant progress yet underscores the necessity for novel approaches to harness the full potential of 3D-LLMs. Hence, with this paper, we aim to chart a course for future research that explores and expands the capabilities of 3D-LLMs in understanding and interacting with the complex 3D world. To support this survey, we have established a project page where papers related to our topic are organized and listed: this https URL.
随着大型语言模型(LLMs)的不断发展,其与3D空间数据的集成取得了快速进展,为理解和与物理空间进行交互提供了前所未有的能力。这项调查对LLMs处理、理解和生成3D数据的方法进行了全面概述。强调LLMs的独特优势,如上下文学习、逐步推理、开放词汇功能和广泛的世界知识,我们强调它们在智能体人工智能(AI)系统中将空间理解和交互的重大潜力。我们的研究跨越了各种3D数据表示,从点云到神经辐射场(NeRFs)。它研究了LLM与3D场景理解、捕捉、问答和对话等任务的集成,以及基于LLM的空间推理、规划和导航等代理。此外,本文还简要回顾了其他将3D和语言集成的方法。本文中的元分析揭示了显著的进展,但同时也强调了需要新的方法来充分利用3D-LLMs的全部潜力。因此,本文旨在为未来研究绘制一个探索和扩展3D-LLMs理解和服务于复杂3D世界的道路的路线图。为了支持这项调查,我们建立了一个与我们的主题相关的项目页,其中列出了相关的论文:https://url。
https://arxiv.org/abs/2405.10255
Recent efforts have evaluated large language models (LLMs) in areas such as commonsense reasoning, mathematical reasoning, and code generation. However, to the best of our knowledge, no work has specifically investigated the performance of LLMs in natural language generation (NLG) tasks, a pivotal criterion for determining model excellence. Thus, this paper conducts a comprehensive evaluation of well-known and high-performing LLMs, namely ChatGPT, ChatGLM, T5-based models, LLaMA-based models, and Pythia-based models, in the context of NLG tasks. We select English and Chinese datasets encompassing Dialogue Generation and Text Summarization. Moreover, we propose a common evaluation setting that incorporates input templates and post-processing strategies. Our study reports both automatic results, accompanied by a detailed analysis.
近年来,人们在自然语言推理(NLR)、数学推理和代码生成等领域对大型语言模型(LLMs)进行了评估。然而,据我们所知,还没有工作专门研究LLMs在自然语言生成(NLG)任务上的表现,这是衡量模型卓越性的关键标准。因此,本文对包括对话生成和文本摘要在内的几种知名且高性能的LLM进行了全面的评估,这些模型是基于ChatGPT、ChatGLM、T5模型的LLMA模型和Python模型的。我们将英语和中文数据集作为研究对象,涵盖了对话生成和文本摘要。此外,我们提出了一个包含输入模板和后处理策略的通用评估设置。我们的研究既包括自动结果,也包括详细的分析。
https://arxiv.org/abs/2405.10251
In this paper, we introduce a novel psychological benchmark, CPsyExam, constructed from questions sourced from Chinese language examinations. CPsyExam is designed to prioritize psychological knowledge and case analysis separately, recognizing the significance of applying psychological knowledge to real-world scenarios. From the pool of 22k questions, we utilize 4k to create the benchmark that offers balanced coverage of subjects and incorporates a diverse range of case analysis techniques.Furthermore, we evaluate a range of existing large language models~(LLMs), spanning from open-sourced to API-based models. Our experiments and analysis demonstrate that CPsyExam serves as an effective benchmark for enhancing the understanding of psychology within LLMs and enables the comparison of LLMs across various granularities.
在本文中,我们引入了一个新的心理基准CPsyExam,它是由来自汉语考试的问题来源构建的。CPsyExam旨在分别优先考虑心理知识和案例分析,认识到将心理知识应用于现实场景的重要性。从22k个问题中,我们利用4k个问题来创建这个基准,为学科和案例分析提供平衡的覆盖,并采用了一系列不同的案例分析技术。此外,我们评估了一系列现有的大型语言模型(LLMs),从开源到基于API模型的模型。我们的实验和分析表明,CPsyExam成为增强LLM中心理学理解的有力基准,并能够比较各种粒度水平的LLM。
https://arxiv.org/abs/2405.10212
The rapid evolution of large language models (LLMs) has ushered in the need for comprehensive assessments of their performance across various dimensions. In this paper, we propose LFED, a Literary Fiction Evaluation Dataset, which aims to evaluate the capability of LLMs on the long fiction comprehension and reasoning. We collect 95 literary fictions that are either originally written in Chinese or translated into Chinese, covering a wide range of topics across several centuries. We define a question taxonomy with 8 question categories to guide the creation of 1,304 questions. Additionally, we conduct an in-depth analysis to ascertain how specific attributes of literary fictions (e.g., novel types, character numbers, the year of publication) impact LLM performance in evaluations. Through a series of experiments with various state-of-the-art LLMs, we demonstrate that these models face considerable challenges in effectively addressing questions related to literary fictions, with ChatGPT reaching only 57.08% under the zero-shot setting. The dataset will be publicly available at this https URL
大规模语言模型的快速演变带来了对它们在各种维度上的表现进行全面评估的需求。在本文中,我们提出了LFED,一个文学小说评估数据集,旨在评估LLM在长篇小说理解和推理方面的能力。我们收集了95部文学作品,这些作品要么最初用中文创作,要么翻译成中文,涵盖了几个世纪内的各种主题。我们定义了一个问题分类器,包括8个问题类别,以指导创建1304个问题。此外,我们进行了深入分析,以确定文学作品的具体属性(如小说类型、角色数量、出版年份)如何影响LLM在评估中的表现。通过一系列与最先进的LLM的实验,我们发现这些模型在有效地回答关于文学小说的相关问题时面临相当大的挑战,在零散设置下,ChatGPT的得分只有57.08%。该数据集将在这个https:// URL上公开发布。
https://arxiv.org/abs/2405.10166
The recent success of large language models (LLMs) has attracted widespread interest to develop role-playing conversational agents personalized to the characteristics and styles of different speakers to enhance their abilities to perform both general and special purpose dialogue tasks. However, the ability to personalize the generated utterances to speakers, whether conducted by human or LLM, has not been well studied. To bridge this gap, our study introduces a novel evaluation challenge: speaker verification in agent-generated conversations, which aimed to verify whether two sets of utterances originate from the same speaker. To this end, we assemble a large dataset collection encompassing thousands of speakers and their utterances. We also develop and evaluate speaker verification models under experiment setups. We further utilize the speaker verification models to evaluate the personalization abilities of LLM-based role-playing models. Comprehensive experiments suggest that the current role-playing models fail in accurately mimicking speakers, primarily due to their inherent linguistic characteristics.
近年来大型语言模型(LLMs)的成功引起了人们对开发个性化的角色扮演对话代理商的广泛关注,以增强他们在执行通用和专用对话任务方面的能力。然而,在将生成的语句个性化给说话人方面(无论是人类还是LLM),尚缺乏深入研究。为了填补这一空白,我们的研究引入了一个新颖的评估挑战:在代理生成的对话中进行说话人验证,旨在验证两组语句是否来自同一个说话人。为此,我们收集了一个包括成千上万个说话人和他们的话语的大型数据集。我们还在一个实验设置中开发和评估了说话人验证模型。进一步利用说话人验证模型评估基于LLM的角色扮演模型的个性化能力。全面的实验结果表明,当前的角色扮演模型在准确复制说话人方面失败,主要原因是它们固有的语言特性。
https://arxiv.org/abs/2405.10150
In this work, we introduce Libra, a prototype model with a decoupled vision system on a large language model (LLM). The decoupled vision system decouples inner-modal modeling and cross-modal interaction, yielding unique visual information modeling and effective cross-modal comprehension. Libra is trained through discrete auto-regressive modeling on both vision and language inputs. Specifically, we incorporate a routed visual expert with a cross-modal bridge module into a pretrained LLM to route the vision and language flows during attention computing to enable different attention patterns in inner-modal modeling and cross-modal interaction scenarios. Experimental results demonstrate that the dedicated design of Libra achieves a strong MLLM baseline that rivals existing works in the image-to-text scenario with merely 50 million training data, providing a new perspective for future multimodal foundation models. Code is available at this https URL.
在这项工作中,我们引入了LIBRA,一个在大型语言模型(LLM)上具有解耦视觉系统的原型模型。解耦的视觉系统解耦了内模态建模和跨模态交互,产生了独特的视觉信息建模和有效的跨模态理解。LIBRA通过在视觉和语言输入上进行离散自回归建模进行训练。具体来说,我们将一个经过跨模态桥接模块的径向视觉专家融入预训练的LLM中,以路由在注意力计算过程中视觉和语言流的视觉和跨模态交互场景,实现不同内模态建模和跨模态交互场景的注意力模式。实验结果表明,专门设计的LIBRA在仅有5000万训练数据的情况下,实现了与现有图像到文本场景中工作的MLLM基线相媲美的强大性能,为未来的多模态基础模型提供了新的视角。代码可以从该链接获取:https://www.example.com/libra。
https://arxiv.org/abs/2405.10140
The emergence of large language models (LLMs) capable of generating realistic texts and images has sparked ethical concerns across various sectors. In response, researchers in academia and industry are actively exploring methods to distinguish AI-generated content from human-authored material. However, a crucial question remains: What are the unique characteristics of AI-generated text? Addressing this gap, this study proposes StyloAI, a data-driven model that uses 31 stylometric features to identify AI-generated texts by applying a Random Forest classifier on two multi-domain datasets. StyloAI achieves accuracy rates of 81% and 98% on the test set of the AuTextification dataset and the Education dataset, respectively. This approach surpasses the performance of existing state-of-the-art models and provides valuable insights into the differences between AI-generated and human-authored texts.
大语言模型(LLMs)的涌现引发了各个领域的伦理担忧。为了回应这一问题,学术界和工业界的研究人员正在积极探讨如何区分由人工智能生成的内容和由人类编写的材料。然而,一个关键的问题仍然存在:人工智能生成的文本的独特特点是什么?为了解决这一空白,本研究提出了StyloAI,一种数据驱动的模型,它使用31个标度特征来识别通过在两个多领域数据集上应用随机森林分类器来检测人工智能生成的文本。StyloAI在AuTextification数据集和Education数据集的测试集上的准确率分别为81%和98%。这种方法超过了现有最先进的模型的性能,并为人工智能生成的文本和人类编写的文本之间的差异提供了宝贵的见解。
https://arxiv.org/abs/2405.10129
Most language models currently available are prone to self-contradiction during dialogues. To mitigate this issue, this study explores a novel contradictory dialogue processing task that aims to detect and modify contradictory statements in a conversation. This task is inspired by research on context faithfulness and dialogue comprehension, which have demonstrated that the detection and understanding of contradictions often necessitate detailed explanations. We develop a dataset comprising contradictory dialogues, in which one side of the conversation contradicts itself. Each dialogue is accompanied by an explanatory label that highlights the location and details of the contradiction. With this dataset, we present a Red Teaming framework for contradictory dialogue processing. The framework detects and attempts to explain the dialogue, then modifies the existing contradictory content using the explanation. Our experiments demonstrate that the framework improves the ability to detect contradictory dialogues and provides valid explanations. Additionally, it showcases distinct capabilities for modifying such dialogues. Our study highlights the importance of the logical inconsistency problem in conversational AI.
目前可用的绝大多数语言模型在对话过程中容易产生自相矛盾。为了减轻这个问题,本研究探索了一个新的矛盾对话处理任务,旨在检测并修改对话中的矛盾陈述。这个任务受到语境 Faithfulness 和对话理解的研究启发,这些研究表明,检测和理解矛盾通常需要详细的解释。我们开发了一个由矛盾对话组成的数据集,其中对话的一侧自相矛盾。每个对话都配有解释标签,突出了矛盾的位置和细节。有了这个数据集,我们提出了一个红队框架 for矛盾对话处理。该框架检测并试图解释对话,然后使用解释修改现有的矛盾内容。我们的实验证明,该框架能够提高检测矛盾对话的能力,并提供有效的解释。此外,它还展示了在修改这类对话方面独特的功能。我们的研究突出了 conversational AI 中逻辑不一致问题的重要性。
https://arxiv.org/abs/2405.10128
Multistep instructions, such as recipes and how-to guides, greatly benefit from visual aids, such as a series of images that accompany the instruction steps. While Large Language Models (LLMs) have become adept at generating coherent textual steps, Large Vision/Language Models (LVLMs) are less capable of generating accompanying image sequences. The most challenging aspect is that each generated image needs to adhere to the relevant textual step instruction, as well as be visually consistent with earlier images in the sequence. To address this problem, we propose an approach for generating consistent image sequences, which integrates a Latent Diffusion Model (LDM) with an LLM to transform the sequence into a caption to maintain the semantic coherence of the sequence. In addition, to maintain the visual coherence of the image sequence, we introduce a copy mechanism to initialise reverse diffusion processes with a latent vector iteration from a previously generated image from a relevant step. Both strategies will condition the reverse diffusion process on the sequence of instruction steps and tie the contents of the current image to previous instruction steps and corresponding images. Experiments show that the proposed approach is preferred by humans in 46.6% of the cases against 26.6% for the second best method. In addition, automatic metrics showed that the proposed method maintains semantic coherence and visual consistency across steps in both domains.
多步骤指令(如食谱和教程)极大地受益于视觉辅助,例如一系列随指导步骤附带的图像。尽管大型语言模型(LLMs)已经能够生成连贯的文本步骤,但大型视觉/语言模型(LVLMs)在生成伴随图像序列方面能力较弱。最具挑战的是,生成的每个图像都需要遵守相关的文本步骤,并且要与序列前面的图像在视觉上保持一致。为解决这个问题,我们提出了一个生成一致图像序列的方法,该方法将潜在扩散模型(LDM)与大型语言模型(LLM)结合,将序列转换为摘要以保持序列的语义连贯。此外,为了保持图像序列的视觉连贯性,我们引入了一个副本机制,从相关步骤之前生成的图像的潜在向量迭代初始化反向扩散过程。两种策略都将指令步骤序列作为条件,并将当前图像的内容与之前的指令步骤和相应的图像连接起来。实验证明,与第二好的方法相比,所提出的方法在46.6%的案例中受到了人类的偏好,而在26.6%的案例中排在了第二。此外,自动指标表明,在两个领域中,所提出的方法都保持了语义连贯和视觉一致性。
https://arxiv.org/abs/2405.10122
Integrating multimodal knowledge into large language models (LLMs) represents a significant advancement in dialogue generation capabilities. However, the effective incorporation of such knowledge in zero-resource scenarios remains a substantial challenge due to the scarcity of diverse, high-quality dialogue datasets. To address this, we propose the Visual Implicit Knowledge Distillation Framework (VIKDF), an innovative approach aimed at enhancing LLMs for enriched dialogue generation in zero-resource contexts by leveraging implicit multimodal knowledge. VIKDF comprises two main stages: knowledge distillation, using an Implicit Query Transformer to extract and encode visual implicit knowledge from image-text pairs into knowledge vectors; and knowledge integration, employing a novel Bidirectional Variational Information Fusion technique to seamlessly integrate these distilled vectors into LLMs. This enables the LLMs to generate dialogues that are not only coherent and engaging but also exhibit a deep understanding of the context through implicit multimodal cues, effectively overcoming the limitations of zero-resource scenarios. Our extensive experimentation across two dialogue datasets shows that VIKDF outperforms existing state-of-the-art models in generating high-quality dialogues. The code will be publicly available following acceptance.
将多模态知识集成到大型语言模型(LLMs)中代表了在对话生成能力方面的重要进步。然而,在零资源场景中有效整合这种知识仍然是一个巨大的挑战,因为缺乏多样化、高质量的多模态对话数据集。为了应对这个挑战,我们提出了Visual Implicit Knowledge Distillation Framework(VIKDF),一种旨在通过利用隐含多模态知识来增强LLMs以实现丰富对话生成的创新方法。VIKDF包括两个主要阶段:知识蒸馏,使用Implicit Query Transformer从图像文本对中提取和编码视觉隐含知识,并将其转化为知识向量;知识整合,采用一种新颖的双向变分信息融合技术,将这些蒸馏知识轻松地整合到LLMs中。这使得LLMs能够生成不仅 coherent 而且 engaging的对话,并通过隐含多模态线索展现对上下文的深刻理解,有效克服了零资源场景的局限性。我们在两个对话数据集上的广泛实验表明,VIKDF在生成高质量对话方面超过了现有最先进的模型。代码将在接受后公开可用。
https://arxiv.org/abs/2405.10121
Open-vocabulary object detection (OvOD) has transformed detection into a language-guided task, empowering users to freely define their class vocabularies of interest during inference. However, our initial investigation indicates that existing OvOD detectors exhibit significant variability when dealing with vocabularies across various semantic granularities, posing a concern for real-world deployment. To this end, we introduce Semantic Hierarchy Nexus (SHiNe), a novel classifier that uses semantic knowledge from class hierarchies. It runs offline in three steps: i) it retrieves relevant super-/sub-categories from a hierarchy for each target class; ii) it integrates these categories into hierarchy-aware sentences; iii) it fuses these sentence embeddings to generate the nexus classifier vector. Our evaluation on various detection benchmarks demonstrates that SHiNe enhances robustness across diverse vocabulary granularities, achieving up to +31.9% mAP50 with ground truth hierarchies, while retaining improvements using hierarchies generated by large language models. Moreover, when applied to open-vocabulary classification on ImageNet-1k, SHiNe improves the CLIP zero-shot baseline by +2.8% accuracy. SHiNe is training-free and can be seamlessly integrated with any off-the-shelf OvOD detector, without incurring additional computational overhead during inference. The code is open source.
开放词汇对象检测(OvOD)使检测任务变成了一种语言指导的任务,使用户在推理过程中可以自由定义他们感兴趣的类词汇。然而,我们的初步调查表明,现有的OvOD检测器在处理不同语义粒度词汇时表现出显著的变异性,这可能会对现实世界的部署造成担忧。为此,我们引入了语义层次结构 Nexus(SHiNe),一种新类器,它使用类层次结构的语义知识。它通过三个步骤运行:i)它从层次结构中检索与目标类别相关的超/子类别;ii)它将这些类别整合到等级感知句子中;iii)它将句子嵌入融合生成nexus分类器向量。我们对各种检测基准的评估表明,SHiNe在不同的词汇粒度下增强了鲁棒性,达到+31.9%的mAP50,同时保留了使用大语言模型生成的等级所取得的有益改进。此外,当应用于ImageNet-1k上的开放词汇分类时,SHiNe提高了CLIP零散 baseline的准确率+2.8%。SHiNe是免费的训练的,可以轻松地与任何现有的OvOD检测器集成,而不会在推理过程中产生额外的计算开销。代码是开源的。
https://arxiv.org/abs/2405.10053
LLM watermarking, which embeds imperceptible yet algorithmically detectable signals in model outputs to identify LLM-generated text, has become crucial in mitigating the potential misuse of large language models. However, the abundance of LLM watermarking algorithms, their intricate mechanisms, and the complex evaluation procedures and perspectives pose challenges for researchers and the community to easily experiment with, understand, and assess the latest advancements. To address these issues, we introduce MarkLLM, an open-source toolkit for LLM watermarking. MarkLLM offers a unified and extensible framework for implementing LLM watermarking algorithms, while providing user-friendly interfaces to ensure ease of access. Furthermore, it enhances understanding by supporting automatic visualization of the underlying mechanisms of these algorithms. For evaluation, MarkLLM offers a comprehensive suite of 12 tools spanning three perspectives, along with two types of automated evaluation pipelines. Through MarkLLM, we aim to support researchers while improving the comprehension and involvement of the general public in LLM watermarking technology, fostering consensus and driving further advancements in research and application. Our code is available at this https URL.
为了减轻大型语言模型的潜在滥用,将不可察觉但可以算法上检测到的信号嵌入到模型输出中以识别由LLM生成的文本已成为至关重要的。然而,LLM水印算法的丰富性、复杂性和评估过程复杂性给研究人员和社区轻松实验、理解和评估最新进展带来了挑战。为解决这些问题,我们引入了MarkLLM,一个用于LLM水印的开源工具包。MarkLLM提供了一个统一的、可扩展的框架,用于实现LLM水印算法,同时提供用户友好的界面,确保访问的便捷性。此外,它通过支持自动可视化算法的底层机制来增强理解。对于评估,MarkLLM提供了一个包括三个方面的全面评估工具包,以及两种类型的自动评估管道。通过MarkLLM,我们旨在支持研究人员,同时改善LLM水印技术的影响和认知,促进共识并在研究和应用中推动进一步发展。我们的代码可以从该链接获取。
https://arxiv.org/abs/2405.10051
Large language models (LLMs) are versatile and can address many tasks, but for computational efficiency, it is often desirable to distill their capabilities into smaller student models. One way to do this for classification tasks is via dataset synthesis, which can be accomplished by generating examples of each label from the LLM. Prior approaches to synthesis use few-shot prompting, which relies on the LLM's parametric knowledge to generate usable examples. However, this leads to issues of repetition, bias towards popular entities, and stylistic differences from human text. In this work, we propose Synthesize by Retrieval and Refinement (SynthesizRR), which uses retrieval augmentation to introduce variety into the dataset synthesis process: as retrieved passages vary, the LLM is "seeded" with different content to generate its examples. We empirically study the synthesis of six datasets, covering topic classification, sentiment analysis, tone detection, and humor, requiring complex synthesis strategies. We find SynthesizRR greatly improves lexical and semantic diversity, similarity to human-written text, and distillation performance, when compared to standard 32-shot prompting and six baseline approaches.
大语言模型(LLMs)具有广泛的用途,可以处理许多任务,但在计算效率方面,通常希望将它们的功能缩小为较小的学生模型。为分类任务进行数据合成的一种方式是通过生成每个标签的示例来缩小LLM的功能。之前的方法使用少样本提示,这依赖于LLM的参数化知识来生成有用的示例。然而,这导致了重复问题、倾向于流行实体和人文风格的差异等问题。在本文中,我们提出了一种名为“合成-检索和精炼”(SynthesizRR)的方法,该方法通过检索增强来引入数据合成过程的多样性:由于检索到的段落有所不同,LLM会“播种”不同的内容以生成其示例。我们通过研究六个主题分类、情感分析、语调检测和幽默等领域的数据合成,探讨了合成策略的复杂性。我们发现,SynthesizRR在比较标准32- shot提示和六个基线方法时,大大提高了词汇和语义多样性、与人类文本的相似性和去耦性能。
https://arxiv.org/abs/2405.10040
Recent advances in large language models (LLMs) have promoted generative error correction (GER) for automatic speech recognition (ASR), which aims to predict the ground-truth transcription from the decoded N-best hypotheses. Thanks to the strong language generation ability of LLMs and rich information in the N-best list, GER shows great effectiveness in enhancing ASR results. However, it still suffers from two limitations: 1) LLMs are unaware of the source speech during GER, which may lead to results that are grammatically correct but violate the source speech content, 2) N-best hypotheses usually only vary in a few tokens, making it redundant to send all of them for GER, which could confuse LLM about which tokens to focus on and thus lead to increased miscorrection. In this paper, we propose ClozeGER, a new paradigm for ASR generative error correction. First, we introduce a multimodal LLM (i.e., SpeechGPT) to receive source speech as extra input to improve the fidelity of correction output. Then, we reformat GER as a cloze test with logits calibration to remove the input information redundancy and simplify GER with clear instructions. Experiments show that ClozeGER achieves a new breakthrough over vanilla GER on 9 popular ASR datasets.
近年来,大型语言模型(LLMs)的进步推动了自动语音识别(ASR)中的生成误差纠正(GER)的发展,该旨在从解码的N个最佳假设预测听到的地面真实转录。得益于LLMs强大的语言生成能力以及N个最佳列表中的丰富信息,GER在增强ASR效果方面表现出巨大的效果。然而,它仍然存在两个局限性:1)LLMs在GER过程中无法感知原始语音,这可能导致语法正确但违反源语音内容的成果;2)N个最佳假设通常只在几个词上变化,这使得为GER发送所有它们变得冗余,可能会使LLM困惑于应关注哪些词,从而导致增加误译。在本文中,我们提出了ClozeGER,一种新的ASR生成误差纠正范式。首先,我们引入了一个多模态LLM(即SpeechGPT)以接收原始语音作为额外的输入以提高纠错输出的保真度。然后,我们将GER重新格式化为一个cloze测试,对logits进行归一化以消除输入信息冗余并简化GER,并提供明确的指导。实验证明,ClozeGER在9个流行的ASR数据集上取得了与普通GER的新突破。
https://arxiv.org/abs/2405.10025