Proprietary LMs such as GPT-4 are often employed to assess the quality of responses from various LMs. However, concerns including transparency, controllability, and affordability strongly motivate the development of open-source LMs specialized in evaluations. On the other hand, existing open evaluator LMs exhibit critical shortcomings: 1) they issue scores that significantly diverge from those assigned by humans, and 2) they lack the flexibility to perform both direct assessment and pairwise ranking, the two most prevalent forms of assessment. Additionally, they do not possess the ability to evaluate based on custom evaluation criteria, focusing instead on general attributes like helpfulness and harmlessness. To address these issues, we introduce Prometheus 2, a more powerful evaluator LM than its predecessor that closely mirrors human and GPT-4 judgements. Moreover, it is capable of processing both direct assessment and pair-wise ranking formats grouped with a user-defined evaluation criteria. On four direct assessment benchmarks and four pairwise ranking benchmarks, Prometheus 2 scores the highest correlation and agreement with humans and proprietary LM judges among all tested open evaluator LMs. Our models, code, and data are all publicly available at this https URL.
proprietary language models such as GPT-4 are often used to assess the quality of responses from various language models. However, concerns such as transparency, controllability, and affordability strongly motivate the development of open-source language models specialized in evaluations. On the other hand, existing open evaluator language models have critical shortcomings: 1) they issue scores that significantly diverge from those assigned by humans, and 2) they lack the flexibility to perform both direct assessment and pairwise ranking, the two most prevalent forms of assessment. Additionally, they do not possess the ability to evaluate based on custom evaluation criteria, instead focusing on general attributes such as helpfulness and harmlessness. To address these issues, we introduce Prometheus 2, a more powerful evaluator language model than its predecessor that closely mirrors human and GPT-4 judgments. Moreover, it is capable of processing both direct assessment and pair-wise ranking formats grouped with a user-defined evaluation criteria. On four direct assessment benchmarks and four pairwise ranking benchmarks, Prometheus 2 scores the highest correlation and agreement with humans and proprietary language model judges among all tested open evaluator language models. Our models, code, and data are all publicly available at this [https://www.url](http://www.url).
https://arxiv.org/abs/2405.01535
Large Language Models (LLMs) have been shown to be capable of performing high-level planning for long-horizon robotics tasks, yet existing methods require access to a pre-defined skill library (e.g. picking, placing, pulling, pushing, navigating). However, LLM planning does not address how to design or learn those behaviors, which remains challenging particularly in long-horizon settings. Furthermore, for many tasks of interest, the robot needs to be able to adjust its behavior in a fine-grained manner, requiring the agent to be capable of modifying low-level control actions. Can we instead use the internet-scale knowledge from LLMs for high-level policies, guiding reinforcement learning (RL) policies to efficiently solve robotic control tasks online without requiring a pre-determined set of skills? In this paper, we propose Plan-Seq-Learn (PSL): a modular approach that uses motion planning to bridge the gap between abstract language and learned low-level control for solving long-horizon robotics tasks from scratch. We demonstrate that PSL achieves state-of-the-art results on over 25 challenging robotics tasks with up to 10 stages. PSL solves long-horizon tasks from raw visual input spanning four benchmarks at success rates of over 85%, out-performing language-based, classical, and end-to-end approaches. Video results and code at this https URL
大语言模型(LLMs)已经被证明在长时间的机器人任务中具有执行高级计划的能力。然而,现有的方法需要访问预定义的技能库(例如抓取、放置、拖动、推开、导航)。然而,LLM计划并没有解决如何设计或学习这些行为,这使得在长时间设置中解决这个问题变得更加具有挑战性。此外,对于许多感兴趣的任务,机器人需要能够以细粒度的方式调整其行为,要求代理具备修改低级控制动作的能力。我们可以 instead 使用LLM在高级策略上进行知识表示,指导强化学习(RL)策略有效地解决机器人控制任务,而无需预先确定一组技能?在本文中,我们提出了Plan-Seq-Learn(PSL):一种模块化方法,使用运动规划来桥接抽象语言和学习的低级控制,以从零开始解决长时间的机器人任务。我们证明了PSL在超过25个具有挑战性的机器人任务上取得了最先进的成果,其中包括10个阶段。PSL通过从成功的视觉输入中解决长期机器人任务,其成功率超过85%,超过了基于语言的传统方法、基于任务的端到端方法和基于知识的方法。视频结果和代码在此处:https://url
https://arxiv.org/abs/2405.01534
The advances in multimodal large language models (MLLMs) have led to growing interests in LLM-based autonomous driving agents to leverage their strong reasoning capabilities. However, capitalizing on MLLMs' strong reasoning capabilities for improved planning behavior is challenging since planning requires full 3D situational awareness beyond 2D reasoning. To address this challenge, our work proposes a holistic framework for strong alignment between agent models and 3D driving tasks. Our framework starts with a novel 3D MLLM architecture that uses sparse queries to lift and compress visual representations into 3D before feeding them into an LLM. This query-based representation allows us to jointly encode dynamic objects and static map elements (e.g., traffic lanes), providing a condensed world model for perception-action alignment in 3D. We further propose OmniDrive-nuScenes, a new visual question-answering dataset challenging the true 3D situational awareness of a model with comprehensive visual question-answering (VQA) tasks, including scene description, traffic regulation, 3D grounding, counterfactual reasoning, decision making and planning. Extensive studies show the effectiveness of the proposed architecture as well as the importance of the VQA tasks for reasoning and planning in complex 3D scenes.
多模态大型语言模型(MLLMs)的进步导致了对基于LLM的自动驾驶代理的浓厚兴趣,以利用其强大的推理能力。然而,利用MLLMs的强大的推理能力进行改进的规划行为具有挑战性,因为规划需要超过2D推理的全面3D情景意识。为解决这个问题,我们的工作提出了一个整体框架,实现代理模型与3D驾驶任务的强一致性。我们的框架从采用稀疏查询的全新3D MLLM架构开始,该架构在将视觉表示压缩成3D后输入LLM之前利用稀疏查询。这种基于查询的表示允许我们共同编码动态物体和静态地图元素(例如,交通车道),为3D感知-动作对齐提供了一个压缩的世界模型。我们还提出了OmniDrive-nuScenes,一个新的视觉问题回答数据集,挑战了具有全面视觉问题回答(VQA)任务的模型的真正3D情景意识,包括场景描述、交通规则、3D建模、反事实推理、决策和规划。大量研究证明了所建议的架构的有效性以及VQA任务对复杂3D场景中的推理和规划的重要性。
https://arxiv.org/abs/2405.01533
Alignment is a standard procedure to fine-tune pre-trained large language models (LLMs) to follow natural language instructions and serve as helpful AI assistants. We have observed, however, that the conventional alignment process fails to enhance the factual accuracy of LLMs, and often leads to the generation of more false facts (i.e. hallucination). In this paper, we study how to make the LLM alignment process more factual, by first identifying factors that lead to hallucination in both alignment steps:\ supervised fine-tuning (SFT) and reinforcement learning (RL). In particular, we find that training the LLM on new knowledge or unfamiliar texts can encourage hallucination. This makes SFT less factual as it trains on human labeled data that may be novel to the LLM. Furthermore, reward functions used in standard RL can also encourage hallucination, because it guides the LLM to provide more helpful responses on a diverse set of instructions, often preferring longer and more detailed responses. Based on these observations, we propose factuality-aware alignment, comprised of factuality-aware SFT and factuality-aware RL through direct preference optimization. Experiments show that our proposed factuality-aware alignment guides LLMs to output more factual responses while maintaining instruction-following capability.
对齐是一种对预训练的大型语言模型(LLMs)进行微调的标准程序,以遵循自然语言指令并作为有帮助的AI助手。然而,我们观察到,传统的对齐过程无法增强LLMs的事实准确性,并通常导致生成更多的虚假事实(即幻觉)。在本文中,我们研究了如何使LLM的对齐过程更加事实准确,通过首先确定导致对齐步骤中出现幻觉的因素:有监督的微调(SFT)和强化学习(RL)。 特别是,我们发现,在为LLM提供新知识或熟悉文本进行训练时,可能会鼓励幻觉。这使得SFT变得不准确,因为它在训练时使用的人类标注数据可能对LLM来说是新颖的。此外,标准RL中使用的奖励函数也可能鼓励幻觉,因为它引导LLM为多样性的指令提供更有帮助的回答,往往更喜欢更长的、更详细的回答。 基于这些观察结果,我们提出了具有事实意识的对齐方法,通过直接偏好优化实现事实意识SFT和事实意识RL。实验证明,我们提出的事实意识对齐引导LLMs输出更准确的事实性响应,同时保持指令跟踪能力。
https://arxiv.org/abs/2405.01525
The transformer structure employed in large language models (LLMs), as a specialized category of deep neural networks (DNNs) featuring attention mechanisms, stands out for their ability to identify and highlight the most relevant aspects of input data. Such a capability is particularly beneficial in addressing a variety of communication challenges, notably in the realm of semantic communication where proper encoding of the relevant data is critical especially in systems with limited bandwidth. In this work, we employ vision transformers specifically for the purpose of compression and compact representation of the input image, with the goal of preserving semantic information throughout the transmission process. Through the use of the attention mechanism inherent in transformers, we create an attention mask. This mask effectively prioritizes critical segments of images for transmission, ensuring that the reconstruction phase focuses on key objects highlighted by the mask. Our methodology significantly improves the quality of semantic communication and optimizes bandwidth usage by encoding different parts of the data in accordance with their semantic information content, thus enhancing overall efficiency. We evaluate the effectiveness of our proposed framework using the TinyImageNet dataset, focusing on both reconstruction quality and accuracy. Our evaluation results demonstrate that our framework successfully preserves semantic information, even when only a fraction of the encoded data is transmitted, according to the intended compression rates.
大型语言模型(LLMs)采用的变换器结构作为一种特殊的深度神经网络(DNNs),具有关注机制,在识别和突出输入数据中最相关方面具有优势。这种能力在解决各种通信挑战方面尤其有益,尤其是在语义通信领域,正确编码相关数据至关重要,尤其是在带宽受限的系统中。在这项工作中,我们专门使用视觉变换器来压缩和简洁地表示输入图像,以保留在整个传输过程中的语义信息。通过使用变换器固有的注意机制,我们创建了一个注意力掩码。这个掩码有效地 prioritize 了图像中关键部分的传输,确保重建阶段关注由掩码突出显示的关键对象。我们的方法显著提高了语义通信的质量,通过根据数据语义信息内容编码数据的不同部分来优化带宽利用率,从而提高整体效率。我们使用TinyImageNet数据集来评估我们提出的框架的有效性,重点关注重建质量和准确性。我们的评估结果表明,根据预定压缩率,我们的框架成功地保留了语义信息,即使只有部分编码数据被传输。
https://arxiv.org/abs/2405.01521
Varied approaches for aligning language models have been proposed, including supervised fine-tuning, RLHF, and direct optimization methods such as DPO. Although DPO has rapidly gained popularity due to its straightforward training process and competitive results, there is an open question of whether there remain practical advantages of using a discriminator, like a reward model, to evaluate responses. We propose D2PO, discriminator-guided DPO, an approach for the online setting where preferences are being collected throughout learning. As we collect gold preferences, we use these not only to train our policy, but to train a discriminative response evaluation model to silver-label even more synthetic data for policy training. We explore this approach across a set of diverse tasks, including a realistic chat setting, we find that our approach leads to higher-quality outputs compared to DPO with the same data budget, and greater efficiency in terms of preference data requirements. Furthermore, we show conditions under which silver labeling is most helpful: it is most effective when training the policy with DPO, outperforming traditional PPO, and benefits from maintaining a separate discriminator from the policy model.
提出了多种与对齐语言模型的方法,包括监督微调、RLHF和直接优化方法(如DPO)。尽管DPO因其直观的训练过程和具有竞争力的结果而迅速受到欢迎,但使用分类器(如奖励模型)来评估响应仍然是一个有争议的问题。我们提出了D2PO、分类器指导的DPO和在线设置中偏好被收集的方法。随着我们收集金偏好,我们不仅用它们来训练我们的策略,而且用它们来训练一个分类响应评估模型,以对抗训练更多的合成数据。我们在一系列多样任务上探讨了这种方法,包括一个真实的聊天设置,我们发现,与使用相同数据预算相比,我们的方法产生了更高的产品质量,并且在偏好数据要求方面更加高效。此外,我们证明了银标记在什么情况下最有帮助:当使用DPO训练策略时,它最有效;超过传统的PPO,并从策略模型中分离出独立的分类器。
https://arxiv.org/abs/2405.01511
Traditionally, natural language processing (NLP) models often use a rich set of features created by linguistic expertise, such as semantic representations. However, in the era of large language models (LLMs), more and more tasks are turned into generic, end-to-end sequence generation problems. In this paper, we investigate the question: what is the role of semantic representations in the era of LLMs? Specifically, we investigate the effect of Abstract Meaning Representation (AMR) across five diverse NLP tasks. We propose an AMR-driven chain-of-thought prompting method, which we call AMRCoT, and find that it generally hurts performance more than it helps. To investigate what AMR may have to offer on these tasks, we conduct a series of analysis experiments. We find that it is difficult to predict which input examples AMR may help or hurt on, but errors tend to arise with multi-word expressions, named entities, and in the final inference step where the LLM must connect its reasoning over the AMR to its prediction. We recommend focusing on these areas for future work in semantic representations for LLMs. Our code: this https URL.
传统上,自然语言处理(NLP)模型通常使用由语言专业知识创建的丰富特征集,例如语义表示。然而,在大型语言模型(LLMs)的时代,越来越多的任务被转化为通用序列生成问题。在本文中,我们研究了在LLM时代语义表示的作用:具体来说,我们研究了抽象意义表示(AMR)在五个不同NLP任务上的效果。我们提出了一个基于AMR的思绪提示方法,我们称之为AMRCoT,并发现它通常会损害性能,而不是帮助。为了研究AMR在这些任务上可能提供的优势,我们进行了一系列分析实验。我们发现很难预测AMR可能会帮助或损害哪些输入示例,但错误往往会在多词表达、命名实体和最后推理步骤中出现,LLM必须将推理跨越AMR与预测相结合。我们建议将未来LLM语义表示工作集中在这些领域上。我们的代码:<https://this URL>。
https://arxiv.org/abs/2405.01502
While most research on controllable text generation has focused on steering base Language Models, the emerging instruction-tuning and prompting paradigm offers an alternate approach to controllability. We compile and release ConGenBench, a testbed of 17 different controllable generation tasks, using a subset of it to benchmark the performance of 9 different baselines and methods on Instruction-tuned Language Models. To our surprise, we find that prompting-based approaches outperform controllable text generation methods on most datasets and tasks, highlighting a need for research on controllable text generation with Instruction-tuned Language Models in specific. Prompt-based approaches match human performance on most stylistic tasks while lagging on structural tasks, foregrounding a need to study more varied constraints and more challenging stylistic tasks. To facilitate such research, we provide an algorithm that uses only a task dataset and a Large Language Model with in-context capabilities to automatically generate a constraint dataset. This method eliminates the fields dependence on pre-curated constraint datasets, hence vastly expanding the range of constraints that can be studied in the future.
尽管关于可控制文本生成的研究主要集中在引导基础语言模型,但新兴的指令调整和提示范式提供了一种可控制的新方法。我们使用部分数据集编译和发布ConGenBench,使用其中的一部分来基准9种不同的基线和方法在指令调整语言模型上的性能。令我们惊讶的是,基于提示的方法在大多数数据集和任务上优于可控制文本生成方法,突出了在特定情况下研究可控制文本生成与指令调整语言模型的需要。基于提示的方法在大多数风格化任务上与人类性能相匹配,但在结构化任务上落后,突出了需要研究更广泛的约束以及更具挑战性的风格化任务。为了促进这种研究,我们提供了一种使用任务数据集和大语言模型(具有上下文功能)自动生成约束数据集的算法。这种方法消除了依赖性,因此极大地扩展了可以研究的约束范围。
https://arxiv.org/abs/2405.01490
Aligning Large Language Models (LLMs) with human values and preferences is essential for making them helpful and safe. However, building efficient tools to perform alignment can be challenging, especially for the largest and most competent LLMs which often contain tens or hundreds of billions of parameters. We create NeMo-Aligner, a toolkit for model alignment that can efficiently scale to using hundreds of GPUs for training. NeMo-Aligner comes with highly optimized and scalable implementations for major paradigms of model alignment such as: Reinforcement Learning from Human Feedback (RLHF), Direct Preference Optimization (DPO), SteerLM, and Self-Play Fine-Tuning (SPIN). Additionally, our toolkit supports running most of the alignment techniques in a Parameter Efficient Fine-Tuning (PEFT) setting. NeMo-Aligner is designed for extensibility, allowing support for other alignment techniques with minimal effort. It is open-sourced with Apache 2.0 License and we invite community contributions at this https URL
将大型语言模型(LLMs)与人类价值观和偏好对齐是使其有帮助和安全的充要条件。然而,构建高效的工具执行对齐可能具有挑战性,尤其是对于包含数十亿或数百亿个参数的大型和最强大的LLM。我们创建了NeMo-Aligner,一个用于模型对齐的工具包,可以高效地扩展到使用数百个GPU进行训练。NeMo-Aligner附带高度优化的可扩展实现,适用于主要模型对齐范式:强化学习来自人类反馈(RLHF)、直接偏好优化(DPO)、SteerLM和自玩微调(SPIN)。此外,我们的工具包支持在参数高效微调(PEFT)设置中运行大多数对齐技术。NeMo-Aligner旨在可扩展性,允许支持其他对齐技术,只需付出很少的努力。它使用Apache 2.0许可证开源,并邀请您在此链接处为社区贡献:https://www.nemoaligner.org/
https://arxiv.org/abs/2405.01481
Large Vision-Language models (VLMs) have demonstrated strong reasoning capabilities in tasks requiring a fine-grained understanding of literal images and text, such as visual question-answering or visual entailment. However, there has been little exploration of these models' capabilities when presented with images and captions containing figurative phenomena such as metaphors or humor, the meaning of which is often implicit. To close this gap, we propose a new task and a high-quality dataset: Visual Figurative Language Understanding with Textual Explanations (V-FLUTE). We frame the visual figurative language understanding problem as an explainable visual entailment task, where the model has to predict whether the image (premise) entails a claim (hypothesis) and justify the predicted label with a textual explanation. Using a human-AI collaboration framework, we build a high-quality dataset, V-FLUTE, that contains 6,027 <image, claim, label, explanation> instances spanning five diverse multimodal figurative phenomena: metaphors, similes, idioms, sarcasm, and humor. The figurative phenomena can be present either in the image, the caption, or both. We further conduct both automatic and human evaluations to assess current VLMs' capabilities in understanding figurative phenomena.
大视觉语言模型(VLMs)已经在需要对字面图像和文本进行深入理解的任务中表现出强大的推理能力,例如视觉问答或视觉蕴含。然而,在遇到包含象征性现象(如隐喻或幽默)的图像和字幕时,对这些模型的能力进行了深入的研究还是很少的。为了填补这一空白,我们提出了一个新的任务和高质量的数据集:视觉符号语言理解与文本解释(V-FLUTE)。我们将视觉符号语言理解问题视为一种可解释的视觉蕴含任务,其中模型需要预测图像(前提)是否符合一个假设(结论),并通过文本解释预测标签。利用人机合作框架,我们构建了一个高质量的数据集V-FLUTE,其中包括6,027个<图像,陈述,标签,解释>实例,涵盖了五种多样 multimodal 符号现象:隐喻、比喻、惯用语、讽刺和幽默。符号现象可以出现在图像中,描述中,或两者兼备。我们进一步进行了自动和人类评估,以评估现有 VLMs 对符号现象的理解能力。
https://arxiv.org/abs/2405.01474
Pre-trained contrastive vision-language models have demonstrated remarkable performance across a wide range of tasks. However, they often struggle on fine-trained datasets with categories not adequately represented during pre-training, which makes adaptation necessary. Recent works have shown promising results by utilizing samples from web-scale databases for retrieval-augmented adaptation, especially in low-data regimes. Despite the empirical success, understanding how retrieval impacts the adaptation of vision-language models remains an open research question. In this work, we adopt a reflective perspective by presenting a systematic study to understand the roles of key components in retrieval-augmented adaptation. We unveil new insights on uni-modal and cross-modal retrieval and highlight the critical role of logit ensemble for effective adaptation. We further present theoretical underpinnings that directly support our empirical observations.
预训练的对比性视觉语言模型在各种各样的任务上都表现出惊人的性能。然而,它们通常在训练过程中遇到在预训练过程中未得到充分代表的类别的数据,因此需要进行调整。最近的工作通过利用大规模网络数据库的样本进行检索增强适应,尤其是在低数据量的情况下,取得了积极的结果。尽管经验证实的成功,但理解检索如何影响视觉语言模型的适应仍然是一个开放的研究问题。在这项工作中,我们采用反思性观点,通过系统地研究检索增强适应的关键组件,揭示了有关单模态和跨模态检索的新见解,并突出了逻辑集成对于有效适应的关键作用。我们进一步提供了理论支持,直接支持我们的实证观察。
https://arxiv.org/abs/2405.01468
This paper investigates the use of Large Language Models (LLMs) for automating the generation of hardware description code, aiming to explore their potential in supporting and enhancing the development of efficient neuromorphic computing architectures. Building on our prior work, we employ OpenAI's ChatGPT4 and natural language prompts to synthesize a RTL Verilog module of a programmable recurrent spiking neural network, while also generating test benches to assess the system's correctness. The resultant design was validated in three case studies, the exclusive OR,the IRIS flower classification and the MNIST hand-written digit classification, achieving accuracies of up to 96.6%. To verify its synthesizability and implementability, the design was prototyped on a field-programmable gate array and implemented on SkyWater 130 nm technology by using an open-source electronic design automation flow. Additionally, we have submitted it to Tiny Tapeout 6 chip fabrication program to further evaluate the system on-chip performance in the future.
本文研究了使用大型语言模型(LLMs)自动生成硬件描述代码的应用,旨在探讨它们在支持和发展高效神经形态计算架构方面的潜力。在我们之前的工作基础上,我们利用OpenAI的ChatGPT4和自然语言提示来合成一个可编程反复抽动的神经网络的RTL Verilog模块,同时生成测试基准以评估系统的正确性。基于所得设计,我们在三个案例研究中进行了验证: exclusive OR,IRIS花分类和MNIST手写数字分类,达到96.6%的准确度。为了验证其可合成性和可实现性,该设计在一块现场可编程门阵列上进行了原型设计,并使用SkyWater 130纳米技术通过开源电子设计自动化流程进行了实现。此外,我们还将其提交给Tiny Tapeout 6芯片制造程序,以进一步评估系统在芯片上的性能。
https://arxiv.org/abs/2405.01419
Large 2D vision-language models (2D-LLMs) have gained significant attention by bridging Large Language Models (LLMs) with images using a simple projector. Inspired by their success, large 3D point cloud-language models (3D-LLMs) also integrate point clouds into LLMs. However, directly aligning point clouds with LLM requires expensive training costs, typically in hundreds of GPU-hours on A100, which hinders the development of 3D-LLMs. In this paper, we introduce MiniGPT-3D, an efficient and powerful 3D-LLM that achieves multiple SOTA results while training for only 27 hours on one RTX 3090. Specifically, we propose to align 3D point clouds with LLMs using 2D priors from 2D-LLMs, which can leverage the similarity between 2D and 3D visual information. We introduce a novel four-stage training strategy for modality alignment in a cascaded way, and a mixture of query experts module to adaptively aggregate features with high efficiency. Moreover, we utilize parameter-efficient fine-tuning methods LoRA and Norm fine-tuning, resulting in only 47.8M learnable parameters, which is up to 260x fewer than existing methods. Extensive experiments show that MiniGPT-3D achieves SOTA on 3D object classification and captioning tasks, with significantly cheaper training costs. Notably, MiniGPT-3D gains an 8.12 increase on GPT-4 evaluation score for the challenging object captioning task compared to ShapeLLM-13B, while the latter costs 160 total GPU-hours on 8 A800. We are the first to explore the efficient 3D-LLM, offering new insights to the community. Code and weights are available at this https URL.
大型的2D视觉语言模型(2D-LLMs)通过简单地使用投影器将大型语言模型(LLMs)与图像相连接,已经引起了 significant 的关注。受到他们的成功启发,大型3D点云语言模型(3D-LLMs)也 将点云集成到 LLMs 中。然而,直接将点云与 LLM 对齐需要昂贵的训练成本,通常在 A100 上需要数百个 GPU-小时,这阻碍了 3D-LLMs 的开发。在本文中,我们介绍了 MiniGPT-3D,一种高效且强大的3D-LLM,在仅训练27小时的情况下实现了多个SOTA结果。具体来说,我们提出了一种使用来自2D-LLMs的2D先验来对齐3D点云与LLM的方法,可以利用2D和3D视觉信息的相似性。我们还提出了一个新颖的四阶段模态对齐训练策略,以及一个混合查询专家模块以高效地适应性地聚合特征。此外,我们还利用参数高效的微调方法 LoRA 和 Norm 微调,实现了仅47.8M可学习参数,比现有方法少260倍。 extensive实验证明,MiniGPT-3D在3D物体分类和文本摘要任务上实现了SOTA,具有显著的训练成本优势。值得注意的是,与ShapeLLM-13B相比,MiniGPT-3D在具有挑战性的物体文本摘要任务上获得了8.12的提高,而后者需要160个总共的GPU-小时,在8个A800上。我们是第一个探索高效3D-LLM,为社区提供了新的见解。代码和权重可以从该https URL获取。
https://arxiv.org/abs/2405.01413
Natural language explanations have become a proxy for evaluating explainable and multi-step Natural Language Inference (NLI) models. However, assessing the validity of explanations for NLI is challenging as it typically involves the crowd-sourcing of apposite datasets, a process that is time-consuming and prone to logical errors. To address existing limitations, this paper investigates the verification and refinement of natural language explanations through the integration of Large Language Models (LLMs) and Theorem Provers (TPs). Specifically, we present a neuro-symbolic framework, named Explanation-Refiner, that augments a TP with LLMs to generate and formalise explanatory sentences and suggest potential inference strategies for NLI. In turn, the TP is employed to provide formal guarantees on the logical validity of the explanations and to generate feedback for subsequent improvements. We demonstrate how Explanation-Refiner can be jointly used to evaluate explanatory reasoning, autoformalisation, and error correction mechanisms of state-of-the-art LLMs as well as to automatically enhance the quality of human-annotated explanations of variable complexity in different domains.
自然语言解释已成为评估可解释性和多步骤自然语言推理(NLI)模型的指标。然而,评估解释的有效性具有挑战性,因为它通常涉及apprise数据的众包,这个过程费时且容易出错。为了应对现有局限,本文研究了通过整合大型语言模型(LLMs)和定理证明器(TPs)来验证和优化自然语言解释的方法。具体来说,我们提出了一个名为Explanation-Refiner的神经符号框架,该框架通过在TP中添加LLMs来生成和形式化解释性句子,并建议可能的NLI推理策略。TP则用于提供关于解释逻辑有效性的正式保证,并生成后续改进的反馈。我们证明了Explanation-Refiner可以与现有LLM的推理、自动形式化和错误纠正机制共同用于评估可解释推理、自动形式化和不同领域的变体解释的质量。
https://arxiv.org/abs/2405.01379
Large-scale machines like particle accelerators are usually run by a team of experienced operators. In case of a particle accelerator, these operators possess suitable background knowledge on both accelerator physics and the technology comprising the machine. Due to the complexity of the machine, particular subsystems of the machine are taken care of by experts, who the operators can turn to. In this work the reasoning and action (ReAct) prompting paradigm is used to couple an open-weights large language model (LLM) with a high-level machine control system framework and other tools, e.g. the electronic logbook or machine design documentation. By doing so, a multi-expert retrieval augmented generation (RAG) system is implemented, which assists operators in knowledge retrieval tasks, interacts with the machine directly if needed, or writes high level control system scripts. This consolidation of expert knowledge and machine interaction can simplify and speed up machine operation tasks for both new and experienced human operators.
大规模的机器通常由经验丰富的操作员团队运行。在粒子加速器的情况下,这些操作员对加速器物理和机器的组成部分都具有适当的背景知识。由于机器的复杂性,特别关注机器的子系统由专家处理,操作员可以向他们寻求帮助。在这项工作中,使用了推理和动作(ReAct)提示模式将带有开放标签的大型语言模型(LLM)与高级机器控制系统框架和其他工具(例如电子日志或机器设计文档)相结合。通过这样做,实现了一个多专家检索增强生成(RAG)系统,该系统有助于操作员在知识检索任务中查找信息,在需要时直接与机器交互,或编写高级控制系统脚本。这一专家知识和机器交互的整合可以简化并加速新经验和经验丰富的操作员的机器操作任务。
https://arxiv.org/abs/2405.01359
Understanding user enjoyment is crucial in human-robot interaction (HRI), as it can impact interaction quality and influence user acceptance and long-term engagement with robots, particularly in the context of conversations with social robots. However, current assessment methods rely solely on self-reported questionnaires, failing to capture interaction dynamics. This work introduces the Human-Robot Interaction Conversational User Enjoyment Scale (HRI CUES), a novel scale for assessing user enjoyment from an external perspective during conversations with a robot. Developed through rigorous evaluations and discussions of three annotators with relevant expertise, the scale provides a structured framework for assessing enjoyment in each conversation exchange (turn) alongside overall interaction levels. It aims to complement self-reported enjoyment from users and holds the potential for autonomously identifying user enjoyment in real-time HRI. The scale was validated on 25 older adults' open-domain dialogue with a companion robot that was powered by a large language model for conversations, corresponding to 174 minutes of data, showing moderate to good alignment. Additionally, the study offers insights into understanding the nuances and challenges of assessing user enjoyment in robot interactions, and provides guidelines on applying the scale to other domains.
理解用户的喜爱在人机交互(HRI)中至关重要,因为它可能会影响交互质量和影响用户对机器的接受程度以及与机器的长期参与,特别是在与社交机器人的对话中。然而,目前的评估方法仅依赖自我报告问卷,无法捕捉交互动态。这项工作介绍了一个名为人机交互聊天机器人用户喜爱量表(HRI CUES)的新量表,用于从外部角度评估用户在机器人对话中的喜爱。通过与具有相关专业知识的三位注释者的深入讨论和严格的评估,该量表构建了一个结构化的框架,用于评估每个对话交流(回合)的喜爱程度以及整个交互水平。该量表旨在补充来自用户的自我报告喜爱,并具有在实时HRI中自动识别用户喜好的潜力。 该量表在25名年龄较大的成年人与一台由大型语言模型驱动的伴侣机器人进行开放领域的对话上进行了验证,对话持续了174分钟,显示出中等至良好的相关性。此外,这项研究揭示了评估用户喜爱在机器人交互中的细微问题和挑战,并为其他领域提供了应用该量表的指导。
https://arxiv.org/abs/2405.01354
Bridging the significant gap between large language model's English and non-English performance presents a great challenge. While some previous studies attempt to mitigate this gap with translated training data, the recently proposed question alignment approach leverages the model's English expertise to improve multilingual performance with minimum usage of expensive, error-prone translation. In this paper, we explore how broadly this method can be applied by examining its effects in reasoning with executable code and reasoning with common sense. We also explore how to apply this approach efficiently to extremely large language models using proxy-tuning. Experiment results on multilingual reasoning benchmarks mGSM, mSVAMP and xCSQA demonstrate that the question alignment approach can be used to boost multilingual performance across diverse reasoning scenarios, model families, and sizes. For instance, when applied to the LLaMA2 models, our method brings an average accuracy improvements of 12.2% on mGSM even with the 70B model. To understand the mechanism of its success, we analyze representation space, chain-of-thought and translation data scales, which reveals how question translation training strengthens language alignment within LLMs and shapes their working patterns.
跨越大型语言模型英语和非英语性能之间的显著差距提出了一个巨大的挑战。虽然一些以前的研究试图通过翻译训练数据来弥合这一差距,但最近提出的疑问对齐方法利用模型的英语专业知识来提高多语言性能,同时最小化使用昂贵且容易出错的翻译。在本文中,我们研究了这种方法在推理执行代码和推理与常识中的应用效果。我们还研究了如何使用代理调整来有效地应用于极其大型语言模型。在多语言推理基准测试mGSM、mSVAMP和xCSQA上进行实验结果表明,疑问对齐方法可以用于提高各种推理场景、模型家族和大小下的多语言性能。例如,当应用于LLLA2模型时,我们的方法在mGSM基准测试上平均提高了12.2%的准确性,即使只有70B模型。为了了解其成功的原因,我们分析了表示空间、推理数据规模以及翻译数据规模,揭示了疑问翻译训练如何加强LLM中的语言对齐并塑造其工作方式。
https://arxiv.org/abs/2405.01345
This research introduces an innovative AI-driven precision agriculture system, leveraging YOLOv8 for disease identification and Retrieval Augmented Generation (RAG) for context-aware diagnosis. Focused on addressing the challenges of diseases affecting the coffee production sector in Karnataka, The system integrates sophisticated object detection techniques with language models to address the inherent constraints associated with Large Language Models (LLMs). Our methodology not only tackles the issue of hallucinations in LLMs, but also introduces dynamic disease identification and remediation strategies. Real-time monitoring, collaborative dataset expansion, and organizational involvement ensure the system's adaptability in diverse agricultural settings. The effect of the suggested system extends beyond automation, aiming to secure food supplies, protect livelihoods, and promote eco-friendly farming practices. By facilitating precise disease identification, the system contributes to sustainable and environmentally conscious agriculture, reducing reliance on pesticides. Looking to the future, the project envisions continuous development in RAG-integrated object detection systems, emphasizing scalability, reliability, and usability. This research strives to be a beacon for positive change in agriculture, aligning with global efforts toward sustainable and technologically enhanced food production.
这项研究介绍了一种创新的人工智能驱动的精度农业系统,利用YOLOv8进行疾病识别和Retrieval Augmented Generation(RAG)进行上下文感知诊断,重点解决影响印度卡纳塔克咖啡生产领域的疾病挑战。系统将先进的物体检测技术与语言模型相结合,以解决LLMs固有的限制。我们的方法不仅解决了LLMs中的幻觉问题,还引入了动态疾病识别和修复策略。实时监测、合作数据扩展和组织的参与确保了系统的适应性在不同的农业环境中。建议的系统的效果不仅超越了自动化,还旨在确保粮食供应、保护生计和促进可持续的环保农业实践。通过促进精确疾病识别,系统为可持续和环保的农业做出了贡献,减少了农药的依赖。展望未来,该项目愿景在RAG集成的物体检测系统中持续发展,强调可扩展性、可靠性和易用性。这项研究旨在成为农业领域积极变革的灯塔,与全球致力于可持续和科技增强食品生产的努力相一致。
https://arxiv.org/abs/2405.01310
Large Language Models (LLMs) have emerged as powerful support tools across various natural language tasks and a range of application domains. Recent studies focus on exploring their capabilities for data annotation. This paper provides a comparative overview of twelve studies investigating the potential of LLMs in labelling data. While the models demonstrate promising cost and time-saving benefits, there exist considerable limitations, such as representativeness, bias, sensitivity to prompt variations and English language preference. Leveraging insights from these studies, our empirical analysis further examines the alignment between human and GPT-generated opinion distributions across four subjective datasets. In contrast to the studies examining representation, our methodology directly obtains the opinion distribution from GPT. Our analysis thereby supports the minority of studies that are considering diverse perspectives when evaluating data annotation tasks and highlights the need for further research in this direction.
大语言模型(LLMs)已成为各种自然语言任务和应用领域强大的支持工具。最近的研究主要集中在探索它们在数据标注方面的能力。本文对12项研究进行了比较,这些研究调查了LLMs在标注数据中的潜力。虽然这些模型显示出成本和时间节省的优势,但存在很大的局限性,例如代表性、偏见、对提示变化的敏感性和对英语偏好。利用这些研究的见解,我们的实证分析进一步研究了四个主观数据集上人类和GPT生成的意见分布。与研究表示性相比,我们的方法直接从GPT中获得了意见分布。因此,我们的分析支持少数研究者在评估数据标注任务时考虑了多样观点,并突出了在这一方向上需要进行进一步研究。
https://arxiv.org/abs/2405.01299
This paper explores the use of Hybrid CTC/Attention encoder-decoder models trained with Intermediate CTC (InterCTC) for Irish (Gaelic) low-resource speech recognition (ASR) and dialect identification (DID). Results are compared to the current best performing models trained for ASR (TDNN-HMM) and DID (ECAPA-TDNN). An optimal InterCTC setting is initially established using a Conformer encoder. This setting is then used to train a model with an E-branchformer encoder and the performance of both architectures are compared. A multi-task fine-tuning approach is adopted for language model (LM) shallow fusion. The experiments yielded an improvement in DID accuracy of 10.8% relative to a baseline ECAPA-TDNN, and WER performance approaching the TDNN-HMM model. This multi-task approach emerges as a promising strategy for Irish low-resource ASR and DID.
本文探讨了使用经过中间CTC(InterCTC)训练的混合编码器-解码器模型(Hybrid CTC/Attention encoder-decoder)在爱尔兰(盖尔语)低资源 speech recognition(ASR)和 dialect identification(DID)任务中的应用。结果与目前最佳训练的 ASR(TDNN-HMM)和 DID(ECAPA-TDNN)模型进行了比较。首先,通过使用 Conformer 编码器建立了一个最优的 InterCTC 设置。然后,使用 E-branchfinder 编码器训练了一个模型,并比较了两种架构的性能。为语言模型(LM)采用多任务微调。实验结果表明,与基线 ECAPA-TDNN相比,DID 准确度提高了 10.8%,而 WER 性能接近于 TDNN-HMM 模型。这种多任务方法在爱尔兰低资源 ASR 和 DID 任务中具有前景。
https://arxiv.org/abs/2405.01293