Multi-intent natural language understanding (NLU) presents a formidable challenge due to the model confusion arising from multiple intents within a single utterance. While previous works train the model contrastively to increase the margin between different multi-intent labels, they are less suited to the nuances of multi-intent NLU. They ignore the rich information between the shared intents, which is beneficial to constructing a better embedding space, especially in low-data scenarios. We introduce a two-stage Prediction-Aware Contrastive Learning (PACL) framework for multi-intent NLU to harness this valuable knowledge. Our approach capitalizes on shared intent information by integrating word-level pre-training and prediction-aware contrastive fine-tuning. We construct a pre-training dataset using a word-level data augmentation strategy. Subsequently, our framework dynamically assigns roles to instances during contrastive fine-tuning while introducing a prediction-aware contrastive loss to maximize the impact of contrastive learning. We present experimental results and empirical analysis conducted on three widely used datasets, demonstrating that our method surpasses the performance of three prominent baselines on both low-data and full-data scenarios.
由于在一个句子中存在多个意图,多意图自然语言理解(NLU)面临着巨大的挑战。虽然以前的工作通过对比训练来增加不同多意图标签之间的间隔,但他们并不适合多意图NLU的细微差别。他们忽略了共享意图之间的丰富信息,这对于构建更好的嵌入空间尤其在低数据场景中是很有益的。我们提出了一个两阶段预测意识对比学习(PACL)框架,用于多意图NLU,以充分利用这种有价值的信息。我们的方法通过将词级预训练和预测意识对比微调相结合,利用共享意图信息,构建了一个预训练数据集。在对比微调期间,我们的框架动态地为实例分配角色,并引入预测意识对比损失以最大化对比学习的影响。我们在三个广泛使用数据集上进行了实验和实证分析,结果表明,我们的方法在低数据和全数据场景上都超越了三个显著的基本方法。
https://arxiv.org/abs/2405.02925
Social media platforms such as Twitter, Reddit, and Sina Weibo play a crucial role in global communication but often encounter strict regulations in geopolitically sensitive regions. This situation has prompted users to ingeniously modify their way of communicating, frequently resorting to coded language in these regulated social media environments. This shift in communication is not merely a strategy to counteract regulation, but a vivid manifestation of language evolution, demonstrating how language naturally evolves under societal and technological pressures. Studying the evolution of language in regulated social media contexts is of significant importance for ensuring freedom of speech, optimizing content moderation, and advancing linguistic research. This paper proposes a multi-agent simulation framework using Large Language Models (LLMs) to explore the evolution of user language in regulated social media environments. The framework employs LLM-driven agents: supervisory agent who enforce dialogue supervision and participant agents who evolve their language strategies while engaging in conversation, simulating the evolution of communication styles under strict regulations aimed at evading social media regulation. The study evaluates the framework's effectiveness through a range of scenarios from abstract scenarios to real-world situations. Key findings indicate that LLMs are capable of simulating nuanced language dynamics and interactions in constrained settings, showing improvement in both evading supervision and information accuracy as evolution progresses. Furthermore, it was found that LLM agents adopt different strategies for different scenarios.
社交媒体平台如Twitter、Reddit和新浪微博在全球交流中扮演着关键角色,但通常会在敏感的地缘政治地区遇到严格的监管。这种情况促使用户巧妙地修改他们的交流方式,经常在受监管的社交媒体环境中使用暗语。这种交流方式的转变不仅是对抗监管的一种策略,更是语言进化的生动表现,展示了在社会和技术压力下语言的自然演变。研究在受监管的社交媒体环境中语言演变的重要性对于确保言论自由、优化内容监管和推动语言研究具有重大意义。本文提出了一种使用大型语言模型(LLMs)的多代理仿真框架,探讨受监管社交媒体环境中用户语言的演变。框架采用LLM驱动的代理:监督代理负责对话监督,参与者代理在参与对话的过程中发展他们的语言策略,模拟在严格监管下避免社交媒体管理策略的演变。研究通过从抽象场景到现实世界情况的各种场景对框架的有效性进行评估。关键发现表明,LLMs能够模拟受约束环境中的细微语言动态和互动,随着进度的提高,逃避监督和信息准确性的能力都有所改善。此外,发现LLM代理采用不同的策略来应对不同的场景。
https://arxiv.org/abs/2405.02858
How to eliminate pronominal reference in group chats? In this work, we have preprocessed 58k authentic chat data and manually annotated 2.3k questions. The reliability of this annotation was confirmed by the scaling law. After this, we conducted fine-tuning on Qwen models, ranging from 0.5B to 32B parameters. The optimal version improved 29.07 in F1 score. This confirms the viability of fine-tuning Large Language Model (LLM) for downstream Natural Language Processing (NLP) tasks. Our contributions are: 1) Created Supervised Fine-Tuning (SFT) training data in alpaca format, along with a set of Low-Rank Adaptation (LoRA) weights, and 2) Developed a method for acquiring high-quality data leveraging scaling law principle. The script, raw data with alpaca format and experiments track are open-sourced on Github this https URL, HuggingFace this https URL and WandB this https URL . The privacy of the data involved has been authorized by users.
如何消除群聊中的代词引用?在这项工作中,我们预处理了58k个真实聊天数据,并手动标注了2.3k个问题。这个注释的可靠性通过缩放定律得到了确认。在这一点之后,我们对Qwen模型进行了微调,从0.5B到32B参数。最佳版本提高了29.07的F1得分。这证实了微调大语言模型(LLM)对于下游自然语言处理(NLP)任务的可用性。我们的贡献是:1)在Alpaca格式下创建了监督微调(SFT)训练数据集,并附带了一组低秩适应(LoRA)权重;2)利用缩放定律原理开发了一种获取高质量数据的方法。脚本、原始数据(以Alpaca格式)和实验跟踪都已开源在Github(https://github.com/)、HuggingFace(https://github.com/)和WandB(https://github.com/)上。用户参与其中的数据的隐私已获得授权。
https://arxiv.org/abs/2405.02817
This paper introduces Stochastic RAG--a novel approach for end-to-end optimization of retrieval-augmented generation (RAG) models that relaxes the simplifying assumptions of marginalization and document independence, made in most prior work. Stochastic RAG casts the retrieval process in RAG as a stochastic sampling without replacement process. Through this formulation, we employ straight-through Gumbel-top-k that provides a differentiable approximation for sampling without replacement and enables effective end-to-end optimization for RAG. We conduct extensive experiments on seven diverse datasets on a wide range of tasks, from open-domain question answering to fact verification to slot-filling for relation extraction and to dialogue systems. By applying this optimization method to a recent and effective RAG model, we advance state-of-the-art results on six out of seven datasets.
本文介绍了一种名为Stochastic RAG的新方法,用于端到端优化检索增强生成(RAG)模型,该方法放宽了大多数先前的工作中的简化假设,即边际化和文档独立性假设。Stochastic RAG将检索过程在RAG中建模为无替换的随机采样过程。通过这种表示方法,我们使用 straight-through Gumbel-top-k,它为无替换采样提供了一个不同的导数近似,并有效地对RAG进行了端到端的优化。我们在包括开放域问题回答、事实验证、关系提取和对话系统等七种多样化的数据集上进行了广泛的实验。通过将这种优化方法应用于最近且有效的RAG模型,我们在六个数据集上超过了现有的最佳结果。
https://arxiv.org/abs/2405.02816
Large Language Models (LLMs) have become integral to a wide spectrum of applications, ranging from traditional computing tasks to advanced artificial intelligence (AI) applications. This widespread adoption has spurred extensive research into LLMs across various disciplines, including the social sciences. Notably, studies have revealed that LLMs possess emotional intelligence, which can be further developed through positive emotional stimuli. This discovery raises an intriguing question: can negative emotions similarly influence LLMs, potentially enhancing their performance? In response to this question, we introduce NegativePrompt, a novel approach underpinned by psychological principles, involving ten specifically designed negative emotional stimuli. We embark on rigorous experimental evaluations of five LLMs including Flan-T5-Large, Vicuna, Llama 2, ChatGPT, and GPT-4, across a set of 45 tasks. The results are revealing: NegativePrompt markedly enhances the performance of LLMs, evidenced by relative improvements of 12.89% in Instruction Induction tasks and 46.25% in BIG-Bench tasks. Moreover, we conduct attention visualization experiments to decipher the underlying mechanisms of NegativePrompt's influence. Our research contributes significantly to the understanding of LLMs and emotion interaction, demonstrating the practical efficacy of NegativePrompt as an emotion-driven method and offering novel insights for the enhancement of LLMs in real-world applications. The code is available at this https URL.
大语言模型(LLMs)已经成为各种应用领域的不可或缺的一部分,从传统计算任务到高级人工智能(AI)应用。这种广泛的应用催生了在各个学科领域对LLMs的深入研究。值得注意的是,研究发现LLMs具有情感智能,可以通过正向情感刺激进一步发展。这一发现引发了一个引人入胜的问题:负向情感是否也会影响LLMs,从而可能增强其性能?为了回答这个问题,我们引入了NegativePrompt,一种基于心理原则的新型方法,包括十种专门设计的负情感刺激。我们进行了对包括Flan-T5-Large、Vicuna、Llama 2、ChatGPT和GPT-4在内的五种LLM的严格实验评估,涉及45个任务。结果表明,NegativePrompt显著增强了LLM的性能,表现为在指导诱导任务中的相对改善率为12.89%,在BIG-Bench任务中的相对改善率为46.25%。此外,我们进行了注意图实验,以揭示NegativePrompt影响背后的机制。我们的研究对LLMs和情感交互的理解做出了重要贡献,证明了NegativePrompt作为一种情感驱动的方法在实际应用中的实际有效性,并为提高LLM在现实应用中的性能提供了新颖的见解。代码可在此处访问:https://www.xxx.com/
https://arxiv.org/abs/2405.02814
Large Language Models (LLMs) have revolutionized natural language processing, but their robustness against adversarial attacks remains a critical concern. We presents a novel white-box style attack approach that exposes vulnerabilities in leading open-source LLMs, including Llama, OPT, and T5. We assess the impact of model size, structure, and fine-tuning strategies on their resistance to adversarial perturbations. Our comprehensive evaluation across five diverse text classification tasks establishes a new benchmark for LLM robustness. The findings of this study have far-reaching implications for the reliable deployment of LLMs in real-world applications and contribute to the advancement of trustworthy AI systems.
大语言模型(LLMs)在自然语言处理领域取得了巨大的革命性突破,但它们对对抗性攻击的鲁棒性仍然是一个关键的担忧。我们提出了一种新颖的白色盒攻击方法,揭示了包括LLMA、OPT和T5在内领先的开源LLM中的漏洞。我们评估了模型大小、结构和微调策略对其对对抗性干扰的鲁棒性影响。我们在五个多样化的文本分类任务上进行全面的评估,为LLM的鲁棒性树立了新的基准。本研究的结果对在现实应用中可靠部署LLM以及可信AI系统的进步具有深远的影响。
https://arxiv.org/abs/2405.02764
Large language models (LLMs) have shown remarkable adaptability to diverse tasks, by leveraging context prompts containing instructions, or minimal input-output examples. However, recent work revealed they also exhibit label bias -- an undesirable preference toward predicting certain answers over others. Still, detecting and measuring this bias reliably and at scale has remained relatively unexplored. In this study, we evaluate different approaches to quantifying label bias in a model's predictions, conducting a comprehensive investigation across 279 classification tasks and ten LLMs. Our investigation reveals substantial label bias in models both before and after debiasing attempts, as well as highlights the importance of outcomes-based evaluation metrics, which were not previously used in this regard. We further propose a novel label bias calibration method tailored for few-shot prompting, which outperforms recent calibration approaches for both improving performance and mitigating label bias. Our results emphasize that label bias in the predictions of LLMs remains a barrier to their reliability.
大语言模型(LLMs)通过利用包含指令的上下文提示或最小输入-输出示例展示了对于各种任务的显著适应性。然而,最近的工作表明,它们还表现出了标签偏见——对于预测某些答案的偏好,而不是其他答案的预测。然而,在可信赖度和规模上检测和衡量这种偏见仍然是一个相对未探索的问题。在这项研究中,我们评估了在模型预测中量化标签偏见的不同方法,对279个分类任务和10个LLM进行了全面的调查。我们的调查揭示了模型在Debiasing尝试前和之后的标签偏见,并强调了基于结果的评估指标之前在这一点上没有使用的重要性。我们进一步提出了一个针对少样本提示的新型标签偏见校准方法,该方法在提高性能和减轻标签偏见方面优于最近的方法。我们的结果强调了LLM预测中标签偏见仍然是一个对其可靠性的障碍。
https://arxiv.org/abs/2405.02743
Methods for relation extraction from text mostly focus on high precision, at the cost of limited recall. High recall is crucial, though, to populate long lists of object entities that stand in a specific relation with a given subject. Cues for relevant objects can be spread across many passages in long texts. This poses the challenge of extracting long lists from long texts. We present the L3X method which tackles the problem in two stages: (1) recall-oriented generation using a large language model (LLM) with judicious techniques for retrieval augmentation, and (2) precision-oriented scrutinization to validate or prune candidates. Our L3X method outperforms LLM-only generations by a substantial margin.
文本关系提取的方法主要关注高精度,但代价是欠准确。高准确度对填充具有特定关系的对象长列表至关重要。相关对象的提示可以分散在长文本的多个段落中。这提出了从长文本中提取长列表的挑战。我们提出了L3X方法,该方法分为两个阶段来解决这个问题:(1)使用大型语言模型(LLM)进行召回导向的生成,并采用适当的检索增强技术;(2)对候选者进行精确度导向的检查,以验证或修剪。我们的L3X方法在LLM-only生成方面取得了很大的优势。
https://arxiv.org/abs/2405.02732
The task of Information Retrieval (IR) requires a system to identify relevant documents based on users' information needs. In real-world scenarios, retrievers are expected to not only rely on the semantic relevance between the documents and the queries but also recognize the nuanced intents or perspectives behind a user query. For example, when asked to verify a claim, a retrieval system is expected to identify evidence from both supporting vs. contradicting perspectives, for the downstream system to make a fair judgment call. In this work, we study whether retrievers can recognize and respond to different perspectives of the queries -- beyond finding relevant documents for a claim, can retrievers distinguish supporting vs. opposing documents? We reform and extend six existing tasks to create a benchmark for retrieval, where we have diverse perspectives described in free-form text, besides root, neutral queries. We show that current retrievers covered in our experiments have limited awareness of subtly different perspectives in queries and can also be biased toward certain perspectives. Motivated by the observation, we further explore the potential to leverage geometric features of retriever representation space to improve the perspective awareness of retrievers in a zero-shot manner. We demonstrate the efficiency and effectiveness of our projection-based methods on the same set of tasks. Further analysis also shows how perspective awareness improves performance on various downstream tasks, with 4.2% higher accuracy on AmbigQA and 29.9% more correlation with designated viewpoints on essay writing, compared to non-perspective-aware baselines.
信息检索(IR)任务的目的是根据用户的需要识别相关的文档。在现实场景中,检索器不仅应该根据文档和查询之间的语义相关性来查找相关文档,还应该识别用户查询背后的细微意图或观点。例如,当被要求验证一个主张时,检索系统应该从支持者和反驳者的角度来看明证据,以便下游系统做出公正的判断。在这项工作中,我们研究了检索器是否能够识别和响应不同查询的角度 - 不仅限于找到相关文档,还可以区分支持者和反对者的文档吗?我们将现有的六个任务进行改革和扩展,为检索创建了一个基准,其中我们用自由文本描述了多样化的观点。我们发现,我们实验中的现有检索器对查询中的微妙不同角度缺乏意识,并且可能存在偏见。为了激发这种观察,我们进一步研究了利用检索器表示空间的几何特征来以零散的方式改善检索器在零散观点上的视角意识的可能性。我们在同一任务集上展示了我们的投影基方法的有效性和有效性。进一步的分析还表明,视角意识在各种下游任务上的改善,与非视角意识的基线相比,在Am ambigQA上的准确度提高了4.2%,在论文写作上的指定观点上的相关性提高了29.9%。
https://arxiv.org/abs/2405.02714
Recently, Large Language Models (LLMs) have been demonstrated to possess impressive capabilities in a variety of domains and tasks. We investigate the issue of prompt design in the multi-turn text-to-SQL task and attempt to enhance the LLMs' reasoning capacity when generating SQL queries. In the conversational context, the current SQL query can be modified from the preceding SQL query with only a few operations due to the context dependency. We introduce our method called CoE-SQL which can prompt LLMs to generate the SQL query based on the previously generated SQL query with an edition chain. We also conduct extensive ablation studies to determine the optimal configuration of our approach. Our approach outperforms different in-context learning baselines stably and achieves state-of-the-art performances on two benchmarks SParC and CoSQL using LLMs, which is also competitive to the SOTA fine-tuned models.
最近,大型语言模型(LLMs)在各种领域和任务中展现了令人印象深刻的能力。我们研究了在多轮文本到 SQL 任务中,提示设计的问题,并试图在生成 SQL 查询时增强 LLMs 的推理能力。在会话背景下,仅通过几个操作就可以从之前的 SQL 查询修改当前的 SQL 查询,因为上下文相关。我们引入了一种名为 CoE-SQL 的方法,该方法可以根据之前生成的 SQL 查询,通过版本链提示 LLMs 生成 SQL 查询。我们还进行了广泛的消融研究,以确定我们方法的最佳配置。我们的方法在不同的上下文学习基线中表现出稳定的优异性能,并且使用 LLMs 在 SParC 和 CoSQL 基准测试中实现了与最先进的模型相当的表现,这也是与当前 SOTA 微调模型竞争激烈的。
https://arxiv.org/abs/2405.02712
With the deluge of information delivered by the daily news cycle, there is a growing need to effectively and efficiently summarize news feeds for quick consumption. We leverage large language models (LLMs), with their advanced learning and generative abilities as compared to conventional language models, to generate concise and coherent summaries for news articles from the XSum dataset. Our paper focuses on two key aspects of LLMs: Efficient in-context Learning (ELearn) and Parameter Efficient Fine-tuning (EFit). Under ELearn, we find that increasing the number of shots in prompts and utilizing simple templates generally improve the quality of summaries. We also find that utilizing relevant examples in few-shot learning for ELearn does not improve model performance. In addition, we studied EFit using different methods and demonstrate that fine-tuning the first layer of LLMs produces better outcomes as compared to fine-tuning other layers or utilizing LoRA. We also find that leveraging more relevant training samples using selective layers does not result in better performance. By combining ELearn and EFit, we create a new model (ELearnFit) that leverages the benefits of both few-shot learning and fine-tuning and produces superior performance to either model alone. We also use ELearnFit to highlight the trade-offs between prompting and fine-tuning, especially for situations where only a limited number of annotated samples are available. Ultimately, our research provides practical techniques to optimize news summarization during the prompting and fine-tuning stages and enhances the synthesis of news articles.
随着每日新闻循环带来的信息流量,越来越需要有效地和高效地概括新闻摘要,以便快速消费。我们利用大型语言模型(LLMs),其与传统语言模型的先进学习和生成能力相比,以生成简洁且连贯的新闻文章摘要。我们的论文重点关注LLMs的两个关键方面:在上下文中的高效学习(ELearn)和参数效率微调(EFit)。在ELearn方面,我们发现,增加提示中的 shot数并使用简单的模板通常会提高摘要的质量。我们还发现,在ELearn中使用相关示例并不会提高模型的性能。此外,我们研究了EFit,并表明,通过微调第一层LLMs,会产生更好的结果, compared to fine-tuning其他层或使用LoRA。我们还发现,通过选择性层利用更相关的训练样本,并不能提高性能。通过结合ELearn和EFit,我们创建了一个新模型(ELearnFit),它利用了两者之间的优势,并产生了优于单独模型的优异性能。我们还使用ELearnFit突出了提示和微调之间的权衡,尤其是在只有有限数量注释样本的情况下的情况。最终,我们的研究为优化新闻摘要的提示和微调阶段提供了实际技术,并提高了新闻文章的合成。
https://arxiv.org/abs/2405.02710
Narratives serve as fundamental frameworks in our understanding of the world and play a crucial role in collaborative sensemaking, providing a versatile foundation for sensemaking. Framing is a subtle yet potent mechanism that influences public perception through specific word choices, shaping interpretations of reported news events. Despite the recognized importance of narratives and framing, a significant gap exists in the literature with regard to the explicit consideration of framing within the context of computational extraction and representation. This article explores the capabilities of a specific narrative extraction and representation approach -- narrative maps -- to capture framing information from news data. The research addresses two key questions: (1) Does the narrative extraction method capture the framing distribution of the data set? (2) Does it produce a representation with consistent framing? Our results indicate that while the algorithm captures framing distributions, achieving consistent framing across various starting and ending events poses challenges. Our results highlight the potential of narrative maps to provide users with insights into the intricate framing dynamics within news narratives. However, we note that directly leveraging framing information in the computational narrative extraction process remains an open challenge.
叙述在我们的对世界的理解中扮演着基本框架的角色,并且在协作意义理解中发挥着关键作用,为意义理解提供了一个多功能的基础。框架是一种微妙而强大的机制,通过特定的词汇选择影响公众的看法,塑造了对报道的新闻事件的理解。尽管叙述和框架在文献中被认为是重要的,但在计算提取和表示的背景下对框架的明确考虑仍然存在很大的空白。本文探讨了特定叙事提取和表示方法——故事地图——从新闻数据中捕捉框架信息的能力。研究解决了两个关键问题:(1)叙事提取方法是否捕捉到了数据集的框架分布?(2)它是否产生了具有一致框架的表示?我们的结果表明,虽然算法捕捉到了框架分布,但在各种开始和结束事件之间实现一致的框架仍然具有挑战性。我们的结果强调了故事地图为用户提供了深入了解新闻叙述中复杂的框架动态的潜力。然而,我们指出,在计算叙述提取过程中直接利用框架信息仍然是一个开放挑战。
https://arxiv.org/abs/2405.02677
Token repetition is a typical form of multi-modal problem in fully non-autoregressive translation (NAT). In this work, we revisit the multi-modal problem in recently proposed NAT models. Our study reveals that these advanced models have introduced other types of information redundancy errors, which cannot be measured by the conventional metric - the continuous repetition ratio. By manually annotating the NAT outputs, we identify two types of information redundancy errors that correspond well to lexical and reordering multi-modality problems. Since human annotation is time-consuming and labor-intensive, we propose automatic metrics to evaluate the two types of redundant errors. Our metrics allow future studies to evaluate new methods and gain a more comprehensive understanding of their effectiveness.
标记重复是全身非自回归翻译(NAT)中的一种典型多模态问题。在本文中,我们重新审视了最近提出的NAT模型中的多模态问题。我们的研究揭示了这些高级模型引入了其他类型的信息冗余错误,这些错误不能通过传统的指标——连续重复比来衡量。通过手动标注NAT输出,我们找出了两种与词义和排序多模态问题相符的信息冗余错误。由于人类标注费时且劳动密集,我们提出了自动指标来评估这两种冗余错误。我们的指标允许未来的研究评估这些新方法,并获得更全面的理解。
https://arxiv.org/abs/2405.02673
Retrieval-augmented large language models (LLMs) leverage relevant content retrieved by information retrieval systems to generate correct responses, aiming to alleviate the hallucination problem. However, existing retriever-responder methods typically append relevant documents to the prompt of LLMs to perform text generation tasks without considering the interaction of fine-grained structural semantics between the retrieved documents and the LLMs. This issue is particularly important for accurate response generation as LLMs tend to ``lose in the middle'' when dealing with input prompts augmented with lengthy documents. In this work, we propose a new pipeline named ``Reinforced Retriever-Reorder-Responder'' (R$^4$) to learn document orderings for retrieval-augmented LLMs, thereby further enhancing their generation abilities while the large numbers of parameters of LLMs remain frozen. The reordering learning process is divided into two steps according to the quality of the generated responses: document order adjustment and document representation enhancement. Specifically, document order adjustment aims to organize retrieved document orderings into beginning, middle, and end positions based on graph attention learning, which maximizes the reinforced reward of response quality. Document representation enhancement further refines the representations of retrieved documents for responses of poor quality via document-level gradient adversarial learning. Extensive experiments demonstrate that our proposed pipeline achieves better factual question-answering performance on knowledge-intensive tasks compared to strong baselines across various public datasets. The source codes and trained models will be released upon paper acceptance.
检索增强的大型语言模型(LLMs)利用信息检索系统检索的相关内容来生成正确的答案,旨在减轻混杂问题。然而,现有的检索响应方法通常在LLM的提示中附加相关文档进行文本生成任务,而没有考虑检索到的文档与LLM之间细粒度语义结构的交互。这个问题在准确回答问题方面尤为重要,因为LLM在处理带有长文档的输入提示时往往会出现“在中途迷失”的情况。在本文中,我们提出了一个名为“强化检索-排序-回答者”(R$^4$)的新管道来学习检索增强LLM的文档顺序,从而在保持LLM的大参数的同时进一步增强其生成能力。排序学习过程根据生成的响应质量分为两个步骤:文档顺序调整和文档表示增强。具体来说,文档顺序调整旨在根据图注意力学习将检索到的文档顺序组织为开始、中间和结束位置,从而最大化响应质量的强化奖励。文档表示增强通过文档级的梯度 adversarial 学习进一步优化了用于低质量响应的文档表示。大量实验证明,与各种公共数据集上的强大基线相比,我们提出的管道在知识密集型任务上的事实问题回答表现更好。源代码和训练好的模型将在论文接受后发布。
https://arxiv.org/abs/2405.02659
This paper introduces Mixat: a dataset of Emirati speech code-mixed with English. Mixat was developed to address the shortcomings of current speech recognition resources when applied to Emirati speech, and in particular, to bilignual Emirati speakers who often mix and switch between their local dialect and English. The data set consists of 15 hours of speech derived from two public podcasts featuring native Emirati speakers, one of which is in the form of conversations between the host and a guest. Therefore, the collection contains examples of Emirati-English code-switching in both formal and natural conversational contexts. In this paper, we describe the process of data collection and annotation, and describe some of the features and statistics of the resulting data set. In addition, we evaluate the performance of pre-trained Arabic and multi-lingual ASR systems on our dataset, demonstrating the shortcomings of existing models on this low-resource dialectal Arabic, and the additional challenge of recognizing code-switching in ASR. The dataset will be made publicly available for research use.
本文介绍了Mixat数据集:这是用英语对阿联酋语音进行混合的数据集。Mixat数据集是为了解决应用到阿联酋语音的现有语音识别资源的不足而开发的,尤其是针对双语阿联酋 speakers,他们经常混合和切换本地方言和英语。数据集包括来自两个公共播客的非母语阿联酋人士的15小时语音,其中一个是以主持人与嘉宾之间的对话形式呈现的。因此,数据集中包含了阿联酋-英语代码转换在正式和非正式会话背景中的例子。在本文中,我们描述了数据收集和注释的过程,并描述了数据集中的某些特征和统计数字。此外,我们还评估了预训练的阿拉伯语和多语言 ASR系统在我们的数据集上的性能,证明了对于这种低资源的中东阿拉伯语,现有模型的不足以及识别代码转换在 ASR 中的挑战。该数据集将公开发布,供研究使用。
https://arxiv.org/abs/2405.02578
Recently, many studies have shown the efficiency of using Bidirectional Encoder Representations from Transformers (BERT) in various Natural Language Processing (NLP) tasks. Specifically, English spelling correction task that uses Encoder-Decoder architecture and takes advantage of BERT has achieved state-of-the-art result. However, to our knowledge, there is no implementation in Vietnamese yet. Therefore, in this study, a combination of Transformer architecture (state-of-the-art for Encoder-Decoder model) and BERT was proposed to deal with Vietnamese spelling correction. The experiment results have shown that our model outperforms other approaches as well as the Google Docs Spell Checking tool, achieves an 86.24 BLEU score on this task.
最近,许多研究都表明了使用双向编码器表示来自Transformer(BERT)在各种自然语言处理(NLP)任务中的效率。具体来说,利用BERT的编码器-解码器架构的英语拼写纠错任务已经达到了最先进的水平。然而,据我们所知,在越南还没有实现。因此,在本文中,我们提出了结合Transformer架构(对于编码器-解码器模型状态最佳)和BERT来处理越南拼写纠错的想法。实验结果表明,我们的模型在表现为其他方法和Google Docs拼写检查工具方面均表现优异,并且达到了86.24 BLEU得分。
https://arxiv.org/abs/2405.02573
Extensive research exists on the performance of large language models on logic-based tasks, whereas relatively little has been done on their ability to generate creative solutions on lateral thinking tasks. The BrainTeaser shared task tests lateral thinking and uses adversarial datasets to prevent memorization, resulting in poor performance for out-of-the-box models. We propose a system for iterative, chain-of-thought prompt engineering which optimizes prompts using human evaluation. Using this shared task, we demonstrate our system's ability to significantly improve model performance by optimizing prompts and evaluate the input dataset.
大量的研究表明,大型语言模型在基于逻辑的任务上的表现,而关于其在横向思维任务上生成创新解决方案的能力,相对较少研究。BrainTeaser共享任务测试横向思维并使用对抗数据集来防止记忆,导致离线模型的性能较差。我们提出了一个迭代式、连锁思维提示工程系统,通过人类评估优化提示。使用这个共享任务,我们证明了我们的系统通过优化提示和评估输入数据集显著提高了模型性能。
https://arxiv.org/abs/2405.02517
This paper introduces "Semantic Scaling," a novel method for ideal point estimation from text. I leverage large language models to classify documents based on their expressed stances and extract survey-like data. I then use item response theory to scale subjects from these data. Semantic Scaling significantly improves on existing text-based scaling methods, and allows researchers to explicitly define the ideological dimensions they measure. This represents the first scaling approach that allows such flexibility outside of survey instruments and opens new avenues of inquiry for populations difficult to survey. Additionally, it works with documents of varying length, and produces valid estimates of both mass and elite ideology. I demonstrate that the method can differentiate between policy preferences and in-group/out-group affect. Among the public, Semantic Scaling out-preforms Tweetscores according to human judgement; in Congress, it recaptures the first dimension DW-NOMINATE while allowing for greater flexibility in resolving construct validity challenges.
本文介绍了一种名为"语义扩展"的新方法,用于从文本中进行理想点估计。我利用大型语言模型对文献进行分类,并提取类似于调查的数据。然后使用项目反应理论对被试进行扩展。语义扩展在现有基于文本的扩展方法方面显著改进,并允许研究人员明确定义他们衡量的意识形态维度。这代表了一种允许在调查工具之外具有这种灵活性的扩展方法,为难以进行调查的人口打开了新的研究途径。此外,它与具有不同长度的文档一起工作,并能够同时生成质量和精英意识形态的有效估计。我证明了该方法可以区分政策偏好和群体内/群体外影响。在人群中,语义扩展根据人类判断胜过推特分数;在国会中,它重新捕获了第一维度DW-NOMINATE,同时允许在解决构建真实性挑战方面具有更大的灵活性。
https://arxiv.org/abs/2405.02472
We reassess the Knowledge Neuron (KN) Thesis: an interpretation of the mechanism underlying the ability of large language models to recall facts from a training corpus. This nascent thesis proposes that facts are recalled from the training corpus through the MLP weights in a manner resembling key-value memory, implying in effect that "knowledge" is stored in the network. Furthermore, by modifying the MLP modules, one can control the language model's generation of factual information. The plausibility of the KN thesis has been demonstrated by the success of KN-inspired model editing methods (Dai et al., 2022; Meng et al., 2022). We find that this thesis is, at best, an oversimplification. Not only have we found that we can edit the expression of certain linguistic phenomena using the same model editing methods but, through a more comprehensive evaluation, we have found that the KN thesis does not adequately explain the process of factual expression. While it is possible to argue that the MLP weights store complex patterns that are interpretable both syntactically and semantically, these patterns do not constitute "knowledge." To gain a more comprehensive understanding of the knowledge representation process, we must look beyond the MLP weights and explore recent models' complex layer structures and attention mechanisms.
我们重新评估了知识神经元(KN)假说:关于大型语言模型从训练语料库中召回事实的机制。这个正在萌芽的假说提出了通过MLP权重从训练语料库中召回事实的方式,类似于关键值记忆,暗示事实上存储在网络中。此外,通过修改MLP模块,一个人可以控制语言模型的事实信息的生成。KN假说的可信度已经被KN灵感模型编辑方法的成功(Dai等人,2022;Meng等人,2022)所证明。我们发现,这个假说充其量只是一个简化。我们不仅发现我们可以使用相同的模型编辑方法编辑某些语言现象的表达,而且,通过更全面的评估,我们发现KN假说不足以解释事实表达的过程。虽然可以认为MLP权重存储可解释为语法和语义复杂模式的复杂模式,但这些模式不构成“知识”。为了获得对知识表示过程的更全面理解,我们必须超越MLP权重,探索最近模型的复杂层结构和注意机制。
https://arxiv.org/abs/2405.02421
Language technologies have made enormous progress, especially with the introduction of large language models (LLMs). On traditional tasks such as machine translation and sentiment analysis, these models perform at near-human level. These advances can, however, exacerbate a variety of issues that models have traditionally struggled with, such as bias, evaluation, and risks. In this position paper, we argue that many of these issues share a common core: a lack of awareness of the factors, context, and implications of the social environment in which NLP operates, which we call social awareness. While NLP is getting better at solving the formal linguistic aspects, limited progress has been made in adding the social awareness required for language applications to work in all situations for all users. Integrating social awareness into NLP models will make applications more natural, helpful, and safe, and will open up new possibilities. Thus we argue that substantial challenges remain for NLP to develop social awareness and that we are just at the beginning of a new era for the field.
语言技术已经取得了巨大的进步,特别是随着大型语言模型(LLMs)的引入。在这些传统任务(如机器翻译和情感分析)中,这些模型表现接近人类水平。然而,这些进步也可能加剧模型长期以来一直难以解决的问题,例如偏见、评估和风险。在本文论文中,我们认为许多这些问题共享一个共同核心:对自然语言处理操作的社会环境因素、上下文和影响的缺乏认识,我们称之为社会意识。虽然自然语言处理在解决形式语言方面正在取得进步,但在添加所需的社交意识以使语言应用在所有情况和所有用户中正常运行方面,进展有限。将社会意识集成到自然语言处理模型中,将使应用更加自然、有益和安全,并开辟新的可能性。因此,我们认为在NLP开发社会意识方面仍然存在巨大的挑战,我们刚刚进入该领域的新的时代。
https://arxiv.org/abs/2405.02411