Many applications of cross-modal music retrieval are related to connecting sheet music images to audio recordings. A typical and recent approach to this is to learn, via deep neural networks, a joint embedding space that correlates short fixed-size snippets of audio and sheet music by means of an appropriate similarity structure. However, two challenges that arise out of this strategy are the requirement of strongly aligned data to train the networks, and the inherent discrepancies of musical content between audio and sheet music snippets caused by local and global tempo differences. In this paper, we address these two shortcomings by designing a cross-modal recurrent network that learns joint embeddings that can summarize longer passages of corresponding audio and sheet music. The benefits of our method are that it only requires weakly aligned audio-sheet music pairs, as well as that the recurrent network handles the non-linearities caused by tempo variations between audio and sheet music. We conduct a number of experiments on synthetic and real piano data and scores, showing that our proposed recurrent method leads to more accurate retrieval in all possible configurations.
许多跨modal音乐检索应用都与音乐谱图像与音频录制之间的联系有关。一种典型的最近的方法是如何通过学习使用深度神经网络,建立一种适当的相似结构,将较短的固定大小音频片段和音乐片段进行相关匹配,通过适当的相似性结构来实现。然而,这个策略引出了两个挑战:训练网络时必须 strongly aligned 的数据,以及由当地和全球速度差异引起的音乐片段之间的固有差异,这些差异是由音频和音乐片段片段长度的不同引起的。在本文中,我们为了解决这两个缺点,设计了一种新的跨modal重复网络,该网络学习联合嵌入,可以总结相应的音频和音乐片段的更长段落。我们的方法的优点在于它只需要Weakly aligned 的音频和音乐片段对,而且重复网络可以处理音频和音乐片段之间速度变化引起的非线性ities。我们研究了合成和真实的钢琴数据和评分,证明了我们提出的重复方法可以在所有可能的配置下提高检索的准确性。
https://arxiv.org/abs/2309.12111
In recent years, substantial advancements in pre-trained language models have paved the way for the development of numerous non-English language versions, with a particular focus on encoder-only and decoder-only architectures. While Spanish language models encompassing BERT, RoBERTa, and GPT have exhibited prowess in natural language understanding and generation, there remains a scarcity of encoder-decoder models designed for sequence-to-sequence tasks involving input-output pairs. This paper breaks new ground by introducing the implementation and evaluation of renowned encoder-decoder architectures, exclusively pre-trained on Spanish corpora. Specifically, we present Spanish versions of BART, T5, and BERT2BERT-style models and subject them to a comprehensive assessment across a diverse range of sequence-to-sequence tasks, spanning summarization, rephrasing, and generative question answering. Our findings underscore the competitive performance of all models, with BART and T5 emerging as top performers across all evaluated tasks. As an additional contribution, we have made all models publicly available to the research community, fostering future exploration and development in Spanish language processing.
近年来,训练语言模型的技术进步显著,为开发许多非英语语言版本铺平了道路,其中特别注重encoder-only和decoder-only架构的发展。虽然包括BERT、RoBERTa和GPT的西班牙语语言模型在自然语言理解和生成方面表现出卓越的能力,但仍缺乏设计用于涉及输入输出对的序列到序列任务Encoder-decoder模型。本文介绍了著名的encoder-decoder架构,仅针对西班牙语语料库进行训练。具体来说,我们介绍了BART、T5和BERT2BERT风格的西班牙语版本,并对所有序列到序列任务进行了综合性评估,包括总结、改写和生成问题回答。我们的发现强调了所有模型的竞争性能,BART和T5在所有评估任务中表现最佳。此外,我们向学术界公开了所有模型,促进了西班牙语语言处理领域的未来探索和发展。
https://arxiv.org/abs/2309.11259
Unsupervised sentence representation learning aims to transform input sentences into fixed-length vectors enriched with intricate semantic information while obviating the reliance on labeled data. Recent progress within this field, propelled by contrastive learning and prompt engineering, has significantly bridged the gap between unsupervised and supervised strategies. Nonetheless, the potential utilization of Chain-of-Thought, remains largely untapped within this trajectory. To unlock latent capabilities within pre-trained models, such as BERT, we propose a two-stage approach for sentence representation: comprehension and summarization. Subsequently, the output of the latter phase is harnessed as the vectorized representation of the input sentence. For further performance enhancement, we meticulously refine both the contrastive learning loss function and the template denoising technique for prompt engineering. Rigorous experimentation substantiates our method, CoT-BERT, transcending a suite of robust baselines without necessitating other text representation models or external databases.
unsupervised sentence representation learning 旨在将输入句子转换为固定长度的向量,其中包含了丰富的语义信息,同时避免了依赖标记数据。近期,通过比较学习和创新工程,这一领域的进展得到了显著加强。然而,在这一路径中,信念链的潜力利用率仍 largely untapped。为了解锁预训练模型(如 BERT)中的隐藏能力,我们提出了一种 sentence Representation 的两步方法:理解和摘要。随后,后阶段的输出被用作输入句子的向量表示。为了进一步改善性能,我们仔细优化了比较学习损失函数和模板去噪技术,以 prompt engineering。严格的实验支持了我们的方法 CoT-BERT,它超越了一组稳健基准,而不需要其他文本表示模型或外部数据库。
https://arxiv.org/abs/2309.11143
Conventional end-to-end Automatic Speech Recognition (ASR) models primarily focus on exact transcription tasks, lacking flexibility for nuanced user interactions. With the advent of Large Language Models (LLMs) in speech processing, more organic, text-prompt-based interactions have become possible. However, the mechanisms behind these models' speech understanding and "reasoning" capabilities remain underexplored. To study this question from the data perspective, we introduce instruction-following speech recognition, training a Listen-Attend-Spell model to understand and execute a diverse set of free-form text instructions. This enables a multitude of speech recognition tasks -- ranging from transcript manipulation to summarization -- without relying on predefined command sets. Remarkably, our model, trained from scratch on Librispeech, interprets and executes simple instructions without requiring LLMs or pre-trained speech modules. It also offers selective transcription options based on instructions like "transcribe first half and then turn off listening," providing an additional layer of privacy and safety compared to existing LLMs. Our findings highlight the significant potential of instruction-following training to advance speech foundation models.
传统的端到端自动语音识别(ASR)模型主要关注精确的转录任务,缺乏精细的用户交互灵活性。随着语音处理中大型语言模型(LLMs)的出现,更自然、基于文本提示的互动已成为可能。然而,这些模型的语音理解和“推理”能力背后的机制仍未被深入研究。为了从数据的角度研究这个问题,我们引入了基于指令追随的语音识别,训练了一个听音Attend- spell模型来理解和执行多样化的自由文本指令。这使得可以从预定义命令集合外多个语音识别任务中解脱出来,涵盖了从转录操作到概括的任务。与我们的训练模型相比,从LibriSpeech中 scratch训练的模型能够解释和执行简单的指令,而不需要LLMs或预先训练的语音模块。它还基于类似“先转录前一半,然后关闭听音”等指令提供选择转录选项,相对于现有的LLMs提供了额外的隐私和安全层。我们的研究成果突出了基于指令追随的训练在推动语音基础模型方面的重大潜力。
https://arxiv.org/abs/2309.09843
Citations in scholarly work serve the essential purpose of acknowledging and crediting the original sources of knowledge that have been incorporated or referenced. Depending on their surrounding textual context, these citations are used for different motivations and purposes. Large Language Models (LLMs) could be helpful in capturing these fine-grained citation information via the corresponding textual context, thereby enabling a better understanding towards the literature. Furthermore, these citations also establish connections among scientific papers, providing high-quality inter-document relationships and human-constructed knowledge. Such information could be incorporated into LLMs pre-training and improve the text representation in LLMs. Therefore, in this paper, we offer a preliminary review of the mutually beneficial relationship between LLMs and citation analysis. Specifically, we review the application of LLMs for in-text citation analysis tasks, including citation classification, citation-based summarization, and citation recommendation. We then summarize the research pertinent to leveraging citation linkage knowledge to improve text representations of LLMs via citation prediction, network structure information, and inter-document relationship. We finally provide an overview of these contemporary methods and put forth potential promising avenues in combining LLMs and citation analysis for further investigation.
在学术性工作中,引用是用来承认和赞扬已经包含或引用的知识的主要目的,这些引用取决于它们周围的文本环境,被用于不同的动机和目的。大型语言模型(LLMs)可以通过相应的文本环境捕获这些精细的引用信息,从而更好地理解文献。此外,这些引用还通过科学论文建立联系,提供高质量的跨文档关系和人类建造的知识。这些信息可以纳入LLMs的预训练阶段,并改进LLMs的文本表示。因此,在本文中,我们提供了对LLMs和引用分析之间互利关系的第一步审查。具体来说,我们审查了LLMs在文本引用分析任务中的应用,包括引用分类、引用基于总结和引用推荐。然后我们总结了利用引用链接知识提高LLMs的文本表示的相关问题,通过引用预测、网络结构信息和跨文档关系。最后,我们提供了这些当代方法的全面概述,并提出了将LLMs和引用分析结合进行进一步研究的有前途的途径。
https://arxiv.org/abs/2309.09727
How well can large language models (LLMs) generate summaries? We develop new datasets and conduct human evaluation experiments to evaluate the zero-shot generation capability of LLMs across five distinct summarization tasks. Our findings indicate a clear preference among human evaluators for LLM-generated summaries over human-written summaries and summaries generated by fine-tuned models. Specifically, LLM-generated summaries exhibit better factual consistency and fewer instances of extrinsic hallucinations. Due to the satisfactory performance of LLMs in summarization tasks (even surpassing the benchmark of reference summaries), we believe that most conventional works in the field of text summarization are no longer necessary in the era of LLMs. However, we recognize that there are still some directions worth exploring, such as the creation of novel datasets with higher quality and more reliable evaluation methods.
大型语言模型(LLMs)如何生成摘要?我们开发了新的数据集,并进行了人类评估实验,以评估LLMs在五个不同的摘要任务中的零生成能力。我们的发现表明,人类评估者对LLM生成摘要的明显偏好,而不是人类编写摘要和微调模型生成的摘要。具体来说,LLM生成摘要表现出更好的事实一致性,以及更少的外部幻觉实例。由于LLMs在摘要任务中的出色表现(甚至超过了参考摘要基准),我们相信,在LLM时代的文本摘要领域中,大多数传统的工作已经不再必要。然而,我们认识到,仍然值得探索一些方向,例如创建高质量的新数据集和更加可靠的评估方法。
https://arxiv.org/abs/2309.09558
Video summarization remains a huge challenge in computer vision due to the size of the input videos to be summarized. We propose an efficient, language-only video summarizer that achieves competitive accuracy with high data efficiency. Using only textual captions obtained via a zero-shot approach, we train a language transformer model and forego image representations. This method allows us to perform filtration amongst the representative text vectors and condense the sequence. With our approach, we gain explainability with natural language that comes easily for human interpretation and textual summaries of the videos. An ablation study that focuses on modality and data compression shows that leveraging text modality only effectively reduces input data processing while retaining comparable results.
视频概括在计算机视觉中仍然是一个巨大的挑战,因为需要概括输入视频的大小。我们提出了一种高效的、仅使用语言的视频概括器,能够在高数据效率的情况下实现 competitive accuracy。仅使用通过零经验方法获得的文字标题,我们训练了一个语言转换Transformer模型,并放弃了图像表示。这种方法允许我们在代表性文本向量之间进行过滤,并压缩序列。通过我们的方法,我们获得了自然语言解释性,这对于人类视频解释和文本概括来说很容易实现。专注于模式和数据压缩的研究结果表明,仅仅利用文本模式只能有效地减少输入数据的处理,但保留了类似的结果。
https://arxiv.org/abs/2309.09405
Previous research in multi-document news summarization has typically concentrated on collating information that all sources agree upon. However, to our knowledge, the summarization of diverse information dispersed across multiple articles about an event has not been previously investigated. The latter imposes a different set of challenges for a summarization model. In this paper, we propose a new task of summarizing diverse information encountered in multiple news articles encompassing the same event. To facilitate this task, we outlined a data collection schema for identifying diverse information and curated a dataset named DiverseSumm. The dataset includes 245 news stories, with each story comprising 10 news articles and paired with a human-validated reference. Moreover, we conducted a comprehensive analysis to pinpoint the position and verbosity biases when utilizing Large Language Model (LLM)-based metrics for evaluating the coverage and faithfulness of the summaries, as well as their correlation with human assessments. We applied our findings to study how LLMs summarize multiple news articles by analyzing which type of diverse information LLMs are capable of identifying. Our analyses suggest that despite the extraordinary capabilities of LLMs in single-document summarization, the proposed task remains a complex challenge for them mainly due to their limited coverage, with GPT-4 only able to cover less than 40% of the diverse information on average.
以往的多篇文章新闻摘要研究通常集中在收集所有来源都同意的信息。然而,据我们所知,对于涉及多个新闻报道的同一件事物的信息摘要,我们从未进行过研究。这使得摘要模型面临了不同的挑战。在本文中,我们提出了一个新的任务,即对多个新闻报道中涵盖同一事件的信息进行摘要。为了便于这个任务的实施,我们制定了一个用于识别不同信息的数据收集 schema 并创建了名为 DiverseSumm 的数据库。该数据库包含 245 个新闻故事,每个故事由 10 个新闻文章组成,并配以人类验证的参考。此外,我们还进行了一项全面分析,以确定使用大型语言模型(LLM)based metrics 评估摘要覆盖和准确性时可能出现的位置和词汇偏好,并确定它们与人类评估之间的相关性。我们的应用结果研究了LLM如何通过分析能够识别哪种类型的不同信息来概括多个新闻文章。我们的分析表明,尽管LLM在单篇新闻摘要中的能力非常卓越,但由于其涵盖信息的有限性,该提出的任务仍然对他们构成复杂的挑战。GPT-4平均只能涵盖不到40%的不同类型的不同信息。
https://arxiv.org/abs/2309.09369
Despite the plethora of telehealth applications to assist home-based older adults and healthcare providers, basic messaging and phone calls are still the most common communication methods, which suffer from limited availability, information loss, and process inefficiencies. One promising solution to facilitate patient-provider communication is to leverage large language models (LLMs) with their powerful natural conversation and summarization capability. However, there is a limited understanding of LLMs' role during the communication. We first conducted two interview studies with both older adults (N=10) and healthcare providers (N=9) to understand their needs and opportunities for LLMs in patient-provider asynchronous communication. Based on the insights, we built an LLM-powered communication system, Talk2Care, and designed interactive components for both groups: (1) For older adults, we leveraged the convenience and accessibility of voice assistants (VAs) and built an LLM-powered VA interface for effective information collection. (2) For health providers, we built an LLM-based dashboard to summarize and present important health information based on older adults' conversations with the VA. We further conducted two user studies with older adults and providers to evaluate the usability of the system. The results showed that Talk2Care could facilitate the communication process, enrich the health information collected from older adults, and considerably save providers' efforts and time. We envision our work as an initial exploration of LLMs' capability in the intersection of healthcare and interpersonal communication.
尽管有许多帮助远程居家老年人和医疗保健提供者的健康应用程序,但基本的信息和电话仍然是最为普遍的通信方式,这些方式存在可用性有限、信息损失和效率不足的问题。一种有望改善患者与医生之间的通信的方法就是利用大型语言模型(LLM),它们的自然对话和摘要能力。然而,在通信过程中,人们对LLM的作用了解有限。我们首先与老年人和医疗保健提供者进行了两次访谈研究,以了解他们的需求和LLM在患者与医生异步通信中的机会。基于这些见解,我们建立了一个基于LLM的通信系统Talk2Care,并为两个群体设计了交互性组件:(1) 对于老年人,我们利用语音助手(VA)的方便性和可用性,建立了一个基于LLM的VA接口,以有效地收集信息。(2) 对于医疗保健提供者,我们建立了一个基于LLM的仪表板,根据老年人与VA的对话摘要和呈现重要健康信息。我们还与老年人和医疗保健提供者进行了两次用户研究,以评估该系统的可用性。结果显示,Talk2Care可以推动通信过程,从老年人收集到丰富的健康信息,并显著节省医疗保健提供者的努力和时间。我们想象我们的工作是探索LLM在医疗保健和个人沟通交叉领域的能力的第一步。
https://arxiv.org/abs/2309.09357
Open-domain Multi-Document Summarization (ODMDS) is a critical tool for condensing vast arrays of documents into coherent, concise summaries. With a more inter-related document set, there does not necessarily exist a correct answer for the retrieval, making it hard to measure the retrieving performance. We propose a rule-based method to process query-based document summarization datasets into ODMDS datasets. Based on this method, we introduce a novel dataset, ODSum, a sophisticated case with its document index interdependent and often interrelated. We tackle ODMDS with the \textit{retrieve-then-summarize} method, and the performance of a list of retrievers and summarizers is investigated. Through extensive experiments, we identify variances in evaluation metrics and provide insights into their reliability. We also found that LLMs suffer great performance loss from retrieving errors. We further experimented methods to improve the performance as well as investigate their robustness against imperfect retrieval. We will release our data and code at this https URL.
Open-domain Multi-Document Summarization (ODMDS) 是一个关键的工具,用于将大量的文档集合成连贯、简洁的摘要。随着更相关的文档集合的增加,检索并不一定存在正确的答案,这使得很难衡量检索性能。我们提出了一种基于规则的方法,将基于查询的文档摘要数据集转换为 ODMDS 数据集。基于这种方法,我们介绍了一个新型数据集 ODSum,它是一个复杂的案例,其文档索引相互依存,经常相互关联。我们使用 \textit{检索-then-摘要} 方法解决了 ODMDS 问题,并研究了一组检索器和摘要器的性能。通过广泛的实验,我们识别了评估指标的差异,并提供了它们的可靠性 insights。我们还发现,LLMs 从检索错误中遭受巨大的性能损失。我们进一步研究了提高性能的方法,并研究了它们对不完美检索的鲁棒性。我们将在 this https URL 上释放我们的数据和代码。
https://arxiv.org/abs/2309.08960
The evaluation of large language models is an essential task in the field of language understanding and generation. As language models continue to advance, the need for effective benchmarks to assess their performance has become imperative. In the context of Traditional Chinese, there is a scarcity of comprehensive and diverse benchmarks to evaluate the capabilities of language models, despite the existence of certain benchmarks such as DRCD, TTQA, CMDQA, and FGC dataset. To address this gap, we propose a novel set of benchmarks that leverage existing English datasets and are tailored to evaluate language models in Traditional Chinese. These benchmarks encompass a wide range of tasks, including contextual question-answering, summarization, classification, and table understanding. The proposed benchmarks offer a comprehensive evaluation framework, enabling the assessment of language models' capabilities across different tasks. In this paper, we evaluate the performance of GPT-3.5, Taiwan-LLaMa-v1.0, and Model 7-C, our proprietary model, on these benchmarks. The evaluation results highlight that our model, Model 7-C, achieves performance comparable to GPT-3.5 with respect to a part of the evaluated capabilities. In an effort to advance the evaluation of language models in Traditional Chinese and stimulate further research in this field, we have open-sourced our benchmark and opened the model for trial.
评估大型语言模型是语言理解生成领域的一项重要任务。随着语言模型的不断发展,评估其性能的有效基准变得越来越重要。在中文背景下,缺乏全面且多样化的基准来评估语言模型的能力,尽管存在某些基准,如DRCD、TTQA、CMDQA和FGC dataset。为了解决这个问题,我们提出了一组新的基准,利用现有的英语数据集,专门用于评估中文语言模型的能力。这些基准涵盖了广泛的任务,包括情境问答、概括、分类和表格理解。我们提出的基准提供了全面的评估框架,使能够评估语言模型在不同任务上的能力。在本文中,我们对这些基准进行评估,对GPT-3.5、台湾LLaMa-v1.0和模型7-C我们专有模型的性能进行了评估。评估结果显示,模型7-C在评估能力的一部分方面实现了与GPT-3.5相当的性能。为了推动中文语言模型评估的发展,并刺激该领域的进一步研究,我们开源了我们基准并开放了模型以供试验。
https://arxiv.org/abs/2309.08448
This thesis focuses on improving the pre-training of natural language models using unsupervised raw data to make them more efficient and aligned with downstream applications. In the first part, we introduce three alternative pre-training objectives to BERT's Masked Language Modeling (MLM), namely Random Token Substitution (RTS), Cluster-based Random Token Substitution (C-RTS), and Swapped Language Modeling (SLM). These objectives involve token swapping instead of masking, with RTS and C-RTS aiming to predict token originality and SLM predicting the original token values. Results show that RTS and C-RTS require less pre-training time while maintaining performance comparable to MLM. Surprisingly, SLM outperforms MLM on certain tasks despite using the same computational budget. In the second part, we proposes self-supervised pre-training tasks that align structurally with downstream applications, reducing the need for labeled data. We use large corpora like Wikipedia and CC-News to train models to recognize if text spans originate from the same paragraph or document in several ways. By doing continuous pre-training, starting from existing models like RoBERTa, ELECTRA, DeBERTa, BART, and T5, we demonstrate significant performance improvements in tasks like Fact Verification, Answer Sentence Selection, and Summarization. These improvements are especially pronounced when limited annotation data is available. The proposed objectives also achieve state-of-the-art results on various benchmark datasets, including FEVER (dev set), ASNQ, WikiQA, and TREC-QA, as well as enhancing the quality of summaries. Importantly, these techniques can be easily integrated with other methods without altering the internal structure of Transformer models, making them versatile for various NLP applications.
本研究专注于利用 unsupervised 的原始数据改善自然语言模型的预训练,使其更高效且与后续应用对齐。在第一部分中,我们介绍了 BERT 的掩蔽语言建模(MLM)的三种备选预训练目标,分别是随机 token 替换(RTS)、基于簇的随机 token 替换(C-RTS)和交换语言建模(SLM)。这些目标涉及 token 交换而不是掩蔽,RTS 和 C-RTS 旨在预测 token 原始值,而 SLM 预测原始 token 值。结果显示,RTS 和 C-RTS 在保持与 MLM 相当的性能的同时,所需的预训练时间更少。令人惊讶地,尽管使用相同的计算资源,SLM 在某些任务中表现优于 MLM。在第二部分中,我们提出了自监督的预训练任务,结构对齐与后续应用,减少标记数据的需求。我们使用大型 Corpus 像维基百科和 CC-新闻训练模型,以识别文本是否来自不同段落或文档的多种方式。通过连续预训练,从现有的模型如罗BERTa、ELECTRA、DeBERTa、BART 和 T5开始,我们在事实验证、回答句子选择和摘要任务中展示了显著的性能改进。这些改进尤其在缺乏标注数据时特别显著。 proposed objectives 也在各种基准数据集上实现了最先进的结果,包括 FEVER(开发集)、ASNQ、WikiQA 和 TREC-QA 的测试集,并增强了摘要质量。重要的是,这些技术可以轻松地与其他方法集成,而无需改变Transformer模型的内部结构,使其适用于各种NLP应用。
https://arxiv.org/abs/2309.08272
We present our method for tackling a legal case retrieval task by introducing our method of encoding documents by summarizing them into continuous vector space via our phrase scoring framework utilizing deep neural networks. On the other hand, we explore the benefits from combining lexical features and latent features generated with neural networks. Our experiments show that lexical features and latent features generated with neural networks complement each other to improve the retrieval system performance. Furthermore, our experimental results suggest the importance of case summarization in different aspects: using provided summaries and performing encoded summarization. Our approach achieved F1 of 65.6% and 57.6% on the experimental datasets of legal case retrieval tasks.
我们介绍了解决法律案件检索任务的方法,通过引入我们使用深度学习神经网络将文档总结为连续向量空间的编码方法。另一方面,我们探索了将词法特征和神经网络生成的潜在特征相结合的好处。我们的实验结果表明,词法特征和神经网络生成的潜在特征互相补充,以提高检索系统性能。此外,我们的实验结果暗示了案件概述在不同方面的的重要性:使用提供的概述并进行编码概述。我们的方法在法律案件检索任务的实验数据集上达到了F1得分的65.6%和57.6%。
https://arxiv.org/abs/2309.08187
Summarization is an important application of large language models (LLMs). Most previous evaluation of summarization models has focused on their performance in content selection, grammaticality and coherence. However, it is well known that LLMs reproduce and reinforce harmful social biases. This raises the question: Do these biases affect model outputs in a relatively constrained setting like summarization? To help answer this question, we first motivate and introduce a number of definitions for biased behaviours in summarization models, along with practical measures to quantify them. Since we find biases inherent to the input document can confound our analysis, we additionally propose a method to generate input documents with carefully controlled demographic attributes. This allows us to sidestep this issue, while still working with somewhat realistic input documents. Finally, we apply our measures to summaries generated by both purpose-built summarization models and general purpose chat models. We find that content selection in single document summarization seems to be largely unaffected by bias, while hallucinations exhibit evidence of biases propagating to generated summaries.
摘要是大型语言模型(LLMs)的重要应用之一。过去对摘要模型的评估主要关注其在内容选择、语法正确性和一致性方面的性能。然而,众所周知,LLMs会重复和强化有害的社会偏见。这引出了一个问题:这些偏见在类似于摘要这样的相对限制条件下可能影响模型的输出吗?为了回答这个问题,我们首先激励并介绍了一些摘要模型中的偏见行为的定义,并提出了量化这些偏见的实际措施。由于我们发现输入文档中的固有偏见可能会干扰我们的分析,因此我们还提出了一种方法,以生成经过严格控制人口统计数据的输入文档。这使我们能够绕过这个问题,但仍与相当真实的输入文档进行工作。最后,我们将我们的措施应用于由专门设计的摘要模型和通用聊天模型生成的摘要。我们发现,单文档摘要中的 content 选择似乎不受偏见影响,而幻觉显示偏见正在传播到生成的摘要中。
https://arxiv.org/abs/2309.08047
In text documents such as news articles, the content and key events usually revolve around a subset of all the entities mentioned in a document. These entities, often deemed as salient entities, provide useful cues of the aboutness of a document to a reader. Identifying the salience of entities was found helpful in several downstream applications such as search, ranking, and entity-centric summarization, among others. Prior work on salient entity detection mainly focused on machine learning models that require heavy feature engineering. We show that fine-tuning medium-sized language models with a cross-encoder style architecture yields substantial performance gains over feature engineering approaches. To this end, we conduct a comprehensive benchmarking of four publicly available datasets using models representative of the medium-sized pre-trained language model family. Additionally, we show that zero-shot prompting of instruction-tuned language models yields inferior results, indicating the task's uniqueness and complexity.
在类似于新闻文章等文本文档中,内容和情感关键事件通常围绕文档中所提到的所有实体的特定子集旋转。这些实体通常被视为重要的实体,为读者提供了文档相关度的有用线索。发现重要的实体有助于在多个后续应用中发挥作用,例如搜索、排名和以实体为中心的概述。先前关于重要实体检测的工作主要关注需要大量特征工程的机器学习模型。我们表明,通过交叉编码风格架构优化中型语言模型可以带来显著的性能提升,比特征工程方法更有效。为此,我们使用中型预先训练语言模型家族中代表性模型对四个公开数据集进行了全面基准测试。此外,我们表明,指令优化的语言模型的零样本引导产生劣化结果,表明任务的独特性和复杂性。
https://arxiv.org/abs/2309.07990
Large Language Models (LLMs) have demonstrated impressive performance on Natural Language Processing (NLP) tasks, such as Question Answering, Summarization, and Classification. The use of LLMs as evaluators, that can rank or score the output of other models (usually LLMs) has become increasingly popular, due to the limitations of current evaluation techniques including the lack of appropriate benchmarks, metrics, cost, and access to human annotators. While LLMs are capable of handling approximately 100 languages, the majority of languages beyond the top 20 lack systematic evaluation across various tasks, metrics, and benchmarks. This creates an urgent need to scale up multilingual evaluation to ensure a precise understanding of LLM performance across diverse languages. LLM-based evaluators seem like the perfect solution to this problem, as they do not require human annotators, human-created references, or benchmarks and can theoretically be used to evaluate any language covered by the LLM. In this paper, we investigate whether LLM-based evaluators can help scale up multilingual evaluation. Specifically, we calibrate LLM-based evaluation against 20k human judgments of five metrics across three text-generation tasks in eight languages. Our findings indicate that LLM-based evaluators may exhibit bias towards higher scores and should be used with caution and should always be calibrated with a dataset of native speaker judgments, particularly in low-resource and non-Latin script languages.
大型语言模型(LLMs)在自然语言处理任务(如问答、摘要和分类)中表现出了令人印象深刻的性能。LLMs被用作评估者,可以对其他模型的输出进行排名或评分(通常LLMs),已经成为当前评估技术限制之一。由于当前评估技术的局限性,包括缺乏适当的基准、指标、成本和人类编辑者。尽管LLMs能够处理大约100种语言,但超过20个排名以上的大部分语言缺乏在各种任务、指标和基准上的系统评估。这导致了紧急需要扩大多语言评估,以确保对多种语言的不同表现进行准确的理解和了解。LLM基于评估者似乎解决这个问题的完美解决方案,因为它们不需要人类编辑者、人类创造参考或基准,理论上可以评估任何被LLM覆盖的语言。在本文中,我们探讨了LLM基于评估者是否能够帮助扩大多语言评估的问题。具体而言,我们校准了LLM基于评估对抗了8种语言中三个文本生成任务中的5个指标的20,000条人类评判。我们的发现表明,LLM基于评估者可能表现出偏向高分的倾向,应该谨慎使用,并且应该始终与母语评判数据集校准,特别是在资源有限和非拉丁字母语言中。
https://arxiv.org/abs/2309.07462
Sifting through vast textual data and summarizing key information imposes a substantial burden on how clinicians allocate their time. Although large language models (LLMs) have shown immense promise in natural language processing (NLP) tasks, their efficacy across diverse clinical summarization tasks has not yet been rigorously examined. In this work, we employ domain adaptation methods on eight LLMs, spanning six datasets and four distinct summarization tasks: radiology reports, patient questions, progress notes, and doctor-patient dialogue. Our thorough quantitative assessment reveals trade-offs between models and adaptation methods in addition to instances where recent advances in LLMs may not lead to improved results. Further, in a clinical reader study with six physicians, we depict that summaries from the best adapted LLM are preferable to human summaries in terms of completeness and correctness. Our ensuing qualitative analysis delineates mutual challenges faced by both LLMs and human experts. Lastly, we correlate traditional quantitative NLP metrics with reader study scores to enhance our understanding of how these metrics align with physician preferences. Our research marks the first evidence of LLMs outperforming human experts in clinical text summarization across multiple tasks. This implies that integrating LLMs into clinical workflows could alleviate documentation burden, empowering clinicians to focus more on personalized patient care and other irreplaceable human aspects of medicine.
筛选大量文本数据并摘要关键信息对临床医生如何分配时间造成了相当大的负担。虽然大型语言模型(LLMs)在自然语言处理任务中表现出巨大的潜力,但它们在不同临床摘要任务中的效能尚未得到充分测试。在这项工作中,我们对8个LLMs采用了域适应方法,涵盖了6个数据集和4个 distinct摘要任务:影像学报告、患者问题、进展记录和医生-患者对话。我们的全面量化评估揭示了模型和适应方法之间的权衡,以及LLMs最近的进展可能并未带来改善结果的实例。此外,在与6名医生进行的临床读者研究中,我们表明,最好适应的LLM摘要优于人类摘要,我们的后续定性分析描述了LLM和人类专家所面临的共同挑战。最后,我们将它们传统的量化NLP度量与读者研究得分相联系,以增强我们对这些方法与医生偏好的匹配的理解。我们的研究提供了LLM在多个任务中胜过人类专家的证据。这意味着将LLM集成到临床工作流程中可以减轻文档负担,使临床医生能够更加关注个性化患者的护理以及其他无法替代的医学人类方面。
https://arxiv.org/abs/2309.07430
Recent development of large language models (LLMs), such as ChatGPT has been widely applied to a wide range of software engineering tasks. Many papers have reported their analysis on the potential advantages and limitations of ChatGPT for writing code, summarization, text generation, etc. However, the analysis of the current state of ChatGPT for log processing has received little attention. Logs generated by large-scale software systems are complex and hard to understand. Despite their complexity, they provide crucial information for subject matter experts to understand the system status and diagnose problems of the systems. In this paper, we investigate the current capabilities of ChatGPT to perform several interesting tasks on log data, while also trying to identify its main shortcomings. Our findings show that the performance of the current version of ChatGPT for log processing is limited, with a lack of consistency in responses and scalability issues. We also outline our views on how we perceive the role of LLMs in the log processing discipline and possible next steps to improve the current capabilities of ChatGPT and the future LLMs in this area. We believe our work can contribute to future academic research to address the identified issues.
大型语言模型(LLM)如ChatGPT已经被广泛应用于各种软件工程任务。许多论文已经分析了ChatGPT在编写代码、概括、生成文本等方面的潜在优势和局限性。然而,对ChatGPT在日志处理方面当前状态的分析却鲜有人关注。由大规模软件系统生成的日志非常复杂,难以理解。尽管如此,它们为专业人员提供了理解系统状态和诊断系统问题的关键信息。在本文中,我们研究了ChatGPT当前的能力,在日志数据上执行了几个有趣的任务,同时也试图找出它的主要缺点。我们的发现表明,ChatGPT当前版本的日志处理性能有限,响应和 scalability 问题缺乏一致性。我们还描述了我们对日志处理领域的LLM作用的看法,以及提高ChatGPT和该领域未来LLM能力的可能下一步行动。我们相信我们的工作可以为未来的学术研究解决所指出的问题。
https://arxiv.org/abs/2309.07938
Understanding procedural natural language (e.g., step-by-step instructions) is a crucial step to execution and planning. However, while there are ample corpora and downstream tasks available in English, the field lacks such resources for most languages. To address this gap, we conduct a case study on Turkish procedural texts. We first expand the number of tutorials in Turkish wikiHow from 2,000 to 52,000 using automated translation tools, where the translation quality and loyalty to the original meaning are validated by a team of experts on a random set. Then, we generate several downstream tasks on the corpus, such as linking actions, goal inference, and summarization. To tackle these tasks, we implement strong baseline models via fine-tuning large language-specific models such as TR-BART and BERTurk, as well as multilingual models such as mBART, mT5, and XLM. We find that language-specific models consistently outperform their multilingual models by a significant margin across most procedural language understanding (PLU) tasks. We release our corpus, downstream tasks and the baseline models with this https URL GGLAB-KU/turkish-plu.
理解程序式自然语言(例如步骤指示)是执行和规划的关键步骤。然而,尽管英语领域有充足的数据和后续任务,但对于大部分语言来说,这些资源尚未普及。为了解决这个问题,我们进行了一项土耳其程序式文本的案例分析。我们首先通过自动化翻译工具将土耳其维基百科的教程数量扩展到2,000到52,000,并随机邀请了一支专家团队对翻译质量和忠实度进行验证。随后,我们在 Corpus 上生成多个后续任务,例如链接行动、目标推断和概括。为了处理这些任务,我们使用了大型语言特定模型(如TR-BART和BERTurk)和多语言模型(如 mBART、mT5 和 XLM)进行微调,并发现语言特定模型在大多数程序式语言理解任务中均显著优于多语言模型。我们将我们的 Corpus、后续任务和基础模型发布到 https://GGLAB-KU/turkish-plu 网站上,供研究人员使用。
https://arxiv.org/abs/2309.06698
Large language models excel in many human-language tasks but often falter in highly specialized domains like scholarly astronomy. To bridge this gap, we introduce AstroLLaMA, a 7-billion-parameter model fine-tuned from LLaMA-2 using over 300,000 astronomy abstracts from arXiv. Optimized for traditional causal language modeling, AstroLLaMA achieves a 30% lower perplexity than Llama-2, showing marked domain adaptation. Our model generates more insightful and scientifically relevant text completions and embedding extraction than state-of-the-arts foundation models despite having significantly fewer parameters. AstroLLaMA serves as a robust, domain-specific model with broad fine-tuning potential. Its public release aims to spur astronomy-focused research, including automatic paper summarization and conversational agent development.
大型语言模型在许多人类语言任务中表现出色,但在像学术天文学这样高度专业化的领域往往会遇到困难。为了解决这个问题,我们引入了 AstroLLaMA,这是一个由 LLaMA-2 调整过的有 7 亿参数的模型,通过从 arXiv 上超过 300,000 篇天文学论文中提取信息,进行了传统的因果关系语言建模优化。尽管参数数量相对较少,但 AstroLLaMA 在准确率方面比最先进的基础模型 Llama-2 下降了 30%。我们的模型生成了更多的深入和科学相关的文本完成器和嵌入提取,尽管参数数量更少,但仍然比其他先进的基础模型更加有效。 AstroLLaMA 是一个功能强大、特定的领域模型,具有广泛的微调潜力。它的公开发布旨在促进专注于天文学的研究,包括自动论文摘要和对话机器人的开发。
https://arxiv.org/abs/2309.06126