Processing low-resource languages, such as Kiswahili, using machine learning is difficult due to lack of adequate training data. However, such low-resource languages are still important for human communication and are already in daily use and users need practical machine processing tasks such as summarization, disambiguation and even question answering (QA). One method of processing such languages, while bypassing the need for training data, is the use semantic networks. Some low resource languages, such as Kiswahili, are of the subject-verb-object (SVO) structure, and similarly semantic networks are a triple of subject-predicate-object, hence SVO parts of speech tags can map into a semantic network triple. An algorithm to process raw natural language text and map it into a semantic network is therefore necessary and desirable in structuring low resource languages texts. This algorithm tested on the Kiswahili QA task with upto 78.6% exact match.
处理资源匮乏语言(如斯瓦希里语)的机器学习翻译非常困难,原因是缺乏足够的训练数据。然而,这类资源匮乏的语言在人类沟通中仍然非常重要,并且已经在日常使用中,用户需要诸如文本摘要、词义消歧和问题回答等实用的机器处理任务。 一种不依赖大量训练数据就能对这些语言进行处理的方法是使用语义网络。一些资源匮乏的语言,如斯瓦希里语,具有主谓宾(SVO)结构,并且同样地,语义网络也是由主词-谓词-宾词组成的三元组,因此可以将SVO的句法标签映射到一个语义网络三元组中。开发一种算法来处理原始自然语言文本并将其映射为语义网络三元组,在结构化资源匮乏语言文本方面是必要的且有实用价值的。该算法在斯瓦希里语问题回答任务上进行测试,最高达到了78.6%的确切匹配率。 这段翻译说明了使用语义网络处理低资源语言的一种方法,并强调了一个有效算法的重要性,它可以在缺乏大量训练数据的情况下将自然语言文本转换为结构化的语义表示形式。
https://arxiv.org/abs/2501.09326
As Large Language Models (LLMs) are integrated into electronic health record (EHR) workflows, validated instruments are essential to evaluate their performance before implementation. Existing instruments for provider documentation quality are often unsuitable for the complexities of LLM-generated text and lack validation on real-world data. The Provider Documentation Summarization Quality Instrument (PDSQI-9) was developed to evaluate LLM-generated clinical summaries. Multi-document summaries were generated from real-world EHR data across multiple specialties using several LLMs (GPT-4o, Mixtral 8x7b, and Llama 3-8b). Validation included Pearson correlation for substantive validity, factor analysis and Cronbach's alpha for structural validity, inter-rater reliability (ICC and Krippendorff's alpha) for generalizability, a semi-Delphi process for content validity, and comparisons of high- versus low-quality summaries for discriminant validity. Seven physician raters evaluated 779 summaries and answered 8,329 questions, achieving over 80% power for inter-rater reliability. The PDSQI-9 demonstrated strong internal consistency (Cronbach's alpha = 0.879; 95% CI: 0.867-0.891) and high inter-rater reliability (ICC = 0.867; 95% CI: 0.867-0.868), supporting structural validity and generalizability. Factor analysis identified a 4-factor model explaining 58% of the variance, representing organization, clarity, accuracy, and utility. Substantive validity was supported by correlations between note length and scores for Succinct (rho = -0.200, p = 0.029) and Organized (rho = -0.190, p = 0.037). Discriminant validity distinguished high- from low-quality summaries (p < 0.001). The PDSQI-9 demonstrates robust construct validity, supporting its use in clinical practice to evaluate LLM-generated summaries and facilitate safer integration of LLMs into healthcare workflows.
随着大型语言模型(LLM)被集成到电子健康记录(EHR)的工作流程中,验证后的评估工具对于在实施前评价其性能至关重要。现有的用于提供者文档质量的仪器往往不适合评估LLM生成文本的复杂性,并且缺乏对现实世界数据的有效验证。为了评估由LLM生成的临床总结的质量,开发了Provider Documentation Summarization Quality Instrument(PDSQI-9)。 在多个专科的真实EHR数据中使用几种LLM(包括GPT-4o、Mixtral 8x7b和Llama 3-8b)生成了跨多份文档的总结。验证过程包括用于实质性有效的Pearson相关性,用于结构有效性的因子分析和Cronbach's alpha值,用于普适性的评分者间一致性(ICC和Krippendorff’s alpha),内容有效性通过半德尔菲流程进行评估,并通过对高质量与低质量摘要之间的比较来支持区分有效性。七名医生评价员对779份总结进行了评审并回答了8,329个问题,达到了超过80%的评分者间一致性检验功效。 PDSQI-9显示出很强的内部一致性和高评分者间可靠性(Cronbach's alpha = 0.879;95% CI: 0.867-0.891;ICC = 0.867;95% CI: 0.867-0.868),支持了结构有效性和普适性。因子分析确定了一个4因素模型,解释了58%的变异量,代表组织、清晰度、准确性和实用性。 实质性有效性得到了总结长度与简洁和有组织评分之间的相关性的支持(rho = -0.200, p = 0.029;rho = -0.190, p = 0.037)。区分有效性通过高质和低质量摘要的显著差异(p < 0.001)得到了验证。 PDSQI-9展示了强大的构建效度,支持其在临床实践中用于评估LLM生成总结,并促进将LLM更安全地集成到医疗保健工作流程中的应用。
https://arxiv.org/abs/2501.08977
Despite their impressive ability to generate high-quality and fluent text, generative large language models (LLMs) also produce hallucinations: statements that are misaligned with established world knowledge or provided input context. However, measuring hallucination can be challenging, as having humans verify model generations on-the-fly is both expensive and time-consuming. In this work, we release HALoGEN, a comprehensive hallucination benchmark consisting of: (1) 10,923 prompts for generative models spanning nine domains including programming, scientific attribution, and summarization, and (2) automatic high-precision verifiers for each use case that decompose LLM generations into atomic units, and verify each unit against a high-quality knowledge source. We use this framework to evaluate ~150,000 generations from 14 language models, finding that even the best-performing models are riddled with hallucinations (sometimes up to 86% of generated atomic facts depending on the domain). We further define a novel error classification for LLM hallucinations based on whether they likely stem from incorrect recollection of training data (Type A errors), or incorrect knowledge in training data (Type B errors), or are fabrication (Type C errors). We hope our framework provides a foundation to enable the principled study of why generative models hallucinate, and advances the development of trustworthy large language models.
尽管生成型大型语言模型(LLM)具备产生高质量和流畅文本的能力,它们也会产生幻觉:即与已建立的世界知识或提供的输入背景不符的陈述。然而,衡量这种幻觉是具有挑战性的,因为让人类实时验证模型生成的内容既昂贵又耗时。在这项工作中,我们发布了HALoGEN,这是一个全面的幻觉基准测试集,包括: 1. 覆盖编程、科学归属和摘要等九个领域的10,923个生成型模型提示。 2. 对每个应用场景自动高精度验证器,在将LLM生成的内容分解为原子单元后,对每个单元与高质量知识来源进行验证。 我们使用这一框架评估了来自14种语言模型的约15万个生成结果,发现即使是表现最好的模型也充斥着幻觉(在某些领域中高达86%的生成事实)。我们进一步根据LLM幻觉是否可能源自训练数据中的错误回忆(A类错误)、错误知识(B类错误)或虚构(C类错误),定义了一种新的误差分类方法。希望我们的框架能够为研究为何生成型模型会产生幻觉提供基础,并促进可信大型语言模型的发展。
https://arxiv.org/abs/2501.08292
In large-scale software development, understanding the functionality and intent behind complex codebases is critical for effective development and maintenance. While code summarization has been widely studied, existing methods primarily focus on smaller code units, such as functions, and struggle with larger code artifacts like files and packages. Additionally, current summarization models tend to emphasize low-level implementation details, often overlooking the domain and business context that are crucial for real-world applications. This paper proposes a two-step hierarchical approach for repository-level code summarization, tailored to business applications. First, smaller code units such as functions and variables are identified using syntax analysis and summarized with local LLMs. These summaries are then aggregated to generate higher-level file and package summaries. To ensure the summaries are grounded in business context, we design custom prompts that capture the intended purpose of code artifacts based on the domain and problem context of the business application. We evaluate our approach on a business support system (BSS) for the telecommunications domain, showing that syntax analysis-based hierarchical summarization improves coverage, while business-context grounding enhances the relevance of the generated summaries.
在大型软件开发中,理解复杂代码库的功能性和意图对于有效开发和维护至关重要。虽然代码摘要的研究已经广泛开展,但现有方法主要集中在较小的代码单元(如函数)上,并且难以处理文件和包等较大的代码结构。此外,目前的摘要模型倾向于强调低层次的实现细节,往往忽视了对现实世界应用至关重要的领域和业务背景。 本文提出了一种针对商业应用程序量身定制的两步分层方法来进行仓库级别的代码摘要。首先,使用语法分析识别较小的代码单元(如函数和变量),并利用局部大型语言模型进行摘要。然后将这些摘要汇总以生成更高层次的文件和包摘要。为了确保摘要能够基于业务应用的领域和问题上下文准确反映代码的意图,我们设计了定制化的提示语。 我们在电信领域的商业支持系统(BSS)上评估了我们的方法,并展示了基于语法分析的分层摘要可以提高覆盖范围,而将业务背景纳入考虑则能增强生成摘要的相关性。
https://arxiv.org/abs/2501.07857
Malware analysis is a complex process of examining and evaluating malicious software's functionality, origin, and potential impact. This arduous process typically involves dissecting the software to understand its components, infection vector, propagation mechanism, and payload. Over the years, deep reverse engineering of malware has become increasingly tedious, mainly due to modern malicious codebases' fast evolution and sophistication. Essentially, analysts are tasked with identifying the elusive needle in the haystack within the complexities of zero-day malware, all while under tight time constraints. Thus, in this paper, we explore leveraging Large Language Models (LLMs) for semantic malware analysis to expedite the analysis of known and novel samples. Built on GPT-4o-mini model, \msp is designed to augment malware analysis for Android through a hierarchical-tiered summarization chain and strategic prompt engineering. Additionally, \msp performs malware categorization, distinguishing potential malware from benign applications, thereby saving time during the malware reverse engineering process. Despite not being fine-tuned for Android malware analysis, we demonstrate that through optimized and advanced prompt engineering \msp can achieve up to 77% classification accuracy while providing highly robust summaries at functional, class, and package levels. In addition, leveraging the backward tracing of the summaries from package to function levels allowed us to pinpoint the precise code snippets responsible for malicious behavior.
恶意软件分析是一个复杂的过程,涉及对恶意软件的功能、来源及其潜在影响进行检查和评估。这一过程通常包括解剖软件以了解其组件、感染途径、传播机制以及有效负载。多年来,恶意软件的深度逆向工程变得越来越繁琐,主要是因为现代恶意代码库快速演变并变得更加复杂。本质上,分析人员的任务是在零日恶意软件的复杂性中找到难以发现的问题,同时还要在时间紧迫的情况下完成任务。因此,在这篇论文中,我们探讨了利用大型语言模型(LLMs)进行语义恶意软件分析的可能性,以加快已知和新型样本的分析速度。 基于GPT-4o-mini模型开发的\msp工具旨在通过分层摘要链和策略性提示工程增强Android平台上的恶意软件分析。此外,\msp还能执行恶意软件分类,区分潜在的恶意软件与良性应用程序,从而在恶意软件逆向工程过程中节省时间。尽管未针对Android恶意软件分析进行微调,但我们展示了通过优化及先进的提示工程技术,\msp能够实现高达77%的分类准确率,并提供功能级、类级别和包级别的高度稳健摘要。此外,利用从包到函数级别的摘要回溯追踪,我们能够精确地定位负责恶意行为的代码片段。
https://arxiv.org/abs/2501.04848
The field of Artificial Intelligence (AI) continues to drive transformative innovations, with significant progress in conversational interfaces, autonomous vehicles, and intelligent content creation. Since the launch of ChatGPT in late 2022, the rise of Generative AI has marked a pivotal era, with the term Large Language Models (LLMs) becoming a ubiquitous part of daily life. LLMs have demonstrated exceptional capabilities in tasks such as text summarization, code generation, and creative writing. However, these models are inherently limited by their token-level processing, which restricts their ability to perform abstract reasoning, conceptual understanding, and efficient generation of long-form content. To address these limitations, Meta has introduced Large Concept Models (LCMs), representing a significant shift from traditional token-based frameworks. LCMs use concepts as foundational units of understanding, enabling more sophisticated semantic reasoning and context-aware decision-making. Given the limited academic research on this emerging technology, our study aims to bridge the knowledge gap by collecting, analyzing, and synthesizing existing grey literature to provide a comprehensive understanding of LCMs. Specifically, we (i) identify and describe the features that distinguish LCMs from LLMs, (ii) explore potential applications of LCMs across multiple domains, and (iii) propose future research directions and practical strategies to advance LCM development and adoption.
人工智能(AI)领域持续推动着变革性的创新,在对话界面、自动驾驶汽车和智能内容创作方面取得了显著进展。自2022年末推出ChatGPT以来,生成式AI的崛起标志着一个关键时代的到来,大型语言模型(LLMs)已成为日常生活中不可或缺的一部分。这些模型在文本摘要、代码生成和创意写作等任务中表现出色。然而,由于它们基于标记级别的处理机制,这类模型在抽象推理、概念理解和长篇内容高效生成方面存在固有限制。为解决这些问题,Meta公司推出了大型概念模型(LCMs),这标志着从传统的基于标记框架向重大转变。LCMs使用概念作为理解的基本单元,从而能够进行更高级别的语义推理和情境感知决策。 鉴于对这一新兴技术的学术研究尚处于起步阶段,我们的研究旨在通过收集、分析和综合现有灰色文献来填补知识空白,并为大型概念模型(LCMs)提供一个全面的理解。具体而言,我们致力于: (i) 确定并描述区分LCMs与LLMs的关键特征; (ii) 探讨LCMs在多个领域的潜在应用; (iii) 提出未来研究方向和实用策略以促进LCM的发展及采用。 通过这些努力,我们的目标是推动这一前沿技术的进步,并为相关领域内的创新开辟新的可能性。
https://arxiv.org/abs/2501.05487
Chain-of-Thought (CoT) Prompting is a dominant paradigm in Large Language Models (LLMs) to enhance complex reasoning. It guides LLMs to present multi-step reasoning, rather than generating the final answer directly. However, CoT encounters difficulties when key information required for reasoning is implicit or missing. This occurs because CoT emphasizes the sequence of reasoning steps while overlooking the early extraction of essential information. We propose a pre-prompting method called Iterative Summarization Pre-Prompting (ISP^2) to refine LLM reasoning when key information is not explicitly provided. First, entities and their corresponding descriptions are extracted to form potential key information pairs. Next, we use a reliability rating to assess these pairs, then merge the two lowest-ranked pairs into a new entity description. This process is repeated until a unique key information pair is obtained. Finally, that pair, along with the original question, is fed into LLMs to produce the answer. Extensive experiments demonstrate a 7.1% improvement compared to existing methods. Unlike traditional prompting, ISP^2 adopts an inductive approach with pre-prompting, offering flexible integration into diverse reasoning frameworks. The code is available at this https URL.
链式思维(CoT)提示是大型语言模型(LLM)中增强复杂推理的一种主导范式。它指导LLM展示多步骤的推理过程,而不是直接生成最终答案。然而,当推理所需的关鍵信息隐含或缺失时,CoT会遇到困难。这是因为CoT注重推理步骤序列的同时忽略了早期提取重要信息的过程。我们提出了一种称为迭代摘要预提示(ISP²)的方法来改进LLM的推理能力,在关键信息未明确提供的情况下使用该方法。首先,抽取实体及其相应的描述以形成潜在的关键信息对。然后,采用可靠性评分评估这些对,并将两个排名最低的信息对合并成一个新的实体描述。此过程重复进行直到获得唯一的一组关键信息对。最后,将这对关键信息和原始问题一起输入到LLM中生成答案。 通过广泛的实验表明,与现有方法相比,ISP²的表现提升了7.1%。不同于传统的提示技术,ISP²采用了一种归纳式的预提示方式,并提供灵活地融入各种推理框架的能力。代码可在以下链接获取:[请在此处插入实际的URL]。
https://arxiv.org/abs/2501.04341
Large language models (LLMs) are increasingly being deployed in high-stakes applications like hiring, yet their potential for unfair decision-making and outcomes remains understudied, particularly in generative settings. In this work, we examine the fairness of LLM-based hiring systems through two real-world tasks: resume summarization and retrieval. By constructing a synthetic resume dataset and curating job postings, we investigate whether model behavior differs across demographic groups and is sensitive to demographic perturbations. Our findings reveal that race-based differences appear in approximately 10% of generated summaries, while gender-based differences occur in only 1%. In the retrieval setting, all evaluated models display non-uniform selection patterns across demographic groups and exhibit high sensitivity to both gender and race-based perturbations. Surprisingly, retrieval models demonstrate comparable sensitivity to non-demographic changes, suggesting that fairness issues may stem, in part, from general brittleness issues. Overall, our results indicate that LLM-based hiring systems, especially at the retrieval stage, can exhibit notable biases that lead to discriminatory outcomes in real-world contexts.
大型语言模型(LLM)在招聘等高风险应用中的部署日益增加,但它们可能带来的不公平决策和结果仍然研究不足,特别是在生成设置中。在这项工作中,我们通过两个实际任务来考察基于LLM的招聘系统的公平性:简历摘要和检索。通过构建合成简历数据集并整理职位发布信息,我们调查模型行为在不同人口统计群体之间的差异以及其对人口统计数据变化的敏感性。 我们的研究发现,在大约10%的生成摘要中出现了种族相关的差异,而在1%的情况下出现了性别相关的差异。在检索设置下,所有评估的模型都表现出跨人口统计群体选择模式的不同,并且对基于性别和种族的变化高度敏感。令人惊讶的是,检索模型显示出与非人口统计数据变化相当的敏感性,这表明公平性问题可能部分源于一般性的脆弱性问题。 总体而言,我们的研究结果表明,基于LLM的招聘系统(尤其是在检索阶段),在现实场景中可能会表现出显著的偏见,从而导致歧视性的结果。
https://arxiv.org/abs/2501.04316
Research on text simplification has primarily focused on lexical and sentence-level changes. Long document-level simplification (DS) is still relatively unexplored. Large Language Models (LLMs), like ChatGPT, have excelled in many natural language processing tasks. However, their performance on DS tasks is unsatisfactory, as they often treat DS as merely document summarization. For the DS task, the generated long sequences not only must maintain consistency with the original document throughout, but complete moderate simplification operations encompassing discourses, sentences, and word-level simplifications. Human editors employ a hierarchical complexity simplification strategy to simplify documents. This study delves into simulating this strategy through the utilization of a multi-stage collaboration using LLMs. We propose a progressive simplification method (ProgDS) by hierarchically decomposing the task, including the discourse-level, topic-level, and lexical-level simplification. Experimental results demonstrate that ProgDS significantly outperforms existing smaller models or direct prompting with LLMs, advancing the state-of-the-art in the document simplification task.
文本简化研究主要集中在词汇和句子级别的变化上。而长文档级的简化(Document Simplification,简称DS)仍然相对未被深入探索。大型语言模型(Large Language Models,LLMs),如ChatGPT,在许多自然语言处理任务中表现出色。然而,它们在执行DS任务时表现不佳,往往将文本简化简单地视为文档摘要的任务。对于DS任务而言,生成的长序列不仅需要在整个过程中保持与原始文档的一致性,还需要进行涵盖话语、句子和词汇层面适度简化的复杂操作。 人类编辑员通过分层复杂度简化策略来简化文档。本研究旨在通过多阶段合作使用LLMs的方法模拟这一策略。我们提出了一种渐进式简化方法(Progressive Document Simplification,简称ProgDS),该方法通过分解任务为话语层次、主题层次和词汇层次的简化来进行。实验结果表明,ProgDS显著优于现有的较小模型或直接对LLMs进行提示的方式,在文档简化的任务中达到了最先进的水平。
https://arxiv.org/abs/2501.03857
Pre-training a language model and then fine-tuning it has shown to be an efficient and effective technique for a wide range of code intelligence tasks, such as code generation, code summarization, and vulnerability detection. However, pretraining language models on a large-scale code corpus is computationally expensive. Fortunately, many off-the-shelf Pre-trained Code Models (PCMs), such as CodeBERT, CodeT5, CodeGen, and Code Llama, have been released publicly. These models acquire general code understanding and generation capability during pretraining, which enhances their performance on downstream code intelligence tasks. With an increasing number of these public pre-trained models, selecting the most suitable one to reuse for a specific task is essential. In this paper, we systematically investigate the reusability of PCMs. We first explore three intuitive model selection methods that select by size, training data, or brute-force fine-tuning. Experimental results show that these straightforward techniques either perform poorly or suffer high costs. Motivated by these findings, we explore learning-based model selection strategies that utilize pre-trained models without altering their parameters. Specifically, we train proxy models to gauge the performance of pre-trained models, and measure the distribution deviation between a model's latent features and the task's labels, using their closeness as an indicator of model transferability. We conduct experiments on 100 widely-used opensource PCMs for code intelligence tasks, with sizes ranging from 42.5 million to 3 billion parameters. The results demonstrate that learning-based selection methods reduce selection time to 100 seconds, compared to 2,700 hours with brute-force fine-tuning, with less than 6% performance degradation across related tasks.
在大型代码语料库上对语言模型进行预训练,然后对其进行微调已被证明是处理各种代码智能任务(如代码生成、代码总结和漏洞检测)的一种高效且有效的方法。然而,在大规模代码数据集上预训练语言模型是非常耗费计算资源的。幸运的是,许多现成的预训练代码模型(PCM),例如CodeBERT、CodeT5、CodeGen以及Code Llama等已经公开发布。这些模型在预训练阶段获得了理解和生成代码的一般能力,从而提高了它们在下游代码智能任务中的表现。随着越来越多的公共预训练模型出现,选择最适合特定任务的模型变得至关重要。 在这篇论文中,我们系统地研究了PCM的重用性问题。首先,我们探讨了三种直观的模型选择方法:按大小、训练数据或通过暴力微调(brute-force fine-tuning)进行选择。实验结果表明,这些简单的方法要么表现不佳,要么成本高昂。受到这些发现的启发,我们探索了一种基于学习的模型选择策略,该策略利用预训练模型而不改变其参数。具体来说,我们训练代理模型来评估预训练模型的表现,并通过测量模型潜在特征与任务标签之间的分布偏差来衡量模型迁移能力,使用它们之间的接近度作为模型可转移性的指标。 我们在100种广泛使用的开源PCM上进行了实验,这些模型的规模从42.5百万到30亿参数不等。结果表明,基于学习的选择方法将选择时间缩短到了100秒(相比暴力微调的2,700小时),同时在相关任务中性能下降不到6%。
https://arxiv.org/abs/2501.03783
3D medical images such as Computed tomography (CT) are widely used in clinical practice, offering a great potential for automatic diagnosis. Supervised learning-based approaches have achieved significant progress but rely heavily on extensive manual annotations, limited by the availability of training data and the diversity of abnormality types. Vision-language alignment (VLA) offers a promising alternative by enabling zero-shot learning without additional annotations. However, we empirically discover that the visual and textural embeddings after alignment endeavors from existing VLA methods form two well-separated clusters, presenting a wide gap to be bridged. To bridge this gap, we propose a Bridged Semantic Alignment (BrgSA) framework. First, we utilize a large language model to perform semantic summarization of reports, extracting high-level semantic information. Second, we design a Cross-Modal Knowledge Interaction (CMKI) module that leverages a cross-modal knowledge bank as a semantic bridge, facilitating interaction between the two modalities, narrowing the gap, and improving their alignment. To comprehensively evaluate our method, we construct a benchmark dataset that includes 15 underrepresented abnormalities as well as utilize two existing benchmark datasets. Experimental results demonstrate that BrgSA achieves state-of-the-art performances on both public benchmark datasets and our custom-labeled dataset, with significant improvements in zero-shot diagnosis of underrepresented abnormalities.
3D医学图像,如计算机断层扫描(CT),在临床实践中被广泛使用,并为自动诊断提供了巨大的潜力。基于监督学习的方法已经取得了显著的进展,但它们严重依赖于大量的人工注释,这些方法受限于训练数据的数量和异常类型的多样性。视觉-语言对齐(VLA)通过允许零样本学习而无需额外标注提供了一种有前景的替代方案。然而,我们实证发现,现有VLA方法在对齐努力后的视觉和文本嵌入形成了两个明显分离的聚类,其间存在一个巨大的鸿沟需要弥合。 为了弥合这一差距,我们提出了一种名为“语义桥接对齐”(Bridged Semantic Alignment, BrgSA)的框架。首先,我们利用大型语言模型进行报告的语义摘要提取,获取高层次的语义信息。其次,我们设计了一个跨模态知识交互模块(Cross-Modal Knowledge Interaction, CMKI),该模块采用跨模态的知识库作为语义桥梁,促进两种模式之间的互动、缩小差距并提高它们的对齐度。 为了全面评估我们的方法,我们构建了一个基准数据集,其中包括15种代表性不足的异常情况,并且还利用了两个现有的基准数据集。实验结果显示,BrgSA在公共基准数据集和自定义标注的数据集上均取得了最先进的性能,在零样本诊断代表性不足的异常方面有显著改进。
https://arxiv.org/abs/2501.03565
Scientific figure captioning is a complex task that requires generating contextually appropriate descriptions of visual content. However, existing methods often fall short by utilizing incomplete information, treating the task solely as either an image-to-text or text summarization problem. This limitation hinders the generation of high-quality captions that fully capture the necessary details. Moreover, existing data sourced from arXiv papers contain low-quality captions, posing significant challenges for training large language models (LLMs). In this paper, we introduce a framework called Multi-LLM Collaborative Figure Caption Generation (MLBCAP) to address these challenges by leveraging specialized LLMs for distinct sub-tasks. Our approach unfolds in three key modules: (Quality Assessment) We utilize multimodal LLMs to assess the quality of training data, enabling the filtration of low-quality captions. (Diverse Caption Generation) We then employ a strategy of fine-tuning/prompting multiple LLMs on the captioning task to generate candidate captions. (Judgment) Lastly, we prompt a prominent LLM to select the highest quality caption from the candidates, followed by refining any remaining inaccuracies. Human evaluations demonstrate that informative captions produced by our approach rank better than human-written captions, highlighting its effectiveness. Our code is available at this https URL
科学图表说明是一项复杂的任务,需要生成与视觉内容相关的上下文适当的描述。然而,现有的方法往往由于使用了不完整的信息而效果不佳,将该任务仅视为图像到文本或文本总结的问题来处理。这种局限性阻碍了高质量的、能够全面捕捉必要细节的标签生成。此外,来自arXiv论文的数据包含低质量的图表说明,这对训练大型语言模型(LLM)提出了重大挑战。 在这篇论文中,我们引入了一个名为多LLM协作图注生成(MLBCAP)框架,通过利用专门化的LLM来解决这些特定子任务的问题。我们的方法主要由三个关键模块组成: 1. **质量评估**:我们使用跨模态LLM来评估训练数据的质量,从而过滤掉低质量的标签。 2. **多样化标签生成**:接下来,我们采用微调/提示多个LLM的方法,在图表说明任务上产生候选说明。 3. **判断**:最后,我们通过提示一个重要的LLM从候选人中选择最优质的描述,并随后修正任何剩余的不准确之处。 人类评估显示,我们的方法产生的信息性图表说明优于人工编写的图表说明,突显了其有效性。我们的代码可在提供的URL获取。
https://arxiv.org/abs/2501.02552
This bachelor's thesis examines the capabilities of ChatGPT 4 in code generation across 19 programming languages. The study analyzed solution rates across three difficulty levels, types of errors encountered, and code quality in terms of runtime and memory efficiency through a quantitative experiment. A total of 188 programming problems were selected from the LeetCode platform, and ChatGPT 4 was given three attempts to produce a correct solution with feedback. ChatGPT 4 successfully solved 39.67% of all tasks, with success rates decreasing significantly as problem complexity increased. Notably, the model faced considerable challenges with hard problems across all languages. ChatGPT 4 demonstrated higher competence in widely used languages, likely due to a larger volume and higher quality of training data. The solution rates also revealed a preference for languages with low abstraction levels and static typing. For popular languages, the most frequent error was "Wrong Answer," whereas for less popular languages, compiler and runtime errors prevailed, suggesting frequent misunderstandings and confusion regarding the structural characteristics of these languages. The model exhibited above-average runtime efficiency in all programming languages, showing a tendency toward statically typed and low-abstraction languages. Memory efficiency results varied significantly, with above-average performance in 14 languages and below-average performance in five languages. A slight preference for low-abstraction languages and a leaning toward dynamically typed languages in terms of memory efficiency were observed. Future research should include a larger number of tasks, iterations, and less popular languages. Additionally, ChatGPT 4's abilities in code interpretation and summarization, debugging, and the development of complex, practical code could be analyzed further.
这篇学士论文研究了ChatGPT 4在19种编程语言中的代码生成能力。该研究通过定量实验分析了解决率、遇到的错误类型以及从运行时间和内存效率方面衡量的代码质量,这些都涵盖了三个难度级别的问题。总共选用了来自LeetCode平台的188个编程问题,并给予ChatGPT 4三次尝试以在有反馈的情况下生成正确解决方案的机会。ChatGPT 4成功解决了39.67%的所有任务,在面对复杂度增加的问题时,成功率显著下降。值得注意的是,模型在所有语言中的难题解决上面临了相当大的挑战。 ChatGPT 4在广泛使用的编程语言中表现出更高的能力,这可能是由于这些语言拥有更大体积和更高质量的训练数据。解决方案率还显示出了对低抽象层次和静态类型的语言偏好。对于流行语言来说,“错误答案”是最常见的问题类型;而对于不那么流行的编程语言,则是编译器错误和运行时错误更为常见,表明模型经常误解或混淆了这些语言的结构特性。 在所有编程语言中,ChatGPT 4均表现出超过平均水平的运行时间效率,并倾向于静态类型和低抽象层次的语言。然而,在内存效率方面,结果差异显著:14种语言表现超出平均值,而五种语言则低于平均水平。观察到略偏向于低抽象层级的语言在内存效率上有所偏爱,同时对于动态类型的编程语言也表现出一定的偏好。 未来的研究应包括更多的任务和迭代次数,以及对更不流行的编程语言的分析。此外,还应当进一步探索ChatGPT 4在代码解释与总结、调试及开发复杂且实用代码方面的能力。
https://arxiv.org/abs/2501.02338
The rapid advancement of artificial intelligence, particularly with the development of Large Language Models (LLMs) built on the transformer architecture, has redefined the capabilities of natural language processing. These models now exhibit remarkable performance across various language-related tasks, such as text generation, question answering, translation, and summarization, often rivaling human-like comprehension. More intriguingly, LLMs have demonstrated emergent abilities extending beyond their core functions, showing proficiency in tasks like commonsense reasoning, code generation, and arithmetic. This survey paper explores the foundational components, scaling mechanisms, and architectural strategies that drive these capabilities. Emphasizing models like GPT and LLaMA, we analyze the impact of exponential data and computational growth on LLM performance, while also addressing the trade-offs associated with scaling. We also examine LLM applications across sectors, such as healthcare, finance, education, and law, highlighting their adaptability and potential to solve domain-specific challenges. Central to this work are the questions of how LLMs generalize across diverse tasks, exhibit planning, and reasoning abilities, and whether these emergent abilities can be systematically elicited or enhanced. In particular, we provide some insights into the CoT (Chain of Thought) and PoT (Plan of Thought) abilities within LLMs, focusing on how pre-training data influences their emergence. Additionally, we investigate LLM-modulo frameworks that integrate external systems, allowing LLMs to handle complex, dynamic tasks. By analyzing these factors, this paper aims to foster the ongoing discussion on the capabilities and limits of LLMs, promoting their responsible development and application in novel and increasingly complex environments.
人工智能,尤其是基于变压器架构构建的大规模语言模型(LLMs)的迅速发展,重新定义了自然语言处理的能力。这些模型在各种与语言相关任务上表现出色,如文本生成、问答、翻译和总结,并且其表现经常可以媲美人类的理解能力。更引人注目的是,LLMs还展示了超越核心功能的新兴能力,在常识推理、代码生成和算术等任务中表现出色。这篇综述论文探讨了驱动这些能力的基础组件、扩展机制和架构策略。着重分析GPT和LLaMA等模型的影响,我们研究了指数级数据和计算增长对LLM性能的影响,并同时讨论了规模扩大带来的权衡。此外,我们也考察了LLMs在医疗保健、金融、教育和法律等领域中的应用情况,突出了它们的适应性和解决特定领域挑战的潜力。 本文的核心问题是如何使LLMs能够在多样化的任务中进行泛化并展示规划与推理能力,以及这些新兴的能力是否可以被系统地触发或增强。特别地,我们提供了关于LLMs内部思维链(Chain of Thought, CoT)和计划思维(Plan of Thought, PoT)能力的一些见解,并聚焦于预训练数据如何影响这些能力的出现。此外,我们还研究了整合外部系统的LLM模态框架,使LLMs能够处理复杂的动态任务。 通过分析上述因素,本文旨在促进关于LLMs能力和局限性的持续讨论,推动其在新颖且日益复杂环境中负责任的发展和应用。
https://arxiv.org/abs/2501.04040
This thesis presents Abstractive Text Summarization models for contemporary Sanskrit prose. The first chapter, titled Introduction, presents the motivation behind this work, the research questions, and the conceptual framework. Sanskrit is a low-resource inflectional language. The key research question that this thesis investigates is what the challenges in developing an abstractive TS for Sanskrit. To answer the key research questions, sub-questions based on four different themes have been posed in this work. The second chapter, Literature Review, surveys the previous works done. The third chapter, data preparation, answers the remaining three questions from the third theme. It reports the data collection and preprocessing challenges for both language model and summarization model trainings. The fourth chapter reports the training and inference of models and the results obtained therein. This research has initiated a pipeline for Sanskrit abstractive text summarization and has reported the challenges faced at every stage of the development. The research questions based on every theme have been answered to answer the key research question.
这篇论文提出了针对现代梵语散文的摘要式文本生成模型。第一章标题为“引言”,介绍了这项研究的动力、研究问题以及概念框架。梵语是一种资源匮乏的屈折语言,本论文重点探讨的问题是开发梵语摘要式文本生成模型所面临的挑战。为了回答主要的研究问题,本文基于四个不同主题提出了若干子问题。 第二章题为“文献回顾”,总结了先前的相关工作。第三章则专注于数据准备阶段,回应了与第三个主题相关的剩余三个问题,报告了在训练语言模型和摘要化模型过程中遇到的数据收集及预处理的挑战。第四章则汇报了模型的训练、推理以及所得的结果。 这项研究开创了一条梵语摘要式文本生成的技术路线,并详细记录了每个开发阶段所面临的挑战。根据每一个主题提出的研究问题,都已被回答,以回应主要的研究问题。
https://arxiv.org/abs/2501.01933
Training transformer-based encoder-decoder models for long document summarization poses a significant challenge due to the quadratic memory consumption during training. Several approaches have been proposed to extend the input length at test time, but training with these approaches is still difficult, requiring truncation of input documents and causing a mismatch between training and test conditions. In this work, we propose CachED (Gradient $\textbf{Cach}$ing for $\textbf{E}$ncoder-$\textbf{D}$ecoder models), an approach that enables end-to-end training of existing transformer-based encoder-decoder models, using the entire document without truncation. Specifically, we apply non-overlapping sliding windows to input documents, followed by fusion in decoder. During backpropagation, the gradients are cached at the decoder and are passed through the encoder in chunks by re-computing the hidden vectors, similar to gradient checkpointing. In the experiments on long document summarization, we extend BART to CachED BART, processing more than 500K tokens during training and achieving superior performance without using any additional parameters.
基于转换器的编码器-解码器模型在长文档摘要生成训练中面临重大挑战,主要是因为训练过程中内存消耗呈二次增长。虽然已经提出了一些方法来延长测试时的输入长度,但使用这些方法进行训练仍然困难重重,需要截断输入文档并导致训练和测试条件不匹配的问题。在此研究中,我们提出了CachED(编码器-解码器模型中的梯度缓存技术),这一方法允许现有基于转换器的编码器-解码器模型在不裁剪整个文档的情况下进行端到端训练。 具体来说,我们在输入文档上应用非重叠滑动窗口,并在解码阶段融合。在反向传播过程中,在解码器中缓存梯度并通过重新计算隐藏向量分块传递给编码器,类似于梯度检查点技术。通过这种方法,我们可以在长文档摘要生成实验中将BART扩展为CachED BART模型,该模型在训练时处理超过50万个令牌,并且无需使用额外参数即可实现优越性能。 简而言之,这项研究提出了一种新颖的方法来解决基于转换器的编码-解码模型在处理非常长文档时遇到的记忆问题和训练难题。
https://arxiv.org/abs/2501.01805
As the impact of global climate change intensifies, corporate carbon emissions have become a focal point of global attention. In response to issues such as the lag in climate change knowledge updates within large language models, the lack of specialization and accuracy in traditional augmented generation architectures for complex problems, and the high cost and time consumption of sustainability report analysis, this paper proposes CarbonChat: Large Language Model-based corporate carbon emission analysis and climate knowledge Q&A system, aimed at achieving precise carbon emission analysis and policy this http URL, a diversified index module construction method is proposed to handle the segmentation of rule-based and long-text documents, as well as the extraction of structured data, thereby optimizing the parsing of key this http URL, an enhanced self-prompt retrieval-augmented generation architecture is designed, integrating intent recognition, structured reasoning chains, hybrid retrieval, and Text2SQL, improving the efficiency of semantic understanding and query this http URL, based on the greenhouse gas accounting framework, 14 dimensions are established for carbon emission analysis, enabling report summarization, relevance evaluation, and customized this http URL, through a multi-layer chunking mechanism, timestamps, and hallucination detection features, the accuracy and verifiability of the analysis results are ensured, reducing hallucination rates and enhancing the precision of the responses.
随着全球气候变化的影响加剧,企业的碳排放已成为全球关注的焦点。针对大型语言模型中气候知识更新滞后、传统增强生成架构在处理复杂问题时缺乏专业化和准确性、以及可持续性报告分析耗时且成本高昂等问题,本文提出了CarbonChat:基于大语言模型的企业碳排放分析与气候知识问答系统,旨在实现精确的碳排放分析及政策建议。为此,提出了一种多样化的指标模块构建方法,以应对基于规则和长文本文件的分割问题,并提取结构化数据,从而优化关键信息的解析。此外,设计了增强型自我提示检索增强生成架构,整合意图识别、结构化推理链、混合检索以及Text2SQL技术,提高了语义理解和查询效率。 在温室气体核算框架的基础上建立了碳排放分析的14个维度,支持报告总结、相关性评估及个性化建议。通过多层次分块机制、时间戳和幻觉检测特性确保了分析结果的准确性和可验证性,降低了幻觉发生率并提升了响应的精确度。
https://arxiv.org/abs/2501.02031
Automatic text summarization, particularly headline generation, remains a critical yet underexplored area for Bengali religious news. Existing approaches to headline generation typically rely solely on the article content, overlooking crucial contextual features such as sentiment, category, and aspect. This limitation significantly hinders their effectiveness and overall performance. This study addresses this limitation by introducing a novel corpus, BeliN (Bengali Religious News) - comprising religious news articles from prominent Bangladeshi online newspapers, and MultiGen - a contextual multi-input feature fusion headline generation approach. Leveraging transformer-based pre-trained language models such as BanglaT5, mBART, mT5, and mT0, MultiGen integrates additional contextual features - including category, aspect, and sentiment - with the news content. This fusion enables the model to capture critical contextual information often overlooked by traditional methods. Experimental results demonstrate the superiority of MultiGen over the baseline approach that uses only news content, achieving a BLEU score of 18.61 and ROUGE-L score of 24.19, compared to baseline approach scores of 16.08 and 23.08, respectively. These findings underscore the importance of incorporating contextual features in headline generation for low-resource languages. By bridging linguistic and cultural gaps, this research advances natural language processing for Bengali and other underrepresented languages. To promote reproducibility and further exploration, the dataset and implementation code are publicly accessible at this https URL.
自动文本摘要,特别是标题生成,在孟加拉语宗教新闻领域仍然是一个关键但尚未充分探索的区域。现有的标题生成方法通常仅依赖于文章内容,忽略了诸如情感、类别和方面等重要上下文特征。这种局限性严重阻碍了它们的有效性和整体性能。这项研究通过引入一种新的数据集BeliN(孟加拉语宗教新闻)来解决这一限制——该数据集包含来自孟加拉国主要在线报纸的宗教新闻文章,以及MultiGen——一种基于上下文多输入特征融合的标题生成方法。借助于像BanglaT5、mBART、mT5和mT0这样的基于Transformer的预训练语言模型,MultiGen将额外的上下文特征(包括类别、方面和情感)与新闻内容相结合。这种融合使模型能够捕捉到传统方法经常忽视的关键上下文信息。 实验结果表明,相对于仅使用新闻内容作为基础的方法,MultiGen具有显著优势,在BLEU评分上达到了18.61,而ROUGE-L评分为24.19,相比之下基础方法的分数分别为16.08和23.08。这些发现强调了在低资源语言的标题生成中整合上下文特征的重要性。通过弥合语言和文化差异,这项研究推进了孟加拉语和其他被边缘化语言的自然语言处理技术。 为了促进可重复性和进一步探索,该数据集和实现代码已公开发布在此[https URL]。
https://arxiv.org/abs/2501.01069
Singlish, a Creole language rooted in English, is a key focus in linguistic research within multilingual and multicultural contexts. However, its spoken form remains underexplored, limiting insights into its linguistic structure and applications. To address this gap, we standardize and annotate the largest spoken Singlish corpus, introducing the Multitask National Speech Corpus (MNSC). These datasets support diverse tasks, including Automatic Speech Recognition (ASR), Spoken Question Answering (SQA), Spoken Dialogue Summarization (SDS), and Paralinguistic Question Answering (PQA). We release standardized splits and a human-verified test set to facilitate further research. Additionally, we propose SingAudioLLM, a multi-task multimodal model leveraging multimodal large language models to handle these tasks concurrently. Experiments reveal our models adaptability to Singlish context, achieving state-of-the-art performance and outperforming prior models by 10-30% in comparison with other AudioLLMs and cascaded solutions.
新加坡英语(Singlish),一种以英语为基础的克里奥尔语言,在多语言和多元文化背景下是语言学研究的关键焦点。然而,其口语形式的研究尚不充分,限制了对这种语言结构及其应用的理解。为了弥补这一不足,我们制定并标注了最大的新加坡英语口语语料库,并引入了多功能国家语音语料库(MNSC)。这些数据集支持包括自动语音识别(ASR)、口语问答(SQA)、口语对话总结(SDS)和副语言问答(PQA)在内的多种任务。为了促进进一步的研究,我们发布了标准化的数据分割和经过人工验证的测试集。 此外,我们提出了SingAudioLLM,这是一种多任务多模态模型,利用大型多模态语言模型来同时处理这些任务。实验结果显示,我们的模型在新加坡英语语境中具有良好的适应性,并且在与其它音频大语言模型(AudioLLMs)和级联解决方案的比较中,在性能上取得了最先进的成果,优于先前的模型10-30%。 总的来说,这项工作旨在通过系统地研究和标准化新加坡英语口语数据集来促进对Singlish的理解和发展。
https://arxiv.org/abs/2501.01034
In the fast-changing realm of information, the capacity to construct coherent timelines from extensive event-related content has become increasingly significant and challenging. The complexity arises in aggregating related documents to build a meaningful event graph around a central topic. This paper proposes CHRONOS - Causal Headline Retrieval for Open-domain News Timeline SummarizatiOn via Iterative Self-Questioning, which offers a fresh perspective on the integration of Large Language Models (LLMs) to tackle the task of Timeline Summarization (TLS). By iteratively reflecting on how events are linked and posing new questions regarding a specific news topic to gather information online or from an offline knowledge base, LLMs produce and refresh chronological summaries based on documents retrieved in each round. Furthermore, we curate Open-TLS, a novel dataset of timelines on recent news topics authored by professional journalists to evaluate open-domain TLS where information overload makes it impossible to find comprehensive relevant documents from the web. Our experiments indicate that CHRONOS is not only adept at open-domain timeline summarization, but it also rivals the performance of existing state-of-the-art systems designed for closed-domain applications, where a related news corpus is provided for summarization.
在信息快速变化的领域中,从大量的事件相关内容构建连贯的时间线变得愈发重要且具有挑战性。复杂性主要体现在如何聚合相关的文档以围绕一个中心话题构建有意义的事件图谱。本文提出了CHRONOS——一种通过迭代自问来为开放域新闻时间线摘要进行因果标题检索的方法。这种方法提供了一种新的视角,即利用大规模语言模型(LLMs)解决时间线总结任务。通过反复思考事件间的关联,并针对特定新闻主题提出新问题以在线或离线知识库中获取信息,LLMs能够基于每一轮检索到的文档生成和刷新基于时间顺序的摘要。 此外,我们还整理了Open-TLS,这是一个由专业记者撰写的关于最近新闻话题的时间线数据集,用于评估开放域时间线总结(TLS),在这种情况下,由于信息过载使得不可能从网络上找到全面的相关文档。实验结果表明,CHRONOS不仅擅长于开放域时间线摘要的生成,在封闭域应用中也与现有的最佳系统相匹敌,这些系统在进行摘要时会提供相关新闻语料库。
https://arxiv.org/abs/2501.00888