This work introduces Salamandra, a suite of open-source decoder-only large language models available in three different sizes: 2, 7, and 40 billion parameters. The models were trained from scratch on highly multilingual data that comprises text in 35 European languages and code. Our carefully curated corpus is made exclusively from open-access data compiled from a wide variety of sources. Along with the base models, supplementary checkpoints that were fine-tuned on public-domain instruction data are also released for chat applications. Additionally, we also share our preliminary experiments on multimodality, which serve as proof-of-concept to showcase potential applications for the Salamandra family. Our extensive evaluations on multilingual benchmarks reveal that Salamandra has strong capabilities, achieving competitive performance when compared to similarly sized open-source models. We provide comprehensive evaluation results both on standard downstream tasks as well as key aspects related to bias and this http URL this technical report, we intend to promote open science by sharing all the details behind our design choices, data curation strategy and evaluation methodology. In addition to that, we deviate from the usual practice by making our training and evaluation scripts publicly accessible. We release all models under a permissive Apache 2.0 license in order to foster future research and facilitate commercial use, thereby contributing to the open-source ecosystem of large language models.
这项工作介绍了Salamandra,一个包含三种不同规模(20亿、70亿和400亿参数)的开源解码器专用大型语言模型系列。这些模型是基于多语种数据从头训练出来的,该数据包括35种欧洲语言及代码文本。我们精心策划的数据集仅由来自各种来源的开放访问数据组成。除了基础模型之外,还在聊天应用程序中发布了通过公共领域指令数据微调得到的补充检查点。此外,我们也分享了初步的多模态实验结果,这些结果作为概念验证展示了Salamandra系列的潜在应用价值。 在多项跨语言基准测试上的广泛评估表明,Salamandra具有强大的能力,并且与同规模的开源模型相比,在性能上达到了竞争水平。我们提供了标准下游任务以及偏见和安全性相关关键方面的全面评价结果。通过这份技术报告,我们旨在推广开放科学理念,分享了我们的设计选择、数据策划策略及评估方法的所有细节。 除此之外,我们还偏离了惯例,公开发布了训练和评估脚本以供公众使用。所有模型均在宽松的Apache 2.0许可证下发布,以此促进未来的研究,并便于商业用途,从而为大型语言模型的开源生态系统做出贡献。
https://arxiv.org/abs/2502.08489
We develop an algorithm to semantically parse linear ordering problems, which require a model to arrange entities using deductive reasoning. Our method takes as input a number of premises and candidate statements, parsing them to a first-order logic of an ordering domain, and then utilizes constraint logic programming to infer the truth of proposed statements about the ordering. Our semantic parser transforms Heim and Kratzer's syntax-based compositional formal semantic rules to a computational algorithm. This transformation involves introducing abstract types and templates based on their rules, and introduces a dynamic component to interpret entities within a contextual framework. Our symbolic system, the Formal Semantic Logic Inferer (FSLI), is applied to answer multiple choice questions in BIG-bench's logical_deduction multiple choice problems, achieving perfect accuracy, compared to 67.06% for the best-performing LLM (GPT-4) and 87.63% for the hybrid system Logic-LM. These promising results demonstrate the benefit of developing a semantic parsing algorithm driven by first-order logic constructs.
我们开发了一种算法,用于对线性排序问题进行语义解析。这种算法要求模型能够使用演绎推理来排列实体。我们的方法以一组前提和候选陈述作为输入,并将其解析为排序领域的第一阶逻辑表示,然后利用约束逻辑编程推断关于排序的命题的真实性。我们的语义解析器将Heim和Kratzer基于句法的组合形式语义规则转换成一种计算算法。这种转换包括根据他们的规则引入抽象类型和模板,并加入一个动态组件以在上下文框架内解释实体。 我们构建了一个符号系统,即形式语义逻辑推断器(Formal Semantic Logic Inferer, FSLI),用于回答BIG-bench基准中逻辑演绎多项选择题的问题。FSLI达到了100%的准确率,相比之下,表现最好的大型语言模型(LLM)GPT-4的准确率为67.06%,而混合系统Logic-LM的准确率为87.63%。 这些有前景的结果表明了开发基于第一阶逻辑构造的语义解析算法的好处。
https://arxiv.org/abs/2502.08415
Propaganda is a form of persuasion that has been used throughout history with the intention goal of influencing people's opinions through rhetorical and psychological persuasion techniques for determined ends. Although Arabic ranked as the fourth most- used language on the internet, resources for propaganda detection in languages other than English, especially Arabic, remain extremely limited. To address this gap, the first Arabic dataset for Multi-label Propaganda, Sentiment, and Emotion (MultiProSE) has been introduced. MultiProSE is an open-source extension of the existing Arabic propaganda dataset, ArPro, with the addition of sentiment and emotion annotations for each text. This dataset comprises 8,000 annotated news articles, which is the largest propaganda dataset to date. For each task, several baselines have been developed using large language models (LLMs), such as GPT-4o-mini, and pre-trained language models (PLMs), including three BERT-based models. The dataset, annotation guidelines, and source code are all publicly released to facilitate future research and development in Arabic language models and contribute to a deeper understanding of how various opinion dimensions interact in news media1.
以下是给定文本的中文翻译: 宣传是一种历史上长期使用的说服形式,其目的是通过修辞和心理劝说技巧来影响人们的观点以达到特定的目的。尽管阿拉伯语在互联网上排名第四常用语言,但在英语以外的语言(尤其是阿拉伯语)中用于检测宣传的资源仍然极为有限。为了填补这一空白,首次推出了针对阿拉伯语的多标签宣传、情感和情绪(MultiProSE)数据集。MultiProSE是现有阿拉伯语宣传数据集ArPro的一个开源扩展版本,并为每条文本添加了情感和情绪注释。该数据集包含8,000篇经过标注的新闻文章,这是迄今为止最大的宣传数据集。对于每个任务,开发人员使用大型语言模型(如GPT-4o-mini)和预训练语言模型(包括三种基于BERT的模型)建立了多个基准模型。该数据集、注释指南及源代码均公开发布,以促进阿拉伯语语言模型的未来研究和发展,并有助于深入了解新闻媒体中各种观点维度之间的相互作用。
https://arxiv.org/abs/2502.08319
Recent research on large language models (LLMs) has demonstrated their ability to understand and employ deceptive behavior, even without explicit prompting. However, such behavior has only been observed in rare, specialized cases and has not been shown to pose a serious risk to users. Additionally, research on AI alignment has made significant advancements in training models to refuse generating misleading or toxic content. As a result, LLMs generally became honest and harmless. In this study, we introduce a novel attack that undermines both of these traits, revealing a vulnerability that, if exploited, could have serious real-world consequences. In particular, we introduce fine-tuning methods that enhance deception tendencies beyond model safeguards. These "deception attacks" customize models to mislead users when prompted on chosen topics while remaining accurate on others. Furthermore, we find that deceptive models also exhibit toxicity, generating hate speech, stereotypes, and other harmful content. Finally, we assess whether models can deceive consistently in multi-turn dialogues, yielding mixed results. Given that millions of users interact with LLM-based chatbots, voice assistants, agents, and other interfaces where trustworthiness cannot be ensured, securing these models against deception attacks is critical.
最近关于大型语言模型(LLMs)的研究展示了它们在没有明确提示的情况下理解和运用欺骗行为的能力。然而,这种行为仅在少数特殊案例中被观察到,并未显示出对用户构成严重威胁的迹象。此外,在人工智能对齐研究方面取得了显著进展,训练模型拒绝生成误导性或有毒内容的技术得到了提升。因此,LLMs通常被认为诚实且无害。在这项研究中,我们引入了一种新型攻击方法,该方法削弱了这两个特性,并揭示了一个若被利用可能会产生严重现实后果的漏洞。特别是,我们介绍了微调方法,这些方法增强了超越模型安全措施的欺骗倾向。这种“欺骗性攻击”使模型在特定话题上误导用户的同时,在其他问题上保持准确性。此外,我们发现具有欺骗性的模型还会表现出毒性,生成仇恨言论、刻板印象和其他有害内容。最后,我们评估了模型是否能在多轮对话中持续进行欺骗,并得到了参差不齐的结果。鉴于数百万用户与基于LLM的聊天机器人、语音助手、代理等交互,在这些场景中无法确保信任度的情况下,保障这些模型免受欺骗性攻击变得至关重要。
https://arxiv.org/abs/2502.08301
The integration of Large Language Models (LLMs) into optimization has created a powerful synergy, opening exciting research opportunities. This paper investigates how LLMs can enhance existing optimization algorithms. Using their pre-trained knowledge, we demonstrate their ability to propose innovative heuristic variations and implementation strategies. To evaluate this, we applied a non-trivial optimization algorithm, Construct, Merge, Solve and Adapt (CMSA) -- a hybrid metaheuristic for combinatorial optimization problems that incorporates a heuristic in the solution construction phase. Our results show that an alternative heuristic proposed by GPT-4o outperforms the expert-designed heuristic of CMSA, with the performance gap widening on larger and denser graphs. Project URL: this https URL
大型语言模型(LLMs)在优化领域的整合创造了一种强大的协同效应,开启了令人兴奋的研究机会。本文探讨了如何利用LLM增强现有的优化算法。通过它们的预训练知识,我们展示了它们提出创新启发式方法和实现策略的能力。为了评估这一点,我们将一个非平凡的优化算法——构造、合并、求解与适应(CMSA)应用于测试中,这是一种混合元启发式算法,用于组合优化问题,并在解决方案构建阶段融入了启发式方法。我们的结果显示,由GPT-4o提出的替代启发式方法超过了CMSA专家设计的启发式方法,在更大的和更密集的图上性能差距进一步扩大。 项目网址:[此处应填写具体URL]
https://arxiv.org/abs/2502.08298
With the advancement of large language models (LLMs), the focus in Conversational AI has shifted from merely generating coherent and relevant responses to tackling more complex challenges, such as personalizing dialogue systems. In an effort to enhance user engagement, chatbots are often designed to mimic human behaviour, responding within a defined emotional spectrum and aligning to a set of values. In this paper, we aim to simulate personal traits according to the Big Five model with the use of LLMs. Our research showed that generating personality-related texts is still a challenging task for the models. As a result, we present a dataset of generated texts with the predefined Big Five characteristics and provide an analytical framework for testing LLMs on a simulation of personality skills.
随着大型语言模型(LLM)的进步,对话式人工智能的关注点已经从生成连贯且相关响应转向应对更具挑战性的任务,例如个性化对话系统。为了增强用户参与度,聊天机器人通常被设计为模仿人类行为,在定义的情感范围内作出反应,并遵循一系列价值观。在这篇论文中,我们旨在利用LLMs根据大五人格模型模拟个人特征。我们的研究表明,生成与性格相关文本对于这些模型而言仍是一项具有挑战性的任务。因此,我们提供了一组带有预定义的大五人格特质的生成文本集,并提出一个用于测试LLM在性格模拟技能上的分析框架。
https://arxiv.org/abs/2502.08265
Recent advancements in Large Vision Language Models (LVLMs) have enabled the development of LVLM-based Graphical User Interface (GUI) agents under various paradigms. Training-based approaches, such as CogAgent and SeeClick, struggle with cross-dataset and cross-platform generalization due to their reliance on dataset-specific training. Generalist LVLMs, such as GPT-4V, employ Set-of-Marks (SoM) for action grounding, but obtaining SoM labels requires metadata like HTML source, which is not consistently available across platforms. Moreover, existing methods often specialize in singular GUI tasks rather than achieving comprehensive GUI understanding. To address these limitations, we introduce TRISHUL, a novel, training-free agentic framework that enhances generalist LVLMs for holistic GUI comprehension. Unlike prior works that focus on either action grounding (mapping instructions to GUI elements) or GUI referring (describing GUI elements given a location), TRISHUL seamlessly integrates both. At its core, TRISHUL employs Hierarchical Screen Parsing (HSP) and the Spatially Enhanced Element Description (SEED) module, which work synergistically to provide multi-granular, spatially, and semantically enriched representations of GUI elements. Our results demonstrate TRISHUL's superior performance in action grounding across the ScreenSpot, VisualWebBench, AITW, and Mind2Web datasets. Additionally, for GUI referring, TRISHUL surpasses the ToL agent on the ScreenPR benchmark, setting a new standard for robust and adaptable GUI comprehension.
最近在大型视觉语言模型(LVLM)领域的进展,使得能够基于不同范式的图形用户界面(GUI)代理开发成为可能。然而,依赖于特定数据集训练的培训方法(如CogAgent和SeeClick)在跨数据集和跨平台泛化方面面临挑战。通用型LVLMs,例如GPT-4V,则采用标记集(SoM)进行动作定位,但获取这些标签通常需要像HTML源代码这样的元数据信息,并且这种信息在不同平台上并不总是可用的。此外,现有的方法往往专注于单一的GUI任务,而不是实现全面的GUI理解。 为了解决这些问题,我们引入了TRISHUL——一个全新的无需训练的支持代理框架,用于增强通用型LVLMs以实现完整的GUI理解能力。不同于以往的研究侧重于动作定位(将指令映射到GUI元素)或GUI引用(根据位置描述GUI元素),TRISHUL无缝集成了这两方面的能力。TRISHUL的核心是层次化屏幕解析(HSP)和空间增强元素描述模块(SEED),它们协同工作,提供多粒度、空间及语义丰富的GUI元素表示。 我们的实验结果表明,TRISHUL在ScreenSpot、VisualWebBench、AITW以及Mind2Web数据集上的动作定位任务中表现出色。此外,在GUI引用方面,TRISHUL超越了ToL代理在ScreenPR基准测试中的表现,从而确立了一个新的标准,标志着稳健且灵活的GUI理解的新时代。
https://arxiv.org/abs/2502.08226
In this work, we propose an architecture of LLM Modules that enables the transfer of knowledge from a large pre-trained model to a smaller model using an Enhanced Cross-Attention mechanism. In the proposed scheme, the Qwen2-1.5B model is frozen and its representations are passed through specially designed attention layers to the GPT-Neo-125M model, which is trained on limited computational resources. Experimental results on the Bespoke-Stratos-17k dataset demonstrate that after 15 epochs of training, the combined model generates responses comparable in quality to those obtained by distillation. We discuss the advantages of the modular approach, provide examples of input queries and comparative analysis, and outline prospects for further extension of the method.
在这项工作中,我们提出了一种基于LLM模块的架构,该架构利用增强型跨注意力机制将大型预训练模型的知识转移到较小规模的模型中。在提议的方案中,Qwen2-1.5B 模型被冻结,并且其表示通过专门设计的注意力层传递到 GPT-Neo-125M 模型上进行训练,后者可以在有限的计算资源下运行。在 Bespoke-Stratos-17k 数据集上的实验结果表明,在经过 15 轮(epoch)的训练后,组合模型生成的回答质量可以与蒸馏方法得到的结果相媲美。我们讨论了模块化方法的优势,并提供了输入查询示例和比较分析,同时概述了进一步扩展该方法的可能性。
https://arxiv.org/abs/2502.08213
In the field of synthetic aperture radar (SAR) remote sensing image interpretation, although Vision language models (VLMs) have made remarkable progress in natural language processing and image understanding, their applications remain limited in professional domains due to insufficient domain expertise. This paper innovatively proposes the first large-scale multimodal dialogue dataset for SAR images, named SARChat-2M, which contains approximately 2 million high-quality image-text pairs, encompasses diverse scenarios with detailed target annotations. This dataset not only supports several key tasks such as visual understanding and object detection tasks, but also has unique innovative aspects: this study develop a visual-language dataset and benchmark for the SAR domain, enabling and evaluating VLMs' capabilities in SAR image interpretation, which provides a paradigmatic framework for constructing multimodal datasets across various remote sensing vertical domains. Through experiments on 16 mainstream VLMs, the effectiveness of the dataset has been fully verified, and the first multi-task dialogue benchmark in the SAR field has been successfully established. The project will be released at this https URL, aiming to promote the in-depth development and wide application of SAR visual language models.
在合成孔径雷达(SAR)遥感图像解释领域,尽管视觉语言模型(VLMs)在自然语言处理和图像理解方面取得了显著进展,但由于缺乏专业领域的知识经验,它们的应用仍受到限制。本文创新性地提出了首个大规模多模态对话数据集——SARChat-2M,该数据集包含约200万对高质量的图像文本配对,并涵盖了多种详细标注目标的情景。此数据集不仅支持包括视觉理解和物体检测在内的多项关键任务,还具有独特的创新点:本研究开发了一个针对SAR领域的视觉语言数据集和基准,这使得可以评估VLMs在解释SAR图像方面的能力,并为构建跨各种遥感垂直领域多模态数据集提供了一种典范框架。通过对16个主流VLM模型的实验验证了该数据集的有效性,并成功建立了首个SAR领域的多任务对话基准。该项目将在[此链接](https://this https URL)发布,旨在促进SAR视觉语言模型的深度发展和广泛应用。
https://arxiv.org/abs/2502.08168
Recent advances in large language models (LLMs) have shown promising improvements, often surpassing existing methods across a wide range of downstream tasks in natural language processing. However, these models still face challenges, which may hinder their practical applicability. For example, the phenomenon of hallucination is known to compromise the reliability of LLMs, especially in fields that demand high factual precision. Current benchmarks primarily focus on hallucination detection and factuality evaluation but do not extend beyond identification. This paper proposes an explanation enhanced hallucination-detection model, coined as HuDEx, aimed at enhancing the reliability of LLM-generated responses by both detecting hallucinations and providing detailed explanations. The proposed model provides a novel approach to integrate detection with explanations, and enable both users and the LLM itself to understand and reduce errors. Our measurement results demonstrate that the proposed model surpasses larger LLMs, such as Llama3 70B and GPT-4, in hallucination detection accuracy, while maintaining reliable explanations. Furthermore, the proposed model performs well in both zero-shot and other test environments, showcasing its adaptability across diverse benchmark datasets. The proposed approach further enhances the hallucination detection research by introducing a novel approach to integrating interpretability with hallucination detection, which further enhances the performance and reliability of evaluating hallucinations in language models.
最近在大型语言模型(LLMs)方面取得的进展显示出显著的进步,在自然语言处理的各种下游任务中,这些模型往往超越了现有的方法。然而,这些模型仍然面临着挑战,这可能阻碍它们的实际应用效果。例如,幻觉现象会导致大型语言模型的可靠性下降,特别是在需要高度事实准确性的领域尤为突出。目前的基准测试主要集中在检测幻觉和评估准确性上,但并未超出识别范围。本文提出了一种新的解释增强型幻觉检测模型,称为HuDEx,旨在通过检测幻觉并提供详细的解释来提高大型语言模型生成响应的可靠性。所提出的模型为将检测与解释相结合提供了全新的方法,并使用户及模型本身都能理解和减少错误。我们的测量结果显示,所提出的模型在幻觉检测准确率上超越了更大的LLMs,如Llama3 70B和GPT-4,同时保持了可靠的解释能力。此外,在零样本和其他测试环境中,该模型表现出良好的性能,并展示了其跨多种基准数据集的适应性。通过将可解释性与幻觉检测相结合的新方法,所提出的方案进一步增强了幻觉检测的研究,从而提升了语言模型中评估幻觉的表现和可靠性。
https://arxiv.org/abs/2502.08109
The growing volume of digitized historical texts requires effective semantic search using text embeddings. However, pre-trained multilingual models, typically evaluated on contemporary texts, face challenges with historical digitized content due to OCR noise and outdated spellings. We explore the use of multilingual embeddings for cross-lingual semantic search on historical Luxembourgish, a low-resource language. We collect historical Luxembourgish news articles spanning various time periods and use GPT-4o to segment and translate them into closely related languages, creating 20,000 parallel training sentences per language pair. We further create a historical bitext mining evaluation set and find that these models struggle to perform cross-lingual search on historical Luxembourgish. To address this, we propose a simple adaptation method using in-domain training data, achieving up to 98\% accuracy in cross-lingual evaluations. We release our adapted models and historical Luxembourgish-German/French bitexts to support further research.
数字化历史文本数量的增长需要使用文本嵌入进行有效的语义搜索。然而,预训练的多语言模型通常在当代文本上进行评估,在处理具有OCR噪声和过时拼写的历史数字化内容方面面临挑战。我们探索了使用多语言嵌入来进行历史卢森堡语(一种低资源语言)跨语言语义搜索的方法。我们收集了不同时间段的历史卢森堡语新闻文章,并利用GPT-4o将其分割并翻译成与其密切相关的其他语言,每种语言组合生成20,000条平行训练句子。此外,我们创建了一个历史双语文本挖掘评估集,发现这些模型在进行历史卢森堡语的跨语言搜索时遇到困难。为解决这一问题,我们提出了一种使用领域内训练数据进行简单适应的方法,在跨语言评估中实现了高达98%的准确率。我们将改进后的模型和历史卢森堡语-德语/法语双语文本发布出来以支持进一步的研究。
https://arxiv.org/abs/2502.07938
Text-conditioned image generation has gained significant attention in recent years and are processing increasingly longer and comprehensive text prompt. In everyday life, dense and intricate text appears in contexts like advertisements, infographics, and signage, where the integration of both text and visuals is essential for conveying complex information. However, despite these advances, the generation of images containing long-form text remains a persistent challenge, largely due to the limitations of existing datasets, which often focus on shorter and simpler text. To address this gap, we introduce TextAtlas5M, a novel dataset specifically designed to evaluate long-text rendering in text-conditioned image generation. Our dataset consists of 5 million long-text generated and collected images across diverse data types, enabling comprehensive evaluation of large-scale generative models on long-text image generation. We further curate 3000 human-improved test set TextAtlasEval across 3 data domains, establishing one of the most extensive benchmarks for text-conditioned generation. Evaluations suggest that the TextAtlasEval benchmarks present significant challenges even for the most advanced proprietary models (e.g. GPT4o with DallE-3), while their open-source counterparts show an even larger performance gap. These evidences position TextAtlas5M as a valuable dataset for training and evaluating future-generation text-conditioned image generation models.
近年来,基于文本的图像生成领域受到了广泛关注,并且处理越来越长和复杂的文本提示的能力也在不断提高。在日常生活中,密集而复杂的文字出现在诸如广告、信息图和标识等场景中,在这些场景中,文字与视觉元素的结合对于传达复杂信息至关重要。然而,尽管取得了这些进展,生成包含大量文本内容的图像仍然是一项持续存在的挑战,这主要是由于现有的数据集往往侧重于较短且简单的文本。 为了解决这一问题,我们引入了TextAtlas5M,这是一个专门用于评估长文本渲染能力的新数据集,在基于文本条件的图像生成中。我们的数据集由来自各种类型、共计500万张通过生成和收集得到的包含长文本的图像组成,这使得大规模生成模型在长文本图像生成方面的全面评估成为可能。此外,我们还精心策划了涵盖三大领域的3000个人工改进测试集TextAtlasEval,建立了最广泛的基础性能评测基准之一。 评估结果表明,即使对于最先进的专有模型(例如GPT4o与DallE-3),TextAtlasEval的基准测试也提出了重大挑战。相比之下,开源模型的表现差距更为明显。这些证据证明了TextAtlas5M作为一个有价值的训练和评估未来基于文本条件图像生成模型的数据集的重要性。
https://arxiv.org/abs/2502.07870
We present a novel data set, WhoDunIt, to assess the deductive reasoning capabilities of large language models (LLM) within narrative contexts. Constructed from open domain mystery novels and short stories, the dataset challenges LLMs to identify the perpetrator after reading and comprehending the story. To evaluate model robustness, we apply a range of character-level name augmentations, including original names, name swaps, and substitutions with well-known real and/or fictional entities from popular discourse. We further use various prompting styles to investigate the influence of prompting on deductive reasoning accuracy. We conduct evaluation study with state-of-the-art models, specifically GPT-4o, GPT-4-turbo, and GPT-4o-mini, evaluated through multiple trials with majority response selection to ensure reliability. The results demonstrate that while LLMs perform reliably on unaltered texts, accuracy diminishes with certain name substitutions, particularly those with wide recognition. This dataset is publicly available here.
我们提出了一种新颖的数据集,名为WhoDunIt,用于评估大型语言模型(LLM)在叙事背景下的演绎推理能力。该数据集由开放领域的侦探小说和短篇故事构建而成,挑战LLM在阅读并理解故事后识别罪犯的能力。为了评估模型的稳健性,我们应用了一系列的角色名称增强技术,包括原始名称、名称交换以及用流行话语中广为人知的真实或虚构人物进行替换。此外,我们使用各种提示风格来研究提示对演绎推理准确性的影响。我们通过多次试验并采用多数投票选择响应的方式,对最先进的GPT-4o、GPT-4-turbo和GPT-4o-mini模型进行了评估研究以确保结果的可靠性。实验结果显示,尽管LLM在未经修改的文本上表现可靠,但在某些名称替换后,尤其是那些具有广泛认知度的情况下,准确性会下降。该数据集现已公开提供。
https://arxiv.org/abs/2502.07747
Large language models (LLMs) have demonstrated remarkable code generation capabilities, but the correctness of the generated code cannot be inherently trusted. This paper explores the feasibility of using formal software verification, specifically the SPARK framework for Ada, to ensure the reliability of LLM-generated code. We present Marmaragan, a tool that leverages an LLM in order to generate SPARK annotations for existing programs, enabling formal verification of the code. The tool is benchmarked on a curated set of SPARK programs, with annotations selectively removed to test specific capabilities. The performance of Marmaragan with GPT-4o on the benchmark is promising, with correct annotations having been generated for 50.7% of the benchmark cases. The results establish a foundation for future work on combining the power of LLMs with the reliability of formal software verification.
大型语言模型(LLMs)展示了生成代码的非凡能力,但生成代码的正确性无法被默认信任。本文探讨了使用正式软件验证——特别是Ada编程语言中的SPARK框架——来确保LLM生成代码可靠性的可行性。我们介绍了名为Marmaragan的工具,该工具利用大型语言模型为现有的程序生成SPARK注解,从而实现对这些程序的正式验证。我们在一组精心挑选的SPARK程序上测试了此工具,通过有选择地移除一些注解来检验其具体能力。在使用GPT-4o进行基准测试时,Marmaragan的表现令人鼓舞,在50.7%的情况下正确生成了注解。这些结果为未来将大型语言模型的强大功能与正式软件验证的可靠性结合的研究奠定了基础。
https://arxiv.org/abs/2502.07728
Achieving a delicate balance between fostering trust in law en- forcement and protecting the rights of both officers and civilians continues to emerge as a pressing research and product challenge in the world today. In the pursuit of fairness and transparency, this study presents an innovative AI-driven system designed to generate police report drafts from complex, noisy, and multi-role dialogue data. Our approach intelligently extracts key elements of law enforcement interactions and includes them in the draft, producing structured narratives that are not only high in quality but also reinforce accountability and procedural clarity. This frame- work holds the potential to transform the reporting process, ensur- ing greater oversight, consistency, and fairness in future policing practices. A demonstration video of our system can be accessed at this https URL Y-kpCHNO/view?usp=sharing
在当今世界,如何在促进执法信任和保护执法人员及市民权利之间取得微妙平衡,仍然是一个紧迫的研究与产品开发挑战。为追求公正与透明度,本研究提出了一种创新的AI驱动系统,旨在从复杂、嘈杂且涉及多种角色的对话数据中生成警察报告草案。我们的方法能够智能地提取执法互动中的关键要素,并将其纳入草案之中,从而产生结构化叙述,这些叙述不仅质量高,还增强了问责制和程序清晰度。这一框架有可能变革报告流程,确保未来的警务实践拥有更大的监督、一致性和公平性。 该系统的演示视频可在此链接中访问:https://www.youtube.com/shorts/Y-kpCHNO/view?usp=sharing
https://arxiv.org/abs/2502.07677
To govern smart contracts running on Ethereum, multiple Ethereum Request for Comment (ERC) standards have been developed, each having a set of rules to guide the behaviors of smart contracts. Violating the ERC rules could cause serious security issues and financial loss, signifying the importance of verifying smart contracts follow ERCs. Today's practices of such verification are to manually audit each single contract, use expert-developed program-analysis tools, or use large language models (LLMs), all of which are far from effective in identifying ERC rule violations. This paper introduces SymGPT, a tool that combines the natural language understanding of large language models (LLMs) with the formal guarantees of symbolic execution to automatically verify smart contracts' compliance with ERC rules. To develop SymGPT, we conduct an empirical study of 132 ERC rules from three widely used ERC standards, examining their content, security implications, and natural language descriptions. Based on this study, we design SymGPT by first instructing an LLM to translate ERC rules into a defined EBNF grammar. We then synthesize constraints from the formalized rules to represent scenarios where violations may occur and use symbolic execution to detect them. Our evaluation shows that SymGPT identifies 5,783 ERC rule violations in 4,000 real-world contracts, including 1,375 violations with clear attack paths for stealing financial assets, demonstrating its effectiveness. Furthermore, SymGPT outperforms six automated techniques and a security-expert auditing service, underscoring its superiority over current smart contract analysis methods.
为了管理运行在以太坊上的智能合约,已经开发了多个以太坊请求评论(ERC)标准,每个标准都有一套规则来指导智能合约的行为。违反这些ERC规则可能会导致严重的安全问题和经济损失,这凸显了验证智能合约遵循ERC规则的重要性。目前的验证方法包括手动审计每一个单独的合同、使用专家开发的程序分析工具或利用大型语言模型(LLM),但所有这些方法在识别ERC规则违规方面效果甚微。 本文介绍了SymGPT这一工具,它结合了大型语言模型(LLM)对自然语言的理解和符号执行的形式保证,以自动验证智能合约是否符合ERC规定。为了开发SymGPT,我们对三个广泛使用的ERC标准中的132项ERC规则进行了实证研究,分析它们的内容、安全影响以及自然语言描述。基于这一研究,我们首先让一个LLM将ERC规则翻译成定义的EBNF语法,然后从形式化的规则中综合约束条件以表示可能发生违规的情况,并使用符号执行来检测这些情况。 我们的评估表明,SymGPT在4,000个实际合约中共发现了5,783项ERC规则违规,其中包括1,375项具有明确攻击路径的违规行为,证明了其有效性。此外,SymGPT的表现优于六种自动化技术以及一个安全专家审计服务,突显了它相对于现有智能合约分析方法的优势。 通过这种方式,SymGPT提供了一种更为高效和准确的方法来确保以太坊上的智能合约遵守重要的ERC规则,从而帮助保护用户免受潜在的安全威胁。
https://arxiv.org/abs/2502.07644
We present FoQA, a Faroese extractive question-answering (QA) dataset with 2,000 samples, created using a semi-automated approach combining Large Language Models (LLMs) and human validation. The dataset was generated from Faroese Wikipedia articles using GPT-4-turbo for initial QA generation, followed by question rephrasing to increase complexity and native speaker validation to ensure quality. We provide baseline performance metrics for FoQA across multiple models, including LLMs and BERT, demonstrating its effectiveness in evaluating Faroese QA performance. The dataset is released in three versions: a validated set of 2,000 samples, a complete set of all 10,001 generated samples, and a set of 2,395 rejected samples for error analysis.
我们介绍了FoQA,这是一个包含2000个样本的法罗塞语提取式问答(QA)数据集,该数据集是通过结合大型语言模型(LLMs)和人工验证的半自动化方法创建的。数据集是从法罗塞维基百科文章中生成的,使用GPT-4-turbo进行初始的问答生成,并随后由人类对问题进行重新表述以增加难度,以及本地母语人士验证以确保质量。我们提供了FoQA在多个模型上的基准性能指标,包括大型语言模型和BERT,证明了它在评估法罗塞语问答表现方面的有效性。该数据集发布了三个版本:一个包含2000个样本的已验证集合、所有10,001个生成样本的完整集合以及用于错误分析的2395个被拒绝样本集合。
https://arxiv.org/abs/2502.07642
Non-autoregressive (NAR) generative models are valuable because they can handle diverse conditional generation tasks in a more principled way than their autoregressive (AR) counterparts, which are constrained by sequential dependency requirements. Recent advancements in NAR models, such as diffusion language models, have demonstrated superior performance in unconditional generation compared to AR models (e.g., GPTs) of similar sizes. However, such improvements do not always lead to improved conditional generation performance. We show that a key reason for this gap is the difficulty in generalizing to conditional probability queries unseen during training. As a result, strong unconditional generation performance does not guarantee high-quality conditional generation. This paper proposes Tractable Transformers (Tracformer), a Transformer-based generative model that is more robust to different conditional generation tasks. Unlike existing models that rely solely on global contextual features derived from full inputs, Tracformers incorporate a sparse Transformer encoder to capture both local and global contextual information. This information is routed through a decoder for conditional generation. Empirical results demonstrate that Tracformers achieve state-of-the-art conditional generation performance on text modeling compared to recent diffusion and AR model baselines.
非自回归(NAR)生成模型因其能够以比自回归(AR)模型更为原则化的方式处理多样化的条件生成任务而具有价值,后者受限于序列依赖要求。最近在NAR模型上的进展,例如扩散语言模型,在无条件生成方面表现出优于类似规模的自回归模型(如GPTs)的性能。然而,并非所有这些改进都能提升条件生成的表现力。我们发现导致这种差距的一个关键原因是难以对训练过程中未见过的条件概率查询进行泛化。因此,强大的无条件生成能力并不能保证高质量的条件生成效果。 本文提出了可处理变压器(Tracformer),这是一种基于Transformer架构的生成模型,更适用于不同的条件生成任务。与依赖于从完整输入中提取全局上下文特征的现有模型不同,Tracformers集成了一个稀疏的Transformer编码器来捕捉局部和全局的上下文信息,并通过解码器将这些信息用于条件生成。 实证结果表明,相较于最近的扩散模型和AR模型基准线,Tracformers在文本建模中的条件生成性能达到了最先进的水平。
https://arxiv.org/abs/2502.07616
Zero-Shot Anomaly Detection (ZSAD) is an emerging AD paradigm. Unlike the traditional unsupervised AD setting that requires a large number of normal samples to train a model, ZSAD is more practical for handling data-restricted real-world scenarios. Recently, Multimodal Large Language Models (MLLMs) have shown revolutionary reasoning capabilities in various vision tasks. However, the reasoning of image abnormalities remains underexplored due to the lack of corresponding datasets and benchmarks. To facilitate research in AD & reasoning, we establish the first visual instruction tuning dataset, Anomaly-Instruct-125k, and the evaluation benchmark, VisA-D&R. Through investigation with our benchmark, we reveal that current MLLMs like GPT-4o cannot accurately detect and describe fine-grained anomalous details in images. To address this, we propose Anomaly-OneVision (Anomaly-OV), the first specialist visual assistant for ZSAD and reasoning. Inspired by human behavior in visual inspection, Anomaly-OV leverages a Look-Twice Feature Matching (LTFM) mechanism to adaptively select and emphasize abnormal visual tokens. Extensive experiments demonstrate that Anomaly-OV achieves significant improvements over advanced generalist models in both detection and reasoning. Extensions to medical and 3D AD are provided for future study. The link to our project page: this https URL
零样本异常检测(ZSAD)是一种新兴的异常检测范式。与传统的无监督异常检测设置需要大量正常样本训练模型不同,ZSAD更适用于处理数据受限的真实场景。近期,多模态大型语言模型(MLLMs)在各种视觉任务中展示了革命性的推理能力。然而,由于缺乏相应的数据集和基准,图像异常的推理研究仍然较少。为了促进异常检测与推理的研究,我们建立了首个视觉指令调优数据集Anomaly-Instruct-125k以及评估基准VisA-D&R。通过我们的基准进行调查后发现,现有的多模态大语言模型如GPT-4o无法准确地检测和描述图像中的细微异常细节。 为了解决这一问题,我们提出了首个针对ZSAD和推理的专家视觉助手Anomaly-OneVision(Anomaly-OV)。受人类在视觉检查中行为启发,Anomaly-OV采用了一种Look-Twice特征匹配机制来自适应地选择并强调异常视觉标记。广泛实验表明,在检测与推理两个方面,Anomaly-OV均实现了比先进通用模型显著的性能提升。此外还提供了将该技术应用于医疗和3D领域的扩展研究方向。 项目页面链接:[请在此处插入实际的URL链接]
https://arxiv.org/abs/2502.07601
Foundation models have become general-purpose assistants, exhibiting diverse capabilities across numerous domains through training on web-scale data. It remains challenging to precisely characterize even a fraction of the full spectrum of capabilities and potential risks in any new model. Existing evaluation approaches often require significant human effort, and it is taking increasing effort to design ever harder challenges for more capable models. We introduce Automated Capability Discovery (ACD), a framework that designates one foundation model as a scientist to systematically propose open-ended tasks probing the abilities of a subject model (potentially itself). By combining frontier models with ideas from the field of open-endedness, ACD automatically and systematically uncovers both surprising capabilities and failures in the subject model. We demonstrate ACD across a range of foundation models (including the GPT, Claude, and Llama series), showing that it automatically reveals thousands of capabilities that would be challenging for any single team to uncover. We further validate our method's automated scoring with extensive human surveys, observing high agreement between model-generated and human evaluations. By leveraging foundation models' ability to both create tasks and self-evaluate, ACD is a significant step toward scalable, automated evaluation of novel AI systems. All code and evaluation logs are open-sourced at this https URL.
基础模型已经成为通用型助手,通过在大规模网络数据上进行训练,在众多领域展示出多样化的功能。然而,准确地描述任何新模型的功能范围和潜在风险依然具有挑战性。现有的评估方法往往需要大量的人力投入,并且设计更加复杂的问题来应对更强大的模型也越来越困难。我们引入了自动化能力发现(ACD)框架,该框架指定一个基础模型作为科学家,系统化地为受测模型(可能包括自身)提出开放式任务以探究其能力。通过将前沿模型与开放性领域的理念相结合,ACD能够自动且系统性地揭示受测模型中令人惊讶的能力和缺陷。 我们在各种基础模型(包括GPT、Claude和Llama系列)上展示了ACD的应用,并证明它可以自动发现数千种任何单一团队难以识别的功能。我们进一步通过广泛的人员调查验证了我们的自动化评分方法,观察到机器生成的评估与人类评估之间有高度的一致性。通过利用基础模型创建任务和自我评价的能力,ACD在新型AI系统的可扩展、自动化评估方面迈出了重要一步。 所有代码和评估日志均以开源形式提供在此URL上(请根据原文中提供的实际链接插入)。
https://arxiv.org/abs/2502.07577