The integration of tools has extended the capabilities of language models (LMs) beyond vanilla text generation to versatile scenarios. However, tool-augmented language models (TaLMs) often assume 'perfect' information access and tool availability, which may not hold in the real world. To systematically study TaLMs' imperfections, we introduce the FAIL-TALMS benchmark, featuring two major failures: under-specified user queries and non-available tools. FAIL-TALMS contains 1,749 examples using 906 tools across 21 categories, including single- and multi-tool usage. We evaluate top-performing proprietary and open-source models, and find all current models except for Claude struggle to recognize missing tools or information. Further, to study possible mitigation of the failures, we enable real-time human interaction, named the Ask-and-Help (AAH) method, to provide missing information or replace non-functional tools. While AAH can help models solve tasks more correctly when queries are under-specified, it brings minimal benefit when complex tools are broken.
工具的整合已经将语言模型(LM)的能力从单纯的文本生成扩展到了多种多样的应用场景。然而,增强型语言模型(TaLMs)通常假设具备“完美”的信息访问和工具可用性条件,在现实世界中这种状况可能并不存在。为了系统地研究TaLMs的不足之处,我们引入了FAIL-TALMS基准测试,该测试主要针对两个重大失败场景:用户查询描述不充分以及工具不可用。FAIL-TALMS包含了1,749个实例,涉及906种不同类别中的21类工具使用案例,包括单一和多种工具的使用情况。我们评估了顶级性能的专有模型和开源模型,并发现除了Claude之外的所有现有模型在识别缺失工具或信息方面都存在困难。此外,为了研究可能减轻这些失败的方法,我们启用了实时的人工交互,称为“提问与协助”(Ask-and-Help, AAH)方法,以便提供缺失的信息或者替代无法使用的工具。虽然AAH能够在用户查询描述不充分的情况下帮助模型更准确地完成任务,但在复杂工具不可用时带来的好处非常有限。
https://arxiv.org/abs/2503.14227
Generating images with embedded text is crucial for the automatic production of visual and multimodal documents, such as educational materials and advertisements. However, existing diffusion-based text-to-image models often struggle to accurately embed text within images, facing challenges in spelling accuracy, contextual relevance, and visual coherence. Evaluating the ability of such models to embed text within a generated image is complicated due to the lack of comprehensive benchmarks. In this work, we introduce TextInVision, a large-scale, text and prompt complexity driven benchmark designed to evaluate the ability of diffusion models to effectively integrate visual text into images. We crafted a diverse set of prompts and texts that consider various attributes and text characteristics. Additionally, we prepared an image dataset to test Variational Autoencoder (VAE) models across different character representations, highlighting that VAE architectures can also pose challenges in text generation within diffusion frameworks. Through extensive analysis of multiple models, we identify common errors and highlight issues such as spelling inaccuracies and contextual mismatches. By pinpointing the failure points across different prompts and texts, our research lays the foundation for future advancements in AI-generated multimodal content.
将包含嵌入文本的图像生成对于自动生产视觉和多模态文档(如教育材料和广告)至关重要。然而,现有的基于扩散的文字到图像模型在准确地将文字嵌入图像方面经常遇到困难,面临着拼写准确性、上下文相关性和视觉连贯性等方面的挑战。由于缺乏全面的基准测试,评估此类模型在其生成的图像中嵌入文本的能力变得复杂。在这项工作中,我们引入了TextInVision,这是一个大规模的、以文本和提示复杂度驱动的基准测试系统,旨在评估扩散模型将视觉文字有效整合到图像中的能力。我们精心设计了一组多样化的提示语和文本,考虑了各种属性和文本特征。此外,我们还准备了一个图像数据集来测试变分自编码器(VAE)模型在不同字符表示上的性能,并强调VAE架构同样可能面临在扩散框架内生成文字方面的挑战。通过对多个模型进行广泛的分析,我们识别出常见的错误并指出了诸如拼写不准确和上下文不符等问题。通过确定不同类型提示语和文本中的失败点,我们的研究为未来AI生成的多模态内容的进步奠定了基础。
https://arxiv.org/abs/2503.13730
Recent advances in large language models (LLMs) have introduced the novel paradigm of using LLMs as judges, where an LLM evaluates and scores the outputs of another LLM, which often correlates highly with human preferences. However, the use of LLM-as-a-judge has been primarily studied in English. In this paper, we evaluate this framework in Russian by introducing the Russian Error tyPes Annotation dataset (REPA), a dataset of 1k user queries and 2k LLM-generated responses. Human annotators labeled each response pair expressing their preferences across ten specific error types, as well as selecting an overall preference. We rank six generative LLMs across the error types using three rating systems based on human preferences. We also evaluate responses using eight LLM judges in zero-shot and few-shot settings. We describe the results of analyzing the judges and position and length biases. Our findings reveal a notable gap between LLM judge performance in Russian and English. However, rankings based on human and LLM preferences show partial alignment, suggesting that while current LLM judges struggle with fine-grained evaluation in Russian, there is potential for improvement.
最近在大规模语言模型(LLM)方面取得的进展引入了一种新颖的方法,即使用LLM作为评判者来评估和评分另一个LLM生成的内容,这种方法通常与人类偏好高度相关。然而,这种LLM作为评判者的应用主要是在英语环境中进行研究。本文中,我们通过介绍俄罗斯错误类型注释数据集(REPA)在俄语环境下评估了这一框架的有效性,该数据集包含1000个用户查询和2000个由LLM生成的回答。人类标注者针对每对回答进行了十种特定错误类型的偏好标注,并选择了总体偏好项。我们使用基于人类偏好的三种评分系统,在这十种错误类型上对六种生成型的LLM进行排名。同时,我们也通过八个零样本和少量样本设置下的LLM评判者来评估答案的质量。我们描述了分析评判者、位置偏差和长度偏差的结果。研究发现显示俄语环境下LLM评判者的性能与英语环境之间存在显著差距。然而,基于人类偏好和LLM偏好的排名显示出部分一致性,这表明尽管当前的LLM评判者在俄语中的细粒度评估方面面临挑战,但仍具有改进的空间。
https://arxiv.org/abs/2503.13102
Neuroblastoma (NB), a leading cause of childhood cancer mortality, exhibits significant histopathological variability, necessitating precise subtyping for accurate prognosis and treatment. Traditional diagnostic methods rely on subjective evaluations that are time-consuming and inconsistent. To address these challenges, we introduce MMLNB, a multi-modal learning (MML) model that integrates pathological images with generated textual descriptions to improve classification accuracy and interpretability. The approach follows a two-stage process. First, we fine-tune a Vision-Language Model (VLM) to enhance pathology-aware text generation. Second, the fine-tuned VLM generates textual descriptions, using a dual-branch architecture to independently extract visual and textual features. These features are fused via Progressive Robust Multi-Modal Fusion (PRMF) Block for stable training. Experimental results show that the MMLNB model is more accurate than the single modal model. Ablation studies demonstrate the importance of multi-modal fusion, fine-tuning, and the PRMF mechanism. This research creates a scalable AI-driven framework for digital pathology, enhancing reliability and interpretability in NB subtyping classification. Our source code is available at this https URL.
神经母细胞瘤(NB)是儿童癌症死亡的主要原因之一,其在组织病理学上表现出显著的异质性,因此需要进行精确分型以实现准确的预后和治疗。传统的诊断方法依赖于主观评估,这不仅耗时而且不一致。为了解决这些挑战,我们引入了MMLNB模型,这是一种多模态学习(MML)模型,它结合了病理图像与生成的文本描述来提高分类精度和可解释性。 该方法采用两阶段过程:首先,我们对视觉-语言模型(VLM)进行微调以增强具有病理意识的文本生成能力。其次,经过微调的VLM使用双分支架构独立提取视觉特征和文本特征,并通过逐步稳健多模态融合(PRMF)模块将这些特征融合起来,以便稳定训练。 实验结果表明,MMLNB模型比单一模式模型更加准确。消融研究表明,多模态融合、微调以及PRMF机制的重要性。这项研究创建了一个可扩展的人工智能驱动框架用于数字病理学,从而在神经母细胞瘤亚型分类中增强可靠性和可解释性。 我们的源代码可在以下链接获取:[请在此处插入实际的URL]
https://arxiv.org/abs/2503.12927
Large Language Models (LLMs) have revolutionized natural language processing through their state of art reasoning capabilities. This paper explores the convergence of LLM reasoning techniques and feature generation for machine learning tasks. We examine four key reasoning approaches: Chain of Thought, Tree of Thoughts, Retrieval-Augmented Generation, and Thought Space Exploration. Our analysis reveals how these approaches can be used to identify effective feature generation rules without having to manually specify search spaces. The paper categorizes LLM-based feature generation methods across various domains including finance, healthcare, and text analytics. LLMs can extract key information from clinical notes and radiology reports in healthcare, by enabling more efficient data utilization. In finance, LLMs facilitate text generation, summarization, and entity extraction from complex documents. We analyze evaluation methodologies for assessing feature quality and downstream performance, with particular attention to OCTree's decision tree reasoning approach that provides language-based feedback for iterative improvements. Current challenges include hallucination, computational efficiency, and domain adaptation. As of March 2025, emerging approaches include inference-time compute scaling, reinforcement learning, and supervised fine-tuning with model distillation. Future directions point toward multimodal feature generation, self-improving systems, and neuro-symbolic approaches. This paper provides a detailed overview of an emerging field that promises to automate and enhance feature engineering through language model reasoning.
大型语言模型(LLM)通过其先进的推理能力彻底革新了自然语言处理。本文探讨了LLM推理技术与机器学习任务特征生成的融合。我们研究了四种关键的推理方法:思维链、思维树、检索增强生成和思想空间探索。我们的分析揭示了这些方法如何能够用来识别有效的特征生成规则,而无需手动指定搜索空间。论文还对基于LLM的特征生成方法进行了分类,并涵盖了金融、医疗保健以及文本分析等多个领域。 在医疗保健方面,大型语言模型可以从临床记录和放射学报告中提取关键信息,从而实现更高效的数据利用。在金融领域,大型语言模型能够从复杂文档中进行文本生成、摘要和实体抽取。本文还分析了评估特征质量和下游性能的方法,并特别关注OCTree的决策树推理方法,该方法提供了基于语言的反馈以迭代改进。 目前面临的挑战包括幻觉(即模型产生不准确或虚假的信息)、计算效率以及领域适应性问题。截至2025年3月,新兴的方法包括推断时间计算缩放、强化学习和使用模型蒸馏进行监督微调。未来的发展方向指向多模态特征生成、自我改进系统和神经符号方法。 本文提供了一个详细的概述,介绍了一个正在兴起的领域,该领域有望通过语言模型推理自动化并增强特征工程。
https://arxiv.org/abs/2503.11989
In controlled text generation using large language models (LLMs), gaps arise between the language model's interpretation and human expectations. We look at the problem of controlling emotions in keyword-based sentence generation for both GPT-4 and LLaMA-3. We selected four emotion representations: Words, Valence-Arousal-Dominance (VAD) dimensions expressed in both Lexical and Numeric forms, and Emojis. Our human evaluation looked at the Human-LLM alignment for each representation, as well as the accuracy and realism of the generated sentences. While representations like VAD break emotions into easy-to-compute components, our findings show that people agree more with how LLMs generate when conditioned on English words (e.g., "angry") rather than VAD scales. This difference is especially visible when comparing Numeric VAD to words. However, we found that converting the originally-numeric VAD scales to Lexical scales (e.g., +4.0 becomes "High") dramatically improved agreement. Furthermore, the perception of how much a generated sentence conveys an emotion is highly dependent on the LLM, representation type, and which emotion it is.
在使用大型语言模型(LLM)进行受控文本生成时,语言模型的解释与人类期望之间存在差距。我们研究了基于关键词句子生成中控制情感的问题,涉及GPT-4和LLaMA-3两种模型。我们选择了四种情感表示方法:单词、瓦勒斯-唤醒度-主宰感(VAD)维度的词汇形式和数字形式以及表情符号。 我们的人类评估考察了每种表示方法下人与语言模型之间的对齐程度,生成句子的情感准确性和现实性。尽管像VAD这样的表示方式将情感分解为易于计算的组成部分,但我们的研究发现表明,在基于英语单词(如“愤怒”)而不是VAD量表进行条件设置时,人们对LLM产生的文本更为认可。这种差异在比较数字形式的VAD和单词形式尤为明显。然而,我们将原本采用数字形式的VAD量表转换为词汇形式(例如+4.0变为“高”),这大大提高了人们的认同感。 此外,我们发现生成句子传达情感的程度高度依赖于所使用的语言模型、表示类型以及具体的情感内容。
https://arxiv.org/abs/2503.11881
We challenge the prevailing assumption that LLMs must rely fully on sub-word tokens for high-quality text generation. To this end, we propose the "Generative Pretrained Thoughtformer" (GPTHF), a hierarchical transformer language model capable of text generation by compressing text into sentence embeddings and employing a sentence attention mechanism. GPTHF retains GPT's architecture, modifying only token interactions via dynamic sparse attention masks. Our experiments show that GPTHF achieves an up to an order of magnitude improvement in FLOPs efficiency and a threefold increase in runtime speed compared to equally-sized GPT models in the low-size regime. This is achieved through a unique generation method that caches and reuses sentence embeddings, allowing significant portions of the input to bypass large parts of the network.
我们挑战了当前认为大型语言模型(LLMs)必须完全依赖亚词单元进行高质量文本生成的假设。为此,我们提出了“生成式预训练思想转换器”(GPTHF),这是一种层次化变压器语言模型,通过将文本压缩为句子嵌入并采用句子注意力机制来进行文本生成。GPTHF保留了GPT的架构,并仅通过对令牌交互使用动态稀疏注意力掩码进行修改。我们的实验表明,在小型规模下,与同等大小的GPT模型相比,GPTHF在浮点运算效率上提高了多达一个数量级,并且运行时间速度提升了三倍。这得益于一种独特的生成方法,该方法通过缓存和重用句子嵌入,使得输入文本的大部分可以绕过网络的主要部分,从而实现了显著的效果提升。
https://arxiv.org/abs/2503.11426
Automated surgical workflow analysis is crucial for education, research, and clinical decision-making, but the lack of annotated datasets hinders the development of accurate and comprehensive workflow analysis solutions. We introduce a novel approach for addressing the sparsity and heterogeneity of annotated training data inspired by the human learning procedure of watching experts and understanding their explanations. Our method leverages a video-language model trained on alignment, denoising, and generative tasks to learn short-term spatio-temporal and multimodal representations. A task-specific temporal model is then used to capture relationships across entire videos. To achieve comprehensive video-language understanding in the surgical domain, we introduce a data collection and filtering strategy to construct a large-scale pretraining dataset from educational YouTube videos. We then utilize parameter-efficient fine-tuning by projecting downstream task annotations from publicly available surgical datasets into the language domain. Extensive experiments in two surgical domains demonstrate the effectiveness of our approach, with performance improvements of up to 7% in phase segmentation tasks, 8% in zero-shot phase segmentation, and comparable capabilities to fully-supervised models in few-shot settings. Harnessing our model's capabilities for long-range temporal localization and text generation, we present the first comprehensive solution for dense video captioning (DVC) of surgical videos, addressing this task despite the absence of existing DVC datasets in the surgical domain. We introduce a novel approach to surgical workflow understanding that leverages video-language pretraining, large-scale video pretraining, and optimized fine-tuning. Our method improves performance over state-of-the-art techniques and enables new downstream tasks for surgical video understanding.
自动化手术工作流程分析对于教育、研究和临床决策至关重要,但缺乏标注数据集阻碍了准确且全面的工作流程分析解决方案的发展。我们提出了一种新的方法来解决训练数据的稀疏性和异质性问题,该方法借鉴人类学习过程中的观察专家操作并理解其解释的过程。我们的方法利用了一个在对齐、去噪和生成任务上进行过训练的视频-语言模型,以学习短期空间时间及多模态表示。随后使用特定于任务的时间模型来捕捉整个视频之间的关系。为了实现手术领域的全面视频-语言理解,我们引入了一种数据收集和过滤策略,从教育性的YouTube视频中构建大规模预训练的数据集。然后通过将公开可用的手术数据集中的下游任务注释投影到语言领域来进行参数高效的微调。 在两个手术领域进行的大量实验表明了我们方法的有效性,在工作流程分割任务上性能提高了7%,在零样本设置下提高8%;同时,我们的模型在少量样本设置下的表现与完全监督的模型相当。利用我们模型的长时序定位和文本生成能力,我们首次提出了一种针对手术视频密集型视频字幕(DVC)的全面解决方案,在没有现有的手术领域DVC数据集的情况下实现了这一目标。 总之,通过采用视频-语言预训练、大规模视频预训练及优化微调的方法,我们的方法在性能上超越了现有技术,并为手术视频理解开辟了新的下游任务。
https://arxiv.org/abs/2503.11392
Recent advancements in large language models have revolutionized text generation with their remarkable capabilities. These models can produce controlled texts that closely adhere to specific requirements when prompted appropriately. However, designing an optimal prompt to control multiple attributes simultaneously can be challenging. A common approach is to linearly combine single-attribute models, but this strategy often overlooks attribute overlaps and can lead to conflicts. Therefore, we propose a novel combination strategy inspired by the Law of Total Probability and Conditional Mutual Information Minimization on generative language models. This method has been adapted for single-attribute control scenario and is termed the Palette of Language Models due to its theoretical linkage between attribute strength and generation style, akin to blending colors on an artist's palette. Moreover, positive correlation and attribute enhancement are advanced as theoretical properties to guide a rational combination strategy design. We conduct experiments on both single control and multiple control settings, and achieve surpassing results.
最近在大型语言模型方面的进展因其卓越的能力彻底改变了文本生成领域。这些模型能够在适当的提示下产生符合特定要求的可控文本。然而,设计一个能够同时控制多个属性的最佳提示语是一个挑战。一种常见的方法是线性组合单一属性模型,但这种方法常常忽略了属性之间的重叠,并可能导致冲突。因此,我们提出了一种新的结合策略,该策略受到全概率法则和生成语言模型条件互信息最小化的启发。此方法已针对单个属性控制场景进行了调整,并因其理论上将属性强度与生成风格之间联系比作调色板上的颜色混合而被称为“语言模型调色板”。此外,“正相关”和“属性增强”被提出来作为指导理性结合策略设计的理论性质。我们在单一控制和多重控制设置中进行了实验,取得了超越现有方法的结果。
https://arxiv.org/abs/2503.11182
Large language models (LLMs) have evolved beyond simple text generation to power software agents that directly translate natural language commands into tangible actions. While API-based LLM agents initially rose to prominence for their robust automation capabilities and seamless integration with programmatic endpoints, recent progress in multimodal LLM research has enabled GUI-based LLM agents that interact with graphical user interfaces in a human-like manner. Although these two paradigms share the goal of enabling LLM-driven task automation, they diverge significantly in architectural complexity, development workflows, and user interaction models. This paper presents the first comprehensive comparative study of API-based and GUI-based LLM agents, systematically analyzing their divergence and potential convergence. We examine key dimensions and highlight scenarios in which hybrid approaches can harness their complementary strengths. By proposing clear decision criteria and illustrating practical use cases, we aim to guide practitioners and researchers in selecting, combining, or transitioning between these paradigms. Ultimately, we indicate that continuing innovations in LLM-based automation are poised to blur the lines between API- and GUI-driven agents, paving the way for more flexible, adaptive solutions in a wide range of real-world applications.
大型语言模型(LLMs)的发展已经超越了简单的文本生成,能够驱动软件代理将自然语言命令直接转化为具体的行动。虽然基于API的LLM代理最初因其强大的自动化能力和与程序端点无缝集成的能力而受到关注,但最近在多模态LLM研究方面的进展使得GUI基的LLM代理得以实现,这些代理可以像人类一样与图形用户界面互动。尽管这两种方法都旨在通过LLM驱动的任务自动化来实现目标,但在架构复杂性、开发工作流程和用户交互模式方面却有着显著的不同。 本文首次对基于API的和基于GUI的LLM代理进行了全面比较研究,系统地分析了它们的区别及其潜在的融合趋势。我们考察了关键维度,并强调了混合方法在利用各自优势的情景中可以发挥作用的情况。通过提出明确的选择标准并举例说明实际用例,我们的目标是指导从业者和研究人员选择、结合或转换这些范式。最终,我们认为继续进行基于LLM的自动化创新将使API驱动和GUI驱动代理之间的界限变得模糊,并为各种现实世界应用中的更灵活、适应性强的解决方案铺平道路。
https://arxiv.org/abs/2503.11069
Generative pretrained transformers (GPT) are the common large language models (LLMs) used for generating text from natural language inputs. However, the fixed properties of language parameters in individual LLMs can lead to inconsistencies in the generated outputs. This limitation also restricts the models' ability to represent diverse language patterns due to inherent biases. Moreover, many powerful LLMs are closed-source. This prevents organizations from integrating their data into these systems, raising concerns about data privacy and limiting industry applications. Inspired by the successful application of LLM ensemble models in text generation, recent literature has also investigated their potential in code generation. This article reviews these emerging LLM ensemble approaches. Our goal is to enhance readers' understanding of existing techniques and encourage further research and practical implementation, aiming to expand the real-world applications of LLM ensemble models in both text and code generation. We categorize these approaches into seven main methods: weight merging, knowledge fusion, mixture of experts, reward ensemble, output ensemble, routing, and cascading. From this list, we focus on four methods and models that show strong performance and potential for broader applications. We analyze their modeling steps, training methods, and output features to provide a clear understanding of their capabilities. Our findings highlight the benefits of LLM ensemble techniques. These include better representation of diversity, improved output quality, and greater flexibility in applications. This information offers valuable insights for selecting models for various real-world tasks involving text and code generation, and potentially applying methods to multimodal LLMs.
生成预训练变换器(GPT)是常见的大型语言模型(LLM),用于从自然语言输入中生成文本。然而,个体LLM中的语言参数的固定特性可能导致生成输出的一致性问题。这一限制也由于内在偏见而限制了模型表示多样化语言模式的能力。此外,许多强大的LLM都是闭源的。这阻止了组织将数据整合到这些系统中,引发了对数据隐私的关注,并限制了行业的应用范围。鉴于LLM集合模型在文本生成中的成功应用,近期文献也开始研究其在代码生成方面的潜力。本文回顾了这些新兴的LLM集合方法。我们的目标是增强读者对现有技术的理解,并鼓励进一步的研究和实际应用,旨在扩大LLM集合模型在文本和代码生成领域的现实世界应用范围。我们将这些方法归类为七种主要方式:权重合并、知识融合、专家混合(Experts in Mixture)、奖励集成、输出集成、路由(Routing)及级联(Cascading)。从这一列表中,我们重点关注表现出强劲性能并具备广泛适用潜力的四种方法和模型。通过分析其建模步骤、训练方法以及输出特征,以提供对其能力的清晰理解。我们的研究结果突显了LLM集合技术的优点,包括更好地表示多样性、提升生成质量及增加应用灵活性。这些信息为选择适用于涉及文本和代码生成的各种现实任务中的模型提供了有价值的见解,并可能应用于多模式LLM中。
https://arxiv.org/abs/2503.13505
Bridging different modalities lies at the heart of cross-modality generation. While conventional approaches treat the text modality as a conditioning signal that gradually guides the denoising process from Gaussian noise to the target image modality, we explore a much simpler paradigm-directly evolving between text and image modalities through flow matching. This requires projecting both modalities into a shared latent space, which poses a significant challenge due to their inherently different representations: text is highly semantic and encoded as 1D tokens, whereas images are spatially redundant and represented as 2D latent embeddings. To address this, we introduce FlowTok, a minimal framework that seamlessly flows across text and images by encoding images into a compact 1D token representation. Compared to prior methods, this design reduces the latent space size by 3.3x at an image resolution of 256, eliminating the need for complex conditioning mechanisms or noise scheduling. Moreover, FlowTok naturally extends to image-to-text generation under the same formulation. With its streamlined architecture centered around compact 1D tokens, FlowTok is highly memory-efficient, requires significantly fewer training resources, and achieves much faster sampling speeds-all while delivering performance comparable to state-of-the-art models. Code will be available at this https URL.
跨模态生成的核心在于连接不同的模态。传统方法通常将文本模态视为一种逐步引导去噪过程的条件信号,从高斯噪声到目标图像模态。然而,我们探索了一种更简单的范式——通过流匹配直接在文本和图像模态之间演化。这需要将这两种模态投影到一个共享的潜在空间中,但由于它们本质上具有不同的表示形式(文本是高度语义化的1D标记,而图像则是冗余的空间2D潜入嵌入),这一过程构成了重大挑战。 为了解决这个问题,我们引入了FlowTok,这是一个最小化框架,通过将图像编码成紧凑的1D令牌表示来无缝地流动于文本和图像之间。与先前的方法相比,这种设计在256分辨率下减少了3.3倍潜在空间大小,并且无需复杂的条件机制或噪声调度程序。 此外,FlowTok自然地扩展到在同一形式化下的图像到文本生成。以紧凑的1D令牌为中心的设计使其具有高度的记忆效率,需要显著更少的训练资源,并实现了更快的采样速度——所有这些都与最先进的模型性能相当。相关代码将在以下网址提供:[此链接](请确保替换为实际提供的URL)。
https://arxiv.org/abs/2503.10772
Generative models aim to simulate realistic effects of various actions across different contexts, from text generation to visual effects. Despite significant efforts to build real-world simulators, the application of generative models to virtual worlds, like financial markets, remains under-explored. In financial markets, generative models can simulate complex market effects of participants with various behaviors, enabling interaction under different market conditions, and training strategies without financial risk. This simulation relies on the finest structured data in financial market like orders thus building the finest realistic simulation. We propose Large Market Model (LMM), an order-level generative foundation model, for financial market simulation, akin to language modeling in the digital world. Our financial Market Simulation engine (MarS), powered by LMM, addresses the domain-specific need for realistic, interactive and controllable order generation. Key observations include LMM's strong scalability across data size and model complexity, and MarS's robust and practicable realism in controlled generation with market impact. We showcase MarS as a forecast tool, detection system, analysis platform, and agent training environment, thus demonstrating MarS's "paradigm shift" potential for a variety of financial applications. We release the code of MarS at this https URL.
生成模型旨在模拟不同情境下各种行为的现实效果,从文本生成到视觉特效。尽管在构建真实世界模拟器方面已投入大量努力,但将生成模型应用于如金融市场这样的虚拟世界的探索仍然不足。在金融市场上,生成模型可以模拟参与者以各种行为产生的复杂市场效应,并能在不同的市场条件下进行互动和策略训练,同时避免财务风险。这种模拟依赖于金融市场中精细结构的数据(如订单),从而构建出最真实的模拟效果。我们提出了一种名为大型市场模型(LMM)的订单级生成基础模型,用于金融市场的模拟,在数字世界中类似于语言建模。我们的金融市场模拟引擎(MarS),由LMM驱动,解决了在控制生成过程中需要现实性、交互性和可控性的特定领域需求。 关键观察结果包括:LMM在数据规模和模型复杂度上的强大可扩展性;以及MarS在受控生成中对市场影响的稳健且实用的真实感。我们展示了MarS作为预测工具、检测系统、分析平台和代理培训环境的能力,从而证明了MarS具有广泛的金融应用“范式转变”潜力。 我们已在[此链接](https://this-url.com)发布MarS代码。
https://arxiv.org/abs/2409.07486
Constraints are critical in text generation as LLM outputs are often unreliable when it comes to ensuring generated outputs adhere to user defined instruction or general safety guidelines. To address this gap, we present Constrained Discrete Diffusion (CDD), a novel method for enforcing constraints on natural language by integrating discrete diffusion models with differentiable optimization. Unlike conventional text generators, which often rely on post-hoc filtering or model retraining for controllable generation, we propose imposing constraints directly into the discrete diffusion sampling process. We illustrate how this technique can be applied to satisfy a variety of natural language constraints, including (i) toxicity mitigation by preventing harmful content from emerging, (ii) character and sequence level lexical constraints, and (iii) novel molecule sequence generation with specific property adherence. Experimental results show that our constraint-aware procedure achieves high fidelity in meeting these requirements while preserving fluency and semantic coherence, outperforming auto-regressive and existing discrete diffusion approaches.
在文本生成中,约束条件至关重要,因为大型语言模型(LLM)的输出往往难以保证符合用户定义指令或通用安全指南。为了解决这一问题,我们提出了一个新颖的方法——受限制离散扩散法(Constrained Discrete Diffusion, CDD),通过将离散扩散模型与可微优化相结合来实施自然语言约束条件。与传统文本生成器不同,后者通常依赖于事后过滤或重新训练以实现可控生成,我们提出直接在离散扩散采样过程中施加约束。 我们展示了如何应用这种技术来满足各种自然语言的限制条件,包括: - 通过防止有害内容出现来减轻毒性。 - 字符和序列级别的词汇限制。 - 遵循特定属性的新分子序列生成。 实验结果显示,我们的这种方法在保证这些要求的同时还能保持流畅性和语义连贯性,在性能上超过了自回归方法和现有的离散扩散方法。
https://arxiv.org/abs/2503.09790
Efficiently acquiring external knowledge and up-to-date information is essential for effective reasoning and text generation in large language models (LLMs). Retrieval augmentation and tool-use training approaches where a search engine is treated as a tool lack complex multi-turn retrieval flexibility or require large-scale supervised data. Prompting advanced LLMs with reasoning capabilities during inference to use search engines is not optimal, since the LLM does not learn how to optimally interact with the search engine. This paper introduces Search-R1, an extension of the DeepSeek-R1 model where the LLM learns -- solely through reinforcement learning (RL) -- to autonomously generate (multiple) search queries during step-by-step reasoning with real-time retrieval. Search-R1 optimizes LLM rollouts with multi-turn search interactions, leveraging retrieved token masking for stable RL training and a simple outcome-based reward function. Experiments on seven question-answering datasets show that Search-R1 improves performance by 26% (Qwen2.5-7B), 21% (Qwen2.5-3B), and 10% (LLaMA3.2-3B) over SOTA baselines. This paper further provides empirical insights into RL optimization methods, LLM choices, and response length dynamics in retrieval-augmented reasoning. The code and model checkpoints are available at this https URL.
有效获取外部知识和最新信息对于大型语言模型(LLM)的有效推理和文本生成至关重要。目前的检索增强和工具使用训练方法,其中搜索引擎被视为一种工具,要么缺乏复杂的多轮检索灵活性,要么需要大规模的监督数据。在推断过程中用具有推理能力的先进LLM提示以利用搜索引擎的方法并不理想,因为LLM没有学会如何最佳地与搜索引擎互动。 本文介绍了Search-R1模型,它是DeepSeek-R1模型的一个扩展版本,在该版本中,LLM通过强化学习(RL)自主生成多轮检索中的多个搜索查询。在逐步推理过程中进行实时检索。Search-R1利用多轮搜索交互优化LLM的执行流程,并采用检索标记屏蔽技术以实现稳定的RL训练以及基于结果的简单奖励函数。 实验结果显示,在七个问答数据集上,与最先进的基准相比,Search-R1分别将性能提高了26%(Qwen2.5-7B)、21%(Qwen2.5-3B)和10%(LLaMA3.2-3B)。此外,本文还提供了关于RL优化方法、LLM选择以及检索增强推理中的响应长度动态变化的经验见解。相关代码和模型检查点可在提供的链接处获取。
https://arxiv.org/abs/2503.09516
The rapid rise of Language Models (LMs) has expanded the capabilities of natural language processing, powering applications from text generation to complex decision-making. While state-of-the-art LMs often boast hundreds of billions of parameters and are primarily deployed in data centers, recent trends show a growing focus on compact models-typically under 10 billion parameters-enabled by techniques such as quantization and other model compression techniques. This shift paves the way for LMs on edge devices, offering potential benefits such as enhanced privacy, reduced latency, and improved data sovereignty. However, the inherent complexity of even these smaller models, combined with the limited computing resources of edge hardware, raises critical questions about the practical trade-offs in executing LM inference outside the cloud. To address these challenges, we present a comprehensive evaluation of generative LM inference on representative CPU-based and GPU-accelerated edge devices. Our study measures key performance indicators-including memory usage, inference speed, and energy consumption-across various device configurations. Additionally, we examine throughput-energy trade-offs, cost considerations, and usability, alongside an assessment of qualitative model performance. While quantization helps mitigate memory overhead, it does not fully eliminate resource bottlenecks, especially for larger models. Our findings quantify the memory and energy constraints that must be considered for practical real-world deployments, offering concrete insights into the trade-offs between model size, inference performance, and efficiency. The exploration of LMs at the edge is still in its early stages. We hope this study provides a foundation for future research, guiding the refinement of models, the enhancement of inference efficiency, and the advancement of edge-centric AI systems.
语言模型(LMs)的快速崛起扩展了自然语言处理的能力,从文本生成到复杂决策的应用范围不断扩大。尽管最先进的语言模型通常拥有数百亿个参数,并主要部署在数据中心中,但最近的趋势显示,人们对紧凑型模型的关注日益增加,这些模型的参数一般不超过100亿,并且可以通过量化和其他模型压缩技术实现。这一转变为边缘设备上的LM铺平了道路,提供了诸如增强隐私、减少延迟和提高数据主权等潜在好处。然而,即使是较小规模模型所固有的复杂性以及边缘硬件计算资源的限制,也引发了关于在云外执行LM推理时实际权衡的重要问题。 为了应对这些挑战,我们对基于代表性的CPU设备和GPU加速边缘设备上的生成式语言模型推理进行了全面评估。我们的研究测量了各种设备配置下的关键性能指标,包括内存使用、推理速度和能耗。此外,我们还考察了吞吐量-能耗的权衡、成本考虑以及可用性,并结合定性模型表现进行评估。虽然量化有助于减轻内存开销,但它并不能完全消除资源瓶颈问题,尤其是在大型模型中。 我们的研究结果揭示了在实际世界部署中必须考虑的记忆和能源约束,提供了关于模型大小、推理性能和效率之间权衡的具体见解。边缘设备上的语言模型探索仍处于初级阶段。我们希望这项研究能够为未来的相关研究奠定基础,指导模型的精炼、推理效率的提升以及以边缘为中心的人工智能系统的进步。
https://arxiv.org/abs/2503.09114
Recent advancements in Large Language Models (LLMs) have significantly improved text generation capabilities. However, they also present challenges, particularly in generating vaccine-related misinformation, which poses risks to public health. Despite research on human-authored misinformation, a notable gap remains in understanding how LLMs contribute to vaccine misinformation and how best to detect it. Existing benchmarks often overlook vaccine-specific misinformation and the diverse roles of misinformation spreaders. This paper introduces VaxGuard, a novel dataset designed to address these challenges. VaxGuard includes vaccine-related misinformation generated by multiple LLMs and provides a comprehensive framework for detecting misinformation across various roles. Our findings show that GPT-3.5 and GPT-4o consistently outperform other LLMs in detecting misinformation, especially when dealing with subtle or emotionally charged narratives. On the other hand, PHI3 and Mistral show lower performance, struggling with precision and recall in fear-driven contexts. Additionally, detection performance tends to decline as input text length increases, indicating the need for improved methods to handle larger content. These results highlight the importance of role-specific detection strategies and suggest that VaxGuard can serve as a key resource for improving the detection of LLM-generated vaccine misinformation.
最近在大型语言模型(LLMs)方面取得的进展显著提升了文本生成能力。然而,这些进步也带来了一些挑战,特别是生成与疫苗相关的错误信息,这对公共健康构成了风险。尽管有关人为制造错误信息的研究已经取得了进展,但关于LLM如何贡献于疫苗错误信息以及如何最有效地检测它们的理解仍然存在明显差距。现有的基准测试通常忽视了疫苗特有错误信息和散布者多样角色的重要性。本文介绍了VaxGuard,这是一个旨在解决上述挑战的新型数据集。VaxGuard包括由多个LLM生成的与疫苗相关的错误信息,并提供了一个全面的框架来识别不同角色中的错误信息。 我们的研究发现表明,GPT-3.5 和 GPT-4 在检测错误信息方面表现优于其他LLM,尤其是在处理微妙或情感强烈的故事时。相比之下,PHI3 和 Mistral 的表现较差,在面对恐惧驱动的情境下难以做到精准和全面的识别。此外,随着输入文本长度的增长,检测性能通常会下降,这表明需要改进方法以更好地处理更长的内容。 这些结果强调了角色特定检测策略的重要性,并表明VaxGuard可以作为提高LLM生成疫苗错误信息检测能力的关键资源。
https://arxiv.org/abs/2503.09103
Recent advancements in unified multimodal understanding and visual generation (or multimodal generation) models have been hindered by their quadratic computational complexity and dependence on large-scale training data. We present OmniMamba, the first linear-architecture-based multimodal generation model that generates both text and images through a unified next-token prediction paradigm. The model fully leverages Mamba-2's high computational and memory efficiency, extending its capabilities from text generation to multimodal generation. To address the data inefficiency of existing unified models, we propose two key innovations: (1) decoupled vocabularies to guide modality-specific generation, and (2) task-specific LoRA for parameter-efficient adaptation. Furthermore, we introduce a decoupled two-stage training strategy to mitigate data imbalance between two tasks. Equipped with these techniques, OmniMamba achieves competitive performance with JanusFlow while surpassing Show-o across benchmarks, despite being trained on merely 2M image-text pairs, which is 1,000 times fewer than Show-o. Notably, OmniMamba stands out with outstanding inference efficiency, achieving up to a 119.2 times speedup and 63% GPU memory reduction for long-sequence generation compared to Transformer-based counterparts. Code and models are released at this https URL
近期,统一多模态理解和视觉生成(或称多模态生成)模型的发展受到了其二次方计算复杂度和对大规模训练数据依赖的限制。我们提出了OmniMamba,这是首个基于线性架构的多模态生成模型,通过统一的下一个标记预测范式生成文本和图像。该模型充分利用了Mamba-2在计算和内存效率上的优势,并将其能力从单纯的文本生成扩展到了多模态生成领域。 为了解决现有统一模型数据利用率低的问题,我们提出了两个关键创新: 1. 分离词汇表以指导特定模式的生成; 2. 任务特定的LoRA(低秩适应)实现参数高效的调整。此外,为了缓解两种任务之间的数据不平衡问题,我们还提出了一种分离式的两阶段训练策略。 采用这些技术后,在仅使用200万张图像-文本对进行训练的情况下,OmniMamba在基准测试中实现了与JanusFlow相当的性能,并超越了Show-o的表现。值得注意的是,尽管训练数据量仅为Show-o的千分之一,但OmniMamba因其出色的推理效率而脱颖而出,在长序列生成上比基于Transformer的方法快119.2倍且GPU内存使用减少了63%。 代码和模型可在以下链接获取:[此处插入URL]
https://arxiv.org/abs/2503.08686
Diffusion models have seen immense success in modelling continuous data across a range of domains such as vision and audio. Despite the challenges of adapting diffusion models to discrete data, recent work explores their application to text generation by working in the continuous embedding space. However, these models lack a natural means to control the inherent trade-off between quality and diversity as afforded by the temperature hyperparameter in autoregressive models, hindering understanding of model performance and restricting generation quality. This work proposes the use of classifier-free guidance and stochastic clamping for manipulating the quality-diversity trade-off on sequence-to-sequence tasks, demonstrating that these techniques may be used to improve the performance of a diffusion language model.
扩散模型在视觉和音频等多个领域中对连续数据建模方面取得了巨大成功。尽管将扩散模型适应离散数据面临挑战,最近的研究通过在连续嵌入空间中工作来探索其在文本生成中的应用。然而,这些模型缺乏一种自然的方式来控制质量与多样性之间的内在权衡,这种权衡可以通过自回归模型中的温度超参数进行调节。这限制了对模型性能的理解,并且阻碍了生成内容的质量提升。本研究提出使用无分类器引导(classifier-free guidance)和随机钳位(stochastic clamping)技术来调整序列到序列任务中质量与多样性的平衡,证明这些技术可以用来提高扩散语言模型的性能。
https://arxiv.org/abs/2503.10683
Different from other text generation tasks, in product description generation, it is of vital importance to generate faithful descriptions that stick to the product attribute information. However, little attention has been paid to this problem. To bridge this gap, we propose a model named Fidelity-oriented Product Description Generator (FPDG). FPDG takes the entity label of each word into account, since the product attribute information is always conveyed by entity words. Specifically, we first propose a Recurrent Neural Network (RNN) decoder based on the Entity-label-guided Long Short-Term Memory (ELSTM) cell, taking both the embedding and the entity label of each word as input. Second, we establish a keyword memory that stores the entity labels as keys and keywords as values, allowing FPDG to attend to keywords by attending to their entity labels. Experiments conducted on a large-scale real-world product description dataset show that our model achieves state-of-the-art performance in terms of both traditional generation metrics and human evaluations. Specifically, FPDG increases the fidelity of the generated descriptions by 25%.
与其它文本生成任务不同,在产品描述生成中,生成忠实于产品属性信息的描述极为重要。然而,对此问题的关注较少。为弥补这一不足,我们提出了一种名为“忠诚度导向的产品描述生成器”(FPDG)的模型。FPDG考虑了每个词的实体标签,因为产品属性信息通常由实体词汇传达。 具体而言,首先我们提出了一个基于实体标签引导的长短期记忆(ELSTM)单元的递归神经网络(RNN)解码器,该解码器同时将每个词的嵌入和实体标签作为输入。其次,我们建立了一个关键词记忆库,存储实体标签作为键值、关键词作为实际内容,使得FPDG能够通过关注其实体标签来访问这些关键词。 在大规模的真实世界产品描述数据集上进行的实验表明,我们的模型在传统的生成度量标准和人工评估方面均达到了最先进的性能。具体而言,与基线相比,FPDG将生成描述的忠诚度提高了25%。
https://arxiv.org/abs/2503.08454