Generative AI has demonstrated strong potential in healthcare, from clinical decision support to patient-facing chatbots that improve outcomes. A critical challenge for deployment is effective human-AI communication, where content must be both personalized and understandable. We introduce MedReadCtrl, a readability-controlled instruction tuning framework that enables LLMs to adjust output complexity without compromising meaning. Evaluations of nine datasets and three tasks across medical and general domains show that MedReadCtrl achieves significantly lower readability instruction-following errors than GPT-4 (e.g., 1.39 vs. 1.59 on ReadMe, p<0.001) and delivers substantial gains on unseen clinical tasks (e.g., +14.7 ROUGE-L, +6.18 SARI on MTSamples). Experts consistently preferred MedReadCtrl (71.7% vs. 23.3%), especially at low literacy levels. These gains reflect MedReadCtrl's ability to restructure clinical content into accessible, readability-aligned language while preserving medical intent, offering a scalable solution to support patient education and expand equitable access to AI-enabled care.
生成式人工智能在医疗保健领域展现出了巨大的潜力,从临床决策支持到提高患者结果的面向患者的聊天机器人。部署过程中面临的重大挑战之一是有效的人机沟通,在这种沟通中,内容必须既个性化又易于理解。我们引入了MedReadCtrl,这是一种可控制读写性的指令调优框架,它使大型语言模型能够调整输出复杂度而不影响其含义。在涵盖医疗和通用领域的九个数据集和三项任务的评估中显示,与GPT-4相比(例如,在ReadMe上为1.39对1.59,p<0.001),MedReadCtrl实现了显著更低的可读性指令跟随错误,并且在未见过的临床任务上取得了实质性的改进(例如,在MTSamples上的ROUGE-L增加了14.7,SARI增加了6.18)。专家们一致更偏好使用MedReadCtrl(71.7%对比23.3%),尤其是在低识字水平的情况下。这些改进反映了MedReadCtrl能够将临床内容重构为与可读性对齐的、易于访问的语言,同时保持医学意图不变,从而提供了一种可扩展的解决方案来支持患者教育并扩大人工智能辅助护理的平等获取途径。
https://arxiv.org/abs/2507.07419
Large language models (LLMs), including zero-shot and few-shot paradigms, have shown promising capabilities in clinical text generation. However, real-world applications face two key challenges: (1) patient data is highly unstructured, heterogeneous, and scattered across multiple note types and (2) clinical notes are often long and semantically dense, making naive prompting infeasible due to context length constraints and the risk of omitting clinically relevant information. We introduce CLI-RAG (Clinically Informed Retrieval-Augmented Generation), a domain-specific framework for structured and clinically grounded text generation using LLMs. It incorporates a novel hierarchical chunking strategy that respects clinical document structure and introduces a task-specific dual-stage retrieval mechanism. The global stage identifies relevant note types using evidence-based queries, while the local stage extracts high-value content within those notes creating relevance at both document and section levels. We apply the system to generate structured progress notes for individual hospital visits using 15 clinical note types from the MIMIC-III dataset. Experiments show that it preserves temporal and semantic alignment across visits, achieving an average alignment score of 87.7%, surpassing the 80.7% baseline from real clinician-authored notes. The generated outputs also demonstrate high consistency across LLMs, reinforcing deterministic behavior essential for reproducibility, reliability, and clinical trust.
大型语言模型(LLMs),包括零样本和少样本范式,在临床文本生成方面展现出了令人鼓舞的能力。然而,实际应用面临两个关键挑战:(1) 患者数据高度不规范、异质化,并分散在多种记录类型中;(2) 临床记录通常很长且语义密集,这使得由于上下文长度限制和遗漏重要临床信息的风险,简单的提示方法变得不可行。我们引入了CLI-RAG(基于临床信息的检索增强生成),这是一种使用LLMs进行结构化和临床基础文本生成的专业领域框架。它结合了一种新颖的分层切块策略,该策略尊重临床文档结构,并引入了一个特定任务的双阶段检索机制。全局阶段通过基于证据的查询识别相关记录类型,而局部阶段则在这些笔记中提取高价值内容,在文档和部分级别上创建相关性。我们将系统应用于使用MIMIC-III数据集中的15种临床记录类型生成结构化的进展记录,以反映个人医院访问的情况。实验表明,该系统能够在不同就诊之间保持时间和语义的一致性,达到了87.7%的平均一致性得分,超过了真实临床医生撰写的笔记中80.7%的基础线水平。生成的结果还展示了在LLMs之间的高度一致性,这加强了确定性的行为,这对于可重复性、可靠性和临床信任至关重要。
https://arxiv.org/abs/2507.06715
Video Multimodal Large Language Models (VideoMLLMs) have achieved remarkable progress in both Video-to-Text and Text-to-Video tasks. However, they often suffer fro hallucinations, generating content that contradicts the visual input. Existing evaluation methods are limited to one task (e.g., V2T) and also fail to assess hallucinations in open-ended, free-form responses. To address this gap, we propose FIFA, a unified FaIthFulness evAluation framework that extracts comprehensive descriptive facts, models their semantic dependencies via a Spatio-Temporal Semantic Dependency Graph, and verifies them using VideoQA models. We further introduce Post-Correction, a tool-based correction framework that revises hallucinated content. Extensive experiments demonstrate that FIFA aligns more closely with human judgment than existing evaluation methods, and that Post-Correction effectively improves factual consistency in both text and video generation.
视频多模态大型语言模型(VideoMLLM)在视频到文本和文本到视频任务中取得了显著进展。然而,这些模型常常出现幻觉问题,即生成的内容与视觉输入相矛盾。现有的评估方法仅限于单一任务(如V2T),并且无法对开放性、自由形式回复中的幻觉进行评估。为了填补这一空白,我们提出了FIFA框架,这是一个统一的忠实度评估框架,能够提取全面描述性的事实,并通过时空语义依赖图模型这些事实之间的语义关系,并利用视频问答模型验证它们的真实性和一致性。此外,我们还引入了Post-Correction工具,这是一种基于工具的纠正框架,用于修订幻觉内容。 广泛的实验表明,FIFA在与人类判断的一致性上优于现有的评估方法,并且Post-Correction有效地提高了文本和视频生成中的事实一致性和准确性。
https://arxiv.org/abs/2507.06523
We introduce OpenFActScore, an open-source implementation of the FActScore framework for evaluating the factuality of text generated by large language models (LLMs). FActScore evaluates the factual accuracy of long-form text by using Atomic Fact Generation (AFG) to extract individual factual claims and Atomic Fact Validation (AFV) to verify each claim against a trusted knowledge source. While the original FActScore relies on closed-source and commercial models such as InstructGPT and ChatGPT, OpenFActScore enables the use of any Hugging Face-compatible model for both AFG and AFV. We provide a detailed technical overview of our implementation, highlighting design choices and modifications made to support open models. We evaluate multiple open-source LLMs on both AFG and AFV using the original FActScore benchmark, reporting BERTScore-F1 for AFG and Error Rate relative to human annotations for AFV. Our results show that open models can approximate the performance of closed-source systems, with Gemma achieving the best overall performance, and our final setup obtains a 0.99 Pearson correlation with the original FActScore experiments. OpenFActScore promotes transparency, reproducibility, and cost-effective evaluation, and is available at: this https URL.
我们介绍了OpenFActScore,这是一个开源实现的框架,用于评估大型语言模型(LLMs)生成文本的事实准确性。FActScore通过使用原子事实生成(AFG)提取单个事实声明,并利用原子事实验证(AFV)将每个声明与受信任的知识源进行核对来评估长篇文本的事实准确性。虽然原始的FActScore依赖于闭源和商业模型,如InstructGPT和ChatGPT,但OpenFActScore允许使用任何兼容Hugging Face的模型来进行原子事实生成(AFG)和验证(AFV)。我们提供了对我们实现的技术概述,强调了为支持开源模型所做的设计选择和修改。我们在原始FActScore基准上对多个开源LLMs进行了评估,报告了用于AFG的BERTScore-F1以及相对于人工标注的错误率来衡量AFV性能。我们的结果显示,开源模型可以接近闭源系统的性能表现,在所有模型中Gemma表现出最佳的整体性能,并且我们最终的设置获得了与原始FActScore实验0.99的相关性系数(Pearson)。OpenFActScore促进了透明度、可重复性和成本效益评估,其代码和文档可在[这个链接](this https URL)获取。
https://arxiv.org/abs/2507.05965
Offline Handwritten Text Recognition (HTR) systems play a crucial role in applications such as historical document digitization, automatic form processing, and biometric authentication. However, their performance is often hindered by the limited availability of annotated training data, particularly for low-resource languages and complex scripts. This paper presents a comprehensive survey of offline handwritten data augmentation and generation techniques designed to improve the accuracy and robustness of HTR systems. We systematically examine traditional augmentation methods alongside recent advances in deep learning, including Generative Adversarial Networks (GANs), diffusion models, and transformer-based approaches. Furthermore, we explore the challenges associated with generating diverse and realistic handwriting samples, particularly in preserving script authenticity and addressing data scarcity. This survey follows the PRISMA methodology, ensuring a structured and rigorous selection process. Our analysis began with 1,302 primary studies, which were filtered down to 848 after removing duplicates, drawing from key academic sources such as IEEE Digital Library, Springer Link, Science Direct, and ACM Digital Library. By evaluating existing datasets, assessment metrics, and state-of-the-art methodologies, this survey identifies key research gaps and proposes future directions to advance the field of handwritten text generation across diverse linguistic and stylistic landscapes.
离线手写文本识别(HTR)系统在历史文档数字化、自动表单处理和生物特征认证等应用中扮演着重要角色。然而,其性能往往受到注释训练数据可用性有限的限制,尤其是对于资源匮乏的语言和复杂的书写系统。本文综述了用于提高HTR系统准确性和鲁棒性的离线手写数据增强和生成技术。我们系统地考察了传统增广方法以及深度学习领域的最新进展,包括生成对抗网络(GANs)、扩散模型和基于变压器的方法。此外,还探讨了生成多样且现实的手写样本的挑战,特别是保持书写真实性及应对数据稀缺问题。本次综述遵循PRISMA方法论,确保了一个结构化和严谨的选择过程。我们的分析始于1,302项初步研究,并在去除重复后筛选至848项,这些研究主要来源于IEEE数字图书馆、Springer Link、Science Direct和ACM数字图书馆等关键学术来源。通过评估现有的数据集、评价指标及前沿方法论,本综述指出了关键的研究缺口,并提出了未来发展方向,以推进跨越多样化语言和书写风格的手写文本生成领域的进步。
https://arxiv.org/abs/2507.06275
A key challenge for iterative text generation is enabling models to efficiently identify and correct their own errors. We propose Review, Remask, Refine (R3), a relatively simple yet elegant framework that requires no additional model training and can be applied to any pre-trained masked text diffusion model (e.g., LLaDA or BD3-LM). In R3, a Process Reward Model (PRM) is utilized for the Review of intermediate generated blocks. The framework then translates these PRM scores into a Remask strategy: the lower a block's PRM score, indicating potential mistakes, the greater the proportion of tokens within that block are remasked. Finally, the model is compelled to Refine these targeted segments, focusing its efforts more intensively on specific sub-optimal parts of past generations, leading to improved final output.
迭代文本生成的一个关键挑战是使模型能够高效地识别和纠正自身的错误。我们提出了Review、Remask、Refine(R3)框架,这是一个相对简单但优雅的方法,无需额外的模型训练,并可应用于任何预训练的掩码文本扩散模型(如LLaDA或BD3-LM)。在R3中,利用过程奖励模型(PRM)来审查中间生成的区块。然后,该框架将这些PRM评分转化为重掩码策略:一个区块的PRM分数越低,表明可能存在错误的可能性越大,则对该区块内的标记进行重掩码的比例也就越高。最后,模型被强制去细化这些目标段落,在过去生成中的某些次优部分集中更多努力,从而提高最终输出的质量。
https://arxiv.org/abs/2507.08018
Large Language Models (LLMs) such as ChatGPT demonstrated the potential to replicate human language abilities through technology, ranging from text generation to engaging in conversations. However, it remains controversial to what extent these systems truly understand language. We examine this issue by narrowing the question down to the semantics of LLMs at the word and sentence level. By examining the inner workings of LLMs and their generated representation of language and by drawing on classical semantic theories by Frege and Russell, we get a more nuanced picture of the potential semantic capabilities of LLMs.
大型语言模型(LLM),如ChatGPT,通过技术展现了复制人类语言能力的潜力,从文本生成到参与对话无所不包。然而,这些系统在多大程度上真正理解语言仍然是一个有争议的问题。我们通过对LLM语义进行细化研究——聚焦于单词和句子层面——来探讨这一问题。通过考察LLM内部工作机制及其对语言的表示方式,并借鉴弗雷格(Frege)和罗素(Russell)的经典语义理论,我们可以获得关于LLM潜在语义能力的一个更为细致的理解。
https://arxiv.org/abs/2507.05448
Personalized text generation has become crucial for adapting language models to diverse and evolving users' personal context across cultural, temporal, and contextual dimensions. While existing methods often rely on centralized fine-tuning or static preference alignment, they struggle to achieve real-time adaptation under resource constraints inherent to personal devices. This limitation creates a dilemma: large cloud-based models lack access to localized user-specific information, while small on-device models cannot match the generation quality of their cloud counterparts. To address this dichotomy, we present CoSteer, a novel collaborative framework that enables decoding-time personalization through localized delta steering. Our key insight lies in leveraging the logits difference between personal context-aware and -agnostic outputs from local small models as steering signals for cloud-based LLMs. Specifically, we formulate token-level optimization as an online learning problem, where local delta vectors dynamically adjust the remote LLM's logits within the on-device environment. This approach preserves privacy by transmitting only the final steered tokens rather than raw data or intermediate vectors, while maintaining cloud-based LLMs' general capabilities without fine-tuning. Through comprehensive experiments on various personalized generation tasks, we demonstrate that CoSteer effectively assists LLMs in generating personalized content by leveraging locally stored user profiles and histories, ensuring privacy preservation through on-device data processing while maintaining acceptable computational overhead.
个性化文本生成已成为适应语言模型在文化、时间及上下文维度上多样化和不断变化的用户个人背景的关键。尽管现有的方法通常依赖于集中式微调或静态偏好对齐,但在资源受限的个人设备下实现实时适应方面仍然面临挑战。这种限制导致了一个两难境地:基于云的大规模模型缺乏访问本地化特定用户的个人信息的能力,而小型在设备上运行的模型则无法达到其云端对应模型生成的质量水平。为解决这一矛盾,我们提出了CoSteer,这是一种新型协作框架,通过本地化的增量调整实现解码时个性化。 我们的关键见解在于利用从本地小型模型中产生的、具有个人上下文意识与无意识输出之间的logits差异作为远程云上大型语言模型(LLM)的引导信号。具体而言,我们将token级别的优化问题公式化为一个在线学习问题,在这个过程中,本地增量向量会动态调整远端LLM在设备环境中的logits值。这种方法通过仅传输最终的调整后的tokens而非原始数据或中间向量来保护隐私,并且无需微调即可保持云端LLMs的一般能力。 通过对各种个性化生成任务进行详尽实验,我们证明CoSteer能够有效地帮助大型语言模型利用本地存储的用户资料和历史记录生成个性化内容,在保证设备端处理数据以维护隐私的同时,维持可接受的计算开销。
https://arxiv.org/abs/2507.04756
3D Visual Question Answering (3D VQA) is crucial for enabling models to perceive the physical world and perform spatial reasoning. In 3D VQA, the free-form nature of answers often leads to improper annotations that can confuse or mislead models when training on the entire dataset. While other text generation tasks can mitigate this issue by learning on large-scale datasets, the scarcity of 3D scene data enlarges the negative effect of misleading annotations. Although active learning strategies can select valuable instances for training, they fail to identify and resolve misleading labels, which the oracle inevitably provides in practice. To address this issue, we propose a multi-turn interactive active learning strategy. This strategy selects data based on models' semantic uncertainty to form a solid knowledge foundation more effectively and actively requests reannotation from an oracle to resolve potentially misleading labels. For uncertainty assessment, we utilize a variance-based metric that takes semantic relationships between terms into consideration, thus avoiding the uniform inter-class similarity assumption of previous assessment metrics. Extensive experiments exhibit better model performance and a substantial reduction in training costs, with a halving of training costs for achieving relatively high accuracy. The code is available at this https URL.
3D 视觉问答(3D VQA)对于使模型感知物理世界并进行空间推理至关重要。在 3D VQA 中,答案的自由形式性质通常会导致不正确的标注,在整个数据集上训练时可能会混淆或误导模型。虽然其他文本生成任务可以通过大规模数据集的学习来减轻这一问题的影响,但三维场景数据的稀缺性加剧了误导性标签带来的负面影响。尽管主动学习策略可以挑选有价值的实例用于训练,它们却无法识别和解决由实际操作中提供的潜在误导性标签。为了解决这个问题,我们提出了一种多轮互动式主动学习策略。该策略基于模型语义不确定性来选择数据,从而更有效地形成坚实的知识基础,并积极请求来自专家(oracle)的重新标注以解决潜在误导性的标签问题。在不确定性评估方面,我们使用一种考虑术语间语义关系的方差基元度量方法,从而避免了先前评估指标中的一致类内相似性假设。广泛的实验展示了模型性能的提升和训练成本的大幅降低,在实现相对较高的准确性时,训练成本可以减少一半。代码可在该网址获取:[请在此处插入具体的URL链接]。
https://arxiv.org/abs/2507.04630
Diffusion models, originally developed for image generation, have emerged as a promising alternative to autoregressive large language models (LLMs). We present a theoretical analysis comparing autoregressive and masked diffusion LLMs, revealing that the intrinsic bidirectional attention mechanism of diffusion LLMs (dLLMs) enables superior context modeling and generation controllability. However, existing dLLM applications face significant challenges in controllable generation: the native multi-step denoising process exhibits high sensitivity to sequence length, elevated hallucination rates, and prohibitive inference costs without specialized optimizations. To address these limitations, we propose \textbf{S}elf-adaptive \textbf{S}chema \textbf{S}caffolding ($S^3$), a novel framework that enables dLLMs to generate structured outputs (e.g., JSON) while maintaining semantic fidelity and accelerating inference. Our approach injects the target schema structure into the output context, reducing unnecessary computation while improving controllability. Extensive experiments demonstrate that $S^3$ achieves substantial improvements: 65\% increase in structural adherence, 48\% enhancement in content fidelity, and 17\% reduction in hallucination rates compared to baseline. These results establish both theoretical foundations and practical pathways for deploying diffusion models in controllable text generation tasks. Code and data will be publicly released.
扩散模型最初用于图像生成,现已作为自回归大型语言模型(LLM)的一种有前景的替代方案出现。我们进行了一项理论分析,比较了自回归和屏蔽式扩散 LLM,揭示出扩散 LLM (dLLMs) 内在的双向注意力机制能够实现更优的上下文建模和生成控制能力。然而,现有的 dLLM 应用面临着可控生成方面的重大挑战:原生多步骤去噪过程对序列长度敏感度高、幻觉率升高以及缺乏专门优化情况下推理成本高昂。为解决这些局限性,我们提出了一种新型框架——自适应模式支架(Self-adaptive Schema Scaffolding, $S^3$),使 dLLMs 能够生成结构化输出(例如 JSON)同时保持语义忠实性和加速推理速度。我们的方法将目标模式结构注入到输出上下文中,减少了不必要的计算量并提高了可控性。广泛的实验表明,$S^3$ 实现了显著的改进:与基线相比,在结构一致性上提升了65%,内容保真度提高了48%,幻觉率降低了17%。这些结果不仅建立了扩散模型在可控制文本生成任务中的理论基础,还为其实用部署提供了途径。代码和数据将公开发布。
https://arxiv.org/abs/2507.04504
Cyber Threat Intelligence (CTI) has emerged as a vital complementary approach that operates in the early phases of the cyber threat lifecycle. CTI involves collecting, processing, and analyzing threat data to provide a more accurate and rapid understanding of cyber threats. Due to the large volume of data, automation through Machine Learning (ML) and Natural Language Processing (NLP) models is essential for effective CTI extraction. These automated systems leverage Open Source Intelligence (OSINT) from sources like social networks, forums, and blogs to identify Indicators of Compromise (IoCs). Although prior research has focused on adversarial attacks on specific ML models, this study expands the scope by investigating vulnerabilities within various components of the entire CTI pipeline and their susceptibility to adversarial attacks. These vulnerabilities arise because they ingest textual inputs from various open sources, including real and potentially fake content. We analyse three types of attacks against CTI pipelines, including evasion, flooding, and poisoning, and assess their impact on the system's information selection capabilities. Specifically, on fake text generation, the work demonstrates how adversarial text generation techniques can create fake cybersecurity and cybersecurity-like text that misleads classifiers, degrades performance, and disrupts system functionality. The focus is primarily on the evasion attack, as it precedes and enables flooding and poisoning attacks within the CTI pipeline.
网络威胁情报(CTI)作为一种至关重要的补充方法,在网络威胁生命周期的早期阶段发挥着重要作用。CTI包括收集、处理和分析威胁数据,以提供更准确和快速理解网络威胁的能力。由于数据量庞大,通过机器学习(ML)和自然语言处理(NLP)模型进行自动化处理对于有效的CTI提取至关重要。这些自动系统利用开源情报(OSINT),如社交网络、论坛和博客等来源,来识别攻击指标(IoCs)。尽管先前的研究主要关注特定ML模型的对抗性攻击,但本研究扩大了范围,调查整个CTI管道中各个组件的漏洞及其对对抗性攻击的易感性。这些漏洞出现的原因在于它们从各种开放源输入文本数据,包括真实和潜在伪造的内容。 我们分析了针对CTI管道的三种类型的攻击,即逃避、泛滥和中毒,并评估了这些攻击对系统信息选择能力的影响。具体来说,在伪造文本生成方面,该工作展示了如何通过对抗性文本生成技术创建虚假的网络安全和类似网络安全的文本,从而误导分类器,降低性能并扰乱系统功能。主要关注的是逃避攻击,因为它发生在其他两种类型(泛滥和中毒)之前,并为它们在CTI管道中的执行提供条件。
https://arxiv.org/abs/2507.06252
This paper describes SHNU multilingual conversational speech recognition system (SHNU-mASR, team name-"maybe"), submitted to Track 1 of the INTERSPEECH 2025 MLC-SLM Challenge. Our system integrates a parallel-speech-encoder architecture with a large language model (LLM) to form a unified multilingual ASR framework. The parallel-speech-encoder consists of two pre-trained encoders, the Whisper-large-v3 encoder and mHuBERT-147 encoder. Their output embeddings are concatenated and fed into the LLM, enabling the model to leverage complementary acoustic and linguistic knowledge and achieve competitive performance. Moreover, we adopt a tri-stage training strategy to jointly update the low-rank adaptation modules and projector parameters of both the speech encoders and the LLM. In addition, we incorporate an additional language-aware prompt at the LLM input to enhance language-specific text generation. The SHNU-mASR system achieves an overall character/word error rate (CER/WER) of 11.76% on the blind evaluation set of the challenge, outperforming the official MLC-SLM baseline by 8.41 absolute CER/WER, without increasing the baseline training data.
本文描述了上海师范大学多语言对话语音识别系统(SHNU-mASR,团队名称-"maybe"),该系统提交给了2025年INTERSPEECH MLC-SLM挑战赛的第一赛道。我们的系统结合了一个并行语音编码器架构和一个大型语言模型(LLM)以形成一个多语言自动语音识别(ASR)统一框架。并行语音编码器包含两个预训练的编码器,分别是Whisper-large-v3编码器和mHuBERT-147编码器。它们的输出嵌入被连接起来并输入到LLM中,使模型能够利用互补的声音学知识和语言学知识,并达到具有竞争力的表现。 此外,我们采用了一种三阶段训练策略,以同时更新语音编码器和LLM中的低秩适应模块和投影参数。另外,在LLM的输入端,我们加入了一个附加的语言感知提示来增强特定语言的文本生成能力。在挑战赛盲测集中,SHNU-mASR系统实现了11.76%的整体字符/词错误率(CER/WER),比官方MLC-SLM基线高出8.41个绝对值的CER/WWER,且没有增加基线训练数据量。
https://arxiv.org/abs/2507.03343
Speculative decoding generally dictates having a small, efficient draft model that is either pretrained or distilled offline to a particular target model series, for instance, Llama or Qwen models. However, within online deployment settings, there are two major challenges: 1) usage of a target model that is incompatible with the draft model; 2) expectation of latency improvements over usage and time. In this work, we propose OmniDraft, a unified framework that enables a single draft model to operate with any target model and adapt dynamically to user data. We introduce an online n-gram cache with hybrid distillation fine-tuning to address the cross-vocabulary mismatch across draft and target models; and further improve decoding speed by leveraging adaptive drafting techniques. OmniDraft is particularly suitable for on-device LLM applications where model cost, efficiency and user customization are the major points of contention. This further highlights the need to tackle the above challenges and motivates the \textit{``one drafter for all''} paradigm. We showcase the proficiency of the OmniDraft framework by performing online learning on math reasoning, coding and text generation tasks. Notably, OmniDraft enables a single Llama-68M model to pair with various target models including Vicuna-7B, Qwen2-7B and Llama3-8B models for speculative decoding; and additionally provides up to 1.5-2x speedup.
推测性解码通常要求使用一个小型、高效的初步模型,该模型要么是预先训练的,要么离线精简为特定的目标模型系列(例如Llama或Qwen模型)。然而,在在线部署设置中存在两大挑战:1)目标模型与初步模型不兼容;2)期望在使用和时间上实现延迟改进。为此,我们提出了OmniDraft,这是一个统一框架,允许单一的初步模型能够与任何目标模型协同工作,并根据用户数据动态调整自身。我们引入了在线n-gram缓存结合混合蒸馏微调技术来解决初步模型和目标模型之间词汇不匹配的问题;并进一步通过采用自适应生成技术提高解码速度。OmniDraft特别适用于在设备上的大型语言模型应用,在这些应用场景中,模型成本、效率以及用户定制化是主要的争议点。这进一步强调了应对上述挑战的需求,并促进了“一个生成器适合所有”的范式。 我们通过在线学习数学推理、编码和文本生成任务展示了OmniDraft框架的专业能力。值得注意的是,OmniDraft使单一Llama-68M模型能够与包括Vicuna-7B、Qwen2-7B和Llama3-8B在内的各种目标模型进行推测性解码,并且提供了最高达1.5到2倍的速度提升。
https://arxiv.org/abs/2507.02659
With the advent of highly capable instruction-tuned neural language models, benchmarking in natural language processing (NLP) is increasingly shifting towards pairwise comparison leaderboards, such as LMSYS Arena, from traditional global pointwise scores (e.g., GLUE, BIG-bench, SWE-bench). This paper empirically investigates the strengths and weaknesses of both global scores and pairwise comparisons to aid decision-making in selecting appropriate model evaluation strategies. Through computational experiments on synthetic and real-world datasets using standard global metrics and the popular Bradley-Terry model for pairwise comparisons, we found that while global scores provide more reliable overall rankings, they can underestimate strong models with rare, significant errors or low confidence. Conversely, pairwise comparisons are particularly effective for identifying strong contenders among models with lower global scores, especially where quality metrics are hard to define (e.g., text generation), though they require more comparisons to converge if ties are frequent. Our code and data are available at this https URL under a permissive license.
随着高度能干的指令调优神经语言模型的出现,自然语言处理(NLP)中的基准测试越来越倾向于采用成对比较排行榜(如LMSYS Arena),而不是传统的全局点式评分系统(例如GLUE、BIG-bench、SWE-bench)。本文通过在合成和真实世界数据集上使用标准全球指标以及流行的Bradley-Terry模型进行成对比较的计算实验,实证研究了全局分数与成对比较各自的优缺点,以帮助选择合适的模型评估策略的决策。我们发现,虽然全局评分提供了更可靠的总体排名,但它们可能会低估在罕见显著错误或低置信度情况下表现优秀的模型。相反,成对比较特别适合于识别那些在全球得分较低但实际性能较强的竞争者,尤其是在难以定义质量指标的情况下(例如文本生成),尽管如果平局频繁出现的话,需要更多的比较才能得出结论。 我们的代码和数据可以在该链接的许可下获取:[URL] (请将 [URL] 替换为实际提供的URL地址)。
https://arxiv.org/abs/2507.01633
Speculative decoding (SD), where a small draft model is employed to propose draft tokens in advance and then the target model validates them in parallel, has emerged as a promising technique for LLM inference acceleration. Many endeavors to improve SD are to eliminate the need for a draft model and generate draft tokens in a retrieval-based manner in order to further alleviate the drafting overhead and significantly reduce the difficulty in deployment and applications. However, retrieval-based SD relies on a matching paradigm to retrieval the most relevant reference as the draft tokens, where these methods often fail to find matched and accurate draft tokens. To address this challenge, we propose LogitSpec to effectively expand the retrieval range and find the most relevant reference as drafts. Our LogitSpec is motivated by the observation that the logit of the last token can not only predict the next token, but also speculate the next next token. Specifically, LogitSpec generates draft tokens in two steps: (1) utilizing the last logit to speculate the next next token; (2) retrieving relevant reference for both the next token and the next next token. LogitSpec is training-free and plug-and-play, which can be easily integrated into existing LLM inference frameworks. Extensive experiments on a wide range of text generation benchmarks demonstrate that LogitSpec can achieve up to 2.61 $\times$ speedup and 3.28 mean accepted tokens per decoding step. Our code is available at this https URL.
投机解码(SD)是一种新兴的技术,通过使用一个小的草稿模型提前提出初步生成的词元,然后目标模型并行验证这些词元来加速大型语言模型(LLM)的推理。为了进一步减少草稿带来的开销,并显著降低部署和应用难度,许多研究试图改进SD技术,使其不需要草稿模型,而是通过检索方式生成草案词元。然而,基于检索的SD依赖于匹配模式从数据库中检索最相关的参考作为草案词元,这种方法往往难以找到准确匹配的草案词元。 为了解决这一挑战,我们提出了LogitSpec方法,该方法能够有效扩大检索范围,并找到最为相关的参考以供生成草稿。受启发于最后一个logits不仅可预测下一个词元,还能够推测出接下来的下一个词元的事实,LogitSpec通过以下两步生成草案词元:(1)利用最后一个logits推测下一个下一个词元;(2)针对下一个和下一个下一个词元检索相关的参考。 LogitSpec是无需训练且即插即用的方法,可以轻松集成到现有的LLM推理框架中。广泛覆盖各种文本生成基准的实验表明,LogitSpec可实现高达2.61倍的速度提升,并在每个解码步骤接受3.28个平均有效词元。我们的代码可在提供的链接处获取。
https://arxiv.org/abs/2507.01449
Video Multimodal Large Language Models (V-MLLMs) have shown impressive capabilities in temporal reasoning and cross-modal understanding, yet their vulnerability to adversarial attacks remains underexplored due to unique challenges: complex cross-modal reasoning mechanisms, temporal dependencies, and computational constraints. We present CAVALRY-V (Cross-modal Language-Vision Adversarial Yielding for Videos), a novel framework that directly targets the critical interface between visual perception and language generation in V-MLLMs. Our approach introduces two key innovations: (1) a dual-objective semantic-visual loss function that simultaneously disrupts the model's text generation logits and visual representations to undermine cross-modal integration, and (2) a computationally efficient two-stage generator framework that combines large-scale pre-training for cross-model transferability with specialized fine-tuning for spatiotemporal coherence. Empirical evaluation on comprehensive video understanding benchmarks demonstrates that CAVALRY-V significantly outperforms existing attack methods, achieving 22.8% average improvement over the best baseline attacks on both commercial systems (GPT-4.1, Gemini 2.0) and open-source models (QwenVL-2.5, InternVL-2.5, Llava-Video, Aria, MiniCPM-o-2.6). Our framework achieves flexibility through implicit temporal coherence modeling rather than explicit regularization, enabling significant performance improvements even on image understanding (34.4% average gain). This capability demonstrates CAVALRY-V's potential as a foundational approach for adversarial research across multimodal systems.
视频多模态大型语言模型(V-MLLMs)在时间推理和跨模态理解方面表现出令人印象深刻的性能,但它们对对抗性攻击的脆弱性由于独特的挑战而鲜有研究:复杂的跨模态推理机制、时间依赖性和计算约束。我们提出了CAVALRY-V(视频多模态语言-视觉对抗框架),这是一个新颖的框架,直接针对V-MLLMs中视觉感知和语言生成之间的关键接口进行攻击。我们的方法引入了两个关键技术创新: 1. 一种双目标语义-视觉损失函数,同时破坏模型的文字生成对数和视觉表示,以削弱跨模态集成。 2. 一个计算效率高的两阶段生成器框架,结合大规模预训练实现跨模型的迁移性,并通过专门化的微调来增强时空一致性。 在全面的视频理解基准测试中进行的经验评估表明,CAVALRY-V显著超越了现有的攻击方法,在商业系统(GPT-4.1、Gemini 2.0)和开源模型(QwenVL-2.5、InternVL-2.5、Llava-Video、Aria、MiniCPM-o-2.6)上平均提高了22.8%。我们的框架通过隐式的时空一致性建模而非显式正则化来实现灵活性,这使得在图像理解任务中也能获得显著的性能提升(平均增益34.4%)。这种能力表明CAVALRY-V作为跨多模态系统对抗性研究基础方法的巨大潜力。
https://arxiv.org/abs/2507.00817
As major progress in LLM-based long-form text generation enables paradigms such as retrieval-augmented generation (RAG) and inference-time scaling, safely incorporating private information into the generation remains a critical open question. We present InvisibleInk, a highly scalable long-form text generation framework satisfying rigorous differential privacy guarantees with respect to the sensitive references. It interprets sampling from the LLM's next-token-distribution as the exponential mechanism over the LLM logits with two innovations. First, we reduce the privacy cost by isolating and clipping only the sensitive information in the model logits (relative to the public logits). Second, we improve text quality by sampling from a small superset of the top-$k$ private tokens. Empirical evaluations demonstrate a consistent $8\times$ reduction in computation cost over state-of-the-art baselines to generate long-form private text of the same utility across privacy levels. In summary, InvisibleInk is able to generate private long-form text at less than $10\times$ the computation cost of non-private generation.
随着基于大型语言模型(LLM)的长文本生成技术取得了重大进展,诸如检索增强生成(RAG)和推理时间缩放等范式也随之出现。然而,在生成过程中安全地整合私有信息仍然是一个关键且未解决的问题。我们提出了InvisibleInk,这是一个高度可扩展的长文本生成框架,能够对敏感参考数据提供严格的差分隐私保证。 InvisibleInk通过两个创新方法解释了从LLM下一个标记分布中采样的过程:首先,将差分隐私机制应用于模型的对数(logits)上。具体来说,我们通过对私有信息进行隔离和裁剪来减少隐私成本,这些信息相对于公共信息而言是敏感的。其次,通过从一个包含前$k$个私有标记的小超集里采样,提高了文本质量。 实证评估表明,在生成具有相同效用的长篇私人文本时,与最先进的基线方法相比,InvisibleInk在各个隐私级别上将计算成本减少了8倍。总而言之,InvisibleInk能够在不到非私有生成10倍的计算成本的情况下生成私有长文本。
https://arxiv.org/abs/2507.02974
We introduce Calligrapher, a novel diffusion-based framework that innovatively integrates advanced text customization with artistic typography for digital calligraphy and design applications. Addressing the challenges of precise style control and data dependency in typographic customization, our framework incorporates three key technical contributions. First, we develop a self-distillation mechanism that leverages the pre-trained text-to-image generative model itself alongside the large language model to automatically construct a style-centric typography benchmark. Second, we introduce a localized style injection framework via a trainable style encoder, which comprises both Qformer and linear layers, to extract robust style features from reference images. An in-context generation mechanism is also employed to directly embed reference images into the denoising process, further enhancing the refined alignment of target styles. Extensive quantitative and qualitative evaluations across diverse fonts and design contexts confirm Calligrapher's accurate reproduction of intricate stylistic details and precise glyph positioning. By automating high-quality, visually consistent typography, Calligrapher surpasses traditional models, empowering creative practitioners in digital art, branding, and contextual typographic design.
我们介绍了Calligrapher,这是一种创新的基于扩散模型的框架,将先进的文本定制与艺术字体设计相结合,用于数字书法和设计应用。为了应对字体定制中的精确风格控制和数据依赖性挑战,我们的框架做出了三项关键技术贡献。 首先,我们开发了一种自我蒸馏机制,该机制利用预训练的文本到图像生成模型以及大型语言模型来自动构建以风格为中心的字体基准测试集。其次,我们引入了一个局部风格注入框架,通过一个可训练的风格编码器实现这一目标,该编码器包括Qformer和线性层,用于从参考图片中提取稳健的风格特征。此外,还采用了一种在上下文中生成的方法,直接将参考图像嵌入到去噪过程中,进一步增强了目标风格的精确定位。 通过涵盖多种字体和设计情境的广泛定量与定性评估,Calligrapher能够准确再现复杂的风格细节,并实现精确的字符定位。通过自动化高质量且视觉一致性的字体生成,Calligrapher超越了传统的模型,为数字艺术、品牌推广以及上下文相关的字体设计领域的创意实践者提供了强大支持。
https://arxiv.org/abs/2506.24123
Current speech language models exceed the size and latency constraints of many deployment environments. We build compact, expressive speech generation models through layer-aligned distillation, matching hidden states, attention maps, and softened logits to compress large multimodal transformers by 3x with minimal loss in performance. We introduce TinyWave, a family of 2B-parameter models for speech-to-speech and interleaved speech-text generation, trained on 50,000 hours of public audio. TinyWave supports (i) speech-only generation using phonetic or expressive tokens and (ii) mixed speech-text continuations. Evaluation on Libri-Light shows TinyWave within 1.4 normalized perplexity points of its teacher. Accuracy on spoken StoryCloze and SALMon reaches 93-97% of the teacher's performance, outperforming size-matched baselines. These models are optimized for deployment on commodity hardware, enabling applications in real-time conversational agents, assistive technologies, and low-resource environments. We release models, training code, and evaluation scripts to support reproducible research on compact, expressive speech generation.
当前的语音语言模型超出了许多部署环境中的大小和延迟限制。我们通过层对齐蒸馏、匹配隐藏状态、注意力图以及软化后的logits,将大型多模态变压器压缩了3倍,并且性能几乎没有损失。我们介绍了TinyWave,这是一个参数量为20亿的模型家族,用于语音到语音及交错语音文本生成,其训练数据集包含了5万小时的公开音频。TinyWave支持(i)仅使用音素或表达性标记进行语音生成;(ii)混合语音-文本延续。在Libri-Light上的评估显示,TinyWave的困惑度比它的教师模型只低1.4个标准化点。对于口语StoryCloze和SALMon任务,准确率达到了老师模型性能的93%-97%,超过了与之大小相匹配的基础模型。这些模型针对商品硬件进行了优化,使其实时对话代理、辅助技术和资源匮乏环境中的应用成为可能。我们发布了模型、训练代码以及评估脚本以支持关于紧凑且表达性语音生成的研究复现。
https://arxiv.org/abs/2506.23670
Diffusion and flow matching models have significantly advanced media generation, yet their design space is well-explored, somewhat limiting further improvements. Concurrently, autoregressive (AR) models, particularly those generating continuous tokens, have emerged as a promising direction for unifying text and media generation. This paper introduces Transition Matching (TM), a novel discrete-time, continuous-state generative paradigm that unifies and advances both diffusion/flow models and continuous AR generation. TM decomposes complex generation tasks into simpler Markov transitions, allowing for expressive non-deterministic probability transition kernels and arbitrary non-continuous supervision processes, thereby unlocking new flexible design avenues. We explore these choices through three TM variants: (i) Difference Transition Matching (DTM), which generalizes flow matching to discrete-time by directly learning transition probabilities, yielding state-of-the-art image quality and text adherence as well as improved sampling efficiency. (ii) Autoregressive Transition Matching (ARTM) and (iii) Full History Transition Matching (FHTM) are partially and fully causal models, respectively, that generalize continuous AR methods. They achieve continuous causal AR generation quality comparable to non-causal approaches and potentially enable seamless integration with existing AR text generation techniques. Notably, FHTM is the first fully causal model to match or surpass the performance of flow-based methods on text-to-image task in continuous domains. We demonstrate these contributions through a rigorous large-scale comparison of TM variants and relevant baselines, maintaining a fixed architecture, training data, and hyperparameters.
扩散和流匹配模型在媒体生成方面取得了显著进展,但它们的设计空间已经被充分探索,这限制了进一步的改进。与此同时,自回归(AR)模型,特别是那些生成连续令牌的模型,已经成为了统一文本和媒体生成的一个有前景的方向。本文介绍了过渡匹配(TM),这是一种新颖的时间离散、状态连续的生成范式,它统一并推进了扩散/流模型以及连续自回归生成。 TM将复杂的生成任务分解为更简单的马尔可夫转换过程,允许表达式的非确定性概率转移核和任意非连续监督过程,从而解锁新的灵活设计途径。通过三种TM变体来探索这些选择:(i)差分过渡匹配(DTM),它通过直接学习转换概率将流匹配推广到离散时间,在图像质量和文本一致性的表现上达到业界领先水平,并且提高了采样效率。(ii)自回归过渡匹配(ARTM)和(iii) 全历史记录过渡匹配(FHTM)分别是部分因果模型和完全因果模型,它们扩展了连续AR方法。这些变体实现了与非因果方法相当的连续因果AR生成质量,并可能使现有AR文本生成技术无缝集成。值得注意的是,FHTM是第一个在连续领域中,在文本到图像任务上达到或超过流基方法性能的全因果模型。 我们通过严格的大型规模对比测试TM变体和相关基准的方法来展示这些贡献,同时保持固定的架构、训练数据和超参数设置不变。
https://arxiv.org/abs/2506.23589