Contrastive decoding is a lightweight and effective inference-time method that improves the quality of text generation in Large Language Models. However, algorithms such as DoLa (Decoding by Contrastive Layers) have only been implemented in decoder-only architectures and studied for their impact on improving factuality. This work adapts DoLa for the T5 and FLAN-T5 model families and evaluates its impact on the models' instruction following capabilities, which to our knowledge is the first implementation of a contrastive decoding strategy in an encoder-decoder architecture. Our results show that DoLa improves the faithfulness of text generation for certain categories of tasks and harms others. To understand these results, we present a layer-by-layer analysis of logit evolution in a FLAN-T5 model to quantify DoLa's impact on token output probabilities.
对比解码是一种轻量且有效的方法,可以在推理阶段提升大型语言模型生成文本的质量。然而,如DoLa(通过对比层进行解码)等算法仅在解码器架构中实现,并主要用于研究其对提高事实性的影响。这项工作将DoLa适应到T5和FLAN-T5模型家族,并评估它对这些模型遵循指令能力的影响。据我们所知,这是首次在编码器-解码器架构中实施对比解码策略。我们的结果表明,DoLa对于某些任务类别能够提升文本生成的忠实度,但同时也会损害其他类别的性能。为了理解这一现象,我们在FLAN-T5模型中的每一层进行了对数几率演化分析,以量化DoLa对令牌输出概率的影响。
https://arxiv.org/abs/2512.03803
Diffusion Language Models (DLMs) have shown strong potential for text generation and are becoming a competitive alternative to autoregressive models. The denoising strategy plays an important role in determining the quality of their outputs. Mainstream denoising strategies include Standard Diffusion and BlockDiffusion. Standard Diffusion performs global denoising without restricting the update range, often finalizing incomplete context and causing premature end-of-sequence predictions. BlockDiffusion updates fixed-size blocks in a preset order, but its rigid structure can break apart coherent semantic units and disrupt reasoning. We present WavefrontDiffusion, a dynamic decoding approach that expands a wavefront of active tokens outward from finalized positions. This adaptive process follows the natural flow of semantic structure while keeping computational cost equal to block-based methods. Across four benchmarks in reasoning and code generation, WavefrontDiffusion achieves state-of-the-art performance while producing outputs with higher semantic fidelity, showing the value of adaptive scheduling for more coherent and efficient generation.
扩散语言模型(DLM)在文本生成方面表现出强大的潜力,并正成为自回归模型的有力竞争者。去噪策略对于决定其输出质量起着重要作用。主流的去噪策略包括标准扩散和BlockDiffusion。标准扩散执行全局去噪,不限制更新范围,经常过早地完成未完整上下文的信息并导致序列提前结束预测。BlockDiffusion以预设顺序更新固定大小的块,但其刚性结构可能会打破连贯的语义单元,并干扰推理过程。 我们提出了WavefrontDiffusion,这是一种动态解码方法,从已经确定的位置向外扩展活跃令牌波前。这种自适应过程遵循自然的语义结构流程,同时保持与基于块的方法相等的计算成本。在四个涉及推理和代码生成的基准测试中,WavefrontDiffusion实现了最先进的性能,并产生了具有更高语义保真的输出结果,展示了自适应调度对于更连贯、更高效的生成的重要性。
https://arxiv.org/abs/2511.19473
Parameter-Efficient Fine-Tuning (PEFT) methods address the increasing size of Large Language Models (LLMs). Currently, many newly introduced PEFT methods are challenging to replicate, deploy, or compare with one another. To address this, we introduce PEFT-Factory, a unified framework for efficient fine-tuning LLMs using both off-the-shelf and custom PEFT methods. While its modular design supports extensibility, it natively provides a representative set of 19 PEFT methods, 27 classification and text generation datasets addressing 12 tasks, and both standard and PEFT-specific evaluation metrics. As a result, PEFT-Factory provides a ready-to-use, controlled, and stable environment, improving replicability and benchmarking of PEFT methods. PEFT-Factory is a downstream framework that originates from the popular LLaMA-Factory, and is publicly available at this https URL
参数高效微调(Parameter-Efficient Fine-Tuning,PEFT)方法旨在解决大型语言模型(Large Language Models,LLMs)尺寸不断增大的问题。目前,许多新引入的PEFT方法在复制、部署或相互比较时存在挑战性。为了解决这一问题,我们推出了PEFT-Factory,这是一个统一框架,用于通过现成和自定义PEFT方法高效微调LLMs。该框架采用模块化设计支持扩展性,并原生提供了一套具有代表性的19种PEFT方法、27个涵盖12项任务的分类和文本生成数据集,以及标准和特定于PEFT的评估指标。因此,PEFT-Factory为PEFT方法提供了可直接使用且稳定的环境,提高了其复制性和基准测试能力。 PEFT-Factory起源于流行的LLaMA-Factory,并可在以下网址公开获取:[https://github.com/PeLo-Lab/peft-factory](https://github.com/PeLo-Lab/peft-factory)
https://arxiv.org/abs/2512.02764
Video-based world models have recently garnered increasing attention for their ability to synthesize diverse and dynamic visual environments. In this paper, we focus on shared world modeling, where a model generates multiple videos from a set of input images, each representing the same underlying world in different camera poses. We propose IC-World, a novel generation framework, enabling parallel generation for all input images via activating the inherent in-context generation capability of large video models. We further finetune IC-World via reinforcement learning, Group Relative Policy Optimization, together with two proposed novel reward models to enforce scene-level geometry consistency and object-level motion consistency among the set of generated videos. Extensive experiments demonstrate that IC-World substantially outperforms state-of-the-art methods in both geometry and motion consistency. To the best of our knowledge, this is the first work to systematically explore the shared world modeling problem with video-based world models.
基于视频的世界模型因其能够合成多样且动态的视觉环境而最近越来越受到关注。在本文中,我们专注于共享世界建模,其中模型从一组输入图像生成多个视频,每个视频代表同一基本世界的不同相机视角下的场景。为此,我们提出了一种新的生成框架——IC-World,通过激活大型视频模型中的内在上下文生成能力来实现所有输入图像的并行生成。此外,我们还使用强化学习方法(Group Relative Policy Optimization)对IC-World进行微调,并结合两个新提出的奖励模型以确保生成的一组视频在场景级别几何一致性和对象级别运动一致性方面达到更好的效果。大量的实验表明,IC-World在几何和运动一致性方面的表现显著优于现有最先进的方法。据我们所知,这是首次系统性地探索基于视频的世界模型中的共享世界建模问题的研究工作。
https://arxiv.org/abs/2512.02793
Despite recent text-to-image models achieving highfidelity text rendering, they still struggle with long or multiple texts due to diluted global attention. We propose DCText, a training-free visual text generation method that adopts a divide-and-conquer strategy, leveraging the reliable short-text generation of Multi-Modal Diffusion Transformers. Our method first decomposes a prompt by extracting and dividing the target text, then assigns each to a designated region. To accurately render each segment within their regions while preserving overall image coherence, we introduce two attention masks - Text-Focus and Context-Expansion - applied sequentially during denoising. Additionally, Localized Noise Initialization further improves text accuracy and region alignment without increasing computational cost. Extensive experiments on single- and multisentence benchmarks show that DCText achieves the best text accuracy without compromising image quality while also delivering the lowest generation latency.
尽管最近的文本到图像模型在高保真度的文字渲染方面取得了进展,但它们仍然难以处理长文本或多段文字,因为全局注意力被稀释了。为此,我们提出了DCText,这是一种无需训练的视觉文本生成方法,采用“分而治之”的策略,利用多模态扩散变换器可靠的短文本生成能力。 我们的方法首先通过提取和分割目标文本将提示进行分解,然后为每个部分分配一个指定区域。为了在保持整体图像连贯性的同时准确渲染每个部分内的段落,我们引入了两个注意力掩码——Text-Focus(文本焦点)和Context-Expansion(上下文扩展),并按顺序应用于去噪过程中。此外,局部噪声初始化进一步提高了文字的准确性以及各部分与区域的一致性,同时不增加计算成本。 在单句和多句子基准测试中进行的广泛实验表明,DCText实现了最高的文本准确性,在不影响图像质量的前提下还提供了最低的生成延迟。
https://arxiv.org/abs/2512.01302
Watermarking acts as a critical safeguard in text generated by Large Language Models (LLMs). By embedding identifiable signals into model outputs, watermarking enables reliable attribution and enhances the security of machine-generated content. Existing approaches typically embed signals by manipulating token generation probabilities. Despite their effectiveness, these methods inherently face a trade-off between detectability and text quality: the signal strength and randomness required for robust watermarking tend to degrade the performance of downstream tasks. In this paper, we design a novel embedding scheme that controls seed pools to facilitate diverse parallel generation of watermarked text. Based on that scheme, we propose WaterSearch, a sentence-level, search-based watermarking framework adaptable to a wide range of existing methods. WaterSearch enhances text quality by jointly optimizing two key aspects: 1) distribution fidelity and 2) watermark signal characteristics. Furthermore, WaterSearch is complemented by a sentence-level detection method with strong attack robustness. We evaluate our method on three popular LLMs across ten diverse tasks. Extensive experiments demonstrate that our method achieves an average performance improvement of 51.01\% over state-of-the-art baselines at a watermark detectability strength of 95\%. In challenging scenarios such as short text generation and low-entropy output generation, our method yields performance gains of 47.78\% and 36.47\%, respectively. Moreover, under different attack senarios including insertion, synonym substitution and paraphrase attasks, WaterSearch maintains high detectability, further validating its robust anti-attack capabilities. Our code is available at \href{this https URL}{this https URL}.
水印技术是大型语言模型(LLMs)生成文本时的重要安全措施。通过在模型输出中嵌入可识别的信号,水印能够实现可靠的归属认定,并提升机器生成内容的安全性。现有方法通常通过调整令牌生成概率来嵌入这些信号。尽管这些方法有效,但它们内在地存在一个检测性和文本质量之间的权衡:为了使水印稳健可靠,所需的信号强度和随机性往往会降低下游任务的性能。 在本文中,我们设计了一种新的嵌入方案,该方案通过控制种子池来促进带水印文本的多样化并行生成。基于这一方案,我们提出了一种句子级、搜索式的水印框架WaterSearch,其适应于广泛现有的方法。WaterSearch 通过优化两个关键方面增强了文本质量:1)分布保真度;2)水印信号特性。 此外,WaterSearch 还配备了具有强大攻击抵御能力的句子级别检测方法。我们在三个流行的大型语言模型上进行了十项多样化任务上的评估。大量实验表明,在95% 的水印可检测强度下,我们的方法相比最先进的基准方法平均性能提高了51.01%。在诸如短文本生成和低熵输出生成等具有挑战性的场景中,该方法分别实现了47.78% 和36.47% 的性能提升。 此外,在包括插入、同义词替换以及改写攻击在内的不同攻击情景下,WaterSearch 维持了高检测性,进一步验证了其强大的抗攻击能力。我们的代码可在[此链接](this https URL)获得。
https://arxiv.org/abs/2512.00837
Token sampling strategies critically influence text generation quality in large language models (LLMs). However, existing methods introduce additional hyperparameters, requiring extensive tuning and complicating deployment. We present Entropy Equilibrium Sampling (EES), an auxiliary hyperparameter-free approach inspired by information theory that can dynamically adjust candidate sets by balancing normalized entropy with probability mass. We evaluate EES on both reasoning and generation tasks across a range of model architectures. Our results show that EES consistently performs well across temperature settings, delivering competitive accuracy and coherence while maintaining diversity. By eliminating the need for hyperparameter tuning, EES greatly simplifies deployment while improving performance. Code is available at this https URL
令牌采样策略对大规模语言模型(LLMs)的文本生成质量有关键影响。然而,现有的方法引入了额外的超参数,需要进行大量的调优,并且增加了部署的复杂性。我们提出了熵均衡采样(Entropy Equilibrium Sampling, EES),这是一种受信息理论启发的辅助无超参数方法,能够通过平衡归一化熵和概率质量来动态调整候选集。我们在多种模型架构上对EES进行了推理任务和生成任务上的评估。我们的结果显示,在不同温度设置下,EES始终表现良好,提供了具有竞争力的准确性和连贯性,同时保持了多样性。通过消除超参数调优的需求,EES大大简化了部署过程,并提高了性能。 代码可在 [提供的URL] 获取。
https://arxiv.org/abs/2512.00789
Vision-Language Models (VLMs) have achieved impressive progress in multimodal text generation, yet their rapid adoption raises increasing concerns about security vulnerabilities. Existing backdoor attacks against VLMs primarily rely on explicit pixel-level triggers or imperceptible perturbations injected into images. While effective, these approaches reduce stealthiness and remain vulnerable to image-based defenses. We introduce concept-guided backdoor attacks, a new paradigm that operates at the semantic concept level rather than on raw pixels. We propose two different attacks. The first, Concept-Thresholding Poisoning (CTP), uses explicit concepts in natural images as triggers: only samples containing the target concept are poisoned, causing the model to behave normally in all other cases but consistently inject malicious outputs whenever the concept appears. The second, CBL-Guided Unseen Backdoor (CGUB), leverages a Concept Bottleneck Model (CBM) during training to intervene on internal concept activations, while discarding the CBM branch at inference time to keep the VLM unchanged. This design enables systematic replacement of a targeted label in generated text (for example, replacing "cat" with "dog"), even when the replacement behavior never appears in the training data. Experiments across multiple VLM architectures and datasets show that both CTP and CGUB achieve high attack success rates while maintaining moderate impact on clean-task performance. These findings highlight concept-level vulnerabilities as a critical new attack surface for VLMs.
视觉-语言模型(VLM)在多模态文本生成方面取得了显著进展,但其快速采用引发了越来越多的安全漏洞担忧。现有的针对VLM的后门攻击主要依赖于显式的像素级触发器或不可察觉地注入到图像中的细微扰动。尽管这些方法有效,它们降低了隐蔽性,并且仍然容易受到基于图像的防御措施的影响。我们引入了一种新的概念引导型后门攻击范式,在语义概念层面而非原始像素上操作。 我们提出了两种不同的攻击方式: 1. **概念阈值投毒(Concept-Thresholding Poisoning, CTP)**:使用自然图像中的显式概念作为触发器,仅对包含目标概念的样本进行投毒。在其他情况下模型正常运行,但一旦出现特定概念时则始终注入恶意输出。 2. **基于概念瓶颈引导的不可见后门攻击(CBL-Guided Unseen Backdoor, CGUB)**:利用训练过程中的概念瓶颈模型(CBM),干预内部的概念激活,并在推理阶段丢弃CBM分支以保持VLM不变。这种设计使得即使替代行为从未出现在训练数据中,也能够系统地替换生成文本中的目标标签(例如将“猫”替换成“狗”)。 跨多个VLM架构和数据集的实验表明,CTP 和 CGUB 都能在保持清洁任务性能适中影响的同时实现较高的攻击成功率。这些发现突显了概念层面的漏洞作为VLM面临的新重要攻击面的重要性。
https://arxiv.org/abs/2512.00713
Knowledge-enhanced text generation aims to enhance the quality of generated text by utilizing internal or external knowledge sources. While language models have demonstrated impressive capabilities in generating coherent and fluent text, the lack of interpretability presents a substantial obstacle. The limited interpretability of generated text significantly impacts its practical usability, particularly in knowledge-enhanced text generation tasks that necessitate reliability and explainability. Existing methods often employ domain-specific knowledge retrievers that are tailored to specific data characteristics, limiting their generalizability to diverse data types and tasks. To overcome this limitation, we directly leverage the two-tier architecture of structured knowledge, consisting of high-level entities and low-level knowledge triples, to design our task-agnostic structured knowledge hunter. Specifically, we employ a local-global interaction scheme for structured knowledge representation learning and a hierarchical transformer-based pointer network as the backbone for selecting relevant knowledge triples and entities. By combining the strong generative ability of language models with the high faithfulness of the knowledge hunter, our model achieves high interpretability, enabling users to comprehend the model output generation process. Furthermore, we empirically demonstrate the effectiveness of our model in both internal knowledge-enhanced table-to-text generation on the RotoWireFG dataset and external knowledge-enhanced dialogue response generation on the KdConv dataset. Our task-agnostic model outperforms state-of-the-art methods and corresponding language models, setting new standards on the benchmark.
https://arxiv.org/abs/2511.23335
Text-to-Video (T2V) models are capable of synthesizing high-quality, temporally coherent dynamic video content, but the diverse generation also inherently introduces critical safety challenges. Existing safety evaluation methods,which focus on static image and text generation, are insufficient to capture the complex temporal dynamics in video generation. To address this, we propose a TEmporal-aware Automated Red-teaming framework, named TEAR, an automated framework designed to uncover safety risks specifically linked to the dynamic temporal sequencing of T2V models. TEAR employs a temporal-aware test generator optimized via a two-stage approach: initial generator training and temporal-aware online preference learning, to craft textually innocuous prompts that exploit temporal dynamics to elicit policy-violating video output. And a refine model is adopted to improve the prompt stealthiness and adversarial effectiveness cyclically. Extensive experimental evaluation demonstrates the effectiveness of TEAR across open-source and commercial T2V systems with over 80% attack success rate, a significant boost from prior best result of 57%.
https://arxiv.org/abs/2511.21145
With the rapid development of large language models (LLMs), the applications of LLMs have grown substantially. In the education domain, LLMs demonstrate significant potential, particularly in automatic text generation, which enables the creation of intelligent and adaptive learning content. This paper proposes a new LLMs framework, which is named as Reading Comprehension Exercise Generation (RCEG). It can generate high-quality and personalized English reading comprehension exercises automatically. Firstly, RCEG uses fine-tuned LLMs to generate content candidates. Then, it uses a discriminator to select the best candidate. Finally, the quality of the generated content has been improved greatly. To evaluate the performance of RCEG, a dedicated dataset for English reading comprehension is constructed to perform the experiments, and comprehensive evaluation metrics are used to analyze the experimental results. These metrics include content diversity, factual accuracy, linguistic toxicity, and pedagogical alignment. Experimental results show that RCEG significantly improves the relevance and cognitive appropriateness of the generated exercises.
https://arxiv.org/abs/2511.18860
Diffusion-based language models have recently emerged as a promising alternative to autoregressive generation, yet their reliance on Transformer backbones limits inference efficiency due to quadratic attention and KV-cache overhead. In this work, we introduce DiffuApriel, a masked diffusion language model built on a bidirectional Mamba backbone that combines the diffusion objective with linear-time sequence modeling. DiffuApriel matches the performance of Transformer-based diffusion models while achieving up to 4.4x higher inference throughput for long sequences with a 1.3B model. We further propose DiffuApriel-H, a hybrid variant that interleaves attention and mamba layers, offering up to 2.6x throughput improvement with balanced global and local context modeling. Our results demonstrate that bidirectional state-space architectures serve as strong denoisers in masked diffusion LMs, providing a practical and scalable foundation for faster, memory-efficient text generation.
https://arxiv.org/abs/2511.15927
Electroencephalogram (EEG)-to-text remains challenging due to high-dimensional noise, subject variability, and error accumulation in autoregressive decoding. We introduce DELTA, which pairs a Residual Vector Quantization (RVQ) EEG tokenizer with a masked language diffusion model (LLaDA). RVQ discretizes continuous EEG into multi-layer tokens to reduce noise and individual differences, while LLaDA reconstructs sentences via non-sequential denoising. On ZuCo, DELTA improves semantic alignment by up to 5.37 points over autoregressive baselines, achieving BLEU-1 21.9 and ROUGE-1 F 17.2 under word-level conditions. These results enable reliable text generation from small EEG-text datasets and point toward scalable multimodal EEG-language models.
https://arxiv.org/abs/2511.21746
With the remarkable success of Vision-Language Models (VLMs) on multimodal tasks, concerns regarding their deployment efficiency have become increasingly prominent. In particular, the number of tokens consumed during the generation process has emerged as a key evaluation this http URL studies have shown that specific inputs can induce VLMs to generate lengthy outputs with low information density, which significantly increases energy consumption, latency, and token costs. However, existing methods simply delay the occurrence of the EOS token to implicitly prolong output, and fail to directly maximize the output token length as an explicit optimization objective, lacking stability and this http URL address these limitations, this paper proposes a novel verbose-text induction attack (VTIA) to inject imperceptible adversarial perturbations into benign images via a two-stage framework, which identifies the most malicious prompt embeddings for optimizing and maximizing the output token of the perturbed this http URL, we first perform adversarial prompt search, employing reinforcement learning strategies to automatically identify adversarial prompts capable of inducing the LLM component within VLMs to produce verbose outputs. We then conduct vision-aligned perturbation optimization to craft adversarial examples on input images, maximizing the similarity between the perturbed image's visual embeddings and those of the adversarial prompt, thereby constructing malicious images that trigger verbose text generation. Comprehensive experiments on four popular VLMs demonstrate that our method achieves significant advantages in terms of effectiveness, efficiency, and generalization capability.
https://arxiv.org/abs/2511.16163
Trained on diverse human-authored texts, Large Language Models (LLMs) unlocked the potential for Creative Natural Language Generation (CNLG), benefiting various applications like advertising and storytelling. Nevertheless, CNLG still remains difficult due to two main challenges. (1) Multi-objective flexibility: user requirements are often personalized, fine-grained, and pluralistic, which LLMs struggle to satisfy simultaneously; (2) Interpretive complexity: beyond generation, creativity also involves understanding and interpreting implicit meaning to enhance users' perception. These challenges significantly limit current methods, especially in short-form text generation, in generating creative and insightful content. To address this, we focus on Chinese baby naming, a representative short-form CNLG task requiring adherence to explicit user constraints (e.g., length, semantics, anthroponymy) while offering meaningful aesthetic explanations. We propose NAMeGEn, a novel multi-agent optimization framework that iteratively alternates between objective extraction, name generation, and evaluation to meet diverse requirements and generate accurate explanations. To support this task, we further construct a classical Chinese poetry corpus with 17k+ poems to enhance aesthetics, and introduce CBNames, a new benchmark with tailored metrics. Extensive experiments demonstrate that NAMeGEn effectively generates creative names that meet diverse, personalized requirements while providing meaningful explanations, outperforming six baseline methods spanning various LLM backbones without any training.
https://arxiv.org/abs/2511.15408
High-quality intraoperative feedback from a surgical trainer is pivotal for improving trainee performance and long-term skill acquisition. Automating natural, trainer-style feedback promises timely, accessible, and consistent guidance at scale but requires models that understand clinically relevant representations. We present a structure-aware pipeline that learns a surgical action ontology from real trainer-to-trainee transcripts (33 surgeries) and uses it to condition feedback generation. We contribute by (1) mining Instrument-Action-Target (IAT) triplets from real-world feedback text and clustering surface forms into normalized categories, (2) fine-tuning a video-to-IAT model that leverages the surgical procedure and task contexts as well as fine-grained temporal instrument motion, and (3) demonstrating how to effectively use IAT triplet representations to guide GPT-4o in generating clinically grounded, trainer-style feedback. We show that, on Task 1: Video-to-IAT recognition, our context injection and temporal tracking deliver consistent AUC gains (Instrument: 0.67 to 0.74; Action: 0.60 to 0.63; Tissue: 0.74 to 0.79). For Task 2: feedback text generation (rated on a 1-5 fidelity rubric where 1 = opposite/unsafe, 3 = admissible, and 5 = perfect match to a human trainer), GPT-4o from video alone scores 2.17, while IAT conditioning reaches 2.44 (+12.4%), doubling the share of admissible generations with score >= 3 from 21% to 42%. Traditional text-similarity metrics also improve: word error rate decreases by 15-31% and ROUGE (phrase/substring overlap) increases by 9-64%. Grounding generation in explicit IAT structure improves fidelity and yields clinician-verifiable rationales, supporting auditable use in surgical training.
https://arxiv.org/abs/2511.15159
Video Models have achieved remarkable success in high-fidelity video generation with coherent motion dynamics. Analogous to the development from text generation to text-based reasoning in language modeling, the development of video models motivates us to ask: Can video models reason via video generation? Compared with the discrete text corpus, video grounds reasoning in explicit spatial layouts and temporal continuity, which serves as an ideal substrate for spatial reasoning. In this work, we explore the reasoning via video paradigm and introduce VR-Bench -- a comprehensive benchmark designed to systematically evaluate video models' reasoning capabilities. Grounded in maze-solving tasks that inherently require spatial planning and multi-step reasoning, VR-Bench contains 7,920 procedurally generated videos across five maze types and diverse visual styles. Our empirical analysis demonstrates that SFT can efficiently elicit the reasoning ability of video model. Video models exhibit stronger spatial perception during reasoning, outperforming leading VLMs and generalizing well across diverse scenarios, tasks, and levels of complexity. We further discover a test-time scaling effect, where diverse sampling during inference improves reasoning reliability by 10--20%. These findings highlight the unique potential and scalability of reasoning via video for spatial reasoning tasks.
https://arxiv.org/abs/2511.15065
Large language models trained on clinical text risk exposing sensitive patient information, yet differential privacy (DP) methods often severely degrade the diagnostic accuracy needed for deployment. Despite rapid progress in DP optimisation and text generation, it remains unclear which privacy-preserving strategy actually works best for clinical language tasks. We present the first systematic head-to-head comparison of four training pipelines for automated diagnostic coding from hospital discharge summaries. All pipelines use identical 1B-parameter models and matched privacy budgets to predict ICD-9 codes. At moderate and relaxed privacy budgets ($\varepsilon \in \{4, 6\}$), knowledge distillation from DP-trained teachers outperforms both direct DP-SGD and DP-synthetic data training, recovering up to 63\% of the non-private performance whilst maintaining strong empirical privacy (membership-inference AUC $\approx$ 0.5). These findings expose large differences in the privacy-utility trade-off across architectures and identify knowledge distillation as the most practical route to privacy-preserving clinical NLP.
https://arxiv.org/abs/2511.14936
Large language models (LLMs), despite their remarkable text generation capabilities, often hallucinate and generate text that is factually incorrect and not grounded in real-world knowledge. This poses serious risks in domains like healthcare, finance, and customer support. A typical way to use LLMs is via the APIs provided by LLM vendors where there is no access to model weights or options to fine-tune the model. Existing methods to detect hallucinations in such settings where the model access is restricted or constrained by resources typically require making multiple LLM API calls, increasing latency and API cost. We introduce CONFACTCHECK, an efficient hallucination detection approach that does not leverage any external knowledge base and works on the simple intuition that responses to factual probes within the generated text should be consistent within a single LLM and across different LLMs. Rigorous empirical evaluation on multiple datasets that cover both the generation of factual texts and the open generation shows that CONFACTCHECK can detect hallucinated facts efficiently using fewer resources and achieves higher accuracy scores compared to existing baselines that operate under similar conditions. Our code is available here.
https://arxiv.org/abs/2511.12236
Positional bias - where models overemphasize certain positions regardless of content - has been shown to negatively impact model performance across various tasks. While recent research has extensively examined positional bias in text generation models, its presence and effects in representation models remain underexplored. Even less is known about such biases in multimodal models. In this work, we investigate positional bias in multimodal representation models, specifically in the context of image-text retrieval. We begin by distinguishing between context importance and positional bias, and then assess the presence and extent of positional bias across different models and datasets. Our experiments demonstrate that positional bias is prevalent in multimodal models, but manifests differently across modalities: text encoders tend to exhibit bias toward the beginning of the input, whereas image encoders show bias at both the beginning and end. Furthermore, we find that this bias arises from, or is amplified by, a combination of factors, including the positional encoding scheme, training loss, context importance, and the nature of using image-text pairs in multimodal training.
https://arxiv.org/abs/2511.11216