The Arabic language has undergone notable transformations over time, including the emergence of new vocabulary, the obsolescence of others, and shifts in word usage. This evolution is evident in the distinction between the classical and modern Arabic eras. Although historians and linguists have partitioned Arabic literature into multiple eras, relatively little research has explored the automatic classification of Arabic texts by time period, particularly beyond the domain of poetry. This paper addresses this gap by employing neural networks and deep learning techniques to automatically classify Arabic texts into distinct eras and periods. The proposed models are evaluated using two datasets derived from two publicly available corpora, covering texts from the pre-Islamic to the modern era. The study examines class setups ranging from binary to 15-class classification and considers both predefined historical eras and custom periodizations. Results range from F1-scores of 0.83 and 0.79 on the binary-era classification task using the OpenITI and APCD datasets, respectively, to 0.20 on the 15-era classification task using OpenITI and 0.18 on the 12-era classification task using APCD.
阿拉伯语随着时间的推移经历了显著的变化,包括新词汇的出现、旧词汇的淘汰以及词语使用的转变。这种演变在古典时代和现代阿拉伯时代的区别中尤为明显。虽然历史学家和语言学家已经将阿拉伯文学划分成多个时期,但较少有研究探索自动分类不同时间段的阿拉伯文本,尤其是在诗歌领域之外的研究更为稀缺。本文通过运用神经网络和深度学习技术来填补这一空白,旨在自动将阿拉伯文本划分为不同的时代和地区。所提出的模型使用了两个公开可用语料库派生的数据集进行评估,这些数据集涵盖了从前伊斯兰时期到现代的各种文本。研究考察了从二元分类到15类分类的不同设置,并考虑到了预定义的历史时期和定制的时间段划分。结果显示,在使用OpenITI数据集的二元时代分类任务中,F1分数为0.83;在使用APCD数据集的任务中,为0.79。而在使用OpenITI数据集进行15类时代分类时,F1分数下降到0.20,在使用APCD数据集进行12类时代分类时则降至0.18。
https://arxiv.org/abs/2601.16138
Large Language Models enable users to access database using natural language interfaces using tools like Text2SQL, Text2SPARQL, and Text2Cypher, which translate user questions into structured database queries. While these systems improve database accessibility, most research focuses on English with limited multilingual support. This work investigates a scalable multilingual Text2Cypher, aiming to support new languages without re-running full fine-tuning, avoiding manual hyper-parameter tuning, and maintaining performance close to joint multilingual fine-tuning. We train language-specific LoRA adapters for English, Spanish, and Turkish and combined them via uniform linear merging or learned fusion MLP with dynamic gating. Experimental results show that the fusion MLP recovers around 75\% of the accuracy gains from joint multilingual fine-tuning while requiring only a smaller subset of the data, outperforming linear merging across all three languages. This approach enables incremental language expansion to new languages by requiring only one LoRA adapter and a lightweight MLP retraining. Learned adapter fusion offers a practical alternative to expensive joint fine-tuning, balancing performance, data efficiency, and scalability for multilingual Text2Cypher task.
大型语言模型通过诸如Text2SQL、Text2SPARQL和Text2Cypher之类的工具,使用户能够使用自然语言接口访问数据库。这些工具可以将用户的查询问题转化为结构化的数据库查询语句。尽管此类系统提高了数据库的可访问性,但大多数相关研究主要集中在英语上,并且对多语言支持有限。这项工作旨在开发一种可扩展的多语言Text2Cypher方法,该方法在增加新语言时无需重新进行全面微调、避免手动超参数调整的同时,保持接近联合多语言微调后的性能水平。 我们为英语、西班牙语和土耳其语训练了特定的语言LoRA适配器,并通过统一线性合并或具有动态门控的学得融合MLP将它们结合起来。实验结果显示,使用融合MLP可以恢复大约75%由联合多语言微调带来的准确度提升效果,同时只需要较小的数据子集,在所有三种语言中均优于线性合并方法。 这种策略通过仅需一个LoRA适配器和轻量级的MLP重新训练来实现对新语言的增量扩展。学得适配器融合为昂贵的联合微调提供了一种实用替代方案,它在多语言Text2Cypher任务中实现了性能、数据效率与可伸缩性之间的平衡。
https://arxiv.org/abs/2601.16097
Refusal behavior in aligned LLMs is often viewed as model-specific, yet we hypothesize it stems from a universal, low-dimensional semantic circuit shared across models. To test this, we introduce Trajectory Replay via Concept-Basis Reconstruction, a framework that transfers refusal interventions from donor to target models, spanning diverse architectures (e.g., Dense to MoE) and training regimes, without using target-side refusal supervision. By aligning layers via concept fingerprints and reconstructing refusal directions using a shared ``recipe'' of concept atoms, we map the donor's ablation trajectory into the target's semantic space. To preserve capabilities, we introduce a weight-SVD stability guard that projects interventions away from high-variance weight subspaces to prevent collateral damage. Our evaluation across 8 model pairs (including GPT-OSS-20B and GLM-4) confirms that these transferred recipes consistently attenuate refusal while maintaining performance, providing strong evidence for the semantic universality of safety alignment.
拒绝行为在对齐的大型语言模型(LLMs)中通常被视为特定于每个模型的现象,但我们假设这些行为源自一个跨越不同模型的通用、低维语义电路。为了验证这一假说,我们引入了通过概念基础重构进行轨迹回放的方法框架。该框架能够在捐赠者和目标模型之间转移拒绝干预措施,涵盖多样化的架构(如密集型到混合专家型)及不同的训练机制,并且无需在目标侧使用拒绝监督数据。 通过利用层次对齐的概念指纹以及基于共享“配方”的概念原子重构拒绝方向,我们能够将捐赠者的消融轨迹映射到目标模型的语义空间中。为了保持能力不受影响,我们引入了一种权重SVD稳定性保护机制,以防止干预措施进入高方差权重子空间,从而避免对其他性能造成不必要的损害。 我们的评估涵盖了8组模型配对(包括GPT-OSS-20B和GLM-4),结果表明这些转移的配方能够一致地减弱拒绝行为同时保持模型性能。这为安全对齐的语义普遍性提供了强有力的支持证据。
https://arxiv.org/abs/2601.16034
This paper presents Mecellem models, a framework for developing specialized language models for the Turkish legal domain through domain adaptation strategies. We make two contributions: (1)Encoder Model Pre-trained from Scratch: ModernBERT-based bidirectional encoders pre-trained on a Turkish-dominant corpus of 112.7 billion tokens. We implement a checkpoint selection strategy that evaluates downstream retrieval performance throughout training, revealing that optimal checkpoints achieve best retrieval scores before pre-training loss reaches its minimum. Our encoder models achieve top-3 rankings on the Turkish retrieval leaderboard, with smaller models (155M parameters) achieving comparable performance to larger reference models (307M-567M parameters). Our approach achieves 92.36% production efficiency compared to state-of-the-art models (embeddinggemma-300m: 100.00%, BAAI/bge-m3: 99.54%, newmindai/bge-m3-stsb: 94.38%), ranking fourth overall despite requiring less computational resources. SOTA models rely on multi-stage, computationally intensive training pipelines, making our single-stage pre-training followed by efficient post-training approach a cost-effective alternative; (2)Decoder Model with Continual Pre-training (CPT): Qwen3-1.7B and Qwen3-4B models adapted to Turkish legal domain through controlled curriculum learning. Four-phase CPT with optimal sample ratios enables gradual transition from general language knowledge to specialized legal terminology and long-context reasoning. This approach achieves 36.2% perplexity reduction on Turkish legal text, demonstrating domain adaptation gains.
这篇论文介绍了Mecellem模型,这是一个通过领域适应策略开发土耳其法律领域的专业语言模型的框架。我们做了两项贡献: 1. **从头开始预训练的编码器模型**:基于现代BERT双向编码器,在以土耳其语为主的、包含1127亿个token的数据集上进行预训练。我们实现了一种检查点选择策略,该策略在整个训练过程中评估下游检索性能,结果显示最优的检查点在预训练损失达到最小值之前就能取得最佳检索得分。我们的编码器模型在土耳其检索排行榜上取得了第三名的成绩,小规模模型(155M参数)的表现与大规模参考模型(307M-567M参数)相当。我们的方法比最先进的模型具有更高的生产效率(embeddinggemma-300m: 100.00%,BAAI/bge-m3: 99.54%,newmindai/bge-m3-stsb: 94.38%),尽管需要较少的计算资源,整体排名第四。最先进的模型依赖于多阶段、计算密集型训练流水线,而我们的单阶段预训练后接高效微调的方法则是一个成本效益更高的替代方案。 2. **持续预训练(CPT)解码器模型**:Qwen3-1.7B和Qwen3-4B模型通过控制性课程学习方法适应土耳其法律领域。四阶段的CPT使用最优样本比例,使模型逐步从通用语言知识过渡到专门化的法律术语及长上下文推理能力。这种方法在土耳其法律文本上将困惑度降低了36.2%,展示了领域的适应性收益。 这两种贡献表明了Mecellem框架的有效性和灵活性,在资源优化的前提下,能够为特定领域(如土耳其法律)开发出高性能的语言模型。
https://arxiv.org/abs/2601.16018
Diffusion-based language models (DLLMs) offer non-sequential, block-wise generation and richer data reuse compared to autoregressive (AR) models, but existing code DLLMs still lag behind strong AR baselines under comparable budgets. We revisit this setting in a controlled study and introduce Stable-DiffCoder, a block diffusion code model that reuses the Seed-Coder architecture, data, and training pipeline. To enable efficient knowledge learning and stable training, we incorporate a block diffusion continual pretraining (CPT) stage enhanced by a tailored warmup and block-wise clipped noise schedule. Under the same data and architecture, Stable-DiffCoder overall outperforms its AR counterpart on a broad suite of code benchmarks. Moreover, relying only on the CPT and supervised fine-tuning stages, Stable-DiffCoder achieves stronger performance than a wide range of \~8B ARs and DLLMs, demonstrating that diffusion-based training can improve code modeling quality beyond AR training alone. Moreover, diffusion-based any-order modeling improves structured code modeling for editing and reasoning, and through data augmentation, benefits low-resource coding languages.
基于扩散的语言模型(DLLM)相较于自回归(AR)模型提供了非顺序的、块状生成和更丰富的数据重用,但现有的代码DLLM在类似预算下仍不及强大的AR基准。我们在一个受控研究中重新审视了这一情况,并引入了Stable-DiffCoder,这是一种采用Seed-Coder架构、数据及训练管道的块扩散代码模型。为了实现高效的知识学习和稳定的训练,我们结合了一个经过定制预热和分块裁剪噪声时间表增强的块扩散持续预训练(CPT)阶段。 在相同的数据集和架构下,Stable-DiffCoder整体上在一个广泛的代码基准测试中超越了它的AR对应模型。此外,仅依靠CPT和监督微调阶段,Stable-DiffCoder就能实现比一系列大约80亿参数的AR和DLLM更强的表现力,证明了基于扩散的训练能够提升代码建模质量超出单独使用AR训练的效果。 更进一步地,基于扩散的任何阶模型改进了结构化代码建模以用于编辑与推理,并且通过数据增强,有利于资源贫乏的编程语言。
https://arxiv.org/abs/2601.15892
Code completion has become a central task, gaining significant attention with the rise of large language model (LLM)-based tools in software engineering. Although recent advances have greatly improved LLMs' code completion abilities, evaluation methods have not advanced equally. Most current benchmarks focus solely on functional correctness of code completions based on given context, overlooking models' ability to follow user instructions during completion-a common scenario in LLM-assisted programming. To address this limitation, we present the first instruction-guided code completion benchmark, Controllable Code Completion Benchmark (C3-Bench), comprising 2,195 carefully designed completion tasks. Through comprehensive evaluation of over 40 mainstream LLMs across C3-Bench and conventional benchmarks, we reveal substantial gaps in instruction-following capabilities between open-source and advanced proprietary models during code completion tasks. Moreover, we develop a straightforward data synthesis pipeline that leverages Qwen2.5-Coder to generate high-quality instruction-completion pairs for supervised fine-tuning (SFT). The resulting model, Qwen2.5-Coder-C3, achieves state-of-the-art performance on C3-Bench. Our findings provide valuable insights for enhancing LLMs' code completion and instruction-following capabilities, establishing new directions for future research in code LLMs. To facilitate reproducibility and foster further research in code LLMs, we open-source all code, datasets, and models.
代码补全已成为软件工程中的核心任务,随着基于大型语言模型(LLM)的工具的兴起,这一领域的关注度显著提升。尽管近期的进步极大地提高了LLM的代码补全能力,但评估方法却没有同样地发展起来。目前大多数基准测试主要关注给定上下文下生成代码片段的功能正确性,而忽视了模型在代码补全过程中遵循用户指令的能力——这在LLM辅助编程中是一个常见的场景。为解决这一局限性,我们提出了首个基于指令的代码补全基准测试Controllable Code Completion Benchmark(C3-Bench),包含2,195个精心设计的任务。 通过跨C3-Bench和传统基准测试对40多个主流LLM进行全面评估,我们揭示了开源模型与高级专有模型在遵循用户指令进行代码补全任务时能力上的显著差距。此外,我们开发了一个简单有效的数据合成流水线,利用Qwen2.5-Coder生成高质量的指令-补全配对用于监督微调(SFT)。由此产生的模型Qwen2.5-Coder-C3,在C3-Bench上达到了最先进的性能水平。 我们的研究结果为提高LLM在代码补全和遵循用户指令能力方面提供了宝贵的见解,并为未来在代码LLM领域的研究开辟了新的方向。为了促进可重复性及推动代码LLM的研究,我们开源所有代码、数据集和模型。
https://arxiv.org/abs/2601.15879
Introduction: Clinical text classification using natural language processing (NLP) models requires adequate training data to achieve optimal performance. For that, 200-500 documents are typically annotated. The number is constrained by time and costs and lacks justification of the sample size requirements and their relationship to text vocabulary properties. Methods: Using the publicly available MIMIC-III dataset containing hospital discharge notes with ICD-9 diagnoses as labels, we employed pre-trained BERT embeddings followed by Random Forest classifiers to identify 10 randomly selected diagnoses, varying training corpus sizes from 100 to 10,000 documents, and analyzed vocabulary properties by identifying strong and noisy predictive words through Lasso logistic regression on bag-of-words embeddings. Results: Learning curves varied significantly across the 10 classification tasks despite identical preprocessing and algorithms, with 600 documents sufficient to achieve 95% of the performance attainable with 10,000 documents for all tasks. Vocabulary analysis revealed that more strong predictors and fewer noisy predictors were associated with steeper learning curves, where every 100 additional noisy words decreased accuracy by approximately 0.02 while 100 additional strong predictors increased maximum accuracy by approximately 0.04.
引言:使用自然语言处理(NLP)模型进行临床文本分类需要充足的训练数据以达到最佳性能。通常情况下,这要求对大约200到500份文档进行注释。然而,这个数字受到时间和成本的限制,并且缺乏关于样本量需求及其与文本词汇属性关系的合理解释。 方法:我们使用了公开可用的MIMIC-III数据集,该数据集中包含医院出院记录及相应的ICD-9诊断标签。通过应用预训练的BERT嵌入并随后采用随机森林分类器来识别10个随机选择的诊断,并且在从100到10,000份文档不等的训练语料库大小下进行了实验。此外,我们还利用bag-of-words(词袋)模型中的Lasso逻辑回归分析词汇属性,通过识别强预测词和噪声预测词来实现这一目标。 结果:尽管预处理过程与算法保持一致,但针对10个分类任务的学习曲线差异显著。对于所有任务而言,使用600份文档即可达到使用10,000份文档所能获得的95%性能水平。词汇分析揭示了更多的强预测词和更少的噪声预测词与陡峭的学习曲线之间的关联性:每增加100个噪声单词会使准确率降低约0.02,而每增加100个强预测词则会将最大准确度提高大约0.04。
https://arxiv.org/abs/2601.15846
The rapid spread of multimodal fake news poses a serious societal threat, as its evolving nature and reliance on timely factual details challenge existing detection methods. Dynamic Retrieval-Augmented Generation provides a promising solution by triggering keyword-based retrieval and incorporating external knowledge, thus enabling both efficient and accurate evidence selection. However, it still faces challenges in addressing issues such as redundant retrieval, coarse similarity, and irrelevant evidence when applied to deceptive content. In this paper, we propose ExDR, an Explanation-driven Dynamic Retrieval-Augmented Generation framework for Multimodal Fake News Detection. Our framework systematically leverages model-generated explanations in both the retrieval triggering and evidence retrieval modules. It assesses triggering confidence from three complementary dimensions, constructs entity-aware indices by fusing deceptive entities, and retrieves contrastive evidence based on deception-specific features to challenge the initial claim and enhance the final prediction. Experiments on two benchmark datasets, AMG and MR2, demonstrate that ExDR consistently outperforms previous methods in retrieval triggering accuracy, retrieval quality, and overall detection performance, highlighting its effectiveness and generalization capability.
多模态虚假新闻的快速传播构成了严重的社会威胁,因为其不断演变的性质和对及时事实细节的依赖挑战了现有的检测方法。动态检索增强生成提供了一种有希望的解决方案,通过触发基于关键词的检索并融合外部知识来实现高效的证据选择和准确度。然而,在应用于欺骗性内容时,它仍然面临诸如冗余检索、粗略相似性和无关证据等挑战。 在本文中,我们提出了ExDR,这是一种针对多模态虚假新闻检测的解释驱动动态检索增强生成框架。我们的框架系统地利用模型生成的解释来触发检索和证据检索模块,并从三个互补维度评估触发信心;通过融合欺骗实体构建基于实体感知的索引;并根据特定于欺骗性的特征检索对比证据,以挑战初始主张并提升最终预测。 在两个基准数据集AMG和MR2上的实验表明,ExDR在检索触发准确性、检索质量和整体检测性能方面始终优于先前的方法,突显了其有效性和泛化能力。
https://arxiv.org/abs/2601.15820
An increasing body of work has leveraged multilingual language models for Natural Language Generation tasks such as summarization. A major empirical bottleneck in this area is the shortage of accurate and robust evaluation metrics for many languages, which hinders progress. Recent studies suggest that multilingual language models often use English as an internal pivot language, and that misalignment with this pivot can lead to degraded downstream performance. Motivated by the hypothesis that this mismatch could also apply to multilingual neural metrics, we ask whether steering their activations toward an English pivot can improve correlation with human judgments. We experiment with encoder- and decoder-based metrics and find that test-time intervention methods are effective across the board, increasing metric effectiveness for diverse languages.
越来越多的研究利用多语言语言模型进行自然语言生成任务,如摘要生成。然而,在这一领域中存在一个重要的实证瓶颈:许多语言缺乏准确和稳健的评估指标,这阻碍了研究进展。最近的一些研究表明,多语言语言模型往往将英语作为内部基准语言,而这种基准与目标语言之间的不匹配会导致下游性能下降。基于假设认为这种不匹配也可能适用于多语言神经评估指标,我们探讨了是否可以通过引导这些指标的激活向英语基准靠拢来提高它们与人类判断的相关性。我们在编码器和解码器基线度量方法上进行了实验,并发现测试时间干预的方法普遍有效,能够提升各种语言的度量效果。
https://arxiv.org/abs/2601.15809
Motivated by the remarkable progress of large language models (LLMs) in objective tasks like mathematics and coding, there is growing interest in their potential to simulate human behavior--a capability with profound implications for transforming social science research and customer-centric business insights. However, LLMs often lack a nuanced understanding of human cognition and behavior, limiting their effectiveness in social simulation and personalized applications. We posit that this limitation stems from a fundamental misalignment: standard LLM pretraining on vast, uncontextualized web data does not capture the continuous, situated context of an individual's decisions, thoughts, and behaviors over time. To bridge this gap, we introduce HumanLLM, a foundation model designed for personalized understanding and simulation of individuals. We first construct the Cognitive Genome Dataset, a large-scale corpus curated from real-world user data on platforms like Reddit, Twitter, Blogger, and Amazon. Through a rigorous, multi-stage pipeline involving data filtering, synthesis, and quality control, we automatically extract over 5.5 million user logs to distill rich profiles, behaviors, and thinking patterns. We then formulate diverse learning tasks and perform supervised fine-tuning to empower the model to predict a wide range of individualized human behaviors, thoughts, and experiences. Comprehensive evaluations demonstrate that HumanLLM achieves superior performance in predicting user actions and inner thoughts, more accurately mimics user writing styles and preferences, and generates more authentic user profiles compared to base models. Furthermore, HumanLLM shows significant gains on out-of-domain social intelligence benchmarks, indicating enhanced generalization.
受到大型语言模型(LLMs)在数学和编程等客观任务方面取得的显著进展的启发,人们对这些模型模拟人类行为的能力产生了越来越大的兴趣——这种能力对社会科学研究和以客户为中心的商业洞察具有深远的影响。然而,LLMs往往缺乏对人类认知和行为的细微理解,这限制了它们在社会仿真和个人化应用中的有效性。我们认为这一局限性源于根本性的不一致:标准的LLM预训练基于大量的无上下文网络数据,并不能捕捉到个体决策、思考和行为随时间形成的连续且情境化的背景。 为了解决这一差距,我们提出了HumanLLM,这是一种旨在对个人进行个性化理解和模拟的基础模型。首先,我们构建了认知基因组数据集(Cognitive Genome Dataset),这是一个从Reddit、Twitter、Blogger和Amazon等平台收集的真实用户数据的大规模语料库。通过包含数据过滤、合成和质量控制在内的严格多阶段管道处理过程,我们自动提取超过550万条用户日志以提炼出丰富的个人档案、行为模式及思考方式。 接下来,我们制定了多样化的学习任务,并进行了监督微调训练,使模型能够预测广泛的个性化人类行为、思维和体验。全面的评估表明,HumanLLM在预测用户行动和内心想法方面表现更优,更能准确模仿用户的写作风格和偏好,并生成更为真实的用户档案,相比基础模型有着显著的进步。 此外,HumanLLM在跨领域社会智能基准测试中的性能也有明显提升,这表明其具备更强的泛化能力。
https://arxiv.org/abs/2601.15793
Large language models are increasingly used to represent human opinions, values, or beliefs, and their steerability towards these ideals is an active area of research. Existing work focuses predominantly on aligning marginal response distributions, treating each survey item independently. While essential, this may overlook deeper latent structures that characterise real populations and underpin cultural values theories. We propose a framework for evaluating the representativeness of aligned models through multivariate correlation patterns in addition to marginal distributions. We show the value of our evaluation scheme by comparing two model steering techniques (persona prompting and demographic fine-tuning) and evaluating them against human responses from the World Values Survey. While the demographically fine-tuned model better approximates marginal response distributions than persona prompting, both techniques fail to fully capture the gold standard correlation patterns. We conclude that representativeness is a distinct aspect of value alignment and an evaluation focused on marginals can mask structural failures, leading to overly optimistic conclusions about model capabilities.
大型语言模型越来越被用于表示人类的观点、价值观或信仰,而如何将这些理想融入到模型中是当前研究的一个活跃领域。现有工作主要集中在对边缘响应分布的校准上,即独立处理每个调查项目。虽然这非常重要,但这种方法可能会忽略反映真实人群特征和支撑文化价值理论的深层次潜在结构。我们提出了一种通过多变量相关模式(而不仅仅是边缘分布)来评估模型代表性的框架。通过比较两种模型导向技术(人物提示法和个人特征微调法),并将其与来自世界价值观调查的人类响应进行对比,展示了我们的评估方案的价值。虽然个人特征微调的模型在边缘响应分布上比人物提示法更接近实际数据,但两种方法都未能完全捕捉到黄金标准的相关模式。我们得出结论认为,代表性是价值一致性的一个独立方面,只关注边缘值的评估可能会掩盖结构性缺陷,导致对模型能力过于乐观的看法。
https://arxiv.org/abs/2601.15755
In the realm of medical report generation (MRG), the integration of natural language processing has emerged as a vital tool to alleviate the workload of radiologists. Despite the impressive capabilities demonstrated by large vision language models (LVLMs) in understanding natural language, their susceptibility to generating plausible yet inaccurate claims, known as ``hallucinations'', raises concerns-especially in the nuanced and critical field of medical. In this work, we introduce a framework, \textbf{K}nowledge-\textbf{E}nhanced with Fine-Grained \textbf{R}einforced Rewards \textbf{M}edical Report Generation (KERM), to tackle the issue. Our approach refines the input to the LVLM by first utilizing MedCLIP for knowledge retrieval, incorporating relevant lesion fact sentences from a curated knowledge corpus. We then introduce a novel purification module to ensure the retrieved knowledge is contextually relevant to the patient's clinical context. Subsequently, we employ fine-grained rewards to guide these models in generating highly supportive and clinically relevant descriptions, ensuring the alignment of model's outputs with desired behaviors. Experimental results on IU-Xray and MIMIC-CXR datasets validate the effectiveness of our approach in mitigating hallucinations and enhancing report quality.
在医学报告生成(MRG)领域,自然语言处理的集成已成为减轻放射科医生工作负担的重要工具。尽管大型视觉语言模型(LVLMs)在理解自然语言方面表现出令人印象深刻的能力,但它们容易产生看似合理但实际上不准确的说法,即所谓的“幻觉”,这在微妙且关键的医学领域引起了担忧。为此,我们提出了一种名为知识增强型细粒度强化奖励医学报告生成框架(KERM),旨在解决这一问题。 我们的方法通过首先使用MedCLIP进行知识检索来优化LVLM的输入,从一个经过精心整理的知识语料库中纳入与患者临床背景相关的病变事实句子。随后,我们引入了一个新颖的净化模块,以确保检索到的知识在上下文中是相关且有用的。之后,我们采用细粒度奖励机制引导模型生成高度支持性且具有临床意义的描述,确保其输出符合预期的行为标准。 我们在IU-Xray和MIMIC-CXR数据集上的实验结果验证了我们的方法在减少幻觉现象和提升报告质量方面的有效性。
https://arxiv.org/abs/2601.15745
The performance of modern AI systems is fundamentally constrained by the quality of their underlying kernels, which translate high-level algorithmic semantics into low-level hardware operations. Achieving near-optimal kernels requires expert-level understanding of hardware architectures and programming models, making kernel engineering a critical but notoriously time-consuming and non-scalable process. Recent advances in large language models (LLMs) and LLM-based agents have opened new possibilities for automating kernel generation and optimization. LLMs are well-suited to compress expert-level kernel knowledge that is difficult to formalize, while agentic systems further enable scalable optimization by casting kernel development as an iterative, feedback-driven loop. Rapid progress has been made in this area. However, the field remains fragmented, lacking a systematic perspective for LLM-driven kernel generation. This survey addresses this gap by providing a structured overview of existing approaches, spanning LLM-based approaches and agentic optimization workflows, and systematically compiling the datasets and benchmarks that underpin learning and evaluation in this domain. Moreover, key open challenges and future research directions are further outlined, aiming to establish a comprehensive reference for the next generation of automated kernel optimization. To keep track of this field, we maintain an open-source GitHub repository at this https URL.
现代AI系统的性能从根本上受到其底层内核质量的限制,这些内核将高层次的算法语义转化为低层次的硬件操作。实现接近最优的内核需要对硬件架构和编程模型有专家级别的理解,使内核工程成为一项关键但耗时且难以扩展的过程。最近在大型语言模型(LLMs)和基于LLM的代理系统方面的进展为自动化的内核生成和优化开启了新的可能性。LLMs非常适合压缩那些难以形式化表达的专业级内核知识,而代理系统则通过将内核开发视为一个迭代、反馈驱动的循环进一步实现规模化的优化。在这一领域已取得了快速进展。然而,该领域的研究仍然支离破碎,缺乏一种基于LLM驱动内核生成的系统的视角。本综述通过提供现有方法的结构化概览来填补这一空白,这些方法涵盖了基于LLM的方法和代理优化工作流,并系统地编纂了支撑此领域学习与评估的数据集和基准测试。此外,还概述了一些关键的开放性挑战及未来的研究方向,旨在为下一代自动化内核优化建立一个全面的参考文献。为了追踪该领域的进展,我们在开源GitHub仓库(请参见此处提供的URL)上维护了一个存储库。
https://arxiv.org/abs/2601.15727
Role-play prompting is known to steer the behavior of language models by injecting a persona into the prompt, improving their zero-shot reasoning capabilities. However, such improvements are inconsistent across different tasks or instances. This inconsistency suggests that zero-shot and role-play prompting may offer complementary strengths rather than one being universally superior. Building on this insight, we propose Persona Switch, a novel decoding method that dynamically combines the benefits of both prompting strategies. Our method proceeds step-by-step, selecting the better output between zero-shot and role-play prompting at each step by comparing their output confidence, as measured by the logit gap. Experiments with widely-used LLMs demonstrate that Persona Switch consistently outperforms competitive baselines, achieving up to 5.13% accuracy improvement. Furthermore, we show that output confidence serves as an informative measure for selecting the more reliable output.
角色扮演提示法被认为可以通过向提示中注入人格来引导语言模型的行为,从而提高其零样本推理能力。然而,这种改善在不同的任务或实例之间并不一致。这种不一致性表明,零样本和角色扮演提示策略可能提供了互补的优势,而不是一方普遍优于另一方。基于这一见解,我们提出了Persona Switch(人物切换),这是一种新颖的解码方法,它能够动态地结合这两种提示策略的优点。 我们的方法按步骤进行,在每一步中通过比较输出自信度(用对数差距衡量)来选择零样本和角色扮演提示法中的更好结果。使用广泛使用的大型语言模型进行的实验表明,Persona Switch 方法始终优于竞争基线,并且最多可以实现5.13% 的准确率提升。此外,我们还展示了输出信心作为选择更可靠结果的有效度量的作用。 这段文字描述了一种新的解码方法(Persona Switch),该方法结合了零样本提示和角色扮演提示的优势,在多种任务中实现了更好的性能。通过比较两种策略的自信程度来动态地选择最合适的输出,这种方法能够提高语言模型在各种任务上的表现。
https://arxiv.org/abs/2601.15708
Patients are increasingly using large language models (LLMs) to seek answers to their healthcare-related questions. However, benchmarking efforts in LLMs for question answering often focus on medical exam questions, which differ significantly in style and content from the questions patients actually raise in real life. To bridge this gap, we sourced data from Google's People Also Ask feature by querying the top 200 prescribed medications in the United States, curating a dataset of medical questions people commonly ask. A considerable portion of the collected questions contains incorrect assumptions and dangerous intentions. We demonstrate that the emergence of these corrupted questions is not uniformly random and depends heavily on the degree of incorrectness in the history of questions that led to their appearance. Current LLMs that perform strongly on other benchmarks struggle to identify incorrect assumptions in everyday questions.
患者越来越多地使用大型语言模型(LLM)来寻求与健康相关的解答。然而,针对LLM在问答方面的基准测试通常集中在医学考试问题上,这些问题的风格和内容与患者实际在生活中提出的问题有很大不同。为了弥合这一差距,我们从Google的“人们也问”功能中收集数据,通过查询美国前200种处方药物来整理了一个普通民众常问的医疗问题的数据集。所收集的问题中有相当一部分包含错误假设和危险意图。我们发现这些被污染的问题并非随机出现,而是严重依赖于导致它们出现的历史提问中的错误程度。当前在其他基准测试中表现强劲的LLM,在识别日常问题中的错误假设方面却显得力不从心。
https://arxiv.org/abs/2601.15674
Large-scale language models (LLMs) often offer clinical judgments based on incomplete information, increasing the risk of misdiagnosis. Existing studies have primarily evaluated confidence in single-turn, static settings, overlooking the coupling between confidence and correctness as clinical evidence accumulates during real consultations, which limits their support for reliable decision-making. We propose the first benchmark for assessing confidence in multi-turn interaction during realistic medical consultations. Our benchmark unifies three types of medical data for open-ended diagnostic generation and introduces an information sufficiency gradient to characterize the confidence-correctness dynamics as evidence increases. We implement and compare 27 representative methods on this benchmark; two key insights emerge: (1) medical data amplifies the inherent limitations of token-level and consistency-level confidence methods, and (2) medical reasoning must be evaluated for both diagnostic accuracy and information completeness. Based on these insights, we present MedConf, an evidence-grounded linguistic self-assessment framework that constructs symptom profiles via retrieval-augmented generation, aligns patient information with supporting, missing, and contradictory relations, and aggregates them into an interpretable confidence estimate through weighted integration. Across two LLMs and three medical datasets, MedConf consistently outperforms state-of-the-art methods on both AUROC and Pearson correlation coefficient metrics, maintaining stable performance under conditions of information insufficiency and multimorbidity. These results demonstrate that information adequacy is a key determinant of credible medical confidence modeling, providing a new pathway toward building more reliable and interpretable large medical models.
大型语言模型(LLMs)通常会根据不完整的信息提供临床判断,增加了误诊的风险。现有研究主要评估单轮静态设置下的置信度,忽略了在真实咨询过程中随着临床证据积累,置信度与正确性的相互作用,这限制了它们支持可靠决策的能力。我们提出了首个用于评估多轮交互中置信度的基准,在现实医疗咨询背景下进行。我们的基准统一了三种类型的医学数据以实现开放式诊断生成,并引入信息充足程度梯度来表征证据增加时置信度与正确性的动态变化。我们在该基准上实现了并比较了27种代表性方法;从中得出了两个关键见解:(1)医学数据放大了标记级和一致性级置信度方法的内在局限性,以及(2)医学推理必须同时评估诊断准确性及信息完整性。 基于这些洞见,我们提出了MedConf——一个基于证据的语言自我评价框架。该框架通过检索增强生成构建症状谱图,将患者信息与支持、缺失和矛盾的关系对齐,并通过加权整合将其转化为可解释的置信度估计。在两种LLM模型和三种医学数据集上进行测试时,无论是在AUROC(受试者操作特征曲线下面积)还是皮尔逊相关系数指标下,MedConf始终超越现有最先进技术,在信息不足和多病共存的情况下也能保持稳定的性能。 这些结果表明,信息充足性是可信的医疗置信度建模的关键决定因素,并为构建更可靠、更具解释性的大规模医学模型提供了新的路径。
https://arxiv.org/abs/2601.15645
In this report, we present the Qwen3-TTS series, a family of advanced multilingual, controllable, robust, and streaming text-to-speech models. Qwen3-TTS supports state-of-the-art 3-second voice cloning and description-based control, allowing both the creation of entirely novel voices and fine-grained manipulation over the output speech. Trained on over 5 million hours of speech data spanning 10 languages, Qwen3-TTS adopts a dual-track LM architecture for real-time synthesis, coupled with two speech tokenizers: 1) Qwen-TTS-Tokenizer-25Hz is a single-codebook codec emphasizing semantic content, which offers seamlessly integration with Qwen-Audio and enables streaming waveform reconstruction via a block-wise DiT. 2) Qwen-TTS-Tokenizer-12Hz achieves extreme bitrate reduction and ultra-low-latency streaming, enabling immediate first-packet emission ($97\,\mathrm{ms}$) through its 12.5 Hz, 16-layer multi-codebook design and a lightweight causal ConvNet. Extensive experiments indicate state-of-the-art performance across diverse objective and subjective benchmark (e.g., TTS multilingual test set, InstructTTSEval, and our long speech test set). To facilitate community research and development, we release both tokenizers and models under the Apache 2.0 license.
在这份报告中,我们介绍了Qwen3-TTS系列,这是一个先进的多语言、可控性高、鲁棒性强以及支持流式传输的文本到语音模型家族。Qwen3-TTS 支持业界领先的3秒声音克隆和基于描述的控制功能,允许创建全新的声音,并对输出语音进行细粒度调整。该模型是在涵盖10种语言的超过500万小时的语音数据上训练而成,并采用了一种双轨LM架构来进行实时合成,同时配备两个语音分词器: 1) Qwen-TTS-Tokenizer-25Hz 是一种单代码本编解码器,强调语义内容,在与Qwen-Audio无缝集成方面表现出色,并通过块式DiT实现流式波形重建。 2) Qwen-TTS-Tokenizer-12Hz 采用12.5 Hz、16层的多代码本设计以及轻量级因果卷积网络(ConvNet),实现了极高的比特率减少和超低延迟的流媒体传输,能够通过其独特设计在接收到第一个数据包后97毫秒内开始发送语音。 广泛的实验表明,Qwen3-TTS 在多种客观和主观基准测试中表现出卓越性能(例如TTS多语言测试集、InstructTTSEval以及我们自己的长语音测试集)。为了促进社区研究和发展,我们将两个分词器和模型在Apache 2.0许可证下开源。
https://arxiv.org/abs/2601.15621
Reinforcement Learning with Verifiable Rewards (RLVR) is a central paradigm for turning large language models (LLMs) into reliable problem solvers, especially in logic-heavy domains. Despite its empirical success, it remains unclear whether RLVR elicits novel capabilities or merely sharpens the distribution over existing knowledge. We study this by formalizing over-sharpening, a phenomenon where the policy collapses onto limited modes, suppressing valid alternatives. At a high level, we discover finite-batch updates intrinsically bias learning toward sampled modes, triggering a collapse that propagates globally via semantic coupling. To mitigate this, we propose inverse-success advantage calibration to prioritize difficult queries and distribution-level calibration to diversify sampling via a memory network. Empirical evaluations validate that our strategies can effectively improve generalization.
带有可验证奖励的强化学习(RLVR)是将大型语言模型(LLMs)转变为可靠的问题解决者的核心范式,尤其是在逻辑密集型领域。尽管它在实践中取得了成功,但仍不清楚RLVR是否激发了新的能力还是仅仅是调整了现有知识的分布。为了研究这一点,我们通过形式化过度拟合现象来开展工作,这是一种政策倾向于有限模式的现象,从而压制有效的替代方案。从高层次上看,我们发现有限批次更新会偏向于采样模式的学习,触发一个通过语义耦合在全球范围内传播的崩溃。 为了解决这个问题,我们提出了逆成功优势校准(优先处理困难查询)和分布级校准(通过记忆网络多样化采样)来提高泛化能力。实证评估验证了我们的策略可以有效改善模型的泛化性能。
https://arxiv.org/abs/2601.15609
The rapid growth of live-streaming platforms such as Twitch has introduced complex challenges in moderating toxic behavior. Traditional moderation approaches, such as human annotation and keyword-based filtering, have demonstrated utility, but human moderators on Twitch constantly struggle to scale effectively in the fast-paced, high-volume, and context-rich chat environment of the platform while also facing harassment themselves. Recent advances in large language models (LLMs), such as DeepSeek-R1-Distill and Llama-3-8B-Instruct, offer new opportunities for toxicity detection, especially in understanding nuanced, multimodal communication involving emotes. In this work, we present an exploratory comparison of toxicity detection approaches tailored to Twitch. Our analysis reveals that incorporating emotes improves the detection of toxic behavior. To this end, we introduce ToxiTwitch, a hybrid model that combines LLM-generated embeddings of text and emotes with traditional machine learning classifiers, including Random Forest and SVM. In our case study, the proposed hybrid approach reaches up to 80 percent accuracy under channel-specific training (with 13 percent improvement over BERT and F1-score of 76 percent). This work is an exploratory study intended to surface challenges and limits of emote-aware toxicity detection on Twitch.
直播平台如Twitch的迅速增长带来了管理有毒行为的复杂挑战。传统的管理模式,比如人工标注和基于关键词过滤的方法,在实践中展示了一定的效果,但Twitch的人类管理员却难以在快速、高流量且包含丰富语境的聊天环境中有效扩展工作范围,并且他们还面临着自身遭受骚扰的问题。近期大型语言模型(LLM)的发展,例如DeepSeek-R1-Distill和Llama-3-8B-Instruct,为毒性检测提供了新的机会,尤其是在理解涉及表情符号的微妙多模态沟通方面。在这项工作中,我们展示了对适用于Twitch平台的毒性检测方法进行探索性比较的研究。我们的分析显示,在检测有毒行为时,将表情符号纳入考量能够提升识别效果。为此,我们推出了ToxiTwitch模型,这是一种混合模型,结合了大型语言模型生成的文字和表情符号嵌入与传统机器学习分类器(包括随机森林和支持向量机)的应用。在案例研究中,所提出的混合方法在特定频道训练下可达到高达80%的准确率(相对于BERT,提高了13%,F1分数为76%)。这项工作是一个旨在揭示Twitch平台上具有表情符号意识的毒性检测挑战和限制的探索性研究。
https://arxiv.org/abs/2601.15605
As large language models (LLMs) are increasingly deployed in real-world applications, safety guardrails are required to go beyond coarse-grained filtering and support fine-grained, interpretable, and adaptable risk assessment. However, existing solutions often rely on rapid classification schemes or post-hoc rules, resulting in limited transparency, inflexible policies, or prohibitive inference costs. To this end, we present YuFeng-XGuard, a reasoning-centric guardrail model family designed to perform multi-dimensional risk perception for LLM interactions. Instead of producing opaque binary judgments, YuFeng-XGuard generates structured risk predictions, including explicit risk categories and configurable confidence scores, accompanied by natural language explanations that expose the underlying reasoning process. This formulation enables safety decisions that are both actionable and interpretable. To balance decision latency and explanatory depth, we adopt a tiered inference paradigm that performs an initial risk decision based on the first decoded token, while preserving ondemand explanatory reasoning when required. In addition, we introduce a dynamic policy mechanism that decouples risk perception from policy enforcement, allowing safety policies to be adjusted without model retraining. Extensive experiments on a diverse set of public safety benchmarks demonstrate that YuFeng-XGuard achieves stateof-the-art performance while maintaining strong efficiency-efficacy trade-offs. We release YuFeng-XGuard as an open model family, including both a full-capacity variant and a lightweight version, to support a wide range of deployment scenarios.
随着大型语言模型(LLM)在实际应用中的部署日益增多,安全防护措施需要超越粗粒度的过滤,支持细粒度、可解释且适应性强的风险评估。然而,现有的解决方案往往依赖于快速分类方案或事后规则,导致透明度有限、政策不灵活或推理成本过高。为此,我们提出了YuFeng-XGuard,这是一个以推理为中心的安全防护模型家族,旨在执行大型语言模型交互的多维风险感知。 与产生不透明二元判断不同,YuFeng-XGuard生成结构化的风险预测,包括明确的风险类别和可配置的信心分数,并附带自然语言解释,揭示底层的推理过程。这种形式化使得安全决策既具有行动性又易于理解。为了平衡决策延迟与解释深度,我们采用了一种分层推理范式,在解码第一个标记时进行初步风险判断,同时在需要时保留按需解释推理的能力。 此外,我们引入了一个动态政策机制,将风险感知从政策执行中分离出来,使得安全策略可以在无需重新训练模型的情况下调整。在一系列多样化的公共安全基准测试中进行的广泛实验表明,YuFeng-XGuard实现了最先进的性能,并保持了强大的效率与效果之间的权衡。 我们将YuFeng-XGuard作为一个开放的模型家族发布,包括一个全容量变体和一个轻量级版本,以支持广泛的部署场景。
https://arxiv.org/abs/2601.15588