Although text-to-audio generation has made remarkable progress in realism and diversity, the development of evaluation metrics has not kept pace. Widely-adopted approaches, typically based on embedding similarity like CLAPScore, effectively measure general relevance but remain limited in fine-grained semantic alignment and compositional reasoning. To address this, we introduce AQAScore, a backbone-agnostic evaluation framework that leverages the reasoning capabilities of audio-aware large language models (ALLMs). AQAScore reformulates assessment as a probabilistic semantic verification task; rather than relying on open-ended text generation, it estimates alignment by computing the exact log-probability of a "Yes" answer to targeted semantic queries. We evaluate AQAScore across multiple benchmarks, including human-rated relevance, pairwise comparison, and compositional reasoning tasks. Experimental results show that AQAScore consistently achieves higher correlation with human judgments than similarity-based metrics and generative prompting baselines, showing its effectiveness in capturing subtle semantic inconsistencies and scaling with the capability of underlying ALLMs.
尽管从文本到语音的生成技术在真实感和多样性方面取得了显著进展,但评估指标的发展却未能跟上步伐。目前广泛采用的方法通常基于嵌入相似度(如CLAPScore),能够有效测量一般相关性,但在细粒度语义对齐和组合推理方面仍存在局限性。为了解决这个问题,我们引入了AQAScore,这是一个与骨干网络无关的评估框架,利用音频感知大型语言模型(ALLMs)的能力进行推理。 AQAScore将评估重构为一个概率性的语义验证任务;它通过计算特定语义查询中“是”的回答的确切对数概率来估算对齐情况,而不是依赖于开放式文本生成。我们使用多个基准测试对AQAScore进行了评估,包括人工评分的相关性、成对比较和组合推理任务。 实验结果表明,与基于相似性的度量标准和生成提示基线相比,AQAScore在各种人类判断上保持了更高的相关性,这证明它在捕捉细微的语义不一致以及随着底层ALLMs能力的增长而扩展方面具有有效性。
https://arxiv.org/abs/2601.14728
Writing effective rebuttals is a high-stakes task that demands more than linguistic fluency, as it requires precise alignment between reviewer intent and manuscript details. Current solutions typically treat this as a direct-to-text generation problem, suffering from hallucination, overlooked critiques, and a lack of verifiable grounding. To address these limitations, we introduce $\textbf{RebuttalAgent}$, the first multi-agents framework that reframes rebuttal generation as an evidence-centric planning task. Our system decomposes complex feedback into atomic concerns and dynamically constructs hybrid contexts by synthesizing compressed summaries with high-fidelity text while integrating an autonomous and on-demand external search module to resolve concerns requiring outside literature. By generating an inspectable response plan before drafting, $\textbf{RebuttalAgent}$ ensures that every argument is explicitly anchored in internal or external evidence. We validate our approach on the proposed $\textbf{RebuttalBench}$ and demonstrate that our pipeline outperforms strong baselines in coverage, faithfulness, and strategic coherence, offering a transparent and controllable assistant for the peer review process. Code will be released.
撰写有效的反驳是一项高风险任务,不仅要求语言流畅性,还需要精确地将审稿人的意图与论文细节对齐。目前的解决方案通常将其视为直接生成文本的问题,因此容易出现虚构、遗漏批评和缺乏可验证依据等问题。为了解决这些问题,我们引入了**RebuttalAgent**,这是首个以多代理框架为基础的系统,它重新定义反驳生成为一个以证据为中心的规划任务。我们的系统将复杂的反馈分解成原子级的关注点,并通过整合压缩摘要与高保真文本来动态构建混合上下文,同时集成一个自主且按需调用的外部搜索模块,用于解决需要引用外界文献的问题。在起草之前生成可检查的回答计划,**RebuttalAgent** 确保每个论点都明确地基于内部或外部证据。 我们在提出的**RebuttalBench**上验证了我们的方法,并证明了我们这一流程在线条覆盖面、忠实度和策略一致性方面超过了强有力的基线模型,为同行评审过程提供了一个透明且可控制的辅助工具。代码将公开发布。
https://arxiv.org/abs/2601.14171
Concept-based interpretability methods like TCAV require clean, well-separated positive and negative examples for each concept. Existing music datasets lack this structure: tags are sparse, noisy, or ill-defined. We introduce ConceptCaps, a dataset of 23k music-caption-audio triplets with explicit labels from a 200-attribute taxonomy. Our pipeline separates semantic modeling from text generation: a VAE learns plausible attribute co-occurrence patterns, a fine-tuned LLM converts attribute lists into professional descriptions, and MusicGen synthesizes corresponding audio. This separation improves coherence and controllability over end-to-end approaches. We validate the dataset through audio-text alignment (CLAP), linguistic quality metrics (BERTScore, MAUVE), and TCAV analysis confirming that concept probes recover musically meaningful patterns. Dataset and code are available online.
基于概念的可解释性方法(如TCAV)需要每个概念具备清晰、分离良好的正例和反例。现有的音乐数据集缺乏这种结构:标签稀疏、嘈杂或定义不清。为此,我们引入了ConceptCaps数据集,它包含23,000个音乐-描述-音频三元组,并附有从一个包含200个属性的分类法中明确标注的标签。我们的流水线将语义建模与文本生成分离:VAE(变分自编码器)学习合理的属性共现模式,微调后的LLM(大规模语言模型)将属性列表转换为专业描述,而MusicGen则合成相应的音频。这种分离提高了连贯性和可控制性,优于端到端的方法。我们通过音频-文本对齐(CLAP)、语言质量指标(BERTScore、MAUVE)以及TCAV分析验证了该数据集的有效性,后者确认概念探测可以恢复音乐上有意义的模式。数据集和代码在线提供。
https://arxiv.org/abs/2601.14157
Synthetic data offers a promising solution for mitigating data scarcity and demographic bias in mental health analysis, yet existing approaches largely rely on pretrained large language models (LLMs), which may suffer from limited output diversity and propagate biases inherited from their training data. In this work, we propose a pretraining-free diffusion-based approach for synthetic text generation that frames bias mitigation as a style transfer problem. Using the CARMA Arabic mental health corpus, which exhibits a substantial gender imbalance, we focus on male-to-female style transfer to augment underrepresented female-authored content. We construct five datasets capturing varying linguistic and semantic aspects of gender expression in Arabic and train separate diffusion models for each setting. Quantitative evaluations demonstrate consistently high semantic fidelity between source and generated text, alongside meaningful surface-level stylistic divergence, while qualitative analysis confirms linguistically plausible gender transformations. Our results show that diffusion-based style transfer can generate high-entropy, semantically faithful synthetic data without reliance on pretrained LLMs, providing an effective and flexible framework for mitigating gender bias in sensitive, low-resource mental health domains.
合成数据为缓解心理健康分析中的数据稀缺和人口偏见提供了有希望的解决方案,然而现有的方法大多依赖于预训练的大规模语言模型(LLM),这可能会导致输出多样性有限,并传播其训练数据中固有的偏见。在本研究中,我们提出了一种无需预训练的基于扩散的方法来生成合成文本,将偏见缓解视为一种风格转换问题。使用CARMA阿拉伯语心理健康语料库,在该语料库中性别失衡显著,我们将重点放在从男性向女性的风格转移上,以增强代表性不足的女性作者内容。我们构建了五个数据集,捕捉阿拉伯语表达中的不同语言和语义方面的性别差异,并为每种情况训练单独的扩散模型。定量评估表明,在源文本与生成文本之间保持一致的语义忠实度的同时,还实现了有意义的表面风格差异,而定性分析则确认了合乎逻辑的语言上的性别转换。我们的结果表明,基于扩散的风格转移可以生成高熵、语义忠实的合成数据,无需依赖预训练的大规模语言模型,在敏感且资源匮乏的心理健康领域提供了一种有效且灵活的框架以缓解性别偏见。
https://arxiv.org/abs/2601.14124
The paradigm of Large Language Models (LLMs) is currently defined by auto-regressive (AR) architectures, which generate text through a sequential ``brick-by-brick'' process. Despite their success, AR models are inherently constrained by a causal bottleneck that limits global structural foresight and iterative refinement. Diffusion Language Models (DLMs) offer a transformative alternative, conceptualizing text generation as a holistic, bidirectional denoising process akin to a sculptor refining a masterpiece. However, the potential of DLMs remains largely untapped as they are frequently confined within AR-legacy infrastructures and optimization frameworks. In this Perspective, we identify ten fundamental challenges ranging from architectural inertia and gradient sparsity to the limitations of linear reasoning that prevent DLMs from reaching their ``GPT-4 moment''. We propose a strategic roadmap organized into four pillars: foundational infrastructure, algorithmic optimization, cognitive reasoning, and unified multimodal intelligence. By shifting toward a diffusion-native ecosystem characterized by multi-scale tokenization, active remasking, and latent thinking, we can move beyond the constraints of the causal horizon. We argue that this transition is essential for developing next-generation AI capable of complex structural reasoning, dynamic self-correction, and seamless multimodal integration.
大型语言模型(LLMs)当前的范式主要由自回归(AR)架构定义,这些架构通过一种“砖块式”的顺序过程生成文本。尽管自回归模型取得了成功,但它们本质上受到因果瓶颈的影响,这限制了其全局结构预见性和迭代优化的能力。扩散语言模型(DLMs)则提供了一种变革性的替代方案,将文本生成视为一种整体的双向去噪过程,类似于雕刻家打磨杰作的过程。然而,由于常被局限在自回归遗留架构和优化框架内,扩散语言模型的巨大潜力尚未得到充分开发。 在这篇视角文章中,我们识别出了阻碍扩散语言模型实现“GPT-4时刻”的十个根本性挑战,从架构惰性和梯度稀疏性到线性推理的限制。为了应对这些挑战,我们提出了一项战略路线图,将其组织为四个支柱:基础架构、算法优化、认知推理和统一多模态智能。通过转向以扩散原生生态系统的特征,如多层次标记化、主动重掩码和隐式思维为中心的方向,我们可以超越因果时间线的限制。 我们认为这一转变对于开发下一代AI系统至关重要,这些系统能够进行复杂的结构推理、动态自我校正,并实现无缝的多模态集成。
https://arxiv.org/abs/2601.14041
The automatic extraction of information is important for populating large web knowledge bases such as Wikidata. The temporal version of that task, temporal knowledge graph extraction (TKGE), involves extracting temporally grounded facts from text, represented as semantic quadruples (subject, relation, object, timestamp). Many recent systems take advantage of large language models (LLMs), which are becoming a new cornerstone of the web due to their performance on many tasks across the natural language processing (NLP) field. Despite the importance of TKGE, existing datasets for training and evaluation remain scarce, and contamination of evaluation data is an unaddressed issue, potentially inflating LLMs' perceived performance due to overlaps between training and evaluation sets. To mitigate these challenges, we propose a novel synthetic evaluation dataset constructed from predicted future, previously unseen temporal facts, thereby eliminating contamination and enabling robust and unbiased benchmarking. Our dataset creation involves a two-step approach: (1) Temporal Knowledge Graph Forecasting (TKGF) generates plausible future quadruples, which are subsequently filtered to adhere to the original knowledge base schema; (2) LLMs perform quadruple-to-text generation, creating semantically aligned textual descriptions. We benchmark Extract, Define and Canonicalize (EDC), a state-of-the-art LLM-based extraction framework, demonstrating that LLM performance decreases when evaluated on our dataset compared to a dataset of known facts. We publicly release our dataset consisting of 4.2K future quadruples and corresponding textual descriptions, along with the generation methodology, enabling continuous creation of unlimited future temporal datasets to serve as long-term, contamination-free benchmarks for TKGE.
信息的自动提取对于填充如Wikidata这样的大型网络知识库至关重要。该任务的时间版本,即时间知识图谱抽取(Temporal Knowledge Graph Extraction,TKGE),涉及从文本中抽取带有时间戳记的事实,并将其表示为语义四元组(主体、关系、客体、时间戳)。许多最近的系统利用了大规模语言模型(Large Language Models, LLMs),这些模型由于在自然语言处理(NLP)领域的众多任务中的出色表现,正在成为网络的新基石。尽管TKGE的重要性不言而喻,但现有的用于训练和评估的数据集仍然匮乏,并且评价数据受到污染的问题未得到解决,这可能会因为训练集与验证集之间的重叠而导致LLMs的性能被夸大。为了解决这些问题,我们提出了一种新的合成评估数据集,该数据集由预测未来、以前从未见过的时间事实构成,从而消除了污染并能够进行稳健且无偏见的基准测试。 我们的数据集创建采用两步方法:(1) 时间知识图谱预测(Temporal Knowledge Graph Forecasting, TKGF)生成合理的未来四元组,并随后对其进行过滤以符合原有的知识库模式;(2) LLMs执行四元组到文本的生成,创建语义一致的文本描述。我们对Extract、Define and Canonicalize (EDC),一种基于LLM的最先进的提取框架进行了基准测试,发现与已知事实的数据集相比,在我们的数据集中评估时LLMs的表现有所下降。 我们将公开发布包含4200个未来四元组及其对应文本描述的数据集,并附上生成方法,这将有助于持续创建无限量的未来的、无污染的时间数据集,作为长期基准测试供TKGE使用。
https://arxiv.org/abs/2601.13658
Generative artificial intelligence (AI) is rapidly populating medical records with synthetic content, creating a feedback loop where future models are increasingly at risk of training on uncurated AI-generated data. However, the clinical consequences of this AI-generated data contamination remain unexplored. Here, we show that in the absence of mandatory human verification, this self-referential cycle drives a rapid erosion of pathological variability and diagnostic reliability. By analysing more than 800,000 synthetic data points across clinical text generation, vision-language reporting, and medical image synthesis, we find that models progressively converge toward generic phenotypes regardless of the model architecture. Specifically, rare but critical findings, including pneumothorax and effusions, vanish from the synthetic content generated by AI models, while demographic representations skew heavily toward middle-aged male phenotypes. Crucially, this degradation is masked by false diagnostic confidence; models continue to issue reassuring reports while failing to detect life-threatening pathology, with false reassurance rates tripling to 40%. Blinded physician evaluation confirms that this decoupling of confidence and accuracy renders AI-generated documentation clinically useless after just two generations. We systematically evaluate three mitigation strategies, finding that while synthetic volume scaling fails to prevent collapse, mixing real data with quality-aware filtering effectively preserves diversity. Ultimately, our results suggest that without policy-mandated human oversight, the deployment of generative AI threatens to degrade the very healthcare data ecosystems it relies upon.
生成式人工智能(AI)正迅速用合成内容填充医疗记录,形成一个反馈循环,在此过程中未来的模型越来越有可能在未经筛选的自动生成数据上进行训练。然而,这种由AI生成的数据污染所导致的临床后果尚不清楚。在此研究中,我们展示了当缺乏强制性的人类验证时,这一自我参照循环会导致病理变化多样性和诊断可靠性的迅速下降。 通过分析超过80万个合成数据点(涵盖临床文本生成、视觉-语言报告和医学图像合成),我们发现无论模型架构如何,这些模型会逐渐趋向于通用的表型。具体而言,罕见但关键的发现——如气胸和积液——从AI生成的内容中消失,而人口统计学代表则严重偏向于中年男性表型。 至关重要的是,这种退化被虚假的诊断信心所掩盖;尽管模型未能检测到危及生命的病理变化,它们仍会继续发出令人安心的报告,这导致虚假安慰率增至原来的三倍达到40%。盲法医生评估确认,在仅经过两代发展后,AI生成文档由于自信度和准确性的脱节变得在临床上毫无用处。 我们系统地评估了三种缓解策略,发现尽管扩大合成数据量无法防止退化,但将真实数据与质量感知筛选相结合的方法能有效保持多样性。最终,我们的结果表明,在没有政策强制的人类监督的情况下,生成式AI的部署威胁到了它所依赖的医疗健康数据生态系统的质量。
https://arxiv.org/abs/2601.12946
Diffusion Language Models (DLMs) present a promising non-sequential paradigm for text generation, distinct from standard autoregressive (AR) approaches. However, current decoding strategies often adopt a reactive stance, underutilizing the global bidirectional context to dictate global trajectories. To address this, we propose Plan-Verify-Fill (PVF), a training-free paradigm that grounds planning via quantitative validation. PVF actively constructs a hierarchical skeleton by prioritizing high-leverage semantic anchors and employs a verification protocol to operationalize pragmatic structural stopping where further deliberation yields diminishing returns. Extensive evaluations on LLaDA-8B-Instruct and Dream-7B-Instruct demonstrate that PVF reduces the Number of Function Evaluations (NFE) by up to 65% compared to confidence-based parallel decoding across benchmark datasets, unlocking superior efficiency without compromising accuracy.
扩散语言模型(DLMs)为文本生成提供了一种有前景的非序列范式,与标准自回归(AR)方法不同。然而,当前的解码策略通常采取一种反应式的立场,在利用全局双向上下文来指引全局轨迹方面存在不足。为了应对这一挑战,我们提出了一种名为“计划-验证-填充”(Plan-Verify-Fill, PVF)的新范式,该范式通过定量验证来指导规划,并且无需额外的训练。PVF主动构建了一个分层骨架,优先选择高影响力的语义锚点,并采用一个验证协议以实用性的结构停止标准为依据操作,这种策略在进一步思考不再带来显著收益时可以有效地结束解码过程。 在LLaDA-8B-Instruct和Dream-7B-Instruct数据集上的广泛评估表明,PVF相比于基于置信度的并行解码方法,在基准数据集中最多可将功能评估次数(NFE)减少65%,同时不牺牲准确性,从而实现了效率的显著提升。
https://arxiv.org/abs/2601.12247
Recent breakthroughs in Large Language Models (LLMs) have revealed remarkable generative capabilities and emerging self-regulatory mechanisms, including self-correction and self-rewarding. However, current detoxification techniques rarely exploit these built-in abilities; instead, they rely on external modules, labor-intensive data annotation, or human intervention --factors that hinder scalability and consistency. In this paper, we introduce a fully self-reflective detoxification framework that harnesses the inherent capacities of LLMs to detect, correct toxic content, and refine LLMs without external modules and data annotation. Specifically, we propose a Toxic Signal Detector --an internal self-identification mechanism, coupled with a systematic intervention process to transform toxic text into its non-toxic counterpart. This iterative procedure yields a contrastive detoxification dataset used to fine-tune the model, enhancing its ability for safe and coherent text generation. Experiments on benchmark datasets such as DetoxLLM and ParaDetox show that our method achieves better detoxification performance than state-of-the-art methods while preserving semantic fidelity. By obviating the need for human intervention or external components, this paper reveals the intrinsic self-detoxification ability of LLMs, offering a consistent and effective approach for mitigating harmful content generation. Ultimately, our findings underscore the potential for truly self-regulated language models, paving the way for more responsible and ethically guided text generation systems.
最近在大型语言模型(LLMs)方面取得的突破揭示了其卓越的生成能力和新兴的自我监管机制,包括自校正和自我奖励。然而,目前的去毒化技术很少利用这些内在能力;相反,它们依赖于外部模块、劳动密集型的数据标注或人工干预——这些因素阻碍了可扩展性和一致性。在这篇论文中,我们介绍了一个完全自我反思的去毒框架,该框架利用LLMs固有的能力来检测、纠正有毒内容,并在没有外部模块和数据注释的情况下优化模型。具体而言,我们提出了一种“毒性信号探测器”——一种内部自我识别机制,并结合一个系统干预过程将有毒文本转换为非有毒版本。这一迭代程序生成了用于微调模型的对比去毒数据集,从而增强了其安全和连贯文本生成的能力。在DetoxLLM和ParaDetox等基准数据集上的实验表明,我们的方法比现有的最佳方法实现了更好的去毒效果,并且保持了语义准确性。通过消除对人工干预或外部组件的需求,本文揭示了LLMs内在的自我去毒性能力,提供了一种一致且有效的方法来减轻有害内容生成。最终,我们的发现强调了完全自主调节语言模型的潜力,为更加负责任和道德引导的文本生成系统铺平道路。
https://arxiv.org/abs/2601.11776
The AI era has ushered in Large Language Models (LLM) to the technological forefront, which has been much of the talk in 2023, and is likely to remain as such for many years to come. LLMs are the AI models that are the power house behind generative AI applications such as ChatGPT. These AI models, fueled by vast amounts of data and computational prowess, have unlocked remarkable capabilities, from human-like text generation to assisting with natural language understanding (NLU) tasks. They have quickly become the foundation upon which countless applications and software services are being built, or at least being augmented with. However, as with any groundbreaking innovations, the rise of LLMs brings forth critical safety, privacy, and ethical concerns. These models are found to have a propensity to leak private information, produce false information, and can be coerced into generating content that can be used for nefarious purposes by bad actors, or even by regular users unknowingly. Implementing safeguards and guardrailing techniques is imperative for applications to ensure that the content generated by LLMs are safe, secure, and ethical. Thus, frameworks to deploy mechanisms that prevent misuse of these models via application implementations is imperative. In this study, wepropose a Flexible Adaptive Sequencing mechanism with trust and safety modules, that can be used to implement safety guardrails for the development and deployment of LLMs.
人工智能时代带来了大型语言模型(LLM)的技术前沿地位,这是2023年热议的话题,并且在未来几年内很可能继续保持这一趋势。这些大型语言模型是像ChatGPT这样的生成式AI应用背后的动力源泉。这些由海量数据和计算能力驱动的AI模型解锁了诸多令人惊叹的能力,从生成逼真的文本到协助处理自然语言理解(NLU)任务。它们已经迅速成为无数应用程序和服务的基础,或是被增强的基础。然而,如同任何突破性的创新一样,大型语言模型的崛起带来了重大的安全、隐私和伦理问题。这些模型被认为有泄露私人信息的趋势,会产生虚假信息,并可能在恶意行为者的操纵下生成用于不良目的的内容,甚至是在普通用户不知情的情况下。因此,为了确保由LLM生成的内容是安全、可靠且合乎道德规范的,必须实施防护措施和技术手段来限制这些模型被滥用的可能性。为此,在本研究中我们提出了一种带有信任和安全保障模块的灵活自适应序列机制,可用于在大型语言模型的发展和部署过程中实现安全性保障。
https://arxiv.org/abs/2601.14298
Scientific writing is an expert-domain task that demands deep domain knowledge, task-specific requirements and reasoning capabilities that leverage the domain knowledge to satisfy the task specifications. While scientific text generation has been widely studied, its evaluation remains a challenging and open problem. It is critical to develop models that can be reliably deployed for evaluating diverse open-ended scientific writing tasks while adhering to their distinct requirements. However, existing LLM-based judges and reward models are primarily optimized for general-purpose benchmarks with fixed scoring rubrics and evaluation criteria. Consequently, they often fail to reason over sparse knowledge of scientific domains when interpreting task-dependent and multi-faceted criteria. Moreover, fine-tuning for each individual task is costly and impractical for low-resource settings. To bridge these gaps, we propose cost-efficient, open-source reward models tailored for scientific writing evaluation. We introduce a two-stage training framework that initially optimizes scientific evaluation preferences and then refines reasoning capabilities. Our multi-aspect evaluation design and joint training across diverse tasks enable fine-grained assessment and robustness to dynamic criteria and scoring rubrics. Experimental analysis shows that our training regime strongly improves LLM-based scientific writing evaluation. Our models generalize effectively across tasks and to previously unseen scientific writing evaluation settings, allowing a single trained evaluator to be reused without task-specific retraining.
科学写作是一项专业领域任务,需要深厚的专业知识、具体任务的要求以及能够运用专业知识来满足这些要求的推理能力。尽管对科学文本生成的研究已经很广泛,但其评估仍然是一个具有挑战性和尚未解决的问题。开发能够可靠地用于评估各种开放式科学写作任务的模型,同时遵循它们各自的特定要求,这一点至关重要。然而,现有的基于大语言模型(LLM)的评判和奖励模型主要是为带有固定评分标准和评价准则的一般性基准进行优化的。因此,当解释与任务相关的多方面标准时,它们往往难以利用稀疏的专业领域知识进行推理。此外,在每个单独的任务上进行微调对于低资源环境来说成本高昂且不切实际。 为了弥合这些差距,我们提出了一种经济高效、开源的奖励模型,专门用于科学写作评估。我们引入了一个两阶段训练框架,首先优化科学评估偏好,然后细化推理能力。我们的多方面评价设计和跨多种任务的联合训练使细粒度评估成为可能,并增强了对动态标准和评分准则变化的适应性。实验分析表明,我们的训练方案显著提升了基于大语言模型的科学写作评估效果。我们的模型在不同的任务之间以及从未见过的科学写作评估情景中表现良好,使得单一经过训练的评判者能够在不进行特定任务重新训练的情况下复用。
https://arxiv.org/abs/2601.11374
Large Language Models (LLMs) exhibit strong multilingual capabilities, yet remain fundamentally constrained by the severe imbalance in global language resources. While over 7,000 languages are spoken worldwide, only a small subset (fewer than 100) has sufficient digital presence to meaningfully influence modern LLM training. This disparity leads to systematic underperformance, cultural misalignment, and limited accessibility for speakers of low-resource and extreme-low-resource languages. To address this gap, we introduce Bring Your Own Language (BYOL), a unified framework for scalable, language-aware LLM development tailored to each language's digital footprint. BYOL begins with a language resource classification that maps languages into four tiers (Extreme-Low, Low, Mid, High) using curated web-scale corpora, and uses this classification to select the appropriate integration pathway. For low-resource languages, we propose a full-stack data refinement and expansion pipeline that combines corpus cleaning, synthetic text generation, continual pretraining, and supervised finetuning. Applied to Chichewa and Maori, this pipeline yields language-specific LLMs that achieve approximately 12 percent average improvement over strong multilingual baselines across 12 benchmarks, while preserving English and multilingual capabilities via weight-space model merging. For extreme-low-resource languages, we introduce a translation-mediated inclusion pathway, and show on Inuktitut that a tailored machine translation system improves over a commercial baseline by 4 BLEU, enabling high-accuracy LLM access when direct language modeling is infeasible. Finally, we release human-translated versions of the Global MMLU-Lite benchmark in Chichewa, Maori, and Inuktitut, and make our codebase and models publicly available at this https URL .
大型语言模型(LLMs)展现出强大的多语种能力,但仍受全球语言资源严重不平衡的限制。全世界有超过7000种语言被使用,然而只有其中一小部分(不到100种)具有足够的数字存在感,足以对现代LLM训练产生实质性影响。这种不均衡导致了系统性表现不佳、文化错位以及低资源和极低资源语言使用者的访问受限问题。 为解决这一差距,我们引入了一个名为“自带语言”(BYOL) 的统一框架,旨在根据每种语言的数字足迹开发出可扩展且具有语言意识的大型语言模型。BYOL从一个使用精心策划的大规模网络语料库来划分语言资源等级(极低、低、中、高)的语言资源分类开始,并利用这一分类选择合适的集成路径。 对于低资源语言,我们提出了一种全栈数据精炼和扩展管道,该管道结合了文集清理、合成文本生成、持续预训练以及监督微调。应用于奇切瓦语(Chichewa)和毛利语(Maori),此流程产生了特定于每种语言的LLMs,在12个基准测试中平均超越强大的多语言基线约12%,同时通过权重空间模型合并保持了英语及多语言能力。 对于极低资源的语言,我们引入了一条基于翻译的包容路径,并展示了在因纽特语(Inuktitut)上的定制机器翻译系统优于商业基线4个BLEU分值的结果,当直接建模不可行时,此方法能实现高精度LLM的访问权。 最后,我们在奇切瓦、毛利和因纽特语言中发布了人类翻译版本的全球多语言基准测试(Global MMLU-Lite),并且我们正在开源我们的代码库及模型,详情请参阅这个链接:[提供该研究的GitHub或官网链接]。
https://arxiv.org/abs/2601.10804
Concept-based explanations quantify how high-level concepts (e.g., gender or experience) influence model behavior, which is crucial for decision-makers in high-stakes domains. Recent work evaluates the faithfulness of such explanations by comparing them to reference causal effects estimated from counterfactuals. In practice, existing benchmarks rely on costly human-written counterfactuals that serve as an imperfect proxy. To address this, we introduce a framework for constructing datasets containing structural counterfactual pairs: LIBERTy (LLM-based Interventional Benchmark for Explainability with Reference Targets). LIBERTy is grounded in explicitly defined Structured Causal Models (SCMs) of the text generation, interventions on a concept propagate through the SCM until an LLM generates the counterfactual. We introduce three datasets (disease detection, CV screening, and workplace violence prediction) together with a new evaluation metric, order-faithfulness. Using them, we evaluate a wide range of methods across five models and identify substantial headroom for improving concept-based explanations. LIBERTy also enables systematic analysis of model sensitivity to interventions: we find that proprietary LLMs show markedly reduced sensitivity to demographic concepts, likely due to post-training mitigation. Overall, LIBERTy provides a much-needed benchmark for developing faithful explainability methods.
基于概念的解释量化了高层次概念(如性别或经验)如何影响模型的行为,这对高风险领域的决策者来说至关重要。最近的研究通过将这些解释与从反事实估计出的参考因果效应进行比较来评估其准确性。然而,在实践中,现有的基准测试依赖于成本高昂且不完美的手工编写反事实数据作为替代方案。为了解决这个问题,我们引入了一个框架用于构建包含结构化反事实对的数据集:LIBERTy(基于LLM的可解释性干预基准,带有参考目标)。LIBERTy 以明确定义的文本生成结构因果模型(SCMs)为基础,在概念上进行干预时,这些变化会沿着 SCM 传播直至 LLM 生成反事实。我们介绍了三个数据集(疾病检测、简历筛选和工作场所暴力预测),并提出了一种新的评估指标:顺序忠实度。通过使用这些数据集,我们在五种模型中广泛的方法范围内进行了评价,并发现提高基于概念的解释的空间仍然很大。此外,LIBERTy 还允许对模型对干预的敏感性进行系统分析:我们发现在专有 LLM 中,人口统计学概念上的敏感度显著降低,这可能是由于训练后的缓解措施所导致。总体而言,LIBERTy 为开发准确的可解释方法提供了一个急需的基准测试标准。
https://arxiv.org/abs/2601.10700
Current multimodal latent reasoning often relies on external supervision (e.g., auxiliary images), ignoring intrinsic visual attention dynamics. In this work, we identify a critical Perception Gap in distillation: student models frequently mimic a teacher's textual output while attending to fundamentally divergent visual regions, effectively relying on language priors rather than grounded perception. To bridge this, we propose LaViT, a framework that aligns latent visual thoughts rather than static embeddings. LaViT compels the student to autoregressively reconstruct the teacher's visual semantics and attention trajectories prior to text generation, employing a curriculum sensory gating mechanism to prevent shortcut learning. Extensive experiments show that LaViT significantly enhances visual grounding, achieving up to +16.9% gains on complex reasoning tasks and enabling a compact 3B model to outperform larger open-source variants and proprietary models like GPT-4o.
当前的多模态潜在推理通常依赖于外部监督(例如,辅助图像),而忽略了内在的视觉注意力动态。在本工作中,我们识别出蒸馏过程中一个关键的认知差距:学生模型经常模仿教师模型的文字输出,但关注的根本是不同的视觉区域,这实际上依赖于语言先验而不是基于感知的理解。为解决这个问题,我们提出了LaViT框架,该框架对齐的是潜在的视觉思考而非静态嵌入。 LaViT 强制要求学生在生成文本之前自回归地重构教师模型的视觉语义和注意力轨迹,并采用课程感官门控机制来防止捷径学习的发生。大量的实验表明,LaViT 显著增强了视觉基础能力,在复杂的推理任务中取得了高达+16.9% 的收益,并使得较小的3B 模型能够超越更大的开源变体以及专有的模型如GPT-4o。
https://arxiv.org/abs/2601.10129
Scientific surveys require not only summarizing large bodies of literature, but also organizing them into clear and coherent conceptual structures. Existing automatic survey generation methods typically focus on linear text generation and struggle to explicitly model hierarchical relations among research topics and structured methodological comparisons, resulting in gaps in structural organization compared to expert-written surveys. We propose MVSS, a multi-view structured survey generation framework that jointly generates and aligns citation-grounded hierarchical trees, structured comparison tables, and survey text. MVSS follows a structure-first paradigm: it first constructs a conceptual tree of the research domain, then generates comparison tables constrained by the tree, and finally uses both as structural constraints for text generation. This enables complementary multi-view representations across structure, comparison, and narrative. We introduce an evaluation framework assessing structural quality, comparative completeness, and citation fidelity. Experiments on 76 computer science topics show MVSS outperforms existing methods in organization and evidence grounding, achieving performance comparable to expert surveys.
科学研究的综述不仅需要总结大量的文献,还需要将这些文献组织成清晰连贯的概念结构。现有的自动综述生成方法通常集中在线性文本生成上,并且难以明确地建模研究主题之间的层次关系和有结构的方法比较,这导致与专家撰写的综述相比,在结构组织方面存在差距。 为此,我们提出了一种名为MVSS的多视角结构化综述生成框架。该框架旨在同时生成并对齐基于引用的层级树、结构化的对比表格以及综述文本。MVSS遵循一种“以结构为先”的方法:首先构建研究领域的概念树,然后在此基础上生成受树约束的比较表,并最终将这两者作为结构限制来辅助文本生成。这种做法可以实现跨越结构、对比和叙述的互补多视角表示。 我们还引入了一种评估框架,用于评价综述在结构性质量、对比完整性和引用忠实度方面的表现。实验结果显示,在针对76个计算机科学主题进行测试时,MVSS相较于现有方法在组织和证据支撑方面表现出色,并且其整体性能可以与专家撰写的综述相媲美。
https://arxiv.org/abs/2601.09504
This paper presents team Kl33n3x's multilingual dialogue summarization and question answering system developed for the NLPAI4Health 2025 shared task. The approach employs a three-stage pipeline: forward translation from Indic languages to English, multitask text generation using a 2.55B parameter distilled language model, and reverse translation back to source languages. By leveraging knowledge distillation techniques, this work demonstrates that compact models can achieve highly competitive performance across nine languages. The system achieved strong win rates across the competition's tasks, with particularly robust performance on Marathi (86.7% QnA), Tamil (86.7% QnA), and Hindi (80.0% QnA), demonstrating the effectiveness of translation-based approaches for low-resource language processing without task-specific fine-tuning.
本文介绍了团队Kl33n3x为NLPAI4Health 2025共享任务开发的多语言对话摘要和问答系统。该方法采用三阶段管道:将印地语系语言翻译成英语,使用一个2.55B参数精简的语言模型进行多任务文本生成,以及返回源语言的逆向翻译。通过利用知识蒸馏技术,这项工作展示了紧凑型模型在九种语言中可以实现非常有竞争力的表现。该系统在整个竞赛任务中取得了强大的胜率,在马拉地语(问答部分86.7%)、泰卢固语(问答部分86.7%)和印地语(问答部分80.0%)上的表现尤为出色,证明了基于翻译的方法在处理资源不足的语言时的有效性,并且无需针对特定任务进行微调。
https://arxiv.org/abs/2601.09059
Chinese stand-up comedy generation goes beyond plain text generation, requiring culturally grounded humor, precise timing, stage-performance cues, and implicit multi-step reasoning. Moreover, commonly used Chinese humor datasets are often better suited for humor understanding and evaluation than for long-form stand-up generation, making direct supervision misaligned with the target task. To address these challenges, we present OpenMic, an end-to-end multi-agent system built on AutoGen that transforms a user-provided life topic into a 3-5 minute Chinese stand-up performance and further produces a narrated comedy video. OpenMic orchestrates multiple specialized agents in a multi-round iterative loop-planning to jointly optimize humor, timing, and performability. To mitigate the dataset-task mismatch, we augment generation with retrieval-augmented generation (RAG) for material grounding and idea expansion, and we fine-tune a dedicated JokeWriter to better internalize stand-up-specific setup-punchline structures and long-range callbacks.
中文脱口秀的生成超越了单纯的文字生成,需要具备文化根基的幽默感、精准的节奏把握、舞台表演提示以及隐含的多步骤推理能力。此外,常用的中文幽默数据集往往更适合于理解和评估幽默,而非长期形式的脱口秀创作,这使得直接监督方法与目标任务不匹配。为了解决这些挑战,我们提出了OpenMic系统,这是一个基于AutoGen构建的端到端多代理系统,它可以将用户提供的生活话题转化为3-5分钟的中文脱口秀表演,并进一步制作成有声笑料视频。 OpenMic通过在多个回合迭代规划中协调多个专业化的代理来共同优化幽默感、节奏和可执行性。为了缓解数据集与任务之间的不匹配问题,我们在生成过程中引入了检索增强生成(RAG)技术,用于材料的定位以及观点扩展,并且训练了一个专门的JokeWriter以更好地内化脱口秀特有的铺垫-笑点结构及长程回扣技巧。 这种综合方法不仅提高了中文脱口秀内容的质量和吸引力,还确保了系统输出具有高度的文化相关性和娱乐性。
https://arxiv.org/abs/2601.08288
Large language models perform text generation through high-dimensional internal dynamics, yet the temporal organisation of these dynamics remains poorly understood. Most interpretability approaches emphasise static representations or causal interventions, leaving temporal structure largely unexplored. Drawing on neuroscience, where temporal integration and metastability are core markers of neural organisation, we adapt these concepts to transformer models and discuss a composite dynamical metric, computed from activation time-series during autoregressive generation. We evaluate this metric in GPT-2-medium across five conditions: structured reasoning, forced repetition, high-temperature noisy sampling, attention-head pruning, and weight-noise injection. Structured reasoning consistently exhibits elevated metric relative to repetitive, noisy, and perturbed regimes, with statistically significant differences confirmed by one-way ANOVA and large effect sizes in key comparisons. These results are robust to layer selection, channel subsampling, and random seeds. Our findings demonstrate that neuroscience-inspired dynamical metrics can reliably characterise differences in computational organisation across functional regimes in large language models. We stress that the proposed metric captures formal dynamical properties and does not imply subjective experience.
大型语言模型通过高维内部动力学进行文本生成,但这些动力学的时序组织仍然鲜为人知。大多数可解释性方法侧重于静态表示或因果干预,而忽略了对时间结构的研究。借鉴神经科学中的时间整合和亚稳态是神经组织的核心标志这一概念,我们将这些概念应用于变换器模型,并讨论一种基于自回归生成过程中激活时间序列计算的复合动力学指标。我们在GPT-2-medium模型上评估了五种条件下的该指标:结构化推理、强制重复、高温噪声采样、注意力头剪枝和权重噪声注入。结构化推理在所有条件下均表现出高于重复性、噪音性和受干扰状态的显著更高指标值,这种差异通过单因素方差分析(ANOVA)得到统计学确认,并且关键比较中具有较大的效应量。这些结果对层数选择、通道子采样和随机种子的选择具有鲁棒性。我们的研究结果表明,受神经科学启发的动力学度量能够可靠地表征大型语言模型在不同功能状态下计算组织的差异。我们强调,所提出的度量捕捉的是形式化的动力学属性,并不意味着主观体验的存在。
https://arxiv.org/abs/2601.11622
Foreshadowing and payoff are ubiquitous narrative devices through which authors introduce commitments early in a story and resolve them through concrete, observable outcomes. However, despite advances in story generation, large language models (LLMs) frequently fail to bridge these long-range narrative dependencies, often leaving "Chekhov's guns" unfired even when the necessary context is present. Existing evaluations largely overlook this structural failure, focusing on surface-level coherence rather than the logical fulfillment of narrative setups. In this paper, we introduce Codified Foreshadowing-Payoff Generation (CFPG), a novel framework that reframes narrative quality through the lens of payoff realization. Recognizing that LLMs struggle to intuitively grasp the "triggering mechanism" of a foreshadowed event, CFPG transforms narrative continuity into a set of executable causal predicates. By mining and encoding Foreshadow-Trigger-Payoff triples from the BookSum corpus, we provide structured supervision that ensures foreshadowed commitments are not only mentioned but also temporally and logically fulfilled. Experiments demonstrate that CFPG significantly outperforms standard prompting baselines in payoff accuracy and narrative alignment. Our findings suggest that explicitly codifying narrative mechanics is essential for moving LLMs from surface-level fluency to genuine narrative competence.
预示和回报是作者在故事早期引入承诺并在后续通过具体可观察结果解决这些承诺的普遍叙述手段。然而,尽管故事情节生成技术有所进步,大型语言模型(LLM)仍然经常无法跨越这些长跨度叙事依赖关系,即使存在必要背景时也常常未能触发“契诃夫之枪”,导致故事中的线索未被利用。现有的评估方法大多忽视这种结构性失败,关注的是表面连贯性而非叙述设定的逻辑实现。 在本文中,我们引入了编码预示-回报生成(Codified Foreshadowing-Payoff Generation, CFPG),这是一个通过实现回报来重新定义叙事质量的新框架。认识到LLM难以直观地理解预示事件的“触发机制”,CFPG将叙事连贯性转化为一组可执行的因果谓词。通过对BookSum语料库中的预示-触发-回报三元组进行挖掘和编码,我们提供了一种结构化监督方法,确保不仅提到预示承诺,而且在时间和逻辑上实现它们。实验结果表明,CFPG在回报准确性和叙事一致性方面显著优于标准提示基准模型。 我们的研究结果表明,明确编码叙述机制对于将LLM从表面流畅性提升到真正的叙事能力至关重要。
https://arxiv.org/abs/2601.07033
In recent years, the advent of the attention mechanism has significantly advanced the field of natural language processing (NLP), revolutionizing text processing and text generation. This has come about through transformer-based decoder-only architectures, which have become ubiquitous in NLP due to their impressive text processing and generation capabilities. Despite these breakthroughs, language models (LMs) remain susceptible to generating undesired outputs: inappropriate, offensive, or otherwise harmful responses. We will collectively refer to these as ``toxic'' outputs. Although methods like reinforcement learning from human feedback (RLHF) have been developed to align model outputs with human values, these safeguards can often be circumvented through carefully crafted prompts. Therefore, this paper examines the extent to which LLMs generate toxic content when prompted, as well as the linguistic factors -- both lexical and syntactic -- that influence the production of such outputs in generative models.
近年来,注意力机制的出现显著推动了自然语言处理(NLP)领域的发展,彻底革新了文本处理和文本生成。这得益于基于变压器的解码器架构,由于其出色的文本处理和生成能力而成为NLP领域的主流选择。尽管取得了这些突破性进展,语言模型(LMs)仍然容易产生不希望出现的内容:不合适、冒犯性的或具有潜在危害的响应。我们将这些统称为“有毒”输出。虽然诸如从人类反馈中进行强化学习(RLHF)等方法已经开发出来以使模型输出与人类价值观保持一致,但通过精心设计的提示语,这些防护措施常常可以被规避。因此,本文探讨了大型语言模型在接收到特定提示时产生有害内容的程度,以及影响生成此类输出的语言因素——包括词汇和句法方面的因素。
https://arxiv.org/abs/2601.06700