As Large Language Models (LLMs) continue to scale, post-training pruning has emerged as a promising approach to reduce computational costs while preserving performance. Existing methods such as SparseGPT and Wanda achieve high sparsity through layer-wise weight reconstruction or activation-aware magnitude pruning, but rely on uniform or hand-crafted heuristics to determine per-layer sparsity ratios. Moreover, recent work has shown that pruned LLMs suffer from severe factual knowledge degradation, with structured pruning methods experiencing near-total collapse in factual question-answering capabilities. We introduce agent-guided pruning, where a foundation model acts as an adaptive pruning agent to intelligently select which layers to prune at each iteration while preserving critical knowledge pathways. Our method constructs layer-wise sensitivity profiles by combining Wanda-inspired weight-activation metrics with gradient importance scores, normalized as z-scores for model-agnostic comparison. These statistics are processed by an LLM agent equipped with self-reflection capabilities, enabling it to learn from previous pruning outcomes and iteratively refine its strategy. A checkpoint rollback mechanism maintains model quality by reverting when perplexity degradation exceeds a threshold. We evaluate our approach on Qwen3 models (4B and 8B parameters) at approximately 45% sparsity, demonstrating substantial improvements over structured pruning baselines: 56% relative improvement in MMLU accuracy, 19x better factual knowledge retention on FreebaseQA, and 69% lower perplexity degradation. Notably, our framework requires no retraining, operates in a model-agnostic manner, and exhibits effective self-correction with only 2-4 rollbacks across 21-40 iterations, demonstrating that foundation models can effectively guide the compression of other foundation models.
随着大型语言模型(LLM)的规模不断扩大,后训练修剪作为一种有望在减少计算成本的同时保持性能的方法逐渐崭露头角。现有方法如SparseGPT和Wanda通过逐层权重重构或基于激活量化的幅度剪枝实现高稀疏度,但它们依赖于统一的手工启发式来确定每层的稀疏比率。此外,最近的研究表明,被修剪的LLM在事实性知识方面有显著下降,结构化修剪方法在这种情况下几乎完全丧失了事实问答能力。我们引入了代理引导剪枝,其中基础模型充当自适应剪枝代理,在每次迭代中智能地选择要修剪的层,同时保持关键的知识路径不变。我们的方法通过结合Wanda启发式的权重-激活度量与梯度重要性得分来构建逐层敏感度概况,并将其归一化为z分数以进行模型无关比较。这些统计信息由具有自我反思能力的LLM代理处理,使其能够从先前的修剪结果中学习并迭代优化其策略。一种检查点回滚机制通过在困惑度下降超过阈值时恢复来保持模型质量。我们在Qwen3模型(4B和8B参数)上大约以45%的稀疏度评估我们的方法,结果显示相对于结构化剪枝基线有显著改进:MMLU准确率提高了56%,FreebaseQA上的事实性知识保留提高了19倍,困惑度下降降低了69%。值得注意的是,我们的框架不需要重新训练,在模型无关的方式下运行,并且仅通过2-4次回滚在21-40次迭代中表现出有效的自我纠正能力,表明基础模型可以有效地指导其他基础模型的压缩过程。
https://arxiv.org/abs/2601.09694
We introduce a voice-agentic framework that learns one critical omni-understanding skill: knowing when to trust itself versus when to consult external audio perception. Our work is motivated by a crucial yet counterintuitive finding: naively fine-tuning an omni-model on both speech recognition and external sound understanding tasks often degrades performance, as the model can be easily misled by noisy hypotheses. To address this, our framework, Speech-Hands, recasts the problem as an explicit self-reflection decision. This learnable reflection primitive proves effective in preventing the model from being derailed by flawed external candidates. We show that this agentic action mechanism generalizes naturally from speech recognition to complex, multiple-choice audio reasoning. Across the OpenASR leaderboard, Speech-Hands consistently outperforms strong baselines by 12.1% WER on seven benchmarks. The model also achieves 77.37% accuracy and high F1 on audio QA decisions, showing robust generalization and reliability across diverse audio question answering datasets. By unifying perception and decision-making, our work offers a practical path toward more reliable and resilient audio intelligence.
我们介绍了一个具备声音代理特性的框架,该框架学会了关键的全方位理解技能:即知道何时信任自身的能力,以及何时需要咨询外部音频感知。我们的工作是由一个至关重要的但违反直觉的研究结果所驱动的:简单地在语音识别和外部声音理解任务上微调一个全能模型往往会降低性能,因为模型容易被噪声假设误导。为了解决这个问题,我们开发了一个名为Speech-Hands的框架,它将问题重新定义为明确的自我反思决策。这种可学习的反思基础证明了防止模型因错误的外部候选答案而偏离轨道的有效性。我们展示了这种代理行动机制能够从语音识别自然地推广到复杂的多选题音频推理。在OpenASR排行榜上,Speech-Hands在这七个基准测试中始终优于强大的基线模型,平均提升了12.1%的单词错误率(WER)。此外,在音频问答决策方面,该模型达到了77.37%的准确率和高F1值,展示了其在多种多样的音频问题回答数据集上的强大泛化能力和可靠性。通过统一感知与决策制定,我们的工作为创建更加可靠且适应性强的音频智能提供了一条实用的道路。
https://arxiv.org/abs/2601.09413
In the information and communications technology (ICT) industry, training a domain-specific large language model (LLM) or constructing a retrieval-augmented generation system requires a substantial amount of high-value domain knowledge. However, the knowledge is not only hidden in the textual modality but also in the image modality. Traditional methods can parse text from domain documents but dont have image captioning ability. Multi-modal LLM (MLLM) can understand images, but they do not have sufficient domain knowledge. To address the above issues, this paper proposes a multi-stage progressive training strategy to train a Domain-specific Image Captioning Model (DICModel) in ICT, and constructs a standard evaluation system to validate the performance of DICModel. Specifically, this work first synthesizes about 7K image-text pairs by combining the Mermaid tool and LLMs, which are used for the first-stage supervised-fine-tuning (SFT) of DICModel. Then, ICT-domain experts manually annotate about 2K image-text pairs for the second-stage SFT of DICModel. Finally, experts and LLMs jointly synthesize about 1.5K visual question answering data for the instruction-based SFT. Experimental results indicate that our DICModel with only 7B parameters performs better than other state-of-the-art models with 32B parameters. Compared to the SOTA models with 7B and 32B parameters, our DICModel increases the BLEU metric by approximately 56.8% and 20.8%, respectively. On the objective questions constructed by ICT domain experts, our DICModel outperforms Qwen2.5-VL 32B by 1% in terms of accuracy rate. In summary, this work can efficiently and accurately extract the logical text from images, which is expected to promote the development of multimodal models in the ICT domain.
在信息和通信技术(ICT)行业中,训练特定领域的大型语言模型(LLM)或构建增强检索的生成系统需要大量的高质量专业领域知识。然而,这些知识不仅隐藏于文本模式中,还存在于图像模式之中。传统的方法可以解析文档中的文本信息,但不具备图像描述能力。多模态LLM能够理解图片内容,但是它们缺乏足够的特定领域的专业知识。为解决上述问题,本文提出了一种多阶段渐进式训练策略,用于在ICT行业中训练一种领域专用的图像描述模型(DICModel),并构建了一个标准评估系统以验证DICModel的表现。具体而言,该研究首先通过结合Mermaid工具和LLM生成了约7K张图片-文本对,这些数据用于第一阶段的监督微调(SFT)过程中的DICModel训练。然后,ICT领域的专家手动标注大约2K张图片-文本对,以供第二阶段SFT使用。最后,专家与LLM共同合成了大约1.5K条基于视觉的问题回答数据,用于指令驱动的SFT。 实验结果显示,在参数量仅为70亿的情况下,我们的DICModel的表现优于其他最先进的模型(其参数量分别为320亿)。相比其他参数量为70亿和320亿的最佳实践模型,本研究中的DICModel在BLEU分数指标上分别提高了约56.8%和20.8%。在由ICT领域专家构建的客观问题测试中,我们的DICModel在准确率方面比Qwen2.5-VL 32B高出1%。 总之,该工作能够有效地、精确地从图像中提取逻辑文本信息,并有望促进多模态模型在ICT领域的进一步发展。
https://arxiv.org/abs/2601.09298
Recent studies in medical question answering (Medical QA) have actively explored the integration of large language models (LLMs) with biomedical knowledge graphs (KGs) to improve factual accuracy. However, most existing approaches still rely on traversing the entire KG or performing large-scale retrieval, which introduces substantial noise and leads to unstable multi-hop reasoning. We argue that the core challenge lies not in expanding access to knowledge, but in identifying and reasoning over the appropriate subset of evidence for each query. ReGraM is a region-first knowledge graph reasoning framework that addresses this challenge by constructing a query-aligned subgraph and performing stepwise reasoning constrained to this localized region under multiple evidence aware modes. By focusing inference on only the most relevant portion of the KG, ReGraM departs from the assumption that all relations are equally useful an assumption that rarely holds in domain-specific medical settings. Experiments on seven medical QA benchmarks demonstrate that ReGraM consistently outperforms a strong baseline (KGARevion), achieving an 8.04% absolute accuracy gain on MCQ, a 4.50% gain on SAQ, and a 42.9% reduction in hallucination rate. Ablation and qualitative analyses further show that aligning region construction with hop-wise reasoning is the primary driver of these improvements. Overall, our results highlight region-first KG reasoning as an effective paradigm for improving factual accuracy and consistency in medical QA.
最近关于医学问答(Medical QA)的研究积极探讨了将大型语言模型(LLMs)与生物医学知识图谱(KGs)相结合,以提高事实准确性的方法。然而,大多数现有的方法仍然依赖于遍历整个KG或执行大规模检索,这引入了大量的噪声,并导致多跳推理不稳定。我们主张核心挑战并不在于扩大对知识的访问范围,而是在于识别和针对每个查询适当的知识证据子集进行推理。 ReGraM 是一个以区域为中心的知识图谱推理框架,通过构建与查询对齐的子图并在多个证据感知模式下对该局部区域执行分步推理来解决这一挑战。通过只聚焦在KG中最为相关的部分进行推理,ReGraM 背离了所有关系同样有用的假设——这在特定领域的医学环境中很少成立。 在七个医学QA基准测试上的实验表明,ReGraM 一致地优于一个强大的基线模型(KGARevion),实现了MCQ绝对准确率8.04%的增长、SAQ上4.50%的提高,并且幻觉率减少了42.9%。消融分析和定性研究进一步显示,将区域构建与按跳推理相匹配是这些改进的主要驱动因素。 总的来说,我们的结果突出了以区域为中心的知识图谱推理作为提高医学QA中事实准确性和一致性的有效范式的重要性。
https://arxiv.org/abs/2601.09280
With the rapid advancement of Multimodal Large Language Models (MLLMs), their potential has garnered significant attention in Chinese Classical Studies (CCS). While existing research has primarily focused on text and visual modalities, the audio corpus within this domain remains largely underexplored. To bridge this gap, we propose the Multi-task Classical Chinese Literary Genre Audio Corpus (MCGA). It encompasses a diverse range of literary genres across six tasks: Automatic Speech Recognition (ASR), Speech-to-Text Translation (S2TT), Speech Emotion Captioning (SEC), Spoken Question Answering (SQA), Speech Understanding (SU), and Speech Reasoning (SR). Through the evaluation of ten MLLMs, our experimental results demonstrate that current models still face substantial challenges when processed on the MCGA test set. Furthermore, we introduce an evaluation metric for SEC and a metric to measure the consistency between the speech and text capabilities of MLLMs. We release MCGA and our code to the public to facilitate the development of MLLMs with more robust multidimensional audio capabilities in CCS. MCGA Corpus: this https URL
随着多模态大型语言模型(MLLMs)的迅速发展,它们在中文古典研究(CCS)中的潜力引起了广泛关注。尽管现有研究主要集中在文本和视觉模式上,但该领域内的音频语料库仍未被充分探索。为填补这一空白,我们提出了一个多任务中国古典文学体裁音频语料库(MCGA)。它涵盖了六个方面的多样化文学体裁:自动语音识别(ASR)、语音到文本翻译(S2TT)、语音情感说明(SEC)、口语问答(SQA)、语音理解(SU)和语音推理(SR)。通过评估十个MLLMs,我们的实验结果显示,在处理MCGA测试集时,当前模型仍面临重大挑战。此外,我们还引入了一个用于SEC的评价指标以及一个衡量MLLMs在语音与文本能力之间一致性程度的指标。我们将MCGA语料库及其代码公开发布,以促进CCS领域中具有更强大多维度音频处理能力的MLLM的发展。 **MCGA语料库链接:[此网址](this https URL)**
https://arxiv.org/abs/2601.09270
Knowledge Graph Retrieval-Augmented Generation (KG-RAG) extends the RAG paradigm by incorporating structured knowledge from knowledge graphs, enabling Large Language Models (LLMs) to perform more precise and explainable reasoning. While KG-RAG improves factual accuracy in complex tasks, existing KG-RAG models are often severely overconfident, producing high-confidence predictions even when retrieved sub-graphs are incomplete or unreliable, which raises concerns for deployment in high-stakes domains. To address this issue, we propose Ca2KG, a Causality-aware Calibration framework for KG-RAG. Ca2KG integrates counterfactual prompting, which exposes retrieval-dependent uncertainties in knowledge quality and reasoning reliability, with a panel-based re-scoring mechanism that stabilises predictions across interventions. Extensive experiments on two complex QA datasets demonstrate that Ca2KG consistently improves calibration while maintaining or even enhancing predictive accuracy.
知识图谱检索增强生成(KG-RAG)通过整合来自知识图谱的结构化知识,扩展了检索增强生成(RAG)范式,使大型语言模型(LLM)能够进行更精确和可解释性的推理。尽管KG-RAG在复杂任务中提高了事实准确性,但现有的KG-RAG模型往往过于自信,在检索到的知识子图不完整或不可靠时仍会产生高置信度的预测,这引起了人们对将其部署于高风险领域时的担忧。 为了解决这一问题,我们提出了Ca2KG(因果校准框架),该框架结合了反事实提示技术,揭示了知识质量和推理可靠性中的检索依赖不确定性,并配以基于面板的重新评分机制来稳定在干预措施下的预测结果。通过两个复杂问答数据集上的广泛实验,我们证明了Ca2KG能够持续提高校准度同时保持或增强预测准确性。 该研究展示了如何通过结合因果推理和反事实提示技术改进KG-RAG模型的信任度评估,这对于确保大型语言模型在高风险环境中的可靠性和安全性至关重要。
https://arxiv.org/abs/2601.09241
The rapid advancement of Large Language Models (LLMs) has necessitated more robust evaluation methods that go beyond static benchmarks, which are increasingly prone to data saturation and leakage. In this paper, we propose a dynamic benchmarking framework for evaluating multilingual and multicultural capabilities through the social deduction game Spyfall. In our setup, models must engage in strategic dialogue to either identify a secret agent or avoid detection, utilizing culturally relevant locations or local foods. Our results show that our game-based rankings align closely with the Chatbot Arena. However, we find a significant performance gap in non-English contexts: models are generally less proficient when handling locally specific entities and often struggle with rule-following or strategic integrity in non-English languages. We demonstrate that this game-based approach provides a scalable, leakage-resistant, and culturally nuanced alternative to traditional NLP benchmarks. The game history can be accessed here this https URL.
大型语言模型(LLMs)的快速发展要求超越静态基准测试的更为稳健的评估方法,这些静态基准测试越来越容易受到数据饱和和泄漏的影响。在本文中,我们提出了一种基于社会推理游戏《Spyfall》的动态基准框架,用于评估多语言和跨文化能力。在这个设置中,模型必须通过战略对话来识别秘密特工或避免被发现,利用与文化相关的地点或本地食物。我们的结果显示,基于游戏的排名与Chatbot Arena的高度一致。然而,我们在非英语语境下发现了显著的性能差距:在处理特定于某个地区的实体时,这些模型通常表现较差,并且往往难以用非英语语言遵循规则或保持策略完整性。 我们证明了这种基于游戏的方法为传统的自然语言处理基准测试提供了一种可扩展、防泄漏和文化敏感的替代方案。游戏历史记录可以在这里访问:[此URL](请将此处的“this https URL”替换为您实际要链接的游戏历史页面)。
https://arxiv.org/abs/2601.09017
Current context augmentation methods, such as retrieval-augmented generation, are essential for solving knowledge-intensive reasoning this http URL, they typically adhere to a rigid, brute-force strategy that executes retrieval at every step. This indiscriminate approach not only incurs unnecessary computational costs but also degrades performance by saturating the context with irrelevant noise. To address these limitations, we introduce Agentic Context Evolution (ACE), a framework inspired by human metacognition that dynamically determines whether to seek new evidence or reason with existing knowledge. ACE employs a central orchestrator agent to make decisions strategically via majority this http URL aims to alternate between activating a retriever agent for external retrieval and a reasoner agent for internal analysis and refinement. By eliminating redundant retrieval steps, ACE maintains a concise and evolved context. Extensive experiments on challenging multi-hop QA benchmarks demonstrate that ACE significantly outperforms competitive baselines in accuracy while achieving efficient token this http URL work provides valuable insights into advancing context-evolved generation for complex, knowledge-intensive tasks.
当前的上下文增强方法,如检索增强生成,在解决知识密集型推理任务方面至关重要。然而,它们通常遵循一种僵化且粗放的战略,在每一个步骤都执行检索操作。这种不加选择的方法不仅导致不必要的计算成本增加,还通过使上下文中充斥着无关信息而降低了性能。为了克服这些限制,我们引入了一种名为“代理式上下文演化”(ACE)的框架,该框架借鉴了人类元认知的理念,并动态决定是否需要寻求新的证据或利用现有的知识进行推理。ACE采用了一个中心协调器代理通过多数投票来制定战略决策,旨在交替激活检索代理和推理代理,以便分别执行外部检索和内部分析及细化工作。通过消除重复的检索步骤,ACE保持了简洁且经过演化的上下文。 在具有挑战性的多跳问题回答基准测试上进行的大量实验表明,ACE在准确性方面显著优于竞争基线模型,并实现了高效的令牌利用率。这项研究为推进复杂知识密集型任务中的上下文演化生成提供了宝贵的见解。
https://arxiv.org/abs/2601.08747
Scene Graph Generation (SGG) suffers from a long-tailed distribution, where a few predicate classes dominate while many others are underrepresented, leading to biased models that underperform on rare relations. Unbiased-SGG methods address this issue by implementing debiasing strategies, but often at the cost of spatial understanding, resulting in an over-reliance on semantic priors. We introduce Salience-SGG, a novel framework featuring an Iterative Salience Decoder (ISD) that emphasizes triplets with salient spatial structures. To support this, we propose semantic-agnostic salience labels guiding ISD. Evaluations on Visual Genome, Open Images V6, and GQA-200 show that Salience-SGG achieves state-of-the-art performance and improves existing Unbiased-SGG methods in their spatial understanding as demonstrated by the Pairwise Localization Average Precision
场景图生成(SGG)面临的一个主要问题是长尾分布,其中少数谓词类别占据了主导地位,而许多其他类别则被严重低估。这种现象导致模型在处理罕见关系时表现不佳且偏向性较强。为了解决这个问题,无偏SGG方法通过实施去偏策略来进行干预,但通常以牺牲空间理解能力为代价,结果过分依赖于语义先验。 为了克服这一问题,我们引入了Salience-SGG,这是一种新的框架,其中包含一个迭代显著度解码器(ISD),它强调具有显着空间结构的三元组。为此,我们提出了无语义依赖的显著度标签来指导ISD的工作。在Visual Genome、Open Images V6和GQA-200上的评估表明,Salience-SGG达到了最先进的性能,并且与现有的无偏SGG方法相比,在其空间理解能力方面有所提升,这从成对定位平均精度(Pairwise Localization Average Precision)的提高中可以明显看出。
https://arxiv.org/abs/2601.08728
Large Language Models (LLMs) have shown strong capabilities across many domains, yet their evaluation in financial quantitative tasks remains fragmented and mostly limited to knowledge-centric question answering. We introduce QuantEval, a benchmark that evaluates LLMs across three essential dimensions of quantitative finance: knowledge-based QA, quantitative mathematical reasoning, and quantitative strategy coding. Unlike prior financial benchmarks, QuantEval integrates a CTA-style backtesting framework that executes model-generated strategies and evaluates them using financial performance metrics, enabling a more realistic assessment of quantitative coding ability. We evaluate some state-of-the-art open-source and proprietary LLMs and observe substantial gaps to human experts, particularly in reasoning and strategy coding. Finally, we conduct large-scale supervised fine-tuning and reinforcement learning experiments on domain-aligned data, demonstrating consistent improvements. We hope QuantEval will facilitate research on LLMs' quantitative finance capabilities and accelerate their practical adoption in real-world trading workflows. We additionally release the full deterministic backtesting configuration (asset universe, cost model, and metric definitions) to ensure strict reproducibility.
大型语言模型(LLMs)在多个领域展现了强大的能力,然而它们在金融量化任务中的评估仍然碎片化,并且主要局限于知识为中心的问题回答。我们引入了QuantEval基准测试,它从定量金融的三个方面来评价LLMs:基于知识的问答、数量化的数学推理以及量化的策略编码。 与之前的财务基准不同,QuantEval整合了一个CTA风格的回测框架,该框架可以执行模型生成的策略,并使用财务绩效指标进行评估,从而能够更真实地衡量量化代码编写能力。我们对一些最先进的开源和专有LLMs进行了评价,观察到在推理和策略编码方面与人类专家存在显著差距。 最后,我们在领域内对齐的数据上进行了大规模监督微调和强化学习实验,显示出了持续的改进效果。我们希望QuantEval能促进对LLMs量化金融能力的研究,并加速它们在现实世界交易工作流程中的实际应用。此外,为了确保严格的可重复性,我们将完整的确定性回测配置(资产组合、成本模型及指标定义)一并发布。
https://arxiv.org/abs/2601.08689
Conversational agents are increasingly used as support tools along mental therapeutic pathways with significant societal impacts. In particular, empathy is a key non-functional requirement in therapeutic contexts, yet current chatbot development practices provide no systematic means to specify or verify it. This paper envisions a framework integrating natural language processing and formal verification to deliver empathetic therapy chatbots. A Transformer-based model extracts dialogue features, which are then translated into a Stochastic Hybrid Automaton model of dyadic therapy sessions. Empathy-related properties can then be verified through Statistical Model Checking, while strategy synthesis provides guidance for shaping agent behavior. Preliminary results show that the formal model captures therapy dynamics with good fidelity and that ad-hoc strategies improve the probability of satisfying empathy requirements.
对话代理在心理健康治疗路径中的应用日益增多,并对社会产生了重要影响。特别是在治疗环境中,同理心是一项关键的非功能性需求,然而当前聊天机器人的开发实践却没有提供系统的方法来指定或验证这一点。本文提出了一个框架,该框架结合了自然语言处理和形式化验证技术,以交付具有同情心的心理治疗聊天机器人。具体来说,采用基于Transformer模型提取对话特征,并将这些特征转换为一对多心理治疗会话的随机混合自动机(Stochastic Hybrid Automaton)模型。然后可以通过统计模型检验来验证与同理心相关的属性,而策略合成则提供了指导以塑造代理行为的建议。初步结果显示,形式化模型能够很好地捕捉到治疗过程中的动态变化,并且定制策略可以提高满足同理心需求的概率。
https://arxiv.org/abs/2601.08477
Vision Language Models (VLMs) are increasingly deployed in autonomous vehicles and mobile systems, making it crucial to evaluate their ability to support safer decision-making in complex environments. However, existing benchmarks inadequately cover diverse hazardous situations, especially anomalous scenarios with spatio-temporal dynamics. While image editing models are a promising means to synthesize such hazards, it remains challenging to generate well-formulated scenarios that include moving, intrusive, and distant objects frequently observed in the real world. To address this gap, we introduce \textbf{HazardForge}, a scalable pipeline that leverages image editing models to generate these scenarios with layout decision algorithms, and validation modules. Using HazardForge, we construct \textbf{MovSafeBench}, a multiple-choice question (MCQ) benchmark comprising 7,254 images and corresponding QA pairs across 13 object categories, covering both normal and anomalous objects. Experiments using MovSafeBench show that VLM performance degrades notably under conditions including anomalous objects, with the largest drop in scenarios requiring nuanced motion understanding.
视觉语言模型(VLMs)在自动驾驶车辆和移动系统中的应用日益广泛,因此评估它们支持复杂环境中安全决策的能力变得至关重要。然而,现有的基准测试不足以涵盖多样化的危险情况,尤其是那些具有时空动态特性的异常场景。虽然图像编辑模型是合成此类危害的一种有前途的方法,但生成包含现实中常见的移动、入侵性及远距离物体的场景仍然具有挑战性。为弥补这一不足,我们引入了**HazardForge**,这是一个可扩展的工作流程,利用图像编辑模型和布局决策算法以及验证模块来生成这些危险情况。 通过使用HazardForge,我们构建了**MovSafeBench**,这是一个多项选择题(MCQ)基准测试,包含了7,254张图片及其对应的问答对,并覆盖了13个物体类别,包括正常对象和异常对象。实验结果表明,在包含异常对象的情况下,VLM的表现显著下降,在需要细致理解运动的情景中表现尤为明显。
https://arxiv.org/abs/2601.08470
Digital platforms have an ever-expanding user base, and act as a hub for communication, business, and connectivity. However, this has also allowed for the spread of hate speech and misogyny. Artificial intelligence models have emerged as an effective solution for countering online hate speech but are under explored for low resource and code-mixed languages and suffer from a lack of interpretability. Explainable Artificial Intelligence (XAI) can enhance transparency in the decisions of deep learning models, which is crucial for a sensitive domain such as hate speech detection. In this paper, we present a multi-modal and explainable web application for detecting misogyny in text and memes in code-mixed Hindi and English. The system leverages state-of-the-art transformer-based models that support multilingual and multimodal settings. For text-based misogyny identification, the system utilizes XLM-RoBERTa (XLM-R) and multilingual Bidirectional Encoder Representations from Transformers (mBERT) on a dataset of approximately 4,193 comments. For multimodal misogyny identification from memes, the system utilizes mBERT + EfficientNet, and mBERT + ResNET trained on a dataset of approximately 4,218 memes. It also provides feature importance scores using explainability techniques including Shapley Additive Values (SHAP) and Local Interpretable Model Agnostic Explanations (LIME). The application aims to serve as a tool for both researchers and content moderators, to promote further research in the field, combat gender based digital violence, and ensure a safe digital space. The system has been evaluated using human evaluators who provided their responses on Chatbot Usability Questionnaire (CUQ) and User Experience Questionnaire (UEQ) to determine overall usability.
数字平台的用户群体不断扩张,这些平台已成为沟通、商务和连接的核心枢纽。然而,这也导致了仇恨言论和性别歧视言论在网络上的传播。人工智能模型已被证明是应对在线仇恨言论的有效解决方案,但它们在低资源语言和代码混合语言中的应用尚不充分,并且缺乏解释性。 可解释的人工智能(XAI)能够增强深度学习模型决策的透明度,在如仇恨言论检测这样敏感领域中尤为重要。本文介绍了一个多模态、可解释的网页应用程序,用于检测代码混合的语言——印地语和英语中的性别歧视文本及网络图片(memes)。该系统利用了支持多语言和多模式设置的最先进的变压器模型。对于基于文本的性别歧视识别,系统使用XLM-RoBERTa (XLM-R) 和多语言双向编码器表示模型(mBERT),分析大约4,193条评论数据集。对于从网络图片(memes)中进行的多模态性别歧视识别,该系统利用了mBERT + EfficientNet和mBERT + ResNET,并且基于一个大约有4,218个网络图片的数据集进行了训练。此外,它还通过可解释性技术提供了特征重要性得分,包括Shapley Additive Values (SHAP) 和 Local Interpretable Model Agnostic Explanations (LIME)。 该应用程序旨在成为研究人员和内容审核员的工具,以推动这一领域的进一步研究、对抗性别数字暴力,并确保一个安全的在线环境。系统通过人类评估者提供的Chatbot Usability Questionnaire(CUQ)和User Experience Questionnaire(UEQ)反馈进行了评估,从而确定整体可用性。
https://arxiv.org/abs/2601.08457
Large Multimodal Models (LMMs) have recently shown remarkable promise in low-level visual perception tasks, particularly in Image Quality Assessment (IQA), demonstrating strong zero-shot capability. However, achieving state-of-the-art performance often requires computationally expensive fine-tuning methods, which aim to align the distribution of quality-related token in output with image quality levels. Inspired by recent training-free works for LMM, we introduce IQARAG, a novel, training-free framework that enhances LMMs' IQA ability. IQARAG leverages Retrieval-Augmented Generation (RAG) to retrieve some semantically similar but quality-variant reference images with corresponding Mean Opinion Scores (MOSs) for input image. These retrieved images and input image are integrated into a specific prompt. Retrieved images provide the LMM with a visual perception anchor for IQA task. IQARAG contains three key phases: Retrieval Feature Extraction, Image Retrieval, and Integration & Quality Score Generation. Extensive experiments across multiple diverse IQA datasets, including KADID, KonIQ, LIVE Challenge, and SPAQ, demonstrate that the proposed IQARAG effectively boosts the IQA performance of LMMs, offering a resource-efficient alternative to fine-tuning for quality assessment.
最近,大型多模态模型(LMM)在低级视觉感知任务中展现出了显著的潜力,尤其是在图像质量评估(IQA)方面,它们表现出强大的零样本学习能力。然而,要达到最先进的性能通常需要采用计算成本高昂的微调方法,这些方法旨在将输出中的与质量相关的令牌分布对齐到不同的图像质量级别上。受到最近无需训练的LMM工作的启发,我们提出了一种新的无需训练框架——IQA-RAG(Image Quality Assessment Retrieval-Augmented Generation),该框架能够增强LMM在IQA方面的性能。 IQA-RAG利用检索增强生成技术来为输入图片检索一些语义相似但质量不同的参考图像及其相应的平均意见评分(MOS)。这些检索到的图像和原始输入图像被整合到一个特定提示中。通过这种方式,检索到的高质量或低质量参照图像是LMM执行IQA任务时的一种视觉感知锚点。 IQA-RAG框架包含三个关键阶段:特征提取、图像检索及融合与生成质量评分。广泛的实验结果显示,在包括KADID, KonIQ, LIVE Challenge和SPAQ在内的多个多样化的IQA数据集上,所提出的IQA-RAG框架能够有效提升LMM在IQA任务中的性能,并为质量评估提供了一种资源高效的替代方案,相比传统的微调方法更加节省计算资源。
https://arxiv.org/abs/2601.08311
Recent search-augmented LLMs trained with reinforcement learning (RL) can interleave searching and reasoning for multi-hop reasoning tasks. However, they face two critical failure modes as the accumulating context becomes flooded with both crucial evidence and irrelevant information: (1) ineffective search chain construction that produces incorrect queries or omits retrieval of critical information, and (2) reasoning hijacking by peripheral evidence that causes models to misidentify distractors as valid evidence. To address these challenges, we propose **D$^2$Plan**, a **D**ual-agent **D**ynamic global **Plan**ning paradigm for complex retrieval-augmented reasoning. **D$^2$Plan** operates through the collaboration of a *Reasoner* and a *Purifier*: the *Reasoner* constructs explicit global plans during reasoning and dynamically adapts them based on retrieval feedback; the *Purifier* assesses retrieval relevance and condenses key information for the *Reasoner*. We further introduce a two-stage training framework consisting of supervised fine-tuning (SFT) cold-start on synthesized trajectories and RL with plan-oriented rewards to teach LLMs to master the **D$^2$Plan** paradigm. Extensive experiments demonstrate that **D$^2$Plan** enables more coherent multi-step reasoning and stronger resilience to irrelevant information, thereby achieving superior performance on challenging QA benchmarks.
最近,采用强化学习(RL)训练的增强型大语言模型(LLM)能够在多跳推理任务中交替执行搜索和推理。然而,随着积累上下文中关键证据与不相关信息的比例增加,这些模型面临两个主要失败模式:(1) 无效的搜索链构建,导致生成错误查询或遗漏重要信息检索;(2) 外围证据干扰推理,导致模型将误导信息误认为有效证据。为了解决这些问题,我们提出了**D²Plan**(双重代理动态全局规划)范式,用于复杂的检索增强型推理任务。 在该框架中,**D²Plan** 通过“Reasoner”和“Purifier”的协作运作:“Reasoner” 在进行推理时构建明确的全局计划,并根据检索反馈对其进行动态调整;而“Purifier”则评估检索的相关性并为“Reasoner”提炼关键信息。为了训练这些LLM掌握 **D²Plan** 范式,我们还引入了一个两阶段训练框架,包括在合成轨迹上的监督微调(SFT)冷启动和面向计划的奖励强化学习(RL)。广泛的实验结果表明,通过使用 **D²Plan**,模型能够执行更加连贯的多步推理,并且对不相关信息具备更强的鲁棒性,在具有挑战性的问答基准测试中表现出色。
https://arxiv.org/abs/2601.08282
Tool-Integrated Reasoning has emerged as a key paradigm to augment Large Language Models (LLMs) with computational capabilities, yet integrating tool-use into long Chain-of-Thought (long CoT) remains underexplored, largely due to the scarcity of training data and the challenge of integrating tool-use without compromising the model's intrinsic long-chain reasoning. In this paper, we introduce DART (Discovery And Reinforcement of Tool-Integrated Reasoning Chains via Rollout Trees), a reinforcement learning framework that enables spontaneous tool-use during long CoT reasoning without human annotation. DART operates by constructing dynamic rollout trees during training to discover valid tool-use opportunities, branching out at promising positions to explore diverse tool-integrated trajectories. Subsequently, a tree-based process advantage estimation identifies and credits specific sub-trajectories where tool invocation positively contributes to the solution, effectively reinforcing these beneficial behaviors. Extensive experiments on challenging benchmarks like AIME and GPQA-Diamond demonstrate that DART significantly outperforms existing methods, successfully harmonizing tool execution with long CoT reasoning.
工具集成推理已成为增强大型语言模型(LLM)计算能力的关键范式,然而,在长链思维(long Chain-of-Thought,简称long CoT)中整合工具使用仍是一个未充分探索的领域。这主要是由于训练数据稀缺以及在不损害模型固有的长期推理的前提下融入工具使用的挑战所致。本文介绍了DART(通过滚动树发现和强化工具集成推理链),这是一种基于强化学习的框架,它能够在没有人类标注的情况下,在长CoT推理过程中实现自发性的工具使用。DART的工作原理是在训练期间构建动态的滚动树来发现有效的工具使用机会,并在有前景的位置分支以探索各种整合了工具使用的轨迹。随后,通过基于树的过程优势估计来识别并奖励那些对解决方案产生积极贡献的具体子轨迹,从而有效地强化这些有益的行为。 在如AIME和GPQA-Diamond等具有挑战性的基准测试中进行的广泛实验表明,DART显著优于现有方法,成功地将工具执行与长CoT推理和谐结合。
https://arxiv.org/abs/2601.08274
As language-model agents evolve from passive chatbots into proactive assistants that handle personal data, evaluating their adherence to social norms becomes increasingly critical, often through the lens of Contextual Integrity (CI). However, existing CI benchmarks are largely text-centric and primarily emphasize negative refusal scenarios, overlooking multimodal privacy risks and the fundamental trade-off between privacy and utility. In this paper, we introduce MPCI-Bench, the first Multimodal Pairwise Contextual Integrity benchmark for evaluating privacy behavior in agentic settings. MPCI-Bench consists of paired positive and negative instances derived from the same visual source and instantiated across three tiers: normative Seed judgments, context-rich Story reasoning, and executable agent action Traces. Data quality is ensured through a Tri-Principle Iterative Refinement pipeline. Evaluations of state-of-the-art multimodal models reveal systematic failures to balance privacy and utility and a pronounced modality leakage gap, where sensitive visual information is leaked more frequently than textual information. We will open-source MPCI-Bench to facilitate future research on agentic CI.
随着语言模型代理从被动聊天机器人发展为处理个人数据的主动助手,评估它们遵守社会规范变得越来越重要,这通常通过情境完整性(Contextual Integrity, CI)的角度来考察。然而,现有的CI基准测试主要集中在文本上,并且主要强调消极拒绝场景,忽视了多模态隐私风险以及隐私与效用之间的基本权衡问题。在本文中,我们介绍了MPCI-Bench,这是第一个用于评估代理设置中隐私行为的多模态成对情境完整性基准。MPCI-Bench由来自同一视觉源的一组正负实例组成,并通过三个层级进行实例化:规范性种子判断、丰富上下文的故事推理以及可执行的智能体行动轨迹。数据质量通过三原则迭代精炼管道得到保证。对最先进的多模态模型的评估揭示了它们在平衡隐私和效用方面的系统性失败,还发现了一个显著的模式泄漏差距,在这个差距中,敏感视觉信息比文本信息更频繁地被泄露出去。我们将开源MPCI-Bench以促进未来关于代理情境完整性的研究。
https://arxiv.org/abs/2601.08235
In domains such as biomedicine, materials, and finance, high-stakes deployment of large language models (LLMs) requires injecting private, domain-specific knowledge that is proprietary, fast-evolving, and under-represented in public pretraining. However, the two dominant paradigms for private knowledge injection each have pronounced drawbacks: fine-tuning is expensive to iterate, and continual updates risk catastrophic forgetting and general-capability regression; retrieval-augmented generation (RAG) keeps the base model intact but is brittle in specialized private corpora due to chunk-induced evidence fragmentation, retrieval drift, and long-context pressure that yields query-dependent prompt inflation. Inspired by how multimodal LLMs align heterogeneous modalities into a shared semantic space, we propose Generation-Augmented Generation (GAG), which treats private expertise as an additional expert modality and injects it via a compact, representation-level interface aligned to the frozen base model, avoiding prompt-time evidence serialization while enabling plug-and-play specialization and scalable multi-domain composition with reliable selective activation. Across two private scientific QA benchmarks (immunology adjuvant and catalytic materials) and mixed-domain evaluations, GAG improves specialist performance over strong RAG baselines by 15.34% and 14.86% on the two benchmarks, respectively, while maintaining performance on six open general benchmarks and enabling near-oracle selective activation for scalable multi-domain deployment.
在生物医学、材料和金融等领域,大规模语言模型(LLMs)的高风险部署需要注入私有且领域特定的知识。这些知识通常是专有的、快速变化的,并且在公共预训练数据中代表性不足。然而,目前用于私人知识注入的两种主导方法各有显著的缺点:微调迭代成本高昂,而持续更新则会因灾难性遗忘和总体能力退化而带来风险;检索增强生成(RAG)虽然保留了基础模型的完整性,但在专门化的私有语料库中容易受到由碎片证据、检索漂移以及长时间上下文压力导致的问题,这些问题会导致依赖查询的提示膨胀。 我们借鉴多模态LLMs将不同模式对齐到共享语义空间的方法,提出了生成增强生成(GAG)。该方法将私人专长视为额外的专业模式,并通过与冻结的基础模型对齐的小型表示级接口将其注入。这种方法避免了在提示时序列化证据的同时,还允许插件式的专业化和可靠的可选激活,以便于多领域的可扩展组合。 我们在两个私有的科学问答基准测试(免疫学佐剂和催化材料)以及跨域评估中进行了实验。结果显示,在这两个基准上,GAG相较于强大的RAG基线分别提升了15.34%和14.86%,同时在六个公开的一般性基准上保持了性能,并实现了接近于预知的可选激活,从而为大规模多领域部署提供了支持。
https://arxiv.org/abs/2601.08209
Despite remarkable progress in large language models, Urdu-a language spoken by over 230 million people-remains critically underrepresented in modern NLP systems. Existing multilingual models demonstrate poor performance on Urdu-specific tasks, struggling with the language's complex morphology, right-to-left Nastaliq script, and rich literary traditions. Even the base LLaMA-3.1 8B-Instruct model shows limited capability in generating fluent, contextually appropriate Urdu text. We introduce Qalb, an Urdu language model developed through a two-stage approach: continued pre-training followed by supervised fine-tuning. Starting from LLaMA 3.1 8B, we perform continued pre-training on a dataset of 1.97 billion tokens. This corpus comprises 1.84 billion tokens of diverse Urdu text-spanning news archives, classical and contemporary literature, government documents, and social media-combined with 140 million tokens of English Wikipedia data to prevent catastrophic forgetting. We then fine-tune the resulting model on the Alif Urdu-instruct dataset. Through extensive evaluation on Urdu-specific benchmarks, Qalb demonstrates substantial improvements, achieving a weighted average score of 90.34 and outperforming the previous state-of-the-art Alif-1.0-Instruct model (87.1) by 3.24 points, while also surpassing the base LLaMA-3.1 8B-Instruct model by 44.64 points. Qalb achieves state-of-the-art performance with comprehensive evaluation across seven diverse tasks including Classification, Sentiment Analysis, and Reasoning. Our results demonstrate that continued pre-training on diverse, high-quality language data, combined with targeted instruction fine-tuning, effectively adapts foundation models to low-resource languages.
尽管大型语言模型取得了显著进展,但乌尔都语——一种超过2.3亿人使用的语言——在现代自然语言处理系统中仍然严重代表性不足。现有的多语言模型在特定于乌尔都语的任务上表现不佳,难以应对该语言复杂的形态结构、从右至左的纳斯塔利克书写体系以及丰富的文学传统。即使是基础的LLaMA-3.1 8B-Instruct模型,在生成流畅且上下文适宜的乌尔都语文本方面也表现出有限的能力。 我们推出了Qalb,这是一种通过两阶段方法开发出来的乌尔都语语言模型:持续预训练后进行监督微调。从LLaMA 3.1 8B开始,我们在一个包含19.7亿个标记的数据集上进行了持续的预训练。这一语料库包括了广泛的乌尔都语文本——涵盖了新闻档案、古典和当代文学、政府文件以及社交媒体——再加上1.4亿个英语维基百科数据标记,以防止灾难性遗忘的发生。 接下来,我们使用Alif Urdu-instruct数据集对生成的模型进行了微调。通过对特定于乌尔都语基准测试进行广泛的评估,Qalb展示了显著改进,在加权平均得分上达到了90.34分,并且在与之前最先进的Alif-1.0-Instruct模型(得分为87.1)相比时,超越了其3.24个点;同时,还比基础LLaMA-3.1 8B-Instruct模型高出44.64个点。Qalb实现了在包括分类、情感分析和推理在内的七个多样化任务上的最先进性能。 我们的研究结果表明,通过持续对多样化的高质量语言数据进行预训练,并结合目标指令微调,能够有效将基础模型调整至资源较少的语言中去。
https://arxiv.org/abs/2601.08141
Vision-language models achieve strong performance across a wide range of multimodal understanding and reasoning tasks, yet their multi-step reasoning remains unstable. Repeated sampling over the same input often produces divergent reasoning trajectories and inconsistent final predictions. To address this, we introduce two complementary approaches inspired by test-time scaling: (1) CASHEW, an inference-time framework that stabilizes reasoning by iteratively aggregating multiple candidate trajectories into higher-quality reasoning traces, with explicit visual verification filtering hallucinated steps and grounding reasoning in visual evidence, and (2) CASHEW-RL, a learned variant that internalizes this aggregation behavior within a single model. CASHEW-RL is trained using Group Sequence Policy Optimization (GSPO) with a composite reward that encourages correct answers grounded in minimal yet sufficient visual evidence, while adaptively allocating reasoning effort based on task difficulty. This training objective enables robust self-aggregation at inference. Extensive experiments on 13 image understanding, video understanding, and video reasoning benchmarks show significant performance improvements, including gains of up to +23.6 percentage points on ScienceQA and +8.1 percentage points on EgoSchema.
视觉-语言模型在多种多模态理解和推理任务中表现出色,但其多步推理的稳定性仍然存在问题。对同一输入进行多次采样往往会导致不同的推理路径和不一致的最终预测结果。为了解决这个问题,我们提出了两种受测试时间缩放启发的方法:(1)CASHEW,这是一个推理框架,在推理过程中通过迭代聚合多个候选轨迹来稳定推理过程,并通过显式的视觉验证过滤掉幻觉步骤并使推理建立在视觉证据之上;以及(2)CASHEW-RL,这是一种学习型变体,它在一个单一模型中内化了这种聚合行为。CASHEW-RL 使用分组序列策略优化(GSPO)进行训练,并采用了一种复合奖励机制来鼓励基于最少但足够视觉证据的正确答案,同时根据任务难度自适应地分配推理努力。这一训练目标使模型能够在推理时实现稳健的自我聚合。 在13个图像理解、视频理解和视频推理基准测试中进行了广泛实验,结果显示了显著性能提升,包括ScienceQA和EgoSchema分别提高了23.6个百分点和8.1个百分点。
https://arxiv.org/abs/2601.08010