Systematic reviews and meta-analyses rely on converting narrative articles into structured, numerically grounded study records. Despite rapid advances in large language models (LLMs), it remains unclear whether they can meet the structural requirements of this process, which hinge on preserving roles, methods, and effect-size attribution across documents rather than on recognizing isolated entities. We propose a structural, diagnostic framework that evaluates LLM-based evidence extraction as a progression of schema-constrained queries with increasing relational and numerical complexity, enabling precise identification of failure points beyond atom-level extraction. Using a manually curated corpus spanning five scientific domains, together with a unified query suite and evaluation protocol, we evaluate two state-of-the-art LLMs under both per-document and long-context, multi-document input regimes. Across domains and models, performance remains moderate for single-property queries but degrades sharply once tasks require stable binding between variables, roles, statistical methods, and effect sizes. Full meta-analytic association tuples are extracted with near-zero reliability, and long-context inputs further exacerbate these failures. Downstream aggregation amplifies even minor upstream errors, rendering corpus-level statistics unreliable. Our analysis shows that these limitations stem not from entity recognition errors, but from systematic structural breakdowns, including role reversals, cross-analysis binding drift, instance compression in dense result sections, and numeric misattribution, indicating that current LLMs lack the structural fidelity, relational binding, and numerical grounding required for automated meta-analysis. The code and data are publicly available at GitHub (this https URL).
系统评价和元分析依赖于将叙述性文章转化为结构化、以数值为基础的研究记录。尽管在大型语言模型(LLMs)方面取得了快速进展,但仍不清楚这些模型是否能够满足这一过程的结构性要求,因为该过程的重点在于在整个文档中保持角色、方法和效应量分配的一致性,而不仅仅是识别孤立的实体。 我们提出了一种结构化诊断框架,用于评估基于大语言模型的事实提取过程,通过逐步增加关系复杂性和数值复杂性的模式约束查询来进行。这种方法能够精确地识别出超越原子级抽取后的失败点。使用跨越五个科学领域的手动整理语料库、统一查询套件和评价协议,我们对两个最先进的LLMs在单文档输入和长上下文多文档输入的情况下进行了评估。 无论是在模型还是领域中,对于单一属性的查询,性能表现仍然处于适中的水平,但一旦任务需要变量之间稳定绑定、角色分配、统计方法以及效应量之间的绑定关系时,性能会急剧下降。完整的元分析关联组被提取出来后可靠度接近于零,而长上下文输入进一步加剧了这种失败。 下游聚合过程放大了上游的任何小错误,这使得整个语料库级别的统计数据变得不可靠。我们的分析表明,这些限制并非源自实体识别错误,而是由于系统性的结构崩溃,包括角色反转、跨分析绑定漂移、密集结果部分中的实例压缩以及数值误分配等现象,显示出当前的大语言模型缺乏进行自动化元分析所需的结构性忠实度、关系绑定和数值基础。 代码和数据公开发布于GitHub(此链接)。
https://arxiv.org/abs/2602.10881
Optimization modeling underpins decision-making in logistics, manufacturing, energy, and finance, yet translating natural-language requirements into correct optimization formulations and solver-executable code remains labor-intensive. Although large language models (LLMs) have been explored for this task, evaluation is still dominated by toy-sized or synthetic benchmarks, masking the difficulty of industrial problems with $10^{3}$--$10^{6}$ (or more) variables and constraints. A key bottleneck is the lack of benchmarks that align natural-language specifications with reference formulations/solver code grounded in real optimization models. To fill in this gap, we introduce MIPLIB-NL, built via a structure-aware reverse construction methodology from real mixed-integer linear programs in MIPLIB~2017. Our pipeline (i) recovers compact, reusable model structure from flat solver formulations, (ii) reverse-generates natural-language specifications explicitly tied to this recovered structure under a unified model--data separation format, and (iii) performs iterative semantic validation through expert review and human--LLM interaction with independent reconstruction checks. This yields 223 one-to-one reconstructions that preserve the mathematical content of the original instances while enabling realistic natural-language-to-optimization evaluation. Experiments show substantial performance degradation on MIPLIB-NL for systems that perform strongly on existing benchmarks, exposing failure modes invisible at toy scale.
优化建模在物流、制造、能源和金融等领域的决策过程中发挥着关键作用。然而,将自然语言需求转化为正确的优化模型公式以及可执行的求解器代码仍然是一项劳动密集型任务。尽管大型语言模型(LLMs)已经被探索用于这项任务,但评估工作仍主要依赖于玩具规模或合成基准测试,这些基准测试掩盖了工业问题中涉及10^3到10^6甚至更多的变量和约束的实际难度。一个关键的瓶颈在于缺乏能够将自然语言规范与基于真实优化模型的标准公式/求解器代码相连接的基准。 为了解决这一缺口,我们引入了MIPLIB-NL,该数据库通过结构感知逆向构建方法从真实的混合整数线性程序(在MIPLIB 2017中)建立。我们的流程包括三个关键步骤: (i) 从扁平化的求解器公式中恢复紧凑、可复用的模型结构; (ii) 根据统一的模型-数据分离格式逆向生成与该模型结构明确关联的自然语言规范; (iii) 通过专家评审和人机交互进行迭代语义验证,并进行独立重构检查。 这些过程最终产生了223个一对一的重建案例,它们保持了原始实例中的数学内容,同时支持从现实世界的角度评估将自然语言转化为优化模型的能力。实验表明,在现有基准上表现出色的系统在MIPLIB-NL上的性能显著下降,这揭示了一些只能通过大规模测试才能显现出来的失败模式。 总之,这项工作提供了一种更接近实际挑战的方法来评估大型语言模型在处理复杂优化问题中的能力,并为未来的工业应用提供了宝贵的资源。
https://arxiv.org/abs/2602.10450
Neural retrieval and GPT-style generative models rely on large, high-quality supervised data, which is still scarce for low-resource languages such as Amharic. We release an Amharic data resource consisting of two datasets that supports research on (i) neural retrieval-ranking and (ii) instruction-following text generation. The retrieval-ranking dataset contains 1,091 manually verified query-positive-negative document triplets drawn from diverse Amharic sources and constructed to support contrastive training and benchmarking of neural retrievers (e.g., DPR, ColBERT-style late interaction and SPLADE-style sparse neural retrieval). Triplets are created through a combination of expert-curated queries, web-derived queries, and LLM-assisted generation, with positive/negative documents selected from the web or synthesized by LLMs and then validated by native speakers. The instruction prompt-response dataset comprises 6,285 Amharic prompt-response pairs spanning multiple domains and instruction types, generated with several LLMs and refined through manual review and correction for grammaticality, relevance, fluency, and factual plausibility. We release both datasets with standardized splits and formats (CSV,JSON,JSONL) to enable reproducible work on Amharic retrieval, ranking, and generative modelling. These datasets also come with a methodology that can be generalized to other low-resource languages.
神经检索和GPT风格的生成模型依赖于大量高质量的监督数据,但对于像阿姆哈拉语这样的低资源语言来说,这种类型的数据仍然稀缺。我们发布了一个包含两个数据集的阿姆哈拉语数据资源,支持关于(i)神经检索排序研究以及(ii)指令跟随文本生成的研究。 **检索-排名数据集**包含了1,091个人工验证过的查询-正样本-负样本三元组,这些三元组是从多种阿姆哈拉语来源中提取并构建出来的,用于支持对比训练和神经检索器(例如DPR、ColBERT风格的后期交互以及SPLADE风格的稀疏神经检索)的基准测试。这些建立了查询-文档对的数据集通过专家策划的查询、网络衍生的查询及LLM辅助生成相结合的方法创建,而正样本/负样本文档则从互联网上选择或由LLMs合成后,再经过母语者的验证。 **指令提示响应数据集**包含6,285个阿姆哈拉语的指令-响应对,这些数据覆盖了多个领域和多种类型的指令,并通过几个大型语言模型生成并经手工审核修正以确保语法正确性、相关性、流畅性和事实上的合理性后得到完善。 我们以标准化的分割格式(CSV、JSON、JSONL)发布这两个数据集,以便于在阿姆哈拉语检索、排名及生成建模方面进行可重复的研究。此外,这些数据集还附带了一种可以推广到其他低资源语言的方法论。
https://arxiv.org/abs/2602.09914
Domain Generalized Video Semantic Segmentation (DGVSS) is trained on a single labeled driving domain and is directly deployed on unseen domains without target labels and test-time adaptation while maintaining temporally consistent predictions over video streams. In practice, both domain shift and temporal-sampling shift break correspondence-based propagation and fixed-stride temporal aggregation, causing severe frame-to-frame flicker even in label-stable regions. We propose Time2General, a DGVSS framework built on Stability Queries. Time2General introduces a Spatio-Temporal Memory Decoder that aggregates multi-frame context into a clip-level spatio-temporal memory and decodes temporally consistent per-frame masks without explicit correspondence propagation. To further suppress flicker and improve robustness to varying sampling rates, the Masked Temporal Consistency Loss is proposed to regularize temporal prediction discrepancies across different strides, and randomize training strides to expose the model to diverse temporal gaps. Extensive experiments on multiple driving benchmarks show that Time2General achieves a substantial improvement in cross-domain accuracy and temporal stability over prior DGSS and VSS baselines while running at up to 18 FPS. Code will be released after the review process.
领域泛化的视频语义分割(Domain Generalized Video Semantic Segmentation, DGVSS)是在单一标注的驾驶域上训练,并直接在无目标标签和测试时间适应的未知域上部署,同时保持视频流中一致的时间预测。实践中,无论是领域的转变还是采样率的变化都会打破基于对应传播和固定步幅的时间聚合,即使在标签稳定的区域也会导致严重的帧间闪烁问题。 我们提出了Time2General框架,它建立在稳定性查询的基础上,并针对DGVSS设计了一个时空记忆解码器(Spatio-Temporal Memory Decoder),该解码器可以将多帧上下文信息聚集为一段视频级别的时空内存,并无需显式的对应传播来解码出一致的时间预测的每帧掩模。为了进一步抑制闪烁并提高对不同采样率变化的鲁棒性,我们提出了遮罩时间一致性损失(Masked Temporal Consistency Loss),以规范不同时步中的时间预测差异,并随机化训练步骤以使模型接触各种各样的时间间隔。 在多个驾驶基准测试中进行的广泛实验表明,Time2General在跨域精度和时间稳定性上相对于之前的DGSS和VSS基线取得了显著改进,并且可以在高达18 FPS的速度下运行。代码将在审查过程后发布。
https://arxiv.org/abs/2602.09648
Urban Visual Pollution (UVP) has emerged as a critical concern, yet research on automatic detection and application remains fragmented. This scoping review maps the existing deep learning-based approaches for detecting, classifying, and designing a comprehensive application framework for visual pollution management. Following the PRISMA-ScR guidelines, seven academic databases (Scopus, Web of Science, IEEE Xplore, ACM DL, ScienceDirect, SpringerNatureLink, and Wiley) were systematically searched and reviewed, and 26 articles were found. Most research focuses on specific pollutant categories and employs variations of YOLO, Faster R-CNN, and EfficientDet architectures. Although several datasets exist, they are limited to specific areas and lack standardized taxonomies. Few studies integrate detection into real-time application systems, yet they tend to be geographically skewed. We proposed a framework for monitoring visual pollution that integrates a visual pollution index to assess the severity of visual pollution for a certain area. This review highlights the need for a unified UVP management system that incorporates pollutant taxonomy, a cross-city benchmark dataset, a generalized deep learning model, and an assessment index that supports sustainable urban aesthetics and enhances the well-being of urban dwellers.
城市视觉污染(UVP)已成为一个关键问题,然而自动检测和应用方面的研究仍然分散。这项综述性回顾总结了现有的基于深度学习的方法,这些方法用于检测、分类以及设计全面的应用框架以管理视觉污染。遵循PRISMA-ScR指南,系统地搜索并审查了七大学术数据库(Scopus、Web of Science、IEEE Xplore、ACM DL、ScienceDirect、SpringerNatureLink 和 Wiley),最终找到了26篇相关文章。大多数研究集中在特定的污染物类别上,并采用YOLO、Faster R-CNN和EfficientDet等架构的变化形式。尽管存在几个数据集,但它们仅限于特定区域并且缺乏标准化分类法。很少有研究将检测整合到实时应用系统中,然而这些研究往往具有地域偏见。 我们提出了一个监测视觉污染的框架,该框架结合了视觉污染指数来评估某一地区视觉污染的严重程度。这项回顾强调需要建立一个统一的城市视觉污染管理系统,这个系统应包括污染物分类、跨城市基准数据集、通用深度学习模型以及支持可持续城市美学和提升城市居民福祉的评估指标。
https://arxiv.org/abs/2602.09446
This research applies artificial intelligence (AI) to separate, cluster, and analyze cardiorespiratory sounds. We recorded a new dataset (HLS-CMDS) and developed several AI models, including generative AI methods based on large language models (LLMs) for guided separation, explainable AI (XAI) techniques to interpret latent representations, variational autoencoders (VAEs) for waveform separation, a chemistry-inspired non-negative matrix factorization (NMF) algorithm for clustering, and a quantum convolutional neural network (QCNN) designed to detect abnormal physiological patterns. The performance of these AI models depends on the quality of the recorded signals. Therefore, this thesis also reviews the biosensing technologies used to capture biomedical data. It summarizes developments in microelectromechanical systems (MEMS) acoustic sensors and quantum biosensors, such as quantum dots and nitrogen-vacancy centers. It further outlines the transition from electronic integrated circuits (EICs) to photonic integrated circuits (PICs) and early progress toward integrated quantum photonics (IQP) for chip-based biosensing. Together, these studies show how AI and next-generation sensors can support more intelligent diagnostic systems for future healthcare.
这项研究将人工智能(AI)应用于心血管呼吸声音的分离、聚类和分析。我们记录了一个新的数据集(HLS-CMDS),并开发了多种AI模型,包括基于大规模语言模型(LLMs)的生成式AI方法用于引导性分离,解释型AI(XAI)技术以解读潜在表示,变分自编码器(VAEs)用于波形分离,一种受化学启发的非负矩阵分解(NMF)算法用于聚类,以及一种量子卷积神经网络(QCNN),旨在检测异常生理模式。这些AI模型的表现取决于记录信号的质量。因此,论文还回顾了捕捉生物医学数据所使用的生物传感技术的发展情况。它总结了微机电系统(MEMS)声学传感器和量子生物传感器(如量子点和氮空位中心)的进展,并进一步概述了从电子集成电路(EICs)向光子集成电路(PICs)过渡以及朝向用于芯片基生物传感的集成量子光子学(IQP)的早期进步。这些研究共同展示了AI和下一代传感器如何支持未来医疗中更为智能的诊断系统。
https://arxiv.org/abs/2602.09210
With the rapid growth of large language models (LLMs), a wide range of methods have been developed to distribute computation and memory across hardware devices for efficient training and inference. While existing surveys provide descriptive overviews of these techniques, systematic analysis of their benefits and trade offs and how such insights can inform principled methodology for designing optimal distributed systems remain limited. This paper offers a comprehensive review of collective operations and distributed parallel strategies, complemented by mathematical formulations to deepen theoretical understanding. We further examine hybrid parallelization designs, emphasizing communication computation overlap across different stages of model deployment, including both training and inference. Recent advances in automated search for optimal hybrid parallelization strategies using cost models are also discussed. Moreover, we present case studies with mainstream architecture categories to reveal empirical insights to guide researchers and practitioners in parallelism strategy selection. Finally, we highlight open challenges and limitations of current LLM training paradigms and outline promising directions for the next generation of large scale model development.
随着大型语言模型(LLMs)的迅速发展,各种方法被开发出来以在硬件设备上分配计算和内存资源,从而实现高效的训练和推理。虽然现有的综述提供了这些技术的描述性概述,但对于系统地分析它们的优势、权衡以及如何利用这种洞察力来指导设计最优分布式系统的原理方法的研究仍然有限。本文提供了一种全面回顾集体操作和分布式并行策略的方法,并通过数学公式深化理论理解。我们进一步研究了混合并行化设计方案,特别强调在模型部署的不同阶段(包括训练和推理)中计算与通信的重叠。此外,还讨论了利用成本模型自动搜索最佳混合并行化策略的最新进展。本文还展示了主流架构类别中的案例研究,揭示了经验见解以指导研究人员和从业者选择并行策略。最后,我们指出现有LLM训练范式的开放性挑战和局限,并概述下一代大规模模型开发中具有前景的方向。 具体而言: - 本文提供了一个全面回顾集体操作和分布式并行化策略的方法。 - 包含数学公式的理论分析以加深理解。 - 研究了混合并行化设计方案,强调在不同阶段(如训练和推理)中的计算与通信重叠。 - 讨论利用成本模型自动搜索最佳混合并行化策略的最新进展。 - 通过主流架构类别的案例研究揭示指导选择并行策略的经验见解。 - 指出现有LLM训练范式的挑战,并概述未来大规模模型开发的方向。
https://arxiv.org/abs/2602.09109
Academic peer review remains the cornerstone of scholarly validation, yet the field faces some challenges in data and methods. From the data perspective, existing research is hindered by the scarcity of large-scale, verified benchmarks and oversimplified evaluation metrics that fail to reflect real-world editorial workflows. To bridge this gap, we present OmniReview, a comprehensive dataset constructed by integrating multi-source academic platforms encompassing comprehensive scholarly profiles through the disambiguation pipeline, yielding 202, 756 verified review records. Based on this data, we introduce a three-tier hierarchical evaluaion framework to assess recommendations from recall to precise expert identification. From the method perspective, existing embedding-based approaches suffer from the information bottleneck of semantic compression and limited interpretability. To resolve these method limitations, we propose Profiling Scholars with Multi-gate Mixture-of-Experts (Pro-MMoE), a novel framework that synergizes Large Language Models (LLMs) with Multi-task Learning. Specifically, it utilizes LLM-generated semantic profiles to preserve fine-grained expertise nuances and interpretability, while employing a Task-Adaptive MMoE architecture to dynamically balance conflicting evaluation goals. Comprehensive experiments demonstrate that Pro-MMoE achieves state-of-the-art performance across six of seven metrics, establishing a new benchmark for realistic reviewer recommendation.
学术同行评审仍然是学术验证的基石,但该领域在数据和方法方面面临着一些挑战。从数据角度来看,现有研究因缺乏大规模、经过验证的基准以及过于简化的评估指标而受限,这些指标无法反映现实世界的编辑工作流程。为弥合这一差距,我们提出了OmniReview,这是一个通过整合多源学术平台并利用去重管道构建的综合性数据集,涵盖了全面的学者档案,产生了202,756条验证后的评审记录。基于此数据,我们引入了一个三层级分层评估框架,用于从召回率到精确专家识别的不同层次对推荐进行评估。 在方法层面,现有的基于嵌入的方法因其语义压缩的信息瓶颈和有限的可解释性而受到限制。为解决这些方法上的局限,我们提出了“学者多门任务混合专家配置文件生成器”(Profiling Scholars with Multi-gate Mixture-of-Experts, Pro-MMoE),这是一个新颖的框架,它结合了大型语言模型(LLMs)和多任务学习。具体而言,该框架利用由LLM生成的语义档案来保留细微的专业知识差异和可解释性,并采用任务自适应MMoE架构以动态平衡相互冲突的评估目标。 全面的实验表明,在七个指标中的六个上,Pro-MMoE实现了最先进的性能,为现实世界的评审人推荐建立了新的基准。
https://arxiv.org/abs/2602.08896
Each year, thousands of patients in need of heart transplants face life-threatening wait times due to organ scarcity. While allocation policies aim to maximize population-level outcomes, current approaches often fail to account for the dynamic arrival of organs and the composition of waitlisted candidates, thereby hampering efficiency. The United States is transitioning from rigid, rule-based allocation to more flexible data-driven models. In this paper, we propose a novel framework for non-myopic policy optimization in general online matching relying on potentials, a concept originally introduced for kidney exchange. We develop scalable and accurate ways of learning potentials that are higher-dimensional and more expressive than prior approaches. Our approach is a form of self-supervised imitation learning: the potentials are trained to mimic an omniscient algorithm that has perfect foresight. We focus on the application of heart transplant allocation and demonstrate, using real historical data, that our policies significantly outperform prior approaches -- including the current US status quo policy and the proposed continuous distribution framework -- in optimizing for population-level outcomes. Our analysis and methods come at a pivotal moment in US policy, as the current heart transplant allocation system is under review. We propose a scalable and theoretically grounded path toward more effective organ allocation.
每年,成千上万需要心脏移植的患者因器官短缺而面临生命威胁性的等待时间。尽管分配政策旨在最大化人口水平上的结果,但目前的方法常常未能考虑到新器官的到来以及等候名单候选者的构成情况,从而降低了效率。美国正从刚性、基于规则的分配方式转向更灵活的数据驱动模型。 本文提出了一种新颖的框架,用于依赖潜力概念的一般在线匹配中的非近视政策优化。该概念最初是在肾交换中引入的。我们开发了可扩展且准确的方法来学习更高维度和更具表达性的潜力值,相比以往方法而言更为有效。我们的方法是一种自监督模仿学习的形式:通过训练潜力以模仿一个拥有完美预知能力的理想算法来进行。 本文聚焦心脏移植分配的应用,并利用真实历史数据证明,在优化人口水平结果方面,我们的政策显著优于先前的方法——包括当前美国的标准策略以及提议的连续分布框架。 在分析和方法上,我们正处于美国政策转变的关键时刻,因为目前的心脏移植分配系统正在接受审查。我们提出了一条可扩展且理论基础坚实的路径,以实现更有效的器官分配。
https://arxiv.org/abs/2602.08878
The use of Large Language Models (LLMs) has drawn growing interest within the scientific community. LLMs can handle large volumes of textual data and support methods for evidence synthesis. Although recent studies highlight the potential of LLMs to accelerate screening and data extraction steps in systematic reviews, detailed reports of their practical application throughout the entire process remain scarce. This paper presents an experience report on the conduction of a systematic mapping study with the support of LLMs, describing the steps followed, the necessary adjustments, and the main challenges faced. Positive aspects are discussed, such as (i) the significant reduction of time in repetitive tasks and (ii) greater standardization in data extraction, as well as negative aspects, including (i) considerable effort to build reliable well-structured prompts, especially for less experienced users, since achieving effective prompts may require several iterations and testing, which can partially offset the expected time savings, (ii) the occurrence of hallucinations, and (iii) the need for constant manual verification. As a contribution, this work offers lessons learned and practical recommendations for researchers interested in adopting LLMs in systematic mappings and reviews, highlighting both efficiency gains and methodological risks and limitations to be considered.
大型语言模型(LLMs)在科学界引起了越来越多的兴趣。LLMs能够处理大量的文本数据,并支持证据综合的方法。尽管最近的研究强调了LLMs加速系统综述中的筛选和数据提取步骤的潜力,但详细报告它们在整个过程中的实际应用仍然很少见。本文介绍了一项使用LLM支持进行系统映射研究的经验报告,描述了所遵循的步骤、必要的调整以及主要面临的挑战。 文中讨论了积极方面,如(i)显著减少了重复任务的时间和(ii) 数据提取的一致性提高,同时也探讨了消极方面,包括(i) 构建可靠且结构良好的提示所需的巨大努力,特别是对于经验较少的用户而言,因为有效的提示可能需要多次迭代和测试,这可能会部分抵消预期的时间节省;(ii) 幻觉现象的发生;以及(iii) 持续的手动验证需求。 作为贡献,这项工作为有兴趣在系统映射和综述中采用LLMs的研究人员提供了经验和实用建议,强调了效率提升和方法论风险及限制需要考虑的方面。
https://arxiv.org/abs/2602.10147
The evaluation of eXplainable Artificial Intelligence (XAI) methods is a rapidly growing field, characterized by a wide variety of approaches. This diversity highlights the complexity of the XAI evaluation, which, unlike traditional AI assessment, lacks a universally correct ground truth for the explanation, making objective evaluation challenging. One promising direction to address this issue involves the use of what we term Synthetic Artificial Intelligence Ground truth (SAIG) methods, which generate artificial ground truths to enable the direct evaluation of XAI techniques. This paper presents the first review and analysis of SAIG methods. We introduce a novel taxonomy to classify these approaches, identifying seven key features that distinguish different SAIG methods. Our comparative study reveals a concerning lack of consensus on the most effective XAI evaluation techniques, underscoring the need for further research and standardization in this area.
可解释人工智能(XAI)方法的评估是一个迅速发展的领域,涵盖了多种多样的评估途径。这种多样性突显了XAI评估的复杂性——与传统的人工智能评估不同的是,在XAI评估中缺乏普遍适用的标准真实情况来验证解释的有效性,使得客观评价具有挑战性。为解决这一问题的一个有前景的方向是使用我们称之为合成人工智能标准事实(SAIG)的方法,这些方法生成人工标准事实以直接评估XAI技术。本文首次对SAIG方法进行了回顾和分析。我们引入了一个新颖的分类法来划分这些方法,并识别出区分不同SAIG方法的七个关键特征。我们的比较研究表明,在最有效的XAI评估技术方面存在令人担忧的一致性不足,这突显了在该领域进一步研究与标准化的需求。
https://arxiv.org/abs/2602.08715
We explore the need for more comprehensive and precise evaluation techniques for generative artificial intelligence (GenAI) in text summarization tasks, specifically in the area of opinion summarization. Traditional methods, which leverage automated metrics to compare machine-generated summaries from a collection of opinion pieces, e.g. product reviews, have shown limitations due to the paradigm shift introduced by large language models (LLM). This paper addresses these shortcomings by proposing a novel, fully automated methodology for assessing the factual consistency of such summaries. The method is based on measuring the similarity between the claims in a given summary with those from the original reviews, measuring the coverage and consistency of the generated summary. To do so, we rely on a simple approach to extract factual assessment from texts that we then compare and summarize in a suitable score. We demonstrate that the proposed metric attributes higher scores to similar claims, regardless of whether the claim is negated, paraphrased, or expanded, and that the score has a high correlation to human judgment when compared to state-of-the-art metrics.
我们探讨了在文本摘要任务中,特别是意见总结领域,对生成式人工智能(GenAI)进行更全面和精确评估技术的需求。传统方法利用自动化指标来比较机器生成的摘要与一系列意见文章(如产品评论)之间的差异,但由于大型语言模型(LLM)带来的范式转变,这些方法显示出一定的局限性。本文通过提出一种新的、完全自动化的评估方法来解决这些问题,该方法旨在衡量此类摘要的事实一致性。 这种方法基于测量给定摘要中的主张与其原始评论中声明的相似度,以此来评估生成摘要的覆盖范围和一致性。为此,我们依赖于从文本中提取事实评估的一种简单方法,并将其进行比较以得出合适的评分。研究表明,所提出的指标无论声明是否被否定、改写或扩展,都会对类似陈述赋予更高的分数;并且与现有最先进的指标相比,该得分与人工评判的相关性很高。
https://arxiv.org/abs/2602.08709
Point cloud is a prevalent 3D data representation format with significant application values in immersive media, autonomous driving, digital heritage protection, etc. However, the large data size of point clouds poses challenges to transmission and storage, which influences the wide deployments. Therefore, point cloud compression plays a crucial role in practical applications for both human and machine perception optimization. To this end, the Moving Picture Experts Group (MPEG) has established two standards for point cloud compression, including Geometry-based Point Cloud Compression (G-PCC) and Video-based Point Cloud Compression (V-PCC). In the meantime, the Audio Video coding Standard (AVS) Workgroup of China also have launched and completed the development for its first generation point cloud compression standard, namely AVS PCC. This new standardization effort has adopted many new coding tools and techniques, which are different from the other counterpart standards. This paper reviews the AVS PCC standard from two perspectives, i.e., the related technologies and performance comparisons.
点云是一种在沉浸式媒体、自动驾驶和数字遗产保护等领域具有重要应用价值的三维数据表示格式。然而,庞大的点云数据量给传输和存储带来了挑战,从而影响了其广泛应用。因此,在人类感知和机器感知优化的实际应用中,点云压缩技术扮演着关键角色。为此,国际电信联盟下属的运动图像专家组(MPEG)建立了两个用于点云压缩的标准:基于几何的点云压缩(G-PCC)和基于视频的点云压缩(V-PCC)。同时,中国的音视频编码标准(AVS)工作组也推出了第一代点云压缩标准——AVS PCC。这项新的标准化工作采用了许多不同于其他现有标准的新编码工具和技术。本文从相关技术和性能比较两个角度对AVS PCC标准进行了回顾和分析。
https://arxiv.org/abs/2602.08613
Multi-agent collaboration has emerged as a promising paradigm for enhancing reasoning capabilities of Large Language Models (LLMs). However, existing approaches remain largely heuristic, lacking principled guidance on what drives performance gains and how to systematically optimize multi-agent reasoning. Specifically, it remains unclear why multi-agent collaboration outperforms single-agent reasoning and which design choices contribute most to these gains, making it difficult to build better systems. We address this gap by introducing a unified theoretical framework that decomposes multi-agent reasoning gains into three conceptually independent dimensions: Exploration for diverse solution coverage, Information for high-fidelity feedback, and Aggregation for principled consensus. Through this lens, existing methods can be understood as special cases that optimize only subsets of these dimensions. Building upon this decomposition, a novel framework called PRISM (Propose-Review-Integrate Synthesis for Multi-agent Reasoning) is proposed, which jointly maximizes all three dimensions through role-based diversity, execution-grounded feedback with evidence-based cross-evaluation, and iterative synthesis with closed-loop validation. Extensive experiments across mathematical reasoning, code generation, and function calling benchmarks demonstrate that PRISM achieves state-of-the-art performance with superior compute-efficiency compared to methods optimizing partial dimensions. The theoretical framework provides actionable design principles for future multi-agent reasoning systems.
多智能体协作已成为增强大型语言模型(LLM)推理能力的一个有前景的范式。然而,现有的方法大多基于启发式原则,缺乏关于什么驱动性能提升及如何系统优化多智能体推理的原则性指导。特别是,目前尚不清楚为何多智能体合作优于单个代理推理以及哪些设计选择最有助于这些改进,这使得构建更好的系统变得困难。为了填补这一空白,我们引入了一个统一的理论框架,该框架将多智能体推理的优势分解为三个概念上独立的维度:探索(用于广泛的解决方案覆盖)、信息(用于高质量反馈)和聚合(用于基于原则的一致性)。通过这种视角,现有的方法可以被理解为仅优化这些维度中的一部分的特殊情况。在此基础上提出了一种新的框架——PRISM(多代理推理中的提议-审查-整合综合),该框架通过对角色多样性的利用、基于证据的交叉评估的执行导向反馈以及迭代合成与闭环验证来同时最大化这三个维度。广泛的实验表明,无论是在数学推理、代码生成还是函数调用基准测试中,PRISM都能实现优于仅优化部分维度的方法的计算效率和最先进的性能水平。这一理论框架为未来多智能体推理系统的设计提供了可操作的原则指导。
https://arxiv.org/abs/2602.08586
Photorealistic color retouching plays a vital role in visual content creation, yet manual retouching remains inaccessible to non-experts due to its reliance on specialized expertise. Reference-based methods offer a promising alternative by transferring the preset color of a reference image to a source image. However, these approaches often operate as novice learners, performing global color mappings derived from pixel-level statistics, without a true understanding of semantic context or human aesthetics. To address this issue, we propose SemiNFT, a Diffusion Transformer (DiT)-based retouching framework that mirrors the trajectory of human artistic training: beginning with rigid imitation and evolving into intuitive creation. Specifically, SemiNFT is first taught with paired triplets to acquire basic structural preservation and color mapping skills, and then advanced to reinforcement learning (RL) on unpaired data to cultivate nuanced aesthetic perception. Crucially, during the RL stage, to prevent catastrophic forgetting of old skills, we design a hybrid online-offline reward mechanism that anchors aesthetic exploration with structural review. % experiments Extensive experiments show that SemiNFT not only outperforms state-of-the-art methods on standard preset transfer benchmarks but also demonstrates remarkable intelligence in zero-shot tasks, such as black-and-white photo colorization and cross-domain (anime-to-photo) preset transfer. These results confirm that SemiNFT transcends simple statistical matching and achieves a sophisticated level of aesthetic comprehension. Our project can be found at this https URL.
逼真的色彩修图在视觉内容创作中扮演着至关重要的角色,然而,由于依赖专门的技术知识,手动修图对于非专业人士来说仍然是难以触及的。基于参考的方法通过将参考图像的预设颜色转移到源图像上提供了一个有前景的选择。然而,这些方法往往操作如同初学者的学习过程,仅从像素级统计中进行全局色彩映射,而不理解语义上下文或人类美学。为了解决这个问题,我们提出了SemiNFT(半自主网络迁移框架),这是一个基于扩散变换器(DiT)的修图框架,它模拟了人类艺术训练的发展轨迹:从严格的模仿开始,逐渐演变为直观创造。具体来说,SemiNFT首先通过成对的三元组进行学习,以获得基本的结构保持和色彩映射技能,并进一步过渡到无配对数据上的强化学习(RL)阶段,以培养细微的美学感知能力。尤为重要的是,在RL阶段,为防止旧技能的灾难性遗忘,我们设计了一个混合在线-离线奖励机制,将美学探索与结构审查相结合。 实验结果表明,SemiNFT不仅在标准预设转移基准测试中超越了现有技术方法,还在零样本任务(如黑白照片上色和跨域转换[动漫到真实图片]的预设转移)方面表现出令人印象深刻的智能。这些结果证实了SemiNFT超越简单的统计匹配,并达到了一种复杂的美学理解水平。我们的项目可以在提供的链接地址找到。
https://arxiv.org/abs/2602.08582
Local governance meeting records are official documents, in the form of minutes or transcripts, documenting how proposals, discussions, and procedural actions unfold during institutional meetings. While generally structured, these documents are often dense, bureaucratic, and highly heterogeneous across municipalities, exhibiting significant variation in language, terminology, structure, and overall organization. This heterogeneity makes them difficult for non-experts to interpret and challenging for intelligent automated systems to process, limiting public transparency and civic engagement. To address these challenges, computational methods can be employed to structure and interpret such complex documents. In particular, Natural Language Processing (NLP) offers well-established methods that can enhance the accessibility and interpretability of governmental records. In this focus article, we review foundational NLP tasks that support the structuring of local governance meeting documents. Specifically, we review three core tasks: document segmentation, domain-specific entity extraction and automatic text summarization, which are essential for navigating lengthy deliberations, identifying political actors and personal information, and generating concise representations of complex decision-making processes. In reviewing these tasks, we discuss methodological approaches, evaluation metrics, and publicly available resources, while highlighting domain-specific challenges such as data scarcity, privacy constraints, and source variability. By synthesizing existing work across these foundational tasks, this article provides a structured overview of how NLP can enhance the structuring and accessibility of local governance meeting records.
地方治理会议记录是官方文件,以会议纪要或实录的形式记载了提案、讨论和程序性行动在机构会议中的展开过程。尽管这些文档通常具有结构化的特征,但它们往往内容密集、官僚化,并且跨地区表现出语言、术语、结构以及整体组织上的高度异质性,这使得非专业人士难以解读,并给智能自动化系统处理带来挑战,从而限制了公共透明度和市民参与度。为了应对这些问题,可以采用计算方法来整理并解析这些复杂的文档。具体而言,自然语言处理(NLP)提供了一套成熟的方法,能够增强政府记录的可访问性和解释性。 本文聚焦于介绍支持地方治理会议文件结构化的基础NLP任务。特别地,我们将回顾三个核心任务:文档分割、领域特定实体抽取和自动文本摘要。这些任务对于梳理冗长讨论、识别政治角色和个人信息以及生成复杂决策过程的简洁表示至关重要。在探讨这些任务时,本文将涉及方法论途径、评估指标以及公开可用资源,并突出数据稀缺性、隐私限制及来源变化等领域的特定挑战。 通过综合现有工作的基础任务,本文为如何利用NLP提升地方治理会议记录的结构化与可访问性提供了一个有条理的概述。
https://arxiv.org/abs/2602.08162
Dialogues are a predominant mode of communication for humans, and it is immensely helpful to have automatically generated summaries of them (e.g., to revise key points discussed in a meeting, to review conversations between customer agents and product users). Prior works on dialogue summary evaluation largely ignore the complexities specific to this task: (i) shift in structure, from multiple speakers discussing information in a scattered fashion across several turns, to a summary's sentences, and (ii) shift in narration viewpoint, from speakers' first/second-person narration, standardized third-person narration in the summary. In this work, we introduce our framework DIALSUMMER to address the above. We propose DIAL-SUMMER's taxonomy of errors to comprehensively evaluate dialogue summaries at two hierarchical levels: DIALOGUE-LEVEL that focuses on the broader speakers/turns, and WITHIN-TURN-LEVEL that focuses on the information talked about inside a turn. We then present DIAL-SUMMER's dataset composed of dialogue summaries manually annotated with our taxonomy's fine-grained errors. We conduct empirical analyses of these annotated errors, and observe interesting trends (e.g., turns occurring in middle of the dialogue are the most frequently missed in the summary, extrinsic hallucinations largely occur at the end of the summary). We also conduct experiments on LLM-Judges' capability at detecting these errors, through which we demonstrate the challenging nature of our dataset, the robustness of our taxonomy, and the need for future work in this field to enhance LLMs' performance in the same. Code and inference dataset coming soon.
对话是人类交流的主要形式,自动生成的对话摘要(例如,在会议后回顾讨论的关键点,或者查看客服代理与产品用户之间的交谈记录)非常有用。之前关于对话摘要评估的研究大多忽略了这项任务特有的复杂性:(i)结构的变化,从多个说话者在多轮次中分散地讨论信息,到总结中的句子;(ii)叙述视角的转变,从说话者的第一人称/第二人称叙述转变为摘要中的标准化第三人称叙述。为此,在这项工作中,我们引入了一个新的框架DIALSUMMER来解决上述问题。 我们提出了DIALSUMMER误差分类法,并在两个层次上全面评估对话摘要:DIALOGUE-LEVEL关注更广泛的说话者和轮次;WITHIN-TURN-LEVEL则集中于某一回合内讨论的信息。接着,我们展示了由我们的错误分类法手动标注的细粒度错误组成的DIAL-SUMMER数据集。通过对这些注释错误进行实证分析,我们观察到了一些有趣的趋势(例如,对话中间发生的轮次在摘要中最常被遗漏;外源性幻觉大多发生在总结的末尾)。此外,我们还通过实验测试了大型语言模型作为评判者检测这些错误的能力,从而展示了我们的数据集的挑战性、分类法的稳健性以及未来改进LLMs(大语言模型)在此领域表现的需求。代码和推理数据集即将发布。
https://arxiv.org/abs/2602.08149
Spreading dynamics is a central topic in the physics of complex systems and network science, providing a unified framework for understanding how information, behaviors, and diseases propagate through interactions among system units. In many propagation contexts, spreading processes are influenced by multiple interacting factors, such as information expression patterns, cultural contexts, living environments, cognitive preferences, and public policies, which are difficult to incorporate directly into classical modeling frameworks. Recently, large language models (LLMs) have exhibited strong capabilities in natural language understanding, reasoning, and generation, enabling explicit perception of semantic content and contextual cues in spreading processes, thereby supporting the analysis of the different influencing factors. Beyond serving as external analytical tools, LLMs can also act as interactive agents embedded in propagation systems, potentially influencing spreading pathways and feedback structures. Consequently, the roles and impacts of LLMs on spreading dynamics have become an active and rapidly growing research area across multiple research disciplines. This review provides a comprehensive overview of recent advances in applying LLMs to the study of spreading dynamics across two representative domains: digital epidemics, such as misinformation and rumors, and biological epidemics, including infectious disease outbreaks. We first examine the foundations of epidemic modeling from a complex-systems perspective and discuss how LLM-based approaches relate to traditional frameworks. We then systematically review recent studies from three key perspectives, which are epidemic modeling, epidemic detection and surveillance, and epidemic prediction and management, to clarify how LLMs enhance these areas. Finally, open challenges and potential research directions are discussed.
传播动力学是复杂系统物理和网络科学的核心主题,为理解信息、行为以及疾病如何通过系统单元之间的相互作用进行传播提供了一个统一的框架。在许多传播情境中,传播过程受到多种相互影响的因素的影响,例如信息表达模式、文化背景、生活环境、认知偏好及公共政策等,这些因素难以直接纳入传统的建模框架中。最近,大型语言模型(LLMs)展示了其在自然语言理解、推理和生成方面的强大能力,能够明确感知到传播过程中的语义内容和情境线索,从而支持对各种影响因素的分析。除了作为外部分析工具之外,LLMs还可以充当嵌入传播系统内的交互式代理,可能会影响传播路径及反馈结构。因此,LLMs在传播动力学中所扮演的角色及其产生的影响已成为跨多个研究领域的活跃且迅速发展的研究领域。这篇综述全面概述了最近将大型语言模型应用于两个代表性领域的传播动态研究的进展:数字流行病(如错误信息和谣言)以及生物流行病(包括传染病爆发)。首先,我们从复杂系统视角探讨流行病建模的基础,并讨论基于LLM的方法如何与传统框架相关联。然后,我们从三个关键角度——流行病建模、流行病检测与监控以及流行病预测与管理——系统地回顾了最近的研究成果,以阐明LLMs如何增强这些领域。最后,我们讨论了一些开放性挑战和潜在的研究方向。
https://arxiv.org/abs/2602.08085
The sparse Mixture of Experts(MoE) architecture has evolved as a powerful approach for scaling deep learning models to more parameters with comparable computation cost. As an important branch of large language model(LLM), MoE model only activate a subset of experts based on a routing network. This sparse conditional computation mechanism significantly improves computational efficiency, paving a promising path for greater scalability and cost-efficiency. It not only enhance downstream applications such as natural language processing, computer vision, and multimodal in various horizontal domains, but also exhibit broad applicability across vertical domains. Despite the growing popularity and application of MoE models across various domains, there lacks a systematic exploration of recent advancements of MoE in many important fields. Existing surveys on MoE suffer from limitations such as lack coverage or none extensively exploration of key areas. This survey seeks to fill these gaps. In this paper, Firstly, we examine the foundational principles of MoE, with an in-depth exploration of its core components-the routing network and expert network. Subsequently, we extend beyond the centralized paradigm to the decentralized paradigm, which unlocks the immense untapped potential of decentralized infrastructure, enables democratization of MoE development for broader communities, and delivers greater scalability and cost-efficiency. Furthermore we focus on exploring its vertical domain applications. Finally, we also identify key challenges and promising future research directions. To the best of our knowledge, this survey is currently the most comprehensive review in the field of MoE. We aim for this article to serve as a valuable resource for both researchers and practitioners, enabling them to navigate and stay up-to-date with the latest advancements.
稀疏的专家混合(Mixture of Experts,MoE)架构已经发展成为一种有效的方法,用于在计算成本相当的情况下扩展深度学习模型中的参数数量。作为大型语言模型(LLM)的重要分支,MoE 模型仅通过基于路由网络激活一部分专家来实现这一目标。这种稀疏的条件计算机制显著提高了计算效率,并为更大规模和更低成本的有效性铺平了道路。它不仅增强了自然语言处理、计算机视觉和多模态等各个水平领域的下游应用,而且在垂直领域中也展现出广泛的应用前景。 尽管 MoE 模型在各个领域的流行度和应用不断增加,但在许多重要领域对最近的 MoE 进展却缺乏系统性的探索。现有的关于 MoE 的综述存在诸如覆盖不全或未深入探讨关键领域等问题。本调查旨在填补这些空白。在这篇论文中,我们首先审查了 MoE 的基础原理,并对其核心组成部分——路由网络和专家网络进行了深入探讨。随后,我们将研究范围扩展到了去中心化的范式,这一转变解锁了去中心化基础设施的巨大潜力,促进了更广泛的社区对 MoE 开发的民主化进程,并带来了更大的可扩展性和成本效益。 此外,我们还关注其在垂直领域的应用探索。最后,我们也指出了关键挑战和具有前景的研究方向。据我们所知,这是目前关于 MoE 的最全面的综述。我们的目标是让这篇论文成为研究人员和实践者的重要资源,帮助他们了解并跟上最新的研究进展。
https://arxiv.org/abs/2602.08019
Large language models (LLMs) are often ensembled together to improve overall reliability and robustness, but in practice models are strongly correlated. This raises a fundamental question: which models should be selected when forming an LLM ensemble? We formulate budgeted ensemble selection as maximizing the mutual information between the true label and predictions of the selected models. Furthermore, to explain why performance can saturate even with many models, we model the correlated errors of the models using Gaussian-copula and show an information-theoretic error floor for the performance of the ensemble. Motivated by these, we propose a simple greedy mutual-information selection algorithm that estimates the required information terms directly from data and iteratively builds an ensemble under a query budget. We test our approach in two question answering datasets and one binary sentiment classification dataset: MEDMCQA, MMLU, and IMDB movie reviews. Across all datasets, we observe that our method consistently outperforms strong baselines under the same query budget.
大型语言模型(LLM)常常被集成在一起以提高整体可靠性和鲁棒性,但在实践中这些模型之间往往存在强相关性。这引发了一个根本问题:在形成一个LLM集成时应该选择哪些模型?我们把预算受限的集成选择问题定义为最大化所选模型的真实标签和预测之间的互信息。此外,为了解释为什么即使使用许多模型性能也可能达到饱和,我们采用高斯-库利(Gaussian-copula)方法来建模模型的相关错误,并展示了集成本身的信息论误差下限。 受到这些理论的启发,我们提出了一种简单的贪婪互信息选择算法,该算法直接从数据中估计所需的互信息项,并在查询预算内迭代构建集成。我们在两个问答数据集和一个二元情感分类数据集上测试了我们的方法:MEDMCQA、MMLU 和 IMDB 电影评论。在所有数据集中,我们观察到,在相同的查询预算下,我们的方法始终优于强基线模型。 这段文字描述了一种关于如何从多个相关大型语言模型中选择最有效的子集合的方法,并提出了一种基于互信息的算法来实现这一目标。这种方法不仅能够解释为什么使用大量模型后性能可能饱和的现象,还提供了一个在查询预算限制下的高效解决方案,在实践中显示出优于其他方法的表现。
https://arxiv.org/abs/2602.08003