Histopathology analysis relies on Hematoxylin and Eosin (H&E) staining, but fluorescence microscopy offers complementary information. Converting fluorescence images to H&E-like appearance can aid interpretation and integration with standard workflows. We present a Cycle-Consistent Adversarial Network (CycleGAN) approach for unpaired image-to-image translation from multi-channel fluorescence microscopy to pseudo H&E stained histopathology images. The method combines C01 and C02 fluorescence channels into RGB and learns a bidirectional mapping between fluorescence and H&E domains without paired training data. The architecture uses ResNet-based generators with residual blocks and PatchGAN discriminators, trained with adversarial, cycle-consistency, and identity losses. Experiments on fluorescence microscopy datasets show the model generates realistic pseudo H&E images that preserve morphological structures while adopting H&E-like color characteristics. This enables visualization of fluorescence data in a format familiar to pathologists and supports integration with existing H&E-based analysis pipelines.
组织病理学分析依赖于苏木精和伊红(H&E)染色,但荧光显微镜提供了互补的信息。将荧光图像转换为类似于H&E的外观有助于解释并整合到标准工作流程中。我们提出了一种基于无配对图像到图像翻译的Cycle-Consistent Adversarial Network (CycleGAN) 方法,用于从多通道荧光显微镜图像生成伪H&E染色组织病理学图像。该方法将C01和C02荧光通道结合为RGB,并学习了在没有配对训练数据的情况下,在荧光与H&E领域之间的双向映射。其架构使用基于ResNet的生成器,包含残差块以及PatchGAN鉴别器,并通过对抗性、循环一致性及身份损失进行训练。 实验结果表明,在多通道荧光显微镜数据集上应用此模型可生成逼真的伪H&E图像,这些图像是在保留形态结构的同时采用了类似于H&E的颜色特性。这使得以病理学家熟悉的格式可视化荧光数据,并支持与现有的基于H&E的分析管道集成。
https://arxiv.org/abs/2601.08776
Large Language Models are rapidly emerging as web-native interfaces to social platforms. On the social web, users frequently have ambiguous and dynamic goals, making complex intent understanding-rather than single-turn execution-the cornerstone of effective human-LLM collaboration. Existing approaches attempt to clarify user intents through sequential or parallel questioning, yet they fall short of addressing the core challenge: modeling the logical dependencies among clarification questions. Inspired by the Cognitive Load Theory, we propose Prism, a novel framework for complex intent understanding that enables logically coherent and efficient intent clarification. Prism comprises four tailored modules: a complex intent decomposition module, which decomposes user intents into smaller, well-structured elements and identifies logical dependencies among them; a logical clarification generation module, which organizes clarification questions based on these dependencies to ensure coherent, low-friction interactions; an intent-aware reward module, which evaluates the quality of clarification trajectories via an intent-aware reward function and leverages Monte Carlo Sample to simulate user-LLM interactions for large-scale,high-quality training data generation; and a self-evolved intent tuning module, which iteratively refines the LLM's logical clarification capability through data-driven feedback and optimization. Prism consistently outperforms existing approaches across clarification interactions, intent execution, and cognitive load benchmarks. It achieves stateof-the-art logical consistency, reduces logical conflicts to 11.5%, increases user satisfaction by 14.4%, and decreases task completion time by 34.8%. All data and code are released.
大型语言模型正在迅速发展成为社交平台上的网络原生接口。在社交媒体上,用户的目标往往模糊且多变,这使得复杂意图的理解——而非单一回合的执行——成为了有效的人机合作的核心。现有的方法通过顺序或并行提问来尝试澄清用户的意图,但它们未能解决核心问题:即模型逻辑依赖关系之间的理解与建立。 受到认知负荷理论的启发,我们提出了Prism这一新颖框架,旨在实现复杂意图的理解,并确保意图澄清过程既连贯又高效。Prism由四个定制模块组成: 1. **复杂的意图分解模块**:此模块将用户的意图拆分为更小、结构化更好的元素,并识别这些元素之间的逻辑依赖关系。 2. **逻辑澄清生成模块**:基于上述依赖关系,组织出有序的澄清问题,确保互动过程中既连贯又高效。 3. **感知意图奖励模块**:通过感知意图的奖励函数评估澄清轨迹的质量,并利用蒙特卡洛采样模拟用户-LLM交互,用于大规模高质量训练数据的生成。 4. **自我演进意图调优模块**:基于数据驱动反馈和优化迭代改进LLM的逻辑澄清能力。 Prism在澄清互动、意图执行以及认知负荷基准测试中均超越了现有方法。它实现了最先进的逻辑一致性,将逻辑冲突减少至11.5%,提高了用户满意度14.4%,并将任务完成时间减少了34.8%。所有数据和代码均已发布。
https://arxiv.org/abs/2601.08653
The growing scale of online misinformation urgently demands Automated Fact-Checking (AFC). Existing benchmarks for evaluating AFC systems, however, are largely limited in terms of task scope, modalities, domain, language diversity, realism, or coverage of misinformation types. Critically, they are static, thus subject to data leakage as their claims enter the pretraining corpora of LLMs. As a result, benchmark performance no longer reliably reflects the actual ability to verify claims. We introduce Verified Theses and Statements (VeriTaS), the first dynamic benchmark for multimodal AFC, designed to remain robust under ongoing large-scale pretraining of foundation models. VeriTaS currently comprises 24,000 real-world claims from 108 professional fact-checking organizations across 54 languages, covering textual and audiovisual content. Claims are added quarterly via a fully automated seven-stage pipeline that normalizes claim formulation, retrieves original media, and maps heterogeneous expert verdicts to a novel, standardized, and disentangled scoring scheme with textual justifications. Through human evaluation, we demonstrate that the automated annotations closely match human judgments. We commit to update VeriTaS in the future, establishing a leakage-resistant benchmark, supporting meaningful AFC evaluation in the era of rapidly evolving foundation models. We will make the code and data publicly available.
在线虚假信息的规模日益扩大,迫切需要自动事实核查(Automated Fact-Checking, AFC)系统。然而,现有的评估AFC系统的基准在任务范围、模态、领域、语言多样性、真实性或假信息类型覆盖方面存在诸多局限性。更重要的是,这些基准是静态的,因此当其主张进入基础模型大规模预训练语料库时,便容易出现数据泄漏问题。结果导致了基准性能不再可靠地反映实际验证声明的能力。 我们推出了Verified Theses and Statements(VeriTaS),这是第一个为多模态AFC设计的动力学基准,旨在在基础模型持续大规模预训练下保持稳健性。目前,VeriTaS包含来自54种语言中108个专业事实核查组织的24,000条真实世界声明,涵盖了文本和音视频内容。通过一个完全自动化的七阶段管道,新的主张每季度添加一次,并且该管道可以规范化声明表述、检索原始媒体并映射异构专家意见到一套新颖的标准且解耦评分方案,其中包含文字解释。通过人工评估,我们证明了自动化标注与人类判断高度吻合。 在未来,我们将致力于更新VeriTaS,建立一个抗数据泄漏的基准,支持在快速演进的基础模型时代进行有意义的AFC评估。我们将公开提供代码和数据。
https://arxiv.org/abs/2601.08611
Generative Adversarial Networks (GANs) face a significant challenge of striking an optimal balance between high-quality image generation and training stability. Recent techniques, such as DCGAN, BigGAN, and StyleGAN, improve visual fidelity; however, such techniques usually struggle with mode collapse and unstable gradients at high network depth. This paper proposes a novel GAN structural model that incorporates deeper inception-inspired convolution and dilated convolution. This novel model is termed the Inception Generative Adversarial Network (IGAN). The IGAN model generates high-quality synthetic images while maintaining training stability, by reducing mode collapse as well as preventing vanishing and exploding gradients. Our proposed IGAN model achieves the Frechet Inception Distance (FID) of 13.12 and 15.08 on the CUB-200 and ImageNet datasets, respectively, representing a 28-33% improvement in FID over the state-of-the-art GANs. Additionally, the IGAN model attains an Inception Score (IS) of 9.27 and 68.25, reflecting improved image diversity and generation quality. Finally, the two techniques of dropout and spectral normalization are utilized in both the generator and discriminator structures to further mitigate gradient explosion and overfitting. These findings confirm that the IGAN model potentially balances training stability with image generation quality, constituting a scalable and computationally efficient framework for high-fidelity image synthesis.
生成对抗网络(GANs)面临的一个重大挑战是实现高质量图像生成与训练稳定性的最佳平衡。最近的技术,如DCGAN、BigGAN和StyleGAN,提高了视觉保真度;然而,这些技术通常在高网络深度时会遇到模式崩溃以及不稳定梯度的问题。本文提出了一种新型的GAN结构模型,该模型结合了深入启发自Inception的设计和扩张卷积。这种新的模型被称为Inception生成对抗网络(IGAN)。通过减少模式崩溃并防止梯度消失或爆炸,IGAN模型能够在保持训练稳定性的同时生成高质量的合成图像。 我们的IGAN模型在CUB-200和ImageNet数据集上的Frechet Inception Distance (FID) 分别达到了13.12和15.08,这比当前最先进的GANs在FID上提高了28%-33%。此外,该模型在CUB-200和ImageNet数据集上的Inception Score (IS) 分别为9.27和68.25,表明其图像多样性和生成质量有所提升。 最后,在生成器和判别器结构中利用了dropout技术和谱归一化这两种技术,进一步减轻了梯度爆炸和过拟合的问题。这些发现证实,IGAN模型有可能在训练稳定性与图像生成质量之间取得平衡,构成了一种高保真图像合成的可扩展且计算效率高的框架。
https://arxiv.org/abs/2601.08332
General intelligence must reorganize experience into internal structures that enable prediction and action under finite resources. Existing systems implicitly presuppose fixed primitive units -- tokens, subwords, pixels, or predefined sensor channels -- thereby bypassing the question of how representational units themselves emerge and stabilize. This paper proposes SANC(E3), an axiomatic framework in which representational units are not given a priori but instead arise as stable outcomes of competitive selection, reconstruction, and compression under finite activation capacity, governed by the explicit minimization of an energy functional E3. SANC(E3) draws a principled distinction between system tokens -- structural anchors such as {here, now, I} and sensory sources -- and tokens that emerge through self-organization during co-occurring events. Five core axioms formalize finite capacity, association from co-occurrence, similarity-based competition, confidence-based stabilization, and the reconstruction-compression-update trade-off. A key feature is a pseudo-memory-mapped I/O mechanism, through which internally replayed Gestalts are processed via the same axiomatic pathway as external sensory input. As a result, perception, imagination, prediction, planning, and action are unified within a single representational and energetic process. From the axioms, twelve propositions are derived, showing that category formation, hierarchical organization, unsupervised learning, and high-level cognitive activities can all be understood as instances of Gestalt completion under E3 minimization.
通用智能必须在有限资源的条件下,将经验重组为内部结构,从而实现预测和行动。现有的系统默认假设存在固定的原始单元——如词元、子词、像素或预定义的传感器通道——从而绕过了这些表示单位本身如何产生及稳定的这一问题。本文提出了SANC(E3),这是一种公理框架,在该框架中,表示单位不是预先给定的,而是作为有限激活容量下竞争选择、重构和压缩产生的稳定结果出现,并且由E3能量函数显式最小化来指导。SANC(E3)在系统词元(如{这里、现在、我}这样的结构锚点)与通过共现事件自我组织过程中产生的词元之间建立了一种原则性的区别。五个核心公理形式化了有限容量,基于共现的关联性,基于相似度的竞争机制,基于自信的稳定机制以及重构-压缩-更新之间的权衡。一个关键特征是伪内存映射输入/输出机制,通过此机制内部回放的格式塔(Gestalt)可通过与外部感觉输入相同的公理路径进行处理。因此,在单一表示和能量过程中统一了感知、想象、预测、规划以及行动。 从这些公理中导出了十二条命题,表明类别形成、层级组织、无监督学习以及高层次的认知活动都可以理解为在E3最小化下格式塔完成的实例。
https://arxiv.org/abs/2601.08224
Global frameworks increasingly advocate for Responsible Artificial Intelligence (AI) in education, yet they provide limited guidance on how ethical, culturally responsive, and curriculum-aligned AI can be operationalized within functioning teacher education systems, particularly in the Global South. This study addresses this gap through the design and evaluation of GenAITEd Ghana, a context-aware, region-specific conversational AI prototype developed to support teacher education in Ghana. Guided by a Design Science Research approach, the system was developed as a school-mimetic digital infrastructure aligned with the organizational logic of Ghanaian Colleges of Education and the National Council for Curriculum and Assessment (NaCCA) framework. GenAITEd Ghana operates as a multi-agent, retrieval-augmented conversational AI that coordinates multiple models for curriculum-grounded dialogue, automatic speech recognition, voice synthesis, and multimedia interaction. Two complementary prompt pathways were embedded: system-level prompts that enforce curriculum boundaries, ethical constraints, and teacher-in-the-loop oversight, and interaction-level semi-automated prompts that structure live pedagogical dialogue through clarification, confirmation, and guided response generation. Evaluation findings show that the system effectively enacted key Responsible AI principles, including transparency, accountability, cultural responsiveness, privacy, and human oversight. Human expert evaluations further indicated that GenAITEd Ghana is pedagogically appropriate for Ghanaian teacher education, promoting student engagement while preserving educators' professional authority. Identified challenges highlight the need for continued model integration, professional development, and critical AI literacy to mitigate risks of over-reliance.
全球框架越来越多地倡导在教育中使用负责任的人工智能(AI),但它们对如何将道德、文化响应和符合课程要求的AI操作化于教师教育系统中的指导却非常有限,尤其是在全球南方地区。本研究通过设计并评估GenAITEd Ghana来填补这一空白——这是一个针对加纳特定区域的情境感知会话式人工智能原型,旨在支持教师教育。 在设计科学方法论指导下开发的GenAITEd Ghana是一个模仿学校环境的数字基础设施,并与加纳教育学院和国家课程及评估委员会(NaCCA)框架的组织逻辑保持一致。GenAITEd Ghana作为一个多代理、检索增强型对话AI系统运行,协调多个模型以实现基于课程的对话、自动语音识别、语音合成以及多媒体互动。 研究中嵌入了两条互补提示路径:一是系统层面的提示,强制执行课程边界、伦理限制和教师监督;二是交互层面的半自动化提示,通过澄清、确认及引导式响应生成来结构化实时教学对话。评估结果表明,该系统有效地实施了负责任AI的关键原则,包括透明度、责任性、文化敏感性、隐私以及人类监管。 专家评估进一步显示,GenAITEd Ghana在加纳教师教育中具有适当的教学应用,既促进学生参与又维护教育者的专业权威。已识别的挑战强调需要继续进行模型集成、专业发展及批判性AI素养培训,以减轻过度依赖的风险。
https://arxiv.org/abs/2601.06093
Agentic memory systems have become critical for enabling LLM agents to maintain long-term context and retrieve relevant information efficiently. However, existing memory frameworks suffer from a fundamental limitation: they perform exhaustive retrieval across the entire storage layer regardless of query characteristics. This brute-force approach creates severe latency bottlenecks as memory grows, hindering real-time agent interactions. We propose SwiftMem, a query-aware agentic memory system that achieves sub-linear retrieval through specialized indexing over temporal and semantic dimensions. Our temporal index enables logarithmic-time range queries for time-sensitive retrieval, while the semantic DAG-Tag index maps queries to relevant topics through hierarchical tag structures. To address memory fragmentation during growth, we introduce an embedding-tag co-consolidation mechanism that reorganizes storage based on semantic clusters to improve cache locality. Experiments on LoCoMo and LongMemEval benchmarks demonstrate that SwiftMem achieves 47$\times$ faster search compared to state-of-the-art baselines while maintaining competitive accuracy, enabling practical deployment of memory-augmented LLM agents.
代理记忆系统已经成为让大型语言模型(LLM)代理保持长期上下文并高效检索相关信息的关键。然而,现有的记忆框架存在一个根本性的局限:它们会无差别地在整个存储层进行全盘搜索,不管查询的具体特性如何。这种蛮力方法在内存增长时会导致严重的延迟瓶颈,从而阻碍了实时代理交互。我们提出SwiftMem,这是一种基于查询的代理记忆系统,通过时间维度和语义维度的专业索引实现了次线性检索。我们的时间索引使得对时间敏感的数据进行对数时间内范围查询成为可能,而语义DAG-Tag索引则通过层次化的标签结构将查询映射到相关主题上。为了解决在内存增长过程中出现的碎片化问题,我们引入了一种嵌入式标签协同整理机制,该机制根据语义簇重新组织存储以提高缓存局部性。 在LoCoMo和LongMemEval基准测试上的实验表明,SwiftMem相比最先进的基线方法,搜索速度提高了47倍,并且保持了竞争性的准确性,从而使得增强型LLM代理的实际部署成为可能。
https://arxiv.org/abs/2601.08160
The development of robust artificial intelligence models for histopathology diagnosis is severely constrained by the scarcity of expert-annotated lesion data, particularly for rare pathologies and underrepresented disease subtypes. While data augmentation offers a potential solution, existing methods fail to generate sufficiently realistic lesion morphologies that preserve the complex spatial relationships and cellular architectures characteristic of histopathological tissues. Here we present PathoGen, a diffusion-based generative model that enables controllable, high-fidelity inpainting of lesions into benign histopathology images. Unlike conventional augmentation techniques, PathoGen leverages the iterative refinement process of diffusion models to synthesize lesions with natural tissue boundaries, preserved cellular structures, and authentic staining characteristics. We validate PathoGen across four diverse datasets representing distinct diagnostic challenges: kidney, skin, breast, and prostate pathology. Quantitative assessment confirms that PathoGen outperforms state-of-the-art generative baselines, including conditional GAN and Stable Diffusion, in image fidelity and distributional similarity. Crucially, we show that augmenting training sets with PathoGen-synthesized lesions enhances downstream segmentation performance compared to traditional geometric augmentations, particularly in data-scarce regimes. Besides, by simultaneously generating realistic morphology and pixel-level ground truth, PathoGen effectively overcomes the manual annotation bottleneck. This approach offers a scalable pathway for developing generalizable medical AI systems despite limited expert-labeled data.
在组织病理学诊断中,开发稳健的人工智能模型受到专家注释的病变数据稀缺性的严重限制,尤其是在罕见疾病和代表性不足的亚型方面。虽然数据增强提供了一种潜在解决方案,但现有方法无法生成足够逼真的病变形态,这些形态能够保持组织病理标本特有的复杂空间关系和细胞结构。在此,我们介绍了PathoGen,这是一种基于扩散的生成模型,可以对良性组织病理学图像中的病灶进行可控且高保真度的修复。与传统的增强技术不同,PathoGen 利用扩散模型的迭代细化过程来合成具有自然组织边界、保持细胞结构和真实染色特征的病变。 我们在四个代表不同的诊断挑战的数据集上验证了 PathoGen:肾脏病理学、皮肤病理学、乳腺病理学和前列腺病理学。定量评估确认 PathoGen 在图像保真度和分布相似性方面优于最新的生成基线,包括条件 GAN 和 Stable Diffusion。关键的是,我们展示了使用 PathoGen 合成的病变数据增强训练集可以提高下游分割性能,特别是在数据稀缺的情况下,这比传统的几何增强方法更有效。 此外,通过同时生成逼真的形态和像素级别的真实标签,PathoGen 有效地克服了手动注释瓶颈问题。这种方法提供了一条可扩展的道路,即使在专家标记的数据有限的情况下也能开发出具有普适性的医学 AI 系统。
https://arxiv.org/abs/2601.08127
Complex reasoning in tool-augmented agent frameworks is inherently long-horizon, causing reasoning traces and transient tool artifacts to accumulate and strain the bounded working context of large language models. Without explicit memory mechanisms, such accumulation disrupts logical continuity and undermines task alignment. This positions memory not as an auxiliary efficiency concern, but as a core component for sustaining coherent, goal-directed reasoning over long horizons. We propose MemoBrain, an executive memory model for tool-augmented agents that constructs a dependency-aware memory over reasoning steps, capturing salient intermediate states and their logical relations. Operating as a co-pilot alongside the reasoning agent, MemoBrain organizes reasoning progress without blocking execution and actively manages the working context. Specifically, it prunes invalid steps, folds completed sub-trajectories, and preserves a compact, high-salience reasoning backbone under a fixed context budget. Together, these mechanisms enable explicit cognitive control over reasoning trajectories rather than passive context accumulation. We evaluate MemoBrain on challenging long-horizon benchmarks, including GAIA, WebWalker, and BrowseComp-Plus, demonstrating consistent improvements over strong baselines.
在增强型代理框架中,复杂的推理本质上具有长远的时间跨度,导致推理过程中的痕迹和临时工具生成的副产品积累,并给大型语言模型的工作上下文带来了压力。如果没有明确的记忆机制,在这种情况下积累的现象会破坏逻辑连贯性并削弱任务的相关性。因此,记忆不再是辅助性的效率问题,而是维持长时间范围内连贯、目标导向推理的核心组成部分。 我们提出了一种名为MemoBrain的执行记忆模型,该模型专门针对增强型代理设计,能够在每次推理步骤中构建一个依赖关系感知的记忆结构,捕捉到显著的中间状态及其逻辑关联。作为推理代理的一个助手,MemoBrain组织推理过程而不阻碍执行,并积极管理工作上下文。具体来说,它会删除无效步骤、折叠已完成的子轨迹,并在固定的工作上下文预算下保留紧凑且具有高相关性的推理骨干。 通过这些机制,MemoBrain能够实现对推理路径的显性认知控制,而不是被动地积累背景信息。我们已经在包括GAIA、WebWalker和BrowseComp-Plus在内的挑战性长时域基准测试中评估了MemoBrain,并证明它比强大的基线模型有持续改进的表现。
https://arxiv.org/abs/2601.08079
Frontier AI regulations primarily focus on systems deployed to external users, where deployment is more visible and subject to outside scrutiny. However, high-stakes applications can occur internally when companies deploy highly capable systems within their own organizations, such as for automating R\&D, accelerating critical business processes, and handling sensitive proprietary data. This paper examines how frontier AI regulations in the United States and European Union in 2025 handle internal deployment. We identify three gaps that could cause internally-deployed systems to evade intended oversight: (1) scope ambiguity that allows internal systems to evade regulatory obligations, (2) point-in-time compliance assessments that fail to capture the continuous evolution of internal systems, and (3) information asymmetries that subvert regulatory awareness and oversight. We then analyze why these gaps persist, examining tensions around measurability, incentives, and information access. Finally, we map potential approaches to address them and their associated tradeoffs. By understanding these patterns, we hope that policy choices around internally deployed AI systems can be made deliberately rather than incidentally.
前沿人工智能(AI)法规主要针对部署给外部用户的系统,这些系统的部署较为明显,并且容易受到外界审查。然而,在公司内部使用高度功能化的系统时也会出现高风险的应用场景,例如用于研发自动化、加速关键业务流程和处理敏感的专有数据。本文探讨了2025年美国和欧洲联盟前沿AI法规在处理内部部署方面的情况。我们识别出了三个可能导致内部部署系统逃避监管审查的缺口:(1)模糊不清的规定范围,这允许内部系统规避合规义务;(2)静态的时间点合规评估未能捕捉到内部系统的持续演变;(3)信息不对称导致监管机构无法获得足够的认知和监督。接下来,我们将分析这些缺口为何会存在,并探讨围绕可衡量性、激励机制和信息获取的紧张关系。最后,我们勾画出可能解决问题的方法及其相应的权衡取舍。通过了解这些模式,我们希望在处理内部部署AI系统时政策选择能够更加有目的性和审慎,而不是出于偶然或应急。
https://arxiv.org/abs/2601.08005
The rapid proliferation of misinformation across online platforms underscores the urgent need for robust, up-to-date, explainable, and multilingual fact-checking resources. However, existing datasets are limited in scope, often lacking multimodal evidence, structured annotations, and detailed links between claims, evidence, and verdicts. This paper introduces a comprehensive data collection and processing pipeline that constructs multimodal fact-checking datasets in French and German languages by aggregating ClaimReview feeds, scraping full debunking articles, normalizing heterogeneous claim verdicts, and enriching them with structured metadata and aligned visual content. We used state-of-the-art large language models (LLMs) and multimodal LLMs for (i) evidence extraction under predefined evidence categories and (ii) justification generation that links evidence to verdicts. Evaluation with G-Eval and human assessment demonstrates that our pipeline enables fine-grained comparison of fact-checking practices across different organizations or media markets, facilitates the development of more interpretable and evidence-grounded fact-checking models, and lays the groundwork for future research on multilingual, multimodal misinformation verification.
在线平台上错误信息的迅速传播强调了对强大、最新、可解释且多语言的事实核查资源的迫切需求。然而,现有的数据集范围有限,通常缺少多模态证据、结构化注释以及声明、证据和裁决之间的详细链接。本文介绍了一种全面的数据收集与处理管道,通过聚合ClaimReview源、抓取完整的辟谣文章、规范化异构声明裁决,并丰富其结构化元数据和对齐的视觉内容,在法语和德语中构建多模态事实核查数据集。我们使用了最先进的大型语言模型(LLM)和多模态LLM进行:(i) 在预定义证据类别下的证据提取;(ii) 将证据链接到裁决的理由生成。通过G-Eval评估和人工评估表明,我们的管道能够实现不同组织或媒体市场间事实核查实践的细致比较,促进更可解释、基于证据的事实核查模型的发展,并为未来多语言、多模态错误信息验证的研究奠定基础。
https://arxiv.org/abs/2601.07985
Generative AI models ought to be useful and safe across cross-cultural contexts. One critical step toward this goal is understanding how AI models adhere to sociocultural norms. While this challenge has gained attention in NLP, existing work lacks both nuance and coverage in understanding and evaluating models' norm adherence. We address these gaps by introducing a taxonomy of norms that clarifies their contexts (e.g., distinguishing between human-human norms that models should recognize and human-AI interactional norms that apply to the human-AI interaction itself), specifications (e.g., relevant domains), and mechanisms (e.g., modes of enforcement). We demonstrate how our taxonomy can be operationalized to automatically evaluate models' norm adherence in naturalistic, open-ended settings. Our exploratory analyses suggest that state-of-the-art models frequently violate norms, though violation rates vary by model, interactional context, and country. We further show that violation rates also vary by prompt intent and situational framing. Our taxonomy and demonstrative evaluation pipeline enable nuanced, context-sensitive evaluation of cultural norm adherence in realistic settings.
生成式AI模型应在跨文化背景下既实用又安全。实现这一目标的关键一步是了解AI模型如何遵循社会和文化规范。尽管自然语言处理(NLP)领域对此挑战给予了关注,但现有研究在理解和评估模型遵守规范的能力方面缺乏细腻性和全面性。为解决这些问题,我们引入了一种规范分类法,明确了这些规范的背景(例如,区分人类之间的规范与适用于人机交互的规范)、具体说明(如相关领域),以及机制(如执行方式)。我们展示了如何通过这种分类法来操作化评估模型在自然、开放环境下遵守规范的能力。我们的探索性分析表明,最先进的模型经常违反规范,尽管违规率因模型类型、交互背景和国家而异。此外,我们还显示违规率会根据提示意图和情境构架的变化而变化。我们的分类法和示范性评估流程使得能够在现实环境中进行细致且敏感于文化背景的评估。
https://arxiv.org/abs/2601.07973
Automatic License Plate Recognition is a frequent research topic due to its wide-ranging practical applications. While recent studies use synthetic images to improve License Plate Recognition (LPR) results, there remain several limitations in these efforts. This work addresses these constraints by comprehensively exploring the integration of real and synthetic data to enhance LPR performance. We subject 16 Optical Character Recognition (OCR) models to a benchmarking process involving 12 public datasets acquired from various regions. Several key findings emerge from our investigation. Primarily, the massive incorporation of synthetic data substantially boosts model performance in both intra- and cross-dataset scenarios. We examine three distinct methodologies for generating synthetic data: template-based generation, character permutation, and utilizing a Generative Adversarial Network (GAN) model, each contributing significantly to performance enhancement. The combined use of these methodologies demonstrates a notable synergistic effect, leading to end-to-end results that surpass those reached by state-of-the-art methods and established commercial systems. Our experiments also underscore the efficacy of synthetic data in mitigating challenges posed by limited training data, enabling remarkable results to be achieved even with small fractions of the original training data. Finally, we investigate the trade-off between accuracy and speed among different models, identifying those that strike the optimal balance in each intra-dataset and cross-dataset settings.
自动车牌识别(Automatic License Plate Recognition,ALPR)是一个由于其实用性广泛而备受研究的课题。尽管最近的研究使用合成图像来改进车牌识别(License Plate Recognition,LPR)结果,但仍存在一些限制。这项工作通过全面探索真实数据和合成数据的整合来解决这些限制,以提高LPR性能。我们对16种光学字符识别(OCR)模型进行了基准测试,涉及从不同地区获取的12个公共数据集。我们的研究得出了一些关键发现: 首先,大量使用合成数据显著提升了模型在内部分布和跨分布场景中的表现。 其次,我们探讨了三种不同的生成合成数据的方法:基于模板的生成、字符排列以及利用生成对抗网络(GAN)模型,每种方法都对性能提升有重大贡献。这几种方法结合使用的综合效果明显,使得从端到端的结果超越现有的最先进的方法和已建立的商业系统。 此外,我们的实验强调了合成数据在缓解由于训练数据量有限带来的挑战方面的有效性,在使用少量原始训练数据的情况下也能获得显著结果。 最后,我们研究了不同模型之间准确性和速度之间的权衡,并根据每个内部分布和跨分布场景识别出达到最佳平衡的那些模型。
https://arxiv.org/abs/2601.07671
Generating structured narrations for real-world e-commerce videos requires models to perceive fine-grained visual details and organize them into coherent, high-level stories--capabilities that existing approaches struggle to unify. We introduce the E-commerce Hierarchical Video Captioning (E-HVC) dataset with dual-granularity, temporally grounded annotations: a Temporal Chain-of-Thought that anchors event-level observations and Chapter Summary that compose them into concise, story-centric summaries. Rather than directly prompting chapters, we adopt a staged construction that first gathers reliable linguistic and visual evidence via curated ASR and frame-level descriptions, then refines coarse annotations into precise chapter boundaries and titles conditioned on the Temporal Chain-of-Thought, yielding fact-grounded, time-aligned narratives. We also observe that e-commerce videos are fast-paced and information-dense, with visual tokens dominating the input sequence. To enable efficient training while reducing input tokens, we propose the Scene-Primed ASR-anchored Compressor (SPA-Compressor), which compresses multimodal tokens into hierarchical scene and event representations guided by ASR semantic cues. Built upon these designs, our HiVid-Narrator framework achieves superior narrative quality with fewer input tokens compared to existing methods.
生成针对现实世界电子商务视频的结构化叙述,需要模型能够感知细微的视觉细节,并将这些细节组织成连贯、高层次的故事——这是现有方法难以统一的能力。我们引入了具有双粒度、时间接地标注的E-commerce Hierarchical Video Captioning (E-HVC) 数据集:一个锚定事件级观察的时间链式思维(Temporal Chain-of-Thought),以及将它们组合成简洁、以故事为中心摘要的小节总结(Chapter Summary)。 不同于直接提示章节,我们采用了分阶段构建的方法,首先通过精心挑选的自动语音识别(ASR)和帧级别描述来收集可靠的语言和视觉证据,然后根据时间链式思维细化粗略标注,从而生成基于事实的时间对齐叙述。 此外,我们注意到电子商务视频节奏快、信息密集,并且视觉标记在输入序列中占据主导地位。为了减少输入标记的同时实现高效训练,我们提出了场景引导的ASR锚定压缩器(Scene-Primed ASR-anchored Compressor, SPA-Compressor),它能够将多模态令牌压缩为层次化的场景和事件表示,并由ASR语义线索指导。 基于这些设计,我们的HiVid-Narrator框架相比现有方法使用更少的输入标记获得了更高的叙述质量。
https://arxiv.org/abs/2601.07366
Multilingual language models (LMs) promise broader NLP access, yet current systems deliver uneven performance across the world's languages. This survey examines why these gaps persist and whether they reflect intrinsic linguistic difficulty or modeling artifacts. We organize the literature around two questions: do linguistic disparities arise from representation and allocation choices (e.g., tokenization, encoding, data exposure, parameter sharing) rather than inherent complexity; and which design choices mitigate inequities across typologically diverse languages. We review linguistic features, such as orthography, morphology, lexical diversity, syntax, information density, and typological distance, linking each to concrete modeling mechanisms. Gaps often shrink when segmentation, encoding, and data exposure are normalized, suggesting much apparent difficulty stems from current modeling choices. We synthesize these insights into design recommendations for tokenization, sampling, architectures, and evaluation to support more balanced multilingual LMs.
多语言语言模型(LM)承诺了更广泛的自然语言处理访问权限,但目前的系统在世界各种语言上的表现并不均衡。这项调查旨在探讨这些差距为何存在,并分析它们是源于内在的语言复杂性还是模型设计中的问题。文献围绕两个核心问题展开讨论:语言差异是否源自表示和分配选择(例如分词、编码、数据曝光、参数共享)而不是固有的复杂性;以及哪些设计选择能减轻不同语言类型之间的不平等。 该调查回顾了诸如正字法、形态学、词汇多样性、句法、信息密度和语系距离等语言特征,并将它们与具体建模机制联系起来。当分词、编码和数据曝光标准化后,差距通常会缩小,这表明很多表面上的困难其实是当前模型选择的结果。 我们综合这些见解,提出关于分词、采样、架构和评估的设计建议,以支持更加均衡的多语言LM系统的发展。
https://arxiv.org/abs/2601.07220
The proliferation of online social networks has significantly reshaped the way individuals access and engage with information. While these platforms offer unprecedented connectivity, they may foster environments where users are increasingly exposed to homogeneous content and like-minded interactions. Such dynamics are associated with selective exposure and the emergence of filter bubbles, echo chambers, tunnel vision, and polarization, which together can contribute to ideological isolation and raise concerns about information diversity and public discourse. This survey provides a comprehensive computational review of existing studies that define, analyze, quantify, and mitigate ideological isolation in online social networks. We examine the mechanisms underlying content personalization, user behavior patterns, and network structures that reinforce content-exposure concentration and narrowing dynamics. This paper also systematically reviews methodological approaches for detecting and measuring these isolation-related phenomena, covering network-, content-, and behavior-based metrics. We further organize computational mitigation strategies, including network-topological interventions and recommendation-level controls, and discuss their trade-offs and deployment considerations. By integrating definitions, metrics, and interventions across structural/topological, content-based, interactional, and cognitive isolation, this survey provides a unified computational framework. It serves as a reference for understanding and addressing the key challenges and opportunities in promoting information diversity and reducing ideological fragmentation in the digital age.
在线社交网络的激增已经显著改变了个体获取和参与信息的方式。尽管这些平台提供了前所未有的连接性,但它们也可能营造出一种环境,在此环境中用户越来越接触到同质化的内容和志趣相投的互动。这种动态与选择性接触以及过滤气泡、回音室效应、隧道视野和意识形态分化等现象相关联,共同导致了思想上的孤立,并引发了对信息多样性及公共讨论的关注。本调查提供了一个全面的计算回顾,涵盖了现有研究中关于定义、分析、量化和缓解在线社交网络中思想隔离的问题。我们审视了强化内容接触集中化与狭隘化的机制,包括内容个性化、用户行为模式以及网络结构。 本文还系统地审查了用于检测和衡量这些孤立现象的方法论方法,涵盖基于网络的、基于内容的和基于行为的指标。此外,我们组织了一系列计算缓解策略,包括网络拓扑干预和推荐级别的控制,并讨论了它们的权衡及部署考虑因素。通过整合结构/拓扑隔离、基于内容的隔离、互动性隔离以及认知隔离中的定义、度量和干预措施,本综述提供了一个统一的计算框架。它为理解并解决数字时代促进信息多样性与减少思想分化的关键挑战和机遇提供了参考。
https://arxiv.org/abs/2601.07884
Low-dose Positron Emission Tomography (PET) imaging reduces patient radiation exposure but suffers from increased noise that degrades image quality and diagnostic reliability. Although diffusion models have demonstrated strong denoising capability, their stochastic nature makes it challenging to enforce anatomically consistent structures, particularly in low signal-to-noise regimes and volumetric whole-body imaging. We propose Wavelet-Conditioned ControlNet (WCC-Net), a fully 3D diffusion-based framework that introduces explicit frequency-domain structural priors via wavelet representations to guide volumetric PET denoising. By injecting wavelet-based structural guidance into a frozen pretrained diffusion backbone through a lightweight control branch, WCC-Net decouples anatomical structure from noise while preserving generative expressiveness and 3D structural continuity. Extensive experiments demonstrate that WCC-Net consistently outperforms CNN-, GAN-, and diffusion-based baselines. On the internal 1/20-dose test set, WCC-Net improves PSNR by +1.21 dB and SSIM by +0.008 over a strong diffusion baseline, while reducing structural distortion (GMSD) and intensity error (NMAE). Moreover, WCC-Net generalizes robustly to unseen dose levels (1/50 and 1/4), achieving superior quantitative performance and improved volumetric anatomical consistency.
低剂量正电子发射断层扫描(PET)成像可以减少患者的辐射暴露,但会增加噪声,从而降低图像质量和诊断可靠性。尽管扩散模型在去噪方面表现出强大的能力,但由于其随机性,在保持解剖学一致性结构方面面临挑战,尤其是在信号与噪声比低的情况下和进行体积全身成像时尤其如此。我们提出了一种名为Wavelet-Conditioned ControlNet(WCC-Net)的完全三维扩散框架,该框架通过小波表示引入明确的频域结构性先验知识来指导体积PET去噪处理。通过在预先训练好的扩散骨干网络中加入基于小波结构引导的轻量级控制分支,WCC-Net能够将解剖学结构与噪声分离,同时保留生成表达性和三维结构连续性。 大量的实验表明,WCC-Net始终优于基于CNN、GAN和扩散模型的基本方法。在内部1/20剂量测试集中,WCC-Net比一个强大的扩散基线提高了PSNR 1.21 dB 和SSIM 0.008,同时减少了结构扭曲(GMSD)和强度误差(NMAE)。此外,WCC-Net能够在未见过的剂量水平(1/50 和 1/4)上稳健地推广,表现出卓越的数量性能以及改进后的体积解剖学一致性。
https://arxiv.org/abs/2601.07093
With the recent surge in the development of large language models, the need for comprehensive and language-specific evaluation benchmarks has become critical. While significant progress has been made in evaluating English language models, benchmarks for other languages, particularly those with unique linguistic characteristics such as Turkish, remain less developed. Our study introduces TurkBench, a comprehensive benchmark designed to assess the capabilities of generative large language models in the Turkish language. TurkBench involves 8,151 data samples across 21 distinct subtasks. These are organized under six main categories of evaluation: Knowledge, Language Understanding, Reasoning, Content Moderation, Turkish Grammar and Vocabulary, and Instruction Following. The diverse range of tasks and the culturally relevant data would provide researchers and developers with a valuable tool for evaluating their models and identifying areas for improvement. We further publish our benchmark for online submissions at this https URL
随着大型语言模型的发展日益加速,全面且具有特定语言特色的评估基准变得至关重要。尽管在英语语言模型的评估方面已取得了显著进展,但对于像土耳其语这样拥有独特语言特性的其他语言而言,相应的评测标准仍相对不完善。我们的研究引入了TurkBench这一综合性测评工具,旨在评估生成式大型语言模型在土耳其语中的表现能力。TurkBench涵盖了21项不同子任务的8,151个数据样本,并将其归类为六大主要评估类别:知识、语言理解、推理、内容审核、土耳其语法与词汇以及指令遵循。这些多样化的任务和具有文化相关性的数据将为研究人员和开发人员提供宝贵的工具,帮助他们评估模型并识别改进领域。我们进一步在线发布了我们的测评基准,请访问此[链接](https://this_https_URL)进行在线提交。
https://arxiv.org/abs/2601.07020
As LLM-based agents operate over sequential multi-step reasoning, hallucinations arising at intermediate steps risk propagating along the trajectory, thus degrading overall reliability. Unlike hallucination detection in single-turn responses, diagnosing hallucinations in multi-step workflows requires identifying which step causes the initial divergence. To fill this gap, we propose a new research task, automated hallucination attribution of LLM-based agents, aiming to identify the step responsible for the hallucination and explain why. To support this task, we introduce AgentHallu, a comprehensive benchmark with: (1) 693 high-quality trajectories spanning 7 agent frameworks and 5 domains, (2) a hallucination taxonomy organized into 5 categories (Planning, Retrieval, Reasoning, Human-Interaction, and Tool-Use) and 14 sub-categories, and (3) multi-level annotations curated by humans, covering binary labels, hallucination-responsible steps, and causal explanations. We evaluate 13 leading models, and results show the task is challenging even for top-tier models (like GPT-5, Gemini-2.5-Pro). The best-performing model achieves only 41.1\% step localization accuracy, where tool-use hallucinations are the most challenging at just 11.6\%. We believe AgentHallu will catalyze future research into developing robust, transparent, and reliable agentic systems.
随着基于大规模语言模型(LLM)的代理进行连续多步推理时,中间步骤中出现的幻觉可能会沿整个过程传播,从而降低整体可靠性。与单轮响应中的幻觉检测不同,诊断多步工作流程中的幻觉需要识别导致最初偏差的具体步骤。为填补这一研究空白,我们提出了一项新的任务——自动将LLM代理的幻觉归因化,旨在确定引发幻觉的具体步骤并解释原因。为了支持这项任务,我们引入了AgentHallu,这是一个全面的基准测试,包括: 1. 693条高质量轨迹,覆盖7个代理框架和5个领域; 2. 一个组织成五个大类(规划、检索、推理、人机交互和工具使用)及14个小类的幻觉分类系统; 3. 多层级的人工注释,涵盖了二元标签、引发幻觉的具体步骤以及因果解释。 我们评估了13个领先模型,并发现即使是顶级模型(如GPT-5和Gemini-2.5-Pro),该任务也极具挑战性。表现最佳的模型仅实现了41.1%的步定位准确率,而工具使用的幻觉是最具挑战性的,仅为11.6%。 我们相信AgentHallu将推动未来研究的发展,以开发出稳健、透明和可靠的代理系统。
https://arxiv.org/abs/2601.06818
We present a Bengali mathematical reasoning model called GanitLLM (named after the Bangla word for mathematics, "Ganit"), together with a new difficulty-aware Bengali math corpus and a curriculum-based GRPO pipeline. Bengali is one of the world's most widely spoken languages, yet existing LLMs either reason in English and then translate, or simply fail on multi-step Bengali math, in part because reinforcement learning recipes are tuned for high-resource languages and collapse under reward sparsity in low-resource settings. To address this, we construct Ganit, a rigorously filtered and decontaminated Bengali math dataset with automatic difficulty tags derived from the pass@k of a strong evaluator model. Building on this dataset, we propose Curriculum-GRPO, which combines multi-stage training (SFT + GRPO) with difficulty-aware sampling and verifiable rewards for format, numerical correctness, and Bengali reasoning. On Bn-MGSM and Bn-MSVAMP, GanitLLM-4B improves over its Qwen3-4B base by +8 and +7 accuracy points, respectively, while increasing the percentage of Bengali reasoning tokens from 14% to over 88% and reducing average solution length from 943 to 193 words.
我们提出了一种名为GanitLLM的孟加拉语数学推理模型(Ganit是孟加拉语中“数学”的词),并附带一个新的难度感知孟加拉语数学语料库和基于课程的GRPO流水线。孟加拉语是世界上使用最广泛的语言之一,然而现有的大规模语言模型要么用英语进行推理然后翻译,要么在多步孟加拉语数学问题上失败,部分原因是强化学习的方法针对高资源语言进行了调优,在低资源环境下由于奖励稀疏性而失效。为了解决这个问题,我们构建了Ganit,这是一个经过严格过滤和净化的孟加拉语数学数据集,具有自动化的难度标签,这些标签源自一个强大评估模型的pass@k分数。 基于这个数据集,我们提出了Curriculum-GRPO方法,它结合了多阶段训练(SFT + GRPO)与难度感知采样以及用于格式、数值正确性和孟加拉语推理的有效奖励。在Bn-MGSM和Bn-MSVAMP这两个测试集中,GanitLLM-4B模型比其基础版本Qwen3-4B分别提高了8个百分点和7个百分点的准确率,并且将孟加拉语推理标记的比例从14%提高到了超过88%,同时平均解决方案长度也从943个单词减少到了193个单词。
https://arxiv.org/abs/2601.06767