This paper presents a review for the NTIRE 2025 Challenge on Short-form UGC Video Quality Assessment and Enhancement. The challenge comprises two tracks: (i) Efficient Video Quality Assessment (KVQ), and (ii) Diffusion-based Image Super-Resolution (KwaiSR). Track 1 aims to advance the development of lightweight and efficient video quality assessment (VQA) models, with an emphasis on eliminating reliance on model ensembles, redundant weights, and other computationally expensive components in the previous IQA/VQA competitions. Track 2 introduces a new short-form UGC dataset tailored for single image super-resolution, i.e., the KwaiSR dataset. It consists of 1,800 synthetically generated S-UGC image pairs and 1,900 real-world S-UGC images, which are split into training, validation, and test sets using a ratio of 8:1:1. The primary objective of the challenge is to drive research that benefits the user experience of short-form UGC platforms such as Kwai and TikTok. This challenge attracted 266 participants and received 18 valid final submissions with corresponding fact sheets, significantly contributing to the progress of short-form UGC VQA and image superresolution. The project is publicly available at this https URL ChallengeCVPR-NTIRE2025.
本文介绍了针对2025年NTIRE挑战赛的短形式用户生成内容(UGC)视频质量评估和增强的回顾。该挑战赛包含两个赛道:(i) 高效视频质量评估(KVQ),以及(ii) 基于扩散方法的图像超分辨率(KwaiSR)。 赛道1旨在推动轻量级且高效的视频质量评估(VQA)模型的发展,重点在于消除对模型集成、冗余权重及其他计算成本较高的组件的依赖,在之前的IQA/VQA竞赛中这些问题普遍存在。赛道2引入了一个专为单张图像超分辨率设计的新短形式UGC数据集,即KwaiSR数据集。该数据集包括1,800对合成生成的S-UGC图像和1,900张真实世界的S-UGC图像,并按照8:1:1的比例分配到训练、验证和测试集合中。 挑战赛的主要目标是推动研究工作,提升像Kwai和TikTok这样的短形式UGC平台上的用户体验。该挑战吸引了266名参与者并收到了18份有效的最终提交作品及其对应的事实表,为短形式UGC视频质量评估和图像超分辨率领域的发展做出了重大贡献。 该项目在以下网址公开发布:[ChallengeCVPR-NTIRE2025](https://challengecvpr-ntire2025.org/)
https://arxiv.org/abs/2504.13131
Do Large Language Models (LLMs) hold positions that conflict with your country's values? Occasionally they do! However, existing works primarily focus on ethical reviews, failing to capture the diversity of national values, which encompass broader policy, legal, and moral considerations. Furthermore, current benchmarks that rely on spectrum tests using manually designed questionnaires are not easily scalable. To address these limitations, we introduce NaVAB, a comprehensive benchmark to evaluate the alignment of LLMs with the values of five major nations: China, the United States, the United Kingdom, France, and Germany. NaVAB implements a national value extraction pipeline to efficiently construct value assessment datasets. Specifically, we propose a modeling procedure with instruction tagging to process raw data sources, a screening process to filter value-related topics and a generation process with a Conflict Reduction mechanism to filter non-conflicting this http URL conduct extensive experiments on various LLMs across countries, and the results provide insights into assisting in the identification of misaligned scenarios. Moreover, we demonstrate that NaVAB can be combined with alignment techniques to effectively reduce value concerns by aligning LLMs' values with the target country.
大型语言模型(LLMs)是否持有与您国家价值观相冲突的立场?有时确实如此!然而,现有研究主要侧重于伦理审查,未能捕捉到国家价值体系中广泛存在的政策、法律和道德考量的多样性。此外,当前依赖手动设计问卷进行范围测试的基准方法难以扩展。为了解决这些限制,我们介绍了NaVAB,这是一个全面的评估基准,旨在衡量大型语言模型与五个主要国家——中国、美国、英国、法国和德国的价值观之间的契合度。NaVAB 实现了一个国家价值观提取管道,能够高效构建价值评估数据集。具体来说,我们提出了一种带有指令标记的建模程序来处理原始数据源,一个筛选过程用于过滤相关主题,并且通过引入冲突减少机制的生成过程来筛除不相关的项目。我们在多个国家的各种LLMs上进行了广泛的实验,结果为识别偏离价值观的情景提供了见解。此外,我们展示了NaVAB可以与对齐技术相结合,有效地降低价值顾虑,将LLM的价值观与其目标国家相协调。
https://arxiv.org/abs/2504.12911
The rapid evolution of artificial intelligence (AI) has introduced AI agents as a disruptive paradigm across various industries, yet their application in machine translation (MT) remains underexplored. This paper describes and analyses the potential of single- and multi-agent systems for MT, reflecting on how they could enhance multilingual digital communication. While single-agent systems are well-suited for simpler translation tasks, multi-agent systems, which involve multiple specialized AI agents collaborating in a structured manner, may offer a promising solution for complex scenarios requiring high accuracy, domain-specific knowledge, and contextual awareness. To demonstrate the feasibility of multi-agent workflows in MT, we are conducting a pilot study in legal MT. The study employs a multi-agent system involving four specialized AI agents for (i) translation, (ii) adequacy review, (iii) fluency review, and (iv) final editing. Our findings suggest that multi-agent systems may have the potential to significantly improve domain-adaptability and contextual awareness, with superior translation quality to traditional MT or single-agent systems. This paper also sets the stage for future research into multi-agent applications in MT, integration into professional translation workflows, and shares a demo of the system analyzed in the paper.
人工智能(AI)的快速演化已经引入了在各个行业中具有颠覆性的代理系统,然而,它们在机器翻译(MT)领域的应用仍处于初步探索阶段。本文描述并分析了单一代理系统和多代理系统在机器翻译中的潜在作用,并探讨它们如何能够提升跨语言数字通信的能力。虽然单一代理系统适合处理较为简单的翻译任务,但涉及多个专业AI代理以结构化方式协作的多代理系统可能为需要高精度、特定领域知识及上下文理解等复杂场景提供有前景的解决方案。 为了展示在机器翻译中使用多代理工作流程的可能性,我们正在进行一项法律翻译领域的试点研究。该研究采用了一个包含四个专门化的AI代理的多代理系统:(i) 翻译;(ii) 适当性审查;(iii) 流畅度审查;以及(iv) 最终编辑。我们的发现表明,多代理系统可能在领域适应性和上下文感知方面有显著提高,并且能够提供优于传统机器翻译或单一代理系统的更高质量的翻译。 本文还为未来关于多代理技术在MT中的应用、整合到专业翻译工作流程中,以及展示论文分析的系统演示铺平了道路。
https://arxiv.org/abs/2504.12891
Multimodal learning has driven innovation across various industries, particularly in the field of music. By enabling more intuitive interaction experiences and enhancing immersion, it not only lowers the entry barriers to the music but also increases its overall appeal. This survey aims to provide a comprehensive review of multimodal tasks related to music, outlining how music contributes to multimodal learning and offering insights for researchers seeking to expand the boundaries of computational music. Unlike text and images, which are often semantically or visually intuitive, music primarily interacts with humans through auditory perception, making its data representation inherently less intuitive. Therefore, this paper first introduces the representations of music and provides an overview of music datasets. Subsequently, we categorize cross-modal interactions between music and multimodal data into three types: music-driven cross-modal interactions, music-oriented cross-modal interactions, and bidirectional music cross-modal interactions. For each category, we systematically trace the development of relevant sub-tasks, analyze existing limitations, and discuss emerging trends. Furthermore, we provide a comprehensive summary of datasets and evaluation metrics used in multimodal tasks related to music, offering benchmark references for future research. Finally, we discuss the current challenges in cross-modal interactions involving music and propose potential directions for future research.
跨模态学习在各个行业中推动了创新,尤其是在音乐领域。通过促进更直观的交互体验和增强沉浸感,它不仅降低了进入音乐领域的门槛,还增加了其总体吸引力。本调查旨在全面回顾与音乐相关的跨模态任务,阐明音乐如何贡献于跨模态学习,并为致力于拓展计算音乐边界的科研人员提供见解。不同于文本和图像往往具有语义或视觉直观性,音乐主要通过听觉感知与人类互动,使其数据表示本质上不够直观。因此,本文首先介绍了音乐的表示形式,并提供了关于音乐数据集的概览。随后,我们将音乐与多模态数据之间的跨模态交互分类为三种类型:以音乐驱动的跨模态交互、面向音乐的跨模态交互和双向音乐跨模态交互。对于每一类,我们系统地追踪了相关子任务的发展,分析现有局限,并讨论新兴趋势。此外,本文还全面总结了与音乐相关的多模态任务中使用的数据集和评估指标,为未来研究提供基准参考。最后,我们将探讨当前涉及音乐的跨模态互动所面临的挑战,并提出未来研究的潜在方向。
https://arxiv.org/abs/2504.12796
Recommender systems play a central role in numerous real-life applications, yet evaluating their performance remains a significant challenge due to the gap between offline metrics and online behaviors. Given the scarcity and limits (e.g., privacy issues) of real user data, we introduce SimUSER, an agent framework that serves as believable and cost-effective human proxies. SimUSER first identifies self-consistent personas from historical data, enriching user profiles with unique backgrounds and personalities. Then, central to this evaluation are users equipped with persona, memory, perception, and brain modules, engaging in interactions with the recommender system. SimUSER exhibits closer alignment with genuine humans than prior work, both at micro and macro levels. Additionally, we conduct insightful experiments to explore the effects of thumbnails on click rates, the exposure effect, and the impact of reviews on user engagement. Finally, we refine recommender system parameters based on offline A/B test results, resulting in improved user engagement in the real world.
推荐系统在众多现实应用中扮演着重要角色,但其性能评估仍然面临着挑战,原因是离线指标与用户在线行为之间存在差距。鉴于真实用户数据的稀缺性和限制(例如隐私问题),我们引入了SimUSER这一代理框架,它充当可信赖且成本效益高的虚拟人类代理。SimUSER首先从历史数据中识别出具有连贯性的个人形象,并通过独特的背景和个性来丰富用户画像。然后,在评估过程中,这些拥有个人形象、记忆、感知以及大脑模块的用户会与推荐系统进行互动。 相较于以往的工作,SimUSER在微观和宏观层面上都更接近于真实人类的表现。此外,我们还进行了有洞察力的实验以探索缩略图对点击率的影响、曝光效应以及评论对用户体验的影响。最后,基于离线A/B测试结果,我们优化了推荐系统的参数设置,在实际应用中取得了更高的用户参与度。
https://arxiv.org/abs/2504.12722
Digital pathology, augmented by artificial intelligence (AI), holds significant promise for improving the workflow of pathologists. However, challenges such as the labor-intensive annotation of whole slide images (WSIs), high computational demands, and trust concerns arising from the absence of uncertainty estimation in predictions hinder the practical application of current AI methodologies in histopathology. To address these issues, we present a novel trustful fully unsupervised multi-level segmentation methodology (TUMLS) for WSIs. TUMLS adopts an autoencoder (AE) as a feature extractor to identify the different tissue types within low-resolution training data. It selects representative patches from each identified group based on an uncertainty measure and then does unsupervised nuclei segmentation in their respective higher-resolution space without using any ML algorithms. Crucially, this solution integrates seamlessly into clinicians workflows, transforming the examination of a whole WSI into a review of concise, interpretable cross-level insights. This integration significantly enhances and accelerates the workflow while ensuring transparency. We evaluated our approach using the UPENN-GBM dataset, where the AE achieved a mean squared error (MSE) of 0.0016. Additionally, nucleus segmentation is assessed on the MoNuSeg dataset, outperforming all unsupervised approaches with an F1 score of 77.46% and a Jaccard score of 63.35%. These results demonstrate the efficacy of TUMLS in advancing the field of digital pathology.
由人工智能(AI)增强的数字病理学在提高病理学家的工作流程方面具有巨大潜力。然而,诸如全滑数字图像(WSIs)人工标注劳动强度大、计算需求高以及由于缺乏预测不确定性估计而产生的信任问题等挑战阻碍了当前AI方法在组织病理学中的实际应用。为了解决这些问题,我们提出了一种新型的可信赖的完全无监督多层次分割方法(TUMLS),用于处理WSI。 TUMLS采用自动编码器(AE)作为特征提取器,利用低分辨率训练数据来识别不同类型的组织。该方法基于不确定性度量从每个已识别的组中选择代表性切片,并在相应的高分辨率空间内执行无监督细胞核分割,且不使用任何机器学习算法。这种方法的关键优势在于它可以无缝地集成到临床医生的工作流程中,将整个WSI的检查转变为对简洁、可解释的跨层次见解的审查。这种整合显著提升了工作流程的效率并确保了透明度。 我们利用UPENN-GBM数据集评估了我们的方法,其中AE在均方误差(MSE)指标上达到了0.0016的成绩。此外,在MoNuSeg数据集上的细胞核分割评估显示,TUMLS优于所有无监督方法,F1得分为77.46%,Jaccard得分则为63.35%。这些结果证明了TUMLS在推进数字病理学领域方面的有效性。
https://arxiv.org/abs/2504.12718
This paper reviews the NTIRE 2025 Challenge on Day and Night Raindrop Removal for Dual-Focused Images. This challenge received a wide range of impressive solutions, which are developed and evaluated using our collected real-world Raindrop Clarity dataset. Unlike existing deraining datasets, our Raindrop Clarity dataset is more diverse and challenging in degradation types and contents, which includes day raindrop-focused, day background-focused, night raindrop-focused, and night background-focused degradations. This dataset is divided into three subsets for competition: 14,139 images for training, 240 images for validation, and 731 images for testing. The primary objective of this challenge is to establish a new and powerful benchmark for the task of removing raindrops under varying lighting and focus conditions. There are a total of 361 participants in the competition, and 32 teams submitting valid solutions and fact sheets for the final testing phase. These submissions achieved state-of-the-art (SOTA) performance on the Raindrop Clarity dataset. The project can be found at this https URL.
本文回顾了NTIRE 2025挑战赛,该赛事专注于日间和夜间雨滴去除在双聚焦图像中的应用。本次挑战吸引了众多出色的解决方案,这些方案使用我们收集的现实世界Raindrop Clarity数据集进行开发和评估。与现有的去雨数据集不同,我们的Raindrop Clarity数据集在退化类型和内容上更具多样性和挑战性,包括日间雨滴聚焦、日间背景聚焦、夜间雨滴聚焦和夜间背景聚焦等退化情况。 该数据集被分为三个子集用于比赛:14,139张图片用于训练,240张图片用于验证,以及731张图片用于测试。本次挑战的主要目标是为在不同光照和聚焦条件下去除雨滴的任务建立新的强有力基准。总共有361名参赛者参加此次竞赛,其中32支队伍提交了有效解决方案和事实表格以参与最终测试阶段。这些提交方案在Raindrop Clarity数据集上实现了最先进的(SOTA)性能。 该项目的网址为 [此处插入具体URL]。
https://arxiv.org/abs/2504.12711
Collaborative perception has attracted growing interest from academia and industry due to its potential to enhance perception accuracy, safety, and robustness in autonomous driving through multi-agent information fusion. With the advancement of Vehicle-to-Everything (V2X) communication, numerous collaborative perception datasets have emerged, varying in cooperation paradigms, sensor configurations, data sources, and application scenarios. However, the absence of systematic summarization and comparative analysis hinders effective resource utilization and standardization of model evaluation. As the first comprehensive review focused on collaborative perception datasets, this work reviews and compares existing resources from a multi-dimensional perspective. We categorize datasets based on cooperation paradigms, examine their data sources and scenarios, and analyze sensor modalities and supported tasks. A detailed comparative analysis is conducted across multiple dimensions. We also outline key challenges and future directions, including dataset scalability, diversity, domain adaptation, standardization, privacy, and the integration of large language models. To support ongoing research, we provide a continuously updated online repository of collaborative perception datasets and related literature: this https URL.
协作感知由于其在自动驾驶中通过多智能体信息融合提高感知准确性、安全性和鲁棒性的潜力,已经引起了学术界和工业界的广泛关注。随着车辆到一切(V2X)通信技术的进步,出现了许多不同的协作感知数据集,在合作模式、传感器配置、数据来源和应用场景等方面各不相同。然而,由于缺乏系统的总结和比较分析,有效资源利用以及模型评估标准化受到了阻碍。作为首个专注于协作感知数据集的全面回顾工作,本文从多维度视角对现有资源进行了回顾和对比。我们根据合作模式对数据集进行分类,考察其数据来源和应用场景,并分析传感器模态和支持的任务类型。我们还从多个维度开展详细的比较分析,并概述了关键挑战和未来方向,包括数据集可扩展性、多样性、领域适应性、标准化、隐私保护以及大型语言模型的集成问题。为了支持持续的研究,我们提供了一个持续更新的协作感知数据集及相关文献在线资源库:[此URL]。 (请注意,最后一个链接应替换为实际提供的具体网址)
https://arxiv.org/abs/2504.12696
Working with documents is a key part of almost any knowledge work, from contextualizing research in a literature review to reviewing legal precedent. Recently, as their capabilities have expanded, primarily text-based NLP systems have often been billed as able to assist or even automate this kind of work. But to what extent are these systems able to model these tasks as experts conceptualize and perform them now? In this study, we interview sixteen domain experts across two domains to understand their processes of document research, and compare it to the current state of NLP systems. We find that our participants processes are idiosyncratic, iterative, and rely extensively on the social context of a document in addition its content; existing approaches in NLP and adjacent fields that explicitly center the document as an object, rather than as merely a container for text, tend to better reflect our participants' priorities, though they are often less accessible outside their research communities. We call on the NLP community to more carefully consider the role of the document in building useful tools that are accessible, personalizable, iterative, and socially aware.
处理文档是几乎任何知识工作中不可或缺的一部分,无论是将研究成果放在文献综述的背景下还是审查法律先例。近年来,随着其能力的扩展,主要基于文本的自然语言处理(NLP)系统经常被宣传为能够协助甚至自动化此类工作。但是,这些系统在多大程度上能够像专家现在概念化和执行的任务一样建模这些任务?在这项研究中,我们采访了两个领域的十六位领域专家,以了解他们的文档调研过程,并将其与当前的NLP系统的状态进行比较。我们发现,参与者的流程具有独特性、迭代性和社会背景依赖性,除了内容之外,还高度依赖于文档的社会背景;现有的将文档视为对象而非仅仅是文本容器的NLP及其相关领域的研究方法更符合参与者的需求,尽管这些方法在领域外往往不够普及。我们呼吁NLP社区更加重视文档的角色,在构建可用性强、个性化程度高、迭代性好和社会意识强的工具时考虑这一因素。
https://arxiv.org/abs/2504.12495
As generative AI tools like ChatGPT become integral to everyday writing, critical questions arise about how to preserve writers' sense of agency and ownership when using these tools. Yet, a systematic understanding of how AI assistance affects different aspects of the writing process - and how this shapes writers' agency - remains underexplored. To address this gap, we conducted a systematic review of 109 HCI papers using the PRISMA approach. From this literature, we identify four overarching design strategies for AI writing support: structured guidance, guided exploration, active co-writing, and critical feedback - mapped across the four key cognitive processes in writing: planning, translating, reviewing, and monitoring. We complement this analysis with interviews of 15 writers across diverse domains. Our findings reveal that writers' desired levels of AI intervention vary across the writing process: content-focused writers (e.g., academics) prioritize ownership during planning, while form-focused writers (e.g., creatives) value control over translating and reviewing. Writers' preferences are also shaped by contextual goals, values, and notions of originality and authorship. By examining when ownership matters, what writers want to own, and how AI interactions shape agency, we surface both alignment and gaps between research and user needs. Our findings offer actionable design guidance for developing human-centered writing tools for co-writing with AI, on human terms.
随着像ChatGPT这样的生成式AI工具在日常写作中变得越来越重要,关于如何在使用这些工具时保持作者的主动性和所有权的问题也愈发关键。然而,系统性地理解AI辅助对写作过程中各个方面的具体影响——以及这对写作者主动性的影响——仍然缺乏深入的研究。为了解决这一缺口,我们采用PRISMA方法审查了109篇人机交互(HCI)论文,并从中归纳出了四种关于AI写作支持的总体设计策略:结构化指导、引导性探索、积极合作写作和批判性反馈——这些策略涵盖了写作过程中四个关键的认知过程:计划、转换、回顾与监控。此外,我们还通过采访来自不同领域的15位写作者来补充这一分析。 我们的研究发现显示,在整个写作流程中,作家对AI介入的期望程度各不相同:内容导向型写作者(如学者)在规划阶段更注重所有权;而形式导向型写作者(如创意工作者)则在意翻译和回顾过程中的控制权。此外,写作者的选择还受到具体目标、价值观以及关于原创性和作者身份概念的影响。 通过考察何时拥有权重要,在什么方面作家想要掌控,以及AI互动如何影响主动性,我们揭示了研究需求与实际用户需求之间既有契合点也有差距的地方。这些发现为开发以人为中心的写作工具——以便人们在合作使用AI时能够保持主动控制——提供了可操作的设计指导。
https://arxiv.org/abs/2504.12488
The YOLO (You Only Look Once) series has been a leading framework in real-time object detection, consistently improving the balance between speed and accuracy. However, integrating attention mechanisms into YOLO has been challenging due to their high computational overhead. YOLOv12 introduces a novel approach that successfully incorporates attention-based enhancements while preserving real-time performance. This paper provides a comprehensive review of YOLOv12's architectural innovations, including Area Attention for computationally efficient self-attention, Residual Efficient Layer Aggregation Networks for improved feature aggregation, and FlashAttention for optimized memory access. Additionally, we benchmark YOLOv12 against prior YOLO versions and competing object detectors, analyzing its improvements in accuracy, inference speed, and computational efficiency. Through this analysis, we demonstrate how YOLOv12 advances real-time object detection by refining the latency-accuracy trade-off and optimizing computational resources.
YOLO(You Only Look Once)系列一直是实时目标检测领域的领军框架,持续优化速度与准确性的平衡。然而,由于注意力机制的高计算开销,将其集成到YOLO中一直颇具挑战性。YOLOv12则引入了一种新颖的方法,在保持实时性能的同时成功地集成了基于注意机制的改进。本文全面回顾了YOLOv12的架构创新,包括用于高效自注意力的区域注意(Area Attention)、用于改进特征聚合的残差高效层聚合网络(Residual Efficient Layer Aggregation Networks)以及优化内存访问的FlashAttention。此外,我们还对YOLOv12与先前版本的YOLO以及其他竞争性目标检测器进行了基准测试,分析了其在准确性、推理速度和计算效率方面的改进。通过这些分析,我们展示了如何通过细化延迟-准确性的权衡并优化计算资源,使YOLOv12推动了实时目标检测技术的发展。
https://arxiv.org/abs/2504.11995
This article reviews contemporary methods for integrating force, including both proprioception and tactile sensing, in robot manipulation policy learning. We conduct a comparative analysis on various approaches for sensing force, data collection, behavior cloning, tactile representation learning, and low-level robot control. From our analysis, we articulate when and why forces are needed, and highlight opportunities to improve learning of contact-rich, generalist robot policies on the path toward highly capable touch-based robot foundation models. We generally find that while there are few tasks such as pouring, peg-in-hole insertion, and handling delicate objects, the performance of imitation learning models is not at a level of dynamics where force truly matters. Also, force and touch are abstract quantities that can be inferred through a wide range of modalities and are often measured and controlled implicitly. We hope that juxtaposing the different approaches currently in use will help the reader to gain a systemic understanding and help inspire the next generation of robot foundation models.
本文回顾了将力(包括本体感觉和触觉感知)集成到机器人操作策略学习中的当代方法。我们对各种感测力的方法、数据收集、行为克隆、触觉表示学习以及低级机器人控制进行了比较分析。通过我们的分析,我们阐明了何时以及为何需要使用力,并强调在迈向高度能力的基于触摸的机器人基础模型过程中,改进接触密集型通用机器人策略的学习机会。总体而言,我们发现虽然诸如倾倒液体、插入孔洞、处理易碎物品等任务中确实存在一些应用场景,但模仿学习模型的表现尚未达到动态水平,使得力真正变得重要。此外,力和触觉是可以通过各种模态推断的抽象量,并且通常通过隐式测量和控制。 我们希望对比当前使用的方法能够帮助读者获得系统性理解,并激发下一代机器人基础模型的发展。
https://arxiv.org/abs/2504.11827
Nature has long inspired the development of swarm intelligence (SI), a key branch of artificial intelligence that models collective behaviors observed in biological systems for solving complex optimization problems. Particle swarm optimization (PSO) is widely adopted among SI algorithms due to its simplicity and efficiency. Despite numerous learning strategies proposed to enhance PSO's performance in terms of convergence speed, robustness, and adaptability, no comprehensive and systematic analysis of these strategies exists. We review and classify various learning strategies to address this gap, assessing their impact on optimization performance. Additionally, a comparative experimental evaluation is conducted to examine how these strategies influence PSO's search dynamics. Finally, we discuss open challenges and future directions, emphasizing the need for self-adaptive, intelligent PSO variants capable of addressing increasingly complex real-world problems.
长期以来,自然界一直启发着群智能(SI)的发展,这是人工智能的一个关键分支,它模仿生物系统中观察到的集体行为来解决复杂的优化问题。粒子群优化(PSO)是众多群智能算法中最广泛采用的一种,因其简单性和高效性而备受青睐。尽管已经提出了许多旨在提高PSO在收敛速度、鲁棒性和适应性方面的性能的学习策略,但尚不存在对这些策略进行全面和系统分析的研究。我们回顾并分类了各种学习策略,以填补这一空白,并评估它们对优化性能的影响。此外,还进行了比较实验评价,以考察这些策略如何影响PSO的搜索动态。最后,我们将讨论开放性的挑战及未来的发展方向,强调需要开发能够应对日益复杂的现实世界问题、具有自适应性和智能性的新型PSO变体的重要性。
https://arxiv.org/abs/2504.11812
In recent years, the demand for 3D content has grown exponentially with intelligent upgrading of interactive media, extended reality (XR), and Metaverse industries. In order to overcome the limitation of traditional manual modeling approaches, such as labor-intensive workflows and prolonged production cycles, revolutionary advances have been achieved through the convergence of novel 3D representation paradigms and artificial intelligence generative technologies. In this survey, we conduct a systematically review of the cutting-edge achievements in static 3D object and scene generation, as well as establish a comprehensive technical framework through systematic categorization. Specifically, we initiate our analysis with mainstream 3D object representations, followed by in-depth exploration of two principal technical pathways in object generation: data-driven supervised learning methods and deep generative model-based approaches. Regarding scene generation, we focus on three dominant paradigms: layout-guided compositional synthesis, 2D prior-based scene generation, and rule-driven modeling. Finally, we critically examine persistent challenges in 3D generation and propose potential research directions for future investigation. This survey aims to provide readers with a structured understanding of state-of-the-art 3D generation technologies while inspiring researchers to undertake more exploration in this domain.
近年来,随着交互式媒体、扩展现实(XR)和元宇宙行业的智能化升级,对三维内容的需求呈指数级增长。为了克服传统手动建模方法的限制,如劳动密集型工作流程和延长的生产周期,通过新型三维表示范例与人工智能生成技术的融合,取得了革命性的进展。在本次综述中,我们系统地回顾了静态3D物体和场景生成领域的最新成就,并通过系统的分类建立了全面的技术框架。具体而言,我们的分析始于主流的3D对象表示形式,随后深入探讨两个主要的对象生成技术路径:数据驱动监督学习方法和基于深度生成模型的方法。对于场景生成,我们重点关注三个主导范式:布局引导组合合成、基于2D先验的场景生成以及规则驱动建模。最后,我们批判性地审视了3D生成中持续存在的挑战,并提出了未来研究方向的可能性。本次综述旨在为读者提供对最新3D生成技术结构化理解的同时,激励研究人员在此领域进行更多探索。
https://arxiv.org/abs/2504.11734
Timing of clinical events is central to characterization of patient trajectories, enabling analyses such as process tracing, forecasting, and causal reasoning. However, structured electronic health records capture few data elements critical to these tasks, while clinical reports lack temporal localization of events in structured form. We present a system that transforms case reports into textual time series-structured pairs of textual events and timestamps. We contrast manual and large language model (LLM) annotations (n=320 and n=390 respectively) of ten randomly-sampled PubMed open-access (PMOA) case reports (N=152,974) and assess inter-LLM agreement (n=3,103; N=93). We find that the LLM models have moderate event recall(O1-preview: 0.80) but high temporal concordance among identified events (O1-preview: 0.95). By establishing the task, annotation, and assessment systems, and by demonstrating high concordance, this work may serve as a benchmark for leveraging the PMOA corpus for temporal analytics.
临床事件的时间定位对于患者轨迹的描述至关重要,它支持过程追踪、预测分析和因果推理等研究。然而,结构化的电子健康记录很少包含这些任务所需的关键数据元素,而临床报告中缺乏以结构化形式表示的事件时间信息。我们提出了一种系统,该系统将病例报告转化为文本时间序列的形式,即将文本中的事件与其发生的时间戳对齐。 我们在10个随机选取的PubMed开放访问(PMOA)案例报告(共包含152,974份报告)上进行了手动标注和大型语言模型(LLM)标注之间的对比研究。具体而言,我们分别使用了320组和390组数据进行手动和LLM标注,并对两个LLM系统之间识别事件的一致性进行了评估(样本量为3,103;总报告数为93)。结果显示,LLM模型在事件召回率方面表现一般(O1-preview: 0.80),但在所识别事件的时间一致性上表现出色 (O1-preview: 0.95)。 通过确立任务、标注及评估系统,并展示出高度的一致性,这项工作可以作为利用PMOA语料库进行时间序列分析的基准参考。
https://arxiv.org/abs/2504.12350
Recent developments in generative artificial intelligence (AI) rely on machine learning techniques such as deep learning and generative modeling to achieve state-of-the-art performance across wide-ranging domains. These methods' surprising performance is due in part to their ability to learn implicit "representations'' of complex, multi-modal data. Unfortunately, deep neural networks are notoriously black boxes that obscure these representations, making them difficult to interpret or analyze. To resolve these difficulties, one approach is to build new interpretable neural network models from the ground up. This is the goal of the emerging field of causal representation learning (CRL) that uses causality as a vector for building flexible, interpretable, and transferable generative AI. CRL can be seen as a culmination of three intrinsically statistical problems: (i) latent variable models such as factor analysis; (ii) causal graphical models with latent variables; and (iii) nonparametric statistics and deep learning. This paper reviews recent progress in CRL from a statistical perspective, focusing on connections to classical models and statistical and causal identifiablity results. This review also highlights key application areas, implementation strategies, and open statistical questions in CRL.
最近在生成式人工智能(AI)领域的进展依赖于机器学习技术,如深度学习和生成模型,在广泛的领域内实现了最先进的性能。这些方法之所以表现出色,部分原因在于它们能够学习复杂、多模态数据的隐含“表示”。不幸的是,深度神经网络因其固有的黑箱特性而难以揭示这些表示,从而使得解释或分析变得困难。为了解决这些问题,一种方法是从头开始构建新的可解释神经网络模型。这是新兴领域因果表征学习(CRL)的目标,该领域利用因果关系来建立灵活、可解释且具有迁移能力的生成式AI系统。 CRL可以被视为三种本质上统计问题的结合:(i)因子分析等隐变量模型;(ii)带有隐变量的因果图模型;以及(iii)非参数统计和深度学习。本文从统计学角度回顾了最近在CRL领域的进展,重点介绍了与经典模型的关系及统计和因果识别的结果。此外,该综述还强调了关键应用领域、实施策略以及CRL中的开放性统计问题。
https://arxiv.org/abs/2504.11609
Deep learning has achieved significant breakthroughs in medical imaging, but these advancements are often dependent on large, well-annotated datasets. However, obtaining such datasets poses a significant challenge, as it requires time-consuming and labor-intensive annotations from medical experts. Consequently, there is growing interest in learning paradigms such as incomplete, inexact, and absent supervision, which are designed to operate under limited, inexact, or missing labels. This survey categorizes and reviews the evolving research in these areas, analyzing around 600 notable contributions since 2018. It covers tasks such as image classification, segmentation, and detection across various medical application areas, including but not limited to brain, chest, and cardiac imaging. We attempt to establish the relationships among existing research studies in related areas. We provide formal definitions of different learning paradigms and offer a comprehensive summary and interpretation of various learning mechanisms and strategies, aiding readers in better understanding the current research landscape and ideas. We also discuss potential future research challenges.
深度学习在医学影像领域取得了显著突破,但这些进展往往依赖于大规模且标注良好的数据集。然而,获取这样的数据集面临着巨大挑战,因为需要耗费大量时间和劳动力进行来自医疗专家的详细注释工作。因此,人们对不完整监督、近似监督和无监督等学习范式越来越感兴趣,这些范式旨在应对有限、不准确或缺失标签的情况。本综述对2018年以来约600项相关研究进行了分类与回顾,涵盖了图像分类、分割及检测等任务,并扩展到脑部、胸部和心脏成像等多个医学应用领域。我们试图建立相关研究之间的联系并提供各种学习范式的正式定义,全面总结和解释了不同的学习机制和策略,帮助读者更好地理解当前的研究格局和发展趋势。此外,我们也探讨了一些潜在的未来研究挑战。
https://arxiv.org/abs/2504.11588
While reasoning models (e.g., DeepSeek R1) trained with reinforcement learning (RL), excel in textual reasoning, they struggle in scenarios requiring structured problem-solving, such as geometric reasoning, concise computation, or complex equation solving-areas where computational tools like code interpreters (CI) demonstrate distinct advantages. To bridge this gap, we propose ReTool, which enhances long-form reasoning with tool-integrated learning, including two key features: (1) dynamic interleaving of real-time code execution within natural language reasoning processes, and (2) an automated RL paradigm that allows policy rollouts with multi-turn real-time code execution and teaches the model in learning when and how to invoke tools based on outcome feedback. ReTool employs a systematic training framework, beginning with synthetic cold-start data generation to produce code-augmented long-form reasoning traces for fine-tuning base models. Subsequent RL training leverages task outcomes as rewards to iteratively refine the model's tool use strategy, enabling autonomous discovery of optimal tool invocation patterns without human priors. Experiments on the challenging MATH Olympiad benchmark AIME demonstrate ReTool's superiority: Our 32B model achieves 67% accuracy with 400 training steps, outperforming text-based RL baseline (40% accuracy, 1080 steps) in efficiency and performance. Remarkably, ReTool-32B attains 72.5% accuracy in extended settings, surpassing OpenAI's o1-preview by 27.9%. Further analysis reveals emergent behaviors such as code self-correction, signaling an ''aha moment'' in which the model autonomously masters adaptive tool use. These findings highlight the promise of outcome-driven tool integration for advancing complex mathematical reasoning and offer new insights into hybrid neuro-symbolic systems.
尽管通过强化学习(RL)训练的推理模型(例如DeepSeek R1)在文本推理方面表现出色,但在需要结构化问题解决能力的情境下,如几何推理、简洁计算或复杂方程求解等领域,它们的表现却不如使用代码解释器(CI)等计算工具的情况。为缩小这一差距,我们提出了ReTool,这是一种增强长篇推理的集成工具学习方法,包括两个关键特性:(1) 在自然语言推理过程中动态插入实时代码执行;和 (2) 一个自动化的RL范式,允许策略在多次交互中进行实时代码执行,并教导模型如何以及何时调用工具以根据结果反馈来改进自身。ReTool采用了一个系统性的训练框架,从合成冷启动数据生成开始,生成增强代码的长篇推理痕迹用于微调基础模型。随后的强化学习训练利用任务成果作为奖励,迭代地优化模型使用工具的策略,使得模型能够在没有人类先验知识的情况下自主发现最佳工具调用模式。 在具有挑战性的MATH奥林匹克基准AIME上进行的实验显示了ReTool的优势:我们的32B模型经过400个训练步骤后实现了67%的准确率,不仅超过了基于文本的RL基线(1080步达到40%的准确率),而且在效率和性能方面都表现更优。特别值得注意的是,在扩展设置中,ReTool-32B达到了72.5%的准确率,比OpenAI的o1-preview高出27.9%。进一步分析揭示了诸如代码自我纠正等新行为,这表明模型自主掌握了适应性工具使用,标志着一种“顿悟时刻”的出现。 这些发现强调了结果驱动型工具整合在推进复杂数学推理方面的潜力,并为混合神经符号系统提供了新的见解。
https://arxiv.org/abs/2504.11536
Cancer patients are increasingly turning to large language models (LLMs) as a new form of internet search for medical information, making it critical to assess how well these models handle complex, personalized questions. However, current medical benchmarks focus on medical exams or consumer-searched questions and do not evaluate LLMs on real patient questions with detailed clinical contexts. In this paper, we first evaluate LLMs on cancer-related questions drawn from real patients, reviewed by three hematology oncology physicians. While responses are generally accurate, with GPT-4-Turbo scoring 4.13 out of 5, the models frequently fail to recognize or address false presuppositions in the questions-posing risks to safe medical decision-making. To study this limitation systematically, we introduce Cancer-Myth, an expert-verified adversarial dataset of 585 cancer-related questions with false presuppositions. On this benchmark, no frontier LLM -- including GPT-4o, this http URL, and Claude-3.5-Sonnet -- corrects these false presuppositions more than 30% of the time. Even advanced medical agentic methods do not prevent LLMs from ignoring false presuppositions. These findings expose a critical gap in the clinical reliability of LLMs and underscore the need for more robust safeguards in medical AI systems.
癌症患者越来越多地转向大型语言模型(LLM)作为获取医疗信息的新形式的互联网搜索,这使得评估这些模型处理复杂且个性化问题的能力变得至关重要。然而,目前的医学基准测试主要关注医学考试或普通消费者查询的问题,并不评价LLM在面对真实病人详细临床背景下的提问时的表现。在这篇论文中,我们首先用由三位血液肿瘤科医生审核的真实癌症患者提出的相关问题来评估LLM。虽然响应总体上是准确的,GPT-4-Turbo得分达到5分中的4.13分,但这些模型经常未能识别或处理问题中存在的错误假设——这可能对安全医疗决策构成风险。 为了系统地研究这一局限性,我们引入了Cancer-Myth,这是一个包含585个带有错误假设的癌症相关问题的专家验证对抗数据集。在该基准测试上,没有前沿LLM——包括GPT-4o、Claude-3.5-Sonnet等——能在超过30%的情况下纠正这些错误假设。即使是先进的医学代理方法也无法阻止LLM忽视这些问题中的错误假设。 这些发现揭示了LLM临床可靠性的关键缺口,并强调需要在医疗AI系统中建立更强大的安全措施。
https://arxiv.org/abs/2504.11373
Complex networks are frequently employed to model physical or virtual complex systems. When certain entities exist across multiple systems simultaneously, unveiling their corresponding relationships across the networks becomes crucial. This problem, known as network alignment, holds significant importance. It enhances our understanding of complex system structures and behaviours, facilitates the validation and extension of theoretical physics research about studying complex systems, and fosters diverse practical applications across various fields. However, due to variations in the structure, characteristics, and properties of complex networks across different fields, the study of network alignment is often isolated within each domain, with even the terminologies and concepts lacking uniformity. This review comprehensively summarizes the latest advancements in network alignment research, focusing on analyzing network alignment characteristics and progress in various domains such as social network analysis, bioinformatics, computational linguistics and privacy protection. It provides a detailed analysis of various methods' implementation principles, processes, and performance differences, including structure consistency-based methods, network embedding-based methods, and graph neural network-based (GNN-based) methods. Additionally, the methods for network alignment under different conditions, such as in attributed networks, heterogeneous networks, directed networks, and dynamic networks, are presented. Furthermore, the challenges and the open issues for future studies are also discussed.
复杂网络经常被用来模拟物理或虚拟的复杂系统。当某些实体同时存在于多个系统中时,揭示这些实体在网络之间的对应关系变得至关重要。这一问题被称为网络对齐,在研究复杂系统的结构和行为、验证并拓展理论物理学关于复杂系统的研究以及促进跨不同领域的多种实际应用方面具有重要意义。然而,由于各领域中的复杂网络在结构、特征及属性上存在差异,网络对齐的研究往往局限于各自领域内进行,并且甚至术语和概念也缺乏统一性。 本文全面总结了最新的网络对齐研究进展,侧重于分析社交网络分析、生物信息学、计算语言学以及隐私保护等各领域的网络对齐特性与进展。详细探讨了不同方法的实施原理、过程及其性能差异,包括基于结构一致性的方法、基于网络嵌入的方法和基于图神经网络(GNN)的方法。此外,还介绍了在属性网络、异构网络、有向网络以及动态网络等不同条件下进行网络对齐的方法,并讨论了未来研究面临的挑战与开放性问题。
https://arxiv.org/abs/2504.11367