We study the task of learning association between faces and voices, which is gaining interest in the multimodal community lately. These methods suffer from the deliberate crafting of negative mining procedures as well as the reliance on the distant margin parameter. These issues are addressed by learning a joint embedding space in which orthogonality constraints are applied to the fused embeddings of faces and voices. However, embedding spaces of faces and voices possess different characteristics and require spaces to be aligned before fusing them. To this end, we propose a method that accurately aligns the embedding spaces and fuses them with an enhanced gated fusion thereby improving the performance of face-voice association. Extensive experiments on the VoxCeleb dataset reveals the merits of the proposed approach.
我们研究了面部与声音之间关联学习的任务,这一任务最近在多模态社区引起了广泛关注。这些方法面临着故意设计负样本挖掘过程以及依赖于远离边际参数的问题。这些问题通过在一个共同嵌入空间中应用正交约束来解决,该空间融合了面部和声音的嵌入表示。然而,面部和声音的嵌入空间具有不同的特性,并且在将它们融合之前需要对齐这些空间。为此,我们提出了一种方法,能够准确地对齐嵌入空间并使用增强型门控融合技术将其融合在一起,从而提高面部与声音关联任务的表现。在VoxCeleb数据集上的广泛实验揭示了所提方法的优势。
https://arxiv.org/abs/2505.17002
We introduce \texttt{CASS}, the first large-scale dataset and model suite for cross-architecture GPU code transpilation, targeting both source-level (CUDA~$\leftrightarrow$~HIP) and assembly-level (Nvidia SASS~$\leftrightarrow$~AMD RDNA3) translation. The dataset comprises 70k verified code pairs across host and device, addressing a critical gap in low-level GPU code portability. Leveraging this resource, we train the \texttt{CASS} family of domain-specific language models, achieving 95\% source translation accuracy and 37.5\% assembly translation accuracy, substantially outperforming commercial baselines such as GPT-4o, Claude, and Hipify. Our generated code matches native performance in over 85\% of test cases, preserving runtime and memory behavior. To support rigorous evaluation, we introduce \texttt{CASS-Bench}, a curated benchmark spanning 16 GPU domains with ground-truth execution. All data, models, and evaluation tools are released as open source to foster progress in GPU compiler tooling, binary compatibility, and LLM-guided hardware translation. Dataset and benchmark are on \href{this https URL}{\textcolor{blue}{HuggingFace}}, with code at \href{this https URL}{\textcolor{blue}{GitHub}}.
我们介绍了一种名为\texttt{CASS}的大型数据集和模型套件,用于跨架构GPU代码转译,旨在支持源码级(CUDA~$\leftrightarrow$~HIP)和汇编级(Nvidia SASS~$\leftrightarrow$~AMD RDNA3)翻译。该数据集包含了70,000个经过验证的代码对,涵盖了主机端和设备端,填补了低级别GPU代码可移植性中的关键空白。利用这一资源,我们训练了\texttt{CASS}系列特定领域的语言模型,在源码转译准确率上达到了95%,汇编级转译准确率达到37.5%。这些性能远超商业基准如GPT-4o、Claude和Hipify的水平。在超过85%的测试案例中,我们生成的代码能够匹配原生性能,并保持了运行时间和内存行为的一致性。 为了支持严格的评估,我们引入了\texttt{CASS-Bench},这是一个经过精心挑选的基准集,涵盖了16个GPU领域并且拥有真实的执行结果。所有的数据、模型和评估工具都作为开源项目发布,以促进在GPU编译器工具开发、二进制兼容性以及LLM(大型语言模型)指导硬件翻译方面的进步。 该数据集与基准可以在\href{this https URL}{HuggingFace}上找到,代码托管于\href{this https URL}{GitHub}。
https://arxiv.org/abs/2505.16968
Foundation models hold significant promise in healthcare, given their capacity to extract meaningful representations independent of downstream tasks. This property has enabled state-of-the-art performance across several clinical applications trained on structured electronic health record (EHR) data, even in settings with limited labeled data, a prevalent challenge in healthcare. However, there is little consensus on these models' potential for clinical utility due to the lack of desiderata of comprehensive and meaningful tasks and sufficiently diverse evaluations to characterize the benefit over conventional supervised learning. To address this gap, we propose a suite of clinically meaningful tasks spanning patient outcomes, early prediction of acute and chronic conditions, including desiderata for robust evaluations. We evaluate state-of-the-art foundation models on EHR data consisting of 5 million patients from Columbia University Irving Medical Center (CUMC), a large urban academic medical center in New York City, across 14 clinically relevant tasks. We measure overall accuracy, calibration, and subpopulation performance to surface tradeoffs based on the choice of pre-training, tokenization, and data representation strategies. Our study aims to advance the empirical evaluation of structured EHR foundation models and guide the development of future healthcare foundation models.
基础模型在医疗保健领域展现出巨大的潜力,这是因为它们能够提取出与具体下游任务无关的有意义表示。这种特性使得这些模型即使在标注数据有限的情况下(这是医疗保健领域的常见挑战),也能在基于结构化电子健康记录(EHR)数据训练的多个临床应用中实现最先进的性能表现。然而,由于缺乏全面且有意义的任务标准以及足够多样化的评估方法来表征其相对于传统监督学习的优势,这些模型在临床实用性方面的潜力仍然存在争议。 为了弥补这一差距,我们提出了一套涵盖患者结果、急性病和慢性疾病的早期预测等具有临床意义的任务,并制定了稳健评价的标准。我们在哥伦比亚大学欧文医学中心(CUMC)提供的包含500万患者的EHR数据集上对最先进基础模型进行了评估,该数据来自纽约市的一个大型城市学术医疗中心。我们针对14个相关的临床任务进行了测试,测量了整体准确性、校准性和不同亚群体的表现,以揭示基于预训练策略、标记化和数据表示方法选择的权衡。 我们的研究旨在推进结构化EHR基础模型的经验评估,并为未来健康保健领域基础模型的发展提供指导。
https://arxiv.org/abs/2505.16941
Artificial Intelligence (AI) is accelerating the transformation of scientific research paradigms, not only enhancing research efficiency but also driving innovation. We introduce NovelSeek, a unified closed-loop multi-agent framework to conduct Autonomous Scientific Research (ASR) across various scientific research fields, enabling researchers to tackle complicated problems in these fields with unprecedented speed and precision. NovelSeek highlights three key advantages: 1) Scalability: NovelSeek has demonstrated its versatility across 12 scientific research tasks, capable of generating innovative ideas to enhance the performance of baseline code. 2) Interactivity: NovelSeek provides an interface for human expert feedback and multi-agent interaction in automated end-to-end processes, allowing for the seamless integration of domain expert knowledge. 3) Efficiency: NovelSeek has achieved promising performance gains in several scientific fields with significantly less time cost compared to human efforts. For instance, in reaction yield prediction, it increased from 27.6% to 35.4% in just 12 hours; in enhancer activity prediction, accuracy rose from 0.52 to 0.79 with only 4 hours of processing; and in 2D semantic segmentation, precision advanced from 78.8% to 81.0% in a mere 30 hours.
人工智能(AI)正在加速科研范式的转变,不仅提升了研究效率,还推动了创新。我们推出了NovelSeek,这是一个统一的闭环多智能体框架,用于在多个科学领域中进行自主科学研究(ASR),使研究人员能够以前所未有的速度和精度解决这些领域的复杂问题。NovelSeek突出三大优势: 1. **可扩展性**:NovelSeek已在12项科研任务中展示了其适应能力,能够在多种基线代码的性能提升方面生成创新想法。 2. **交互性**:NovelSeek提供了一个接口,支持人类专家反馈和多智能体互动,在自动化端到端过程中能够无缝集成领域专业知识。 3. **效率**:相比人工努力,NovelSeek在多个科学领域中实现了显著的时间成本节约,并取得了令人瞩目的性能提升。例如,在反应产率预测方面,其性能从27.6%提升至35.4%,仅耗时12小时;在增强子活性预测上,准确度从0.52升至0.79,仅需4小时的处理时间;而在二维语义分割领域,精度提升了近三个百分点,在短短30小时内由78.8%提高到81.0%。
https://arxiv.org/abs/2505.16938
Protein language models (pLMs) pre-trained on vast protein sequence databases excel at various downstream tasks but lack the structural knowledge essential for many biological applications. To address this, we integrate structural insights from pre-trained protein graph neural networks (pGNNs) into pLMs through a latent-level contrastive learning task. This task aligns residue representations from pLMs with those from pGNNs across multiple proteins, enriching pLMs with inter-protein structural knowledge. Additionally, we incorporate a physical-level task that infuses intra-protein structural knowledge by optimizing pLMs to predict structural tokens. The proposed dual-task framework effectively incorporates both inter-protein and intra-protein structural knowledge into pLMs. Given the variability in the quality of protein structures in PDB, we further introduce a residue loss selection module, which uses a small model trained on high-quality structures to select reliable yet challenging residue losses for the pLM to learn. Applying our structure alignment method to the state-of-the-art ESM2 and AMPLIFY results in notable performance gains across a wide range of tasks, including a 12.7% increase in ESM2 contact prediction. The data, code, and resulting SaESM2 and SaAMPLIFY models will be released on Hugging Face.
蛋白质语言模型(pLMs)在庞大的蛋白质序列数据库上预训练后,在各种下游任务中表现出色,但缺乏许多生物学应用所需的重要结构知识。为了弥补这一不足,我们将来自预训练的蛋白质图神经网络(pGNNs)的结构洞察融入到pLMs中,通过一个潜在层次对比学习的任务实现这一点。此任务使来自pLMs和pGNNs的残基表示在多个蛋白质之间对齐,从而丰富了pLMs的跨蛋白结构知识。此外,我们还引入了一个物理层面的任务,通过优化pLM预测结构标记的能力来注入同蛋白内的结构知识。提出的双任务框架有效将跨蛋白与同蛋白内结构知识融入到pLMs中。 鉴于PDB中的蛋白质结构质量存在差异性,我们进一步引入了一种残基损失选择模块,该模块使用在高质量结构上训练的小型模型为pLM挑选可靠且具有挑战性的残基损失。应用我们的结构对齐方法至最先进的ESM2和AMPLIFY模型,在包括ESM2接触预测在内的广泛任务中实现了显著的性能提升(提高了12.7%)。数据、代码以及生成的SaESM2和SaAMPLIFY模型将在Hugging Face上发布。
https://arxiv.org/abs/2505.16896
Retrieval-augmented generation (RAG) systems have advanced large language models (LLMs) in complex deep search scenarios requiring multi-step reasoning and iterative information retrieval. However, existing approaches face critical limitations that lack high-quality training trajectories or suffer from the distributional mismatches in simulated environments and prohibitive computational costs for real-world deployment. This paper introduces SimpleDeepSearcher, a lightweight yet effective framework that bridges this gap through strategic data engineering rather than complex training paradigms. Our approach synthesizes high-quality training data by simulating realistic user interactions in live web search environments, coupled with a multi-criteria curation strategy that optimizes the diversity and quality of input and output side. Experiments on five benchmarks across diverse domains demonstrate that SFT on only 871 curated samples yields significant improvements over RL-based baselines. Our work establishes SFT as a viable pathway by systematically addressing the data-scarce bottleneck, offering practical insights for efficient deep search systems. Our code is available at this https URL.
检索增强生成(RAG)系统在复杂深度搜索场景中提升了大型语言模型(LLM)的能力,这些场景需要多步推理和迭代信息检索。然而,现有方法面临关键限制:缺乏高质量的训练轨迹或因模拟环境中分布不匹配以及现实世界部署计算成本过高而受阻。本文介绍了SimpleDeepSearcher,这是一个轻量级但有效的框架,通过策略性的数据工程而非复杂的训练范式来弥合这一差距。我们的方法通过在实时网络搜索环境中模拟真实的用户交互,合成高质量的训练数据,并结合多标准策划策略优化输入和输出侧的多样性和质量。我们在五个跨不同领域的基准测试中进行的实验表明,仅对871个精选样本进行微调即可显著超越基于强化学习(RL)的基础线方法。我们的工作通过系统地解决数据稀缺瓶颈,确立了SFT(即针对特定任务的精细调整)作为可行途径的地位,并为高效的深度搜索系统提供了实用见解。我们的代码可在上述链接获取。
https://arxiv.org/abs/2505.16834
GUI automation faces critical challenges in dynamic environments. MLLMs suffer from two key issues: misinterpreting UI components and outdated knowledge. Traditional fine-tuning methods are costly for app-specific knowledge updates. We propose GUI-explorer, a training-free GUI agent that incorporates two fundamental mechanisms: (1) Autonomous Exploration of Function-aware Trajectory. To comprehensively cover all application functionalities, we design a Function-aware Task Goal Generator that automatically constructs exploration goals by analyzing GUI structural information (e.g., screenshots and activity hierarchies). This enables systematic exploration to collect diverse trajectories. (2) Unsupervised Mining of Transition-aware Knowledge. To establish precise screen-operation logic, we develop a Transition-aware Knowledge Extractor that extracts effective screen-operation logic through unsupervised analysis the state transition of structured interaction triples (observation, action, outcome). This eliminates the need for human involvement in knowledge extraction. With a task success rate of 53.7% on SPA-Bench and 47.4% on AndroidWorld, GUI-explorer shows significant improvements over SOTA agents. It requires no parameter updates for new apps. GUI-explorer is open-sourced and publicly available at this https URL.
GUI自动化在动态环境中面临关键挑战。大规模语言模型(MLLMs)存在两个主要问题:误判UI组件和知识过时。传统的微调方法对于应用程序特定的知识更新成本高昂。我们提出了一种名为GUI-explorer的无训练需求的GUI代理,该代理融合了两种基本机制: 1. 功能感知轨迹自主探索。为了全面覆盖所有应用功能,我们设计了一个基于分析GUI结构信息(如屏幕截图和活动层次)的功能感知任务目标生成器,自动构建探索目标。这使得系统化地进行多样化轨迹收集成为可能。 2. 过渡感知知识无监督挖掘。为建立精确的屏幕操作逻辑,我们开发了一种过渡感知知识提取器,通过分析结构化的交互三元组(观察、行动、结果)的状态转换来进行有效的屏幕操作逻辑的无监督学习。这消除了人工参与知识抽取的需求。 在SPA-Bench和AndroidWorld基准测试中,GUI-explorer的任务成功率分别达到了53.7%和47.4%,显著优于现有最优方法(SOTA)代理,并且对于新应用无需参数更新即可使用。GUI-explorer已开源并公开发布于以下链接:[https URL]。
https://arxiv.org/abs/2505.16827
Existing pretrained models for 3D mesh generation often suffer from data biases and produce low-quality results, while global reinforcement learning (RL) methods rely on object-level rewards that struggle to capture local structure details. To address these challenges, we present \textbf{Mesh-RFT}, a novel fine-grained reinforcement fine-tuning framework that employs Masked Direct Preference Optimization (M-DPO) to enable localized refinement via quality-aware face masking. To facilitate efficient quality evaluation, we introduce an objective topology-aware scoring system to evaluate geometric integrity and topological regularity at both object and face levels through two metrics: Boundary Edge Ratio (BER) and Topology Score (TS). By integrating these metrics into a fine-grained RL strategy, Mesh-RFT becomes the first method to optimize mesh quality at the granularity of individual faces, resolving localized errors while preserving global coherence. Experiment results show that our M-DPO approach reduces Hausdorff Distance (HD) by 24.6\% and improves Topology Score (TS) by 3.8\% over pre-trained models, while outperforming global DPO methods with a 17.4\% HD reduction and 4.9\% TS gain. These results demonstrate Mesh-RFT's ability to improve geometric integrity and topological regularity, achieving new state-of-the-art performance in production-ready mesh generation. Project Page: \href{this https URL}{this https URL}.
现有的预训练模型在生成3D网格时通常会受到数据偏差的影响,导致结果质量较低。而全局强化学习(RL)方法依赖于对象级奖励机制,在捕捉局部结构细节方面表现不佳。为了应对这些挑战,我们提出了一种新的细粒度强化微调框架**Mesh-RFT**,该框架采用遮罩直接偏好优化(M-DPO),通过感知质量的面屏蔽实现局部细化。为促进高效的质量评估,我们引入了一个目标拓扑感知评分系统,利用两个指标——边界边比(BER)和拓扑分数(TS)来评估几何完整性和拓扑规范性,在物体级和面级进行评价。 通过将这些度量标准整合到细粒度RL策略中,Mesh-RFT成为了第一个能够在个体面上优化网格质量的方法。这种方法在解决局部错误的同时保持了全局一致性。实验结果显示,我们的M-DPO方法相较于预训练模型能够减少24.6%的豪斯多夫距离(HD)并提高3.8%的拓扑分数(TS),并且与全局DPO方法相比,在17.4%的HD和4.9%的TS上表现出更好的性能。这些结果表明,Mesh-RFT具有改进几何完整性和拓扑规范性的能力,并在生产级网格生成方面达到了新的最先进的水平。 项目页面:[链接](this https URL)。
https://arxiv.org/abs/2505.16761
Recent advances in large-scale pre-trained Electroencephalogram (EEG) models have shown great promise, driving progress in Brain-Computer Interfaces (BCIs) and healthcare applications. However, despite their success, many existing pre-trained models have struggled to fully capture the rich information content of neural oscillations, a limitation that fundamentally constrains their performance and generalizability across diverse BCI tasks. This limitation is frequently rooted in suboptimal architectural design choices which constrain their representational capacity. In this work, we introduce LaBraM++, an enhanced Large Brainwave Foundation Model (LBM) that incorporates principled improvements grounded in robust signal processing foundations. LaBraM++ demonstrates substantial gains across a variety of tasks, consistently outperforming its originally-based architecture and achieving competitive results when compared to other open-source LBMs. Its superior performance and training efficiency highlight its potential as a strong foundation for future advancements in LBMs.
最近在大规模预训练脑电图(EEG)模型方面的进展显示出巨大的潜力,推动了脑机接口(BCI)和医疗应用的进步。然而,尽管这些现有模型取得了成功,许多预训练模型仍然难以完全捕捉神经振荡的丰富信息内容,这一局限性从根本上制约了它们的性能及在不同BCI任务中的泛化能力。这种局限通常源于次优的架构设计选择,限制了其表示能力。 在这项工作中,我们介绍了LaBraM++,这是一种增强型的大规模脑波基础模型(LBM),融合了基于稳健信号处理原则的改进措施。LaBraM++在各种任务中均表现出显著的优势,并且始终超越其原始架构,在与其它开源LBMs的竞争中也取得了具有竞争力的结果。它卓越的表现和训练效率凸显了作为未来LBMs发展基石的巨大潜力。
https://arxiv.org/abs/2505.16724
The Earth's surface is subject to complex and dynamic processes, ranging from large-scale phenomena such as tectonic plate movements to localized changes associated with ecosystems, agriculture, or human activity. Satellite images enable global monitoring of these processes with extensive spatial and temporal coverage, offering advantages over in-situ methods. In particular, resulting satellite image time series (SITS) datasets contain valuable information. To handle their large volume and complexity, some recent works focus on the use of graph-based techniques that abandon the regular Euclidean structure of satellite data to work at an object level. Besides, graphs enable modelling spatial and temporal interactions between identified objects, which are crucial for pattern detection, classification and regression tasks. This paper is an effort to examine the integration of graph-based methods in spatio-temporal remote-sensing analysis. In particular, it aims to present a versatile graph-based pipeline to tackle SITS analysis. It focuses on the construction of spatio-temporal graphs from SITS and their application to downstream tasks. The paper includes a comprehensive review and two case studies, which highlight the potential of graph-based approaches for land cover mapping and water resource forecasting. It also discusses numerous perspectives to resolve current limitations and encourage future developments.
地球表面受到复杂且动态的过程影响,这些过程从板块构造运动等大规模现象到生态系统、农业或人类活动相关的局部变化不等。卫星图像能够提供广泛的时空覆盖范围,用于全球监测这些过程,并且在这一点上相比现场方法具有优势。特别是,生成的卫星影像时间序列(SITS)数据集包含有价值的信息。为了处理其庞大的体积和复杂性,最近的一些研究侧重于采用基于图的技术,这种方法放弃了卫星数据的常规欧几里得结构,转而在对象层面进行工作。此外,图还能够建模已识别对象之间的空间和时间交互,这对于模式检测、分类和回归任务至关重要。 本文旨在探讨将基于图的方法整合到时空遥感分析中的努力,并特别致力于提出一种灵活的基于图的工作流程以处理SITS分析。该研究聚焦于从SITS构建时空图以及这些图表在下游任务中的应用。文章包括了全面回顾及两个案例研究,突显了基于图方法在土地覆盖制图和水资源预测方面的潜力。同时,它还讨论了许多解决当前限制并鼓励未来发展的方法视角。 具体来说,本文概述如下: 1. **文献综述**:详尽地回顾了遥感图像的时空特性、SITS分析中的数据处理技术以及现有基于图方法在遥感领域的应用。 2. **案例研究**: - 土地覆盖制图:展示如何使用时空图来改进土地利用分类和变化检测任务,特别是对于快速城市化地区或生态系统转化区域。 - 水资源预测:展示了通过构建包含水体、土地表面特征及气象信息的时空网络模型来进行水资源量预测的方法。 3. **未来工作**:强调了该领域存在的挑战与机会,包括数据质量和计算效率改进的需求,并提出了利用图神经网络进行更复杂模式识别的可能性。
https://arxiv.org/abs/2505.16685
Embodied navigation demands comprehensive scene understanding and precise spatial reasoning. While image-text models excel at interpreting pixel-level color and lighting cues, 3D-text models capture volumetric structure and spatial relationships. However, unified fusion approaches that jointly fuse 2D images, 3D point clouds, and textual instructions face challenges in limited availability of triple-modality data and difficulty resolving conflicting beliefs among modalities. In this work, we introduce CoNav, a collaborative cross-modal reasoning framework where a pretrained 3D-text model explicitly guides an image-text navigation agent by providing structured spatial-semantic knowledge to resolve ambiguities during navigation. Specifically, we introduce Cross-Modal Belief Alignment, which operationalizes this cross-modal guidance by simply sharing textual hypotheses from the 3D-text model to the navigation agent. Through lightweight fine-tuning on a small 2D-3D-text corpus, the navigation agent learns to integrate visual cues with spatial-semantic knowledge derived from the 3D-text model, enabling effective reasoning in embodied navigation. CoNav achieves significant improvements on four standard embodied navigation benchmarks (R2R, CVDN, REVERIE, SOON) and two spatial reasoning benchmarks (ScanQA, SQA3D). Moreover, under close navigation Success Rate, CoNav often generates shorter paths compared to other methods (as measured by SPL), showcasing the potential and challenges of fusing data from different modalities in embodied navigation. Project Page: this https URL
嵌入式导航需要全面的场景理解和精确的空间推理能力。虽然图像-文本模型擅长解读像素级别的颜色和光照线索,而3D-文本模型则能捕捉体积结构和空间关系。然而,联合融合2D图像、3D点云以及文本指令的数据统一融合方法在处理三模态数据稀缺性和解决不同模式之间冲突信念的难题时面临挑战。为此,我们引入了CoNav,这是一个协作跨模态推理框架,在此框架中,预训练的3D-文本模型通过提供结构化的空间语义知识来明确指导图像-文本导航代理,从而在导航过程中解决模糊性问题。 具体而言,我们提出了跨模态信念对齐方法,该方法通过简单地从3D-文本模型共享文字假设给导航代理来实现这种跨模态引导。经过轻量级的微调,在一个小型2D-3D-文本语料库上训练后,导航代理可以学习将视觉线索与从3D-文本模型衍生的空间语义知识结合起来,从而在嵌入式导航中进行有效推理。 CoNav在四个标准嵌入式导航基准(R2R, CVDN, REVERIE, SOON)和两个空间推理基准(ScanQA, SQA3D)上取得了显著的改进。此外,在接近导航成功率的情况下,与其它方法相比(通过SPL测量),CoNav通常生成更短的路径,展示了在嵌入式导航中融合不同模态数据的能力及其面临的挑战。 项目主页:[此链接](https://this-url)
https://arxiv.org/abs/2505.16663
Medical anomaly detection (AD) is crucial for early clinical intervention, yet it faces challenges due to limited access to high-quality medical imaging data, caused by privacy concerns and data silos. Few-shot learning has emerged as a promising approach to alleviate these limitations by leveraging the large-scale prior knowledge embedded in vision-language models (VLMs). Recent advancements in few-shot medical AD have treated normal and abnormal cases as a one-class classification problem, often overlooking the distinction among multiple anomaly categories. Thus, in this paper, we propose a framework tailored for few-shot medical anomaly detection in the scenario where the identification of multiple anomaly categories is required. To capture the detailed radiological signs of medical anomaly categories, our framework incorporates diverse textual descriptions for each category generated by a Large-Language model, under the assumption that different anomalies in medical images may share common radiological signs in each category. Specifically, we introduce SD-MAD, a two-stage Sign-Driven few-shot Multi-Anomaly Detection framework: (i) Radiological signs are aligned with anomaly categories by amplifying inter-anomaly discrepancy; (ii) Aligned signs are selected further to mitigate the effect of the under-fitting and uncertain-sample issue caused by limited medical data, employing an automatic sign selection strategy at inference. Moreover, we propose three protocols to comprehensively quantify the performance of multi-anomaly detection. Extensive experiments illustrate the effectiveness of our method.
医学异常检测(AD)对于早期临床干预至关重要,但由于隐私问题和数据孤岛导致的高质量医疗影像数据访问受限,这一领域面临着挑战。少样本学习作为一种有前景的方法已经出现,通过利用嵌入在视觉-语言模型(VLMs)中的大规模先验知识来缓解这些限制。近期关于少样本医学异常检测的研究通常将正常情况与异常情况视为一类分类问题处理,并且往往忽略了多个异常类别之间的差异。因此,在本论文中我们提出了一种专为需要识别多种异常类别的少样本医学异常检测场景设计的框架。 为了捕捉医疗异常类别的详细放射学标志,我们的框架引入了由大型语言模型生成的各种文本描述来表示每个类别。该假设认为,不同类型的医学图像中的异常可能在每个类别内共享一些共同的放射学标志。具体而言,我们提出了SD-MAD,这是一种两阶段驱动标识符(Sign-Driven)的少样本多异常检测框架:(i) 通过放大各异常类之间的差异来对齐放射学标志与异常类别;(ii) 在推理时采用自动标识符选择策略进一步选取对齐后的标识符以缓解由有限医疗数据导致的欠拟合和不确定样本问题。 此外,我们还提出了三种协议以全面量化多异常检测性能。大量的实验展示了我们方法的有效性。
https://arxiv.org/abs/2505.16659
Medical Image Segmentation (MIS) includes diverse tasks, from bone to organ segmentation, each with its own challenges in finding the best segmentation model. The state-of-the-art AutoML-related MIS-framework nnU-Net automates many aspects of model configuration but remains constrained by fixed hyperparameters and heuristic design choices. As a full-AutoML framework for MIS, we propose Auto-nnU-Net, a novel nnU-Net variant enabling hyperparameter optimization (HPO), neural architecture search (NAS), and hierarchical NAS (HNAS). Additionally, we propose Regularized PriorBand to balance model accuracy with the computational resources required for training, addressing the resource constraints often faced in real-world medical settings that limit the feasibility of extensive training procedures. We evaluate our approach across diverse MIS datasets from the well-established Medical Segmentation Decathlon, analyzing the impact of AutoML techniques on segmentation performance, computational efficiency, and model design choices. The results demonstrate that our AutoML approach substantially improves the segmentation performance of nnU-Net on 6 out of 10 datasets and is on par on the other datasets while maintaining practical resource requirements. Our code is available at this https URL.
医学图像分割(MIS)涵盖了从骨骼到器官的各种任务,每种任务在寻找最佳分割模型时都有其独特的挑战。目前最先进的与AutoML相关的MIS框架nnU-Net自动配置了许多模型参数,但仍受限于固定的超参数和启发式设计选择。作为面向MIS的全自动化机器学习(AutoML)框架,我们提出了一种新的nnU-Net变体——Auto-nnU-Net,该框架支持超参数优化(HPO)、神经架构搜索(NAS)以及分层NAS(HNAS)。此外,我们还提出了Regularized PriorBand方法,在保证模型精度的同时考虑训练所需的计算资源,解决了实际医疗环境中由于资源限制而使广泛培训程序不可行的问题。我们在Medical Segmentation Decathlon中一系列多样化的MIS数据集上评估了我们的方法,分析了AutoML技术对分割性能、计算效率以及模型设计选择的影响。结果表明,在10个数据集中,我们的AutoML方法在6个数据集中显著提高了nnU-Net的分割性能,并且在其他四个数据集中的表现与现有方法相当,同时保持实用的资源需求。 我们提出的代码可以在以下网址获取:[请在此处插入链接]。
https://arxiv.org/abs/2505.16561
3D Gaussian Splatting (3DGS) has emerged as a high-fidelity and efficient paradigm for online free-viewpoint video (FVV) reconstruction, offering viewers rapid responsiveness and immersive experiences. However, existing online methods face challenge in prohibitive storage requirements primarily due to point-wise modeling that fails to exploit the motion properties. To address this limitation, we propose a novel Compact Gaussian Streaming (ComGS) framework, leveraging the locality and consistency of motion in dynamic scene, that models object-consistent Gaussian point motion through keypoint-driven motion representation. By transmitting only the keypoint attributes, this framework provides a more storage-efficient solution. Specifically, we first identify a sparse set of motion-sensitive keypoints localized within motion regions using a viewspace gradient difference strategy. Equipped with these keypoints, we propose an adaptive motion-driven mechanism that predicts a spatial influence field for propagating keypoint motion to neighboring Gaussian points with similar motion. Moreover, ComGS adopts an error-aware correction strategy for key frame reconstruction that selectively refines erroneous regions and mitigates error accumulation without unnecessary overhead. Overall, ComGS achieves a remarkable storage reduction of over 159 X compared to 3DGStream and 14 X compared to the SOTA method QUEEN, while maintaining competitive visual fidelity and rendering speed. Our code will be released.
3D高斯点化(3D Gaussian Splatting,简称3DGS)作为一种高保真且高效的在线自由视角视频(FVV)重建范式已经出现,为观众提供了快速响应和沉浸式的体验。然而,现有的在线方法由于逐点建模未能充分利用运动特性,面临着存储需求过大的挑战。为了克服这一限制,我们提出了一种新的紧凑型高斯流(Compact Gaussian Streaming,简称ComGS)框架,该框架利用了动态场景中的局部性和运动一致性,并通过关键点驱动的运动表示来建模具有对象一致性的高斯点运动。通过仅传输关键点属性,该框架提供了一个更为存储高效的解决方案。 具体而言,我们首先使用视空间梯度差策略识别出一组稀疏的关键点集,这些关键点位于运动区域内并对其敏感。基于这些关键点,我们提出了一个自适应的运动驱动机制,预测用于传播关键点运动至具有相似运动的邻近高斯点的空间影响场。此外,ComGS还采用了一种错误感知校正策略来进行关键帧重建,该策略选择性地精炼出错区域并防止误差累积,同时避免了不必要的开销。 总体而言,在保持竞争性的视觉保真度和渲染速度的同时,ComGS相比3DGStream减少了超过159倍的存储需求,并比最先进的QUEEN方法减少了14倍。我们的代码将公开发布。
https://arxiv.org/abs/2505.16533
In recent years, the rapid development of deepfake technology has given rise to an emerging and serious threat to public security: diffusion model-based digital human generation. Unlike traditional face manipulation methods, such models can generate highly realistic videos with consistency through multimodal control signals. Their flexibility and covertness pose severe challenges to existing detection strategies. To bridge this gap, we introduce DigiFakeAV, the first large-scale multimodal digital human forgery dataset based on diffusion models. Employing five latest digital human generation methods (Sonic, Hallo, etc.) and voice cloning method, we systematically produce a dataset comprising 60,000 videos (8.4 million frames), covering multiple nationalities, skin tones, genders, and real-world scenarios, significantly enhancing data diversity and realism. User studies show that the confusion rate between forged and real videos reaches 68%, and existing state-of-the-art (SOTA) detection models exhibit large drops in AUC values on DigiFakeAV, highlighting the challenge of the dataset. To address this problem, we further propose DigiShield, a detection baseline based on spatiotemporal and cross-modal fusion. By jointly modeling the 3D spatiotemporal features of videos and the semantic-acoustic features of audio, DigiShield achieves SOTA performance on both the DigiFakeAV and DF-TIMIT datasets. Experiments show that this method effectively identifies covert artifacts through fine-grained analysis of the temporal evolution of facial features in synthetic videos.
近年来,深度伪造技术的快速发展引发了一种新的且严重的公共安全威胁:基于扩散模型的数字人类生成。与传统的面部篡改方法不同,这些模型可以通过多模态控制信号产生高度逼真的视频,并保持一致性。它们的灵活性和隐蔽性对现有的检测策略构成了严峻挑战。为了弥补这一差距,我们推出了DigiFakeAV,这是第一个基于扩散模型的大规模多模态数字人类伪造数据集。该数据集采用五种最新的数字人类生成方法(如Sonic、Hallo等)以及语音克隆技术,系统地创建了一个包含60,000个视频(共840万帧)的数据集,涵盖了多种国籍、肤色、性别和现实场景,显著增强了数据的多样性和逼真度。用户研究表明,伪造视频与真实视频之间的混淆率达到了68%,并且现有的最先进的检测模型在DigiFakeAV上的AUC值大幅下降,突显了该数据集所带来的挑战。 为了解决这个问题,我们进一步提出了DigiShield,这是一种基于时空和跨模态融合的检测基准。通过同时建模视频的三维时空特征以及音频的语义-声学特征,DigiShield在DigiFakeAV和DF-TIMIT两个数据集上均达到了最先进的性能水平。实验表明,该方法能够通过精细分析合成视频中面部特征随时间演变所产生的细微痕迹来有效识别隐蔽性伪造内容。
https://arxiv.org/abs/2505.16512
Achieving full automation in self-driving vehicles remains a challenge, especially in dynamic urban environments where navigation requires real-time adaptability. Existing systems struggle to handle navigation plans when faced with unpredictable changes in road layouts, spontaneous detours, or missing map data, due to their heavy reliance on predefined cartographic information. In this work, we explore the use of Large Language Models to generate Answer Set Programming rules by translating informal navigation instructions into structured, logic-based reasoning. ASP provides non-monotonic reasoning, allowing autonomous vehicles to adapt to evolving scenarios without relying on predefined maps. We present an experimental evaluation in which LLMs generate ASP constraints that encode real-world urban driving logic into a formal knowledge representation. By automating the translation of informal navigation instructions into logical rules, our method improves adaptability and explainability in autonomous navigation. Results show that LLM-driven ASP rule generation supports semantic-based decision-making, offering an explainable framework for dynamic navigation planning that aligns closely with how humans communicate navigational intent.
在自动驾驶车辆中实现完全自动化仍然是一个挑战,尤其是在需要实时适应性的动态城市环境中。现有的系统由于严重依赖于预定义的地图信息,在面对道路布局的不可预测变化、突发绕行或缺失地图数据时,难以处理导航计划。在这项工作中,我们探讨了利用大型语言模型(LLM)通过将非正式的导航指令翻译成结构化的逻辑推理来生成答案集编程(ASP)规则的方法。ASP提供了非单调推理能力,使自动驾驶车辆能够在不依赖预定义地图的情况下适应不断变化的情景。 我们在实验评估中展示了LLM如何生成ASP约束条件,从而将现实世界的城市驾驶逻辑编码为形式化知识表示。通过自动将非正式的导航指令转换成逻辑规则,我们的方法提高了自主导航中的适应性和可解释性。结果显示,由LLM驱动的ASP规则生成支持语义基础的决策制定,并提供了一个与人类沟通导航意图方式密切相关的动态导航规划可解释框架。 总结来说,这项研究提出了一种利用大型语言模型辅助生成答案集编程规则的方法来改进自动驾驶车辆在复杂环境中的适应性和解释性。
https://arxiv.org/abs/2505.16498
We present a novel implicit neural shape optimization framework for 3D high-contrast Electrical Impedance Tomography (EIT), addressing scenarios where conductivity exhibits sharp discontinuities across material interfaces. These high-contrast cases, prevalent in metallic implant monitoring and industrial defect detection, challenge traditional reconstruction methods due to severe ill-posedness. Our approach synergizes shape optimization with implicit neural representations, introducing key innovations including a shape derivative-based optimization scheme that explicitly incorporates high-contrast interface conditions and an efficient latent space representation that reduces variable dimensionality. Through rigorous theoretical analysis of algorithm convergence and extensive numerical experiments, we demonstrate substantial performance improvements, establishing our framework as promising for practical applications in medical imaging with metallic implants and industrial non-destructive testing.
我们提出了一种新颖的隐式神经形状优化框架,用于三维高对比度电阻抗断层成像(EIT)。该方法针对导电性在材料界面处表现出急剧不连续性的场景。这些高对比度的情况常见于金属植入物监测和工业缺陷检测中,并且由于严重的不适定问题给传统的重建方法带来了挑战。 我们的方法将形状优化与隐式神经表示相结合,引入了包括基于形状导数的优化方案在内的关键创新,该方案明确地结合了高对比度界面条件。此外,我们还提出了一种高效的潜在空间表示来减少变量维度。 通过严格的算法收敛性理论分析和大量的数值实验,我们展示了显著的性能改进,并确立了我们的框架在医学成像(如含金属植入物)及工业无损检测等实际应用中的前景。
https://arxiv.org/abs/2505.16487
Document Visual Question Answering (DocVQA) faces dual challenges in processing lengthy multimodal documents (text, images, tables) and performing cross-modal reasoning. Current document retrieval-augmented generation (DocRAG) methods remain limited by their text-centric approaches, frequently missing critical visual information. The field also lacks robust benchmarks for assessing multimodal evidence selection and integration. We introduce MMDocRAG, a comprehensive benchmark featuring 4,055 expert-annotated QA pairs with multi-page, cross-modal evidence chains. Our framework introduces innovative metrics for evaluating multimodal quote selection and enables answers that interleave text with relevant visual elements. Through large-scale experiments with 60 VLM/LLM models and 14 retrieval systems, we identify persistent challenges in multimodal evidence retrieval, selection, and this http URL findings reveal advanced proprietary LVMs show superior performance than open-sourced alternatives. Also, they show moderate advantages using multimodal inputs over text-only inputs, while open-source alternatives show significant performance degradation. Notably, fine-tuned LLMs achieve substantial improvements when using detailed image descriptions. MMDocRAG establishes a rigorous testing ground and provides actionable insights for developing more robust multimodal DocVQA systems. Our benchmark and code are available at this https URL.
文档视觉问答(DocVQA)面临双重挑战,即处理长篇多模态文件(文本、图像、表格)以及执行跨模式推理。当前的文档检索增强型生成(DocRAG)方法仍然受限于其以文本为中心的方法,经常忽略关键的视觉信息。该领域还缺乏评估多模态证据选择和整合的稳健基准。我们引入了MMDocRAG,这是一个全面的基准测试平台,包含4,055个专家标注的问题-答案对,并且涉及跨模式、多页的证据链。我们的框架提出了用于评估多模态引用选择的新颖指标,并支持在回答中插入文本与相关视觉元素。 通过大规模实验(使用60种视觉语言模型/大型语言模型和14种检索系统),我们识别出多模态证据检索、选择中的持久性挑战以及生成高质量答案的困难。研究结果表明,先进的专有LVM表现优于开源替代品。此外,它们在使用多模态输入时显示了相对于仅文本输入的适度优势,而开源替代品则表现出显著的性能下降。值得注意的是,经过微调的语言模型在使用详细图像描述时取得了显著改进。 MMDocRAG为开发更强大的多模态DocVQA系统建立了一个严格的测试平台,并提供了可操作性的见解。我们的基准和代码可在[这里](https://github.com/MMDocRAG)访问。
https://arxiv.org/abs/2505.16470
During sudden disaster events, accurately predicting public panic sentiment on social media is crucial for proactive governance and crisis management. Current efforts on this problem face three main challenges: lack of finely annotated data hinders emotion prediction studies, unmodeled risk perception causes prediction inaccuracies, and insufficient interpretability of panic formation mechanisms. We address these issues by proposing a Psychology-driven generative Agent framework (PsychoAgent) for explainable panic prediction based on emotion arousal theory. Specifically, we first construct a fine-grained open panic emotion dataset (namely COPE) via human-large language models (LLMs) collaboration to mitigate semantic bias. Then, we develop a framework integrating cross-domain heterogeneous data grounded in psychological mechanisms to model risk perception and cognitive differences in emotion generation. To enhance interpretability, we design an LLM-based role-playing agent that simulates individual psychological chains through dedicatedly designed prompts. Experimental results on our annotated dataset show that PsychoAgent improves panic emotion prediction performance by 12.6% to 21.7% compared to baseline models. Furthermore, the explainability and generalization of our approach is validated. Crucially, this represents a paradigm shift from opaque "data-driven fitting" to transparent "role-based simulation with mechanistic interpretation" for panic emotion prediction during emergencies. Our implementation is publicly available at: this https URL.
在突发事件中,准确预测社交媒体上公众恐慌情绪对于主动治理和危机管理至关重要。目前在这方面的努力面临三大主要挑战:缺乏精细标注的数据阻碍了情感预测研究的进展;未建模的风险感知导致预测不准确;以及恐慌形成机制解释性的不足。我们通过提出一个基于情绪唤醒理论的心理驱动生成代理框架(PsychoAgent),来解决这些问题,旨在实现可解释的恐慌预测。 具体而言,我们首先通过人类与大型语言模型(LLMs)的合作构建了一个细粒度的开放恐慌情感数据集(即COPE),以减轻语义偏差。然后,我们开发了一个整合跨领域异构数据的框架,并基于心理机制来建模风险感知和情绪生成中的认知差异。为了增强解释性,我们设计了一种基于LLM的角色扮演代理,通过专门设计的提示模拟个体的心理链。 在我们的标注数据集上的实验结果显示,PsychoAgent相比基线模型提高了12.6%至21.7%的情绪恐慌预测性能。此外,我们的方法的有效性和泛化能力也得到了验证。尤为重要的是,这标志着从不透明的“数据驱动拟合”向透明的“基于角色模拟和机制解释”的范式转变,在紧急情况下的恐慌情绪预测中实现这一转变。 我们的实施代码公开可得:[此链接](this https URL)。
https://arxiv.org/abs/2505.16455
While reinforcement learning (RL) has demonstrated remarkable success in enhancing large language models (LLMs), it has primarily focused on single-turn tasks such as solving math problems. Training effective web agents for multi-turn interactions remains challenging due to the complexity of long-horizon decision-making across dynamic web interfaces. In this work, we present WebAgent-R1, a simple yet effective end-to-end multi-turn RL framework for training web agents. It learns directly from online interactions with web environments by asynchronously generating diverse trajectories, entirely guided by binary rewards depending on task success. Experiments on the WebArena-Lite benchmark demonstrate the effectiveness of WebAgent-R1, boosting the task success rate of Qwen-2.5-3B from 6.1% to 33.9% and Llama-3.1-8B from 8.5% to 44.8%, significantly outperforming existing state-of-the-art methods and strong proprietary models such as OpenAI o3. In-depth analyses reveal the effectiveness of the thinking-based prompting strategy and test-time scaling through increased interactions for web tasks. We further investigate different RL initialization policies by introducing two variants, namely WebAgent-R1-Zero and WebAgent-R1-CoT, which highlight the importance of the warm-up training stage (i.e., behavior cloning) and provide insights on incorporating long chain-of-thought (CoT) reasoning in web agents.
虽然强化学习(RL)在增强大型语言模型(LLMs)方面取得了显著的成功,但其主要集中在解决数学问题等单轮任务上。由于多轮交互中需要处理动态网络界面中的长期决策复杂性,因此训练有效的网页代理仍然具有挑战性。在此工作中,我们提出了WebAgent-R1,这是一个简单而有效的一端到另一端的多回合RL框架,用于训练网页代理。它通过异步生成多样化的轨迹直接从与网页环境的在线交互中学习,并完全根据任务的成功与否来指导二进制奖励。 在WebArena-Lite基准上的实验显示了WebAgent-R1的有效性,将Qwen-2.5-3B的任务成功率从6.1%提高到33.9%,Llama-3.1-8B从8.5%提高到44.8%,显著优于现有的最先进方法和强大的专有模型,如OpenAI的o3。深入分析揭示了基于思考的提示策略以及通过增加交互来测试时间扩展的有效性。 此外,我们还通过引入两种变体(即WebAgent-R1-Zero和WebAgent-R1-CoT),探讨了不同的RL初始化策略,强调了预热训练阶段(即行为克隆)的重要性,并为在网页代理中结合长链思维推理提供了见解。
https://arxiv.org/abs/2505.16421