Generative Adversarial Network (GAN) inversion have demonstrated excellent performance in image inpainting that aims to restore lost or damaged image texture using its unmasked content. Previous GAN inversion-based methods usually utilize well-trained GAN models as effective priors to generate the realistic regions for missing holes. Despite excellence, they ignore a hard constraint that the unmasked regions in the input and the output should be the same, resulting in a gap between GAN inversion and image inpainting and thus degrading the performance. Besides, existing GAN inversion approaches often consider a single modality of the input image, neglecting other auxiliary cues in images for improvements. Addressing these problems, we propose a novel GAN inversion approach, dubbed MMInvertFill, for image inpainting. MMInvertFill contains primarily a multimodal guided encoder with a pre-modulation and a GAN generator with F&W+ latent space. Specifically, the multimodal encoder aims to enhance the multi-scale structures with additional semantic segmentation edge texture modalities through a gated mask-aware attention module. Afterwards, a pre-modulation is presented to encode these structures into style vectors. To mitigate issues of conspicuous color discrepancy and semantic inconsistency, we introduce the F&W+ latent space to bridge the gap between GAN inversion and image inpainting. Furthermore, in order to reconstruct faithful and photorealistic images, we devise a simple yet effective Soft-update Mean Latent module to capture more diversified in-domain patterns for generating high-fidelity textures for massive corruptions. In our extensive experiments on six challenging datasets, we show that our MMInvertFill qualitatively and quantitatively outperforms other state-of-the-arts and it supports the completion of out-of-domain images effectively.
生成对抗网络(GAN)逆向技术在图像修复领域中展示了卓越的性能,其目标是利用未遮挡的内容恢复丢失或损坏的纹理。之前的基于 GAN 逆向的方法通常会使用经过良好训练的 GAN 模型作为有效的先验条件来生成缺失区域的真实感部分。尽管这些方法表现优秀,但它们忽略了输入图像和输出图像中未遮挡区域应当保持一致这一严格的约束条件,这导致了 GAN 逆向与图像修复之间的差距,并因此降低了性能。此外,现有的 GAN 逆向方法通常仅考虑输入图像的单一模式,忽视了其他在改进方面有帮助的辅助线索。 为了应对这些挑战,我们提出了一种新颖的 GAN 逆向方法,称为 MMInvertFill,用于图像修复。MMInvertFill 主要包含一个多模态引导编码器和一个 F&W+ 潜空间中的 GAN 发生器。具体来说,多模态编码器旨在通过门控掩码感知注意模块增强多层次结构,并且引入了额外的语义分割边缘纹理模式。随后,我们提出预调制技术将这些结构编码为样式向量。为了缓解明显的颜色差异和语义不一致的问题,我们引进 F&W+ 潜空间来弥合 GAN 逆向与图像修复之间的差距。 进一步地,为了重构忠实且逼真的图像,我们设计了一个简单而有效的软更新均值潜空间模块,以捕捉更多的域内模式,并为大量损坏生成高质量的纹理。在六个具有挑战性的数据集上的广泛实验中,我们的 MMInvertFill 从定性和定量上都超越了当前最佳方法,并支持跨域图像的有效完成任务。
https://arxiv.org/abs/2504.12844
Vision Foundation Models (VFMs) have delivered remarkable performance in Domain Generalized Semantic Segmentation (DGSS). However, recent methods often overlook the fact that visual cues are susceptible, whereas the underlying geometry remains stable, rendering depth information more robust. In this paper, we investigate the potential of integrating depth information with features from VFMs, to improve the geometric consistency within an image and boost the generalization performance of VFMs. We propose a novel fine-tuning DGSS framework, named DepthForge, which integrates the visual cues from frozen DINOv2 or EVA02 and depth cues from frozen Depth Anything V2. In each layer of the VFMs, we incorporate depth-aware learnable tokens to continuously decouple domain-invariant visual and spatial information, thereby enhancing depth awareness and attention of the VFMs. Finally, we develop a depth refinement decoder and integrate it into the model architecture to adaptively refine multi-layer VFM features and depth-aware learnable tokens. Extensive experiments are conducted based on various DGSS settings and five different datsets as unseen target domains. The qualitative and quantitative results demonstrate that our method significantly outperforms alternative approaches with stronger performance, steadier visual-spatial attention, and superior generalization ability. In particular, DepthForge exhibits outstanding performance under extreme conditions (e.g., night and snow). Code is available at this https URL.
视觉基础模型(VFMs)在领域泛化语义分割(DGSS)中表现出色。然而,近期的方法往往忽视了这样一个事实:即视觉线索容易受到影响,而底层几何结构则相对稳定,因此深度信息更加稳健。在这篇论文中,我们研究了将深度信息与VFMs提取的特征相结合的潜力,以提高图像内的几何一致性并增强VFMs的泛化性能。我们提出了一种新颖的微调DGSS框架,名为DepthForge,该框架结合了冻结后的DINOv2或EVA02提供的视觉线索和冻结后的Depth Anything V2提供的深度线索。在VFMs的每一层中,我们引入了感知深度的学习令牌来连续分离出领域不变的视觉和空间信息,从而增强了模型对深度的理解与注意力机制。最后,我们开发了一种深度细化解码器,并将其集成到模型架构中以自适应地精炼多层VFMs特征及感知深度的学习令牌。 基于各种DGSS设置以及五个不同未见过的目标数据集进行了广泛的实验。定性和定量结果均表明,我们的方法在性能、视觉-空间注意力的稳定性以及泛化能力方面显著优于其他替代方法,并且在极端条件下(如夜晚和雪地)表现出色。代码可在以下链接获取:[提供具体网址]
https://arxiv.org/abs/2504.12753
Zero-shot coordination (ZSC), the ability to adapt to a new partner in a cooperative task, is a critical component of human-compatible AI. While prior work has focused on training agents to cooperate on a single task, these specialized models do not generalize to new tasks, even if they are highly similar. Here, we study how reinforcement learning on a distribution of environments with a single partner enables learning general cooperative skills that support ZSC with many new partners on many new problems. We introduce two Jax-based, procedural generators that create billions of solvable coordination challenges. We develop a new paradigm called Cross-Environment Cooperation (CEC), and show that it outperforms competitive baselines quantitatively and qualitatively when collaborating with real people. Our findings suggest that learning to collaborate across many unique scenarios encourages agents to develop general norms, which prove effective for collaboration with different partners. Together, our results suggest a new route toward designing generalist cooperative agents capable of interacting with humans without requiring human data.
零样本协调(ZSC)是指在合作任务中能够适应新伙伴的能力,这是与人类兼容的人工智能的关键组成部分。此前的工作主要集中在训练代理在单一任务上进行协作,但这些专业化模型无法推广到新的任务,即使这些新任务与其先前经历的任务非常相似也是如此。在这项研究中,我们探讨了如何通过在一个包含单个伙伴的环境分布上的强化学习来获得可以支持与许多新伙伴解决众多新问题的一般合作技能。 为了实现这一目标,我们引入了两个基于Jax的过程生成器,它们能创建数十亿个可解的合作挑战。我们开发了一个名为跨环境合作(CEC)的新范式,并展示了它在与真实人类协作时,在定量和定性上都超过了竞争基准模型的表现。我们的研究结果表明,通过学习跨越众多独特场景进行合作,代理学会发展出一般性的规范,这些规范对于与不同伙伴的合作非常有效。 综合来看,我们的研究成果为设计能够广泛地与人类互动而无需依赖人类数据的通用合作智能体提供了一条新途径。
https://arxiv.org/abs/2504.12714
Despite recent advances in Large Language Models (LLMs) for code generation, the quality of LLM-generated code still faces significant challenges. One significant issue is code repetition, which refers to the model's tendency to generate structurally redundant code, resulting in inefficiencies and reduced readability. To address this, we conduct the first empirical study to investigate the prevalence and nature of repetition across 19 state-of-the-art code LLMs using three widely-used benchmarks. Our study includes both quantitative and qualitative analyses, revealing that repetition is pervasive and manifests at various granularities and extents, including character, statement, and block levels. We further summarize a taxonomy of 20 repetition patterns. Building on our findings, we propose DeRep, a rule-based technique designed to detect and mitigate repetition in generated code. We evaluate DeRep using both open-source benchmarks and in an industrial setting. Our results demonstrate that DeRep significantly outperforms baselines in reducing repetition (with an average improvements of 91.3%, 93.5%, and 79.9% in rep-3, rep-line, and sim-line metrics) and enhancing code quality (with a Pass@1 increase of 208.3% over greedy search). Furthermore, integrating DeRep improves the performance of existing repetition mitigation methods, with Pass@1 improvements ranging from 53.7% to 215.7%.
尽管大型语言模型(LLM)在代码生成方面取得了近期进展,但由这些模型生成的代码质量仍然面临重大挑战。其中一个主要问题是代码重复,即模型倾向于生成结构上冗余的代码,导致效率低下和可读性降低。为解决这一问题,我们进行了首个实证研究,使用三个广泛使用的基准测试来调查19种最先进的代码LLM中的重复现象及其性质。我们的研究包括定量分析和定性分析,揭示了重复问题是普遍存在的,并且在字符、语句和块等多个层次上以多种形式出现。此外,我们总结了一个包含20个重复模式的分类体系。 基于上述发现,我们提出了一种名为DeRep的技术,这是一种基于规则的方法,旨在检测并减少生成代码中的重复现象。我们在开源基准测试和工业环境中评估了DeRep的效果。我们的结果显示,与基线方法相比,DeRep在降低重复(在rep-3、rep-line和sim-line指标上平均改进幅度分别为91.3%、93.5% 和79.9%)以及提高代码质量(贪心搜索的基础上Pass@1提升了208.3%)方面表现出显著优势。此外,将DeRep与其他现有重复缓解方法结合使用可以进一步提升这些方法的表现,在不同基准测试中的Pass@1性能改进范围为53.7%到215.7%。
https://arxiv.org/abs/2504.12608
High-stakes domains like cyber operations need responsible and trustworthy AI methods. While large language models (LLMs) are becoming increasingly popular in these domains, they still suffer from hallucinations. This research paper provides learning outcomes from a case study with LinkQ, an open-source natural language interface that was developed to combat hallucinations by forcing an LLM to query a knowledge graph (KG) for ground-truth data during question-answering (QA). We conduct a quantitative evaluation of LinkQ using a well-known KGQA dataset, showing that the system outperforms GPT-4 but still struggles with certain question categories - suggesting that alternative query construction strategies will need to be investigated in future LLM querying systems. We discuss a qualitative study of LinkQ with two domain experts using a real-world cybersecurity KG, outlining these experts' feedback, suggestions, perceived limitations, and future opportunities for systems like LinkQ.
高风险领域如网络操作需要负责任且可信赖的人工智能方法。尽管大型语言模型(LLM)在这些领域越来越受欢迎,但它们仍然存在幻觉问题。这篇研究论文提供了一项案例研究的学习成果,该案例研究使用了LinkQ——一个开源的自然语言接口,它通过迫使LLM在问答过程中查询知识图谱(KG)来对抗幻觉现象。我们使用一个知名的KGQA数据集对LinkQ进行了定量评估,结果显示该系统的表现优于GPT-4,但仍难以应对某些问题类别——这表明未来针对LLM查询系统的替代查询构建策略将需要进一步研究。我们还进行了一项定性研究,与两位领域专家一起探讨了在实际网络安全KG中使用LinkQ的情况,并概述了这些专家的反馈、建议、感知到的局限性和面向未来的系统(如LinkQ)的发展机会。
https://arxiv.org/abs/2504.12422
This paper presents an overview of NTIRE 2025 the First Challenge on Event-Based Image Deblurring, detailing the proposed methodologies and corresponding results. The primary goal of the challenge is to design an event-based method that achieves high-quality image deblurring, with performance quantitatively assessed using Peak Signal-to-Noise Ratio (PSNR). Notably, there are no restrictions on computational complexity or model size. The task focuses on leveraging both events and images as inputs for single-image deblurring. A total of 199 participants registered, among whom 15 teams successfully submitted valid results, offering valuable insights into the current state of event-based image deblurring. We anticipate that this challenge will drive further advancements in event-based vision research.
本文介绍了NTIRE 2025首次基于事件的图像去模糊挑战赛,详细描述了提出的方案及其相应的结果。该挑战的主要目标是设计一种能够实现高质量图像去模糊的基于事件的方法,并通过峰值信噪比(PSNR)进行性能量化评估。值得注意的是,对计算复杂度或模型大小没有任何限制。任务的重点在于利用事件和图像作为输入来执行单幅图像去模糊处理。共有199名参赛者注册,其中15支队伍成功提交了有效结果,为基于事件的图像去模糊技术现状提供了宝贵的见解。我们期待这次挑战赛能进一步推动基于事件的视觉研究的发展。
https://arxiv.org/abs/2504.12401
This paper presents an overview of the NTIRE 2025 Image Denoising Challenge ({\sigma} = 50), highlighting the proposed methodologies and corresponding results. The primary objective is to develop a network architecture capable of achieving high-quality denoising performance, quantitatively evaluated using PSNR, without constraints on computational complexity or model size. The task assumes independent additive white Gaussian noise (AWGN) with a fixed noise level of 50. A total of 290 participants registered for the challenge, with 20 teams successfully submitting valid results, providing insights into the current state-of-the-art in image denoising.
本文介绍了NTIRE 2025图像去噪挑战赛({\sigma} = 50)的概览,重点展示了提出的各种方法及其对应的结果。主要目标是开发一种能够在不受到计算复杂度或模型大小限制的情况下,通过PSNR定量评估实现高质量去噪性能的网络架构。该任务假设存在具有固定噪声水平为50的独立加性白高斯噪声(AWGN)。共有290名参赛者注册参加了此次挑战赛,其中20支队伍成功提交了有效的结果,这为我们了解当前图像去噪领域的最先进技术水平提供了重要见解。
https://arxiv.org/abs/2504.12276
The application of artificial intelligence (AI) in medical imaging has revolutionized diagnostic practices, enabling advanced analysis and interpretation of radiological data. This study presents a comprehensive evaluation of radiomics-based and deep learning-based approaches for disease detection in chest radiography, focusing on COVID-19, lung opacity, and viral pneumonia. While deep learning models, particularly convolutional neural networks (CNNs) and vision transformers (ViTs), learn directly from image data, radiomics-based models extract and analyze quantitative features, potentially providing advantages in data-limited scenarios. This study systematically compares the diagnostic accuracy and robustness of various AI models, including Decision Trees, Gradient Boosting, Random Forests, Support Vector Machines (SVM), and Multi-Layer Perceptrons (MLP) for radiomics, against state-of-the-art computer vision deep learning architectures. Performance metrics across varying sample sizes reveal insights into each model's efficacy, highlighting the contexts in which specific AI approaches may offer enhanced diagnostic capabilities. The results aim to inform the integration of AI-driven diagnostic tools in clinical practice, particularly in automated and high-throughput environments where timely, reliable diagnosis is critical. This comparative study addresses an essential gap, establishing guidance for the selection of AI models based on clinical and operational needs.
在医学影像领域应用人工智能(AI)已经彻底改变了诊断实践,使得放射学数据的高级分析和解读成为可能。本研究全面评估了基于放射组学和深度学习的方法,在胸部X光片疾病检测中的表现,重点探讨了新冠病毒、肺部不透明区域及病毒性肺炎的情况。虽然深度学习模型特别是卷积神经网络(CNN)和视觉变换器(ViT),能够直接从图像数据中进行学习,而基于放射组学的模型则通过提取并分析定量特征来进行工作,在数据有限的情况下可能会提供优势。 本研究系统地比较了各种AI模型的诊断准确性和鲁棒性,包括决策树、梯度提升、随机森林、支持向量机(SVM)和多层感知器(MLP),用于放射组学,并将其与最先进的计算机视觉深度学习架构进行对比。在不同样本大小下的性能指标揭示了每种模型的有效性,突出了特定AI方法可能提供增强诊断能力的具体情况。 研究结果旨在为临床实践中集成基于人工智能的诊断工具提供指导,特别是在自动化和高通量环境中,这些地方需要及时、可靠的诊断。这项比较研究解决了一个重要缺口,并确立了根据临床及运营需求选择AI模型的指南。
https://arxiv.org/abs/2504.12249
Deep neural networks (DNNs) have recently become the leading method for low-light image enhancement (LLIE). However, despite significant progress, their outputs may still exhibit issues such as amplified noise, incorrect white balance, or unnatural enhancements when deployed in real world applications. A key challenge is the lack of diverse, large scale training data that captures the complexities of low-light conditions and imaging pipelines. In this paper, we propose a novel image signal processing (ISP) driven data synthesis pipeline that addresses these challenges by generating unlimited paired training data. Specifically, our pipeline begins with easily collected high-quality normal-light images, which are first unprocessed into the RAW format using a reverse ISP. We then synthesize low-light degradations directly in the RAW domain. The resulting data is subsequently processed through a series of ISP stages, including white balance adjustment, color space conversion, tone mapping, and gamma correction, with controlled variations introduced at each stage. This broadens the degradation space and enhances the diversity of the training data, enabling the generated data to capture a wide range of degradations and the complexities inherent in the ISP pipeline. To demonstrate the effectiveness of our synthetic pipeline, we conduct extensive experiments using a vanilla UNet model consisting solely of convolutional layers, group normalization, GeLU activation, and convolutional block attention modules (CBAMs). Extensive testing across multiple datasets reveals that the vanilla UNet model trained with our data synthesis pipeline delivers high fidelity, visually appealing enhancement results, surpassing state-of-the-art (SOTA) methods both quantitatively and qualitatively.
最近,深度神经网络(DNN)已成为低光图像增强(LLIE)的领先方法。然而,尽管取得了显著进展,它们在实际应用中的输出仍然可能出现诸如放大噪声、错误白平衡或不自然增强等问题。关键挑战之一是缺乏能够捕捉低光条件和成像流程复杂性的多样化大规模训练数据。 为此,本文提出了一种新颖的基于图像信号处理(ISP)的数据合成管道,通过生成无限制配对训练数据来解决这些问题。具体来说,我们的管道从易于收集的高质量正常光照图像开始,并使用反向ISP首先将其未加工为RAW格式。然后,在RAW域直接合成低光退化情况。随后,生成的数据会经过一系列ISP阶段处理,包括白平衡调整、颜色空间转换、色调映射和伽马校正等,同时在每个阶段引入受控变化。这拓宽了降级范围,并增强了训练数据的多样性,使得生成的数据能够捕捉到广泛的降级情况以及ISP流程中的固有复杂性。 为了证明我们合成管道的有效性,我们在一个简单的UNet模型上进行了大量实验,该模型仅由卷积层、组归一化、GeLU激活和卷积块注意模块(CBAMs)组成。在多个数据集上的广泛测试表明,使用我们的数据合成管道训练的简单UNet模型能够提供高保真度且视觉效果良好的增强结果,在量化和定性评估中均超越了最先进的方法(SOTA)。
https://arxiv.org/abs/2504.12204
Remote sensing imagery is essential for environmental monitoring, agricultural management, and disaster response. However, data loss due to cloud cover, sensor failures, or incomplete acquisition-especially in high-resolution and high-frequency tasks-severely limits satellite imagery's effectiveness. Traditional interpolation methods struggle with large missing areas and complex structures. Remote sensing imagery consists of multiple bands, each with distinct meanings, and ensuring consistency across bands is critical to avoid anomalies in the combined images. This paper proposes SatelliteMaker, a diffusion-based method that reconstructs missing data across varying levels of data loss while maintaining spatial, spectral, and temporal consistency. We also propose Digital Elevation Model (DEM) as a conditioning input and use tailored prompts to generate realistic images, making diffusion models applicable to quantitative remote sensing tasks. Additionally, we propose a VGG-Adapter module based on Distribution Loss, which reduces distribution discrepancy and ensures style consistency. Extensive experiments show that SatelliteMaker achieves state-of-the-art performance across multiple tasks.
遥感影像对于环境监测、农业管理和灾害响应至关重要。然而,由于云层遮挡、传感器故障或采集不完整等问题,尤其是在高分辨率和高频任务中,数据缺失严重限制了卫星图像的有效性。传统插值方法在处理大面积缺失区域和复杂结构时效果不佳。遥感影像包含多个波段,每个波段具有独特的意义,确保各波段间的一致性对于避免合成图像中的异常现象至关重要。 本文提出了一种基于扩散模型的方法SatelliteMaker,该方法能够在不同程度的数据丢失情况下重建缺失数据,并保持空间、光谱和时间上的一致性。我们还提出了将数字高程模型(DEM)作为条件输入并使用定制提示生成真实感影像的方案,使得扩散模型能够应用于定量遥感任务中。此外,本文提出了一种基于分布损失的VGG-Adapter模块,以减少数据分布差异,并确保风格一致性。 广泛的实验表明,SatelliteMaker在多项任务上达到了最先进的性能水平。
https://arxiv.org/abs/2504.12112
We present a Bayesian dynamic borrowing (BDB) approach to enhance the quantitative identification of adverse events (AEs) in spontaneous reporting systems (SRSs). The method embeds a robust meta-analytic predictive (MAP) prior within a Bayesian hierarchical model and incorporates semantic similarity measures (SSMs) to enable weighted information sharing from MedDRA Preferred Terms (PTs) that are clinical similar to the target PT. This continuous similarity-based borrowing addresses limitation of rigid hierarchical grouping in current disproportionality analysis (DPA). Using data from the FDA Adverse Event Reporting System (FAERS) between 2015 and 2019, we evalute this approach - termed IC SSM - against standard Information Component (IC) analysis and IC with borrowing at the MedDRA high-level group term (HLGT) level. A novel references set (PVLens), derived from FDA product label updates, enabled prospective evaluation of method performance in identifying AEs prior to official labeling. The IC SSM approach demonstrated improved sensitivity compared to both traditional IC and HLGT-based borrowing, with minor trade-offs in F1 scores and Youden's index. IC SSM consistently identified more true positives and detected signals over 5 months sooner than traditional IC. Despite a marginally lower aggregate Youden's index, IC SSM showed higher performance in the early post-marketing period, providing more stable and relevant estimates than HLGT-based borrowing and traditional IC. These findings support the use of SSM-informed Bayesian borrowing as a scalable and context-aware enhancement to traditional DPA methods. Future research should validate this approach across other datasets and explore additional similarity metrics and Bayesian inference strategies using case-level data.
我们提出了一种贝叶斯动态借用(BDB)方法,以增强自发报告系统(SRSs)中不良事件(AEs)的定量识别。该方法在一个贝叶斯分层模型内嵌入了稳健的元分析预测(MAP)先验,并结合语义相似度量(SSMs),使从临床上与目标首选术语(PT)相似的MedDRA优选术语中借重信息成为可能。这种基于连续相似性的借用解决了当前不对称性分析(DPA)中严格的分层分组的限制。利用2015年至2019年间来自美国食品药品监督管理局不良事件报告系统(FAERS)的数据,我们评估了这种方法——称为IC SSM——相对于标准信息成分(IC)分析和MedDRA高级别组术语(HLGT)级别的借重方法的表现。 一个新颖的参考集(PVLens),来源于FDA产品标签更新,使得能够前瞻性地评估在正式标记之前识别AEs的方法性能。与传统的IC和HLGT级借用相比,IC SSM方法表现出更高的灵敏度,尽管F1分数和Youden指数略有下降。与传统IC相比,IC SSM始终能更早检测到更多真正的阳性结果,并且在五个月前就能发现信号。 虽然总体的Youden指数稍低,但IC SSM显示了在上市后早期阶段的表现更好,提供了比HLGT级借用和传统的IC更为稳定和相关的估计。这些发现在支持使用SSM告知的贝叶斯借重作为一种可扩展且情境感知的传统DPA方法增强手段方面具有重要意义。未来的研究应验证该方法在其他数据集上的有效性,并探索利用案例级别数据的额外相似度量和贝叶斯推理策略。
https://arxiv.org/abs/2504.12052
Text-to-Video generation, which utilizes the provided text prompt to generate high-quality videos, has drawn increasing attention and achieved great success due to the development of diffusion models recently. Existing methods mainly rely on a pre-trained text encoder to capture the semantic information and perform cross attention with the encoded text prompt to guide the generation of video. However, when it comes to complex prompts that contain dynamic scenes and multiple camera-view transformations, these methods can not decompose the overall information into separate scenes, as well as fail to smoothly change scenes based on the corresponding camera-views. To solve these problems, we propose a novel method, i.e., Modular-Cam. Specifically, to better understand a given complex prompt, we utilize a large language model to analyze user instructions and decouple them into multiple scenes together with transition actions. To generate a video containing dynamic scenes that match the given camera-views, we incorporate the widely-used temporal transformer into the diffusion model to ensure continuity within a single scene and propose CamOperator, a modular network based module that well controls the camera movements. Moreover, we propose AdaControlNet, which utilizes ControlNet to ensure consistency across scenes and adaptively adjusts the color tone of the generated video. Extensive qualitative and quantitative experiments prove our proposed Modular-Cam's strong capability of generating multi-scene videos together with its ability to achieve fine-grained control of camera movements. Generated results are available at this https URL.
文本到视频生成技术,利用提供的文字提示来生成高质量的视频,在最近扩散模型的发展推动下引起了越来越多的关注并取得了巨大成功。现有的方法主要依赖于预训练的文字编码器来捕捉语义信息,并通过与编码后的文字提示进行跨注意力交互以引导视频生成。然而,在处理包含动态场景和多种摄像视角转换的复杂提示时,这些方法无法将整体信息分解为独立的场景,也无法根据相应的摄像视角平滑地切换场景。为了解决这些问题,我们提出了一种新的方法——Modular-Cam。 具体而言,为了更好地理解给定的复杂提示,我们利用大型语言模型来分析用户指令,并将其解构为多个场景及其间的过渡动作。为了生成包含符合指定摄像视角动态场景的视频,我们将常用的时序变换器整合到扩散模型中以确保单个场景内的连贯性,并提出了CamOperator模块化网络模块,该模块能够很好地控制摄像机运动。此外,我们还提出了一种名为AdaControlNet的方法,它利用ControlNet来保证跨场景的一致性并自适应调整生成视频的色调。 大量的定性和定量实验验证了我们的Modular-Cam方法在生成多场景视频方面的强大能力及其对摄像机运动进行细粒度控制的能力。生成的结果可以在以下链接查看:[此处插入具体URL]。
https://arxiv.org/abs/2504.12048
The rapid development of generative AI facilitates content creation and makes image manipulation easier and more difficult to detect. While multimodal Large Language Models (LLMs) have encoded rich world knowledge, they are not inherently tailored for combating AI-generated Content (AIGC) and struggle to comprehend local forgery details. In this work, we investigate the application of multimodal LLMs in forgery detection. We propose a framework capable of evaluating image authenticity, localizing tampered regions, providing evidence, and tracing generation methods based on semantic tampering clues. Our method demonstrates that the potential of LLMs in forgery analysis can be effectively unlocked through meticulous prompt engineering and the application of few-shot learning techniques. We conduct qualitative and quantitative experiments and show that GPT4V can achieve an accuracy of 92.1% in Autosplice and 86.3% in LaMa, which is competitive with state-of-the-art AIGC detection methods. We further discuss the limitations of multimodal LLMs in such tasks and propose potential improvements.
生成式人工智能的迅速发展促进了内容创作,并使图像操纵变得更加容易,同时也更难以检测。虽然多模态大型语言模型(LLM)编码了丰富的世界知识,但它们本身并不专门用于对抗AI生成的内容(AIGC),并且在理解局部伪造细节方面存在困难。在这项工作中,我们探讨了将多模态LLMs应用于伪造检测的应用。我们提出了一种框架,该框架能够根据语义篡改线索评估图像的真实性、定位被篡改的区域、提供证据,并追踪生成方法。我们的方法证明,通过精心设计的提示工程和少样本学习技术的应用,可以有效解锁大型语言模型在伪造分析中的潜力。我们进行了定性和定量实验,并展示了GPT4V在Autosplice数据集上达到了92.1%的准确率,在LaMa数据集上达到了86.3%,这与当前最先进的AIGC检测方法具有竞争力。此外,我们还讨论了多模态LLMs在此类任务中的局限性并提出潜在改进方案。
https://arxiv.org/abs/2504.11686
Recent advances in reasoning-focused large language models (LLMs) mark a shift from general LLMs toward models designed for complex decision-making, a crucial aspect in medicine. However, their performance in specialized domains like ophthalmology remains underexplored. This study comprehensively evaluated and compared the accuracy and reasoning capabilities of four newly developed reasoning-focused LLMs, namely DeepSeek-R1, OpenAI o1, o3-mini, and Gemini 2.0 Flash-Thinking. Each model was assessed using 5,888 multiple-choice ophthalmology exam questions from the MedMCQA dataset in zero-shot setting. Quantitative evaluation included accuracy, Macro-F1, and five text-generation metrics (ROUGE-L, METEOR, BERTScore, BARTScore, and AlignScore), computed against ground-truth reasonings. Average inference time was recorded for a subset of 100 randomly selected questions. Additionally, two board-certified ophthalmologists qualitatively assessed clarity, completeness, and reasoning structure of responses to differential diagnosis questions.O1 (0.902) and DeepSeek-R1 (0.888) achieved the highest accuracy, with o1 also leading in Macro-F1 (0.900). The performance of models across the text-generation metrics varied: O3-mini excelled in ROUGE-L (0.151), o1 in METEOR (0.232), DeepSeek-R1 and o3-mini tied for BERTScore (0.673), DeepSeek-R1 (-4.105) and Gemini 2.0 Flash-Thinking (-4.127) performed best in BARTScore, while o3-mini (0.181) and o1 (0.176) led AlignScore. Inference time across the models varied, with DeepSeek-R1 being slowest (40.4 seconds) and Gemini 2.0 Flash-Thinking fastest (6.7 seconds). Qualitative evaluation revealed that DeepSeek-R1 and Gemini 2.0 Flash-Thinking tended to provide detailed and comprehensive intermediate reasoning, whereas o1 and o3-mini displayed concise and summarized justifications.
最近针对推理的大型语言模型(LLM)的发展标志着从通用型LLM向专门用于复杂决策制定的模型转变,这对医学领域尤为重要。然而,在眼科等专业领域的性能仍然缺乏深入研究。本研究全面评估并比较了四个新开发的以推理为重点的LLM——DeepSeek-R1、OpenAI o1、o3-mini 和 Gemini 2.0 Flash-Thinking 的准确性和推理能力。每个模型使用来自MedMCQA数据集的5,888道选择题型眼科考试题目进行零样本设置下的评估。定量评价包括准确性、宏平均F1分值,以及针对真实理由计算出的五种文本生成指标(ROUGE-L、METEOR、BERTScore、BARTScore 和 AlignScore)。对于随机选取的一组100道题目的子集记录了平均推断时间。此外,两位认证眼科医师对模型回答鉴别诊断问题时的清晰度、完整性和推理结构进行了定性评估。 在准确性方面,O1(0.902)和DeepSeek-R1(0.888)表现最佳,其中O1也在宏平均F1分值上领先(0.900)。关于文本生成指标的表现各模型不尽相同:o3-mini 在 ROUGE-L 上表现出色(0.151),而 o1 在 METEOR 上领先(0.232);DeepSeek-R1 和 o3-mini 在 BERTScore 指标上并列第一(0.673)。在BARTScore指标中,DeepSeek-R1 (-4.105)和Gemini 2.0 Flash-Thinking (-4.127)表现最好;而在AlignScore指标中,o3-mini (0.181) 和 o1 (0.176)领先。各模型的推断时间有所不同:DeepSeek-R1 最慢(40.4秒),而Gemini 2.0 Flash-Thinking最快(6.7秒)。定性评估显示,DeepSeek-R1和Gemini 2.0 Flash-Thinking倾向于提供详细且全面的中间推理过程,而o1 和 o3-mini 则表现出简洁概括的理由解释。
https://arxiv.org/abs/2504.11186
Earth observation satellites like Sentinel-1 (S1) and Sentinel-2 (S2) provide complementary remote sensing (RS) data, but S2 images are often unavailable due to cloud cover or data gaps. To address this, we propose a diffusion model (DM)-based approach for SAR-to-RGB translation, generating synthetic optical images from SAR inputs. We explore three different setups: two using Standard Diffusion, which reconstruct S2 images by adding and removing noise (one without and one with class conditioning), and one using Cold Diffusion, which blends S2 with S1 before removing the SAR signal. We evaluate the generated images in downstream tasks, including land cover classification and cloud removal. While generated images may not perfectly replicate real S2 data, they still provide valuable information. Our results show that class conditioning improves classification accuracy, while cloud removal performance remains competitive despite our approach not being optimized for it. Interestingly, despite exhibiting lower perceptual quality, the Cold Diffusion setup performs well in land cover classification, suggesting that traditional quantitative evaluation metrics may not fully reflect the practical utility of generated images. Our findings highlight the potential of DMs for SAR-to-RGB translation in RS applications where RGB images are missing.
地球观测卫星,如Sentinel-1(S1)和Sentinel-2(S2),提供了互补的遥感数据。然而,由于云层遮挡或数据缺口的原因,S2图像常常不可用。为了解决这个问题,我们提出了一种基于扩散模型(DM)的方法来实现从合成孔径雷达(SAR)到RGB的颜色转换,能够从SAR输入生成合成光学图像。 我们探讨了三种不同的设置:两种使用标准扩散方法,通过添加和移除噪声来重建S2图像(一种无类别条件的情况,另一种则有类别条件),以及一种使用冷扩散的方法,在去除SAR信号之前将S2与S1混合。我们在下游任务中评估生成的图像,包括土地覆盖分类和去云处理。虽然生成的图像可能无法完美复制真实的S2数据,但它们仍然提供了有价值的信息。 我们的结果显示,类别条件改善了分类准确性,而尽管我们没有针对其进行优化,去云性能依然具有竞争力。有趣的是,即使感知质量较低,冷扩散设置在土地覆盖分类中表现良好,这表明传统的定量评估指标可能无法完全反映生成图像的实际效用。我们的发现强调了DMs在遥感应用中的潜力,特别是在缺少RGB图像的情况下实现SAR到RGB的转换。
https://arxiv.org/abs/2504.11154
Recent advancements in human image animation have been propelled by video diffusion models, yet their reliance on numerous iterative denoising steps results in high inference costs and slow speeds. An intuitive solution involves adopting consistency models, which serve as an effective acceleration paradigm through consistency distillation. However, simply employing this strategy in human image animation often leads to quality decline, including visual blurring, motion degradation, and facial distortion, particularly in dynamic regions. In this paper, we propose the DanceLCM approach complemented by several enhancements to improve visual quality and motion continuity at low-step regime: (1) segmented consistency distillation with an auxiliary light-weight head to incorporate supervision from real video latents, mitigating cumulative errors resulting from single full-trajectory generation; (2) a motion-focused loss to centre on motion regions, and explicit injection of facial fidelity features to improve face authenticity. Extensive qualitative and quantitative experiments demonstrate that DanceLCM achieves results comparable to state-of-the-art video diffusion models with a mere 2-4 inference steps, significantly reducing the inference burden without compromising video quality. The code and models will be made publicly available.
近期,人类图像动画技术的进步主要得益于视频扩散模型的应用,然而这些模型依赖于大量的迭代去噪步骤,导致了高昂的推理成本和缓慢的速度。一种直观的解决方案是采用一致性模型(Consistency Models),这种方法通过一致性蒸馏作为有效的加速范式来提高效率。然而,仅仅在人类图像动画中使用这种策略往往会降低质量,包括视觉模糊、运动退化以及面部扭曲,特别是在动态区域更为明显。 为此,在本文中我们提出了一种名为DanceLCM的方法,并结合了几项增强措施以改善低步数下的视觉质量和运动连贯性: 1. **分段一致性蒸馏**:采用辅助轻量级头部从真实视频潜在特征(latents)中融入监督,从而缓解了单次全程生成过程中的累积误差问题。 2. **运动聚焦损失**:重点关注运动区域,并明确注入面部保真度特征以提升脸部的真实性。 通过广泛的定性和定量实验表明,DanceLCM在仅需2-4步推理的情况下便能达到与当前最先进的视频扩散模型相媲美的结果,大大降低了推理负担而不影响视频质量。我们将公开发布代码和模型以供使用。
https://arxiv.org/abs/2504.11143
This study introduces a novel method for revealing human covert attention patterns using gameplay data alone, utilizing offline attention techniques from reinforcement learning (RL). We propose the contextualized, task-relevant (CTR) attention network, which generates attention maps from both human and RL agent gameplay in Atari environments. These maps are sparse yet retain the necessary information for the current player's decision making. We compare the CTR-derived attention maps with a temporally integrated overt attention (TIOA) model based on eye-tracking data, serving as a point of comparison and discussion. Visual inspection reveals distinct attention patterns: human CTR maps focus on the player and rather nearby opponents, occasionally shifting between stronger focus and broader views - sometimes even attending to empty space ahead. In contrast, agent maps maintain a consistent broad focus on most objects, including distant ones and the player. Quantitative analysis further demonstrates that human CTR maps align more closely with TIOA than agent maps do. Our findings indicate that the CTR attention network can effectively reveal human covert attention patterns from gameplay alone, without the need for additional data like brain activity recordings. This work contributes to understanding human-agent attention differences and enables the development of RL agents augmented with human covert attention.
这项研究介绍了一种仅使用游戏数据揭示人类隐蔽注意力模式的新方法,该方法采用了来自强化学习(RL)的离线注意技术。我们提出了上下文相关、任务相关的(CTR)注意网络,它能够从Atari环境中的人类玩家和RL智能体的游戏数据中生成注意图谱。这些地图是稀疏的,但保留了当前玩家决策所需的必要信息。 我们将CTR产生的注意图与基于眼动追踪数据的时间整合显性注意力(TIOA)模型进行比较,作为对比和讨论的参考点。通过视觉检查可以看出明显不同的注意模式:人类CTR映射主要集中在角色及其附近的对手上,并且偶尔会在更广视域和更强聚焦之间转换——有时甚至会关注前方空旷的空间;相比之下,智能体的映射则始终保持着对大多数对象的一致广泛视野,包括远处的对象以及玩家自身。 定量分析进一步表明,人类的CTR映射与TIOA模型更为一致,而智能体映射与其之间的吻合度较低。我们的研究结果表明,CTR注意力网络能够仅凭游戏数据有效地揭示人类隐蔽注意模式,无需额外的数据如脑电活动记录等。这项工作有助于理解人机在注意方面的差异,并促进了增强人类隐蔽注意的RL代理人的开发。
https://arxiv.org/abs/2504.11118
Developing an understanding of high-dimensional data can be facilitated by visualizing that data using dimensionality reduction. However, the low-dimensional embeddings are often difficult to interpret. To facilitate the exploration and interpretation of low-dimensional embeddings, we introduce a new concept named partitioning with explanations. The idea is to partition the data shown through the embedding into groups, each of which is given a sparse explanation using the original high-dimensional attributes. We introduce an objective function that quantifies how much we can learn through observing the explanations of the data partitioning, using information theory, and also how complex the explanations are. Through parameterization of the complexity, we can tune the solutions towards the desired granularity. We propose InfoClus, which optimizes the partitioning and explanations jointly, through greedy search constrained over a hierarchical clustering. We conduct a qualitative and quantitative analysis of InfoClus on three data sets. We contrast the results on the Cytometry data with published manual analysis results, and compare with two other recent methods for explaining embeddings (RVX and VERA). These comparisons highlight that InfoClus has distinct advantages over existing procedures and methods. We find that InfoClus can automatically create good starting points for the analysis of dimensionality-reduction-based scatter plots.
理解高维数据可以通过使用降维技术来可视化这些数据而得到帮助,但是生成的低维嵌入通常难以解释。为了便于探索和解读这些低维嵌入,我们引入了一个名为“带解释划分”的新概念。这个理念是将通过嵌入展示的数据划分为若干组,并为每一组提供基于原始高维属性的一组稀疏解释。我们提出了一种目标函数,它量化了通过观察数据分区的解释所能学到的知识量(使用信息论),同时也考虑了解释的复杂程度。通过对复杂度进行参数化调整,我们可以将解决方案调谐至所需的粒度水平。我们提出了InfoClus算法,该算法通过在层次聚类约束下的贪心搜索同时优化划分和解释。我们在三个数据集上对InfoClus进行了定性和定量分析,并将其在Cytometry数据上的结果与已发表的手动分析结果进行对比,同时也与其他两种最近用于解释嵌入的方法(RVX和VERA)进行了比较。这些比较突显了InfoClus相对于现有过程和方法的独特优势。我们发现,InfoClus能够自动创建优秀的起点,用于基于降维的散点图分析。
https://arxiv.org/abs/2504.11089
Recent video generation research has focused heavily on isolated actions, leaving interactive motions-such as hand-face interactions-largely unexamined. These interactions are essential for emerging biometric authentication systems, which rely on interactive motion-based anti-spoofing approaches. From a security perspective, there is a growing need for large-scale, high-quality interactive videos to train and strengthen authentication models. In this work, we introduce a novel paradigm for animating realistic hand-face interactions. Our approach simultaneously learns spatio-temporal contact dynamics and biomechanically plausible deformation effects, enabling natural interactions where hand movements induce anatomically accurate facial deformations while maintaining collision-free contact. To facilitate this research, we present InterHF, a large-scale hand-face interaction dataset featuring 18 interaction patterns and 90,000 annotated videos. Additionally, we propose InterAnimate, a region-aware diffusion model designed specifically for interaction animation. InterAnimate leverages learnable spatial and temporal latents to effectively capture dynamic interaction priors and integrates a region-aware interaction mechanism that injects these priors into the denoising process. To the best of our knowledge, this work represents the first large-scale effort to systematically study human hand-face interactions. Qualitative and quantitative results show InterAnimate produces highly realistic animations, setting a new benchmark. Code and data will be made public to advance research.
最近的视频生成研究主要集中在孤立的动作上,而互动动作(如手部和面部之间的交互)则很少被研究。这些互动对于新兴的生物识别认证系统至关重要,因为这类系统依赖于基于互动运动的防伪技术。从安全角度来看,为了训练和完善认证模型,急需大规模、高质量的互动视频。在这项工作中,我们介绍了一种新的方法来模拟逼真的手部与面部之间的互动。我们的方法同时学习空间-时间接触动力学和生物力学上合理的变形效应,使自然的互动成为可能,在这种互动中,手的动作会诱发准确反映解剖结构的面部变形,并且在保持无碰撞接触的情况下进行。 为了推进这项研究,我们提出了InterHF,这是一个大规模的手部与面部互动数据集,包含了18种互动模式和90,000个标注视频。此外,我们还提出了一种专门用于交互式动画制作的区域感知扩散模型——InterAnimate。InterAnimate利用可学习的空间和时间潜在变量来有效捕捉动态交互先验,并集成一种区域感知互动物机制将这些先验注入去噪过程中。 据我们所知,这项工作是首次系统性地研究大规模人类手部与面部互动的研究尝试。定性和定量结果表明,InterAnimate生成的动画具有高度的真实性,并树立了新的基准。代码和数据将在未来公开以促进相关领域的研究进展。
https://arxiv.org/abs/2504.10905
Precipitation plays a critical role in the Earth's hydrological cycle, directly affecting ecosystems, agriculture, and water resource management. Accurate precipitation estimation and prediction are crucial for understanding climate dynamics, disaster preparedness, and environmental monitoring. In recent years, artificial intelligence (AI) has gained increasing attention in quantitative remote sensing (QRS), enabling more advanced data analysis and improving precipitation estimation accuracy. Although traditional methods have been widely used for precipitation estimation, they face limitations due to the difficulty of data acquisition and the challenge of capturing complex feature relationships. Furthermore, the lack of standardized multi-source satellite datasets, and in most cases, the exclusive reliance on station data, significantly hinders the effective application of advanced AI models. To address these challenges, we propose the Rainy dataset, a multi-source spatio-temporal dataset that integrates pure satellite data with station data, and propose Taper Loss, designed to fill the gap in tasks where only in-situ data is available without area-wide support. The Rainy dataset supports five main tasks: (1) satellite calibration, (2) precipitation event prediction, (3) precipitation level prediction, (4) spatiotemporal prediction, and (5) precipitation downscaling. For each task, we selected benchmark models and evaluation metrics to provide valuable references for researchers. Using precipitation as an example, the Rainy dataset and Taper Loss demonstrate the seamless collaboration between QRS and computer vision, offering data support for AI for Science in the field of QRS and providing valuable insights for interdisciplinary collaboration and integration.
降水在地球水文循环中扮演着关键角色,直接影响生态系统、农业和水资源管理。准确的降水量估计与预测对于理解气候动态、灾害准备以及环境监测至关重要。近年来,在定量遥感(QRS)领域,人工智能(AI)得到了越来越多的关注,它使得数据分析更为先进,并提高了降水量估算的准确性。尽管传统的降水估测方法被广泛应用,但由于数据获取困难和捕捉复杂特征关系的挑战,这些传统方法面临诸多限制。此外,缺乏标准化的多源卫星数据集,在大多数情况下仅依赖站点数据的情况显著阻碍了高级AI模型的有效应用。 为了解决这些问题,我们提出了Rainy数据集,这是一个整合了纯卫星数据与站数据的多源时空数据集,并提出Taper Loss算法以解决只有现场数据而无区域支持的任务中的空白。Rainy数据集支持五项主要任务:(1)卫星校准;(2)降水事件预测;(3)降水量级预测;(4)时空预测;以及(5)降尺度。对于每个任务,我们都选择了基准模型和评估指标,为研究人员提供有价值的参考。 以降水为例,Rainy数据集和Taper Loss算法展示了QRS与计算机视觉之间的无缝协作,提供了支持科学领域内AI的数据支持,并为进一步跨学科合作和整合提供了宝贵的见解。
https://arxiv.org/abs/2504.10776