This survey explores recent developments in generating digital twins from videos. Such digital twins can be used for robotics application, media content creation, or design and construction works. We analyze various approaches, including 3D Gaussian Splatting, generative in-painting, semantic segmentation, and foundation models highlighting their advantages and limitations. Additionally, we discuss challenges such as occlusions, lighting variations, and scalability, as well as potential future research directions. This survey aims to provide a comprehensive overview of state-of-the-art methodologies and their implications for real-world applications. Awesome list: this https URL
这项调查探讨了从视频生成数字孪生的最新发展。这样的数字孪生可用于机器人应用、媒体内容创作或设计和建设工作。我们分析了各种方法,包括3D高斯点阵、生成式修复、语义分割以及基础模型,并强调它们各自的优点和局限性。此外,我们还讨论了遮挡、光照变化和可扩展性的挑战,以及未来研究的潜在方向。这项调查旨在提供当前最先进的方法及其对实际应用影响的全面概述。 附:详情参见此链接: [https URL] (请将[https URL]替换为实际链接地址)
https://arxiv.org/abs/2504.13159
Generative Adversarial Network (GAN) inversion have demonstrated excellent performance in image inpainting that aims to restore lost or damaged image texture using its unmasked content. Previous GAN inversion-based methods usually utilize well-trained GAN models as effective priors to generate the realistic regions for missing holes. Despite excellence, they ignore a hard constraint that the unmasked regions in the input and the output should be the same, resulting in a gap between GAN inversion and image inpainting and thus degrading the performance. Besides, existing GAN inversion approaches often consider a single modality of the input image, neglecting other auxiliary cues in images for improvements. Addressing these problems, we propose a novel GAN inversion approach, dubbed MMInvertFill, for image inpainting. MMInvertFill contains primarily a multimodal guided encoder with a pre-modulation and a GAN generator with F&W+ latent space. Specifically, the multimodal encoder aims to enhance the multi-scale structures with additional semantic segmentation edge texture modalities through a gated mask-aware attention module. Afterwards, a pre-modulation is presented to encode these structures into style vectors. To mitigate issues of conspicuous color discrepancy and semantic inconsistency, we introduce the F&W+ latent space to bridge the gap between GAN inversion and image inpainting. Furthermore, in order to reconstruct faithful and photorealistic images, we devise a simple yet effective Soft-update Mean Latent module to capture more diversified in-domain patterns for generating high-fidelity textures for massive corruptions. In our extensive experiments on six challenging datasets, we show that our MMInvertFill qualitatively and quantitatively outperforms other state-of-the-arts and it supports the completion of out-of-domain images effectively.
生成对抗网络(GAN)逆向技术在图像修复领域中展示了卓越的性能,其目标是利用未遮挡的内容恢复丢失或损坏的纹理。之前的基于 GAN 逆向的方法通常会使用经过良好训练的 GAN 模型作为有效的先验条件来生成缺失区域的真实感部分。尽管这些方法表现优秀,但它们忽略了输入图像和输出图像中未遮挡区域应当保持一致这一严格的约束条件,这导致了 GAN 逆向与图像修复之间的差距,并因此降低了性能。此外,现有的 GAN 逆向方法通常仅考虑输入图像的单一模式,忽视了其他在改进方面有帮助的辅助线索。 为了应对这些挑战,我们提出了一种新颖的 GAN 逆向方法,称为 MMInvertFill,用于图像修复。MMInvertFill 主要包含一个多模态引导编码器和一个 F&W+ 潜空间中的 GAN 发生器。具体来说,多模态编码器旨在通过门控掩码感知注意模块增强多层次结构,并且引入了额外的语义分割边缘纹理模式。随后,我们提出预调制技术将这些结构编码为样式向量。为了缓解明显的颜色差异和语义不一致的问题,我们引进 F&W+ 潜空间来弥合 GAN 逆向与图像修复之间的差距。 进一步地,为了重构忠实且逼真的图像,我们设计了一个简单而有效的软更新均值潜空间模块,以捕捉更多的域内模式,并为大量损坏生成高质量的纹理。在六个具有挑战性的数据集上的广泛实验中,我们的 MMInvertFill 从定性和定量上都超越了当前最佳方法,并支持跨域图像的有效完成任务。
https://arxiv.org/abs/2504.12844
Vision Foundation Models (VFMs) have delivered remarkable performance in Domain Generalized Semantic Segmentation (DGSS). However, recent methods often overlook the fact that visual cues are susceptible, whereas the underlying geometry remains stable, rendering depth information more robust. In this paper, we investigate the potential of integrating depth information with features from VFMs, to improve the geometric consistency within an image and boost the generalization performance of VFMs. We propose a novel fine-tuning DGSS framework, named DepthForge, which integrates the visual cues from frozen DINOv2 or EVA02 and depth cues from frozen Depth Anything V2. In each layer of the VFMs, we incorporate depth-aware learnable tokens to continuously decouple domain-invariant visual and spatial information, thereby enhancing depth awareness and attention of the VFMs. Finally, we develop a depth refinement decoder and integrate it into the model architecture to adaptively refine multi-layer VFM features and depth-aware learnable tokens. Extensive experiments are conducted based on various DGSS settings and five different datsets as unseen target domains. The qualitative and quantitative results demonstrate that our method significantly outperforms alternative approaches with stronger performance, steadier visual-spatial attention, and superior generalization ability. In particular, DepthForge exhibits outstanding performance under extreme conditions (e.g., night and snow). Code is available at this https URL.
视觉基础模型(VFMs)在领域泛化语义分割(DGSS)中表现出色。然而,近期的方法往往忽视了这样一个事实:即视觉线索容易受到影响,而底层几何结构则相对稳定,因此深度信息更加稳健。在这篇论文中,我们研究了将深度信息与VFMs提取的特征相结合的潜力,以提高图像内的几何一致性并增强VFMs的泛化性能。我们提出了一种新颖的微调DGSS框架,名为DepthForge,该框架结合了冻结后的DINOv2或EVA02提供的视觉线索和冻结后的Depth Anything V2提供的深度线索。在VFMs的每一层中,我们引入了感知深度的学习令牌来连续分离出领域不变的视觉和空间信息,从而增强了模型对深度的理解与注意力机制。最后,我们开发了一种深度细化解码器,并将其集成到模型架构中以自适应地精炼多层VFMs特征及感知深度的学习令牌。 基于各种DGSS设置以及五个不同未见过的目标数据集进行了广泛的实验。定性和定量结果均表明,我们的方法在性能、视觉-空间注意力的稳定性以及泛化能力方面显著优于其他替代方法,并且在极端条件下(如夜晚和雪地)表现出色。代码可在以下链接获取:[提供具体网址]
https://arxiv.org/abs/2504.12753
Labeling has always been expensive in the medical context, which has hindered related deep learning application. Our work introduces active learning in surgical video frame selection to construct a high-quality, affordable Laparoscopic Cholecystectomy dataset for semantic segmentation. Active learning allows the Deep Neural Networks (DNNs) learning pipeline to include the dataset construction workflow, which means DNNs trained by existing dataset will identify the most informative data from the newly collected data. At the same time, DNNs' performance and generalization ability improve over time when the newly selected and annotated data are included in the training data. We assessed different data informativeness measurements and found the deep features distances select the most informative data in this task. Our experiments show that with half of the data selected by active learning, the DNNs achieve almost the same performance with 0.4349 mean Intersection over Union (mIoU) compared to the same DNNs trained on the full dataset (0.4374 mIoU) on the critical anatomies and surgical instruments.
在医学领域,标注工作一直成本高昂,这阻碍了相关深度学习应用的发展。我们的研究通过引入主动学习技术来选择手术视频中的关键帧,旨在构建一个高质量且经济的腹腔镜胆囊切除术数据集,用于语义分割任务。 主动学习使深度神经网络(DNN)的学习流程能够融入数据集构建过程。这意味着使用现有数据训练的DNN可以从新收集的数据中识别出最具信息量的部分。同时,当这些被选择并标注的新数据加入到训练集中时,DNN的表现和泛化能力会随着时间逐步提升。 我们在实验中评估了不同的数据信息度量方法,并发现基于深度特征距离的方法能够在此任务中最有效地挑选出最有信息量的数据。我们的实验证明,在使用主动学习选出一半数据的情况下,DNN的性能与完全数据集训练的结果几乎相同(关键解剖结构和手术器械在mIoU指标上分别为0.4349和0.4374)。这表明我们的方法有效地降低了构建高质量医学图像分割数据集的成本。
https://arxiv.org/abs/2504.12573
Purpose: The operating room (OR) is a complex environment where optimizing workflows is critical to reduce costs and improve patient outcomes. The use of computer vision approaches for the automatic recognition of perioperative events enables identification of bottlenecks for OR optimization. However, privacy concerns limit the use of computer vision for automated event detection from OR videos, which makes privacy-preserving approaches needed for OR workflow analysis. Methods: We propose a two-stage pipeline for privacy-preserving OR video analysis and event detection. In the first stage, we leverage vision foundation models for depth estimation and semantic segmentation to generate de-identified Digital Twins (DT) of the OR from conventional RGB videos. In the second stage, we employ the SafeOR model, a fused two-stream approach that processes segmentation masks and depth maps for OR event detection. We evaluate this method on an internal dataset of 38 simulated surgical trials with five event classes. Results: Our results indicate that this DT-based approach to the OR event detection model achieves performance on par and sometimes even better than raw RGB video-based models on detecting OR events. Conclusion: DTs enable privacy-preserving OR workflow analysis, facilitating the sharing of de-identified data across institutions and they can potentially enhance model generalizability by mitigating domain-specific appearance differences.
目的:手术室(OR)是一个复杂的环境,优化工作流程对于减少成本和改善患者结果至关重要。使用计算机视觉方法自动识别围手术期事件能够帮助识别瓶颈以进行手术室的优化。然而,隐私问题限制了从手术室视频中通过计算机视觉来进行自动化事件检测的应用,因此需要采取保护隐私的方法来分析手术室的工作流程。 方法:我们提出了一种两阶段流水线用于隐私保护的手术室视频分析和事件检测。在第一阶段,利用视觉基础模型进行深度估计和语义分割,从常规RGB视频中生成去身份化的数字孪生(DT)以表示手术室。在第二阶段,使用SafeOR模型,这是一种融合了双流方法的方法,该方法处理分割掩码和深度图来检测手术事件。我们在一个内部数据集上评估了这种方法,这个数据集包含了38次模拟的外科手术试验,并且有五种类别的事件。 结果:我们的结果显示,基于DT的方法在检测手术室事件时,其性能与基于原始RGB视频模型的表现相当,有时甚至更好。 结论:数字孪生(DT)能够进行隐私保护下的手术工作流程分析,有助于跨机构共享去身份化数据,并有可能通过缓解特定领域的外观差异来增强模型的泛化能力。
https://arxiv.org/abs/2504.12552
Existing zero-shot 3D point cloud segmentation methods often struggle with limited transferability from seen classes to unseen classes and from semantic to visual space. To alleviate this, we introduce 3D-PointZshotS, a geometry-aware zero-shot segmentation framework that enhances both feature generation and alignment using latent geometric prototypes (LGPs). Specifically, we integrate LGPs into a generator via a cross-attention mechanism, enriching semantic features with fine-grained geometric details. To further enhance stability and generalization, we introduce a self-consistency loss, which enforces feature robustness against point-wise perturbations. Additionally, we re-represent visual and semantic features in a shared space, bridging the semantic-visual gap and facilitating knowledge transfer to unseen classes. Experiments on three real-world datasets, namely ScanNet, SemanticKITTI, and S3DIS, demonstrate that our method achieves superior performance over four baselines in terms of harmonic mIoU. The code is available at \href{this https URL}{Github}.
现有的零样本(zero-shot)三维点云分割方法在从已知类别到未知类别的迁移能力和语义空间到视觉空间的转换上常常遇到困难。为了缓解这些问题,我们提出了3D-PointZshotS,这是一种几何感知的零样本分割框架,通过潜在几何原型(Latent Geometric Prototypes, LGPs)增强了特征生成和对齐过程。具体来说,我们将LGPs整合进一个生成器中,并利用跨注意力机制(cross-attention mechanism),从而使语义特征能够获得细粒度的几何细节。 为了进一步提高稳定性和泛化能力,我们引入了一个自我一致性损失函数(self-consistency loss),该函数通过强制点级别的扰动下的特征鲁棒性来优化模型。此外,我们将视觉和语义特征重新表示为一个共享空间中的形式,以弥合语义与视觉之间的鸿沟,并促进向未知类别的知识转移。 在三个真实世界的数据集(ScanNet、SemanticKITTI 和 S3DIS)上的实验表明,我们的方法在调和mIoU方面优于四个基准模型。代码可以在Github上找到:[此链接](this https URL)。
https://arxiv.org/abs/2504.12442
Detection of spatial areas where biodiversity is at risk is of paramount importance for the conservation and monitoring of ecosystems. Large terrestrial mammalian herbivores are keystone species as their activity not only has deep effects on soils, plants, and animals, but also shapes landscapes, as large herbivores act as allogenic ecosystem engineers. One key landscape feature that indicates intense herbivore activity and potentially impacts biodiversity is the formation of grazing trails. Grazing trails are formed by the continuous trampling activity of large herbivores that can produce complex networks of tracks of bare soil. Here, we evaluated different algorithms based on machine learning techniques to identify grazing trails. Our goal is to automatically detect potential areas with intense herbivory activity, which might be beneficial for conservation and management plans. We have applied five semantic segmentation methods combined with fourteen encoders aimed at mapping grazing trails on aerial images. Our results indicate that in most cases the chosen methodology successfully mapped the trails, although there were a few instances where the actual trail structure was underestimated. The UNet architecture with the MambaOut encoder was the best architecture for mapping trails. The proposed approach could be applied to develop tools for mapping and monitoring temporal changes in these landscape structures to support habitat conservation and land management programs. This is the first time, to the best of our knowledge, that competitive image segmentation results are obtained for the detection and delineation of trails of large herbivorous mammals.
对存在生物多样性风险的区域进行空间检测对于生态系统保护和监测至关重要。大型陆生食草哺乳动物是关键物种,因为它们的活动不仅会对土壤、植物和动物产生深远影响,还会塑造景观,因为大型食草动物作为异质生态工程师在其中扮演重要角色。一个重要的景观特征,表明有强烈的食草动物活动并可能对生物多样性产生影响的是踩踏小径(或称放牧小径)的形成。这些小径是由大型食草动物持续践踏所形成的裸土痕迹网络。 在此研究中,我们评估了基于机器学习技术的不同算法以识别放牧小径。我们的目标是自动检测潜在存在强烈食草活动的区域,这对保护和管理计划可能有益。我们应用了五种语义分割方法结合十四种编码器来对空中图像中的放牧小径进行映射。结果显示,在大多数情况下所选择的方法成功地绘制出了这些路径,尽管有少数情况低估了实际的小径结构。使用MambaOut编码器的UNet架构是最适合绘制定向路径的模型。 所提出的方法可以应用于开发工具以监测和跟踪这些景观结构随时间的变化,从而支持栖息地保护和土地管理计划。据我们所知,这是首次获得用于检测和界定大型食草哺乳动物小径的竞争性图像分割结果。
https://arxiv.org/abs/2504.12121
Accurate medical image segmentation is essential for effective diagnosis and treatment. Previously, PraNet-V1 was proposed to enhance polyp segmentation by introducing a reverse attention (RA) module that utilizes background information. However, PraNet-V1 struggles with multi-class segmentation tasks. To address this limitation, we propose PraNet-V2, which, compared to PraNet-V1, effectively performs a broader range of tasks including multi-class segmentation. At the core of PraNet-V2 is the Dual-Supervised Reverse Attention (DSRA) module, which incorporates explicit background supervision, independent background modeling, and semantically enriched attention fusion. Our PraNet-V2 framework demonstrates strong performance on four polyp segmentation datasets. Additionally, by integrating DSRA to iteratively enhance foreground segmentation results in three state-of-the-art semantic segmentation models, we achieve up to a 1.36% improvement in mean Dice score. Code is available at: this https URL.
准确的医学图像分割对于有效诊断和治疗至关重要。此前,PraNet-V1通过引入反向注意力(RA)模块利用背景信息来增强息肉分割的效果。然而,PraNet-V1在处理多类分割任务时存在困难。为了克服这一限制,我们提出了PraNet-V2,在多类分割等更广泛的任务上相比PraNet-V1表现出色。PraNet-V2的核心是双监督反向注意力(DSRA)模块,该模块结合了显式的背景监督、独立的背景建模和语义增强注意融合。 我们的PraNet-V2框架在四个息肉分割数据集上展示了强大的性能。此外,通过将DSRA集成到三个最先进的语义分割模型中以迭代地改进前景分割结果,我们在平均Dice得分方面实现了高达1.36%的提升。 代码可在以下链接获取:[此URL](https://this https URL)。请注意,实际提供的是一个占位符,请使用正确的GitHub或其他版本控制系统链接来替换上述伪URL。
https://arxiv.org/abs/2504.10986
Deep learning techniques have achieved remarkable success in the semantic segmentation of remote sensing images and in land-use change detection. Nevertheless, their real-time deployment on edge platforms remains constrained by decoder complexity. Herein, we introduce LightFormer, a lightweight decoder for time-critical tasks that involve unstructured targets, such as disaster assessment, unmanned aerial vehicle search-and-rescue, and cultural heritage monitoring. LightFormer employs a feature-fusion and refinement module built on channel processing and a learnable gating mechanism to aggregate multi-scale, multi-range information efficiently, which drastically curtails model complexity. Furthermore, we propose a spatial information selection module (SISM) that integrates long-range attention with a detail preservation branch to capture spatial dependencies across multiple scales, thereby substantially improving the recognition of unstructured targets in complex scenes. On the ISPRS Vaihingen benchmark, LightFormer attains 99.9% of GLFFNet's mIoU (83.9% vs. 84.0%) while requiring only 14.7% of its FLOPs and 15.9% of its parameters, thus achieving an excellent accuracy-efficiency trade-off. Consistent results on LoveDA, ISPRS Potsdam, RescueNet, and FloodNet further demonstrate its robustness and superior perception of unstructured objects. These findings highlight LightFormer as a practical solution for remote sensing applications where both computational economy and high-precision segmentation are imperative.
深度学习技术在遥感图像的语义分割和土地使用变化检测方面取得了显著的成功。然而,由于解码器复杂度的问题,它们在边缘平台上的实时部署仍然受到限制。在此背景下,我们引入了LightFormer,这是一种轻量级解码器,适用于时间关键任务,如灾害评估、无人驾驶飞机搜救以及文化遗产监测等涉及无结构目标的应用。 LightFormer采用基于通道处理和可学习门控机制的特征融合与细化模块来高效聚合多尺度、长距离信息,从而大大降低了模型复杂度。此外,我们还提出了一种空间信息选择模块(SISM),该模块结合了长程注意力机制和细节保持分支,能够跨多个尺度捕获空间依赖关系,显著提升了在复杂场景中对无结构目标的识别能力。 在ISPRS Vaihingen基准测试上,LightFormer达到了GLFFNet mIoU值的99.9%(83.9% 对比 84.0%),同时仅需其14.7%的FLOPs和15.9%的参数量,从而实现了出色的精度与效率权衡。在LoveDA、ISPRS Potsdam、RescueNet和FloodNet上的持续测试结果进一步证明了它的鲁棒性和对无结构物体的高度感知能力。 这些发现突显出LightFormer作为遥感应用中一种实用解决方案的价值,在这种应用场景下,计算经济性与高精度分割同样重要。
https://arxiv.org/abs/2504.10834
In this paper, we propose a novel framework for controllable video diffusion, OmniVDiff, aiming to synthesize and comprehend multiple video visual content in a single diffusion model. To achieve this, OmniVDiff treats all video visual modalities in the color space to learn a joint distribution, while employing an adaptive control strategy that dynamically adjusts the role of each visual modality during the diffusion process, either as a generation modality or a conditioning modality. This allows flexible manipulation of each modality's role, enabling support for a wide range of tasks. Consequently, our model supports three key functionalities: (1) Text-conditioned video generation: multi-modal visual video sequences (i.e., rgb, depth, canny, segmentaion) are generated based on the text conditions in one diffusion process; (2) Video understanding: OmniVDiff can estimate the depth, canny map, and semantic segmentation across the input rgb frames while ensuring coherence with the rgb input; and (3) X-conditioned video generation: OmniVDiff generates videos conditioned on fine-grained attributes (e.g., depth maps or segmentation maps). By integrating these diverse tasks into a unified video diffusion framework, OmniVDiff enhances the flexibility and scalability for controllable video diffusion, making it an effective tool for a variety of downstream applications, such as video-to-video translation. Extensive experiments demonstrate the effectiveness of our approach, highlighting its potential for various video-related applications.
在这篇论文中,我们提出了一种新颖的可控视频扩散框架OmniVDiff,旨在通过单一的扩散模型合成和理解多个视频视觉内容。为了实现这一目标,OmniVDiff将所有视频视觉模态(如颜色空间)视为整体进行联合分布学习,并采用自适应控制策略,在扩散过程中动态调整每个视觉模态的角色,既可以作为生成模态也可以作为条件模态。这使得对每一模态角色的灵活操控成为可能,从而支持广泛的任务需求。因此,我们的模型具备以下三大核心功能: 1. 基于文本的视频生成:OmniVDiff在一次扩散过程中根据文本条件生成多模式视觉视频序列(例如RGB、深度图、Canny边缘检测图和分割图)。 2. 视频理解:OmniVDiff能够估计输入RGB帧中的深度信息、Canny图以及语义分割,同时确保与RGB输入的一致性。 3. 基于X的视频生成:OmniVDiff可以基于精细属性(例如深度图或分割图)来生成条件化的视频。 通过将这些多样化的任务整合到统一的视频扩散框架中,OmniVDiff增强了可控视频扩散的灵活性和可扩展性,成为视频转视频翻译等众多下游应用的有效工具。广泛的实验验证了我们方法的有效性,并展示了其在各种视频相关应用中的潜力。
https://arxiv.org/abs/2504.10825
Remote sensing images are widely utilized in many disciplines such as feature recognition and scene semantic segmentation. However, due to environmental factors and the issues of the imaging system, the image quality is often degraded which may impair subsequent visual tasks. Even though denoising remote sensing images plays an essential role before applications, the current denoising algorithms fail to attain optimum performance since these images possess complex features in the texture. Denoising frameworks based on artificial neural networks have shown better performance; however, they require exhaustive training with heterogeneous samples that extensively consume resources like power, memory, computation, and latency. Thus, here we present a computationally efficient and robust remote sensing image denoising method that doesn't require additional training samples. This method partitions patches of a remote-sensing image in which a low-rank manifold, representing the noise-free version of the image, underlies the patch space. An efficient and robust approach to revealing this manifold is a randomized approximation of the singular value spectrum of the geodesics' Gramian matrix of the patch space. The method asserts a unique emphasis on each color channel during denoising so the three denoised channels are merged to produce the final image.
遥感图像在特征识别和场景语义分割等众多学科中得到了广泛应用。然而,由于环境因素及成像系统的问题,图像质量常常会退化,从而影响后续的视觉任务。尽管去噪是应用前的重要步骤,但由于这些图像中的纹理具有复杂特性,目前的去噪算法难以达到最佳效果。基于人工神经网络的去噪框架虽然表现出更好的性能,但它们需要大量不同样本进行训练,这会消耗大量的资源如电力、内存和计算能力。 因此,在本文中我们提出了一种无需额外训练样本且在计算上更加高效和鲁棒的遥感图像去噪方法。该方法将遥感图像分割成若干块,并认为这些块的空间中存在一个低秩流形,即代表无噪声版本图像的基础结构。通过随机化地近似块空间中测地线Gram矩阵的奇异值谱来揭示这一流形的方法在计算上是高效且鲁棒的。此外,在去噪过程中该方法对每个色彩通道赋予独特的权重,并将三个经过处理后的色彩通道合并以生成最终图像。
https://arxiv.org/abs/2504.10820
Posidonia oceanica meadows are a species of seagrass highly dependent on rocks for their survival and conservation. In recent years, there has been a concerning global decline in this species, emphasizing the critical need for efficient monitoring and assessment tools. While deep learning-based semantic segmentation and visual automated monitoring systems have shown promise in a variety of applications, their performance in underwater environments remains challenging due to complex water conditions and limited datasets. This paper introduces a framework that combines machine learning and computer vision techniques to enable an autonomous underwater vehicle (AUV) to inspect the boundaries of Posidonia oceanica meadows autonomously. The framework incorporates an image segmentation module using an existing Mask R-CNN model and a strategy for Posidonia oceanica meadow boundary tracking. Furthermore, a new class dedicated to rocks is introduced to enhance the existing model, aiming to contribute to a comprehensive monitoring approach and provide a deeper understanding of the intricate interactions between the meadow and its surrounding environment. The image segmentation model is validated using real underwater images, while the overall inspection framework is evaluated in a realistic simulation environment, replicating actual monitoring scenarios with real underwater images. The results demonstrate that the proposed framework enables the AUV to autonomously accomplish the main tasks of underwater inspection and segmentation of rocks. Consequently, this work holds significant potential for the conservation and protection of marine environments, providing valuable insights into the status of Posidonia oceanica meadows and supporting targeted preservation efforts
Posidonia oceanica 海草床是一种高度依赖岩石生存和保护的海草物种。近年来,这种物种在全球范围内出现了令人担忧的下降趋势,强调了高效监测和评估工具的重要性和迫切性。虽然基于深度学习的语义分割和视觉自动化监控系统在各种应用中显示出巨大潜力,但由于复杂的水下环境条件和有限的数据集,它们在水下环境中的表现仍然面临挑战。 本文介绍了一个结合机器学习和计算机视觉技术的框架,使自主水下航行器(AUV)能够自动检查Posidonia oceanica海草床的边界。该框架包含一个使用现有Mask R-CNN模型进行图像分割的模块,并制定了追踪Posidonia oceanica海草床边界的策略。此外,引入了一种专门针对岩石的新类别以增强现有的模型,旨在为全面监测方法做出贡献,并提供对海草床及其周围环境之间复杂相互作用的更深入了解。 通过使用真实的水下图片验证图像分割模型,而整个检查框架则在一个模拟的真实环境中进行评估,该环境复制了实际监控场景并使用真实水下图片。实验结果显示,所提出的框架使AUV能够自主完成水下检查和岩石的分割等主要任务。因此,这项工作在海洋环境保护方面具有重要的潜力,为Posidonia oceanica海草床的状态提供了有价值的见解,并支持有目标的保护措施。
https://arxiv.org/abs/2504.10750
Recent Open-Vocabulary Semantic Segmentation (OVSS) models extend the CLIP model to segmentation while maintaining the use of multiple templates (e.g., a photo of <class>, a sketch of a <class>, etc.) for constructing class-wise averaged text embeddings, acting as a classifier. In this paper, we challenge this status quo and investigate the impact of templates for OVSS. Empirically, we observe that for each class, there exist single-template classifiers significantly outperforming the conventional averaged classifier. We refer to them as class-experts. Given access to unlabeled images and without any training involved, we estimate these experts by leveraging the class-wise prediction entropy of single-template classifiers, selecting as class-wise experts those which yield the lowest entropy. All experts, each specializing in a specific class, collaborate in a newly proposed fusion method to generate more accurate OVSS predictions. Our plug-and-play method, coined FLOSS, is orthogonal and complementary to existing OVSS methods, offering a ''free lunch'' to systematically improve OVSS without labels and additional training. Extensive experiments demonstrate that FLOSS consistently boosts state-of-the-art methods on various OVSS benchmarks. Moreover, the selected expert templates can generalize well from one dataset to others sharing the same semantic categories, yet exhibiting distribution shifts. Additionally, we obtain satisfactory improvements under a low-data regime, where only a few unlabeled images are available. Our code is available at this https URL .
最近的开放词汇语义分割(OVSS)模型在保留使用多个模板(例如,“一张<类>的照片”,“一个<类>的手绘图”等)来构建类别级别的平均文本嵌入的同时,扩展了CLIP模型以应用于分割。这些文本嵌入充当分类器。在这篇论文中,我们挑战这一现状,并研究模板对OVSS的影响。通过实证分析,我们发现对于每个类别,存在单一模板的分类器可以显著优于传统的平均分类器。我们将这些表现优异的分类器称为“类专家”。在不需要任何训练的情况下,利用未标记图像和单模板分类器的类别级预测熵来估计这些专家,并选择产生最低熵的作为类别级别的专家。 所有专注于特定类别的专家通过一种新的融合方法协同工作以生成更准确的OVSS预测。我们提出的方法FLOSS(自由午餐优化语义分割)与现有的OVSS方法正交且互补,无需标签和额外训练即可系统性地改进OVSS性能。广泛的实验表明,在多个OVSS基准测试中,FLOSS能够持续提升现有最先进模型的表现。 此外,所选的专家模板在从一个数据集转移到另一个具有相同语义类别但表现出分布偏移的数据集时,可以很好地泛化。我们还在低数据环境下取得了令人满意的结果,此时只有少量未标记图像可用。我们的代码可在以下链接获取:[提供URL]
https://arxiv.org/abs/2504.10487
This paper introduces SAIL, a single transformer unified multimodal large language model (MLLM) that integrates raw pixel encoding and language decoding within a singular architecture. Unlike existing modular MLLMs, which rely on a pre-trained vision transformer (ViT), SAIL eliminates the need for a separate vision encoder, presenting a more minimalist architecture design. Instead of introducing novel architectural components, SAIL adapts mix-attention mechanisms and multimodal positional encodings to better align with the distinct characteristics of visual and textual modalities. We systematically compare SAIL's properties-including scalability, cross-modal information flow patterns, and visual representation capabilities-with those of modular MLLMs. By scaling both training data and model size, SAIL achieves performance comparable to modular MLLMs. Notably, the removal of pretrained ViT components enhances SAIL's scalability and results in significantly different cross-modal information flow patterns. Moreover, SAIL demonstrates strong visual representation capabilities, achieving results on par with ViT-22B in vision tasks such as semantic segmentation. Code and models are available at this https URL.
这篇论文介绍了SAIL,这是一种单一的变压器统一多模态大型语言模型(MLLM),该模型在一个单一架构中集成了原始像素编码和语言解码。与依赖于预训练视觉变换器(ViT)的现有模块化MLLM不同,SAIL消除了对单独视觉编码器的需求,提出了更为简约的架构设计。SAIL没有引入新的建筑组件,而是调整了混合注意力机制和多模态位置编码以更好地适应视觉和文本模式的独特特性。 我们系统地比较了SAIL的属性(包括可扩展性、跨模态信息流模式以及视觉表示能力)与模块化MLLM的属性。通过扩大训练数据量和模型规模,SAIL实现了接近于模块化MLLM的表现水平。值得注意的是,移除预训练ViT组件增强了SAIL的可扩展性,并导致了显著不同的跨模态信息流动模式。此外,SAIL展示了强大的视觉表示能力,在语义分割等视觉任务中与22B参数规模的ViT表现相当。 代码和模型可在[此处](https://this-url.com)获取。
https://arxiv.org/abs/2504.10462
Road damage can create safety and comfort challenges for both human drivers and autonomous vehicles (AVs). This damage is particularly prevalent in rural areas due to less frequent surveying and maintenance of roads. Automated detection of pavement deterioration can be used as an input to AVs and driver assistance systems to improve road safety. Current research in this field has predominantly focused on urban environments driven largely by public datasets, while rural areas have received significantly less attention. This paper introduces M2S-RoAD, a dataset for the semantic segmentation of different classes of road damage. M2S-RoAD was collected in various towns across New South Wales, Australia, and labelled for semantic segmentation to identify nine distinct types of road damage. This dataset will be released upon the acceptance of the paper.
道路损坏会对人类驾驶员和自动驾驶车辆(AVs)的安全性和舒适性造成挑战。特别是在农村地区,由于路面勘测与维护的频率较低,这种情况更为普遍。自动检测路面恶化情况可以作为输入信息提供给自动驾驶汽车及驾驶辅助系统,以提高道路安全性。目前的研究主要集中在城市环境中,这主要是由公共数据集驱动的,相比之下,农村地区的研究则少得多。本文介绍了一个名为M2S-RoAD的数据集,该数据集用于对不同类别的道路损坏进行语义分割。M2S-RoAD数据集是在澳大利亚新南威尔士州的不同城镇收集并标记的,以识别九种不同的路面损坏类型。此数据集将在论文被接受后发布。
https://arxiv.org/abs/2504.10123
Unsupervised Domain Adaptation (UDA) is essential for enabling semantic segmentation in new domains without requiring costly pixel-wise annotations. State-of-the-art (SOTA) UDA methods primarily use self-training with architecturally identical teacher and student networks, relying on Exponential Moving Average (EMA) updates. However, these approaches face substantial performance degradation with lightweight models due to inherent architectural inflexibility leading to low-quality pseudo-labels. To address this, we propose Distilled Unsupervised Domain Adaptation (DUDA), a novel framework that combines EMA-based self-training with knowledge distillation (KD). Our method employs an auxiliary student network to bridge the architectural gap between heavyweight and lightweight models for EMA-based updates, resulting in improved pseudo-label quality. DUDA employs a strategic fusion of UDA and KD, incorporating innovative elements such as gradual distillation from large to small networks, inconsistency loss prioritizing poorly adapted classes, and learning with multiple teachers. Extensive experiments across four UDA benchmarks demonstrate DUDA's superiority in achieving SOTA performance with lightweight models, often surpassing the performance of heavyweight models from other approaches.
无监督领域适应(UDA)对于在新领域中实现语义分割,同时不需要昂贵的像素级标注至关重要。目前最先进的(SOTA)UDA方法主要使用自训练方法,并且其教师和学生网络架构相同,依赖于指数移动平均(EMA)更新。然而,这些方法在轻量级模型上面临着显著的性能下降问题,原因在于它们固有的架构灵活性不足导致伪标签质量低。为了解决这一问题,我们提出了知识蒸馏无监督领域适应(DUDA),这是一种结合了基于EMA自训练和知识蒸馏(KD)的新框架。我们的方法使用一个辅助学生网络来弥合重型模型与轻型模型在EMA更新中的架构差距,从而提高伪标签的质量。 DUDA巧妙地融合了UDA和KD技术,并融入了一些创新性元素,如从大型到小型网络的渐进式知识蒸馏、优先考虑适应较差类别的不一致性损失以及多教师学习。跨四个UDA基准进行的广泛实验表明,DUDA在使用轻量级模型时能够超越其他方法的重型模型表现,达到SOTA性能水平。
https://arxiv.org/abs/2504.09814
Semi-Supervised Semantic Segmentation (SSSS) aims to improve segmentation accuracy by leveraging a small set of labeled images alongside a larger pool of unlabeled data. Recent advances primarily focus on pseudo-labeling, consistency regularization, and co-training strategies. However, existing methods struggle to balance global semantic representation with fine-grained local feature extraction. To address this challenge, we propose a novel tri-branch semi-supervised segmentation framework incorporating a dual-teacher strategy, named IGL-DT. Our approach employs SwinUnet for high-level semantic guidance through Global Context Learning and ResUnet for detailed feature refinement via Local Regional Learning. Additionally, a Discrepancy Learning mechanism mitigates over-reliance on a single teacher, promoting adaptive feature learning. Extensive experiments on benchmark datasets demonstrate that our method outperforms state-of-the-art approaches, achieving superior segmentation performance across various data regimes.
半监督语义分割(SSSS)的目标是通过利用少量标记图像和大量未标记数据来提高分割精度。最近的研究主要集中在伪标签生成、一致性正则化和协同训练策略上。然而,现有的方法在平衡全局语义表示与精细的局部特征提取方面存在困难。为了解决这一挑战,我们提出了一种新的三分支半监督分割框架,并引入了双教师策略,命名为IGL-DT(Instance Guidance Learning with Dual Teachers)。我们的方法使用SwinUnet进行高层次语义引导,通过全局上下文学习来实现;同时利用ResUNet对细节特征进行细化,通过局部区域学习来完成。此外,我们还设计了一种差异学习机制,以减少对单一教师的过度依赖,促进自适应特征学习。 在基准数据集上的广泛实验表明,我们的方法优于现有的最先进方案,在各种数据规模下均取得了卓越的分割性能。
https://arxiv.org/abs/2504.09797
Existing computer vision(CV)-based structural damage identification models demonstrate notable accuracy in categorizing and localizing damage. However, these models present several critical limitations that hinder their practical application in civil engineering(CE). Primarily, their ability to recognize damage types remains constrained, preventing comprehensive analysis of the highly varied and complex conditions encountered in real-world CE structures. Second, these models lack linguistic capabilities, rendering them unable to articulate structural damage characteristics through natural language descriptions. With the continuous advancement of artificial intelligence(AI), large multi-modal models(LMMs) have emerged as a transformative solution, enabling the unified encoding and alignment of textual and visual data. These models can autonomously generate detailed descriptive narratives of structural damage while demonstrating robust generalization across diverse scenarios and tasks. This study introduces SDIGLM, an innovative LMM for structural damage identification, developed based on the open-source VisualGLM-6B architecture. To address the challenge of adapting LMMs to the intricate and varied operating conditions in CE, this work integrates a U-Net-based semantic segmentation module to generate defect segmentation maps as visual Chain of Thought(CoT). Additionally, a multi-round dialogue fine-tuning dataset is constructed to enhance logical reasoning, complemented by a language CoT formed through prompt engineering. By leveraging this multi-modal CoT, SDIGLM surpasses general-purpose LMMs in structural damage identification, achieving an accuracy of 95.24% across various infrastructure types. Moreover, the model effectively describes damage characteristics such as hole size, crack direction, and corrosion severity.
现有的基于计算机视觉(CV)的结构损伤识别模型在分类和定位损伤方面表现出显著的准确性。然而,这些模型存在几个关键限制,阻碍了它们在土木工程(CE)中的实际应用。首先,它们识别不同类型损害的能力仍然受到限制,无法对现实世界中复杂多变的CE结构状况进行全面分析。其次,这些模型缺乏语言能力,因此无法通过自然语言描述来阐述结构损伤的特点。 随着人工智能技术的进步,大型多模态模型(LMMs)作为一种变革性解决方案应运而生,能够统一编码和对齐文本与视觉数据。这类模型可以自主生成详细的结构损伤描述,并在各种场景和任务中展示出强大的泛化能力。本研究介绍了SDIGLM——一种基于开源VisualGLM-6B架构开发的创新LMM,用于结构损伤识别。 为了应对将LMMs应用于复杂多变的CE操作条件中的挑战,这项工作整合了U-Net语义分割模块来生成缺陷分割图作为视觉Chain of Thought(CoT),并通过提示工程构建了一个通过多轮对话微调的数据集以增强逻辑推理,并形成语言CoT。借助这一多模态CoT,SDIGLM在结构损伤识别方面超越了一般的通用LMMs,在各种基础设施类型上实现了95.24%的准确率。此外,该模型还能够有效地描述诸如孔洞大小、裂缝方向和腐蚀严重程度等损害特性。
https://arxiv.org/abs/2504.11477
Existing Masked Image Modeling methods apply fixed mask patterns to guide the self-supervised training. As those mask patterns resort to different criteria to depict image contents, sticking to a fixed pattern leads to a limited vision cues modeling this http URL paper introduces an evolved hierarchical masking method to pursue general visual cues modeling in self-supervised learning. The proposed method leverages the vision model being trained to parse the input visual cues into a hierarchy structure, which is hence adopted to generate masks accordingly. The accuracy of hierarchy is on par with the capability of the model being trained, leading to evolved mask patterns at different training stages. Initially, generated masks focus on low-level visual cues to grasp basic textures, then gradually evolve to depict higher-level cues to reinforce the learning of more complicated object semantics and contexts. Our method does not require extra pre-trained models or annotations and ensures training efficiency by evolving the training difficulty. We conduct extensive experiments on seven downstream tasks including partial-duplicate image retrieval relying on low-level details, as well as image classification and semantic segmentation that require semantic parsing capability. Experimental results demonstrate that it substantially boosts performance across these tasks. For instance, it surpasses the recent MAE by 1.1\% in imageNet-1K classification and 1.4\% in ADE20K segmentation with the same training epochs. We also align the proposed method with the current research focus on LLMs. The proposed approach bridges the gap with large-scale pre-training on semantic demanding tasks and enhances intricate detail perception in tasks requiring low-level feature recognition.
现有的遮罩图像建模方法使用固定掩码模式来指导自我监督训练。由于这些掩码模式依赖于不同的标准来描绘图像内容,坚持采用固定模式会导致对视觉线索的模型化能力有限。本文提出了一种进化分层掩蔽方法,以追求在自我监督学习中的通用视觉线索建模。所提出的方法利用正在训练的视觉模型将输入的视觉线索解析为层次结构,并据此生成相应的掩码。这种层次结构的准确性与正在训练的模型的能力相当,在不同的训练阶段会演化出不同模式的掩码。 初始阶段,生成的掩码集中于低级视觉线索以掌握基本纹理,随后逐渐演变为描绘高级别的线索,从而强化对更复杂物体语义和上下文的学习。我们的方法不需要额外的预训练模型或注释,并通过进化训练难度来确保训练效率。我们在包括基于低层细节的部分重复图像检索、以及需要语义解析能力的图像分类和语义分割在内的七个下游任务上进行了广泛的实验。实验证明,该方法在这类任务中显著提升了性能表现。例如,在相同的训练周期下,它在ImageNet-1K分类和ADE20K分割方面分别比最近的MAE模型高出1.1%和1.4%。 我们还使提出的方法与当前研究大型语言模型(LLMs)的重点相吻合。所提出的方案弥合了大规模预训练任务中对语义需求之间的差距,并增强了在需要低级特征识别的任务中的复杂细节感知能力。
https://arxiv.org/abs/2504.09155
Effective leveraging of real-world driving datasets is crucial for enhancing the training of autonomous driving systems. While Offline Reinforcement Learning enables the training of autonomous vehicles using such data, most available datasets lack meaningful reward labels. Reward labeling is essential as it provides feedback for the learning algorithm to distinguish between desirable and undesirable behaviors, thereby improving policy performance. This paper presents a novel pipeline for generating human-aligned reward labels. The proposed approach addresses the challenge of absent reward signals in real-world datasets by generating labels that reflect human judgment and safety considerations. The pipeline incorporates an adaptive safety component, activated by analyzing semantic segmentation maps, allowing the autonomous vehicle to prioritize safety over efficiency in potential collision scenarios. The proposed pipeline is applied to an occluded pedestrian crossing scenario with varying levels of pedestrian traffic, using synthetic and simulation data. The results indicate that the generated reward labels closely match the simulation reward labels. When used to train the driving policy using Behavior Proximal Policy Optimisation, the results are competitive with other baselines. This demonstrates the effectiveness of our method in producing reliable and human-aligned reward signals, facilitating the training of autonomous driving systems through Reinforcement Learning outside of simulation environments and in alignment with human values.
在现实世界中,有效地利用驾驶数据集对于提升自动驾驶系统的训练至关重要。虽然离线强化学习(Offline Reinforcement Learning)可以通过使用这些数据来培训自主车辆,但大多数现有的数据集中缺少有意义的奖励标签。奖励标记对于提供反馈给算法以区分理想行为与不理想行为是至关重要的,从而提高策略性能。 本文提出了一种生成与人类判断和安全考虑相一致的新颖管道方法。该方法解决了现实世界数据集缺乏有效奖赏信号的问题,通过产生反映人类判断和安全考量的标签来克服这一挑战。该管道包含一个自适应的安全组件,在分析语义分割图后激活,允许自动驾驶汽车在潜在碰撞场景中优先保证安全性而非效率。 此管道应用于具有不同程度行人流量的被遮挡行人过街情景,并使用合成与仿真数据进行测试。结果显示生成的奖励标签紧密匹配仿真的奖励标签。当用于采用行为近端策略优化(Behavior Proximal Policy Optimisation)训练驾驶策略时,该方法所得到的结果与其他基准方法相比具备竞争力。 这表明我们提出的方法能够产生可靠且符合人类价值观的奖赏信号,在强化学习环境中促进自动驾驶系统的培训,并超越仿真环境的应用范围。这种方法提升了自主车辆在现实世界中的安全性和性能表现,使其更好地与人类期望保持一致。
https://arxiv.org/abs/2504.08704