Remote sensing image dehazing (RSID) aims to remove nonuniform and physically irregular haze factors for high-quality image restoration. The emergence of CNNs and Transformers has taken extraordinary strides in the RSID arena. However, these methods often struggle to demonstrate the balance of adequate long-range dependency modeling and maintaining computational efficiency. To this end, we propose the first lightweight network on the mamba-based model called RSDhamba in the field of RSID. Greatly inspired by the recent rise of Selective State Space Model (SSM) for its superior performance in modeling linear complexity and remote dependencies, our designed RSDehamba integrates the SSM framework into the U-Net architecture. Specifically, we propose the Vision Dehamba Block (VDB) as the core component of the overall network, which utilizes the linear complexity of SSM to achieve the capability of global context encoding. Simultaneously, the Direction-aware Scan Module (DSM) is designed to dynamically aggregate feature exchanges over different directional domains to effectively enhance the flexibility of sensing the spatially varying distribution of haze. In this way, our RSDhamba fully demonstrates the superiority of spatial distance capture dependencies and channel information exchange for better extraction of haze features. Extensive experimental results on widely used benchmarks validate the surpassing performance of our RSDehamba against existing state-of-the-art methods.
遥感图像去雾(RSID)旨在消除不均匀和物理不规则雾因素,实现高质量图像修复。随着CNNs和Transformer在RSID领域的非凡进步,这些方法通常很难在适当的远距离依赖建模和保持计算效率之间实现平衡。为此,我们提出了在基于mamba模型的RSID领域中的第一个轻量级网络RSDhamba。受到最近Selective State Space Model(SSM)在建模线性复杂度和远程依赖方面的卓越性能启发,我们设计的RSDehamba将SSM框架融入U-Net架构中。具体来说,我们提出了Vision Dehamba Block(VDB)作为整个网络的核心组件,它利用SSM的线性复杂度实现全局上下文编码的能力。同时,我们设计了Direction-aware Scan Module(DSM)来动态地聚合不同方向域的特征交换,有效增强对雾中空间分布变化的灵活感知。这样,我们的RSDhamba完全展示了空间距离捕捉依赖和信道信息交流对于更好提取雾特征的优越性。在广泛使用的基准测试上进行的大量实验结果证实了RSDehamba相对于现有技术的超群性能。
https://arxiv.org/abs/2405.10030
In this paper, we review the NTIRE 2024 challenge on Restore Any Image Model (RAIM) in the Wild. The RAIM challenge constructed a benchmark for image restoration in the wild, including real-world images with/without reference ground truth in various scenarios from real applications. The participants were required to restore the real-captured images from complex and unknown degradation, where generative perceptual quality and fidelity are desired in the restoration result. The challenge consisted of two tasks. Task one employed real referenced data pairs, where quantitative evaluation is available. Task two used unpaired images, and a comprehensive user study was conducted. The challenge attracted more than 200 registrations, where 39 of them submitted results with more than 400 submissions. Top-ranked methods improved the state-of-the-art restoration performance and obtained unanimous recognition from all 18 judges. The proposed datasets are available at this https URL and the homepage of this challenge is at this https URL.
在本文中,我们对在野中的NTIRE 2024挑战进行了回顾。RAIM挑战为图像修复领域树立了一个基准,包括各种真实世界图像,这些图像带有/不带参考真实世界地面真实情况下的图像修复情况。参与者被要求从复杂和未知退化中恢复真实捕获的图像,要求在修复结果中实现生成感知质量和一致性。挑战包括两个任务。任务一是使用真实参考数据对,有定量的评估可用。任务二是使用未配对图像,并进行了全面的用户研究。挑战吸引了超过200个注册,其中39个提交了超过400个结果。排名前几的方法提高了最先进的修复性能,并得到了所有18个评委的一致认可。提出的数据集可在此处访问:<https://url> 和<https://url>。
https://arxiv.org/abs/2405.09923
Infrared (IR) image super-resolution faces challenges from homogeneous background pixel distributions and sparse target regions, requiring models that effectively handle long-range dependencies and capture detailed local-global information. Recent advancements in Mamba-based (Selective Structured State Space Model) models, employing state space models, have shown significant potential in visual tasks, suggesting their applicability for IR enhancement. In this work, we introduce IRSRMamba: Infrared Image Super-Resolution via Mamba-based Wavelet Transform Feature Modulation Model, a novel Mamba-based model designed specifically for IR image super-resolution. This model enhances the restoration of context-sparse target details through its advanced dependency modeling capabilities. Additionally, a new wavelet transform feature modulation block improves multi-scale receptive field representation, capturing both global and local information efficiently. Comprehensive evaluations confirm that IRSRMamba outperforms existing models on multiple benchmarks. This research advances IR super-resolution and demonstrates the potential of Mamba-based models in IR image processing. Code are available at \url{this https URL}.
红外图像超分辨率面临来自均匀背景像素分布和稀疏目标区域的挑战,需要处理长距离依赖并捕获详细局部全局信息的模型。最近基于Mamba(选择性结构状态空间模型)的模型在视觉任务方面的进展表明,它们适用于红外增强。在这项工作中,我们介绍了IRSRMamba:通过Mamba波谷变换特征调制模型进行红外图像超分辨率,这是一种专为红外图像超分辨率设计的Mamba模型。通过其先进的依赖建模能力,该模型通过恢复稀疏目标细节来增强语境。此外,一个新的波谷变换特征调制块通过多尺度接收场表示捕获全局和局部信息,有效捕捉信息。全面的评估证实,IRSRMamba在多个基准测试上优于现有模型。这项研究推动了红外图像超分辨率,并展示了基于Mamba模型的在红外图像处理中的潜力。代码可在此处访问:\url{这个链接}
https://arxiv.org/abs/2405.09873
Ancient murals are valuable cultural heritage with great archaeological value. They provide insights into ancient religions, ceremonies, folklore, among other things through their content. However, due to long-term oxidation and inadequate protection, ancient murals have suffered continuous damage, including peeling and mold etc. Additionally, since ancient murals were typically painted indoors, the light intensity in images captured by digital devices is often low. The poor visibility hampers the further restoration of damaged areas. To address the escalating damage to ancient frescoes and facilitate batch restoration at archaeological sites, we propose a two-stage restoration model which called MER(Mural Enhancement and Restoration net) for ancient murals that are damaged and have been captured in low light. Our two-stage model not only enhances the visual quality of restored images but also achieves commendable results in relevant metric evaluations compared with other competitors. Furthermore, we have launched a website dedicated to the restoration of ancient mural paintings, utilizing the proposed model. Code is available at this https URL.
古代壁画是一种具有重要考古价值的宝贵文化遗产。它们通过内容揭示了古代宗教、仪式、民间传说等方面,为我们提供了深入了解古代文化的途径。然而,由于长时间的氧化和不足的保护,古代壁画遭受了持续的损害,包括剥落和霉菌等问题。此外,由于古代壁画通常是在室内绘制的,数字设备捕捉到的图像光线强度往往较低。图像的低可见性阻碍了进一步修复受损区域的修复。为解决古代壁画受到的损害,并促进在考古遗址上的批量修复,我们提出了一个两阶段修复模型,称为MER(壁画增强和修复网络)模型,用于受损害且在低光线下捕捉的古代壁画。我们的两阶段模型不仅提高了修复图像的视觉质量,还与其他竞争对手相比取得了可圈可点的成果。此外,我们还创建了一个专门用于修复古代壁画绘画的网站,并使用所提出的模型进行修复。代码可在此链接处获取。
https://arxiv.org/abs/2405.08245
Unveiling the real appearance of retouched faces to prevent malicious users from deceptive advertising and economic fraud has been an increasing concern in the era of digital economics. This article makes the first attempt to investigate the face retouching reversal (FRR) problem. We first collect an FRR dataset, named deepFRR, which contains 50,000 StyleGAN-generated high-resolution (1024*1024) facial images and their corresponding retouched ones by a commercial online API. To our best knowledge, deepFRR is the first FRR dataset tailored for training the deep FRR models. Then, we propose a novel diffusion-based FRR approach (FRRffusion) for the FRR task. Our FRRffusion consists of a coarse-to-fine two-stage network: A diffusion-based Facial Morpho-Architectonic Restorer (FMAR) is constructed to generate the basic contours of low-resolution faces in the first stage, while a Transformer-based Hyperrealistic Facial Detail Generator (HFDG) is designed to create high-resolution facial details in the second stage. Tested on deepFRR, our FRRffusion surpasses the GP-UNIT and Stable Diffusion methods by a large margin in four widespread quantitative metrics. Especially, the de-retouched images by our FRRffusion are visually much closer to the raw face images than both the retouched face images and those restored by the GP-UNIT and Stable Diffusion methods in terms of qualitative evaluation with 85 subjects. These results sufficiently validate the efficacy of our work, bridging the recently-standing gap between the FRR and generic image restoration tasks. The dataset and code are available at this https URL.
在数字经济的时期,揭开修饰前后的脸的真实外观以防止恶意用户进行欺骗性广告和经济欺诈是一个越来越关注的问题。本文是首次调查了脸部修饰反向(FRR)问题。我们首先收集了一个名为deepFRR的数据集,其中包含50,000个由StyleGAN生成的具有1024*1024分辨率的高清(1024*1024)面部图像以及它们由商业在线API修整过的相应图像。据我们所知,deepFRR是第一个针对训练深度FRR模型的FRR数据集。然后,我们提出了一个新颖的扩散为基础的FRR方法(FRRffusion)用于FRR任务。我们的FRRffusion包括一个粗到细的两级网络:首先,通过扩散构建面部形态还原器(FMAR),以生成低分辨率面部的基本轮廓;其次,设计了一个Transformer-based超现实面部细节生成器(HFDG),用于在第二个阶段创建高分辨率面部细节。在deepFRR上测试我们的FRRffusion,我们的FRRffusion在四个广泛的定量指标上超过了GP-UNIT和Stable Diffusion方法。特别是,我们通过FRRffusion生成的去修补过的图像在视觉上与原始面部图像非常接近,而在定量评估中,与GP-UNIT和Stable Diffusion方法相比,修复后的图像在85个受试者中的质量评估结果也相差无几。这些结果充分验证了我们的工作,缩小了FRR和通用图像修复任务之间的最近空白。数据集和代码可在此https URL找到。
https://arxiv.org/abs/2405.07582
Global point clouds that correctly represent the static environment features can facilitate accurate localization and robust path planning. However, dynamic objects introduce undesired ghost tracks that are mixed up with the static environment. Existing dynamic removal methods normally fail to balance the performance in computational efficiency and accuracy. In response, we present BeautyMap to efficiently remove the dynamic points while retaining static features for high-fidelity global maps. Our approach utilizes a binary-encoded matrix to efficiently extract the environment features. With a bit-wise comparison between matrices of each frame and the corresponding map region, we can extract potential dynamic regions. Then we use coarse to fine hierarchical segmentation of the $z$-axis to handle terrain variations. The final static restoration module accounts for the range-visibility of each single scan and protects static points out of sight. Comparative experiments underscore BeautyMap's superior performance in both accuracy and efficiency against other dynamic points removal methods. The code is open-sourced at this https URL.
全球点云正确地表示静态环境特征可以促进准确的局部化和稳健路径规划。然而,动态对象引入了不必要的幽灵轨迹,与静态环境混杂在一起。现有的动态删除方法通常无法在计算效率和准确性之间实现平衡。因此,我们提出了BeautyMap,以高效地删除动态点,同时保留高保真度的全局地图的静态特征。我们的方法利用二进制编码矩阵有效地提取环境特征。通过比较每个帧的矩阵和相应地图区域的位元比较,我们可以提取潜在的动态区域。然后,我们使用粗到细的$z$轴层次分割来处理地形变化。最后的静态恢复模块考虑每个单独扫描的可视范围,保护看不见的静态点。与其它动态点删除方法相比,BeautyMap在准确性和效率方面都具有卓越的表现。代码现在开源在https://这个URL上。
https://arxiv.org/abs/2405.07283
Recent advancements in light field super-resolution (SR) have yielded impressive results. In practice, however, many existing methods are limited by assuming fixed degradation models, such as bicubic downsampling, which hinders their robustness in real-world scenarios with complex degradations. To address this limitation, we present LF-DEST, an effective blind Light Field SR method that incorporates explicit Degradation Estimation to handle various degradation types. LF-DEST consists of two primary components: degradation estimation and light field restoration. The former concurrently estimates blur kernels and noise maps from low-resolution degraded light fields, while the latter generates super-resolved light fields based on the estimated degradations. Notably, we introduce a modulated and selective fusion module that intelligently combines degradation representations with image information, allowing for effective handling of diverse degradation types. We conduct extensive experiments on benchmark datasets, demonstrating that LF-DEST achieves superior performance across a variety of degradation scenarios in light field SR.
近年来,光场超分辨率(SR)的进展取得了令人印象深刻的成果。然而,在实践中,许多现有方法都受到假设固定的退化模型的限制,例如双线性下采样,这使得它们在真实世界场景中具有复杂退化时的鲁棒性受到限制。为了克服这一局限,我们提出了LF-DEST,一种有效的盲光场SR方法,它引入了明确的退化估计来处理各种退化类型。LF-DEST由两个主要组件组成:退化估计和光场恢复。前一个同时估计来自低分辨率退化光场的模糊核和噪声图,而后一个根据估计的退化生成超分辨率光场。值得注意的是,我们引入了一个调制和选择性的融合模块,它能够智能地将退化表示与图像信息相结合,允许处理不同退化类型的有效方法。我们在基准数据集上进行了广泛的实验,结果表明,LF-DEST在各种退化场景中实现了卓越的性能。
https://arxiv.org/abs/2405.07012
Point-based representations have recently gained popularity in novel view synthesis, for their unique advantages, e.g., intuitive geometric representation, simple manipulation, and faster convergence. However, based on our observation, these point-based neural re-rendering methods are only expected to perform well under ideal conditions and suffer from noisy, patchy points and unbounded scenes, which are challenging to handle but defacto common in real applications. To this end, we revisit one such influential method, known as Neural Point-based Graphics (NPBG), as our baseline, and propose Robust Point-based Graphics (RPBG). We in-depth analyze the factors that prevent NPBG from achieving satisfactory renderings on generic datasets, and accordingly reform the pipeline to make it more robust to varying datasets in-the-wild. Inspired by the practices in image restoration, we greatly enhance the neural renderer to enable the attention-based correction of point visibility and the inpainting of incomplete rasterization, with only acceptable overheads. We also seek for a simple and lightweight alternative for environment modeling and an iterative method to alleviate the problem of poor geometry. By thorough evaluation on a wide range of datasets with different shooting conditions and camera trajectories, RPBG stably outperforms the baseline by a large margin, and exhibits its great robustness over state-of-the-art NeRF-based variants. Code available at this https URL.
点基础表示最近在 novel view synthesis 中获得了越来越多的关注,因为它们具有独特的优势,例如直观的几何表示、简单的操作和较快的收敛速度。然而,根据我们的观察,这些点基础神经渲染方法只能在理想条件下表现出色,并受到噪声点和不规则场景的困扰,这些挑战很难处理,但在实际应用中相当普遍。因此,我们回顾了一种 influential 的方法,称为 Neural Point-Based Graphics (NPBG),作为我们的 baseline,并提出了 Robust Point-Based Graphics (RPBG)。我们深入分析了导致 NPBG 在通用数据集上无法实现令人满意的渲染的原因,并相应地改进了流程,使其在真实世界数据集上更加健壮。受到图像修复技术的启发,我们极大地增强了神经渲染器,使其能够通过点可见性注意力和完整纹理映射的修复来改善,同时具有可接受的性能开销。我们还寻求一种简单而轻便的环境建模方法和一个迭代方法来解决几何问题。通过在各种数据集上进行广泛的评估,RPBG 稳定地超过了基线,并表现出其在 State-of-the-art NeRF 基版本中的巨大稳健性。代码可在此处下载:https://www.example.com/
https://arxiv.org/abs/2405.05663
In this paper, we tackle the problem of grasping transparent and specular objects. This issue holds importance, yet it remains unsolved within the field of robotics due to failure of recover their accurate geometry by depth cameras. For the first time, we propose ASGrasp, a 6-DoF grasp detection network that uses an RGB-D active stereo camera. ASGrasp utilizes a two-layer learning-based stereo network for the purpose of transparent object reconstruction, enabling material-agnostic object grasping in cluttered environments. In contrast to existing RGB-D based grasp detection methods, which heavily depend on depth restoration networks and the quality of depth maps generated by depth cameras, our system distinguishes itself by its ability to directly utilize raw IR and RGB images for transparent object geometry reconstruction. We create an extensive synthetic dataset through domain randomization, which is based on GraspNet-1Billion. Our experiments demonstrate that ASGrasp can achieve over 90% success rate for generalizable transparent object grasping in both simulation and the real via seamless sim-to-real transfer. Our method significantly outperforms SOTA networks and even surpasses the performance upper bound set by perfect visible point cloud inputs.Project page: this https URL
在本文中,我们解决了理解透明和散射物体的难题。尽管这个问题在机器人领域具有重要意义,但由于深度相机无法准确恢复物体的准确几何形状,因此仍未得到解决。为了实现首次,我们提出了ASGrasp,一种使用RGB-D活动立体相机的多关节 grasp 检测网络。ASGrasp 使用基于两层学习的立体网络来重建透明物体的几何形状,使得在杂乱环境中实现无材料差异的物体抓取。与现有的基于深度学习的抓取检测方法不同,这些方法严重依赖深度恢复网络和深度相机生成的深度图的质量。我们的系统通过直接利用原始IR和RGB图像进行透明物体几何形状重建,从而与现有方法区分开来。通过领域随机化创建了一个广泛的合成数据集,基于GraspNet-1Billion。我们的实验证明,ASGrasp在模拟和真实情况下都可以实现超过90%的通用透明物体抓取成功率。我们的方法在现有SOTA网络之上显著优于网络,甚至超过了由完美可见点云输入产生的性能上限。项目页面:https:// this URL
https://arxiv.org/abs/2405.05648
Document image restoration is a crucial aspect of Document AI systems, as the quality of document images significantly influences the overall performance. Prevailing methods address distinct restoration tasks independently, leading to intricate systems and the incapability to harness the potential synergies of multi-task learning. To overcome this challenge, we propose DocRes, a generalist model that unifies five document image restoration tasks including dewarping, deshadowing, appearance enhancement, deblurring, and binarization. To instruct DocRes to perform various restoration tasks, we propose a novel visual prompt approach called Dynamic Task-Specific Prompt (DTSPrompt). The DTSPrompt for different tasks comprises distinct prior features, which are additional characteristics extracted from the input image. Beyond its role as a cue for task-specific execution, DTSPrompt can also serve as supplementary information to enhance the model's performance. Moreover, DTSPrompt is more flexible than prior visual prompt approaches as it can be seamlessly applied and adapted to inputs with high and variable resolutions. Experimental results demonstrate that DocRes achieves competitive or superior performance compared to existing state-of-the-art task-specific models. This underscores the potential of DocRes across a broader spectrum of document image restoration tasks. The source code is publicly available at this https URL
文档图像修复是文档人工智能系统的关键方面,因为文档图像的质量会显著影响整体性能。现有的方法独立处理不同的修复任务,导致复杂系统和无法充分利用多任务学习的机会。为了克服这个挑战,我们提出了DocRes,一种统一了包括去畸化、去阴影、增强外观和模糊以及二值化的五种文档图像修复任务的通用模型。要指导DocRes执行各种修复任务,我们提出了名为动态任务特定提示(DTSPrompt)的新颖视觉提示方法。DTSPrompt的不同任务包括从输入图像中提取的独特的先验特征,这些特征用于指导修复任务。DTSPrompt不仅作为任务特定执行的提示,还可以作为增强模型性能的补充信息。此外,DTSPrompt比先前的视觉提示方法更灵活,因为它可以轻松应用于具有高分辨率和高变比率的输入。实验结果表明,DocRes在现有任务特定模型上实现了竞争性或优越性能。这表明DocRes在更广泛的文档图像修复任务的领域具有巨大的潜力。源代码可以在该https URL上获取。
https://arxiv.org/abs/2405.04408
Deep learning-based image restoration methods have achieved promising performance. However, how to faithfully preserve the structure of the original image remains challenging. To address this challenge, we propose a novel Residual-Conditioned Optimal Transport (RCOT) approach, which models the image restoration as an optimal transport (OT) problem for both unpaired and paired settings, integrating the transport residual as a unique degradation-specific cue for both the transport cost and the transport map. Specifically, we first formalize a Fourier residual-guided OT objective by incorporating the degradation-specific information of the residual into the transport cost. Based on the dual form of the OT formulation, we design the transport map as a two-pass RCOT map that comprises a base model and a refinement process, in which the transport residual is computed by the base model in the first pass and then encoded as a degradation-specific embedding to condition the second-pass restoration. By duality, the RCOT problem is transformed into a minimax optimization problem, which can be solved by adversarially training neural networks. Extensive experiments on multiple restoration tasks show the effectiveness of our approach in terms of both distortion measures and perceptual quality. Particularly, RCOT restores images with more faithful structural details compared to state-of-the-art methods.
基于深度学习的图像修复方法已经取得了很好的性能。然而,如何忠实保留原始图像的结构仍然具有挑战性。为解决这个问题,我们提出了一个新颖的残差约束优化传输(RCOT)方法,将图像修复建模为对于未配对和成对设置的优化传输(OT)问题,将传输残差作为传输成本和传输映射的唯一退化特定提示。具体来说,我们首先通过将残差的退化特定信息融入传输成本中,形式化了一个Fourier残差引导的OT目标。基于OT公式的双形式,我们设计了一个包含基模型和优化过程的两层RCOT映射,其中传输残差在第一层由基模型计算,然后用退化特定编码作为第二层修复的调节。通过极值,RCOT问题转化为一个最小最大优化问题,可以被对抗性训练的神经网络求解。在多个修复任务上进行的大量实验证明了我们方法在失真度和感知质量方面的有效性。特别是,RCOT修复的图像具有比现有方法更忠实于结构的细节。
https://arxiv.org/abs/2405.02843
While image forensics is concerned with whether an image has been tampered with, image anti-forensics attempts to prevent image forensics methods from detecting tampered images. The competition between these two fields started long before the advancement of deep learning. JPEG compression, blurring and noising, which are simple methods by today's standards, have long been used for anti-forensics and have been the subject of much research in both forensics and anti-forensics. Although these traditional methods are old, they make it difficult to detect fake images and are used for data augmentation in training deep image forgery detection models. In addition to making the image difficult to detect, these methods leave traces on the image and consequently degrade the image quality. Separate image forensics methods have also been developed to detect these traces. In this study, we go one step further and improve the image quality after these methods with deep image restoration models and make it harder to detect the forged image. We evaluate the impact of these methods on image quality. We then test both our proposed methods with deep learning and methods without deep learning on the two best existing image manipulation detection models. In the obtained results, we show how existing image forgery detection models fail against the proposed methods. Code implementation will be publicly available at this https URL .
虽然图像 forensics 关注的是图像是否被篡改,图像抗篡改试图防止图像 forensics 方法检测到篡改的图像。这两个领域之间的竞争早在深度学习进步之前就开始了。JPEG 压缩、模糊和噪音消除等简单方法,在今天的标准下被认为是反篡改的常用方法,已经在反篡改和图像 forensics 中进行了广泛研究。尽管这些传统方法已经过时,但它们使得检测假图像变得困难,因此在训练深度图像伪造检测模型时被用作数据增强。此外,这些方法还会给图像留下痕迹,从而降低图像质量。已经还开发了单独的图像 forensics 方法来检测这些痕迹。在本研究中,我们进一步研究了这些方法改善图像质量以及使用深度图像修复模型使伪造图像更难被检测的效果。我们评估了这些方法对图像质量的影响。然后,我们用深度学习方法和没有深度学习的方法测试了这两个最好的现有图像篡改检测模型。在得到的结果中,我们证明了现有的图像伪造检测模型如何失败于所提出的方法。代码实现将在这个链接公开发布。
https://arxiv.org/abs/2405.02751
The accuracy and robustness of 3D human pose estimation (HPE) are limited by 2D pose detection errors and 2D to 3D ill-posed challenges, which have drawn great attention to Multi-Hypothesis HPE research. Most existing MH-HPE methods are based on generative models, which are computationally expensive and difficult to train. In this study, we propose a Probabilistic Restoration 3D Human Pose Estimation framework (PRPose) that can be integrated with any lightweight single-hypothesis model. Specifically, PRPose employs a weakly supervised approach to fit the hidden probability distribution of the 2D-to-3D lifting process in the Single-Hypothesis HPE model and then reverse-map the distribution to the 2D pose input through an adaptive noise sampling strategy to generate reasonable multi-hypothesis samples effectively. Extensive experiments on 3D HPE benchmarks (Human3.6M and MPI-INF-3DHP) highlight the effectiveness and efficiency of PRPose. Code is available at: this https URL.
3D人体姿态估计(HPE)的准确性和鲁棒性受到二维姿态检测错误和二维到三维非线性挑战的限制,这些已经引起了多假设性HPE研究的广泛关注。现有的MH-HPE方法都是基于生成模型的,这些模型计算代价高且训练困难。在这项研究中,我们提出了一个概率修复3D人体姿态估计框架(PRPose),可以与任何轻量级的单假设模型集成。具体来说,PRPose采用了一种弱监督方法来适应单假设HPE模型中2D-to-3D提升过程的隐藏概率分布,然后通过自适应噪声采样策略将分布反向映射到2D姿态输入,从而有效地生成合理的多个假设样本。在3D HPE基准(Human3.6M和MPI-INF-3DHP)上的大量实验揭示了PRPose的有效性和效率。代码可在此处下载:https://this URL。
https://arxiv.org/abs/2405.02114
Unmanned Aerial Vehicles (UAVs) have emerged as a transformative technology across diverse sectors, offering adaptable solutions to complex challenges in both military and civilian domains. Their expanding capabilities present a platform for further advancement by integrating cutting-edge computational tools like Artificial Intelligence (AI) and Machine Learning (ML) algorithms. These advancements have significantly impacted various facets of human life, fostering an era of unparalleled efficiency and convenience. Large Language Models (LLMs), a key component of AI, exhibit remarkable learning and adaptation capabilities within deployed environments, demonstrating an evolving form of intelligence with the potential to approach human-level proficiency. This work explores the significant potential of integrating UAVs and LLMs to propel the development of autonomous systems. We comprehensively review LLM architectures, evaluating their suitability for UAV integration. Additionally, we summarize the state-of-the-art LLM-based UAV architectures and identify novel opportunities for LLM embedding within UAV frameworks. Notably, we focus on leveraging LLMs to refine data analysis and decision-making processes, specifically for enhanced spectral sensing and sharing in UAV applications. Furthermore, we investigate how LLM integration expands the scope of existing UAV applications, enabling autonomous data processing, improved decision-making, and faster response times in emergency scenarios like disaster response and network restoration. Finally, we highlight crucial areas for future research that are critical for facilitating the effective integration of LLMs and UAVs.
无人机(UAVs)作为一种变革性的技术,已经出现在各种领域,为军事和民用领域提供了适应性的解决方案。它们不断扩大的能力为通过整合尖端的计算工具如人工智能(AI)和机器学习(ML)算法,进一步推动进步提供了平台。这些进步对人类生活产生了重大影响,推动了无与伦比的高效和便利的时期。大型语言模型(LLMs),是AI的关键组成部分,在部署环境中表现出惊人的学习和适应能力,表明了一种不断发展的智能形式,具有接近人类水平的能力。 本文探讨了将无人机(UAVs)和LLMs集成以推动自主系统开发的巨大潜力。我们全面回顾了LLM架构,评估其是否适合无人机集成。此外,我们总结了基于LLM的无人机架构的最新进展,并探讨了LLM在无人机框架中嵌入的新机会。值得注意的是,我们重点关注利用LLMs优化数据分析和决策过程,特别是增强无人机应用中的光谱感知和数据共享。 此外,我们研究了LLM集成如何扩大现有无人机应用的范围,实现自主数据处理、改进决策以及在紧急场景如灾难应对和网络恢复中的更快的响应时间。最后,我们强调了未来研究的关键领域,这些领域对于促进LLMs和UAV的有效整合至关重要。
https://arxiv.org/abs/2405.01745
Denoising hyperspectral images (HSIs) is a crucial preprocessing procedure due to the noise originating from intra-imaging mechanisms and environmental factors. Utilizing domain-specific knowledge of HSIs, such as spectral correlation, spatial self-similarity, and spatial-spectral correlation, is essential for deep learning-based denoising. Existing methods are often constrained by running time, space complexity, and computational complexity, employing strategies that explore these priors separately. While the strategies can avoid some redundant information, considering that hyperspectral images are 3-D images with strong spatial continuity and spectral correlation, this kind of strategy inevitably overlooks subtle long-range spatial-spectral information that positively impacts image restoration. This paper proposes a Spatial-Spectral Selective State Space Model-based U-shaped network, termed Spatial-Spectral U-Mamba (SSUMamba), for hyperspectral image denoising. We can obtain complete global spatial-spectral correlation within a module thanks to the linear space complexity in State Space Model (SSM) computations. We introduce an Alternating Scan (SSAS) strategy for HSI data, which helps model the information flow in multiple directions in 3-D HSIs. Experimental results demonstrate that our method outperforms several compared methods. The source code will be available at this https URL.
去噪的人造超分辨率图像(HSIs)是 due 来自成像机制和环境因素的噪声的关键预处理步骤。利用HSIs的领域特定知识,如光谱相关性、空间自相似性和空间-光谱相关性,对于基于深度学习的去噪是至关重要的。现有的方法通常受到运行时间、空间复杂度和计算复杂性的限制,采用单独探索这些 prior 的策略。虽然这些策略可以避免一些冗余信息,但考虑到超分辨率图像具有强烈的空间连续性和光谱相关性,这种策略会无意地忽视图像恢复中微妙的长期空间-光谱信息。本文提出了一种基于空间-光谱选择状态空间模型的U型网络,称为空间-光谱U-Mamba(SSUMamba)用于HSI去噪。我们可以在State Space Model(SSM)计算的线性空间复杂性内获得完整的全局空间-光谱相关性。我们引入了交替扫描(SSAS)策略HSI数据,有助于在3D HSIs中建模信息流。实验结果表明,我们的方法超越了若干比较方法。源代码将在此处https:// URL上可用。
https://arxiv.org/abs/2405.01726
Recent advancements in Bird's Eye View (BEV) fusion for map construction have demonstrated remarkable mapping of urban environments. However, their deep and bulky architecture incurs substantial amounts of backpropagation memory and computing latency. Consequently, the problem poses an unavoidable bottleneck in constructing high-resolution (HR) BEV maps, as their large-sized features cause significant increases in costs including GPU memory consumption and computing latency, named diverging training costs issue. Affected by the problem, most existing methods adopt low-resolution (LR) BEV and struggle to estimate the precise locations of urban scene components like road lanes, and sidewalks. As the imprecision leads to risky self-driving, the diverging training costs issue has to be resolved. In this paper, we address the issue with our novel Trumpet Neural Network (TNN) mechanism. The framework utilizes LR BEV space and outputs an up-sampled semantic BEV map to create a memory-efficient pipeline. To this end, we introduce Local Restoration of BEV representation. Specifically, the up-sampled BEV representation has severely aliased, blocky signals, and thick semantic labels. Our proposed Local Restoration restores the signals and thins (or narrows down) the width of the labels. Our extensive experiments show that the TNN mechanism provides a plug-and-play memory-efficient pipeline, thereby enabling the effective estimation of real-sized (or precise) semantic labels for BEV map construction.
近年来,在Bird's Eye View (BEV) 融合地图构建方面的先进进展展示了城市环境的显著映射。然而,它们的深度和笨重架构导致大量反向传播记忆和计算延迟。因此,在构建高分辨率(HR)BEV地图时,这个问题成为了一个不可避免的瓶颈,因为它们的大型特征导致包括GPU内存消耗和计算延迟在内的费用增加,即名字叫作扩散训练成本问题。受到这个问题的影响,大多数现有方法采用低分辨率(LR)BEV,并很难估计城市场景组件(如道路和人行道)的精确位置。由于精度导致自动驾驶风险,必须解决扩散训练成本问题。在本文中,我们通过新颖的Trumpet Neural Network(TNN)机制来解决此问题。该框架利用LR BEV空间并输出上采样语义BEV地图,以创建一个具有记忆效率的流水线。为此,我们引入了局部恢复BEV表示。具体来说,上采样BEV表示严重混淆、块状信号和厚语义标签。我们提出的局部恢复通过恢复信号和减薄(或缩小)标签的宽度来解决这一问题。我们广泛的实验证明,TNN机制提供了一个插件式、具有记忆效率的流水线,从而实现对BEV地图建设的实际大小(或精确)语义标签的有效估计。
https://arxiv.org/abs/2405.01016
Blind face restoration (BFR) on images has significantly progressed over the last several years, while real-world video face restoration (VFR), which is more challenging for more complex face motions such as moving gaze directions and facial orientations involved, remains unsolved. Typical BFR methods are evaluated on privately synthesized datasets or self-collected real-world low-quality face images, which are limited in their coverage of real-world video frames. In this work, we introduced new real-world datasets named FOS with a taxonomy of "Full, Occluded, and Side" faces from mainly video frames to study the applicability of current methods on videos. Compared with existing test datasets, FOS datasets cover more diverse degradations and involve face samples from more complex scenarios, which helps to revisit current face restoration approaches more comprehensively. Given the established datasets, we benchmarked both the state-of-the-art BFR methods and the video super resolution (VSR) methods to comprehensively study current approaches, identifying their potential and limitations in VFR tasks. In addition, we studied the effectiveness of the commonly used image quality assessment (IQA) metrics and face IQA (FIQA) metrics by leveraging a subjective user study. With extensive experimental results and detailed analysis provided, we gained insights from the successes and failures of both current BFR and VSR methods. These results also pose challenges to current face restoration approaches, which we hope stimulate future advances in VFR research.
近年来,在图像上进行盲人面修复(BFR)已经取得了显著的进展,而现实生活中视频面部修复(VFR)仍然是一个未解决的问题,尤其是对于更复杂的运动眼动和面部朝向等场景。典型的BFR方法在私有的合成数据集或自收集的现实生活中低质量面部分类数据上进行评估,这些数据集对于现实视频帧的覆盖范围有限。在本文中,我们引入了名为FOS的新现实数据集,从主要视频帧的“完整、遮挡和侧面”面部分类对现实视频进行研究,以评估当前方法在视频中的应用。与现有测试数据集相比,FOS数据集涵盖了更多的 degradation,并涉及更复杂场景中的面部分子,这有助于更全面地回顾当前的面部修复方法。鉴于已有的数据集,我们比较了最先进的BFR方法和视频超级分辨率(VSR)方法,以全面研究当前方法,并确定其在VFR任务中的潜力和限制。此外,我们利用主观用户研究探讨了常用图像质量评估(IQA)指标和面部智商评估(FIQA)指标的有效性。通过大量实验结果和详细分析提供了,我们从当前BFR和VSR方法的成功和失败中获得了洞察。这些结果也对当前的面部修复方法提出了挑战,我们希望激发未来在VFR研究方面的进一步进展。
https://arxiv.org/abs/2404.19500
Advanced text-to-image diffusion models raise safety concerns regarding identity privacy violation, copyright infringement, and Not Safe For Work content generation. Towards this, unlearning methods have been developed to erase these involved concepts from diffusion models. However, these unlearning methods only shift the text-to-image mapping and preserve the visual content within the generative space of diffusion models, leaving a fatal flaw for restoring these erased concepts. This erasure trustworthiness problem needs probe, but previous methods are sub-optimal from two perspectives: (1) Lack of transferability: Some methods operate within a white-box setting, requiring access to the unlearned model. And the learned adversarial input often fails to transfer to other unlearned models for concept restoration; (2) Limited attack: The prompt-level methods struggle to restore narrow concepts from unlearned models, such as celebrity identity. Therefore, this paper aims to leverage the transferability of the adversarial attack to probe the unlearning robustness under a black-box setting. This challenging scenario assumes that the unlearning method is unknown and the unlearned model is inaccessible for optimization, requiring the attack to be capable of transferring across different unlearned models. Specifically, we employ an adversarial search strategy to search for the adversarial embedding which can transfer across different unlearned models. This strategy adopts the original Stable Diffusion model as a surrogate model to iteratively erase and search for embeddings, enabling it to find the embedding that can restore the target concept for different unlearning methods. Extensive experiments demonstrate the transferability of the searched adversarial embedding across several state-of-the-art unlearning methods and its effectiveness for different levels of concepts.
高级文本到图像扩散模型引起了关于隐私侵犯、版权侵犯以及禁止工作内容生成的安全问题。为了解决这个问题,已经开发了一些 unlearning 方法来删除这些涉及概念的扩散模型。然而,这些 unlearning 方法仅将文本到图像映射转移,并保留扩散模型中生成空间内的视觉内容,这留下了恢复这些消除概念的致命缺陷。消除信任问题需要深入研究,但之前的方法从两个方面来看都是次优的:(1)可转移性:一些方法在白盒环境中运行,需要访问消除模型的访问权限。而学到的对抗输入往往无法转移到其他消除模型的概念恢复;(2)攻击范围有限:提示级别方法很难从消除模型中恢复狭窄的概念,如名人身份。因此,本文旨在利用对抗攻击的可转移性在黑盒环境中研究消除模型的稳健性。这种具有挑战性的场景假设消除方法是未知的,消除模型对于优化是不可访问的,攻击需要具备跨不同消除模型的转移能力。具体来说,我们采用对抗搜索策略来搜索跨越不同消除模型的对抗嵌入。这种策略使用原始的稳定扩散模型作为代理模型,通过迭代消除和搜索嵌入来找到可以恢复不同消除方法的目标概念的嵌入。大量的实验证明,搜索到的对抗嵌入在几种最先进的消除方法和不同概念水平之间具有可转移性,并且对于概念恢复具有有效性。
https://arxiv.org/abs/2404.19382
This paper proposes a framework for the 3D reconstruction of satellites in low-Earth orbit, utilizing videos captured by small amateur telescopes. The video data obtained from these telescopes differ significantly from data for standard 3D reconstruction tasks, characterized by intense motion blur, atmospheric turbulence, pervasive background light pollution, extended focal length and constrained observational perspectives. To address these challenges, our approach begins with a comprehensive pre-processing workflow that encompasses deep learning-based image restoration, feature point extraction and camera pose initialization. We proceed with the application of an improved 3D Gaussian splatting algorithm for reconstructing the 3D model. Our technique supports simultaneous 3D Gaussian training and pose estimation, enabling the robust generation of intricate 3D point clouds from sparse, noisy data. The procedure is further bolstered by a post-editing phase designed to eliminate noise points inconsistent with our prior knowledge of a satellite's geometric constraints. We validate our approach using both synthetic datasets and actual observations of China's Space Station, showcasing its significant advantages over existing methods in reconstructing 3D space objects from ground-based observations.
本文提出了一种利用小业余望远镜捕获的视频对低地球轨道卫星进行三维重建的框架。这些望远镜获得的视频数据与标准的三维重建任务的数据显示有很大的差异,其特点为强烈的运动模糊、大气扰动、广泛的背景光污染和延长焦距,并且存在约束的观测视角。为了应对这些挑战,我们的方法从全面预处理工作开始,包括基于深度学习的图像修复、特征点提取和相机姿态初始化。我们接下来应用改进的3D高斯展平算法来重建3D模型。我们的技术支持同时进行3D高斯训练和姿态估计,从而能够从稀疏、噪声数据中生成精致的3D点云。这一过程进一步得到了一个后编辑阶段的支持,该阶段旨在消除与先前知识不符的噪声点。我们通过使用中国空间站的真实观测数据来验证我们的方法,展示了其从地面观测数据中重构3D空间物体的重要优势。
https://arxiv.org/abs/2404.18394
Blind Compressed Image Restoration (CIR) has garnered significant attention due to its practical applications. It aims to mitigate compression artifacts caused by unknown quality factors, particularly with JPEG codecs. Existing works on blind CIR often seek assistance from a quality factor prediction network to facilitate their network to restore compressed images. However, the predicted numerical quality factor lacks spatial information, preventing network adaptability toward image contents. Recent studies in prompt-learning-based image restoration have showcased the potential of prompts to generalize across varied degradation types and degrees. This motivated us to design a prompt-learning-based compressed image restoration network, dubbed PromptCIR, which can effectively restore images from various compress levels. Specifically, PromptCIR exploits prompts to encode compression information implicitly, where prompts directly interact with soft weights generated from image features, thus providing dynamic content-aware and distortion-aware guidance for the restoration process. The light-weight prompts enable our method to adapt to different compression levels, while introducing minimal parameter overhead. Overall, PromptCIR leverages the powerful transformer-based backbone with the dynamic prompt module to proficiently handle blind CIR tasks, winning first place in the NTIRE 2024 challenge of blind compressed image enhancement track. Extensive experiments have validated the effectiveness of our proposed PromptCIR. The code is available at this https URL.
由于其实际应用而引起了广泛关注的 Blind Compressed Image Restoration (CIR) 旨在减轻由于未知质量因素引起的压缩伪影。特别地,在 JPEG 编码标准下,它试图通过质量因素预测网络帮助网络恢复压缩图像。然而,预测的数值质量因子缺乏空间信息,导致网络对图像内容的适应性受限。最近,基于提示的学习图像修复的研究展示了提示的潜力在各种退化类型和程度上进行泛化的可能性。因此,我们设计了一个基于提示的学习压缩图像修复网络,称之为 PromptCIR,可以有效地从各种压缩级别恢复图像。具体来说,PromptCIR 通过提示对压缩信息进行隐式编码,其中提示直接与图像特征生成的软权重交互,为修复过程提供动态内容感知和失真感知指导。轻量级的提示使得我们的方法能够适应不同的压缩级别,同时引入了最小参数开销。总体而言,PromptCIR 利用了强大的 Transformer 基干与动态提示模块,能够有效地处理盲压缩图像增强任务,在 2024 年 NTIRE 挑战中获得了第一名的成绩。大量实验验证了我们所提出的 PromptCIR 的有效性。代码可在此处访问:https://www.acm.org/dl/event/2024/08/02/NTIR-2024-Blind-Compressed-Image-Enhancement-track/
https://arxiv.org/abs/2404.17433