Accurate localization is fundamental for autonomous underwater vehicles (AUVs) to carry out precise tasks, such as manipulation and construction. Vision-based solutions using fiducial marker are promising, but extremely challenging underwater because of harsh lighting condition underwater. This paper introduces a gradient-based active camera exposure control method to tackle sharp lighting variations during image acquisition, which can establish better foundation for subsequent image enhancement procedures. Considering a typical scenario for underwater operations where visual tags are used, we proposed several experiments comparing our method with other state-of-the-art exposure control method including Active Exposure Control (AEC) and Gradient-based Exposure Control (GEC). Results show a significant improvement in the accuracy of robot localization. This method is an important component that can be used in visual-based state estimation pipeline to improve the overall localization accuracy.
准确的局部定位对于自主水下车辆(AUVs)执行精确任务(如操作和建设)至关重要。使用标记引导的视觉解决方案前景广阔,但在水下由于恶劣的照明条件而变得极其困难。本文介绍了一种基于梯度的主动相机曝光控制方法,以解决图像采集期间图像锐利的照明变化,为后续图像增强过程奠定更好的基础。考虑到水下操作中通常使用视觉标签的情况,我们提出了几种实验,将我们的方法与其他最先进的曝光控制方法(包括主动曝光控制(AEC)和基于梯度的曝光控制(GEC))进行比较。结果表明,机器人的局部定位精度得到了显著提高。这种方法是用于视觉 based 状态估计管道以提高整体局部定位精度的关键组成部分。
https://arxiv.org/abs/2404.12055
This study addresses the evolving challenges in urban traffic monitoring detection systems based on fisheye lens cameras by proposing a framework that improves the efficacy and accuracy of these systems. In the context of urban infrastructure and transportation management, advanced traffic monitoring systems have become critical for managing the complexities of urbanization and increasing vehicle density. Traditional monitoring methods, which rely on static cameras with narrow fields of view, are ineffective in dynamic urban environments, necessitating the installation of multiple cameras, which raises costs. Fisheye lenses, which were recently introduced, provide wide and omnidirectional coverage in a single frame, making them a transformative solution. However, issues such as distorted views and blurriness arise, preventing accurate object detection on these images. Motivated by these challenges, this study proposes a novel approach that combines a ransformer-based image enhancement framework and ensemble learning technique to address these challenges and improve traffic monitoring accuracy, making significant contributions to the future of intelligent traffic management systems. Our proposed methodological framework won 5th place in the 2024 AI City Challenge, Track 4, with an F1 score of 0.5965 on experimental validation data. The experimental results demonstrate the effectiveness, efficiency, and robustness of the proposed system. Our code is publicly available at this https URL.
本研究针对基于鱼眼镜头摄像头的城市交通监测检测系统所面临的不断演变挑战,提出了一个框架来提高这些系统的有效性和准确性。在城市的基础设施和交通管理背景下,先进的交通监测系统对于管理城市化复杂性和增加车辆密度至关重要。传统监测方法,依赖静态摄像头,其视野狭窄,在动态城市环境中无效,需要安装多个摄像头,这会增加成本。鱼眼镜头,最近引入,在单帧中提供广泛和全向覆盖,使得它们成为变革性的解决方案。然而,像扭曲和模糊这样的问题出现,使得这些图像上的准确物体检测效果受限。为了应对这些挑战,本研究提出了一个结合基于Transformer的图像增强框架和集成学习技术的新方法,以解决这些问题并提高交通监测准确性,对智能交通管理系统的发展做出了重要贡献。我们提出的方法论框架在2024 AI City Challenge Track 4中获得了第五名,实验验证数据中的F1分数为0.5965。实验结果证明了所提出系统的有效性、效率和稳健性。我们的代码公开在https://这个URL上。
https://arxiv.org/abs/2404.10078
Image restoration, which aims to recover high-quality images from their corrupted counterparts, often faces the challenge of being an ill-posed problem that allows multiple solutions for a single input. However, most deep learning based works simply employ l1 loss to train their network in a deterministic way, resulting in over-smoothed predictions with inferior perceptual quality. In this work, we propose a novel method that shifts the focus from a deterministic pixel-by-pixel comparison to a statistical perspective, emphasizing the learning of distributions rather than individual pixel values. The core idea is to introduce spatial entropy into the loss function to measure the distribution difference between predictions and targets. To make this spatial entropy differentiable, we employ kernel density estimation (KDE) to approximate the probabilities for specific intensity values of each pixel with their neighbor areas. Specifically, we equip the entropy with diffusion models and aim for superior accuracy and enhanced perceptual quality over l1 based noise matching loss. In the experiments, we evaluate the proposed method for low light enhancement on two datasets and the NTIRE challenge 2024. All these results illustrate the effectiveness of our statistic-based entropy loss. Code is available at this https URL.
图像修复的目标是从损坏的图像中恢复高质量的图像,通常面临着一个具有单个输入多项式解的问题。然而,大多数基于深度学习的作品仅仅采用L1损失来以确定性的方式训练网络,导致预测过拟合,感知质量差。在本文中,我们提出了一种新方法,将重点从确定性的像素逐像素比较转变为统计视角,强调学习分布而不是单个像素值。核心思想是引入空间熵到损失函数中,以测量预测和目标之间的分布差异。为了使空间熵不同寻常,我们采用核密度估计(KDE)来近似每个像素具有与其邻居区域的具体强度值的概率。具体来说,我们将熵与扩散模型相结合,旨在实现与基于L1噪声匹配的损失相比的卓越准确性和感知质量的提高。在实验中,我们对所提出的方法在两个数据集上的低光增强进行了评估,以及NTIRE挑战2024。所有这些结果都说明了基于统计熵的熵损失的有效性。代码可在此处访问:https://www.xxx.com/
https://arxiv.org/abs/2404.09735
Improving instance-specific image goal navigation (InstanceImageNav), which locates the identical object in a real-world environment from a query image, is essential for robotic systems to assist users in finding desired objects. The challenge lies in the domain gap between low-quality images observed by the moving robot, characterized by motion blur and low-resolution, and high-quality query images provided by the user. Such domain gaps could significantly reduce the task success rate but have not been the focus of previous work. To address this, we propose a novel method called Few-shot Cross-quality Instance-aware Adaptation (CrossIA), which employs contrastive learning with an instance classifier to align features between massive low- and few high-quality images. This approach effectively reduces the domain gap by bringing the latent representations of cross-quality images closer on an instance basis. Additionally, the system integrates an object image collection with a pre-trained deblurring model to enhance the observed image quality. Our method fine-tunes the SimSiam model, pre-trained on ImageNet, using CrossIA. We evaluated our method's effectiveness through an InstanceImageNav task with 20 different types of instances, where the robot identifies the same instance in a real-world environment as a high-quality query image. Our experiments showed that our method improves the task success rate by up to three times compared to the baseline, a conventional approach based on SuperGlue. These findings highlight the potential of leveraging contrastive learning and image enhancement techniques to bridge the domain gap and improve object localization in robotic applications. The project website is this https URL.
提高实例特定图像目标导航(InstanceImageNav)对于机器人系统协助用户在现实环境中找到所需物品至关重要。挑战在于移动机器人观测到的低质量图像与用户提供的优质图像之间的领域差距。这种领域差距可能会显著降低任务成功率,但以前的工作并未将此作为重点。为解决这个问题,我们提出了名为Few-shot Cross-quality Instance-aware Adaptation(CrossIA)的新方法,该方法采用对比学习与实例分类器来将大型低质量图像和少量高质量图像的特征对齐。这种方法通过在实例基础上将跨质量图像的潜在表示拉近,有效减少了领域差距。此外,系统还集成了一个预训练去雾模型来提高观测到的图像质量。我们使用CrossIA对SimSiam模型进行微调。我们对20种不同类型的实例进行了InstanceImageNav任务评估,机器人在一个真实环境中识别出相同实例作为高质量查询图像。我们的实验结果表明,与基于SuperGlue的传统方法相比,我们的方法将任务成功率提高了300%以上。这些发现强调了利用对比学习和图像增强技术跨越领域差距并在机器人应用中改善物体定位的可能性。项目网站是https://www.xxx。
https://arxiv.org/abs/2404.09645
Degraded underwater images decrease the accuracy of underwater object detection. However, existing methods for underwater image enhancement mainly focus on improving the indicators in visual aspects, which may not benefit the tasks of underwater image detection, and may lead to serious degradation in performance. To alleviate this problem, we proposed a bidirectional-guided method for underwater object detection, referred to as BG-YOLO. In the proposed method, network is organized by constructing an enhancement branch and a detection branch in a parallel way. The enhancement branch consists of a cascade of an image enhancement subnet and an object detection subnet. And the detection branch only consists of a detection subnet. A feature guided module connects the shallow convolution layer of the two branches. When training the enhancement branch, the object detection subnet in the enhancement branch guides the image enhancement subnet to be optimized towards the direction that is most conducive to the detection task. The shallow feature map of the trained enhancement branch will be output to the feature guided module, constraining the optimization of detection branch through consistency loss and prompting detection branch to learn more detailed information of the objects. And hence the detection performance will be refined. During the detection tasks, only detection branch will be reserved so that no additional cost of computation will be introduced. Extensive experiments demonstrate that the proposed method shows significant improvement in performance of the detector in severely degraded underwater scenes while maintaining a remarkable detection speed.
降解的水下图像降低水下物体检测的准确性。然而,水下图像增强的主要方法主要关注提高视觉方面的指标,这可能不会对水下物体检测任务产生好处,甚至可能导致性能严重下降。为解决这个问题,我们提出了一个双向引导的水下物体检测方法,称为BG-YOLO。在所提出的方法中,网络通过构建一个增强分支和一个检测分支来组织。增强分支包括图像增强子网和一个物体检测子网。而检测分支仅包括一个检测子网。一个特征引导模块连接了两个分支的浅卷积层。在训练增强分支时,增强分支中的物体检测子网指导图像增强子网朝着最有益于检测任务的方向进行优化。训练后的增强分支的浅特征图将输出到特征引导模块,通过一致损失约束检测分支,并通过提示检测分支学习更详细的信息来提高检测性能。因此,检测性能将得到改进。在检测任务期间,只保留检测分支以避免引入额外的计算成本。大量实验证明,在严重降解的水下场景中,所提出的方法显示出明显的检测器性能提升,同时保持出色的检测速度。
https://arxiv.org/abs/2404.08979
Localizing text in low-light environments is challenging due to visual degradations. Although a straightforward solution involves a two-stage pipeline with low-light image enhancement (LLE) as the initial step followed by detector, LLE is primarily designed for human vision instead of machine and can accumulate errors. In this work, we propose an efficient and effective single-stage approach for localizing text in dark that circumvents the need for LLE. We introduce a constrained learning module as an auxiliary mechanism during the training stage of the text detector. This module is designed to guide the text detector in preserving textual spatial features amidst feature map resizing, thus minimizing the loss of spatial information in texts under low-light visual degradations. Specifically, we incorporate spatial reconstruction and spatial semantic constraints within this module to ensure the text detector acquires essential positional and contextual range knowledge. Our approach enhances the original text detector's ability to identify text's local topological features using a dynamic snake feature pyramid network and adopts a bottom-up contour shaping strategy with a novel rectangular accumulation technique for accurate delineation of streamlined text features. In addition, we present a comprehensive low-light dataset for arbitrary-shaped text, encompassing diverse scenes and languages. Notably, our method achieves state-of-the-art results on this low-light dataset and exhibits comparable performance on standard normal light datasets. The code and dataset will be released.
在低光环境中定位文本具有挑战性,因为会出现视觉退化。尽管简单的解决方案涉及两个步骤:首先进行低光图像增强(LLE),然后是检测器,但LLE主要针对人类视觉而不是机器,并可能累积错误。在这项工作中,我们提出了一个高效且有效的单阶段方法来在黑暗中定位文本,绕过了需要LLE的步骤。我们在文本检测器的训练阶段引入了一个约束学习模块作为附加机制。这个模块的设计旨在指导文本检测器在特征图缩放过程中保留文本空间特征,从而在低光视觉退化下最小化文本中的空间信息损失。具体来说,我们在这个模块中引入了空间重构和空间语义约束,以确保文本检测器获得了关键的位置和上下文范围知识。我们的方法通过动态蛇特征金字塔网络增强了原始文本检测器的能力,并采用了一种新颖的矩形累积技术,实现了对平滑文本特征的准确边界描绘。此外,我们还提出了一个涵盖任意形状文本的全面低光数据集,包括各种场景和语言。值得注意的是,我们的方法在低光数据集上取得了最先进的成果,同时在标准正常光线数据集上的表现与标准 normal 光线数据集相当。代码和数据集将公开发布。
https://arxiv.org/abs/2404.08965
In this paper we have present an improved Cycle GAN based model for under water image enhancement. We have utilized the cycle consistent learning technique of the state-of-the-art Cycle GAN model with modification in the loss function in terms of depth-oriented attention which enhance the contrast of the overall image, keeping global content, color, local texture, and style information intact. We trained the Cycle GAN model with the modified loss functions on the benchmarked Enhancing Underwater Visual Perception (EUPV) dataset a large dataset including paired and unpaired sets of underwater images (poor and good quality) taken with seven distinct cameras in a range of visibility situation during research on ocean exploration and human-robot cooperation. In addition, we perform qualitative and quantitative evaluation which supports the given technique applied and provided a better contrast enhancement model of underwater imagery. More significantly, the upgraded images provide better results from conventional models and further for under water navigation, pose estimation, saliency prediction, object detection and tracking. The results validate the appropriateness of the model for autonomous underwater vehicles (AUV) in visual navigation.
在本文中,我们提出了一个改进的基于Cycle GAN的深海图像增强模型。我们利用了最先进的Cycle GAN模型的循环一致学习技术,并对损失函数进行了修改,以实现深度定向关注,从而增强整个图像的对比度,同时保留全局内容、颜色、局部纹理和样式信息。我们使用修改后的损失函数在经过充分验证的深海视觉感知(EUPV)数据集上训练了Cycle GAN模型,该数据集包括由七种不同相机在各种能见度条件下拍摄的 paired和未 paired水下图像(劣质和优质)。此外,我们还进行了定性和定量的评估,证明了所提出的技术具有实际应用价值,并提供了更好的水下图像增强模型。值得注意的是,升级后的图像在传统模型的基础上表现更好,对于水下导航、姿态估计、熵检测、物体检测和跟踪等应用具有更高的性能。这些结果证实了该模型在自主水下车辆(AUV)视觉导航方面的适用性。
https://arxiv.org/abs/2404.07649
This study systematically investigates the impact of image enhancement techniques on Convolutional Neural Network (CNN)-based Brain Tumor Segmentation, focusing on Histogram Equalization (HE), Contrast Limited Adaptive Histogram Equalization (CLAHE), and their hybrid variations. Employing the U-Net architecture on a dataset of 3064 Brain MRI images, the research delves into preprocessing steps, including resizing and enhancement, to optimize segmentation accuracy. A detailed analysis of the CNN-based U-Net architecture, training, and validation processes is provided. The comparative analysis, utilizing metrics such as Accuracy, Loss, MSE, IoU, and DSC, reveals that the hybrid approach CLAHE-HE consistently outperforms others. Results highlight its superior accuracy (0.9982, 0.9939, 0.9936 for training, testing, and validation, respectively) and robust segmentation overlap, with Jaccard values of 0.9862, 0.9847, and 0.9864, and Dice values of 0.993, 0.9923, and 0.9932 for the same phases, emphasizing its potential in neuro-oncological applications. The study concludes with a call for refinement in segmentation methodologies to further enhance diagnostic precision and treatment planning in neuro-oncology.
本研究系统地研究了图像增强技术对基于卷积神经网络(CNN)的脑肿瘤分割的影响,重点关注归一化等价(HE)、对比有限适应性归一化(CLAHE)及其混合变体。在包含3064个脑部MRI图像的数据集上应用U-Net架构,研究深入探讨了预处理步骤,包括缩放和增强,以优化分割准确性。提供了基于CNN的U-Net架构、训练和验证过程的详细分析。比较分析使用了诸如准确度、损失、均方误差(MSE)、IoU和DSC等指标,显示了混合方法CLAHE-HE始终优于其他方法。结果突出了其卓越的准确性(分别为训练、测试和验证的0.9982、0.9939和0.9936),以及稳健的分割重叠,以及 Jaccard 值为0.9862、0.9847 和0.9864,以及IoU值为0.993、0.9923 和0.9932 的相同阶段。研究强调了其在神经肿瘤学应用中的潜在价值。研究结论呼吁在分割方法上进行优化,以进一步提高神经肿瘤学诊断的准确性和治疗规划。
https://arxiv.org/abs/2404.05341
Low-light image enhancement (LLIE) aims to improve low-illumination images. However, existing methods face two challenges: (1) uncertainty in restoration from diverse brightness degradations; (2) loss of texture and color information caused by noise suppression and light enhancement. In this paper, we propose a novel enhancement approach, CodeEnhance, by leveraging quantized priors and image refinement to address these challenges. In particular, we reframe LLIE as learning an image-to-code mapping from low-light images to discrete codebook, which has been learned from high-quality images. To enhance this process, a Semantic Embedding Module (SEM) is introduced to integrate semantic information with low-level features, and a Codebook Shift (CS) mechanism, designed to adapt the pre-learned codebook to better suit the distinct characteristics of our low-light dataset. Additionally, we present an Interactive Feature Transformation (IFT) module to refine texture and color information during image reconstruction, allowing for interactive enhancement based on user preferences. Extensive experiments on both real-world and synthetic benchmarks demonstrate that the incorporation of prior knowledge and controllable information transfer significantly enhances LLIE performance in terms of quality and fidelity. The proposed CodeEnhance exhibits superior robustness to various degradations, including uneven illumination, noise, and color distortion.
低光图像增强(LLIE)旨在改善低光图像。然而,现有的方法面临两个挑战:(1)从不同亮度退化中恢复修复的不确定性;(2)由于噪声抑制和光增强而丢失纹理和颜色信息。在本文中,我们提出了一种新增强方法,称为CodeEnhance,通过利用量化先验信息和图像修复来解决这些挑战。特别地,我们将LLIE重新表述为从低光图像中学习图像到编码映射,这是从高质量图像中学习的高质量图像。为了增强这个过程,我们引入了一个语义嵌入模块(SEM),以将语义信息与低级特征集成,并设计了一个Codebook Shift(CS)机制,旨在将预先学习的编码器适应该低光数据集的显著特征。此外,我们还介绍了交互式特征转换(IFT)模块,用于在图像重建过程中修复纹理和颜色信息,并允许根据用户偏好进行交互式增强。在现实世界和合成基准上进行的大量实验证明,引入先验知识和可控制信息传递 significantly增强了LLIE在质量和保真度方面的性能。所提出的CodeEnhance在各种退化中表现出卓越的鲁棒性,包括不均匀光照、噪声和颜色失真。
https://arxiv.org/abs/2404.05253
This paper introduces the physics-inspired synthesized underwater image dataset (PHISWID), a dataset tailored for enhancing underwater image processing through physics-inspired image synthesis. Deep learning approaches to underwater image enhancement typically demand extensive datasets, yet acquiring paired clean and degraded underwater ones poses significant challenges. While several underwater image datasets have been proposed using physics-based synthesis, a publicly accessible collection has been lacking. Additionally, most underwater image synthesis approaches do not intend to reproduce atmospheric scenes, resulting in incomplete enhancement. PHISWID addresses this gap by offering a set of paired ground-truth (atmospheric) and synthetically degraded underwater images, showcasing not only color degradation but also the often-neglected effects of marine snow, a composite of organic matter and sand particles that considerably impairs underwater image clarity. The dataset applies these degradations to atmospheric RGB-D images, enhancing the dataset's realism and applicability. PHISWID is particularly valuable for training deep neural networks in a supervised learning setting and for objectively assessing image quality in benchmark analyses. Our results reveal that even a basic U-Net architecture, when trained with PHISWID, substantially outperforms existing methods in underwater image enhancement. We intend to release PHISWID publicly, contributing a significant resource to the advancement of underwater imaging technology.
本文介绍了一个基于物理图像生成的水下图像数据集(PHISWID),该数据集专门用于通过物理图像合成来增强水下图像处理。水下图像增强通常需要大量的数据,然而获取成对的水下干净和污损图像会面临重大挑战。虽然基于物理图像生成的水下图像数据集已经提出了几个,但目前还没有公开可用的集。此外,大多数水下图像合成方法并没有意图复制大气场景,导致增强效果不完整。PHISWID通过提供一组成对的水下地面(大气)和合成降解的水下图像,不仅展示了色彩降解,还突出了经常被忽视的海洋雪(由有机物质和沙子颗粒组成的复合物,对水下图像清晰度有很大影响)的影响。该数据集将这些降解应用到大气RGB-D图像中,提高了数据集的逼真度和适用性。PHISWID对于在监督学习环境中训练深度神经网络以及在基准分析中客观评估图像质量具有特别价值。我们的结果表明,即使是最基本的U-Net架构,当使用PHISWID进行训练时,也会显著优于现有方法在水下图像增强方面。我们打算将PHISWID公开发布,为水下成像技术的发展贡献重大资源。
https://arxiv.org/abs/2404.03998
Many existing methods for low-light image enhancement (LLIE) based on Retinex theory ignore important factors that affect the validity of this theory in digital imaging, such as noise, quantization error, non-linearity, and dynamic range overflow. In this paper, we propose a new expression called Digital-Imaging Retinex theory (DI-Retinex) through theoretical and experimental analysis of Retinex theory in digital imaging. Our new expression includes an offset term in the enhancement model, which allows for pixel-wise brightness contrast adjustment with a non-linear mapping function. In addition, to solve the lowlight enhancement problem in an unsupervised manner, we propose an image-adaptive masked reverse degradation loss in Gamma space. We also design a variance suppression loss for regulating the additional offset term. Extensive experiments show that our proposed method outperforms all existing unsupervised methods in terms of visual quality, model size, and speed. Our algorithm can also assist downstream face detectors in low-light, as it shows the most performance gain after the low-light enhancement compared to other methods.
许多基于Retinex理论的低光图像增强(LLIE)方法忽略了数字图像中影响该理论有效性的重要因素,如噪声、量化误差、非线性以及动态范围溢出。在本文中,我们通过数字图像中Retinex理论的分析,提出了一个新的表达式称为数字图像Retinex理论(DI-Retinex)。我们的新表达式包括增强模型的偏移项,允许非线性映射函数对每个像素进行逐点亮度对比调整。此外,为了以无监督的方式解决低光增强问题,我们提出了在Gamma空间中的图像自适应掩码反向退化损失。我们还设计了一个用于调节附加偏移项的方差抑制损失。大量的实验结果表明,与所有现有无监督方法相比,我们的方法在视觉质量、模型大小和速度方面都表现出色。此外,我们的算法还可以帮助下游面部检测在低光环境中获得更好的性能,因为与其它方法相比,在低光增强后,其性能提升最为显著。
https://arxiv.org/abs/2404.03327
We present a new additive image factorization technique that treats images to be composed of multiple latent specular components which can be simply estimated recursively by modulating the sparsity during decomposition. Our model-driven {\em RSFNet} estimates these factors by unrolling the optimization into network layers requiring only a few scalars to be learned. The resultant factors are interpretable by design and can be fused for different image enhancement tasks via a network or combined directly by the user in a controllable fashion. Based on RSFNet, we detail a zero-reference Low Light Enhancement (LLE) application trained without paired or unpaired supervision. Our system improves the state-of-the-art performance on standard benchmarks and achieves better generalization on multiple other datasets. We also integrate our factors with other task specific fusion networks for applications like deraining, deblurring and dehazing with negligible overhead thereby highlighting the multi-domain and multi-task generalizability of our proposed RSFNet. The code and data is released for reproducibility on the project homepage.
我们提出了一种新的附加图像因素分解技术,该技术处理由多个潜在极化子组件组成的图像。这些因素可以通过在分解过程中对稀疏度的调节来简单地递归估计。我们的模型驱动的{\em RSFNet}通过将优化展开到仅需要学习几个标量来处理的网络层中来估计这些因素。由此产生的因素可以通过网络或通过用户在可控制的方式进行融合,用于不同的图像增强任务。基于RSFNet,我们详细介绍了一个无需配对或非配对监督的零参考低光增强(LLE)应用。我们的系统在标准基准上提高了最先进的性能,并在多个其他数据集上取得了更好的泛化能力。我们还将我们的因素与其他任务特定的融合网络集成,用于诸如去雾、去噪和去雾等应用。通过显著的 overhead,提高了我们提出的RSFNet的多领域和多任务通用性。代码和数据发布在项目主页上以进行可重复性。
https://arxiv.org/abs/2404.01998
In this paper we propose a novel modification of Contrastive Language-Image Pre-Training (CLIP) guidance for the task of unsupervised backlit image enhancement. Our work builds on the state-of-the-art CLIP-LIT approach, which learns a prompt pair by constraining the text-image similarity between a prompt (negative/positive sample) and a corresponding image (backlit image/well-lit image) in the CLIP embedding space. Learned prompts then guide an image enhancement network. Based on the CLIP-LIT framework, we propose two novel methods for CLIP guidance. First, we show that instead of tuning prompts in the space of text embeddings, it is possible to directly tune their embeddings in the latent space without any loss in quality. This accelerates training and potentially enables the use of additional encoders that do not have a text encoder. Second, we propose a novel approach that does not require any prompt tuning. Instead, based on CLIP embeddings of backlit and well-lit images from training data, we compute the residual vector in the embedding space as a simple difference between the mean embeddings of the well-lit and backlit images. This vector then guides the enhancement network during training, pushing a backlit image towards the space of well-lit images. This approach further dramatically reduces training time, stabilizes training and produces high quality enhanced images without artifacts, both in supervised and unsupervised training regimes. Additionally, we show that residual vectors can be interpreted, revealing biases in training data, and thereby enabling potential bias correction.
在本文中,我们提出了一种新颖的对于无监督反光照像增强任务的 Contrastive Language-Image 前馈(CLIP)指导。我们的工作基于最先进的 CLIP-LIT 方法,该方法通过在 CLIP 嵌入空间中约束提示(负样本/正样本)与相应图像(反光照像/良好光照图像)之间的文本-图像相似性来学习提示对。学习到的提示 then 指导图像增强网络。基于 CLIP-LIT 框架,我们提出了两种新颖的 CLIP 指导方法。首先,我们证明了直接在语义空间中调整提示而不是在文本嵌入空间中进行调整,不会损失质量。这加速了训练,并有可能使使用没有文本编码器的额外编码器成为可能。其次,我们提出了一种不需要提示调整的新方法。我们基于训练数据的反光照像和良好光照图像的 CLIP 嵌入计算残差向量作为简单差异来表示照明条件下的图像。该向量在训练期间指导增强网络,将反光照像推向良好光照图像的空间。这种方法进一步显著减少了训练时间,稳定了训练,并产生了高质量的增强图像,同时避免了伪影,无论是监督还是无监督训练模式下。此外,我们还证明了残差向量可以解释,揭示了训练数据中的偏见,从而实现了可能的偏见纠正。
https://arxiv.org/abs/2404.01889
Recent image tone adjustment (or enhancement) approaches have predominantly adopted supervised learning for learning human-centric perceptual assessment. However, these approaches are constrained by intrinsic challenges of supervised learning. Primarily, the requirement for expertly-curated or retouched images escalates the data acquisition expenses. Moreover, their coverage of target style is confined to stylistic variants inferred from the training data. To surmount the above challenges, we propose an unsupervised learning-based approach for text-based image tone adjustment method, CLIPtone, that extends an existing image enhancement method to accommodate natural language descriptions. Specifically, we design a hyper-network to adaptively modulate the pretrained parameters of the backbone model based on text description. To assess whether the adjusted image aligns with the text description without ground truth image, we utilize CLIP, which is trained on a vast set of language-image pairs and thus encompasses knowledge of human perception. The major advantages of our approach are three fold: (i) minimal data collection expenses, (ii) support for a range of adjustments, and (iii) the ability to handle novel text descriptions unseen in training. Our approach's efficacy is demonstrated through comprehensive experiments, including a user study.
近年来,图像色调调整(或增强)方法主要采用监督学习来进行人机中心感知评估。然而,这些方法受到监督学习内生挑战的限制。首先,专家编辑或修复图像的需求导致数据获取费用增加。其次,它们对目标风格的覆盖仅限于从训练数据中推断的文体变异。为了克服上述挑战,我们提出了一个基于无监督学习的文本图像色调调整方法,CLIPtone,该方法将现有的图像增强方法扩展到适应自然语言描述。具体来说,我们设计了一个超网络,根据文本描述自适应地调整骨干模型的预训练参数。为了评估调整后的图像是否与文本描述一致,我们使用了CLIP,它在一个广泛的语图像对训练集上进行训练,因此包括人类感知知识。我们方法的主要优势是三倍:(一)最小数据收集费用,(二)支持各种调整,(三)能够处理在训练中未见过的文本描述。通过全面的实验,包括用户研究,我们证明了这种方法的有效性。
https://arxiv.org/abs/2404.01123
Event camera has recently received much attention for low-light image enhancement (LIE) thanks to their distinct advantages, such as high dynamic range. However, current research is prohibitively restricted by the lack of large-scale, real-world, and spatial-temporally aligned event-image datasets. To this end, we propose a real-world (indoor and outdoor) dataset comprising over 30K pairs of images and events under both low and normal illumination conditions. To achieve this, we utilize a robotic arm that traces a consistent non-linear trajectory to curate the dataset with spatial alignment precision under 0.03mm. We then introduce a matching alignment strategy, rendering 90% of our dataset with errors less than 0.01s. Based on the dataset, we propose a novel event-guided LIE approach, called EvLight, towards robust performance in real-world low-light scenes. Specifically, we first design the multi-scale holistic fusion branch to extract holistic structural and textural information from both events and images. To ensure robustness against variations in the regional illumination and noise, we then introduce a Signal-to-Noise-Ratio (SNR)-guided regional feature selection to selectively fuse features of images from regions with high SNR and enhance those with low SNR by extracting regional structure information from events. Extensive experiments on our dataset and the synthetic SDSD dataset demonstrate our EvLight significantly surpasses the frame-based methods. Code and datasets are available at this https URL.
事件相机因其出色的动态范围和高动态范围而最近受到了很多关注,用于低光图像增强(LIE)。然而,当前的研究由于缺乏大规模、真实世界和空间时间同步的事件图像数据集而受到限制。为此,我们提出了一个由超过30K对图像和事件组成的真实世界(室内和室外)数据集。为了实现这一目标,我们利用一个机器人臂,在低光和正常光照条件下,对数据集进行空间对齐精度为0.03mm的轨迹跟踪。然后,我们引入了一种匹配对齐策略,将数据集中的90%数据与错误小于0.01s的图像进行匹配。基于这个数据集,我们提出了一个新的事件指导的LIE方法,称为EvLight,以在现实世界的低光场景中实现稳健的性能。具体来说,我们首先设计了一个多尺度 holistic 融合分支,从事件和图像中提取整体结构和纹理信息。为了确保对区域照明和噪声的鲁棒性,我们然后引入了信号-噪声比(SNR)指导的局部特征选择,选择具有高SNR的区域特征并增强具有低SNR的区域特征,通过从事件中提取区域结构信息进行局部结构增强。对我们数据集和合成SDSD数据集的广泛实验证明,我们的EvLight明显超越了基于帧的方法。代码和数据集可在该链接处获取:https://www. thisurl.
https://arxiv.org/abs/2404.00834
While burst LR images are useful for improving the SR image quality compared with a single LR image, prior SR networks accepting the burst LR images are trained in a deterministic manner, which is known to produce a blurry SR image. In addition, it is difficult to perfectly align the burst LR images, making the SR image more blurry. Since such blurry images are perceptually degraded, we aim to reconstruct the sharp high-fidelity boundaries. Such high-fidelity images can be reconstructed by diffusion models. However, prior SR methods using the diffusion model are not properly optimized for the burst SR task. Specifically, the reverse process starting from a random sample is not optimized for image enhancement and restoration methods, including burst SR. In our proposed method, on the other hand, burst LR features are used to reconstruct the initial burst SR image that is fed into an intermediate step in the diffusion model. This reverse process from the intermediate step 1) skips diffusion steps for reconstructing the global structure of the image and 2) focuses on steps for refining detailed textures. Our experimental results demonstrate that our method can improve the scores of the perceptual quality metrics. Code: this https URL
虽然 burst LR 图像在改善与单个 LR 图像的 SR 图像质量方面是有用的,但接受 burst LR 图像的早期 SR 网络是在确定性方式下训练的,这已经被知道会生成模糊的 SR 图像。此外,很难完美对齐 burst LR 图像,使得 SR 图像变得更模糊。由于这些模糊的图像在感知上退化,我们试图通过扩散模型重构尖锐的高保真度边界。通过扩散模型可以重构高保真度图像。然而,早期 SR 方法使用扩散模型并未对 burst SR 任务进行优化。具体来说,从随机样本开始的反向过程没有优化图像增强和恢复方法,包括 burst SR。在我们的方法中,另一方面,使用 burst LR 特征重构输入到扩散模型中间步骤的初始 burst SR 图像。这种反向过程从中间步骤 1) 跳过扩散步骤以重构图像的整体结构,2) 专注于微纹理的优化步骤。我们的实验结果表明,我们的方法可以提高感知质量指标的得分。代码:https:// this URL
https://arxiv.org/abs/2403.19428
The widespread use of high-definition screens in edge devices, such as end-user cameras, smartphones, and televisions, is spurring a significant demand for image enhancement. Existing enhancement models often optimize for high performance while falling short of reducing hardware inference time and power consumption, especially on edge devices with constrained computing and storage resources. To this end, we propose Image Color Enhancement Lookup Table (ICELUT) that adopts LUTs for extremely efficient edge inference, without any convolutional neural network (CNN). During training, we leverage pointwise (1x1) convolution to extract color information, alongside a split fully connected layer to incorporate global information. Both components are then seamlessly converted into LUTs for hardware-agnostic deployment. ICELUT achieves near-state-of-the-art performance and remarkably low power consumption. We observe that the pointwise network structure exhibits robust scalability, upkeeping the performance even with a heavily downsampled 32x32 input image. These enable ICELUT, the first-ever purely LUT-based image enhancer, to reach an unprecedented speed of 0.4ms on GPU and 7ms on CPU, at least one order faster than any CNN solution. Codes are available at this https URL.
高清显示屏在边缘设备(如用户摄像头、智能手机和电视)中的广泛使用,推动了图像增强需求的显著增长。现有的增强模型通常在提高性能的同时,却忽略了降低硬件推理时间和功耗,尤其是在受约束的计算和存储资源边缘设备上。为此,我们提出了 Image Color Enhancement Lookup Table (ICELUT),它采用无卷积神经网络(CNN)的边缘推理,实现了非常高效的图像增强。在训练过程中,我们利用点乘积(1x1)卷积提取颜色信息,并添加一个全连接层以整合全局信息。然后,这两个组件都被无缝转换为硬件无关的LUT。ICELUT 取得了与最先进的性能相当的成绩,并且具有非常低的功耗。我们观察到,点乘积网络结构表现出出色的可扩展性,即使以 heavily downsampled 32x32 的输入图像,其性能仍然保持不变。这使得 ICELUT,首个基于LUT 的图像增强器,能够在 GPU 上达到 0.4ms 的速度,在 CPU 上达到 7ms 的速度,至少比任何 CNN 解决方案快一个数量级。代码可在此处下载:https://www.icelut.org/。
https://arxiv.org/abs/2403.19238
In recent years, significant progress has been made in the field of underwater image enhancement (UIE). However, its practical utility for high-level vision tasks, such as underwater object detection (UOD) in Autonomous Underwater Vehicles (AUVs), remains relatively unexplored. It may be attributed to several factors: (1) Existing methods typically employ UIE as a pre-processing step, which inevitably introduces considerable computational overhead and latency. (2) The process of enhancing images prior to training object detectors may not necessarily yield performance improvements. (3) The complex underwater environments can induce significant domain shifts across different scenarios, seriously deteriorating the UOD performance. To address these challenges, we introduce EnYOLO, an integrated real-time framework designed for simultaneous UIE and UOD with domain-adaptation capability. Specifically, both the UIE and UOD task heads share the same network backbone and utilize a lightweight design. Furthermore, to ensure balanced training for both tasks, we present a multi-stage training strategy aimed at consistently enhancing their performance. Additionally, we propose a novel domain-adaptation strategy to align feature embeddings originating from diverse underwater environments. Comprehensive experiments demonstrate that our framework not only achieves state-of-the-art (SOTA) performance in both UIE and UOD tasks, but also shows superior adaptability when applied to different underwater scenarios. Our efficiency analysis further highlights the substantial potential of our framework for onboard deployment.
近年来,在水下图像增强(UIE)领域取得了显著的进展。然而,将其应用于高级视觉任务,如自主水下车辆(AUV)下的水下物体检测(UOD),仍然相对未经探索。这可能归因于以下几个因素: (1)现有的方法通常将UIE作为预处理步骤,这不可避免地引入了相当大的计算开销和延迟。 (2)在训练物体检测器之前增强图像的过程未必能带来性能提升。 (3)水下复杂的环境可能会在不同的场景之间引起显著的领域转移,严重削弱了UOD的性能。为了应对这些挑战,我们引入了EnYOLO,一个专为同时进行UIE和UOD的领域自适应框架。具体来说,UIE和UOD任务头共享相同的网络骨架,并采用轻量级设计。此外,为了确保两个任务之间的平衡训练,我们提出了多阶段训练策略,旨在持续提高它们的性能。此外,我们还提出了一种新的领域自适应策略,以对来自不同水下环境的特征嵌入进行对齐。全面的实验证明,我们的框架不仅在UIE和UOD任务上实现了最先进的(SOTA)性能,而且当应用于不同水下场景时表现出卓越的适应性。我们的效率分析进一步突显了我们的框架在车载部署方面的巨大潜力。
https://arxiv.org/abs/2403.19079
Deep neural networks have achieved remarkable success in a variety of computer vision applications. However, there is a problem of degrading accuracy when the data distribution shifts between training and testing. As a solution of this problem, Test-time Adaptation~(TTA) has been well studied because of its practicality. Although TTA methods increase accuracy under distribution shift by updating the model at test time, using high-uncertainty predictions is known to degrade accuracy. Since the input image is the root of the distribution shift, we incorporate a new perspective on enhancing the input image into TTA methods to reduce the prediction's uncertainty. We hypothesize that enhancing the input image reduces prediction's uncertainty and increase the accuracy of TTA methods. On the basis of our hypothesis, we propose a novel method: Test-time Enhancer and Classifier Adaptation~(TECA). In TECA, the classification model is combined with the image enhancement model that transforms input images into recognition-friendly ones, and these models are updated by existing TTA methods. Furthermore, we found that the prediction from the enhanced image does not always have lower uncertainty than the prediction from the original image. Thus, we propose logit switching, which compares the uncertainty measure of these predictions and outputs the lower one. In our experiments, we evaluate TECA with various TTA methods and show that TECA reduces prediction's uncertainty and increases accuracy of TTA methods despite having no hyperparameters and little parameter overhead.
深度神经网络在各种计算机视觉应用中取得了显著的成功。然而,当数据分布在训练和测试之间转移时,数据分布的改变会导致模型的准确性下降。为了解决这个问题,由于其实用性,测试时间适应(TTA)方法已经得到了很好的研究。虽然TTA方法通过在测试时更新模型来增加准确性,但使用高不确定预测已知会降低准确性。由于输入图像是分布转移的根源,我们将增强输入图像的新视角纳入TTA方法中,以减少预测的不确定性。我们假设,增强输入图像会降低预测的不确定性并提高TTA方法的准确性。根据我们的假设,我们提出了一个新的方法:测试时间增强器和分类器适应(TECA)。在TECA中,将分类模型与将输入图像转换为对齐友好图像的图像增强模型相结合,并通过现有的TTA方法更新这些模型。此外,我们发现,增强后的图像的预测不确定性并不总是低于原始图像的预测不确定性。因此,我们提出了对数切换,它比较了这些预测和输出的不确定性度量,并输出较低的那个。在我们的实验中,我们用各种TTA方法评估TECA,并发现TECA通过没有超参数且参数开销较小的情况下,减少了预测的不确定性并提高了TTA方法的准确性。
https://arxiv.org/abs/2403.17423
Ultrasound imaging is crucial for evaluating organ morphology and function, yet depth adjustment can degrade image quality and field-of-view, presenting a depth-dependent dilemma. Traditional interpolation-based zoom-in techniques often sacrifice detail and introduce artifacts. Motivated by the potential of arbitrary-scale super-resolution to naturally address these inherent challenges, we present the Residual Dense Swin Transformer Network (RDSTN), designed to capture the non-local characteristics and long-range dependencies intrinsic to ultrasound images. It comprises a linear embedding module for feature enhancement, an encoder with shifted-window attention for modeling non-locality, and an MLP decoder for continuous detail reconstruction. This strategy streamlines balancing image quality and field-of-view, which offers superior textures over traditional methods. Experimentally, RDSTN outperforms existing approaches while requiring fewer parameters. In conclusion, RDSTN shows promising potential for ultrasound image enhancement by overcoming the limitations of conventional interpolation-based methods and achieving depth-independent imaging.
超声成像对评价器官形态和功能至关重要,但深度调整会降低图像质量和视野范围,呈现出深度相关的困境。传统的基于插值的方法通常会牺牲细节并引入伪影。鉴于任意尺度超分辨率的自然解决这些固有挑战的潜力,我们提出了残余密集辛普森变换网络(RDSTN),旨在捕捉超声图像的非局部特征和长距离依赖关系。它包括一个用于特征增强的线性嵌入模块、一个具有平移窗口注意力的编码器和一个用于连续细节重构的MLP解码器。这种策略在平衡图像质量和视野范围方面取得了优越的 texture,超过了传统方法。实验证明,RDSTN在性能上优于现有方法,同时需要的参数更少。总之,RDSTN通过克服传统插值方法的局限,为超声图像增强展示了有前景的可能性,实现了无深度依赖的图像。
https://arxiv.org/abs/2403.16384