Transformer-based methods have demonstrated excellent performance on super-resolution visual tasks, surpassing conventional convolutional neural networks. However, existing work typically restricts self-attention computation to non-overlapping windows to save computational costs. This means that Transformer-based networks can only use input information from a limited spatial range. Therefore, a novel Hybrid Multi-Axis Aggregation network (HMA) is proposed in this paper to exploit feature potential information better. HMA is constructed by stacking Residual Hybrid Transformer Blocks(RHTB) and Grid Attention Blocks(GAB). On the one side, RHTB combines channel attention and self-attention to enhance non-local feature fusion and produce more attractive visual results. Conversely, GAB is used in cross-domain information interaction to jointly model similar features and obtain a larger perceptual field. For the super-resolution task in the training phase, a novel pre-training method is designed to enhance the model representation capabilities further and validate the proposed model's effectiveness through many experiments. The experimental results show that HMA outperforms the state-of-the-art methods on the benchmark dataset. We provide code and models at this https URL.
基于Transformer的方法在超分辨率视觉任务上已经表现出卓越的性能,超过了传统的卷积神经网络。然而,现有的工作通常将自注意力计算限制在非重叠的窗口以节省计算成本。这意味着基于Transformer的网络只能利用输入信息的有限空间范围。因此,本文提出了一种新颖的混合多轴聚合网络(HMA)来更好地利用特征潜力信息。HMA由Residual Hybrid Transformer Blocks(RHTB)和Grid Attention Blocks(GAB)堆叠而成。一方面,RHTB通过结合通道关注和自注意力来增强非局部特征融合,产生更具吸引力的视觉结果。另一方面,GAB用于跨域信息交互,共同建模类似特征,获得更大的感知场。在训练阶段,为增强模型表示能力并验证所提出的模型的有效性,设计了一种新颖的预训练方法。实验结果表明,HMA在基准数据集上优于最先进的方法。我们在这个网址提供了代码和模型。
https://arxiv.org/abs/2405.05001
Over the past few years, deep neural models have made considerable advances in image quality assessment (IQA). However, the underlying reasons for their success remain unclear, owing to the complex nature of deep neural networks. IQA aims to describe how the human visual system (HVS) works and to create its efficient approximations. On the other hand, Saliency Prediction task aims to emulate HVS via determining areas of visual interest. Thus, we believe that saliency plays a crucial role in human perception. In this work, we conduct an empirical study that reveals the relation between IQA and Saliency Prediction tasks, demonstrating that the former incorporates knowledge of the latter. Moreover, we introduce a novel SACID dataset of saliency-aware compressed images and conduct a large-scale comparison of classic and neural-based IQA methods. All supplementary code and data will be available at the time of publication.
在过去的几年里,深度神经网络在图像质量评估(IQA)方面取得了显著的进步。然而,由于深度神经网络的复杂性,其成功背后的原因仍然不明确。IQA 的目标描述了人视觉系统(HVS)的工作,并旨在创建其有效的近似。另一方面,Saliency 预测任务旨在通过确定视觉兴趣区域来模仿 HVS。因此,我们认为 高亮在人类感知中扮演着关键角色。在这项工作中,我们进行了一项实证研究,揭示了 IQA 和 Saliency 预测任务之间的关系,证明了前一个包含了后一个的知识。此外,我们还引入了一个名为 SACID 的适用于高亮度的压缩图像的新 SACID 数据集,并对基于经典方法和神经网络的 IQA 方法进行了大规模比较。所有补充代码和数据将在发表时提供。
https://arxiv.org/abs/2405.04997
Table detection, a pivotal task in document analysis, aims to precisely recognize and locate tables within document images. Although deep learning has shown remarkable progress in this realm, it typically requires an extensive dataset of labeled data for proficient training. Current CNN-based semi-supervised table detection approaches use the anchor generation process and Non-Maximum Suppression (NMS) in their detection process, limiting training efficiency. Meanwhile, transformer-based semi-supervised techniques adopted a one-to-one match strategy that provides noisy pseudo-labels, limiting overall efficiency. This study presents an innovative transformer-based semi-supervised table detector. It improves the quality of pseudo-labels through a novel matching strategy combining one-to-one and one-to-many assignment techniques. This approach significantly enhances training efficiency during the early stages, ensuring superior pseudo-labels for further training. Our semi-supervised approach is comprehensively evaluated on benchmark datasets, including PubLayNet, ICADR-19, and TableBank. It achieves new state-of-the-art results, with a mAP of 95.7% and 97.9% on TableBank (word) and PubLaynet with 30% label data, marking a 7.4 and 7.6 point improvement over previous semi-supervised table detection approach, respectively. The results clearly show the superiority of our semi-supervised approach, surpassing all existing state-of-the-art methods by substantial margins. This research represents a significant advancement in semi-supervised table detection methods, offering a more efficient and accurate solution for practical document analysis tasks.
表格检测是文档分析中的一个关键任务,旨在精确识别和定位文档图像中的表格。尽管在领域内深度学习取得了显著的进步,但通常需要大量带标签数据的数据集进行有效的训练。当前基于CNN的半监督表格检测方法使用锚生成过程和非最大抑制(NMS)来检测过程,限制了训练效率。同时,采用Transformer的半监督技术采用一对一匹配策略,提供了噪音伪标签,从而限制了整体效率。 本研究介绍了一种新颖的Transformer-based半监督表格检测器。它通过结合一对一和一对多分配技术的新颖匹配策略来提高伪标签的质量。这种方法在训练早期阶段显著增强训练效率,确保为后续训练提供卓越的伪标签。 我们对半监督方法进行了全面评估,包括PubLayNet、ICADR-19和TableBank。它达到或超过了95.7%和97.9%的mAP在TableBank(单词)和PubLaynet上,分别比之前的半监督表格检测方法提高了7.4和7.6个点。结果清楚地表明了我们的半监督方法的优越性,超越了所有现有方法。这项研究在半监督表格检测方法上取得了显著的进展,为实际文档分析任务提供了一个更高效和准确的解决方案。
https://arxiv.org/abs/2405.04971
Collaborative perception empowers each agent to improve its perceptual ability through the exchange of perceptual messages with other agents. It inherently results in a fundamental trade-off between perception ability and communication cost. To address this bottleneck issue, our core idea is to optimize the collaborative messages from two key aspects: representation and selection. The proposed codebook-based message representation enables the transmission of integer codes, rather than high-dimensional feature maps. The proposed information-filling-driven message selection optimizes local messages to collectively fill each agent's information demand, preventing information overflow among multiple agents. By integrating these two designs, we propose CodeFilling, a novel communication-efficient collaborative perception system, which significantly advances the perception-communication trade-off and is inclusive to both homogeneous and heterogeneous collaboration settings. We evaluate CodeFilling in both a real-world dataset, DAIR-V2X, and a new simulation dataset, OPV2VH+. Results show that CodeFilling outperforms previous SOTA Where2comm on DAIR-V2X/OPV2VH+ with 1,333/1,206 times lower communication volume. Our code is available at this https URL.
合作感知使每个智能体通过与其他智能体交换感知信息来提高其感知能力。这固有地导致感知能力和通信成本之间的基本权衡。为解决这个瓶颈问题,我们核心的想法是优化两个关键方面:表示和选择。基于代码的报文表示允许传输整数编码,而不是高维特征图。所提出的信息填充驱动的消息选择优化了局部消息,使其 collective fill each agent's information demand,防止了多个智能体之间的信息溢出。通过整合这两个设计,我们提出了CodeFilling,一种新颖的通信高效的协作感知系统,显著提高了感知-通信权衡,并适用于各种协作设置。我们在DAIR-V2X和OPV2VH+这两个真实世界数据集上评估了CodeFilling。结果表明,CodeFilling在DAIR-V2X/OPV2VH+上优于之前的最佳SOTAWhere2comm,通信量降低了1,333/1,206倍。我们的代码可在此处访问的链接中获取。
https://arxiv.org/abs/2405.04966
Recent progress in remote sensing image (RSI) super-resolution (SR) has exhibited remarkable performance using deep neural networks, e.g., Convolutional Neural Networks and Transformers. However, existing SR methods often suffer from either a limited receptive field or quadratic computational overhead, resulting in sub-optimal global representation and unacceptable computational costs in large-scale RSI. To alleviate these issues, we develop the first attempt to integrate the Vision State Space Model (Mamba) for RSI-SR, which specializes in processing large-scale RSI by capturing long-range dependency with linear complexity. To achieve better SR reconstruction, building upon Mamba, we devise a Frequency-assisted Mamba framework, dubbed FMSR, to explore the spatial and frequent correlations. In particular, our FMSR features a multi-level fusion architecture equipped with the Frequency Selection Module (FSM), Vision State Space Module (VSSM), and Hybrid Gate Module (HGM) to grasp their merits for effective spatial-frequency fusion. Recognizing that global and local dependencies are complementary and both beneficial for SR, we further recalibrate these multi-level features for accurate feature fusion via learnable scaling adaptors. Extensive experiments on AID, DOTA, and DIOR benchmarks demonstrate that our FMSR outperforms state-of-the-art Transformer-based methods HAT-L in terms of PSNR by 0.11 dB on average, while consuming only 28.05% and 19.08% of its memory consumption and complexity, respectively.
近年来,遥感图像(RSI)超分辨率(SR)方面的进步已经显著地使用了深度神经网络,例如卷积神经网络(CNN)和Transformer。然而,现有的SR方法通常存在接收范围有限或线性计算开销等问题,导致全局表示效果不佳,并在大规模RSI上产生不可接受的计算成本。为了减轻这些问题,我们开发了第一个将Vision State Space Model(Mamba)集成到RSI-SR中的尝试,Mamba专门处理大规模RSI并通过线性复杂性捕捉长距离依赖。为了实现更好的SR复原,我们在Mamba的基础上设计了一个Frequency-assisted Mamba框架,称之为FMSR,以探讨其空间和频率关联。特别地,我们的FMSR配备了多级融合架构,包括频率选择模块(FSM)、视觉状态空间模块(VSSM)和混合门模块(HGM),以把握其对有效空间-频率融合的优点。认识到全局和局部依赖是互补的,两者都对SR有益,我们通过可学习缩放调整器进一步重新校准这些多级特征以实现准确的特征融合。在AID、DOTA和DIOR基准测试上进行的广泛实验证明,我们的FMSR在PSNR方面平均优于基于Transformer的当前最先进方法HAT-L,同时消耗只有28.05%和19.08%的内存开销和复杂度。
https://arxiv.org/abs/2405.04964
Automating visual inspection in industrial production lines is essential for increasing product quality across various industries. Anomaly detection (AD) methods serve as robust tools for this purpose. However, existing public datasets primarily consist of images without anomalies, limiting the practical application of AD methods in production settings. To address this challenge, we present (1) the Valeo Anomaly Dataset (VAD), a novel real-world industrial dataset comprising 5000 images, including 2000 instances of challenging real defects across more than 20 subclasses. Acknowledging that traditional AD methods struggle with this dataset, we introduce (2) Segmentation-based Anomaly Detector (SegAD). First, SegAD leverages anomaly maps as well as segmentation maps to compute local statistics. Next, SegAD uses these statistics and an optional supervised classifier score as input features for a Boosted Random Forest (BRF) classifier, yielding the final anomaly score. Our SegAD achieves state-of-the-art performance on both VAD (+2.1% AUROC) and the VisA dataset (+0.4% AUROC). The code and the models are publicly available.
在工业生产线上自动化视觉检查对于提高产品质量至关重要,异常检测(AD)方法作为这一目的的有力工具显得至关重要。然而,现有的公共数据集主要包含没有异常的图像,这限制了AD方法在生产环境中的实际应用。为解决这个问题,我们提出了(1)Valeo异常数据集(VAD),这是一个由5000个图像组成的新兴工业现实数据集,包括2000个具有超过20个亚类的具有挑战性的真实缺陷实例。承认传统AD方法在这个数据集上挣扎,我们引入了(2)基于分段的异常检测器(SegAD)。首先,SegAD利用异常图和分割图计算局部统计。接下来,SegAD将这些统计量作为输入特征输入到Boosted Random Forest(BRF)分类器中,产生最终的异常得分。我们的SegAD在VAD (+2.1% AUROC)和VisA数据集 (+0.4% AUROC)上实现了最先进的性能。代码和模型公开可用。
https://arxiv.org/abs/2405.04953
Facial feature tracking is essential in imaging ballistocardiography for accurate heart rate estimation and enables motor degradation quantification in Parkinson's disease through skin feature tracking. While deep convolutional neural networks have shown remarkable accuracy in tracking tasks, they typically require extensive labeled data for supervised training. Our proposed pipeline employs a convolutional stacked autoencoder to match image crops with a reference crop containing the target feature, learning deep feature encodings specific to the object category in an unsupervised manner, thus reducing data requirements. To overcome edge effects making the performance dependent on crop size, we introduced a Gaussian weight on the residual errors of the pixels when calculating the loss function. Training the autoencoder on facial images and validating its performance on manually labeled face and hand videos, our Deep Feature Encodings (DFE) method demonstrated superior tracking accuracy with a mean error ranging from 0.6 to 3.3 pixels, outperforming traditional methods like SIFT, SURF, Lucas Kanade, and the latest transformers like PIPs++ and CoTracker. Overall, our unsupervised learning approach excels in tracking various skin features under significant motion conditions, providing superior feature descriptors for tracking, matching, and image registration compared to both traditional and state-of-the-art supervised learning methods.
面部特征跟踪在球面心电图成像中至关重要,因为它能准确估计心脏率,并且通过皮肤特征跟踪在帕金森病患者中实现运动降解量化。虽然深度卷积神经网络在跟踪任务中表现出惊人的准确性,但通常需要大量的有标签数据进行监督训练。我们提出的方案采用卷积堆叠自编码器将图像块与包含目标特征的参考块匹配,无监督地学习特定于物体类别的深度特征编码,从而减少了数据需求。为了克服边缘效果,使性能取决于图像大小,我们在计算损失函数时对像素残差应用高斯权重。在面部图像上训练自编码器并验证其性能,我们的Deep Feature Encodings(DFE)方法在平均误差范围内从0.6到3.3像素,超越了传统方法(如SIFT,SURF,Lucas Kanade)和最先进的变压器(如PIPs++和CoTracker),展示了卓越的跟踪精度。总的来说,我们的无监督学习方法在重大运动条件下 excels于跟踪各种皮肤特征,为跟踪、匹配和图像配准提供卓越的性能,与传统和最先进的监督学习方法相比。
https://arxiv.org/abs/2405.04943
Text-to-image person re-identification (ReID) retrieves pedestrian images according to textual descriptions. Manually annotating textual descriptions is time-consuming, restricting the scale of existing datasets and therefore the generalization ability of ReID models. As a result, we study the transferable text-to-image ReID problem, where we train a model on our proposed large-scale database and directly deploy it to various datasets for evaluation. We obtain substantial training data via Multi-modal Large Language Models (MLLMs). Moreover, we identify and address two key challenges in utilizing the obtained textual descriptions. First, an MLLM tends to generate descriptions with similar structures, causing the model to overfit specific sentence patterns. Thus, we propose a novel method that uses MLLMs to caption images according to various templates. These templates are obtained using a multi-turn dialogue with a Large Language Model (LLM). Therefore, we can build a large-scale dataset with diverse textual descriptions. Second, an MLLM may produce incorrect descriptions. Hence, we introduce a novel method that automatically identifies words in a description that do not correspond with the image. This method is based on the similarity between one text and all patch token embeddings in the image. Then, we mask these words with a larger probability in the subsequent training epoch, alleviating the impact of noisy textual descriptions. The experimental results demonstrate that our methods significantly boost the direct transfer text-to-image ReID performance. Benefiting from the pre-trained model weights, we also achieve state-of-the-art performance in the traditional evaluation settings.
文本到图像人物识别(ReID)根据文本描述检索行人图像。手动标注文本描述费时,限制了现有数据集中的规模,因此限制了ReID模型的泛化能力。因此,我们研究可迁移的文本到图像ReID问题,在这个问题上,我们在提出的 large-scale 数据库上训练一个模型,然后直接部署到各种数据集上进行评估。我们通过多模态大型语言模型(MLLMs)获得了大量训练数据。此外,我们解决了利用获得的文本描述的两个关键挑战。首先,一个 MLLM 倾向于生成具有相似结构的描述,导致模型过拟合特定的句法模式。因此,我们提出了一种新颖的方法,使用 MLLMs 根据各种模板给图像 caption。这些模板是在与大型语言模型(LLM)的多轮对话中获得的。因此,我们可以构建一个具有多样文本描述的大型数据集。其次,一个 MLLM 可能产生错误的描述。因此,我们引入了一种新颖的方法,该方法会自动识别描述中与图像不匹配的单词。这个方法基于文本和图像中所有补丁词向量的相似性。然后,我们在后续的训练 epoch 中将这些单词的概率增大,减轻了噪音文本描述的影响。实验结果表明,我们的方法显著提高了直接迁移文本到图像 ReID 的性能。利用预训练模型权重,我们在传统评估设置中也取得了最先进的性能。
https://arxiv.org/abs/2405.04940
Weakly supervised semantic segmentation (WSSS) aims at learning a semantic segmentation model with only image-level tags. Despite intensive research on deep learning approaches over a decade, there is still a significant performance gap between WSSS and full semantic segmentation. Most current WSSS methods always focus on a limited single image (pixel-wise) information while ignoring the valuable inter-image (semantic-wise) information. From this perspective, a novel end-to-end WSSS framework called DSCNet is developed along with two innovations: i) pixel-wise group contrast and semantic-wise graph contrast are proposed and introduced into the WSSS framework; ii) a novel dual-stream contrastive learning (DSCL) mechanism is designed to jointly handle pixel-wise and semantic-wise context information for better WSSS performance. Specifically, the pixel-wise group contrast learning (PGCL) and semantic-wise graph contrast learning (SGCL) tasks form a more comprehensive solution. Extensive experiments on PASCAL VOC and MS COCO benchmarks verify the superiority of DSCNet over SOTA approaches and baseline models.
弱监督语义分割(WSSS)旨在学习仅基于图像级别的标签的语义分割模型。尽管在过去的十年里对深度学习方法进行了广泛研究,但WSSS和完整语义分割之间的性能差距仍然很大。大多数当前的WSSS方法始终关注有限单个图像(像素级)信息,而忽略了宝贵的跨图像(语义级)信息。从这方面来看,与两个创新相结合,我们提出了一个名为DSCNet的新端到端WSSS框架:i)提出了像素级组内对比和语义级图内对比;ii)设计了一种新颖的双流对比学习(DSCL)机制,以更好地处理像素级和语义级上下文信息,从而提高WSSS性能。具体来说,像素级组内对比学习(PGCL)和语义级图内对比学习(SGCL)任务组成更全面的解决方案。在PASCAL VOC和MS COCO基准上进行的实验证实了DSCNet相对于当前最先进的方法的优越性。
https://arxiv.org/abs/2405.04913
Medical Image Synthesis (MIS) plays an important role in the intelligent medical field, which greatly saves the economic and time costs of medical diagnosis. However, due to the complexity of medical images and similar characteristics of different tissue cells, existing methods face great challenges in meeting their biological consistency. To this end, we propose the Hybrid Augmented Generative Adversarial Network (HAGAN) to maintain the authenticity of structural texture and tissue cells. HAGAN contains Attention Mixed (AttnMix) Generator, Hierarchical Discriminator and Reverse Skip Connection between Discriminator and Generator. The AttnMix consistency differentiable regularization encourages the perception in structural and textural variations between real and fake images, which improves the pathological integrity of synthetic images and the accuracy of features in local areas. The Hierarchical Discriminator introduces pixel-by-pixel discriminant feedback to generator for enhancing the saliency and discriminance of global and local details simultaneously. The Reverse Skip Connection further improves the accuracy for fine details by fusing real and synthetic distribution features. Our experimental evaluations on three datasets of different scales, i.e., COVID-CT, ACDC and BraTS2018, demonstrate that HAGAN outperforms the existing methods and achieves state-of-the-art performance in both high-resolution and low-resolution.
医学图像合成(MIS)在智能医疗领域中发挥着重要作用,大大降低了医疗诊断的经济和时间成本。然而,由于医学图像的复杂性和不同组织细胞的类似特征,现有方法在满足其生物一致性方面面临巨大挑战。为此,我们提出了混合增强生成对抗网络(HAGAN)来保持结构的真实性和组织细胞的真实性。HAGAN包括注意力混合(AttnMix)生成器、分层判别器和判别器和生成器的反向跳过连接。AttnMix一致性差分 regularization 鼓励在真实和假图像之间关注结构和组织学变异性,从而提高合成图像的病理完整性以及局部区域的特征准确性。分层判别器引入了逐像素判别反馈来增强生成器,以同时提高全局和局部细节的清晰度和鉴别度。反向跳过连接通过融合真实和合成分布特征进一步提高了准确度。我们在三个不同规模的数据集(即 COVID-CT、ACDC 和 BraTS2018)上的实验评估结果表明,HAGAN 优于现有方法,在 both high-resolution 和 low-resolution 高分辨率低分辨率方面实现了最先进的性能。
https://arxiv.org/abs/2405.04902
Emotion recognition is an important part of affective computing. Extracting emotional cues from human gaits yields benefits such as natural interaction, a nonintrusive nature, and remote detection. Recently, the introduction of self-supervised learning techniques offers a practical solution to the issues arising from the scarcity of labeled data in the field of gait-based emotion recognition. However, due to the limited diversity of gaits and the incompleteness of feature representations for skeletons, the existing contrastive learning methods are usually inefficient for the acquisition of gait emotions. In this paper, we propose a contrastive learning framework utilizing selective strong augmentation (SSA) for self-supervised gait-based emotion representation, which aims to derive effective representations from limited labeled gait data. First, we propose an SSA method for the gait emotion recognition task, which includes upper body jitter and random spatiotemporal mask. The goal of SSA is to generate more diverse and targeted positive samples and prompt the model to learn more distinctive and robust feature representations. Then, we design a complementary feature fusion network (CFFN) that facilitates the integration of cross-domain information to acquire topological structural and global adaptive features. Finally, we implement the distributional divergence minimization loss to supervise the representation learning of the generally and strongly augmented queries. Our approach is validated on the Emotion-Gait (E-Gait) and Emilya datasets and outperforms the state-of-the-art methods under different evaluation protocols.
情感识别是情感计算的重要组成部分。从人类脚步中提取情感线索带来了诸如自然互动、非侵入性、远程检测等好处。最近,自监督学习技术的发展为基于脚步情感识别领域缺乏标注数据的问题提供了一个实际解决方案。然而,由于脚步动作的多样性有限和骨骼特征表示的不完整性,现有的对比学习方法通常对于获取有限标注数据的脚步情感识别效果不佳。在本文中,我们提出了一种使用选择性 strong augmentation (SSA) 的对比学习框架,用于自监督基于脚步情感表示,旨在从有限的标注数据中提取有效的情感表示。首先,我们提出了一种 SSA 方法来处理脚步情感识别任务,包括上半身抖动和随机时空掩码。SSA 的目标是生成更多多样化和针对性的正样本,并促使模型学习更具有特色和鲁棒性的特征表示。然后,我们设计了一个互补特征融合网络(CFFN),促进跨领域信息的整合以获取拓扑结构和全局自适应特征。最后,我们实现了一种分布差异最小化损失,用于指导一般和强烈 augmented 查询的特征学习。我们的方法在 Emotion-Gait 和 Emilya 数据集上的验证结果表明,它在不同评估协议下优于最先进的methods。
https://arxiv.org/abs/2405.04900
The search for refining 3D LiDAR data has attracted growing interest motivated by recent techniques such as supervised learning or generative model-based methods. Existing approaches have shown the possibilities for using diffusion models to generate refined LiDAR data with high fidelity, although the performance and speed of such methods have been limited. These limitations make it difficult to execute in real-time, causing the approaches to struggle in real-world tasks such as autonomous navigation and human-robot interaction. In this work, we introduce a novel approach based on conditional diffusion models for fast and high-quality sparse-to-dense upsampling of 3D scene point clouds through an image representation. Our method employs denoising diffusion probabilistic models trained with conditional inpainting masks, which have been shown to give high performance on image completion tasks. We introduce a series of experiments, including multiple datasets, sampling steps, and conditional masks, to determine the ideal configuration, striking a balance between performance and inference speed. This paper illustrates that our method outperforms the baselines in sampling speed and quality on upsampling tasks using the KITTI-360 dataset. Furthermore, we illustrate the generalization ability of our approach by simultaneously training on real-world and synthetic datasets, introducing variance in quality and environments.
寻找精化3D LiDAR数据的搜索吸引了越来越多的关注,这是由最近使用的如监督学习或基于生成模型的方法等技术引起的。虽然已经证明了使用扩散模型生成具有高保真度的精化LiDAR数据的可能性,但这种方法的性能和速度仍然有限。这些限制使得在实时执行中很难实现,导致在现实世界的任务(如自主导航和人类机器人交互)中,这些方法遇到困难。 在本文中,我们介绍了一种基于条件扩散模型的新的方法,用于通过图像表示对3D场景点云进行高保真度的平滑和压缩。我们的方法采用带条件修补掩码的噪声扩散概率模型进行训练,这些模型已经在图像完成任务中表现出良好的性能。我们介绍了一系列实验,包括多个数据集、采样步骤和条件掩码,以确定理想的配置,在性能和推理速度之间取得平衡。本文证明了,我们的方法在KITTI-360数据集上的采样速度和质量方面超过了基线。此外,我们还通过同时在一手真实世界和合成数据上训练,展示了我们方法的一般化能力。我们还展示了在不同质量和环境下的平滑和压缩效果。
https://arxiv.org/abs/2405.04889
The increasing demand for computational photography and imaging on mobile platforms has led to the widespread development and integration of advanced image sensors with novel algorithms in camera systems. However, the scarcity of high-quality data for research and the rare opportunity for in-depth exchange of views from industry and academia constrain the development of mobile intelligent photography and imaging (MIPI). Building on the achievements of the previous MIPI Workshops held at ECCV 2022 and CVPR 2023, we introduce our third MIPI challenge including three tracks focusing on novel image sensors and imaging algorithms. In this paper, we summarize and review the Nighttime Flare Removal track on MIPI 2024. In total, 170 participants were successfully registered, and 14 teams submitted results in the final testing phase. The developed solutions in this challenge achieved state-of-the-art performance on Nighttime Flare Removal. More details of this challenge and the link to the dataset can be found at this https URL.
随着移动平台对计算摄影和图像的需求不断增加,相机系统中的高级图像传感器与新型算法 integration 越来越普遍。然而,高质量数据的稀缺以及工业和学术界之间深入交流的罕见机会限制了移动智能摄影和成像(MIPI)的发展。在 previous MIPI Workshops at ECCV 2022 和 CVPR 2023 的基础上,我们介绍了我们的第三个 MIPI 挑战,包括三个专注于新颖图像传感器和成像算法的轨道。在本文中,我们总结了和回顾了 MIPI 2024 中的夜间闪光消除轨道。在测试阶段,共有 170 名参与者成功注册,14 支团队提交了最终测试阶段的结果。这个挑战中开发出的解决方案在夜间闪光消除方面实现了最先进的性能。关于这个挑战以及与数据集的链接,请查阅此链接。
https://arxiv.org/abs/2405.04867
Rooting in the scarcity of most attributes, realistic pedestrian attribute datasets exhibit unduly skewed data distribution, from which two types of model failures are delivered: (1) label imbalance: model predictions lean greatly towards the side of majority labels; (2) semantics imbalance: model is easily overfitted on the under-represented attributes due to their insufficient semantic diversity. To render perfect label balancing, we propose a novel framework that successfully decouples label-balanced data re-sampling from the curse of attributes co-occurrence, i.e., we equalize the sampling prior of an attribute while not biasing that of the co-occurred others. To diversify the attributes semantics and mitigate the feature noise, we propose a Bayesian feature augmentation method to introduce true in-distribution novelty. Handling both imbalances jointly, our work achieves best accuracy on various popular benchmarks, and importantly, with minimal computational budget.
翻译:在大多数属性的稀缺性上,现实步行属性数据集表现出不健康的数据分布,其中两种模型故障被交付:(1)标签不平衡:模型预测极大地倾向于多数标签的一侧;(2)语义不平衡:由于它们语义多样性的不足,模型很容易过拟合在代表性不足的属性上。为了实现完美的标签平衡,我们提出了一个新框架,该框架成功地将标签平衡数据重新采样与属性共现的诅咒解耦,即在平衡一个属性的采样优先级时,不偏袒其他共现属性的采样优先级。为了丰富属性的语义并减轻特征噪声,我们提出了一个贝叶斯特征增强方法来引入真正的分布新奇。共同处理两种不平衡,我们的工作在各种流行基准上实现了最佳准确度,并且重要的是,具有最小的计算开销。
https://arxiv.org/abs/2405.04858
Controllable text-to-image (T2I) diffusion models generate images conditioned on both text prompts and semantic inputs of other modalities like edge maps. Nevertheless, current controllable T2I methods commonly face challenges related to efficiency and faithfulness, especially when conditioning on multiple inputs from either the same or diverse modalities. In this paper, we propose a novel Flexible and Efficient method, FlexEControl, for controllable T2I generation. At the core of FlexEControl is a unique weight decomposition strategy, which allows for streamlined integration of various input types. This approach not only enhances the faithfulness of the generated image to the control, but also significantly reduces the computational overhead typically associated with multimodal conditioning. Our approach achieves a reduction of 41% in trainable parameters and 30% in memory usage compared with Uni-ControlNet. Moreover, it doubles data efficiency and can flexibly generate images under the guidance of multiple input conditions of various modalities.
可控文本到图像(T2I)扩散模型根据多种模态(如边缘图)的语义输入生成图像。然而,当前的可控T2I方法通常面临效率和忠实性方面的挑战,尤其是在对相同或多样模态进行条件时。在本文中,我们提出了一个新方法FlexEControl,用于可控T2I生成。FlexEControl的核心是一种独特的权重分解策略,允许集成各种输入类型的简化整合。这种方法不仅增强了生成图像与控制之间的忠实性,而且显著减少了通常与多模态条件相关的计算开销。与Uni-ControlNet相比,我们的方法训练参数减少了41%,内存使用了减少了30%。此外,它还提高了数据效率,并能在各种模态的输入条件下进行灵活生成图像。
https://arxiv.org/abs/2405.04834
The PD-L1 rate, the number of PD-L1 positive tumor cells over the total number of all tumor cells, is an important metric for immunotherapy. This metric is recorded as diagnostic information with pathological images. In this paper, we propose a proportion estimation method with a small amount of cell-level annotation and proportion annotation, which can be easily collected. Since the PD-L1 rate is calculated from only `tumor cells' and not using `non-tumor cells', we first detect tumor cells with a detection model. Then, we estimate the PD-L1 proportion by introducing a masking technique to `learning from label proportion.' In addition, we propose a weighted focal proportion loss to address data imbalance problems. Experiments using clinical data demonstrate the effectiveness of our method. Our method achieved the best performance in comparisons.
PD-L1率,即所有肿瘤细胞总数中PD-L1阳性肿瘤细胞的数量,是免疫治疗的一个重要指标。这一指标以病理性图像中的诊断信息进行记录。在本文中,我们提出了一种少量细胞级别注释和比例注释的方法,可以轻松地收集。由于PD-L1率仅基于“肿瘤细胞”计算,而不是“非肿瘤细胞”,我们首先使用检测模型检测肿瘤细胞。然后,我们通过引入掩码技术来“从标签比例中学习”来估计PD-L1比例。此外,我们还提出了一种加权聚类比例损失来解决数据不平衡问题。临床数据的实验结果证明了我们方法的有效性。我们的方法在比较中取得了最佳性能。
https://arxiv.org/abs/2405.04815
In the realm of robotics, the quest for achieving real-world autonomy, capable of executing large-scale and long-term operations, has positioned place recognition (PR) as a cornerstone technology. Despite the PR community's remarkable strides over the past two decades, garnering attention from fields like computer vision and robotics, the development of PR methods that sufficiently support real-world robotic systems remains a challenge. This paper aims to bridge this gap by highlighting the crucial role of PR within the framework of Simultaneous Localization and Mapping (SLAM) 2.0. This new phase in robotic navigation calls for scalable, adaptable, and efficient PR solutions by integrating advanced artificial intelligence (AI) technologies. For this goal, we provide a comprehensive review of the current state-of-the-art (SOTA) advancements in PR, alongside the remaining challenges, and underscore its broad applications in robotics. This paper begins with an exploration of PR's formulation and key research challenges. We extensively review literature, focusing on related methods on place representation and solutions to various PR challenges. Applications showcasing PR's potential in robotics, key PR datasets, and open-source libraries are discussed. We also emphasizes our open-source package, aimed at new development and benchmark for general PR. We conclude with a discussion on PR's future directions, accompanied by a summary of the literature covered and access to our open-source library, available to the robotics community at: this https URL.
在机器人领域,实现真实世界自主,能够执行大规模和长期任务的目标,使位置识别(PR)成为关键技术。尽管 PR 社区在过去的 20 年里取得了惊人的进展,从计算机视觉和机器人学等领域吸引了关注,但开发足够支持真实世界机器人系统的方法仍然具有挑战性。本文旨在通过强调 PR 在同步定位与映射(SLAM)2.0 框架中的关键作用来弥合这个空白。这一机器人导航的新阶段要求通过整合先进的人工智能(AI)技术,提供可扩展、可适应和高效的 PR 解决方案。为此,我们对 PR 领域的最前沿(SOTA)进展进行全面回顾,并讨论了其在机器人领域的广泛应用。本文从 PR 的定义和关键研究挑战开始。我们详细回顾了文献,重点关注与位置表示和各种 PR 挑战相关的相关方法。展示了 PR 在机器人领域的潜在应用、关键 PR 数据集和开源库。我们还强调了我们旨在为一般 PR 的新发展和基准提供支持的开放式源代码包。我们最后讨论了 PR 的未来方向,并附有文献回顾和开放式源代码库的链接,供机器人学术界使用:https:// this URL.
https://arxiv.org/abs/2405.04812
HiRISE (High-Resolution Imaging Science Experiment) is a camera onboard the Mars Reconnaissance orbiter responsible for photographing vast areas of the Martian surface in unprecedented detail. It can capture millions of incredible closeup images in minutes. However, Mars suffers from frequent regional and local dust storms hampering this data-collection process, and pipeline, resulting in loss of effort and crucial flight time. Removing these images manually requires a large amount of manpower. I filter out these images obstructed by atmospheric dust automatically by using a Dust Image Classifier fine-tuned on Resnet-50 with an accuracy of 94.05%. To further facilitate the seamless filtering of Images I design a prediction pipeline that classifies and stores these dusty patches. I also denoise partially obstructed images using an Auto Encoder-based denoiser and Pix2Pix GAN with 0.75 and 0.99 SSIM Index respectively
嗨,HiRISE(高分辨率成像科学实验)是一种安装在火星侦察轨道器上的相机,负责在火星表面拍摄前所未有的详细照片。它可以在几秒钟内捕捉到数百万张令人惊叹的近景照片。然而,火星经常受到区域和局部沙暴的困扰,这会阻碍数据收集过程和管道,导致努力和关键飞行时间的损失。通过使用在Resnet-50上微调的Dust Image Classifier自动过滤这些受大气尘埃影响的图像,我的算法可以实现94.05%的准确率。为了进一步简化图像的过滤,我还设计了一个预测管道,该管道对图像进行分类并存储这些沙尘斑块。为了消除部分受阻图像,我还使用基于自动编码器的去噪算法和Pix2Pix GAN,其SSIM指数分别为0.75和0.99。
https://arxiv.org/abs/2405.04807
Satellite imagery has played an increasingly important role in post-disaster building damage assessment. Unfortunately, current methods still rely on manual visual interpretation, which is often time-consuming and can cause very low accuracy. To address the limitations of manual interpretation, there has been a significant increase in efforts to automate the process. We present a solution that performs the two most important tasks in building damage assessment, segmentation and classification, through deep-learning models. We show our results submitted as part of the xView2 Challenge, a competition to design better models for identifying buildings and their damage level after exposure to multiple kinds of natural disasters. Our best model couples a building identification semantic segmentation convolutional neural network (CNN) to a building damage classification CNN, with a combined F1 score of 0.66, surpassing the xView2 challenge baseline F1 score of 0.28. We find that though our model was able to identify buildings with relatively high accuracy, building damage classification across various disaster types is a difficult task due to the visual similarity between different damage levels and different damage distribution between disaster types, highlighting the fact that it may be important to have a probabilistic prior estimate regarding disaster damage in order to obtain accurate predictions.
卫星影像在灾后建筑损害评估中扮演着越来越重要的角色。然而,目前的评估方法仍然依赖于人工视觉解释,这通常需要花费大量时间,并可能导致非常低的精度。为解决手动解释的局限性,已经加大了自动化过程的努力。我们提出了一个解决方案,通过深度学习模型执行建筑损害评估中的两个最重要的任务:分割和分类。我们在xView2挑战中展示了我们的结果,该挑战旨在为识别在多种自然灾害中受损的建筑和其损害程度提供更好的模型。我们最好的模型将具有建筑识别的语义分割卷积神经网络(CNN)与建筑损害分类卷积神经网络相结合,F1分数为0.66,超过了xView2挑战基线F1分数0.28。我们发现,尽管我们的模型能够以相对较高的准确度识别建筑物,但不同灾害类型之间建筑损害的分类仍然具有困难,因为不同灾害类型的损害水平和损害分布之间存在视觉相似性,这表明在获得准确预测之前,关于灾害损害的概率先验估计可能是重要的。
https://arxiv.org/abs/2405.04800
Change Detection (CD) aims to identify pixels with semantic changes between images. However, annotating massive numbers of pixel-level images is labor-intensive and costly, especially for multi-temporal images, which require pixel-wise comparisons by human experts. Considering the excellent performance of visual language models (VLMs) for zero-shot, open-vocabulary, etc. with prompt-based reasoning, it is promising to utilize VLMs to make better CD under limited labeled data. In this paper, we propose a VLM guidance-based semi-supervised CD method, namely DiffMatch. The insight of DiffMatch is to synthesize free change labels using VLMs to provide additional supervision signals for unlabeled data. However, almost all current VLMs are designed for single-temporal images and cannot be directly applied to bi- or multi-temporal images. Motivated by this, we first propose a VLM-based mixed change event generation (CEG) strategy to yield pseudo labels for unlabeled CD data. Since the additional supervised signals provided by these VLM-driven pseudo labels may conflict with the pseudo labels from the consistency regularization paradigm (e.g. FixMatch), we propose the dual projection head for de-entangling different signal sources. Further, we explicitly decouple the bi-temporal images semantic representation through two auxiliary segmentation decoders, which are also guided by VLM. Finally, to make the model more adequately capture change representations, we introduce metric-aware supervision by feature-level contrastive loss in auxiliary branches. Extensive experiments show the advantage of DiffMatch. For instance, DiffMatch improves the FixMatch baseline by +5.3 IoU on WHU-CD and by +2.4 IoU on LEVIR-CD with 5% labels. In addition, our CEG strategy, in an un-supervised manner, can achieve performance far superior to state-of-the-art un-supervised CD methods.
变化检测(CD)旨在识别图像之间语义变化的像素。然而,标注大量像素级别图像劳动密集且代价昂贵,尤其是在需要通过人类专家进行逐像素比较的多时间尺度图像上。考虑到视觉语言模型(VLMs)在零散、开词等提示下推理的优秀表现,我们有望在有限的标注数据下使用VLMs实现更好的CD。在本文中,我们提出了一种基于VLM指导的半监督CD方法,即DiffMatch。DiffMatch的洞察力在于利用VLMs合成自由变化标签,为未标注数据提供额外的监督信号。然而,几乎所有当前的VLMs都是为单时间尺度图像设计的,不能直接应用于双或多时间尺度图像。因此,我们首先提出了一种基于VLM的混合变化事件生成(CEG)策略,为未标注的CD数据生成伪标签。由于这些VLM驱动的伪标签可能与一致性正则化范式(例如FixMatch)中的伪标签发生冲突,我们提出了双投影头以解开不同信号源。此外,我们通过两个辅助分割解码器明确地解耦双时间尺度图像的语义表示。最后,为了使模型更好地捕捉变化表示,我们在辅助分支上引入基于特征级的对比损失的度量指导。大量实验证明,DiffMatch具有优势。例如,DiffMatch在WHU-CD上的IoU提高了+5.3,而在LEVIR-CD上的IoU提高了+2.4,同时我们的CEG策略在没有监督的情况下可以达到比最先进的无监督CD方法更出色的性能。
https://arxiv.org/abs/2405.04788