Modeling the shape of garments has received much attention, but most existing approaches assume the garments to be worn by someone, which constrains the range of shapes they can assume. In this work, we address shape recovery when garments are being manipulated instead of worn, which gives rise to an even larger range of possible shapes. To this end, we leverage the implicit sewing patterns (ISP) model for garment modeling and extend it by adding a diffusion-based deformation prior to represent these shapes. To recover 3D garment shapes from incomplete 3D point clouds acquired when the garment is folded, we map the points to UV space, in which our priors are learned, to produce partial UV maps, and then fit the priors to recover complete UV maps and 2D to 3D mappings. Experimental results demonstrate the superior reconstruction accuracy of our method compared to previous ones, especially when dealing with large non-rigid deformations arising from the manipulations.
模型化衣物的形状已经引起了很多关注,但现有的方法都假定衣物是由人穿着的,这限制了它们可能具有的形状范围。在这项工作中,我们在衣物被操作而不是穿着时解决形状恢复问题,这导致具有更广泛的可能形状。为此,我们利用了衣物建模中的隐式缝纫图案(ISP)模型,并将其扩展以添加扩散基的变形,以表示这些形状。为了从衣物折叠时获得的 incomplete 3D 点云中恢复 3D 衣物形状,我们将点映射到 UV 空间,其中我们的先验知识是在这个空间中学习的,然后将先验知识贴合以恢复完整的 UV 映射和 2D 到 3D 映射。实验结果表明,与以前的方法相比,我们方法的精度优越,尤其是在处理由操作引起的大非刚性变形时。
https://arxiv.org/abs/2405.10934
Digital Subtraction Angiography (DSA) is one of the gold standards in vascular disease diagnosing. With the help of contrast agent, time-resolved 2D DSA images deliver comprehensive insights into blood flow information and can be utilized to reconstruct 3D vessel structures. Current commercial DSA systems typically demand hundreds of scanning views to perform reconstruction, resulting in substantial radiation exposure. However, sparse-view DSA reconstruction, aimed at reducing radiation dosage, is still underexplored in the research community. The dynamic blood flow and insufficient input of sparse-view DSA images present significant challenges to the 3D vessel reconstruction task. In this study, we propose to use a time-agnostic vessel probability field to solve this problem effectively. Our approach, termed as vessel probability guided attenuation learning, represents the DSA imaging as a complementary weighted combination of static and dynamic attenuation fields, with the weights derived from the vessel probability field. Functioning as a dynamic mask, vessel probability provides proper gradients for both static and dynamic fields adaptive to different scene types. This mechanism facilitates a self-supervised decomposition between static backgrounds and dynamic contrast agent flow, and significantly improves the reconstruction quality. Our model is trained by minimizing the disparity between synthesized projections and real captured DSA images. We further employ two training strategies to improve our reconstruction quality: (1) coarse-to-fine progressive training to achieve better geometry and (2) temporal perturbed rendering loss to enforce temporal consistency. Experimental results have demonstrated superior quality on both 3D vessel reconstruction and 2D view synthesis.
数字减影血管造影(DSA)是诊断血管疾病的一个金标准。通过对比剂,时间分辨率2D DSA图像能全面了解血流信息,并可用于重建3D血管结构。当前商业DSA系统通常需要数百个扫描 views 来执行重建,导致大量辐射暴露。然而,稀疏视野DSA重建,旨在降低辐射剂量,在研究社区中仍是一个未被探索的问题。动态血流和稀疏视野DSA图像输入不足,给3D血管重建任务带来了重大挑战。在本研究中,我们提出了一种名为“引导稀疏视野DSA学习”的方法来有效解决这一问题。我们的方法将DSA成像视为静态和动态衰减场的一个互补加权组合,权重来自血管概率场。作为动态掩码,血管概率提供不同场景自适应的静态和动态场的正确梯度。这一机制促使自监督分解静态背景和动态对比剂流,从而显著提高重建质量。我们的模型通过最小化生成投影和真实捕获DSA图像之间的差异进行训练。我们进一步采用两种训练策略来提高我们的重建质量:粗到细的渐进训练以实现更好的几何形状(1);时间扰动渲染损失以确保时间一致性(2)。实验结果表明,在3D血管重建和2D视图合成方面具有卓越的质量。
https://arxiv.org/abs/2405.10705
Object-Centric Learning (OCL) seeks to enable Neural Networks to identify individual objects in visual scenes, which is crucial for interpretable visual comprehension and reasoning. Most existing OCL models adopt auto-encoding structures and learn to decompose visual scenes through specially designed inductive bias, which causes the model to miss small objects during reconstruction. Reverse hierarchy theory proposes that human vision corrects perception errors through a top-down visual pathway that returns to bottom-level neurons and acquires more detailed information, inspired by which we propose Reverse Hierarchy Guided Network (RHGNet) that introduces a top-down pathway that works in different ways in the training and inference processes. This pathway allows for guiding bottom-level features with top-level object representations during training, as well as encompassing information from bottom-level features into perception during inference. Our model achieves SOTA performance on several commonly used datasets including CLEVR, CLEVRTex and MOVi-C. We demonstrate with experiments that our method promotes the discovery of small objects and also generalizes well on complex real-world scenes. Code will be available at https://anonymous.4open.science/r/RHGNet-6CEF.
对象中心化学习(OCL)旨在使神经网络能够识别视觉场景中的单个对象,这对于可解释的视觉理解和推理至关重要。大多数现有的OCL模型采用自动编码结构,并通过特别设计的归纳偏见学习来分解视觉场景,这导致在重构过程中模型会丢失小物体。逆层次结构理论提出,人类视觉通过一个自上而下的视觉路径来纠正感知错误,并获取更详细的信息,因此我们提出了逆层次结构引导网络(RHGNet),该网络在训练和推理过程中采用自上而下的路径。这条路径允许在训练过程中使用顶级别物体表示来指导底部特征,以及在推理过程中将底部特征的信息包含在感知中。我们的模型在包括CLEVR、CLEVRTex和MOVi-C等常用数据集在内的多个数据集上实现了SOTA性能。我们通过实验证明了我们的方法有助于发现小物体,并且在复杂的现实生活中表现良好。代码将在https://anonymous.4open.science/r/RHGNet-6CEF中提供。
https://arxiv.org/abs/2405.10598
3D occupancy perception holds a pivotal role in recent vision-centric autonomous driving systems by converting surround-view images into integrated geometric and semantic representations within dense 3D grids. Nevertheless, current models still encounter two main challenges: modeling depth accurately in the 2D-3D view transformation stage, and overcoming the lack of generalizability issues due to sparse LiDAR supervision. To address these issues, this paper presents GEOcc, a Geometric-Enhanced Occupancy network tailored for vision-only surround-view perception. Our approach is three-fold: 1) Integration of explicit lift-based depth prediction and implicit projection-based transformers for depth modeling, enhancing the density and robustness of view transformation. 2) Utilization of mask-based encoder-decoder architecture for fine-grained semantic predictions; 3) Adoption of context-aware self-training loss functions in the pertaining stage to complement LiDAR supervision, involving the re-rendering of 2D depth maps from 3D occupancy features and leveraging image reconstruction loss to obtain denser depth supervision besides sparse LiDAR ground-truths. Our approach achieves State-Of-The-Art performance on the Occ3D-nuScenes dataset with the least image resolution needed and the most weightless image backbone compared with current models, marking an improvement of 3.3% due to our proposed contributions. Comprehensive experimentation also demonstrates the consistent superiority of our method over baselines and alternative approaches.
3D占有率感知在最近以视觉为中心的自驾系统中将周围视图图像转换为密集3D网格内的集成几何和语义表示,在很大程度上推动了这种技术的发展。然而,目前的模型仍然面临着两个主要挑战:在2D-3D视图变换阶段准确建模深度,以及克服由于稀疏LiDAR监督而导致的泛化问题。为了应对这些问题,本文提出了GEOcc,一种专为视觉仅周围视图感知而设计的几何增强占有率网络。我们的方法是三方面的:1)将显式升力为基础的深度预测和隐式投影为基础的变压器深度建模相结合,提高视图变换的密度和稳健性;2)利用掩码为基础的编码器-解码器架构进行细粒度语义预测;3)在相关阶段采用语境感知自训练损失函数来补充LiDAR监督,包括从3D占有率特征重新渲染2D深度图,并利用图像重建损失以获得比稀疏LiDAR ground-truths更密的深度监督。我们的方法在Occ3D-nuScenes数据集上实现了与最少的图像分辨率相关的最轻量级的图像骨架,与当前模型的最轻量级图像骨架相比,提高了3.3%的性能,并通过我们的建议取得了显著的改善。综合实验还证明了我们的方法相对于基线和替代方法的优势是一致的。
https://arxiv.org/abs/2405.10591
Recent advances in multi-view camera-only 3D object detection either rely on an accurate reconstruction of bird's-eye-view (BEV) 3D features or on traditional 2D perspective view (PV) image features. While both have their own pros and cons, few have found a way to stitch them together in order to benefit from "the best of both worlds". To this end, we explore a duo space (i.e., BEV and PV) 3D perception framework, in conjunction with some useful duo space fusion strategies that allow effective aggregation of the two feature representations. To the best of our knowledge, our proposed method, DuoSpaceNet, is the first to leverage two distinct feature spaces and achieves the state-of-the-art 3D object detection and BEV map segmentation results on nuScenes dataset.
近年来,多视角相机仅3D物体检测技术的发展主要依赖于对鸟眼视图(BEV)3D特征的准确重建,或者依赖于传统2D透视图(PV)图像特征。虽然两者都有其自身的优点和缺点,但很少有方法将它们结合起来以实现“两者之最”。因此,我们探讨了一种结合鸟眼视图(BEV)和透视图(PV)的3D感知框架,并探讨了一些有用的二元空间融合策略,以实现两个特征表示的有效聚合。据我们所知,我们提出的方法DuoSpaceNet是第一个利用两个不同的特征空间并实现 nuScenes 数据集上最先进的3D物体检测和 BEV地图分割结果的方法。
https://arxiv.org/abs/2405.10577
Automated driving fundamentally requires knowledge about the surrounding geometry of the scene. Modern approaches use only captured images to predict occupancy maps that represent the geometry. Training these approaches requires accurate data that may be acquired with the help of LiDAR scanners. We show that the techniques used for current benchmarks and training datasets to convert LiDAR scans into occupancy grid maps yield very low quality, and subsequently present a novel approach using evidence theory that yields more accurate reconstructions. We demonstrate that these are superior by a large margin, both qualitatively and quantitatively, and that we additionally obtain meaningful uncertainty estimates. When converting the occupancy maps back to depth estimates and comparing them with the raw LiDAR measurements, our method yields a MAE improvement of 30% to 52% on nuScenes and 53% on Waymo over other occupancy ground-truth data. Finally, we use the improved occupancy maps to train a state-of-the-art occupancy prediction method and demonstrate that it improves the MAE by 25% on nuScenes.
自动驾驶从根本上需要关于场景周围的几何知识。现代方法仅使用捕获的图像预测占据地图,代表几何形状。为训练这些方法,需要准确的数据,这可能通过激光雷达扫描器获得。我们证明了当前基准测试数据集和训练数据集中的转换激光雷达扫描为占据网格图的方法产生非常低质量,然后使用证据理论提出了一种新方法,该方法产生更准确的重构。我们证明了这些方法的优越性,不仅在定性方面,而且在定量方面,并且我们还获得了有意义的置信度估计。将占据地图转换为深度估计并与原始激光雷达测量进行比较,我们的方法在nuScenes和Waymo上的MAE改进分别为30%到52%和53%。最后,我们使用改进的占据地图训练了最先进的占据预测方法,并证明了它提高了MAE by 25%在nuScenes上的表现。
https://arxiv.org/abs/2405.10575
Single image super-resolution (SR) is an established pixel-level vision task aimed at reconstructing a high-resolution image from its degraded low-resolution counterpart. Despite the notable advancements achieved by leveraging deep neural networks for SR, most existing deep learning architectures feature an extensive number of layers, leading to high computational complexity and substantial memory demands. These issues become particularly pronounced in the context of infrared image SR, where infrared devices often have stringent storage and computational constraints. To mitigate these challenges, we introduce a novel, efficient, and precise single infrared image SR model, termed the Lightweight Information Split Network (LISN). The LISN comprises four main components: shallow feature extraction, deep feature extraction, dense feature fusion, and high-resolution infrared image reconstruction. A key innovation within this model is the introduction of the Lightweight Information Split Block (LISB) for deep feature extraction. The LISB employs a sequential process to extract hierarchical features, which are then aggregated based on the relevance of the features under consideration. By integrating channel splitting and shift operations, the LISB successfully strikes an optimal balance between enhanced SR performance and a lightweight framework. Comprehensive experimental evaluations reveal that the proposed LISN achieves superior performance over contemporary state-of-the-art methods in terms of both SR quality and model complexity, affirming its efficacy for practical deployment in resource-constrained infrared imaging applications.
单图像超分辨率(SR)是一种旨在从低分辨率图像重构高分辨率图像的像素级别视觉任务。尽管通过利用深度神经网络进行SR取得了显著的进展,但大多数现有的深度学习架构具有大量的层,导致计算复杂度高和内存需求大。这些问题在红外图像SR背景下尤为突出,因为红外设备通常具有严格的存储和计算限制。为了减轻这些挑战,我们引入了一种新颖、高效和精确的红外图像SR模型,称为轻量信息分割网络(LISN)。LISN由四个主要组件组成:浅特征提取、深特征提取、密集特征融合和高分辨率红外图像重建。 在这个模型中,关键创新是引入了轻量信息分割块(LISB)用于深特征提取。LISB采用一种级联过程来提取分层特征,然后根据考虑到的特征的相关性进行聚合。通过整合通道分割和移位操作,LISB成功地将增强SR性能与轻量框架之间的平衡达成最优。 comprehensive experimental evaluations reveal that the proposed LISN achieves superior performance over contemporary state-of-the-art methods in terms of both SR quality and model complexity, affirming its applicability for practical deployment in resource-constrained infrared imaging applications.
https://arxiv.org/abs/2405.10561
In computer vision and graphics, the accurate reconstruction of road surfaces is pivotal for various applications, especially in autonomous driving. This paper introduces a novel method leveraging the Multi-Layer Perceptrons (MLPs) framework to reconstruct road surfaces in height, color, and semantic information by input world coordinates x and y. Our approach NeRO uses encoding techniques based on MLPs, significantly improving the performance of the complex details, speeding up the training speed, and reducing neural network size. The effectiveness of this method is demonstrated through its superior performance, which indicates a promising direction for rendering road surfaces with semantics applications, particularly in applications demanding visualization of road conditions, 4D labeling, and semantic groupings.
在计算机视觉和图形学中,准确地重构道路表面对于各种应用至关重要,尤其是在自动驾驶中。本文介绍了一种利用多层感知器(MLPs)框架在输入世界坐标x和y的基础上重构道路表面的新颖方法。我们的方法NeRO基于MLPs的编码技术,显著提高了复杂细节的性能,加快了训练速度,并减小了神经网络的大小。该方法的效果通过其卓越的性能得到证明,这表明了用语义应用渲染道路表面的一个有前景的方向,特别是在需要展示道路状况、4D标注和语义组分的应用中。
https://arxiv.org/abs/2405.10554
Federated Learning (FL) has emerged as a leading paradigm for decentralized, privacy preserving machine learning training. However, recent research on gradient inversion attacks (GIAs) have shown that gradient updates in FL can leak information on private training samples. While existing surveys on GIAs have focused on the honest-but-curious server threat model, there is a dearth of research categorizing attacks under the realistic and far more privacy-infringing cases of malicious servers and clients. In this paper, we present a survey and novel taxonomy of GIAs that emphasize FL threat models, particularly that of malicious servers and clients. We first formally define GIAs and contrast conventional attacks with the malicious attacker. We then summarize existing honest-but-curious attack strategies, corresponding defenses, and evaluation metrics. Critically, we dive into attacks with malicious servers and clients to highlight how they break existing FL defenses, focusing specifically on reconstruction methods, target model architectures, target data, and evaluation metrics. Lastly, we discuss open problems and future research directions.
联邦学习(FL)已成为分布式、隐私保护机器学习训练的领先范式。然而,最近关于梯度倒转攻击(GIAs)的研究表明,FL中的梯度更新可能泄露关于私有训练样本的信息。虽然现有的GIAs调查主要关注诚实的服务器威胁模型,但缺乏研究分类恶意服务器和客户端的攻击。在本文中,我们提交了一份GIAs的调查和新的分类器,重点关注FL威胁模型,特别是恶意服务器和客户端。我们首先正式定义GIAs,并区分常规攻击与恶意攻击者。然后我们总结现有的诚实但好奇的攻击策略、防御和相关评估指标。关键的是,我们深入研究恶意服务器和客户端的攻击,重点关注现有FL防御的漏洞,特别关注重建方法、目标模型架构、目标数据和评估指标。最后,我们讨论了未解决的问题和未来的研究方向。
https://arxiv.org/abs/2405.10376
In this work, we recover the underlying 3D structure of non-geometrically consistent scenes. We focus our analysis on hand-drawn images from cartoons and anime. Many cartoons are created by artists without a 3D rendering engine, which means that any new image of a scene is hand-drawn. The hand-drawn images are usually faithful representations of the world, but only in a qualitative sense, since it is difficult for humans to draw multiple perspectives of an object or scene 3D consistently. Nevertheless, people can easily perceive 3D scenes from inconsistent inputs! In this work, we correct for 2D drawing inconsistencies to recover a plausible 3D structure such that the newly warped drawings are consistent with each other. Our pipeline consists of a user-friendly annotation tool, camera pose estimation, and image deformation to recover a dense structure. Our method warps images to obey a perspective camera model, enabling our aligned results to be plugged into novel-view synthesis reconstruction methods to experience cartoons from viewpoints never drawn before. Our project page is https://toon3d.studio/.
在这项工作中,我们恢复了非几何一致性场景的潜在3D结构。我们的分析重点在于手绘的动漫图像。许多动漫是由没有3D渲染引擎的艺术家创作的,这意味着任何场景的新图像都是手绘的。手绘图像通常忠实于现实世界,但只有从定性意义上说,因为人类很难 consistently绘制物体或场景的多个视角。然而,人们可以很容易地从不一致的输入中感知3D场景!在这项工作中,我们纠正了2D绘图不一致性,以恢复一个可信的3D结构,使得新扭曲的图像相互一致。我们的流程包括一个用户友好的注释工具、目标姿态估计和图像变形以恢复密集结构。我们的方法扭曲图像以遵守透视相机模型,使我们得到的结果可以插入到从未见过的观点合成重建方法中,从从未体验过的角度合成卡通。我们的项目页面是https://toon3d.studio/。
https://arxiv.org/abs/2405.10320
Advances in 3D reconstruction have enabled high-quality 3D capture, but require a user to collect hundreds to thousands of images to create a 3D scene. We present CAT3D, a method for creating anything in 3D by simulating this real-world capture process with a multi-view diffusion model. Given any number of input images and a set of target novel viewpoints, our model generates highly consistent novel views of a scene. These generated views can be used as input to robust 3D reconstruction techniques to produce 3D representations that can be rendered from any viewpoint in real-time. CAT3D can create entire 3D scenes in as little as one minute, and outperforms existing methods for single image and few-view 3D scene creation. See our project page for results and interactive demos at this https URL .
3D重建技术的进步使得高质量的3D捕捉成为可能,但需要用户收集数百到数千张图像来创建3D场景。我们提出了一种名为CAT3D的方法,通过使用多视角扩散模型模拟这种现实世界的捕捉过程,来创建任何3D物体。给定任意数量的输入图像和一组目标新视角,我们的模型生成场景中高度一致的新视角。这些生成的视图可以作为输入,用于具有实时渲染能力的稳健3D重建技术,产生可以从任何视角渲染的3D表示。CAT3D可以在不到一分钟的时间内创建整个3D场景,并超越了现有方法在单张图像和少数视角3D场景创建方面的表现。请查看我们的项目页面,以查看结果和交互式演示。https://url.com/cat3d 。
https://arxiv.org/abs/2405.10314
We identify an issue in multi-task learnable compression, in which a representation learned for one task does not positively contribute to the rate-distortion performance of a different task as much as expected, given the estimated amount of information available in it. We interpret this issue using the predictive $\mathcal{V}$-information framework. In learnable scalable coding, previous work increased the utilization of side-information for input reconstruction by also rewarding input reconstruction when learning this shared representation. We evaluate the impact of this idea in the context of input reconstruction more rigorously and extended it to other computer vision tasks. We perform experiments using representations trained for object detection on COCO 2017 and depth estimation on the Cityscapes dataset, and use them to assist in image reconstruction and semantic segmentation tasks. The results show considerable improvements in the rate-distortion performance of the assisted tasks. Moreover, using the proposed representations, the performance of the base tasks are also improved. Results suggest that the proposed method induces simpler representations that are more compatible with downstream processes.
我们在多任务学习压缩中识别出一个问题,即在给定可用的信息量的情况下,为某一任务学习的表示对另一任务的目标速率失真性能的贡献不如预期。我们使用预测$\mathcal{V}$-信息框架来解释这个问题。在可学习可扩展编码中,以前的工作通过在学习和共享表示时奖励输入重建来增加了侧信息用于输入复原的利用率。我们将在输入复原的背景下更严格地评估这个想法,并将其扩展到其他计算机视觉任务上。我们在COCO 2017上为物体检测训练的表示和城市景观数据集上为深度估计训练的表示,并使用它们来协助图像复原和语义分割任务。实验结果表明,辅助任务的速率失真性能得到了显著的提高。此外,使用所提出的表示,基本任务的性能也得到了提高。结果表明,与所提出的方法相关的更简单的表示对下游过程更兼容。
https://arxiv.org/abs/2405.10244
Active reconstruction technique enables robots to autonomously collect scene data for full coverage, relieving users from tedious and time-consuming data capturing process. However, designed based on unsuitable scene representations, existing methods show unrealistic reconstruction results or the inability of online quality evaluation. Due to the recent advancements in explicit radiance field technology, online active high-fidelity reconstruction has become achievable. In this paper, we propose GS-Planner, a planning framework for active high-fidelity reconstruction using 3D Gaussian Splatting. With improvement on 3DGS to recognize unobserved regions, we evaluate the reconstruction quality and completeness of 3DGS map online to guide the robot. Then we design a sampling-based active reconstruction strategy to explore the unobserved areas and improve the reconstruction geometric and textural quality. To establish a complete robot active reconstruction system, we choose quadrotor as the robotic platform for its high agility. Then we devise a safety constraint with 3DGS to generate executable trajectories for quadrotor navigation in the 3DGS map. To validate the effectiveness of our method, we conduct extensive experiments and ablation studies in highly realistic simulation scenes.
主动重建技术使机器人能够自主收集场景数据以实现全面覆盖,从而摆脱用户从繁琐且耗时的数据捕捉过程。然而,由于不合适的场景表示,现有的方法表现出不现实的重构结果或在线质量评估无能为力。由于最近在显式辐射场技术方面的进步,在线主动高保真度重构已经成为可能。在本文中,我们提出了GS-Planner,一种使用3D高斯平铺进行主动高保真度重构的规划框架。通过提高3DGS以识别未观察到的区域,我们在线评估了3DGS地图的重建质量和完整性,以指导机器人的行动。然后我们设计了一种基于采样的主动重构策略,以探索未观察到的区域并提高重建几何和纹理质量。为了建立完整的机器人主动重建系统,我们选择了四旋翼作为机器人平台,因为它具有高度的敏捷性。然后我们通过3DGS生成机器人导航在3DGS地图上的可执行轨迹的安全约束。为了验证我们方法的有效性,我们在具有高度逼真的仿真场景中进行了广泛的实验和消融研究。
https://arxiv.org/abs/2405.10142
Event Stream Super-Resolution (ESR) aims to address the challenge of insufficient spatial resolution in event streams, which holds great significance for the application of event cameras in complex scenarios. Previous works for ESR often process positive and negative events in a mixed paradigm. This paradigm limits their ability to effectively model the unique characteristics of each event and mutually refine each other by considering their correlations. In this paper, we propose a bilateral event mining and complementary network (BMCNet) to fully leverage the potential of each event and capture the shared information to complement each other simultaneously. Specifically, we resort to a two-stream network to accomplish comprehensive mining of each type of events individually. To facilitate the exchange of information between two streams, we propose a bilateral information exchange (BIE) module. This module is layer-wisely embedded between two streams, enabling the effective propagation of hierarchical global information while alleviating the impact of invalid information brought by inherent characteristics of events. The experimental results demonstrate that our approach outperforms the previous state-of-the-art methods in ESR, achieving performance improvements of over 11\% on both real and synthetic datasets. Moreover, our method significantly enhances the performance of event-based downstream tasks such as object recognition and video reconstruction. Our code is available at this https URL.
事件流超分辨率(ESR)旨在解决事件流中空间分辨率不足的问题,这对在复杂场景中应用事件相机具有重大意义。之前的ESR工作通常在混合范式中处理正负事件。这种范式限制了他们有效建模每个事件的独特特点以及相互 refinement 彼此的能力。在本文中,我们提出了一种双边事件挖掘和互补网络(BMCNet),以充分利用每个事件的潜力,同时捕捉到相互补充的信息。具体来说,我们采用双流网络分别对每种事件进行全面的挖掘。为了促进两个流之间的信息交流,我们提出了双向信息交换(BIE)模块。该模块在两个流之间层叠嵌入,有效地传播分层全局信息,同时减轻由于事件固有特征带来的不准确信息的影响。实验结果表明,我们的方法在ESR领域超过了最先进的现有方法,实现了超过11%的性能提升,无论是真实数据还是合成数据。此外,我们的方法显著增强了基于事件的下游任务(如物体识别和视频重建)的性能。我们的代码可在此处访问:https://www.example.com/。
https://arxiv.org/abs/2405.10037
Previous unsupervised anomaly detection (UAD) methods often struggle with significant intra-class diversity; i.e., a class in a dataset contains multiple subclasses, which we categorize as Feature-Rich Anomaly Detection Datasets (FRADs). This is evident in applications such as unified setting and unmanned supermarket scenarios. To address this challenge, we developed MiniMaxAD: a lightweight autoencoder designed to efficiently compress and memorize extensive information from normal images. Our model utilizes a large kernel convolutional network equipped with a Global Response Normalization (GRN) unit and employs a multi-scale feature reconstruction strategy. The GRN unit significantly increases the upper limit of the network's capacity, while the large kernel convolution facilitates the extraction of highly abstract patterns, leading to compact normal feature modeling. Additionally, we introduce an Adaptive Contraction Loss (ADCLoss), tailored to FRADs to overcome the limitations of global cosine distance loss. MiniMaxAD was comprehensively tested across six challenging UAD benchmarks, achieving state-of-the-art results in four and highly competitive outcomes in the remaining two. Notably, our model achieved a detection AUROC of up to 97.0\% in ViSA under the unified setting. Moreover, it not only achieved state-of-the-art performance in unmanned supermarket tasks but also exhibited an inference speed 37 times faster than the previous best method, demonstrating its effectiveness in complex UAD tasks.
之前无监督异常检测(UAD)方法通常在数据集中的类内多样性显著受限;即数据集中的一个类别可能包含多个亚类,我们称之为特征丰富异常检测数据集(FRADs)。这在统一设置和无人超市场景等应用中是显而易见的。为解决这个挑战,我们开发了MiniMaxAD:一种轻量级的自编码器,旨在有效地压缩和记忆丰富的图像信息。我们的模型采用了一个大核卷积神经网络,配备了一个全局响应归一化(GRN)单元,并采用多尺度特征重构策略。GRN单元显著增加了网络的容量上限,而大核卷积有助于提取高度抽象的模式,导致紧凑的正常特征建模。此外,我们还引入了自适应收缩损失(ADCLoss),专门针对FRADs来克服全局余弦距离损失。MiniMaxAD在六个具有挑战性的UAD基准测试中进行了全面测试,在四个基准测试中实现了最先进的性能,在另外两个基准测试中具有极具竞争力的结果。值得注意的是,在统一设置下,我们的模型在ViSA上的检测AUROC可以达到97.0%。此外,它不仅在无人超市任务中实现了最先进的表现,而且具有比之前最佳方法快37倍的应用速度,表明其对于复杂UAD任务的处理效果非常出色。
https://arxiv.org/abs/2405.09933
Recent advances in generative models trained on large-scale datasets have made it possible to synthesize high-quality samples across various domains. Moreover, the emergence of strong inversion networks enables not only a reconstruction of real-world images but also the modification of attributes through various editing methods. However, in certain domains related to privacy issues, e.g., human faces, advanced generative models along with strong inversion methods can lead to potential misuses. In this paper, we propose an essential yet under-explored task called generative identity unlearning, which steers the model not to generate an image of a specific identity. In the generative identity unlearning, we target the following objectives: (i) preventing the generation of images with a certain identity, and (ii) preserving the overall quality of the generative model. To satisfy these goals, we propose a novel framework, Generative Unlearning for Any Identity (GUIDE), which prevents the reconstruction of a specific identity by unlearning the generator with only a single image. GUIDE consists of two parts: (i) finding a target point for optimization that un-identifies the source latent code and (ii) novel loss functions that facilitate the unlearning procedure while less affecting the learned distribution. Our extensive experiments demonstrate that our proposed method achieves state-of-the-art performance in the generative machine unlearning task. The code is available at this https URL.
近年来,在训练在大规模数据集上的生成模型取得了进步,使得在各种领域合成高质量样本成为可能。此外,强反向网络的出现使得不仅能够对现实世界的图像进行重建,还能够通过各种编辑方法修改属性。然而,在某些与隐私问题相关的领域,例如人脸等,先进的生成模型与强大的反向网络可能会导致潜在的滥用。在本文中,我们提出了一个关键但尚未被充分探索的任务:生成身份解学习,它引导模型不生成特定身份的图像。在生成身份解学习中,我们目标实现以下两个目标:(i)防止生成具有特定身份的图像,(ii)保留生成模型的整体质量。为了满足这些目标,我们提出了一个新颖的框架:生成身份解学习 for Any Identity (GUIDE)。GUIDE 包括两个部分:(i)找到一个优化目标点,该点不识别原始潜在码并(ii)新的损失函数,在释放学习过程中有助于解码过程,同时对学习分布的影响较小。我们丰富的实验证明,与其他方法相比,我们提出的解码方法在生成机解码任务中实现了最先进的性能。代码可在此链接下载:https://www.example.com/
https://arxiv.org/abs/2405.09879
With the advent of image super-resolution (SR) algorithms, how to evaluate the quality of generated SR images has become an urgent task. Although full-reference methods perform well in SR image quality assessment (SR-IQA), their reliance on high-resolution (HR) images limits their practical applicability. Leveraging available reconstruction information as much as possible for SR-IQA, such as low-resolution (LR) images and the scale factors, is a promising way to enhance assessment performance for SR-IQA without HR for reference. In this letter, we attempt to evaluate the perceptual quality and reconstruction fidelity of SR images considering LR images and scale factors. Specifically, we propose a novel dual-branch reduced-reference SR-IQA network, \ie, Perception- and Fidelity-aware SR-IQA (PFIQA). The perception-aware branch evaluates the perceptual quality of SR images by leveraging the merits of global modeling of Vision Transformer (ViT) and local relation of ResNet, and incorporating the scale factor to enable comprehensive visual perception. Meanwhile, the fidelity-aware branch assesses the reconstruction fidelity between LR and SR images through their visual perception. The combination of the two branches substantially aligns with the human visual system, enabling a comprehensive SR image evaluation. Experimental results indicate that our PFIQA outperforms current state-of-the-art models across three widely-used SR-IQA benchmarks. Notably, PFIQA excels in assessing the quality of real-world SR images.
随着图像超分辨率(SR)算法的出现,如何评估生成SR图像的质量已成为一个紧迫的任务。尽管全参考方法在SR图像质量评估(SR-IQA)中表现良好,但它们对高分辨率(HR)图像的依赖限制了其实用性。充分利用SR-IQA中可用的重建信息,如低分辨率(LR)图像和比例因子,是一种提高SR-IQA性能而无需参考HR图像的方法。在本文中,我们试图评估SR图像的感知质量和重构准确性,同时考虑LR图像和比例因子。具体来说,我们提出了一个新颖的双分支减少参考SR-IQA网络,即感知-和可靠性-感知SR-IQA(PFIQA)。感知分支通过利用Vision Transformer(ViT)的全局建模优点和ResNet的局部关系,以及包含比例因子来提高综合视觉 perception。同时,可靠性分支通过它们的视觉感知评估LR和SR图像之间的重构准确性。两个分支的组合使得PFIQA与人类视觉系统高度契合,实现了全面的SR图像评估。实验结果表明,我们的PFIQA在三个广泛使用的SR-IQA基准测试中超过了最先进的模型。值得注意的是,PFIQA在评估真实世界SR图像的质量方面表现出色。
https://arxiv.org/abs/2405.09472
3D content creation plays a vital role in various applications, such as gaming, robotics simulation, and virtual reality. However, the process is labor-intensive and time-consuming, requiring skilled designers to invest considerable effort in creating a single 3D asset. To address this challenge, text-to-3D generation technologies have emerged as a promising solution for automating 3D creation. Leveraging the success of large vision language models, these techniques aim to generate 3D content based on textual descriptions. Despite recent advancements in this area, existing solutions still face significant limitations in terms of generation quality and efficiency. In this survey, we conduct an in-depth investigation of the latest text-to-3D creation methods. We provide a comprehensive background on text-to-3D creation, including discussions on datasets employed in training and evaluation metrics used to assess the quality of generated 3D models. Then, we delve into the various 3D representations that serve as the foundation for the 3D generation process. Furthermore, we present a thorough comparison of the rapidly growing literature on generative pipelines, categorizing them into feedforward generators, optimization-based generation, and view reconstruction approaches. By examining the strengths and weaknesses of these methods, we aim to shed light on their respective capabilities and limitations. Lastly, we point out several promising avenues for future research. With this survey, we hope to inspire researchers further to explore the potential of open-vocabulary text-conditioned 3D content creation.
3D内容创作在各种应用中发挥着重要作用,如游戏、机器人模拟和虚拟现实。然而,该过程费力且耗时,需要熟练的设计师投入大量精力创作单个3D资产。为应对这一挑战,文本到3D生成技术作为一种有前途的自动化3D创作的解决方案应运而生。通过利用大型视觉语言模型的成功,这些技术旨在根据文本描述生成3D内容。尽管在最近一段时间内这一领域取得了进展,但现有的解决方案在生成质量和效率方面仍然存在显著的限制。在本次调查中,我们深入研究了最新的文本到3D创作方法。我们提供了关于文本到3D创作的全面背景,包括讨论训练和评估指标所使用的数据集以及用于评估生成3D模型的质量的评估指标。接着,我们深入探讨了作为3D生成过程基础的各种3D表示。此外,我们还对迅速发展的关于生成管道的研究进行了全面的比较,并将它们分为前馈生成、基于优化的生成和视图重构方法。通过分析这些方法的优缺点,我们希望揭示它们各自的潜能和局限。最后,我们指出了未来研究的几个有前景的方向。通过这次调查,我们希望激励研究人员进一步探索开放词汇文本条件下3D内容创作的潜力。
https://arxiv.org/abs/2405.09431
The multi-scale receptive field and large kernel attention (LKA) module have been shown to significantly improve performance in the lightweight image super-resolution task. However, existing lightweight super-resolution (SR) methods seldom pay attention to designing efficient building block with multi-scale receptive field for local modeling, and their LKA modules face a quadratic increase in computational and memory footprints as the convolutional kernel size increases. To address the first issue, we propose the multi-scale blueprint separable convolutions (MBSConv) as highly efficient building block with multi-scale receptive field, it can focus on the learning for the multi-scale information which is a vital component of discriminative representation. As for the second issue, we revisit the key properties of LKA in which we find that the adjacent direct interaction of local information and long-distance dependencies is crucial to provide remarkable performance. Thus, taking this into account and in order to mitigate the complexity of LKA, we propose a large coordinate kernel attention (LCKA) module which decomposes the 2D convolutional kernels of the depth-wise convolutional layers in LKA into horizontal and vertical 1-D kernels. LCKA enables the adjacent direct interaction of local information and long-distance dependencies not only in the horizontal direction but also in the vertical. Besides, LCKA allows for the direct use of extremely large kernels in the depth-wise convolutional layers to capture more contextual information, which helps to significantly improve the reconstruction performance, and it incurs lower computational complexity and memory footprints. Integrating MBSConv and LCKA, we propose a large coordinate kernel attention network (LCAN).
多尺度 receptive 场和大型内核注意 (LKA) 模块已经被证明在轻量图像超分辨率任务中显著提高了性能。然而,现有的轻量级超分辨率(SR)方法很少关注设计具有多尺度 receptive 场的有效构建模块,并且随着卷积核大小的增加,它们的 LKA 模块的计算和内存足迹呈指数增长。为解决第一个问题,我们提出了多尺度蓝色模板分离卷积(MBSConv)作为具有多尺度 receptive 场的非常高效构建模块,它可以关注多尺度信息,这是判别表示的重要组成部分。对于第二个问题,我们重新审视了 LKA 的关键特性,我们发现邻近信息之间的直接相互作用和长距离依赖关系对提供出色的性能至关重要。因此,考虑到这一点,为了减轻 LKA 的复杂性,我们提出了大型坐标卷积注意(LCKA)模块,它将 LKA 的深度卷积层中的 2D 卷积核拆分为水平和垂直 1D 卷积核。LCKA 不仅使相邻直接相互作用于局部信息和长距离依赖关系,而且在水平和垂直方向上都有。此外,LCKA 允许在深度卷积层中直接使用极其大的卷积核来捕捉更多的上下文信息,从而显著提高重构性能,并使其计算复杂性和内存足迹更低。将 MBSConv 和 LCKA 集成起来,我们提出了大型坐标卷积注意网络 (LCAN)。
https://arxiv.org/abs/2405.09353
Anomaly detection and localization without any manual annotations and prior knowledge is a challenging task under the setting of unsupervised learning. The existing works achieve excellent performance in the anomaly detection, but with complex networks or cumbersome pipelines. To address this issue, this paper explores a simple but effective architecture in the anomaly detection. It consists of a well pre-trained encoder to extract hierarchical feature representations and a decoder to reconstruct these intermediate features from the encoder. In particular, it does not require any data augmentations and anomalous images for training. The anomalies can be detected when the decoder fails to reconstruct features well, and then errors of hierarchical feature reconstruction are aggregated into an anomaly map to achieve anomaly localization. The difference comparison between those features of encoder and decode lead to more accurate and robust localization results than the comparison in single feature or pixel-by-pixel comparison in the conventional works. Experiment results show that the proposed method outperforms the state-of-the-art methods on MNIST, Fashion-MNIST, CIFAR-10, and MVTec Anomaly Detection datasets on both anomaly detection and localization.
在无需手动注释和先前知识的情况下,检测异常并定位异常是一个具有挑战性的任务,尤其是在无监督学习环境中。现有的作品在异常检测方面表现出色,但使用了复杂的网络或繁琐的流程。为解决这个问题,本文探索了一种简单但有效的异常检测架构。它由一个预训练的编码器和一个解码器组成,编码器用于提取分层次的特征表示,解码器用于从编码器中重构这些中间特征。特别地,它不需要进行数据增强或异常图像的训练。当解码器无法很好地重构特征时,可以检测到异常。然后将层次特征重构的错误聚集在异常地图上,实现异常的局部化。编码器和解码器的特征差异比较比传统工作中的单个特征或像素逐像素比较更准确和稳健的局部化结果。实验结果表明,与最先进的 methods相比,所提出的方法在MNIST、Fashion-MNIST、CIFAR-10和MVTec异常检测数据集上 both anomaly detection and localization outperforms.
https://arxiv.org/abs/2405.09148