Latent Diffusion Models (LDMs) enable a wide range of applications but raise ethical concerns regarding illegal utilization.Adding watermarks to generative model outputs is a vital technique employed for copyright tracking and mitigating potential risks associated with AI-generated content. However, post-hoc watermarking techniques are susceptible to evasion. Existing watermarking methods for LDMs can only embed fixed messages. Watermark message alteration requires model retraining. The stability of the watermark is influenced by model updates and iterations. Furthermore, the current reconstruction-based watermark removal techniques utilizing variational autoencoders (VAE) and diffusion models have the capability to remove a significant portion of watermarks. Therefore, we propose a novel technique called DiffuseTrace. The goal is to embed invisible watermarks in all generated images for future detection semantically. The method establishes a unified representation of the initial latent variables and the watermark information through training an encoder-decoder model. The watermark information is embedded into the initial latent variables through the encoder and integrated into the sampling process. The watermark information is extracted by reversing the diffusion process and utilizing the decoder. DiffuseTrace does not rely on fine-tuning of the diffusion model components. The watermark is embedded into the image space semantically without compromising image quality. The encoder-decoder can be utilized as a plug-in in arbitrary diffusion models. We validate through experiments the effectiveness and flexibility of DiffuseTrace. DiffuseTrace holds an unprecedented advantage in combating the latest attacks based on variational autoencoders and Diffusion Models.
潜在扩散模型(LDMs)允许应用于广泛的领域,但涉及非法利用的伦理问题。在将水印添加到生成模型的输出中是保护版权跟踪和减轻与AI生成的内容相关的潜在风险的重要技术。然而,后置水印技术易被绕过。现有的LDM水印方法只能嵌入固定的消息。水印消息修改需要模型重构。水印的稳定性受模型更新和迭代的影响。此外,使用变分自编码器(VAE)和扩散模型基于重构的消歧水印去除技术具有去除大量水印的能力。因此,我们提出了名为DiffuseTrace的新技术。目标是将不可见的水印嵌入所有生成的图像中,供未来检测具有语义意义。该方法通过训练编码器-解码器模型,将初始潜在变量和 水印信息建立为统一表示。水印信息通过编码器整合到抽样过程中。通过反转扩散过程并利用解码器提取水印信息。DiffuseTrace不依赖于对扩散模型组件的微调。水印在图像空间语义上嵌入,同时不牺牲图像质量。编码器-解码器可以作为任意扩散模型的插件使用。通过实验验证DiffuseTrace的有效性和灵活性。DiffuseTrace在对抗基于变分自编码器(VAE)和扩散模型的最新攻击方面具有史无前例的优势。
https://arxiv.org/abs/2405.02696
Neuron reconstruction, one of the fundamental tasks in neuroscience, rebuilds neuronal morphology from 3D light microscope imaging data. It plays a critical role in analyzing the structure-function relationship of neurons in the nervous system. However, due to the scarcity of neuron datasets and high-quality SWC annotations, it is still challenging to develop robust segmentation methods for single neuron reconstruction. To address this limitation, we aim to distill the consensus knowledge from massive natural image data to aid the segmentation model in learning the complex neuron structures. Specifically, in this work, we propose a novel training paradigm that leverages a 2D Vision Transformer model pre-trained on large-scale natural images to initialize our Transformer-based 3D neuron segmentation model with a tailored 2D-to-3D weight transferring strategy. Our method builds a knowledge sharing connection between the abundant natural and the scarce neuron image domains to improve the 3D neuron segmentation ability in a data-efficiency manner. Evaluated on a popular benchmark, BigNeuron, our method enhances neuron segmentation performance by 8.71% over the model trained from scratch with the same amount of training samples.
神经元重建是神经科学中的一个基本任务,它通过从3D光显微镜图像数据中重构神经元形态学来分析神经系统中神经元的结构和功能关系。然而,由于神经元数据集的稀缺性和高质量的SWC注释的缺乏,开发用于单神经元重建的稳健分割方法仍然具有挑战性。为了克服这一限制,我们旨在从大规模自然图像数据中提取共识知识,以帮助分割模型学习复杂的神经元结构。具体来说,在本文中,我们提出了一种利用预训练于大型自然图像的2D Vision Transformer模型作为初始化,以实现基于Transformer的3D神经元分割模型的自适应2D到3D权转移策略。我们的方法建立了丰富自然和稀疏神经元图像领域之间的知识共享联系,以以数据效率的方式提高神经元分割能力。在流行的基准测试BigNeuron上进行评估,我们的方法将自定义模型的神经元分割性能提高了8.71%。
https://arxiv.org/abs/2405.02686
Hand manipulating objects is an important interaction motion in our daily activities. We faithfully reconstruct this motion with a single RGBD camera by a novel deep reinforcement learning method to leverage physics. Firstly, we propose object compensation control which establishes direct object control to make the network training more stable. Meanwhile, by leveraging the compensation force and torque, we seamlessly upgrade the simple point contact model to a more physical-plausible surface contact model, further improving the reconstruction accuracy and physical correctness. Experiments indicate that without involving any heuristic physical rules, this work still successfully involves physics in the reconstruction of hand-object interactions which are complex motions hard to imitate with deep reinforcement learning. Our code and data are available at this https URL.
手操作物体是我们日常生活中的重要交互动作。我们通过一种新颖的深度强化学习方法,利用物理原理,重构了单个RGBD相机来捕捉这个动作。首先,我们提出了对象补偿控制,建立了直接物体控制,使得网络训练更加稳定。同时,通过利用补偿力和扭矩,我们将简单的点接触模型升级为更物理上合理的表面接触模型,进一步提高了重构精度和物理正确性。实验结果表明,在没有使用任何启发式物理规则的情况下,这项工作仍然成功地涉及了物理在重构手-物体交互过程中的应用,这些复杂动作很难通过深度强化学习来模仿。我们的代码和数据可在此处访问:https://www. this URL。
https://arxiv.org/abs/2405.02676
Active learning in 3D scene reconstruction has been widely studied, as selecting informative training views is critical for the reconstruction. Recently, Neural Radiance Fields (NeRF) variants have shown performance increases in active 3D reconstruction using image rendering or geometric uncertainty. However, the simultaneous consideration of both uncertainties in selecting informative views remains unexplored, while utilizing different types of uncertainty can reduce the bias that arises in the early training stage with sparse inputs. In this paper, we propose ActiveNeuS, which evaluates candidate views considering both uncertainties. ActiveNeuS provides a way to accumulate image rendering uncertainty while avoiding the bias that the estimated densities can introduce. ActiveNeuS computes the neural implicit surface uncertainty, providing the color uncertainty along with the surface information. It efficiently handles the bias by using the surface information and a grid, enabling the fast selection of diverse viewpoints. Our method outperforms previous works on popular datasets, Blender and DTU, showing that the views selected by ActiveNeuS significantly improve performance.
已经在三维场景重建中广泛研究了积极学习,因为选择有信息的训练视角对于重建至关重要。最近,神经辐射场(NeRF)变体通过图像渲染或几何不确定性在积极三维重建中显示出性能提高。然而,同时考虑选择信息丰富的视角仍然是一个未探索的问题,而利用不同类型的不确定性可以减少在训练早期阶段出现稀疏输入导致的偏差。在本文中,我们提出了ActiveNeuS,它考虑了 both uncertainties(不确定性)。ActiveNeuS通过累积图像渲染不确定性,同时避免估计密度可能引入的偏差。ActiveNeuS计算神经隐性表面不确定性,提供表面信息以及颜色不确定性。它有效地处理偏差,通过表面信息和网格实现观点的快速选择。我们的方法在流行的数据集Blender和DTU上优于以前的工作,表明ActiveNeuS选择的观点显著提高了性能。
https://arxiv.org/abs/2405.02568
For cyber-physical systems (CPS), including robotics and autonomous vehicles, mass deployment has been hindered by fatal errors that occur when operating in rare events. To replicate rare events such as vehicle crashes, many companies have created logging systems and employed crash reconstruction experts to meticulously recreate these valuable events in simulation. However, in these methods, "what if" questions are not easily formulated and answered. We present ScenarioNL, an AI System for creating scenario programs from natural language. Specifically, we generate these programs from police crash reports. Reports normally contain uncertainty about the exact details of the incidents which we represent through a Probabilistic Programming Language (PPL), Scenic. By using Scenic, we can clearly and concisely represent uncertainty and variation over CPS behaviors, properties, and interactions. We demonstrate how commonplace prompting techniques with the best Large Language Models (LLM) are incapable of reasoning about probabilistic scenario programs and generating code for low-resource languages such as Scenic. Our system is comprised of several LLMs chained together with several kinds of prompting strategies, a compiler, and a simulator. We evaluate our system on publicly available autonomous vehicle crash reports in California from the last five years and share insights into how we generate code that is both semantically meaningful and syntactically correct.
对于计算机物理系统(CPS),包括机器人学和自动驾驶车辆,由于在罕见事件操作中发生致命错误,导致大量部署受到阻碍。为了复制像车辆碰撞这样的稀有事件,许多公司创建了日志系统并雇佣了碰撞重建专家,精心复制这些有价值的事件在仿真中。然而,在这些方法中, "如果" 问题不容易制定和回答。我们提出了ScenarioNL,一种用自然语言创建场景程序的人工智能系统。具体来说,我们从警察碰撞报告开始生成这些程序。报告通常包含关于事件详细信息的不确定性,我们通过概率编程语言(PPL,Scenic)来表示这些不确定性。通过使用Scenic,我们可以清晰地表示CPS的行为、属性和相互作用的随机性。我们证明了最好的大规模语言模型(LLM)的常见提示技术无法推理关于概率情景程序,并为低资源语言(如Scenic)生成代码。我们的系统由几个LLM链、几种提示策略、编译器和模拟器组成。我们在过去五年内从加利福尼亚公开发布的自动驾驶车辆碰撞报告中评估我们的系统,并分享了如何生成既语义有意义又语法正确的代码。
https://arxiv.org/abs/2405.03709
Computed Tomography (CT) is pivotal in industrial quality control and medical diagnostics. Sparse-view CT, offering reduced ionizing radiation, faces challenges due to its under-sampled nature, leading to ill-posed reconstruction problems. Recent advancements in Implicit Neural Representations (INRs) have shown promise in addressing sparse-view CT reconstruction. Recognizing that CT often involves scanning similar subjects, we propose a novel approach to improve reconstruction quality through joint reconstruction of multiple objects using INRs. This approach can potentially leverage both the strengths of INRs and the statistical regularities across multiple objects. While current INR joint reconstruction techniques primarily focus on accelerating convergence via meta-initialization, they are not specifically tailored to enhance reconstruction quality. To address this gap, we introduce a novel INR-based Bayesian framework integrating latent variables to capture the inter-object relationships. These variables serve as a dynamic reference throughout the optimization, thereby enhancing individual reconstruction fidelity. Our extensive experiments, which assess various key factors such as reconstruction quality, resistance to overfitting, and generalizability, demonstrate significant improvements over baselines in common numerical metrics. This underscores a notable advancement in CT reconstruction methods.
计算断层成像(CT)在工业品质控制和医学诊断中具有关键作用。稀疏视野CT由于其欠采样特性,面临挑战,导致欠拟合重建问题。随着隐式神经表示(INRs)的最近进展,显示了在解决稀疏视野CT重建方面取得进展的前景。认识到CT通常涉及对类似被试的扫描,我们提出了一种通过使用INRs共同重构多个对象来提高重建质量的新方法。这种方法可以利用INRs的优点和多个对象之间的统计 regularities。尽管当前的INR联合重建技术主要通过元初始化加速收敛,但它们并未专门针对提高重建质量进行优化。为了填补这一空白,我们引入了一个基于INRs的新颖贝叶斯框架,将潜在变量集成在一起,以捕捉对象之间的交互关系。这些变量在优化过程中充当动态参考,从而提高每个重建对象的准确性。我们对各种关键因素(如重建质量、过拟合抵抗性和泛化能力)的广泛实验证明,在常见数值指标上,基准线以上显著改善。这表明在CT重建方法上取得了显著的进展。
https://arxiv.org/abs/2405.02509
Computing the gradients of a rendering process is paramount for diverse applications in computer vision and graphics. However, accurate computation of these gradients is challenging due to discontinuities and rendering approximations, particularly for surface-based representations and rasterization-based rendering. We present a novel method for computing gradients at visibility discontinuities for rasterization-based differentiable renderers. Our method elegantly simplifies the traditionally complex problem through a carefully designed approximation strategy, allowing for a straightforward, effective, and performant solution. We introduce a novel concept of micro-edges, which allows us to treat the rasterized images as outcomes of a differentiable, continuous process aligned with the inherently non-differentiable, discrete-pixel rasterization. This technique eliminates the necessity for rendering approximations or other modifications to the forward pass, preserving the integrity of the rendered image, which makes it applicable to rasterized masks, depth, and normals images where filtering is prohibitive. Utilizing micro-edges simplifies gradient interpretation at discontinuities and enables handling of geometry intersections, offering an advantage over the prior art. We showcase our method in dynamic human head scene reconstruction, demonstrating effective handling of camera images and segmentation masks.
计算渲染过程中梯度的计算对于计算机视觉和图形学中的各种应用至关重要。然而,由于不连续性和渲染近似,准确计算这些梯度具有挑战性,特别是在基于表面的表示和基于元组织的渲染中。我们提出了一种新的方法,用于计算基于元组织的渲染中的可见性断点处的梯度。我们的方法通过精心设计的近似策略,将通常复杂的问题简化为一个简单而有效的解决方案。我们引入了一个名为微边(micro-edges)的新概念,使我们可以将栅化图像视为与固有非可导性离散像素渲染过程的连续过程的输出。这种技术消除了对前向传播的渲染近似或其他修改的需要,保留了渲染图像的完整性,使其适用于无法进行滤波的栅化mask、深度和法线图像。利用微边简化了在断点处的梯度解释,并使处理几何交涉及纳,提供了与先例技术相比的优势。我们在动态人类头部场景重构中展示我们的方法,证明了在处理相机图像和分割掩码方面具有有效的处理能力。
https://arxiv.org/abs/2405.02508
In the field of robotics and computer vision, efficient and accurate semantic mapping remains a significant challenge due to the growing demand for intelligent machines that can comprehend and interact with complex environments. Conventional panoptic mapping methods, however, are limited by predefined semantic classes, thus making them ineffective for handling novel or unforeseen objects. In response to this limitation, we introduce the Unified Promptable Panoptic Mapping (UPPM) method. UPPM utilizes recent advances in foundation models to enable real-time, on-demand label generation using natural language prompts. By incorporating a dynamic labeling strategy into traditional panoptic mapping techniques, UPPM provides significant improvements in adaptability and versatility while maintaining high performance levels in map reconstruction. We demonstrate our approach on real-world and simulated datasets. Results show that UPPM can accurately reconstruct scenes and segment objects while generating rich semantic labels through natural language interactions. A series of ablation experiments validated the advantages of foundation model-based labeling over fixed label sets.
在机器人学和计算机视觉领域,有效的语义映射由于对能够理解和与复杂环境交互的智能机器的需求不断增加而仍然是一个重要的挑战。然而,传统的全景映射方法却受到预定义语义类别的限制,因此对于处理新颖或未曾预料到的事物来说,它们的有效性就有限了。为了应对这个局限性,我们引入了统一可提示的全景映射(UPPM)方法。UPPM利用最近在基础模型上的进展,通过自然语言提示实现实时、按需标签生成。通过将动态标签策略融入传统全景映射技术中,UPPM在保持高地图重建性能的同时,显著提高了其适应性和多样性。我们在真实世界和模拟数据集上验证了我们的方法。结果表明,UPPM在生成自然语言交互下的准确场景 reconstructs 和对象 segmentation的同时,提供了通过自然语言交互生成丰富语义标签的优势。一系列消融实验证实了基于基础模型 的标签集在固定标签集上的优势。
https://arxiv.org/abs/2405.02162
With the wide application of knowledge distillation between an ImageNet pre-trained teacher model and a learnable student model, industrial anomaly detection has witnessed a significant achievement in the past few years. The success of knowledge distillation mainly relies on how to keep the feature discrepancy between the teacher and student model, in which it assumes that: (1) the teacher model can jointly represent two different distributions for the normal and abnormal patterns, while (2) the student model can only reconstruct the normal distribution. However, it still remains a challenging issue to maintain these ideal assumptions in practice. In this paper, we propose a simple yet effective two-stage industrial anomaly detection framework, termed as AAND, which sequentially performs Anomaly Amplification and Normality Distillation to obtain robust feature discrepancy. In the first anomaly amplification stage, we propose a novel Residual Anomaly Amplification (RAA) module to advance the pre-trained teacher encoder. With the exposure of synthetic anomalies, it amplifies anomalies via residual generation while maintaining the integrity of pre-trained model. It mainly comprises a Matching-guided Residual Gate and an Attribute-scaling Residual Generator, which can determine the residuals' proportion and characteristic, respectively. In the second normality distillation stage, we further employ a reverse distillation paradigm to train a student decoder, in which a novel Hard Knowledge Distillation (HKD) loss is built to better facilitate the reconstruction of normal patterns. Comprehensive experiments on the MvTecAD, VisA, and MvTec3D-RGB datasets show that our method achieves state-of-the-art performance.
知识蒸馏在工业异常检测中的应用已经取得了显著成就。知识蒸馏的成功主要依赖于如何保持教师和 student模型之间的特征差异,其中它假设:(1)教师模型可以共同表示正常和异常模式的两种不同分布,而(2)学生模型只能重构正常分布。然而,在实践中仍然存在一个具有挑战性的问题,即维持这些理想假设。在本文中,我们提出了一个简单而有效的工业异常检测框架,称为AAND,它分为两个阶段依次执行异常增强和正常分化。在第一个异常增强阶段,我们提出了一个新的残差异常增强(RAA)模块,以提高预训练教师编码器的性能。通过暴露合成异常,它通过残差生成来放大异常,同时保持预训练模型的完整性。它主要由一个匹配引导的残差门和一个属性缩放的残差生成器组成,可以分别确定残差的比率和特征。在第二个正则化分化阶段,我们进一步采用反向蒸馏范式训练学生解码器,其中构建了一种新的硬知识蒸馏(HKD)损失,以更好地促进对正常模式的重建。在MvTecAD、VisA和MvTec3D-RGB数据集上进行全面的实验证明,我们的方法达到了最先进的性能水平。
https://arxiv.org/abs/2405.02068
In the fields of photogrammetry, computer vision and computer graphics, the task of neural 3D scene reconstruction has led to the exploration of various techniques. Among these, 3D Gaussian Splatting stands out for its explicit representation of scenes using 3D Gaussians, making it appealing for tasks like 3D point cloud extraction and surface reconstruction. Motivated by its potential, we address the domain of 3D scene reconstruction, aiming to leverage the capabilities of the Microsoft HoloLens 2 for instant 3D Gaussian Splatting. We present HoloGS, a novel workflow utilizing HoloLens sensor data, which bypasses the need for pre-processing steps like Structure from Motion by instantly accessing the required input data i.e. the images, camera poses and the point cloud from depth sensing. We provide comprehensive investigations, including the training process and the rendering quality, assessed through the Peak Signal-to-Noise Ratio, and the geometric 3D accuracy of the densified point cloud from Gaussian centers, measured by Chamfer Distance. We evaluate our approach on two self-captured scenes: An outdoor scene of a cultural heritage statue and an indoor scene of a fine-structured plant. Our results show that the HoloLens data, including RGB images, corresponding camera poses, and depth sensing based point clouds to initialize the Gaussians, are suitable as input for 3D Gaussian Splatting.
在摄影测量、计算机视觉和计算机图形学领域,神经3D场景重建的任务促使我们探索各种技术。在这些技术中,3D高斯平铺因使用3D高斯形式的场景表示而脱颖而出,这使得它对诸如3D点云提取和表面重建等任务具有吸引力。为了利用其潜力,我们转向3D场景重建领域,旨在利用微软HoloLens 2的即时3D高斯平铺功能。我们介绍了HoloGS,一种利用HoloLens传感器数据的新工作流程,无需进行预处理步骤,即可直接访问所需输入数据,即深度感測的图像、摄影机姿态和点云。我们提供了全面的调查,包括训练过程和渲染质量,通过Peak信号-噪声比进行评估,以及通过Chamfer距离测量来自高斯中心的密度点云的3D几何准确性。我们对我们的方法在两个自 capture场景进行了评估:文化遗产雕像的户外场景和精细结构植物的室内场景。我们的结果表明,HoloLens数据,包括RGB图像、相应的相机姿态和基于深度感測的点云,作为3D高斯平铺的输入是合适的。
https://arxiv.org/abs/2405.02005
For multimodal LLMs, the synergy of visual comprehension (textual output) and generation (visual output) presents an ongoing challenge. This is due to a conflicting objective: for comprehension, an MLLM needs to abstract the visuals; for generation, it needs to preserve the visuals as much as possible. Thus, the objective is a dilemma for visual-tokens. To resolve the conflict, we propose encoding images into morph-tokens to serve a dual purpose: for comprehension, they act as visual prompts instructing MLLM to generate texts; for generation, they take on a different, non-conflicting role as complete visual-tokens for image reconstruction, where the missing visual cues are recovered by the MLLM. Extensive experiments show that morph-tokens can achieve a new SOTA for multimodal comprehension and generation simultaneously. Our project is available at this https URL.
对于多模态LLM,视觉理解(文本输出)和生成(视觉输出)的协同作用是一个持续的挑战。这是因为理解需要对视觉进行抽象,而生成需要尽可能地保留视觉。因此,对于视觉标记来说,目标是一个两难的困境。为了解决冲突,我们提出将图像编码成形位标记以实现双重目的:在理解方面,它们作为视觉提示指导MLLM生成文本;在生成方面,它们以一种非冲突的方式担任完整的视觉标记,以便MLLM通过修复缺失的视觉线索来重建图像。大量实验证明,形位标记可以同时达到多模态理解和生成的最优水平。我们的项目可以从该链接https://www.example.com/中获取。
https://arxiv.org/abs/2405.01926
Expressive voice conversion (VC) conducts speaker identity conversion for emotional speakers by jointly converting speaker identity and emotional style. Emotional style modeling for arbitrary speakers in expressive VC has not been extensively explored. Previous approaches have relied on vocoders for speech reconstruction, which makes speech quality heavily dependent on the performance of vocoders. A major challenge of expressive VC lies in emotion prosody modeling. To address these challenges, this paper proposes a fully end-to-end expressive VC framework based on a conditional denoising diffusion probabilistic model (DDPM). We utilize speech units derived from self-supervised speech models as content conditioning, along with deep features extracted from speech emotion recognition and speaker verification systems to model emotional style and speaker identity. Objective and subjective evaluations show the effectiveness of our framework. Codes and samples are publicly available.
表达性语音转换(VC)通过共同转换说话人身份和情感风格来对情感说话人进行演讲者身份转换。对于表达性VC中任意说话人的情感风格建模,尚未进行过深入探讨。之前的解决方案依赖于语音重建器的语音重建,这使得语音质量高度依赖于语音重建器的表现。情感语音转换的一个主要挑战是情感语调建模。为了应对这些挑战,本文基于条件去噪扩散概率模型(DDPM)提出了一种完整的端到端情感语音转换框架。我们利用自监督语音模型产生的语音单元作为内容条件,同时从语音情感识别和说话人验证系统中提取深度特征来建模情感风格和说话人身份。客观和主观评估结果表明,我们的框架的有效性。代码和样本公开可用。
https://arxiv.org/abs/2405.01730
Deep learning has made significant progress in computer vision, specifically in image classification, object detection, and semantic segmentation. The skip connection has played an essential role in the architecture of deep neural networks,enabling easier optimization through residual learning during the training stage and improving accuracy during testing. Many neural networks have inherited the idea of residual learning with skip connections for various tasks, and it has been the standard choice for designing neural networks. This survey provides a comprehensive summary and outlook on the development of skip connections in deep neural networks. The short history of skip connections is outlined, and the development of residual learning in deep neural networks is surveyed. The effectiveness of skip connections in the training and testing stages is summarized, and future directions for using skip connections in residual learning are discussed. Finally, we summarize seminal papers, source code, models, and datasets that utilize skip connections in computer vision, including image classification, object detection, semantic segmentation, and image reconstruction. We hope this survey could inspire peer researchers in the community to develop further skip connections in various forms and tasks and the theory of residual learning in deep neural networks. The project page can be found at this https URL
深度学习在计算机视觉领域取得了显著进展,尤其是在图像分类、目标检测和语义分割方面。跳转连接在深度神经网络的架构中发挥了关键作用,通过在训练阶段通过残差学习进行更简单的优化,并在测试阶段提高准确性。许多神经网络都继承了残差学习与跳转连接的想法,将其作为设计神经网络的标准选择。 本次调查对跳转连接在深度神经网络中的发展进行了全面的概括和展望。首先简要介绍了跳转连接的短史,然后调查了在深度神经网络中残差学习的开发。总结了跳转连接在训练和测试阶段的有效性,并讨论了在残差学习中将跳转连接用于未来研究的方向。最后,我们总结了在计算机视觉领域使用跳转连接的一些论文、源代码、模型和数据集。我们希望能激励社区中的同行研究者在各种形式和任务上进一步发展跳转连接,并探讨深度神经网络中残差学习的理论。项目页面可以通过这个链接找到:https://github.com/your_username/project_name
https://arxiv.org/abs/2405.01725
The transformer structure employed in large language models (LLMs), as a specialized category of deep neural networks (DNNs) featuring attention mechanisms, stands out for their ability to identify and highlight the most relevant aspects of input data. Such a capability is particularly beneficial in addressing a variety of communication challenges, notably in the realm of semantic communication where proper encoding of the relevant data is critical especially in systems with limited bandwidth. In this work, we employ vision transformers specifically for the purpose of compression and compact representation of the input image, with the goal of preserving semantic information throughout the transmission process. Through the use of the attention mechanism inherent in transformers, we create an attention mask. This mask effectively prioritizes critical segments of images for transmission, ensuring that the reconstruction phase focuses on key objects highlighted by the mask. Our methodology significantly improves the quality of semantic communication and optimizes bandwidth usage by encoding different parts of the data in accordance with their semantic information content, thus enhancing overall efficiency. We evaluate the effectiveness of our proposed framework using the TinyImageNet dataset, focusing on both reconstruction quality and accuracy. Our evaluation results demonstrate that our framework successfully preserves semantic information, even when only a fraction of the encoded data is transmitted, according to the intended compression rates.
大型语言模型(LLMs)采用的变换器结构作为一种特殊的深度神经网络(DNNs),具有关注机制,在识别和突出输入数据中最相关方面具有优势。这种能力在解决各种通信挑战方面尤其有益,尤其是在语义通信领域,正确编码相关数据至关重要,尤其是在带宽受限的系统中。在这项工作中,我们专门使用视觉变换器来压缩和简洁地表示输入图像,以保留在整个传输过程中的语义信息。通过使用变换器固有的注意机制,我们创建了一个注意力掩码。这个掩码有效地 prioritize 了图像中关键部分的传输,确保重建阶段关注由掩码突出显示的关键对象。我们的方法显著提高了语义通信的质量,通过根据数据语义信息内容编码数据的不同部分来优化带宽利用率,从而提高整体效率。我们使用TinyImageNet数据集来评估我们提出的框架的有效性,重点关注重建质量和准确性。我们的评估结果表明,根据预定压缩率,我们的框架成功地保留了语义信息,即使只有部分编码数据被传输。
https://arxiv.org/abs/2405.01521
Recent works in hand-object reconstruction mainly focus on the single-view and dense multi-view settings. On the one hand, single-view methods can leverage learned shape priors to generalise to unseen objects but are prone to inaccuracies due to occlusions. On the other hand, dense multi-view methods are very accurate but cannot easily adapt to unseen objects without further data collection. In contrast, sparse multi-view methods can take advantage of the additional views to tackle occlusion, while keeping the computational cost low compared to dense multi-view methods. In this paper, we consider the problem of hand-object reconstruction with unseen objects in the sparse multi-view setting. Given multiple RGB images of the hand and object captured at the same time, our model SVHO combines the predictions from each view into a unified reconstruction without optimisation across views. We train our model on a synthetic hand-object dataset and evaluate directly on a real world recorded hand-object dataset with unseen objects. We show that while reconstruction of unseen hands and objects from RGB is challenging, additional views can help improve the reconstruction quality.
近年来,手物体重建主要集中在单视图和密集多视图设置。一方面,单视图方法可以利用学习到的形状先验来推广到未见到的物体,但由于遮挡而存在误差。另一方面,密集多视图方法非常准确,但需要进一步的数据收集才能适应未见到的物体。相比之下,稀疏多视图方法可以利用附加的视图来解决遮挡问题,而 computational cost 较低,相对于密集多视图方法。在本文中,我们考虑在稀疏多视图设置下处理未见物体的问题。 给定同时捕捉到同一手和物体的多个 RGB 图像,我们的模型 SVHO 将每个视图的预测合并为统一的重建,无需在视图中进行优化。我们在合成手-物体数据集上训练模型,并直接在真实世界记录的手-物体数据集上进行评估。我们证明了,虽然从 RGB 重建未见的手和物体具有挑战性,但附加视图可以提高重建质量。
https://arxiv.org/abs/2405.01353
An accurate detection and tracking of devices such as guiding catheters in live X-ray image acquisitions is an essential prerequisite for endovascular cardiac interventions. This information is leveraged for procedural guidance, e.g., directing stent placements. To ensure procedural safety and efficacy, there is a need for high robustness no failures during tracking. To achieve that, one needs to efficiently tackle challenges, such as: device obscuration by contrast agent or other external devices or wires, changes in field-of-view or acquisition angle, as well as the continuous movement due to cardiac and respiratory motion. To overcome the aforementioned challenges, we propose a novel approach to learn spatio-temporal features from a very large data cohort of over 16 million interventional X-ray frames using self-supervision for image sequence data. Our approach is based on a masked image modeling technique that leverages frame interpolation based reconstruction to learn fine inter-frame temporal correspondences. The features encoded in the resulting model are fine-tuned downstream. Our approach achieves state-of-the-art performance and in particular robustness compared to ultra optimized reference solutions (that use multi-stage feature fusion, multi-task and flow regularization). The experiments show that our method achieves 66.31% reduction in maximum tracking error against reference solutions (23.20% when flow regularization is used); achieving a success score of 97.95% at a 3x faster inference speed of 42 frames-per-second (on GPU). The results encourage the use of our approach in various other tasks within interventional image analytics that require effective understanding of spatio-temporal semantics.
准确检测和跟踪诸如引导导管等设备在活体X光成像中的操作,是进行内窥镜心脏干预的必要前提。这一信息用于指导操作,例如指导支架植入。为了确保操作的安全性和有效性,需要在跟踪过程中具备高鲁棒性,以避免失败。为了实现这一目标,需要有效地解决一些挑战,例如:对比剂或其他外部设备或导线对设备的遮挡,视野或成像角度的变化,以及由于心脏和呼吸运动而产生的连续运动。为了克服上述挑战,我们提出了一个新颖的方法,从超过1600万干预X光帧的大型数据集中学取空间-时间特征,通过自监督图像序列数据进行图像序列数据。我们的方法基于遮罩图像建模技术,利用基于重构的帧插值学习细粒度时间对应关系。在得到的模型中编码的特征经过下游微调。我们的方法在超优化参考解决方案(使用多级特征融合、多任务和流 regularization)方面实现了最先进的性能和鲁棒性。实验结果表明,我们的方法将最大跟踪误差减少了66.31%(使用流 regularization 时,降低了23.20%);在每秒42帧的推理速度下,实现了97.95%的成功率(在GPU上)。结果鼓励将我们的方法应用于各种需要在操作图像分析中有效理解空间-时间语义的各种其他任务。
https://arxiv.org/abs/2405.01156
Although several image super-resolution solutions exist, they still face many challenges. CNN-based algorithms, despite the reduction in computational complexity, still need to improve their accuracy. While Transformer-based algorithms have higher accuracy, their ultra-high computational complexity makes them difficult to be accepted in practical applications. To overcome the existing challenges, a novel super-resolution reconstruction algorithm is proposed in this paper. The algorithm achieves a significant increase in accuracy through a unique design while maintaining a low complexity. The core of the algorithm lies in its cleverly designed Global-Local Information Extraction Module and Basic Block Module. By combining global and local information, the Global-Local Information Extraction Module aims to understand the image content more comprehensively so as to recover the global structure and local details in the image more accurately, which provides rich information support for the subsequent reconstruction process. Experimental results show that the comprehensive performance of the algorithm proposed in this paper is optimal, providing an efficient and practical new solution in the field of super-resolution reconstruction.
尽管已经存在许多图像超分辨率解决方案,但它们仍然面临许多挑战。基于卷积神经网络(CNN)的算法,尽管在计算复杂性方面有所降低,但仍需要提高其准确性。而基于Transformer的算法具有更高的准确性,但它们的超高计算复杂性使得它们难以在实际应用中接受。为了克服现有挑战,本文提出了一种新颖的超分辨率重构算法。该算法通过独特的设计在保持低复杂性的同时实现了显著的准确率增加。算法的核心在于其巧妙设计的全局局部信息提取模块和基本模块。通过结合全局和局部信息,全局局部信息提取模块旨在更全面地理解图像内容,从而更准确地恢复图像的全局结构和局部细节,为后续的重建过程提供了丰富的信息支持。实验结果表明,本文提出的算法的全面性能最优,为超分辨率重构领域提供了一种高效且实用的全新解决方案。
https://arxiv.org/abs/2405.01085
Reconstructing a hand mesh from a single RGB image is a challenging task because hands are often occluded by objects. Most previous works attempted to introduce more additional information and adopt attention mechanisms to improve 3D reconstruction results, but it would increased computational complexity. This observation prompts us to propose a new and concise architecture while improving computational efficiency. In this work, we propose a simple and effective 3D hand mesh reconstruction network HandSSCA, which is the first to incorporate state space modeling into the field of hand pose estimation. In the network, we have designed a novel state space channel attention module that extends the effective sensory field, extracts hand features in the spatial dimension, and enhances hand regional features in the channel dimension. This design helps to reconstruct a complete and detailed hand mesh. Extensive experiments conducted on well-known datasets featuring challenging hand-object occlusions (such as FREIHAND, DEXYCB, and HO3D) demonstrate that our proposed HandSSCA achieves state-of-the-art performance while maintaining a minimal parameter count.
从单个RGB图像中重构手网格是一个具有挑战性的任务,因为手经常被物体遮挡。之前的工作尝试引入更多附加信息并采用注意机制来提高3D重建结果,但会增加计算复杂度。这个观察结果促使我们提出一种新而简洁的架构,同时提高计算效率。在本文中,我们提出了一个简单而有效的3D手网格重构网络HandSSCA,这是第一个将状态空间建模应用到手姿态估计领域的网络。在网络中,我们设计了一个新颖的状态空间通道关注模块,扩展了有效的感官场,提取了手部的空间维度,并在通道维度上增强了手部区域特征。这种设计有助于重构完整和详细的手网格。在已知具有挑战性手-物体遮挡的数据集(如FREIHAND、DEXYCB和HO3D)上进行的广泛实验证明,我们的HandSSCA网络在保持最小参数计数的同时实现了最先进的性能。
https://arxiv.org/abs/2405.01066
Photorealistic simulation plays a crucial role in applications such as autonomous driving, where advances in neural radiance fields (NeRFs) may allow better scalability through the automatic creation of digital 3D assets. However, reconstruction quality suffers on street scenes due to largely collinear camera motions and sparser samplings at higher speeds. On the other hand, the application often demands rendering from camera views that deviate from the inputs to accurately simulate behaviors like lane changes. In this paper, we propose several insights that allow a better utilization of Lidar data to improve NeRF quality on street scenes. First, our framework learns a geometric scene representation from Lidar, which is fused with the implicit grid-based representation for radiance decoding, thereby supplying stronger geometric information offered by explicit point cloud. Second, we put forth a robust occlusion-aware depth supervision scheme, which allows utilizing densified Lidar points by accumulation. Third, we generate augmented training views from Lidar points for further improvement. Our insights translate to largely improved novel view synthesis under real driving scenes.
照片现实模拟在自动驾驶等应用中扮演着关键角色,因为神经辐射场(NeRFs)的进步可能允许通过自动创建数字3D资产来实现更好的可扩展性。然而,在街景中,由于主要是平行的相机运动和高速时的采样稀疏,重建质量下降。另一方面,应用程序通常要求从相机视角进行渲染,以准确模拟行为,如变道。在本文中,我们提出了几个见解,使得Lidar数据能够更好地用于改善街景中的NeRF质量。首先,我们的框架从Lidar中学习几何场景表示,并将其与隐式网格表示的辐射解码相结合,从而提供来自明确点云的更强的几何信息。其次,我们提出了一个鲁棒的可视化深度监督方案,允许通过累积使用密集的Lidar点。第三,我们从Lidar点生成增强的训练视图,以进一步改进。我们的见解使得在现实驾驶场景中产生了显著改进的新视图合成。
https://arxiv.org/abs/2405.00900
Image Quality Assessment (IQA) is essential in various Computer Vision tasks such as image deblurring and super-resolution. However, most IQA methods require reference images, which are not always available. While there are some reference-free IQA metrics, they have limitations in simulating human perception and discerning subtle image quality variations. We hypothesize that the JPEG quality factor is representatives of image quality measurement, and a well-trained neural network can learn to accurately evaluate image quality without requiring a clean reference, as it can recognize image degradation artifacts based on prior knowledge. Thus, we developed a reference-free quality evaluation network, dubbed "Quality Factor (QF) Predictor", which does not require any reference. Our QF Predictor is a lightweight, fully convolutional network comprising seven layers. The model is trained in a self-supervised manner: it receives JPEG compressed image patch with a random QF as input, is trained to accurately predict the corresponding QF. We demonstrate the versatility of the model by applying it to various tasks. First, our QF Predictor can generalize to measure the severity of various image artifacts, such as Gaussian Blur and Gaussian noise. Second, we show that the QF Predictor can be trained to predict the undersampling rate of images reconstructed from Magnetic Resonance Imaging (MRI) data.
图像质量评估(IQA)在各种计算机视觉任务中(如图像去噪和超分辨率)非常重要。然而,大多数IQA方法都需要参考图像,这些图像并不总是可用的。虽然有一些无需参考图像的IQA指标,但它们在模拟人类感知和辨别细微图像质量变化方面存在局限性。我们假设JPEG质量因子是图像质量测量的代表,并且经过良好训练的神经网络可以准确评估图像质量,而不需要干净的参考,因为它可以根据先验知识识别图像退化伪像。因此,我们开发了一个无需参考的图像质量评估网络,名为“质量因子(QF)预测器”,它包含七个层。该模型以自监督的方式训练:它接收经过随机QF的压缩JPEG图像补丁作为输入,并通过准确预测相应的QF进行训练。我们通过应用该模型到各种任务来展示其多才性。首先,我们的QF预测器可以推广用于衡量各种图像伪像的严重程度,如高斯平滑和高斯噪声。其次,我们证明了QF预测器可以被训练预测从磁共振成像(MRI)数据中重构的图像的降采样率。
https://arxiv.org/abs/2405.02208