The analysis and use of egocentric videos for robotic tasks is made challenging by occlusion due to the hand and the visual mismatch between the human hand and a robot end-effector. In this sense, the human hand presents a nuisance. However, often hands also provide a valuable signal, e.g. the hand pose may suggest what kind of object is being held. In this work, we propose to extract a factored representation of the scene that separates the agent (human hand) and the environment. This alleviates both occlusion and mismatch while preserving the signal, thereby easing the design of models for downstream robotics tasks. At the heart of this factorization is our proposed Video Inpainting via Diffusion Model (VIDM) that leverages both a prior on real-world images (through a large-scale pre-trained diffusion model) and the appearance of the object in earlier frames of the video (through attention). Our experiments demonstrate the effectiveness of VIDM at improving inpainting quality on egocentric videos and the power of our factored representation for numerous tasks: object detection, 3D reconstruction of manipulated objects, and learning of reward functions, policies, and affordances from videos.
对人类手作为机器人任务中的媒介进行分析和使用个人视角的视频是困难的,这因为手和人类手与机器人末端执行器之间的视觉不匹配而 occlusion。从这个意义上说,人类手是一个麻烦。然而,通常手也提供有价值的信号,例如手的姿势可能暗示着正在握着什么物体。在这项工作中,我们提议提取一个Factored Representation,将agent(人类手)和环境分开。这可以减轻 occlusion 和不匹配,同时保留信号,从而简化后续机器人任务中模型的设计。在这个观点的中心是我们所提出的视频扩散模型(VIDM),它利用现实世界图像的先验知识和视频早期帧中物体的外观(通过注意力)。我们的实验证明了 VIDM 在改善个人视角视频涂色质量和我们Factored Representation 对于许多任务的有效性:物体检测、3D重建操纵物体、从视频中学习奖励函数、政策和可用性。
https://arxiv.org/abs/2305.16301
The standard approach for neural topic modeling uses a variational autoencoder (VAE) framework that jointly minimizes the KL divergence between the estimated posterior and prior, in addition to the reconstruction loss. Since neural topic models are trained by recreating individual input documents, they do not explicitly capture the coherence between topic words on the corpus level. In this work, we propose a novel diversity-aware coherence loss that encourages the model to learn corpus-level coherence scores while maintaining a high diversity between topics. Experimental results on multiple datasets show that our method significantly improves the performance of neural topic models without requiring any pretraining or additional parameters.
神经网络主题建模的标准方法使用Variational Autoencoder (VAE)框架,该框架 jointly 最小化估计的后验与先验之间的KL散度,同时最小化重建损失。由于神经网络主题模型是通过重塑 individual 输入文档来训练的,它们并不 explicitly 捕捉主题词在语料库级别上的一致性。在这项工作中,我们提出了一种独特的一致性相关一致性损失,它鼓励模型学习语料库级别上的一致性评分,同时保持主题之间的高多样性。多个数据集的实验结果表明,我们的方法显著改进了神经网络主题模型的性能,而不需要任何预训练或额外的参数。
https://arxiv.org/abs/2305.16199
Along with the recent development of deep neural networks, appearance-based gaze estimation has succeeded considerably when training and testing within the same domain. Compared to the within-domain task, the variance of different domains makes the cross-domain performance drop severely, preventing gaze estimation deployment in real-world applications. Among all the factors, ranges of head pose and gaze are believed to play a significant role in the final performance of gaze estimation, while collecting large ranges of data is expensive. This work proposes an effective model training pipeline consisting of a training data synthesis and a gaze estimation model for unsupervised domain adaptation. The proposed data synthesis leverages the single-image 3D reconstruction to expand the range of the head poses from the source domain without requiring a 3D facial shape dataset. To bridge the inevitable gap between synthetic and real images, we further propose an unsupervised domain adaptation method suitable for synthetic full-face data. We propose a disentangling autoencoder network to separate gaze-related features and introduce background augmentation consistency loss to utilize the characteristics of the synthetic source domain. Through comprehensive experiments, we show that the model only using monocular-reconstructed synthetic training data can perform comparably to real data with a large label range. Our proposed domain adaptation approach further improves the performance on multiple target domains. The code and data will be available at \url{this https URL}.
与深度学习网络的最新发展相伴,基于外观的 gaze 估计在同域内的训练和测试中取得了显著成功。相对于同域任务,不同域的差异使跨域性能严重下降,从而阻止 gaze 估计在现实世界中的应用。在所有因素中,头部姿态和注视范围被认为是 gaze 估计最终性能的重要影响因素,而收集大量数据的成本很高。该工作提出了一种有效的模型训练 pipeline,包括一个训练数据合成和 gaze 估计模型的无监督跨域适应方法。该合成方法利用单图像三维重构扩大源域头部姿态范围,而不需要三维面部形状数据集。为了弥补合成和真实图像之间的必然差距,我们进一步提出了适合合成全貌数据的无监督跨域适应方法。我们提出了一个分离注意力相关的特征的解码网络,并引入背景增强一致性损失,利用合成源域的特点。通过综合实验,我们表明,仅使用单眼重构的合成训练数据可以使用与大量标签范围的真实数据相当的性能。我们提出的跨域适应方法进一步改进了多个目标域的性能。代码和数据将可在 \url{this https URL} 上获取。
https://arxiv.org/abs/2305.16140
Obtaining accurate 3D object poses is vital for numerous computer vision applications, such as 3D reconstruction and scene understanding. However, annotating real-world objects is time-consuming and challenging. While synthetically generated training data is a viable alternative, the domain shift between real and synthetic data is a significant challenge. In this work, we aim to narrow the performance gap between models trained on synthetic data and few real images and fully supervised models trained on large-scale data. We achieve this by approaching the problem from two perspectives: 1) We introduce SyntheticP3D, a new synthetic dataset for object pose estimation generated from CAD models and enhanced with a novel algorithm. 2) We propose a novel approach (CC3D) for training neural mesh models that perform pose estimation via inverse rendering. In particular, we exploit the spatial relationships between features on the mesh surface and a contrastive learning scheme to guide the domain adaptation process. Combined, these two approaches enable our models to perform competitively with state-of-the-art models using only 10% of the respective real training images, while outperforming the SOTA model by 10.4% with a threshold of pi/18 using only 50% of the real training data. Our trained model further demonstrates robust generalization to out-of-distribution scenarios despite being trained with minimal real data.
获取准确的三维物体姿态对于许多计算机视觉应用至关重要,例如3D重建和场景理解。然而,标注现实世界的物体是耗时且具有挑战性的。虽然合成的训练数据是一个可行的替代方案,但 domain shift between real and synthetic data是一个重大的挑战。在这项工作中,我们旨在缩小训练数据集上模型与一小部分真实图像和完全监督模型之间的差距。我们通过从两个角度看待问题来实现这一目标:1)我们引入了SyntheticP3D,一个从CAD模型生成的物体姿态估计新合成数据集,并使用了一种 novel 算法增强。2)我们提出了一种 novel 的方法(CC3D)用于训练神经网络网格模型,通过逆渲染进行姿态估计。特别是,我们利用网格表面上特征的空间关系和比较学习计划指导域适应过程。结合这两个方法,我们的模型可以使用仅真实训练图像的10%的数据,以竞争的方式与最先进的模型进行表现,而使用仅真实训练数据的50%的数据时,比SOTA模型表现更好,达到阈值pi/18时,表现高出10.4%。我们的训练模型进一步证明了在离群值场景下的鲁棒泛化能力,尽管训练数据仅有很少的真实数据。
https://arxiv.org/abs/2305.16124
Due to the unsupervised nature of anomaly detection, the key to fueling deep models is finding supervisory signals. Different from current reconstruction-guided generative models and transformation-based contrastive models, we devise novel data-driven supervision for tabular data by introducing a characteristic -- scale -- as data labels. By representing varied sub-vectors of data instances, we define scale as the relationship between the dimensionality of original sub-vectors and that of representations. Scales serve as labels attached to transformed representations, thus offering ample labeled data for neural network training. This paper further proposes a scale learning-based anomaly detection method. Supervised by the learning objective of scale distribution alignment, our approach learns the ranking of representations converted from varied subspaces of each data instance. Through this proxy task, our approach models inherent regularities and patterns within data, which well describes data "normality". Abnormal degrees of testing instances are obtained by measuring whether they fit these learned patterns. Extensive experiments show that our approach leads to significant improvement over state-of-the-art generative/contrastive anomaly detection methods.
由于异常检测的无监督性质,驱动深度模型的关键在于找到监督信号。与当前基于重构引导的生成模型和基于转换的比较模型不同,我们提出了一种新的基于数据驱动的监督方法,将特征——尺寸——作为数据标签。通过表示数据实例的不同子向量,我们定义尺寸为原始子向量维度与表示维度之间的关系。尺寸作为转换表示的标签,为神经网络训练提供了大量的标记数据。本文还提出了基于尺寸学习的异常检测方法。在尺寸分布对齐的学习目标的监督下,我们的算法学习从每个数据实例的不同子空间中转换表示的排名。通过这个代理任务,我们的算法模型数据内在的规律性和模式,很好地描述了数据“正常”性。测试实例的异常程度可以通过测量是否适应这些学习模式来获得。广泛的实验表明,我们的算法比当前最先进的生成/对比异常检测方法取得了显著的改进。
https://arxiv.org/abs/2305.16114
Reconstruction-based methods have struggled to achieve competitive performance on anomaly detection. In this paper, we introduce Denoising Diffusion Anomaly Detection (DDAD). We propose a novel denoising process for image reconstruction conditioned on a target image. This results in a coherent restoration that closely resembles the target image. Subsequently, our anomaly detection framework leverages this conditioning where the target image is set as the input image to guide the denoising process, leading to defectless reconstruction while maintaining nominal patterns. We localise anomalies via a pixel-wise and feature-wise comparison of the input and reconstructed image. Finally, to enhance the effectiveness of feature comparison, we introduce a domain adaptation method that utilises generated examples from our conditioned denoising process to fine-tune the feature extractor. The veracity of the approach is demonstrated on various datasets including MVTec and VisA benchmarks, achieving state-of-the-art results of 99.5% and 99.3% image-level AUROC respectively.
基于重构的方法在异常检测方面一直难以取得竞争性能。在本文中,我们介绍了去噪扩散异常检测(DDAD),我们提出了一种基于目标图像的全新的去噪过程,以产生与目标图像非常相似的连贯恢复。随后,我们的异常检测框架利用目标图像作为输入图像的指导,以引导去噪过程,从而实现无缺陷恢复并保持名义模式。我们通过像素级和特征级比较输入和恢复图像来定位异常。最后,为了增强特征比较的有效性,我们引入了一种域适应方法,该方法利用我们Conditioned去噪过程生成的示例来微调特征提取器。该方法在包括MVTec和 VisA基准数据的各种数据集上进行了验证,分别实现了99.5%和99.3%的图像auROC水平。
https://arxiv.org/abs/2305.15956
Recent developments in the field of non-local attention (NLA) have led to a renewed interest in self-similarity-based single image super-resolution (SISR). Researchers usually used the NLA to explore non-local self-similarity (NSS) in SISR and achieve satisfactory reconstruction results. However, a surprising phenomenon that the reconstruction performance of the standard NLA is similar to the NLA with randomly selected regions stimulated our interest to revisit NLA. In this paper, we first analyzed the attention map of the standard NLA from different perspectives and discovered that the resulting probability distribution always has full support for every local feature, which implies a statistical waste of assigning values to irrelevant non-local features, especially for SISR which needs to model long-range dependence with a large number of redundant non-local features. Based on these findings, we introduced a concise yet effective soft thresholding operation to obtain high-similarity-pass attention (HSPA), which is beneficial for generating a more compact and interpretable distribution. Furthermore, we derived some key properties of the soft thresholding operation that enable training our HSPA in an end-to-end manner. The HSPA can be integrated into existing deep SISR models as an efficient general building block. In addition, to demonstrate the effectiveness of the HSPA, we constructed a deep high-similarity-pass attention network (HSPAN) by integrating a few HSPAs in a simple backbone. Extensive experimental results demonstrate that HSPAN outperforms state-of-the-art approaches on both quantitative and qualitative evaluations.
最近在非局部注意力(NLA)领域的发展引起了对基于自相似性的单图像超分辨率(SISR)的新兴趣。研究人员通常使用NLA在SISR中探索非局部自相似性(NSS)并取得了令人满意的重建结果。然而,一个令人惊讶的现象是,标准NLA的重建表现与随机选择区域的NLA相似,这激发了我们重新考虑NLA的兴趣。在本文中,我们首先从不同的角度分析了标准NLA的注意力地图,并发现结果的概率分布对所有 local 特征都有充分的支持,这意味着将值分配给无关的局部特征是一种统计浪费,特别是对于需要使用大量冗余的局部特征来建模长距离依赖的SISR。基于这些发现,我们介绍了一种简洁但有效的软阈值操作,以获得高相似性跳过注意力(HSPA),这有助于生成更紧凑并具有可解释性分布。此外,我们推导了一些软阈值操作的关键特性,以使其能够以端到端的方式训练我们的HSPA。HSPA可以集成到现有的深度SISR模型中,作为高效的一般构建块。此外,为了证明HSPA的有效性,我们构建了一个深度高相似性跳过注意力网络(HSPAN),并将几个HSPA集成在一个简单的骨架中。广泛的实验结果显示,HSPAN在量化和定性评估方面都优于最先进的方法。
https://arxiv.org/abs/2305.15768
In recent years, 3D models have been utilized in many applications, such as auto-driver, 3D reconstruction, VR, and AR. However, the scarcity of 3D model data does not meet its practical demands. Thus, generating high-quality 3D models efficiently from textual descriptions is a promising but challenging way to solve this problem. In this paper, inspired by the ability of human beings to complement visual information details from ambiguous descriptions based on their own experience, we propose a novel text-3D generation model (T2TD), which introduces the related shapes or textual information as the prior knowledge to improve the performance of the 3D generation model. In this process, we first introduce the text-3D knowledge graph to save the relationship between 3D models and textual semantic information, which can provide the related shapes to guide the target 3D model generation. Second, we integrate an effective causal inference model to select useful feature information from these related shapes, which removes the unrelated shape information and only maintains feature information that is strongly relevant to the textual description. Meanwhile, to effectively integrate multi-modal prior knowledge into textual information, we adopt a novel multi-layer transformer structure to progressively fuse related shape and textual information, which can effectively compensate for the lack of structural information in the text and enhance the final performance of the 3D generation model. The final experimental results demonstrate that our approach significantly improves 3D model generation quality and outperforms the SOTA methods on the text2shape datasets.
近年来,三维模型被广泛应用于许多应用,例如自动驾驶、三维重建、虚拟现实和增强现实等。然而,三维模型数据的稀缺性并没有满足其实际需求。因此,从文本描述中生成高质量三维模型是一个有前途但具有挑战性的方法来解决这个问题。在本文中,基于人类从不确定描述中补充视觉信息细节的能力,我们提出了一种新的文本-三维生成模型(T2TD),该模型引入了相关的形状或文本信息作为先验知识,以提高三维生成模型的性能。在这个过程中,我们首先介绍了文本-三维知识图,以保存三维模型和文本语义信息之间的关系,可以提供相关的形状来指导目标三维模型生成。其次,我们集成了有效的因果推断模型,从这些相关的形状中选择有用的特征信息,删除了不相关的形状信息,仅保留与文本描述密切相关的特征信息。同时,为了有效地将多模态先验知识集成到文本信息中,我们采用了一种新的多层Transformer结构,逐步融合相关的形状和文本信息,可以 effectively弥补文本中的结构信息缺失,并提高三维生成模型的最终性能。最终的实验结果显示,我们的方法显著提高了三维模型生成质量,在文本2shape数据集上优于最先进的方法。
https://arxiv.org/abs/2305.15753
Millimeter-wave (MMW) imaging is emerging as a promising technique for safe security inspection. It achieves a delicate balance between imaging resolution, penetrability and human safety, resulting in higher resolution compared to low-frequency microwave, stronger penetrability compared to visible light, and stronger safety compared to X ray. Despite of recent advance in the last decades, the high cost of requisite large-scale antenna array hinders widespread adoption of MMW imaging in practice. To tackle this challenge, we report a large-scale single-shot MMW imaging framework using sparse antenna array, achieving low-cost but high-fidelity security inspection under an interpretable learning scheme. We first collected extensive full-sampled MMW echoes to study the statistical ranking of each element in the large-scale array. These elements are then sampled based on the ranking, building the experimentally optimal sparse sampling strategy that reduces the cost of antenna array by up to one order of magnitude. Additionally, we derived an untrained interpretable learning scheme, which realizes robust and accurate image reconstruction from sparsely sampled echoes. Last, we developed a neural network for automatic object detection, and experimentally demonstrated successful detection of concealed centimeter-sized targets using 10% sparse array, whereas all the other contemporary approaches failed at the same sample sampling ratio. The performance of the reported technique presents higher than 50% superiority over the existing MMW imaging schemes on various metrics including precision, recall, and mAP50. With such strong detection ability and order-of-magnitude cost reduction, we anticipate that this technique provides a practical way for large-scale single-shot MMW imaging, and could advocate its further practical applications.
毫米波成像正在成为安全检查的有前途的技术。它在影像分辨率、穿透力和人类安全之间实现了微妙的平衡,相较于低频微波、相较于可见光,相较于X射线,它的分辨率更高。尽管过去几年有所进步,但所需的大规模天线阵列高昂的成本阻碍了它在实际应用中的广泛采用。为了解决这个问题,我们报告了一种稀疏天线阵列的大型单光子毫米波成像框架,实现了低成本但高保真的安全检查,通过可解释的学习算法实现。我们首先收集了广泛的全样本毫米波反射来研究大规模阵列中每个元素的统计分析排名。这些元素是根据排名进行采样,构建实验最优的稀疏采样策略,降低了天线阵列的成本高达一个数量级。此外,我们推导了一种未训练的可解释学习算法,从稀疏采样反射中实现稳健的且准确的图像重构。最后,我们开发了一种新的自动目标检测神经网络,并使用10%的稀疏数组实现了成功的目标检测,而其他 contemporary 方法在相同的样本采样比例上失败了。报告方法的性能在精度、召回率和mAP50等指标上表现出超过50%的优越性。凭借这种强大的检测能力和数量级的降低成本能力,我们预计,这种方法可以为大规模单光子毫米波成像提供实用的方法和建议其进一步实际应用。
https://arxiv.org/abs/2305.15750
A key aspect of text-to-image personalization methods is the manner in which the target concept is represented within the generative process. This choice greatly affects the visual fidelity, downstream editability, and disk space needed to store the learned concept. In this paper, we explore a new text-conditioning space that is dependent on both the denoising process timestep (time) and the denoising U-Net layers (space) and showcase its compelling properties. A single concept in the space-time representation is composed of hundreds of vectors, one for each combination of time and space, making this space challenging to optimize directly. Instead, we propose to implicitly represent a concept in this space by optimizing a small neural mapper that receives the current time and space parameters and outputs the matching token embedding. In doing so, the entire personalized concept is represented by the parameters of the learned mapper, resulting in a compact, yet expressive, representation. Similarly to other personalization methods, the output of our neural mapper resides in the input space of the text encoder. We observe that one can significantly improve the convergence and visual fidelity of the concept by introducing a textual bypass, where our neural mapper additionally outputs a residual that is added to the output of the text encoder. Finally, we show how one can impose an importance-based ordering over our implicit representation, providing users control over the reconstruction and editability of the learned concept using a single trained model. We demonstrate the effectiveness of our approach over a range of concepts and prompts, showing our method's ability to generate high-quality and controllable compositions without fine-tuning any parameters of the generative model itself.
文本到图像个性化方法的关键方面是在生成过程中如何表示目标概念。这种选择极大地影响了视觉质量、后续编辑能力和存储学习概念所需的磁盘空间。在本文中,我们探索了一个新的文本塑造空间,该空间依赖于去噪过程时间步骤(时间)和去噪U-Net层(空间)并展示了其令人瞩目的特性。在空间时间和时间空间表示中,一个概念由数百个向量组成,每个向量对应于时间空间和空间组合,这使得直接优化这个空间非常具有挑战性。相反,我们建议在这个空间中通过优化一个小的神经网络映射器来间接表示一个概念,输出匹配 token 嵌入。这样做,整个个性化概念都由学习的映射器参数表示,生成紧凑但表达能力强的表示。与其他个性化方法类似,我们神经网络映射器的输出位于文本编码器的输入空间中。我们观察到,通过引入文本绕过,可以显著改善概念的收敛和视觉质量,我们的神经网络映射器还额外输出一个残留值,将其添加到文本编码器的输出中。最后,我们展示了如何通过引入文本绕过来强加重要性排序,为用户提供对学习概念的重建和编辑控制,使用单个训练模型提供用户对学习概念的重构和编辑控制。我们展示了我们方法在不同概念和提示下的有效性,展示了我们方法生成高质量、可控制的组合的能力,而无需对生成模型自身的任何参数进行微调。
https://arxiv.org/abs/2305.15391
We address the problem of referring image segmentation that aims to generate a mask for the object specified by a natural language expression. Many recent works utilize Transformer to extract features for the target object by aggregating the attended visual regions. However, the generic attention mechanism in Transformer only uses the language input for attention weight calculation, which does not explicitly fuse language features in its output. Thus, its output feature is dominated by vision information, which limits the model to comprehensively understand the multi-modal information, and brings uncertainty for the subsequent mask decoder to extract the output mask. To address this issue, we propose Multi-Modal Mutual Attention ($\mathrm{M^3Att}$) and Multi-Modal Mutual Decoder ($\mathrm{M^3Dec}$) that better fuse information from the two input modalities. Based on {$\mathrm{M^3Dec}$}, we further propose Iterative Multi-modal Interaction ($\mathrm{IMI}$) to allow continuous and in-depth interactions between language and vision features. Furthermore, we introduce Language Feature Reconstruction ($\mathrm{LFR}$) to prevent the language information from being lost or distorted in the extracted feature. Extensive experiments show that our proposed approach significantly improves the baseline and outperforms state-of-the-art referring image segmentation methods on RefCOCO series datasets consistently.
我们解决了生成由自然语言表达指定的物体掩码的问题,该掩码需要生成一个由 attending visual regions 聚合而成的对象特征。许多最近的工作利用了 Transformer 来提取目标物体的特征,通过聚合 attending visual regions 来获取特征。然而,Transformer 中的通用注意力机制仅使用语言输入来进行注意力权重计算,并未在输出中显式融合语言特征。因此,其输出特征主要由视觉信息主导,限制了模型全面理解多模态信息,并给后续的掩码解码器提取输出掩码带来了不确定性。为了解决这个问题,我们提出了 Multi-Modal mutual Attention 和 Multi-Modal mutual Decoder,它们更好地融合来自两个输入模态的信息。基于 {$\mathrm{M^3Dec}$},我们进一步提出了迭代多模态交互({$\mathrm{IMI}$}),以允许语言和视觉特征之间的连续深入交互。此外,我们引入了语言特征重构({$\mathrm{LFR}$}),以防止在提取特征的过程中丢失或扭曲语言信息。广泛的实验结果表明,我们提出的方法显著提高了基准值,并在 RefCOCO 系列数据集上 consistently 优于最先进的生成掩码的方法。
https://arxiv.org/abs/2305.15302
Audio inpainting aims to reconstruct missing segments in corrupted recordings. Previous methods produce plausible reconstructions when the gap length is shorter than about 100\;ms, but the quality decreases for longer gaps. This paper explores recent advancements in deep learning and, particularly, diffusion models, for the task of audio inpainting. The proposed method uses an unconditionally trained generative model, which can be conditioned in a zero-shot fashion for audio inpainting, offering high flexibility to regenerate gaps of arbitrary length. An improved deep neural network architecture based on the constant-Q transform, which allows the model to exploit pitch-equivariant symmetries in audio, is also presented. The performance of the proposed algorithm is evaluated through objective and subjective metrics for the task of reconstructing short to mid-sized gaps. The results of a formal listening test show that the proposed method delivers a comparable performance against state-of-the-art for short gaps, while retaining a good audio quality and outperforming the baselines for the longest gap lengths tested, 150\;ms and 200\;ms. This work helps improve the restoration of sound recordings having fairly long local disturbances or dropouts, which must be reconstructed.
音频填补旨在修复损坏录音中的缺失部分。先前的方法能够在短于约100\;ms的间隙长度下产生合理的重构,但质量随着间隙长度的增加而降低。本文探讨了深度学习和特别是扩散模型在音频填补任务中的最新进展。 proposed 方法使用了一种无条件训练的生成模型,可以在音频填补任务中以零次条件状态进行训练,提供高灵活性,以生成任意长度的间隙。还提出了基于常质量变换改进的深度神经网络架构,该架构使模型可以利用音频的音高同义对称性。采用该架构的算法的性能通过客观和主观指标进行评估,以重构短小到中等大小的间隙。正式听辨测试的结果表明,该方法在短间隙长度方面提供了与当前最先进的方法相当的性能,同时保持了良好的音频质量和在测试的最长间隙长度上超越了基准值。这项工作有助于改善需要重构的较长 local 干扰或缺失的音频录音的恢复。
https://arxiv.org/abs/2305.15266
Understanding whether self-supervised learning methods can scale with unlimited data is crucial for training large-scale models. In this work, we conduct an empirical study on the scaling capability of masked image modeling (MIM) methods (e.g., MAE) for visual recognition. Unlike most previous works that depend on the widely-used ImageNet dataset, which is manually curated and object-centric, we take a step further and propose to investigate this problem in a more practical setting. Specifically, we utilize the web-collected Coyo-700M dataset. We randomly sample varying numbers of training images from the Coyo dataset and construct a series of sub-datasets, containing 0.5M, 1M, 5M, 10M, and 100M images, for pre-training. Our goal is to investigate how the performance changes on downstream tasks when scaling with different sizes of data and models. The study reveals that: 1) MIM can be viewed as an effective method to improve the model capacity when the scale of the training data is relatively small; 2) Strong reconstruction targets can endow the models with increased capacities on downstream tasks; 3) MIM pre-training is data-agnostic under most scenarios, which means that the strategy of sampling pre-training data is non-critical. We hope these observations could provide valuable insights for future research on MIM.
理解无监督学习方法是否能够以无限数据规模扩展对于训练大规模模型至关重要。在本研究中,我们进行了一项实证研究,研究掩膜图像建模方法(例如MAE)在视觉识别任务中的图像扩展能力。与大多数先前工作依赖于广泛使用的ImageNet数据集,该数据集是由手动审核和对象为中心的不同,我们更进一步,建议研究这个问题更为实际的情况。具体而言,我们利用收集在互联网上的Coyo-700M数据集。我们随机从Coyo数据集中随机抽样一定数量的训练图像,并构建了一系列子数据集,其中包括0.5M、1M、5M、10M和100M图像,用于预处理。我们的目标是研究在不同数据量和模型大小下,后续任务的性能变化。研究表明:1) MIM可以在相对较小的训练数据规模下被视为提高模型能力的有效方法;2) 强大的重构目标可以赋予模型更大的后续任务能力;3) MIM预处理在大多数情况中数据无关,这意味着采样预处理数据的策略不是关键。我们希望这些观察可以为未来的MIM研究提供有价值的见解。
https://arxiv.org/abs/2305.15248
This paper introduces Deceptive-NeRF, a new method for enhancing the quality of reconstructed NeRF models using synthetically generated pseudo-observations, capable of handling sparse input and removing floater artifacts. Our proposed method involves three key steps: 1) reconstruct a coarse NeRF model from sparse inputs; 2) generate pseudo-observations based on the coarse model; 3) refine the NeRF model using pseudo-observations to produce a high-quality reconstruction. To generate photo-realistic pseudo-observations that faithfully preserve the identity of the reconstructed scene while remaining consistent with the sparse inputs, we develop a rectification latent diffusion model that generates images conditional on a coarse RGB image and depth map, which are derived from the coarse NeRF and latent text embedding from input images. Extensive experiments show that our method is effective and can generate perceptually high-quality NeRF even with very sparse inputs.
本论文介绍了篡改NeRF(Deceptive-NeRF)方法,一种利用合成的伪观测来增强重构NeRF模型质量的方法,能够处理稀疏输入并去除浮点 artifacts。我们提出了一种关键的方法,涉及三个步骤:1)从稀疏输入中重构粗的NeRF模型;2)基于粗模型生成伪观测;3)利用伪观测对NeRF模型进行精化,以产生高质量的重构。为了生成逼真的伪观测,忠实地保留重构场景的身份,同时与稀疏输入保持一致,我们开发了一个纠正的隐向扩散模型,该模型生成图像的条件是基于粗的RGB图像和深度图,这些是粗NeRF和输入图像的隐文本嵌入的推导结果。广泛的实验表明,我们的方法非常有效,即使在非常稀疏的输入情况下,也能生成感知上高质量的NeRF。
https://arxiv.org/abs/2305.15171
Deep deraining networks, while successful in laboratory benchmarks, consistently encounter substantial generalization issues when deployed in real-world applications. A prevailing perspective in deep learning encourages the use of highly complex training data, with the expectation that a richer image content knowledge will facilitate overcoming the generalization problem. However, through comprehensive and systematic experimentation, we discovered that this strategy does not enhance the generalization capability of these networks. On the contrary, it exacerbates the tendency of networks to overfit to specific degradations. Our experiments reveal that better generalization in a deraining network can be achieved by simplifying the complexity of the training data. This is due to the networks are slacking off during training, that is, learning the least complex elements in the image content and degradation to minimize training loss. When the complexity of the background image is less than that of the rain streaks, the network will prioritize the reconstruction of the background, thereby avoiding overfitting to the rain patterns and resulting in improved generalization performance. Our research not only offers a valuable perspective and methodology for better understanding the generalization problem in low-level vision tasks, but also displays promising practical potential.
Deep learning中的深度生成网络虽然在实验室基准测试中表现良好,但在实际应用中却经常遇到极大的泛化问题。深度学习中一种普遍的观点鼓励使用复杂的训练数据,期望更丰富的图像内容知识能够有助于克服泛化问题。然而,通过全面系统和实验,我们发现这种策略并未提高这些网络的泛化能力,相反,它加剧了网络对特定退化的过度拟合倾向。我们的实验表明,通过简化训练数据的复杂度,可以实现深度生成网络更好的泛化能力。这是因为在训练过程中,网络正在休息,即学习图像内容和退化中最简单的元素,以最小化训练损失。当背景图像的复杂度小于雨滴轨迹的复杂度时,网络将优先考虑背景重建,从而避免过度拟合到雨滴模式,并导致更好的泛化性能。我们的研究不仅提供了有价值的视角和方法,以更好地理解低级别视觉任务中的泛化问题,还展示了令人充满希望的实际应用潜力。
https://arxiv.org/abs/2305.15134
Incrementally recovering 3D dense structures from monocular videos is of paramount importance since it enables various robotics and AR applications. Feature volumes have recently been shown to enable efficient and accurate incremental dense reconstruction without the need to first estimate depth, but they are not able to achieve as high of a resolution as depth-based methods due to the large memory consumption of high-resolution feature volumes. This letter proposes a real-time feature volume-based dense reconstruction method that predicts TSDF (Truncated Signed Distance Function) values from a novel sparsified deep feature volume, which is able to achieve higher resolutions than previous feature volume-based methods, and is favorable in large-scale outdoor scenarios where the majority of voxels are empty. An uncertainty-aware multi-view stereo (MVS) network is leveraged to infer initial voxel locations of the physical surface in a sparse feature volume. Then for refining the recovered 3D geometry, deep features are attentively aggregated from multiview images at potential surface locations, and temporally fused. Besides achieving higher resolutions than before, our method is shown to produce more complete reconstructions with finer detail in many cases. Extensive evaluations on both public and self-collected datasets demonstrate a very competitive real-time reconstruction result for our method compared to state-of-the-art reconstruction methods in both indoor and outdoor settings.
增量从单视角视频中提取三维高密度结构非常重要,因为它使各种机器人和增强现实应用得以实现。特征体积最近被证明能够高效、准确地增量式重构,而不需要先估计深度,但是,由于高分辨率特征体积的大内存消耗,无法达到与深度方法相同的分辨率。这封信提出了一种实时特征体积基于重构方法,该方法预测了TSDF(截断 signed distance function)值从一个稀疏深度特征体积中,该方法比先前的特征体积基于方法能够提高分辨率,并且适用于大多数空 voxel 的大规模户外场景。一个具有不确定性意识的多视图立体(MVS)网络用于推断稀疏特征体积中物理表面的初始 voxel 位置。然后,为了优化恢复的三维几何,深度特征从多视图图像的潜在表面位置 attentively agglomerative 地聚合,并时间融合。除了以前实现更高的分辨率外,我们的方法和许多情况下生产更为完整、细节更丰富的重构。对公共和自收集的数据集进行了广泛的评估,证明了我们的实时重构结果在与室内外最佳重构方法的比较中非常具有竞争力。
https://arxiv.org/abs/2305.14918
Editing real facial images is a crucial task in computer vision with significant demand in various real-world applications. While GAN-based methods have showed potential in manipulating images especially when combined with CLIP, these methods are limited in their ability to reconstruct real images due to challenging GAN inversion capability. Despite the successful image reconstruction achieved by diffusion-based methods, there are still challenges in effectively manipulating fine-gained facial attributes with textual this http URL address these issues and facilitate convenient manipulation of real facial images, we propose a novel approach that conduct text-driven image editing in the semantic latent space of diffusion model. By aligning the temporal feature of the diffusion model with the semantic condition at generative process, we introduce a stable manipulation strategy, which perform precise zero-shot manipulation effectively. Furthermore, we develop an interactive system named ChatFace, which combines the zero-shot reasoning ability of large language models to perform efficient manipulations in diffusion semantic latent space. This system enables users to perform complex multi-attribute manipulations through dialogue, opening up new possibilities for interactive image editing. Extensive experiments confirmed that our approach outperforms previous methods and enables precise editing of real facial images, making it a promising candidate for real-world applications. Project page: this https URL
编辑真实的面部图像是计算机视觉中一个重要的任务,在各种实际应用场景中具有巨大的需求。尽管基于GAN的方法在操纵图像方面表现出了潜力,特别是在结合Clip时更是如此,但这些方法在重建真实图像的能力上受到挑战,因为GAN的逆运算能力具有挑战性。尽管扩散方法成功地重建了图像,但仍然存在有效地操纵微调的面部属性通过文本这个http URL解决这些问题并方便真实面部图像操纵的需求,因此我们提出了一种 novel 的方法,在扩散模型的语义潜在空间中进行文本驱动的图像编辑。通过将扩散模型的时间特性与生成过程中语义条件对齐,我们引入了一种稳定的操纵策略,可以实现精确的零次操作操纵。此外,我们开发了一个名为ChatFace的交互系统,它结合了大型语言模型的零次操作推理能力,在扩散语义潜在空间中高效地进行操纵。该系统使用户通过对话进行复杂的多属性操纵,打开了交互图像编辑的新可能性。广泛的实验确认了我们的 approach 比先前的方法更有效,使用户可以精确地编辑真实的面部图像,使其成为实际应用场景中的有前途的选择。项目页面: this https URL
https://arxiv.org/abs/2305.14742
Depth cameras have found applications in diverse fields, such as computer vision, artificial intelligence, and video gaming. However, the high latency and low frame rate of existing commodity depth cameras impose limitations on their applications. We propose a fast and accurate depth map reconstruction technique to reduce latency and increase the frame rate in depth cameras. Our approach uses only a commodity depth camera and color camera in a hybrid camera setup; our prototype is implemented using a Kinect Azure depth camera at 30 fps and a high-speed RGB iPhone 11 Pro camera captured at 240 fps. The proposed network, AutoDepthNet, is an encoder-decoder model that captures frames from the high-speed RGB camera and combines them with previous depth frames to reconstruct a stream of high frame rate depth maps. On GPU, with a 480 x 270 output resolution, our system achieves an inference time of 8 ms, enabling real-time use at up to 200 fps with parallel processing. AutoDepthNet can estimate depth values with an average RMS error of 0.076, a 44.5% improvement compared to an optical flow-based comparison method. Our method can also improve depth map quality by estimating depth values for missing and invalidated pixels. The proposed method can be easily applied to existing depth cameras and facilitates the use of depth cameras in applications that require high-speed depth estimation. We also showcase the effectiveness of the framework in upsampling different sparse datasets e.g. video object segmentation. As a demonstration of our method, we integrated our framework into existing body tracking systems and demonstrated the robustness of the proposed method in such applications.
深度相机在各种领域都有广泛的应用,例如计算机视觉、人工智能和视频游戏。然而,现有 commodity 深度相机的高延迟和低帧率限制了其应用。我们提出了一种快速且准确的深度地图重构技术,以降低深度相机的延迟并提高帧率。我们的算法仅使用一种 commodity 深度相机和彩色相机在一个混合相机系统中实现;我们的原型使用 Kinect Azure 深度相机以 30 fps 和 iPhone 11 Pro 的高速度 RGB 相机以 240 fps 捕捉帧数。我们提出的网络名为 AutoDepthNet,它是一种编码-解码模型,从高速度 RGB 相机捕获帧并将其与前深度帧组合以重构高帧率深度地图流。在 GPU 上,输出分辨率为 480 x 270,我们的系统实现了 8 毫秒的推断时间,可以在并行处理下使用高达 200 fps。AutoDepthNet 的平均RMS误差为 0.076,比基于光学流的比较方法提高了 44.5%。我们的算法还可以通过估计缺失和损坏像素的深度值来提高深度地图质量。 proposed 方法可以轻松应用于现有的深度相机,并方便深度相机在需要快速深度估计的应用中使用。我们还展示了框架在增加不同稀疏数据集(例如视频物体分割)的采样效果方面的有效性。作为演示,我们将我们的框架集成到现有的身体跟踪系统中,并展示了该方法在这种应用中的鲁棒性。
https://arxiv.org/abs/2305.14731
Most contemporary supervised Remote Sensing (RS) image Change Detection (CD) approaches are customized for equal-resolution bitemporal images. Real-world applications raise the need for cross-resolution change detection, aka, CD based on bitemporal images with different spatial resolutions. Current cross-resolution methods that are trained with samples of a fixed resolution difference (resolution ratio between the high-resolution (HR) image and the low-resolution (LR) one) may fit a certain ratio but lack adaptation to other resolution differences. Toward continuous cross-resolution CD, we propose scale-invariant learning to enforce the model consistently predicting HR results given synthesized samples of varying bitemporal resolution differences. Concretely, we synthesize blurred versions of the HR image by random downsampled reconstructions to reduce the gap between HR and LR images. We introduce coordinate-based representations to decode per-pixel predictions by feeding the coordinate query and corresponding multi-level embedding features into an MLP that implicitly learns the shape of land cover changes, therefore benefiting recognizing blurred objects in the LR image. Moreover, considering that spatial resolution mainly affects the local textures, we apply local-window self-attention to align bitemporal features during the early stages of the encoder. Extensive experiments on two synthesized and one real-world different-resolution CD datasets verify the effectiveness of the proposed method. Our method significantly outperforms several vanilla CD methods and two cross-resolution CD methods on the three datasets both in in-distribution and out-of-distribution settings. The empirical results suggest that our method could yield relatively consistent HR change predictions regardless of varying resolution difference ratios. Our code will be public.
当代监督遥感(RS)图像变化检测(CD)方法大多数是针对相等分辨率的两个时间图像进行定制的。实际应用程序需要实现交叉分辨率变化检测,也就是基于具有不同空间分辨率的两个时间图像的CD方法。目前,训练样本为固定分辨率差异(高分辨率(HR)图像和低分辨率(LR)图像的分辨率比)的交叉分辨率方法可能符合一定的比率,但缺乏对其他分辨率差异的适应。为了实现持续交叉分辨率CD,我们提出按尺寸无关学习来实现模型 consistently predict HR结果,给定合成样本中不同bitemporal resolution differences的模拟样本。具体来说,我们随机剪裁重建样本以生成模糊版本的HR图像,以减少HR和LR图像之间的差异。我们引入坐标表示法,通过将坐标查询和相应的多层次嵌入特征输入到一个MLP中,使其隐含地学习地形覆盖变化的形状,因此受益于识别LR图像中的模糊物体。此外,考虑到空间分辨率主要影响 local texture,我们在编码器的早期阶段应用 local-window self-attention来对齐双时间特征。我们对两个合成和一个实际世界不同分辨率的CD数据集进行了广泛的实验,以验证所提出方法的有效性。我们的方法在分布和离分布环境下在 HR数据的分布性和离分布环境下均显著优于多个简单CD方法和两个交叉分辨率CD方法。经验结果显示,我们的方法能够产生相对一致的HR变化预测,无论分辨率差异比率如何变化。我们的代码将公开。
https://arxiv.org/abs/2305.14722
Neural networks have in recent years shown promise for helping software engineers write programs and even formally verify them. While semantic information plays a crucial part in these processes, it remains unclear to what degree popular neural architectures like transformers are capable of modeling that information. This paper examines the behavior of neural networks learning algorithms relevant to programs and formal verification proofs through the lens of mechanistic interpretability, focusing in particular on structural recursion. Structural recursion is at the heart of tasks on which symbolic tools currently outperform neural models, like inferring semantic relations between datatypes and emulating program behavior. We evaluate the ability of transformer models to learn to emulate the behavior of structurally recursive functions from input-output examples. Our evaluation includes empirical and conceptual analyses of the limitations and capabilities of transformer models in approximating these functions, as well as reconstructions of the ``shortcut" algorithms the model learns. By reconstructing these algorithms, we are able to correctly predict 91 percent of failure cases for one of the approximated functions. Our work provides a new foundation for understanding the behavior of neural networks that fail to solve the very tasks they are trained for.
神经网络近年来表现出有助于帮助软件工程师编写程序以及Formal Verification的潜力。虽然语义信息在这些过程中扮演着关键角色,但目前仍不清楚像Transformers这样的流行神经网络架构是否能够建模这些信息的相当程度。本文通过机械解释性的视角审视了与程序和Formal Verification证明相关的神经网络学习算法的行为,特别是重点关注结构重复。结构重复是当前符号工具能够比神经网络模型更好地完成的任务的核心,例如推断数据类型之间的语义关系并模拟程序行为。我们评估了Transformer模型学习到结构重复函数的行为的能力。我们的评估包括对Transformer模型在这些函数approximating方面的限制和能力进行实证和概念分析,以及重新构建模型学习到的“捷径”算法。通过重新构建这些算法,我们准确地预测了91%的approximating函数失败案例。我们的工作为理解神经网络未能解决它们训练目标的任务的行为提供了新的基础。
https://arxiv.org/abs/2305.14699