In Magnetic Resonance Imaging (MRI), image acquisitions are often undersampled in the measurement domain to accelerate the scanning process, at the expense of image quality. However, image quality is a crucial factor that influences the accuracy of clinical diagnosis; hence, high-quality image reconstruction from undersampled measurements has been a key area of research. Recently, deep learning (DL) methods have emerged as the state-of-the-art for MRI reconstruction, typically involving deep neural networks to transform undersampled MRI images into high-quality MRI images through data-driven processes. Nevertheless, there is clear and significant room for improvement in undersampled DL MRI reconstruction to meet the high standards required for clinical diagnosis, in terms of eliminating aliasing artifacts and reducing image noise. In this paper, we introduce a self-supervised pretraining procedure using contrastive learning to improve the accuracy of undersampled DL MRI reconstruction. We use contrastive learning to transform the MRI image representations into a latent space that maximizes mutual information among different undersampled representations and optimizes the information content at the input of the downstream DL reconstruction models. Our experiments demonstrate improved reconstruction accuracy across a range of acceleration factors and datasets, both quantitatively and qualitatively. Furthermore, our extended experiments validate the proposed framework's robustness under adversarial conditions, such as measurement noise, different k-space sampling patterns, and pathological abnormalities, and also prove the transfer learning capabilities on MRI datasets with completely different anatomy. Additionally, we conducted experiments to visualize and analyze the properties of the proposed MRI contrastive learning latent space.
在磁共振成像(MRI)中,图像采集通常在测量域中 undersampled 以加速扫描过程,但以牺牲图像质量为代价。然而,图像质量是影响临床诊断准确性至关重要的一个因素,因此,高质量的 undersampled 测量图像重建一直是一个研究热点。最近,深度学习(DL)方法已成为 MRI 重建的最新技术,通常涉及使用深度神经网络将 undersampled MRI 图像转换为高质量 MRI 图像,通过数据驱动的过程。然而,在 undersampled DL MRI 重建中,仍然存在显而易见的改进空间,以满足临床诊断的高标准,即消除混叠伪影并减少图像噪声。在本文中,我们使用对比学习引入自监督预训练程序来提高 undersampled DL MRI 重建的准确性。我们使用对比学习将 MRI 图像表示转换为具有最大相互信息的不同 undersampled 表示的潜在空间,并优化输入下游 DL 重建模型的信息内容。我们的实验结果表明,在不同的加速因子和数据集上,图像重建准确性得到了显著提高,无论是定量还是定性。此外,我们的扩展实验证明了所提出的框架在逆境条件下的鲁棒性,例如测量噪声、不同的 k-空间采样模式和病理性异常,以及证明其在具有完全不同解剖学结构的 MRI 数据上的迁移学习能力。此外,我们还进行了实验来可视化和分析所提出的 MRI 对比学习潜在空间的性质。
https://arxiv.org/abs/2306.00530
This letter introduces a novel framework for dense Visual Simultaneous Localization and Mapping (VSLAM) based on Gaussian Splatting. Recently Gaussian Splatting-based SLAM has yielded promising results, but rely on RGB-D input and is weak in tracking. To address these limitations, we uniquely integrates advanced sparse visual odometry with a dense Gaussian Splatting scene representation for the first time, thereby eliminating the dependency on depth maps typical of Gaussian Splatting-based SLAM systems and enhancing tracking robustness. Here, the sparse visual odometry tracks camera poses in RGB stream, while Gaussian Splatting handles map reconstruction. These components are interconnected through a Multi-View Stereo (MVS) depth estimation network. And we propose a depth smooth loss to reduce the negative effect of estimated depth maps. Furthermore, the consistency in scale between the sparse visual odometry and the dense Gaussian map is preserved by Sparse-Dense Adjustment Ring (SDAR). We have evaluated our system across various synthetic and real-world datasets. The accuracy of our pose estimation surpasses existing methods and achieves state-of-the-art performance. Additionally, it outperforms previous monocular methods in terms of novel view synthesis fidelity, matching the results of neural SLAM systems that utilize RGB-D input.
本文介绍了一种基于Gaussian Splatting的密集视觉同时定位与映射(VSLAM)的新框架。最近基于Gaussian Splatting的SLAM已经取得了良好的结果,但是依赖于RGB-D输入,并且在跟踪方面较弱。为了克服这些限制,我们独特地将先进的稀疏视觉欧拉角与密集Gaussian Splatting场景表示集成在一起,从而消除了Gaussian Splatting-based SLAM系统常见的深度图依赖,提高了跟踪的鲁棒性。在这里,稀疏视觉欧拉角跟踪相机姿态,而Gaussian Splatting处理地图重建。这些组件通过多视角立体(MVS)深度估计网络相互连接。我们提出了一个深度平滑损失来减少估计深度图的负影响。此外,通过稀疏-稠密调整环(SDAR)可以保留稀疏视觉欧拉角与密集Gaussian地图之间的一致缩放。我们在各种合成和真实世界数据集上对系统进行了评估。我们的姿态估计精度超越了现有方法,达到了最先进的水平。此外,在新颖视角合成保真度方面,它超过了之前的多目视觉SLAM系统,与利用RGB-D输入的神经SLAM系统的结果相匹敌。
https://arxiv.org/abs/2405.06241
Transparent objects are ubiquitous in industry, pharmaceuticals, and households. Grasping and manipulating these objects is a significant challenge for robots. Existing methods have difficulty reconstructing complete depth maps for challenging transparent objects, leaving holes in the depth reconstruction. Recent work has shown neural radiance fields (NeRFs) work well for depth perception in scenes with transparent objects, and these depth maps can be used to grasp transparent objects with high accuracy. NeRF-based depth reconstruction can still struggle with especially challenging transparent objects and lighting conditions. In this work, we propose Residual-NeRF, a method to improve depth perception and training speed for transparent objects. Robots often operate in the same area, such as a kitchen. By first learning a background NeRF of the scene without transparent objects to be manipulated, we reduce the ambiguity faced by learning the changes with the new object. We propose training two additional networks: a residual NeRF learns to infer residual RGB values and densities, and a Mixnet learns how to combine background and residual NeRFs. We contribute synthetic and real experiments that suggest Residual-NeRF improves depth perception of transparent objects. The results on synthetic data suggest Residual-NeRF outperforms the baselines with a 46.1% lower RMSE and a 29.5% lower MAE. Real-world qualitative experiments suggest Residual-NeRF leads to more robust depth maps with less noise and fewer holes. Website: this https URL
透明的物体在工业、制药和家庭中随处可见。 grasping 和 manipulating these物体对机器人来说是一个重大的挑战。现有的方法很难完整地重构具有挑战性透明物体的深度图,导致深度重建中的缺口。最近的工作表明,神经辐射场(NeRFs)在具有透明物体的场景中具有良好的深度感知能力,这些深度图可以用高精度地抓住透明的物体。基于NeRF的深度重建仍然很难处理尤其是具有挑战性透明物体和照明条件的场景。在这项工作中,我们提出了Residual-NeRF,一种改进透明物体的深度感知和训练速度的方法。机器人通常在相同的区域内操作,例如厨房。通过首先学习场景中没有透明物体的背景NeRF,我们减少了在学习和新的物体之间面对的模糊性。我们提出了两个附加网络:一个残差NeRF学会了推断残差RGB值和密度,另一个Mixnet学会了如何将背景和残差NeRFs组合。我们贡献了合成和真实实验,表明Residual-NeRF可以提高透明物体的深度感知。基于合成数据的实验结果表明,Residual-NeRF在残差平方均误差(RMSE)和均方误差(MAE)方面优于基线。真实世界定性实验表明,Residual-NeRF导致具有更少噪声和更少孔洞的更稳健的深度图。网站:https://this URL
https://arxiv.org/abs/2405.06181
Due to the rare occurrence of anomalous events, a typical approach to anomaly detection is to train an autoencoder (AE) with normal data only so that it learns the patterns or representations of the normal training data. At test time, the trained AE is expected to well reconstruct normal but to poorly reconstruct anomalous data. However, contrary to the expectation, anomalous data is often well reconstructed as well. In order to further separate the reconstruction quality between normal and anomalous data, we propose creating pseudo anomalies from learned adaptive noise by exploiting the aforementioned weakness of AE, i.e., reconstructing anomalies too well. The generated noise is added to the normal data to create pseudo anomalies. Extensive experiments on Ped2, Avenue, ShanghaiTech, CIFAR-10, and KDDCUP datasets demonstrate the effectiveness and generic applicability of our approach in improving the discriminative capability of AEs for anomaly detection.
由于异常事件的罕见发生,一种典型的异常检测方法是对仅使用正常数据训练自编码器(AE),以便它学习正常训练数据的模式或表示。在测试时,训练好的AE预计会很好地重构正常数据,但会较差地重构异常数据。然而,与预期相反,异常数据通常也会很好地重构。为了进一步区分正常和异常数据的重建质量,我们提出了一种利用AE弱点的技术来创建伪异常的方法,即在重构异常数据时过于强大。生成的噪声添加到正常数据中以创建伪异常。在Ped2、Avenue、ShanghaiTech、CIFAR-10和KDDCUP数据集上进行的大量实验证明了我们方法在提高AE异常检测的判别能力方面的有效性和通用适用性。
https://arxiv.org/abs/2405.05886
We propose an approach for reconstructing free-moving object from a monocular RGB video. Most existing methods either assume scene prior, hand pose prior, object category pose prior, or rely on local optimization with multiple sequence segments. We propose a method that allows free interaction with the object in front of a moving camera without relying on any prior, and optimizes the sequence globally without any segments. We progressively optimize the object shape and pose simultaneously based on an implicit neural representation. A key aspect of our method is a virtual camera system that reduces the search space of the optimization significantly. We evaluate our method on the standard HO3D dataset and a collection of egocentric RGB sequences captured with a head-mounted device. We demonstrate that our approach outperforms most methods significantly, and is on par with recent techniques that assume prior information.
我们提出了一种从单目RGB视频重构自由运动的物体的方法。大多数现有方法要么假设场景先验、手部姿势先验、物体类别姿势先验,要么依赖于多个序列段进行局部优化。我们提出了一种允许在运动相机前自由交互物体,且不需要依赖任何先验的方法来全局优化物体形状和姿态。我们根据隐式神经表示同时逐步优化物体的形状和姿态。我们方法的关键点是一个虚拟相机系统,它显著减少了优化的搜索空间。我们在标准的HO3D数据集和用头部设备捕捉的一组自适应RGB序列上评估我们的方法。我们证明了我们的方法明显优于大多数方法,并且与假设先验信息的最新技术相当。
https://arxiv.org/abs/2405.05858
In text recognition, self-supervised pre-training emerges as a good solution to reduce dependence on expansive annotated real data. Previous studies primarily focus on local visual representation by leveraging mask image modeling or sequence contrastive learning. However, they omit modeling the linguistic information in text images, which is crucial for recognizing text. To simultaneously capture local character features and linguistic information in visual space, we propose Symmetric Superimposition Modeling (SSM). The objective of SSM is to reconstruct the direction-specific pixel and feature signals from the symmetrically superimposed input. Specifically, we add the original image with its inverted views to create the symmetrically superimposed inputs. At the pixel level, we reconstruct the original and inverted images to capture character shapes and texture-level linguistic context. At the feature level, we reconstruct the feature of the same original image and inverted image with different augmentations to model the semantic-level linguistic context and the local character discrimination. In our design, we disrupt the character shape and linguistic rules. Consequently, the dual-level reconstruction facilitates understanding character shapes and linguistic information from the perspective of visual texture and feature semantics. Experiments on various text recognition benchmarks demonstrate the effectiveness and generality of SSM, with 4.1% average performance gains and 86.6% new state-of-the-art average word accuracy on Union14M benchmarks.
在文本识别中,自监督预训练成为减少对广泛注释的现实数据依赖的一个好方法。以前的研究主要通过利用掩膜图像建模或序列对比学习来关注局部视觉表示。然而,它们忽略了在文本图像中建模语言信息,这对于识别文本至关重要。为了同时捕捉视觉空间中的局部字符特征和语言信息,我们提出了对称超像素建模(SSM)。 SSM 的目标是对对称超像素输入进行重建。具体来说,我们将原始图像及其倒置版本添加到对称超像素输入中。在像素级别,我们重建原始和倒置图像以捕捉字符形状和文本级别语境。在特征级别,我们使用不同的增强来重建相同原始图像和倒置图像的特征,以建模语义级别的语境和局部字符识别。 在我们的设计中,我们打破了字符形状和语言规则。因此,双级别重构有助于从视觉纹理和特征语义的角度理解字符形状和语言信息。在各种文本识别基准实验中,SSM 的有效性和普遍性得到了证明,平均性能提高了4.1%,在联合14M基准测试中的平均词准确率达到了86.6%。
https://arxiv.org/abs/2405.05841
Computed Tomography (CT) technology reduces radiation haz-ards to the human body through sparse sampling, but fewer sampling angles pose challenges for image reconstruction. Score-based generative models are widely used in sparse-view CT re-construction, performance diminishes significantly with a sharp reduction in projection angles. Therefore, we propose an ultra-sparse view CT reconstruction method utilizing multi-scale dif-fusion models (MSDiff), designed to concentrate on the global distribution of information and facilitate the reconstruction of sparse views with local image characteristics. Specifically, the proposed model ingeniously integrates information from both comprehensive sampling and selectively sparse sampling tech-niques. Through precise adjustments in diffusion model, it is capable of extracting diverse noise distribution, furthering the understanding of the overall structure of images, and aiding the fully sampled model in recovering image information more effec-tively. By leveraging the inherent correlations within the projec-tion data, we have designed an equidistant mask, enabling the model to focus its attention more effectively. Experimental re-sults demonstrated that the multi-scale model approach signifi-cantly improved the quality of image reconstruction under ultra-sparse angles, with good generalization across various datasets.
计算断层成像(CT)技术通过稀疏采样降低了对人体的辐射危害,但更少的采样角度对图像重建带来了挑战。基于得分的生成模型在稀疏视图CT重建中应用广泛,然而,随着投影角度的急剧减少,性能会显著下降。因此,我们提出了一个使用多尺度扩散模型(MSDiff)的超稀疏视图CT重建方法,旨在集中于全局信息分布并促进具有局部图像特征的稀疏视图的重建。具体来说,与综合采样和选择性稀疏采样技术相结合,该模型在扩散模型上进行了精确调整,能够提取丰富的噪声分布,进一步揭示图像的整体结构,并帮助完全采样的模型更有效地恢复图像信息。通过利用投影数据中固有相关性,我们设计了一个等距离掩码,使模型能够更有效地关注注意力。实验结果表明,在超稀疏角度下,多尺度模型方法显著提高了图像重建的质量,具有好的泛化性能。
https://arxiv.org/abs/2405.05814
Text-driven 3D indoor scene generation holds broad applications, ranging from gaming and smart homes to AR/VR applications. Fast and high-fidelity scene generation is paramount for ensuring user-friendly experiences. However, existing methods are characterized by lengthy generation processes or necessitate the intricate manual specification of motion parameters, which introduces inconvenience for users. Furthermore, these methods often rely on narrow-field viewpoint iterative generations, compromising global consistency and overall scene quality. To address these issues, we propose FastScene, a framework for fast and higher-quality 3D scene generation, while maintaining the scene consistency. Specifically, given a text prompt, we generate a panorama and estimate its depth, since the panorama encompasses information about the entire scene and exhibits explicit geometric constraints. To obtain high-quality novel views, we introduce the Coarse View Synthesis (CVS) and Progressive Novel View Inpainting (PNVI) strategies, ensuring both scene consistency and view quality. Subsequently, we utilize Multi-View Projection (MVP) to form perspective views, and apply 3D Gaussian Splatting (3DGS) for scene reconstruction. Comprehensive experiments demonstrate FastScene surpasses other methods in both generation speed and quality with better scene consistency. Notably, guided only by a text prompt, FastScene can generate a 3D scene within a mere 15 minutes, which is at least one hour faster than state-of-the-art methods, making it a paradigm for user-friendly scene generation.
文本驱动的3D室内场景生成具有广泛的应用,从游戏和智能家居到AR/VR应用程序。快速且高保真的场景生成对确保用户友好体验至关重要。然而,现有的方法的特点是生成过程漫长,或者需要对运动参数进行复杂的手动指定,这给用户带来困扰。此外,这些方法通常依赖于狭视野迭代生成,导致全局一致性和平面整体质量妥协。为了应对这些问题,我们提出了FastScene,一个快速且高保真度的3D场景生成框架,同时保持场景的一致性。具体来说,给定文本提示,我们生成全景并估计其深度,因为全景涵盖了整个场景的信息并表现出明确的几何约束。为了获得高质量的新视角,我们引入了粗视图合成(CVS)和渐进式新视图修复(PNVI)策略,确保场景的一致性和视质。随后,我们利用多视投影(MVP)形成透视视图,并应用3D高斯膨胀(3DGS)进行场景重建。全面实验证明,FastScene在生成速度和质量方面都超过了其他方法,具有良好的场景一致性。值得注意的是,仅凭借文本提示,FastScene可以在短短的15分钟内生成一个3D场景,这比现有方法至少快一个小时,使得它在用户友好场景生成方面具有范例意义。
https://arxiv.org/abs/2405.05768
Detail features of magnetic resonance images play a cru-cial role in accurate medical diagnosis and treatment, as they capture subtle changes that pose challenges for doc-tors when performing precise judgments. However, the widely utilized naive diffusion model has limitations, as it fails to accurately capture more intricate details. To en-hance the quality of MRI reconstruction, we propose a comprehensive detail-preserving reconstruction method using multiple diffusion models to extract structure and detail features in k-space domain instead of image do-main. Moreover, virtual binary modal masks are utilized to refine the range of values in k-space data through highly adaptive center windows, which allows the model to focus its attention more efficiently. Last but not least, an inverted pyramid structure is employed, where the top-down image information gradually decreases, ena-bling a cascade representation. The framework effective-ly represents multi-scale sampled data, taking into ac-count the sparsity of the inverted pyramid architecture, and utilizes cascade training data distribution to repre-sent multi-scale data. Through a step-by-step refinement approach, the method refines the approximation of de-tails. Finally, the proposed method was evaluated by con-ducting experiments on clinical and public datasets. The results demonstrate that the proposed method outper-forms other methods.
磁共振图像的详细特征在精确医学诊断和治疗中扮演着关键角色,因为它们捕捉到医生在进行精确判断时难以捕捉的微妙变化。然而,广泛使用的 naive 扩散模型的局限性在于它无法准确捕捉到更复杂细节。为了提高 MRI 重建的质量,我们提出了一个全面的细节保留重建方法,使用多个扩散模型在 k 空间域提取结构和细节特征,而不是在图像域。此外,还使用了虚拟二进制模态掩码来通过高度自适应的核窗进一步优化 k 空间数据中的值的范围,使模型能够更有效地关注其注意力。最后,采用了一个倒置金字塔结构,其中自顶向下的图像信息逐渐减少,使得层次表示得以实现。该框架有效地表示了多尺度采样数据,考虑到倒置金字塔架构的稀疏性,并利用级联训练数据分布来表示多尺度数据。通过逐步优化方法,该方法有效地提高了细节的逼真度。最后,通过在临床和公开数据集上进行实验,对所提出的方法进行了评估。结果表明,与其它方法相比,所提出的方法具有优异的性能。
https://arxiv.org/abs/2405.05763
Element segmentation is a key step in nondestructive testing of Printed Circuit Boards (PCB) based on Computed Tomography (CT) technology. In recent years, the rapid development of self-supervised pretraining technology can obtain general image features without labeled samples, and then use a small amount of labeled samples to solve downstream tasks, which has a good potential in PCB element segmentation. At present, Masked Image Modeling (MIM) pretraining model has been initially applied in PCB CT image element segmentation. However, due to the small and regular size of PCB elements such as vias, wires, and pads, the global visual field has redundancy for a single element reconstruction, which may damage the performance of the model. Based on this issue, we propose an efficient pretraining model based on multi-scale local visual field feature reconstruction for PCB CT image element segmentation (EMLR-seg). In this model, the teacher-guided MIM pretraining model is introduced into PCB CT image element segmentation for the first time, and a multi-scale local visual field extraction (MVE) module is proposed to reduce redundancy by focusing on local visual fields. At the same time, a simple 4-Transformer-blocks decoder is used. Experiments show that EMLR-seg can achieve 88.6% mIoU on the PCB CT image dataset we proposed, which exceeds 1.2% by the baseline model, and the training time is reduced by 29.6 hours, a reduction of 17.4% under the same experimental condition, which reflects the advantage of EMLR-seg in terms of performance and efficiency.
元素分割是基于 Computed Tomography(CT)技术的印刷电路板(PCB)非破坏性测试中的关键步骤。近年来,自监督预训练技术的快速发展可以获得没有标注样本的一般图像特征,然后使用少量的标注样本解决下游任务,这有助于提高PCB元素分割的性能。目前,我们已将掩膜图像建模(MIM)预训练模型应用于PCB CT图像元件分割。然而,由于PCB元件(如 via、线和垫子)的大小较小且规则,单个元件重建时的全局视觉场存在冗余,这可能会损害模型的性能。基于这个问题,我们提出了一个基于多尺度局部视觉场特征重建的高效预训练模型(EMLR-seg)用于PCB CT图像元件分割(EMLR-seg)。在这个模型中,我们首次将指导式MIM预训练模型应用于PCB CT图像元件分割,并提出了一个多尺度局部视觉场提取(MVE)模块,通过关注局部视觉场来减少冗余。同时,使用简单的4个Transformer blocks decoder。实验结果表明,EMLR-seg在提出的PCB CT图像数据集上可以达到88.6%的mIoU,超过基线模型的1.2%,训练时间也减少了29.6小时,降低了17.4%,反映了EMLR-seg在性能和效率方面的优势。
https://arxiv.org/abs/2405.05745
Gaussian Splatting has garnered widespread attention due to its exceptional performance. Consequently, SLAM systems based on Gaussian Splatting have emerged, leveraging its capabilities for rapid real-time rendering and high-fidelity mapping. However, current Gaussian Splatting SLAM systems usually struggle with large scene representation and lack effective loop closure adjustments and scene generalization capabilities. To address these issues, we introduce NGM-SLAM, the first GS-SLAM system that utilizes neural radiance field submaps for progressive scene expression, effectively integrating the strengths of neural radiance fields and 3D Gaussian Splatting. We have developed neural implicit submaps as supervision and achieve high-quality scene expression and online loop closure adjustments through Gaussian rendering of fused submaps. Our results on multiple real-world scenes and large-scale scene datasets demonstrate that our method can achieve accurate gap filling and high-quality scene expression, supporting both monocular, stereo, and RGB-D inputs, and achieving state-of-the-art scene reconstruction and tracking performance.
由于其出色的表现,Gaussian Splatting引起了广泛的关注。因此,基于Gaussian Splatting的SLAM系统应运而生,利用其能力实现快速实时渲染和高保真度映射。然而,当前的基于Gaussian Splatting的SLAM系统通常在大型场景表示上遇到困难,并且缺乏有效的环路关闭调整和场景泛化功能。为了应对这些问题,我们引入了NGM-SLAM,首个利用神经元辐射场子图进行渐进式场景表达的GS-SLAM系统。我们通过将神经元隐式子图作为监督来实现高质量的场景表达,并通过融合子图的Gaussian渲染来实现。我们在多个真实世界场景和大型场景数据集上的结果表明,我们的方法可以实现准确的填充缺口和高保真度的场景表达,支持单目、立体视和RGB-D输入,并实现与最先进的场景重建和跟踪性能相当的结果。
https://arxiv.org/abs/2405.05702
In this paper, we tackle the problem of grasping transparent and specular objects. This issue holds importance, yet it remains unsolved within the field of robotics due to failure of recover their accurate geometry by depth cameras. For the first time, we propose ASGrasp, a 6-DoF grasp detection network that uses an RGB-D active stereo camera. ASGrasp utilizes a two-layer learning-based stereo network for the purpose of transparent object reconstruction, enabling material-agnostic object grasping in cluttered environments. In contrast to existing RGB-D based grasp detection methods, which heavily depend on depth restoration networks and the quality of depth maps generated by depth cameras, our system distinguishes itself by its ability to directly utilize raw IR and RGB images for transparent object geometry reconstruction. We create an extensive synthetic dataset through domain randomization, which is based on GraspNet-1Billion. Our experiments demonstrate that ASGrasp can achieve over 90% success rate for generalizable transparent object grasping in both simulation and the real via seamless sim-to-real transfer. Our method significantly outperforms SOTA networks and even surpasses the performance upper bound set by perfect visible point cloud inputs.Project page: this https URL
在本文中,我们解决了理解透明和散射物体的难题。尽管这个问题在机器人领域具有重要意义,但由于深度相机无法准确恢复物体的准确几何形状,因此仍未得到解决。为了实现首次,我们提出了ASGrasp,一种使用RGB-D活动立体相机的多关节 grasp 检测网络。ASGrasp 使用基于两层学习的立体网络来重建透明物体的几何形状,使得在杂乱环境中实现无材料差异的物体抓取。与现有的基于深度学习的抓取检测方法不同,这些方法严重依赖深度恢复网络和深度相机生成的深度图的质量。我们的系统通过直接利用原始IR和RGB图像进行透明物体几何形状重建,从而与现有方法区分开来。通过领域随机化创建了一个广泛的合成数据集,基于GraspNet-1Billion。我们的实验证明,ASGrasp在模拟和真实情况下都可以实现超过90%的通用透明物体抓取成功率。我们的方法在现有SOTA网络之上显著优于网络,甚至超过了由完美可见点云输入产生的性能上限。项目页面:https:// this URL
https://arxiv.org/abs/2405.05648
Combining face swapping with lip synchronization technology offers a cost-effective solution for customized talking face generation. However, directly cascading existing models together tends to introduce significant interference between tasks and reduce video clarity because the interaction space is limited to the low-level semantic RGB space. To address this issue, we propose an innovative unified framework, SwapTalk, which accomplishes both face swapping and lip synchronization tasks in the same latent space. Referring to recent work on face generation, we choose the VQ-embedding space due to its excellent editability and fidelity performance. To enhance the framework's generalization capabilities for unseen identities, we incorporate identity loss during the training of the face swapping module. Additionally, we introduce expert discriminator supervision within the latent space during the training of the lip synchronization module to elevate synchronization quality. In the evaluation phase, previous studies primarily focused on the self-reconstruction of lip movements in synchronous audio-visual videos. To better approximate real-world applications, we expand the evaluation scope to asynchronous audio-video scenarios. Furthermore, we introduce a novel identity consistency metric to more comprehensively assess the identity consistency over time series in generated facial videos. Experimental results on the HDTF demonstrate that our method significantly surpasses existing techniques in video quality, lip synchronization accuracy, face swapping fidelity, and identity consistency. Our demo is available at this http URL.
将面部换脸和嘴同步技术相结合,为定制化生成 talking face 提供了一种经济高效的方法。然而,直接级联现有的模型往往会在任务之间引入显著的干扰,并降低视频清晰度,因为交互空间仅限于低级语义 RGB 空间。为了解决这个问题,我们提出了一个创新性的统一框架 SwapTalk,它能够在同一个潜在空间中完成面部换脸和嘴同步任务。参考最近的面部生成工作,我们选择 VQ 嵌入空间,因为它具有出色的编辑性和保真度性能。为了提高框架对未见到的身份的泛化能力,我们在换脸模块的训练过程中引入了身份损失。此外,在嘴同步模块的训练过程中,我们在潜在空间中引入专家判别监督,以提高同步质量。在评估阶段,以前的研究主要关注同步音频-视频视频中的嘴运动自重建。为了更好地近似真实世界应用,我们将评估范围扩展到异步音频-视频场景。此外,我们引入了一个新的身份一致性度量来更全面地评估生成面部视频序列中的身份一致性。在 HDTF 上的实验结果表明,我们的方法在视频质量、嘴同步精度、面部换脸保真度和身份一致性方面显著超越了现有技术。我们的演示版本可以从该链接下载:http://www.example.com。
https://arxiv.org/abs/2405.05636
We introduce a new family of minimal problems for reconstruction from multiple views. Our primary focus is a novel approach to autocalibration, a long-standing problem in computer vision. Traditional approaches to this problem, such as those based on Kruppa's equations or the modulus constraint, rely explicitly on the knowledge of multiple fundamental matrices or a projective reconstruction. In contrast, we consider a novel formulation involving constraints on image points, the unknown depths of 3D points, and a partially specified calibration matrix $K$. For $2$ and $3$ views, we present a comprehensive taxonomy of minimal autocalibration problems obtained by relaxing some of these constraints. These problems are organized into classes according to the number of views and any assumed prior knowledge of $K$. Within each class, we determine problems with the fewest -- or a relatively small number of -- solutions. From this zoo of problems, we devise three practical solvers. Experiments with synthetic and real data and interfacing our solvers with COLMAP demonstrate that we achieve superior accuracy compared to state-of-the-art calibration methods. The code is available at this https URL
我们介绍了一种新的多视角重建最小问题家族。我们的主要关注点是来自多个视角的自动对齐的新方法,这是一个在计算机视觉中由来已久的问题。传统解决这个问题的方式,如基于Kruppa方程或模量约束,都依赖于多个基本矩阵或投影重构。相反,我们考虑了一个涉及对图像点、3D点未知深度以及部分指定的校准矩阵K的新颖公式。对于2和3视图,我们通过放宽其中一些约束来全面分类最小自对齐问题。这些问题根据视图的数量和任何假设的先验知识分成不同的类别。在每个类别内,我们确定具有最少的(或相对较小的)解的问题。从这个问题动物园中,我们设计了三个人工智能 solver。用合成和真实数据进行实验,并将我们的 solver 与COLMAP进行交互,证明了我们在与最先进的校准方法相比实现更高的准确性的效果。代码可在此处访问:https://www.cs.utah.edu/~germain/ projects/calibration/
https://arxiv.org/abs/2405.05605
Magnetic Resonance Imaging (MRI) is a widely used imaging technique, however it has the limitation of long scanning time. Though previous model-based and learning-based MRI reconstruction methods have shown promising performance, most of them have not fully utilized the edge prior of MR images, and there is still much room for improvement. In this paper, we build a joint edge optimization model that not only incorporates individual regularizers specific to both the MR image and the edges, but also enforces a co-regularizer to effectively establish a stronger correlation between them. Specifically, the edge information is defined through a non-edge probability map to guide the image reconstruction during the optimization process. Meanwhile, the regularizers pertaining to images and edges are incorporated into a deep unfolding network to automatically learn their respective inherent a-priori information.Numerical experiments, consisting of multi-coil and single-coil MRI data with different sampling schemes at a variety of sampling factors, demonstrate that the proposed method outperforms other compared methods.
磁共振成像(MRI)是一种广泛使用的成像技术,然而其缺点是扫描时间较长。虽然以前基于模型的MRI重建方法和基于学习的方法显示出良好的性能,但它们并没有充分利用MR图像的边缘先验信息,仍然有很大的改进空间。在本文中,我们构建了一个联合边缘优化模型,不仅包含了针对MR图像和边缘的个体正则器,还引入了一个共同正则器来有效建立它们之间的更强相关性。具体来说,边缘信息通过非边缘概率图来定义,在优化过程中指导图像重建。同时,针对图像和边缘的正则器被融入了深层次展开网络中,以自动学习它们各自的固有先验信息。在多种MRI数据(包括多圈和单圈数据)和各种采样因素下进行的数值实验证明,与比较方法相比,所提出的方法优越。
https://arxiv.org/abs/2405.05564
Interaction intention anticipation aims to jointly predict future hand trajectories and interaction hotspots. Existing research often treated trajectory forecasting and interaction hotspots prediction as separate tasks or solely considered the impact of trajectories on interaction hotspots, which led to the accumulation of prediction errors over time. However, a deeper inherent connection exists between hand trajectories and interaction hotspots, which allows for continuous mutual correction between them. Building upon this relationship, a novel Bidirectional prOgressive Transformer (BOT), which introduces a Bidirectional Progressive mechanism into the anticipation of interaction intention is established. Initially, BOT maximizes the utilization of spatial information from the last observation frame through the Spatial-Temporal Reconstruction Module, mitigating conflicts arising from changes of view in first-person videos. Subsequently, based on two independent prediction branches, a Bidirectional Progressive Enhancement Module is introduced to mutually improve the prediction of hand trajectories and interaction hotspots over time to minimize error accumulation. Finally, acknowledging the intrinsic randomness in human natural behavior, we employ a Trajectory Stochastic Unit and a C-VAE to introduce appropriate uncertainty to trajectories and interaction hotspots, respectively. Our method achieves state-of-the-art results on three benchmark datasets Epic-Kitchens-100, EGO4D, and EGTEA Gaze+, demonstrating superior in complex scenarios.
交互意图 anticipation旨在共同预测未来的手轨迹和交互热点。现有的研究通常将轨迹预测和交互热点预测视为单独的任务,或者仅考虑轨迹对交互热点的影响,这导致预测误差在时间上逐渐累积。然而,手轨迹和交互热点之间存在更深的内在联系,允许它们之间进行连续相互修正。基于这种关系,我们建立了一个新颖的双向 progressive Transformer(BOT),将双向渐进机制引入到交互意图的预期中。 最初,BOT通过 Spatial-Temporal Reconstruction Module 从最后一个观察帧最大化空间信息利用率,缓解第一人称视频视角变化引起的冲突。随后,根据两个独立预测分支,引入双向渐进增强模块,共同提高手轨迹和交互热点的预测,以最小化误差累积。最后,考虑到人类自然行为的固有随机性,我们使用轨迹随机单元和 C-VAE 为轨迹和交互热点引入适当的不确定性。 我们的方法在三个基准数据集上实现了最先进的性能:Epic-Kitchens-100、EGO4D 和 EGTEA Gaze+。在复杂场景中表现出卓越的性能。
https://arxiv.org/abs/2405.05552
Neural Radiance Fields (NeRF) have emerged as a powerful paradigm for 3D scene representation, offering high-fidelity renderings and reconstructions from a set of sparse and unstructured sensor data. In the context of autonomous robotics, where perception and understanding of the environment are pivotal, NeRF holds immense promise for improving performance. In this paper, we present a comprehensive survey and analysis of the state-of-the-art techniques for utilizing NeRF to enhance the capabilities of autonomous robots. We especially focus on the perception, localization and navigation, and decision-making modules of autonomous robots and delve into tasks crucial for autonomous operation, including 3D reconstruction, segmentation, pose estimation, simultaneous localization and mapping (SLAM), navigation and planning, and interaction. Our survey meticulously benchmarks existing NeRF-based methods, providing insights into their strengths and limitations. Moreover, we explore promising avenues for future research and development in this domain. Notably, we discuss the integration of advanced techniques such as 3D Gaussian splatting (3DGS), large language models (LLM), and generative AIs, envisioning enhanced reconstruction efficiency, scene understanding, decision-making capabilities. This survey serves as a roadmap for researchers seeking to leverage NeRFs to empower autonomous robots, paving the way for innovative solutions that can navigate and interact seamlessly in complex environments.
神经辐射场(NeRF)已成为3D场景表示的强大范例,提供来自一系列稀疏和无结构传感器数据的最高质量渲染和重构。在自主机器人领域,感知和理解环境至关重要,因此NeRF在提高性能方面具有巨大的潜力。在本文中,我们对使用NeRF增强自主机器人能力的最先进技术进行全面调查和分析。我们特别关注自主机器人的感知、定位和导航模块,深入研究了对于自主操作至关重要的任务,包括3D建模、分割、姿态估计、同时定位与映射(SLAM)、导航和规划以及交互。我们的调查详细基准了现有的NeRF方法,提供了它们的优势和局限性的洞察。此外,我们探讨了该领域未来研究的方向和前景。值得注意的是,我们讨论了包括3D高斯分裂(3DGS)、大型语言模型(LLM)和生成式人工智能(GAN)等先进技术的整合,旨在提高建模效率、增强场景理解和决策能力。本调查为研究人员利用NeRF增强自主机器人提供了路线图,为研究人员提供了一个创新解决方案,可以让自主机器人顺畅地导航和交互。
https://arxiv.org/abs/2405.05526
Diffusion models are a powerful generative framework, but come with expensive inference. Existing acceleration methods often compromise image quality or fail under complex conditioning when operating in an extremely low-step regime. In this work, we propose a novel distillation framework tailored to enable high-fidelity, diverse sample generation using just one to three steps. Our approach comprises three key components: (i) Backward Distillation, which mitigates training-inference discrepancies by calibrating the student on its own backward trajectory; (ii) Shifted Reconstruction Loss that dynamically adapts knowledge transfer based on the current time step; and (iii) Noise Correction, an inference-time technique that enhances sample quality by addressing singularities in noise prediction. Through extensive experiments, we demonstrate that our method outperforms existing competitors in quantitative metrics and human evaluations. Remarkably, it achieves performance comparable to the teacher model using only three denoising steps, enabling efficient high-quality generation.
扩散模型是一个强大的生成框架,但它们通常代价高昂的推理。现有的加速方法在操作在极度低步骤率时通常会牺牲图像质量或者在复杂条件下失效。在本文中,我们提出了一个专门为高保真度、多样样本生成而设计的全新去蒸馏框架。我们的方法包括三个关键组件: (i)反向扩散,通过在自己反向轨迹上进行自校准来缓解训练推理差异;(ii)平移重构损失,根据当前时间步动态适应知识转移;和(iii)噪声纠正,通过解决噪声预测中的 singularities 来提高样本质量。通过大量实验,我们证明了我们的方法在数量指标和人类评价方面优于现有竞争对手。值得注意的是,它仅使用三个去噪步骤就实现了与教师模型相当的表现,从而实现高效的优质生成。
https://arxiv.org/abs/2405.05224
Recent progress in remote sensing image (RSI) super-resolution (SR) has exhibited remarkable performance using deep neural networks, e.g., Convolutional Neural Networks and Transformers. However, existing SR methods often suffer from either a limited receptive field or quadratic computational overhead, resulting in sub-optimal global representation and unacceptable computational costs in large-scale RSI. To alleviate these issues, we develop the first attempt to integrate the Vision State Space Model (Mamba) for RSI-SR, which specializes in processing large-scale RSI by capturing long-range dependency with linear complexity. To achieve better SR reconstruction, building upon Mamba, we devise a Frequency-assisted Mamba framework, dubbed FMSR, to explore the spatial and frequent correlations. In particular, our FMSR features a multi-level fusion architecture equipped with the Frequency Selection Module (FSM), Vision State Space Module (VSSM), and Hybrid Gate Module (HGM) to grasp their merits for effective spatial-frequency fusion. Recognizing that global and local dependencies are complementary and both beneficial for SR, we further recalibrate these multi-level features for accurate feature fusion via learnable scaling adaptors. Extensive experiments on AID, DOTA, and DIOR benchmarks demonstrate that our FMSR outperforms state-of-the-art Transformer-based methods HAT-L in terms of PSNR by 0.11 dB on average, while consuming only 28.05% and 19.08% of its memory consumption and complexity, respectively.
近年来,遥感图像(RSI)超分辨率(SR)方面的进步已经显著地使用了深度神经网络,例如卷积神经网络(CNN)和Transformer。然而,现有的SR方法通常存在接收范围有限或线性计算开销等问题,导致全局表示效果不佳,并在大规模RSI上产生不可接受的计算成本。为了减轻这些问题,我们开发了第一个将Vision State Space Model(Mamba)集成到RSI-SR中的尝试,Mamba专门处理大规模RSI并通过线性复杂性捕捉长距离依赖。为了实现更好的SR复原,我们在Mamba的基础上设计了一个Frequency-assisted Mamba框架,称之为FMSR,以探讨其空间和频率关联。特别地,我们的FMSR配备了多级融合架构,包括频率选择模块(FSM)、视觉状态空间模块(VSSM)和混合门模块(HGM),以把握其对有效空间-频率融合的优点。认识到全局和局部依赖是互补的,两者都对SR有益,我们通过可学习缩放调整器进一步重新校准这些多级特征以实现准确的特征融合。在AID、DOTA和DIOR基准测试上进行的广泛实验证明,我们的FMSR在PSNR方面平均优于基于Transformer的当前最先进方法HAT-L,同时消耗只有28.05%和19.08%的内存开销和复杂度。
https://arxiv.org/abs/2405.04964
Attack knowledge graph construction seeks to convert textual cyber threat intelligence (CTI) reports into structured representations, portraying the evolutionary traces of cyber attacks. Even though previous research has proposed various methods to construct attack knowledge graphs, they generally suffer from limited generalization capability to diverse knowledge types as well as requirement of expertise in model design and tuning. Addressing these limitations, we seek to utilize Large Language Models (LLMs), which have achieved enormous success in a broad range of tasks given exceptional capabilities in both language understanding and zero-shot task fulfillment. Thus, we propose a fully automatic LLM-based framework to construct attack knowledge graphs named: AttacKG+. Our framework consists of four consecutive modules: rewriter, parser, identifier, and summarizer, each of which is implemented by instruction prompting and in-context learning empowered by LLMs. Furthermore, we upgrade the existing attack knowledge schema and propose a comprehensive version. We represent a cyber attack as a temporally unfolding event, each temporal step of which encapsulates three layers of representation, including behavior graph, MITRE TTP labels, and state summary. Extensive evaluation demonstrates that: 1) our formulation seamlessly satisfies the information needs in threat event analysis, 2) our construction framework is effective in faithfully and accurately extracting the information defined by AttacKG+, and 3) our attack graph directly benefits downstream security practices such as attack reconstruction. All the code and datasets will be released upon acceptance.
攻击知识图构建旨在将文本形式的网络威胁情报(CTI)报告转换为结构化的表示形式,描绘网络攻击的演变轨迹。尽管之前的研究提出了各种方法来构建攻击知识图,但它们通常都存在对不同知识类型的泛化能力有限以及模型设计和调整的要求。为解决这些限制,我们寻求利用大型语言模型(LLMs),因为它们在广泛的任务上取得了巨大的成功,并且在语言理解和零击任务满足方面具有卓越的能力。因此,我们提出了一个完全自动化的LLM-为基础的攻击知识图构建框架,名为:AttacKG+。 我们的框架由四个连续的模块组成:改写器、解析器、标识器和总结器,每个模块都通过指令提示和上下文学习由LLM实现。此外,我们升级了现有的攻击知识模式并提出了全面版本。我们用一个时间展开的事件来表示网络攻击,每个时间步都包含三层表示,包括行为图、MITRE TTP标签和状态概述。丰富的评估表明:1)我们的公式在威胁事件分析中无缝地满足信息需求,2)我们的构建框架有效地忠实并准确地提取了由AttacKG+定义的信息,3)我们的攻击图直接受益于下游安全实践,如攻击重建。所有代码和数据将在接受提交时发布。
https://arxiv.org/abs/2405.04753