Neural fields have been broadly investigated as scene representations for the reproduction and novel generation of diverse outdoor scenes, including those autonomous vehicles and robots must handle. While successful approaches for RGB and LiDAR data exist, neural reconstruction methods for radar as a sensing modality have been largely unexplored. Operating at millimeter wavelengths, radar sensors are robust to scattering in fog and rain, and, as such, offer a complementary modality to active and passive optical sensing techniques. Moreover, existing radar sensors are highly cost-effective and deployed broadly in robots and vehicles that operate outdoors. We introduce Radar Fields - a neural scene reconstruction method designed for active radar imagers. Our approach unites an explicit, physics-informed sensor model with an implicit neural geometry and reflectance model to directly synthesize raw radar measurements and extract scene occupancy. The proposed method does not rely on volume rendering. Instead, we learn fields in Fourier frequency space, supervised with raw radar data. We validate the effectiveness of the method across diverse outdoor scenarios, including urban scenes with dense vehicles and infrastructure, and in harsh weather scenarios, where mm-wavelength sensing is especially favorable.
神经场已经被广泛研究作为场景表示,用于复制和生成多样户外场景,包括自主车辆和机器人必须处理的场景。虽然已经存在成功的方法来处理RGB和LiDAR数据,但雷达作为感知模态的神经重建方法仍然没有被深入研究。在毫米波长下操作,雷达传感器对雾和雨的漫射非常耐用,因此为主动和被动光学传感技术提供了一种互补的维度。此外,现有的雷达传感器在户外车辆和机器人的操作上非常成本效益,广泛部署在各个场景中。我们介绍了一种专门为主动雷达图像机设计的神经场景重建方法。我们的方法将显式物理引导的传感器模型与隐式神经几何和反射率模型相结合,直接合成雷达测量值并提取场景占有率。与体积渲染无关,我们通过在频域上学习场,并使用原始雷达数据进行监督。我们验证了该方法在各种户外场景中的有效性,包括城市场景中有大量车辆和基础设施以及恶劣天气场景,其中毫米波感知尤其有利。
https://arxiv.org/abs/2405.04662
Existing diffusion-based video editing methods have achieved impressive results in motion editing. Most of the existing methods focus on the motion alignment between the edited video and the reference video. However, these methods do not constrain the background and object content of the video to remain unchanged, which makes it possible for users to generate unexpected videos. In this paper, we propose a one-shot video motion editing method called Edit-Your-Motion that requires only a single text-video pair for training. Specifically, we design the Detailed Prompt-Guided Learning Strategy (DPL) to decouple spatio-temporal features in space-time diffusion models. DPL separates learning object content and motion into two training stages. In the first training stage, we focus on learning the spatial features (the features of object content) and breaking down the temporal relationships in the video frames by shuffling them. We further propose Recurrent-Causal Attention (RC-Attn) to learn the consistent content features of the object from unordered video frames. In the second training stage, we restore the temporal relationship in video frames to learn the temporal feature (the features of the background and object's motion). We also adopt the Noise Constraint Loss to smooth out inter-frame differences. Finally, in the inference stage, we inject the content features of the source object into the editing branch through a two-branch structure (editing branch and reconstruction branch). With Edit-Your-Motion, users can edit the motion of objects in the source video to generate more exciting and diverse videos. Comprehensive qualitative experiments, quantitative experiments and user preference studies demonstrate that Edit-Your-Motion performs better than other methods.
现有的基于扩散的动态编辑方法在动态编辑方面取得了令人印象深刻的成果。大多数现有方法都关注编辑视频与参考视频之间的运动对齐。然而,这些方法没有将视频的背景和对象内容约束为不变,这使得用户可以生成意外的视频。在本文中,我们提出了一个仅需要一个文本-视频对进行训练的一击视频动态编辑方法,称为Edit-Your-Motion。具体来说,我们设计了一个空间时间扩散模型中的详细提示引导学习策略(DPL)来分离空间特征。DPL将学习对象内容和运动分为两个训练阶段。在第一个训练阶段,我们专注于学习空间特征(对象内容的特征)并通过随机调整视频帧顺序来分解视频帧中的时间关系。我们还提出了递归因果注意(RC-Attn)来从无序视频帧中学习对象的连续特征。在第二个训练阶段,我们恢复视频帧中的时间关系以学习背景和对象的动态特征。我们还采用噪声约束损失来平滑帧之间的差异。最后,在推理阶段,我们将源对象的內容通过双分支结构(编辑分支和重构分支)注入编辑分支中。使用Edit-Your-Motion,用户可以编辑源视频中对象的动态以生成更令人兴奋和多样化的视频。全面的定性实验、定量和用户偏好研究证明了Edit-Your-Motion的性能优于其他方法。
https://arxiv.org/abs/2405.04496
Neural Radiance Field~(NeRF) achieves extremely high quality in object-scaled and indoor scene reconstruction. However, there exist some challenges when reconstructing large-scale scenes. MLP-based NeRFs suffer from limited network capacity, while volume-based NeRFs are heavily memory-consuming when the scene resolution increases. Recent approaches propose to geographically partition the scene and learn each sub-region using an individual NeRF. Such partitioning strategies help volume-based NeRF exceed the single GPU memory limit and scale to larger scenes. However, this approach requires multiple background NeRF to handle out-of-partition rays, which leads to redundancy of learning. Inspired by the fact that the background of current partition is the foreground of adjacent partition, we propose a scalable scene reconstruction method based on joint Multi-resolution Hash Grids, named DistGrid. In this method, the scene is divided into multiple closely-paved yet non-overlapped Axis-Aligned Bounding Boxes, and a novel segmented volume rendering method is proposed to handle cross-boundary rays, thereby eliminating the need for background NeRFs. The experiments demonstrate that our method outperforms existing methods on all evaluated large-scale scenes, and provides visually plausible scene reconstruction. The scalability of our method on reconstruction quality is further evaluated qualitatively and quantitatively.
Neural Radiance Field (NeRF) 在物体缩放和室内场景重建方面实现了极高的质量。然而,在重构大规模场景时存在一些挑战。基于 MLP 的 NeRFs 存在网络容量有限的问题,而基于体积的 NeRFs 在场景分辨率增加时内存消耗严重。最近的方法提出了使用单个 NeRF 个体对场景进行地理分割,并学习每个子区域的策略。这种分割策略有助于基于体积的 NeRF 超过单个 GPU 内存限制,并扩展到更大的场景。然而,这种方法需要多个背景 NeRF 来处理跨分区光线,导致学习冗余。受到当前分割背景是相邻分割背景的事实启发,我们提出了一个基于联合多分辨率哈希网格的可扩展场景重构方法,名为 DistGrid。在这个方法中,场景被分成多个紧密铺纹且不重叠的轴向对齐的边界框,并提出了一种新的分割体积渲染方法来处理跨边界光线,从而消除需要背景 NeRF 的需求。实验证明,我们的方法在所有评估的大规模场景上都超过了现有方法,并提供了解剖学合理的场景重建。我们对重建质量的可扩展性进行了定性和定量的评估。
https://arxiv.org/abs/2405.04416
Neural Radiance Fields (NeRFs) have become a rapidly growing research field with the potential to revolutionize typical photogrammetric workflows, such as those used for 3D scene reconstruction. As input, NeRFs require multi-view images with corresponding camera poses as well as the interior orientation. In the typical NeRF workflow, the camera poses and the interior orientation are estimated in advance with Structure from Motion (SfM). But the quality of the resulting novel views, which depends on different parameters such as the number and distribution of available images, as well as the accuracy of the related camera poses and interior orientation, is difficult to predict. In addition, SfM is a time-consuming pre-processing step, and its quality strongly depends on the image content. Furthermore, the undefined scaling factor of SfM hinders subsequent steps in which metric information is required. In this paper, we evaluate the potential of NeRFs for industrial robot applications. We propose an alternative to SfM pre-processing: we capture the input images with a calibrated camera that is attached to the end effector of an industrial robot and determine accurate camera poses with metric scale based on the robot kinematics. We then investigate the quality of the novel views by comparing them to ground truth, and by computing an internal quality measure based on ensemble methods. For evaluation purposes, we acquire multiple datasets that pose challenges for reconstruction typical of industrial applications, like reflective objects, poor texture, and fine structures. We show that the robot-based pose determination reaches similar accuracy as SfM in non-demanding cases, while having clear advantages in more challenging scenarios. Finally, we present first results of applying the ensemble method to estimate the quality of the synthetic novel view in the absence of a ground truth.
神经辐射场(NeRFs)已成为一个快速发展的研究领域,具有可能彻底颠覆典型摄影测量工作流程(如用于3D场景重建的摄影测量工作流程)的潜力。作为输入,NeRFs需要多视角图像以及相应的相机姿态,以及内部方向。在典型的NeRF工作流程中,相机姿态和内部方向在SfM预处理过程中估计。但是,得到的新颖视角的质量,其取决于可用图像的数量和分布,以及相关相机姿态和内部方向的准确性,是很难预测的。此外,SfM是一个耗时的预处理步骤,其质量对图像内容有很强依赖。此外,SfM的未定义缩放因子阻碍了后续需要度量信息步骤。在本文中,我们评估了NeRFs在工业机器人应用中的潜在潜力。我们提出了用校准的相机附着在工业机器人的末端执行器上捕获输入图像,并根据机器人运动学确定准确相机姿态的方法。然后,我们通过比较新颖视角与地面真实值以及根据元方法计算内部质量指标来评估其质量。为了评估目的,我们获取了几个具有挑战性的工业应用数据集,例如反光物体、低纹理和细结构。我们发现,基于机器人的姿态确定在非要求较高的情况下与SfM相当准确,而在更具挑战性的场景中具有明显优势。最后,我们展示了在没有地面真实值的情况下应用元方法估计合成新颖视角质量的第一结果。
https://arxiv.org/abs/2405.04345
Automatic perception of image quality is a challenging problem that impacts billions of Internet and social media users daily. To advance research in this field, we propose a no-reference image quality assessment (NR-IQA) method termed Cross-IQA based on vision transformer(ViT) model. The proposed Cross-IQA method can learn image quality features from unlabeled image data. We construct the pretext task of synthesized image reconstruction to unsupervised extract the image quality information based ViT block. The pretrained encoder of Cross-IQA is used to fine-tune a linear regression model for score prediction. Experimental results show that Cross-IQA can achieve state-of-the-art performance in assessing the low-frequency degradation information (e.g., color change, blurring, etc.) of images compared with the classical full-reference IQA and NR-IQA under the same datasets.
自动感知图像质量是一个具有挑战性的问题,每天影响数十亿互联网和社交媒体用户。为了在這個領域進一步研究,我們提出了基於視覺變壓器(ViT)模型的交叉 IQA 方法,稱為 ViT-CrossIQA。這種方法可以从未標記的圖像數據中學習圖像質量特征。我們構建了基於 ViT 的合成圖像重构的前置任務,用於無監督地提取圖像質量信息,基於 ViT 塊。用於交叉 IQA 的预训练編碼器用於微調线性回歸模型進行得分預測。實驗結果表明,ViT-CrossIQA 在與相同數據集上評估圖像的低頻率退化信息(例如,色彩變化,模糊等)方面與經典的全參考 IQA 和 NR-IQA 达到最先进的性能。
https://arxiv.org/abs/2405.04311
Even though Non-rigid Structure-from-Motion (NRSfM) has been extensively studied and great progress has been made, there are still key challenges that hinder their broad real-world applications: 1) the inherent motion/rotation ambiguity requires either explicit camera motion recovery with extra constraint or complex Procrustean Alignment; 2) existing low-rank modeling of the global shape can over-penalize drastic deformations in the 3D shape sequence. This paper proposes to resolve the above issues from a spatial-temporal modeling perspective. First, we propose a novel Temporally-smooth Procrustean Alignment module that estimates 3D deforming shapes and adjusts the camera motion by aligning the 3D shape sequence consecutively. Our new alignment module remedies the requirement of complex reference 3D shape during alignment, which is more conductive to non-isotropic deformation modeling. Second, we propose a spatial-weighted approach to enforce the low-rank constraint adaptively at different locations to accommodate drastic spatially-variant deformation reconstruction better. Our modeling outperform existing low-rank based methods, and extensive experiments across different datasets validate the effectiveness of our method.
尽管非刚性结构从运动(NRSfM)已经得到了广泛的研究,并取得了很大的进展,但仍然有一些关键挑战阻碍了它们在现实世界的广泛应用:1)固有运动/旋转不确定性要求要么通过额外的约束明确地恢复相机运动,要么通过复杂的Procrustean对齐来处理3D形状序列中的剧烈变形;2)现有的全局形状建模低秩度可以过度惩罚3D形状序列中的剧烈变形。本文从空间-时间建模的角度来解决上述问题。首先,我们提出了一种新颖的时间平滑Procrustean对齐模块,估计3D变形形状,并通过依次对齐3D形状序列来调整相机运动。我们的新对齐模块消除了在对齐过程中的复杂参考3D形状要求,这更有利于非均匀变形建模。其次,我们提出了一种在不同的位置对低秩约束进行自适应的方法,以适应剧烈空间变异性形状重建。我们的建模方法超越了现有的低秩建模方法,而且通过不同数据集的广泛实验验证了我们的方法的有效性。
https://arxiv.org/abs/2405.04309
In the face of a new era of generative models, the detection of artificially generated content has become a matter of utmost importance. The ability to create credible minute-long music deepfakes in a few seconds on user-friendly platforms poses a real threat of fraud on streaming services and unfair competition to human artists. This paper demonstrates the possibility (and surprising ease) of training classifiers on datasets comprising real audio and fake reconstructions, achieving a convincing accuracy of 99.8%. To our knowledge, this marks the first publication of a music deepfake detector, a tool that will help in the regulation of music forgery. Nevertheless, informed by decades of literature on forgery detection in other fields, we stress that a good test score is not the end of the story. We step back from the straightforward ML framework and expose many facets that could be problematic with such a deployed detector: calibration, robustness to audio manipulation, generalisation to unseen models, interpretability and possibility for recourse. This second part acts as a position for future research steps in the field and a caveat to a flourishing market of fake content checkers.
在面对全新一代生成模型的新时代,检测人造内容已成为至关重要的事。在用户友好平台上几秒钟内创建可信的分钟长度的AI音乐 deepfake,对流媒体服务的欺诈威胁和对人类艺术家的不公平竞争构成了真正的威胁。本文证明了在包含真实音频和假重建的数据集上训练分类器是可能的,并且令人惊讶地容易,达到了99.8%的准确度。据我们所知,这标志着音乐 deepfake 检测器的首次发布,这将有助于音乐欺诈的监管。然而,根据其他领域的伪造检测几十年的文献,我们强调一个好的测试分数并不是故事的结束。我们离开了简单的机器学习框架,揭示了可能存在问题的部署检测器的许多方面:校准,对音频操作的鲁棒性,对未见过的模型的泛化,可解释性和可诉性。第二部分在领域未来的研究步骤中扮演了立场,同时也是繁荣内容检查器市场的警示。
https://arxiv.org/abs/2405.04181
Modern AI systems such as self-driving cars and game-playing agents achieve superhuman performance, but often lack human-like features such as generalization, interpretability and human inter-operability. Inspired by the rich interactions between language and decision-making in humans, we introduce Policy Learning with a Language Bottleneck (PLLB), a framework enabling AI agents to generate linguistic rules that capture the strategies underlying their most rewarding behaviors. PLLB alternates between a rule generation step guided by language models, and an update step where agents learn new policies guided by rules. In a two-player communication game, a maze solving task, and two image reconstruction tasks, we show that PLLB agents are not only able to learn more interpretable and generalizable behaviors, but can also share the learned rules with human users, enabling more effective human-AI coordination.
现代的人工智能系统(如自动驾驶汽车和游戏智能体)通常具有超人类的表现,但通常缺乏类似于人类的特征,如泛化、可解释性和人机交互。受到人类语言和决策之间丰富互动的启发,我们引入了Policy Learning with a Language Bottleneck (PLLB)框架,该框架使AI代理能够生成语言规则,以捕捉它们最有益行为的策略。PLLB在语言模型的引导下进行规则生成步骤,然后进行更新步骤,其中代理学习基于规则的新策略。在两个玩家的通信游戏、迷宫解决任务和两个图像重建任务中,我们证明了PLLB智能体不仅能够学习更有解释性和可扩展的行为,还可以与人类用户共享所学规则,从而实现更有效的 人机协同。
https://arxiv.org/abs/2405.04118
Generating consistent multiple views for 3D reconstruction tasks is still a challenge to existing image-to-3D diffusion models. Generally, incorporating 3D representations into diffusion model decrease the model's speed as well as generalizability and quality. This paper proposes a general framework to generate consistent multi-view images from single image or leveraging scene representation transformer and view-conditioned diffusion model. In the model, we introduce epipolar geometry constraints and multi-view attention to enforce 3D consistency. From as few as one image input, our model is able to generate 3D meshes surpassing baselines methods in evaluation metrics, including PSNR, SSIM and LPIPS.
生成一致性的多视角图像对于现有的图像到3D扩散模型仍然是一个挑战。通常,将3D表示集成到扩散模型中会降低模型的速度以及可扩展性和质量。本文提出了一种通用的框架,从单个图像或利用场景表示变换器和视图条件扩散模型生成一致性的多视角图像。在模型中,我们引入了极化几何约束和多视角注意力和3D一致性约束,以实现3D一致性。从只要一个图像输入,我们的模型能够在评估指标上超越基线方法,包括PSNR,SSIM和LPIPS。
https://arxiv.org/abs/2405.03894
Because of the advantages of computation complexity compared with traditional localization algorithms, fingerprint based localization is getting increasing demand. Expanding the fingerprint database from the frequency domain by channel reconstruction can improve localization accuracy. However, in a mobility environment, the channel reconstruction accuracy is limited by the time-varying parameters. In this paper, we proposed a system to extract the time-varying parameters based on space-alternating generalized expectation maximization (SAGE) algorithm, then used variational auto-encoder (VAE) to reconstruct the channel state information on another channel. The proposed scheme is tested on the data generated by the deep-MIMO channel model. Mathematical analysis for the viability of our system is also shown in this paper.
由于与传统局部化算法的计算复杂度相比具有优势,指纹基于局部化越来越受到欢迎。通过通道重建从频域扩大指纹数据库可以提高局部化精度。然而,在移动环境中,通道重建精度受到时变参数的限制。在本文中,我们提出了一个基于空间交替高斯期望最大化(SAGE)算法的系统,用于提取时变参数,然后使用变分自编码器(VAE)在另一个通道上重构通道状态信息。所提出的方案在深度MIMO信道模型的数据上进行了测试。本文还展示了我们系统的数学分析。
https://arxiv.org/abs/2405.03842
We present a zero-shot pose optimization method that enforces accurate physical contact constraints when estimating the 3D pose of humans. Our central insight is that since language is often used to describe physical interaction, large pretrained text-based models can act as priors on pose estimation. We can thus leverage this insight to improve pose estimation by converting natural language descriptors, generated by a large multimodal model (LMM), into tractable losses to constrain the 3D pose optimization. Despite its simplicity, our method produces surprisingly compelling pose reconstructions of people in close contact, correctly capturing the semantics of the social and physical interactions. We demonstrate that our method rivals more complex state-of-the-art approaches that require expensive human annotation of contact points and training specialized models. Moreover, unlike previous approaches, our method provides a unified framework for resolving self-contact and person-to-person contact.
我们提出了一种零 shot姿态优化方法,在估计人类3D姿态时强制准确的身体接触约束。我们核心的见解是,由于语言通常用来描述物理交互,因此大型预训练文本模型可以作为姿态估计的先验。因此,我们可以利用这个见解来通过将自然语言描述符转换为可求解的损失来约束3D姿态优化,从而提高姿态估计。尽管我们的方法很简单,但通过将大型多模态模型(LMM)生成的自然语言描述符转换为可求解的损失,我们成功地捕捉到了社会和物理交互的语义。我们证明了我们的方法与需要昂贵的人类标注接触点和训练专用模型的更复杂方法匹敌。此外,与以前的方法不同,我们的方法提供了一个统一的框架来解决自接触和人与人接触。
https://arxiv.org/abs/2405.03689
We describe a method for recovering the irradiance underlying a collection of images corrupted by atmospheric turbulence. Since supervised data is often technically impossible to obtain, assumptions and biases have to be imposed to solve this inverse problem, and we choose to model them explicitly. Rather than initializing a latent irradiance ("template") by heuristics to estimate deformation, we select one of the images as a reference, and model the deformation in this image by the aggregation of the optical flow from it to other images, exploiting a prior imposed by Central Limit Theorem. Then with a novel flow inversion module, the model registers each image TO the template but WITHOUT the template, avoiding artifacts related to poor template initialization. To illustrate the robustness of the method, we simply (i) select the first frame as the reference and (ii) use the simplest optical flow to estimate the warpings, yet the improvement in registration is decisive in the final reconstruction, as we achieve state-of-the-art performance despite its simplicity. The method establishes a strong baseline that can be further improved by integrating it seamlessly into more sophisticated pipelines, or with domain-specific methods if so desired.
我们描述了一种从受大气湍流污染的图像集合中恢复辐射的方法。由于通常很难获得监督数据,因此需要强加假设和偏见来解决这个反问题,我们选择明确地建模它们。我们不通过启发式初始化一个潜在的辐射("模板")来估计变形,而是选择一个图像作为参考,并利用来自它到其他图像的光学流的聚合来建模变形,利用中央极限定理 prior。然后,通过一种新颖的流量反演模块,模型将每个图像注册到模板上,但没有模板。这避免了与初始模板质量差相关的伪影。为了说明方法的稳健性,我们只需(i)选择第一个帧作为参考,(ii)使用最简单的光学流估计变形,然而在最终重构中,模型的注册改善是决定性的,尽管它的简单性,我们实现了最先进的表现。通过将该方法无缝集成到更复杂的数据处理流程中,或如果需要的话使用领域特定方法,可以进一步改进这个基线。
https://arxiv.org/abs/2405.03662
Image-based 3D reconstruction is a challenging task that involves inferring the 3D shape of an object or scene from a set of input images. Learning-based methods have gained attention for their ability to directly estimate 3D shapes. This review paper focuses on state-of-the-art techniques for 3D reconstruction, including the generation of novel, unseen views. An overview of recent developments in the Gaussian Splatting method is provided, covering input types, model structures, output representations, and training strategies. Unresolved challenges and future directions are also discussed. Given the rapid progress in this domain and the numerous opportunities for enhancing 3D reconstruction methods, a comprehensive examination of algorithms appears essential. Consequently, this study offers a thorough overview of the latest advancements in Gaussian Splatting.
基于图像的三维重建是一个具有挑战性的任务,涉及从一系列输入图像中推断出物体或场景的3D形状。基于学习的方法因直接估计3D形状而受到关注。本综述论文重点介绍最先进的三维重建技术,包括生成新颖且未见过的视图。概述了Gaussian Splatting方法最近的发展,包括输入类型、模型结构和输出表示,以及训练策略。还讨论了未解决的问题和未来的发展方向。鉴于该领域快速发展和许多提高3D重建方法的机会,全面评估算法似乎至关重要。因此,本研究全面概述了Gaussian Splatting的最新进展。
https://arxiv.org/abs/2405.03417
Building accurate maps is a key building block to enable reliable localization, planning, and navigation of autonomous vehicles. We propose a novel approach for building accurate maps of dynamic environments utilizing a sequence of LiDAR scans. To this end, we propose encoding the 4D scene into a novel spatio-temporal implicit neural map representation by fitting a time-dependent truncated signed distance function to each point. Using our representation, we extract the static map by filtering the dynamic parts. Our neural representation is based on sparse feature grids, a globally shared decoder, and time-dependent basis functions, which we jointly optimize in an unsupervised fashion. To learn this representation from a sequence of LiDAR scans, we design a simple yet efficient loss function to supervise the map optimization in a piecewise way. We evaluate our approach on various scenes containing moving objects in terms of the reconstruction quality of static maps and the segmentation of dynamic point clouds. The experimental results demonstrate that our method is capable of removing the dynamic part of the input point clouds while reconstructing accurate and complete 3D maps, outperforming several state-of-the-art methods. Codes are available at: this https URL
建立准确的地图是实现自动驾驶车辆的可靠定位、规划和导航的关键模块。我们提出了一种利用连续激光雷达扫描序列建立动态环境中准确地图的新颖方法。为此,我们通过将4D场景编码为一个新颖的空间-时间隐式神经网络表示来达成目标。利用我们的表示,我们通过滤波动态部分来提取静态地图。我们的神经表示基于稀疏特征网格、全局共享的解码器以及时间相关的基函数,我们以自适应的方式共同优化。为了从一系列激光雷达扫描中学习这种表示,我们设计了一个简单而有效的损失函数来以片段方式监督地图优化。我们在各种包含运动物体的场景中评估我们的方法,根据静态地图的重建质量和动态点云的分割程度。实验结果表明,我们的方法能够删除输入点云中的动态部分,同时还原准确和完整的3D地图,优于其他最先进的方法。代码可在此处访问:https:// this URL
https://arxiv.org/abs/2405.03388
In the field of computer vision, the numerical encoding of 3D surfaces is crucial. It is classical to represent surfaces with their Signed Distance Functions (SDFs) or Unsigned Distance Functions (UDFs). For tasks like representation learning, surface classification, or surface reconstruction, this function can be learned by a neural network, called Neural Distance Function. This network, and in particular its weights, may serve as a parametric and implicit representation for the surface. The network must represent the surface as accurately as possible. In this paper, we propose a method for learning UDFs that improves the fidelity of the obtained Neural UDF to the original 3D surface. The key idea of our method is to concentrate the learning effort of the Neural UDF on surface edges. More precisely, we show that sampling more training points around surface edges allows better local accuracy of the trained Neural UDF, and thus improves the global expressiveness of the Neural UDF in terms of Hausdorff distance. To detect surface edges, we propose a new statistical method based on the calculation of a $p$-value at each point on the surface. Our method is shown to detect surface edges more accurately than a commonly used local geometric descriptor.
在计算机视觉领域,对3D表面的数值编码至关重要。通常,可以使用符号距离函数(SDF)或无符号距离函数(UDF)表示表面。对于诸如表示学习、表面分类或表面重建等任务,可以通过神经网络学习此函数,称为神经距离函数。这个网络及其权重可以作为表面参数和隐式表示。网络必须尽可能准确地表示表面。在本文中,我们提出了一种学习无符号距离函数(UDFs)的方法,从而提高了获得的神经UDF与原始3D表面的忠实程度。我们提出的方法的关键思想是将神经UDF的学习努力集中在其表面边缘上。具体来说,我们证明,在表面边缘附近采样更多的训练点可以提高训练后的神经UDF的局部准确性,从而改善神经UDF在哈密尔顿距离方面的表达性。为了检测表面边缘,我们提出了一种基于在表面每个点上计算$p$值的新统计方法。我们的方法被证明比常用的局部几何描述符得更准确地检测表面边缘。
https://arxiv.org/abs/2405.03381
This study accelerates MR cholangiopancreatography (MRCP) acquisitions using deep learning-based (DL) reconstruction at 3T and 0.55T. Thirty healthy volunteers underwent conventional two-fold MRCP scans at field strengths of 3T or 0.55T. We trained a variational network (VN) using retrospectively six-fold undersampled data obtained at 3T. We then evaluated our method against standard techniques such as parallel imaging (PI) and compressed sensing (CS), focusing on peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) as metrics. Furthermore, considering acquiring fully-sampled MRCP is impractical, we added a self-supervised DL reconstruction (SSDU) to the evaluating group. We also tested our method in a prospective accelerated scenario to reflect real-world clinical applications and evaluated its adaptability to MRCP at 0.55T. Our method demonstrated a remarkable reduction of average acquisition time from 599/542 to 255/180 seconds for MRCP at 3T/0.55T. In both retrospective and prospective undersampling scenarios, the PSNR and SSIM of VN were higher than those of PI, CS, and SSDU. At the same time, VN preserved the image quality of undersampled data, i.e., sharpness and the visibility of hepatobiliary ducts. In addition, VN also produced high quality reconstructions at 0.55T resulting in the highest PSNR and SSIM. In summary, VN trained for highly accelerated MRCP allows to reduce the acquisition time by a factor of 2.4/3.0 at 3T/0.55T while maintaining the image quality of the conventional acquisition.
这项研究使用基于深度学习的(DL)复原加速了3T和0.55T的MRCP获取。在对3T或0.55T的场地强度下,30名健康志愿者接受了传统的两倍MRCP扫描。我们使用在3T上获得的反向采样数据训练了一个变分网络(VN)。然后,我们评估我们的方法与标准技术(如并行成像和压缩感知)的区别,重点关注峰值信号-噪声比(PSNR)和结构相似性(SSIM)。此外,考虑到完全采样MRCP是不切实际的,我们在评估组中添加了自监督的DL复原(SSDU)。我们还将在3T/0.55T的前瞻性加速情景中测试我们的方法,以反映临床应用的真实情况,并评估其在0.55T下MRCP的适应性。我们的方法在3T/0.55T下的平均采集时间从599/542秒降低到255/180秒。在反向和前瞻性欠采样情景中,VN的PSNR和SSIM均高于PI、CS和SSDU。同时,VN在0.55T下保持了欠采样数据的图像质量,即清晰度和肝胆管的可见性。此外,VN还在0.55T下产生了高质量的重构,导致PSNR和SSIM最高。总之,为高度加速的MRCP训练VN可以在3T/0.55T下将采集时间降低2.4/3.0倍,同时保持传统采集的图像质量。
https://arxiv.org/abs/2405.03732
Reconstructing human dynamic vision from brain activity is a challenging task with great scientific significance. The difficulty stems from two primary issues: (1) vision-processing mechanisms in the brain are highly intricate and not fully revealed, making it challenging to directly learn a mapping between fMRI and video; (2) the temporal resolution of fMRI is significantly lower than that of natural videos. To overcome these issues, this paper propose a two-stage model named Mind-Animator, which achieves state-of-the-art performance on three public datasets. Specifically, during the fMRI-to-feature stage, we decouple semantic, structural, and motion features from fMRI through fMRI-vision-language tri-modal contrastive learning and sparse causal attention. In the feature-to-video stage, these features are merged to videos by an inflated Stable Diffusion. We substantiate that the reconstructed video dynamics are indeed derived from fMRI, rather than hallucinations of the generative model, through permutation tests. Additionally, the visualization of voxel-wise and ROI-wise importance maps confirms the neurobiological interpretability of our model.
从脑活动重构人动态视觉是一个具有巨大科学意义的具有挑战性的任务。难度来自于两个主要问题:(1) 大脑中的视觉处理机制非常复杂,没有被完全揭示,因此直接从fMRI到视频的映射是困难的;(2) fMRI的时间分辨率远低于自然视频。为了克服这些困难,本文提出了一种名为Mind-Animator的两阶段模型,在三个公开数据集上实现了最先进的性能。具体来说,在fMRI到特征阶段,我们通过fMRI视觉语言三体对比学习将语义、结构和运动特征从fMRI中解耦,并通过稀疏相关注意来降低fMRI的分辨率。在特征到视频阶段,这些特征通过膨胀的稳定扩散合并成视频。我们通过置换测试证实,重构的视频动态确实来源于fMRI,而不是生成模型的虚像。此外,体积和区域重要性图的可视化证实了我们的模型的神经生物学解释性。
https://arxiv.org/abs/2405.03280
Efficient Image Super-Resolution (SR) aims to accelerate SR network inference by minimizing computational complexity and network parameters while preserving performance. Existing state-of-the-art Efficient Image Super-Resolution methods are based on convolutional neural networks. Few attempts have been made with Mamba to harness its long-range modeling capability and efficient computational complexity, which have shown impressive performance on high-level vision tasks. In this paper, we propose DVMSR, a novel lightweight Image SR network that incorporates Vision Mamba and a distillation strategy. The network of DVMSR consists of three modules: feature extraction convolution, multiple stacked Residual State Space Blocks (RSSBs), and a reconstruction module. Specifically, the deep feature extraction module is composed of several residual state space blocks (RSSB), each of which has several Vision Mamba Moudles(ViMM) together with a residual connection. To achieve efficiency improvement while maintaining comparable performance, we employ a distillation strategy to the vision Mamba network for superior performance. Specifically, we leverage the rich representation knowledge of teacher network as additional supervision for the output of lightweight student networks. Extensive experiments have demonstrated that our proposed DVMSR can outperform state-of-the-art efficient SR methods in terms of model parameters while maintaining the performance of both PSNR and SSIM. The source code is available at this https URL
高效的图像超分辨率(SR)旨在通过最小化计算复杂度和网络参数来加速SR网络推理,同时保持性能。现有的最先进的Efficient Image Super-Resolution方法基于卷积神经网络。在Mamba上,已经尝试了一些利用其远距离建模能力和高性价比的方法,这些方法在高级视觉任务上的表现令人印象深刻。在本文中,我们提出了DVMSR,一种新颖的轻量级图像SR网络,它结合了Vision Mamba和差分策略。DVMSR网络由三个模块组成:特征提取卷积、多层堆叠残差状态空间块(RSSB)和重构模块。具体来说,深层特征提取模块由多个残差状态空间块(RSSB)组成,每个RSSB都包含多个Vision Mamba模块和一个残差连接。为了在保持性能的同时实现效率提升,我们对视觉Mamba网络采用了差分策略,以获得更好的性能。具体来说,我们利用教师网络的丰富表示知识作为对轻量学生网络输出的附加监督。大量实验证明,与最先进的有效SR方法相比,我们提出的DVMSR在模型参数方面具有优越的性能,同时保持PSNR和SSIM的性能。源代码可在此处访问:https://url
https://arxiv.org/abs/2405.03008
Score matching with Langevin dynamics (SMLD) method has been successfully applied to accelerated MRI. However, the hyperparameters in the sampling process require subtle tuning, otherwise the results can be severely corrupted by hallucination artifacts, particularly with out-of-distribution test data. In this study, we propose a novel workflow in which SMLD results are regarded as additional priors to guide model-driven network training. First, we adopted a pretrained score network to obtain samples as preliminary guidance images (PGI) without the need for network retraining, parameter tuning and in-distribution test data. Although PGIs are corrupted by hallucination artifacts, we believe that they can provide extra information through effective denoising steps to facilitate reconstruction. Therefore, we designed a denoising module (DM) in the second step to improve the quality of PGIs. The features are extracted from the components of Langevin dynamics and the same score network with fine-tuning; hence, we can directly learn the artifact patterns. Third, we designed a model-driven network whose training is guided by denoised PGIs (DGIs). DGIs are densely connected with intermediate reconstructions in each cascade to enrich the features and are periodically updated to provide more accurate guidance. Our experiments on different sequences revealed that despite the low average quality of PGIs, the proposed workflow can effectively extract valuable information to guide the network training, even with severely reduced training data and sampling steps. Our method outperforms other cutting-edge techniques by effectively mitigating hallucination artifacts, yielding robust and high-quality reconstruction results.
使用Langevin动力学(SMLD)方法进行分数匹配已经被成功地应用于加速磁共振成像(MRI)。然而,在采样过程中需要精细调整的参数,否则结果可能会受到伪影伪像的严重影响,特别是在非分布测试数据上。在这项研究中,我们提出了一个新颖的工作流程,其中SMLD结果被视为指导模型驱动网络训练的额外 prior。首先,我们采用预训练的分数网络来获取不需要网络重新训练和参数调整的样本作为初步指导图像(PGI)。尽管PGIs受到伪影伪像的污染,但我们认为它们可以通过有效的去噪步骤提供额外的信息,以促进重建。因此,我们在第二步设计了一个去噪模块(DM),以提高PGIs的质量。特征是从Langevin动态的组件中提取的,与同一分数网络进行了微调;因此,我们可以直接学习伪影模式。第三,我们设计了一个指导去噪PGIs(DGIs)的模型驱动网络。DGIs在级联过程中与中间修复相比密度连接,从而丰富特征,并定期更新以提供更准确的指导。我们对不同序列的实验表明,尽管PGIs的平均质量较低,但所提出的工作流程仍可以有效地提取有价值的信息来指导网络训练,即使训练数据和采样步骤严重减少。我们的方法通过有效地减轻伪影伪像,产生 robust 和 high-quality 重建结果,超越了其他尖端技术。
https://arxiv.org/abs/2405.02958
Inverse imaging problems (IIPs) arise in various applications, with the main objective of reconstructing an image from its compressed measurements. This problem is often ill-posed for being under-determined with multiple interchangeably consistent solutions. The best solution inherently depends on prior knowledge or assumptions, such as the sparsity of the image. Furthermore, the reconstruction process for most IIPs relies significantly on the imaging (i.e. forward model) parameters, which might not be fully known, or the measurement device may undergo calibration drifts. These uncertainties in the forward model create substantial challenges, where inaccurate reconstructions usually happen when the postulated parameters of the forward model do not fully match the actual ones. In this work, we devoted to tackling accurate reconstruction under the context of a set of possible forward model parameters that exist. Here, we propose a novel Moment-Aggregation (MA) framework that is compatible with the popular IIP solution by using a neural network prior. Specifically, our method can reconstruct the signal by considering all candidate parameters of the forward model simultaneously during the update of the neural network. We theoretically demonstrate the convergence of the MA framework, which has a similar complexity with reconstruction under the known forward model parameters. Proof-of-concept experiments demonstrate that the proposed MA achieves performance comparable to the forward model with the known precise parameter in reconstruction across both compressive sensing and phase retrieval applications, with a PSNR gap of 0.17 to 1.94 over various datasets, including MNIST, X-ray, Glas, and MoNuseg. This highlights our method's significant potential in reconstruction under an uncertain forward model.
逆向成像问题(IIPs)在各种应用中出现,主要目的是从压缩测量中重构图像。这个问题通常是不确定的,因为存在多个可交换的一致解。最好的解决方案本质上取决于先验知识或假设,比如图像的稀疏性。此外,大多数IIP的重建过程对成像(即前向模型)参数的依赖性很大程度上,这些参数可能不完全知道,或者测量设备可能经历标定漂移。这些前向模型的不确定性在很大程度上导致了挑战,而当前向模型的假设参数不等于实际值时,通常会出现不准确的重构。在本文中,我们致力于解决在存在一组可能的前向模型参数的背景下实现准确重构的问题。这里,我们提出了一个新颖的Moment-Aggregation(MA)框架,它与流行的IIP解决方案兼容,并使用神经网络先验。具体来说,我们的方法可以在神经网络更新过程中同时考虑所有前向模型候选参数,从而重构信号。我们理论证明了MA框架的收敛,其与已知前向模型参数下的重构具有相似的复杂性。概念性实验证实,所提出的MA在压缩感知和相位恢复应用中,与已知精确参数的前向模型具有可比较的性能,各数据集的PSNR差值在0.17到1.94之间。这突出了在不确定前向模型的背景下进行重构的我们方法的潜在巨大价值。
https://arxiv.org/abs/2405.02944