Visual tokenization via auto-encoding empowers state-of-the-art image and video generative models by compressing pixels into a latent space. Although scaling Transformer-based generators has been central to recent advances, the tokenizer component itself is rarely scaled, leaving open questions about how auto-encoder design choices influence both its objective of reconstruction and downstream generative performance. Our work aims to conduct an exploration of scaling in auto-encoders to fill in this blank. To facilitate this exploration, we replace the typical convolutional backbone with an enhanced Vision Transformer architecture for Tokenization (ViTok). We train ViTok on large-scale image and video datasets far exceeding ImageNet-1K, removing data constraints on tokenizer scaling. We first study how scaling the auto-encoder bottleneck affects both reconstruction and generation -- and find that while it is highly correlated with reconstruction, its relationship with generation is more complex. We next explored the effect of separately scaling the auto-encoders' encoder and decoder on reconstruction and generation performance. Crucially, we find that scaling the encoder yields minimal gains for either reconstruction or generation, while scaling the decoder boosts reconstruction but the benefits for generation are mixed. Building on our exploration, we design ViTok as a lightweight auto-encoder that achieves competitive performance with state-of-the-art auto-encoders on ImageNet-1K and COCO reconstruction tasks (256p and 512p) while outperforming existing auto-encoders on 16-frame 128p video reconstruction for UCF-101, all with 2-5x fewer FLOPs. When integrated with Diffusion Transformers, ViTok demonstrates competitive performance on image generation for ImageNet-1K and sets new state-of-the-art benchmarks for class-conditional video generation on UCF-101.
这段文本描述了一项关于通过改进自动编码器(auto-encoder)架构来提升图像和视频生成模型性能的研究工作。以下是该工作的主要内容翻译: 视觉标记化通过自编码能够使最先进的图像和视频生成模型受益,它将像素压缩到潜在空间中。尽管基于Transformer的生成器的扩展在最近的进步中占据了中心地位,但其组件标记器本身很少被扩大规模,这留下了一些关于自动编码器设计选择如何影响重建目标以及下游生成性能的问题。我们的工作旨在通过探索自动编码器的扩展来填补这一空白。 为了促进这项研究,我们用增强型视觉Transformer架构(ViTok)替换了传统的卷积骨干网络以进行标记化,并在ImageNet-1K数据集规模之上训练了ViTok,从而消除了对令牌生成器扩展的数据限制。我们首先研究了压缩自动编码器瓶颈如何影响重建和生成——发现尽管它与重建高度相关,但其与生成的关系更加复杂。 接下来,我们探讨单独扩大自动编码器的编码器和解码器在重建和生成性能上的效果。最重要的是,我们发现在对任何重建或生成都没有显著增益的情况下扩展了编码器,而扩展解码器则提升了重建的效果,但对于生成的影响则是好坏参半。 基于我们的探索结果,设计了一种轻量级的自动编码器ViTok,在ImageNet-1K和COCO图像重建任务(256p和512p)中表现与最先进的自动编码器相当,并且在UCF-101数据集上对于16帧128p视频重建,性能超越了现有的自动编码器,同时计算量仅为原来的2到5倍。当将ViTok集成至Diffusion Transformers时,在ImageNet-1K图像生成中展示了竞争力的表现,并为UCF-101的类条件视频生成设定了新的最先进基准。 这项研究不仅扩展了对自动编码器如何影响图像和视频生成的理解,还提出了一种轻量级而高效的解决方案ViTok,能够在不牺牲性能的前提下大幅减少计算资源需求。
https://arxiv.org/abs/2501.09755
Neural Radiance Fields (NeRF) often struggle with reconstructing and rendering highly reflective scenes. Recent advancements have developed various reflection-aware appearance models to enhance NeRF's capability to render specular reflections. However, the robust reconstruction of highly reflective scenes is still hindered by the inherent shape ambiguity on specular surfaces. Existing methods typically rely on additional geometry priors to regularize the shape prediction, but this can lead to oversmoothed geometry in complex scenes. Observing the critical role of surface normals in parameterizing reflections, we introduce a transmittance-gradient-based normal estimation technique that remains robust even under ambiguous shape conditions. Furthermore, we propose a dual activated densities module that effectively bridges the gap between smooth surface normals and sharp object boundaries. Combined with a reflection-aware appearance model, our proposed method achieves robust reconstruction and high-fidelity rendering of scenes featuring both highly specular reflections and intricate geometric structures. Extensive experiments demonstrate that our method outperforms existing state-of-the-art methods on various datasets.
神经辐射场(NeRF)在重建和渲染高度反射场景时往往面临挑战。近期的研究发展了各种考虑反射的外观模型,以增强NeRF对镜面反射的渲染能力。然而,由于镜面表面上固有的形状模糊性,高度反射场景的稳健重建仍然受到阻碍。现有的方法通常依赖于额外的几何先验来正则化形状预测,但这可能导致复杂场景中的过度平滑几何效果。 注意到表面法线在参数化反射中扮演的关键角色,我们引入了一种基于透射梯度的法线估计技术,即使在形状模糊的情况下也能保持稳健性。此外,我们提出了一种双激活密度模块,有效连接了光滑的表面法线和锐利的对象边界之间的差距。 结合考虑反射的外观模型,我们的方法实现了具有高度镜面反射和复杂几何结构场景的稳健重建和高保真渲染。广泛的实验表明,在各种数据集上,我们的方法优于现有的最先进方法。
https://arxiv.org/abs/2501.09460
The synthesis of high-quality 3D assets from textual or visual inputs has become a central objective in modern generative modeling. Despite the proliferation of 3D generation algorithms, they frequently grapple with challenges such as multi-view inconsistency, slow generation times, low fidelity, and surface reconstruction problems. While some studies have addressed some of these issues, a comprehensive solution remains elusive. In this paper, we introduce \textbf{CaPa}, a carve-and-paint framework that generates high-fidelity 3D assets efficiently. CaPa employs a two-stage process, decoupling geometry generation from texture synthesis. Initially, a 3D latent diffusion model generates geometry guided by multi-view inputs, ensuring structural consistency across perspectives. Subsequently, leveraging a novel, model-agnostic Spatially Decoupled Attention, the framework synthesizes high-resolution textures (up to 4K) for a given geometry. Furthermore, we propose a 3D-aware occlusion inpainting algorithm that fills untextured regions, resulting in cohesive results across the entire model. This pipeline generates high-quality 3D assets in less than 30 seconds, providing ready-to-use outputs for commercial applications. Experimental results demonstrate that CaPa excels in both texture fidelity and geometric stability, establishing a new standard for practical, scalable 3D asset generation.
从文本或视觉输入生成高质量的三维(3D)资产已成为现代生成模型的核心目标。尽管出现了许多3D生成算法,但它们仍然面临着多视角不一致、生成时间长、保真度低以及表面重构问题等挑战。虽然一些研究已经解决了一部分这些问题,但仍缺乏一个全面的解决方案。在这篇文章中,我们介绍了\textbf{CaPa}(雕刻与绘制框架),这是一个能够高效生成高保真3D资产的系统。 CaPa采用了一个两阶段的过程,将几何生成和纹理合成解耦开来。在第一阶段,使用一个多视角输入引导的3D潜在扩散模型来生成几何结构,确保从不同视角观察时的一致性。第二阶段则利用了一种新颖且不依赖特定模型的空间分离注意力机制(Spatially Decoupled Attention),为给定的几何体合成高分辨率纹理(最高可达4K)。此外,我们还提出了一种3D感知的遮挡修复算法,用于填充未绘制区域,确保整个模型的一致性。该流程可以在不到30秒的时间内生成高质量的3D资产,并提供可以直接应用于商业用途的输出结果。 实验结果显示,CaPa在纹理保真度和几何稳定性方面均表现出色,为实用且可扩展的3D资产生成设定了新的标准。
https://arxiv.org/abs/2501.09433
Robust WiFi-based human pose estimation is a challenging task that bridges discrete and subtle WiFi signals to human skeletons. This paper revisits this problem and reveals two critical yet overlooked issues: 1) cross-domain gap, i.e., due to significant variations between source-target domain pose distributions; and 2) structural fidelity gap, i.e., predicted skeletal poses manifest distorted topology, usually with misplaced joints and disproportionate bone lengths. This paper fills these gaps by reformulating the task into a novel two-phase framework dubbed DT-Pose: Domain-consistent representation learning and Topology-constrained Pose decoding. Concretely, we first propose a temporal-consistent contrastive learning strategy with uniformity regularization, coupled with self-supervised masking-reconstruction operations, to enable robust learning of domain-consistent and motion-discriminative WiFi-specific representations. Beyond this, we introduce a simple yet effective pose decoder with task prompts, which integrates Graph Convolution Network (GCN) and Transformer layers to constrain the topology structure of the generated skeleton by exploring the adjacent-overarching relationships among human joints. Extensive experiments conducted on various benchmark datasets highlight the superior performance of our method in tackling these fundamental challenges in both 2D/3D human pose estimation tasks.
基于WiFi的人体姿态估计是一项具有挑战性的任务,它需要将离散且微妙的WiFi信号与人体骨骼联系起来。本文重新审视了这一问题,并揭示了两个关键但被忽视的问题:1)跨域差距,即由于源领域和目标领域的姿态分布差异显著;2)结构保真度差距,即预测的人体骨骼姿势表现出扭曲的拓扑结构,通常表现为关节位置不正确及骨头长度不成比例。本文通过重构任务为一个新颖的两阶段框架来填补这些缺口,该框架被称为DT-Pose(领域一致表示学习和拓扑约束姿态解码)。具体而言,我们首先提出了一种带有均匀性正则化的时序一致性对比学习策略,并结合自监督掩蔽-重建操作,以实现领域一致性和运动判别性的WiFi特有表征的稳健学习。此外,我们引入了一个简单但有效的姿势解码器,该解码器通过集成图卷积网络(GCN)和Transformer层来约束生成骨骼的拓扑结构,并探索人体关节之间的相邻-整体关系。在多个基准数据集上进行的广泛实验表明,在二维/三维人体姿态估计任务中解决这些基本挑战时,我们的方法表现出卓越性能。
https://arxiv.org/abs/2501.09411
Neural implicit k-space representations (NIK) have shown promising results for dynamic magnetic resonance imaging (MRI) at high temporal resolutions. Yet, reducing acquisition time, and thereby available training data, results in severe performance drops due to overfitting. To address this, we introduce a novel self-supervised k-space loss function $\mathcal{L}_\mathrm{PISCO}$, applicable for regularization of NIK-based reconstructions. The proposed loss function is based on the concept of parallel imaging-inspired self-consistency (PISCO), enforcing a consistent global k-space neighborhood relationship without requiring additional data. Quantitative and qualitative evaluations on static and dynamic MR reconstructions show that integrating PISCO significantly improves NIK representations. Particularly for high acceleration factors (R$\geq$54), NIK with PISCO achieves superior spatio-temporal reconstruction quality compared to state-of-the-art methods. Furthermore, an extensive analysis of the loss assumptions and stability shows PISCO's potential as versatile self-supervised k-space loss function for further applications and architectures. Code is available at: this https URL
神经隐式k空间表示(NIK)在高时间分辨率的动态磁共振成像(MRI)中展现了令人鼓舞的结果。然而,减少采集时间会导致可用训练数据量减少,从而因过拟合而导致性能严重下降。为解决这一问题,我们提出了一种新颖的自监督k空间损失函数$\mathcal{L}_\mathrm{PISCO}$,该函数适用于NIK基础重建技术中的正则化处理。所提出的损失函数基于平行成像启发的自我一致性(Parallel Imaging-inspired Self-Consistency, PISCO)概念,在无需额外数据的情况下强制执行一致的整体k空间邻域关系。 定量和定性评估结果表明,在静态和动态MR重建中,集成PISCO显著提升了NIK表示的质量。特别是在高加速因子(R≥54)的情况下,与现有方法相比,配备了PISCO的NIK在时空重建质量上表现出优越性能。此外,对损失假设及稳定性的广泛分析也显示了PISCO作为多功能自监督k空间损失函数应用于进一步应用和架构中的潜力。 代码可在以下网址获取:[this https URL]
https://arxiv.org/abs/2501.09403
Deep learning-based joint source-channel coding (JSCC) is emerging as a promising technology for effective image transmission. However, most existing approaches focus on transmitting clear images, overlooking real-world challenges such as motion blur caused by camera shaking or fast-moving objects. Motion blur often degrades image quality, making transmission and reconstruction more challenging. Event cameras, which asynchronously record pixel intensity changes with extremely low latency, have shown great potential for motion deblurring tasks. However, the efficient transmission of the abundant data generated by event cameras remains a significant challenge. In this work, we propose a novel JSCC framework for the joint transmission of blurry images and events, aimed at achieving high-quality reconstructions under limited channel bandwidth. This approach is designed as a deblurring task-oriented JSCC system. Since RGB cameras and event cameras capture the same scene through different modalities, their outputs contain both shared and domain-specific information. To avoid repeatedly transmitting the shared information, we extract and transmit their shared information and domain-specific information, respectively. At the receiver, the received signals are processed by a deblurring decoder to generate clear images. Additionally, we introduce a multi-stage training strategy to train the proposed model. Simulation results demonstrate that our method significantly outperforms existing JSCC-based image transmission schemes, addressing motion blur effectively.
基于深度学习的联合源信道编码(JSCC)正逐渐成为有效图像传输的一种有前景的技术。然而,大多数现有的方法主要关注清晰图像的传输,忽略了诸如相机抖动或快速移动物体引起的运动模糊等现实世界的挑战。运动模糊通常会降低图像质量,使传输和重建变得更加困难。事件相机通过异步记录像素强度变化来以极低延迟捕捉数据,在去运动模糊任务中显示出巨大潜力。然而,高效地传输由事件相机生成的大量数据仍然是一个重大挑战。在这项工作中,我们提出了一种新颖的JSCC框架,旨在联合传输模糊图像和事件信息,并在有限的信道带宽下实现高质量重建。该方法被设计为针对去运动模糊任务的JSCC系统。由于RGB摄像机和事件相机通过不同的模式捕获同一场景,它们的输出包含共享信息以及领域特定的信息。为了避免重复传输共享信息,我们分别提取并传输其共享信息和领域特有信息。在接收端,接收到的信号由去运动模糊解码器处理以生成清晰图像。此外,我们引入了多阶段训练策略来训练所提出的模型。仿真结果表明,我们的方法显著优于现有的基于JSCC的图像传输方案,在有效解决运动模糊方面表现出色。
https://arxiv.org/abs/2501.09396
As virtual and augmented reality applications gain popularity, omnidirectional image (ODI) super-resolution has become increasingly important. Unlike 2D plain images that are formed on a plane, ODIs are projected onto spherical surfaces. Applying established image super-resolution methods to ODIs, therefore, requires performing equirectangular projection (ERP) to map the ODIs onto a plane. ODI super-resolution needs to take into account geometric distortion resulting from ERP. However, without considering such geometric distortion of ERP images, previous deep-learning-based methods only utilize a limited range of pixels and may easily miss self-similar textures for reconstruction. In this paper, we introduce a novel Geometric Distortion Guided Transformer for Omnidirectional image Super-Resolution (GDGT-OSR). Specifically, a distortion modulated rectangle-window self-attention mechanism, integrated with deformable self-attention, is proposed to better perceive the distortion and thus involve more self-similar textures. Distortion modulation is achieved through a newly devised distortion guidance generator that produces guidance by exploiting the variability of distortion across latitudes. Furthermore, we propose a dynamic feature aggregation scheme to adaptively fuse the features from different self-attention modules. We present extensive experimental results on public datasets and show that the new GDGT-OSR outperforms methods in existing literature.
随着虚拟和增强现实应用的流行,全方位图像(ODI)超分辨率变得越来越重要。与在平面上形成的2D平面图像不同,ODIs是投影到球形表面的。因此,将已建立的图像超分辨率方法应用于ODIs需要执行等矩形投影(ERP)以将其映射到平面上。ODI 超分辨率技术需要考虑 ERP 引起的几何失真。然而,如果不考虑 ERP 图像中的几何失真,以前基于深度学习的方法仅利用有限范围内的像素,并且可能轻易错过用于重建的自相似纹理。 在本文中,我们介绍了一种新颖的全方位图像超分辨率几何失真引导变换器(GDGT-OSR)。具体而言,提出了一种集成了可变形自我注意机制的失真调制矩形窗口自我注意机制,以更好地感知失真,并因此涉及更多的自相似纹理。通过利用纬度之间失真的变化性来生成指导信息的新设计失真引导发生器实现了失真调节。此外,我们还提出了动态特征聚合方案以适应地融合来自不同自我注意模块的特性。 我们在公共数据集上展示了广泛的实验结果,并表明新的 GDGT-OSR 方法优于现有文献中的方法。
https://arxiv.org/abs/2406.10869
Large Reconstruction Models (LRMs) have recently become a popular method for creating 3D foundational models. Training 3D reconstruction models with 2D visual data traditionally requires prior knowledge of camera poses for the training samples, a process that is both time-consuming and prone to errors. Consequently, 3D reconstruction training has been confined to either synthetic 3D datasets or small-scale datasets with annotated poses. In this study, we investigate the feasibility of 3D reconstruction using unposed video data of various objects. We introduce UVRM, a novel 3D reconstruction model capable of being trained and evaluated on monocular videos without requiring any information about the pose. UVRM uses a transformer network to implicitly aggregate video frames into a pose-invariant latent feature space, which is then decoded into a tri-plane 3D representation. To obviate the need for ground-truth pose annotations during training, UVRM employs a combination of the score distillation sampling (SDS) method and an analysis-by-synthesis approach, progressively synthesizing pseudo novel-views using a pre-trained diffusion model. We qualitatively and quantitatively evaluate UVRM's performance on the G-Objaverse and CO3D datasets without relying on pose information. Extensive experiments show that UVRM is capable of effectively and efficiently reconstructing a wide range of 3D objects from unposed videos.
最近,大型重构模型(LRMs)已成为创建3D基础模型的一种流行方法。使用2D视觉数据训练3D重构模型通常需要了解训练样本的相机姿态,这一过程既耗时又容易出错。因此,传统的3D重构训练要么局限于合成的3D数据集,要么是带有标注姿势的小规模数据集。在这项研究中,我们探讨了仅用未标记视频数据进行三维重建的可能性,并引入了一种新型的三维重建模型UVRM(Unposed Video Reconstruction Model)。这种模型可以在无需任何姿态信息的情况下通过单目视频进行训练和评估。 UVRM采用一种变换网络,将视频帧隐式地聚合成一个与姿势无关的潜在特征空间,然后解码为三平面3D表示。为了在训练过程中避免使用真实姿态标注,UVRM结合了分数蒸馏采样(SDS)方法和分析合成方法,逐步利用预训练的扩散模型生成伪新视角视图。 我们在G-Objaverse和CO3D数据集上对UVRM进行了定性和定量评估,并且在这些评估中没有使用任何姿态信息。大量的实验表明,UVRM能够有效并高效地从无标记视频中重构各种各样的三维物体。
https://arxiv.org/abs/2501.09347
Transformer-based encoder-decoder models have achieved remarkable success in image-to-image transfer tasks, particularly in image restoration. However, their high computational complexity-manifested in elevated FLOPs and parameter counts-limits their application in real-world scenarios. Existing knowledge distillation methods in image restoration typically employ lightweight student models that directly mimic the intermediate features and reconstruction results of the teacher, overlooking the implicit attention relationships between them. To address this, we propose a Soft Knowledge Distillation (SKD) strategy that incorporates a Multi-dimensional Cross-net Attention (MCA) mechanism for compressing image restoration models. This mechanism facilitates interaction between the student and teacher across both channel and spatial dimensions, enabling the student to implicitly learn the attention matrices. Additionally, we employ a Gaussian kernel function to measure the distance between student and teacher features in kernel space, ensuring stable and efficient feature learning. To further enhance the quality of reconstructed images, we replace the commonly used L1 or KL divergence loss with a contrastive learning loss at the image level. Experiments on three tasks-image deraining, deblurring, and denoising-demonstrate that our SKD strategy significantly reduces computational complexity while maintaining strong image restoration capabilities.
基于Transformer的编码器-解码器模型在图像到图像转换任务中,尤其是图像恢复方面取得了显著的成功。然而,这些模型由于计算复杂度高(表现为更高的FLOPs和参数计数)而限制了其在现实场景中的应用。现有的知识蒸馏方法通常采用轻量级的学生模型直接模仿教师模型的中间特征和重建结果,忽略了两者之间的隐式注意关系。为了解决这一问题,我们提出了一种Soft Knowledge Distillation(SKD)策略,该策略结合了Multi-dimensional Cross-net Attention(MCA)机制以压缩图像恢复模型。这种机制促进了学生和教师在通道和空间维度上的交互,使学生能够隐式地学习注意力矩阵。 此外,我们使用高斯核函数来衡量学生和教师特征之间的距离,确保稳定且高效的特征学习。为了进一步提高重建图像的质量,我们将常用的L1或KL散度损失替换为基于图像级别的对比学习损失。在三项任务——去雨、去模糊和降噪的实验中,我们的SKD策略显著降低了计算复杂性,并保持了强大的图像恢复能力。
https://arxiv.org/abs/2501.09321
Purpose: To propose a domain-conditioned and temporal-guided diffusion modeling method, termed dynamic Diffusion Modeling (dDiMo), for accelerated dynamic MRI reconstruction, enabling diffusion process to characterize spatiotemporal information for time-resolved multi-coil Cartesian and non-Cartesian data. Methods: The dDiMo framework integrates temporal information from time-resolved dimensions, allowing for the concurrent capture of intra-frame spatial features and inter-frame temporal dynamics in diffusion modeling. It employs additional spatiotemporal ($x$-$t$) and self-consistent frequency-temporal ($k$-$t$) priors to guide the diffusion process. This approach ensures precise temporal alignment and enhances the recovery of fine image details. To facilitate a smooth diffusion process, the nonlinear conjugate gradient algorithm is utilized during the reverse diffusion steps. The proposed model was tested on two types of MRI data: Cartesian-acquired multi-coil cardiac MRI and Golden-Angle-Radial-acquired multi-coil free-breathing lung MRI, across various undersampling rates. Results: dDiMo achieved high-quality reconstructions at various acceleration factors, demonstrating improved temporal alignment and structural recovery compared to other competitive reconstruction methods, both qualitatively and quantitatively. This proposed diffusion framework exhibited robust performance in handling both Cartesian and non-Cartesian acquisitions, effectively reconstructing dynamic datasets in cardiac and lung MRI under different imaging conditions. Conclusion: This study introduces a novel diffusion modeling method for dynamic MRI reconstruction.
目的:为了加速动态MRI重建,提出了一种基于领域条件和时间引导的扩散建模方法,称为动态扩散模型(dDiMo),使扩散过程能够表征时空信息以适应时变多线圈笛卡尔和非笛卡尔数据。 方法:dDiMo框架结合了从时间分辨维度获取的时间信息,允许在扩散建模过程中同时捕获帧内空间特征与帧间时间动力学。它采用附加的时空(x-t)和自洽频域-时间(k-t)先验来引导扩散过程。这种方法确保了精确的时间对齐,并增强了图像细节的恢复。为了促进平滑的扩散过程,在反向扩散步骤中应用非线性共轭梯度算法。所提出的模型在两种类型的MRI数据上进行了测试:笛卡尔获取的多线圈心脏MRI和黄金角度径向获取的多线圈自由呼吸肺部MRI,覆盖各种欠采样率。 结果:dDiMo在各种加速因子下均达到了高质量重建,在时间对齐和结构恢复方面优于其他竞争性重建方法(从定性和定量两方面来看)。该提出的扩散框架表现出处理笛卡尔和非笛卡尔采集的强大性能,在心脏和肺部MRI的不同成像条件下有效重构动态数据集。 结论:本研究引入了一种用于动态MRI重建的新型扩散建模方法。
https://arxiv.org/abs/2501.09305
Vision-based tactile sensors have drawn increasing interest in the robotics community. However, traditional lens-based designs impose minimum thickness constraints on these sensors, limiting their applicability in space-restricted settings. In this paper, we propose ThinTact, a novel lensless vision-based tactile sensor with a sensing field of over 200 mm2 and a thickness of less than 10 this http URL utilizes the mask-based lensless imaging technique to map the contact information to CMOS signals. To ensure real-time tactile sensing, we propose a real-time lensless reconstruction algorithm that leverages a frequency-spatial-domain joint filter based on discrete cosine transform (DCT). This algorithm achieves computation significantly faster than existing optimization-based methods. Additionally, to improve the sensing quality, we develop a mask optimization method based on the generic algorithm and the corresponding system matrix calibration this http URL evaluate the performance of our proposed lensless reconstruction and tactile sensing through qualitative and quantitative experiments. Furthermore, we demonstrate ThinTact's practical applicability in diverse applications, including texture recognition and contact-rich object manipulation. The paper will appear in the IEEE Transactions on Robotics: this https URL. Video: this https URL
基于视觉的触觉传感器在机器人学领域引起了越来越多的关注。然而,传统的透镜设计对这些传感器施加了最小厚度限制,从而在空间受限的应用场景中应用受到限制。本文提出了ThinTact,这是一种新型无镜头的基于视觉的触觉传感器,其传感区域超过200平方毫米,厚度小于10毫米。它采用掩模无镜头成像技术将接触信息映射为CMOS信号。为了确保实时触觉感知,我们提出了一种实时无透镜重建算法,该算法利用基于离散余弦变换(DCT)的频域和空域联合滤波器。相比现有的优化方法,此算法实现了显著更快的计算速度。此外,为进一步提高传感质量,我们开发了一种基于通用算法及相应的系统矩阵校准的掩模优化方法。通过定性和定量实验来评估所提出的无透镜重建技术和触觉感知性能。另外,我们展示了ThinTact在包括纹理识别和接触密集型对象操作在内的多种应用中的实际适用性。 相关研究论文已发表于IEEE机器人技术汇刊(IEEE Transactions on Robotics)。视频演示链接见:[此处插入原文视频链接]。
https://arxiv.org/abs/2501.09273
Model compression through knowledge distillation has seen extensive application in classification and segmentation tasks. However, its potential in image-to-image translation, particularly in image restoration, remains underexplored. To address this gap, we propose a Simultaneous Learning Knowledge Distillation (SLKD) framework tailored for model compression in image restoration tasks. SLKD employs a dual-teacher, single-student architecture with two distinct learning strategies: Degradation Removal Learning (DRL) and Image Reconstruction Learning (IRL), simultaneously. In DRL, the student encoder learns from Teacher A to focus on removing degradation factors, guided by a novel BRISQUE extractor. In IRL, the student decoder learns from Teacher B to reconstruct clean images, with the assistance of a proposed PIQE extractor. These strategies enable the student to learn from degraded and clean images simultaneously, ensuring high-quality compression of image restoration models. Experimental results across five datasets and three tasks demonstrate that SLKD achieves substantial reductions in FLOPs and parameters, exceeding 80\%, while maintaining strong image restoration performance.
模型通过知识蒸馏进行压缩在分类和分割任务中得到了广泛的应用,但在图像到图像的转换领域,尤其是在图像恢复方面,其潜力尚未被充分探索。为了填补这一空白,我们提出了一种专门用于图像恢复任务中的模型压缩的Simultaneous Learning Knowledge Distillation (SLKD)框架。SLKD采用了一个双教师单学生架构,并结合了两种不同的学习策略:退化移除学习(DRL)和图像重建学习(IRL),以同时进行。 在DRL中,学生编码器从教师A那里学习如何专注于去除退化因素,这受到新颖的BRISQUE提取器的指导。而在IRL中,学生解码器从教师B那里学习如何利用提出的PIQE提取器的帮助来重建干净图像。这两种策略使学生能够同时从降质和未降质的图像中学到知识,从而确保高质量地压缩图像恢复模型。 通过五个数据集和三个任务进行的实验结果表明,SLKD在保持强大的图像恢复性能的同时,可以实现超过80%的FLOPs(浮点运算次数)和参数减少。
https://arxiv.org/abs/2501.09268
White Light Interferometry (WLI) is a precise optical tool for measuring the 3D topography of microstructures. However, conventional WLI cannot capture the natural color of a sample's surface, which is essential for many microscale research applications that require both 3D geometry and color information. Previous methods have attempted to overcome this limitation by modifying WLI hardware and analysis software, but these solutions are often costly. In this work, we address this challenge from a computer vision multi-modal reconstruction perspective for the first time. We introduce OpticFusion, a novel approach that uses an additional digital optical microscope (OM) to achieve 3D reconstruction with natural color textures using multi-view WLI and OM images. Our method employs a two-step data association process to obtain the poses of WLI and OM data. By leveraging the neural implicit representation, we fuse multi-modal data and apply color decomposition technology to extract the sample's natural color. Tested on our multi-modal dataset of various microscale samples, OpticFusion achieves detailed 3D reconstructions with color textures. Our method provides an effective tool for practical applications across numerous microscale research fields. The source code and our real-world dataset are available at this https URL.
白光干涉测量法(WLI)是一种用于测量微结构三维形貌的精密光学工具。然而,传统的WLI无法捕捉样品表面的真实颜色,这对于许多需要同时具备三维几何形状和色彩信息的微尺度研究应用来说至关重要。先前的方法试图通过修改WLI硬件和分析软件来克服这一限制,但这些解决方案通常成本较高。在这项工作中,我们首次从计算机视觉多模态重建的角度解决了这个问题。 我们介绍了OpticFusion,这是一种新颖的方法,利用额外的数字光学显微镜(OM)与多视角白光干涉测量图像相结合,实现带有真实色彩纹理的三维重建。我们的方法采用两步数据关联过程来获取WLI和OM数据的姿态信息。通过利用神经隐式表示技术,我们融合了多模态数据,并应用颜色分解技术提取样品的真实色彩。 我们在多种微尺度样本的多模态数据集上对OpticFusion进行了测试,结果表明该方法能够实现带有详细纹理色彩的三维重建。我们的方法为多个微尺度研究领域的实际应用提供了有效的工具。源代码和我们的真实世界数据集可在[此链接](https://this-url.com)获取。
https://arxiv.org/abs/2501.09259
Visual-Spatial Systems has become increasingly essential in concrete crack inspection. However, existing methods often lacks adaptability to diverse scenarios, exhibits limited robustness in image-based approaches, and struggles with curved or complex geometries. To address these limitations, an innovative framework for two-dimensional (2D) crack detection, three-dimensional (3D) reconstruction, and 3D automatic crack measurement was proposed by integrating computer vision technologies and multi-modal Simultaneous localization and mapping (SLAM) in this study. Firstly, building on a base DeepLabv3+ segmentation model, and incorporating specific refinements utilizing foundation model Segment Anything Model (SAM), we developed a crack segmentation method with strong generalization across unfamiliar scenarios, enabling the generation of precise 2D crack masks. To enhance the accuracy and robustness of 3D reconstruction, Light Detection and Ranging (LiDAR) point clouds were utilized together with image data and segmentation masks. By leveraging both image- and LiDAR-SLAM, we developed a multi-frame and multi-modal fusion framework that produces dense, colorized point clouds, effectively capturing crack semantics at a 3D real-world scale. Furthermore, the crack geometric attributions were measured automatically and directly within 3D dense point cloud space, surpassing the limitations of conventional 2D image-based measurements. This advancement makes the method suitable for structural components with curved and complex 3D geometries. Experimental results across various concrete structures highlight the significant improvements and unique advantages of the proposed method, demonstrating its effectiveness, accuracy, and robustness in real-world applications.
视觉空间系统在混凝土裂缝检测中变得越来越重要。然而,现有方法往往缺乏对多样场景的适应性,在基于图像的方法中表现出有限的鲁棒性,并且难以处理曲线或复杂的几何形状。为了克服这些局限性,本研究提出了一种结合计算机视觉技术和多模态同步定位与地图构建(SLAM)的新框架,用于二维(2D)裂缝检测、三维(3D)重建和自动测量。首先,在DeepLabv3+分割模型的基础上进行改进,并利用基础模型Segment Anything Model (SAM) 进行特定的优化,我们开发了一种在不熟悉场景中具有强泛化的裂缝分割方法,能够生成精确的2D裂缝掩模。为了提高三维重建的准确性和鲁棒性,本研究结合了激光雷达点云、图像数据和分割掩模的数据。通过利用图像SLAM和激光雷达SLAM,我们开发了一个多帧和多模态融合框架,产生密集且彩色化的点云,在3D现实尺度上有效地捕捉裂缝语义信息。此外,还在三维稠密的点云空间内直接自动测量了裂缝几何属性,超出了传统二维图像基方法的限制。这一进展使得该方法适用于具有曲线及复杂三维几何形状的结构部件。各种混凝土结构上的实验结果强调了所提方法在实际应用中的显著改进和独特优势,证明其有效、准确且鲁棒性良好。
https://arxiv.org/abs/2501.09203
Targeting the notorious cumulative drift errors in NeRF SLAM, we propose a Semantic-guided Loop Closure with Shared Latent Code, dubbed SLC$^2$-SLAM. Especially, we argue that latent codes stored in many NeRF SLAM systems are not fully exploited, as they are only used for better reconstruction. In this paper, we propose a simple yet effective way to detect potential loops using the same latent codes as local features. To further improve the loop detection performance, we use the semantic information, which are also decoded from the same latent codes to guide the aggregation of local features. Finally, with the potential loops detected, we close them with a graph optimization followed by bundle adjustment to refine both the estimated poses and the reconstructed scene. To evaluate the performance of our SLC$^2$-SLAM, we conduct extensive experiments on Replica and ScanNet datasets. Our proposed semantic-guided loop closure significantly outperforms the pre-trained NetVLAD and ORB combined with Bag-of-Words, which are used in all the other NeRF SLAM with loop closure. As a result, our SLC$^2$-SLAM also demonstrated better tracking and reconstruction performance, especially in larger scenes with more loops, like ScanNet.
针对NeRF SLAM中的累积漂移误差问题,我们提出了一种基于语义引导的循环闭合方法,并结合共享潜在代码(Semantic-guided Loop Closure with Shared Latent Code),简称SLC$^2$-SLAM。特别地,我们认为许多NeRF SLAM系统中存储的潜在代码没有得到充分利用,因为它们仅用于更好的重建。在本文中,我们提出了一种简单而有效的方法来利用这些相同的潜在代码作为局部特征来检测潜在循环。 为了进一步提高循环检测性能,我们使用从同一组潜在代码解码出的语义信息来指导局部特征的聚合过程。最后,在确定了潜在循环后,我们通过图优化和随后的束调整(bundle adjustment)来闭合这些循环,并以此细化估计的姿态和重建场景。 为评估我们的SLC$^2$-SLAM方法的效果,我们在Replica和ScanNet数据集上进行了广泛的实验。我们提出的基于语义引导的循环闭合法显著优于所有其他采用预训练NetVLAD与Bag-of-Words结合ORB的方法在NeRF SLAM中的表现。因此,在像ScanNet这样包含更多循环的大场景中,我们的SLC$^2$-SLAM方法展示了更佳的跟踪和重建性能。 通过这种方法,不仅解决了累积漂移误差问题,还显著提升了整体的定位与建图精度,特别是在复杂环境下的表现尤为突出。
https://arxiv.org/abs/2501.08880
Dynamic MRI reconstruction, one of inverse problems, has seen a surge by the use of deep learning techniques. Especially, the practical difficulty of obtaining ground truth data has led to the emergence of unsupervised learning approaches. A recent promising method among them is implicit neural representation (INR), which defines the data as a continuous function that maps coordinate values to the corresponding signal values. This allows for filling in missing information only with incomplete measurements and solving the inverse problem effectively. Nevertheless, previous works incorporating this method have faced drawbacks such as long optimization time and the need for extensive hyperparameter tuning. To address these issues, we propose Dynamic-Aware INR (DA-INR), an INR-based model for dynamic MRI reconstruction that captures the spatial and temporal continuity of dynamic MRI data in the image domain and explicitly incorporates the temporal redundancy of the data into the model structure. As a result, DA-INR outperforms other models in reconstruction quality even at extreme undersampling ratios while significantly reducing optimization time and requiring minimal hyperparameter tuning.
动态MRI重建作为逆问题的一种,通过使用深度学习技术得到了显著的发展。特别是,获取真实数据的实际困难导致了无监督学习方法的出现。其中最近一种有前景的方法是隐式神经表示(INR),它将数据定义为一个连续函数,该函数将坐标值映射到相应的信号值上。这种方法允许仅通过不完整的测量来填补缺失的信息,并有效地解决逆问题。然而,之前采用此方法的工作面临着诸如优化时间长和需要大量超参数调整等缺点。 为了克服这些问题,我们提出了动态感知的INR(DA-INR),这是一种基于INR的动态MRI重建模型,该模型在图像域中捕捉到了动态MRI数据的空间和时间连续性,并明确地将数据的时间冗余纳入了模型结构。因此,即使在极端欠采样比率下,DA-INR也能以显著减少优化时间和最小化超参数调整的前提下,优于其他模型的重建质量。
https://arxiv.org/abs/2501.09049
Reconstructing speech envelopes from EEG signals is essential for exploring neural mechanisms underlying speech perception. Yet, EEG variability across subjects and physiological artifacts complicate accurate reconstruction. To address this problem, we introduce Subject Disentangling Neural Network (SDN-Net), which disentangles subject identity information from reconstructed speech envelopes to enhance cross-subject reconstruction accuracy. SDN-Net integrates three key components: MLA-Codec, MPN-MI, and CTA-MTDNN. The MLA-Codec, a fully convolutional neural network, decodes EEG signals into speech envelopes. The CTA-MTDNN module, a multi-scale time-delay neural network with channel and temporal attention, extracts subject identity features from EEG signals. Lastly, the MPN-MI module, a mutual information estimator with a multi-layer perceptron, supervises the removal of subject identity information from the reconstructed speech envelope. Experiments on the Auditory EEG Decoding Dataset demonstrate that SDN-Net achieves superior performance in inner- and cross-subject speech envelope reconstruction compared to recent state-of-the-art methods.
从脑电波信号中重建语音包络对于探索言语感知的神经机制至关重要。然而,不同受试者间的脑电波变化和生理伪迹使得准确重建变得复杂。为解决这一问题,我们引入了主体拆分神经网络(Subject Disentangling Neural Network, SDN-Net),该网络能够从重建的语音包络中分离出主体身份信息,从而提高跨受试者的重建精度。 SDN-Net整合了三个关键组件:MLA-Codec、MPN-MI和CTA-MTDNN。MLA-Codec是一个全卷积神经网络,用于将脑电波信号解码为语音包络;CTA-MTDNN模块是一种带有通道与时间注意力的多尺度时延神经网络,从脑电波信号中提取主体身份特征;最后,MPN-MI模块则利用一个多层感知器进行互信息估计,监督去除重建语音包络中的主体身份信息。 在听觉EEG解码数据集上的实验表明,SDN-Net相比最近的先进方法,在内源性和跨受试者言语包络重建方面取得了卓越的效果。
https://arxiv.org/abs/2501.08693
Coronary artery disease, caused by the narrowing of coronary vessels due to atherosclerosis, is the leading cause of death worldwide. The diagnostic gold standard, fractional flow reserve (FFR), measures the trans-stenotic pressure ratio during maximal vasodilation but is invasive and costly. This has driven the development of virtual FFR (vFFR) using computational fluid dynamics (CFD) to simulate coronary flow. Geometric deep learning algorithms have shown promise for learning features on meshes, including cardiovascular research applications. This study empirically analyzes various backends for predicting vFFR fields in coronary arteries as CFD surrogates, comparing six backends for learning hemodynamics on meshes using CFD solutions as ground truth. The study has two parts: i) Using 1,500 synthetic left coronary artery bifurcations, models were trained to predict pressure-related fields for vFFR reconstruction, comparing different learning variables. ii) Using 427 patient-specific CFD simulations, experiments were repeated focusing on the best-performing learning variable from the synthetic dataset. Most backends performed well on the synthetic dataset, especially when predicting pressure drop over the manifold. Transformer-based backends outperformed others when predicting pressure and vFFR fields and were the only models achieving strong performance on patient-specific data, excelling in both average per-point error and vFFR accuracy in stenotic lesions. These results suggest geometric deep learning backends can effectively replace CFD for simple geometries, while transformer-based networks are superior for complex, heterogeneous datasets. Pressure drop was identified as the optimal network output for learning pressure-related fields.
冠状动脉疾病是由动脉粥样硬化导致的冠状血管狭窄所引起的,它是全球死亡的主要原因。诊断的金标准是压力导丝测量法(FFR),该方法通过最大血管扩张期间测量跨狭窄段的压力比来进行评估,但这种方法具有侵入性和成本较高。这推动了利用计算流体动力学(CFD)模拟冠状动脉血流以创建虚拟压力导丝测量法(vFFR)的发展。 几何深度学习算法在网格上学习特征方面展现出了前景,并且已经应用于心血管研究领域。本项研究实证分析了几种后端,用于预测冠状动脉中的vFFR场作为CFD的替代方法。使用计算流体动力学解决方案作为真实数据,比较了六种不同的后端来学习网格上的血液动力学特征。 这项研究分为两个部分: i) 使用1500个合成左冠状动脉分叉模型进行训练,以预测与压力相关的场并重建vFFR,比较了不同的学习变量。 ii) 使用427名患者特定的CFD模拟实验,在最佳性能的学习变量上重复试验,这些数据是从合成数据集中得出的。 大多数后端在合成数据集上的表现良好,尤其是在网格上预测压降方面。基于Transformer的后端在预测压力和vFFR场时优于其他后端,并且是唯一一组在患者特定数据中表现出色的模型,在平均点误差和狭窄病变中的vFFR准确性方面均表现出色。 这些结果表明几何深度学习后端可以有效地取代CFD用于简单结构,而基于Transformer的网络对于复杂、异质的数据集更胜一筹。压降被确定为学习与压力相关场的最佳网络输出。
https://arxiv.org/abs/2501.09046
Diffusion models have recently shown remarkable results in magnetic resonance imaging reconstruction. However, the employed networks typically are black-box estimators of the (smoothed) prior score with tens of millions of parameters, restricting interpretability and increasing reconstruction time. Furthermore, parallel imaging reconstruction algorithms either rely on off-line coil sensitivity estimation, which is prone to misalignment and restricting sampling trajectories, or perform per-coil reconstruction, making the computational cost proportional to the number of coils. To overcome this, we jointly reconstruct the image and the coil sensitivities using the lightweight, parameter-efficient, and interpretable product of Gaussian mixture diffusion model as an image prior and a classical smoothness priors on the coil sensitivities. The proposed method delivers promising results while allowing for fast inference and demonstrating robustness to contrast out-of-distribution data and sampling trajectories, comparable to classical variational penalties such as total variation. Finally, the probabilistic formulation allows the calculation of the posterior expectation and pixel-wise variance.
最近,扩散模型在磁共振成像重建中展示了显著的成果。然而,所使用的网络通常是包含数千万参数的黑箱估计器,用于估算(平滑后的)先验分数,这限制了可解释性并增加了重建时间。此外,并行成像重建算法要么依赖于离线线圈敏感度估计,这种做法容易出现对齐错误且限制采样轨迹的选择;要么进行每个线圈单独的重建工作,导致计算成本与线圈数量成正比。 为了克服这些问题,我们提出了一种联合重建图像和线圈敏感性的方法。该方法使用轻量级、参数高效且可解释性强的高斯混合扩散模型作为图像先验,并在该基础上加入经典的光滑性假设来处理线圈敏感度问题。所提出的这种方法不仅能够提供有前景的结果,还能实现快速推理并表现出对对比度分布外数据和采样轨迹的强大鲁棒性,其性能与传统的诸如总变差(Total Variation)等经典变分罚函数相当。 最后,这种概率方法允许计算后验期望值以及像素级别的方差。
https://arxiv.org/abs/2501.08662
This paper introduces a novel approach to enhance the performance of Gaussian Shading, a prevalent watermarking technique, by integrating the Exact Diffusion Inversion via Coupled Transformations (EDICT) framework. While Gaussian Shading traditionally embeds watermarks in a noise latent space, followed by iterative denoising for image generation and noise addition for watermark recovery, its inversion process is not exact, leading to potential watermark distortion. We propose to leverage EDICT's ability to derive exact inverse mappings to refine this process. Our method involves duplicating the watermark-infused noisy latent and employing a reciprocal, alternating denoising and noising scheme between the two latents, facilitated by EDICT. This allows for a more precise reconstruction of both the image and the embedded watermark. Empirical evaluation on standard datasets demonstrates that our integrated approach yields a slight, yet statistically significant improvement in watermark recovery fidelity. These results highlight the potential of EDICT to enhance existing diffusion-based watermarking techniques by providing a more accurate and robust inversion mechanism. To the best of our knowledge, this is the first work to explore the synergy between EDICT and Gaussian Shading for digital watermarking, opening new avenues for research in robust and high-fidelity watermark embedding and extraction.
本文介绍了一种新颖的方法,通过结合精确扩散逆向变换框架(EDICT)来提升高斯着色技术的性能。高斯着色是一种广泛使用的水印技术,它通常在噪声潜在空间中嵌入水印,并通过迭代去噪生成图像和添加噪声恢复水印。然而,其逆过程不准确,可能导致水印失真。我们提出利用EDICT精确推导逆向映射的能力来优化这一过程。我们的方法包括复制包含水印的噪声潜变量,并在两个潜变量之间交替进行去噪和加噪操作,借助于EDICT框架实现。这使得图像和嵌入水印的重构更为精准。 通过标准数据集上的实证评估表明,我们的集成方法使水印恢复保真度有了细微但统计上显著的提升。这些结果突显了EDICT增强现有基于扩散技术的水印处理能力,并提供更准确、更具鲁棒性的逆向机制的潜力。据我们所知,这是首次探索EDICT与高斯着色在数字水印中的协同作用的研究工作,为研究稳健且高保真的水印嵌入和提取开辟了新的途径。
https://arxiv.org/abs/2501.08604