The NIR-to-RGB spectral domain translation is a formidable task due to the inherent spectral mapping ambiguities within NIR inputs and RGB outputs. Thus, existing methods fail to reconcile the tension between maintaining texture detail fidelity and achieving diverse color variations. In this paper, we propose a Multi-scale HSV Color Feature Embedding Network (MCFNet) that decomposes the mapping process into three sub-tasks, including NIR texture maintenance, coarse geometry reconstruction, and RGB color prediction. Thus, we propose three key modules for each corresponding sub-task: the Texture Preserving Block (TPB), the HSV Color Feature Embedding Module (HSV-CFEM), and the Geometry Reconstruction Module (GRM). These modules contribute to our MCFNet methodically tackling spectral translation through a series of escalating resolutions, progressively enriching images with color and texture fidelity in a scale-coherent fashion. The proposed MCFNet demonstrates substantial performance gains over the NIR image colorization task. Code is released at: this https URL.
NIR-to-RGB spectral domain translation是一个具有挑战性的任务,因为NIR输入和RGB输出的固有光谱映射歧义。因此,现有的方法无法在保持纹理细节保真度和实现多样色彩变化之间实现和谐。在本文中,我们提出了一种多尺度HSV颜色特征嵌入网络(MCFNet),将映射过程分解为包括NIR纹理维护、粗几何重建和RGB颜色预测三个子任务的三个子任务。因此,我们提出了每个相应子任务的关键模块:纹理保留模块(TPB)、HSV颜色特征嵌入模块(HSV-CFEM)和几何重建模块(GRM)。这些模块通过一系列逐渐升高的分辨率,以尺度和谐的方式贡献于我们的MCFNet方法,通过一系列自适应纹理映射,实现对NIR图像颜色化的巨大性能提升。所提出的MCFNet在NIR图像颜色化任务中取得了显著的性能提升。代码发布在:https://这个URL。
https://arxiv.org/abs/2404.16685
While neural implicit representations have gained popularity in multi-view 3D reconstruction, previous work struggles to yield physically plausible results, thereby limiting their applications in physics-demanding domains like embodied AI and robotics. The lack of plausibility originates from both the absence of physics modeling in the existing pipeline and their inability to recover intricate geometrical structures. In this paper, we introduce PhyRecon, which stands as the first approach to harness both differentiable rendering and differentiable physics simulation to learn implicit surface representations. Our framework proposes a novel differentiable particle-based physical simulator seamlessly integrated with the neural implicit representation. At its core is an efficient transformation between SDF-based implicit representation and explicit surface points by our proposed algorithm, Surface Points Marching Cubes (SP-MC), enabling differentiable learning with both rendering and physical losses. Moreover, we model both rendering and physical uncertainty to identify and compensate for the inconsistent and inaccurate monocular geometric priors. The physical uncertainty additionally enables a physics-guided pixel sampling to enhance the learning of slender structures. By amalgamating these techniques, our model facilitates efficient joint modeling with appearance, geometry, and physics. Extensive experiments demonstrate that PhyRecon significantly outperforms all state-of-the-art methods in terms of reconstruction quality. Our reconstruction results also yield superior physical stability, verified by Isaac Gym, with at least a 40% improvement across all datasets, opening broader avenues for future physics-based applications.
虽然多视角3D重建中神经隐式表示已经获得了越来越多的关注,但之前的 work 很难产生物理上合理的成果,从而限制了它们在需要物理要求的领域(如 embodied AI 和机器人学)的应用。缺乏可信度源于现有流程中缺少物理建模以及它们无法恢复复杂的几何结构。在本文中,我们引入了 PhyRecon,这是第一个利用可导渲染和可导物理仿真来学习隐式表面表示的方法。我们的框架将新颖的可导粒子基于物理仿真与神经隐式表示无缝集成。其核心是基于我们提出的表面点前进立方(SP-MC)算法在 SDF 基于隐式表示和显式表面点之间进行有效的转换,实现基于渲染和物理损失的可导学习。此外,我们还建模了渲染和物理不确定性以识别和弥补不一致和不准确的单目几何先验。物理不确定性还允许我们进行基于物理的像素采样,以增强对细长结构的学习。通过将这些技术相结合,我们的模型实现了与外观、几何和物理的效率共生建模。大量实验证明,PhyRecon 在重建质量方面显著超过了所有现有方法。我们的重建结果还证明了伊萨·格雷戈尔(Isaac Gym)验证的卓越物理稳定性,在所有数据集上实现了至少 40% 的改进,为未来的基于物理的应用于开辟了更广泛的道路。
https://arxiv.org/abs/2404.16666
Efficient visual perception using mobile systems is crucial, particularly in unknown environments such as search and rescue operations, where swift and comprehensive perception of objects of interest is essential. In such real-world applications, objects of interest are often situated in complex environments, making the selection of the 'Next Best' view based solely on maximizing visibility gain suboptimal. Semantics, providing a higher-level interpretation of perception, should significantly contribute to the selection of the next viewpoint for various perception tasks. In this study, we formulate a novel information gain that integrates both visibility gain and semantic gain in a unified form to select the semantic-aware Next-Best-View. Additionally, we design an adaptive strategy with termination criterion to support a two-stage search-and-acquisition manoeuvre on multiple objects of interest aided by a multi-degree-of-freedoms (Multi-DoFs) mobile system. Several semantically relevant reconstruction metrics, including perspective directivity and region of interest (ROI)-to-full reconstruction volume ratio, are introduced to evaluate the performance of the proposed approach. Simulation experiments demonstrate the advantages of the proposed approach over existing methods, achieving improvements of up to 27.13% for the ROI-to-full reconstruction volume ratio and a 0.88234 average perspective directivity. Furthermore, the planned motion trajectory exhibits better perceiving coverage toward the target.
高效的移动设备视觉感知对于移动系统来说至关重要,尤其是在未知环境中,如搜索和救援行动,快速全面地感知感兴趣的对象是至关重要的。在这类现实应用中,感兴趣的对象通常位于复杂的环境中,因此仅基于最大化可见性增益来选择下一个视角是不够的。语义信息,提供对感知的高级解释,应该对各种感知任务的下一个视角选择产生显著影响。在本研究中,我们提出了一种新颖的信息增益形式,将可见性增益和语义增益统一在一起,以选择具有语义意识的下一个最好的视角。此外,我们设计了一个支持多自由度(Multi-DoFs)移动系统两个阶段搜索与获取动作的适应策略。为了评估所提出方法的表现,我们引入了一些语义相关的重构指标,包括视点定向和感兴趣区域(ROI)到完整重建体积比。仿真实验证明,与现有方法相比,所提出的方法具有明显的优势,ROI-到完整重建体积比的提升幅度达到27.13%,平均视点定向为0.88234。此外,计划运动轨迹对目标区域的感知覆盖面更好。
https://arxiv.org/abs/2404.16507
Recent advancements in self-supervised learning in the point cloud domain have demonstrated significant potential. However, these methods often suffer from drawbacks, including lengthy pre-training time, the necessity of reconstruction in the input space, or the necessity of additional modalities. In order to address these issues, we introduce Point-JEPA, a joint embedding predictive architecture designed specifically for point cloud data. To this end, we introduce a sequencer that orders point cloud tokens to efficiently compute and utilize tokens proximity based on their indices during target and context selection. The sequencer also allows shared computations of the tokens proximity between context and target selection, further improving the efficiency. Experimentally, our method achieves competitive results with state-of-the-art methods while avoiding the reconstruction in the input space or additional modality.
近年来,在点云领域自监督学习的进展已经展示了很大的潜力。然而,这些方法通常存在缺点,包括漫长的预训练时间、在输入空间进行重建的必要性,或者需要额外的模块。为了应对这些问题,我们引入了点JEPA,一种专门针对点云数据的联合嵌入预测架构。为此,我们引入一个序列器,对点云令牌进行排序,以在目标和上下文选择期间基于其索引计算并利用令牌的接近性。序列器还允许在上下文和目标选择之间共享计算令牌接近性,从而进一步提高效率。实验证明,我们的方法在获得与最先进方法竞争力的结果的同时,避免了在输入空间进行重建或添加额外模块。
https://arxiv.org/abs/2404.16432
While originally developed for novel view synthesis, Neural Radiance Fields (NeRFs) have recently emerged as an alternative to multi-view stereo (MVS). Triggered by a manifold of research activities, promising results have been gained especially for texture-less, transparent, and reflecting surfaces, while such scenarios remain challenging for traditional MVS-based approaches. However, most of these investigations focus on close-range scenarios, with studies for airborne scenarios still missing. For this task, NeRFs face potential difficulties at areas of low image redundancy and weak data evidence, as often found in street canyons, facades or building shadows. Furthermore, training such networks is computationally expensive. Thus, the aim of our work is twofold: First, we investigate the applicability of NeRFs for aerial image blocks representing different characteristics like nadir-only, oblique and high-resolution imagery. Second, during these investigations we demonstrate the benefit of integrating depth priors from tie-point measures, which are provided during presupposed Bundle Block Adjustment. Our work is based on the state-of-the-art framework VolSDF, which models 3D scenes by signed distance functions (SDFs), since this is more applicable for surface reconstruction compared to the standard volumetric representation in vanilla NeRFs. For evaluation, the NeRF-based reconstructions are compared to results of a publicly available benchmark dataset for airborne images.
虽然最初是为 novel view synthesis 设计的,但近年来 Neural Radiance Fields (NeRFs) 已经作为一种多视图立体 (MVS) 的替代方案得到了广泛应用。受到多种研究活动的触发,尤其是在缺乏纹理、透明和反射表面的情况下,NeRFs 的表现尤为出色,而传统 MVS 方法在这些问题上仍然具有挑战性。然而,这些研究主要集中在近景场景,尽管已经对空气场景进行了研究,但仍有缺失。对于这项任务,NeRFs 在低图像冗余和弱数据证据的领域可能会面临潜在的困难,正如在街巷、建筑立面或建筑物阴影中常见的情况。此外,训练这类网络在计算上较为昂贵。因此,我们工作的目标是双重的:首先,我们研究 NeRFs 在代表不同特性的航空图像块上的适用性;其次,在這些調查期間,我們將展示將來自點測量學的深度 prior 整合到預假 Bundle Block Adjustment 中的好處。我们的工作基於最先进的框架 VolSDF,它通過點距函數 (SDF) 建模 3D 场景,因為這比標準的 NeRFs 的表面重建更適合作用。對於評估,我們將 NeRF 基於的重建與空氣中可獲得的公开數據集的結果進行比較。
https://arxiv.org/abs/2404.16429
In this paper, we study the problem of 3D reconstruction from a single-view RGB image and propose a novel approach called DIG3D for 3D object reconstruction and novel view synthesis. Our method utilizes an encoder-decoder framework which generates 3D Gaussians in decoder with the guidance of depth-aware image features from encoder. In particular, we introduce the use of deformable transformer, allowing efficient and effective decoding through 3D reference point and multi-layer refinement adaptations. By harnessing the benefits of 3D Gaussians, our approach offers an efficient and accurate solution for 3D reconstruction from single-view images. We evaluate our method on the ShapeNet SRN dataset, getting PSNR of 24.21 and 24.98 in car and chair dataset, respectively. The result outperforming the recent method by around 2.25%, demonstrating the effectiveness of our method in achieving superior results.
在本文中,我们研究了从单视RGB图像中进行3D重建的问题,并提出了名为DIG3D的三维物体重建和新颖视图合成方法。我们的方法利用了一个编码器-解码器框架,在编码器的指导下生成3D高斯分布。特别地,我们引入了形变Transformer,通过3D参考点和多层精细修复适应来实现高效的解码。通过利用3D高斯分布的优势,我们的方法为从单视图像中进行3D重建提供了有效且准确的方法。我们在ShapeNet SRN数据集上评估我们的方法,得到汽车和椅子数据集的PSNR分别为24.21和24.98。该结果比最近的方法约领先2.25%,证明了我们在实现卓越结果方面的有效性。
https://arxiv.org/abs/2404.16323
Procedural noise is a fundamental component of computer graphics pipelines, offering a flexible way to generate textures that exhibit "natural" random variation. Many different types of noise exist, each produced by a separate algorithm. In this paper, we present a single generative model which can learn to generate multiple types of noise as well as blend between them. In addition, it is capable of producing spatially-varying noise blends despite not having access to such data for training. These features are enabled by training a denoising diffusion model using a novel combination of data augmentation and network conditioning techniques. Like procedural noise generators, the model's behavior is controllable via interpretable parameters and a source of randomness. We use our model to produce a variety of visually compelling noise textures. We also present an application of our model to improving inverse procedural material design; using our model in place of fixed-type noise nodes in a procedural material graph results in higher-fidelity material reconstructions without needing to know the type of noise in advance.
过程噪声是计算机图形流水线的一个基本组成部分,提供了一种生成具有“自然”随机变异的纹理的方式。有许多不同类型的噪声,每个都是由独立的算法生成的。在本文中,我们提出了一个单生成模型,可以学习生成多种类型的噪声以及将它们混合。此外,它还能够在没有访问到这种数据进行训练的情况下生成空间随机的噪声混合。这些特点是由使用新型的数据增强和网络调节技术训练去噪扩散模型而实现的。与过程噪声生成器类似,模型的行为可以通过可解释的参数和随机性的来源进行控制。我们使用我们的模型来生成各种视觉上令人印象深刻的噪声纹理。我们还介绍了一个使用我们模型的改进反向过程材料设计的应用;将我们的模型代替固定类型噪声节点在图状材料图中,可以实现更高保真的材料重构,而无需提前知道噪声的类型。
https://arxiv.org/abs/2404.16292
Complex single-objective bounded problems are often difficult to solve. In evolutionary computation methods, since the proposal of differential evolution algorithm in 1997, it has been widely studied and developed due to its simplicity and efficiency. These developments include various adaptive strategies, operator improvements, and the introduction of other search methods. After 2014, research based on LSHADE has also been widely studied by researchers. However, although recently proposed improvement strategies have shown superiority over their previous generation's first performance, adding all new strategies may not necessarily bring the strongest performance. Therefore, we recombine some effective advances based on advanced differential evolution variants in recent years and finally determine an effective combination scheme to further promote the performance of differential evolution. In this paper, we propose a strategy recombination and reconstruction differential evolution algorithm called reconstructed differential evolution (RDE) to solve single-objective bounded optimization problems. Based on the benchmark suite of the 2024 IEEE Congress on Evolutionary Computation (CEC2024), we tested RDE and several other advanced differential evolution variants. The experimental results show that RDE has superior performance in solving complex optimization problems.
复杂单个目标有界问题往往很难解决。在进化计算方法中,自1997年差别进化算法(Differential Evolution,DE)的提出以来,因为它简单而高效,所以得到了广泛研究和开发。这些发展包括各种自适应策略、操作器改进以及引入其他搜索方法。自2014年以来,基于LSHADE的研究也得到了广泛研究。然而,尽管最近提出的改进策略在以前一代中的首次表现上显示出优越性,但添加所有新的策略不一定会带来最强的性能。因此,我们根据近年来基于先进差别进化变体的有效进展重新组合了一些策略,最后确定了一种有效的组合方案,进一步促进差别进化算法的性能。在本文中,我们提出了一个名为重构差别进化(RDE)的策略重组和重构差别进化算法,用于解决单个目标有界优化问题。基于2024年IEEE Congress on Evolutionary Computation(CEC2024)的基准集,我们测试了RDE以及其他几种高级差别进化算法。实验结果表明,RDE在解决复杂优化问题方面具有优越性能。
https://arxiv.org/abs/2404.16280
We present NeRF-XL, a principled method for distributing Neural Radiance Fields (NeRFs) across multiple GPUs, thus enabling the training and rendering of NeRFs with an arbitrarily large capacity. We begin by revisiting existing multi-GPU approaches, which decompose large scenes into multiple independently trained NeRFs, and identify several fundamental issues with these methods that hinder improvements in reconstruction quality as additional computational resources (GPUs) are used in training. NeRF-XL remedies these issues and enables the training and rendering of NeRFs with an arbitrary number of parameters by simply using more hardware. At the core of our method lies a novel distributed training and rendering formulation, which is mathematically equivalent to the classic single-GPU case and minimizes communication between GPUs. By unlocking NeRFs with arbitrarily large parameter counts, our approach is the first to reveal multi-GPU scaling laws for NeRFs, showing improvements in reconstruction quality with larger parameter counts and speed improvements with more GPUs. We demonstrate the effectiveness of NeRF-XL on a wide variety of datasets, including the largest open-source dataset to date, MatrixCity, containing 258K images covering a 25km^2 city area.
我们提出了NeRF-XL,一种在多个GPU之间有原则地分配Neural Radiance场(NeRFs)的方法,从而实现使用任意大的容量训练和渲染NeRFs。我们首先回顾现有的多GPU方法,这些方法将大场景分解为多个独立训练的NeRFs,并确定这些方法在训练过程中存在几个基本问题,这些问题会随着使用额外的计算资源(GPUs)而有所改善。NeRF-XL解决了这些问题,通过简单地使用更多硬件来训练和渲染NeRFs,实现了NeRFs具有任意数量参数的训练和渲染。 我们方法的核心是一个新的分布式训练和渲染公式,它与经典单GPU情况等价,并最小化了GPU之间的通信。通过解锁具有任意大参数计数的NeRFs,我们的方法揭示了多GPU对NeRFs的扩展规模定律,表明随着参数计数的大幅增加,重建质量的提高以及速度的提高。我们在各种数据集上都证明了NeRF-XL的有效性,包括迄今最大的开源数据集MatrixCity,该数据集包含258K个图像,覆盖了25平方公里的城市区域。
https://arxiv.org/abs/2404.16221
Recent work has found that sparse autoencoders (SAEs) are an effective technique for unsupervised discovery of interpretable features in language models' (LMs) activations, by finding sparse, linear reconstructions of LM activations. We introduce the Gated Sparse Autoencoder (Gated SAE), which achieves a Pareto improvement over training with prevailing methods. In SAEs, the L1 penalty used to encourage sparsity introduces many undesirable biases, such as shrinkage -- systematic underestimation of feature activations. The key insight of Gated SAEs is to separate the functionality of (a) determining which directions to use and (b) estimating the magnitudes of those directions: this enables us to apply the L1 penalty only to the former, limiting the scope of undesirable side effects. Through training SAEs on LMs of up to 7B parameters we find that, in typical hyper-parameter ranges, Gated SAEs solve shrinkage, are similarly interpretable, and require half as many firing features to achieve comparable reconstruction fidelity.
最近的工作发现,稀疏自动编码器(SAEs)是发现自然语言模型(LMs)激活的有用特征的有效技术,通过找到稀疏、线性的LM激活的稀疏重构。我们引入了门控稀疏自动编码器(Gated SSAE),它比现有的训练方法实现了帕累托改进。在SAE中,用于鼓励稀疏性的L1惩罚引入了许多不利的偏差,例如收缩——系统性地低估特征激活。Gated SSAE的关键洞察力是分离确定使用方向的功能和估计这些方向的大小:这使我们能够仅对前者应用L1惩罚,从而限制了不良影响范围。通过在具有7B参数的LM上训练SAEs,我们发现,在典型的超参数范围内,Gated SSAE解决了收缩,具有与SAEs相似的可解释性,并且需要一半的 firing特征才能实现与同质重构的相当的重建保真度。
https://arxiv.org/abs/2404.16014
Recent advancements in 3D reconstruction technologies have paved the way for high-quality and real-time rendering of complex 3D scenes. Despite these achievements, a notable challenge persists: it is difficult to precisely reconstruct specific objects from large scenes. Current scene reconstruction techniques frequently result in the loss of object detail textures and are unable to reconstruct object portions that are occluded or unseen in views. To address this challenge, we delve into the meticulous 3D reconstruction of specific objects within large scenes and propose a framework termed OMEGAS: Object Mesh Extraction from Large Scenes Guided by GAussian Segmentation. OMEGAS employs a multi-step approach, grounded in several excellent off-the-shelf methodologies. Specifically, initially, we utilize the Segment Anything Model (SAM) to guide the segmentation of 3D Gaussian Splatting (3DGS), thereby creating a basic 3DGS model of the target object. Then, we leverage large-scale diffusion priors to further refine the details of the 3DGS model, especially aimed at addressing invisible or occluded object portions from the original scene views. Subsequently, by re-rendering the 3DGS model onto the scene views, we achieve accurate object segmentation and effectively remove the background. Finally, these target-only images are used to improve the 3DGS model further and extract the definitive 3D object mesh by the SuGaR model. In various scenarios, our experiments demonstrate that OMEGAS significantly surpasses existing scene reconstruction methods. Our project page is at: this https URL
近年来,三维重建技术的发展为高质量和实时渲染复杂的3D场景奠定了基础。然而,一个显著的挑战仍然存在:很难精确从大型场景中重建特定对象。当前的场景重建技术通常会导致丢失物体细节纹理,并且无法从视图中重建被遮挡或未见到的物体部分。为了应对这个挑战,我们深入研究了大场景中具体对象的3D重建,并提出了一个名为OMEGAS的框架:基于Gaussian分割的大型场景引导对象网格提取。OMEGAS采用了一种多步骤方法,基于几种出色的非处方方法。具体来说,首先,我们使用SAM引导3D高斯平铺(3DGS)的分割,从而创建了目标对象的初步3DGS模型。然后,我们利用大型扩散 prior进一步优化3DGS模型的细节,特别关注解决原始场景视图中看不见或被遮挡的物体部分。接下来,通过将3DGS模型重新渲染到场景视图中,我们实现了精确的物体分割,并有效地去除了背景。最后,这些目标仅图像被用于进一步改进3DGS模型,并使用SuGaR模型提取了最终的3D物体网格。在各种场景中,我们的实验表明,OMEGAS显著超越了现有的场景重建方法。我们的项目页面是:https:// this URL
https://arxiv.org/abs/2404.15891
Patellofemoral joint (PFJ) issues affect one in four people, with 20% experiencing chronic knee pain despite treatment. Poor outcomes and pain after knee replacement surgery are often linked to patellar mal-tracking. Traditional imaging methods like CT and MRI face challenges, including cost and metal artefacts, and there's currently no ideal way to observe joint motion without issues such as soft tissue artefacts or radiation exposure. A new system to monitor joint motion could significantly improve understanding of PFJ dynamics, aiding in better patient care and outcomes. Combining 2D ultrasound with motion tracking for 3D reconstruction of the joint using semantic segmentation and position registration can be a solution. However, the need for expensive external infrastructure to estimate the trajectories of the scanner remains the main limitation to implementing 3D bone reconstruction from handheld ultrasound scanning clinically. We proposed the Visual-Inertial Odometry (VIO) and the deep learning-based inertial-only odometry methods as alternatives to motion capture for tracking a handheld ultrasound scanner. The 3D reconstruction generated by these methods has demonstrated potential for assessing the PFJ and for further measurements from free-hand ultrasound scans. The results show that the VIO method performs as well as the motion capture method, with average reconstruction errors of 1.25 mm and 1.21 mm, respectively. The VIO method is the first infrastructure-free method for 3D reconstruction of bone from wireless handheld ultrasound scanning with an accuracy comparable to methods that require external infrastructure.
翻译:Patellofemoral joint (PFJ) 问题影响四分之一的人,即使经过治疗,20%的人仍然会经历慢性膝盖疼痛。腿部置换手术后的不良结果和疼痛通常与膝关节不良运动有关。传统的影像技术如 CT 和 MRI 面临成本和金属伪影等挑战,目前没有理想的方法在没有软组织伪影或辐射暴露等问题的情况下观察关节运动。一种新系统监测关节运动可能显著改善对 PFJ 动态的理解,有助于提高患者护理和治疗效果。将 2D 超声与运动跟踪结合进行关节三维重建可以使用语义分割和位置配准,可能是解决方案。然而,需要昂贵的外部基础设施估计扫描器的轨迹仍然是实施临床超声三维骨重建的主要限制。我们提出了视觉惯性测量 (VIO) 和基于深度学习的惯性仅运动跟踪方法作为手持超声扫描器的运动捕捉替代方法。这些方法产生的 3D 重建已经证明了评估 PFJ 的潜力和从自由手超声扫描中进行进一步测量的可能性。结果表明,VIO 方法与运动捕捉方法的表现相同,平均重建误差分别为 1.25 mm 和 1.21 mm。VIO 方法是第一个无基础设施免费的 3D 骨重建方法,其准确性相当于需要外部基础设施的方法。
https://arxiv.org/abs/2404.15847
This paper addresses the challenges associated with hyperspectral image (HSI) reconstruction from miniaturized satellites, which often suffer from stripe effects and are computationally resource-limited. We propose a Real-Time Compressed Sensing (RTCS) network designed to be lightweight and require only relatively few training samples for efficient and robust HSI reconstruction in the presence of the stripe effect and under noisy transmission conditions. The RTCS network features a simplified architecture that reduces the required training samples and allows for easy implementation on integer-8-based encoders, facilitating rapid compressed sensing for stripe-like HSI, which exactly matches the moderate design of miniaturized satellites on push broom scanning mechanism. This contrasts optimization-based models that demand high-precision floating-point operations, making them difficult to deploy on edge devices. Our encoder employs an integer-8-compatible linear projection for stripe-like HSI data transmission, ensuring real-time compressed sensing. Furthermore, based on the novel two-streamed architecture, an efficient HSI restoration decoder is proposed for the receiver side, allowing for edge-device reconstruction without needing a sophisticated central server. This is particularly crucial as an increasing number of miniaturized satellites necessitates significant computing resources on the ground station. Extensive experiments validate the superior performance of our approach, offering new and vital capabilities for existing miniaturized satellite systems.
本文讨论了从微型卫星中恢复超光谱图像(HSI)面临的挑战,这些卫星通常受到条带效应的影响,并且计算资源有限。我们提出了一个轻量级的实时压缩感知(RTCS)网络,旨在实现高效且在条带效应和噪声传输条件下具有鲁棒性的HSI重构。RTCS网络具有简化架构,减少了所需的训练样本,并使条带型HSI的压缩感知变得容易,与迷你火箭扫描机制上的中等设计完全吻合。这 contrasts 基于优化的模型,这些模型需要高精度的浮点运算,使得它们难以在边缘设备上部署。我们的编码器采用了一个兼容整数8的线性投影来传输条带型HSI数据,实现实时压缩感知。此外,根据新颖的双流架构,在接收端提出了一种高效的HSI恢复解码器,允许在没有复杂中央服务器的情况下实现边缘设备重建。随着越来越多的微型卫星需要地面站的大量计算资源,这种方法的重要性也越来越突出。大量实验验证了我们的方法的优越性能,为现有的微型卫星系统提供了新的和至关重要的功能。
https://arxiv.org/abs/2404.15781
Existing NeRF-based inverse rendering methods suppose that scenes are exclusively illuminated by distant light sources, neglecting the potential influence of emissive sources within a scene. In this work, we confront this limitation using LDR multi-view images captured with emissive sources turned on and off. Two key issues must be addressed: 1) ambiguity arising from the limited dynamic range along with unknown lighting details, and 2) the expensive computational cost in volume rendering to backtrace the paths leading to final object colors. We present a novel approach, ESR-NeRF, leveraging neural networks as learnable functions to represent ray-traced fields. By training networks to satisfy light transport segments, we regulate outgoing radiances, progressively identifying emissive sources while being aware of reflection areas. The results on scenes encompassing emissive sources with various properties demonstrate the superiority of ESR-NeRF in qualitative and quantitative ways. Our approach also extends its applicability to the scenes devoid of emissive sources, achieving lower CD metrics on the DTU dataset.
现有的基于NeRF的反向渲染方法假定场景是由远距离光源独家照明,而忽略了场景内潜在的发射源影响。在本文中,我们通过开启和关闭发射源的LDR多视角图像来应对这一局限。需要解决两个关键问题:1)由于动态范围有限和未知的光线细节而产生的模糊;2)在体积渲染中,为了追溯导致最终物体颜色的路径而产生的昂贵计算成本。我们提出了ESR-NeRF,一种利用神经网络作为可学习函数来表示光迹场的全新方法。通过训练网络满足光传输段,我们调节出射辐射,在意识到反射区域的同时,逐渐确定发射源。ESR-NeRF对具有各种属性的发射源场景的性能在质量和数量上都有所改进。我们的方法还将其应用扩展到没有发射源的场景中,在DTU数据集上的CD指标较低。
https://arxiv.org/abs/2404.15707
Proto-form reconstruction has been a painstaking process for linguists. Recently, computational models such as RNN and Transformers have been proposed to automate this process. We take three different approaches to improve upon previous methods, including data augmentation to recover missing reflexes, adding a VAE structure to the Transformer model for proto-to-language prediction, and using a neural machine translation model for the reconstruction task. We find that with the additional VAE structure, the Transformer model has a better performance on the WikiHan dataset, and the data augmentation step stabilizes the training.
原型重建一直是语言学家们痛苦的过程。最近,提出了使用诸如RNN和Transformer这样的计算模型来自动化这一过程。我们采用了三种不同的方法来改进以前的方法,包括数据增强来恢复缺失的反射,在Transformer模型中添加VAE结构来进行原型到语言预测,以及使用神经机器翻译模型来进行重构任务。我们发现,在添加了VAE结构之后,Transformer模型的WikiHan数据集的表现更好,数据增强步骤使训练趋于稳定。
https://arxiv.org/abs/2404.15690
Due to the rapid spread of rumors on social media, rumor detection has become an extremely important challenge. Recently, numerous rumor detection models which utilize textual information and the propagation structure of events have been proposed. However, these methods overlook the importance of semantic evolvement information of event in propagation process, which is often challenging to be truly learned in supervised training paradigms and traditional rumor detection methods. To address this issue, we propose a novel semantic evolvement enhanced Graph Autoencoder for Rumor Detection (GARD) model in this paper. The model learns semantic evolvement information of events by capturing local semantic changes and global semantic evolvement information through specific graph autoencoder and reconstruction strategies. By combining semantic evolvement information and propagation structure information, the model achieves a comprehensive understanding of event propagation and perform accurate and robust detection, while also detecting rumors earlier by capturing semantic evolvement information in the early stages. Moreover, in order to enhance the model's ability to learn the distinct patterns of rumors and non-rumors, we introduce a uniformity regularizer to further improve the model's performance. Experimental results on three public benchmark datasets confirm the superiority of our GARD method over the state-of-the-art approaches in both overall performance and early rumor detection.
由于社交媒体上谣言的迅速传播,谣言检测已成为一个非常具有挑战性的任务。最近,许多利用文本信息和事件传播结构提出了很多谣言检测模型。然而,这些方法忽视了在传播过程中事件语义演变信息的重要性,而这种信息在监督训练范式和传统谣言检测方法中通常很难真正学习。为解决这个问题,我们提出了一个新颖的半监督演化增强图卷积神经网络(GARD)谣言检测模型,本文对其进行了阐述。该模型通过捕获局部语义变化和全局语义演化信息来学习事件语义演变信息,通过特殊的图卷积神经网络和重构策略获得全局语义演化信息。通过结合语义演变信息和传播结构信息,模型获得了对事件传播的全面理解,并能够准确和可靠地检测谣言,同时还能在谣言传播初期通过捕获语义演变信息来检测谣言。此外,为了增强模型学习不同谣言和非谣言的独特模式的能力,我们引入了均匀性正则化进一步改进了模型的性能。在三个公开基准数据集上的实验结果证实了我们在总体表现和早期谣言检测方面超过了最先进方法的优越性。
https://arxiv.org/abs/2404.16076
Existing Transformers for monocular 3D human shape and pose estimation typically have a quadratic computation and memory complexity with respect to the feature length, which hinders the exploitation of fine-grained information in high-resolution features that is beneficial for accurate reconstruction. In this work, we propose an SMPL-based Transformer framework (SMPLer) to address this issue. SMPLer incorporates two key ingredients: a decoupled attention operation and an SMPL-based target representation, which allow effective utilization of high-resolution features in the Transformer. In addition, based on these two designs, we also introduce several novel modules including a multi-scale attention and a joint-aware attention to further boost the reconstruction performance. Extensive experiments demonstrate the effectiveness of SMPLer against existing 3D human shape and pose estimation methods both quantitatively and qualitatively. Notably, the proposed algorithm achieves an MPJPE of 45.2 mm on the Human3.6M dataset, improving upon Mesh Graphormer by more than 10% with fewer than one-third of the parameters. Code and pretrained models are available at this https URL.
现有的单目3D人体形状和姿态估计器通常具有关于特征长度的二次计算和内存复杂度,这会阻碍在分辨率高的特征中发掘微小信息的有利程度,从而影响准确重建。在这项工作中,我们提出了一个基于SMPL的Transformer框架(SMPLer)来解决这一问题。SMPLer包括两个关键组件:分离的注意力操作和基于SMPL的目标表示,允许在Transformer中有效利用高分辨率特征。此外,根据这两个设计,我们还引入了几个新颖的模块,包括多尺度注意力和联合注意,以进一步提高重建性能。大量实验证明,SMPLer在现有的人体形状和姿态估计方法上具有有效性,无论是定量还是定性。值得注意的是,与Mesh Graphormer相比,所提出的算法在Human3.6M数据集上实现了更高的MPJPE,同时参数数量不到三分之一。代码和预训练模型可在此处访问的URL中获取。
https://arxiv.org/abs/2404.15276
Recent advancements in machine learning have led to novel imaging systems and algorithms that address ill-posed problems. Assessing their trustworthiness and understanding how to deploy them safely at test time remains an important and open problem. We propose a method that leverages conformal prediction to retrieve upper/lower bounds and statistical inliers/outliers of reconstructions based on the prediction intervals of downstream metrics. We apply our method to sparse-view CT for downstream radiotherapy planning and show 1) that metric-guided bounds have valid coverage for downstream metrics while conventional pixel-wise bounds do not and 2) anatomical differences of upper/lower bounds between metric-guided and pixel-wise methods. Our work paves the way for more meaningful reconstruction bounds. Code available at this https URL
近年来机器学习的进步导致了处理欠拟合问题的新颖成像系统和算法。评估它们的可信度和在测试时安全部署它们仍然是一个重要且开放的问题。我们提出了一种利用同构预测来检索基于预测间隔的重建的上/下界以及统计异常/正常化的方法。我们将该方法应用于稀疏视野CT downstream放射治疗计划,并证明了:在下游指标的指导下,指标指导的边界具有有效的覆盖范围,而传统的像素级边界则没有;以及指标指导和像素级方法的upper/lower边界之间解剖学差异。我们的工作为更有意义的研究奠定了基础。代码可在此处访问:https:// this URL
https://arxiv.org/abs/2404.15274
Video anomaly detection (VAD) is a challenging task aiming to recognize anomalies in video frames, and existing large-scale VAD researches primarily focus on road traffic and human activity scenes. In industrial scenes, there are often a variety of unpredictable anomalies, and the VAD method can play a significant role in these scenarios. However, there is a lack of applicable datasets and methods specifically tailored for industrial production scenarios due to concerns regarding privacy and security. To bridge this gap, we propose a new dataset, IPAD, specifically designed for VAD in industrial scenarios. The industrial processes in our dataset are chosen through on-site factory research and discussions with engineers. This dataset covers 16 different industrial devices and contains over 6 hours of both synthetic and real-world video footage. Moreover, we annotate the key feature of the industrial process, ie, periodicity. Based on the proposed dataset, we introduce a period memory module and a sliding window inspection mechanism to effectively investigate the periodic information in a basic reconstruction model. Our framework leverages LoRA adapter to explore the effective migration of pretrained models, which are initially trained using synthetic data, into real-world scenarios. Our proposed dataset and method will fill the gap in the field of industrial video anomaly detection and drive the process of video understanding tasks as well as smart factory deployment.
视频异常检测(VAD)是一个具有挑战性的任务,旨在识别视频帧中的异常情况,现有的大规模VAD研究主要集中在道路交通和人类活动场景。在工业场景中,通常存在多种不可预测的异常情况,VAD方法在这些场景中发挥着重要作用。然而,由于对隐私和安全问题的担忧,缺乏针对工业生产场景的可应用数据和方法。为了填补这一空白,我们提出了一个专门为工业场景设计的新的数据集IPAD。我们通过对现场工厂研究和与工程师的讨论来选择工业过程。这个数据集涵盖了16种不同的工业设备,包含了超过6小时的合成和现实世界的视频录像。此外,我们还对工业过程的关键特征,即周期性进行了标注。基于所提出的数据集,我们引入了周期记忆模块和滑动窗口检查机制,有效调查了基本重构模型的周期信息。我们的框架利用了LoRA适配器,探索将预训练模型有效迁移到真实世界场景。我们所提出的数据集和方法将填补工业视频异常检测领域中的空白,推动视频理解任务和智能工厂部署的发展。
https://arxiv.org/abs/2404.15033
We propose a novel pipeline for unknown object grasping in shared robotic autonomy scenarios. State-of-the-art methods for fully autonomous scenarios are typically learning-based approaches optimised for a specific end-effector, that generate grasp poses directly from sensor input. In the domain of assistive robotics, we seek instead to utilise the user's cognitive abilities for enhanced satisfaction, grasping performance, and alignment with their high level task-specific goals. Given a pair of stereo images, we perform unknown object instance segmentation and generate a 3D reconstruction of the object of interest. In shared control, the user then guides the robot end-effector across a virtual hemisphere centered around the object to their desired approach direction. A physics-based grasp planner finds the most stable local grasp on the reconstruction, and finally the user is guided by shared control to this grasp. In experiments on the DLR EDAN platform, we report a grasp success rate of 87% for 10 unknown objects, and demonstrate the method's capability to grasp objects in structured clutter and from shelves.
我们提出了一个在共享机器人自主场景中解决未知物体抓取的新流程。在先进的全自动驾驶场景中,通常采用基于学习的优化方法,针对特定的末端设备生成直接的抓取姿态。在辅助机器人领域,我们寻求利用用户的认知能力来提高满足感、抓取性能以及与高层次任务目标的对齐。 给定一对立体图像,我们进行未知物体实例分割并生成物体感兴趣的3D复原。在共享控制下,用户 then 导引机器人末端Effector 穿越围绕物体的虚拟半球,以到达期望的接近方向。基于物理的抓取规划器找到重构中最具稳定性的局部抓取,最后用户通过共享控制找到这个抓取。 在德国 Frauncese 实验室的 EDAN 平台实验中,我们报告了10个未知物体的抓取成功率为87%,并展示了该方法在结构混乱和货架上的物体抓取能力。
https://arxiv.org/abs/2404.15001