Digital Subtraction Angiography (DSA) is one of the gold standards in vascular disease diagnosing. With the help of contrast agent, time-resolved 2D DSA images deliver comprehensive insights into blood flow information and can be utilized to reconstruct 3D vessel structures. Current commercial DSA systems typically demand hundreds of scanning views to perform reconstruction, resulting in substantial radiation exposure. However, sparse-view DSA reconstruction, aimed at reducing radiation dosage, is still underexplored in the research community. The dynamic blood flow and insufficient input of sparse-view DSA images present significant challenges to the 3D vessel reconstruction task. In this study, we propose to use a time-agnostic vessel probability field to solve this problem effectively. Our approach, termed as vessel probability guided attenuation learning, represents the DSA imaging as a complementary weighted combination of static and dynamic attenuation fields, with the weights derived from the vessel probability field. Functioning as a dynamic mask, vessel probability provides proper gradients for both static and dynamic fields adaptive to different scene types. This mechanism facilitates a self-supervised decomposition between static backgrounds and dynamic contrast agent flow, and significantly improves the reconstruction quality. Our model is trained by minimizing the disparity between synthesized projections and real captured DSA images. We further employ two training strategies to improve our reconstruction quality: (1) coarse-to-fine progressive training to achieve better geometry and (2) temporal perturbed rendering loss to enforce temporal consistency. Experimental results have demonstrated superior quality on both 3D vessel reconstruction and 2D view synthesis.
数字减影血管造影(DSA)是诊断血管疾病的一个金标准。通过对比剂,时间分辨率2D DSA图像能全面了解血流信息,并可用于重建3D血管结构。当前商业DSA系统通常需要数百个扫描 views 来执行重建,导致大量辐射暴露。然而,稀疏视野DSA重建,旨在降低辐射剂量,在研究社区中仍是一个未被探索的问题。动态血流和稀疏视野DSA图像输入不足,给3D血管重建任务带来了重大挑战。在本研究中,我们提出了一种名为“引导稀疏视野DSA学习”的方法来有效解决这一问题。我们的方法将DSA成像视为静态和动态衰减场的一个互补加权组合,权重来自血管概率场。作为动态掩码,血管概率提供不同场景自适应的静态和动态场的正确梯度。这一机制促使自监督分解静态背景和动态对比剂流,从而显著提高重建质量。我们的模型通过最小化生成投影和真实捕获DSA图像之间的差异进行训练。我们进一步采用两种训练策略来提高我们的重建质量:粗到细的渐进训练以实现更好的几何形状(1);时间扰动渲染损失以确保时间一致性(2)。实验结果表明,在3D血管重建和2D视图合成方面具有卓越的质量。
https://arxiv.org/abs/2405.10705
The infant brain undergoes rapid development in the first few years after birth.Compared to cross-sectional studies, longitudinal studies can depict the trajectories of infants brain development with higher accuracy, statistical power and flexibility.However, the collection of infant longitudinal magnetic resonance (MR) data suffers a notorious dropout problem, resulting in incomplete datasets with missing time points. This limitation significantly impedes subsequent neuroscience and clinical modeling. Yet, existing deep generative models are facing difficulties in missing brain image completion, due to sparse data and the nonlinear, dramatic contrast/geometric variations in the developing brain. We propose LoCI-DiffCom, a novel Longitudinal Consistency-Informed Diffusion model for infant brain image Completion,which integrates the images from preceding and subsequent time points to guide a diffusion model for generating high-fidelity missing data. Our designed LoCI module can work on highly sparse sequences, relying solely on data from two temporal points. Despite wide separation and diversity between age time points, our approach can extract individualized developmental features while ensuring context-aware consistency. Our experiments on a large infant brain MR dataset demonstrate its effectiveness with consistent performance on missing infant brain MR completion even in big gap scenarios, aiding in better delineation of early developmental trajectories.
婴儿的大脑在出生后的前几年经历了快速的发育。与横断面研究相比,纵向研究能够更准确、更具有统计效力和灵活性来描绘婴儿大脑发展的轨迹。然而,婴儿纵向磁共振(MR)数据的收集存在一个著名的断点问题,导致缺失时间点的数据集。这一限制严重阻碍了后续的神经科学和临床建模。然而,现有的深度生成模型由于稀疏数据和发育中大脑的非线性和戏剧性的对比/几何变化而面临困难。我们提出了一种新颖的纵向一致性引导的扩散模型,称为LoCI-DiffCom,用于婴儿脑图像的补全。该模型将先前的和后续时间点的图像整合在一起,以指导扩散模型生成高保真度的缺失数据。我们设计的LoCI模块可以在高度稀疏的序列上工作,仅依赖于两个时间点的数据。尽管年龄时间点之间存在广泛的区别和差异,但我们的方法可以在确保上下文感知一致性的同时提取个体化的发育特征。我们对大型婴儿脑MR数据集的实验证明,即使在大型缺口场景中,LoCI模型在婴儿脑图像的补全上仍然具有有效的表现,有助于更好地描绘早期的发育轨迹。
https://arxiv.org/abs/2405.10691
3D occupancy perception holds a pivotal role in recent vision-centric autonomous driving systems by converting surround-view images into integrated geometric and semantic representations within dense 3D grids. Nevertheless, current models still encounter two main challenges: modeling depth accurately in the 2D-3D view transformation stage, and overcoming the lack of generalizability issues due to sparse LiDAR supervision. To address these issues, this paper presents GEOcc, a Geometric-Enhanced Occupancy network tailored for vision-only surround-view perception. Our approach is three-fold: 1) Integration of explicit lift-based depth prediction and implicit projection-based transformers for depth modeling, enhancing the density and robustness of view transformation. 2) Utilization of mask-based encoder-decoder architecture for fine-grained semantic predictions; 3) Adoption of context-aware self-training loss functions in the pertaining stage to complement LiDAR supervision, involving the re-rendering of 2D depth maps from 3D occupancy features and leveraging image reconstruction loss to obtain denser depth supervision besides sparse LiDAR ground-truths. Our approach achieves State-Of-The-Art performance on the Occ3D-nuScenes dataset with the least image resolution needed and the most weightless image backbone compared with current models, marking an improvement of 3.3% due to our proposed contributions. Comprehensive experimentation also demonstrates the consistent superiority of our method over baselines and alternative approaches.
3D占有率感知在最近以视觉为中心的自驾系统中将周围视图图像转换为密集3D网格内的集成几何和语义表示,在很大程度上推动了这种技术的发展。然而,目前的模型仍然面临着两个主要挑战:在2D-3D视图变换阶段准确建模深度,以及克服由于稀疏LiDAR监督而导致的泛化问题。为了应对这些问题,本文提出了GEOcc,一种专为视觉仅周围视图感知而设计的几何增强占有率网络。我们的方法是三方面的:1)将显式升力为基础的深度预测和隐式投影为基础的变压器深度建模相结合,提高视图变换的密度和稳健性;2)利用掩码为基础的编码器-解码器架构进行细粒度语义预测;3)在相关阶段采用语境感知自训练损失函数来补充LiDAR监督,包括从3D占有率特征重新渲染2D深度图,并利用图像重建损失以获得比稀疏LiDAR ground-truths更密的深度监督。我们的方法在Occ3D-nuScenes数据集上实现了与最少的图像分辨率相关的最轻量级的图像骨架,与当前模型的最轻量级图像骨架相比,提高了3.3%的性能,并通过我们的建议取得了显著的改善。综合实验还证明了我们的方法相对于基线和替代方法的优势是一致的。
https://arxiv.org/abs/2405.10591
The development of open benchmarking platforms could greatly accelerate the adoption of AI agents in retail. This paper presents comprehensive simulations of customer shopping behaviors for the purpose of benchmarking reinforcement learning (RL) agents that optimize coupon targeting. The difficulty of this learning problem is largely driven by the sparsity of customer purchase events. We trained agents using offline batch data comprising summarized customer purchase histories to help mitigate this effect. Our experiments revealed that contextual bandit and deep RL methods that are less prone to over-fitting the sparse reward distributions significantly outperform static policies. This study offers a practical framework for simulating AI agents that optimize the entire retail customer journey. It aims to inspire the further development of simulation tools for retail AI systems.
开放基准测试平台的开发可以大大加速零售领域AI代理的采用。本文旨在对优化优惠券定向的强化学习(RL)代理的全面仿真进行阐述。这个问题学习问题的难度很大程度上是由客户购买事件的稀疏性引起的。我们使用包括总结客户购买历史的离线批数据来训练代理,以帮助减轻这种影响。我们的实验表明,相对于容易过拟合稀疏奖励分布的静态策略,具有较少过拟合的上下文抽头和深度强化学习方法显著取得了优势。本研究为模拟整个零售客户旅程的AI代理提供了一个实用的框架。其旨在激发零售AI系统进一步发展模拟工具。
https://arxiv.org/abs/2405.10469
Integrating an RGB camera into a ToF imaging system has become a significant technique for perceiving the real world. The RGB guided ToF imaging system is crucial to several applications, including face anti-spoofing, saliency detection, and trajectory prediction. Depending on the distance of the working range, the implementation schemes of the RGB guided ToF imaging systems are different. Specifically, ToF sensors with a uniform field of illumination, which can output dense depth but have low resolution, are typically used for close-range measurements. In contrast, LiDARs, which emit laser pulses and can only capture sparse depth, are usually employed for long-range detection. In the two cases, depth quality improvement for RGB guided ToF imaging corresponds to two sub-tasks: guided depth super-resolution and guided depth completion. In light of the recent significant boost to the field provided by deep learning, this paper comprehensively reviews the works related to RGB guided ToF imaging, including network structures, learning strategies, evaluation metrics, benchmark datasets, and objective functions. Besides, we present quantitative comparisons of state-of-the-art methods on widely used benchmark datasets. Finally, we discuss future trends and the challenges in real applications for further research.
将RGB相机集成到ToF成像系统中已成为感知现实世界的重要技术。RGB引导ToF成像系统对于多个应用场景至关重要,包括面部抗伪造、轮廓检测和轨迹预测。根据工作范围的不同,RGB引导ToF成像系统的实现方案是不同的。具体来说,具有均匀场照明的ToF传感器通常用于近距离测量。相反,激光雷达,它们只能捕获稀疏深度,通常用于远距离检测。在这两种情况下,RGB引导ToF成像系统的深度质量改进相当于两个子任务:引导深度超分辨率 和引导深度完成。 鉴于最近深度学习在领域提供的重大提升,本文全面回顾了与RGB引导ToF成像相关的论文,包括网络结构、学习策略、评估指标、基准数据集和目标函数。此外,我们还在广泛使用的基准数据集上对最先进的方法进行了定量比较。最后,我们讨论了在实际应用中未来的趋势和挑战,为进一步研究提供了指导。
https://arxiv.org/abs/2405.10357
Infrared (IR) image super-resolution faces challenges from homogeneous background pixel distributions and sparse target regions, requiring models that effectively handle long-range dependencies and capture detailed local-global information. Recent advancements in Mamba-based (Selective Structured State Space Model) models, employing state space models, have shown significant potential in visual tasks, suggesting their applicability for IR enhancement. In this work, we introduce IRSRMamba: Infrared Image Super-Resolution via Mamba-based Wavelet Transform Feature Modulation Model, a novel Mamba-based model designed specifically for IR image super-resolution. This model enhances the restoration of context-sparse target details through its advanced dependency modeling capabilities. Additionally, a new wavelet transform feature modulation block improves multi-scale receptive field representation, capturing both global and local information efficiently. Comprehensive evaluations confirm that IRSRMamba outperforms existing models on multiple benchmarks. This research advances IR super-resolution and demonstrates the potential of Mamba-based models in IR image processing. Code are available at \url{this https URL}.
红外图像超分辨率面临来自均匀背景像素分布和稀疏目标区域的挑战,需要处理长距离依赖并捕获详细局部全局信息的模型。最近基于Mamba(选择性结构状态空间模型)的模型在视觉任务方面的进展表明,它们适用于红外增强。在这项工作中,我们介绍了IRSRMamba:通过Mamba波谷变换特征调制模型进行红外图像超分辨率,这是一种专为红外图像超分辨率设计的Mamba模型。通过其先进的依赖建模能力,该模型通过恢复稀疏目标细节来增强语境。此外,一个新的波谷变换特征调制块通过多尺度接收场表示捕获全局和局部信息,有效捕捉信息。全面的评估证实,IRSRMamba在多个基准测试上优于现有模型。这项研究推动了红外图像超分辨率,并展示了基于Mamba模型的在红外图像处理中的潜力。代码可在此处访问:\url{这个链接}
https://arxiv.org/abs/2405.09873
Multi-line LiDAR is widely used in autonomous vehicles, so point cloud-based 3D detectors are essential for autonomous driving. Extracting rich multi-scale features is crucial for point cloud-based 3D detectors in autonomous driving due to significant differences in the size of different types of objects. However, due to the real-time requirements, large-size convolution kernels are rarely used to extract large-scale features in the backbone. Current 3D detectors commonly use feature pyramid networks to obtain large-scale features; however, some objects containing fewer point clouds are further lost during downsampling, resulting in degraded performance. Since pillar-based schemes require much less computation than voxel-based schemes, they are more suitable for constructing real-time 3D detectors. Hence, we propose PillarNeXt, a pillar-based scheme. We redesigned the feature encoding, the backbone, and the neck of the 3D detector. We propose Voxel2Pillar feature encoding, which uses a sparse convolution constructor to construct pillars with richer point cloud features, especially height features. Moreover, additional learnable parameters are added, which enables the initial pillar to achieve higher performance capabilities. We extract multi-scale and large-scale features in the proposed fully sparse backbone, which does not utilize large-size convolutional kernels; the backbone consists of the proposed multi-scale feature extraction module. The neck consists of the proposed sparse ConvNeXt, whose simple structure significantly improves the performance. The effectiveness of the proposed PillarNeXt is validated on the Waymo Open Dataset, and object detection accuracy for vehicles, pedestrians, and cyclists is improved; we also verify the effectiveness of each proposed module in detail.
多线激光雷达在自动驾驶中得到了广泛应用,因此基于点云的3D检测器对自动驾驶至关重要。由于不同类型物体的大小差异很大,因此从点云中提取丰富的多尺度特征对于自动驾驶中的点云测距器至关重要。然而,由于实时要求,大型卷积核通常不会用于从骨干网络中提取大尺度特征。目前,大多数3D检测器使用特征金字塔网络获取大尺度特征;然而,在 downsampling 过程中,一些包含较少点云的对象进一步丢失,导致性能下降。由于基于柱面的方案需要比基于体素的方案更少的计算,因此它们更适合用于构建实时3D检测器。因此,我们提出了PillarNeXt,一种基于柱面的方案。我们重新设计了3D检测器的特征编码器、骨干网络和颈部。我们提出了Voxel2Pillar特征编码器,它使用稀疏卷积构建具有丰富点云特征的支柱,特别是高度特征。此外,还增加了可学习的参数,使得初始支柱能够实现更高的性能能力。我们在提出的完全稀疏骨干中提取多尺度和大尺度特征,这并没有使用大型卷积核;骨干由提出的多尺度特征提取模块组成。颈部分别由提出的稀疏ConvNeXt组成,其简单的结构显著提高了性能。提出的PillarNeXt的有效性在Waymo Open Dataset上得到了验证,并对车辆、行人、自行车等的对象检测精度进行了提高。我们还详细验证了每个提出的模块的有效性。
https://arxiv.org/abs/2405.09828
Due to spatial redundancy in remote sensing images, sparse tokens containing rich information are usually involved in self-attention (SA) to reduce the overall token numbers within the calculation, avoiding the high computational cost issue in Vision Transformers. However, such methods usually obtain sparse tokens by hand-crafted or parallel-unfriendly designs, posing a challenge to reach a better balance between efficiency and performance. Different from them, this paper proposes to use learnable meta tokens to formulate sparse tokens, which effectively learn key information meanwhile improving the inference speed. Technically, the meta tokens are first initialized from image tokens via cross-attention. Then, we propose Dual Cross-Attention (DCA) to promote information exchange between image tokens and meta tokens, where they serve as query and key (value) tokens alternatively in a dual-branch structure, significantly reducing the computational complexity compared to self-attention. By employing DCA in the early stages with dense visual tokens, we obtain the hierarchical architecture LeMeViT with various sizes. Experimental results in classification and dense prediction tasks show that LeMeViT has a significant $1.7 \times$ speedup, fewer parameters, and competitive performance compared to the baseline models, and achieves a better trade-off between efficiency and performance.
由于遥感图像中的空间冗余,通常包含丰富信息的稀疏标记通常参与自注意力(SA)以降低计算过程中的总体标记数量,从而避免在Vision Transformers中出现高计算成本问题。然而,这些方法通常通过手动或并行不友好设计获得稀疏标记,挑战了在效率和性能之间实现更好的平衡。与它们不同,本文提出了一种使用可学习元标记来表示稀疏标记的方法,这通过同时学习关键信息有效地提高了推理速度。技术上,元标记首先通过跨注意力从图像标记中初始化。然后,我们提出Dual Cross-Attention (DCA)来促进图像标记和元标记之间的信息交流,在双分支结构中,它们作为查询和键(值)标记交替出现,从而大大减少了计算复杂性。通过在密集视觉标记的早期阶段使用DCA,我们获得了具有不同大小的层次结构LeMeViT。在分类和密集预测任务上的实验结果表明,LeMeViT与基线模型相比具有明显的1.7倍速度提升、更少的参数和竞争力的性能,并实现了效率与性能的更好平衡。
https://arxiv.org/abs/2405.09789
Accurate detection of vulvovaginal candidiasis is critical for women's health, yet its sparse distribution and visually ambiguous characteristics pose significant challenges for accurate identification by pathologists and neural networks alike. Our eye-tracking data reveals that areas garnering sustained attention - yet not marked by experts after deliberation - are often aligned with false positives of neural networks. Leveraging this finding, we introduce Gaze-DETR, a pioneering method that integrates gaze data to enhance neural network precision by diminishing false positives. Gaze-DETR incorporates a universal gaze-guided warm-up protocol applicable across various detection methods and a gaze-guided rectification strategy specifically designed for DETR-based models. Our comprehensive tests confirm that Gaze-DETR surpasses existing leading methods, showcasing remarkable improvements in detection accuracy and generalizability.
准确检测外阴阴道念珠菌病对女性健康至关重要,但它的稀疏分布和视觉上模糊的特点对病理学家和神经网络鉴定者来说都带来了重大挑战。我们的眼动数据表明,获得持续关注却未被专家肯定的区域通常与神经网络的假阳性结果一致。利用这一发现,我们引入了Gaze-DETR,一种开创性的方法,将眼动数据集成到神经网络中,通过降低假阳性结果来提高检测精度。Gaze-DETR采用了一个通用的眼动引导预热协议,适用于各种检测方法,并专门为基于DETR模型的检测方法设计了一个眼动引导校正策略。我们全面的测试证实,Gaze-DETR超越了现有领先方法,展示了在检测准确性和泛化方面显著的改进。
https://arxiv.org/abs/2405.09463
Image-guided depth completion aims at generating a dense depth map from sparse LiDAR data and RGB image. Recent methods have shown promising performance by reformulating it as a classification problem with two sub-tasks: depth discretization and probability prediction. They divide the depth range into several discrete depth values as depth categories, serving as priors for scene depth distributions. However, previous depth discretization methods are easy to be impacted by depth distribution variations across different scenes, resulting in suboptimal scene depth distribution priors. To address the above problem, we propose a progressive depth decoupling and modulating network, which incrementally decouples the depth range into bins and adaptively generates multi-scale dense depth maps in multiple stages. Specifically, we first design a Bins Initializing Module (BIM) to construct the seed bins by exploring the depth distribution information within a sparse depth map, adapting variations of depth distribution. Then, we devise an incremental depth decoupling branch to progressively refine the depth distribution information from global to local. Meanwhile, an adaptive depth modulating branch is developed to progressively improve the probability representation from coarse-grained to fine-grained. And the bi-directional information interactions are proposed to strengthen the information interaction between those two branches (sub-tasks) for promoting information complementation in each branch. Further, we introduce a multi-scale supervision mechanism to learn the depth distribution information in latent features and enhance the adaptation capability across different scenes. Experimental results on public datasets demonstrate that our method outperforms the state-of-the-art methods. The code will be open-sourced at [this https URL](this https URL).
图像引导深度完成旨在从稀疏的激光雷达数据和彩色图像中生成密集的深度图。最近的方法通过将其转化为分类问题,将其分为两个子任务:深度离散化和概率预测,通过分段深度值,表现出良好的性能。他们将深度范围划分为几个离散的深度值作为深度类别,作为场景深度分布的预训练。然而,之前的深度离散化方法很容易受到不同场景中深度分布的变化影响,导致场景深度分布预优劣。为了应对这个问题,我们提出了一个逐步深度解耦和调制网络,该网络在多个阶段逐步解耦深度范围并生成多尺度密集深度图。具体来说,我们首先设计了一个Bins初始化模块(BIM),通过探索稀疏深度图中的深度分布信息来构建种子值,适应深度分布的变化。然后,我们设计了一个逐步深度解耦分支,从全局到局部逐步优化深度分布信息。同时,我们还开发了一个自适应深度调制分支,从粗粒度到细粒度逐步改进概率表示。为了增强这两个分支(子任务)之间的信息交互,以提高每个分支的信息互补,我们引入了多尺度监督机制,以学习潜在特征中的深度分布信息,并增强在不同场景下的适应能力。在公开数据集上的实验结果表明,我们的方法优于最先进的方法。代码将在此处[https://this](https://this)公开源码。
https://arxiv.org/abs/2405.09342
-Recent strides in model predictive control (MPC)underscore a dependence on numerical advancements to efficientlyand accurately solve large-scale problems. Given the substantialnumber of variables characterizing typical whole-body optimalcontrol (OC) problems -often numbering in the thousands-exploiting the sparse structure of the numerical problem becomescrucial to meet computational demands, typically in the range ofa few milliseconds. A fundamental building block for computingNewton or Sequential Quadratic Programming (SQP) steps indirect optimal control methods involves addressing the linearquadratic regulator (LQR) problem. This paper concentrateson equality-constrained problems featuring implicit systemdynamics and dual regularization, a characteristic found inadvanced interior-point or augmented Lagrangian solvers. Here,we introduce a parallel algorithm designed for solving an LQRproblem with dual regularization. Leveraging a rewriting of theLQR recursion through block elimination, we first enhanced theefficiency of the serial algorithm, then subsequently generalized itto handle parametric problems. This extension enables us to splitdecision variables and solve multiple subproblems concurrently.Our algorithm is implemented in our nonlinear numerical optimalcontrol library ALIGATOR. It showcases improved performanceover previous serial formulations and we validate its efficacy bydeploying it in the model predictive control of a real quadrupedrobot. This paper follows up from our prior work on augmentedLagrangian methods for numerical optimal control with implicitdynamics and constraints.
近年来,模型预测控制(MPC)的进步表明,要高效准确地解决大规模问题,需要依赖数值改进。对于典型全身最优控制(OC)问题中大量存在的变量,通常有几千个,利用数值问题的稀疏结构变得至关重要,通常需要花费计算资源的毫秒级。计算新牛顿或序贯四元规划(SQP)步的直接最优控制方法的基本构建模块涉及解决线性二次调节器(LQR)问题。本文重点讨论具有隐含系统动力学和支持的等式约束问题,这是高级内部点或增强型拉格朗日求解器中发现的特征。在这里,我们介绍了一种用于求解具有双重 regularization 的 LQR 问题的并行算法。通过通过块消除重写LQR递归,我们首先增强了序列算法的效率,然后随后扩展到处理参数问题。这个扩展使得我们可以同时划分决策变量并解决多个子问题。 我们的算法实现在我们非线性数值最优控制库 ALIGATOR 中。它展示了前述序列形式的改进性能,并通过将该算法应用于实际四足机器人的模型预测控制来验证其有效性。本文接着我们在之前的关于具有隐含动力学和支持的增广拉格朗日方法的研究工作。
https://arxiv.org/abs/2405.09197
For the shape control of deformable free-form surfaces, simulation plays a crucial role in establishing the mapping between the actuation parameters and the deformed shapes. The differentiation of this forward kinematic mapping is usually employed to solve the inverse kinematic problem for determining the actuation parameters that can realize a target shape. However, the free-form surfaces obtained from simulators are always different from the physically deformed shapes due to the errors introduced by hardware and the simplification adopted in physical simulation. To fill the gap, we propose a novel deformation function based sim-to-real learning method that can map the geometric shape of a simulated model into its corresponding shape of the physical model. Unlike the existing sim-to-real learning methods that rely on completely acquired dense markers, our method accommodates sparsely distributed markers and can resiliently use all captured frames -- even for those in the presence of missing markers. To demonstrate its effectiveness, our sim-to-real method has been integrated into a neural network-based computational pipeline designed to tackle the inverse kinematic problem on a pneumatically actuated deformable mannequin.
对于可变形自由曲面形状的控制,仿真在确定驱动参数与变形形状之间的映射方面起着关键作用。通常采用向前运动学映射的微分来求解确定驱动参数以实现目标形状的反向运动学问题。然而,由于硬件误差和物理仿真中简化的采用,从仿真器获得的自由曲面总是与物理变形形状不同。为了填补这一空白,我们提出了一个基于仿真的学习方法的新颖变形函数,可以将模拟模型的几何形状映射到物理模型的相应形状。与现有的仿真学习方法完全基于获得的密集标记不同,我们的方法可以适应稀疏分布的标记,并且可以弹性地使用所有捕获的帧——即使是在缺失标记的情况下。为了证明其有效性,我们的仿真方法已经集成到了一个用于解决气动驱动可变形人体模型的反向运动学问题的神经网络计算管道中。
https://arxiv.org/abs/2405.08935
In this work, we introduce a novel method for calculating the 6DoF pose of an object using a single RGB-D image. Unlike existing methods that either directly predict objects' poses or rely on sparse keypoints for pose recovery, our approach addresses this challenging task using dense correspondence, i.e., we regress the object coordinates for each visible pixel. Our method leverages existing object detection methods. We incorporate a re-projection mechanism to adjust the camera's intrinsic matrix to accommodate cropping in RGB-D images. Moreover, we transform the 3D object coordinates into a residual representation, which can effectively reduce the output space and yield superior performance. We conducted extensive experiments to validate the efficacy of our approach for 6D pose estimation. Our approach outperforms most previous methods, especially in occlusion scenarios, and demonstrates notable improvements over the state-of-the-art methods. Our code is available on this https URL.
在这项工作中,我们提出了一种使用单张RGB-D图像计算物体6DoF姿态的新方法。与现有的方法不同,它们要么直接预测物体的姿态,要么依赖于稀疏的关键点来进行姿态恢复。我们的方法通过密集匹配解决了这一具有挑战性的任务,即我们对于每个可见像素回归物体的坐标。我们的方法依赖于现有的物体检测方法。我们引入了一个重投影机制来调整相机的固有矩阵以适应RGB-D图像的裁剪。此外,我们将3D物体坐标转换为残差表示,可以有效地降低输出空间并产生卓越的性能。我们对我们的方法在6DoF姿态估计方面的有效性进行了广泛的实验验证。与大多数先前的方法相比,我们的方法在遮挡场景中表现优异,并显著超越了最先进的 methods。我们的代码可以在这个https:// URL上找到。
https://arxiv.org/abs/2405.08483
Safe maneuvering capability is critical for mobile robots in complex environments. However, robotic system dynamics are often time-varying, uncertain, or even unknown during the motion planning and control process. Therefore, many existing model-based reinforcement learning (RL) methods could not achieve satisfactory reliability in guaranteeing safety. To address this challenge, we propose a two-level Vector Field-guided Learning Predictive Control (VF-LPC) approach that guarantees safe maneuverability. The first level, the guiding level, generates safe desired trajectories using the designed kinodynamic guiding vector field, enabling safe motion in obstacle-dense environments. The second level, the Integrated Motion Planning and Control (IMPC) level, first uses the deep Koopman operator to learn a nominal dynamics model offline and then updates the model uncertainties online using sparse Gaussian processes (GPs). The learned dynamics and game-based safe barrier function are then incorporated into the learning predictive control framework to generate near-optimal control sequences. We conducted tests to compare the performance of VF-LPC with existing advanced planning methods in an obstacle-dense environment. The simulation results show that it can generate feasible trajectories quickly. Then, VF-LPC is evaluated against motion planning methods that employ model predictive control (MPC) and RL in high-fidelity CarSim software. The results show that VF-LPC outperforms them under metrics of completion time, route length, and average solution time. We also carried out path-tracking control tests on a racing road to validate the model uncertainties learning capability. Finally, we conducted real-world experiments on a Hongqi E-HS3 vehicle, further validating the VF-LPC approach's effectiveness.
保证移动机器人在复杂环境中的安全机动能力至关重要。然而,机器人系统动力学通常在运动规划和控制过程中是时间变化、不确定或甚至是未知的。因此,许多基于模型的强化学习(RL)方法无法在保证安全方面达到令人满意的可靠性。为解决这个问题,我们提出了一个两级Vector Field-guided Learning Predictive Control(VF-LPC)方法,以确保安全机动。 第一级,指导层,使用设计的水动力引导向量场生成安全的愿望轨迹,使机器人在密集障碍物的环境中安全运动。第二级,集成运动规划与控制(IMPC)层,首先使用深度Koopman操作学习一个定理动态模型,然后在线使用稀疏高斯过程(GPs)更新模型不确定性。然后将学到的动态和基于游戏的safe barrier函数纳入学习预测控制框架,生成最优控制序列。我们对VF-LPC与现有高级规划方法在密集障碍物的环境中的性能进行了测试。 仿真结果表明,VF-LPC可以快速生成可行轨迹。然后,将VF-LPC与采用模型预测控制(MPC)和RL的高保真度CarSim软件的运动规划方法进行比较。结果表明,在完成时间、路径长度和平均解决方案时间等指标上,VF-LPC优越。我们还对赛车道路进行了路径跟踪控制测试,以验证模型不确定性学习能力的有效性。 最后,我们在一辆 Hongqi E-HS3 车上进行了实际实验,进一步验证了VF-LPC方法的有效性。
https://arxiv.org/abs/2405.08283
Tensors serve as a crucial tool in the representation and analysis of complex, multi-dimensional data. As data volumes continue to expand, there is an increasing demand for developing optimization algorithms that can directly operate on tensors to deliver fast and effective computations. Many problems in real-world applications can be formulated as the task of recovering high-order tensors characterized by sparse and/or low-rank structures. In this work, we propose novel Kaczmarz algorithms with a power of the $\ell_1$-norm regularization for reconstructing high-order tensors by exploiting sparsity and/or low-rankness of tensor data. In addition, we develop both a block and an accelerated variant, along with a thorough convergence analysis of these algorithms. A variety of numerical experiments on both synthetic and real-world datasets demonstrate the effectiveness and significant potential of the proposed methods in image and video processing tasks, such as image sequence destriping and video deconvolution.
稀疏和/或低秩数据中的高阶张量表示和分析的关键工具。随着数据量的不断扩展,对开发可以直接对张量进行操作的优化算法的需求也在增加,以实现快速和有效的计算。许多现实应用问题可以表述为通过利用张量数据的稀疏和/或低秩性来求解高阶张量的问题。在本文中,我们提出了一种名为Kaczmarz的新的稀疏度量算法,通过利用张量数据的稀疏和/或低秩性来重构高阶张量。此外,我们还开发了块状和加速版本,以及这些算法的详细收敛分析。对 both synthetic and real-world datasets 的多种数值实验表明,与所提出的方法一起,在图像和视频处理任务中取得了有效性和显著潜力。
https://arxiv.org/abs/2405.08275
The multi-plane representation has been highlighted for its fast training and inference across static and dynamic neural radiance fields. This approach constructs relevant features via projection onto learnable grids and interpolating adjacent vertices. However, it has limitations in capturing low-frequency details and tends to overuse parameters for low-frequency features due to its bias toward fine details, despite its multi-resolution concept. This phenomenon leads to instability and inefficiency when training poses are sparse. In this work, we propose a method that synergistically integrates multi-plane representation with a coordinate-based network known for strong bias toward low-frequency signals. The coordinate-based network is responsible for capturing low-frequency details, while the multi-plane representation focuses on capturing fine-grained details. We demonstrate that using residual connections between them seamlessly preserves their own inherent properties. Additionally, the proposed progressive training scheme accelerates the disentanglement of these two features. We empirically show that the proposed method achieves comparable results to explicit encoding with fewer parameters, and particularly, it outperforms others for the static and dynamic NeRFs under sparse inputs.
多平面表示法因其在静态和动态神经元辐射场中的快速训练和推理而受到突出关注。这种方法通过投影到可学习网格来构建相关特征,并通过插值相邻顶点来 interpolate。然而,它对于低频细节的捕捉有限,并且由于其对细节的倾向,倾向于过度使用参数。尽管它具有多分辨率的概念,但这种现象在训练稀疏 poses 时会导致不稳定和低效。 在本文中,我们提出了一种方法,将多平面表示法与一个对低频信号有很强偏好的坐标基础网络相结合。坐标基础网络负责捕捉低频细节,而多平面表示法专注于捕捉高频细节。我们证明了通过它们之间的残差连接来协同操作,可以保留它们的固有属性。此外,所提出的渐进式训练方法加速了这两个特征的分离。 我们通过实验实证,表明与用更少的参数进行显式编码相比,所提出的方法具有可比较的结果,特别是在稀疏输入下。特别地,它在静态和动态 NeRFs 上表现优异,即使输入稀疏。
https://arxiv.org/abs/2405.07857
High-resolution road representations are a key factor for the success of (highly) automated driving functions. These representations, for example, high-definition (HD) maps, contain accurate information on a multitude of factors, among others: road geometry, lane information, and traffic signs. Through the growing complexity and functionality of automated driving functions, also the requirements on testing and evaluation grow continuously. This leads to an increasing interest in virtual test drives for evaluation purposes. As roads play a crucial role in traffic flow, accurate real-world representations are needed, especially when deriving realistic driving behavior data. This paper proposes a novel approach to generate realistic road representations based solely on point cloud information, independent of the LiDAR sensor, mounting position, and without the need for odometry data, multi-sensor fusion, machine learning, or highly-accurate calibration. As the primary use case is simulation, we use the OpenDRIVE format for evaluation.
高质量的道路表示是自动驾驶功能成功的重要因素之一。例如,高清晰度(HD)地图包含关于道路几何、车道信息和交通标志等大量准确信息。随着自动驾驶功能日益复杂和功能强大,对测试和评估的需求也在不断增长。这导致对于评价目的的虚拟驾驶测试的需求不断增加。因为道路在交通流量中扮演着关键角色,所以尤其需要准确的现实世界道路表示,尤其是在从真实的驾驶行为数据中提取数据时。本文提出了一种仅基于点云信息生成现实道路表示的新方法,独立于LiDAR传感器、安装位置,无需轮迹数据、多传感器融合、机器学习或高精度校准。作为主要应用场景是仿真,我们使用OpenDRIVE格式进行评估。
https://arxiv.org/abs/2405.07544
Transportation of samples across different domains is a central task in several machine learning problems. A sensible requirement for domain transfer tasks in computer vision and language domains is the sparsity of the transportation map, i.e., the transfer algorithm aims to modify the least number of input features while transporting samples across the source and target domains. In this work, we propose Elastic Net Optimal Transport (ENOT) to address the sparse distribution transfer problem. The ENOT framework utilizes the $L_1$-norm and $L_2$-norm regularization mechanisms to find a sparse and stable transportation map between the source and target domains. To compute the ENOT transport map, we consider the dual formulation of the ENOT optimization task and prove that the sparsified gradient of the optimal potential function in the ENOT's dual representation provides the ENOT transport map. Furthermore, we demonstrate the application of the ENOT framework to perform feature selection for sparse domain transfer. We present the numerical results of applying ENOT to several domain transfer problems for synthetic Gaussian mixtures and real image and text data. Our empirical results indicate the success of the ENOT framework in identifying a sparse domain transport map.
在许多机器学习问题中,跨不同领域的样本传输是一个重要的任务。在计算机视觉和自然语言领域中,领域迁移任务的合理要求是传输图的稀疏性,即传输算法旨在在将样本从一个源域传输到目标域的同时,尽量减少输入特征的数量。在本文中,我们提出了一种弹性网最优传输(ENOT)来解决稀疏分布传输问题。 ENOT框架利用$L_1$范数和$L_2$范数正则化机制来寻找源域和目标域之间的稀疏和稳定的传输图。为了计算ENOT传输图,我们考虑ENOT优化任务的拉格朗日乘子法,并证明在ENOT的双表示中,最优潜在函数的稀疏梯度提供了ENOT传输图。此外,我们还证明了ENOT框架在稀疏域传输问题中的应用。 我们通过为 synthetic Gaussian mixtures 和真实图像和文本数据应用 ENOT 框架进行了数值实验,展示了 ENOT 框架在稀疏域传输问题中找到稀疏领域传输图的成功。我们的实证结果表明,ENOT框架在识别稀疏域传输图方面非常成功。
https://arxiv.org/abs/2405.07489
In the character animation field, modern supervised keyframe interpolation models have demonstrated exceptional performance in constructing natural human motions from sparse pose definitions. As supervised models, large motion datasets are necessary to facilitate the learning process; however, since motion is represented with fixed hierarchical skeletons, such datasets are incompatible for skeletons outside the datasets' native configurations. Consequently, the expected availability of a motion dataset for desired skeletons severely hinders the feasibility of learned interpolation in practice. To combat this limitation, we propose Point Cloud-based Motion Representation Learning (PC-MRL), an unsupervised approach to enabling cross-compatibility between skeletons for motion interpolation learning. PC-MRL consists of a skeleton obfuscation strategy using temporal point cloud sampling, and an unsupervised skeleton reconstruction method from point clouds. We devise a temporal point-wise K-nearest neighbors loss for unsupervised learning. Moreover, we propose First-frame Offset Quaternion (FOQ) and Rest Pose Augmentation (RPA) strategies to overcome necessary limitations of our unsupervised point cloud-to-skeletal motion process. Comprehensive experiments demonstrate the effectiveness of PC-MRL in motion interpolation for desired skeletons without supervision from native datasets.
在角色动画领域,现代有监督的关键帧插值模型在构建自然人体运动方面表现出卓越的性能。作为有监督模型,需要大型动作数据集来促进学习过程;然而,由于运动以固定的层次结构骨架表示,这类数据集对于数据集中的非原配置骨骼是不可兼容的。因此,为期望的骨骼模型设计的运动数据集在实践中严重阻碍了学习插值的可行性。为了克服这一限制,我们提出了基于点云的运动表示学习(PC-MRL),一种无监督方法,以实现骨骼在运动插值学习中的互操作性。PC-MRL包括使用时域点云采样进行骨密度干扰策略和一个无监督骨架重构方法。我们设计了一个基于时刻的K-最近邻损失来进行无监督学习。此外,我们还提出了First-frame Offset Quaternion(FOQ)和Rest Pose Augmentation(RPA)策略来克服我们无监督点云到骨骼的运动过程所必需的局限性。全面的实验证明PC-MRL在不需要来自原始数据集的监督的情况下,在运动插值方面具有很高的有效性。
https://arxiv.org/abs/2405.07444
In the field of transportation, it is of paramount importance to address and mitigate illegal actions committed by both motor and non-motor vehicles. Among those actions, wrong-way cycling (i.e., riding a bicycle or e-bike in the opposite direction of the designated traffic flow) poses significant risks to both cyclists and other road users. To this end, this paper formulates a problem of detecting wrong-way cycling ratios in CCTV videos. Specifically, we propose a sparse sampling method called WWC-Predictor to efficiently solve this problem, addressing the inefficiencies of direct tracking methods. Our approach leverages both detection-based information, which utilizes the information from bounding boxes, and orientation-based information, which provides insights into the image itself, to enhance instantaneous information capture capability. On our proposed benchmark dataset consisting of 35 minutes of video sequences and minute-level annotation, our method achieves an average error rate of a mere 1.475% while taking only 19.12% GPU time of straightforward tracking methods under the same detection model. This remarkable performance demonstrates the effectiveness of our approach in identifying and predicting instances of wrong-way cycling.
在运输领域,解决和减轻由机动和非机动车辆实施的非法行为至关重要。在这些行为中,逆向骑行(即在指定交通流量方向上骑行自行车或电动自行车)对骑车人和其他道路用户造成了巨大的安全风险。为此,本文提出了一种检测逆向骑行的比例问题。具体来说,我们提出了一个稀疏采样方法,称为WWC-Predictor,以有效地解决这一问题,解决了直接跟踪方法的低效性。我们的方法利用了基于检测的信息和基于方向的信息,以增强瞬时信息捕捉能力。在我们提出的基准数据集中,包括35分钟的视频序列和分钟级别的注释,我们的方法在相同的检测模型的GPU上只用了19.12% 的时间,平均误差率为仅仅1.475%。这一出色的性能证明了我们在识别和预测逆向骑行实例方面的方法的有效性。
https://arxiv.org/abs/2405.07293