Forecasting long-term 3D human motion is challenging: the stochasticity of human behavior makes it hard to generate realistic human motion from the input sequence alone. Information on the scene environment and the motion of nearby people can greatly aid the generation process. We propose a scene-aware social transformer model (SAST) to forecast long-term (10s) human motion motion. Unlike previous models, our approach can model interactions between both widely varying numbers of people and objects in a scene. We combine a temporal convolutional encoder-decoder architecture with a Transformer-based bottleneck that allows us to efficiently combine motion and scene information. We model the conditional motion distribution using denoising diffusion models. We benchmark our approach on the Humans in Kitchens dataset, which contains 1 to 16 persons and 29 to 50 objects that are visible simultaneously. Our model outperforms other approaches in terms of realism and diversity on different metrics and in a user study. Code is available at this https URL.
预测长期3D人动是具有挑战性的:人类行为的随机性使得从输入序列 alone 生成逼真的人类运动很难。关于场景环境和附近人的运动信息可以大大提高生成过程。我们提出了一种场景感知的社交Transformer模型(SAST)来预测长期(10秒)人类运动。与之前的方法不同,我们的方法可以建模场景中广泛变化的数量人和物的相互作用。我们结合了时间卷积编码器-解码器架构和基于Transformer的瓶颈,可以有效地结合运动和场景信息。我们使用去噪扩散模型建模条件运动分布。我们在Kitchens数据集上对人类进行基准测试,该数据集包含1到16个人员和29到50个同时可见的物体。我们的模型在不同的指标和用户研究中比其他方法在逼真性和多样性方面表现出色。代码可在此处访问:https://thisurl.com/
https://arxiv.org/abs/2409.12189
Advances in microscopy imaging enable researchers to visualize structures at the nanoscale level thereby unraveling intricate details of biological organization. However, challenges such as image noise, photobleaching of fluorophores, and low tolerability of biological samples to high light doses remain, restricting temporal resolutions and experiment durations. Reduced laser doses enable longer measurements at the cost of lower resolution and increased noise, which hinders accurate downstream analyses. Here we train a denoising diffusion probabilistic model (DDPM) to predict high-resolution images by conditioning the model on low-resolution information. Additionally, the probabilistic aspect of the DDPM allows for repeated generation of images that tend to further increase the signal-to-noise ratio. We show that our model achieves a performance that is better or similar to the previously best-performing methods, across four highly diverse datasets. Importantly, while any of the previous methods show competitive performance for some, but not all datasets, our method consistently achieves high performance across all four data sets, suggesting high generalizability.
显微镜成像技术的进步使科学家能够在大纳米级别上可视化结构,从而揭示生物组织的复杂细节。然而,像图像噪声、荧光因子的光衰减以及生物样品对高光剂量的低容忍性等问题仍然存在,限制了时间分辨率和高光强度实验的持续时间。降低激光剂量可以实现长时间测量,但代价是分辨率降低和噪声增加,这会阻碍准确的后端分析。在这里,我们训练了一个去噪扩散概率模型(DDPM)来预测高分辨率图像,通过将模型对低分辨率信息进行条件处理。此外,DDPM的概率性质允许反复生成具有趋势增加信号-噪声比的效果。我们证明了我们的模型在四个高度多样化的数据集上的性能与之前最佳方法的性能相似或者更好。重要的是,虽然之前的方法在某些数据集上具有竞争性的性能,但不是所有数据集,但我们方法在所有四个数据集上都实现了高性能,表明具有高泛化能力。
https://arxiv.org/abs/2409.12078
Pose skeleton images are an important reference in pose-controllable image generation. In order to enrich the source of skeleton images, recent works have investigated the generation of pose skeletons based on natural language. These methods are based on GANs. However, it remains challenging to perform diverse, structurally correct and aesthetically pleasing human pose skeleton generation with various textual inputs. To address this problem, we propose a framework with GUNet as the main model, PoseDiffusion. It is the first generative framework based on a diffusion model and also contains a series of variants fine-tuned based on a stable diffusion model. PoseDiffusion demonstrates several desired properties that outperform existing methods. 1) Correct Skeletons. GUNet, a denoising model of PoseDiffusion, is designed to incorporate graphical convolutional neural networks. It is able to learn the spatial relationships of the human skeleton by introducing skeletal information during the training process. 2) Diversity. We decouple the key points of the skeleton and characterise them separately, and use cross-attention to introduce textual conditions. Experimental results show that PoseDiffusion outperforms existing SoTA algorithms in terms of stability and diversity of text-driven pose skeleton generation. Qualitative analyses further demonstrate its superiority for controllable generation in Stable Diffusion.
姿势骨骼图像是 pose 控制图像生成的关键参考。为了丰富骨骼图的来源,最近的工作基于自然语言生成姿势骨骼图。这些方法基于 GANs。然而,用各种文本输入进行多样、结构正确和美观的人体姿势骨骼图生成仍然具有挑战性。为解决这个问题,我们提出了一个基于 GUNet 的框架,名为 PoseDiffusion。它是第一个基于扩散模型的生成框架,并且还包含一系列基于稳定扩散模型的微调版本。PoseDiffusion 展示了几个优于现有方法的特征。1)正确的骨骼。GUNet,一个用于姿势扩散的去噪模型,通过引入骨骼信息在训练过程中学习人体骨骼的空间关系。2)多样性。我们通过解耦骨骼的关键点并对其进行单独特征描述,同时使用跨注意来引入文本条件。实验结果表明,PoseDiffusion 在稳定性和文本驱动姿势骨骼图生成多样性方面优于现有的 SoTA 算法。定性分析进一步证明了它在 Stable Diffusion 中具有更好的可控生成能力。
https://arxiv.org/abs/2409.11689
Histopathology analysis is the gold standard for medical diagnosis. Accurate classification of whole slide images (WSIs) and region-of-interests (ROIs) localization can assist pathologists in diagnosis. The gigapixel resolution of WSI and the absence of fine-grained annotations make direct classification and analysis challenging. In weakly supervised learning, multiple instance learning (MIL) presents a promising approach for WSI classification. The prevailing strategy is to use attention mechanisms to measure instance importance for classification. However, attention mechanisms fail to capture inter-instance information, and self-attention causes quadratic computational complexity. To address these challenges, we propose AMD-MIL, an agent aggregator with a mask denoise mechanism. The agent token acts as an intermediate variable between the query and key for computing instance importance. Mask and denoising matrices, mapped from agents-aggregated value, dynamically mask low-contribution representations and eliminate noise. AMD-MIL achieves better attention allocation by adjusting feature representations, capturing micro-metastases in cancer, and improving interpretability. Extensive experiments on CAMELYON-16, CAMELYON-17, TCGA-KIDNEY, and TCGA-LUNG show AMD-MIL's superiority over state-of-the-art methods.
病理学分析是医疗诊断的黄金标准。准确地对整个切片图像(WSIs)和感兴趣区域(ROIs)进行分类和定位可以帮助病理学家进行诊断。WSI的巨像素分辨率以及缺乏细粒度注释使得直接分类和分析具有挑战性。在弱监督学习中,多实例学习(MIL)对WSI分类是一个有前途的方法。然而,目前的策略使用关注机制来衡量实例的重要性进行分类。然而,关注机制无法捕捉实例之间的交互信息,自注意力会导致平方计算复杂度。为了应对这些挑战,我们提出了AMD-MIL,一个带口罩去噪机制的代理聚合器。代理令牌充当查询和键的中间变量,用于计算实例重要性。口罩和去噪矩阵从代理聚合值进行映射,动态地遮盖低贡献表示并消除噪声。通过调整特征表示,AMD-MIL实现了更好的关注分配,捕获了癌症中的微转移,并提高了可解释性。在CAMELYON-16、CAMELYON-17、TCGA-KIDNEY和TCGA-LUNG等大量实验中,AMD-MIL优越于最先进的方法。
https://arxiv.org/abs/2409.11664
Ultrasound imaging, despite its widespread use in medicine, often suffers from various sources of noise and artifacts that impact the signal-to-noise ratio and overall image quality. Enhancing ultrasound images requires a delicate balance between contrast, resolution, and speckle preservation. This paper introduces a novel approach that integrates adaptive beamforming with denoising diffusion-based variance imaging to address this challenge. By applying Eigenspace-Based Minimum Variance (EBMV) beamforming and employing a denoising diffusion model fine-tuned on ultrasound data, our method computes the variance across multiple diffusion-denoised samples to produce high-quality despeckled images. This approach leverages both the inherent multiplicative noise of ultrasound and the stochastic nature of diffusion models. Experimental results on a publicly available dataset demonstrate the effectiveness of our method in achieving superior image reconstructions from single plane-wave acquisitions. The code is available at: this https URL.
超声成像,尽管在医学领域得到了广泛应用,但经常受到各种噪声和伪像的影响,从而影响信号与噪声比和整体图像质量。提高超声图像质量需要在一个对contrast(对比度)、resolution(分辨率)和speckle preservation(伪像保持)之间的微调取得平衡。本文介绍了一种将自适应波形形成与基于扩散的伪像成像相结合的新方法,以解决这一挑战。通过应用基于Eigenspace的最小方差(EBMV)波形形成和利用在超声数据上微调的denoising diffusion模型,我们的方法计算了多个扩散去噪样本之间的方差,从而产生了高质量的去噪图像。这种方法利用了超声的固有乘法噪声和扩散模型的随机性质。公开可用数据集上的实验结果表明,我们的方法能够从单光子测距中实现卓越的图像重构。代码可在此处下载:https://this URL。
https://arxiv.org/abs/2409.11380
Recent advancements in deep learning have shown impressive results in image and video denoising, leveraging extensive pairs of noisy and noise-free data for supervision. However, the challenge of acquiring paired videos for dynamic scenes hampers the practical deployment of deep video denoising techniques. In contrast, this obstacle is less pronounced in image denoising, where paired data is more readily available. Thus, a well-trained image denoiser could serve as a reliable spatial prior for video denoising. In this paper, we propose a novel unsupervised video denoising framework, named ``Temporal As a Plugin'' (TAP), which integrates tunable temporal modules into a pre-trained image denoiser. By incorporating temporal modules, our method can harness temporal information across noisy frames, complementing its power of spatial denoising. Furthermore, we introduce a progressive fine-tuning strategy that refines each temporal module using the generated pseudo clean video frames, progressively enhancing the network's denoising performance. Compared to other unsupervised video denoising methods, our framework demonstrates superior performance on both sRGB and raw video denoising datasets.
近年来在深度学习领域在图像和视频去噪方面的先进成果表明,利用大量噪声和无噪声数据的丰富对偶用于监督,取得了令人印象深刻的结果。然而,获取动态场景下的成对视频仍然是一个挑战,这阻碍了深度视频去噪技术的实际应用。相比之下,在图像去噪中,成对数据更易于获得,因此一个训练好的图像去噪器可以作为一个可靠的时空先验用于视频去噪。在本文中,我们提出了一种名为“Temporal As a Plugin” (TAP) 的无监督视频去噪框架,将可调整的时间模块集成到预训练的图像去噪器中。通过引入时间模块,我们的方法可以利用噪声帧中的时间信息,补充其空间去噪的能力。此外,我们引入了一种逐步微调策略,使用生成的伪干净视频帧来优化每个时间模块,逐步提高网络的消噪性能。与其它无监督视频去噪方法相比,我们的框架在SRGB和原始视频去噪数据集上都表现出卓越的性能。
https://arxiv.org/abs/2409.11256
Purpose: Bone metastasis have a major impact on the quality of life of patients and they are diverse in terms of size and location, making their segmentation complex. Manual segmentation is time-consuming, and expert segmentations are subject to operator variability, which makes obtaining accurate and reproducible segmentations of bone metastasis on CT-scans a challenging yet important task to achieve. Materials and Methods: Deep learning methods tackle segmentation tasks efficiently but require large datasets along with expert manual segmentations to generalize on new images. We propose an automated data synthesis pipeline using 3D Denoising Diffusion Probabilistic Models (DDPM) to enchance the segmentation of femoral metastasis from CT-scan volumes of patients. We used 29 existing lesions along with 26 healthy femurs to create new realistic synthetic metastatic images, and trained a DDPM to improve the diversity and realism of the simulated volumes. We also investigated the operator variability on manual segmentation. Results: We created 5675 new volumes, then trained 3D U-Net segmentation models on real and synthetic data to compare segmentation performance, and we evaluated the performance of the models depending on the amount of synthetic data used in training. Conclusion: Our results showed that segmentation models trained with synthetic data outperformed those trained on real volumes only, and that those models perform especially well when considering operator variability.
目的:骨转移瘤对患者的生活质量产生了重大影响,而且它们在大小和位置上各不相同,使得分割复杂。手动分割耗时,专家分割也存在偏差,这使得在CT扫描中实现准确且可重复的骨转移瘤分割成为一个具有挑战性但重要的问题。材料和方法:深度学习方法有效地解决分割任务,但需要大量数据和专家手动分割才能在新生图像上进行泛化。我们提出了一种使用3D去噪扩散概率模型(DDPM)自动数据合成管道来增强患者CT扫描体积中的股骨转移瘤分割。我们使用29个现有病变和26个健康的股骨来创建新的逼真的合成转移瘤图像,并训练了DDPM以提高模拟体积的多样性和真实性。我们还研究了手动分割的运营商变异性。结果:我们创建了5675个新体积,然后用真实和合成数据训练了3D U-Net分割模型,以比较分割性能,并评估了模型的性能与训练中使用的合成数据量有关。结论:我们的结果表明,使用合成数据训练分割模型优于仅使用真实数据训练的模型,而且当考虑运营商变异性时,这些模型表现尤其出色。
https://arxiv.org/abs/2409.11011
In recent years, deep learning-based image compression, particularly through generative models, has emerged as a pivotal area of research. Despite significant advancements, challenges such as diminished sharpness and quality in reconstructed images, learning inefficiencies due to mode collapse, and data loss during transmission persist. To address these issues, we propose a novel compression model that incorporates a denoising step with diffusion models, significantly enhancing image reconstruction fidelity by sub-information(e.g., edge and depth) from leveraging latent space. Empirical experiments demonstrate that our model achieves superior or comparable results in terms of image quality and compression efficiency when measured against the existing models. Notably, our model excels in scenarios of partial image loss or excessive noise by introducing an edge estimation network to preserve the integrity of reconstructed images, offering a robust solution to the current limitations of image compression.
近年来,基于深度学习的图像压缩,特别是通过生成模型,已成为一个重要的研究热点。尽管取得了显著的进步,但失真和重建图像的质量以及由于模式坍塌导致的学习效率降低等问题仍然存在。为解决这些问题,我们提出了一个新型的压缩模型,该模型结合了扩散模型中的去噪步骤,通过保留潜在空间中的下信息(如边缘和深度)显著增强图像重建保真度。实验结果表明,与现有模型相比,我们的模型在图像质量和压缩效率方面实现卓越或可比较的结果。值得注意的是,我们的模型在部分图像丢失或过度噪声的场景中表现出色,通过引入边缘估计网络来保留重构图像的完整性,为解决图像压缩的当前限制提供了一个稳健的解决方案。
https://arxiv.org/abs/2409.10978
Robot exploration aims at constructing unknown environments and it is important to achieve it with shorter paths. Traditional methods focus on optimizing the visiting order based on current observations, which may lead to local-minimal results. Recently, by predicting the structure of the unseen environment, the exploration efficiency can be further improved. However, in a cluttered environment, due to the randomness of obstacles, the ability for prediction is limited. Therefore, to solve this problem, we propose a map prediction algorithm that can be efficient in predicting the layout of noisy indoor environments. We focus on the scenario of 2D exploration. First, we perform floor plan extraction by denoising the cluttered map using deep learning. Then, we use a floor plan-based algorithm to improve the prediction accuracy. Additionally, we extract the segmentation of rooms and construct their connectivity based on the predicted map, which can be used for downstream tasks. To validate the effectiveness of the proposed method, it is applied to exploration tasks. Extensive experiments show that even in cluttered scenes, our proposed method can benefit efficiency.
机器人探险的目标是构建未知的环境,实现这一目标的关键是使用更短的路径。传统方法集中精力优化基于当前观察的访问顺序,这可能导致局部最小结果。近年来,通过预测未见环境结构的概率,探险效率可以进一步优化。然而,在拥挤的环境中,由于障碍物的随机性,预测能力受限。因此,为了解决这个问题,我们提出了一个可以在嘈杂室内环境中有效预测布局的地图预测算法。我们关注2D探险场景。首先,通过使用深度学习对杂乱的地图进行去噪,进行平面图提取。然后,使用基于平面图的算法提高预测准确性。此外,根据预测的地图提取房间的分割,并构建它们的连通性,可以用于下游任务。为了验证所提出方法的有效性,将其应用于探险任务。大量实验结果表明,即使在杂乱的场景中,我们的方法也可以提高效率。
https://arxiv.org/abs/2409.10878
LiDAR is one of the most commonly adopted sensors for simultaneous localization and mapping (SLAM) and map-based global localization. SLAM and map-based localization are crucial for the independent operation of autonomous systems, especially when external signals such as GNSS are unavailable or unreliable. While state-of-the-art (SOTA) LiDAR SLAM systems could achieve 0.5% (i.e., 0.5m per 100m) of errors and map-based localization could achieve centimeter-level global localization, it is still unclear how robust they are under various common LiDAR data corruptions. In this work, we extensively evaluated five SOTA LiDAR-based localization systems under 18 common scene-level LiDAR point cloud data (PCD) corruptions. We found that the robustness of LiDAR-based localization varies significantly depending on the category. For SLAM, hand-crafted methods are in general robust against most types of corruption, while being extremely vulnerable (up to +80% errors) to a specific corruption. Learning-based methods are vulnerable to most types of corruptions. For map-based global localization, we found that the SOTA is resistant to all applied corruptions. Finally, we found that simple Bilateral Filter denoising effectively eliminates noise-based corruption but is not helpful in density-based corruption. Re-training is more effective in defending learning-based SLAM against all types of corruption.
LiDAR 是同时定位与映射(SLAM)中最常见的传感器之一。SLAM 和基于地图的局部定位对于自主系统的独立操作至关重要,尤其是当外部信号(如GNSS)不可靠或不可用时。尽管最先进的(SOTA)LiDAR SLAM系统可以实现每100米0.5%的误差和基于地图的局部定位可以实现厘米级的全球局部定位,但它们在各种常见的LiDAR数据损坏情况下的鲁棒性仍然不确定。在本文中,我们对18个常见场景级的LiDAR点云数据(PCD)损坏情况下的五款SOTA LiDAR本地化系统进行了广泛评估。我们发现,LiDAR本地化的鲁棒性取决于类别。对于SLAM,手工方法通常对大多数类型的损坏具有鲁棒性,但对特定损坏类型(至多+80%的误差)极其脆弱。学习方法对大多数类型的损坏都具有脆弱性。对于基于地图的全球局部定位,我们发现,SOTA对所有应用的损坏都具有鲁棒性。最后,我们发现,简单的双边滤波去噪有效地消除基于噪声的损坏,但对基于密度的损坏则无帮助。重新训练在防御学习方法 SLAM 对所有类型的损坏更有效地效果明显。
https://arxiv.org/abs/2409.10824
Speech foundation models, such as HuBERT and its variants, are pre-trained on large amounts of unlabeled speech for various downstream tasks. These models use a masked prediction objective, where the model learns to predict information about masked input segments from the unmasked context. The choice of prediction targets in this framework can influence performance on downstream tasks. For example, targets that encode prosody are beneficial for speaker-related tasks, while targets that encode phonetics are more suited for content-related tasks. Additionally, prediction targets can vary in the level of detail they encode; targets that encode fine-grained acoustic details are beneficial for denoising tasks, while targets that encode higher-level abstractions are more suited for content-related tasks. Despite the importance of prediction targets, the design choices that affect them have not been thoroughly studied. This work explores the design choices and their impact on downstream task performance. Our results indicate that the commonly used design choices for HuBERT can be suboptimal. We propose novel approaches to create more informative prediction targets and demonstrate their effectiveness through improvements across various downstream tasks.
演讲基础模型,如HuBERT及其变体,在各种下游任务上预先训练了大量无标签语音数据。这些模型使用遮罩预测目标,其中模型从未遮罩的上下文中学到预测遮罩输入段的信息。在这样一个框架中,预测目标的选择可以影响下游任务的性能。例如,表示语调的预测目标对说话人相关任务有益,而表示音位的预测目标更适合内容相关任务。此外,预测目标的详细程度也可以不同;表示细小音位的预测目标对于去噪任务有益,而表示更高层次抽象的预测目标更适合内容相关任务。尽管预测目标非常重要,但影响它们设计的选择并没有进行深入研究。本文探讨了设计选择及其对下游任务性能的影响。我们的结果表明,常用的HuBERT设计选择可能效果不佳。我们提出了新的方法来创建更有信息量的预测目标,并通过各种下游任务的改善证明了它们的有效性。
https://arxiv.org/abs/2409.10788
Diffusion models have achieved remarkable progress in generative modelling, particularly in enhancing image quality to conform to human preferences. Recently, these models have also been applied to low-level computer vision for photo-realistic image restoration (IR) in tasks such as image denoising, deblurring, dehazing, etc. In this review paper, we introduce key constructions in diffusion models and survey contemporary techniques that make use of diffusion models in solving general IR tasks. Furthermore, we point out the main challenges and limitations of existing diffusion-based IR frameworks and provide potential directions for future work.
扩散模型在生成建模方面取得了显著的进步,特别是在提高图像质量以符合人类偏好方面。最近,这些模型还应用于低级计算机视觉中的 photo-realistic 图像修复(IR)任务,例如图像去噪、去雾等。在本文综述论文中,我们介绍了扩散模型的关键构建,并调查了使用扩散模型解决一般 IR 任务的当代技术。此外,我们指出了现有基于扩散的 IR 框架的主要挑战和局限性,并为未来的工作提供了潜在方向。
https://arxiv.org/abs/2409.10353
Implicit feedback, often used to build recommender systems, unavoidably confronts noise due to factors such as misclicks and position bias. Previous studies have attempted to alleviate this by identifying noisy samples based on their diverged patterns, such as higher loss values, and mitigating the noise through sample dropping or reweighting. Despite the progress, we observe existing approaches struggle to distinguish hard samples and noise samples, as they often exhibit similar patterns, thereby limiting their effectiveness in denoising recommendations. To address this challenge, we propose a Large Language Model Enhanced Hard Sample Denoising (LLMHD) framework. Specifically, we construct an LLM-based scorer to evaluate the semantic consistency of items with the user preference, which is quantified based on summarized historical user interactions. The resulting scores are used to assess the hardness of samples for the pointwise or pairwise training objectives. To ensure efficiency, we introduce a variance-based sample pruning strategy to filter potential hard samples before scoring. Besides, we propose an iterative preference update module designed to continuously refine summarized user preference, which may be biased due to false-positive user-item interactions. Extensive experiments on three real-world datasets and four backbone recommenders demonstrate the effectiveness of our approach.
隐式反馈,通常用于构建推荐系统,不可避免地受到因素如点击偏差和位置偏见的噪声影响。之前的研究试图通过根据其分叉模式识别噪声样本来减轻这种噪声,并通过样本丢弃或重新加权来缓解噪声。尽管如此,我们观察到现有的方法很难区分硬样本和噪声样本,因为它们通常表现出类似的模式,从而限制了它们在去噪推荐中的有效性。为解决这个问题,我们提出了一个大型语言模型增强硬样本去噪(LLMHD)框架。具体来说,我们构建了一个基于LLM的评分器来评估用户偏好的物品的语义一致性,该一致性基于总结历史用户交互。得分用于评估点wise或成对训练目标中的样本的难度。为了确保效率,我们引入了一种基于方差的可变样本剪枝策略,在评分之前过滤可能为硬样本的潜在样本。此外,我们提出了一种迭代偏好更新模块,旨在持续优化总结的用户偏好,该偏好可能受到虚假正例用户-物品交互的偏见影响。在三个真实世界数据集和四个基线推荐器上的大量实验证明了我们方法的有效性。
https://arxiv.org/abs/2409.10343
Even if the depth maps captured by RGB-D sensors deployed in real environments are often characterized by large areas missing valid depth measurements, the vast majority of depth completion methods still assumes depth values covering all areas of the scene. To address this limitation, we introduce SteeredMarigold, a training-free, zero-shot depth completion method capable of producing metric dense depth, even for largely incomplete depth maps. SteeredMarigold achieves this by using the available sparse depth points as conditions to steer a denoising diffusion probabilistic model. Our method outperforms relevant top-performing methods on the NYUv2 dataset, in tests where no depth was provided for a large area, achieving state-of-art performance and exhibiting remarkable robustness against depth map incompleteness. Our code will be publicly available.
即使部署在现实环境中的RGB-D传感器捕捉到的深度图通常表现为大面积缺失有效的深度测量,绝大多数深度完成方法仍然假定深度值覆盖场景的所有区域。为了应对这个局限性,我们引入了SteeredMarigold,一种无需训练,零散深度完成方法,即使深度图 largely incomplete,仍能产生 metric 密度的深度。SteeredMarigold 是通过利用可用稀疏深度点作为引导去噪扩散概率模型来进行实现的。我们的方法在 NYUv2 数据集上的相关 top-performing 方法中表现出色,在未提供深度的大面积区域中进行测试时,实现了与最先进的深度图完整性相关的性能水平,并表现出了对深度图不完整性的非常强的鲁棒性。我们的代码将公开可用。
https://arxiv.org/abs/2409.10202
Kernel image regression methods have shown to provide excellent efficiency in many image processing task, such as image and light-field compression, Gaussian Splatting, denoising and super-resolution. The estimation of parameters for these methods frequently employ gradient descent iterative optimization, which poses significant computational burden for many applications. In this paper, we introduce a novel adaptive segmentation-based initialization method targeted for optimizing Steered-Mixture-of Experts (SMoE) gating networks and Radial-Basis-Function (RBF) networks with steering kernels. The novel initialization method allocates kernels into pre-calculated image segments. The optimal number of kernels, kernel positions, and steering parameters are derived per segment in an iterative optimization and kernel sparsification procedure. The kernel information from "local" segments is then transferred into a "global" initialization, ready for use in iterative optimization of SMoE, RBF, and related kernel image regression methods. Results show that drastic objective and subjective quality improvements are achievable compared to widely used regular grid initialization, "state-of-the-art" K-Means initialization and previously introduced segmentation-based initialization methods, while also drastically improving the sparsity of the regression models. For same quality, the novel initialization results in models with around 50% reduction of kernels. In addition, a significant reduction of convergence time is achieved, with overall run-time savings of up to 50%. The segmentation-based initialization strategy itself admits heavy parallel computation; in theory, it may be divided into as many tasks as there are segments in the images. By accessing only four parallel GPUs, run-time savings of already 50% for initialization are achievable.
内核图像回归方法在许多图像处理任务中表现出优异的效率,例如图像和光场压缩、高斯平铺、去噪和超分辨率。这些方法的参数估计通常采用梯度下降迭代优化,这对许多应用来说造成了很大的计算负担。在本文中,我们引入了一种新的自适应分割-基于初始化方法,针对优化导向Mixture-of-Expert(SMoE)卷积网络和径向基函数(RBF)网络与导向核。新的初始化方法将核分配到预计算的图像段中。通过迭代优化和核稀疏过程,每个段得到最优的核数量、核位置和导向参数。然后将“局部”段的核信息传递到“全局”初始化,准备用于迭代优化SMoE、RBF和相关内核图像回归方法。结果表明,与广泛使用的普通网格初始化、最先进的K-Means初始化和之前引入的分割-基于初始化方法相比,获得了显著的对象和主观质量提升,同时极大地改善了回归模型的稀疏性。对于相同质量,新的初始化方法实现了模型内核减少50%的情况。此外,还取得了显著的收敛时间减少,总运行时间节省了50%。分割-基于初始化策略本身具有沉重的并行计算;从理论上讲,它可以分解成与图像中的段数相同的所有任务。通过访问四个并行GPU,可以实现初始化时间的节省50%以上。
https://arxiv.org/abs/2409.10101
Artificial intelligence models have shown great potential in structure-based drug design, generating ligands with high binding affinities. However, existing models have often overlooked a crucial physical constraint: atoms must maintain a minimum pairwise distance to avoid separation violation, a phenomenon governed by the balance of attractive and repulsive forces. To mitigate such separation violations, we propose NucleusDiff. It models the interactions between atomic nuclei and their surrounding electron clouds by enforcing the distance constraint between the nuclei and manifolds. We quantitatively evaluate NucleusDiff using the CrossDocked2020 dataset and a COVID-19 therapeutic target, demonstrating that NucleusDiff reduces violation rate by up to 100.00% and enhances binding affinity by up to 22.16%, surpassing state-of-the-art models for structure-based drug design. We also provide qualitative analysis through manifold sampling, visually confirming the effectiveness of NucleusDiff in reducing separation violations and improving binding affinities.
人工智能模型在基于结构药物设计方面显示出巨大的潜力,能生成具有高结合亲和力的化合物。然而,现有的模型往往忽视了一个关键的物理约束:原子间必须保持最低的互斥距离以避免分离违规,这是一种由吸引力和排斥力平衡所控制的现象。为了减轻这种分离违规,我们提出了NucleusDiff。它通过在原子核和周围电子云之间施加强距离约束来建模相互作用。我们通过交叉验证2020数据集和 COVID-19 治疗靶点对 NucleusDiff 进行定量评估,并证明 NucleusDiff 将违规率降低至 100.00%,提高了结合亲和力至 22.16%,超越了基于结构药物设计的最佳模型。此外,我们还通过多维采样提供了定性的分析,直观地证实了 NucleusDiff 在减少分离违规和提高结合亲和力方面的有效性。
https://arxiv.org/abs/2409.10584
Developing an efficient sampler capable of generating independent and identically distributed (IID) samples from a Boltzmann distribution is a crucial challenge in scientific research, e.g. molecular dynamics. In this work, we intend to learn neural samplers given energy functions instead of data sampled from the Boltzmann distribution. By learning the energies of the noised data, we propose a diffusion-based sampler, ENERGY-BASED DENOISING ENERGY MATCHING, which theoretically has lower variance and more complexity compared to related works. Furthermore, a novel bootstrapping technique is applied to EnDEM to balance between bias and variance. We evaluate EnDEM and BEnDEM on a 2-dimensional 40 Gaussian Mixture Model (GMM) and a 4-particle double-welling potential (DW-4). The experimental results demonstrate that BEnDEM can achieve state-of-the-art performance while being more robust.
开发一个能够从玻尔兹曼分布中生成独立且等距分布(IID)样本的高效的采样器是科学研究的關鍵挑戰之一,例如分子動力学。在本文中,我們打算學習能量函數而不是從玻爾扎曼分布中學習數據的采样。通過學習嘈雜數據的能量,我們提出了基於扩散的采样器,ENERGY-BASED DENOISING ENERGY MATCHING,該理論上比相關工作具有更低的方差和更複雜性。此外,還應用了一種新的bootstrap技術來平衡偏度和方差。我們在2維40高斯混合模型的(GMM)和4粒子雙元玻色势(DW-4)上評估了EnDEM和BEnDEM,實驗結果表明,BEnDEM可以在保持更優秀的性能同時更加稳健。
https://arxiv.org/abs/2409.09787
Current end-to-end autonomous driving methods resort to unifying modular designs for various tasks (e.g. perception, prediction and planning). Although optimized in a planning-oriented spirit with a fully differentiable framework, existing end-to-end driving systems without ego-centric designs still suffer from unsatisfactory performance and inferior efficiency, owing to the rasterized scene representation learning and redundant information transmission. In this paper, we revisit the human driving behavior and propose an ego-centric fully sparse paradigm, named DiFSD, for end-to-end self-driving. Specifically, DiFSD mainly consists of sparse perception, hierarchical interaction and iterative motion planner. The sparse perception module performs detection, tracking and online mapping based on sparse representation of the driving scene. The hierarchical interaction module aims to select the Closest In-Path Vehicle / Stationary (CIPV / CIPS) from coarse to fine, benefiting from an additional geometric prior. As for the iterative motion planner, both selected interactive agents and ego-vehicle are considered for joint motion prediction, where the output multi-modal ego-trajectories are optimized in an iterative fashion. Besides, both position-level motion diffusion and trajectory-level planning denoising are introduced for uncertainty modeling, thus facilitating the training stability and convergence of the whole framework. Extensive experiments conducted on nuScenes dataset demonstrate the superior planning performance and great efficiency of DiFSD, which significantly reduces the average L2 error by \textbf{66\%} and collision rate by \textbf{77\%} than UniAD while achieves \textbf{8.2$\times$} faster running efficiency.
当前的端到端自动驾驶方法将各种任务的统一化设计(例如感知、预测和规划)推向极致。尽管在规划式的思想下,以完全可导框架优化,但现有的端到端自动驾驶系统仍然由于基于自我中心的设计而性能不满意,效率低下。在本文中,我们重新审视了人类驾驶行为,并提出了一个以自我为中心的全稀疏范式,名为DiFSD,用于端到端自驾驶。具体来说,DiFSD主要由稀疏感知、分层交互和迭代运动规划器组成。稀疏感知模块基于驾驶场景的稀疏表示进行检测、跟踪和在线映射。分层交互模块旨在从粗到细选择最近路径上的车辆/车站(CIPV/CIPS),利用额外的几何先验。对于迭代运动规划器,为联合运动预测选择选定的交互代理和自车辆,其中输出多模态自车轨迹以迭代方式优化。此外,我们还引入了位置级运动扩散和轨迹级规划去噪,用于不确定性建模,从而促进整个框架的训练稳定性和收敛。在 nuScenes 数据集上进行的广泛实验证明,DiFSD具有卓越的规划性能和巨大的效率提升,与 UniAD 相比,降低了平均 L2 误差 66\%,碰撞率 77\%,并实现了 8.2\times 的运行效率提升。
https://arxiv.org/abs/2409.09777
This paper introduces Easy One-Step Text-to-Speech (E1 TTS), an efficient non-autoregressive zero-shot text-to-speech system based on denoising diffusion pretraining and distribution matching distillation. The training of E1 TTS is straightforward; it does not require explicit monotonic alignment between the text and audio pairs. The inference of E1 TTS is efficient, requiring only one neural network evaluation for each utterance. Despite its sampling efficiency, E1 TTS achieves naturalness and speaker similarity comparable to various strong baseline models. Audio samples are available at this http URL .
本文介绍了一种名为Easy One-Step Text-to-Speech(E1 TTS)的高效非自回归零 shot 文本到语音(TTS)系统,该系统基于去噪扩散预训练和分布匹配蒸馏。E1 TTS的训练是直接的;它不需要文本和音频对之间的显式 monotonic 对齐。E1 TTS的推理是高效的,只需要对每个语音进行一次神经网络评估。尽管它的采样效率很高,但E1 TTS实现了与各种强大基线模型相当的 naturalness(自然)和 speaker similarity(说话者相似性)。音频样本可在此链接 http:// 上获取。
https://arxiv.org/abs/2409.09351
Dynamic and dexterous manipulation of objects presents a complex challenge, requiring the synchronization of hand motions with the trajectories of objects to achieve seamless and physically plausible interactions. In this work, we introduce ManiDext, a unified hierarchical diffusion-based framework for generating hand manipulation and grasp poses based on 3D object trajectories. Our key insight is that accurately modeling the contact correspondences between objects and hands during interactions is crucial. Therefore, we propose a continuous correspondence embedding representation that specifies detailed hand correspondences at the vertex level between the object and the hand. This embedding is optimized directly on the hand mesh in a self-supervised manner, with the distance between embeddings reflecting the geodesic distance. Our framework first generates contact maps and correspondence embeddings on the object's surface. Based on these fine-grained correspondences, we introduce a novel approach that integrates the iterative refinement process into the diffusion process during the second stage of hand pose generation. At each step of the denoising process, we incorporate the current hand pose residual as a refinement target into the network, guiding the network to correct inaccurate hand poses. Introducing residuals into each denoising step inherently aligns with traditional optimization process, effectively merging generation and refinement into a single unified framework. Extensive experiments demonstrate that our approach can generate physically plausible and highly realistic motions for various tasks, including single and bimanual hand grasping as well as manipulating both rigid and articulated objects. Code will be available for research purposes.
动态和灵巧地操作物体是一个复杂挑战,需要将手部动作与物体的轨迹同步,以实现无缝和物理上逼真的交互。在这项工作中,我们引入了ManiDext,一个基于3D物体轨迹的统一层次扩散框架,用于生成基于物体轨迹的手操作和抓握姿势。我们工作的关键见解是,在交互过程中准确建模物体和手之间的接触对应关系至关重要。因此,我们提出了一个连续的对应关系嵌入表示,指定物体和手之间的手部对应关系在顶点级别。这个嵌入以自监督的方式优化手部网格,嵌入之间的距离反映距离。我们的框架首先在物体表面生成接触图和对应关系嵌入。基于这些细粒度的对应关系,我们引入了一种新方法,将迭代平滑过程整合到手部姿势生成过程的第二阶段。在去噪过程的每个步骤中,我们将当前手部姿势残差作为平滑目标输入网络,引导网络纠正不准确的手部姿势。将残差引入每个去噪步骤本质上与传统优化过程 aligns,将生成和平滑整合成一个单一的框架。大量实验证明,我们的方法可以为各种任务生成物理上逼真和高度逼真的运动,包括单手和双手抓握以及操作刚性和关节的运动。代码将供研究用途。
https://arxiv.org/abs/2409.09300