Bird's-eye View (BeV) representations have emerged as the de-facto shared space in driving applications, offering a unified space for sensor data fusion and supporting various downstream tasks. However, conventional models use grids with fixed resolution and range and face computational inefficiencies due to the uniform allocation of resources across all cells. To address this, we propose PointBeV, a novel sparse BeV segmentation model operating on sparse BeV cells instead of dense grids. This approach offers precise control over memory usage, enabling the use of long temporal contexts and accommodating memory-constrained platforms. PointBeV employs an efficient two-pass strategy for training, enabling focused computation on regions of interest. At inference time, it can be used with various memory/performance trade-offs and flexibly adjusts to new specific use cases. PointBeV achieves state-of-the-art results on the nuScenes dataset for vehicle, pedestrian, and lane segmentation, showcasing superior performance in static and temporal settings despite being trained solely with sparse signals. We will release our code along with two new efficient modules used in the architecture: Sparse Feature Pulling, designed for the effective extraction of features from images to BeV, and Submanifold Attention, which enables efficient temporal modeling. Our code is available at this https URL.
bird's-eye view (BeV) 表示已经成为了驾驶应用程序的事实共享空间,提供了一个统一的传感器数据融合空间,支持各种下游任务。然而,传统的模型使用固定的分辨率范围和网格,由于所有细胞资源均匀分配,导致计算效率低下。为了解决这个问题,我们提出了 PointBeV,一种新颖的稀疏 BeV 分割模型, operating on sparse BeV 细胞而不是密集网格。这种方法可以精确控制内存使用,使得可以使用长时上下文,并适应具有内存限制的平台。PointBeV 使用一种高效的两步策略进行训练,使得在感兴趣的区域进行集中计算。在推理时,它可以与各种内存/性能权衡灵活配合,并能够适应新的具体应用场景。PointBeV 在 nuScenes 数据集上取得了最先进的分数,在车辆、行人、车道分割中展示了在静态和时间设置下的卓越性能,尽管它仅使用稀疏信号进行训练。我们将发布我们的代码以及用于架构的两个新的高效模块:稀疏特征提取,用于从图像中有效提取特征到 BeV;子流注意,它允许有效的时序建模。我们的代码可在此处访问:https://www.aclweb.org/anthology/N22-11966
https://arxiv.org/abs/2312.00703
Accurate image reconstruction is at the heart of diagnostics in medical imaging. Supervised deep learning-based approaches have been investigated for solving inverse problems including image reconstruction. However, these trained models encounter unseen data distributions that are widely shifted from training data during deployment. Therefore, it is essential to assess whether a given input falls within the training data distribution for diagnostic purposes. Uncertainty estimation approaches exist but focus on providing an uncertainty map to radiologists, rather than assessing the training distribution fit. In this work, we propose a method based on the local Lipschitz-based metric to distinguish out-of-distribution images from in-distribution with an area under the curve of 99.94%. Empirically, we demonstrate a very strong relationship between the local Lipschitz value and mean absolute error (MAE), supported by a high Spearman's rank correlation coefficient of 0.8475, which determines the uncertainty estimation threshold for optimal model performance. Through the identification of false positives, the local Lipschitz and MAE relationship was used to guide data augmentation and reduce model uncertainty. Our study was validated using the AUTOMAP architecture for sensor-to-image Magnetic Resonance Imaging (MRI) reconstruction. We compare our proposed approach with baseline methods: Monte-Carlo dropout and deep ensembles, and further analysis included MRI denoising and Computed Tomography (CT) sparse-to-full view reconstruction using UNET architectures. We show that our approach is applicable to various architectures and learned functions, especially in the realm of medical image reconstruction, where preserving the diagnostic accuracy of reconstructed images remains paramount.
精确的图像重建是医学影像诊断的核心。已研究了使用监督深度学习方法来解决包括图像重建在内的反问题。然而,这些训练好的模型在部署时会面临从训练数据分布广泛偏移的未见数据分布。因此,在诊断目的下评估给定输入是否落在训练数据分布非常重要。不确定性估计方法存在,但主要关注提供给放射医生的不确定性地图,而不是评估训练分布的拟合度。在这项工作中,我们提出了一种基于局部Lipschitz度量的方法来区分离散和分布中的图像。通过99.94%面积下的平均绝对误差(MAE)与局部Lipschitz值的关系,我们实验证明了一个非常强的关系。通过识别假阳性,我们利用局部Lipschitz和MAE关系指导数据增强和减少模型不确定性。我们的研究通过AUTOMAP架构对传感器到图像的磁共振成像(MRI)重建进行了验证。我们比较了我们的方法与基线方法:蒙特卡洛随机失活和深度集成,还进一步分析了使用UNET架构的CT稀疏到全视图重建。我们证明了我们的方法适用于各种架构和学习函数,尤其是在医学图像重建领域,保留重建图像的诊断准确性至关重要。
https://arxiv.org/abs/2305.07618
Despite the prevalence of pretrained language models in natural language understanding tasks, understanding lengthy text such as document is still challenging due to the data sparseness problem. Inspired by that humans develop their ability of understanding lengthy text from reading shorter text, we propose a simple yet effective summarization-based data augmentation, SUMMaug, for document classification. We first obtain easy-to-learn examples for the target document classification task by summarizing the input of the original training examples, while optionally merging the original labels to conform to the summarized input. We then use the generated pseudo examples to perform curriculum learning. Experimental results on two datasets confirmed the advantage of our method compared to existing baseline methods in terms of robustness and accuracy. We release our code and data at this https URL.
尽管在自然语言处理任务中预训练语言模型的普及程度很高,但理解长篇文档,如文档,仍然具有挑战性,因为数据稀疏性问题。受到人类通过阅读较短文本来理解长篇文本的能力的启发,我们提出了一个简单而有效的基于摘要的增强数据,SUMMaug,用于文档分类。我们首先通过摘要原始训练例子来获得易于学习的目标文档分类任务的示例,同时可选择性地合并原始标签以符合摘要输入。然后使用生成的伪例子进行课程学习。在两个数据集上的实验结果证实了我们的方法与现有基线方法在鲁棒性和准确性方面的优势。我们将代码和数据发布在此处:https://www.url。
https://arxiv.org/abs/2312.00513
Re-localizing a camera from a single image in a previously mapped area is vital for many computer vision applications in robotics and augmented/virtual reality. In this work, we address the problem of estimating the 6 DoF camera pose relative to a global frame from a single image. We propose to leverage a novel network of relative spatial and temporal geometric constraints to guide the training of a Deep Network for localization. We employ simultaneously spatial and temporal relative pose constraints that are obtained not only from adjacent camera frames but also from camera frames that are distant in the spatio-temporal space of the scene. We show that our method, through these constraints, is capable of learning to localize when little or very sparse ground-truth 3D coordinates are available. In our experiments, this is less than 1% of available ground-truth data. We evaluate our method on 3 common visual localization datasets and show that it outperforms other direct pose estimation methods.
将一个相机从以前映射区域中的单个图像重新定位到机器人学和增强现实(AR/VR)中许多计算机视觉应用程序中,对齐相机从单个图像到全局帧是至关重要的。在这项工作中,我们解决了在单个图像上估计6个自由度相机姿态相对于全局帧的问题。我们提出了一个新颖的相对空间和时间几何约束网络,用于指导基于局部网络的相机定位训练。我们采用不仅来自相邻相机帧而且来自场景中距离较远的相机帧的相对姿态约束。我们证明了我们的方法通过这些约束有能力在几乎没有或非常稀疏的地面真实3D坐标可用时进行定位。在我们的实验中,这个比例不到1%。我们在3个常见的视觉局部定位数据集上评估我们的方法,并证明了它优于其他直接姿态估计方法。
https://arxiv.org/abs/2312.00500
Novel view synthesis from limited observations remains an important and persistent task. However, high efficiency in existing NeRF-based few-shot view synthesis is often compromised to obtain an accurate 3D representation. To address this challenge, we propose a few-shot view synthesis framework based on 3D Gaussian Splatting that enables real-time and photo-realistic view synthesis with as few as three training views. The proposed method, dubbed FSGS, handles the extremely sparse initialized SfM points with a thoughtfully designed Gaussian Unpooling process. Our method iteratively distributes new Gaussians around the most representative locations, subsequently infilling local details in vacant areas. We also integrate a large-scale pre-trained monocular depth estimator within the Gaussians optimization process, leveraging online augmented views to guide the geometric optimization towards an optimal solution. Starting from sparse points observed from limited input viewpoints, our FSGS can accurately grow into unseen regions, comprehensively covering the scene and boosting the rendering quality of novel views. Overall, FSGS achieves state-of-the-art performance in both accuracy and rendering efficiency across diverse datasets, including LLFF, Mip-NeRF360, and Blender. Project website: this https URL.
基于有限观测的新视图合成仍然是一个重要的和持久的任务。然而,为了获得准确的3D表示,现有基于NeRF的少样本视图合成中的高效率往往牺牲了。为了应对这个挑战,我们提出了一个基于3D高斯平铺的少样本视图合成框架,使得只需要几个训练视图实现实时和照片实时的视图合成。所提出的方法被称为FSGS,通过一种经过精心设计的Gaussian Unpooling过程处理极其稀疏的初始化SfM点。我们的方法在Gaussians优化过程中逐步在最具代表性的位置分布新的高斯,然后填充空闲区域的局部细节。我们还将在Gaussians优化过程中集成一个大型预训练单目深度估计器,利用在线增强视图引导几何优化达到最优解。从有限的输入视点观测到的稀疏点开始,FSGS可以准确地生长到未见区域,全面覆盖场景,并提高新视图的渲染质量。总体而言,FSGS在各种数据集上的准确性和渲染效率都达到了最先进的水平,包括LLFF、Mip-NeRF360和Blender。项目网站:https://this URL。
https://arxiv.org/abs/2312.00451
Vision Transformers have received significant attention due to their impressive performance in many vision tasks. While the token mixer or attention block has been studied in great detail, the channel mixer or feature mixing block (FFN or MLP) has not been explored in depth albeit it accounts for a bulk of the parameters and computation in a model. In this work, we study whether sparse feature mixing can replace the dense connections and confirm this with a block diagonal MLP structure that improves the accuracy by supporting larger expansion ratios. To improve the feature clusters formed by this structure and thereby further improve the accuracy, a lightweight, parameter-free, channel covariance attention (CCA) mechanism is introduced as a parallel branch during training. This design of CCA enables gradual feature mixing across channel groups during training whose contribution decays to zero as the training progresses to convergence. This allows the CCA block to be discarded during inference, thus enabling enhanced performance with no additional computational cost. The resulting $\textit{Scalable CHannEl MixEr}$ (SCHEME) can be plugged into any ViT architecture to obtain a gamut of models with different trade-offs between complexity and performance by controlling the block diagonal structure size in the MLP. This is shown by the introduction of a new family of SCHEMEformer models. Experiments on image classification, object detection, and semantic segmentation, with different ViT backbones, consistently demonstrate substantial accuracy gains over existing designs, especially under lower FLOPs regimes. For example, the SCHEMEformer establishes a new SOTA of 79.7% accuracy for ViTs using pure attention mixers on ImageNet-1K at 1.77G FLOPs.
由于在许多视觉任务中的出色表现,Transformer Vision模型已经引起了很大的关注。尽管在token mixer或attention block上已经进行了详细研究,但通道混合器或特征混合块(FFN或MLP)尚未深入研究,尽管它占据了模型中大部分的参数和计算。在本文中,我们研究是否稀疏特征混合可以取代密集连接,并通过支持更大的扩展比来证实这一结论。为了提高由该结构形成的特征簇的准确性,在训练过程中引入了一个轻量级、参数无关的通道协方差注意(CCA)机制作为并行分支。该设计的CCA允许在训练过程中逐步混合通道组,其贡献在训练达到收敛时逐渐消失。这使得CCA块在推理时可以被丢弃,从而实现在不增加计算成本的情况下提高性能。通过控制MLP中块的扩展比,可以将得到的具有不同复杂度和性能的模型插接到任何ViT架构中。这通过引入一个新的SCHEMEformer模型家族来证明。在不同的ViT骨干网络、图像分类、目标检测和语义分割实验中,与现有设计相比,具有显著的准确性提升,尤其是在较低的FLOPs条件下。例如,SCHEMEformer在ImageNet-1K上使用纯注意力混合器建立了79.7%的准确率的新SOTA。
https://arxiv.org/abs/2312.00412
3D Morphable Models (3DMMs) provide promising 3D face reconstructions in various applications. However, existing methods struggle to reconstruct faces with extreme expressions due to deficiencies in supervisory signals, such as sparse or inaccurate landmarks. Segmentation information contains effective geometric contexts for face reconstruction. Certain attempts intuitively depend on differentiable renderers to compare the rendered silhouettes of reconstruction with segmentation, which is prone to issues like local optima and gradient instability. In this paper, we fully utilize the facial part segmentation geometry by introducing Part Re-projection Distance Loss (PRDL). Specifically, PRDL transforms facial part segmentation into 2D points and re-projects the reconstruction onto the image plane. Subsequently, by introducing grid anchors and computing different statistical distances from these anchors to the point sets, PRDL establishes geometry descriptors to optimize the distribution of the point sets for face reconstruction. PRDL exhibits a clear gradient compared to the renderer-based methods and presents state-of-the-art reconstruction performance in extensive quantitative and qualitative experiments. The project will be publicly available.
3D可塑模型(3DMMs)在各种应用中提供了有前景的3D面部重构。然而,由于监督信号的不足,例如稀疏或不准确的关键点,现有的方法很难通过极端表情来重构面部。分割信息包含用于面部重构的有效几何上下文。某些尝试直觉上依赖于可变渲染器来比较重构的轮廓与分割轮廓,这容易导致局部最优解和梯度不稳定问题。在本文中,我们充分利用了面部部分分割的几何结构,通过引入Part Re-projection Distance Loss(PRDL)。具体来说,PRDL将面部部分分割转化为2D点,并将其重构到图像平面上。然后,通过引入网格锚点并计算这些锚点与点集之间的不同统计距离,PRDL建立几何描述符以优化面部重构点集的分布。与基于渲染器的 method相比,PRDL显示出明显的梯度。在广泛的定量实验和定性实验中,PRDL在面部重构方面表现出了最先进的性能。该项目将公开发布。
https://arxiv.org/abs/2312.00311
The problem of novel view synthesis has grown significantly in popularity recently with the introduction of Neural Radiance Fields (NeRFs) and other implicit scene representation methods. A recent advance, 3D Gaussian Splatting (3DGS), leverages an explicit representation to achieve real-time rendering with high-quality results. However, 3DGS still requires an abundance of training views to generate a coherent scene representation. In few shot settings, similar to NeRF, 3DGS tends to overfit to training views, causing background collapse and excessive floaters, especially as the number of training views are reduced. We propose a method to enable training coherent 3DGS-based radiance fields of 360 scenes from sparse training views. We find that using naive depth priors is not sufficient and integrate depth priors with generative and explicit constraints to reduce background collapse, remove floaters, and enhance consistency from unseen viewpoints. Experiments show that our method outperforms base 3DGS by up to 30.5% and NeRF-based methods by up to 15.6% in LPIPS on the MipNeRF-360 dataset with substantially less training and inference cost.
新视图合成问题的知名度近期显著增长,随着 Neural Radiance Fields (NeRFs) 和其他隐式场景表示方法的出现。最近的一项进展是 3D Gaussian Splatting(3DGS),它利用明确的表示来实现高质量的实时渲染。然而,3DGS 仍然需要大量的训练样本才能生成一致的场景表示。在少数 shot 设置中,与 NeRF 类似,3DGS 倾向于过拟合到训练样本,导致背景塌陷和过多的平滑器。我们提出了一种方法,允许从稀疏训练样本中训练一致的 3DGS 基于辐射场的值。我们发现,使用简单的深度优先级方法是不够的,并将深度优先级与生成约束和显式约束相结合以减少背景塌陷,消除平滑器,并增强可见视角之间的一致性。实验证明,我们的方法在 MipNeRF-360 数据集上的 LPIPS 评估中比基 3DGS 提高了至少 30.5%,比 NeRF 方法提高了至少 15.6%。
https://arxiv.org/abs/2312.00206
This paper presents our work for the Violence Inciting Text Detection shared task in the First Workshop on Bangla Language Processing. Social media has accelerated the propagation of hate and violence-inciting speech in society. It is essential to develop efficient mechanisms to detect and curb the propagation of such texts. The problem of detecting violence-inciting texts is further exacerbated in low-resource settings due to sparse research and less data. The data provided in the shared task consists of texts in the Bangla language, where each example is classified into one of the three categories defined based on the types of violence-inciting texts. We try and evaluate several BERT-based models, and then use an ensemble of the models as our final submission. Our submission is ranked 10th in the final leaderboard of the shared task with a macro F1 score of 0.737.
本文代表我们在第一届孟加拉语言处理研讨会上的工作,研究了如何检测社交媒体上可能引起暴力和仇恨言论的文稿。社交媒体加速了社会中暴力和仇恨言论的传播。在资源有限的环境中,检测和遏制这种文本的传播变得尤为重要。在共享任务中提供的数据中,每篇文章都基于孟加拉语,并根据可能引起暴力和仇恨言论的文稿类型将其归类为三种不同的类别。我们试图评估几种基于BERT的模型,然后将模型的集合作为我们最终的提交。我们提交的论文在共享任务的最终排行榜上排名第10,并具有微宏F1分数为0.737。
https://arxiv.org/abs/2311.18778
Recent work has demonstrated a remarkable ability to customize text-to-image diffusion models to multiple, fine-grained concepts in a sequential (i.e., continual) manner while only providing a few example images for each concept. This setting is known as continual diffusion. Here, we ask the question: Can we scale these methods to longer concept sequences without forgetting? Although prior work mitigates the forgetting of previously learned concepts, we show that its capacity to learn new tasks reaches saturation over longer sequences. We address this challenge by introducing a novel method, STack-And-Mask INcremental Adapters (STAMINA), which is composed of low-ranked attention-masked adapters and customized MLP tokens. STAMINA is designed to enhance the robust fine-tuning properties of LoRA for sequential concept learning via learnable hard-attention masks parameterized with low rank MLPs, enabling precise, scalable learning via sparse adaptation. Notably, all introduced trainable parameters can be folded back into the model after training, inducing no additional inference parameter costs. We show that STAMINA outperforms the prior SOTA for the setting of text-to-image continual customization on a 50-concept benchmark composed of landmarks and human faces, with no stored replay data. Additionally, we extended our method to the setting of continual learning for image classification, demonstrating that our gains also translate to state-of-the-art performance in this standard benchmark.
近年来,工作展示了将文本到图像扩散模型定制到多个细粒度概念并在序列(即连续)方式下进行定制,而只需为每个概念提供几张示例图像的显著能力。这个设置被称为连续扩散。在这里,我们问一个问题:我们能否将这种方法扩展到更长的概念序列,而不遗忘?尽管先前的研究减轻了之前学习概念的遗忘,但我们证明了在更长的序列中学习新任务的能力已经达到了饱和。为了解决这个问题,我们引入了一种新的方法:STack-And-Mask INcremental Adapters(STAMINA),它由低秩注意力掩码的适应器组成,并定制了自定义的MLP标记。STAMINA旨在通过可学习的高注意度掩码参数化低秩MLP,增强LoRA在序列概念学习中的鲁棒性,实现精确、可扩展的学习,通过稀疏适应。值得注意的是,所有引入的可训练参数在训练后都可以折叠回到模型中,从而不会产生额外的推理参数成本。我们在包含50个概念的文本到图像连续定制设置上评估了STAMINA,发现它在该设置上超过了先前的SOTA。此外,我们还将我们的方法扩展到图像分类的连续学习设置中,证明了我们的收获同样可以应用于这个标准基准。
https://arxiv.org/abs/2311.18763
Modeling dynamic, large-scale urban scenes is challenging due to their highly intricate geometric structures and unconstrained dynamics in both space and time. Prior methods often employ high-level architectural priors, separating static and dynamic elements, resulting in suboptimal capture of their synergistic interactions. To address this challenge, we present a unified representation model, called Periodic Vibration Gaussian (PVG). PVG builds upon the efficient 3D Gaussian splatting technique, originally designed for static scene representation, by introducing periodic vibration-based temporal dynamics. This innovation enables PVG to elegantly and uniformly represent the characteristics of various objects and elements in dynamic urban scenes. To enhance temporally coherent representation learning with sparse training data, we introduce a novel flow-based temporal smoothing mechanism and a position-aware adaptive control strategy. Extensive experiments on Waymo Open Dataset and KITTI benchmarks demonstrate that PVG surpasses state-of-the-art alternatives in both reconstruction and novel view synthesis for both dynamic and static scenes. Notably, PVG achieves this without relying on manually labeled object bounding boxes or expensive optical flow estimation. Moreover, PVG exhibits 50/6000-fold acceleration in training/rendering over the best alternative.
建模动态、大尺度城市场景具有挑战性,因为它们具有高度复杂的几何结构和在空间和时间上的无约束动力学。先前的方法通常采用高级的建筑先验,分离静态和动态元素,导致其协同作用的捕捉效果往往不理想。为了应对这一挑战,我们提出了一个统一的表示模型,称为周期振动高斯(PVG)。PVG在高效3D高斯平铺技术的基础上,引入了周期振动为基础的时间动态。这一创新使得PVG能够优雅且均匀地表示动态城市场景中各种物体和元素的特点。为了通过稀疏训练数据增强时间一致性表示学习,我们引入了一种新的流体为基础的时间平滑机制和位置感知自控制策略。在Waymo Open Dataset和KITTI基准上进行的广泛实验证明,PVG在重构和生成新视图方面都超过了最先进的替代方法,特别是在动态和静态场景。值得注意的是,PVG在没有依赖于手动标记的对象边界框或昂贵的光学流估计的情况下实现这一目标。此外,PVG在训练/渲染过程中的加速表现出50/6000-倍的优势。
https://arxiv.org/abs/2311.18561
We propose SparseDC, a model for Depth Completion of Sparse and non-uniform depth inputs. Unlike previous methods focusing on completing fixed distributions on benchmark datasets (e.g., NYU with 500 points, KITTI with 64 lines), SparseDC is specifically designed to handle depth maps with poor quality in real usage. The key contributions of SparseDC are two-fold. First, we design a simple strategy, called SFFM, to improve the robustness under sparse input by explicitly filling the unstable depth features with stable image features. Second, we propose a two-branch feature embedder to predict both the precise local geometry of regions with available depth values and accurate structures in regions with no depth. The key of the embedder is an uncertainty-based fusion module called UFFM to balance the local and long-term information extracted by CNNs and ViTs. Extensive indoor and outdoor experiments demonstrate the robustness of our framework when facing sparse and non-uniform input depths. The pre-trained model and code are available at this https URL.
我们提出了SparseDC,一种用于处理稀疏和非均匀深度输入的模型。与之前专注于在基准数据集上完成固定分布的方法(例如,NYU具有500个点,KITTI具有64行)不同,SparseDC专门设计来处理实际使用中深度图的质量较差的情况。SparseDC的关键贡献是双重的。首先,我们设计了一个简单的策略,称为SFFM,通过明确填充不稳定深度特征来提高在稀疏输入下的鲁棒性。其次,我们提出了一个两分支的特征嵌入器来预测既有可用深度值的区域的精确局部几何,又有没有深度值的区域的准确结构。嵌入器的关键是一个基于不确定性的融合模块,称为UFFM,平衡了CNN和ViTs提取的局部和长期信息。在室内和室外实验中,我们的框架在面对稀疏和非均匀输入深度时表现出极高的鲁棒性。预训练模型和代码可在此处访问:https://www.spark-computing.org/zh/project/sparse-dc。
https://arxiv.org/abs/2312.00097
Recent approaches for semantic correspondence have focused on obtaining high-quality correspondences using a complicated network, refining the ambiguous or noisy matching points. Despite their performance improvements, they remain constrained by the limited training pairs due to costly point-level annotations. This paper proposes a simple yet effective method that performs training with unlabeled pairs to complement both limited image pairs and sparse point pairs, requiring neither extra labeled keypoints nor trainable modules. We fundamentally extend the data quantity and variety by augmenting new unannotated pairs not primitively provided as training pairs in benchmarks. Using a simple teacher-student framework, we offer reliable pseudo correspondences to the student network via machine supervision. Finally, the performance of our network is steadily improved by the proposed iterative training, putting back the student as a teacher to generate refined labels and train a new student repeatedly. Our models outperform the milestone baselines, including state-of-the-art methods on semantic correspondence benchmarks.
近年来,针对语义匹配的方法主要关注使用复杂网络获取高质量匹配,并通过精炼模糊或嘈杂的匹配点来优化。尽管它们的性能有所提高,但它们仍然受到由于昂贵点级注释而有限训练对对的限制。本文提出了一种简单而有效的方法,通过未标记的对进行训练来补充有限图像对和稀疏点对,无需额外的标记关键点或可训练模块。我们通过扩展基准中未提供作为训练对的数据量和多样性来扩展数据量。通过简单的老师和学生框架,我们通过机器监督向学生网络提供可靠的伪对应。最后,通过提出的迭代训练,我们的网络性能得到了持续的提高,让学生重新成为老师,生成更精确的标签,并重复训练一个新的学生。我们的模型在语义匹配基准中优于里程碑基准,包括在语义匹配基准上最先进的方法。
https://arxiv.org/abs/2311.18540
Two-tower models are a prevalent matching framework for recommendation, which have been widely deployed in industrial applications. The success of two-tower matching attributes to its efficiency in retrieval among a large number of items, since the item tower can be precomputed and used for fast Approximate Nearest Neighbor (ANN) search. However, it suffers two main challenges, including limited feature interaction capability and reduced accuracy in online serving. Existing approaches attempt to design novel late interactions instead of dot products, but they still fail to support complex feature interactions or lose retrieval efficiency. To address these challenges, we propose a new matching paradigm named SparCode, which supports not only sophisticated feature interactions but also efficient retrieval. Specifically, SparCode introduces an all-to-all interaction module to model fine-grained query-item interactions. Besides, we design a discrete code-based sparse inverted index jointly trained with the model to achieve effective and efficient model inference. Extensive experiments have been conducted on open benchmark datasets to demonstrate the superiority of our framework. The results show that SparCode significantly improves the accuracy of candidate item matching while retaining the same level of retrieval efficiency with two-tower models. Our source code will be available at MindSpore/models.
双塔模型是一种流行的推荐匹配框架,在工业应用中得到了广泛部署。双塔匹配在大量项目中的检索效率取得了成功,因为物品塔可以预先计算并用于快速近似最近邻居(ANN)搜索。然而,它面临着两个主要挑战,包括功能交互能力有限和在线服务中的准确性降低。现有方法试图设计新颖的晚交互,而不是点积,但它们仍然无法支持复杂的特征交互,或者导致检索效率降低。为了应对这些挑战,我们提出了一个新的匹配范例,名为SparCode,它支持复杂的特征交互和有效的检索。具体来说,SparCode引入了全交互模块来建模细粒度的查询-项目交互。此外,我们与模型共同训练了一个离散代码的稀疏逆索引,以实现有效且高效的模型推理。我们在公开基准数据集上进行了广泛的实验,以证明我们框架的优越性。结果表明,SparCode在保留与双塔模型相同检索效率的同时,显著提高了候选项目匹配的准确性。我们的源代码将 available at MindSpore/models。
https://arxiv.org/abs/2311.18213
Sparse LiDAR point clouds cause severe loss of detail of static structures and reduce the density of static points available for navigation. Reduced density can be detrimental to navigation under several scenarios. We observe that despite high sparsity, in most cases, the global topology of LiDAR outlining the static structures can be inferred. We utilize this property to obtain a backbone skeleton of a static LiDAR scan in the form of a single connected component that is a proxy to its global topology. We utilize the backbone to augment new points along static structures to overcome sparsity. Newly introduced points could correspond to existing static structures or to static points that were earlier obstructed by dynamic objects. To the best of our knowledge, we are the first to use this strategy for sparse LiDAR point clouds. Existing solutions close to our approach fail to identify and preserve the global static LiDAR topology and generate sub-optimal points. We propose GLiDR, a Graph Generative network that is topologically regularized using 0-dimensional Persistent Homology (PH) constraints. This enables GLiDR to introduce newer static points along a topologically consistent global static LiDAR backbone. GLiDR generates precise static points using 32x sparser dynamic scans and performs better than the baselines across three datasets. The newly introduced static points allow GLiDR to outperform LiDAR-based navigation using SLAM in several settings. GLiDR generates a valuable byproduct - an accurate binary segmentation mask of static and dynamic objects that is helpful for navigation planning and safety in constrained environments.
稀疏的激光雷达点云导致静态结构细节丧失严重,并降低了静态点密度,从而降低了可用于导航的静态点的密度。减少密度在几种场景下对导航都可能造成严重后果。我们观察到,尽管点密度较高,但在大多数情况下,激光雷达轮廓的静态结构全局拓扑结构可以推断出来。利用这一特性,我们获得了以单连接组件的形式表示静态激光雷达扫描骨架,作为其全局拓扑结构的代理。利用骨架,我们通过填充静态结构中的新点来克服稀疏性。新引入的点可能对应于现有的静态结构或被动态物体 earlier 遮挡的静态点。据我们所知,这是第一个使用这种策略处理稀疏激光雷达点云的。与我们的方法接近的现有解决方案未能识别并保留全局静态激光雷达拓扑结构,并生成次优点。我们提出了GLiDR,一种基于0维 persistent homology(PH)约束的图生成网络。这使得GLiDR能够沿全局静态激光雷达骨架引入新的静态点。GLiDR通过使用32x稀疏动态扫描生成精确的静态点,在三个数据集上的表现优于基线。新引入的静态点使GLiDR在几种场景中能够超过基于激光雷达的导航。GLiDR生成有价值的产品 - 准确的二进制分割掩码静态和动态物体,这对于导航规划和在约束环境中的安全驾驶非常有益。
https://arxiv.org/abs/2312.00068
Volumetric phenomena, such as clouds and fog, present a significant challenge for 3D reconstruction systems due to their translucent nature and their complex interactions with light. Conventional techniques for reconstructing scattering volumes rely on controlled setups, limiting practical applications. This paper introduces an approach to reconstructing volumes from a few input stereo pairs. We propose a novel deep learning framework that integrates a deep stereo model with a 3D Convolutional Neural Network (3D CNN) and an advection module, capable of capturing the shape and dynamics of volumes. The stereo depths are used to carve empty space around volumes, providing the 3D CNN with a prior for coping with the lack of input views. Refining our output, the advection module leverages the temporal evolution of the medium, providing a mechanism to infer motion and improve temporal consistency. The efficacy of our system is demonstrated through its ability to estimate density and velocity fields of large-scale volumes, in this case, clouds, from a sparse set of stereo image pairs.
体积现象(如云和雾)对 3D 重建系统来说是一个显著的挑战,由于它们的透明性质以及与光线复杂的相互作用。传统的散射体积重建方法依赖于控制设置,限制了其实用性。本文提出了一种从几个输入立体对中重构体积的方法。我们提出了一种新颖的深度学习框架,将深度立体模型与 3D 卷积神经网络(3D CNN)和加速度模块相结合,能够捕捉体积的形状和动态。立体深度被用来在体积周围挖空空间,为 3D CNN 提供了一种应对缺少输入视图的先前经验。通过优化我们的输出,加速度模块利用了中场的时变性,提供了一种机制来推断运动并改善时间一致性。我们系统的有效性通过其能够从稀疏的立体图像对中估计大规模体积的密度和速度场来证明。在这种情况下,我们的大规模云。
https://arxiv.org/abs/2311.17657
Embedding graphs in continous spaces is a key factor in designing and developing algorithms for automatic information extraction to be applied in diverse tasks (e.g., learning, inferring, predicting). The reliability of graph embeddings directly depends on how much the geometry of the continuous space matches the graph structure. Manifolds are mathematical structure that can enable to incorporate in their topological spaces the graph characteristics, and in particular nodes distances. State-of-the-art of manifold-based graph embedding algorithms take advantage of the assumption that the projection on a tangential space of each point in the manifold (corresponding to a node in the graph) would locally resemble a Euclidean space. Although this condition helps in achieving efficient analytical solutions to the embedding problem, it does not represent an adequate set-up to work with modern real life graphs, that are characterized by weighted connections across nodes often computed over sparse datasets with missing records. In this work, we introduce a new class of manifold, named soft manifold, that can solve this situation. In particular, soft manifolds are mathematical structures with spherical symmetry where the tangent spaces to each point are hypocycloids whose shape is defined according to the velocity of information propagation across the data points. Using soft manifolds for graph embedding, we can provide continuous spaces to pursue any task in data analysis over complex datasets. Experimental results on reconstruction tasks on synthetic and real datasets show how the proposed approach enable more accurate and reliable characterization of graphs in continuous spaces with respect to the state-of-the-art.
在设计和开发应用于各种任务的自动信息抽取算法中,将嵌入图放入连续空间是一个关键因素。这个因素直接取决于连续空间中几何结构的相似程度。流形是一种数学结构,可以将其拓扑空间中的图特征融入其中,特别是节点距离。基于流形的图嵌入算法的最先进技术利用了每个点在流形上的投影在切向空间上类似于欧氏空间的假设。尽管这个条件有助于实现嵌入问题的高效分析解决方案,但它并不代表处理现代现实图的适当设置。在本文中,我们引入了一种新型的流形类,称为软流形,可以解决这个问题。特别地,软流形是具有球对称的数学结构,其中每个点的切向空间是一个半椭圆,其形状是由数据点中信息传播的速度定义的。使用软流形进行图嵌入,我们可以为复杂数据集上的任何数据分析任务提供连续空间。在合成和真实数据集上的重构任务实验结果表明,与最先进的技术相比,所提出的方案使对连续空间中图的描述更加准确和可靠。
https://arxiv.org/abs/2311.17598
In goal-conditioned reinforcement learning (GCRL), sparse rewards present significant challenges, often obstructing efficient learning. Although multi-step GCRL can boost this efficiency, it can also lead to off-policy biases in target values. This paper dives deep into these biases, categorizing them into two distinct categories: "shooting" and "shifting". Recognizing that certain behavior policies can hasten policy refinement, we present solutions designed to capitalize on the positive aspects of these biases while minimizing their drawbacks, enabling the use of larger step sizes to speed up GCRL. An empirical study demonstrates that our approach ensures a resilient and robust improvement, even in ten-step learning scenarios, leading to superior learning efficiency and performance that generally surpass the baseline and several state-of-the-art multi-step GCRL benchmarks.
在目标条件强化学习(GCRL)中,稀疏奖励带来了显著的挑战,通常会阻碍有效的学习。尽管多步GCRL可以提高学习效率,但它也可能导致目标值的非策略偏差。本文对这些偏差进行了深入的研究,将它们分为两个 distinct类别:"射击"和"移动"。认识到某些行为策略可以加速策略的优化,我们提出了旨在利用这些偏差的优势并最小化其缺陷的解决方案,从而加快GCRL。一个实证研究证明了我们的方法确保了即使在十步学习场景中,我们的方法也具有弹性和稳健的改进,从而实现了卓越的学习效率和性能,通常超过了基线和几个最先进的 multi-step GCRL 基准。
https://arxiv.org/abs/2311.17565
Convolutional Neural Networks (CNNs) are hard to deploy on edge devices due to its high computation and storage complexities. As a common practice for model compression, network pruning consists of two major categories: unstructured and structured pruning, where unstructured pruning constantly performs better. However, unstructured pruning presents a structured pattern at high pruning rates, which limits its performance. To this end, we propose a Rank-based PruninG (RPG) method to maintain the ranks of sparse weights in an adversarial manner. In each step, we minimize the low-rank approximation error for the weight matrices using singular value decomposition, and maximize their distance by pushing the weight matrices away from its low rank approximation. This rank-based optimization objective guides sparse weights towards a high-rank topology. The proposed method is conducted in a gradual pruning fashion to stabilize the change of rank during training. Experimental results on various datasets and different tasks demonstrate the effectiveness of our algorithm in high sparsity. The proposed RPG outperforms the state-of-the-art performance by 1.13% top-1 accuracy on ImageNet in ResNet-50 with 98% sparsity. The codes are available at this https URL and this https URL.
卷积神经网络(CNNs)在边缘设备上部署困难,因为它们具有高计算和存储复杂性。作为一种常见的模型压缩实践,网络剪枝包括两大类:无结构和有结构剪枝,其中无结构剪枝在高剪枝率下始终表现更好。然而,无结构剪枝在剪枝率较高时呈现出一维结构模式,这限制了其性能。为此,我们提出了一种基于排名的剪枝方法,以在对抗方式下维护稀疏权重的等级。在每一步中,我们通过稀疏值分解最小化权重矩阵的低秩近似误差,并通过将权重矩阵推离其低秩近似来最大化它们的距离。基于排名的优化目标引导稀疏权重走向高维拓扑结构。在训练过程中,我们采用逐步剪枝方式来稳定秩的变化。各种数据集和不同任务上的实验结果表明,我们的算法在高稀疏情况下具有很好的效果。与最先进的性能相比,我们提出的RPG方法在ImageNet上ResNet-50的98%稀疏率上提高了1.13%的top-1准确率。代码可在此https://url和https://url处获取。
https://arxiv.org/abs/2311.17493
LiDAR point cloud semantic segmentation enables the robots to obtain fine-grained semantic information of the surrounding environment. Recently, many works project the point cloud onto the 2D image and adopt the 2D Convolutional Neural Networks (CNNs) or vision transformer for LiDAR point cloud semantic segmentation. However, since more than one point can be projected onto the same 2D position but only one point can be preserved, the previous 2D image-based segmentation methods suffer from inevitable quantized information loss. To avoid quantized information loss, in this paper, we propose a novel spherical frustum structure. The points projected onto the same 2D position are preserved in the spherical frustums. Moreover, we propose a memory-efficient hash-based representation of spherical frustums. Through the hash-based representation, we propose the Spherical Frustum sparse Convolution (SFC) and Frustum Fast Point Sampling (F2PS) to convolve and sample the points stored in spherical frustums respectively. Finally, we present the Spherical Frustum sparse Convolution Network (SFCNet) to adopt 2D CNNs for LiDAR point cloud semantic segmentation without quantized information loss. Extensive experiments on the SemanticKITTI and nuScenes datasets demonstrate that our SFCNet outperforms the 2D image-based semantic segmentation methods based on conventional spherical projection. The source code will be released later.
LiDAR点云语义分割使得机器人能够获得周围环境的细粒度语义信息。最近,许多工作将点云投影到2D图像上,并使用2D卷积神经网络(CNNs)或视觉Transformer进行LiDAR点云语义分割。然而,由于每个点都可以投影到相同的2D位置,但只有一个点可以被保留,因此以前基于2D图像的分割方法存在不可避免的量化信息损失。为了避免量化信息损失,在本文中,我们提出了一个新的球形骨架结构。在球形骨架中,投影到同一2D位置的点被保留。此外,我们还提出了一个高效率的哈希 based 表示球形骨架。通过哈希表示,我们提出了 Spherical Frustum Sparse Convolution(SFC)和 Frustum Fast Point Sampling(F2PS)分别对存储在球形骨架中的点进行卷积和采样。最后,我们提出了 Spherical Frustum Sparse Convolution Network(SFCNet),用于在没有量化信息损失的情况下使用2D CNN进行LiDAR点云语义分割。在SemanticKITTI和nuScenes数据集上的大量实验证明,我们的SFCNet优于基于传统球形投影的2D图像分割方法。源代码将稍后发布。
https://arxiv.org/abs/2311.17491