DEtection TRansformer (DETR) started a trend that uses a group of learnable queries for unified visual perception. This work begins by applying this appealing paradigm to LiDAR-based point cloud segmentation and obtains a simple yet effective baseline. Although the naive adaptation obtains fair results, the instance segmentation performance is noticeably inferior to previous works. By diving into the details, we observe that instances in the sparse point clouds are relatively small to the whole scene and often have similar geometry but lack distinctive appearance for segmentation, which are rare in the image domain. Considering instances in 3D are more featured by their positional information, we emphasize their roles during the modeling and design a robust Mixed-parameterized Positional Embedding (MPE) to guide the segmentation process. It is embedded into backbone features and later guides the mask prediction and query update processes iteratively, leading to Position-Aware Segmentation (PA-Seg) and Masked Focal Attention (MFA). All these designs impel the queries to attend to specific regions and identify various instances. The method, named Position-guided Point cloud Panoptic segmentation transFormer (P3Former), outperforms previous state-of-the-art methods by 3.4% and 1.2% PQ on SemanticKITTI and nuScenes benchmark, respectively. The source code and models are available at this https URL .
DEtectionTRansformer(DETR)开始使用一组可学习查询来实现统一的视觉感知。这项工作首先将这一吸引人的范式应用于基于激光雷达的点云分割,并获得了简单但有效的基线。虽然简单的适应方法获得了公正的结果,但实例分割性能明显低于以前的工作。通过深入研究细节,我们发现稀疏点云实例相对于整个场景来说相对较小,往往具有相似的几何形状,但在分割方面缺乏独特的外观,这在图像领域非常罕见。考虑到3D实例更多地取决于其位置信息,我们在建模期间强调它们的作用,设计了一个稳健的混合参数化位置嵌入(MPE),以指导分割过程。它被嵌入到基线特征中,然后迭代地指导掩码预测和查询更新过程,导致位置Aware分割(PA-Seg)和掩码焦点注意(MFA)。所有这些设计都促使查询关注特定的区域并识别各种实例。该方法被称为位置引导点云 Panoptic 分割转换器(P3 former),在语义KITTI和nuScenes基准测试中分别比以前的先进方法高出3.4%和1.2%。源代码和模型可在该httpsURL上提供。
https://arxiv.org/abs/2303.13509
Diffusion models gain increasing popularity for their generative capabilities. Recently, there have been surging needs to generate customized images by inverting diffusion models from exemplar images. However, existing inversion methods mainly focus on capturing object appearances. How to invert object relations, another important pillar in the visual world, remains unexplored. In this work, we propose ReVersion for the Relation Inversion task, which aims to learn a specific relation (represented as "relation prompt") from exemplar images. Specifically, we learn a relation prompt from a frozen pre-trained text-to-image diffusion model. The learned relation prompt can then be applied to generate relation-specific images with new objects, backgrounds, and styles. Our key insight is the "preposition prior" - real-world relation prompts can be sparsely activated upon a set of basis prepositional words. Specifically, we propose a novel relation-steering contrastive learning scheme to impose two critical properties of the relation prompt: 1) The relation prompt should capture the interaction between objects, enforced by the preposition prior. 2) The relation prompt should be disentangled away from object appearances. We further devise relation-focal importance sampling to emphasize high-level interactions over low-level appearances (e.g., texture, color). To comprehensively evaluate this new task, we contribute ReVersion Benchmark, which provides various exemplar images with diverse relations. Extensive experiments validate the superiority of our approach over existing methods across a wide range of visual relations.
扩散模型因其生成能力而日益受欢迎。最近,有日益增长的需求,通过反转示例图像中的扩散模型来生成定制图像。然而,现有的反转方法主要关注捕捉对象外观。如何反转对象关系,视觉世界中的另一个重要支柱,仍未被探索。在本研究中,我们提出了关系反转任务 ReVersion,旨在从示例图像中学习特定关系(表示为“关系 prompt”)。具体来说,我们从 frozen 预训练文本到图像扩散模型中学习关系 prompt。学习的关系 prompt 可以应用于生成新对象、背景和风格的关系特定图像。我们的关键发现是“前置条件” - 真实的关系提示可以在一组基词的基础上稀疏激活。具体来说,我们提出了一种关系引导的Contrastive学习策略,以强加关系提示的两个关键特性:1) 关系提示应该捕捉对象之间的交互,由前置条件强制实施。2) 关系提示应该从对象外观中分离出来。我们还设计了关系焦点重要性采样策略,强调高层次交互胜过低层次外观(例如纹理和颜色)。为了全面评估这个新任务,我们贡献了 ReVersion 基准,提供了各种具有不同关系示例的图像。广泛的实验验证了我们方法相对于现有方法在多种视觉关系方面的优越性。
https://arxiv.org/abs/2303.13495
We study the problem of object retrieval in scenarios where visual sensing is absent, object shapes are unknown beforehand and objects can move freely, like grabbing objects out of a drawer. Successful solutions require localizing free objects, identifying specific object instances, and then grasping the identified objects, only using touch feedback. Unlike vision, where cameras can observe the entire scene, touch sensors are local and only observe parts of the scene that are in contact with the manipulator. Moreover, information gathering via touch sensors necessitates applying forces on the touched surface which may disturb the scene itself. Reasoning with touch, therefore, requires careful exploration and integration of information over time -- a challenge we tackle. We present a system capable of using sparse tactile feedback from fingertip touch sensors on a dexterous hand to localize, identify and grasp novel objects without any visual feedback. Videos are available at this https URL.
我们研究的是视觉感知不存在、物体形状未知且能够自由移动的场景,例如取出抽屉中的物品。成功的解决方案需要定位自由物品、识别特定的实例并用手抓住被识别的物品,仅使用触摸反馈。与视觉不同的是,相机只能观察整个场景,触摸传感器是局部的,只能观察与操纵器接触的部分场景。此外,通过触摸传感器收集信息需要在触摸表面上施加力量,这可能会干扰场景本身。因此,与触摸进行推理需要仔细探索和整合信息——这是一个我们挑战的问题。我们提出了一种系统,它能够利用手指尖端触摸传感器提供的稀疏触觉反馈,在没有任何视觉反馈的情况下定位、识别并抓住新物品。视频资源可在以下 https URL 中找到。
https://arxiv.org/abs/2303.13482
This work presents a novel RGB-D-inertial dynamic SLAM method that can enable accurate localisation when the majority of the camera view is occluded by multiple dynamic objects over a long period of time. Most dynamic SLAM approaches either remove dynamic objects as outliers when they account for a minor proportion of the visual input, or detect dynamic objects using semantic segmentation before camera tracking. Therefore, dynamic objects that cause large occlusions are difficult to detect without prior information. The remaining visual information from the static background is also not enough to support localisation when large occlusion lasts for a long period. To overcome these problems, our framework presents a robust visual-inertial bundle adjustment that simultaneously tracks camera, estimates cluster-wise dense segmentation of dynamic objects and maintains a static sparse map by combining dense and sparse features. The experiment results demonstrate that our method achieves promising localisation and object segmentation performance compared to other state-of-the-art methods in the scenario of long-term large occlusion.
这项工作提出了一种 novel RGB-D-inertial 动态 SLAM 方法,能够在长时间内多个动态物体遮挡大部分摄像头视图的情况下实现准确的定位。大多数动态 SLAM 方法要么在动态物体占据视觉输入的较小比例时将其视为异常值并删除,要么在跟踪摄像头之前使用语义分割方法检测动态物体。因此,在没有先前信息的情况下难以检测造成大规模遮挡的动态物体。在长时间大规模遮挡的情况下,剩余的静态背景视觉信息不足以支持定位。因此,我们框架提出了一种稳健的视觉-inertial Bundle 调整方法,可以同时跟踪摄像头并估计动态物体的密集群组分割,并通过结合密集和稀疏特征维持静态稀疏地图。实验结果显示,与我们在其他长期大规模遮挡场景中使用的先进方法相比,我们的方法实现了 promising Localization 和物体分割性能。
https://arxiv.org/abs/2303.13316
Parameter-Efficient transfer learning with Adapters have been studied in Natural Language Processing (NLP) as an alternative to full fine-tuning. Adapters are memory-efficient and scale well with downstream tasks by training small bottle-neck layers added between transformer layers while keeping the large pretrained language model (PLMs) frozen. In spite of showing promising results in NLP, these methods are under-explored in Information Retrieval. While previous studies have only experimented with dense retriever or in a cross lingual retrieval scenario, in this paper we aim to complete the picture on the use of adapters in IR. First, we study adapters for SPLADE, a sparse retriever, for which adapters not only retain the efficiency and effectiveness otherwise achieved by finetuning, but are memory-efficient and orders of magnitude lighter to train. We observe that Adapters-SPLADE not only optimizes just 2\% of training parameters, but outperforms fully fine-tuned counterpart and existing parameter-efficient dense IR models on IR benchmark datasets. Secondly, we address domain adaptation of neural retrieval thanks to adapters on cross-domain BEIR datasets and TripClick. Finally, we also consider knowledge sharing between rerankers and first stage rankers. Overall, our study complete the examination of adapters for neural IR
在自然语言处理(NLP)中,使用Adapters作为参数高效的转移学习替代方法已经得到了研究。Adapters能够在Transformer层之间添加小型瓶颈层,同时保持大型预训练语言模型(PLM)冻结,从而实现 Memory-Efficient Transfer Learning。尽管在NLP中取得了 promising 的结果,但在信息检索中这些方法仍然未被深入研究。尽管以前的研究仅尝试过密集检索或跨语言检索场景,但本 paper 旨在完整描述在IR中使用Adapters的情况。首先,我们研究了Adapters-SPLADE,它是一个稀疏检索器,Adapters不仅保留了经过微调后实现的效率与效果,而且具有 Memory-Efficient 和数量级更轻的训练能力。我们观察到Adapters-SPLADE不仅优化了训练参数的2\% ,而且在IR基准数据集上比完全微调的替代品和现有的参数高效的密集IR模型表现更好。其次,我们考虑了跨域BEIR数据和 TripClick Adapters 的神经网络检索域适应问题。最后,我们还考虑了重新排名器和第一级排名器之间的知识共享。总之,我们的研究涵盖了神经网络IRAdapters的使用。
https://arxiv.org/abs/2303.13220
Deep Neural Networks use thousands of mostly incomprehensible features to identify a single class, a decision no human can follow. We propose an interpretable sparse and low dimensional final decision layer in a deep neural network with measurable aspects of interpretability and demonstrate it on fine-grained image classification. We argue that a human can only understand the decision of a machine learning model, if the features are interpretable and only very few of them are used for a single decision. For that matter, the final layer has to be sparse and, to make interpreting the features feasible, low dimensional. We call a model with a Sparse Low-Dimensional Decision SLDD-Model. We show that a SLDD-Model is easier to interpret locally and globally than a dense high-dimensional decision layer while being able to maintain competitive accuracy. Additionally, we propose a loss function that improves a model's feature diversity and accuracy. Our more interpretable SLDD-Model only uses 5 out of just 50 features per class, while maintaining 97% to 100% of the accuracy on four common benchmark datasets compared to the baseline model with 2048 features.
深度神经网络使用成千上万个几乎无法理解的特征来识别一个类别,这是一个人类无法跟随的决定。我们提出了一种可解释的稀疏和低维度的最后决策层,在具有可解释性测量方面的深度神经网络中,并将其应用于精细的图像分类。我们认为,如果特征可解释,而且只有很少几个特征用于一个决定,那么人类只能理解机器学习模型的决定。因此,最终层必须稀疏,并且为了解释特征可行,低维度。我们称之为具有稀疏低维度决策层的SLDD-模型。我们表明,SLDD-模型在 local和global解释方面比Dense高维度决策层更容易实现,同时能够保持竞争精度。此外,我们提出了一个损失函数,以改善模型的特征多样性和精度。我们的更可解释的SLDD-模型仅使用每个类别中的5个特征,而在每个常见基准数据集上保持了97%至100%的精度,相比之下,与具有2048个特征的基线模型相比,我们的SLDD-模型保持了相同的精度。
https://arxiv.org/abs/2303.13166
Recent trends in semi-supervised learning have significantly boosted the performance of 3D semi-supervised medical image segmentation. Compared with 2D images, 3D medical volumes involve information from different directions, e.g., transverse, sagittal, and coronal planes, so as to naturally provide complementary views. These complementary views and the intrinsic similarity among adjacent 3D slices inspire us to develop a novel annotation way and its corresponding semi-supervised model for effective segmentation. Specifically, we firstly propose the orthogonal annotation by only labeling two orthogonal slices in a labeled volume, which significantly relieves the burden of annotation. Then, we perform registration to obtain the initial pseudo labels for sparsely labeled volumes. Subsequently, by introducing unlabeled volumes, we propose a dual-network paradigm named Dense-Sparse Co-training (DeSCO) that exploits dense pseudo labels in early stage and sparse labels in later stage and meanwhile forces consistent output of two networks. Experimental results on three benchmark datasets validated our effectiveness in performance and efficiency in annotation. For example, with only 10 annotated slices, our method reaches a Dice up to 86.93% on KiTS19 dataset.
最近的趋势是半监督学习,这极大地提高了3D半监督医学图像分割的性能。相比2D图像,3D医疗体积从不同方向涉及信息,例如横断、 sagittal和 coronal平面,以自然提供互补观点。这些互补观点和相邻3D切片之间的内在相似性启发我们开发一种新的标注方式和相应的半监督模型,以有效地分割。具体来说,我们首先提出了垂直标注,仅在每个标记体积中标记两个垂直切片,从而显著减轻了标注的负担。随后,我们进行注册以获取较少标记的初始伪标签。随后,通过引入未标记体积,我们提出了一种名为Dense-Sparse Co-training(DeSCO)的双网络范式,该范式在早期利用密集伪标签,而在后期利用稀疏标签,同时强制两个网络的一致性输出。对三个基准数据集的实验结果验证了我们的性能效率和标注的有效性。例如,仅使用10个标注切片,我们的方法在Kinets19数据集上达到Dice高达86.93%。
https://arxiv.org/abs/2303.13090
Current attention algorithms (e.g., self-attention) are stimulus-driven and highlight all the salient objects in an image. However, intelligent agents like humans often guide their attention based on the high-level task at hand, focusing only on task-related objects. This ability of task-guided top-down attention provides task-adaptive representation and helps the model generalize to various tasks. In this paper, we consider top-down attention from a classic Analysis-by-Synthesis (AbS) perspective of vision. Prior work indicates a functional equivalence between visual attention and sparse reconstruction; we show that an AbS visual system that optimizes a similar sparse reconstruction objective modulated by a goal-directed top-down signal naturally simulates top-down attention. We further propose Analysis-by-Synthesis Vision Transformer (AbSViT), which is a top-down modulated ViT model that variationally approximates AbS, and achieves controllable top-down attention. For real-world applications, AbSViT consistently improves over baselines on Vision-Language tasks such as VQA and zero-shot retrieval where language guides the top-down attention. AbSViT can also serve as a general backbone, improving performance on classification, semantic segmentation, and model robustness.
目前的注意力算法(例如,自我注意力)是基于刺激驱动的,并在图像中强调所有突出的物体。然而,像人类这样的智能代理通常基于当前任务的指导,只关注任务相关的物体。这种任务引导的高层次注意力提供了任务适应的表示,帮助模型适应多种任务。在本文中,我们将从视觉分析的迭代(AbS)视角看待高层次注意力。先前的工作表明视觉注意力和稀疏重建之间存在功能等价性。我们表明,一个以目标为导向的高层次信号驱动的AbS视觉系统自然地模拟了高层次注意力。我们还提出了分析-分析迭代视觉卷积器(AbSViT),它是一个高层次信号驱动的ViT模型,其变化模拟了AbS,并实现了可控制高层次注意力。对于实际应用场景,AbSViT在视觉语言任务(如VQA和零样本检索)中 consistently improves over baselines,特别是在语言指导高层次注意力的情况下。AbSViT还可以作为一个通用的骨架,提高分类、语义分割和模型鲁棒性。
https://arxiv.org/abs/2303.13043
In this paper, we aim to learn a semantic radiance field from multiple scenes that is accurate, efficient and generalizable. While most existing NeRFs target at the tasks of neural scene rendering, image synthesis and multi-view reconstruction, there are a few attempts such as Semantic-NeRF that explore to learn high-level semantic understanding with the NeRF structure. However, Semantic-NeRF simultaneously learns color and semantic label from a single ray with multiple heads, where the single ray fails to provide rich semantic information. As a result, Semantic NeRF relies on positional encoding and needs to train one specific model for each scene. To address this, we propose Semantic Ray (S-Ray) to fully exploit semantic information along the ray direction from its multi-view reprojections. As directly performing dense attention over multi-view reprojected rays would suffer from heavy computational cost, we design a Cross-Reprojection Attention module with consecutive intra-view radial and cross-view sparse attentions, which decomposes contextual information along reprojected rays and cross multiple views and then collects dense connections by stacking the modules. Experiments show that our S-Ray is able to learn from multiple scenes, and it presents strong generalization ability to adapt to unseen scenes.
在本文中,我们旨在从多个场景中提取准确的、高效的、可泛化的语义亮度场。虽然大多数现有的NeRF目标是基于神经网络场景渲染、图像合成和多视角重构的任务,但也有一些尝试,例如语义NeRF,探索使用NeRF结构学习高级别的语义理解。然而,语义NeRF同时从具有多个头的光线中提取颜色和语义标签,其中单个光线无法提供丰富的语义信息。因此,NeRF依赖于位置编码,需要为每个场景训练一个特定的模型。为了解决此问题,我们提出了语义射线(S-Ray),以充分利用沿着射线方向的信息,从多视角重投影中获取。由于直接在重投影射线上执行密集关注将会面临高昂的计算成本,我们设计了一个交叉重投影注意力模块,其中包括连续内视角放射线和交叉视角稀疏关注,该模块分解了重投影射线和交叉多视角中的上下文信息,然后通过堆叠模块收集密集连接。实验结果表明,我们的S-Ray可以从多个场景学习,并表现出强大的适应未知场景的能力。
https://arxiv.org/abs/2303.13014
One-to-one label assignment in object detection has successfully obviated the need for non-maximum suppression (NMS) as postprocessing and makes the pipeline end-to-end. However, it triggers a new dilemma as the widely used sparse queries cannot guarantee a high recall, while dense queries inevitably bring more similar queries and encounter optimization difficulties. As both sparse and dense queries are problematic, then what are the expected queries in end-to-end object detection? This paper shows that the solution should be Dense Distinct Queries (DDQ). Concretely, we first lay dense queries like traditional detectors and then select distinct ones for one-to-one assignments. DDQ blends the advantages of traditional and recent end-to-end detectors and significantly improves the performance of various detectors including FCN, R-CNN, and DETRs. Most impressively, DDQ-DETR achieves 52.1 AP on MS-COCO dataset within 12 epochs using a ResNet-50 backbone, outperforming all existing detectors in the same setting. DDQ also shares the benefit of end-to-end detectors in crowded scenes and achieves 93.8 AP on CrowdHuman. We hope DDQ can inspire researchers to consider the complementarity between traditional methods and end-to-end detectors. The source code can be found at \url{this https URL}.
在物体检测中,一对一的标签分配已经成功避免了使用非最大抑制(NMS)作为后续处理并实现了整个流程的端到端。然而,它引发了一个新困境,因为广泛使用的稀疏查询不能保证高召回率,而Dense查询不可避免地会导致更多的类似查询并遇到优化难题。由于稀疏和Dense查询都存在问题,那么什么是端到端物体检测中预期查询的问题?本论文表明,解决方案应该是Dense distinct queries(DDQ)。具体来说,我们首先像传统的探测器一样布置Dense查询,然后选择唯一的DDQ作为一对一的分配。DDQ将传统的和最近的端到端探测器的优势相结合,并显著改进了包括Fcn、R-CNN和DeTRs等多种探测器的性能。最令人瞩目的是,DDQ-DETR在MS-COCO数据集上使用ResNet-50骨架在12 epochs内实现了52.1AP,在所有现有探测器中表现最优。DDQ也在拥挤的场景中分享了端到端探测器的优势,并在人群人类数据集上实现了93.8AP。我们希望DDQ能够激励研究人员考虑传统方法和端到端探测器之间的互补性。源代码可以在\url{this https URL}找到。
https://arxiv.org/abs/2303.12776
LiDAR-based 3D point cloud recognition has benefited various applications. Without specially considering the LiDAR point distribution, most current methods suffer from information disconnection and limited receptive field, especially for the sparse distant points. In this work, we study the varying-sparsity distribution of LiDAR points and present SphereFormer to directly aggregate information from dense close points to the sparse distant ones. We design radial window self-attention that partitions the space into multiple non-overlapping narrow and long windows. It overcomes the disconnection issue and enlarges the receptive field smoothly and dramatically, which significantly boosts the performance of sparse distant points. Moreover, to fit the narrow and long windows, we propose exponential splitting to yield fine-grained position encoding and dynamic feature selection to increase model representation ability. Notably, our method ranks 1st on both nuScenes and SemanticKITTI semantic segmentation benchmarks with 81.9% and 74.8% mIoU, respectively. Also, we achieve the 3rd place on nuScenes object detection benchmark with 72.8% NDS and 68.5% mAP. Code is available at this https URL.
利用激光雷达点云识别3D点云的方法可以造福多种应用程序。如果没有特别考虑激光雷达点云分布,大多数当前方法都面临信息断开和接收域有限的问题,特别是对于稀疏遥远的点。在这项工作中,我们研究了激光雷达点云的 varying-sparss分布,并提出了Sphere Former直接聚合从密集接近点到稀疏遥远的信息。我们设计了径向窗口自注意力,将空间划分为多个非重叠的窄长窗口。它克服了信息断开的问题,并且极大地扩展了接收域,这极大地提高了稀疏遥远的点的性能。此外,为了适应窄长窗口,我们提出了指数分割,生成精细的位置编码和动态特征选择,以提高模型表示能力。值得注意的是,我们的方法在nuScenes和SemanticKITTI语义分割基准测试中分别获得了81.9%和74.8%的mIoU,同时,在nuScenes物体检测基准测试中获得了第3名,72.8%的NDS和68.5%的mAP。代码在此httpsURL上可用。
https://arxiv.org/abs/2303.12766
This paper describes our submission to ICASSP 2023 MUG Challenge Track 4, Keyphrase Extraction, which aims to extract keyphrases most relevant to the conference theme from conference materials. We model the challenge as a single-class Named Entity Recognition task and developed techniques for better performance on the challenge: For the data preprocessing, we encode the split keyphrases after word segmentation. In addition, we increase the amount of input information that the model can accept at one time by fusing multiple preprocessed sentences into one segment. We replace the loss function with the multi-class focal loss to address the sparseness of keyphrases. Besides, we score each appearance of keyphrases and add an extra output layer to fit the score to rank keyphrases. Exhaustive evaluations are performed to find the best combination of the word segmentation tool, the pre-trained embedding model, and the corresponding hyperparameters. With these proposals, we scored 45.04 on the final test set.
本论文描述了我们对ICASSP 2023 MUG Challenge track 4,“关键词提取”,提交的申请。该任务旨在从会议材料中提取与会议主题最相关的关键词。我们将挑战建模为单个类别命名实体识别任务,并开发了在挑战中更好的表现的技术:对于数据预处理,我们在词分割后编码分割的关键词。此外,我们增加模型可以同时接受的时间步长的输入信息数量,通过将多个预处理句子融合成一个段来增加。我们替换损失函数为多类聚焦损失,以解决关键词稀疏性问题。此外,我们对每个关键词的出现进行评分,并添加额外的输出层以匹配评分来排名关键词。进行了充分的评估,以找到最佳的组合词分割工具、预训练嵌入模型和相应的超参数。基于这些建议,我们在最终测试集上得分45.04。
https://arxiv.org/abs/2303.13463
Conventionally, since the natural language action space is astronomical, approximate dynamic programming applied to dialogue generation involves policy improvement with action sampling. However, such a practice is inefficient for reinforcement learning (RL) because the eligible (high action value) responses are very sparse, and the greedy policy sustained by the random sampling is flabby. This paper shows that the performance of dialogue policy positively correlated with sampling size by theoretical and experimental. We introduce a novel dual-granularity Q-function to alleviate this limitation by exploring the most promising response category to intervene in the sampling. It extracts the actions following the grained hierarchy, which can achieve the optimum with fewer policy iterations. Our approach learns in the way of offline RL from multiple reward functions designed to recognize human emotional details. Empirical studies demonstrate that our algorithm outperforms the baseline methods. Further verification presents that ours can generate responses with higher expected rewards and controllability.
传统上,由于自然语言行动空间非常庞大,将近似动态规划应用于对话生成涉及行动采样的政策改进。然而,这种实践对于强化学习(RL)来说并不高效,因为符合条件的(高行动价值)响应非常稀疏,而随机采样维持的贪婪策略很软。本文表明,对话政策的性能和采样量之间存在正相关关系,通过探索最有希望的反应类别来干预采样。它从多个旨在识别人类情感细节的惩罚函数中获取行动,可以实现最优解,只需要较少的政策迭代。我们的方法采用离线RL学习方法,从多个奖励函数中学习,以识别人类情感细节。实证研究表明,我们的算法比基准方法表现更好。进一步验证表明,我们可以生成更高期望奖励和控制性响应。
https://arxiv.org/abs/2303.13465
Reliable localization is crucial for autonomous robots to navigate efficiently and safely. Some navigation methods can plan paths with high localizability (which describes the capability of acquiring reliable localization). By following these paths, the robot can access the sensor streams that facilitate more accurate location estimation results by the localization algorithms. However, most of these methods require prior knowledge and struggle to adapt to unseen scenarios or dynamic changes. To overcome these limitations, we propose a novel approach for localizability-enhanced navigation via deep reinforcement learning in dynamic human environments. Our proposed planner automatically extracts geometric features from 2D laser data that are helpful for localization. The planner learns to assign different importance to the geometric features and encourages the robot to navigate through areas that are helpful for laser localization. To facilitate the learning of the planner, we suggest two techniques: (1) an augmented state representation that considers the dynamic changes and the confidence of the localization results, which provides more information and allows the robot to make better decisions, (2) a reward metric that is capable to offer both sparse and dense feedback on behaviors that affect localization accuracy. Our method exhibits significant improvements in lost rate and arrival rate when tested in previously unseen environments.
可靠的定位对于自主机器人高效、安全地导航至关重要。一些导航方法可以规划具有高定位可靠性的路径(这描述了获取可靠定位的能力)。通过遵循这些路径,机器人可以访问有助于定位算法更准确地定位传感器流,从而更轻松地实现定位算法。然而,这些方法的大部分都需要先前的知识,并且很难适应未曾遇到的情况和动态变化。为了克服这些限制,我们提出了一种基于深度强化学习的新颖定位增强方法,通过动态人类环境。我们提议的规划者自动从2D激光数据中提取几何特征,这些特征对于定位有用。规划者学习将不同的几何特征赋予不同的重要性,并鼓励机器人穿越有助于激光定位的区域。为了促进规划者的学习,我们建议两种方法:(1)一个扩展的状态表示法,考虑动态变化和定位结果的可信度,提供了更多的信息,使机器人能够做出更好的决策,(2)一个奖励指标,能够提供稀疏和稠密的反馈,影响定位准确性的行为。我们在之前未曾测试过的环境中测试时,该方法表现出显著的减少丢失率和到达率的提高。
https://arxiv.org/abs/2303.12354
We consider the problem of decentralized multi-agent reinforcement learning in Markov games. A fundamental question is whether there exist algorithms that, when adopted by all agents and run independently in a decentralized fashion, lead to no-regret for each player, analogous to celebrated convergence results in normal-form games. While recent work has shown that such algorithms exist for restricted settings (notably, when regret is defined with respect to deviations to Markovian policies), the question of whether independent no-regret learning can be achieved in the standard Markov game framework was open. We provide a decisive negative resolution this problem, both from a computational and statistical perspective. We show that: - Under the widely-believed assumption that PPAD-hard problems cannot be solved in polynomial time, there is no polynomial-time algorithm that attains no-regret in general-sum Markov games when executed independently by all players, even when the game is known to the algorithm designer and the number of players is a small constant. - When the game is unknown, no algorithm, regardless of computational efficiency, can achieve no-regret without observing a number of episodes that is exponential in the number of players. Perhaps surprisingly, our lower bounds hold even for seemingly easier setting in which all agents are controlled by a a centralized algorithm. They are proven via lower bounds for a simpler problem we refer to as SparseCCE, in which the goal is to compute a coarse correlated equilibrium that is sparse in the sense that it can be represented as a mixture of a small number of product policies. The crux of our approach is a novel application of aggregation techniques from online learning, whereby we show that any algorithm for the SparseCCE problem can be used to compute approximate Nash equilibria for non-zero sum normal-form games.
我们考虑在马尔可夫博弈中分散化的多Agent reinforcement learning问题。一个基本的问题是是否存在算法,当所有agent 采用并分散式地独立运行时,会导致每个玩家没有后悔,类似于普通形式博弈中庆祝的收敛结果。虽然最近的研究表明这种算法适用于限制条件(特别是当后悔的定义与偏离马可夫策略有关时),但是否在标准的马尔可夫博弈框架中实现独立的没有后悔学习仍然是开放的。我们提供了从计算和统计角度来看的决定性负解法。我们表明: - 在被广泛相信的假设下,PPAD 困难问题无法在多项式时间内解决,因此没有多项式时间算法可以在所有玩家独立执行的情况下,使普遍总的马可夫游戏没有后悔,即使游戏已知,玩家数量是一个小常数。 - 当游戏未知时,无论计算效率如何,任何算法都不会在没有观察数量呈指数增长的事件的情况下实现没有后悔,也许令人惊讶地,我们的下界即使在看似更容易的环境中,由一个集中控制的算法控制的 all agent 控制的情况下也成立。我们通过证明下界对于一个更简单的问题,我们称为 SparseCCE,目的是计算疏离的协调均衡,其疏离的程度是指它可以表示为一组产品政策的混合物。我们的方法的核心是从在线学习中提取聚合技术的新颖应用,因此我们表明,对于 SparseCCE 问题的任何算法都可以用于计算非零总正常形式游戏近似的纳什均衡。
https://arxiv.org/abs/2303.12287
Accurate and robust trajectory prediction of neighboring agents is critical for autonomous vehicles traversing in complex scenes. Most methods proposed in recent years are deep learning-based due to their strength in encoding complex interactions. However, unplausible predictions are often generated since they rely heavily on past observations and cannot effectively capture the transient and contingency interactions from sparse samples. In this paper, we propose a hierarchical hybrid framework of deep learning (DL) and reinforcement learning (RL) for multi-agent trajectory prediction, to cope with the challenge of predicting motions shaped by multi-scale interactions. In the DL stage, the traffic scene is divided into multiple intermediate-scale heterogenous graphs based on which Transformer-style GNNs are adopted to encode heterogenous interactions at intermediate and global levels. In the RL stage, we divide the traffic scene into local sub-scenes utilizing the key future points predicted in the DL stage. To emulate the motion planning procedure so as to produce trajectory predictions, a Transformer-based Proximal Policy Optimization (PPO) incorporated with a vehicle kinematics model is devised to plan motions under the dominant influence of microscopic interactions. A multi-objective reward is designed to balance between agent-centric accuracy and scene-wise compatibility. Experimental results show that our proposal matches the state-of-the-arts on the Argoverse forecasting benchmark. It's also revealed by the visualized results that the hierarchical learning framework captures the multi-scale interactions and improves the feasibility and compliance of the predicted trajectories.
相邻代理的准确和鲁棒的轨迹预测对于在复杂场景中自动驾驶车辆非常重要。近年来,大多数方法都是基于深度学习的,因为深度学习在编码复杂交互方面具有优势。然而,由于它们依赖于过去的观察结果,并且无法有效地从稀疏样本中捕捉瞬态和异常交互,所以往往产生不合理的预测。在本文中,我们提出了一种分层的深度学习和强化学习混合框架,用于多代理轨迹预测,以应对由多尺度交互所塑造的轨迹预测挑战。在深度学习阶段,交通场景被分成多个中等规模的异质图形,基于这些图形采用Transformer风格的GNNs来编码异质交互在中等和全球水平上。在强化学习阶段,我们利用深度学习阶段预测的关键未来点将交通场景划分为本地子场景。为了模拟运动规划过程并产生轨迹预测,一个基于Transformer的远程决策优化(PPO)结合车辆运动学模型设计用于在微观交互主导影响下规划运动。一个多目标奖励旨在平衡代理中心准确性和场景间兼容性。实验结果表明,我们的提议与Argoverse预测基准的先进技术相当。可视化结果也表明,分层学习框架捕获了多尺度交互,并提高了预测轨迹的可行性和遵守性。
https://arxiv.org/abs/2303.12274
The two-stage object pose estimation paradigm first detects semantic keypoints on the image and then estimates the 6D pose by minimizing reprojection errors. Despite performing well on standard benchmarks, existing techniques offer no provable guarantees on the quality and uncertainty of the estimation. In this paper, we inject two fundamental changes, namely conformal keypoint detection and geometric uncertainty propagation, into the two-stage paradigm and propose the first pose estimator that endows an estimation with provable and computable worst-case error bounds. On one hand, conformal keypoint detection applies the statistical machinery of inductive conformal prediction to convert heuristic keypoint detections into circular or elliptical prediction sets that cover the groundtruth keypoints with a user-specified marginal probability (e.g., 90%). Geometric uncertainty propagation, on the other, propagates the geometric constraints on the keypoints to the 6D object pose, leading to a Pose UnceRtainty SEt (PURSE) that guarantees coverage of the groundtruth pose with the same probability. The PURSE, however, is a nonconvex set that does not directly lead to estimated poses and uncertainties. Therefore, we develop RANdom SAmple averaGing (RANSAG) to compute an average pose and apply semidefinite relaxation to upper bound the worst-case errors between the average pose and the groundtruth. On the LineMOD Occlusion dataset we demonstrate: (i) the PURSE covers the groundtruth with valid probabilities; (ii) the worst-case error bounds provide correct uncertainty quantification; and (iii) the average pose achieves better or similar accuracy as representative methods based on sparse keypoints.
两阶段的对象姿态估计范式首先在图像中检测语义关键点,然后最小化投影误差来估计6D姿态。尽管在标准基准测试中表现良好,现有技术并没有提供可证明的质量和不确定性保证。在本文中,我们引入了两个基本变化,即 conformal keypoint 检测和几何不确定性传播,并将这两个变化融入两阶段范式中,并提出了第一个姿态估计器,该器具冒猜值和计算可证明的最坏误差限。一方面, conformal keypoint 检测应用了基于经验引导预测的统计机器,将启发式关键点检测转换为循环或椭圆预测集,以指定用户指定边际概率(例如90%)覆盖 groundtruth 关键点。另一方面,几何不确定性传播将几何约束传播到6D 对象姿态,导致一个 Pose UnceRtainty SEt(PURSE),保证覆盖 groundtruth 姿态的概率与相同的概率。然而,purSE 是一个非凸集合,并不直接导致估计姿态和不确定性。因此,我们开发了RANdom SAmple averaGing(RANSAG),计算平均姿态,并应用半确界放松来限制平均姿态和 groundtruth 姿态之间的最坏误差限。在 LineMOD Occlusion 数据集上,我们证明了:(i) purSE 覆盖 groundtruth 以有效概率;(ii)最坏误差限提供正确的不确定性量化;(iii)平均姿态以基于稀疏关键点的代表方法的更好或类似精度实现。
https://arxiv.org/abs/2303.12246
There is a recent trend in the LiDAR perception field towards unifying multiple tasks in a single strong network with improved performance, as opposed to using separate networks for each task. In this paper, we introduce a new LiDAR multi-task learning paradigm based on the transformer. The proposed LiDARFormer utilizes cross-space global contextual feature information and exploits cross-task synergy to boost the performance of LiDAR perception tasks across multiple large-scale datasets and benchmarks. Our novel transformer-based framework includes a cross-space transformer module that learns attentive features between the 2D dense Bird's Eye View (BEV) and 3D sparse voxel feature maps. Additionally, we propose a transformer decoder for the segmentation task to dynamically adjust the learned features by leveraging the categorical feature representations. Furthermore, we combine the segmentation and detection features in a shared transformer decoder with cross-task attention layers to enhance and integrate the object-level and class-level features. LiDARFormer is evaluated on the large-scale nuScenes and the Waymo Open datasets for both 3D detection and semantic segmentation tasks, and it outperforms all previously published methods on both tasks. Notably, LiDARFormer achieves the state-of-the-art performance of 76.4% L2 mAPH and 74.3% NDS on the challenging Waymo and nuScenes detection benchmarks for a single model LiDAR-only method.
最近的研究表明,在激光雷达感知领域,趋势是将多个任务统一到一个强大的网络中,以提高性能,而不是每个任务使用单独的网络。在本文中,我们介绍了基于Transformer的激光雷达多任务学习范式。我们提议的激光雷达前体使用跨空间全局特征信息,利用跨任务协同作用来提高多个大规模数据集和基准的性能。我们的新型Transformer框架包括一个跨空间Transformer模块,用于学习2D密集鸟眼视图(BEV)和3D稀疏立方点特征映射的注意特征。我们还提议了一个分割任务Transformer解码器,通过利用分类特征表示动态地调整学习的特征。此外,我们将分割和检测特征在共享的Transformer解码器中与跨任务注意力层组合在一起,以增强和整合对象级和类级特征。激光雷达前体在大型无物体场景和Waymo开放数据集上,以及3D检测和语义分割任务中的两个任务上的大规模基准数据集上进行了评估,并在两个任务上优于所有先前发布的方法。特别是,激光雷达前体在Waymo和无物体场景检测基准数据集上实现了最先进的76.4%L2mAP和74.3%NDS性能。
https://arxiv.org/abs/2303.12194
We present a visual-inertial depth estimation pipeline that integrates monocular depth estimation and visual-inertial odometry to produce dense depth estimates with metric scale. Our approach performs global scale and shift alignment against sparse metric depth, followed by learning-based dense alignment. We evaluate on the TartanAir and VOID datasets, observing up to 30% reduction in inverse RMSE with dense scale alignment relative to performing just global alignment alone. Our approach is especially competitive at low density; with just 150 sparse metric depth points, our dense-to-dense depth alignment method achieves over 50% lower iRMSE over sparse-to-dense depth completion by KBNet, currently the state of the art on VOID. We demonstrate successful zero-shot transfer from synthetic TartanAir to real-world VOID data and perform generalization tests on NYUv2 and VCU-RVI. Our approach is modular and is compatible with a variety of monocular depth estimation models. Video: this https URL Code: this https URL
我们提出了一种视觉惯性 depth 估计流程,它将单目 depth 估计和视觉惯性测距集成起来,以产生具有度量尺度的密集 depth 估计。我们的方法进行全球尺度和 shift 对齐,对抗稀疏度量深度,随后进行基于学习的密集对齐。我们评估了 TartanAir 和 void 数据集,观察到与仅进行全球对齐相比,进行密集尺度对齐能够实现 -30% 的逆RMSE 降低。我们的方法在密度较低时特别具有竞争力。仅在 150 个稀疏度量深度点以下时,我们的密集到密集 depth 对齐方法能够实现 KBNet 中稀疏到密集深度完成的 iRMSE 降低超过 50%。我们成功地将合成的 TartanAir 数据集零样本转移到真实的 void 数据集上,并进行了对NYUv2 和 VCU-RVI 的泛化测试。我们的方法是一个模块化的方法,可以与多种单目 depth 估计模型兼容。视频: this https URL 代码: this https URL
https://arxiv.org/abs/2303.12134
Previous works on video object segmentation (VOS) are trained on densely annotated videos. Nevertheless, acquiring annotations in pixel level is expensive and time-consuming. In this work, we demonstrate the feasibility of training a satisfactory VOS model on sparsely annotated videos-we merely require two labeled frames per training video while the performance is sustained. We term this novel training paradigm as two-shot video object segmentation, or two-shot VOS for short. The underlying idea is to generate pseudo labels for unlabeled frames during training and to optimize the model on the combination of labeled and pseudo-labeled data. Our approach is extremely simple and can be applied to a majority of existing frameworks. We first pre-train a VOS model on sparsely annotated videos in a semi-supervised manner, with the first frame always being a labeled one. Then, we adopt the pre-trained VOS model to generate pseudo labels for all unlabeled frames, which are subsequently stored in a pseudo-label bank. Finally, we retrain a VOS model on both labeled and pseudo-labeled data without any restrictions on the first frame. For the first time, we present a general way to train VOS models on two-shot VOS datasets. By using 7.3% and 2.9% labeled data of YouTube-VOS and DAVIS benchmarks, our approach achieves comparable results in contrast to the counterparts trained on fully labeled set. Code and models are available at this https URL.
以往的视频对象分割工作都是在稠密标注的视频上进行训练。然而,在这项工作中,我们证明了在稠密标注的视频上训练令人满意的VOS模型的可行性。我们只需要每训练视频标注两个帧,而表现却能持续维持。我们称之为“两帧视频对象分割”或“两帧VOS”,这是一种新的训练范式。其基本思想是在训练期间为未标注帧生成伪标签,并优化基于标注和伪标签数据的模型。我们的方法和思路非常简单,可以应用于大多数现有的框架。我们首先在未标注的视频上采用半监督的方式预训练VOS模型,而第一个帧总是标注的。然后,我们采用预训练的VOS模型为所有未标注帧生成伪标签,并将其存储在伪标签库中。最后,我们重新训练基于标注和伪标签数据的VOS模型,而不受第一个帧的限制。有史以来第一次,我们提出了一种通用的方法来训练两帧VOS数据集上的VOS模型。利用YouTube-VOS和 Davis基准视频的7.3%和2.9%的标注数据,我们的方法和模型与在完全标注数据集上训练的相对应。代码和模型可在该httpsURL上获取。
https://arxiv.org/abs/2303.12078