Bundle adjustment (BA) is a critical technique in various robotic applications, such as simultaneous localization and mapping (SLAM), augmented reality (AR), and photogrammetry. BA optimizes parameters such as camera poses and 3D landmarks to align them with observations. With the growing importance of deep learning in perception systems, there is an increasing need to integrate BA with deep learning frameworks for enhanced reliability and performance. However, widely-used C++-based BA frameworks, such as GTSAM, g$^2$o, and Ceres, lack native integration with modern deep learning libraries like PyTorch. This limitation affects their flexibility, adaptability, ease of debugging, and overall implementation efficiency. To address this gap, we introduce an eager-mode BA framework seamlessly integrated with PyPose, providing PyTorch-compatible interfaces with high efficiency. Our approach includes GPU-accelerated, differentiable, and sparse operations designed for 2nd-order optimization, Lie group and Lie algebra operations, and linear solvers. Our eager-mode BA on GPU demonstrates substantial runtime efficiency, achieving an average speedup of 18.5$\times$, 22$\times$, and 23$\times$ compared to GTSAM, g$^2$o, and Ceres, respectively.
捆绑调整(BA)是一种关键的技术,在各种机器人应用中都有广泛的应用,如同时定位与映射(SLAM)、增强现实(AR)和摄影测量。BA优化参数,如相机姿态和3D地标,以与观测值对齐。随着深度学习在感知系统中的重要性不断增加,越来越多的需要将BA与深度学习框架集成以提高可靠性和性能。然而,广泛使用的基于C++的BA框架,如GTSAM、g$^2$o和Ceres,与现代深度学习库(如PyTorch)的本地集成缺乏。这一限制影响了它们的灵活性、适应性、调试难度和整体实现效率。为了填补这一空白,我们引入了一个与PyPose无缝集成的 eager-mode BA框架,提供与PyTorch兼容的接口,具有高效率。我们的方法包括为2阶优化、Lie组和Lie代数运算以及线性求解设计的GPU加速、可导和稀疏操作。我们的 eager-mode BA在GPU 上具有实质性的运行效率,与 GTSAM、g$^2$o 和 Ceres 分别相比,平均速度提升为 18.5$\times$、22$\times$ 和 23$\times$。
https://arxiv.org/abs/2409.12190
Mixture-of-Experts (MoE) models scale more effectively than dense models due to sparse computation through expert routing, selectively activating only a small subset of expert modules. However, sparse computation challenges traditional training practices, as discrete expert routing hinders standard backpropagation and thus gradient-based optimization, which are the cornerstone of deep learning. To better pursue the scaling power of MoE, we introduce GRIN (GRadient-INformed MoE training), which incorporates sparse gradient estimation for expert routing and configures model parallelism to avoid token dropping. Applying GRIN to autoregressive language modeling, we develop a top-2 16$\times$3.8B MoE model. Our model, with only 6.6B activated parameters, outperforms a 7B dense model and matches the performance of a 14B dense model trained on the same data. Extensive evaluations across diverse tasks demonstrate the potential of GRIN to significantly enhance MoE efficacy, achieving 79.4 on MMLU, 83.7 on HellaSwag, 74.4 on HumanEval, and 58.9 on MATH.
混合专家(MoE)模型比密集模型更有效地扩展,因为专家路由稀疏计算,仅激活一小部分专家模块。然而,稀疏计算挑战了传统的训练实践,因为离散的专家路由阻碍了标准的反向传播,从而使得基于梯度的优化成为深度学习的重要组成部分。为了更好地追求MoE的扩展能力,我们引入了GRIN(GRadient-INformed MoE训练),它包含了专家路由的稀疏梯度估计,并配置模型并行度以避免词丢弃。将GRIN应用于自回归语言建模,我们开发了一个前2 16$\times$3.8B MoE模型。我们的模型仅激活6.6B个参数,却表现出了比7B个参数的密集模型更好的性能,并达到了与在相同数据上训练的14B个参数的模型相匹配的性能。在多样任务的大规模评估中,GRIN证明了显著增强MoE效果的潜力,在MMLU上取得了79.4,在HellaSwag上取得了83.7,在HumanEval上取得了74.4,在MATH上取得了58.9的分数。
https://arxiv.org/abs/2409.12136
We propose visual-inertial simultaneous localization and mapping that tightly couples sparse reprojection errors, inertial measurement unit pre-integrals, and relative pose factors with dense volumetric occupancy mapping. Hereby depth predictions from a deep neural network are fused in a fully probabilistic manner. Specifically, our method is rigorously uncertainty-aware: first, we use depth and uncertainty predictions from a deep network not only from the robot's stereo rig, but we further probabilistically fuse motion stereo that provides depth information across a range of baselines, therefore drastically increasing mapping accuracy. Next, predicted and fused depth uncertainty propagates not only into occupancy probabilities but also into alignment factors between generated dense submaps that enter the probabilistic nonlinear least squares estimator. This submap representation offers globally consistent geometry at scale. Our method is thoroughly evaluated in two benchmark datasets, resulting in localization and mapping accuracy that exceeds the state of the art, while simultaneously offering volumetric occupancy directly usable for downstream robotic planning and control in real-time.
我们提出了一个将视觉和惯性同时进行局部定位和映射的方法,该方法紧密耦合稀疏投影误差、惯性测量单元预积分和相对姿态因子与密集体积占用映射。在本文中,来自深度神经网络的深度预测在完全概率方式下进行融合。具体来说,我们的方法是严谨的不确定性感知方法:首先,我们从机器人的双目立体系统中不仅使用深度预测和不确定性预测,而且我们还使用运动立体学提供深度信息,从而极大地提高了映射精度。接下来,预测和融合的深度不确定性不仅传播到占有概率,而且传播到生成密集子图之间的对齐因子。这种子图表示在尺度上具有全局一致的几何。我们的方法在两个基准数据集上进行了彻底评估,从而实现了超过现有技术的局部定位和映射精度,同时为实时下游机器人规划和控制提供了直接可用的体积占有率。
https://arxiv.org/abs/2409.12051
Offline multi-agent reinforcement learning (MARL) is an exciting direction of research that uses static datasets to find optimal control policies for multi-agent systems. Though the field is by definition data-driven, efforts have thus far neglected data in their drive to achieve state-of-the-art results. We first substantiate this claim by surveying the literature, showing how the majority of works generate their own datasets without consistent methodology and provide sparse information about the characteristics of these datasets. We then show why neglecting the nature of the data is problematic, through salient examples of how tightly algorithmic performance is coupled to the dataset used, necessitating a common foundation for experiments in the field. In response, we take a big step towards improving data usage and data awareness in offline MARL, with three key contributions: (1) a clear guideline for generating novel datasets; (2) a standardisation of over 80 existing datasets, hosted in a publicly available repository, using a consistent storage format and easy-to-use API; and (3) a suite of analysis tools that allow us to understand these datasets better, aiding further development.
离线多智能体强化学习(MARL)是一个令人兴奋的研究方向,它使用静态数据集来寻找最优控制策略 for 多智能体系统。尽管该领域本质上是数据驱动的,但迄今为止的努力都忽略了数据在追求最优结果的过程中。我们首先通过调查文献来证实这一说法,并表明大多数工作在生成自己的数据集时没有采用一致的方法论,并且关于这些数据集的特征提供得很稀疏。然后我们展示了忽视数据性质的严重问题,通过引人注目的例子说明,算法的性能与使用的数据集密切相关,需要在该领域建立共同的基础进行实验。为了回应这一问题,我们迈出了改善离线MARL数据使用和数据意识的重要一步,包括以下三个关键贡献:(1)生成新数据集的清晰指南;(2)使用一致存储格式并在公开可用的存储库中托管的 80 多个现有数据集的标准化;(3)一套分析工具,使我们能够更好地理解这些数据集,并有助于进一步发展。
https://arxiv.org/abs/2409.12001
Multi-camera perception methods in Bird's-Eye-View (BEV) have gained wide application in autonomous driving. However, due to the differences between roadside and vehicle-side scenarios, there currently lacks a multi-camera BEV solution in roadside. This paper systematically analyzes the key challenges in multi-camera BEV perception for roadside scenarios compared to vehicle-side. These challenges include the diversity in camera poses, the uncertainty in Camera numbers, the sparsity in perception regions, and the ambiguity in orientation angles. In response, we introduce RopeBEV, the first dense multi-camera BEV approach. RopeBEV introduces BEV augmentation to address the training balance issues caused by diverse camera poses. By incorporating CamMask and ROIMask (Region of Interest Mask), it supports variable camera numbers and sparse perception, respectively. Finally, camera rotation embedding is utilized to resolve orientation ambiguity. Our method ranks 1st on the real-world highway dataset RoScenes and demonstrates its practical value on a private urban dataset that covers more than 50 intersections and 600 cameras.
总的来说,本文系统地分析了多相机视野(BEV)在道路场景中与车辆侧场景之间的关键挑战。这些挑战包括相机姿态的多样性、相机数量的不确定性、感知区域的稀疏性和方向角的模糊性。为了应对这些挑战,我们引入了RopeBEV,这是第一个密集的多相机BEV方法。RopeBEV通过引入BEV增强来解决由于不同相机姿态而引起的训练平衡问题。通过包含CamMask和ROIMask(区域兴趣掩码),它支持变量的相机数量和稀疏的感知。最后,通过使用相机旋转嵌入来解决方向角的模糊性。在现实世界的highway数据集RoScenes上,我们的方法排名第1,证明了其在覆盖超过50个交叉口和600个摄像机的大型城市数据集上的实际价值。
https://arxiv.org/abs/2409.11706
Rapid growth in the popularity of AR/VR/MR applications and cloud-based visual localization systems has given rise to an increased focus on the privacy of user content in the localization process. This privacy concern has been further escalated by the ability of deep neural networks to recover detailed images of a scene from a sparse set of 3D or 2D points and their descriptors - the so-called inversion attacks. Research on privacy-preserving localization has therefore focused on preventing these inversion attacks on both the query image keypoints and the 3D points of the scene map. To this end, several geometry obfuscation techniques that lift points to higher-dimensional spaces, i.e., lines or planes, or that swap coordinates between points % have been proposed. In this paper, we point to a common weakness of these obfuscations that allows to recover approximations of the original point positions under the assumption of known neighborhoods. We further show that these neighborhoods can be computed by learning to identify descriptors that co-occur in neighborhoods. Extensive experiments show that our approach for point recovery is practically applicable to all existing geometric obfuscation schemes. Our results show that these schemes should not be considered privacy-preserving, even though they are claimed to be privacy-preserving. Code will be available at \url{this https URL}.
随着AR/VR/MR应用程序和基于云的视觉本地化系统的流行度迅速增长,对用户内容隐私的关注在本地化过程中也变得更加关注。这种隐私问题进一步被深度神经网络从稀疏的3D或2D点及其描述中恢复详细图像的能力所加剧,这种能力称为反向攻击。因此,隐私保护的本地化研究集中在防止这些反向攻击对查询图像关键点和场景地图的3D点。为此,我们提出了几种几何混淆技术,这些技术将点提升到更高维空间,即直线或平面,或将坐标交换为点%的坐标。在本文中,我们指出了这些混淆技术的共同弱点,即在已知邻居的情况下可以恢复原始点位置的近似。我们进一步证明了这些邻居可以通过学习识别共出现于邻居中的描述符来计算。大量实验证明,我们的点恢复方法适用于所有现有的几何混淆方案。我们的结果表明,尽管这些方案声称具有隐私保护功能,但它们实际上并未实现隐私保护。代码将公开在\url{这个 https URL}中。
https://arxiv.org/abs/2409.11536
3D Gaussian Splatting (3DGS) integrates the strengths of primitive-based representations and volumetric rendering techniques, enabling real-time, high-quality rendering. However, 3DGS models typically overfit to single-scene training and are highly sensitive to the initialization of Gaussian ellipsoids, heuristically derived from Structure from Motion (SfM) point clouds, which limits both generalization and practicality. To address these limitations, we propose GS-Net, a generalizable, plug-and-play 3DGS module that densifies Gaussian ellipsoids from sparse SfM point clouds, enhancing geometric structure representation. To the best of our knowledge, GS-Net is the first plug-and-play 3DGS module with cross-scene generalization capabilities. Additionally, we introduce the CARLA-NVS dataset, which incorporates additional camera viewpoints to thoroughly evaluate reconstruction and rendering quality. Extensive experiments demonstrate that applying GS-Net to 3DGS yields a PSNR improvement of 2.08 dB for conventional viewpoints and 1.86 dB for novel viewpoints, confirming the method's effectiveness and robustness.
3D Gaussian Splatting(3DGS)将原始表示方法和体积渲染技术的优势集成在一起,实现实时、高质感的渲染。然而,3DGS模型通常过拟合到单场景训练,对Gaussian椭球面的初始化非常敏感,这限制了模型的泛化和实用性。为了克服这些限制,我们提出了GS-Net,一个可扩展的、可插拔的3DGS模块,它从稀疏的SfM点云中填充Gaussian椭球面,增强几何结构表示。据我们所知,GS-Net是第一个具有跨场景泛化的可插拔3DGS模块。此外,我们还引入了CARLA-NVS数据集,该数据集包含了额外的相机视点,以全面评估重建和渲染质量。大量实验证明,将GS-Net应用于3DGS可以实现传统视点的PSNR改进2.08 dB,新视点的PSNR改进1.86 dB,证实了该方法的有效性和鲁棒性。
https://arxiv.org/abs/2409.11307
This paper presents LOLA, a massively multilingual large language model trained on more than 160 languages using a sparse Mixture-of-Experts Transformer architecture. Our architectural and implementation choices address the challenge of harnessing linguistic diversity while maintaining efficiency and avoiding the common pitfalls of multilinguality. Our analysis of the evaluation results shows competitive performance in natural language generation and understanding tasks. Additionally, we demonstrate how the learned expert-routing mechanism exploits implicit phylogenetic linguistic patterns to potentially alleviate the curse of multilinguality. We provide an in-depth look at the training process, an analysis of the datasets, and a balanced exploration of the model's strengths and limitations. As an open-source model, LOLA promotes reproducibility and serves as a robust foundation for future research. Our findings enable the development of compute-efficient multilingual models with strong, scalable performance across languages.
本文介绍了LOLA,一种利用超过160种语言的丰富多语言训练的大型语言模型,采用稀疏的Mixture-of-Experts Transformer架构进行训练。我们模型的架构和实现选择解决了利用语言多样性同时保持效率并避免多语言性的常见陷阱。我们对评估结果的分析显示,LOLA在自然语言生成和理解任务上具有竞争力的性能。此外,我们还证明了学习到的专家路由机制利用了潜在的分支形态语言学模式,可能有助于减轻多语言性的诅咒。我们深入研究了训练过程、数据集分析以及模型优缺点的平衡探索。作为开源模型,LOLA推动了可重复性,为未来的研究提供了稳健的基础。我们的研究结果使计算效率高的多语言模型的发展成为可能,这些模型具有强大的可扩展性能。
https://arxiv.org/abs/2409.11272
Large language models (LLMs) offer powerful capabilities but incur substantial computational costs, driving the need for efficient compression techniques. This study evaluates the impact of popular compression methods - Magnitude Pruning, SparseGPT, and Wanda - on the LLaMA-2-7B model, focusing on the trade-offs between model size reduction, downstream task performance, and the role of calibration data. Our findings reveal that while SparseGPT and Wanda preserve perplexity even at 50% sparsity, they suffer significant degradation on downstream tasks, highlighting the inadequacy of perplexity as the sole evaluation metric. To address this, we introduce Jensen-Shannon (JS) Divergence as a more comprehensive metric that captures nuanced changes in model behavior post-compression. We further demonstrate that task-specific calibration data significantly enhances the downstream performance of compressed models compared to general calibration data. This research underscores the necessity for diverse evaluation metrics and careful calibration data selection to fully understand the complexities of LLM compression and its implications for practical applications.
大规模语言模型(LLMs)具有强大的功能,但计算成本很高,导致需要高效的压缩技术。本研究评估了流行的压缩方法 - 规模剪枝、稀疏GPT和Wanda - 对LLA-2-7B模型的影响,重点关注模型大小减少、下游任务性能之间的权衡以及校准数据的作用。我们的研究结果表明,即使在50%的稀疏度下,SparseGPT和Wanda也保留了解谜能力,但下游任务的性能显著下降,突出了仅用解谜能力作为评价指标的不足之处。为了应对这个问题,我们引入了Jensen-Shannon(JS)距离作为更全面的指标,它捕捉了压缩后模型行为的变化。我们进一步证明了任务特定的校准数据显著增强了压缩模型的下游性能,与通用校准数据相比。这项研究强调了需要多样化的评估指标和仔细的校准数据选择,以全面了解LLM压缩的复杂性及其对实际应用的启示。
https://arxiv.org/abs/2409.11233
Digitizing 3D static scenes and 4D dynamic events from multi-view images has long been a challenge in computer vision and graphics. Recently, 3D Gaussian Splatting (3DGS) has emerged as a practical and scalable reconstruction method, gaining popularity due to its impressive reconstruction quality, real-time rendering capabilities, and compatibility with widely used visualization tools. However, the method requires a substantial number of input views to achieve high-quality scene reconstruction, introducing a significant practical bottleneck. This challenge is especially severe in capturing dynamic scenes, where deploying an extensive camera array can be prohibitively costly. In this work, we identify the lack of spatial autocorrelation of splat features as one of the factors contributing to the suboptimal performance of the 3DGS technique in sparse reconstruction settings. To address the issue, we propose an optimization strategy that effectively regularizes splat features by modeling them as the outputs of a corresponding implicit neural field. This results in a consistent enhancement of reconstruction quality across various scenarios. Our approach effectively handles static and dynamic cases, as demonstrated by extensive testing across different setups and scene complexities.
从多视角图像中数字化3D静态场景和4D动态事件是一个计算机视觉和图形学中的长期挑战。近年来,3D高斯展平(3DGS)作为一种实际且可扩展的重建方法而受到欢迎,其出色的重建质量、实时渲染能力和与广泛使用的可视化工具的兼容性使其备受瞩目。然而,该方法需要大量的输入视角才能实现高质量的场景重建,导致实际瓶颈。在捕捉动态场景时,部署广泛的电影相机阵列可能过于昂贵。在本文中,我们将缺乏空间自相关性作为导致3DGS技术在稀疏重建设置中性能低下的因素之一。为了应对这个问题,我们提出了一个优化策略,通过将展平特征建模为相应隐式神经场的输出来有效地规范展平特征。这导致各种场景的重建质量得到一致提升。我们的方法有效地处理了静态和动态情况,这通过在不同设置和场景复杂性上进行广泛的测试得到了充分证明。
https://arxiv.org/abs/2409.11211
The LiDAR-based 3D object detector that strikes a balance between accuracy and speed is crucial for achieving real-time perception in autonomous driving and robotic navigation systems. To enhance the accuracy of point cloud detection, integrating global context for visual understanding improves the point clouds ability to grasp overall spatial information. However, many existing LiDAR detection models depend on intricate feature transformation and extraction processes, leading to poor real-time performance and high resource consumption, which limits their practical effectiveness. In this work, we propose a Faster LiDAR 3D object detection framework, called FASD, which implements heterogeneous model distillation by adaptively uniform cross-model voxel features. We aim to distill the transformer's capacity for high-performance sequence modeling into Mamba models with low FLOPs, achieving a significant improvement in accuracy through knowledge transfer. Specifically, Dynamic Voxel Group and Adaptive Attention strategies are integrated into the sparse backbone, creating a robust teacher model with scale-adaptive attention for effective global visual context modeling. Following feature alignment with the Adapter, we transfer knowledge from the Transformer to the Mamba through latent space feature supervision and span-head distillation, resulting in improved performance and an efficient student model. We evaluated the framework on the Waymo and nuScenes datasets, achieving a 4x reduction in resource consumption and a 1-2\% performance improvement over the current SoTA methods.
基于激光雷达的三维物体检测器在准确性和速度之间取得平衡对于实现自动驾驶和机器人导航系统中的实时感知至关重要。为了提高点云检测的准确性,通过视觉理解全局上下文来集成全局上下文有助于提高点云抓住整体空间信息的能力。然而,许多现有的激光雷达检测模型依赖于复杂的特征转换和提取过程,导致实时性能差和高资源消耗,这限制了它们的有效实用性。在这项工作中,我们提出了一个Faster LiDAR 3D物体检测框架,称为FASD,通过自适应地均匀跨模态体积特征进行异构模型蒸馏。我们的目标是将变压器的高性能序列建模能力转化为低FLOPs的Mamba模型,通过知识传递实现显著的准确性提升。具体来说,动态体素组和自适应关注策略融入稀疏骨架中,创建了一个具有尺度自适应注意的稳健教师模型,用于有效的全局视觉上下文建模。通过适应器与Mamba之间的潜在空间特征监督和跨度头蒸馏,我们将Transformer的知识传递给Mamba,从而实现了提高性能和高效的学生模型。我们在Waymo和nuScenes数据集上评估了该框架,实现了资源消耗的四倍降低和性能提升至当前SOTA方法的1-2%。
https://arxiv.org/abs/2409.11018
Autonomous robot navigation in off-road environments requires a comprehensive understanding of the terrain geometry and traversability. The degraded perceptual conditions and sparse geometric information at longer ranges make the problem challenging especially when driving at high speeds. Furthermore, the sensing-to-mapping latency and the look-ahead map range can limit the maximum speed of the vehicle. Building on top of the recent work RoadRunner, in this work, we address the challenge of long-range (100 m) traversability estimation. Our RoadRunner (M&M) is an end-to-end learning-based framework that directly predicts the traversability and elevation maps at multiple ranges (50 m, 100 m) and resolutions (0.2 m, 0.8 m) taking as input multiple images and a LiDAR voxel map. Our method is trained in a self-supervised manner by leveraging the dense supervision signal generated by fusing predictions from an existing traversability estimation stack (X-Racer) in hindsight and satellite Digital Elevation Maps. RoadRunner M&M achieves a significant improvement of up to 50% for elevation mapping and 30% for traversability estimation over RoadRunner, and is able to predict in 30% more regions compared to X-Racer while achieving real-time performance. Experiments on various out-of-distribution datasets also demonstrate that our data-driven approach starts to generalize to novel unstructured environments. We integrate our proposed framework in closed-loop with the path planner to demonstrate autonomous high-speed off-road robotic navigation in challenging real-world environments. Project Page: this https URL
自主机器人导航在非道路环境中需要对地形几何和可穿越性进行全面理解。在较长的距离和稀疏的几何信息下,尤其是在高速驾驶时,问题变得更具挑战性。此外,感知到映射的延迟和前向映射范围限制了车辆的最大速度。在最近的工作RoadRunner的基础上,在这篇论文中,我们解决了长距离(100米)可穿越性估计的问题。我们的RoadRunner(M&M)是一个端到端学习基础框架,直接预测多个范围(50米,100米)和分辨率(0.2米,0.8米)的高程和地形图,输入为多个图像和LiDAR体素图。我们的方法通过利用来自现有可穿越性估计栈(X-Racer)的丰富监督信号进行自监督学习,实现了对地形图和可穿越性估计的重大改进。RoadRunner M&M在可穿越性和地形图估计方面比RoadRunner和改进了50%,在X-Racer上的预测范围比X-Racer扩大了30%,同时在实现实时性能的同时预测更多的区域。在各种离散数据集上进行的实验还表明,我们的数据驱动方法在应对新颖无结构环境时开始表现出泛化能力。我们将所提出的框架与路径规划器集成,展示了在具有挑战性的现实环境中实现自主高速非道路机器人导航。项目页面:https:// this URL
https://arxiv.org/abs/2409.10940
We focus on agile, continuous, and terrain-adaptive jumping of quadrupedal robots in discontinuous terrains such as stairs and stepping stones. Unlike single-step jumping, continuous jumping requires accurately executing highly dynamic motions over long horizons, which is challenging for existing approaches. To accomplish this task, we design a hierarchical learning and control framework, which consists of a learned heightmap predictor for robust terrain perception, a reinforcement-learning-based centroidal-level motion policy for versatile and terrain-adaptive planning, and a low-level model-based leg controller for accurate motion tracking. In addition, we minimize the sim-to-real gap by accurately modeling the hardware characteristics. Our framework enables a Unitree Go1 robot to perform agile and continuous jumps on human-sized stairs and sparse stepping stones, for the first time to the best of our knowledge. In particular, the robot can cross two stair steps in each jump and completes a 3.5m long, 2.8m high, 14-step staircase in 4.5 seconds. Moreover, the same policy outperforms baselines in various other parkour tasks, such as jumping over single horizontal or vertical discontinuities. Experiment videos can be found at \url{this https URL\_cod/}.
我们关注四足机器人在不连续地形(如楼梯和石阶)上的敏捷、连续和地形适应跳跃。与单步跳跃不同,连续跳跃需要准确地在较长的水平方向上执行高度动态的运动,这对于现有的方法来说具有挑战性。为了实现这一目标,我们设计了一个分层学习和控制框架,包括用于稳健地形感知的学习高度图预测器、基于强化学习的中心水平运动策略以及基于低级模型的腿部控制器。此外,我们通过准确建模硬件特征来最小化模拟与现实之间的差距。我们的框架使得单元树Go1机器人能够在人类大小的楼梯和稀疏的石阶上进行敏捷和连续跳跃,这是目前我们所知的最优结果。特别地,机器人可以在每次跳跃中跨越两个楼梯级,并在4.5秒内完成一个2.8米高、14级台阶。此外,相同的政策在各种公园冒险任务中优于基线,例如跳过单程水平或垂直不连续性。实验视频可以在 \url{this <https://cod.google.com/g/url_cod/}找到。
https://arxiv.org/abs/2409.10923
Transformer-based large Language Models (LLMs) become increasingly important in various domains. However, the quadratic time complexity of attention operation poses a significant challenge for scaling to longer contexts due to the extremely high inference latency and GPU memory consumption for caching key-value (KV) vectors. This paper proposes RetrievalAttention, a training-free approach to accelerate attention computation. To leverage the dynamic sparse property of attention, RetrievalAttention builds approximate nearest neighbor search (ANNS) indexes upon KV vectors in CPU memory and retrieves the most relevant ones via vector search during generation. Due to the out-of-distribution (OOD) between query vectors and key vectors, off-the-shelf ANNS indexes still need to scan O(N) (usually 30% of all keys) data for accurate retrieval, which fails to exploit the high sparsity. RetrievalAttention first identifies the OOD challenge of ANNS-based attention, and addresses it via an attention-aware vector search algorithm that can adapt to queries and only access 1--3% of data, thus achieving a sub-linear time complexity. RetrievalAttention greatly reduces the inference cost of long-context LLM with much lower GPU memory requirements while maintaining the model accuracy. Especially, RetrievalAttention only needs 16GB GPU memory for serving 128K tokens in LLMs with 8B parameters, which is capable of generating one token in 0.188 seconds on a single NVIDIA RTX4090 (24GB).
Transformer-based large Language Models (LLMs) 在各种领域变得越来越重要。然而,注意力操作的二次时间复杂度使得将模型扩展到更长的上下文变得具有挑战性,因为对于缓存键值(KV)向量,推理延迟和高GPU内存消耗非常高。本文提出了一种无需训练的 RetrievalAttention 方法,以加速注意力计算。为了利用注意力的动态稀疏性质,RetrievalAttention 在 CPU 内存中构建近邻搜索(ANNS)索引,并通过在生成时通过向量搜索获取最相关的索引。由于查询向量与键向量之间的分布不合适,即使是最基本的 ANNS 索引,仍然需要扫描 O(N)(通常所有键的30%)数据进行准确检索,这未能充分利用高稀疏性。RetrievalAttention 首先确定了基于 ANNS 的注意力的 OOD 挑战,并通过一个注意力感知的向量搜索算法来解决它,该算法可以适应查询,并且只访问 1-3% 的数据,从而实现 sub-linear 时间复杂度。RetrievalAttention 极大地减少了具有较低GPU内存要求的长时间LLM的推理成本,同时保持模型的准确性。特别地,RetrievalAttention 为具有8B参数的128K个token的LLM提供16GB的GPU内存,能够在单个NVIDIA RTX4090(24GB)上每秒生成一个token。
https://arxiv.org/abs/2409.10516
Predicting high-dimensional or extreme multilabels, such as in medical coding, requires both accuracy and interpretability. Existing works often rely on local interpretability methods, failing to provide comprehensive explanations of the overall mechanism behind each label prediction within a multilabel set. We propose a mechanistic interpretability module called DIctionary Label Attention (\method) that disentangles uninterpretable dense embeddings into a sparse embedding space, where each nonzero element (a dictionary feature) represents a globally learned medical concept. Through human evaluations, we show that our sparse embeddings are more human understandable than its dense counterparts by at least 50 percent. Our automated dictionary feature identification pipeline, leveraging large language models (LLMs), uncovers thousands of learned medical concepts by examining and summarizing the highest activating tokens for each dictionary feature. We represent the relationships between dictionary features and medical codes through a sparse interpretable matrix, enhancing the mechanistic and global understanding of the model's predictions while maintaining competitive performance and scalability without extensive human annotation.
预测高维或极端多标签,如医学编码,需要准确性和可解释性。现有的工作通常依赖于局部可解释方法,而无法提供每个多标签集合中每个标签预测的整体机制的全面解释。我们提出了一种机制性可解释模块,称为指示性标签关注(DIctionary Label Attention)方法,将不可解释的密集嵌入分离到稀疏嵌入空间中,其中每个非零元素(一个词典特征)表示学习到的全球医学概念。通过人类评估,我们发现我们的稀疏嵌入比其密集对应物更易于理解,至少50%的人类理解。我们的自动词典特征识别管道,利用大型语言模型(LLMs),通过检查和总结每个词典特征的最高激活词,揭示了数千个学习到的医学概念。我们通过稀疏可解释矩阵表示字典特征与医学编码之间的关系,增强了模型预测的机制性和全局理解,同时保持竞争性能和可扩展性,而无需进行大量的人类标注。
https://arxiv.org/abs/2409.10504
This paper presents a diffusion-based recommender system that incorporates classifier-free guidance. Most current recommender systems provide recommendations using conventional methods such as collaborative or content-based filtering. Diffusion is a new approach to generative AI that improves on previous generative AI approaches such as Variational Autoencoders (VAEs) and Generative Adversarial Networks (GANs). We incorporate diffusion in a recommender system that mirrors the sequence users take when browsing and rating items. Although a few current recommender systems incorporate diffusion, they do not incorporate classifier-free guidance, a new innovation in diffusion models as a whole. In this paper, we present a diffusion recommender system that augments the underlying recommender system model for improved performance and also incorporates classifier-free guidance. Our findings show improvements over state-of-the-art recommender systems for most metrics for several recommendation tasks on a variety of datasets. In particular, our approach demonstrates the potential to provide better recommendations when data is sparse.
本文介绍了一种基于扩散的推荐系统,该系统集成了分类器无关指导。大多数现有的推荐系统使用传统方法,如合作或内容过滤来进行推荐。扩散是一种新的生成人工智能方法,它在VAE和生成对抗网络等先前的生成人工智能方法上进行了改进。我们将扩散融入到一个反映用户在浏览和评分项目时所采取的序列的推荐系统中。尽管目前一些推荐系统已经包含了扩散,但它们并没有包含分类器无关指导,这是扩散模型整体的全新创新。在本文中,我们介绍了一种用于增强底层推荐系统模型的扩散推荐系统,以提高性能,并还集成了分类器无关指导。我们的研究结果表明,对于大多数数据集,基于扩散的推荐系统在多个推荐任务上均优于最先进的推荐系统。特别是,我们的方法在数据稀疏时提供了更好的推荐。
https://arxiv.org/abs/2409.10494
DETR introduces a simplified one-stage framework for scene graph generation (SGG). However, DETR-based SGG models face two challenges: i) Sparse supervision, as each image typically contains fewer than 10 relation annotations, while the models employ over 100 relation queries. This sparsity arises because each ground truth relation is assigned to only one single query during training. ii) False negative samples, since one ground truth relation may have multiple queries with similar matching scores. These suboptimally matched queries are simply treated as negative samples, causing the loss of valuable supervisory signals. As a response, we devise Hydra-SGG, a one-stage SGG method that adopts a new Hybrid Relation Assignment. This assignment combines a One-to-One Relation Assignment with a newly introduced IoU-based One-to-Many Relation Assignment. Specifically, each ground truth is assigned to multiple relation queries with high IoU subject-object boxes. This Hybrid Relation Assignment increases the number of positive training samples, alleviating sparse supervision. Moreover, we, for the first time, empirically show that self-attention over relation queries helps reduce duplicated relation predictions. We, therefore, propose Hydra Branch, a parameter-sharing auxiliary decoder without a self-attention layer. This design promotes One-to-Many Relation Assignment by enabling different queries to predict the same relation. Hydra-SGG achieves state-of-the-art performance with 10.6 mR@20 and 16.0 mR@50 on VG150, while only requiring 12 training epochs. It also sets a new state-of-the-art on Open Images V6 and and GQA.
DETR提出了一种用于场景图生成的简化框架(SGG)。然而,基于DETR的SGG模型面临两个挑战:i)稀疏监督,因为每个图像通常包含不到10个关系注释,而模型使用超过100个关系查询。这种稀疏性是因为在训练过程中,每个真实关系仅分配给一个单一查询。ii)假阳性样本,因为一个真实关系可能有多个具有类似匹配分数的查询。这些次优匹配查询被视为负样本,导致有价值的监督信号的损失。为了回应这一挑战,我们发明了Hydra-SGG,一种基于Hybrid关系分配的一阶段SGG方法。这个分配结合了One-to-One关系分配和一种新引入的IoU基础上的One-to-Many关系分配。具体来说,每个真实关系分配给具有高IoU关系主体-对象框的多个关系查询。这种关系分配增加了阳性训练样本的数量,减轻了稀疏监督。此外,我们还首次通过实验证明了关系查询的自注意力有助于减少重复关系预测。因此,我们提出了Hydra Branch,一种没有自注意力的参数共享辅助解码器。这种设计通过允许不同的查询预测相同的關係,促进了One-to-Many关系分配。Hydra-SGG在VG150上实现了每MRP20的10.6mR和每MRP50的16.0mR,仅需要12个训练周期。它还在Open Images V6和GQA上实现了新的最佳性能。
https://arxiv.org/abs/2409.10262
This paper proposes SOLVR, a unified pipeline for learning based LiDAR-Visual re-localisation which performs place recognition and 6-DoF registration across sensor modalities. We propose a strategy to align the input sensor modalities by leveraging stereo image streams to produce metric depth predictions with pose information, followed by fusing multiple scene views from a local window using a probabilistic occupancy framework to expand the limited field-of-view of the camera. Additionally, SOLVR adopts a flexible definition of what constitutes positive examples for different training losses, allowing us to simultaneously optimise place recognition and registration performance. Furthermore, we replace RANSAC with a registration function that weights a simple least-squares fitting with the estimated inlier likelihood of sparse keypoint correspondences, improving performance in scenarios with a low inlier ratio between the query and retrieved place. Our experiments on the KITTI and KITTI360 datasets show that SOLVR achieves state-of-the-art performance for LiDAR-Visual place recognition and registration, particularly improving registration accuracy over larger distances between the query and retrieved place.
本文提出了一种名为SOLVR的统一光雷达-视觉重新定位管道,用于学习,在传感器模态上实现位置识别和6自由度配准。我们提出了一种利用立体图像流将输入传感器模态对齐以产生姿态信息,然后使用概率填充框架将局部视图中的多个场景视角进行融合,以扩展摄像机有限场视野的方法。此外,SOLVR对不同训练损失中“积极”例子的定义是灵活的,允许我们同时优化位置识别和配准性能。另外,我们用一个注册函数取代RANSAC,该函数根据稀疏关键点对应的高估计似然权重简单最小二乘拟合,从而提高在查询和检索位置之间稀疏比的情况下场景性能。我们对KITTI和KITTI360数据集的实验结果表明,SOLVR在光雷达视觉配准和位置识别方面实现了最先进的性能,特别是通过增加查询和检索位置之间的距离来提高配准准确性。
https://arxiv.org/abs/2409.10247
Even if the depth maps captured by RGB-D sensors deployed in real environments are often characterized by large areas missing valid depth measurements, the vast majority of depth completion methods still assumes depth values covering all areas of the scene. To address this limitation, we introduce SteeredMarigold, a training-free, zero-shot depth completion method capable of producing metric dense depth, even for largely incomplete depth maps. SteeredMarigold achieves this by using the available sparse depth points as conditions to steer a denoising diffusion probabilistic model. Our method outperforms relevant top-performing methods on the NYUv2 dataset, in tests where no depth was provided for a large area, achieving state-of-art performance and exhibiting remarkable robustness against depth map incompleteness. Our code will be publicly available.
即使部署在现实环境中的RGB-D传感器捕捉到的深度图通常表现为大面积缺失有效的深度测量,绝大多数深度完成方法仍然假定深度值覆盖场景的所有区域。为了应对这个局限性,我们引入了SteeredMarigold,一种无需训练,零散深度完成方法,即使深度图 largely incomplete,仍能产生 metric 密度的深度。SteeredMarigold 是通过利用可用稀疏深度点作为引导去噪扩散概率模型来进行实现的。我们的方法在 NYUv2 数据集上的相关 top-performing 方法中表现出色,在未提供深度的大面积区域中进行测试时,实现了与最先进的深度图完整性相关的性能水平,并表现出了对深度图不完整性的非常强的鲁棒性。我们的代码将公开可用。
https://arxiv.org/abs/2409.10202
Learning compact and meaningful latent space representations has been shown to be very useful in generative modeling tasks for visual data. One particular example is applying Vector Quantization (VQ) in variational autoencoders (VQ-VAEs, VQ-GANs, etc.), which has demonstrated state-of-the-art performance in many modern generative modeling applications. Quantizing the latent space has been justified by the assumption that the data themselves are inherently discrete in the latent space (like pixel values). In this paper, we propose an alternative representation of the latent space by relaxing the structural assumption than the VQ formulation. Specifically, we assume that the latent space can be approximated by a union of subspaces model corresponding to a dictionary-based representation under a sparsity constraint. The dictionary is learned/updated during the training process. We apply this approach to look at two models: Dictionary Learning Variational Autoencoders (DL-VAEs) and DL-VAEs with Generative Adversarial Networks (DL-GANs). We show empirically that our more latent space is more expressive and has leads to better representations than the VQ approach in terms of reconstruction quality at the expense of a small computational overhead for the latent space computation. Our results thus suggest that the true benefit of the VQ approach might not be from discretization of the latent space, but rather the lossy compression of the latent space. We confirm this hypothesis by showing that our sparse representations also address the codebook collapse issue as found common in VQ-family models.
学习紧凑且富有意义的潜在空间表示在生成建模任务中对于视觉数据非常有益。一个具体的例子是在变分自编码器(VQ-VAE,VQ-GAN等)中应用向量量化(VQ),已经在许多现代生成建模应用中证明了最先进的性能。将潜在空间量化为数据固有的离散性质(如像素值)的假设,为向量量化提供了正当理由。在本文中,我们提出了另一种潜在空间表示,通过放宽VQ公式的结构假设,将潜在空间建模为一个字典基于表示的子空间模型的并集。字典在训练过程中学习/更新。我们将这种方法应用于看两个模型:词典学习变分自编码器(DL-VAEs)和词典学习变分自编码器(DL-VAEs)与生成对抗网络(DL-GANs)。我们通过实验实证地证明,我们的更丰富的潜在空间具有更丰富的表达性,并且在一定程度上优于VQ方法在重构质量方面的表现,但代价是计算开销略小。因此,我们得出的结论是,VQ方法的真实益处可能不在于对潜在空间的离散化,而在于对潜在空间的损失压缩。我们证实了这一假设,通过实验发现我们的稀疏表示同样解决了VQ家族模型中常见代码book collapse问题。
https://arxiv.org/abs/2409.11184