Paper Reading AI Learner

NetRoller: Interfacing General and Specialized Models for End-to-End Autonomous Driving

2025-06-17 14:52:50
Ren Xin, Hongji Liu, Xiaodong Mei, Wenru Liu, Maosheng Ye, Zhili Chen, Jun Ma

Abstract

Integrating General Models (GMs) such as Large Language Models (LLMs), with Specialized Models (SMs) in autonomous driving tasks presents a promising approach to mitigating challenges in data diversity and model capacity of existing specialized driving models. However, this integration leads to problems of asynchronous systems, which arise from the distinct characteristics inherent in GMs and SMs. To tackle this challenge, we propose NetRoller, an adapter that incorporates a set of novel mechanisms to facilitate the seamless integration of GMs and specialized driving models. Specifically, our mechanisms for interfacing the asynchronous GMs and SMs are organized into three key stages. NetRoller first harvests semantically rich and computationally efficient representations from the reasoning processes of LLMs using an early stopping mechanism, which preserves critical insights on driving context while maintaining low overhead. It then applies learnable query embeddings, nonsensical embeddings, and positional layer embeddings to facilitate robust and efficient cross-modality translation. At last, it employs computationally efficient Query Shift and Feature Shift mechanisms to enhance the performance of SMs through few-epoch fine-tuning. Based on the mechanisms formalized in these three stages, NetRoller enables specialized driving models to operate at their native frequencies while maintaining situational awareness of the GM. Experiments conducted on the nuScenes dataset demonstrate that integrating GM through NetRoller significantly improves human similarity and safety in planning tasks, and it also achieves noticeable precision improvements in detection and mapping tasks for end-to-end autonomous driving. The code and models are available at this https URL .

Abstract (translated)

将通用模型(GMs),如大型语言模型(LLMs),与自动驾驶任务中的专用模型(SMs)集成,为解决现有专用车辆模型在数据多样性和模型容量方面的挑战提供了一种有前景的方法。然而,这种整合会导致异步系统问题,这些问题源于GM和SM各自独特的特性。为了应对这一挑战,我们提出了NetRoller,这是一种适配器,它包含一系列新颖机制以促进通用模型与专用驾驶模型的无缝集成。具体而言,我们的接口异步GMs和SMs的方法被组织为三个关键阶段。 首先,NetRoller通过采用早期停止机制从LLMs的推理过程中提取语义丰富且计算效率高的表示形式,从而保留了关于驾驶环境的关键洞察,同时保持较低的开销。然后,它应用可学习查询嵌入、无意义嵌入和位置层嵌入来促进稳健高效的跨模态转换。最后,通过采用计算高效性的查询偏移和特征偏移机制,在少轮次微调的情况下增强SMs的表现。 基于这三个阶段中正式化的机制,NetRoller使专用驾驶模型能够在其原生频率下运行,同时保持对通用模型的态势感知能力。在nuScenes数据集上进行的实验表明,通过NetRoller集成GM显著提高了规划任务中的人类相似性和安全性,并且对于端到端自动驾驶的任务如检测和地图绘制也取得了明显的精度提升。 代码和模型可在[此处](https://this_https_URL/)获取(请将"this https URL"替换为实际链接)。

URL

https://arxiv.org/abs/2506.14589

PDF

https://arxiv.org/pdf/2506.14589.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot