Paper Reading AI Learner

When language and vision meet road safety: leveraging multimodal large language models for video-based traffic accident analysis

2025-01-17 23:35:34
Ruixuan Zhang, Beichen Wang, Juexiao Zhang, Zilin Bian, Chen Feng, Kaan Ozbay

Abstract

The increasing availability of traffic videos functioning on a 24/7/365 time scale has the great potential of increasing the spatio-temporal coverage of traffic accidents, which will help improve traffic safety. However, analyzing footage from hundreds, if not thousands, of traffic cameras in a 24/7/365 working protocol remains an extremely challenging task, as current vision-based approaches primarily focus on extracting raw information, such as vehicle trajectories or individual object detection, but require laborious post-processing to derive actionable insights. We propose SeeUnsafe, a new framework that integrates Multimodal Large Language Model (MLLM) agents to transform video-based traffic accident analysis from a traditional extraction-then-explanation workflow to a more interactive, conversational approach. This shift significantly enhances processing throughput by automating complex tasks like video classification and visual grounding, while improving adaptability by enabling seamless adjustments to diverse traffic scenarios and user-defined queries. Our framework employs a severity-based aggregation strategy to handle videos of various lengths and a novel multimodal prompt to generate structured responses for review and evaluation and enable fine-grained visual grounding. We introduce IMS (Information Matching Score), a new MLLM-based metric for aligning structured responses with ground truth. We conduct extensive experiments on the Toyota Woven Traffic Safety dataset, demonstrating that SeeUnsafe effectively performs accident-aware video classification and visual grounding by leveraging off-the-shelf MLLMs. Source code will be available at \url{this https URL}.

Abstract (translated)

随着24/7/365时间尺度上交通视频的日益普及,这为增加交通事故的空间时间和覆盖范围带来了巨大的潜力,从而有助于提高交通安全。然而,分析成百上千个全天候运行的摄像头拍摄的录像仍然是一个极具挑战性的任务。当前基于视觉的方法主要集中在提取原始信息(如车辆轨迹或单个对象检测)上,但需要繁琐的手动后处理才能得出具有实际操作意义的见解。 我们提出了一种新的框架SeeUnsafe,该框架集成了多模态大型语言模型(MLLM)代理,将视频基础的交通事故分析从传统的“提取-解释”工作流程转变为更加互动和对话式的模式。这种转变通过自动化复杂的任务(如视频分类和视觉定位)来显著提高处理效率,同时通过使系统能够无缝适应各种交通场景和用户定义的问题而提高了灵活性。 我们的框架采用基于严重程度的聚合策略来处理不同长度的视频,并使用一种新型的多模态提示来生成结构化的响应以供审查和评估,这有助于实现精细化的视觉定位。我们还引入了IMS(信息匹配得分),这是一种新的MLLM基础指标,用于将结构化响应与真实情况对齐。 我们在丰田编织交通安全数据集上进行了广泛的实验,结果表明SeeUnsafe可以通过利用现成的MLLM有效地执行事故感知视频分类和视觉定位。源代码将在[此处](https://this https URL)提供。

URL

https://arxiv.org/abs/2501.10604

PDF

https://arxiv.org/pdf/2501.10604.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot