Paper Reading AI Learner

First Place Solution to the ECCV 2024 ROAD++ Challenge @ ROAD++ Atomic Activity Recognition 2024

2024-10-30 15:06:58
Ruyang Li, Tengfei Zhang, Heng Zhang, Tiejun Liu, Yanwei Wang, Xuelei Li

Abstract

This report presents our team's technical solution for participating in Track 3 of the 2024 ECCV ROAD++ Challenge. The task of Track 3 is atomic activity recognition, which aims to identify 64 types of atomic activities in road scenes based on video content. Our approach primarily addresses the challenges of small objects, discriminating between single object and a group of objects, as well as model overfitting in this task. Firstly, we construct a multi-branch activity recognition framework that not only separates different object categories but also the tasks of single object and object group recognition, thereby enhancing recognition accuracy. Subsequently, we develop various model ensembling strategies, including integrations of multiple frame sampling sequences, different frame sampling sequence lengths, multiple training epochs, and different backbone networks. Furthermore, we propose an atomic activity recognition data augmentation method, which greatly expands the sample space by flipping video frames and road topology, effectively mitigating model overfitting. Our methods rank first in the test set of Track 3 for the ROAD++ Challenge 2024, and achieve 69% mAP.

Abstract (translated)

本报告介绍了我们团队在2024年ECCV ROAD++挑战赛Track 3中的技术解决方案。Track 3的任务是原子活动识别,旨在基于视频内容识别道路上的64种类型的原子活动。我们的方法主要解决了小物体、区分单个物体和一组物体以及模型过拟合等方面的挑战。首先,我们构建了一个多分支活动识别框架,不仅分离了不同的物体类别,还分别处理了单个物体和一组物体的识别任务,从而提高了识别准确性。随后,我们开发了多种模型集成策略,包括多个帧采样序列、不同长度的帧采样序列、多次训练周期以及不同的骨干网络的整合。此外,我们提出了一种原子活动识别的数据增强方法,通过翻转视频帧和道路拓扑结构极大地扩展了样本空间,有效地缓解了模型过拟合问题。我们的方法在2024年ROAD++挑战赛Track 3测试集中的排名首位,并实现了69%的mAP。

URL

https://arxiv.org/abs/2410.23092

PDF

https://arxiv.org/pdf/2410.23092.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot