Paper Reading AI Learner

Video-STAR: Reinforcing Open-Vocabulary Action Recognition with Tools

2025-10-09 17:20:44
Zhenlong Yuan, Xiangyan Qu, Chengxuan Qian, Rui Chen, Jing Tang, Lei Sun, Xiangxiang Chu, Dapeng Zhang, Yiwei Wang, Yujun Cai, Shuo Li

Abstract

Multimodal large language models (MLLMs) have demonstrated remarkable potential in bridging visual and textual reasoning, yet their reliance on text-centric priors often limits their ability to disentangle semantically similar actions in open-vocabulary scenarios. To address this, we propose Video-STAR, a framework that harmonizes contextual sub-motion decomposition with tool-augmented reinforcement learning for open-vocabulary action recognition (OVAR). Unlike prior methods that treat actions as monolithic entities, our approach innovatively decomposes actions into discriminative sub-motions for fine-grained matching while dynamically invoking domain-specific tools for cross-modal interleaving, thereby enabling category-specific reasoning capacity and reducing cross-modal hallucination. Moreover, by designing a hierarchical reward that balances tool-usage efficiency, sub-motion relevance, and structural coherence in reasoning, our method autonomously leverages external tools to prioritize sub-motion patterns without explicit supervision, transmitting from text-centric reasoning to visually grounded inference. Extensive evaluations on HMDB-51, UCF-101, SSv2, Kinetics-400, and Kinetics-600 datasets demonstrate our state-of-the-art performance, outperforming existing methods in distinguishing fine-grained actions and handling cross-modal hallucination, validating our excellent robustness and generalization.

Abstract (translated)

多模态大型语言模型(MLLM)在连接视觉和文本推理方面展示了显著的潜力,但它们对文本中心先验的依赖往往限制了其在开放词汇场景中区分语义相似动作的能力。为解决这一问题,我们提出了Video-STAR框架,该框架结合了上下文子运动分解与工具增强型强化学习,用于开放词汇动作识别(OVAR)。不同于以往的方法将动作视为单一实体处理,我们的方法创新地将动作分解为可辨别的子动作以实现细粒度匹配,并动态调用领域特定的工具进行跨模态交织,从而增强了类别特异性的推理能力并减少了跨模态幻觉。 此外,通过设计一种平衡工具使用效率、子运动相关性和推理结构一致性的分层奖励机制,我们的方法可以自主利用外部工具来优先考虑子动作模式,而无需显式监督。这使得从文本中心推理转移到基于视觉的推断成为可能。 在HMDB-51、UCF-101、SSv2、Kinetics-400和Kinetics-600数据集上的广泛评估表明,我们的方法在区分细粒度动作和处理跨模态幻觉方面优于现有方法,并验证了我们强大的鲁棒性和泛化能力。

URL

https://arxiv.org/abs/2510.08480

PDF

https://arxiv.org/pdf/2510.08480.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot