Abstract
Multimodal large language models (MLLMs) have demonstrated remarkable potential in bridging visual and textual reasoning, yet their reliance on text-centric priors often limits their ability to disentangle semantically similar actions in open-vocabulary scenarios. To address this, we propose Video-STAR, a framework that harmonizes contextual sub-motion decomposition with tool-augmented reinforcement learning for open-vocabulary action recognition (OVAR). Unlike prior methods that treat actions as monolithic entities, our approach innovatively decomposes actions into discriminative sub-motions for fine-grained matching while dynamically invoking domain-specific tools for cross-modal interleaving, thereby enabling category-specific reasoning capacity and reducing cross-modal hallucination. Moreover, by designing a hierarchical reward that balances tool-usage efficiency, sub-motion relevance, and structural coherence in reasoning, our method autonomously leverages external tools to prioritize sub-motion patterns without explicit supervision, transmitting from text-centric reasoning to visually grounded inference. Extensive evaluations on HMDB-51, UCF-101, SSv2, Kinetics-400, and Kinetics-600 datasets demonstrate our state-of-the-art performance, outperforming existing methods in distinguishing fine-grained actions and handling cross-modal hallucination, validating our excellent robustness and generalization.
Abstract (translated)
多模态大型语言模型(MLLM)在连接视觉和文本推理方面展示了显著的潜力,但它们对文本中心先验的依赖往往限制了其在开放词汇场景中区分语义相似动作的能力。为解决这一问题,我们提出了Video-STAR框架,该框架结合了上下文子运动分解与工具增强型强化学习,用于开放词汇动作识别(OVAR)。不同于以往的方法将动作视为单一实体处理,我们的方法创新地将动作分解为可辨别的子动作以实现细粒度匹配,并动态调用领域特定的工具进行跨模态交织,从而增强了类别特异性的推理能力并减少了跨模态幻觉。 此外,通过设计一种平衡工具使用效率、子运动相关性和推理结构一致性的分层奖励机制,我们的方法可以自主利用外部工具来优先考虑子动作模式,而无需显式监督。这使得从文本中心推理转移到基于视觉的推断成为可能。 在HMDB-51、UCF-101、SSv2、Kinetics-400和Kinetics-600数据集上的广泛评估表明,我们的方法在区分细粒度动作和处理跨模态幻觉方面优于现有方法,并验证了我们强大的鲁棒性和泛化能力。
URL
https://arxiv.org/abs/2510.08480