Abstract
We present CAT-V (Caption AnyThing in Video), a training-free framework for fine-grained object-centric video captioning that enables detailed descriptions of user-selected objects through time. CAT-V integrates three key components: a Segmenter based on SAMURAI for precise object segmentation across frames, a Temporal Analyzer powered by TRACE-Uni for accurate event boundary detection and temporal analysis, and a Captioner using InternVL-2.5 for generating detailed object-centric descriptions. Through spatiotemporal visual prompts and chain-of-thought reasoning, our framework generates detailed, temporally-aware descriptions of objects' attributes, actions, statuses, interactions, and environmental contexts without requiring additional training data. CAT-V supports flexible user interactions through various visual prompts (points, bounding boxes, and irregular regions) and maintains temporal sensitivity by tracking object states and interactions across different time segments. Our approach addresses limitations of existing video captioning methods, which either produce overly abstract descriptions or lack object-level precision, enabling fine-grained, object-specific descriptions while maintaining temporal coherence and spatial accuracy. The GitHub repository for this project is available at this https URL
Abstract (translated)
我们介绍了CAT-V(Caption AnyThing in Video),这是一个无需训练的框架,用于细粒度以对象为中心的视频描述,该框架能够通过时间对用户选择的对象进行详细描述。CAT-V集成了三个关键组件:基于SAMURAI的Segmenter,可在帧间实现精确的对象分割;由TRACE-Uni提供动力的Temporal Analyzer,可准确检测事件边界并进行时间分析;以及使用InternVL-2.5生成详细以对象为中心描述的Captioner。通过时空视觉提示和链式思维推理,我们的框架能够无需额外训练数据即可生成对物体属性、动作、状态、交互及环境背景具有时间意识的详细描述。CAT-V支持通过各种视觉提示(点、边界框和不规则区域)进行灵活的用户互动,并通过跟踪不同时间段内对象的状态和交互来保持时间敏感性。 我们的方法解决了现有视频描述方法存在的局限,这些方法要么产生过于抽象的描述,要么缺乏对单个物体级别的精确度。CAT-V能够在维护时间和空间准确性的同时生成细粒度、特定于每个对象的描述。该项目的GitHub仓库在此 https URL 获取。
URL
https://arxiv.org/abs/2504.05541