Abstract
The rapid increase in video content production has resulted in enormous data volumes, creating significant challenges for efficient analysis and resource management. To address this, robust video analysis tools are essential. This paper presents an innovative proof of concept using Generative Artificial Intelligence (GenAI) in the form of Vision Language Models to enhance the downstream video analysis process. Our tool generates customized textual summaries based on user-defined queries, providing focused insights within extensive video datasets. Unlike traditional methods that offer generic summaries or limited action recognition, our approach utilizes Vision Language Models to extract relevant information, improving analysis precision and efficiency. The proposed method produces textual summaries from extensive CCTV footage, which can then be stored for an indefinite time in a very small storage space compared to videos, allowing users to quickly navigate and verify significant events without exhaustive manual review. Qualitative evaluations result in 80% and 70% accuracy in temporal and spatial quality and consistency of the pipeline respectively.
Abstract (translated)
视频内容生产的迅速增加导致了数据量的激增,这对有效分析和资源管理带来了巨大挑战。为了解决这个问题,强大的视频分析工具至关重要。本文提出了一种创新的概念验证方法,利用生成式人工智能(GenAI)中的视觉语言模型来增强下游视频分析过程。我们的工具可以根据用户定义的查询生成定制化的文本摘要,在广泛的视频数据集中提供有针对性的见解。与传统的通用摘要或有限的动作识别方法不同,我们的方法使用视觉语言模型提取相关信息,从而提高分析的精确性和效率。 所提出的方法能够从大量的闭路电视(CCTV)录像中生成文本摘要,并且这些文本可以存储在相对较小的空间里,相比视频文件占用的空间来说显著减少。这样用户可以在无需详细手动审查的情况下快速浏览和验证重要事件。 定性评估结果显示,在时间质量和管道的一致性方面,该方法的准确性分别为80%和70%。
URL
https://arxiv.org/abs/2501.02850