VimTS: A Unified Video and Image Text Spotter for Enhancing the Cross-domain Generalization

Abstract
Abstract (translated)
URL
PDF

Abstract

Text spotting, a task involving the extraction of textual information from image or video sequences, faces challenges in cross-domain adaption, such as image-to-image and image-to-video generalization. In this paper, we introduce a new method, termed VimTS, which enhances the generalization ability of the model by achieving better synergy among different tasks. Typically, we propose a Prompt Queries Generation Module and a Tasks-aware Adapter to effectively convert the original single-task model into a multi-task model suitable for both image and video scenarios with minimal additional parameters. The Prompt Queries Generation Module facilitates explicit interaction between different tasks, while the Tasks-aware Adapter helps the model dynamically learn suitable features for each task. Additionally, to further enable the model to learn temporal information at a lower cost, we propose a synthetic video text dataset (VTD-368k) by leveraging the Content Deformation Fields (CoDeF) algorithm. Notably, our method outperforms the state-of-the-art method by an average of 2.6% in six cross-domain benchmarks such as TT-to-IC15, CTW1500-to-TT, and TT-to-CTW1500. For video-level cross-domain adaption, our method even surpasses the previous end-to-end video spotting method in ICDAR2015 video and DSText v2 by an average of 5.5% on the MOTA metric, using only image-level data. We further demonstrate that existing Large Multimodal Models exhibit limitations in generating cross-domain scene text spotting, in contrast to our VimTS model which requires significantly fewer parameters and data. The code and datasets will be made available at the this https URL.

Abstract (translated)

文本检测是一个涉及从图像或视频序列中提取文本信息的任务，在跨领域适应中面临挑战，例如图像到图像和图像到视频的泛化。在本文中，我们介绍了一种名为VimTS的新方法，通过实现不同任务之间的更好协同，增强了模型的泛化能力。通常，我们提出一个提示查询生成模块和一个任务感知适配器，有效地将原始的单任务模型转换为适合图像和视频场景的多任务模型，且最小化额外的参数。提示查询生成模块促进不同任务之间的显式交互，而任务感知适配器帮助模型动态学习每个任务所需的特征。此外，为了进一步降低模型学习时间成本，我们利用内容变形场（CoDeF）算法提出了合成视频文本数据集（VTD-368k）。值得注意的是，与最先进的 method 相比，我们的方法在六个跨领域基准测试（如TT-to-IC15，CTW1500-to-TT和TT-to-CTW1500）上的平均性能提高了2.6%。对于视频级别的跨领域适应，我们的方法甚至比ICDAR2015视频和DSText v2上的前端终点检测方法提高了平均5.5%的MOTA指标，仅使用图像级别数据。我们进一步证明了现有的大型多模态模型在生成跨领域场景文本检测方面存在局限性，而我们的VimTS模型则需要更少的参数和数据。代码和数据集将在此处https URL上提供。

URL

https://arxiv.org/abs/2404.19652

PDF

https://arxiv.org/pdf/2404.19652.pdf

VimTS: A Unified Video and Image Text Spotter for Enhancing the Cross-domain Generalization

Abstract

Abstract (translated)

URL

PDF Copy

PDF