Paper Reading AI Learner

VimTS: A Unified Video and Image Text Spotter for Enhancing the Cross-domain Generalization

2024-04-30 15:49:03
Yuliang Liu, Mingxin Huang, Hao Yan, Linger Deng, Weijia Wu, Hao Lu, Chunhua Shen, Lianwen Jin, Xiang Bai

Abstract

Text spotting, a task involving the extraction of textual information from image or video sequences, faces challenges in cross-domain adaption, such as image-to-image and image-to-video generalization. In this paper, we introduce a new method, termed VimTS, which enhances the generalization ability of the model by achieving better synergy among different tasks. Typically, we propose a Prompt Queries Generation Module and a Tasks-aware Adapter to effectively convert the original single-task model into a multi-task model suitable for both image and video scenarios with minimal additional parameters. The Prompt Queries Generation Module facilitates explicit interaction between different tasks, while the Tasks-aware Adapter helps the model dynamically learn suitable features for each task. Additionally, to further enable the model to learn temporal information at a lower cost, we propose a synthetic video text dataset (VTD-368k) by leveraging the Content Deformation Fields (CoDeF) algorithm. Notably, our method outperforms the state-of-the-art method by an average of 2.6% in six cross-domain benchmarks such as TT-to-IC15, CTW1500-to-TT, and TT-to-CTW1500. For video-level cross-domain adaption, our method even surpasses the previous end-to-end video spotting method in ICDAR2015 video and DSText v2 by an average of 5.5% on the MOTA metric, using only image-level data. We further demonstrate that existing Large Multimodal Models exhibit limitations in generating cross-domain scene text spotting, in contrast to our VimTS model which requires significantly fewer parameters and data. The code and datasets will be made available at the this https URL.

Abstract (translated)

文本检测是一个涉及从图像或视频序列中提取文本信息的任务,在跨领域适应中面临挑战,例如图像到图像和图像到视频的泛化。在本文中,我们介绍了一种名为VimTS的新方法,通过实现不同任务之间的更好协同,增强了模型的泛化能力。通常,我们提出一个提示查询生成模块和一个任务感知适配器,有效地将原始的单任务模型转换为适合图像和视频场景的多任务模型,且最小化额外的参数。提示查询生成模块促进不同任务之间的显式交互,而任务感知适配器帮助模型动态学习每个任务所需的特征。此外,为了进一步降低模型学习时间成本,我们利用内容变形场(CoDeF)算法提出了合成视频文本数据集(VTD-368k)。值得注意的是,与最先进的 method 相比,我们的方法在六个跨领域基准测试(如TT-to-IC15,CTW1500-to-TT和TT-to-CTW1500)上的平均性能提高了2.6%。对于视频级别的跨领域适应,我们的方法甚至比ICDAR2015视频和DSText v2上的前端终点检测方法提高了平均5.5%的MOTA指标,仅使用图像级别数据。我们进一步证明了现有的大型多模态模型在生成跨领域场景文本检测方面存在局限性,而我们的VimTS模型则需要更少的参数和数据。代码和数据集将在此处https URL上提供。

URL

https://arxiv.org/abs/2404.19652

PDF

https://arxiv.org/pdf/2404.19652.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot