Paper Reading AI Learner

CPN: Complementary Proposal Network for Unconstrained Text Detection

2024-02-18 10:43:53
Longhuang Wu, Shangxuan Tian, Youxin Wang, Pengfei Xiong

Abstract

Existing methods for scene text detection can be divided into two paradigms: segmentation-based and anchor-based. While Segmentation-based methods are well-suited for irregular shapes, they struggle with compact or overlapping layouts. Conversely, anchor-based approaches excel for complex layouts but suffer from irregular shapes. To strengthen their merits and overcome their respective demerits, we propose a Complementary Proposal Network (CPN) that seamlessly and parallelly integrates semantic and geometric information for superior performance. The CPN comprises two efficient networks for proposal generation: the Deformable Morphology Semantic Network, which generates semantic proposals employing an innovative deformable morphological operator, and the Balanced Region Proposal Network, which produces geometric proposals with pre-defined anchors. To further enhance the complementarity, we introduce an Interleaved Feature Attention module that enables semantic and geometric features to interact deeply before proposal generation. By leveraging both complementary proposals and features, CPN outperforms state-of-the-art approaches with significant margins under comparable computation cost. Specifically, our approach achieves improvements of 3.6%, 1.3% and 1.0% on challenging benchmarks ICDAR19-ArT, IC15, and MSRA-TD500, respectively. Code for our method will be released.

Abstract (translated)

现有的场景文本检测方法可以分为两种范式:基于分割和基于锚定。虽然基于分割的方法对于不规则形状的应用效果很好,但它们在紧凑或重叠布局下表现不佳。相反,基于锚定的方法在复杂布局下表现出色,但存在不规则形状的问题。为了增强其优势并克服各自的缺陷,我们提出了一个互补建议网络(CPN),它平滑地并行地整合语义和几何信息以实现卓越的性能。CPN包括两个用于提议生成的有效网络:具有创新变形形态操作的语义变形形态网络和具有预定义锚定的平衡区域提议网络。为了进一步增强互补性,我们还引入了一个跨特征关注模块,使得语义和几何特征在提议生成前进行深度交互。通过利用互补提议和特征,CPN在类似计算成本下显著优于最先进的 approaches。具体来说,我们的方法在具有挑战性的基准测试ICDAR19-ArT、IC15和MSRA-TD500上分别实现了3.6%、1.3%和1.0%的改进。我们的方法将发布代码。

URL

https://arxiv.org/abs/2402.11540

PDF

https://arxiv.org/pdf/2402.11540.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot