Paper Reading AI Learner

GL-PGENet: A Parameterized Generation Framework for Robust Document Image Enhancement

2025-05-28 06:37:06
Zhihong Tang, Yang Li

Abstract

Document Image Enhancement (DIE) serves as a critical component in Document AI systems, where its performance substantially determines the effectiveness of downstream tasks. To address the limitations of existing methods confined to single-degradation restoration or grayscale image processing, we present Global with Local Parametric Generation Enhancement Network (GL-PGENet), a novel architecture designed for multi-degraded color document images, ensuring both efficiency and robustness in real-world scenarios. Our solution incorporates three key innovations: First, a hierarchical enhancement framework that integrates global appearance correction with local refinement, enabling coarse-to-fine quality improvement. Second, a Dual-Branch Local-Refine Network with parametric generation mechanisms that replaces conventional direct prediction, producing enhanced outputs through learned intermediate parametric representations rather than pixel-wise mapping. This approach enhances local consistency while improving model generalization. Finally, a modified NestUNet architecture incorporating dense block to effectively fuse low-level pixel features and high-level semantic features, specifically adapted for document image characteristics. In addition, to enhance generalization performance, we adopt a two-stage training strategy: large-scale pretraining on a synthetic dataset of 500,000+ samples followed by task-specific fine-tuning. Extensive experiments demonstrate the superiority of GL-PGENet, achieving state-of-the-art SSIM scores of 0.7721 on DocUNet and 0.9480 on RealDAE. The model also exhibits remarkable cross-domain adaptability and maintains computational efficiency for high-resolution images without performance degradation, confirming its practical utility in real-world scenarios.

Abstract (translated)

文档图像增强(DIE)在文档人工智能系统中扮演着关键角色,其性能对下游任务的有效性有着重大影响。为了克服现有方法仅限于单一退化恢复或灰度图像处理的局限性,我们提出了一种全新的架构——全局与局部参数生成增强网络(GL-PGENet),专门针对多退化的彩色文档图像,并确保在实际应用中的高效性和鲁棒性。 我们的解决方案包括三项关键创新: 1. **分层增强框架**:该框架结合了全球外观校正和局部细化,通过从粗到细的质量改进方法来提升图像质量。 2. **双分支局部精细化网络与参数生成机制**:替代传统直接预测的方法,这种机制采用学习得到的中间参数表示而不是像素级映射来产生增强输出。这种方法提升了局部一致性,并增强了模型的泛化能力。 3. **修改版NestUNet架构**:此架构融合了密集块,有效结合低层次像素特征和高层次语义特征,并特别针对文档图像的特点进行了优化。 此外,为了提高泛化性能,我们采用了一种两阶段训练策略:先在包含50万多个样本的合成数据集上进行大规模预训练,随后根据具体任务进行微调。广泛的实验表明GL-PGENet显著优于现有技术,在DocUNet和RealDAE测试集中分别达到了最先进的SSIM得分0.7721和0.9480。 该模型还展示了出色的跨域适应性,并且在处理高分辨率图像时,仍能保持计算效率而不影响性能表现,这进一步证实了其在实际场景中的实用价值。

URL

https://arxiv.org/abs/2505.22021

PDF

https://arxiv.org/pdf/2505.22021.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot