Paper Reading AI Learner

NAF-DPM: A Nonlinear Activation-Free Diffusion Probabilistic Model for Document Enhancement

2024-04-08 16:52:21
Giordano Cicchetti, Danilo Comminiello

Abstract

Real-world documents may suffer various forms of degradation, often resulting in lower accuracy in optical character recognition (OCR) systems. Therefore, a crucial preprocessing step is essential to eliminate noise while preserving text and key features of documents. In this paper, we propose NAF-DPM, a novel generative framework based on a diffusion probabilistic model (DPM) designed to restore the original quality of degraded documents. While DPMs are recognized for their high-quality generated images, they are also known for their large inference time. To mitigate this problem we provide the DPM with an efficient nonlinear activation-free (NAF) network and we employ as a sampler a fast solver of ordinary differential equations, which can converge in a few iterations. To better preserve text characters, we introduce an additional differentiable module based on convolutional recurrent neural networks, simulating the behavior of an OCR system during training. Experiments conducted on various datasets showcase the superiority of our approach, achieving state-of-the-art performance in terms of pixel-level and perceptual similarity metrics. Furthermore, the results demonstrate a notable character error reduction made by OCR systems when transcribing real-world document images enhanced by our framework. Code and pre-trained models are available at this https URL.

Abstract (translated)

现实世界的文档可能会遭受各种形式的降解,通常导致光学字符识别(OCR)系统中的准确性降低。因此,一个关键的前处理步骤是必不可少的,以在保留文档文本和关键特征的同时消除噪声。在本文中,我们提出了NAF-DPM,一种基于扩散概率模型(DPM)的新型生成框架,旨在恢复降解文档的原始质量。虽然DPM以其生成的高质量图像而闻名,但它们也以其大型的推理时间而闻名。为了减轻这个问题,我们将DPM与一个高效的非线性激活函数(NAF)网络相结合,并使用一个快速求解普通微分方程的算法作为采样器。为了更好地保留文本字符,我们引入了一个基于卷积循环神经网络的额外可导模块,模拟了OCR系统在训练过程中的行为。在各种数据集上进行的实验展示了我们方法的优势,在像素级和感知相似性度量方面实现了最先进的性能。此外,结果表明,通过使用我们的框架对现实世界文档图像进行增强,OCR系统在转录过程中显著减少了字符错误。代码和预训练模型可在此处访问:https://url.cn/NAF-DPM

URL

https://arxiv.org/abs/2404.05669

PDF

https://arxiv.org/pdf/2404.05669.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot