Paper Reading AI Learner

Layout-Aware OCR for Black Digital Archives with Unsupervised Evaluation

2025-09-16 16:43:34
Fitsum Sileshi Beyene, Christopher L. Dancy

Abstract

Despite their cultural and historical significance, Black digital archives continue to be a structurally underrepresented area in AI research and infrastructure. This is especially evident in efforts to digitize historical Black newspapers, where inconsistent typography, visual degradation, and limited annotated layout data hinder accurate transcription, despite the availability of various systems that claim to handle optical character recognition (OCR) well. In this short paper, we present a layout-aware OCR pipeline tailored for Black newspaper archives and introduce an unsupervised evaluation framework suited to low-resource archival contexts. Our approach integrates synthetic layout generation, model pretraining on augmented data, and a fusion of state-of-the-art You Only Look Once (YOLO) detectors. We used three annotation-free evaluation metrics, the Semantic Coherence Score (SCS), Region Entropy (RE), and Textual Redundancy Score (TRS), which quantify linguistic fluency, informational diversity, and redundancy across OCR regions. Our evaluation on a 400-page dataset from ten Black newspaper titles demonstrates that layout-aware OCR improves structural diversity and reduces redundancy compared to full-page baselines, with modest trade-offs in coherence. Our results highlight the importance of respecting cultural layout logic in AI-driven document understanding and lay the foundation for future community-driven and ethically grounded archival AI systems.

Abstract (translated)

尽管黑人数字档案在文化和历史方面具有重要意义,但在人工智能研究和基础设施中,它们仍然处于结构性的代表性不足地位。特别是在数字化历史上黑人报纸的努力中,不一致的排版、视觉退化以及有限的标注布局数据阻碍了准确转录工作,即便存在一些系统声称能很好地处理光学字符识别(OCR)。在本文中,我们提出了一种针对黑人报纸档案的布局感知OCR流水线,并介绍了一个适合低资源存档背景的无监督评估框架。我们的方法结合了合成布局生成、增强数据上的模型预训练以及最先进的YOLO检测器融合技术。 我们使用了三个无需注释的评估指标:语义连贯性评分(SCS)、区域熵(RE)和文本冗余评分(TRS),这些指标量化了语言流畅度、信息多样性及OCR区域内冗余程度。在来自十个黑人报纸标题的400页数据集上的评估表明,布局感知OCR提升了结构多样性和减少了冗余,尽管与全页面基线相比,在连贯性方面存在一些权衡。 我们的结果突显了在人工智能驱动文档理解中尊重文化排版逻辑的重要性,并为未来基于社区和以伦理为基础的档案AI系统奠定了基础。

URL

https://arxiv.org/abs/2509.13236

PDF

https://arxiv.org/pdf/2509.13236.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot