Abstract
Despite their cultural and historical significance, Black digital archives continue to be a structurally underrepresented area in AI research and infrastructure. This is especially evident in efforts to digitize historical Black newspapers, where inconsistent typography, visual degradation, and limited annotated layout data hinder accurate transcription, despite the availability of various systems that claim to handle optical character recognition (OCR) well. In this short paper, we present a layout-aware OCR pipeline tailored for Black newspaper archives and introduce an unsupervised evaluation framework suited to low-resource archival contexts. Our approach integrates synthetic layout generation, model pretraining on augmented data, and a fusion of state-of-the-art You Only Look Once (YOLO) detectors. We used three annotation-free evaluation metrics, the Semantic Coherence Score (SCS), Region Entropy (RE), and Textual Redundancy Score (TRS), which quantify linguistic fluency, informational diversity, and redundancy across OCR regions. Our evaluation on a 400-page dataset from ten Black newspaper titles demonstrates that layout-aware OCR improves structural diversity and reduces redundancy compared to full-page baselines, with modest trade-offs in coherence. Our results highlight the importance of respecting cultural layout logic in AI-driven document understanding and lay the foundation for future community-driven and ethically grounded archival AI systems.
Abstract (translated)
尽管黑人数字档案在文化和历史方面具有重要意义,但在人工智能研究和基础设施中,它们仍然处于结构性的代表性不足地位。特别是在数字化历史上黑人报纸的努力中,不一致的排版、视觉退化以及有限的标注布局数据阻碍了准确转录工作,即便存在一些系统声称能很好地处理光学字符识别(OCR)。在本文中,我们提出了一种针对黑人报纸档案的布局感知OCR流水线,并介绍了一个适合低资源存档背景的无监督评估框架。我们的方法结合了合成布局生成、增强数据上的模型预训练以及最先进的YOLO检测器融合技术。 我们使用了三个无需注释的评估指标:语义连贯性评分(SCS)、区域熵(RE)和文本冗余评分(TRS),这些指标量化了语言流畅度、信息多样性及OCR区域内冗余程度。在来自十个黑人报纸标题的400页数据集上的评估表明,布局感知OCR提升了结构多样性和减少了冗余,尽管与全页面基线相比,在连贯性方面存在一些权衡。 我们的结果突显了在人工智能驱动文档理解中尊重文化排版逻辑的重要性,并为未来基于社区和以伦理为基础的档案AI系统奠定了基础。
URL
https://arxiv.org/abs/2509.13236