Paper Reading AI Learner

ChroniclingAmericaQA: A Large-scale Question Answering Dataset based on Historical American Newspaper Pages

2024-03-26 16:48:13
Bhawna Piryani, Jamshid Mozafari, Adam Jatowt

Abstract

Question answering (QA) and Machine Reading Comprehension (MRC) tasks have significantly advanced in recent years due to the rapid development of deep learning techniques and, more recently, large language models. At the same time, many benchmark datasets have become available for QA and MRC tasks. However, most existing large-scale benchmark datasets have been created predominantly using synchronous document collections like Wikipedia or the Web. Archival document collections, such as historical newspapers, contain valuable information from the past that is still not widely used to train large language models. To further contribute to advancing QA and MRC tasks and to overcome the limitation of previous datasets, we introduce ChroniclingAmericaQA, a large-scale dataset with 485K question-answer pairs created based on the historical newspaper collection Chronicling America. Our dataset is constructed from a subset of the Chronicling America newspaper collection spanning 120 years. One of the significant challenges for utilizing digitized historical newspaper collections is the low quality of OCR text. Therefore, to enable realistic testing of QA models, our dataset can be used in three different ways: answering questions from raw and noisy content, answering questions from cleaner, corrected version of the content, as well as answering questions from scanned images of newspaper pages. This and the fact that ChroniclingAmericaQA spans the longest time period among available QA datasets make it quite a unique and useful resource.

Abstract (translated)

问题回答(QA)和机器阅读理解(MRC)任务在近年来显著发展,主要得益于深度学习技术的快速发展以及最近大型语言模型的广泛应用。同时,许多针对QA和MRC任务的基准数据集已经变得可用。然而,现有的大型基准数据集主要是在类似于维基百科或互联网的同步文档集合中创建的。档案文档集合,如历史报纸,包含有价值的历史信息,但这些信息尚未被广泛用于训练大型语言模型。为了进一步推动QA和MRC任务的发展,克服前人数据集的局限,我们引入了ChroniclingAmericaQA,一个基于历史报纸收集的大型数据集,包含了485K个问题-答案对。我们的数据集是基于Chronicling America报纸收藏库的一个子集构建的,该收藏库跨度为120年。 利用数字化的历史报纸集合的一个显著挑战是OCR文本的质量较低。因此,为了实现对QA模型的真实测试,我们的数据集可以以三种方式使用:回答原始和噪音内容的问题,回答清洁和修正过的内容的问题,以及回答从报纸页面扫描图像中回答的问题。ChroniclingAmericaQA在现有QA数据集中的时间跨度最长,使其成为一个独特且有用的资源。

URL

https://arxiv.org/abs/2403.17859

PDF

https://arxiv.org/pdf/2403.17859.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot