Abstract
This paper presents the Coswara dataset, a dataset containing diverse set of respiratory sounds and rich meta-data, recorded between April-2020 and February-2022 from 2635 individuals (1819 SARS-CoV-2 negative, 674 positive, and 142 recovered subjects). The respiratory sounds contained nine sound categories associated with variants of breathing, cough and speech. The rich metadata contained demographic information associated with age, gender and geographic location, as well as the health information relating to the symptoms, pre-existing respiratory ailments, comorbidity and SARS-CoV-2 test status. Our study is the first of its kind to manually annotate the audio quality of the entire dataset (amounting to 65~hours) through manual listening. The paper summarizes the data collection procedure, demographic, symptoms and audio data information. A COVID-19 classifier based on bi-directional long short-term (BLSTM) architecture, is trained and evaluated on the different population sub-groups contained in the dataset to understand the bias/fairness of the model. This enabled the analysis of the impact of gender, geographic location, date of recording, and language proficiency on the COVID-19 detection performance.
Abstract (translated)
本 paper 介绍了Coswara 数据集,这是一个包含多种呼吸声和丰富元数据的集,记录于2020年4月至2022年2月期间,从2635个个体中收集(其中包括1819个无COVID-19症状、674个有症状和142个恢复者)。呼吸声包含了与呼吸、咳嗽和言语变异相关的九个声音分类。丰富元数据包含与年龄、性别和地理位置相关的人口统计数据,以及与症状、现有呼吸疾病、并发和COVID-19测试状态相关的健康信息。我们的研究是该领域的第一个,通过手动听力方式对整个数据集的音频质量进行手动标注(共计65小时)。文章总结了数据收集程序、人口统计数据、症状和音频数据信息。基于双向长短期记忆(BLSTM)架构的COVID-19分类器在数据集中的不同人口子群体中进行训练和评估,以理解模型偏见/公平性。这使能够分析性别、地理位置、记录日期和语言 proficiency对COVID-19检测性能的影响。
URL
https://arxiv.org/abs/2305.12741