SkIn: Skimming-Intensive Long-Text Classification Based on BERT and Application to Medical Corpus

Abstract
Abstract (translated)
URL
PDF

Abstract

BERT is a widely used pre-trained model in natural language processing. However, because its time and space requirements increase with a quadratic level of the text length, the BERT model is difficult to use directly on the long-text corpus. The collected text data is usually quite long in some fields, such as health care. Therefore, to apply the pre-trained language knowledge of BERT to long text, in this paper, imitating the skimming-intensive reading method used by humans when reading a long paragraph, the Skimming-Intensive Model (SkIn) is proposed. It can dynamically select the critical information in the text so that the length of the input into the BERT-Base model is significantly reduced, which can effectively save the cost of the classification algorithm. Experiments show that the SkIn method has achieved better results than the baselines on long-text classification datasets in the medical field, while its time and space requirements increase linearly with the text length, alleviating the time and space overflow problem of BERT on long-text data.

Abstract (translated)

URL

https://arxiv.org/abs/2209.05741

PDF

https://arxiv.org/pdf/2209.05741.pdf