Document Provenance and Authentication through Authorship Classification

Abstract
Abstract (translated)
URL
PDF

Abstract

Style analysis, which is relatively a less explored topic, enables several interesting applications. For instance, it allows authors to adjust their writing style to produce a more coherent document in collaboration. Similarly, style analysis can also be used for document provenance and authentication as a primary step. In this paper, we propose an ensemble-based text-processing framework for the classification of single and multi-authored documents, which is one of the key tasks in style analysis. The proposed framework incorporates several state-of-the-art text classification algorithms including classical Machine Learning (ML) algorithms, transformers, and deep learning algorithms both individually and in merit-based late fusion. For the merit-based late fusion, we employed several weight optimization and selection methods to assign merit-based weights to the individual text classification algorithms. We also analyze the impact of the characters on the task that are usually excluded in NLP applications during pre-processing by conducting experiments on both clean and un-clean data. The proposed framework is evaluated on a large-scale benchmark dataset, significantly improving performance over the existing solutions.

Abstract (translated)

风格分析是一个相对较为陌生的主题,但它却带来了几个有趣的应用。例如,它可以让作者在协作中调整写作风格,生成更具连贯性的文档。同样,风格分析也可以用于文档溯源和验证,作为其主要步骤。在本文中,我们提出了一个集成式的文本处理框架,用于对单写人和多写人文档进行分类,这是风格分析中的关键任务之一。该框架包括多个先进的文本分类算法,包括经典机器学习(机器学习)算法、变压器和深度学习算法,同时包括基于价值的 late fusion 算法。对于基于价值的 late fusion 算法,我们采用了多个权重优化和选择方法,为每个文本分类算法分配基于价值的权重。我们还分析了字符对NLP应用中通常被排除的任务的影响,通过在干净数据和脏数据上开展实验进行分析。该框架在一个大型基准数据集上进行评估, significantly improving over the existing solutions.

URL

https://arxiv.org/abs/2303.01197

PDF

https://arxiv.org/pdf/2303.01197.pdf