Abstract
Transformer-based neural networks have demonstrated remarkable performance in natural language processing tasks such as sentiment analysis. Nevertheless, the issue of ensuring the dependability of these complicated architectures through comprehensive testing is still open. This paper presents a collection of coverage criteria specifically designed to assess test suites created for transformer-based sentiment analysis networks. Our approach utilizes input space partitioning, a black-box method, by considering emotionally relevant linguistic features such as verbs, adjectives, adverbs, and nouns. In order to effectively produce test cases that encompass a wide range of emotional elements, we utilize the k-projection coverage metric. This metric minimizes the complexity of the problem by examining subsets of k features at the same time, hence reducing dimensionality. Large language models are employed to generate sentences that display specific combinations of emotional features. The findings from experiments obtained from a sentiment analysis dataset illustrate that our criteria and generated tests have led to an average increase of 16\% in test coverage. In addition, there is a corresponding average decrease of 6.5\% in model accuracy, showing the ability to identify vulnerabilities. Our work provides a foundation for improving the dependability of transformer-based sentiment analysis systems through comprehensive test evaluation.
Abstract (translated)
Transformer-based neural networks have achieved remarkable performance in natural language processing tasks such as sentiment analysis. However, ensuring the reliability of these complex architectures through comprehensive testing is still an open issue. This paper presents a collection of coverage criteria specifically designed to assess test suites created for transformer-based sentiment analysis networks. Our approach utilizes input space partitioning, a black-box method, by considering emotionally relevant linguistic features such as verbs, adjectives, adverbs, and nouns. To effectively generate test cases that cover a wide range of emotional elements, we employ the k-projection coverage metric. This metric minimizes the complexity of the problem by examining subsets of k features at the same time, reducing dimensionality. Large language models are employed to generate sentences that display specific combinations of emotional features. The results of experiments on a sentiment analysis dataset illustrate that our criteria and generated tests have led to an average increase of 16% in test coverage. In addition, there is an average decrease of 6.5% in model accuracy, indicating the ability to identify vulnerabilities. Our work provides a foundation for improving the reliability of transformer-based sentiment analysis systems through comprehensive test evaluation.
URL
https://arxiv.org/abs/2407.20884