Abstract
Writing assistance is an application closely related to human life and is also a fundamental Natural Language Processing (NLP) research field. Its aim is to improve the correctness and quality of input texts, with character checking being crucial in detecting and correcting wrong characters. From the perspective of the real world where handwriting occupies the vast majority, characters that humans get wrong include faked characters (i.e., untrue characters created due to writing errors) and misspelled characters (i.e., true characters used incorrectly due to spelling errors). However, existing datasets and related studies only focus on misspelled characters mainly caused by phonological or visual confusion, thereby ignoring faked characters which are more common and difficult. To break through this dilemma, we present Visual-C$^3$, a human-annotated Visual Chinese Character Checking dataset with faked and misspelled Chinese characters. To the best of our knowledge, Visual-C$^3$ is the first real-world visual and the largest human-crafted dataset for the Chinese character checking scenario. Additionally, we also propose and evaluate novel baseline methods on Visual-C$^3$. Extensive empirical results and analyses show that Visual-C$^3$ is high-quality yet challenging. The Visual-C$^3$ dataset and the baseline methods will be publicly available to facilitate further research in the community.
Abstract (translated)
写作协助是一个与人类生活和自然语言处理(NLP)密切相关并作为NLP研究领域的基本应用。其目标是提高输入文本的正确性和质量,其中字符检查在检测和纠正错误字符方面至关重要。从现实世界的角度来看,人类会犯错包括由于书写错误而创建的虚假字符和拼写错误导致的真实字符。然而,现有数据集和相关研究主要关注由于音标或视觉混淆引起的拼写错误主要字符,而忽略了更加普遍和困难的伪造字符。为了突破这一困境,我们提出了Visual-C$^3$,一个由人类标注的视觉中文字符检查数据集,包括伪造和拼写错误的中文字符。据我们所知,Visual-C$^3$是第一个真实世界的视觉中文字符检查数据集,也是中文字符检查场景中最大的人造数据集。此外,我们还针对Visual-C$^3$提出了并评估了新型基准方法。大量的实证结果和分析表明,Visual-C$^3$具有高质量但具有挑战性。Visual-C$^3$数据集和基准方法将公开发布,以促进社区进一步研究。
URL
https://arxiv.org/abs/2311.11268