Zero-shot Generation of Training Data with Denoising Diffusion Probabilistic Model for Handwritten Chinese Character Recognition

Abstract
Abstract (translated)
URL
PDF

Abstract

There are more than 80,000 character categories in Chinese while most of them are rarely used. To build a high performance handwritten Chinese character recognition (HCCR) system supporting the full character set with a traditional approach, many training samples need be collected for each character category, which is both time-consuming and expensive. In this paper, we propose a novel approach to transforming Chinese character glyph images generated from font libraries to handwritten ones with a denoising diffusion probabilistic model (DDPM). Training from handwritten samples of a small character set, the DDPM is capable of mapping printed strokes to handwritten ones, which makes it possible to generate photo-realistic and diverse style handwritten samples of unseen character categories. Combining DDPM-synthesized samples of unseen categories with real samples of other categories, we can build an HCCR system to support the full character set. Experimental results on CASIA-HWDB dataset with 3,755 character categories show that the HCCR systems trained with synthetic samples perform similarly with the one trained with real samples in terms of recognition accuracy. The proposed method has the potential to address HCCR with a larger vocabulary.

Abstract (translated)

中文字符有超过80,000个分类，但大部分很少被使用。通过传统的方法建立一个支持全部字符集的高性能手写中文字符识别系统，需要为每个字符类别收集许多训练样本，这既耗时又昂贵。在本文中，我们提出了一种 novel 的方法，使用一种denoising diffusion probabilistic模型(DDPM)将从字体库生成的中文字符glyph图像转换为手写图像，从而实现手写字符的去噪。通过训练小型字符集的手写样本，DDPM能够将打印 strokes 映射到手写 ones，从而生成从未见过的分类类别的逼真和多样化的手写样本。将 DDPM-合成的未知分类类别样本与其他类别的真实样本组合在一起，可以构建一个支持全部字符集的 HCCR 系统。针对CASIA-HWDB数据集，包含3,755个字符类别的实验结果显示，使用合成样本训练的 HCCR 系统在识别精度方面与使用真实样本训练的系统类似。该方法有潜力解决使用更大量词汇的 HCCR。

URL

https://arxiv.org/abs/2305.15660

PDF

https://arxiv.org/pdf/2305.15660.pdf