Representing Online Handwriting for Recognition in Large Vision-Language Models

Abstract
Abstract (translated)
URL
PDF

Abstract

The adoption of tablets with touchscreens and styluses is increasing, and a key feature is converting handwriting to text, enabling search, indexing, and AI assistance. Meanwhile, vision-language models (VLMs) are now the go-to solution for image understanding, thanks to both their state-of-the-art performance across a variety of tasks and the simplicity of a unified approach to training, fine-tuning, and inference. While VLMs obtain high performance on image-based tasks, they perform poorly on handwriting recognition when applied naively, i.e., by rendering handwriting as an image and performing optical character recognition (OCR). In this paper, we study online handwriting recognition with VLMs, going beyond naive OCR. We propose a novel tokenized representation of digital ink (online handwriting) that includes both a time-ordered sequence of strokes as text, and as image. We show that this representation yields results comparable to or better than state-of-the-art online handwriting recognizers. Wide applicability is shown through results with two different VLM families, on multiple public datasets. Our approach can be applied to off-the-shelf VLMs, does not require any changes in their architecture, and can be used in both fine-tuning and parameter-efficient tuning. We perform a detailed ablation study to identify the key elements of the proposed representation.

Abstract (translated)

采用触摸屏和触控笔的平板电脑正在增加，关键功能是将手写内容转换为文本，实现搜索、索引和人工智能协助。同时，凭借其在各种任务上的先进性能和统一训练方法的简单性，视觉语言模型（VLMs）现在成为图像理解的绝佳解决方案。然而，当它们应用于粗暴地将手写内容转换为图像并进行光学字符识别（OCR）时，VLMs表现不佳。在本文中，我们研究使用VLMs的在线手写识别，超越了粗暴的OCR。我们提出了一种新颖的数字墨水（在线手写）分词表示，包括 strokes 时间序列作为文本，以及作为图像。我们证明了这种表示产生了与或优于最先进的在线手写识别器相同或更好的结果。通过在两个不同的VLM家族上展示结果，在多个公共数据集上进行了广泛的适用性展示。我们的方法可以应用于现成的VLMs，无需对其架构进行任何更改，并且可以用于参数高效调整。我们进行了详细的可缩放研究，以确定所提出的表示的关键要素。

URL

https://arxiv.org/abs/2402.15307

PDF

https://arxiv.org/pdf/2402.15307.pdf

Representing Online Handwriting for Recognition in Large Vision-Language Models

Abstract

Abstract (translated)

URL

PDF Copy

PDF