Abstract
Text detection is frequently used in vision-based mobile robots when they need to interpret texts in their surroundings to perform a given task. For instance, delivery robots in multilingual cities need to be capable of doing multilingual text detection so that the robots can read traffic signs and road markings. Moreover, the target languages change from region to region, implying the need of efficiently re-training the models to recognize the novel/new languages. However, collecting and labeling training data for novel languages are cumbersome, and the efforts to re-train an existing/trained text detector are considerable. Even worse, such a routine would repeat whenever a novel language appears. This motivates us to propose a new problem setting for tackling the aforementioned challenges in a more efficient way: "We ask for a generalizable multilingual text detection framework to detect and identify both seen and unseen language regions inside scene images without the requirement of collecting supervised training data for unseen languages as well as model re-training". To this end, we propose "MENTOR", the first work to realize a learning strategy between zero-shot learning and few-shot learning for multilingual scene text detection.
Abstract (translated)
文本检测在基于视觉的移动机器人中非常常见,当它们需要在其周围环境中解释文本以执行特定任务时。例如,多语言城市中的送货机器人需要具备多语言文本检测能力,以便机器人能阅读交通标志和道路标线。此外,目标语言会根据地区发生变化,这意味着需要有效地重新训练模型以识别新/未知的语言。然而,收集和标注新语言的训练数据非常耗时,重新训练现有/训练好的文本检测模型的努力也很大。更糟糕的是,这种模式会重复出现,每次出现新语言都会如此。因此,我们激励自己提出一个更有效的解决方案来解决上述挑战: "我们要求一个通用的多语言场景文本检测框架,能够检测和识别场景图像中的已见和未见语言区域,无需收集未见语言的监督训练数据,也不需要重新训练模型。" 为实现这一目标,我们提出了"MENTOR",是第一个在零散学习和高散射学习之间实现学习策略的关于多语言场景文本检测的工作。
URL
https://arxiv.org/abs/2403.07286