Optical_Character

Nuclei Instance Segmentation of Cryosectioned H&E Stained Histological Images using Triple U-Net Architecture

2024-04-19 16:36:21

Zarif Ahmed, Chowdhury Nur E Alam Siddiqi, Fardifa Fathmiul Alam, Tasnim Ahmed, Tareque Mohmud Chowdhury

arXiv_CV

arXiv_CV Segmentation Pose Medical Optical_Character
Abstract

Nuclei instance segmentation is crucial in oncological diagnosis and cancer pathology research. H&E stained images are commonly used for medical diagnosis, but pre-processing is necessary before using them for image processing tasks. Two principal pre-processing methods are formalin-fixed paraffin-embedded samples (FFPE) and frozen tissue samples (FS). While FFPE is widely used, it is time-consuming, while FS samples can be processed quickly. Analyzing H&E stained images derived from fast sample preparation, staining, and scanning can pose difficulties due to the swift process, which can result in the degradation of image quality. This paper proposes a method that leverages the unique optical characteristics of H&E stained images. A three-branch U-Net architecture has been implemented, where each branch contributes to the final segmentation results. The process includes applying watershed algorithm to separate overlapping regions and enhance accuracy. The Triple U-Net architecture comprises an RGB branch, a Hematoxylin branch, and a Segmentation branch. This study focuses on a novel dataset named CryoNuSeg. The results obtained through robust experiments outperform the state-of-the-art results across various metrics. The benchmark score for this dataset is AJI 52.5 and PQ 47.7, achieved through the implementation of U-Net Architecture. However, the proposed Triple U-Net architecture achieves an AJI score of 67.41 and PQ of 50.56. The proposed architecture improves more on AJI than other evaluation metrics, which further justifies the superiority of the Triple U-Net architecture over the baseline U-Net model, as AJI is a more strict evaluation metric. The use of the three-branch U-Net model, followed by watershed post-processing, significantly surpasses the benchmark scores, showing substantial improvement in the AJI score

Abstract (translated)

核实例分割在肿瘤诊断和肿瘤病理学研究中至关重要。H&E染色图像通常用于医疗诊断，但在使用前需要进行预处理。有两种主要的预处理方法是固定装载的福尔马林固定 Paraffin 嵌入组织（FFPE）和冻存组织样本（FS）。虽然FFPE广泛使用，但需要花费很长时间，而FS样本可以快速处理。通过快速样本准备、染色和扫描生成的H&E染色图像的分析可能会面临困难，因为快速处理会导致图像质量的降低。本文提出了一种利用H&E染色图像独特光学特征的方法。实现了一个三分支的U-Net架构，每个分支都对最终分割结果做出贡献。该过程包括应用分割算法分离重叠区域并提高准确性。Triple U-Net架构包括RGB分支、Hematoxylin分支和分割分支。本研究专注于一个名为CryoNuSeg的新数据集。通过稳健的实验获得的结果在各种指标上优于最先进的成果。该数据集的基准得分是AJI 52.5和PQ 47.7，通过实现U-Net架构实现。然而，所提出的Triple U-Net架构获得了AJI得分67.41和PQ得分50.56。该架构在AJI评分方面比其他评估指标的优越性更加明显，因为AJI是一个更严格的评估指标。使用三个分支的U-Net模型，然后进行分割后处理，显著超越了基准得分，表明在AJI评分方面取得了很大的改进。

URL

https://arxiv.org/abs/2404.12986

PDF

https://arxiv.org/pdf/2404.12986.pdf
Read All
DLoRA-TrOCR: Mixed Text Mode Optical Character Recognition Based On Transformer

2024-04-19 09:28:16

Da Chang, Yu Li

arXiv_CV

arXiv_CV Recognition Deep_Learning OCR Inference Transformer Pose Optical_Character
Abstract

With the continuous development of OCR technology and the expansion of application fields, text recognition in complex scenes has become a key challenge. Factors such as multiple fonts, mixed scenes and complex layouts seriously affect the recognition accuracy of traditional OCR models. Although OCR models based on deep learning have performed well in specific fields or similar data sets in recent years, the generalization ability and robustness of the model are still a big challenge when facing complex environments with multiple scenes. Furthermore, training an OCR model from scratch or fine-tuning all parameters is very demanding on computing resources and inference time, which limits the flexibility of its application. This study focuses on a fundamental aspect of mixed text recognition in response to the challenges mentioned above, which involves effectively fine-tuning the pre-trained basic OCR model to demonstrate exceptional performance across various downstream tasks. To this end, we propose a parameter-efficient hybrid text recognition method based on pre-trained OCR Transformer, namely DLoRA-TrOCR. This method embeds DoRA into the image encoder and LoRA into the internal structure of the text decoder, enabling efficient parameter fine-tuning for downstream tasks. Experimental results show that compared to similar parameter adjustment methods, our model DLoRA-TrOCR has the smallest number of parameters and performs better. It can achieve state-of-the-art performance on complex scene data sets involving simultaneous recognition of mixed handwritten, printed and street view texts.

Abstract (translated)

随着OCR技术的持续发展和应用领域的扩展，复杂场景中的文本识别已成为一个关键挑战。诸如多种字体、混合场景和复杂布局等因素，都严重影响了传统OCR模型的识别准确性。虽然基于深度学习的OCR模型在某些领域或类似数据集上的表现已经很好，但在面对复杂环境（多场景）时，模型的泛化能力和鲁棒性仍然是一个巨大的挑战。此外，从零开始训练OCR模型或微调所有参数在计算资源和推理时间上非常耗时，这限制了其应用的灵活性。本文关注于应对上述挑战的基本文本识别方面，这涉及通过预训练的基本OCR模型有效地微调以在各种下游任务上展示卓越性能。为此，我们提出了一个参数高效的混合文本识别方法，基于预训练的OCR Transformer，即DLoRA-TrOCR。该方法将DoRA嵌入到图像编码器中，将LoRA嵌入到文本解码器的内部结构中，以实现对下游任务的参数高效微调。实验结果表明，与类似的参数调整方法相比，我们的模型DLoRA-TrOCR具有最小的参数数量并表现出更好的性能。它可以在涉及同时识别混合手写、打印和街头街景文本的复杂场景数据集上实现最先进的性能。

URL

https://arxiv.org/abs/2404.12734

PDF

https://arxiv.org/pdf/2404.12734.pdf
Read All
Resilience of Large Language Models for Noisy Instructions

2024-04-15 12:55:08

Bin Wang, Chengwei Wei, Zhengyuan Liu, Geyu Lin, Nancy F. Chen

arXiv_CL

arXiv_CL Speech_Recognition Recognition Embedding OCR Language_Model Action Optical_Character Speech
Abstract

As the rapidly advancing domain of natural language processing (NLP), large language models (LLMs) have emerged as powerful tools for interpreting human commands and generating text across various tasks. Nonetheless, the resilience of LLMs to handle text containing inherent errors, stemming from human interactions and collaborative systems, has not been thoroughly explored. Our study investigates the resilience of LLMs against five common types of disruptions including 1) ASR (Automatic Speech Recognition) errors, 2) OCR (Optical Character Recognition) errors, 3) grammatical mistakes, 4) typographical errors, and 5) distractive content. We aim to investigate how these models react by deliberately embedding these errors into instructions. Our findings reveal that while some LLMs show a degree of resistance to certain types of noise, their overall performance significantly suffers. This emphasizes the importance of further investigation into enhancing model resilience. In response to the observed decline in performance, our study also evaluates a "re-pass" strategy, designed to purify the instructions of noise before the LLMs process them. Our analysis indicates that correcting noisy instructions, particularly for open-source LLMs, presents significant challenges.

Abstract (translated)

随着自然语言处理（NLP）领域迅速发展，大型语言模型（LLMs）已成为解读人类指令和生成各种任务的强大工具。然而，LLMs对处理包含固有错误的文本以及协作系统产生的文本的抵抗力尚未得到充分探讨。我们的研究调查了LLMs对五种常见干扰类型的抵抗力，包括1)自动语音识别（ASR）错误，2)光学字符识别（OCR）错误，3)语法错误，4)排版错误，5)分散的内容。我们旨在通过故意将这些错误嵌入指令中，研究模型对这些干扰的反应。我们的研究结果表明，虽然某些LLM对某些类型的噪音表现出一定程度的抵抗力，但它们的整体性能严重下降。这强调了进一步研究增强模型韧性的重要性。为了应对观察到的性能下降，我们的研究还评估了一种“重新通过”策略，旨在在LLMs处理指令之前净化指令中的噪音。我们的分析表明，修复噪音指令，特别是对于开源LLM，带来了显著的挑战。

URL

https://arxiv.org/abs/2404.09754

PDF

https://arxiv.org/pdf/2404.09754.pdf
Read All
TEXT2TASTE: A Versatile Egocentric Vision System for Intelligent Reading Assistance Using Large Language Model

2024-04-14 13:39:02

Wiktor Mucha, Florin Cuconasu, Naome A. Etori, Valia Kalokyri, Giovanni Trappolini

arXiv_CV

arXiv_CV Recognition Detection Object_Detection Knowledge Language_Model Pose Action Optical_Character Chat
Abstract

The ability to read, understand and find important information from written text is a critical skill in our daily lives for our independence, comfort and safety. However, a significant part of our society is affected by partial vision impairment, which leads to discomfort and dependency in daily activities. To address the limitations of this part of society, we propose an intelligent reading assistant based on smart glasses with embedded RGB cameras and a Large Language Model (LLM), whose functionality goes beyond corrective lenses. The video recorded from the egocentric perspective of a person wearing the glasses is processed to localise text information using object detection and optical character recognition methods. The LLM processes the data and allows the user to interact with the text and responds to a given query, thus extending the functionality of corrective lenses with the ability to find and summarize knowledge from the text. To evaluate our method, we create a chat-based application that allows the user to interact with the system. The evaluation is conducted in a real-world setting, such as reading menus in a restaurant, and involves four participants. The results show robust accuracy in text retrieval. The system not only provides accurate meal suggestions but also achieves high user satisfaction, highlighting the potential of smart glasses and LLMs in assisting people with special needs.

Abstract (translated)

阅读、理解和从书面文本中查找重要信息是我们日常生活中独立、舒适和安全的关键技能。然而，在我们社会的大部分地区，由于部分视力障碍，这导致我们在日常活动中感到不适和依赖。为了应对这部分社会的局限，我们提出了一个基于智能眼镜内置RGB摄像机和大语言模型（LLM）的智能阅读助手，其功能超越了矫正眼镜。该视频是从佩戴眼镜的人的自我中心视角拍摄的，使用对象检测和光学字符识别方法对文本信息进行定位。LLM处理数据并允许用户与文本交互，回答给定的问题，从而扩展了矫正眼镜的功能，使其能够从文本中查找和总结知识。为了评估我们的方法，我们创建了一个基于聊天应用的系统，允许用户与系统互动。评估在一个真实的场景中进行，如在餐厅阅读菜单，涉及四个参与者。结果显示，文本检索的准确度很高。系统不仅提供了准确的餐点建议，还实现了高用户满意度，突出了智能眼镜和LLM在帮助特殊需求人士方面的潜力。

URL

https://arxiv.org/abs/2404.09254

PDF

https://arxiv.org/pdf/2404.09254.pdf
Read All
Making Old Kurdish Publications Processable by Augmenting Available Optical Character Recognition Engines

2024-04-09 08:08:03

Blnd Yaseen, Hossein Hassani

arXiv_CL

arXiv_CL RNN Recognition Face OCR Pose Optical_Character
Abstract

Kurdish libraries have many historical publications that were printed back in the early days when printing devices were brought to Kurdistan. Having a good Optical Character Recognition (OCR) to help process these publications and contribute to the Kurdish languages resources which is crucial as Kurdish is considered a low-resource language. Current OCR systems are unable to extract text from historical documents as they have many issues, including being damaged, very fragile, having many marks left on them, and often written in non-standard fonts and more. This is a massive obstacle in processing these documents as currently processing them requires manual typing which is very time-consuming. In this study, we adopt an open-source OCR framework by Google, Tesseract version 5.0, that has been used to extract text for various languages. Currently, there is no public dataset, and we developed our own by collecting historical documents from Zheen Center for Documentation and Research, which were printed before 1950 and resulted in a dataset of 1233 images of lines with transcription of each. Then we used the Arabic model as our base model and trained the model using the dataset. We used different methods to evaluate our model, Tesseracts built-in evaluator lstmeval indicated a Character Error Rate (CER) of 0.755%. Additionally, Ocreval demonstrated an average character accuracy of 84.02%. Finally, we developed a web application to provide an easy- to-use interface for end-users, allowing them to interact with the model by inputting an image of a page and extracting the text. Having an extensive dataset is crucial to develop OCR systems with reasonable accuracy, as currently, no public datasets are available for historical Kurdish documents; this posed a significant challenge in our work. Additionally, the unaligned spaces between characters and words proved another challenge with our work.

Abstract (translated)

库尔德图书馆有很多历史出版物，在印刷机被带到库尔德斯坦的早期时期就开始印刷时。拥有一些好的光学字符识别（OCR）工具可以帮助处理这些出版物，为库尔德语资源做出贡献，库尔德语被认为是一种低资源语言。当前的OCR系统无法从历史文件中提取文本，因为它们有很多问题，包括受损、非常脆弱、有许多标记和通常使用非标准的字体等。目前处理这些文件需要手动打字，这非常耗时。在这项研究中，我们采用了由Google开发的开放源代码OCR框架，该框架已用于各种语言的文本提取。目前，还没有公开的数据集，于是我们自己收集了Zheen中心文档研究所的 historical 文件，这些文件在1950年之前印刷，形成了包含每个行文本的1233个图像的数据集。然后我们使用了阿拉伯语模型作为基础模型，使用数据集训练了模型。我们使用不同的方法评估我们的模型，Tesseract内置的评估器 lstmeval 表示了一个字符错误率（CER）为0.755%。此外，Ocrval展示了平均字符准确率为84.02%。最后，我们开发了一个 Web 应用程序，为用户提供了易于使用的界面，让他们通过输入一张页面图像来与模型交互并提取文本。拥有丰富的数据集对开发具有合理准确性的OCR系统至关重要，因为目前，为库尔德斯坦历史文件没有公开的数据集；这在我们工作中造成了很大的挑战。此外，字符和单词之间的未对齐空间也是另一个挑战。

URL

https://arxiv.org/abs/2404.06101

PDF

https://arxiv.org/pdf/2404.06101.pdf
Read All
NAF-DPM: A Nonlinear Activation-Free Diffusion Probabilistic Model for Document Enhancement

2024-04-08 16:52:21

Giordano Cicchetti, Danilo Comminiello

arXiv_CV

arXiv_CV RNN CNN Recognition OCR Inference Pose Optical_Character Enhancement Diffusion
Abstract

Real-world documents may suffer various forms of degradation, often resulting in lower accuracy in optical character recognition (OCR) systems. Therefore, a crucial preprocessing step is essential to eliminate noise while preserving text and key features of documents. In this paper, we propose NAF-DPM, a novel generative framework based on a diffusion probabilistic model (DPM) designed to restore the original quality of degraded documents. While DPMs are recognized for their high-quality generated images, they are also known for their large inference time. To mitigate this problem we provide the DPM with an efficient nonlinear activation-free (NAF) network and we employ as a sampler a fast solver of ordinary differential equations, which can converge in a few iterations. To better preserve text characters, we introduce an additional differentiable module based on convolutional recurrent neural networks, simulating the behavior of an OCR system during training. Experiments conducted on various datasets showcase the superiority of our approach, achieving state-of-the-art performance in terms of pixel-level and perceptual similarity metrics. Furthermore, the results demonstrate a notable character error reduction made by OCR systems when transcribing real-world document images enhanced by our framework. Code and pre-trained models are available at this https URL.

Abstract (translated)

现实世界的文档可能会遭受各种形式的降解，通常导致光学字符识别（OCR）系统中的准确性降低。因此，一个关键的前处理步骤是必不可少的，以在保留文档文本和关键特征的同时消除噪声。在本文中，我们提出了NAF-DPM，一种基于扩散概率模型（DPM）的新型生成框架，旨在恢复降解文档的原始质量。虽然DPM以其生成的高质量图像而闻名，但它们也以其大型的推理时间而闻名。为了减轻这个问题，我们将DPM与一个高效的非线性激活函数（NAF）网络相结合，并使用一个快速求解普通微分方程的算法作为采样器。为了更好地保留文本字符，我们引入了一个基于卷积循环神经网络的额外可导模块，模拟了OCR系统在训练过程中的行为。在各种数据集上进行的实验展示了我们方法的优势，在像素级和感知相似性度量方面实现了最先进的性能。此外，结果表明，通过使用我们的框架对现实世界文档图像进行增强，OCR系统在转录过程中显著减少了字符错误。代码和预训练模型可在此处访问：https://url.cn/NAF-DPM

URL

https://arxiv.org/abs/2404.05669

PDF

https://arxiv.org/pdf/2404.05669.pdf
Read All
PEaCE: A Chemistry-Oriented Dataset for Optical Character Recognition on Scientific Documents

2024-03-23 05:20:36

Nan Zhang, Connor Heaton, Sean Timothy Okonsky, Prasenjit Mitra, Hilal Ezgi Toraman

arXiv_AI

arXiv_AI Recognition OCR Transformer Pose Optical_Character
Abstract

Optical Character Recognition (OCR) is an established task with the objective of identifying the text present in an image. While many off-the-shelf OCR models exist, they are often trained for either scientific (e.g., formulae) or generic printed English text. Extracting text from chemistry publications requires an OCR model that is capable in both realms. Nougat, a recent tool, exhibits strong ability to parse academic documents, but is unable to parse tables in PubMed articles, which comprises a significant part of the academic community and is the focus of this work. To mitigate this gap, we present the Printed English and Chemical Equations (PEaCE) dataset, containing both synthetic and real-world records, and evaluate the efficacy of transformer-based OCR models when trained on this resource. Given that real-world records contain artifacts not present in synthetic records, we propose transformations that mimic such qualities. We perform a suite of experiments to explore the impact of patch size, multi-domain training, and our proposed transformations, ultimately finding that models with a small patch size trained on multiple domains using the proposed transformations yield the best performance. Our dataset and code is available at this https URL.

Abstract (translated)

光字符识别（OCR）是一个已经确立的任务，旨在识别图像中的文本。虽然有许多现成的OCR模型存在，但它们通常是在科学（如公式）或通用打印英语文本上训练的。从化学出版物中提取文本需要一种在两种情况下都具有良好能力的OCR模型。 Nougat是一个最近的工具，表现出对学术文档的解析能力，但它无法解析PubMed文章中的表格，而这在学术社区中占有重要地位，也是本研究的核心。为了弥合这一差距，我们提出了Printed English and Chemical Equations（PEaCE）数据集，包含合成和真实世界记录，并评估了当用此资源训练Transformer-based OCR模型时，其有效性。鉴于真实世界记录包含在合成记录中不存在的 artifacts，我们提出了模拟这种特性的变换。我们进行了一系列实验，以探索补丁大小、多领域训练以及我们提出的变换对OCR模型的影响，最终发现，在补丁尺寸较小、跨领域训练并使用我们提出的变换训练的模型中，模型的性能最好。我们的数据和代码可以在该https URL上找到。

URL

https://arxiv.org/abs/2403.15724

PDF

https://arxiv.org/pdf/2403.15724.pdf
Read All
OCR is All you need: Importing Multi-Modality into Image-based Defect Detection System

2024-03-18 07:41:39

Chih-Chung Hsu, Chia-Ming Lee, Chun-Hung Sun, Kuang-Ming Wu

arXiv_AI

arXiv_AI CNN Recognition Detection Classification Face OCR Inference Prediction Pose Optical_Character
Abstract

Automatic optical inspection (AOI) plays a pivotal role in the manufacturing process, predominantly leveraging high-resolution imaging instruments for scanning purposes. It detects anomalies by analyzing image textures or patterns, making it an essential tool in industrial manufacturing and quality control. Despite its importance, the deployment of models for AOI often faces challenges. These include limited sample sizes, which hinder effective feature learning, variations among source domains, and sensitivities to changes in lighting and camera positions during imaging. These factors collectively compromise the accuracy of model predictions. Traditional AOI often fails to capitalize on the rich mechanism-parameter information from machines or inside images, including statistical parameters, which typically benefit AOI classification. To address this, we introduce an external modality-guided data mining framework, primarily rooted in optical character recognition (OCR), to extract statistical features from images as a second modality to enhance performance, termed OANet (Ocr-Aoi-Net). A key aspect of our approach is the alignment of external modality features, extracted using a single modality-aware model, with image features encoded by a convolutional neural network. This synergy enables a more refined fusion of semantic representations from different modalities. We further introduce feature refinement and a gating function in our OANet to optimize the combination of these features, enhancing inference and decision-making capabilities. Experimental outcomes show that our methodology considerably boosts the recall rate of the defect detection model and maintains high robustness even in challenging scenarios.

Abstract (translated)

自动光学检测（AOI）在制造业过程中扮演着关键角色，主要利用高分辨率成像仪器进行扫描。它通过分析图像纹理或模式来检测异常，因此在工业制造和质量控制中成为必不可少的工具。尽管AOI非常重要，但部署模型进行AOI通常面临挑战。这些挑战包括样本量有限、源域间差异和成像过程中的光线和相机位置变化对模型的敏感性等。这些因素共同削弱了模型的预测准确性。传统的AOI往往没有充分利用机器或内部图像的丰富机制参数信息，包括统计参数，这些参数通常对AOI分类有利。为了应对这个问题，我们引入了一个外部模式引导的数据挖掘框架，主要基于光学字符识别（OCR），旨在从图像作为第二模态提取统计特征以提高性能，称之为OANet（Ocr-Aoi-Net）。我们方法的关键方面是对单模态模型的外模式特征与通过卷积神经网络编码的图像特征之间的对齐。这种协同作用使不同模态语义表示的融合更加精确。我们进一步引入了特征精度和一个门控函数在我们的OANet中优化这些特征，提高推理和决策能力。实验结果表明，我们的方法显著提高了缺陷检测模型的召回率，在具有挑战性的场景下，表现仍然良好。

URL

https://arxiv.org/abs/2403.11536

PDF

https://arxiv.org/pdf/2403.11536.pdf
Read All
Advanced Knowledge Extraction of Physical Design Drawings, Translation and conversion to CAD formats using Deep Learning

2024-03-17 18:06:06

Jesher Joshua M, Ragav V, Syed Ibrahim S P

arXiv_CV

arXiv_CV GAN Recognition Detection Object_Detection Deep_Learning OCR Knowledge Pose Action Optical_Character Edge_Detection
Abstract

The maintenance, archiving and usage of the design drawings is cumbersome in physical form in different industries for longer period. It is hard to extract information by simple scanning of drawing sheets. Converting them to their digital formats such as Computer-Aided Design (CAD), with needed knowledge extraction can solve this problem. The conversion of these machine drawings to its digital form is a crucial challenge which requires advanced techniques. This research proposes an innovative methodology utilizing Deep Learning methods. The approach employs object detection model, such as Yolov7, Faster R-CNN, to detect physical drawing objects present in the images followed by, edge detection algorithms such as canny filter to extract and refine the identified lines from the drawing region and curve detection techniques to detect circle. Also ornaments (complex shapes) within the drawings are extracted. To ensure comprehensive conversion, an Optical Character Recognition (OCR) tool is integrated to identify and extract the text elements from the drawings. The extracted data which includes the lines, shapes and text is consolidated and stored in a structured comma separated values(.csv) file format. The accuracy and the efficiency of conversion is evaluated. Through this, conversion can be automated to help organizations enhance their productivity, facilitate seamless collaborations and preserve valuable design information in a digital format easily accessible. Overall, this study contributes to the advancement of CAD conversions, providing accurate results from the translating process. Future research can focus on handling diverse drawing types, enhanced accuracy in shape and line detection and extraction.

Abstract (translated)

设计图的维护、归档和使用在不同的行业中存在形式上的复杂性，尤其是在较长时间内。通过简单的扫描图纸来提取信息非常困难。将它们转换为数字格式，如计算机辅助设计（CAD），如果需要知识提取，可以解决这个问题。将这些机器图纸转换为数字形式是一个关键的挑战，需要先进的技术。这项研究提出了利用深度学习方法的创新方法。该方法采用物体检测模型（如Yolov7、Faster R-CNN）来检测图像中的物理绘图物体，然后使用边缘检测算法（如canny滤波器）提取和优化已识别的线条和曲线检测技术（如圆检测）来检测圆。此外，还提取了图纸中的装饰物（复杂形状）。为了确保全面的转换，还集成了光学字符识别（OCR）工具，用于从设计图中识别和提取文本元素。提取的数据包括线条、形状和文本，以结构化的逗号分隔值（.csv）文件格式进行汇总。评估了转换的准确性和效率。通过这项研究，转换可以自动化，帮助组织提高生产力，促进无缝合作，并轻松地保存有价值的数字设计信息。总的来说，这项研究为CAD转换的发展做出了贡献，从翻译过程中提供了准确的结果。未来的研究可以关注处理不同类型的绘图，形状和线检测的准确度，以及提取。

URL

https://arxiv.org/abs/2403.11291

PDF

https://arxiv.org/pdf/2403.11291.pdf
Read All
Adversarial Training with OCR Modality Perturbation for Scene-Text Visual Question Answering

2024-03-14 11:22:06

Zhixuan Shen, Haonan Luo, Sijia Li, Tianrui Li

arXiv_AI

arXiv_AI VQA Recognition Adversarial Attention QA Embedding OCR Relation Pose Optical_Character Scene_Text Enhancement
Abstract

Scene-Text Visual Question Answering (ST-VQA) aims to understand scene text in images and answer questions related to the text content. Most existing methods heavily rely on the accuracy of Optical Character Recognition (OCR) systems, and aggressive fine-tuning based on limited spatial location information and erroneous OCR text information often leads to inevitable overfitting. In this paper, we propose a multimodal adversarial training architecture with spatial awareness capabilities. Specifically, we introduce an Adversarial OCR Enhancement (AOE) module, which leverages adversarial training in the embedding space of OCR modality to enhance fault-tolerant representation of OCR texts, thereby reducing noise caused by OCR errors. Simultaneously, We add a Spatial-Aware Self-Attention (SASA) mechanism to help the model better capture the spatial relationships among OCR tokens. Various experiments demonstrate that our method achieves significant performance improvements on both the ST-VQA and TextVQA datasets and provides a novel paradigm for multimodal adversarial training.

Abstract (translated)

场景文本视觉问答（ST-VQA）旨在理解图像中的场景文本，并回答与文本内容相关的问题。现有方法很大程度上依赖于光学字符识别（OCR）系统的准确性，而且基于有限的空间位置信息和错误的OCR文本信息进行激进的微调往往会导致不可预测的过拟合。在本文中，我们提出了一个具有空间感知能力的多模态对抗训练架构。具体来说，我们引入了一个对抗性OCR增强（AOE）模块，它利用OCR模态的嵌入空间中的对抗性训练来增强OCR文本的容错表示，从而减少由OCR错误引起的噪声。同时，我们还添加了一个空间感知自注意力（SASA）机制，帮助模型更好地捕捉OCR词汇之间的空间关系。各种实验结果表明，我们的方法在ST-VQA和TextVQA数据集上都取得了显著的性能提升，并为多模态对抗训练树立了新的范例。

URL

https://arxiv.org/abs/2403.09288

PDF

https://arxiv.org/pdf/2403.09288.pdf
Read All
Rich Semantic Knowledge Enhanced Large Language Models for Few-shot Chinese Spell Checking

2024-03-13 12:55:43

Ming Dong, Yujing Chen, Miao Zhang, Hao Sun, Tingting He

arXiv_CL

arXiv_CL Recognition OCR Knowledge Language_Model Bert Pose Few-Shot Optical_Character Speech
Abstract

Chinese Spell Checking (CSC) is a widely used technology, which plays a vital role in speech to text (STT) and optical character recognition (OCR). Most of the existing CSC approaches relying on BERT architecture achieve excellent performance. However, limited by the scale of the foundation model, BERT-based method does not work well in few-shot scenarios, showing certain limitations in practical applications. In this paper, we explore using an in-context learning method named RS-LLM (Rich Semantic based LLMs) to introduce large language models (LLMs) as the foundation model. Besides, we study the impact of introducing various Chinese rich semantic information in our framework. We found that by introducing a small number of specific Chinese rich semantic structures, LLMs achieve better performance than the BERT-based model on few-shot CSC task. Furthermore, we conduct experiments on multiple datasets, and the experimental results verified the superiority of our proposed framework.

Abstract (translated)

中文 Spell Checking（CSC）是一种广泛使用的技术，对语音到文本（STT）和光学字符识别（OCR）起着关键作用。大多数现有的 CSC 方法都依赖 BERT 架构，实现出色的性能。然而，由于基础模型规模有限，基于 BERT 的方法在少样本场景下表现不佳，在实际应用中存在局限性。在本文中，我们探讨了使用名为 RS-LLM（基于丰富语义的 LLM）的上下文学习方法来引入大型语言模型（LLMs）作为基础模型。此外，我们研究了在我们的框架中引入各种中文丰富语义信息的影响。我们发现，通过引入少量特定的中文丰富语义结构，LLMs 在少样本 CSC 任务上比基于 BERT 的模型实现更好的性能。此外，我们对多个数据集进行了实验，并验证了我们提出框架的优越性。

URL

https://arxiv.org/abs/2403.08492

PDF

https://arxiv.org/pdf/2403.08492.pdf
Read All
MoAI: Mixture of All Intelligence for Large Language and Vision Models

2024-03-12 10:44:13

Byung-Kwan Lee, Beomchan Park, Chae Won Kim, Yong Man Ro

arXiv_CV

arXiv_CV Segmentation Recognition Detection OCR Relation Language_Model Zero-Shot Optical_Character
Abstract

The rise of large language models (LLMs) and instruction tuning has led to the current trend of instruction-tuned large language and vision models (LLVMs). This trend involves either meticulously curating numerous instruction tuning datasets tailored to specific objectives or enlarging LLVMs to manage vast amounts of vision language (VL) data. However, current LLVMs have disregarded the detailed and comprehensive real-world scene understanding available from specialized computer vision (CV) models in visual perception tasks such as segmentation, detection, scene graph generation (SGG), and optical character recognition (OCR). Instead, the existing LLVMs rely mainly on the large capacity and emergent capabilities of their LLM backbones. Therefore, we present a new LLVM, Mixture of All Intelligence (MoAI), which leverages auxiliary visual information obtained from the outputs of external segmentation, detection, SGG, and OCR models. MoAI operates through two newly introduced modules: MoAI-Compressor and MoAI-Mixer. After verbalizing the outputs of the external CV models, the MoAI-Compressor aligns and condenses them to efficiently use relevant auxiliary visual information for VL tasks. MoAI-Mixer then blends three types of intelligence (1) visual features, (2) auxiliary features from the external CV models, and (3) language features by utilizing the concept of Mixture of Experts. Through this integration, MoAI significantly outperforms both open-source and closed-source LLVMs in numerous zero-shot VL tasks, particularly those related to real-world scene understanding such as object existence, positions, relations, and OCR without enlarging the model size or curating extra visual instruction tuning datasets.

Abstract (translated)

大规模语言模型（LLMs）和指令调整的发展导致出现了一种以指令为单位的 large language 和视觉模型（LLVMs）的当前趋势。这种趋势涉及要么精心挑选针对特定目标的指令调整数据集，要么将 LLVMs 扩展以处理大量视觉语言（VL）数据。然而，现有的 LLVMs 忽略了从专用计算机视觉（CV）模型在视觉感知任务（如分割、检测、场景图生成（SGG）和光学字符识别（OCR）中获得的详细和全面的真实场景理解。相反，现有的 LLVMs 主要依赖其 LLM 骨干的大容量和新兴功能。因此，我们提出了一个新的 LLVM，混合智能（MoAI），它利用外部分割、检测、SGG 和 OCR 模型的输出获得辅助视觉信息。MoAI 通过两个新模块运行：MoAI-Compressor 和 MoAI-Mixer。在对外部 CV 模型的输出进行口头说明后，MoAI-Compressor 对其进行对齐和压缩，以便为 VL 任务有效地使用相关辅助视觉信息。然后，MoAI-Mixer 通过利用专家混合的概念，将三种智能（1）视觉特征、（2）外部 CV 模型的辅助特征和（3）语言特征进行混合。通过这种整合，MoAI 在许多零散 VL 任务中显著优于开源和闭源 LLVMs，尤其是与现实场景理解相关的任务，例如物体存在、位置、关系和 OCR，而不需要扩大模型大小或额外视觉指令调整数据集。

URL

https://arxiv.org/abs/2403.07508

PDF

https://arxiv.org/pdf/2403.07508.pdf
Read All
LOCR: Location-Guided Transformer for Optical Character Recognition

2024-03-04 15:34:12

Yu Sun, Dongzhan Zhou, Chen Lin, Conghui He, Wanli Ouyang, Han-Sen Zhong

arXiv_AI

arXiv_AI Recognition OCR Transformer Pose Optical_Character
Abstract

Academic documents are packed with texts, equations, tables, and figures, requiring comprehensive understanding for accurate Optical Character Recognition (OCR). While end-to-end OCR methods offer improved accuracy over layout-based approaches, they often grapple with significant repetition issues, especially with complex layouts in Out-Of-Domain (OOD) this http URL tackle this issue, we propose LOCR, a model that integrates location guiding into the transformer architecture during autoregression. We train the model on a dataset comprising over 77M text-location pairs from 125K academic document pages, including bounding boxes for words, tables and mathematical symbols. LOCR adeptly handles various formatting elements and generates content in Markdown language. It outperforms all existing methods in our test set constructed from arXiv, as measured by edit distance, BLEU, METEOR and F-measure.LOCR also reduces repetition frequency from 4.4% of pages to 0.5% in the arXiv dataset, from 13.2% to 1.3% in OOD quantum physics documents and from 8.1% to 1.8% in OOD marketing documents. Additionally, LOCR features an interactive OCR mode, facilitating the generation of complex documents through a few location prompts from human.

Abstract (translated)

学术文档充满了文本、方程、表格和图形，需要全面理解才能实现准确的光学字符识别（OCR）。虽然端到端OCR方法在布局基础上提供了更高的准确性，但它们往往面临着显著的重复问题，尤其是在复杂的离散域（OOD）中。为了解决这个问题，我们提出了LOCR，一种将位置引导集成到自回归架构中的模型。我们在包括125K个学术文档页面的770M个文本位置对训练模型，包括单词、表格和数学符号的边界框。LOCR擅长处理各种格式要素并生成Markdown语言的内容。它在我们的测试集中超越了所有现有方法，以通过编辑距离、BLEU、METEOR和F-分数进行衡量。LOCR还在arXiv数据集中将重复频率从每页的4.4%降低到0.5%，从13.2%降低到1.3%，从8.1%降低到1.8%。此外，LOCR具有交互式OCR模式，通过来自人类的一些位置提示来生成复杂的文档。

URL

https://arxiv.org/abs/2403.02127

PDF

https://arxiv.org/pdf/2403.02127.pdf
Read All
Large Language Models for Simultaneous Named Entity Extraction and Spelling Correction

2024-03-01 13:36:04

Edward Whittaker, Ikuo Kitagishi

arXiv_CV

arXiv_CV Recognition Face OCR Language_Model Bert Action Optical_Character
Abstract

Language Models (LMs) such as BERT, have been shown to perform well on the task of identifying Named Entities (NE) in text. A BERT LM is typically used as a classifier to classify individual tokens in the input text, or to classify spans of tokens, as belonging to one of a set of possible NE categories. In this paper, we hypothesise that decoder-only Large Language Models (LLMs) can also be used generatively to extract both the NE, as well as potentially recover the correct surface form of the NE, where any spelling errors that were present in the input text get automatically corrected. We fine-tune two BERT LMs as baselines, as well as eight open-source LLMs, on the task of producing NEs from text that was obtained by applying Optical Character Recognition (OCR) to images of Japanese shop receipts; in this work, we do not attempt to find or evaluate the location of NEs in the text. We show that the best fine-tuned LLM performs as well as, or slightly better than, the best fine-tuned BERT LM, although the differences are not significant. However, the best LLM is also shown to correct OCR errors in some cases, as initially hypothesised.

Abstract (translated)

已经证明，像BERT这样的语言模型在识别命名实体（NE）方面表现良好。通常，BERT LM被用作分类器，将输入文本中的单个词分类为属于一系列可能NE类别的 span。在本文中，我们假设可以使用 decoder-only Large Language Models（LLMs）进行生成，既可以提取NE，又可以可能恢复NE的正确表面形式，其中输入文本中存在的任何拼写错误都会自动更正。我们对两个BERT LM和八个开源LLM进行了微调，作为 baseline，并在对日本商店收据进行光学字符识别（OCR）后生成NE的任务上进行了训练。在这项工作中，我们没有试图在文本中寻找或评估NE的位置。我们证明了最佳微调的LLM的表现与最佳微调的BERT LM相当，或者略有更好；尽管这些差异并不显著。然而，最佳LLM也被发现可以在某些情况下纠正OCR错误，正如最初假设的那样。

URL

https://arxiv.org/abs/2403.00528

PDF

https://arxiv.org/pdf/2403.00528.pdf
Read All
ODM: A Text-Image Further Alignment Pre-training Approach for Scene Text Detection and Spotting

2024-03-01 06:13:53

Chen Duan, Pei Fu, Shan Guo, Qianyi Jiang, Xiaoming Wei

arXiv_CV

arXiv_CV Recognition Detection OCR Transformer Pose Optical_Character Scene_Text
Abstract

In recent years, text-image joint pre-training techniques have shown promising results in various tasks. However, in Optical Character Recognition (OCR) tasks, aligning text instances with their corresponding text regions in images poses a challenge, as it requires effective alignment between text and OCR-Text (referring to the text in images as OCR-Text to distinguish from the text in natural language) rather than a holistic understanding of the overall image content. In this paper, we propose a new pre-training method called OCR-Text Destylization Modeling (ODM) that transfers diverse styles of text found in images to a uniform style based on the text prompt. With ODM, we achieve better alignment between text and OCR-Text and enable pre-trained models to adapt to the complex and diverse styles of scene text detection and spotting tasks. Additionally, we have designed a new labeling generation method specifically for ODM and combined it with our proposed Text-Controller module to address the challenge of annotation costs in OCR tasks, allowing a larger amount of unlabeled data to participate in pre-training. Extensive experiments on multiple public datasets demonstrate that our method significantly improves performance and outperforms current pre-training methods in scene text detection and spotting tasks. Code is available at {this https URL}.

Abstract (translated)

近年来，在各种任务中，文本图像联合预训练技术取得了良好的效果。然而，在光学字符识别（OCR）任务中，将文本实例与图像中的相应文本区域对齐是一个挑战，因为它需要将文本和OCR-文本（将图像中的文本称为OCR-文本，以与自然语言中的文本区分开来）之间的有效对齐，而不是对整个图像内容的全面理解。在本文中，我们提出了一个名为OCR-Text Destylization Modeling（ODM）的新预训练方法，它基于文本提示将图像中发现的多样文本风格转移为统一风格。通过ODM，我们实现了文本和OCR-文本之间的更好对齐，并使预训练模型能够适应场景文本检测和标注任务的复杂和多样风格。此外，我们还针对ODM设计了一个新的标签生成方法，并将其与我们的Text-Controller模块相结合，以解决OCR任务中标注成本的挑战，允许更多的未标注数据参与预训练。在多个公开数据集上的广泛实验证明，我们的方法显著提高了性能，在场景文本检测和标注任务中优于当前预训练方法。代码可在此处下载：{this <https://URL>}。

URL

https://arxiv.org/abs/2403.00303

PDF

https://arxiv.org/pdf/2403.00303.pdf
Read All
Representing Online Handwriting for Recognition in Large Vision-Language Models

2024-02-23 13:11:10

Anastasiia Fadeeva, Philippe Schlattner, Andrii Maksai, Mark Collier, Efi Kokiopoulou, Jesse Berent, Claudiu Musat

arXiv_AI

arXiv_AI Image_Caption Recognition OCR Inference Language_Model Transformer Pose Optical_Character Handwriting
Abstract

The adoption of tablets with touchscreens and styluses is increasing, and a key feature is converting handwriting to text, enabling search, indexing, and AI assistance. Meanwhile, vision-language models (VLMs) are now the go-to solution for image understanding, thanks to both their state-of-the-art performance across a variety of tasks and the simplicity of a unified approach to training, fine-tuning, and inference. While VLMs obtain high performance on image-based tasks, they perform poorly on handwriting recognition when applied naively, i.e., by rendering handwriting as an image and performing optical character recognition (OCR). In this paper, we study online handwriting recognition with VLMs, going beyond naive OCR. We propose a novel tokenized representation of digital ink (online handwriting) that includes both a time-ordered sequence of strokes as text, and as image. We show that this representation yields results comparable to or better than state-of-the-art online handwriting recognizers. Wide applicability is shown through results with two different VLM families, on multiple public datasets. Our approach can be applied to off-the-shelf VLMs, does not require any changes in their architecture, and can be used in both fine-tuning and parameter-efficient tuning. We perform a detailed ablation study to identify the key elements of the proposed representation.

Abstract (translated)

采用触摸屏和触控笔的平板电脑正在增加，关键功能是将手写内容转换为文本，实现搜索、索引和人工智能协助。同时，凭借其在各种任务上的先进性能和统一训练方法的简单性，视觉语言模型（VLMs）现在成为图像理解的绝佳解决方案。然而，当它们应用于粗暴地将手写内容转换为图像并进行光学字符识别（OCR）时，VLMs表现不佳。在本文中，我们研究使用VLMs的在线手写识别，超越了粗暴的OCR。我们提出了一种新颖的数字墨水（在线手写）分词表示，包括 strokes 时间序列作为文本，以及作为图像。我们证明了这种表示产生了与或优于最先进的在线手写识别器相同或更好的结果。通过在两个不同的VLM家族上展示结果，在多个公共数据集上进行了广泛的适用性展示。我们的方法可以应用于现成的VLMs，无需对其架构进行任何更改，并且可以用于参数高效调整。我们进行了详细的可缩放研究，以确定所提出的表示的关键要素。

URL

https://arxiv.org/abs/2402.15307

PDF

https://arxiv.org/pdf/2402.15307.pdf
Read All
Beyond the Mud: Datasets and Benchmarks for Computer Vision in Off-Road Racing

2024-02-12 19:42:05

Jacob Tyo, Motolani Olarinre, Youngseog Chung, Zachary C. Lipton

arXiv_CV

arXiv_CV Recognition Person_Re-identification Re-identification OCR Pose Optical_Character
Abstract

Despite significant progress in optical character recognition (OCR) and computer vision systems, robustly recognizing text and identifying people in images taken in unconstrained \emph{in-the-wild} environments remain an ongoing challenge. However, such obstacles must be overcome in practical applications of vision systems, such as identifying racers in photos taken during off-road racing events. To this end, we introduce two new challenging real-world datasets - the off-road motorcycle Racer Number Dataset (RND) and the Muddy Racer re-iDentification Dataset (MUDD) - to highlight the shortcomings of current methods and drive advances in OCR and person re-identification (ReID) under extreme conditions. These two datasets feature over 6,300 images taken during off-road competitions which exhibit a variety of factors that undermine even modern vision systems, namely mud, complex poses, and motion blur. We establish benchmark performance on both datasets using state-of-the-art models. Off-the-shelf models transfer poorly, reaching only 15% end-to-end (E2E) F1 score on text spotting, and 33% rank-1 accuracy on ReID. Fine-tuning yields major improvements, bringing model performance to 53% F1 score for E2E text spotting and 79% rank-1 accuracy on ReID, but still falls short of good performance. Our analysis exposes open problems in real-world OCR and ReID that necessitate domain-targeted techniques. With these datasets and analysis of model limitations, we aim to foster innovations in handling real-world conditions like mud and complex poses to drive progress in robust computer vision. All data was sourced from this http URL, a website used by professional motorsports photographers, racers, and fans. The top-performing text spotting and ReID models are deployed on this platform to power real-time race photo search.

Abstract (translated)

尽管在光学字符识别（OCR）和计算机视觉系统方面取得了显著的进展，但在不受约束的野外环境中准确识别文本和识别人物仍然是一个持续的挑战。然而，在视觉系统的实际应用中， such 障碍必须被克服，例如在赛车照片中识别赛车手。为此，我们引入了两个新的具有挑战性的现实世界数据集——赛车手编号数据集（RND）和泥泞赛车手重新识别数据集（MUDD），以强调在极端条件下 OCR 和人物识别（ReID）方法的不足之处，推动在不受约束的环境中实现更好的识别性能。这两个数据集涵盖了在赛车比赛中拍摄的超过 6,300 张图像，这些图像呈现出各种因素，对即使是最先进的现代视觉系统也会产生影响，例如泥、复杂的姿势和运动模糊。我们在两个数据集上使用最先进的模型进行基准性能评估。通用的模型转移差，仅达到15%的端到端（E2E） F1 分数在文本检测中，而在 ReID 方面也只有33%的排名一准确率。微调带来重大改进，将模型的性能提高到53%的E2E文本检测和79%的排名一准确率在 ReID 上，但仍然存在不足。我们的分析揭示了在现实世界 OCR 和 ReID 中需要解决的问题，这需要领域特定的技术。有了这些数据集和模型限制的分析，我们旨在推动在处理类似泥和复杂姿态的实时情况方面的创新，以推动计算机视觉在实时情况下的进步。所有数据都来自这个链接，这是一个专业赛车摄影师、赛车手和粉丝使用的网站。在这个平台上，最优秀的文本检测和 ReID 模型部署用于实时赛车照片搜索。

URL

https://arxiv.org/abs/2402.08025

PDF

https://arxiv.org/pdf/2402.08025.pdf
Read All
Segmentation-free Connectionist Temporal Classification loss based OCR Model for Text Captcha Classification

2024-02-08 05:18:11

Vaibhav Khatavkar, Makarand Velankar, Sneha Petkar

arXiv_CV

arXiv_CV Segmentation Recognition Detection Classification Face OCR Pose Optical_Character
Abstract

Captcha are widely used to secure systems from automatic responses by distinguishing computer responses from human responses. Text, audio, video, picture picture-based Optical Character Recognition (OCR) are used for creating captcha. Text-based OCR captcha are the most often used captcha which faces issues namely, complex and distorted contents. There are attempts to build captcha detection and classification-based systems using machine learning and neural networks, which need to be tuned for accuracy. The existing systems face challenges in the recognition of distorted characters, handling variable-length captcha and finding sequential dependencies in captcha. In this work, we propose a segmentation-free OCR model for text captcha classification based on the connectionist temporal classification loss technique. The proposed model is trained and tested on a publicly available captcha dataset. The proposed model gives 99.80\% character level accuracy, while 95\% word level accuracy. The accuracy of the proposed model is compared with the state-of-the-art models and proves to be effective. The variable length complex captcha can be thus processed with the segmentation-free connectionist temporal classification loss technique with dependencies which will be massively used in securing the software systems.

Abstract (translated)

Captcha 广泛用于通过区分计算机响应与人类响应来保护系统免受自动回复。文本、音频、视频和基于图像的图像识别（OCR）用于创建 captcha。基于文本的 OCR captcha 是最常见的 captcha，它面临复杂和扭曲的内容的问题。尝试使用机器学习和神经网络基于 captcha 检测和分类构建系统，这些系统需要进行精度调整。现有的系统在识别扭曲的 characters、处理变长 captcha 和寻找序列依赖方面面临挑战。在本文中，我们提出了一种基于连接主义时间分类损失技术文本captcha分类的分割免费 OCR 模型。所提出的模型在公开可用的 captcha 数据集上进行训练和测试。所提出的模型在字符级别具有 99.80\% 的准确率，而在单词级别具有 95\% 的准确率。与最先进的模型进行比较证明效果显著。因此，可以使用基于分割免费连接主义时间分类损失技术处理变长复杂 captcha。

URL

https://arxiv.org/abs/2402.05417

PDF

https://arxiv.org/pdf/2402.05417.pdf
Read All
ExTTNet: A Deep Learning Algorithm for Extracting Table Texts from Invoice Images

2024-02-03 19:24:45

Adem Akdoğan, Murat Kurt

arXiv_AI

arXiv_AI Recognition Deep_Learning OCR Autonomous Action Optical_Character
Abstract

In this work, product tables in invoices are obtained autonomously via a deep learning model, which is named as ExTTNet. Firstly, text is obtained from invoice images using Optical Character Recognition (OCR) techniques. Tesseract OCR engine [37] is used for this process. Afterwards, the number of existing features is increased by using feature extraction methods to increase the accuracy. Labeling process is done according to whether each text obtained as a result of OCR is a table element or not. In this study, a multilayer artificial neural network model is used. The training has been carried out with an Nvidia RTX 3090 graphics card and taken $162$ minutes. As a result of the training, the F1 score is $0.92$.

Abstract (translated)

在这项工作中，通过一个名为ExTTNet的深度学习模型，在发票中自动获取产品表是通过深度学习模型获得的。首先，使用光字符识别（OCR）技术从发票图像中获取文本。 Tesseract OCR 引擎 [37] 用于这个过程。然后，通过使用特征提取方法增加现有特征的数量来提高准确性。根据提取的 OCR 结果文本是否为表元素进行标签处理。在这项研究中，使用了多层人工神经网络模型。训练过程使用Nvidia RTX 3090图形芯片进行，用时162分钟。训练后，F1得分达到了0.92。

URL

https://arxiv.org/abs/2402.02246

PDF

https://arxiv.org/pdf/2402.02246.pdf
Read All
Enhancing Multimodal Large Language Models with Vision Detection Models: An Empirical Study

2024-01-31 16:38:32

Qirui Jiao, Daoyuan Chen, Yilun Huang, Yaliang Li, Ying Shen

arXiv_CV

arXiv_CV Image_Caption Recognition Detection Object_Detection Embedding OCR Language_Model Optical_Character Dialog Chat
Abstract

Despite the impressive capabilities of Multimodal Large Language Models (MLLMs) in integrating text and image modalities, challenges remain in accurately interpreting detailed visual elements. This paper presents an empirical study on enhancing MLLMs with state-of-the-art (SOTA) object detection and Optical Character Recognition models to improve fine-grained image understanding and reduce hallucination in responses. Our research investigates the embedding-based infusion of detection information, the impact of such infusion on the MLLMs' original abilities, and the interchangeability of detection models. We conduct systematic experiments with models such as LLaVA-1.5, DINO, and PaddleOCRv2, revealing that our approach not only refines MLLMs' performance in specific visual tasks but also maintains their original strengths. The resulting enhanced MLLMs outperform SOTA models on 9 out of 10 benchmarks, achieving an improvement of up to 12.99% on the normalized average score, marking a notable advancement in multimodal understanding. We release our codes to facilitate further exploration into the fine-grained multimodal dialogue capabilities of MLLMs.

Abstract (translated)

尽管多模态大型语言模型（MLLMs）在整合文本和图像模态方面具有令人印象深刻的 capabilities，但准确解释详细视觉元素仍然具有挑战性。本文进行了一项关于通过最先进的（SOTA）目标检测和光学字符识别模型增强MLLMs以提高细粒度图像理解和减少响应幻影的实证研究。我们的研究探讨了基于嵌入的检测信息 infusion 对MLLMs原始能力的影响以及检测模型的互换性。我们使用LLaVA-1.5、DINO和PaddleOCRv2等模型进行了系统实验，结果表明，我们的方法不仅提高了MLLMs在特定视觉任务上的性能，而且保持了其原始优势。增强后的MLLMs在9个基准测试中的表现优于SOTA模型，平均分数提高了12.99%，表明在多模态理解方面取得了显著的进展。我们将我们的代码发布出来，以促进对MLLMs细粒度多模态对话能力的进一步探索。

URL

https://arxiv.org/abs/2401.17981

PDF

https://arxiv.org/pdf/2401.17981.pdf
Read All

Content

Optical_Character (20)

Optical_Character

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL