Paper Reading AI Learner

LMV-RPA: Large Model Voting-based Robotic Process Automation

2024-12-23 20:28:22
Osama Abdellatif, Ahmed Ayman, Ali Hamdi

Abstract

Automating high-volume unstructured data processing is essential for operational efficiency. Optical Character Recognition (OCR) is critical but often struggles with accuracy and efficiency in complex layouts and ambiguous text. These challenges are especially pronounced in large-scale tasks requiring both speed and precision. This paper introduces LMV-RPA, a Large Model Voting-based Robotic Process Automation system to enhance OCR workflows. LMV-RPA integrates outputs from OCR engines such as Paddle OCR, Tesseract OCR, Easy OCR, and DocTR with Large Language Models (LLMs) like LLaMA 3 and Gemini-1.5-pro. Using a majority voting mechanism, it processes OCR outputs into structured JSON formats, improving accuracy, particularly in complex layouts. The multi-phase pipeline processes text extracted by OCR engines through LLMs, combining results to ensure the most accurate outputs. LMV-RPA achieves 99 percent accuracy in OCR tasks, surpassing baseline models with 94 percent, while reducing processing time by 80 percent. Benchmark evaluations confirm its scalability and demonstrate that LMV-RPA offers a faster, more reliable, and efficient solution for automating large-scale document processing tasks.

Abstract (translated)

自动化高容量非结构化数据处理对于操作效率至关重要。光学字符识别(OCR)虽然关键,但在复杂布局和模糊文本中往往难以保证准确性和效率。这些挑战在需要同时具备速度与精度的大规模任务中尤为突出。本文介绍了一种基于大模型投票的机器人流程自动化系统——LMV-RPA,以提升OCR工作流的效果。LMV-RPA集成了包括Paddle OCR、Tesseract OCR、Easy OCR和DocTR在内的多个OCR引擎输出,并结合了诸如LLaMA 3和Gemini-1.5-pro等大型语言模型(LLMs)。通过多数投票机制,它将OCR的输出转换为结构化的JSON格式,特别是在复杂布局中提高了准确率。多阶段管道流程对OCR引擎提取出的文字进行处理并整合结果,以确保最高准确性。LMV-RPA在OCR任务中的准确率达到99%,优于基准模型94%的准确率,并且降低了80%的处理时间。基准评估确认了其可扩展性,并表明LMV-RPA为自动化大规模文档处理任务提供了一个更快、更可靠和高效的解决方案。

URL

https://arxiv.org/abs/2412.17965

PDF

https://arxiv.org/pdf/2412.17965.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot