Paper Reading AI Learner

STR-Cert: Robustness Certification for Deep Text Recognition on Deep Learning Pipelines and Vision Transformers

2023-11-28 21:53:08
Daqian Shao, Lukas Fesser, Marta Kwiatkowska

Abstract

Robustness certification, which aims to formally certify the predictions of neural networks against adversarial inputs, has become an integral part of important tool for safety-critical applications. Despite considerable progress, existing certification methods are limited to elementary architectures, such as convolutional networks, recurrent networks and recently Transformers, on benchmark datasets such as MNIST. In this paper, we focus on the robustness certification of scene text recognition (STR), which is a complex and extensively deployed image-based sequence prediction problem. We tackle three types of STR model architectures, including the standard STR pipelines and the Vision Transformer. We propose STR-Cert, the first certification method for STR models, by significantly extending the DeepPoly polyhedral verification framework via deriving novel polyhedral bounds and algorithms for key STR model components. Finally, we certify and compare STR models on six datasets, demonstrating the efficiency and scalability of robustness certification, particularly for the Vision Transformer.

Abstract (translated)

安全性认证(旨在正式认证神经网络对抗性输入的预测)已成为安全关键应用的重要工具。尽管取得了显著的进展,但现有的认证方法仅限于基本的架构,如卷积神经网络、循环神经网络和最近 Transformer,在基准数据集如 MNIST 上。在本文中,我们重点关注场景文本识别(STR)模型的安全性认证,这是一种复杂且广泛部署的图像序列预测问题。我们解决了三种 STR 模型架构,包括标准的 STR 管道和 Vision Transformer。我们通过通过扩展 DeepP 凸多面体验证框架来引入新的凸多面体界来提出 STR-Cert,这是第一个为 STR 模型设计的认证方法。最后,我们在六个数据集上进行了 STR 模型的认证和比较,证明了安全性认证的有效性和可扩展性,特别是对于 Vision Transformer。

URL

https://arxiv.org/abs/2401.05338

PDF

https://arxiv.org/pdf/2401.05338.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot