Paper Reading AI Learner

Compressed Image Captioning using CNN-based Encoder-Decoder Framework

2024-04-28 03:47:48
Md Alif Rahman Ridoy, M Mahmud Hasan, Shovon Bhowmick

Abstract

In today's world, image processing plays a crucial role across various fields, from scientific research to industrial applications. But one particularly exciting application is image captioning. The potential impact of effective image captioning is vast. It can significantly boost the accuracy of search engines, making it easier to find relevant information. Moreover, it can greatly enhance accessibility for visually impaired individuals, providing them with a more immersive experience of digital content. However, despite its promise, image captioning presents several challenges. One major hurdle is extracting meaningful visual information from images and transforming it into coherent language. This requires bridging the gap between the visual and linguistic domains, a task that demands sophisticated algorithms and models. Our project is focused on addressing these challenges by developing an automatic image captioning architecture that combines the strengths of convolutional neural networks (CNNs) and encoder-decoder models. The CNN model is used to extract the visual features from images, and later, with the help of the encoder-decoder framework, captions are generated. We also did a performance comparison where we delved into the realm of pre-trained CNN models, experimenting with multiple architectures to understand their performance variations. In our quest for optimization, we also explored the integration of frequency regularization techniques to compress the "AlexNet" and "EfficientNetB0" model. We aimed to see if this compressed model could maintain its effectiveness in generating image captions while being more resource-efficient.

Abstract (translated)

在当今世界,图像处理在各个领域都扮演着关键角色,从科学研究到工业应用。但其中最令人兴奋的应用是图像标注。有效图像标注的机会潜力是巨大的。它可以大大提高搜索引擎的准确性,使其更容易找到相关信息。此外,它还可以大大提高盲人人士的可访问性,为他们提供更加沉浸式的数字内容体验。然而,尽管它具有很大的潜力,图像标注仍然面临着几个挑战。一个主要的障碍是从图像中提取有意义的视觉信息并将其转化为连贯的语言。这需要跨越视觉和语言域的鸿沟,这是一个需要复杂算法和模型的任务。我们的项目专注于通过开发结合卷积神经网络(CNN)和编码器-解码器模型的自动图像标注架构来解决这些挑战。CNN模型用于提取图像的视觉特征,然后,通过编码器-解码器框架,生成标注。我们还进行了一次性能比较,深入研究了预训练的 CNN 模型,尝试使用多种架构来了解它们的性能变化。在我们寻求优化的过程中,我们还探索了引入频率正则化技术来压缩“AlexNet”和“EfficientNetB0”模型的 integration。我们试图看看这个压缩模型是否能在生成图像标注的同时更加高效地使用资源。

URL

https://arxiv.org/abs/2404.18062

PDF

https://arxiv.org/pdf/2404.18062.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot