Paper Reading AI Learner

MiPa: Mixed Patch Infrared-Visible Modality Agnostic Object Detection

2024-04-29 16:42:58
Heitor R. Medeiros, David Latortue, Fidel Guerrero Pena, Eric Granger, Marco Pedersoli

Abstract

In this paper, we present a different way to use two modalities, in which either one modality or the other is seen by a single model. This can be useful when adapting an unimodal model to leverage more information while respecting a limited computational budget. This would mean having a single model that is able to deal with any modalities. To describe this, we coined the term anymodal learning. An example of this, is a use case where, surveillance in a room when the lights are off would be much more valuable using an infrared modality while a visible one would provide more discriminative information when lights are on. This work investigates how to efficiently leverage visible and infrared/thermal modalities for transformer-based object detection backbone to create an anymodal architecture. Our work does not create any inference overhead during the testing while exploring an effective way to exploit the two modalities during the training. To accomplish such a task, we introduce the novel anymodal training technique: Mixed Patches (MiPa), in conjunction with a patch-wise domain agnostic module, which is responsible of learning the best way to find a common representation of both modalities. This approach proves to be able to balance modalities by reaching competitive results on individual modality benchmarks with the alternative of using an unimodal architecture on three different visible-infrared object detection datasets. Finally, our proposed method, when used as a regularization for the strongest modality, can beat the performance of multimodal fusion methods while only requiring a single modality during inference. Notably, MiPa became the state-of-the-art on the LLVIP visible/infrared benchmark. Code: this https URL

Abstract (translated)

在本文中,我们提出了另一种使用两种方式的方法,其中一种方式是让一个模型看到一种模式,而另一种方式是让一个模型看到另一种模式。当适应一个单一模态的模型以利用更多的信息,同时遵守有限计算预算时,这种方法可以很有用。这意味着要有一个能够处理任何模态的模型。为了描述这一点,我们定义了一个术语:多模态学习。一个这种多模态学习的例子是在房间里有灯光熄灭时进行监控,使用红外模式会比使用可见模式提供更有价值的监控信息,而灯光打开时,可见模式会提供更有区分性的信息。本文研究了如何有效地将可见和红外/热模态用于基于Transformer的对象检测骨干网络以创建多模态架构。我们的工作在测试过程中没有产生任何推理开销,同时探索了在训练过程中有效利用两种模态的最佳方法。为了实现这一目标,我们引入了新的多模态训练技术:Mixed Patches(MiPa),并与一个补丁域无关的模块相结合,该模块负责学习找到两种模态之间共同表示的最佳方式。这种方法通过在单个模态基准上实现竞争性的结果,证明了能够平衡模态,同时使用三个不同的可见-红外物体检测数据集上的单一模态架构。最后,当我们将该方法用作最强的模态的正则化时,可以在仅需要一个模态的情况下击败多模态融合方法的性能。值得注意的是,MiPa在LLVIP可见/红外基准上达到了最先进的水平。代码:https:// this URL

URL

https://arxiv.org/abs/2404.18849

PDF

https://arxiv.org/pdf/2404.18849.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot