Paper Reading AI Learner

A Diffusion-based Data Generator for Training Object Recognition Models in Ultra-Range Distance

2024-04-15 14:55:43
Eran Bamani, Eden Nissinman, Lisa Koenigsberg, Inbar Meir, Avishai Sintov

Abstract

Object recognition, commonly performed by a camera, is a fundamental requirement for robots to complete complex tasks. Some tasks require recognizing objects far from the robot's camera. A challenging example is Ultra-Range Gesture Recognition (URGR) in human-robot interaction where the user exhibits directive gestures at a distance of up to 25~m from the robot. However, training a model to recognize hardly visible objects located in ultra-range requires an exhaustive collection of a significant amount of labeled samples. The generation of synthetic training datasets is a recent solution to the lack of real-world data, while unable to properly replicate the realistic visual characteristics of distant objects in images. In this letter, we propose the Diffusion in Ultra-Range (DUR) framework based on a Diffusion model to generate labeled images of distant objects in various scenes. The DUR generator receives a desired distance and class (e.g., gesture) and outputs a corresponding synthetic image. We apply DUR to train a URGR model with directive gestures in which fine details of the gesturing hand are challenging to distinguish. DUR is compared to other types of generative models showcasing superiority both in fidelity and in recognition success rate when training a URGR model. More importantly, training a DUR model on a limited amount of real data and then using it to generate synthetic data for training a URGR model outperforms directly training the URGR model on real data. The synthetic-based URGR model is also demonstrated in gesture-based direction of a ground robot.

Abstract (translated)

物体识别,通常由相机执行,是机器人完成复杂任务的基本要求。有些任务要求在机器人摄像头的远距离内识别物体。一个具有挑战性的例子是在人机交互中,用户在距离机器人25米以内的距离时表现出指令手势。然而,为了训练能够识别距离遥远物体且难以观察的模型的模型,需要收集大量标记样本。生成合成训练数据是针对缺乏真实世界数据的最近解决方案,但它无法正确复制远处物体的真实视觉特征。在本文中,我们基于扩散模型的扩散Ultra-Range (DUR)框架提出了合成远距离物体标记图像的方案。DUR生成器接收所需距离和分类(例如手势)并输出相应的合成图像。我们将DUR应用于训练具有指令手势的URGR模型,其中手部细节的分辨度很高。DUR与其他类型的生成模型进行了比较,展示了在准确性和识别成功率方面的优越性。更重要的是,在有限量的真实数据上训练DUR模型,然后使用该模型生成训练URGR模型的合成数据,比直接在真实数据上训练URGR模型更有效地实现。基于仿真的URGR模型还在基于手势的地面机器人的手部方向中得到了演示。

URL

https://arxiv.org/abs/2404.09846

PDF

https://arxiv.org/pdf/2404.09846.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot