Paper Reading AI Learner

CLIPSwarm: Generating Drone Shows from Text Prompts with Vision-Language Models

2024-03-20 10:17:39
Pablo Pueyo, Eduardo Montijano, Ana C. Murillo, Mac Schwager

Abstract

This paper introduces CLIPSwarm, a new algorithm designed to automate the modeling of swarm drone formations based on natural language. The algorithm begins by enriching a provided word, to compose a text prompt that serves as input to an iterative approach to find the formation that best matches the provided word. The algorithm iteratively refines formations of robots to align with the textual description, employing different steps for "exploration" and "exploitation". Our framework is currently evaluated on simple formation targets, limited to contour shapes. A formation is visually represented through alpha-shape contours and the most representative color is automatically found for the input word. To measure the similarity between the description and the visual representation of the formation, we use CLIP [1], encoding text and images into vectors and assessing their similarity. Subsequently, the algorithm rearranges the formation to visually represent the word more effectively, within the given constraints of available drones. Control actions are then assigned to the drones, ensuring robotic behavior and collision-free movement. Experimental results demonstrate the system's efficacy in accurately modeling robot formations from natural language descriptions. The algorithm's versatility is showcased through the execution of drone shows in photorealistic simulation with varying shapes. We refer the reader to the supplementary video for a visual reference of the results.

Abstract (translated)

本文介绍了CLIPSwarm算法,这是一种根据自然语言自动建模蚊群机群的新算法。算法首先通过丰富提供的单词,生成一个文本提示,作为迭代方法找到与提供单词最匹配的机群。然后,算法迭代地优化机器人的形式,使它们与文本描述对齐,采用不同的步骤进行“探索”和“利用”。我们的框架目前仅在简单的机群目标上进行评估,这些目标限制在轮廓形状上。通过alpha形状轮廓和自动找到输入单词最具有代表性的颜色,对机群的视觉表示进行表示。为了测量描述与视觉表示的相似度,我们使用CLIP[1],将文本和图像编码为向量并评估它们的相似性。然后,算法重新排列机群,在给定的无人机约束范围内更有效地视觉表示单词。最后,通过在无人机上分配控制动作,确保机器人的行为和无碰撞运动。实验结果证明了系统在准确建模自然语言描述的机器人集群方面的高效性。通过在photorealistic仿真中执行无人机演示,展示了算法的多样性。我们请读者参考补充视频,查看结果的视觉参考。

URL

https://arxiv.org/abs/2403.13467

PDF

https://arxiv.org/pdf/2403.13467.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot