CLIPSwarm: Generating Drone Shows from Text Prompts with Vision-Language Models

Abstract
Abstract (translated)
URL
PDF

Abstract

This paper introduces CLIPSwarm, a new algorithm designed to automate the modeling of swarm drone formations based on natural language. The algorithm begins by enriching a provided word, to compose a text prompt that serves as input to an iterative approach to find the formation that best matches the provided word. The algorithm iteratively refines formations of robots to align with the textual description, employing different steps for "exploration" and "exploitation". Our framework is currently evaluated on simple formation targets, limited to contour shapes. A formation is visually represented through alpha-shape contours and the most representative color is automatically found for the input word. To measure the similarity between the description and the visual representation of the formation, we use CLIP [1], encoding text and images into vectors and assessing their similarity. Subsequently, the algorithm rearranges the formation to visually represent the word more effectively, within the given constraints of available drones. Control actions are then assigned to the drones, ensuring robotic behavior and collision-free movement. Experimental results demonstrate the system's efficacy in accurately modeling robot formations from natural language descriptions. The algorithm's versatility is showcased through the execution of drone shows in photorealistic simulation with varying shapes. We refer the reader to the supplementary video for a visual reference of the results.

Abstract (translated)

本文介绍了CLIPSwarm算法，这是一种根据自然语言自动建模蚊群机群的新算法。算法首先通过丰富提供的单词，生成一个文本提示，作为迭代方法找到与提供单词最匹配的机群。然后，算法迭代地优化机器人的形式，使它们与文本描述对齐，采用不同的步骤进行“探索”和“利用”。我们的框架目前仅在简单的机群目标上进行评估，这些目标限制在轮廓形状上。通过alpha形状轮廓和自动找到输入单词最具有代表性的颜色，对机群的视觉表示进行表示。为了测量描述与视觉表示的相似度，我们使用CLIP[1]，将文本和图像编码为向量并评估它们的相似性。然后，算法重新排列机群，在给定的无人机约束范围内更有效地视觉表示单词。最后，通过在无人机上分配控制动作，确保机器人的行为和无碰撞运动。实验结果证明了系统在准确建模自然语言描述的机器人集群方面的高效性。通过在photorealistic仿真中执行无人机演示，展示了算法的多样性。我们请读者参考补充视频，查看结果的视觉参考。

URL

https://arxiv.org/abs/2403.13467

PDF

https://arxiv.org/pdf/2403.13467.pdf

CLIPSwarm: Generating Drone Shows from Text Prompts with Vision-Language Models

Abstract

Abstract (translated)

URL

PDF Copy

PDF