Paper Reading AI Learner

Synthesizing Reality: Leveraging the Generative AI-Powered Platform Midjourney for Construction Worker Detection

2025-07-17 15:35:27
Hongyang Zhao, Tianyu Liang, Sina Davari, Daeho Kim

Abstract

While recent advancements in deep neural networks (DNNs) have substantially enhanced visual AI's capabilities, the challenge of inadequate data diversity and volume remains, particularly in construction domain. This study presents a novel image synthesis methodology tailored for construction worker detection, leveraging the generative-AI platform Midjourney. The approach entails generating a collection of 12,000 synthetic images by formulating 3000 different prompts, with an emphasis on image realism and diversity. These images, after manual labeling, serve as a dataset for DNN training. Evaluation on a real construction image dataset yielded promising results, with the model attaining average precisions (APs) of 0.937 and 0.642 at intersection-over-union (IoU) thresholds of 0.5 and 0.5 to 0.95, respectively. Notably, the model demonstrated near-perfect performance on the synthetic dataset, achieving APs of 0.994 and 0.919 at the two mentioned thresholds. These findings reveal both the potential and weakness of generative AI in addressing DNN training data scarcity.

Abstract (translated)

尽管深度神经网络(DNN)的近期进展大幅提升了视觉人工智能的能力,但在建筑领域的数据多样性和数量不足的问题依然存在。本研究提出了一种针对建筑工人检测的新颖图像合成方法,利用生成式AI平台Midjourney进行实施。该方法通过制定3000个不同的提示来生成一组共12,000张合成图像,并强调图像的真实感和多样性。这些经过人工标注的图像被用作DNN训练的数据集。在实际建筑图像数据集上的评估显示,模型取得了令人鼓舞的结果,在交并比(IoU)阈值为0.5和从0.5到0.95时,平均精度(APs)分别为0.937和0.642。值得注意的是,该模型在合成数据集上表现接近完美,在上述两个阈值下的APs分别为0.994和0.919。这些发现揭示了生成式AI在解决DNN训练数据稀缺问题上的潜力与不足。

URL

https://arxiv.org/abs/2507.13221

PDF

https://arxiv.org/pdf/2507.13221.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot