Paper Reading AI Learner

ExpressEdit: Video Editing with Natural Language and Sketching

2024-03-26 13:34:21
Bekzat Tilekbay, Saelyne Yang, Michal Lewkowicz, Alex Suryapranata, Juho Kim

Abstract

Informational videos serve as a crucial source for explaining conceptual and procedural knowledge to novices and experts alike. When producing informational videos, editors edit videos by overlaying text/images or trimming footage to enhance the video quality and make it more engaging. However, video editing can be difficult and time-consuming, especially for novice video editors who often struggle with expressing and implementing their editing ideas. To address this challenge, we first explored how multimodality$-$natural language (NL) and sketching, which are natural modalities humans use for expression$-$can be utilized to support video editors in expressing video editing ideas. We gathered 176 multimodal expressions of editing commands from 10 video editors, which revealed the patterns of use of NL and sketching in describing edit intents. Based on the findings, we present ExpressEdit, a system that enables editing videos via NL text and sketching on the video frame. Powered by LLM and vision models, the system interprets (1) temporal, (2) spatial, and (3) operational references in an NL command and spatial references from sketching. The system implements the interpreted edits, which then the user can iterate on. An observational study (N=10) showed that ExpressEdit enhanced the ability of novice video editors to express and implement their edit ideas. The system allowed participants to perform edits more efficiently and generate more ideas by generating edits based on user's multimodal edit commands and supporting iterations on the editing commands. This work offers insights into the design of future multimodal interfaces and AI-based pipelines for video editing.

Abstract (translated)

信息视频作为向新手和专家解释概念和程序知识的关键来源。在制作信息视频时,编辑师通过在视频上叠加文本/图像或裁剪 footage 来提高视频质量和使其更具吸引力。然而,视频编辑可能会很困难,尤其是对于新手编辑来说,他们通常很难表达和实施他们的编辑想法。为了解决这个挑战,我们首先探讨了如何利用自然语言(NL)和手绘,这些是人类表达编辑意图的自然方式,来支持视频编辑师表达视频编辑想法。我们收集了10个视频编辑器的176个多模态编辑命令,揭示了NL和手绘在描述编辑意图时使用的模式。根据这些发现,我们提出了ExpressEdit系统,它通过NL文本和手绘在视频帧上编辑视频。通过LLM和视觉模型供电,系统解释NL命令中的时间、空间和操作引用,以及手绘中的空间参考。系统实现解释的编辑,然后用户可以进行迭代。一个观察性研究(N=10)表明,ExpressEdit提高了新手视频编辑师表达和实施编辑想法的能力。系统允许参与者通过基于用户的多模态编辑命令进行编辑,并生成更多编辑想法,通过对编辑命令的支持进行迭代。这项工作为未来多模态界面和基于AI的视频编辑流程的设计提供了洞见。

URL

https://arxiv.org/abs/2403.17693

PDF

https://arxiv.org/pdf/2403.17693.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot