ExpressEdit: Video Editing with Natural Language and Sketching

Abstract
Abstract (translated)
URL
PDF

Abstract

Informational videos serve as a crucial source for explaining conceptual and procedural knowledge to novices and experts alike. When producing informational videos, editors edit videos by overlaying text/images or trimming footage to enhance the video quality and make it more engaging. However, video editing can be difficult and time-consuming, especially for novice video editors who often struggle with expressing and implementing their editing ideas. To address this challenge, we first explored how multimodality$-$natural language (NL) and sketching, which are natural modalities humans use for expression$-$can be utilized to support video editors in expressing video editing ideas. We gathered 176 multimodal expressions of editing commands from 10 video editors, which revealed the patterns of use of NL and sketching in describing edit intents. Based on the findings, we present ExpressEdit, a system that enables editing videos via NL text and sketching on the video frame. Powered by LLM and vision models, the system interprets (1) temporal, (2) spatial, and (3) operational references in an NL command and spatial references from sketching. The system implements the interpreted edits, which then the user can iterate on. An observational study (N=10) showed that ExpressEdit enhanced the ability of novice video editors to express and implement their edit ideas. The system allowed participants to perform edits more efficiently and generate more ideas by generating edits based on user's multimodal edit commands and supporting iterations on the editing commands. This work offers insights into the design of future multimodal interfaces and AI-based pipelines for video editing.

Abstract (translated)

信息视频作为向新手和专家解释概念和程序知识的关键来源。在制作信息视频时，编辑师通过在视频上叠加文本/图像或裁剪 footage 来提高视频质量和使其更具吸引力。然而，视频编辑可能会很困难，尤其是对于新手编辑来说，他们通常很难表达和实施他们的编辑想法。为了解决这个挑战，我们首先探讨了如何利用自然语言（NL）和手绘，这些是人类表达编辑意图的自然方式，来支持视频编辑师表达视频编辑想法。我们收集了10个视频编辑器的176个多模态编辑命令，揭示了NL和手绘在描述编辑意图时使用的模式。根据这些发现，我们提出了ExpressEdit系统，它通过NL文本和手绘在视频帧上编辑视频。通过LLM和视觉模型供电，系统解释NL命令中的时间、空间和操作引用，以及手绘中的空间参考。系统实现解释的编辑，然后用户可以进行迭代。一个观察性研究（N=10）表明，ExpressEdit提高了新手视频编辑师表达和实施编辑想法的能力。系统允许参与者通过基于用户的多模态编辑命令进行编辑，并生成更多编辑想法，通过对编辑命令的支持进行迭代。这项工作为未来多模态界面和基于AI的视频编辑流程的设计提供了洞见。

URL

https://arxiv.org/abs/2403.17693

PDF

https://arxiv.org/pdf/2403.17693.pdf

ExpressEdit: Video Editing with Natural Language and Sketching

Abstract

Abstract (translated)

URL

PDF Copy

PDF