Existing video captioning approaches typically require to first sample video frames from a decoded video and then conduct a subsequent process (e.g., feature extraction and/or captioning model learning). In this pipeline, manual frame sampling may ignore key information in videos and thus degrade performance. Additionally, redundant information in the sampled frames may result in low efficiency in the inference of video captioning. Addressing this, we study video captioning from a different perspective in compressed domain, which brings multi-fold advantages over the existing pipeline: 1) Compared to raw images from the decoded video, the compressed video, consisting of I-frames, motion vectors and residuals, is highly distinguishable, which allows us to leverage the entire video for learning without manual sampling through a specialized model design; 2) The captioning model is more efficient in inference as smaller and less redundant information is processed. We propose a simple yet effective end-to-end transformer in the compressed domain for video captioning that enables learning from the compressed video for captioning. We show that even with a simple design, our method can achieve state-of-the-art performance on different benchmarks while running almost 2x faster than existing approaches. Code is available at this https URL.
现有的视频字幕方法通常需要先从解码视频中抽取帧并执行后续处理(例如特征提取和或字幕模型学习)。在这条处理路径中,手动帧采样可能会忽略视频中的关键信息,从而降低性能。此外,抽取的帧中的冗余信息可能会影响视频字幕推断的效率。针对这个问题,我们从压缩域的角度研究视频字幕,比现有的处理路径具有多项优势:1)与解码视频中的 raw 图像相比,压缩视频由 I-frames、运动向量和残差组成,具有很高的辨识度,这使得我们可以通过专门的模型设计利用整个视频进行学习,而无需手动采样;2)由于处理的信息规模更小且冗余更少,字幕模型在推断方面更加高效。我们提出了一种在压缩域中用于视频字幕的端到端Transformer,使其可以从压缩视频中学习字幕。我们证明,即使采用简单的设计,我们的方法也可以在不同基准测试中实现最先进的性能,同时比现有方法运行快近2倍。代码可在该 https URL 上获取。
https://arxiv.org/abs/2309.12867
While most modern video understanding models operate on short-range clips, real-world videos are often several minutes long with semantically consistent segments of variable length. A common approach to process long videos is applying a short-form video model over uniformly sampled clips of fixed temporal length and aggregating the outputs. This approach neglects the underlying nature of long videos since fixed-length clips are often redundant or uninformative. In this paper, we aim to provide a generic and adaptive sampling approach for long-form videos in lieu of the de facto uniform sampling. Viewing videos as semantically consistent segments, we formulate a task-agnostic, unsupervised, and scalable approach based on Kernel Temporal Segmentation (KTS) for sampling and tokenizing long videos. We evaluate our method on long-form video understanding tasks such as video classification and temporal action localization, showing consistent gains over existing approaches and achieving state-of-the-art performance on long-form video modeling.
现代视频理解模型通常处理的是短片段,而实际视频往往几十分钟,具有语义 consistent 的片段长度可变。处理长视频的常见方法是使用一段固定时间长度的片段进行均匀采样,并汇总输出。这种方法忽略了长视频的深层次本质,因为固定长度的片段往往重复或无意义。在本文中,我们旨在提供一种通用的、自适应的采样方法,以代替事实上的均匀采样。将视频视为语义 consistent 的片段,我们制定了基于核心时间分割(KTS)的任务无关、无监督和可扩展的方法,用于采样和 tokenizing 长视频。我们针对长视频理解任务,如视频分类和时间行为定位,评估了我们的方法,显示与现有方法一致的增益,并在长视频建模方面实现了最先进的性能。
https://arxiv.org/abs/2309.11569
Despite an exciting new wave of multimodal machine learning models, current approaches still struggle to interpret the complex contextual relationships between the different modalities present in videos. Going beyond existing methods that emphasize simple activities or objects, we propose a new model-agnostic approach for generating detailed textual descriptions that captures multimodal video information. Our method leverages the extensive knowledge learnt by large language models, such as GPT-3.5 or Llama2, to reason about textual descriptions of the visual and aural modalities, obtained from BLIP-2, Whisper and ImageBind. Without needing additional finetuning of video-text models or datasets, we demonstrate that available LLMs have the ability to use these multimodal textual descriptions as proxies for ``sight'' or ``hearing'' and perform zero-shot multimodal classification of videos in-context. Our evaluations on popular action recognition benchmarks, such as UCF-101 or Kinetics, show these context-rich descriptions can be successfully used in video understanding tasks. This method points towards a promising new research direction in multimodal classification, demonstrating how an interplay between textual, visual and auditory machine learning models can enable more holistic video understanding.
尽管出现了令人兴奋的多媒态机器学习模型,但现有的方法仍然难以解释视频中出现的不同感官模式之间的复杂上下文关系。我们提出了一种新的模型无关的方法,用于生成详细文本描述,捕捉多媒态视频信息。我们的方法利用大型语言模型如GPT-3.5或Llama2学习到的广泛知识,以处理从BLIP-2、Whisper和ImageBind获取的视觉和听觉感官模式文本描述。我们不需要进一步调整视频文本模型或数据集,就能证明可用的LLMs有使用这些多媒态文本描述作为“看到”或“听到”的代用品,并在上下文中实现零次机会多媒态分类的能力。我们对流行的行动识别基准点如UCF-101或Kinetics进行评估,表明这些丰富的上下文描述可以在视频理解任务中成功使用。这种方法指向了多媒态分类中的有前途的新研究方向,展示了如何将文本、视觉和听觉机器学习模型之间的交互实现更全面的视频理解。
https://arxiv.org/abs/2309.10783
Dynamic magnetic resonance imaging (DMRI) is an effective imaging tool for diagnosis tasks that require motion tracking of a certain anatomy. To speed up DMRI acquisition, k-space measurements are commonly undersampled along spatial or spatial-temporal domains. The difficulty of recovering useful information increases with increasing undersampling ratios. Compress sensing was invented for this purpose and has become the most popular method until deep learning (DL) based DMRI reconstruction methods emerged in the past decade. Nevertheless, existing DL networks are still limited in long-range sequential dependency understanding and computational efficiency and are not fully automated. Considering the success of Transformers positional embedding and "swin window" self-attention mechanism in the vision community, especially natural video understanding, we hereby propose a novel architecture named Reconstruction Swin Transformer (RST) for 4D MRI. RST inherits the backbone design of the Video Swin Transformer with a novel reconstruction head introduced to restore pixel-wise intensity. A convolution network called SADXNet is used for rapid initialization of 2D MR frames before RST learning to effectively reduce the model complexity, GPU hardware demand, and training time. Experimental results in the cardiac 4D MR dataset further substantiate the superiority of RST, achieving the lowest RMSE of 0.0286 +/- 0.0199 and 1 - SSIM of 0.0872 +/- 0.0783 on 9 times accelerated validation sequences.
动态磁共振成像(DMRI)是一种有效的诊断工具,用于需要对特定解剖结构进行运动跟踪的任务。为了加速DMRI获取,k-空间测量通常会在空间或时间域中 Undersampling 。恢复有用信息的难度随着 Undersampling 比例的增加而增加。压缩感知是由这发明的,已经成为最受欢迎的方法,直到基于深度学习(DL)的DMRI重建方法在过去十年中问世。然而,现有的DL网络仍然局限于远程序列依赖理解和计算效率的限制,并尚未完全自动化。考虑到 Transformers 的位置嵌入和“swin窗口”自注意力机制在视觉社区中的成功,特别是自然视频理解,我们 hereby propose 一个名为 Reconstruction Swin Transformer (RST) 的全新的架构,用于4DMRI重建。 RST继承视频 Swin Transformer 的主要骨架设计,引入一个恢复像素强度的新重建头。一个卷积网络称为SADXNet,用于快速初始化2DMR帧,以有效地减少模型复杂度,GPU硬件需求和训练时间。在心脏4DMRI数据集的实验结果进一步支持了RST的优势,实现了最低RMSE为0.0286±0.0199,1-SSIM为0.0872±0.0783,在9倍加速验证序列上。
https://arxiv.org/abs/2309.10227
As the most critical components in a sentence, subject, predicate and object require special attention in the video captioning task. To implement this idea, we design a novel framework, named COllaborative three-Stream Transformers (COST), to model the three parts separately and complement each other for better representation. Specifically, COST is formed by three branches of transformers to exploit the visual-linguistic interactions of different granularities in spatial-temporal domain between videos and text, detected objects and text, and actions and text. Meanwhile, we propose a cross-granularity attention module to align the interactions modeled by the three branches of transformers, then the three branches of transformers can support each other to exploit the most discriminative semantic information of different granularities for accurate predictions of captions. The whole model is trained in an end-to-end fashion. Extensive experiments conducted on three large-scale challenging datasets, i.e., YouCookII, ActivityNet Captions and MSVD, demonstrate that the proposed method performs favorably against the state-of-the-art methods.
作为句子中最关键的部分,主语、谓语和对象在视频字幕任务中需要特别重视。为了实现这个想法,我们设计了一个全新的框架,名为合作三流转换器(Cost),分别建模三个部分,以更好地表示。具体来说,Cost由Transformer的三个分支组成,以利用视频和文本在空间-时间域中不同粒度的视觉语言学交互,检测物体和文本,以及操作和文本。同时,我们提出了跨粒度注意力模块,以对齐由Transformer的三个分支建模的互动,然后三个分支可以互相支持,利用不同粒度的语义信息进行准确的字幕预测。整个模型以端到端的方式进行训练。在三个大规模挑战性数据集上(YouCookII、ActivityNetcaptions和MSVD)进行了广泛的实验,结果表明,我们提出的方法与最先进的方法表现良好。
https://arxiv.org/abs/2309.09611
Large-scale noisy web image-text datasets have been proven to be efficient for learning robust vision-language models. However, when transferring them to the task of video retrieval, models still need to be fine-tuned on hand-curated paired text-video data to adapt to the diverse styles of video descriptions. To address this problem without the need for hand-annotated pairs, we propose a new setting, text-video retrieval with uncurated & unpaired data, that during training utilizes only text queries together with uncurated web videos without any paired text-video data. To this end, we propose an approach, In-Style, that learns the style of the text queries and transfers it to uncurated web videos. Moreover, to improve generalization, we show that one model can be trained with multiple text styles. To this end, we introduce a multi-style contrastive training procedure that improves the generalizability over several datasets simultaneously. We evaluate our model on retrieval performance over multiple datasets to demonstrate the advantages of our style transfer framework on the new task of uncurated & unpaired text-video retrieval and improve state-of-the-art performance on zero-shot text-video retrieval.
大型网络图像和文本数据集已经被证明是学习稳健视觉语言模型的有效工具。然而,将它们转移到视频检索任务时,模型仍然需要在手动编辑的配对文本-视频数据上进行微调,以适应各种视频描述风格。为了解决不需要手动编辑配对数据的问题,我们提出了一种新的设置,即文本-视频检索带未编辑和未配对数据的新场景,在训练期间仅使用文本查询和未编辑的 Web 视频,而不需要配对的文本-视频数据。为此,我们提出了一种方法,In-Style,来学习文本查询的风格并将其转移至未编辑的 Web 视频。此外,为了改善泛化能力,我们表明,一个模型可以训练多个文本风格。为此,我们引入了一种多风格对比训练程序,可以提高多个数据集的泛化能力。我们评估了模型在多个数据集上的检索性能,以展示我们的风格转移框架在新任务上的优点,即未编辑和未配对文本-视频检索任务,并提高了零样本文本-视频检索的性能。
https://arxiv.org/abs/2309.08928
The SoccerNet 2023 challenges were the third annual video understanding challenges organized by the SoccerNet team. For this third edition, the challenges were composed of seven vision-based tasks split into three main themes. The first theme, broadcast video understanding, is composed of three high-level tasks related to describing events occurring in the video broadcasts: (1) action spotting, focusing on retrieving all timestamps related to global actions in soccer, (2) ball action spotting, focusing on retrieving all timestamps related to the soccer ball change of state, and (3) dense video captioning, focusing on describing the broadcast with natural language and anchored timestamps. The second theme, field understanding, relates to the single task of (4) camera calibration, focusing on retrieving the intrinsic and extrinsic camera parameters from images. The third and last theme, player understanding, is composed of three low-level tasks related to extracting information about the players: (5) re-identification, focusing on retrieving the same players across multiple views, (6) multiple object tracking, focusing on tracking players and the ball through unedited video streams, and (7) jersey number recognition, focusing on recognizing the jersey number of players from tracklets. Compared to the previous editions of the SoccerNet challenges, tasks (2-3-7) are novel, including new annotations and data, task (4) was enhanced with more data and annotations, and task (6) now focuses on end-to-end approaches. More information on the tasks, challenges, and leaderboards are available on this https URL. Baselines and development kits can be found on this https URL.
《 soccerNet 2023 挑战》是由 soccerNet 团队组织的三年期视频理解挑战。这一版挑战由七个视觉任务组成,被分成三个主要主题。第一个主题是广播视频理解,包括三个高级别的任务,与描述视频广播中发生的事件有关:(1)行动Spotting,专注于从视频中检索与足球全球行动相关的所有时间戳;(2)球行动Spotting,专注于从视频中检索与足球球状态变化相关的所有时间戳;(3)稠密视频字幕,专注于用自然语言描述广播,并Anchoring 时间戳。第二个主题是实地理解,涉及任务 (4) 的单一任务,专注于从图像中检索内在和外部相机参数。第三个主题是球员理解,包括三个低级别的任务,与提取有关球员的信息有关:(5)重识别,专注于从多个视角检索相同的球员;(6)多物体跟踪,专注于通过未编辑的视频流跟踪球员和球;(7)球衣号码识别,专注于从跟踪let 中识别球员的球衣号码。与前几次《 soccerNet 挑战》相比,任务 (2-3-7) 是全新的,包括新的注释和数据,任务 (4) 得到了更多的数据和注释增强,而任务 (6) 则专注于端到端的方法。更多关于任务、挑战和排行榜的信息可以在 this https URL 上获取。基准线和开发kit 可以在 this https URL 上找到。
https://arxiv.org/abs/2309.06006
We use large language models to aid learners enhance proficiency in a foreign language. This is accomplished by identifying content on topics that the user is interested in, and that closely align with the learner's proficiency level in that foreign language. Our work centers on French content, but our approach is readily transferable to other languages. Our solution offers several distinctive characteristics that differentiate it from existing language-learning solutions, such as, a) the discovery of content across topics that the learner cares about, thus increasing motivation, b) a more precise estimation of the linguistic difficulty of the content than traditional readability measures, and c) the availability of both textual and video-based content. The linguistic complexity of video content is derived from the video captions. It is our aspiration that such technology will enable learners to remain engaged in the language-learning process by continuously adapting the topics and the difficulty of the content to align with the learners' evolving interests and learning objectives.
我们用大型语言模型协助学习者提高外语技能。这通过识别用户感兴趣的主题下的内容来实现,这些主题与学习者的外语技能水平密切相关。我们的工作主要集中在法语内容上,但我们的方法可以轻松地适用于其他语言。我们的解决方案提供了几个独特的特征,使其与现有的语言学习解决方案区别开来,例如,a) 发现学习者关心的主题下的内容,从而增加动力;b) 比传统阅读测量更加准确地估算内容的语言难度;c) 提供文本和视频内容。视频内容的语言学复杂性从视频字幕中得出。我们希望这种技术将使学习者能够继续参与语言学习过程,通过不断适应主题和内容的难度,使其与学习者不断变化的兴趣和学习目标保持一致。
https://arxiv.org/abs/2309.05142
The task of audio captioning is similar in essence to tasks such as image and video captioning. However, it has received much less attention. We propose three desiderata for captioning audio -- (i) fluency of the generated text, (ii) faithfulness of the generated text to the input audio, and the somewhat related (iii) audibility, which is the quality of being able to be perceived based only on audio. Our method is a zero-shot method, i.e., we do not learn to perform captioning. Instead, captioning occurs as an inference process that involves three networks that correspond to the three desired qualities: (i) A Large Language Model, in our case, for reasons of convenience, GPT-2, (ii) A model that provides a matching score between an audio file and a text, for which we use a multimodal matching network called ImageBind, and (iii) A text classifier, trained using a dataset we collected automatically by instructing GPT-4 with prompts designed to direct the generation of both audible and inaudible sentences. We present our results on the AudioCap dataset, demonstrating that audibility guidance significantly enhances performance compared to the baseline, which lacks this objective.
音频标注任务与图像和视频标注任务的本质相似,但关注度较少。我们提出了三个音频标注的偏好——(i)生成的文本流畅度,(ii)生成的文本与输入音频的准确匹配度,以及(iii)音频清晰度,这是一种仅基于音频就能感知的质量。我们的方法是零样本方法,即不会学习进行标注。相反,标注发生在推理过程中,涉及三个与三个期望质量对应的网络:(i)一个大型语言模型,因为我们 convenience 等原因选择 GPT-2,(ii)一个提供音频文件和文本匹配分数的模型,我们使用名为 ImageBind 的多感官匹配网络,(iii)一个文本分类器,使用我们自动收集的数据集进行训练,以生成可听和不可听的句子。我们在音频标注数据集上展示我们的结果,表明与基准相比,音频清晰度指导显著提高了性能。
https://arxiv.org/abs/2309.03884
The challenge of long-term video understanding remains constrained by the efficient extraction of object semantics and the modelling of their relationships for downstream tasks. Although the CLIP visual features exhibit discriminative properties for various vision tasks, particularly in object encoding, they are suboptimal for long-term video understanding. To address this issue, we present the Attributes-Aware Network (AAN), which consists of two key components: the Attributes Extractor and a Graph Reasoning block. These components facilitate the extraction of object-centric attributes and the modelling of their relationships within the video. By leveraging CLIP features, AAN outperforms state-of-the-art approaches on two popular action detection datasets: Charades and Toyota Smarthome Untrimmed datasets.
长期视频理解的挑战仍然受到高效提取对象语义和为后续任务建模的限制。虽然Clip视觉特征在不同视觉任务中表现出差异的特性,特别是在对象编码方面,但它们对于长期视频理解是最优的。为了解决这一问题,我们提出了属性意识到网络(AAN),它由两个关键组件组成:属性提取器和图推理块。这些组件有助于在视频中提取对象centric属性并建模它们之间的关系。通过利用Clip特征,AAN在两个流行的行动检测数据集上:Charades和Toyota Smarthome Untrimmed数据集上比最先进的方法表现更好。
https://arxiv.org/abs/2309.00696
Human-robot interaction (HRI) is a rapidly growing field that encompasses social and industrial applications. Machine learning plays a vital role in industrial HRI by enhancing the adaptability and autonomy of robots in complex environments. However, data privacy is a crucial concern in the interaction between humans and robots, as companies need to protect sensitive data while machine learning algorithms require access to large datasets. Federated Learning (FL) offers a solution by enabling the distributed training of models without sharing raw data. Despite extensive research on Federated learning (FL) for tasks such as natural language processing (NLP) and image classification, the question of how to use FL for HRI remains an open research problem. The traditional FL approach involves transmitting large neural network parameter matrices between the server and clients, which can lead to high communication costs and often becomes a bottleneck in FL. This paper proposes a communication-efficient FL framework for human-robot interaction (CEFHRI) to address the challenges of data heterogeneity and communication costs. The framework leverages pre-trained models and introduces a trainable spatiotemporal adapter for video understanding tasks in HRI. Experimental results on three human-robot interaction benchmark datasets: HRI30, InHARD, and COIN demonstrate the superiority of CEFHRI over full fine-tuning in terms of communication costs. The proposed methodology provides a secure and efficient approach to HRI federated learning, particularly in industrial environments with data privacy concerns and limited communication bandwidth. Our code is available at this https URL.
人机互动(HRI)是一个快速发展的领域,涵盖了社会和工业应用领域。机器学习在工业人机互动中发挥着重要作用,通过增强机器人在复杂环境中的适应性和自主性。然而,数据隐私是人类机互动中一个重要的问题,因为公司需要保护敏感数据,而机器学习算法需要访问大型数据集。联邦学习(FL)提供了一个解决方案,通过允许在服务器和客户端之间传输大型神经网络参数矩阵,而无需分享原始数据。尽管对于诸如自然语言处理(NLP)和图像分类等任务进行了广泛的联邦学习研究,但如何使用FL来解决数据异质性和通信成本的挑战仍然是一个开放的研究问题。传统的FL方法涉及在服务器和客户端之间传输大型神经网络参数矩阵,这可能会导致高通信成本,并且往往成为FL中的瓶颈。本文提出了一种通信高效的FL框架,以解决数据异质性和通信成本的挑战,框架利用预先训练的模型并引入了可训练的时间空间适配器,用于HRI中的视频理解任务。对于三个人类机互动基准数据集:HRI30、InHARD和COIN,的实验结果证明了CEFHRI在通信成本方面的优势。该提出的方法提供了一种安全高效的HRI联邦学习方法,特别是在存在数据隐私担忧和通信带宽有限的环境中。我们的代码可在这个httpsURL上可用。
https://arxiv.org/abs/2308.14965
Composed Image Retrieval (CoIR) has recently gained popularity as a task that considers both text and image queries together, to search for relevant images in a database. Most CoIR approaches require manually annotated datasets, comprising image-text-image triplets, where the text describes a modification from the query image to the target image. However, manual curation of CoIR triplets is expensive and prevents scalability. In this work, we instead propose a scalable automatic dataset creation methodology that generates triplets given video-caption pairs, while also expanding the scope of the task to include composed video retrieval (CoVR). To this end, we mine paired videos with a similar caption from a large database, and leverage a large language model to generate the corresponding modification text. Applying this methodology to the extensive WebVid2M collection, we automatically construct our WebVid-CoVR dataset, resulting in 1.6 million triplets. Moreover, we introduce a new benchmark for CoVR with a manually annotated evaluation set, along with baseline results. Our experiments further demonstrate that training a CoVR model on our dataset effectively transfers to CoIR, leading to improved state-of-the-art performance in the zero-shot setup on both the CIRR and FashionIQ benchmarks. Our code, datasets, and models are publicly available at this https URL.
合成图像检索(CoIR)最近变得越来越流行,成为一个涉及文本和图像查询的任务,以在数据库中寻找相关图像。大多数CoIR方法需要手动标注的数据集,包括图像-文本-图像三体,其中文本描述了从查询图像到目标图像的修改。然而,手动编辑CoIR三体是非常昂贵的,并且限制了可扩展性。在这项工作中,我们提出了一种可扩展的自动数据集创建方法,根据视频字幕对生成三体,同时也扩大了任务的范围,包括合成视频检索(CoVR)。为此,我们从一个大型数据库中检索具有相似字幕的配对视频,并利用大型语言模型生成相应的修改文本。将这个方法应用于广泛的WebVid2M集合中,我们自动构建了我们WebVid-CoVR数据集,产生了160万三体。此外,我们提出了一个新的CoVR基准集,并使用手动标注的评估集,与基线结果一起提出了。我们的实验进一步证明,在我们 dataset 中训练的 CoVR 模型有效地转移到了 CoIR,导致在CIVR和时尚IQ基准任务中零样本 setup 中提高最先进的性能。我们的代码、数据和模型都在这个 https URL 上公开可用。
https://arxiv.org/abs/2308.14746
Supervised visual captioning models typically require a large scale of images or videos paired with descriptions in a specific language (i.e., the vision-caption pairs) for training. However, collecting and labeling large-scale datasets is time-consuming and expensive for many scenarios and languages. Therefore, sufficient labeled pairs are usually not available. To deal with the label shortage problem, we present a simple yet effective zero-shot approach MultiCapCLIP that can generate visual captions for different scenarios and languages without any labeled vision-caption pairs of downstream datasets. In the training stage, MultiCapCLIP only requires text data for input. Then it conducts two main steps: 1) retrieving concept prompts that preserve the corresponding domain knowledge of new scenarios; 2) auto-encoding the prompts to learn writing styles to output captions in a desired language. In the testing stage, MultiCapCLIP instead takes visual data as input directly to retrieve the concept prompts to generate the final visual descriptions. The extensive experiments on image and video captioning across four benchmarks and four languages (i.e., English, Chinese, German, and French) confirm the effectiveness of our approach. Compared with state-of-the-art zero-shot and weakly-supervised methods, our method achieves 4.8% and 21.5% absolute improvements in terms of BLEU@4 and CIDEr metrics. Our code is available at this https URL.
监督的视觉标题化模型通常需要大规模图像或视频与特定语言(即视觉标题配对)进行训练。然而,收集和标注大规模数据对于许多场景和语言来说是一项耗时且昂贵的任务,因此足够的标注配对通常不存在。为了解决标注不足的问题,我们提出了一种简单但有效的零样本方法MultiCapCLIP,它可以生成不同场景和语言的视觉标题,而不需要下游数据集的视觉标题配对。在训练阶段,MultiCapCLIP只需要文本数据进行输入。然后它执行两个主要步骤:1)检索概念提示,以保留新场景对应的领域知识;2)自动编码提示来学习写作风格,以输出所需的语言caption。在测试阶段,MultiCapCLIP则直接使用视觉数据进行输入,以检索概念提示,生成最终的视觉描述。在图像和视频标题化 across四个基准和四个语言(即英语、中文、德语和法语)的广泛实验确认了我们方法的有效性。与最先进的零样本和弱监督方法相比,我们的方法在BLEU@4和CER metrics上实现了4.8%和21.5%的绝对改进。我们的代码可在这个httpsURL上获取。
https://arxiv.org/abs/2308.13218
The growing interest in omnidirectional videos (ODVs) that capture the full field-of-view (FOV) has gained 360-degree saliency prediction importance in computer vision. However, predicting where humans look in 360-degree scenes presents unique challenges, including spherical distortion, high resolution, and limited labelled data. We propose a novel vision-transformer-based model for omnidirectional videos named SalViT360 that leverages tangent image representations. We introduce a spherical geometry-aware spatiotemporal self-attention mechanism that is capable of effective omnidirectional video understanding. Furthermore, we present a consistency-based unsupervised regularization term for projection-based 360-degree dense-prediction models to reduce artefacts in the predictions that occur after inverse projection. Our approach is the first to employ tangent images for omnidirectional saliency prediction, and our experimental results on three ODV saliency datasets demonstrate its effectiveness compared to the state-of-the-art.
对捕获完整视野(FOV)的全方位视频的兴趣不断增加,这已成为计算机视觉中360度目标检测的重要性所在。然而,在360度场景下预测人类注视的位置面临着独特的挑战,包括球形失真、高分辨率和有限的标注数据。我们提出了一种新的视觉Transformer模型,名为SalViT360,它利用触线图像表示。我们引入了一种具有球形几何aware时间空间自注意力机制,能够有效进行全方位视频理解。此外,我们提出了一种基于一致性的无监督 Regularization term,用于基于投影的360度密集预测模型,以减少在反向投影后预测中出现的人为干扰。我们的方法是世界上第一个使用触线图像进行全方位目标检测的方法,我们对三个ODV目标检测数据集的实验结果表明,它与当前最先进的方法相比非常有效。
https://arxiv.org/abs/2308.13004
Several recent works have directly extended the image masked autoencoder (MAE) with random masking into video domain, achieving promising results. However, unlike images, both spatial and temporal information are important for video understanding. This suggests that the random masking strategy that is inherited from the image MAE is less effective for video MAE. This motivates the design of a novel masking algorithm that can more efficiently make use of video saliency. Specifically, we propose a motion-guided masking algorithm (MGM) which leverages motion vectors to guide the position of each mask over time. Crucially, these motion-based correspondences can be directly obtained from information stored in the compressed format of the video, which makes our method efficient and scalable. On two challenging large-scale video benchmarks (Kinetics-400 and Something-Something V2), we equip video MAE with our MGM and achieve up to +$1.3\%$ improvement compared to previous state-of-the-art methods. Additionally, our MGM achieves equivalent performance to previous video MAE using up to $66\%$ fewer training epochs. Lastly, we show that MGM generalizes better to downstream transfer learning and domain adaptation tasks on the UCF101, HMDB51, and Diving48 datasets, achieving up to +$4.9\%$ improvement compared to baseline methods.
近年来,几项工作直接将随机掩膜的图像掩膜自动编码(MAE)扩展到视频领域,取得了令人瞩目的成果。然而,与图像不同,时间和空间信息对于视频理解至关重要。这表明,继承自图像MAE的随机掩膜策略对于视频MAE并不有效。这激励了设计一种能够更有效地利用视频隐示性的新掩膜算法。具体而言,我们提出了一种运动引导掩膜算法(MGM),利用运动向量指导每个掩膜的位置。 crucially,这些基于运动的对应关系可以从视频压缩格式中直接获取,这使得我们的算法高效且可扩展。在两个具有挑战性的大规模视频基准数据集(Kinetics-400和 something-something V2)上,我们为视频MAE配备了我们的 MGM 并比先前的先进方法实现了 +1.3% 的提高。此外,我们的 MGM 使用比先前视频 MAE 少训好的 $66\%$ 更多的训练 epoch 实现了与先前视频 MAE 相当的性能。最后,我们证明了 MGM 对 UCF101、HMDB51 和diving48 数据集的后续传输学习和域名适应能力更好,比基准方法实现了 +4.9% 的提高。
https://arxiv.org/abs/2308.12962
Self-supervised learning (SSL) techniques have recently produced outstanding results in learning visual representations from unlabeled videos. Despite the importance of motion in supervised learning techniques for action recognition, SSL methods often do not explicitly consider motion information in videos. To address this issue, we propose MOFO (MOtion FOcused), a novel SSL method for focusing representation learning on the motion area of a video, for action recognition. MOFO automatically detects motion areas in videos and uses these to guide the self-supervision task. We use a masked autoencoder which randomly masks out a high proportion of the input sequence; we force a specified percentage of the inside of the motion area to be masked and the remainder from outside. We further incorporate motion information into the finetuning step to emphasise motion in the downstream task. We demonstrate that our motion-focused innovations can significantly boost the performance of the currently leading SSL method (VideoMAE) for action recognition. Our method improves the recent self-supervised Vision Transformer (ViT), VideoMAE, by achieving +2.6%, +2.1%, +1.3% accuracy on Epic-Kitchens verb, noun and action classification, respectively, and +4.7% accuracy on Something-Something V2 action classification. Our proposed approach significantly improves the performance of the current SSL method for action recognition, indicating the importance of explicitly encoding motion in SSL.
自监督学习(SSL)技术最近在从未标记视频学习视觉表示方面取得了卓越的结果。尽管在行动识别任务中运动的重要性不容忽视,但SSL方法通常并不明确考虑视频中的运动信息。为了解决这个问题,我们提出了MOFO(MOtion FOcused),这是一个新的SSL方法,专注于视频的运动区域表示学习,以进行行动识别。MOFO自动检测视频的运动区域并使用这些区域来指导自我监督任务。我们使用一个Masked GAN来随机掩盖高比例的输入序列;我们将指定百分比的内部运动区域掩盖,并将剩余的外部部分也掩盖。我们还将运动信息添加到微调步骤中,以强调后续任务中的运动。我们证明了我们的运动集中的创新可以显著增强目前领先的SSL方法(VideoMAE)的行动识别性能。我们的方法在Epic-Kitchens动词、名词和行动分类任务中分别提高了2.6%、2.1%和1.3%的准确性,在某种-某种V2行动分类任务中提高了4.7%的准确性。我们提出的方法显著提高了当前SSL方法的行动识别性能,这表明在SSL中明确编码运动的重要性。
https://arxiv.org/abs/2308.12447
Generating video stories from text prompts is a complex task. In addition to having high visual quality, videos need to realistically adhere to a sequence of text prompts whilst being consistent throughout the frames. Creating a benchmark for video generation requires data annotated over time, which contrasts with the single caption used often in video datasets. To fill this gap, we collect comprehensive human annotations on three existing datasets, and introduce StoryBench: a new, challenging multi-task benchmark to reliably evaluate forthcoming text-to-video models. Our benchmark includes three video generation tasks of increasing difficulty: action execution, where the next action must be generated starting from a conditioning video; story continuation, where a sequence of actions must be executed starting from a conditioning video; and story generation, where a video must be generated from only text prompts. We evaluate small yet strong text-to-video baselines, and show the benefits of training on story-like data algorithmically generated from existing video captions. Finally, we establish guidelines for human evaluation of video stories, and reaffirm the need of better automatic metrics for video generation. StoryBench aims at encouraging future research efforts in this exciting new area.
生成从文本提示生成的视频故事是一项复杂的任务。除了拥有高质量的视觉质量外,视频需要真实地遵循一系列文本提示,同时在整个帧内保持一致。为生成视频基准数据需要不断地进行数据标注,这与通常用于视频数据集的单个标题不同。为了填补这一差距,我们收集了三个现有数据集的全面人类标注,并介绍了 Storybench:一个新的挑战性多任务基准,可以可靠地评估即将发布的文本到视频模型。我们的基准包括三个逐渐增加难度的视频生成任务:行动执行,其中下一个行动必须从 conditioning 视频开始生成;故事延续,其中从 conditioning 视频开始的序列必须执行;故事生成,其中必须从文本提示开始生成视频。我们评估了小型但强大的文本到视频基线,并展示了从现有视频标题算法生成的故事类数据的训练 benefits。最后,我们制定了人类视频故事评估的指导方针,并重申了视频生成更好的自动指标的必要性。 Storybench 旨在鼓励在未来研究这个令人兴奋的新型领域。
https://arxiv.org/abs/2308.11606
Many real-world applications, from sport analysis to surveillance, benefit from automatic long-term action recognition. In the current deep learning paradigm for automatic action recognition, it is imperative that models are trained and tested on datasets and tasks that evaluate if such models actually learn and reason over long-term information. In this work, we propose a method to evaluate how suitable a video dataset is to evaluate models for long-term action recognition. To this end, we define a long-term action as excluding all the videos that can be correctly recognized using solely short-term information. We test this definition on existing long-term classification tasks on three popular real-world datasets, namely Breakfast, CrossTask and LVU, to determine if these datasets are truly evaluating long-term recognition. Our study reveals that these datasets can be effectively solved using shortcuts based on short-term information. Following this finding, we encourage long-term action recognition researchers to make use of datasets that need long-term information to be solved.
许多实际应用程序,从体育分析到监控,都受益于自动长期动作识别。在当前的自动动作识别深度学习范式中,训练和测试模型是至关重要的,需要评估这些模型在长期信息下是否真的学习和推理。在本研究中,我们提出了一种方法来评估视频数据集是否适合用于评估长期动作识别模型。为此,我们定义长期动作是指在仅使用短期信息的情况下能够正确识别的视频,我们对这些定义在三个流行的实际数据集上(早餐、交叉任务和 LVU)进行了测试,以确定这些数据集是否真正评估了长期识别。我们的研究表明,这些数据集可以通过基于短期信息的捷径有效地解决。基于这一发现,我们鼓励长期动作识别研究人员使用需要长期信息才能解决的问题的数据集。
https://arxiv.org/abs/2308.11244
Deep learning has made significant strides in video understanding tasks, but the computation required to classify lengthy and massive videos using clip-level video classifiers remains impractical and prohibitively expensive. To address this issue, we propose Audio-Visual Glance Network (AVGN), which leverages the commonly available audio and visual modalities to efficiently process the spatio-temporally important parts of a video. AVGN firstly divides the video into snippets of image-audio clip pair and employs lightweight unimodal encoders to extract global visual features and audio features. To identify the important temporal segments, we use an Audio-Visual Temporal Saliency Transformer (AV-TeST) that estimates the saliency scores of each frame. To further increase efficiency in the spatial dimension, AVGN processes only the important patches instead of the whole images. We use an Audio-Enhanced Spatial Patch Attention (AESPA) module to produce a set of enhanced coarse visual features, which are fed to a policy network that produces the coordinates of the important patches. This approach enables us to focus only on the most important spatio-temporally parts of the video, leading to more efficient video recognition. Moreover, we incorporate various training techniques and multi-modal feature fusion to enhance the robustness and effectiveness of our AVGN. By combining these strategies, our AVGN sets new state-of-the-art performance in multiple video recognition benchmarks while achieving faster processing speed.
深度学习在视频理解任务方面取得了巨大的进展,但使用片段级别的视频分类器来分类较长的和大量的视频仍然需要不切实际且昂贵的计算。为了解决这一问题,我们提出了音频-视觉留意网络(AVGN),该网络利用常见的音频和视觉特性来高效处理视频的时空重要部分。AVGN首先将视频分解成图像-音频片段组合,并使用轻量级单模编码器提取 global 视觉特征和音频特征。为了确定重要的时间片段,我们使用音频-视觉时间偏向Transformer(AV-TeST)来估计每个帧的偏向 scores。为了进一步增加空间维度的效率和,AVGN仅处理重要的斑点,而不是整个图像。我们使用音频增强的Spatial Patch Attention(AESPA)模块生成一组增强的粗视觉特征,并将其喂给政策网络,生成重要的斑点坐标。这种方法使我们只能关注视频中最重要的时空部分,从而促进了更高效的视频识别。此外,我们采用了各种训练技术和多模特征融合,以提高我们的AVGN的稳健性和效果。通过结合这些策略,我们的AVGN在多个视频识别基准上创造了新的前沿技术表现,同时实现了更快的处理速度。
https://arxiv.org/abs/2308.09322
Recently, the community has made tremendous progress in developing effective methods for point cloud video understanding that learn from massive amounts of labeled data. However, annotating point cloud videos is usually notoriously expensive. Moreover, training via one or only a few traditional tasks (e.g., classification) may be insufficient to learn subtle details of the spatio-temporal structure existing in point cloud videos. In this paper, we propose a Masked Spatio-Temporal Structure Prediction (MaST-Pre) method to capture the structure of point cloud videos without human annotations. MaST-Pre is based on spatio-temporal point-tube masking and consists of two self-supervised learning tasks. First, by reconstructing masked point tubes, our method is able to capture the appearance information of point cloud videos. Second, to learn motion, we propose a temporal cardinality difference prediction task that estimates the change in the number of points within a point tube. In this way, MaST-Pre is forced to model the spatial and temporal structure in point cloud videos. Extensive experiments on MSRAction-3D, NTU-RGBD, NvGesture, and SHREC'17 demonstrate the effectiveness of the proposed method.
最近,社区在开发有效的点云视频理解方法方面取得了巨大的进展,这些方法可以从大量的标记数据中学习。然而,标注点云视频通常非常昂贵。此外,仅通过一个或几个传统任务(例如分类)进行训练可能不足以学习点云视频中的微妙空间-时间结构。在本文中,我们提出了一种遮挡空间-时间结构预测方法(MaST-Pre),以在没有人类标注的情况下捕捉点云视频的结构。MaST-Pre基于空间点 tube 遮挡,包括两个自监督学习任务。首先,通过重建遮挡点 tube,我们的方法能够捕捉点云视频的外观信息。其次,学习运动,我们提出了时间数量差异预测任务,该任务估计了点 tube 内点的数量变化。通过这种方式,MaST-Pre被迫在点云视频中建模空间-时间结构。在MSRAction-3D、NTU-RGBD、NvGesture和SHREC'17等实验中,进行了广泛的实验,证明了该方法的有效性。
https://arxiv.org/abs/2308.09245