Understanding action recognition in egocentric videos has emerged as a vital research topic with numerous practical applications. With the limitation in the scale of egocentric data collection, learning robust deep learning-based action recognition models remains difficult. Transferring knowledge learned from the large-scale exocentric data to the egocentric data is challenging due to the difference in videos across views. Our work introduces a novel cross-view learning approach to action recognition (CVAR) that effectively transfers knowledge from the exocentric to the egocentric view. First, we introduce a novel geometric-based constraint into the self-attention mechanism in Transformer based on analyzing the camera positions between two views. Then, we propose a new cross-view self-attention loss learned on unpaired cross-view data to enforce the self-attention mechanism learning to transfer knowledge across views. Finally, to further improve the performance of our cross-view learning approach, we present the metrics to measure the correlations in videos and attention maps effectively. Experimental results on standard egocentric action recognition benchmarks, i.e., Charades-Ego, EPIC-Kitchens-55, and EPIC-Kitchens-100, have shown our approach's effectiveness and state-of-the-art performance.
理解个人视角视频的行动识别已经发展成为一个重要的实用研究主题,具有大量的实际应用。由于个人视角数据采集的规模限制,学习可靠的深度学习行动识别模型仍然非常困难。从大型外部视角数据到个人视角数据的学习知识转移面临着由于不同视角视频差异的挑战。我们的工作提出了一种新的跨视角学习方法(CVAR),能够有效地将外部视角知识转移到个人视角视角。首先,我们引入一种新的几何约束,在Transformer中引入自注意力机制,基于分析两个视角之间的摄像机位置。然后,我们提出了一种新的跨视角自注意力损失,在配对跨视角数据上训练,以强制自注意力机制学习跨视角知识转移。最后,为了进一步提高我们的跨视角学习方法的性能,我们提出了指标,以有效地测量视频和注意力地图之间的相关性。标准个人视角行动识别基准测试数据,如Charades-Ego、Epic厨房55和Epic厨房100,已经证明了我们的方法的有效性和最先进的性能。
https://arxiv.org/abs/2305.15699
Currently, video behavior recognition is one of the most foundational tasks of computer vision. The 2D neural networks of deep learning are built for recognizing pixel-level information such as images with RGB, RGB-D, or optical flow formats, with the current increasingly wide usage of surveillance video and more tasks related to human action recognition. There are increasing tasks requiring temporal information for frames dependency analysis. The researchers have widely studied video-based recognition rather than image-based(pixel-based) only to extract more informative elements from geometry tasks. Our current related research addresses multiple novel proposed research works and compares their advantages and disadvantages between the derived deep learning frameworks rather than machine learning frameworks. The comparison happened between existing frameworks and datasets, which are video format data only. Due to the specific properties of human actions and the increasingly wide usage of deep neural networks, we collected all research works within the last three years between 2020 to 2022. In our article, the performance of deep neural networks surpassed most of the techniques in the feature learning and extraction tasks, especially video action recognition.
目前,视频行为识别是计算机视觉中最基本的任务之一。深度学习的二维神经网络是为了识别像素级别的信息,例如RGB、RGB-D或光学流格式的图像。随着监控视频的普及以及与人类行为识别相关的更多任务的出现,需要对帧依赖分析提供更多信息的越来越多的任务。研究人员广泛研究了基于视频而不是基于图像(像素)的识别,仅从几何任务中获取更 informative 的元素。我们当前的相关研究涉及多个全新的 proposes 的研究项目,并比较它们从 derived 深度学习框架中的优点和缺点,而不是从机器学习框架中的优点和缺点。比较的是现有的框架和数据集,它们仅适用于视频格式的数据。由于人类行为的特殊性质以及深度学习神经网络的日益普及,我们在2020年至2022年这三个三年内收集了所有研究项目。在我们的文章中,深度学习网络的性能在特征学习和提取任务中超过了许多技术,特别是视频行为识别。
https://arxiv.org/abs/2305.15692
The recognition of human actions in videos is one of the most active research fields in computer vision. The canonical approach consists in a more or less complex preprocessing stages of the raw video data, followed by a relatively simple classification algorithm. Here we address recognition of human actions using the reservoir computing algorithm, which allows us to focus on the classifier stage. We introduce a new training method for the reservoir computer, based on "Timesteps Of Interest", which combines in a simple way short and long time scales. We study the performance of this algorithm using both numerical simulations and a photonic implementation based on a single non-linear node and a delay line on the well known KTH dataset. We solve the task with high accuracy and speed, to the point of allowing for processing multiple video streams in real time. The present work is thus an important step towards developing efficient dedicated hardware for video processing.
视频对人类行为识别的研究是计算机视觉领域最为活跃的研究领域之一。标准的研究方法包括对原始视频数据的较为复杂的预处理阶段,随后是相对简单的分类算法。在这里,我们使用 Reservoir Computing算法来解决人类行为识别问题,从而使我们能够专注于分类器阶段。我们提出了一种新的 Reservoir Computer 的训练方法,基于“感兴趣的时间步”,它以一种简单的方式将短期和长期时间尺度组合在一起。我们使用数值模拟和基于单个非线性节点和延迟线的光子实现来研究该算法的性能。我们以一种高精度和高速度的方式解决了任务,以至于能够实时处理多个视频流。因此,当前工作是开发高效专门用于视频处理的硬件的重要步骤。
https://arxiv.org/abs/2305.15283
We present a new self-supervised paradigm on point cloud sequence understanding. Inspired by the discriminative and generative self-supervised methods, we design two tasks, namely point cloud sequence based Contrastive Prediction and Reconstruction (CPR), to collaboratively learn more comprehensive spatiotemporal representations. Specifically, dense point cloud segments are first input into an encoder to extract embeddings. All but the last ones are then aggregated by a context-aware autoregressor to make predictions for the last target segment. Towards the goal of modeling multi-granularity structures, local and global contrastive learning are performed between predictions and targets. To further improve the generalization of representations, the predictions are also utilized to reconstruct raw point cloud sequences by a decoder, where point cloud colorization is employed to discriminate against different frames. By combining classic contrast and reconstruction paradigms, it makes the learned representations with both global discrimination and local perception. We conduct experiments on four point cloud sequence benchmarks, and report the results on action recognition and gesture recognition under multiple experimental settings. The performances are comparable with supervised methods and show powerful transferability.
我们提出了一种新的点云序列理解自监督范式。借鉴了具有选择和生成自监督方法,我们设计了两个任务,即点云序列基于对比预测和重建(CPR),旨在协作学习更全面的时间空间表示。具体来说,密集点云片段首先输入到编码器中以提取嵌入。除了最后一个片段,所有片段都被基于上下文自回归器聚合起来,以预测最后一个目标片段。为了建模多粒度结构,在预测和目标之间进行 local 和 global 对比学习。为了进一步提高表示的泛化能力,预测也被用于解码器中重建原始点云序列,其中点云颜色化用于区分不同的帧。通过结合经典对比和重建范式,它使学习到的表示具有全球区分性和 local 感知。我们对四个点云序列基准进行了实验,并在不同的实验设置下报告了动作识别和手势识别的结果。表现与监督方法相当,表现出强大的转移性。
https://arxiv.org/abs/2305.12959
This paper studies the computational offloading of video action recognition in edge computing. To achieve effective semantic information extraction and compression, following semantic communication we propose a novel spatiotemporal attention-based autoencoder (STAE) architecture, including a frame attention module and a spatial attention module, to evaluate the importance of frames and pixels in each frame. Additionally, we use entropy encoding to remove statistical redundancy in the compressed data to further reduce communication overhead. At the receiver, we develop a lightweight decoder that leverages a 3D-2D CNN combined architecture to reconstruct missing information by simultaneously learning temporal and spatial information from the received data to improve accuracy. To fasten convergence, we use a step-by-step approach to train the resulting STAE-based vision transformer (ViT_STAE) models. Experimental results show that ViT_STAE can compress the video dataset HMDB51 by 104x with only 5% accuracy loss, outperforming the state-of-the-art baseline DeepISC. The proposed ViT_STAE achieves faster inference and higher accuracy than the DeepISC-based ViT model under time-varying wireless channel, which highlights the effectiveness of STAE in guaranteeing higher accuracy under time constraints.
本论文研究的是边缘计算中视频行动识别的计算负载问题。为了有效地提取和压缩语义信息,我们提出了一种新的基于时间空间注意力的自编码器(STAE)架构,包括帧注意力模块和空间注意力模块,以评估每个帧和像素的重要性。此外,我们还使用熵编码来去除压缩数据中的统计冗余,进一步减少通信 overhead。在接收端,我们开发了一个轻量化解码器,利用3D-2D卷积神经网络综合架构,从接收数据中同时学习时间空间和空间信息,以提高精度。为了加速收敛,我们采用了一步一迭代的方法来训练产生STAE-based视觉Transformer(ViT_STAE)模型。实验结果显示,ViT_STAE能够压缩HMDB51视频数据集,压缩率增加到104倍,而精度损失只有5%。相比之下,DeepISC-based ViT模型在时间 varying无线通道下的性能表现不如ViT_STAE,这表明了STAE在满足时间限制条件下保证更高精度的有效性。
https://arxiv.org/abs/2305.12796
We present a new general learning approach for action recognition, Prompt Learning for Action Recognition (PLAR), which leverages the strengths of prompt learning to guide the learning process. Our approach is designed to predict the action label by helping the models focus on the descriptions or instructions associated with actions in the input videos. Our formulation uses various prompts, including optical flow, large vision models, and learnable prompts to improve the recognition performance. Moreover, we propose a learnable prompt method that learns to dynamically generate prompts from a pool of prompt experts under different inputs. By sharing the same objective, our proposed PLAR can optimize prompts that guide the model's predictions while explicitly learning input-invariant (prompt experts pool) and input-specific (data-dependent) prompt knowledge. We evaluate our approach on datasets consisting of both ground camera videos and aerial videos, and scenes with single-agent and multi-agent actions. In practice, we observe a 3.17-10.2% accuracy improvement on the aerial multi-agent dataset, Okutamam and 0.8-2.6% improvement on the ground camera single-agent dataset, Something Something V2. We plan to release our code on the WWW.
我们提出了一种新的行动识别通用学习方法,即prompt learning for action recognition(PLAR),利用prompt learning的优势来指导学习过程。我们的方法旨在帮助模型专注于输入视频中的行动描述或指令,从而预测行动标签。我们的配方使用了多种prompt,包括光学流、大型视觉模型和可学习prompt来提高识别性能。此外,我们提出了一种可学习prompt方法,学习从不同输入中提取的动态prompt专家生成prompt。通过共享相同的目标,我们提出的PLAR可以优化指导模型预测的prompt,同时明确学习输入不变的prompt知识(prompt专家池)和输入特定的prompt知识(数据依赖)。我们对地面相机单个行动样本和单agent多agent样本的数据集进行了评估,在实际应用中,我们观察到空中多agent数据集的准确率提高了3.17-10.2%,地面相机单个行动样本的准确率提高了0.8-2.6%。我们计划将我们的代码发布到WWW上。
https://arxiv.org/abs/2305.12437
How humans understand and recognize the actions of others is a complex neuroscientific problem that involves a combination of cognitive mechanisms and neural networks. Research has shown that humans have brain areas that recognize actions that process top-down attentional information, such as the temporoparietal association area. Also, humans have brain regions dedicated to understanding the minds of others and analyzing their intentions, such as the medial prefrontal cortex of the temporal lobe. Skeleton-based action recognition creates mappings for the complex connections between the human skeleton movement patterns and behaviors. Although existing studies encoded meaningful node relationships and synthesized action representations for classification with good results, few of them considered incorporating a priori knowledge to aid potential representation learning for better performance. LA-GCN proposes a graph convolution network using large-scale language models (LLM) knowledge assistance. First, the LLM knowledge is mapped into a priori global relationship (GPR) topology and a priori category relationship (CPR) topology between nodes. The GPR guides the generation of new "bone" representations, aiming to emphasize essential node information from the data level. The CPR mapping simulates category prior knowledge in human brain regions, encoded by the PC-AC module and used to add additional supervision-forcing the model to learn class-distinguishable features. In addition, to improve information transfer efficiency in topology modeling, we propose multi-hop attention graph convolution. It aggregates each node's k-order neighbor simultaneously to speed up model convergence. LA-GCN reaches state-of-the-art on NTU RGB+D, NTU RGB+D 120, and NW-UCLA datasets.
人类如何理解和识别他人的行为是一个复杂的神经科学问题,涉及认知机制和神经网络的结合。研究表明,人类有大脑区域专门用于识别处理自上而下的注意力信息,例如顶叶皮层 association area。此外,人类有大脑区域致力于理解他人的思想并分析其意图,例如顶叶皮层的早期大脑区域。基于骨骼的行动识别创造了人类骨骼运动模式和行为之间的复杂联系的映射。虽然现有的研究编码了有意义的节点关系,并合成了行动表示以获得良好的分类结果,但它们中没有几个考虑引入先验知识以帮助潜在的表示学习以更好地表现。LA-GCN提出了使用大规模语言模型(LLM)知识协助的 graph 卷积网络。首先,LLM 知识将被映射到先验全局关系(GPR)拓扑和先验类别关系(CPR)拓扑在节点之间。GPR 指导生成新的“骨头”表示,旨在强调数据级别上的关键节点信息。CPR 映射模拟了人类大脑区域的先验类别知识,由 PC-AC 模块编码并用于增加额外的监督,强迫模型学习类别区分的特征。此外,为了改善拓扑建模的信息传输效率,我们提出了多跳注意力 graph 卷积。它同时聚合每个节点的 k 级邻居以加快模型收敛。LA-GCN在 NTU RGB+D、NTU RGB+D 120 和NW-UCLA 数据集上达到了当前领先水平。
https://arxiv.org/abs/2305.12398
Graph Convolutional Networks (GCNs) have long defined the state-of-the-art in skeleton-based action recognition, leveraging their ability to unravel the complex dynamics of human joint topology through the graph's adjacency matrix. However, an inherent flaw has come to light in these cutting-edge models: they tend to optimize the adjacency matrix jointly with the model weights. This process, while seemingly efficient, causes a gradual decay of bone connectivity data, culminating in a model indifferent to the very topology it sought to map. As a remedy, we propose a threefold strategy: (1) We forge an innovative pathway that encodes bone connectivity by harnessing the power of graph distances. This approach preserves the vital topological nuances often lost in conventional GCNs. (2) We highlight an oft-overlooked feature - the temporal mean of a skeletal sequence, which, despite its modest guise, carries highly action-specific information. (3) Our investigation revealed strong variations in joint-to-joint relationships across different actions. This finding exposes the limitations of a single adjacency matrix in capturing the variations of relational configurations emblematic of human movement, which we remedy by proposing an efficient refinement to Graph Convolutions (GC) - the BlockGC. This evolution slashes parameters by a substantial margin (above 40%), while elevating performance beyond original GCNs. Our full model, the BlockGCN, establishes new standards in skeleton-based action recognition for small model sizes. Its high accuracy, notably on the large-scale NTU RGB+D 120 dataset, stand as compelling proof of the efficacy of BlockGCN. The source code and model can be found at this https URL.
Graph Convolutional Networks (GCNs) 已经长期定义了基于骨骼的行动识别的最新技术,通过利用 graph 的相邻矩阵,他们能够解开人类关节拓扑的复杂的动态。然而,这些前沿模型中存在的一个固有缺陷已经浮出水面:他们倾向于同时优化相邻矩阵和模型权重。这个进程虽然看似高效,但会导致骨连接数据逐渐退化,最终导致一个模型对它所要映射的拓扑毫不在意。作为一种补救措施,我们提出了三个策略: (1) forge 一个创新的路径,利用 graph 距离的力量,编码骨连接。这种方法保留了传统 GCN 中的重要拓扑细节,通常丢失在常规 GCN 中。 (2) 突出一个常被忽略的特征 - 骨骼序列的时间平均值,尽管其貌不扬,但携带高度行动特异性的信息。 (3) 我们的研究揭示了不同行动中 joint-to-joint 关系的强烈差异。这个发现揭示了单相邻矩阵在捕捉人类运动关系的象征性变异方面的局限性,我们因此提出了一个高效的改进方案 - 块GC。这个进化减少了参数的数量(超过 40%),并超越了原始 GCN 的性能。我们的完整模型 - 块GCN,在小型模型大小下建立了基于骨骼的行动识别的新标准。它的高准确性,特别是在大型 NTU RGB+D 120 数据集上的表现,成为块GCN 有效性的强有力的证明。源代码和模型可以在 this https URL 中找到。
https://arxiv.org/abs/2305.11468
Action recognition is an important problem that requires identifying actions in video by learning complex interactions across scene actors and objects. However, modern deep-learning based networks often require significant computation, and may capture scene context using various modalities that further increases compute costs. Efficient methods such as those used for AR/VR often only use human-keypoint information but suffer from a loss of scene context that hurts accuracy. In this paper, we describe an action-localization method, KeyNet, that uses only the keypoint data for tracking and action recognition. Specifically, KeyNet introduces the use of object based keypoint information to capture context in the scene. Our method illustrates how to build a structured intermediate representation that allows modeling higher-order interactions in the scene from object and human keypoints without using any RGB information. We find that KeyNet is able to track and classify human actions at just 5 FPS. More importantly, we demonstrate that object keypoints can be modeled to recover any loss in context from using keypoint information over AVA action and Kinetics datasets.
动作识别是重要的问题,需要通过学习场景演员和物体之间的复杂交互来识别视频中的动作。然而,现代深度学习网络通常需要大量计算,并且可能使用各种模式来捕获场景上下文,进一步增加了计算成本。例如,用于增强现实/虚拟现实的方法通常仅使用人类关键点信息,但会损失场景上下文,从而伤害准确性。在本文中,我们描述了一种动作本地化方法 KeyNet,它仅使用关键点数据进行跟踪和动作识别。具体来说,KeyNet引入了使用基于对象的关键点信息来捕获场景上下文的方法。我们的方法 illustrate 如何建立结构化的中间表示,以便从物体和人类关键点模型场景的高级交互,而无需使用任何RGB信息。我们发现 KeyNet可以在只有5帧每秒的速度下跟踪和分类人类动作。更重要的是,我们证明了物体关键点可以建模,以从使用关键点信息恢复场景上下文的任何损失。
https://arxiv.org/abs/2305.09539
End-to-end learning has taken hold of many computer vision tasks, in particular, related to still images, with task-specific optimization yielding very strong performance. Nevertheless, human-centric action recognition is still largely dominated by hand-crafted pipelines, and only individual components are replaced by neural networks that typically operate on individual frames. As a testbed to study the relevance of such pipelines, we present a new fully annotated video dataset of fitness activities. Any recognition capabilities in this domain are almost exclusively a function of human poses and their temporal dynamics, so pose-based solutions should perform well. We show that, with this labelled data, end-to-end learning on raw pixels can compete with state-of-the-art action recognition pipelines based on pose estimation. We also show that end-to-end learning can support temporally fine-grained tasks such as real-time repetition counting.
面向整个的学习过程已经掌控了许多计算机视觉任务,特别是与静态图像相关的任务,通过特定的优化可以产生非常强劲的性能。然而,人类中心的行动识别仍然主要由手工制作的管道所支配,只有个别组件被替换为通常运行在单个帧上的神经网络。作为研究这些管道相关的联系的试验平台,我们提供了一项全新的完全注释的健身活动视频数据集。在这个领域中,任何识别能力几乎完全是人类姿态及其时间动态的函数,因此基于姿态的解决方案应该表现良好。我们表明,使用这种标记数据,基于像素的面向整个的学习过程可以与基于姿态估计的行动识别管道竞争。我们还表明,面向整个的学习过程可以支持时间细致的任务,例如实时重复计数。
https://arxiv.org/abs/2305.08191
Ensuring traffic safety and preventing accidents is a critical goal in daily driving, where the advancement of computer vision technologies can be leveraged to achieve this goal. In this paper, we present a multi-view, multi-scale framework for naturalistic driving action recognition and localization in untrimmed videos, namely M$^2$DAR, with a particular focus on detecting distracted driving behaviors. Our system features a weight-sharing, multi-scale Transformer-based action recognition network that learns robust hierarchical representations. Furthermore, we propose a new election algorithm consisting of aggregation, filtering, merging, and selection processes to refine the preliminary results from the action recognition module across multiple views. Extensive experiments conducted on the 7th AI City Challenge Track 3 dataset demonstrate the effectiveness of our approach, where we achieved an overlap score of 0.5921 on the A2 test set. Our source code is available at \url{this https URL}.
确保交通安全并防止事故是日常生活中的一个关键目标,而计算机视觉技术的不断进步可以在这方面发挥作用来实现这一目标。在本文中,我们提出了一种多视角多尺度的框架,用于自然场景驾驶行为识别和局部化在未修剪的视频中的活动,即 M$^2$DAR,并特别注重检测分心驾驶行为。我们的系统采用共享权重的多尺度Transformer行动识别网络来学习可靠的层次化表示。此外,我们提出了一种新的选举算法,包括聚合、过滤、合并和选择过程,以优化来自不同视角的行动识别模块的初步结果。在第七期AI城市挑战Track 3数据集的实验中,我们对A2测试集进行了广泛的实验,取得了0.5921的重叠得分。我们源代码可访问 \url{this https URL}。
https://arxiv.org/abs/2305.08877
Despite recent advances in video-based action recognition and robust spatio-temporal modeling, most of the proposed approaches rely on the abundance of computational resources to afford running huge and computation-intensive convolutional or transformer-based neural networks to obtain satisfactory results. This limits the deployment of such models on edge devices with limited power and computing resources. In this work we investigate an important smart home application, video based delivery detection, and present a simple and lightweight pipeline for this task that can run on resource-constrained doorbell cameras. Our proposed pipeline relies on motion cues to generate a set of coarse activity proposals followed by their classification with a mobile-friendly 3DCNN network. For training we design a novel semi-supervised attention module that helps the network to learn robust spatio-temporal features and adopt an evidence-based optimization objective that allows for quantifying the uncertainty of predictions made by the network. Experimental results on our curated delivery dataset shows the significant effectiveness of our pipeline compared to alternatives and highlights the benefits of our training phase novelties to achieve free and considerable inference-time performance gains.
尽管近年来在视频行动识别和稳健空间时间建模方面取得了进展,但大多数 proposed 的方法都依赖于计算资源的充足性,以支付运行大型、计算密集型卷积或Transformer神经网络以获得满意结果的需求。这限制了在资源受限的边缘设备上部署这些模型。在这项研究中,我们研究了一个重要的智能家庭应用——视频based delivery detection,并提出了一个简单的、轻量化的管道来完成这项任务,可以在资源受限的入门摄像头上运行。我们提出的管道依赖于运动线索生成一组粗动作建议,然后使用一个易于移动设备的3DCNN网络进行分类。为训练我们设计了一个新的半监督注意力模块,帮助网络学习稳健的空间时间特征,并采用基于证据的优化目标,允许量化网络的预测不确定性。我们对我们 curated delivery 数据集的实验结果表明,我们的管道相对于其他方法具有显著的有效性,并突出了我们在训练阶段新奇性的优势,以获得免费且可观的推理时间性能增益。
https://arxiv.org/abs/2305.07812
4D human perception plays an essential role in a myriad of applications, such as home automation and metaverse avatar simulation. However, existing solutions which mainly rely on cameras and wearable devices are either privacy intrusive or inconvenient to use. To address these issues, wireless sensing has emerged as a promising alternative, leveraging LiDAR, mmWave radar, and WiFi signals for device-free human sensing. In this paper, we propose MM-Fi, the first multi-modal non-intrusive 4D human dataset with 27 daily or rehabilitation action categories, to bridge the gap between wireless sensing and high-level human perception tasks. MM-Fi consists of over 320k synchronized frames of five modalities from 40 human subjects. Various annotations are provided to support potential sensing tasks, e.g., human pose estimation and action recognition. Extensive experiments have been conducted to compare the sensing capacity of each or several modalities in terms of multiple tasks. We envision that MM-Fi can contribute to wireless sensing research with respect to action recognition, human pose estimation, multi-modal learning, cross-modal supervision, and interdisciplinary healthcare research.
4D人类感知在多种应用中发挥着至关重要的作用,例如智能家居和虚拟现实角色模拟。然而,现有的解决方案主要依赖相机和穿戴设备,要么侵犯了隐私,要么使用起来不方便。为了解决这些问题,无线感知已经成为一个有前途的选择,利用LiDAR、毫米波雷达和WiFi信号来实现无设备人类感知。在本文中,我们提出了MM-Fi,是第一个无侵入性的多模态4D人类数据集,其中包括27种日常或康复行动类别,以连接无线感知和高级别人类感知任务之间的区别。MM-Fi由40名人类 subjects 超过320k个同步帧组成。各种注释提供了支持潜在感知任务的支持,例如人类姿态估计和动作识别。广泛的实验进行了比较每个或多个模态的感知能力,我们预计MM-Fi可以为无线感知研究做出贡献,包括行动识别、人类姿态估计、多模态学习、跨模态监督和跨学科医疗保健研究。
https://arxiv.org/abs/2305.10345
In this paper, we study a novel problem in egocentric action recognition, which we term as "Multimodal Generalization" (MMG). MMG aims to study how systems can generalize when data from certain modalities is limited or even completely missing. We thoroughly investigate MMG in the context of standard supervised action recognition and the more challenging few-shot setting for learning new action categories. MMG consists of two novel scenarios, designed to support security, and efficiency considerations in real-world applications: (1) missing modality generalization where some modalities that were present during the train time are missing during the inference time, and (2) cross-modal zero-shot generalization, where the modalities present during the inference time and the training time are disjoint. To enable this investigation, we construct a new dataset MMG-Ego4D containing data points with video, audio, and inertial motion sensor (IMU) modalities. Our dataset is derived from Ego4D dataset, but processed and thoroughly re-annotated by human experts to facilitate research in the MMG problem. We evaluate a diverse array of models on MMG-Ego4D and propose new methods with improved generalization ability. In particular, we introduce a new fusion module with modality dropout training, contrastive-based alignment training, and a novel cross-modal prototypical loss for better few-shot performance. We hope this study will serve as a benchmark and guide future research in multimodal generalization problems. The benchmark and code will be available at this https URL.
在本文中,我们研究了一种 egocentric 行动识别中的新颖问题,我们称之为 "Multimodal Generalization" (MMG)。 MMG 的目标是研究在特定modality(例如视觉、听觉和惯性运动传感器)数据限制或完全缺失的情况下,系统如何能够泛化。我们在标准监督行动识别上下文中,以及更困难的单Shot(即一次操作中只读取一个数据点)情况下,对 MMG 进行了深入研究。 MMG 由两个新的场景组成,旨在支持在现实应用中的安全和效率考虑:(1)缺失modality 泛化,即在推断时间缺失某些modality,(2)跨modal 零Shot 泛化,即在推断时间和训练时间互相分离的modality。 为进行此研究,我们创建了一个包含视频、音频和惯性运动传感器(IMU)modality 的数据集 MMG-Ego4D。我们的数据集从 Ego4D 数据集中提取,但由人类专家进行处理和彻底重新注释,以协助研究 MMG 问题。我们对 MMG-Ego4D 上各种模型进行了评估,并提出了新方法,提高了泛化能力。特别是,我们引入了一个modality dropout 训练、Contrastive based 对齐训练和一种新的跨modal 原型损失的新 fusion 模块,以改善单Shot 性能。我们希望本研究将成为跨modality 泛化问题的基准,并指导未来的跨modal 泛化研究。基准和代码将在这个 https URL 上可用。
https://arxiv.org/abs/2305.07214
Self-supervised video representation learning aimed at maximizing similarity between different temporal segments of one video, in order to enforce feature persistence over time. This leads to loss of pertinent information related to temporal relationships, rendering actions such as `enter' and `leave' to be indistinguishable. To mitigate this limitation, we propose Latent Time Navigation (LTN), a time-parameterized contrastive learning strategy that is streamlined to capture fine-grained motions. Specifically, we maximize the representation similarity between different video segments from one video, while maintaining their representations time-aware along a subspace of the latent representation code including an orthogonal basis to represent temporal changes. Our extensive experimental analysis suggests that learning video representations by LTN consistently improves performance of action classification in fine-grained and human-oriented tasks (e.g., on Toyota Smarthome dataset). In addition, we demonstrate that our proposed model, when pre-trained on Kinetics-400, generalizes well onto the unseen real world video benchmark datasets UCF101 and HMDB51, achieving state-of-the-art performance in action recognition.
自我监督的视频表示学习旨在最大化不同视频中的时间片段之间的相似性,以强制实现特征持久化。这会导致与时间关系相关的相关信息的损失,使像“进入”和“离开”这样的动作变得难以区分。为了克服这一限制,我们提出了隐式时间导航(LTN),这是一种时间参数化的比较学习策略,简化以捕捉精细的运动。具体来说,我们最大化来自不同视频的不同视频片段之间的表示相似性,同时保持它们的表示时间aware,在一个包括表示时间变化Orthogonal basis的隐式表示代码 subspace上。我们的广泛实验分析表明,通过LTN学习视频表示 consistently 改善精细和人类目标任务(例如,在丰田智能家居数据集上)的行动分类性能。此外,我们证明,在我们基于Kinetics-400预训练的模型中,LTN训练可以很好地泛化到未观测的现实世界视频基准数据集UCF101和HMDB51,实现行动识别的最先进的性能。
https://arxiv.org/abs/2305.06437
The ability to specify robot commands by a non-expert user is critical for building generalist agents capable of solving a large variety of tasks. One convenient way to specify the intended robot goal is by a video of a person demonstrating the target task. While prior work typically aims to imitate human demonstrations performed in robot environments, here we focus on a more realistic and challenging setup with demonstrations recorded in natural and diverse human environments. We propose Video-conditioned Policy learning (ViP), a data-driven approach that maps human demonstrations of previously unseen tasks to robot manipulation skills. To this end, we learn our policy to generate appropriate actions given current scene observations and a video of the target task. To encourage generalization to new tasks, we avoid particular tasks during training and learn our policy from unlabelled robot trajectories and corresponding robot videos. Both robot and human videos in our framework are represented by video embeddings pre-trained for human action recognition. At test time we first translate human videos to robot videos in the common video embedding space, and then use resulting embeddings to condition our policies. Notably, our approach enables robot control by human demonstrations in a zero-shot manner, i.e., without using robot trajectories paired with human instructions during training. We validate our approach on a set of challenging multi-task robot manipulation environments and outperform state of the art. Our method also demonstrates excellent performance in a new challenging zero-shot setup where no paired data is used during training.
由非专家用户指定机器人命令是构建能够解决多种任务通用代理的关键。一种方便的方式来指定预期的机器人目标的方法是通过演示目标任务的人的视频中进行指定。虽然先前的工作通常旨在模仿人们在机器人环境中的表演,但在这里我们关注于在自然和多样化的人类环境中记录演示的更现实和具有挑战性的setup。我们提出了视频条件决策学习(ViP),一种基于数据驱动的方法,将人类从未见过的任务的演示映射到机器人操作技能。为此,我们学习我们的政策以生成适当的行动,给定当前场景观察和目标任务的视频。为了鼓励向新任务 generalization,我们在训练期间避免特定的任务,并从未标识的机器人路径和相应的机器人视频学习我们的政策。在我们的框架中,机器人和人类视频都由预训练用于人类行动识别的视频嵌入表示。在测试时,我们首先将人类视频转换为机器人视频在共同视频嵌入空间中,然后使用 resulting 嵌入条件我们的政策。值得注意的是,我们的方法使通过人类演示控制机器人成为零样本控制,即在训练期间不使用机器人路径与人类指示的配对数据。我们验证了我们的方法在一个具有挑战性的多任务机器人操纵环境上,并优于现有技术。我们的方法还演示了在训练期间不使用配对数据的新挑战零样本setup中的出色表现。
https://arxiv.org/abs/2305.06289
Current few-shot action recognition involves two primary sources of information for classification:(1) intra-video information, determined by frame content within a single video clip, and (2) inter-video information, measured by relationships (e.g., feature similarity) among videos. However, existing methods inadequately exploit these two information sources. In terms of intra-video information, current sampling operations for input videos may omit critical action information, reducing the utilization efficiency of video data. For the inter-video information, the action misalignment among videos makes it challenging to calculate precise relationships. Moreover, how to jointly consider both inter- and intra-video information remains under-explored for few-shot action recognition. To this end, we propose a novel framework, Video Information Maximization (VIM), for few-shot video action recognition. VIM is equipped with an adaptive spatial-temporal video sampler and a spatiotemporal action alignment model to maximize intra- and inter-video information, respectively. The video sampler adaptively selects important frames and amplifies critical spatial regions for each input video based on the task at hand. This preserves and emphasizes informative parts of video clips while eliminating interference at the data level. The alignment model performs temporal and spatial action alignment sequentially at the feature level, leading to more precise measurements of inter-video similarity. Finally, These goals are facilitated by incorporating additional loss terms based on mutual information measurement. Consequently, VIM acts to maximize the distinctiveness of video information from limited video data. Extensive experimental results on public datasets for few-shot action recognition demonstrate the effectiveness and benefits of our framework.
当前 few-shot 行动识别涉及两个主要的信息分类来源:(1) 内部视频信息,由单个视频片段中的帧内容确定,(2) 外部视频信息,由视频中的关系(例如,特征相似度)测量。然而,现有的方法未能充分利用这两个信息来源。在内部视频信息方面,当前对输入视频的采样操作可能漏掉关键行动信息,降低视频数据的利用率。对于外部视频信息,视频中的行动不同步会导致计算精确关系的挑战。此外,如何在 few-shot 行动识别中同时考虑内部和外部视频信息仍待探索。为此,我们提出了一种新框架,称为 Video InformationMaximization(VIM),以解决 few-shot 视频行动识别的问题。VIM 配备自适应的空间和时间视频采样器和时间空间行动对齐模型,以最大化内部和外部视频信息。视频采样器根据当前任务自适应地选择重要帧并扩大关键空间区域,为每个输入视频自动选择。这种方法可以保持视频片段中的信息部分并强调其信息,同时消除数据层次的干扰。对齐模型在特征级别的时间空间和位置行动对齐,从而更准确地测量外部视频相似性。最后,通过基于互信息测量的额外的损失函数,这些目标更容易实现。因此,VIM 旨在最大化从有限视频数据中区分视频信息的特异性。 few-shot 行动识别公共数据集的实验结果证明了我们的框架的有效性和优点。
https://arxiv.org/abs/2305.06114
Humans have the natural ability to recognize actions even if the objects involved in the action or the background are changed. Humans can abstract away the action from the appearance of the objects and their context which is referred to as compositionality of actions. Compositional action recognition deals with imparting human-like compositional generalization abilities to action-recognition models. In this regard, extracting the interactions between humans and objects forms the basis of compositional understanding. These interactions are not affected by the appearance biases of the objects or the context. But the context provides additional cues about the interactions between things and stuff. Hence we need to infuse context into the human-object interactions for compositional action recognition. To this end, we first design a spatial-temporal interaction encoder that captures the human-object (things) interactions. The encoder learns the spatio-temporal interaction tokens disentangled from the background context. The interaction tokens are then infused with contextual information from the video tokens to model the interactions between things and stuff. The final context-infused spatio-temporal interaction tokens are used for compositional action recognition. We show the effectiveness of our interaction-centric approach on the compositional Something-Else dataset where we obtain a new state-of-the-art result of 83.8% top-1 accuracy outperforming recent important object-centric methods by a significant margin. Our approach of explicit human-object-stuff interaction modeling is effective even for standard action recognition datasets such as Something-Something-V2 and Epic-Kitchens-100 where we obtain comparable or better performance than state-of-the-art.
人类天生具有识别动作的能力,即使行动所涉及的对象或背景发生了变化也能做到。人类可以从对象和背景的外观中提取动作,这被称为动作的组成性。组成性的动作识别处理将人类类似组成性泛化的能力赋予动作识别模型。在这方面,提取人类和物体之间的交互关系是组成性理解的基础。这些交互不受物体或背景的外观偏差的影响。但是,背景提供了关于物体和物品之间的交互额外的线索。因此,我们需要将背景融入人类-物体交互中,以组成性的动作识别。为此,我们首先设计了一个空间-时间交互编码器,用于捕获人类-物体(物品)交互。编码器学习从背景context中分离的空间-时间交互代币。交互代币然后将从视频代币中注入context信息,以模型物体和物品之间的交互。最后,context注入的空间-时间交互代币用于组成性的动作识别。我们展示了我们交互为中心的方法在组成性其他数据集上的有效性,我们在top1准确率为83.8%的新高水平上取得了比最近的重要物体为中心的方法显著的优势。我们的明确人类-物体-物品交互模型即使在标准的动作识别数据集如something-something-V2和Epic- Kitchens-100中,也能取得比最新水平更高的表现。
https://arxiv.org/abs/2305.02673
Self-supervised skeleton-based action recognition enjoys a rapid growth along with the development of contrastive learning. The existing methods rely on imposing invariance to augmentations of 3D skeleton within a single data stream, which merely leverages the easy positive pairs and limits the ability to explore the complicated movement patterns. In this paper, we advocate that the defect of single-stream contrast and the lack of necessary feature transformation are responsible for easy positives, and therefore propose a Cross-Stream Contrastive Learning framework for skeleton-based action Representation learning (CSCLR). Specifically, the proposed CSCLR not only utilizes intra-stream contrast pairs, but introduces inter-stream contrast pairs as hard samples to formulate a better representation learning. Besides, to further exploit the potential of positive pairs and increase the robustness of self-supervised representation learning, we propose a Positive Feature Transformation (PFT) strategy which adopts feature-level manipulation to increase the variance of positive pairs. To validate the effectiveness of our method, we conduct extensive experiments on three benchmark datasets NTU-RGB+D 60, NTU-RGB+D 120 and PKU-MMD. Experimental results show that our proposed CSCLR exceeds the state-of-the-art methods on a diverse range of evaluation protocols.
自监督的骨骼为基础的动作识别在比较学习的发展过程中取得了迅速的增长。目前的方法依赖于在单个数据流内对三维骨骼的增强实现不变性,这仅仅利用了简单的正则对,并限制了探索复杂运动模式的能力。在本文中,我们主张单流对比度的缺陷和必要的特征转换缺乏是导致简单正则对的原因,因此我们提出了基于交叉流比较学习框架的骨骼为基础的动作表示学习(CSCLR)方法。具体来说,我们提出的CSCLR不仅利用内部流对比度对,还引入了外部流对比度对作为困难样本,以构建更好的表示学习。此外,为了进一步利用正则对的潜力并增加自监督表示学习的鲁棒性,我们提出了正则特征转换(PFT)策略,采用特征级别的操作来增加正则对的变异性。为了验证我们的方法的有效性,我们进行了广泛的实验,对三个基准数据集NTU-RGB+D 60、NTU-RGB+D 120和PKU-MMD进行了评估。实验结果表明,我们提出的CSCLR在多种评估协议标准上超过了当前最先进的方法。
https://arxiv.org/abs/2305.02324
Cross view action recognition (CVAR) seeks to recognize a human action when observed from a previously unseen viewpoint. This is a challenging problem since the appearance of an action changes significantly with the viewpoint. Applications of CVAR include surveillance and monitoring of assisted living facilities where is not practical or feasible to collect large amounts of training data when adding a new camera. We present a simple yet efficient CVAR framework to learn invariant features from either RGB videos, 3D skeleton data, or both. The proposed approach outperforms the current state-of-the-art achieving similar levels of performance across input modalities: 99.4% (RGB) and 99.9% (3D skeletons), 99.4% (RGB) and 99.9% (3D Skeletons), 97.3% (RGB), and 99.2% (3D skeletons), and 84.4%(RGB) for the N-UCLA, NTU-RGB+D 60, NTU-RGB+D 120, and UWA3DII datasets, respectively.
Cross-view action recognition (CVAR)旨在从以前从未见过的视角识别人类动作。这是一个挑战性的问题,因为动作的外观随着视角的变化而显著改变。CVAR的应用包括对辅助居住设施的监控和监测,因为在添加新摄像头时收集大量训练数据并不是实际或可行的。我们提出了一个简单但高效的CVAR框架,从RGB视频、3D骨骼数据或两者中学习不变特征。 proposed approach outperforms the current state-of-the-art achieving similar levels of performance across input modalities: 99.4% (RGB) and 99.9% (3D skeletons), 99.4% (RGB) and 99.9% (3D Skeletons), 97.3% (RGB), and 99.2% (3D skeletons), and 84.4%(RGB) for the N-UCLA, NTU-RGB+D 60, NTU-RGB+D 120, and UWA3DII datasets, respectively.
https://arxiv.org/abs/2305.01733