Video analysis is a major computer vision task that has received a lot of attention in recent years. The current state-of-the-art performance for video analysis is achieved with Deep Neural Networks (DNNs) that have high computational costs and need large amounts of labeled data for training. Spiking Neural Networks (SNNs) have significantly lower computational costs (thousands of times) than regular non-spiking networks when implemented on neuromorphic hardware. They have been used for video analysis with methods like 3D Convolutional Spiking Neural Networks (3D CSNNs). However, these networks have a significantly larger number of parameters compared with spiking 2D CSNN. This, not only increases the computational costs, but also makes these networks more difficult to implement with neuromorphic hardware. In this work, we use CSNNs trained in an unsupervised manner with the Spike Timing-Dependent Plasticity (STDP) rule, and we introduce, for the first time, Spiking Separated Spatial and Temporal Convolutions (S3TCs) for the sake of reducing the number of parameters required for video analysis. This unsupervised learning has the advantage of not needing large amounts of labeled data for training. Factorizing a single spatio-temporal spiking convolution into a spatial and a temporal spiking convolution decreases the number of parameters of the network. We test our network with the KTH, Weizmann, and IXMAS datasets, and we show that S3TCs successfully extract spatio-temporal information from videos, while increasing the output spiking activity, and outperforming spiking 3D convolutions.
视频分析是近年来备受关注的计算机视觉任务之一。当前视频分析的最佳性能是通过深度神经网络(DNN)实现的,这些网络具有高计算成本,需要大量的标记数据进行训练。在神经生成硬件上实现Spiking Neural Networks(SNNs)具有显著更低的计算成本(数千倍),比传统的非Spiking Neural Networks(NNNs)更低。这些方法,例如3DConvolutionalSpiking Neural Networks(3D CSNNs),已经被用于视频分析。然而,与Spiking 2D CSNNs相比,这些网络具有更多的参数,这不仅增加了计算成本,也使这些网络在神经生成硬件上实现变得更加困难。在本文中,我们使用无监督的Spiking Timing-Dependent Plasticity(STDP)规则训练的CSNNs,以降低视频分析所需的参数数量,并首次介绍了Spiking Separated Spatial and TemporalConvolutions(S3TCs),以减少视频分析所需的参数数量。这种无监督学习的优势是不需要大量标记数据进行训练。将单个时间和空间SpikingConvolution分解成空间和时间SpikingConvolution,可以降低网络中的参数数量。我们使用KTH、Weizmann和IXMAS数据集测试我们的网络,并表明S3TCs成功地从视频中提取时间和空间信息,同时增加输出SpikingActivity,并超越了Spiking 3DConvolutions。
https://arxiv.org/abs/2309.12761
Action scene understanding in soccer is a challenging task due to the complex and dynamic nature of the game, as well as the interactions between players. This article provides a comprehensive overview of this task divided into action recognition, spotting, and spatio-temporal action localization, with a particular emphasis on the modalities used and multimodal methods. We explore the publicly available data sources and metrics used to evaluate models' performance. The article reviews recent state-of-the-art methods that leverage deep learning techniques and traditional methods. We focus on multimodal methods, which integrate information from multiple sources, such as video and audio data, and also those that represent one source in various ways. The advantages and limitations of methods are discussed, along with their potential for improving the accuracy and robustness of models. Finally, the article highlights some of the open research questions and future directions in the field of soccer action recognition, including the potential for multimodal methods to advance this field. Overall, this survey provides a valuable resource for researchers interested in the field of action scene understanding in soccer.
在足球比赛中的动作场景理解是一个具有挑战性的任务,因为足球比赛具有复杂和动态的特点,以及球员之间的互动。本文对这个任务进行了全面综述,并将其分成动作识别、发现和时间和空间动作定位,其中特别注重使用的模式和多种模式方法。我们探索了可用于评估模型性能的公开可用数据源和指标。本文回顾了最近利用深度学习技术和传统方法的最新方法。我们重点探讨了多种模式方法,它们整合了来自多个来源的信息,例如视频和音频数据,以及以不同方式代表一个来源的方法。方法的优点和局限性被讨论,以及它们如何提高模型的准确性和鲁棒性的潜力。最后,文章强调了足球动作识别领域的一些开放研究问题和未来的研究方向,包括多种模式方法推动该领域的进步的潜力。总的来说,本文为对足球动作场景理解领域感兴趣的研究人员提供了宝贵的资源。
https://arxiv.org/abs/2309.12067
To integrate action recognition methods into autonomous robotic systems, it is crucial to consider adverse situations involving target occlusions. Such a scenario, despite its practical relevance, is rarely addressed in existing self-supervised skeleton-based action recognition methods. To empower robots with the capacity to address occlusion, we propose a simple and effective method. We first pre-train using occluded skeleton sequences, then use k-means clustering (KMeans) on sequence embeddings to group semantically similar samples. Next, we employ K-nearest-neighbor (KNN) to fill in missing skeleton data based on the closest sample neighbors. Imputing incomplete skeleton sequences to create relatively complete sequences as input provides significant benefits to existing skeleton-based self-supervised models. Meanwhile, building on the state-of-the-art Partial Spatio-Temporal Learning (PSTL), we introduce an Occluded Partial Spatio-Temporal Learning (OPSTL) framework. This enhancement utilizes Adaptive Spatial Masking (ASM) for better use of high-quality, intact skeletons. The effectiveness of our imputation methods is verified on the challenging occluded versions of the NTURGB+D 60 and NTURGB+D 120. The source code will be made publicly available at this https URL.
将行动识别方法整合到自主机器人系统中,必须考虑涉及目标遮挡的不利情况。尽管这种场景的实际 relevance 很低,但在当前基于骨骼的行动识别方法中却很少有人考虑。为了赋予机器人处理遮挡的能力,我们提出了一种简单而有效的方法。我们首先使用遮挡的骨骼序列进行预训练,然后使用 k-means 聚类(KMeans)将序列嵌入向量分组语义相似的样本。接下来,我们使用 KNN 根据最接近的样本邻居填充缺失的骨骼数据。将不完整的骨骼序列输入生成相对完整的序列作为输入,为当前基于骨骼的自监督模型带来重大的好处。同时,基于当前先进的 partial Spatial-Temporal Learning(PSTL)技术,我们提出了一个被改进的遮挡 partial Spatial-Temporal Learning(OPSTL)框架。这种改进利用自适应空间遮蔽(ASM)更好地利用高质量的完整的骨骼。我们的代入方法的有效性在 NturGB+D 60 和 NturGB+D 120 等挑战性的遮挡版本上进行了验证。源代码将在 this https://www.tensorflow.org/zh/api_docs/python/tf/keras/models/Sequential 网站上公开发布。
https://arxiv.org/abs/2309.12029
Self-supervised representation learning for human action recognition has developed rapidly in recent years. Most of the existing works are based on skeleton data while using a multi-modality setup. These works overlooked the differences in performance among modalities, which led to the propagation of erroneous knowledge between modalities while only three fundamental modalities, i.e., joints, bones, and motions are used, hence no additional modalities are explored. In this work, we first propose an Implicit Knowledge Exchange Module (IKEM) which alleviates the propagation of erroneous knowledge between low-performance modalities. Then, we further propose three new modalities to enrich the complementary information between modalities. Finally, to maintain efficiency when introducing new modalities, we propose a novel teacher-student framework to distill the knowledge from the secondary modalities into the mandatory modalities considering the relationship constrained by anchors, positives, and negatives, named relational cross-modality knowledge distillation. The experimental results demonstrate the effectiveness of our approach, unlocking the efficient use of skeleton-based multi-modality data. Source code will be made publicly available at this https URL.
过去几年中,人类行为识别 self-supervised Representation Learning 快速发展。大多数现有工作都基于骨骼数据,同时使用多模态 setup。这些工作忽略了不同模态的性能差异,导致不同模态之间的错误知识传播,而仅使用三个基本模态(即关节、骨骼和运动),因此没有探索额外的模态。在这项工作中,我们提出了一种隐含知识交换模块(IKEM),减轻低性能模态之间错误知识的传播。我们还提出了三个新的模态,以丰富不同模态之间的互补信息。最后,为了在引入新模态时保持效率,我们提出了一种独特的教师学生框架,将知识从 secondary 模态中提取,并将其转换为必要的模态,称为关系跨模态知识蒸馏。实验结果显示了我们的方法的有效性,解锁了基于骨骼的多模态数据的有效使用。源代码将在 this https URL 上公开发布。
https://arxiv.org/abs/2309.12009
The fine-grained medical action analysis task has received considerable attention from pattern recognition communities recently, but it faces the problems of data and algorithm shortage. Cardiopulmonary Resuscitation (CPR) is an essential skill in emergency treatment. Currently, the assessment of CPR skills mainly depends on dummies and trainers, leading to high training costs and low efficiency. For the first time, this paper constructs a vision-based system to complete error action recognition and skill assessment in CPR. Specifically, we define 13 types of single-error actions and 74 types of composite error actions during external cardiac compression and then develop a video dataset named CPR-Coach. By taking the CPR-Coach as a benchmark, this paper thoroughly investigates and compares the performance of existing action recognition models based on different data modalities. To solve the unavoidable Single-class Training & Multi-class Testing problem, we propose a humancognition-inspired framework named ImagineNet to improve the model's multierror recognition performance under restricted supervision. Extensive experiments verify the effectiveness of the framework. We hope this work could advance research toward fine-grained medical action analysis and skill assessment. The CPR-Coach dataset and the code of ImagineNet are publicly available on Github.
精细的医疗行动分析任务最近受到了模式识别社区的广泛关注,但面临着数据和算法短缺的问题。心肺复苏(CPR)是紧急医疗服务中至关重要的技能。目前,CPR技能评估主要依赖于假人和监督者,导致高培训成本和低效率。首次构建了一个基于视觉系统的系统来完成CPR中的错误动作识别和技能评估。具体来说,我们定义了13种单错误动作和74种组合错误动作在外部心脏按压期间,然后开发了一个名为CPR-Coach的视频数据集。以CPR-Coach作为基准,本文进行了深入调查和比较了基于不同数据模式的现有行动识别模型的性能。为了解决不可避免的单类培训和多类测试问题,我们提出了一个人类认知启发框架,名为ImagineNet,以提高模型的多错误识别性能,在限制监督下。广泛实验验证了框架的有效性。我们希望这项工作能够推进精细的医疗行动分析和技能评估研究。CPR-Coach数据集和ImagineNet代码在Github上公开可用。
https://arxiv.org/abs/2309.11718
We present SkeleTR, a new framework for skeleton-based action recognition. In contrast to prior work, which focuses mainly on controlled environments, we target more general scenarios that typically involve a variable number of people and various forms of interaction between people. SkeleTR works with a two-stage paradigm. It first models the intra-person skeleton dynamics for each skeleton sequence with graph convolutions, and then uses stacked Transformer encoders to capture person interactions that are important for action recognition in general scenarios. To mitigate the negative impact of inaccurate skeleton associations, SkeleTR takes relative short skeleton sequences as input and increases the number of sequences. As a unified solution, SkeleTR can be directly applied to multiple skeleton-based action tasks, including video-level action classification, instance-level action detection, and group-level activity recognition. It also enables transfer learning and joint training across different action tasks and datasets, which result in performance improvement. When evaluated on various skeleton-based action recognition benchmarks, SkeleTR achieves the state-of-the-art performance.
我们提出了SkewedTR,一个基于骨骼的动作识别框架。与之前的工作主要关注控制环境相比,我们的目标更为通用,通常涉及的人数会因人而异,并且人们之间的各种交互形式。SkewedTR采用两个阶段的范式。首先,它使用图卷积对每个骨骼序列的内骨骼动力学建模,然后使用叠加的Transformer编码器来捕捉在通用场景下对于动作识别非常重要的人之间的交互。为了减轻不准确的骨骼关联带来的负面影响,SkewedTR采用了相对较短的骨骼序列作为输入,并增加了序列的数量。作为一种统一的解决方案,SkewedTR可以直接应用于多个基于骨骼的动作任务,包括视频级别的动作分类、实例级别的动作检测和群体级别的活动识别。它还能够实现跨不同动作任务和数据集的迁移学习和联合训练,从而带来性能的提升。当在各种基于骨骼的动作识别基准上进行评价时,SkewedTR取得了最先进的表现。
https://arxiv.org/abs/2309.11445
Despite an exciting new wave of multimodal machine learning models, current approaches still struggle to interpret the complex contextual relationships between the different modalities present in videos. Going beyond existing methods that emphasize simple activities or objects, we propose a new model-agnostic approach for generating detailed textual descriptions that captures multimodal video information. Our method leverages the extensive knowledge learnt by large language models, such as GPT-3.5 or Llama2, to reason about textual descriptions of the visual and aural modalities, obtained from BLIP-2, Whisper and ImageBind. Without needing additional finetuning of video-text models or datasets, we demonstrate that available LLMs have the ability to use these multimodal textual descriptions as proxies for ``sight'' or ``hearing'' and perform zero-shot multimodal classification of videos in-context. Our evaluations on popular action recognition benchmarks, such as UCF-101 or Kinetics, show these context-rich descriptions can be successfully used in video understanding tasks. This method points towards a promising new research direction in multimodal classification, demonstrating how an interplay between textual, visual and auditory machine learning models can enable more holistic video understanding.
尽管出现了令人兴奋的多媒态机器学习模型,但现有的方法仍然难以解释视频中出现的不同感官模式之间的复杂上下文关系。我们提出了一种新的模型无关的方法,用于生成详细文本描述,捕捉多媒态视频信息。我们的方法利用大型语言模型如GPT-3.5或Llama2学习到的广泛知识,以处理从BLIP-2、Whisper和ImageBind获取的视觉和听觉感官模式文本描述。我们不需要进一步调整视频文本模型或数据集,就能证明可用的LLMs有使用这些多媒态文本描述作为“看到”或“听到”的代用品,并在上下文中实现零次机会多媒态分类的能力。我们对流行的行动识别基准点如UCF-101或Kinetics进行评估,表明这些丰富的上下文描述可以在视频理解任务中成功使用。这种方法指向了多媒态分类中的有前途的新研究方向,展示了如何将文本、视觉和听觉机器学习模型之间的交互实现更全面的视频理解。
https://arxiv.org/abs/2309.10783
Generalized zero-shot skeleton-based action recognition (GZSSAR) is a new challenging problem in computer vision community, which requires models to recognize actions without any training samples. Previous studies only utilize the action labels of verb phrases as the semantic prototypes for learning the mapping from skeleton-based actions to a shared semantic space. However, the limited semantic information of action labels restricts the generalization ability of skeleton features for recognizing unseen actions. In order to solve this dilemma, we propose a multi-semantic fusion (MSF) model for improving the performance of GZSSAR, where two kinds of class-level textual descriptions (i.e., action descriptions and motion descriptions), are collected as auxiliary semantic information to enhance the learning efficacy of generalizable skeleton features. Specially, a pre-trained language encoder takes the action descriptions, motion descriptions and original class labels as inputs to obtain rich semantic features for each action class, while a skeleton encoder is implemented to extract skeleton features. Then, a variational autoencoder (VAE) based generative module is performed to learn a cross-modal alignment between skeleton and semantic features. Finally, a classification module is built to recognize the action categories of input samples, where a seen-unseen classification gate is adopted to predict whether the sample comes from seen action classes or not in GZSSAR. The superior performance in comparisons with previous models validates the effectiveness of the proposed MSF model on GZSSAR.
广义的零样本骨架行动识别(GZSSAR)是计算机视觉社区中的一个新的挑战性问题,要求模型在没有训练样本的情况下识别行动。以前的研究仅使用了动词短语的行动标签作为语义原型,以学习从骨架行动到共享语义空间映射的关系。然而,行动标签有限的语义信息限制了识别未知行动的能力。为了解决这个困境,我们提出了一种多语义融合(MSF)模型,以改善GZSSAR的性能。在这里,我们收集了两种类级别的文本描述(即行动描述和运动描述),作为辅助语义信息,以提高可通用骨架特征的学习效果。特别地,一个训练好的语言编码器使用行动描述、运动描述和原始类标签作为输入,为每个行动类生成丰富的语义特征,而骨架编码器则用于提取骨架特征。然后,一种变分自编码器(VAE)基于优化器模块进行了学习,以学习骨架和语义特征之间的跨模态对齐。最后,建立了分类模块,以识别输入样本的行动类别,并采用可见-不见分类门来预测样本是否来自可见行动类别。与以前的模型相比,表现出更好的性能,这证明了所提出的MSF模型在GZSSAR上的有效性。
https://arxiv.org/abs/2309.09592
The recent advances in Convolutional Neural Networks (CNNs) and Vision Transformers have convincingly demonstrated high learning capability for video action recognition on large datasets. Nevertheless, deep models often suffer from the overfitting effect on small-scale datasets with a limited number of training videos. A common solution is to exploit the existing image augmentation strategies for each frame individually including Mixup, Cutmix, and RandAugment, which are not particularly optimized for video data. In this paper, we propose a novel video augmentation strategy named Selective Volume Mixup (SV-Mix) to improve the generalization ability of deep models with limited training videos. SV-Mix devises a learnable selective module to choose the most informative volumes from two videos and mixes the volumes up to achieve a new training video. Technically, we propose two new modules, i.e., a spatial selective module to select the local patches for each spatial position, and a temporal selective module to mix the entire frames for each timestamp and maintain the spatial pattern. At each time, we randomly choose one of the two modules to expand the diversity of training samples. The selective modules are jointly optimized with the video action recognition framework to find the optimal augmentation strategy. We empirically demonstrate the merits of the SV-Mix augmentation on a wide range of video action recognition benchmarks and consistently boot the performances of both CNN-based and transformer-based models.
最近的卷积神经网络(CNNs)和视觉转换器(Transformer)的进步已经证明,在大型数据集上实现视频动作识别有很高的学习能力。然而,深度模型在仅有有限训练视频的小型数据集上往往容易过拟合。一种常见的解决方案是单独利用现有的图像增强策略,包括混合、裁剪和 randAugment,这些策略并不特别优化视频数据。在本文中,我们提出了一种名为选择性体积混合(SV- Mix)的新视频增强策略,以改善深度模型在仅有有限训练视频的情况下的泛化能力。SV- Mix设计了一个可学习的选择模块,以从两个视频中选择最 informative 的片段,并将它们混合成新的训练视频。在技术上,我们提出了两个新模块,即空间选择模块以每个空间位置选择局部片段,和时间选择模块以每个时间戳混合整个帧,以保持空间模式。在每个时点上,我们随机选择其中一个模块以扩展训练样本的多样性。选择模块与视频动作识别框架一起优化,以找到最佳增强策略。我们经验证地证明了 SV- Mix 增强在多种视频动作识别基准上的优点,并 consistently 提高了基于 CNN 和 Transformer 的模型的性能。
https://arxiv.org/abs/2309.09534
Many studies focus on improving pretraining or developing new backbones in text-video retrieval. However, existing methods may suffer from the learning and inference bias issue, as recent research suggests in other text-video-related tasks. For instance, spatial appearance features on action recognition or temporal object co-occurrences on video scene graph generation could induce spurious correlations. In this work, we present a unique and systematic study of a temporal bias due to frame length discrepancy between training and test sets of trimmed video clips, which is the first such attempt for a text-video retrieval task, to the best of our knowledge. We first hypothesise and verify the bias on how it would affect the model illustrated with a baseline study. Then, we propose a causal debiasing approach and perform extensive experiments and ablation studies on the Epic-Kitchens-100, YouCook2, and MSR-VTT datasets. Our model overpasses the baseline and SOTA on nDCG, a semantic-relevancy-focused evaluation metric which proves the bias is mitigated, as well as on the other conventional metrics.
许多研究关注改进文本-视频检索的前训练或开发新的骨架。然而,现有方法可能面临学习和推断偏见问题,就像最近在与其他文本-视频相关任务中研究的提示一样。例如,在动作识别中的 spatial appearance 特征或视频场景图生成中的时间对象同时出现可能会导致伪相关性。在这项工作中,我们提出了一种独特的系统和有组织的研究时间偏差,由于剪辑视频片段的训练和测试集之间的帧长度差异引起的,这是针对文本-视频检索任务的第一起尝试,据我们所知。我们首先提出了一种因果关系的偏见消除方法,并针对Epic-Kitchens-100、YouCook2和MSR-VTT数据集进行了广泛的实验和烧灼研究。我们的模型超越了基准模型和SOTA on nDCG,这是一个注重语义相关性的评价指标,证明了偏见已经减轻,同时超过了其他常规指标。
https://arxiv.org/abs/2309.09311
Skeletal Action recognition from an egocentric view is important for applications such as interfaces in AR/VR glasses and human-robot interaction, where the device has limited resources. Most of the existing skeletal action recognition approaches use 3D coordinates of hand joints and 8-corner rectangular bounding boxes of objects as inputs, but they do not capture how the hands and objects interact with each other within the spatial context. In this paper, we present a new framework called Contact-aware Skeletal Action Recognition (CaSAR). It uses novel representations of hand-object interaction that encompass spatial information: 1) contact points where the hand joints meet the objects, 2) distant points where the hand joints are far away from the object and nearly not involved in the current action. Our framework is able to learn how the hands touch or stay away from the objects for each frame of the action sequence, and use this information to predict the action class. We demonstrate that our approach achieves the state-of-the-art accuracy of 91.3% and 98.4% on two public datasets, H2O and FPHA, respectively.
从自我中心视角来看,骨骼动作识别对于像AR/VR眼镜和人-机器人交互等应用非常重要,因为这些应用资源有限。目前,大多数现有的骨骼动作识别方法都使用手关节的3D坐标和对象八角形限定框作为输入,但它们未能捕捉在空间上下文中手和对象如何相互交互。在本文中,我们提出了一个新的框架,称为接触感知的骨骼动作识别(CaSAR)。它使用包括空间信息的手部-物体交互新表示法,包括1)手关节与对象的接触点,2)远离对象且几乎未参与当前动作的手关节的远程点。我们的框架能够学习如何在动作序列的每一帧中,学习手如何接触或远离对象,并使用这些信息预测动作类别。我们证明了我们的方法在两个公开数据集H2O和FPHA上达到了91.3%和98.4%的最先进的精度。
https://arxiv.org/abs/2309.10001
Action recognition is a key technology for many industrial applications. Methods using visual information such as images are very popular. However, privacy issues prevent widespread usage due to the inclusion of private information, such as visible faces and scene backgrounds, which are not necessary for recognizing user action. In this paper, we propose a privacy-preserving action recognition by ultrasound active sensing. As action recognition from ultrasound active sensing in a non-invasive manner is not well investigated, we create a new dataset for action recognition and conduct a comparison of features for classification. We calculated feature values by focusing on the temporal variation of the amplitude of ultrasound reflected waves and performed classification using a support vector machine and VGG for eight fundamental action classes. We confirmed that our method achieved an accuracy of 97.9% when trained and evaluated on the same person and in the same environment. Additionally, our method achieved an accuracy of 89.5% even when trained and evaluated on different people. We also report the analyses of accuracies in various conditions and limitations.
行为识别是许多工业应用的关键技术。使用图像等视觉信息的方法非常流行。然而,隐私问题由于包括不必要的私人信息,如可见面部和场景背景,而阻碍了广泛的使用。在本文中,我们提出了一种通过超声波主动感知来保持隐私的行为识别方法。由于从超声波主动感知中非侵入性的行为识别尚未得到充分研究,我们创造了一个新的行为识别数据集,并进行了特征分类的比较。我们专注于超声波反射波幅度的时变特性,并使用支持向量机和VGG对八个基本行为类别进行分类。我们确认,在我们同一个人在同环境下训练和评估的情况下,我们的方法可以实现97.9%的准确率。此外,即使在不同人环境下训练和评估的情况下,我们的方法也能实现89.5%的准确率。我们还报告了在各种条件和限制下的准确性分析。
https://arxiv.org/abs/2309.08087
Human action recognition (HAR) is a high-level and significant research area in computer vision due to its ubiquitous applications. The main limitations of the current HAR models are their complex structures and lengthy training time. In this paper, we propose a simple yet versatile and effective end-to-end deep learning architecture, coined as TransNet, for HAR. TransNet decomposes the complex 3D-CNNs into 2D- and 1D-CNNs, where the 2D- and 1D-CNN components extract spatial features and temporal patterns in videos, respectively. Benefiting from its concise architecture, TransNet is ideally compatible with any pretrained state-of-the-art 2D-CNN models in other fields, being transferred to serve the HAR task. In other words, it naturally leverages the power and success of transfer learning for HAR, bringing huge advantages in terms of efficiency and effectiveness. Extensive experimental results and the comparison with the state-of-the-art models demonstrate the superior performance of the proposed TransNet in HAR in terms of flexibility, model complexity, training speed and classification accuracy.
人类行为识别(HAR)是计算机视觉中的一个高水平且重要的研究领域,因为其广泛的应用。当前 HAR 模型的主要限制是其复杂的结构和较长的训练时间。在本文中,我们提出了一种简单但多功能且高效的端到端深度学习架构,称为 TransNet,以用于 HAR。TransNet 将复杂的 3D-CNN 分解成 2D-和 1D-CNN,其中 2D-和 1D-CNN 组件分别从视频中提取空间特征和时间模式。得益于其简洁的架构,TransNet 最好地兼容其他领域上预先训练的 2D-CNN 模型,并将其转移到 HAR 任务上。换句话说,它自然地利用迁移学习在 HAR 方面的能力和成功,带来了效率和 effectiveness 的巨大优势。广泛的实验结果和与最先进的模型的比较展示了提出的 TransNet 在 HAR 方面的灵活性、模型复杂度、训练速度和分类精度上的卓越表现。
https://arxiv.org/abs/2309.06951
Deep learning approaches to natural language processing have made great strides in recent years. While these models produce symbols that convey vast amounts of diverse knowledge, it is unclear how such symbols are grounded in data from the world. In this paper, we explore the development of a private language for visual data representation by training emergent language (EL) encoders/decoders in both i) a traditional referential game environment and ii) a contrastive learning environment utilizing a within-class matching training paradigm. An additional classification layer utilizing neural machine translation and random forest classification was used to transform symbolic representations (sequences of integer symbols) to class labels. These methods were applied in two experiments focusing on object recognition and action recognition. For object recognition, a set of sketches produced by human participants from real imagery was used (Sketchy dataset) and for action recognition, 2D trajectories were generated from 3D motion capture systems (MOVI dataset). In order to interpret the symbols produced for data in each experiment, gradient-weighted class activation mapping (Grad-CAM) methods were used to identify pixel regions indicating semantic features which contribute evidence towards symbols in learned languages. Additionally, a t-distributed stochastic neighbor embedding (t-SNE) method was used to investigate embeddings learned by CNN feature extractors.
深度学习方法在自然语言处理方面近年来取得了巨大的进展。尽管这些模型产生了表达大量不同知识的符号,但如何将这些符号从世界上的数据中基础起来仍然是不清楚的。在本文中,我们将探索一种私人语言的开发,通过训练生成语言(EL)编码器和/解码器,在既有的参考游戏环境和使用同 class 匹配训练范式的对比学习环境中训练。此外,我们还使用神经网络翻译和随机森林分类技术,将符号表示(整数符号序列)转换为类标签。这些方法被应用于两个实验,重点研究了对象识别和动作识别。对于对象识别,从真实图像中人类参与者生成的 Sketch 数据集被使用( Sketch 数据集),对于动作识别,从 3D 运动捕捉系统生成的 2D 轨迹数据被使用(MOVI 数据集)。为了解释每个实验中生成的符号,使用梯度加权类激活映射(Grad-CAM)方法识别像素区域,表示语义特征,为学习语言中的符号提供证据。此外,使用 t 分布随机邻嵌入(t-SNE)方法研究了使用卷积神经网络特征提取器学习嵌入。
https://arxiv.org/abs/2309.06335
Contrastive learning has achieved great success in skeleton-based action recognition. However, most existing approaches encode the skeleton sequences as entangled spatiotemporal representations and confine the contrasts to the same level of representation. Instead, this paper introduces a novel contrastive learning framework, namely Spatiotemporal Clues Disentanglement Network (SCD-Net). Specifically, we integrate the decoupling module with a feature extractor to derive explicit clues from spatial and temporal domains respectively. As for the training of SCD-Net, with a constructed global anchor, we encourage the interaction between the anchor and extracted clues. Further, we propose a new masking strategy with structural constraints to strengthen the contextual associations, leveraging the latest development from masked image modelling into the proposed SCD-Net. We conduct extensive evaluations on the NTU-RGB+D (60&120) and PKU-MMD (I&II) datasets, covering various downstream tasks such as action recognition, action retrieval, transfer learning, and semi-supervised learning. The experimental results demonstrate the effectiveness of our method, which outperforms the existing state-of-the-art (SOTA) approaches significantly.
对比学习在基于骨骼的动作识别方面取得了巨大的成功。然而,大多数现有方法将骨骼序列编码为相互交织的时间和空间表示,并将对比限制在同一层次的表示中。相反,本文介绍了一种全新的对比学习框架,即时间和空间线索分离网络(SCD-Net)。具体而言,我们将其解耦模块与特征提取器集成起来,从空间和时间两个方面分别提取明确的线索。对于SCD-Net的训练,我们使用构造的全球锚进行鼓励,并提出了一种新的有结构限制的掩膜策略,以加强上下文关联,利用掩膜图像建模的最新发展将其引入到所提出的SCD-Net中。我们在NTU-RGB+D(60&120)和PKU-MMD(I&II)数据集上进行了广泛的评估,涵盖了各种后续任务,例如动作识别、动作检索、迁移学习和半监督学习。实验结果证明了我们方法的有效性,它显著优于现有的最先进的方法(SOTA)。
https://arxiv.org/abs/2309.05834
Various types of sensors have been considered to develop human action recognition (HAR) models. Robust HAR performance can be achieved by fusing multimodal data acquired by different sensors. In this paper, we introduce a new multimodal fusion architecture, referred to as Unified Contrastive Fusion Transformer (UCFFormer) designed to integrate data with diverse distributions to enhance HAR performance. Based on the embedding features extracted from each modality, UCFFormer employs the Unified Transformer to capture the inter-dependency among embeddings in both time and modality domains. We present the Factorized Time-Modality Attention to perform self-attention efficiently for the Unified Transformer. UCFFormer also incorporates contrastive learning to reduce the discrepancy in feature distributions across various modalities, thus generating semantically aligned features for information fusion. Performance evaluation conducted on two popular datasets, UTD-MHAD and NTU RGB+D, demonstrates that UCFFormer achieves state-of-the-art performance, outperforming competing methods by considerable margins.
各种传感器都被考虑用于开发人类行为识别(HAR)模型。通过合并来自不同传感器的 multimodal 数据,可以实现稳健的 HAR 性能。在本文中,我们介绍了一种新的多模态融合架构,称为 Unified Contrastive Fusion Transformer(UCF former),旨在整合具有不同分布的数据以提高 HAR 性能。基于每个模态的嵌入特征提取,UCF former 使用 Unified Transformer 捕捉时间模态和模态间嵌入之间的相互依赖。我们介绍了 factorized 时间模态注意力,以高效地为 Unified Transformer 进行自注意力。UCF former 还引入了对比学习,以减少各种模态中特征分布的差异,从而产生语义上对齐的特征,用于信息融合。在两个流行的数据集 UTD-MHAD 和 NTU RGB+D 上进行的性能评估表明,UCF former 实现了最先进的性能,显著超越了竞争方法。
https://arxiv.org/abs/2309.05032
Few-shot video action recognition is an effective approach to recognizing new categories with only a few labeled examples, thereby reducing the challenges associated with collecting and annotating large-scale video datasets. Existing methods in video action recognition rely on large labeled datasets from the same domain. However, this setup is not realistic as novel categories may come from different data domains that may have different spatial and temporal characteristics. This dissimilarity between the source and target domains can pose a significant challenge, rendering traditional few-shot action recognition techniques ineffective. To address this issue, in this work, we propose a novel cross-domain few-shot video action recognition method that leverages self-supervised learning and curriculum learning to balance the information from the source and target domains. To be particular, our method employs a masked autoencoder-based self-supervised training objective to learn from both source and target data in a self-supervised manner. Then a progressive curriculum balances learning the discriminative information from the source dataset with the generic information learned from the target domain. Initially, our curriculum utilizes supervised learning to learn class discriminative features from the source data. As the training progresses, we transition to learning target-domain-specific features. We propose a progressive curriculum to encourage the emergence of rich features in the target domain based on class discriminative supervised features in the source domain. %a schedule that helps with this transition. We evaluate our method on several challenging benchmark datasets and demonstrate that our approach outperforms existing cross-domain few-shot learning techniques. Our code is available at \hyperlink{this https URL}{this https URL}
单视角的视频动作识别是一种有效的方法,仅使用少数标记示例来识别新类别,从而减少了收集和标注大规模视频数据集所带来的挑战。现有的视频动作识别方法依赖于同一领域的大型标记数据集。然而,这种方法并不现实,因为新类别可能来自不同的数据域,具有不同的空间和时间特征。这种源和目标域之间的差异可能会构成一个巨大的挑战,使传统的单视角动作识别技术无效。为了解决这一问题,在这项工作中,我们提出了一种新的跨域单视角视频动作识别方法,利用自监督学习和课程学习平衡来自源和目标域的信息。特别地,我们采用了掩码自编码器为基础的自监督训练目标,从源和目标数据中进行自监督学习。然后,一个逐步的课程平衡了从源数据中学习歧视性信息,与从目标域中学习通用信息。起初,我们的课程使用监督学习学习源数据中的类歧视性特征。随着训练的进行,我们转向学习目标域特定的特征。我们提出了一个逐步的课程,以鼓励在目标域中产生丰富的特征,基于源域中类歧视性监督特征。这个时间表有助于解决这个问题。我们评估了我们的方法和几个具有挑战性的基准数据集,并证明了我们的方法比现有的跨域单视角学习技术更有效。我们的代码可在 \hyperlink{this https URL}{this https URL} 找到。
https://arxiv.org/abs/2309.03989
Inspired by the remarkable success of Latent Diffusion Models (LDMs) for image synthesis, we study LDM for text-to-video generation, which is a formidable challenge due to the computational and memory constraints during both model training and inference. A single LDM is usually only capable of generating a very limited number of video frames. Some existing works focus on separate prediction models for generating more video frames, which suffer from additional training cost and frame-level jittering, however. In this paper, we propose a framework called "Reuse and Diffuse" dubbed $\textit{VidRD}$ to produce more frames following the frames already generated by an LDM. Conditioned on an initial video clip with a small number of frames, additional frames are iteratively generated by reusing the original latent features and following the previous diffusion process. Besides, for the autoencoder used for translation between pixel space and latent space, we inject temporal layers into its decoder and fine-tune these layers for higher temporal consistency. We also propose a set of strategies for composing video-text data that involve diverse content from multiple existing datasets including video datasets for action recognition and image-text datasets. Extensive experiments show that our method achieves good results in both quantitative and qualitative evaluations. Our project page is available $\href{this https URL}{here}$.
受到图像合成模型(LDMs)在图像生成方面的出色成功启发,我们研究了一种名为“利用和扩散”(VidRD)的框架,用于从文本到视频的生成。由于在模型训练和推断过程中的计算和内存限制,这种挑战是非常严峻的。通常,一个LDM只能生成非常少量的视频帧。一些现有工作集中在生成更多视频帧的单独预测模型上,但这种方法面临着额外的训练成本和帧级抖动等问题。在本文中,我们提出了一种名为“利用和扩散”(VidRD)的框架,以跟随一个已经生成的LDM生成的帧,从而生成更多的帧。在初始视频片段中设置了少量的帧,我们使用最初的原始隐状态特征迭代地生成额外的帧,并遵循之前的扩散过程。此外,对于用于像素空间和隐状态空间的卷积神经网络,我们将其解码器中的时间层注入到其解码器中,并优化这些层以提高时间一致性。我们还提出了一组策略,用于构建视频和文本数据,其中包括多个现有数据集,包括视频行动识别和图像文本数据集。广泛的实验结果表明,我们的方法在量化和定性评估中都取得了良好的结果。我们的项目页面可用以下链接访问:https://www.bilibili.com/project/vidRD。
https://arxiv.org/abs/2309.03549
With the surge in attention to Egocentric Hand-Object Interaction (Ego-HOI), large-scale datasets such as Ego4D and EPIC-KITCHENS have been proposed. However, most current research is built on resources derived from third-person video action recognition. This inherent domain gap between first- and third-person action videos, which have not been adequately addressed before, makes current Ego-HOI suboptimal. This paper rethinks and proposes a new framework as an infrastructure to advance Ego-HOI recognition by Probing, Curation and Adaption (EgoPCA). We contribute comprehensive pre-train sets, balanced test sets and a new baseline, which are complete with a training-finetuning strategy. With our new framework, we not only achieve state-of-the-art performance on Ego-HOI benchmarks but also build several new and effective mechanisms and settings to advance further research. We believe our data and the findings will pave a new way for Ego-HOI understanding. Code and data are available at this https URL
随着对主观手眼交互(Ego-HOI)的广泛关注,例如Ego4D和Epic-KITCHENS等大规模数据集已经被提出。然而,目前大多数研究都基于从第三人称视频动作识别中获取的资源。这种第一和第三人称动作视频之间的固有领域差距,之前未得到充分解决,使当前Ego-HOI最优化。本文重新思考并提出了一个新的框架,作为推动Ego-HOI识别的基础设施,通过探测、编辑和适应(EgoPCA)技术。我们提供了全面的训练集、平衡测试集和新的基线,并采用了训练微调策略。使用我们的新框架,我们不仅实现了Ego-HOI基准点的最先进的性能,还建立了几个新的有效机制和设置,以推动进一步的研究。我们相信我们的数据和发现将开创理解Ego-HOI的新途径。代码和数据可在以下httpsURL获取。
https://arxiv.org/abs/2309.02423
Deep learning models have a risk of utilizing spurious clues to make predictions, such as recognizing actions based on the background scene. This issue can severely degrade the open-set action recognition performance when the testing samples have different scene distributions from the training samples. To mitigate this problem, we propose a novel method, called Scene-debiasing Open-set Action Recognition (SOAR), which features an adversarial scene reconstruction module and an adaptive adversarial scene classification module. The former prevents the decoder from reconstructing the video background given video features, and thus helps reduce the background information in feature learning. The latter aims to confuse scene type classification given video features, with a specific emphasis on the action foreground, and helps to learn scene-invariant information. In addition, we design an experiment to quantify the scene bias. The results indicate that the current open-set action recognizers are biased toward the scene, and our proposed SOAR method better mitigates such bias. Furthermore, our extensive experiments demonstrate that our method outperforms state-of-the-art methods, and the ablation studies confirm the effectiveness of our proposed modules.
深度学习模型有利用伪线索进行预测的风险,例如基于背景场景识别动作。当测试样本与训练样本的场景分布不同时,这种问题会对开放集动作识别性能造成严重的影响。为了解决这个问题,我们提出了一种新方法,称为场景去偏差的开放集动作识别(SOAR),它包括一个对抗场景重建模块和一个自适应对抗场景分类模块。前者防止解码器根据视频特征重构视频背景,从而有助于减少特征学习中的背景信息。后者旨在根据视频特征对场景进行分类,特别注重动作的前端,并有助于学习场景不变的信息。此外,我们设计了一项实验来量化场景偏差。结果显示,当前开放集动作识别器存在偏向场景的倾向,而我们提出的SOAR方法更好地克服了这种偏差。此外,我们的广泛实验表明,我们的方法比当前的方法表现更好,而削除研究证实了我们提出的模块的 effectiveness。
https://arxiv.org/abs/2309.01265