Most existing one-shot skeleton-based action recognition focuses on raw low-level information (e.g., joint location), and may suffer from local information loss and low generalization ability. To alleviate these, we propose to leverage text description generated from large language models (LLM) that contain high-level human knowledge, to guide feature learning, in a global-local-global way. Particularly, during training, we design $2$ prompts to gain global and local text descriptions of each action from an LLM. We first utilize the global text description to guide the skeleton encoder focus on informative joints (i.e.,global-to-local). Then we build non-local interaction between local text and joint features, to form the final global representation (i.e., local-to-global). To mitigate the asymmetry issue between the training and inference phases, we further design a dual-branch architecture that allows the model to perform novel class inference without any text input, also making the additional inference cost neglectable compared with the base skeleton encoder. Extensive experiments on three different benchmarks show that CrossGLG consistently outperforms the existing SOTA methods with large margins, and the inference cost (model size) is only $2.8$\% than the previous SOTA. CrossGLG can also serve as a plug-and-play module that can substantially enhance the performance of different SOTA skeleton encoders with a neglectable cost during inference. The source code will be released soon.
大多数现有的基于一次单动作骨骼图的动作识别方法都关注原始的低级信息(例如关节位置),并可能由于局部信息丢失和泛化能力较低而受到限制。为了缓解这些限制,我们提出了一种利用包含大型语言模型(LLM)生成的文本描述来指导特征学习的方法,以实现全局-局部-全局的方式。特别是在训练过程中,我们设计了两 prompt 来从LLM获得每个动作的全球和局部文本描述。我们首先利用全局文本描述指导骨骼编码器关注有用的关节(即全局到局部)。然后我们建立了局部文本和关节特征之间的非局部交互,形成了最终的全局表示(即局部到全局)。为了减轻训练和推理阶段之间的不对称问题,我们还设计了一个双分支架构,使模型可以在没有任何文本输入的情况下进行新颖的分类推理,并将推理成本与基骨骼编码器相比微不足道。在三个不同的基准上进行的大量实验表明,CrossGLG在性能上始终优于现有SOTA方法,并且推理成本(模型大小)仅为前SOTA方法的$2.8\%$。CrossGLG还可以作为可插拔的模块,在推理过程中忽略不计成本地显著增强不同SOTA骨骼编码器的性能。源代码即将发布。
https://arxiv.org/abs/2403.10082
Understanding human actions from body poses is critical for assistive robots sharing space with humans in order to make informed and safe decisions about the next interaction. However, precise temporal localization and annotation of activity sequences is time-consuming and the resulting labels are often noisy. If not effectively addressed, label noise negatively affects the model's training, resulting in lower recognition quality. Despite its importance, addressing label noise for skeleton-based action recognition has been overlooked so far. In this study, we bridge this gap by implementing a framework that augments well-established skeleton-based human action recognition methods with label-denoising strategies from various research areas to serve as the initial benchmark. Observations reveal that these baselines yield only marginal performance when dealing with sparse skeleton data. Consequently, we introduce a novel methodology, NoiseEraSAR, which integrates global sample selection, co-teaching, and Cross-Modal Mixture-of-Experts (CM-MOE) strategies, aimed at mitigating the adverse impacts of label noise. Our proposed approach demonstrates better performance on the established benchmark, setting new state-of-the-art standards. The source code for this study will be made accessible at this https URL.
从人体姿态中理解人类动作对于与人类共享空间的辅助机器人来说至关重要,以便在下一个交互中做出明智和安全的决定。然而,精确的时间局部化和活动序列的标注是一个耗时且耗资的过程,所得的标签常常是嘈杂的。如果没有得到有效解决,标签噪声将影响模型的训练,导致识别质量下降。尽管解决这个问题很重要,但迄今为止还没有有效地解决基于骨骼的动作识别的标签噪声问题。在本研究中,我们通过在基于骨骼的人体动作识别方法中集成各种研究领域的标签去噪策略,为这一领域提供一个初步的基准。观察结果表明,这些基线在处理稀疏骨骼数据时的表现只是微不足道。因此,我们引入了一种新的方法,NoiseEraSAR,它结合了全局样本选择、协同教学和跨模态专家混合(CM-MOE)策略,旨在减轻标签噪声的负面影响。我们所提出的方法在既定基准上的表现优于其他基线,为现有的技术水平树立了新的标杆。本研究的源代码将在这个链接中公开:https://www. this URL。
https://arxiv.org/abs/2403.09975
3D hand poses are an under-explored modality for action recognition. Poses are compact yet informative and can greatly benefit applications with limited compute budgets. However, poses alone offer an incomplete understanding of actions, as they cannot fully capture objects and environments with which humans interact. To efficiently model hand-object interactions, we propose HandFormer, a novel multimodal transformer. HandFormer combines 3D hand poses at a high temporal resolution for fine-grained motion modeling with sparsely sampled RGB frames for encoding scene semantics. Observing the unique characteristics of hand poses, we temporally factorize hand modeling and represent each joint by its short-term trajectories. This factorized pose representation combined with sparse RGB samples is remarkably efficient and achieves high accuracy. Unimodal HandFormer with only hand poses outperforms existing skeleton-based methods at 5x fewer FLOPs. With RGB, we achieve new state-of-the-art performance on Assembly101 and H2O with significant improvements in egocentric action recognition.
3D手势是对动作识别的一个未被充分探索的维度。手势既紧凑又富有信息量,可以为有限计算预算的应用程序带来很大好处。然而,仅凭手势本身无法完全理解动作,因为它们无法完全捕捉人类互动的对象和环境。为了有效地建模手与物体之间的交互,我们提出了HandFormer,一种新颖的多模态Transformer。HandFormer在高速时间分辨率下对3D手势进行建模,并使用稀疏采样RGB帧对场景语义进行编码。通过观察手势的独特特点,我们进行了时间分解的手建模,并分别用其短期轨迹表示每个关节。这种分解的手建模与稀疏RGB样本相结合,具有显著的高效性并实现了高准确度。仅使用手部姿势的Unimodal HandFormer在FLOPs上优于现有基于骨骼的方法5倍。与RGB结合,我们在Assembly101和H2O上实现了与之前骨架方法相当但显著改进的性能。
https://arxiv.org/abs/2403.09805
Skeleton-based action recognition, which classifies human actions based on the coordinates of joints and their connectivity within skeleton data, is widely utilized in various scenarios. While Graph Convolutional Networks (GCNs) have been proposed for skeleton data represented as graphs, they suffer from limited receptive fields constrained by joint connectivity. To address this limitation, recent advancements have introduced transformer-based methods. However, capturing correlations between all joints in all frames requires substantial memory resources. To alleviate this, we propose a novel approach called Skeletal-Temporal Transformer (SkateFormer) that partitions joints and frames based on different types of skeletal-temporal relation (Skate-Type) and performs skeletal-temporal self-attention (Skate-MSA) within each partition. We categorize the key skeletal-temporal relations for action recognition into a total of four distinct types. These types combine (i) two skeletal relation types based on physically neighboring and distant joints, and (ii) two temporal relation types based on neighboring and distant frames. Through this partition-specific attention strategy, our SkateFormer can selectively focus on key joints and frames crucial for action recognition in an action-adaptive manner with efficient computation. Extensive experiments on various benchmark datasets validate that our SkateFormer outperforms recent state-of-the-art methods.
基于骨架的动作识别,这种方法根据骨架数据中关节的坐标及其骨架数据中的连接来对人类动作进行分类。虽然已经提出了用图卷积网络(GCNs)表示的骨架数据,但这些网络存在有限的感受野,受到关节连接的限制。为了应对这个限制,最近的研究引入了基于Transformer的方法。然而,在捕捉所有关节的所有帧之间的相关性时,需要大量的内存资源。为了减轻这个缺陷,我们提出了一个新的方法,称为骨骼-时间Transformer(SkateFormer),它根据不同类型的骨架-时间关系(Skate-Type)对关节和帧进行分 partition,并在每个 partition 内执行骨架-时间自注意(Skate-MSA)。我们将动作识别的关键骨架-时间关系分为四种不同的类型。这些类型结合了(i)基于物理邻近和远距离关节的两种骨架关系类型,和(ii)基于邻近和远距离帧的两种时间关系类型。通过这种分区特定的关注策略,我们的SkateFormer可以在以高效计算的方式选择性地关注关键关节和帧,实现动作自适应的动作识别。在各种基准数据集上的大量实验证实,我们的SkateFormer超越了最近的最先进方法。
https://arxiv.org/abs/2403.09508
Event camera, a novel bio-inspired vision sensor, has drawn a lot of attention for its low latency, low power consumption, and high dynamic range. Currently, overfitting remains a critical problem in event-based classification tasks for Spiking Neural Network (SNN) due to its relatively weak spatial representation capability. Data augmentation is a simple but efficient method to alleviate overfitting and improve the generalization ability of neural networks, and saliency-based augmentation methods are proven to be effective in the image processing field. However, there is no approach available for extracting saliency maps from SNNs. Therefore, for the first time, we present Spiking Layer-Time-wise Relevance Propagation rule (SLTRP) and Spiking Layer-wise Relevance Propagation rule (SLRP) in order for SNN to generate stable and accurate CAMs and saliency maps. Based on this, we propose EventRPG, which leverages relevance propagation on the spiking neural network for more efficient augmentation. Our proposed method has been evaluated on several SNN structures, achieving state-of-the-art performance in object recognition tasks including N-Caltech101, CIFAR10-DVS, with accuracies of 85.62% and 85.55%, as well as action recognition task SL-Animals with an accuracy of 91.59%. Our code is available at this https URL.
事件相机,一种新颖的生物启发式视觉传感器,因低延迟、低功耗和高动态范围而引起了大量关注。目前,在基于事件的分类任务中,由于其相对较弱的空间表示能力,过拟合仍然是一个关键问题。数据增强是一种简单而有效的减轻过拟合并提高神经网络泛化能力的方法,而在图像处理领域,基于激性的增强方法已经被证明是有效的。然而,还没有方法可以从SNN中提取激性图。因此,我们首次提出了Spiking Layer-Time-wise Relevance Propagation rule (SLTRP)和Spiking Layer-wise Relevance Propagation rule (SLRP),以便SNN生成稳定和准确的CAMs和激性图。基于此,我们提出了EventRPG,它利用了SNN中的相关传播进行更有效的增强。我们的方法已经在多个SNN结构上进行了评估,在包括N-Caltech101、CIFAR10-DVS在内的物体识别任务中取得了最先进的性能,精度分别为85.62%和85.55%,以及在SL-Animals动作识别任务中的91.59%精度。我们的代码可以从该链接处获取。
https://arxiv.org/abs/2403.09274
On modern industrial assembly lines, many intelligent algorithms have been developed to replace or supervise workers. However, we found that there were bottlenecks in both training datasets and real-time performance when deploying algorithms on actual assembly line. Therefore, we developed a promising strategy for expanding industrial datasets, which utilized large models with strong generalization abilities to achieve efficient, high-quality, and large-scale dataset expansion, solving the problem of insufficient and low-quality industrial datasets. We also applied this strategy to video action recognition. We proposed a method of converting hand action recognition problems into hand skeletal trajectory classification problems, which solved the real-time performance problem of industrial algorithms. In the "hand movements during wire insertion" scenarios on the actual assembly line, the accuracy of hand action recognition reached 98.8\%. We conducted detailed experimental analysis to demonstrate the effectiveness and superiority of the method, and deployed the entire process on Midea's actual assembly line.
在现代工业装配线上,已经开发了许多智能算法来替代或监控工人。然而,我们发现将算法部署到实际装配线上存在瓶颈,无论是训练数据集还是实时性能。因此,我们提出了一种扩展工业数据集的有前途的方法,该方法利用具有强大泛化能力的 large 模型实现高效的、高品质的大型数据集扩展,解决了工业数据集不足和低质量的问题。我们还将这种方法应用于视频动作识别。我们提出了一种将手动作识别问题转化为手骨骼轨迹分类问题的方法,解决了工业算法的实时性能问题。在实际装配线上的"电线插入期间的手动作"场景中,手动作识别的准确性达到了 98.8%。我们详细实验分析证明了这种方法的有效性和优越性,并将其部署在美的实际装配线上。
https://arxiv.org/abs/2403.09056
Industrial managements, including quality control, cost and safety optimization, etc., heavily rely on high quality industrial human action recognitions (IHARs) which were hard to be implemented in large-scale industrial scenes due to their high costs and poor real-time performance. In this paper, we proposed a large-scale foundation model(LSFM)-based IHAR method, wherein various LSFMs and lightweight methods were jointly used, for the first time, to fulfill low-cost dataset establishment and real-time IHARs. Comprehensive tests on in-situ large-scale industrial manufacturing lines elucidated that the proposed method realized great reduction on employment costs, superior real-time performance, and satisfactory accuracy and generalization capabilities, indicating its great potential as a backbone IHAR method, especially for large-scale industrial applications.
工业管理,包括质量控制、成本和安全优化等,高度依赖于高质量的工业人力资源识别(IHARs),但由于其高成本和低实时性能,在大型工业场景中很难实现。在本文中,我们提出了基于大规模基础模型(LSFM)的IHAR方法,各种LSFM和轻量级方法共同使用,第一次实现了低成本数据集建立和实时IHAR。在本文所描述的大规模工业制造现场的综合测试中,我们证明了所提出的 method 在降低就业成本、提高实时性能和保证满意的精度及泛化能力方面取得了巨大的改善,表明其作为基础 IHAR 方法在大规模工业应用中的巨大潜力。
https://arxiv.org/abs/2403.08420
In this paper, we introduce Attention Prompt Tuning (APT) - a computationally efficient variant of prompt tuning for video-based applications such as action recognition. Prompt tuning approaches involve injecting a set of learnable prompts along with data tokens during fine-tuning while keeping the backbone frozen. This approach greatly reduces the number of learnable parameters compared to full tuning. For image-based downstream tasks, normally a couple of learnable prompts achieve results close to those of full tuning. However, videos, which contain more complex spatiotemporal information, require hundreds of tunable prompts to achieve reasonably good results. This reduces the parameter efficiency observed in images and significantly increases latency and the number of floating-point operations (FLOPs) during inference. To tackle these issues, we directly inject the prompts into the keys and values of the non-local attention mechanism within the transformer block. Additionally, we introduce a novel prompt reparameterization technique to make APT more robust against hyperparameter selection. The proposed APT approach greatly reduces the number of FLOPs and latency while achieving a significant performance boost over the existing parameter-efficient tuning methods on UCF101, HMDB51, and SSv2 datasets for action recognition. The code and pre-trained models are available at this https URL
在本文中,我们提出了Attention Prompt Tuning(APT) - 一种计算效率高的提示调整方法,适用于基于视频的应用程序,如动作识别。提示调整方法涉及在微调过程中注入一系列可学习提示,同时保留骨干网络。与完全调整相比,这种方法大大减少了可学习参数的数量。对于基于图像的下游任务,通常只需要几个学习提示就能获得与完全调整类似的结果。然而,视频,包含更复杂的时空信息,需要数百个可调整的提示才能达到合理好的效果。这使得在推理过程中观察到的参数效率降低,并大大增加了延迟和浮点运算(FLOPs)的数量。为了解决这些问题,我们将提示直接注入到Transformer模块中的键和值中。此外,我们还引入了一种新的提示重新参数化技术,使APT对超参数选择更加鲁棒。所提出的APT方法在动作识别任务上显著减少了FLOPs和延迟,同时在UCF101、HMDB51和SSv2数据集上的性能得到了显著的提高。代码和预训练模型可在此处访问:<https:// this URL>
https://arxiv.org/abs/2403.06978
Human action recognition in videos is a critical task with significant implications for numerous applications, including surveillance, sports analytics, and healthcare. The challenge lies in creating models that are both precise in their recognition capabilities and efficient enough for practical use. This study conducts an in-depth analysis of various deep learning models to address this challenge. Utilizing a subset of the UCF101 Videos dataset, we focus on Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Two-Stream ConvNets. The research reveals that while CNNs effectively capture spatial features and RNNs encode temporal sequences, Two-Stream ConvNets exhibit superior performance by integrating spatial and temporal dimensions. These insights are distilled from the evaluation metrics of accuracy, precision, recall, and F1-score. The results of this study underscore the potential of composite models in achieving robust human action recognition and suggest avenues for future research in optimizing these models for real-world deployment.
视频中的人类动作识别是一个关键的任务,对于包括监控、运动分析和人医疗等在内的众多应用具有重要的影响。挑战在于创建既准确度高又足够高效的模型。本研究对各种深度学习模型进行了深入分析,以应对这个挑战。利用UCF101视频数据集中的部分数据,我们重点关注卷积神经网络(CNNs)、循环神经网络(RNNs)和双流卷积网(Two-Stream CNNs)。研究揭示,虽然CNNs能有效捕捉空间特征,RNNs能编码时间序列,但双流卷积网通过整合空间和时间维度表现出卓越的性能。这些洞察是从准确性、精确度、召回率和F1分数等评估指标中得出的。本研究的成果突出了组合模型的潜力,为实现稳健的人动作识别和优化这些模型以实现真实世界部署提供了方向。
https://arxiv.org/abs/2403.06810
Emergency Medical Services (EMS) responders often operate under time-sensitive conditions, facing cognitive overload and inherent risks, requiring essential skills in critical thinking and rapid decision-making. This paper presents CognitiveEMS, an end-to-end wearable cognitive assistant system that can act as a collaborative virtual partner engaging in the real-time acquisition and analysis of multimodal data from an emergency scene and interacting with EMS responders through Augmented Reality (AR) smart glasses. CognitiveEMS processes the continuous streams of data in real-time and leverages edge computing to provide assistance in EMS protocol selection and intervention recognition. We address key technical challenges in real-time cognitive assistance by introducing three novel components: (i) a Speech Recognition model that is fine-tuned for real-world medical emergency conversations using simulated EMS audio recordings, augmented with synthetic data generated by large language models (LLMs); (ii) an EMS Protocol Prediction model that combines state-of-the-art (SOTA) tiny language models with EMS domain knowledge using graph-based attention mechanisms; (iii) an EMS Action Recognition module which leverages multimodal audio and video data and protocol predictions to infer the intervention/treatment actions taken by the responders at the incident scene. Our results show that for speech recognition we achieve superior performance compared to SOTA (WER of 0.290 vs. 0.618) on conversational data. Our protocol prediction component also significantly outperforms SOTA (top-3 accuracy of 0.800 vs. 0.200) and the action recognition achieves an accuracy of 0.727, while maintaining an end-to-end latency of 3.78s for protocol prediction on the edge and 0.31s on the server.
紧急医疗服务(EMS)的响应者通常在时间紧迫的情况下操作,面临着认知超负荷和固有风险,需要具备关键的思维和快速决策能力。本文介绍了一个端到端的可穿戴式认知助手系统CognitiveEMS,该系统可以作为实时获取和分析多模态数据的合作虚拟伙伴,通过增强现实(AR)智能眼镜与EMS救援人员交互。CognitiveEMS实时处理连续的数据流,并利用边缘计算为EMS协议选择和干预识别提供支持。我们通过引入三个新的组件来解决实时思维助手的关键技术挑战:(i)一个用于模拟EMS音频录音的实时语音识别模型,通过模拟大型语言模型(LLMs)生成的合成数据进行微调;(ii)一个结合最先进的(SOTA)微型语言模型与EMS领域知识的EMS协议预测模型,使用图感知关注机制;(iii)一个利用多模态音频和视频数据以及协议预测推断救援人员在现场采取干预/治疗措施的EMS动作识别模块。我们的结果表明,在语音识别方面,我们实现了与SOTA(WER 0.290 vs. 0.618)的优越性能。我们的协议预测组件显著优于SOTA(top-3精度为0.800 vs. 0.200),而动作识别模块的准确度为0.727,同时在边缘的协议预测上的延迟为3.78秒,服务器上的延迟为0.31秒。
https://arxiv.org/abs/2403.06734
Temporal localization of driving actions plays a crucial role in advanced driver-assistance systems and naturalistic driving studies. However, this is a challenging task due to strict requirements for robustness, reliability and accurate localization. In this work, we focus on improving the overall performance by efficiently utilizing video action recognition networks and adapting these to the problem of action localization. To this end, we first develop a density-guided label smoothing technique based on label probability distributions to facilitate better learning from boundary video-segments that typically include multiple labels. Second, we design a post-processing step to efficiently fuse information from video-segments and multiple camera views into scene-level predictions, which facilitates elimination of false positives. Our methodology yields a competitive performance on the A2 test set of the naturalistic driving action recognition track of the 2022 NVIDIA AI City Challenge with an F1 score of 0.271.
时间的驾驶动作的局部化在高级驾驶辅助系统和自然驾驶研究中起着关键作用。然而,由于对鲁棒性、可靠性和准确局部化的严格要求,这是项具有挑战性的任务。在这项工作中,我们专注于通过有效地利用视频动作识别网络来提高整体性能,并将这些网络适应于动作局部定位问题。为此,我们首先开发了一种基于标签概率分布的密度引导标签平滑技术,以促进更好地学习边界视频段中通常包括多个标签的学习。其次,我们设计了一个后处理步骤,以有效地将视频段和多个相机视图中的信息融合到场景级别预测中,从而消除虚假阳性结果。我们采用的方法在2022 NVIDIA AI City Challenge的自然驾驶动作识别赛道上获得了竞争力的性能,F1分数为0.271。
https://arxiv.org/abs/2403.06616
Classification and localization of driving actions over time is important for advanced driver-assistance systems and naturalistic driving studies. Temporal localization is challenging because it requires robustness, reliability, and accuracy. In this study, we aim to improve the temporal localization and classification accuracy performance by adapting video action recognition and 2D human-pose estimation networks to one model. Therefore, we design a transformer-based fusion architecture to effectively combine 2D-pose features and spatio-temporal features. The model uses 2D-pose features as the positional embedding of the transformer architecture and spatio-temporal features as the main input to the encoder of the transformer. The proposed solution is generic and independent of the camera numbers and positions, giving frame-based class probabilities as output. Finally, the post-processing step combines information from different camera views to obtain final predictions and eliminate false positives. The model performs well on the A2 test set of the 2023 NVIDIA AI City Challenge for naturalistic driving action recognition, achieving the overlap score of the organizer-defined distracted driver behaviour metric of 0.5079.
分类和本地化驾驶动作的时间对于高级驾驶辅助系统和自然驾驶研究非常重要。由于时间局部化需要稳健性、可靠性和准确性,因此这项研究旨在通过将视频动作识别和2D人体姿态估计网络适配到一个模型来提高时间局部化和分类准确性的性能。因此,我们设计了一个基于Transformer的融合架构,以有效地结合2D人体姿态特征和时空特征。该模型使用2D人体姿态特征作为Transformer架构的位置嵌入,将时空特征作为Transformer的编码器的输入。所提出的解决方案是通用的,独立于相机数量和位置,以输出基于帧的分类概率。最后,后处理步骤结合不同相机视角的信息,获得最终预测并消除假阳性结果。该模型在2023 NVIDIA AI City Challenge的自然驾驶动作识别A2测试集上表现良好,达到组织定义的分心驾驶员行为指标的覆盖度为0.5079。
https://arxiv.org/abs/2403.06577
Data replay is a successful incremental learning technique for images. It prevents catastrophic forgetting by keeping a reservoir of previous data, original or synthesized, to ensure the model retains past knowledge while adapting to novel concepts. However, its application in the video domain is rudimentary, as it simply stores frame exemplars for action recognition. This paper presents the first exploration of video data replay techniques for incremental action segmentation, focusing on action temporal modeling. We propose a Temporally Coherent Action (TCA) model, which represents actions using a generative model instead of storing individual frames. The integration of a conditioning variable that captures temporal coherence allows our model to understand the evolution of action features over time. Therefore, action segments generated by TCA for replay are diverse and temporally coherent. In a 10-task incremental setup on the Breakfast dataset, our approach achieves significant increases in accuracy for up to 22% compared to the baselines.
数据回放是一种成功的图像增广学习技术。通过保留先前数据(无论是原始还是合成)来防止灾难性遗忘,确保模型在适应新颖概念的同时保留过去知识。然而,其在视频领域的应用还很简单,因为它仅将动作示例存储为帧示例进行动作识别。本文是关于视频数据回放技术进行增量动作分割的第一篇探索,重点关注动作的时间建模。我们提出了一个使用生成模型表示动作的Temporally Coherent Action(TCA)模型。通过整合一个捕获时间一致性的条件变量,允许我们的模型理解动作特征随时间的演变。因此,TCA回放的动作分割具有多样性和时间一致性。在早餐数据集上的10个任务增量设置中,我们的方法在准确度上比基线提高了至少22%。
https://arxiv.org/abs/2403.06102
Micro-action is an imperceptible non-verbal behaviour characterised by low-intensity movement. It offers insights into the feelings and intentions of individuals and is important for human-oriented applications such as emotion recognition and psychological assessment. However, the identification, differentiation, and understanding of micro-actions pose challenges due to the imperceptible and inaccessible nature of these subtle human behaviors in everyday life. In this study, we innovatively collect a new micro-action dataset designated as Micro-action-52 (MA-52), and propose a benchmark named micro-action network (MANet) for micro-action recognition (MAR) task. Uniquely, MA-52 provides the whole-body perspective including gestures, upper- and lower-limb movements, attempting to reveal comprehensive micro-action cues. In detail, MA-52 contains 52 micro-action categories along with seven body part labels, and encompasses a full array of realistic and natural micro-actions, accounting for 205 participants and 22,422 video instances collated from the psychological interviews. Based on the proposed dataset, we assess MANet and other nine prevalent action recognition methods. MANet incorporates squeeze-and excitation (SE) and temporal shift module (TSM) into the ResNet architecture for modeling the spatiotemporal characteristics of micro-actions. Then a joint-embedding loss is designed for semantic matching between video and action labels; the loss is used to better distinguish between visually similar yet distinct micro-action categories. The extended application in emotion recognition has demonstrated one of the important values of our proposed dataset and method. In the future, further exploration of human behaviour, emotion, and psychological assessment will be conducted in depth. The dataset and source code are released at this https URL.
微动作是一种难以察觉的非语言行为,其特征是低强度运动。它揭示了个体情感和意图,对于诸如情感识别和心理评估等以人为中心的应用具有重要意义。然而,由于这些微行为在日常生活中的难以察觉和无法访问性,识别、区分和理解微动作带来了挑战。在这项研究中,我们创新性地收集了一个名为MA-52的新微动作数据集,并提出了名为微动作网络(MANet)的基准用于微动作识别(MAR)任务。与其他方法不同,MA-52提供了全身视角,包括手势、上半身和下半身运动,试图揭示全面的微动作线索。具体来说,MA-52包含了52个微动作类别以及7个身体部位标签,涵盖了205个参与者以及从心理访谈中收集的22,422个视频实例。基于所提出的数据集,我们评估了MANet和其他9种普遍的动作识别方法。MANet将挤压和兴奋(SE)以及时钟转移模块(TSM)融入ResNet架构,以建模微动作的时空特征。然后,为视频和动作标签之间的语义匹配设计了联合嵌入损失;该损失被用于更好地区分视觉上相似但具有区别的微动作类别。在情感识别的扩展应用中,我们发现我们提出的数据和方法的一个关键价值。在未来的研究中,将深入探讨人类行为、情感和心理评估。数据和源代码发布在https://www. thisurl。
https://arxiv.org/abs/2403.05234
Machine comprehension of visual information from images and videos by neural networks faces two primary challenges. Firstly, there exists a computational and inference gap in connecting vision and language, making it difficult to accurately determine which object a given agent acts on and represent it through language. Secondly, classifiers trained by a single, monolithic neural network often lack stability and generalization. To overcome these challenges, we introduce MoE-VRD, a novel approach to visual relationship detection utilizing a mixture of experts. MoE-VRD identifies language triplets in the form of < subject, predicate, object> tuples to extract relationships from visual processing. Leveraging recent advancements in visual relationship detection, MoE-VRD addresses the requirement for action recognition in establishing relationships between subjects (acting) and objects (being acted upon). In contrast to single monolithic networks, MoE-VRD employs multiple small models as experts, whose outputs are aggregated. Each expert in MoE-VRD specializes in visual relationship learning and object tagging. By utilizing a sparsely-gated mixture of experts, MoE-VRD enables conditional computation and significantly enhances neural network capacity without increasing computational complexity. Our experimental results demonstrate that the conditional computation capabilities and scalability of the mixture-of-experts approach lead to superior performance in visual relationship detection compared to state-of-the-art methods.
机器从图像和视频中理解视觉信息的主要挑战有两个。首先,在连接视觉和语言之间存在计算和推理差距,这使得准确确定给定代理对哪个对象进行操作并将其通过语言表示为困难。其次,由单个单体神经网络训练的分类器通常缺乏稳定性和泛化能力。为了克服这些挑战,我们引入了MoE-VRD,一种利用专家混合的新视觉关系检测方法。MoE-VRD以<主体,谓词,对象>元组的形式识别视觉处理中的语言三元组以提取关系。利用最近在视觉关系检测方面的进展,MoE-VRD在建立主体(进行操作)与物体(被操作)之间的关系方面解决了动作识别的要求。与单体网络相比,MoE-VRD采用多个小模型作为专家,其输出进行聚合。每个专家在MoE-VRD专门研究视觉关系学习和对象标记。通过使用稀疏门控的专家混合,MoE-VRD实现了条件计算,显著增强了神经网络能力,而不会增加计算复杂度。我们的实验结果表明,条件计算能力和可扩展性是专家混合方法的优越性能在视觉关系检测方面比最先进的方法更显著。
https://arxiv.org/abs/2403.03994
Motion segmentation is a fundamental problem in computer vision and is crucial in various applications such as robotics, autonomous driving and action recognition. Recently, spectral clustering based methods have shown impressive results on motion segmentation in dynamic environments. These methods perform spectral clustering on motion affinity matrices to cluster objects or point trajectories in the scene into different motion groups. However, existing methods often need the number of motions present in the scene to be known, which significantly reduces their practicality. In this paper, we propose a unified model selection technique to automatically infer the number of motion groups for spectral clustering based motion segmentation methods by combining different existing model selection techniques together. We evaluate our method on the KT3DMoSeg dataset and achieve competitve results comparing to the baseline where the number of clusters is given as ground truth information.
运动分割是计算机视觉中的一个基本问题,在各种应用中,如机器人学、自动驾驶和动作识别,都至关重要。最近,基于谱聚类的运动分割方法在动态环境中取得了令人印象深刻的运动分割结果。这些方法通过在运动关联矩阵上执行谱聚类来对场景中的物体或点轨迹进行聚类,将它们分为不同的运动组。然而,现有的方法通常需要知道场景中存在的运动的数量,这大大降低了它们的实用性。在本文中,我们提出了一种统一的模型选择技术,通过结合不同的现有模型选择技术,自动推断基于运动分割的运动组数。我们在KT3DMoSeg数据集上评估我们的方法,与基线相比,实现了竞争力的结果。 Motion Segmentation is a fundamental problem in computer vision and is crucial in various applications such as robotics, autonomous driving, and action recognition. Recently, spectral clustering-based methods have achieved impressive results on motion segmentation in dynamic environments. These methods perform spectral clustering on motion affinity matrices to cluster objects or point trajectories in the scene into different motion groups. However, existing methods often need the number of motions present in the scene to be known, which significantly reduces their practicality. In this paper, we propose a unified model selection technique to automatically infer the number of motion groups for spectral clustering-based motion segmentation methods by combining different existing model selection techniques together. We evaluate our method on the KT3DMoSeg dataset and achieve competitive results compared to the baseline where the number of clusters is given as ground truth information.
https://arxiv.org/abs/2403.01606
Contrastive Language-Image Pretraining (CLIP) has shown remarkable open-vocabulary abilities across various image understanding tasks. Building upon this impressive success, recent pioneer works have proposed to adapt the powerful CLIP to video data, leading to efficient and effective video learners for open-vocabulary action recognition. Inspired by the fact that humans perform actions in diverse environments, our work delves into an intriguing question: Can CLIP-based video learners effectively generalize to video domains they have not encountered during training? To answer this, we establish a CROSS-domain Open-Vocabulary Action recognition benchmark named XOV-Action, and conduct a comprehensive evaluation of five state-of-the-art CLIP-based video learners under various types of domain gaps. Our evaluation demonstrates that previous methods exhibit limited action recognition performance in unseen video domains, revealing potential challenges of the cross-domain open-vocabulary action recognition task. To address this task, our work focuses on a critical challenge, namely scene bias, and we accordingly contribute a novel scene-aware video-text alignment method. Our key idea is to distinguish video representations apart from scene-encoded text representations, aiming to learn scene-agnostic video representations for recognizing actions across domains. Extensive experimental results demonstrate the effectiveness of our method. The benchmark and code will be available at this https URL.
对比性语言-图像预训练(CLIP)在各种图像理解任务中展现了令人印象深刻的开放词汇能力。在取得这一惊人的成功的基础上,最近的研究者工作提出了将强大的CLIP适应视频数据,从而实现对于开放词汇动作识别的高效和有效的视频学习者。受到人类在各种环境执行动作的事实启发,我们的工作深入探讨了一个有趣的问题:CLIP基于视频的学习者是否能够有效泛化到训练时未接触过的视频领域?为了回答这个问题,我们建立了一个名为XOV-Action的跨领域开放词汇动作识别基准,并对五种最先进的基于CLIP的视觉学习者的各种类型域差进行全面的评估。我们的评估表明,以前的方法在未见过的视频领域中的动作识别表现有限,揭示了跨领域开放词汇动作识别任务的潜在挑战。为了应对这个问题,我们的工作集中于一个关键挑战,即场景偏见,并因此贡献了一种新的场景感知视频文本对齐方法。我们的关键想法是区分视频表示与场景编码文本表示,旨在学习跨域的动作识别视频表示。大量的实验结果证明了我们的方法的有效性。基准和代码将在此处链接提供。
https://arxiv.org/abs/2403.01560
Dynamic 3D point cloud sequences serve as one of the most common and practical representation modalities of dynamic real-world environments. However, their unstructured nature in both spatial and temporal domains poses significant challenges to effective and efficient processing. Existing deep point cloud sequence modeling approaches imitate the mature 2D video learning mechanisms by developing complex spatio-temporal point neighbor grouping and feature aggregation schemes, often resulting in methods lacking effectiveness, efficiency, and expressive power. In this paper, we propose a novel generic representation called \textit{Structured Point Cloud Videos} (SPCVs). Intuitively, by leveraging the fact that 3D geometric shapes are essentially 2D manifolds, SPCV re-organizes a point cloud sequence as a 2D video with spatial smoothness and temporal consistency, where the pixel values correspond to the 3D coordinates of points. The structured nature of our SPCV representation allows for the seamless adaptation of well-established 2D image/video techniques, enabling efficient and effective processing and analysis of 3D point cloud sequences. To achieve such re-organization, we design a self-supervised learning pipeline that is geometrically regularized and driven by self-reconstructive and deformation field learning objectives. Additionally, we construct SPCV-based frameworks for both low-level and high-level 3D point cloud sequence processing and analysis tasks, including action recognition, temporal interpolation, and compression. Extensive experiments demonstrate the versatility and superiority of the proposed SPCV, which has the potential to offer new possibilities for deep learning on unstructured 3D point cloud sequences. Code will be released at this https URL.
动态3D点云序列是一种最常见且实用的动态现实环境中的表示模式。然而,它们在空间和时间域中的无结构特性对有效的和高效的处理提出了重大挑战。现有的深度点云序列建模方法通过开发复杂的空间时间点邻居聚类和特征聚合方案来模仿成熟2D视频学习机制,往往导致方法缺乏有效性、效率和表现力。在本文中,我们提出了一个名为《有序点云视频》(SPCVs)的新通用表示。直觉上,通过利用3D几何形状本质上是由2D流形的事实,SPCV重新组织点云序列成一个具有空间平滑度和时间一致性的2D视频,其中像素值对应于点的3D坐标。我们SPCV表示的结构性质允许我们无缝地适应经过检验的2D图像/视频技术,从而实现对3D点云序列的高效和有效的处理和分析。为了实现这种重新组织,我们设计了一个自监督学习管道,它几何上正则化并受到自构建和变形场学习目标的驱动。此外,我们还构建了基于SPCV的框架,用于低级和高级3D点云序列处理和分析任务,包括动作识别、时间平滑和压缩。大量实验证明,所提出的SPCV具有灵活性和优越性,为在无结构3D点云序列上进行深度学习提供了新的可能性。代码将在这个链接上发布。
https://arxiv.org/abs/2403.01129
Transformer-based models have significantly improved performance across a range of multimodal understanding tasks, such as visual question answering and action recognition. However, multimodal Transformers significantly suffer from a quadratic complexity of the multi-head attention with the input sequence length, especially as the number of modalities increases. To address this, we introduce Low-Cost Multimodal Transformer (LoCoMT), a novel multimodal attention mechanism that aims to reduce computational cost during training and inference with minimal performance loss. Specifically, by assigning different multimodal attention patterns to each attention head, LoCoMT can flexibly control multimodal signals and theoretically ensures a reduced computational cost compared to existing multimodal Transformer variants. Experimental results on two multimodal datasets, namely Audioset and MedVidCL demonstrate that LoCoMT not only reduces GFLOPs but also matches or even outperforms established models.
基于Transformer的模型已经在广泛的跨模态理解任务中显著提高了性能,例如视觉问答和动作识别。然而,跨模态Transformer模型在输入序列长度为平方时会受到极大的 quadratic 复杂性的困扰,尤其是在模态数量增加时。为解决这个问题,我们引入了低成本跨模态Transformer(LoCoMT),一种旨在在训练和推理过程中降低计算成本的新型多模态关注机制。具体来说,通过为每个注意力头分配不同的跨模态关注模式,LoCoMT可以灵活地控制跨模态信号,并理论上确保了与现有跨模态Transformer变体相比降低了计算成本。在两个跨模态数据集(即AudioSet和MedVidCL)上的实验结果表明,LoCoMT不仅减少了GFLOPs,而且与现有的模型相媲美,甚至超过了这些模型。
https://arxiv.org/abs/2402.15096
Few-shot action recognition aims at quickly adapting a pre-trained model to the novel data with a distribution shift using only a limited number of samples. Key challenges include how to identify and leverage the transferable knowledge learned by the pre-trained model. Our central hypothesis is that temporal invariance in the dynamic system between latent variables lends itself to transferability (domain-invariance). We therefore propose DITeD, or Domain-Invariant Temporal Dynamics for knowledge transfer. To detect the temporal invariance part, we propose a generative framework with a two-stage training strategy during pre-training. Specifically, we explicitly model invariant dynamics including temporal dynamic generation and transitions, and the variant visual and domain encoders. Then we pre-train the model with the self-supervised signals to learn the representation. After that, we fix the whole representation model and tune the classifier. During adaptation, we fix the transferable temporal dynamics and update the image encoder. The efficacy of our approach is revealed by the superior accuracy of DITeD over leading alternatives across standard few-shot action recognition datasets. Moreover, we validate that the learned temporal dynamic transition and temporal dynamic generation modules possess transferable qualities.
少数shot动作识别的目标是使用仅限于几个样本的数据对预训练模型进行分布平移,快速将其适应新的数据。关键挑战包括如何识别和利用预训练模型学到的可转移知识。我们的核心假设是,在潜在变量之间的动态系统中的时间不变性有助于可转移性(领域不变性)。因此,我们提出了DITeD或领域不变性时间动态,用于知识转移。为了检测时间不变性部分,我们在预训练期间使用了一个两阶段训练策略的生成框架。具体来说,我们明确建模了包括时间动态生成和转换的不变动态,以及变体的视觉和领域编码器。然后,我们通过自监督信号对模型进行预训练,以学习表示。适应过程中,我们固定可转移的时间动态,并更新图像编码器。我们的方法的有效性通过其在标准几 shot动作识别数据集上的卓越准确性得到体现。此外,我们还验证了学习到的的时间动态转移和时间动态生成模块具有可转移的品质。
https://arxiv.org/abs/2402.12706