Gestures are inherent to human interaction and often complement speech in face-to-face communication, forming a multimodal communication system. An important task in gesture analysis is detecting a gesture's beginning and end. Research on automatic gesture detection has primarily focused on visual and kinematic information to detect a limited set of isolated or silent gestures with low variability, neglecting the integration of speech and vision signals to detect gestures that co-occur with speech. This work addresses this gap by focusing on co-speech gesture detection, emphasising the synchrony between speech and co-speech hand gestures. We address three main challenges: the variability of gesture forms, the temporal misalignment between gesture and speech onsets, and differences in sampling rate between modalities. We investigate extended speech time windows and employ separate backbone models for each modality to address the temporal misalignment and sampling rate differences. We utilize Transformer encoders in cross-modal and early fusion techniques to effectively align and integrate speech and skeletal sequences. The study results show that combining visual and speech information significantly enhances gesture detection performance. Our findings indicate that expanding the speech buffer beyond visual time segments improves performance and that multimodal integration using cross-modal and early fusion techniques outperforms baseline methods using unimodal and late fusion methods. Additionally, we find a correlation between the models' gesture prediction confidence and low-level speech frequency features potentially associated with gestures. Overall, the study provides a better understanding and detection methods for co-speech gestures, facilitating the analysis of multimodal communication.
手势是人类交互的固有特征,通常在面对面交流中作为语言的补充,形成了一个多模态通信系统。手势分析的一个重要任务是检测手势的开始和结束。在手势检测研究中,主要集中于视觉和运动信息来检测具有较低变异性的一小部分孤立或无声手势,而忽视了语音和视觉信号的融合来检测与语言同时出现的手势。本文通过关注共词手势检测来填补这一空白,强调言语和共词手势之间的同步性。我们解决了三个主要挑战:手势形式的变异性,手势和言语发起时间之间的时序错位,以及模态之间的采样率差异。我们研究了扩展的言语时间窗口,并为每个模态使用单独的骨干模型来解决时序错位和采样率差异。我们利用跨模态和早期融合技术中的Transformer编码器来有效地对齐和整合言语和骨骼序列。研究结果表明,结合视觉和言语信息可以显著增强手势检测性能。我们的研究结果表明,将言语缓冲区扩展到视觉时间段以外可以提高性能,而跨模态和早期融合技术的使用优于使用单模态和晚期融合方法。此外,我们发现模型手势预测信心与可能与手势相关的小级别语音频率特征之间存在相关性。总体而言,研究为共词手势提供了更好的理解和检测方法,促进了多模态通信的分析。
https://arxiv.org/abs/2404.14952
Millimeter wave radar is gaining traction recently as a promising modality for enabling pervasive and privacy-preserving gesture recognition. However, the lack of rich and fine-grained radar datasets hinders progress in developing generalized deep learning models for gesture recognition across various user postures (e.g., standing, sitting), positions, and scenes. To remedy this, we resort to designing a software pipeline that exploits wealthy 2D videos to generate realistic radar data, but it needs to address the challenge of simulating diversified and fine-grained reflection properties of user gestures. To this end, we design G3R with three key components: (i) a gesture reflection point generator expands the arm's skeleton points to form human reflection points; (ii) a signal simulation model simulates the multipath reflection and attenuation of radar signals to output the human intensity map; (iii) an encoder-decoder model combines a sampling module and a fitting module to address the differences in number and distribution of points between generated and real-world radar data for generating realistic radar data. We implement and evaluate G3R using 2D videos from public data sources and self-collected real-world radar data, demonstrating its superiority over other state-of-the-art approaches for gesture recognition.
毫米波雷达最近作为实现普遍且隐私保护的手势识别的有前景的模态而受到关注。然而,缺乏丰富和细粒度的雷达数据集会阻碍开发通用的深度学习模型用于各种用户姿势(例如,站立,坐着)和场景的手势识别。为了解决这个问题,我们采用了设计一个利用丰富 2D 视频生成逼真雷达数据的软件流水线,但需要解决模拟用户手势多种反射特性的挑战。为此,我们设计了 G3R,包括三个关键组件:(i)一个手势反射点生成器将手臂骨点扩展为人体反射点;(ii)一个信号模拟模型模拟多径反射和衰减雷达信号以输出人体强度图;(iii)一个编码器-解码器模型结合采样模块和拟合模块来解决生成和现实世界雷达数据中点之间的数量和分布差异,以生成逼真的雷达数据。我们使用来自公共数据源的 2D 视频和自收集的实世界雷达数据来实施和评估 G3R,证明了其在手势识别方面的优越性。
https://arxiv.org/abs/2404.14934
Current electromyography (EMG) pattern recognition (PR) models have been shown to generalize poorly in unconstrained environments, setting back their adoption in applications such as hand gesture control. This problem is often due to limited training data, exacerbated by the use of supervised classification frameworks that are known to be suboptimal in such settings. In this work, we propose a shift to deep metric-based meta-learning in EMG PR to supervise the creation of meaningful and interpretable representations. We use a Siamese Deep Convolutional Neural Network (SDCNN) and contrastive triplet loss to learn an EMG feature embedding space that captures the distribution of the different classes. A nearest-centroid approach is subsequently employed for inference, relying on how closely a test sample aligns with the established data distributions. We derive a robust class proximity-based confidence estimator that leads to a better rejection of incorrect decisions, i.e. false positives, especially when operating beyond the training data domain. We show our approach's efficacy by testing the trained SDCNN's predictions and confidence estimations on unseen data, both in and out of the training domain. The evaluation metrics include the accuracy-rejection curve and the Kullback-Leibler divergence between the confidence distributions of accurate and inaccurate predictions. Outperforming comparable models on both metrics, our results demonstrate that the proposed meta-learning approach improves the classifier's precision in active decisions (after rejection), thus leading to better generalization and applicability.
目前,用于手势控制的EMG(肌电图)模式识别(PR)模型在约束环境中的泛化能力较差,这使得它们在应用领域(如手势控制)中的采用受到了限制。这个问题通常是由于训练数据的有限性,以及使用已知在這種环境中表现不佳的监督分类框架而加剧的。在本文中,我们提出了一个从EMG PR转向深度基于指标的元学习,以指导有意义和可解释的表示的创建。我们使用Siamese Deep Convolutional Neural Network(SDCNN)和对比性三元组损失来学习捕捉不同类分布的EMG特征嵌入空间。接下来,我们采用最近邻算法进行推理,依赖于测试样本与已建立数据分布的接近程度。我们通过计算最近邻算法的置信度来得到一个鲁棒的分类器,可以更好地拒绝错误决策,即假阳性,尤其是在训练数据领域之外。我们通过在未见过的数据上测试训练后的SDCNN的预测和置信度来评估我们方法的有效性。评估指标包括准确率-拒绝曲线和准确与不准确预测之间的一致分布的Kullback-Leibler散度。在两个指标上超越相似模型的结果表明,所提出的元学习方法改进了分类器在积极决策(拒绝对话)中的准确率,从而提高了泛化能力和适用性。
https://arxiv.org/abs/2404.15360
Gesture recognition based on surface electromyography (sEMG) has been gaining importance in many 3D Interactive Scenes. However, sEMG is easily influenced by various forms of noise in real-world environments, leading to challenges in providing long-term stable interactions through sEMG. Existing methods often struggle to enhance model noise resilience through various predefined data augmentation techniques. In this work, we revisit the problem from a short term enhancement perspective to improve precision and robustness against various common noisy scenarios with learnable denoise using sEMG intrinsic pattern information and sliding-window attention. We propose a Short Term Enhancement Module(STEM) which can be easily integrated with various models. STEM offers several benefits: 1) Learnable denoise, enabling noise reduction without manual data augmentation; 2) Scalability, adaptable to various models; and 3) Cost-effectiveness, achieving short-term enhancement through minimal weight-sharing in an efficient attention mechanism. In particular, we incorporate STEM into a transformer, creating the Short Term Enhanced Transformer (STET). Compared with best-competing approaches, the impact of noise on STET is reduced by more than 20%. We also report promising results on both classification and regression datasets and demonstrate that STEM generalizes across different gesture recognition tasks.
基于表面电生理(sEMG)的手势识别在许多三维交互场景中越来越重要。然而,sEMG很容易受到现实环境中各种形式的噪声影响,导致通过sEMG提供长期稳定交互存在挑战。现有的方法通常很难通过各种预定义的数据增强技术增强模型的噪声韧性。在这项工作中,我们从短期增强的角度重新审视问题,以改善精度和对各种常见噪声场景的鲁棒性,使用sEMG固有模式信息和滑动窗口注意力进行学习去噪。我们提出了一个短期增强模块(STEM),可以轻松地与各种模型集成。STEM带来几个优点:1)可学习去噪,无需手动数据增强;2)可扩展性,适用于各种模型;3)性价比高,通过高效的注意力机制实现短期增强。特别地,我们将STEM集成到Transformer中,创建了短期增强Transformer(STET)。与最佳竞争方法相比,STET受到的噪声影响降低了20%以上。我们还报道了在分类和回归数据集上的积极结果,并证明了STEM在各种手势识别任务上通用。
https://arxiv.org/abs/2404.11213
The Indian classical dance-drama Kathakali has a set of hand gestures called Mudras, which form the fundamental units of all its dance moves and postures. Recognizing the depicted mudra becomes one of the first steps in its digital processing. The work treats the problem as a 24-class classification task and proposes a vector-similarity-based approach using pose estimation, eliminating the need for further training or fine-tuning. This approach overcomes the challenge of data scarcity that limits the application of AI in similar domains. The method attains 92% accuracy which is a similar or better performance as other model-training-based works existing in the domain, with the added advantage that the method can still work with data sizes as small as 1 or 5 samples with a slightly reduced performance. Working with images, videos, and even real-time streams is possible. The system can work with hand-cropped or full-body images alike. We have developed and made public a dataset for the Kathakali Mudra Recognition as part of this work.
印度古典舞蹈戏剧Kathakali有一组被称为Mudras的手势,它们构成了所有其舞蹈动作和姿势的基本单位。识别所描绘的手势是其在数字处理中的第一步。该工作将问题视为一个24类分类任务,并使用姿态估计基于向量的方法,消除了进一步的训练或微调的需求。这种方法克服了数据稀缺性,这限制了AI在类似领域中的应用。该方法获得了92%的准确度,这是该领域其他基于模型训练的工作中的类似或更好的性能,并且具有可以与少量数据样本一起工作的稍微降低性能的优点。可以与图像、视频和实时流合作工作。系统可以处理手裁剪或全身图像。我们在这个工作中为Kathakali Mudra识别开发并公开了一个数据集。
https://arxiv.org/abs/2404.11205
Human communication is multi-modal; e.g., face-to-face interaction involves auditory signals (speech) and visual signals (face movements and hand gestures). Hence, it is essential to exploit multiple modalities when designing machine learning-based facial expression recognition systems. In addition, given the ever-growing quantities of video data that capture human facial expressions, such systems should utilize raw unlabeled videos without requiring expensive annotations. Therefore, in this work, we employ a multitask multi-modal self-supervised learning method for facial expression recognition from in-the-wild video data. Our model combines three self-supervised objective functions: First, a multi-modal contrastive loss, that pulls diverse data modalities of the same video together in the representation space. Second, a multi-modal clustering loss that preserves the semantic structure of input data in the representation space. Finally, a multi-modal data reconstruction loss. We conduct a comprehensive study on this multimodal multi-task self-supervised learning method on three facial expression recognition benchmarks. To that end, we examine the performance of learning through different combinations of self-supervised tasks on the facial expression recognition downstream task. Our model ConCluGen outperforms several multi-modal self-supervised and fully supervised baselines on the CMU-MOSEI dataset. Our results generally show that multi-modal self-supervision tasks offer large performance gains for challenging tasks such as facial expression recognition, while also reducing the amount of manual annotations required. We release our pre-trained models as well as source code publicly
人类交流是多模态的;例如,面对面的交互包括听觉信号(说话)和视觉信号(面部动作和手势)。因此,在设计基于机器学习的面部表情识别系统时,充分利用多个模态是非常重要的。此外,考虑到不断增长的视频数据量,这些系统应该利用未标记的视频原始数据,而不需要昂贵的注释。因此,在这项工作中,我们采用了一种多任务多模态自监督学习方法对野外视频数据进行面部表情识别。我们的模型结合了三种自监督目标函数:第一,多模态对比损失,将相同视频中的不同数据模态聚集在表示空间中。第二,多模态聚类损失,保留输入数据在表示空间中的语义结构。第三,多模态数据重建损失。我们在三个面部表情识别基准上对这种多模态多任务自监督学习方法进行全面研究。为此,我们研究了在不同自监督任务对面部表情识别下游任务进行学习时的性能。我们的模型ConCluGen在CMU-MOSEI数据集上优于多个多模态自监督和完全监督基线。我们的结果表明,多模态自监督任务为具有挑战性的任务(如面部表情识别)提供了较大的性能提升,同时减少了手动注释的数量。我们公开发布预训练模型和源代码。
https://arxiv.org/abs/2404.10904
Object recognition, commonly performed by a camera, is a fundamental requirement for robots to complete complex tasks. Some tasks require recognizing objects far from the robot's camera. A challenging example is Ultra-Range Gesture Recognition (URGR) in human-robot interaction where the user exhibits directive gestures at a distance of up to 25~m from the robot. However, training a model to recognize hardly visible objects located in ultra-range requires an exhaustive collection of a significant amount of labeled samples. The generation of synthetic training datasets is a recent solution to the lack of real-world data, while unable to properly replicate the realistic visual characteristics of distant objects in images. In this letter, we propose the Diffusion in Ultra-Range (DUR) framework based on a Diffusion model to generate labeled images of distant objects in various scenes. The DUR generator receives a desired distance and class (e.g., gesture) and outputs a corresponding synthetic image. We apply DUR to train a URGR model with directive gestures in which fine details of the gesturing hand are challenging to distinguish. DUR is compared to other types of generative models showcasing superiority both in fidelity and in recognition success rate when training a URGR model. More importantly, training a DUR model on a limited amount of real data and then using it to generate synthetic data for training a URGR model outperforms directly training the URGR model on real data. The synthetic-based URGR model is also demonstrated in gesture-based direction of a ground robot.
物体识别,通常由相机执行,是机器人完成复杂任务的基本要求。有些任务要求在机器人摄像头的远距离内识别物体。一个具有挑战性的例子是在人机交互中,用户在距离机器人25米以内的距离时表现出指令手势。然而,为了训练能够识别距离遥远物体且难以观察的模型的模型,需要收集大量标记样本。生成合成训练数据是针对缺乏真实世界数据的最近解决方案,但它无法正确复制远处物体的真实视觉特征。在本文中,我们基于扩散模型的扩散Ultra-Range (DUR)框架提出了合成远距离物体标记图像的方案。DUR生成器接收所需距离和分类(例如手势)并输出相应的合成图像。我们将DUR应用于训练具有指令手势的URGR模型,其中手部细节的分辨度很高。DUR与其他类型的生成模型进行了比较,展示了在准确性和识别成功率方面的优越性。更重要的是,在有限量的真实数据上训练DUR模型,然后使用该模型生成训练URGR模型的合成数据,比直接在真实数据上训练URGR模型更有效地实现。基于仿真的URGR模型还在基于手势的地面机器人的手部方向中得到了演示。
https://arxiv.org/abs/2404.09846
Addressing the challenge of a digital assistant capable of executing a wide array of user tasks, our research focuses on the realm of instruction-based mobile device control. We leverage recent advancements in large language models (LLMs) and present a visual language model (VLM) that can fulfill diverse tasks on mobile devices. Our model functions by interacting solely with the user interface (UI). It uses the visual input from the device screen and mimics human-like interactions, encompassing gestures such as tapping and swiping. This generality in the input and output space allows our agent to interact with any application on the device. Unlike previous methods, our model operates not only on a single screen image but on vision-language sentences created from sequences of past screenshots along with corresponding actions. Evaluating our method on the challenging Android in the Wild benchmark demonstrates its promising efficacy and potential.
address the challenge of a digital assistant capable of executing a wide range of user tasks, our research focuses on the realm of instruction-based mobile device control. We leverage recent advances in large language models (LLMs) and present a visual language model (VLM) that can perform various tasks on mobile devices. Our model functions by interacting solely with the user interface (UI). It uses the visual input from the device screen and mimics human-like interactions, covering gestures such as tapping and swiping. This generality in the input and output space allows our agent to interact with any application on the device. Unlike previous methods, our model operates not only on a single screen image but on vision-language sentences created from sequences of past screenshots along with corresponding actions. Evaluating our method on the challenging Android in the Wild benchmark demonstrates its promising efficacy and potential.
https://arxiv.org/abs/2404.08755
Intention-based Human-Robot Interaction (HRI) systems allow robots to perceive and interpret user actions to proactively interact with humans and adapt to their behavior. Therefore, intention prediction is pivotal in creating a natural interactive collaboration between humans and robots. In this paper, we examine the use of Large Language Models (LLMs) for inferring human intention during a collaborative object categorization task with a physical robot. We introduce a hierarchical approach for interpreting user non-verbal cues, like hand gestures, body poses, and facial expressions and combining them with environment states and user verbal cues captured using an existing Automatic Speech Recognition (ASR) system. Our evaluation demonstrates the potential of LLMs to interpret non-verbal cues and to combine them with their context-understanding capabilities and real-world knowledge to support intention prediction during human-robot interaction.
基于意图的人机交互(HRI)系统允许机器人感知和解释用户的动作,从而主动与人类互动并适应其行为。因此,意图预测在创建人类与机器人之间自然互动的重要性不言而喻。在本文中,我们研究了使用大型语言模型(LLMs)在合作物体分类任务中推断人类意图的方法。我们引入了一种分层的解释用户非语言线索的方法,包括手势、身体姿势和面部表情,并将其与环境和用户口头线索捕获的现有自动语音识别(ASR)系统相结合。我们的评估表明,LLMs具有解释非语言线索的能力,并将其与上下文理解能力和现实世界的知识相结合,支持在人类与机器人交互过程中进行意图预测。
https://arxiv.org/abs/2404.08424
Large-scale face recognition datasets are collected by crawling the Internet and without individuals' consent, raising legal, ethical, and privacy concerns. With the recent advances in generative models, recently several works proposed generating synthetic face recognition datasets to mitigate concerns in web-crawled face recognition datasets. This paper presents the summary of the Synthetic Data for Face Recognition (SDFR) Competition held in conjunction with the 18th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2024) and established to investigate the use of synthetic data for training face recognition models. The SDFR competition was split into two tasks, allowing participants to train face recognition systems using new synthetic datasets and/or existing ones. In the first task, the face recognition backbone was fixed and the dataset size was limited, while the second task provided almost complete freedom on the model backbone, the dataset, and the training pipeline. The submitted models were trained on existing and also new synthetic datasets and used clever methods to improve training with synthetic data. The submissions were evaluated and ranked on a diverse set of seven benchmarking datasets. The paper gives an overview of the submitted face recognition models and reports achieved performance compared to baseline models trained on real and synthetic datasets. Furthermore, the evaluation of submissions is extended to bias assessment across different demography groups. Lastly, an outlook on the current state of the research in training face recognition models using synthetic data is presented, and existing problems as well as potential future directions are also discussed.
大规模面部识别数据集是通过爬取互联网收集的,未经个人许可,引发法律、伦理和隐私问题。随着生成模型的最新进展,最近的一些工作提出了生成合成面部识别数据集以减轻爬取互联网面部识别数据集所带来的担忧。本文概述了与IEEE 18届国际面部和手势识别会议(FG 2024)共同举办的合成数据面部识别(SDFR)比赛的摘要,以研究合成数据在训练面部识别模型中的使用。SDFR比赛分为两个任务,允许参赛者使用新的合成数据集或现有数据集来训练面部识别系统。在第一任务中,面部识别骨架被固定,数据集大小有限,而第二任务为模型骨架、数据和训练流程提供了几乎完全的自由。提交的比赛模型在现有和新的合成数据集上进行训练,并采用了一些巧妙的方法来提高利用合成数据进行训练的效果。提交的模型在七个基准数据集上进行评估并排名。本文概述了提交的面部识别模型,并报告了与基于真实和合成数据进行训练的基线模型的性能。此外,评估提交模型的方法还扩展到了不同 demographic 群体之间的偏见评估。最后,本文概述了使用合成数据训练面部识别模型的当前研究进展,并讨论了现有问题和未来可能的发展方向。
https://arxiv.org/abs/2404.04580
Transformer networks are rapidly becoming SotA in many fields, such as NLP and CV. Similarly to CNN, there is a strong push for deploying Transformer models at the extreme edge, ultimately fitting the tiny power budget and memory footprint of MCUs. However, the early approaches in this direction are mostly ad-hoc, platform, and model-specific. This work aims to enable and optimize the flexible, multi-platform deployment of encoder Tiny Transformers on commercial MCUs. We propose a complete framework to perform end-to-end deployment of Transformer models onto single and multi-core MCUs. Our framework provides an optimized library of kernels to maximize data reuse and avoid unnecessary data marshaling operations into the crucial attention block. A novel MHSA inference schedule, named Fused-Weight Self-Attention, is introduced, fusing the linear projection weights offline to further reduce the number of operations and parameters. Furthermore, to mitigate the memory peak reached by the computation of the attention map, we present a Depth-First Tiling scheme for MHSA. We evaluate our framework on three different MCU classes exploiting ARM and RISC-V ISA, namely the STM32H7, the STM32L4, and GAP9 (RV32IMC-XpulpV2). We reach an average of 4.79x and 2.0x lower latency compared to SotA libraries CMSIS-NN (ARM) and PULP-NN (RISC-V), respectively. Moreover, we show that our MHSA depth-first tiling scheme reduces the memory peak by up to 6.19x, while the fused-weight attention can reduce the runtime by 1.53x, and number of parameters by 25%. We report significant improvements across several Tiny Transformers: for instance, when executing a transformer block for the task of radar-based hand-gesture recognition on GAP9, we achieve a latency of 0.14ms and energy consumption of 4.92 micro-joules, 2.32x lower than the SotA PULP-NN library on the same platform.
Transformer网络正在迅速成为NLP和CV领域的SOTA。与CNN相似,在边缘部署Transformer模型是一个强烈的推动力,最终适应了MCU的微弱能量消耗和内存开销。然而,目前这种方向上的早期方法大多是临时的、针对特定平台和模型的。这项工作旨在实现并优化在商业MCU上灵活、多平台部署编码器Tiny Transformers。我们提出了一个完整的框架来执行端到端Transformer模型在单和多核MCU上的部署。我们的框架提供了一个优化的库内核以最大化数据重用并避免不必要的数据移动操作到关键的注意力模块。我们还引入了一种名为Fused-Weight Self-Attention的新奇MHSA推理计划,它将离线线性投影权重与内存中的数据相融合,从而进一步减少了操作和参数数量。此外,为了减轻计算注意力图的内存高峰,我们提出了深度优先的Tiling方案用于MHSA。我们在ARM和RISC-V ISA架构的三个不同MCU类上评估我们的框架,分别是STM32H7、STM32L4和GAP9(RV32IMC-XpulpV2)。我们分别达到与SOTA库CMSIS-NN(ARM)和PULP-NN(RISC-V)分别为4.79x和2.0x的低延迟。此外,我们还证明了我们的MHSA深度优先贴片方案可以将内存高峰降低6.19x,而融合权重注意力可以降低运行时1.53x,并降低参数数量25%。我们在多个Tiny Transformers上取得了显著的改善:例如,在GAP9上执行基于雷达的手势识别的Transformer模块时,我们达到的延迟为0.14ms,能量消耗为4.92微焦,比同一平台的SOTA PULP-NN库低2.32倍。
https://arxiv.org/abs/2404.02945
Skeleton-based gesture recognition methods have achieved high success using Graph Convolutional Network (GCN). In addition, context-dependent adaptive topology as a neighborhood vertex information and attention mechanism leverages a model to better represent actions. In this paper, we propose self-attention GCN hybrid model, Multi-Scale Spatial-Temporal self-attention (MSST)-GCN to effectively improve modeling ability to achieve state-of-the-art results on several datasets. We utilize spatial self-attention module with adaptive topology to understand intra-frame interactions within a frame among different body parts, and temporal self-attention module to examine correlations between frames of a node. These two are followed by multi-scale convolution network with dilations, which not only captures the long-range temporal dependencies of joints but also the long-range spatial dependencies (i.e., long-distance dependencies) of node temporal behaviors. They are combined into high-level spatial-temporal representations and output the predicted action with the softmax classifier.
使用基于骨架的手势识别方法已经取得了很高的成功,这是通过图卷积网络(GCN)实现的。此外,上下文相关的自适应拓扑作为邻居顶点信息和注意机制,利用模型更好地表示动作。在本文中,我们提出了自注意力GCN混合模型,多尺度空间时间自注意力(MSST)-GCN,以有效地提高建模能力,实现在这些数据集上的最先进结果。我们利用自注意力模块与自适应拓扑来理解不同身体部分之间的帧内交互,并利用时间自注意力模块来研究节点的时间行为之间的关联。接着是多尺度卷积网络与膨胀操作,不仅捕捉了关节的长距离时间依赖,还捕捉了节点的时间行为的空间距离(即远距离依赖)。它们被组合成高级空间时间表示,并通过软分类器输出预测的动作。
https://arxiv.org/abs/2404.02624
Co-speech gestures, if presented in the lively form of videos, can achieve superior visual effects in human-machine interaction. While previous works mostly generate structural human skeletons, resulting in the omission of appearance information, we focus on the direct generation of audio-driven co-speech gesture videos in this work. There are two main challenges: 1) A suitable motion feature is needed to describe complex human movements with crucial appearance information. 2) Gestures and speech exhibit inherent dependencies and should be temporally aligned even of arbitrary length. To solve these problems, we present a novel motion-decoupled framework to generate co-speech gesture videos. Specifically, we first introduce a well-designed nonlinear TPS transformation to obtain latent motion features preserving essential appearance information. Then a transformer-based diffusion model is proposed to learn the temporal correlation between gestures and speech, and performs generation in the latent motion space, followed by an optimal motion selection module to produce long-term coherent and consistent gesture videos. For better visual perception, we further design a refinement network focusing on missing details of certain areas. Extensive experimental results show that our proposed framework significantly outperforms existing approaches in both motion and video-related evaluations. Our code, demos, and more resources are available at this https URL.
翻译:如果以生动的视频形式呈现共同说话手势,可以在人机交互中实现卓越的视觉效果。 然而,之前的工作主要生成结构化的人体骨骼,导致忽略了外貌信息。 在这项工作中,我们专注于在视频中直接生成音频驱动的共同说话手势视频。 存在两个主要挑战:1)需要一个合适的运动特征来描述复杂的人体运动,保留关键的外貌信息。 2)手势和说话表现出固有的依赖关系,应该在任意长度的时间上进行时序对齐。 为解决这些问题,我们提出了一个新颖的运动解耦框架来生成共同说话手势视频。 具体来说,我们首先引入了一个设计精良的非线性TPS变换来获得保留关键外貌信息的潜在运动特征。 然后,我们提出了一个基于Transformer的扩散模型来学习手势和说话之间的时序相关性,并在潜在运动空间中进行生成。 接下来是一个最优运动选择模块,用于产生长远的连续和一致的手势视频。为了获得更好的视觉效果,我们还针对某些区域的设计了一个细化网络。 大量的实验结果表明,与其他现有方法相比,我们提出的框架在运动和视频相关评估中显著取得了优势。 我们的代码、演示和其他资源都可以在以下链接处获得:https://www.example.com/
https://arxiv.org/abs/2404.01862
As human-robot collaboration is becoming more widespread, there is a need for a more natural way of communicating with the robot. This includes combining data from several modalities together with the context of the situation and background knowledge. Current approaches to communication typically rely only on a single modality or are often very rigid and not robust to missing, misaligned, or noisy data. In this paper, we propose a novel method that takes inspiration from sensor fusion approaches to combine uncertain information from multiple modalities and enhance it with situational awareness (e.g., considering object properties or the scene setup). We first evaluate the proposed solution on simulated bimodal datasets (gestures and language) and show by several ablation experiments the importance of various components of the system and its robustness to noisy, missing, or misaligned observations. Then we implement and evaluate the model on the real setup. In human-robot interaction, we must also consider whether the selected action is probable enough to be executed or if we should better query humans for clarification. For these purposes, we enhance our model with adaptive entropy-based thresholding that detects the appropriate thresholds for different types of interaction showing similar performance as fine-tuned fixed thresholds.
随着人机协作的越来越广泛,我们需要一种更自然与机器人进行交流的方式。这包括将来自多个模态的数据在情境和背景知识的基础上进行结合。通常,目前的交流方法仅依赖于单一模态,或者通常是相当僵硬且对缺失、错位或噪音数据不敏感。在本文中,我们提出了一种新方法,受到传感器融合方法的启发,将来自多个模态的不确定信息进行结合并增强其情境意识(例如,考虑物体属性或场景设置)。我们首先在模拟双模态数据集(手势和语言)上评估所提出的解决方案,并通过几个消融实验来展示系统各个组件的重要性和对噪音、缺失或错位观察的鲁棒性。然后,我们在真实设置上实现并评估该模型。在人与机器人交互中,我们还需要考虑所选行动是否足够可能被执行,或者我们应该更好地向人类进行询问以获得澄清。为此,我们通过自适应熵基于阈值的方法来增强我们的模型,该方法能够检测不同类型的交互中适用的阈值,并表现出与细调固定阈值类似的表现。
https://arxiv.org/abs/2404.01702
For decades, robotics researchers have pursued various tasks for multi-robot systems, from cooperative manipulation to search and rescue. These tasks are multi-robot extensions of classical robotic tasks and often optimized on dimensions such as speed or efficiency. As robots transition from commercial and research settings into everyday environments, social task aims such as engagement or entertainment become increasingly relevant. This work presents a compelling multi-robot task, in which the main aim is to enthrall and interest. In this task, the goal is for a human to be drawn to move alongside and participate in a dynamic, expressive robot flock. Towards this aim, the research team created algorithms for robot movements and engaging interaction modes such as gestures and sound. The contributions are as follows: (1) a novel group navigation algorithm involving human and robot agents, (2) a gesture responsive algorithm for real-time, human-robot flocking interaction, (3) a weight mode characterization system for modifying flocking behavior, and (4) a method of encoding a choreographer's preferences inside a dynamic, adaptive, learned system. An experiment was performed to understand individual human behavior while interacting with the flock under three conditions: weight modes selected by a human choreographer, a learned model, or subset list. Results from the experiment showed that the perception of the experience was not influenced by the weight mode selection. This work elucidates how differing task aims such as engagement manifest in multi-robot system design and execution, and broadens the domain of multi-robot tasks.
翻译: robot 研究者们多年来一直在追求多机器人系统的各种任务,从合作操作到搜索和救援。这些任务是经典机器人任务的扩展,通常在速度或效率等维度上进行优化。随着机器人从商业和研究环境向日常生活环境转变,社交任务(如参与或娱乐)变得越来越相关。这项工作展示了一个引人入胜的多机器人任务,其主要目标是为人类带来乐趣和兴趣。在这个任务中,目标是让一个人被吸引并参与到动态、表达性的机器人队中。为了实现这个目标,研究团队创建了涉及人类和机器人代理的新的机器人运动算法,以及诸如手势和声音等互动模式。贡献如下:(1)涉及人类和机器人代理的全新组队导航算法;(2)实时、人类机器人队紧耦合的姿势响应算法;(3)用于修改队形行为的权重模式特征系统;(4)在动态、自适应、学习系统内编码舞蹈编导偏好的方法。实验探讨了在三种条件下(由人类编舞者选择的重量模式,学习模型或子列表)与队进行交互的人类行为。实验结果表明,体验的感觉不受重量模式选择的影响。这项工作揭示了不同任务目标在多机器人系统设计和执行中的表现,并拓展了多机器人任务的领域。
https://arxiv.org/abs/2404.00442
Worker-Robot Cooperation is a new industrial trend, which aims to sum the advantages of both the human and the industrial robot to afford a new intelligent manufacturing techniques. The cooperative manufacturing between the worker and the robot contains other elements such as the product parts and the manufacturing tools. All these production elements must cooperate in one manufacturing workcell to fulfill the production requirements. The manufacturing control system is the mean to connect all these cooperative elements together in one body. This manufacturing control system is distributed and autonomous due to the nature of the cooperative workcell. Accordingly, this article proposes the holonic control architecture as the manufacturing concept of the cooperative workcell. Furthermore, the article focuses on the feasibility of this manufacturing concept, by applying it over a case study that involves the cooperation between a dual-arm robot and a worker. During this case study, the worker uses a variety of hand gestures to cooperate with the robot to achieve the highest production flexibility
工人-机器人合作是一种新的工业趋势,旨在将人类和工业机器人的优势相结合,实现新的智能制造技术。工人和机器人之间的合作制造包括产品部件和制造工具等生产要素。所有这些生产要素都必须在一家制造工作单元内合作,以满足生产需求。由于合作工作单元的性质,制造控制系统将所有合作元素连接在一起。本文提出层次控制架构作为合作工作单元的制造概念。此外,本文重点探讨了这种制造概念的可行性,通过将应用于一个涉及双臂机器人与工人合作的案例研究。在这个案例研究中,工人使用各种手势与机器人合作,以实现最高的生产灵活性。
https://arxiv.org/abs/2404.00369
This paper addresses the problem of generating lifelike holistic co-speech motions for 3D avatars, focusing on two key aspects: variability and coordination. Variability allows the avatar to exhibit a wide range of motions even with similar speech content, while coordination ensures a harmonious alignment among facial expressions, hand gestures, and body poses. We aim to achieve both with ProbTalk, a unified probabilistic framework designed to jointly model facial, hand, and body movements in speech. ProbTalk builds on the variational autoencoder (VAE) architecture and incorporates three core designs. First, we introduce product quantization (PQ) to the VAE, which enriches the representation of complex holistic motion. Second, we devise a novel non-autoregressive model that embeds 2D positional encoding into the product-quantized representation, thereby preserving essential structure information of the PQ codes. Last, we employ a secondary stage to refine the preliminary prediction, further sharpening the high-frequency details. Coupling these three designs enables ProbTalk to generate natural and diverse holistic co-speech motions, outperforming several state-of-the-art methods in qualitative and quantitative evaluations, particularly in terms of realism. Our code and model will be released for research purposes at this https URL.
本文解决了为3D虚拟角色生成逼真度高的整体协同运动的问题,重点关注两个关键方面:可变性和协调性。可变性使得虚拟角色即使拥有相似的语音内容,也能表现出广泛的动作,而协调性确保了面部表情、手势和身体姿态之间的和谐对齐。我们希望通过ProbTalk,一个为共同建模面部、手和身体运动而设计的统一概率框架来实现这一目标。ProbTalk借鉴了变分自编码器(VAE)架构,并包括三个核心设计。首先,我们将产品量化(PQ)引入VAE,从而丰富复杂整体运动的表示。其次,我们设计了一个新型的非自回归模型,将2D位置编码嵌入产品量化表示中,从而保留PQ代码的 essential结构信息。最后,我们采用二级阶段来微调初步预测,进一步锐化高频细节。将这三个设计相结合使得ProbTalk能够生成自然且多样化的整体协同运动,在质量和数量评估中超过了最先进的方法,尤其是在逼真度方面。我们的代码和模型将在此处https://url.com/释放研究用途。
https://arxiv.org/abs/2404.00368
Purpose: Surgical video is an important data stream for gesture recognition. Thus, robust visual encoders for those data-streams is similarly important. Methods: Leveraging the Bridge-Prompt framework, we fine-tune a pre-trained vision-text model (CLIP) for gesture recognition in surgical videos. This can utilize extensive outside video data such as text, but also make use of label meta-data and weakly supervised contrastive losses. Results: Our experiments show that prompt-based video encoder outperforms standard encoders in surgical gesture recognition tasks. Notably, it displays strong performance in zero-shot scenarios, where gestures/tasks that were not provided during the encoder training phase are included in the prediction phase. Additionally, we measure the benefit of inclusion text descriptions in the feature extractor training schema. Conclusion: Bridge-Prompt and similar pre-trained+fine-tuned video encoder models present significant visual representation for surgical robotics, especially in gesture recognition tasks. Given the diverse range of surgical tasks (gestures), the ability of these models to zero-shot transfer without the need for any task (gesture) specific retraining makes them invaluable.
目的:手术视频是手势识别的重要数据流。因此,对于这些数据流,同样重要的是具有稳健的视觉编码器。方法:利用Bridge-Prompt框架,我们对一个预训练的视觉-文本模型(CLIP)进行手术视频手势识别的微调。这可以利用广泛的外部视频数据(如文本),但还利用标签元数据和弱监督的对比损失。结果:我们的实验结果表明,基于提示的视频编码器在手术手势识别任务中优于标准编码器。值得注意的是,在编码器训练阶段没有提供手势/任务的情况下,表现出优异的零散场景性能。此外,我们还在特征提取器训练方案中测量了包含文本描述的好处。结论:Bridge-Prompt和其他预训练+微调的视频编码器模型为手术机器人提供了显著的视觉表示,特别是在手势识别任务中。考虑到手术任务的多样范围(手势),这些模型在没有任何特定任务(手势)重新训练的情况下实现零散场景转移的能力使其无价。
https://arxiv.org/abs/2403.19786
Hand gesture recognition (HGR) based on multimodal data has attracted considerable attention owing to its great potential in applications. Various manually designed multimodal deep networks have performed well in multimodal HGR (MHGR), but most of existing algorithms require a lot of expert experience and time-consuming manual trials. To address these issues, we propose an evolutionary network architecture search framework with the adaptive multimodel fusion (AMF-ENAS). Specifically, we design an encoding space that simultaneously considers fusion positions and ratios of the multimodal data, allowing for the automatic construction of multimodal networks with different architectures through decoding. Additionally, we consider three input streams corresponding to intra-modal surface electromyography (sEMG), intra-modal accelerometer (ACC), and inter-modal sEMG-ACC. To automatically adapt to various datasets, the ENAS framework is designed to automatically search a MHGR network with appropriate fusion positions and ratios. To the best of our knowledge, this is the first time that ENAS has been utilized in MHGR to tackle issues related to the fusion position and ratio of multimodal data. Experimental results demonstrate that AMF-ENAS achieves state-of-the-art performance on the Ninapro DB2, DB3, and DB7 datasets.
基于多模态数据的双手势识别(HGR)引起了相当大的关注,因为其在应用领域具有很大的潜力。 various手工设计的多模态深度网络在多模态HGR(MHGR)表现良好,但现有的算法需要很多专家经验和费时费力的手动尝试。为了解决这些问题,我们提出了一个基于自适应多模态融合(AMF-ENAS)的进化网络架构搜索框架。具体来说,我们设计了一个编码空间,同时考虑多模态数据的融合位置和比值,允许通过解码自动构建具有不同架构的 multimodal 网络。此外,我们考虑三个输入流,分别是对模态表面电生理(sEMG)、对模态加速度计(ACC)和跨模态 sEMG-ACC。为了自动适应各种数据集,ENAS 框架被设计为自动搜索具有适当融合位置和比值的 MHGR 网络。据我们所知,这是 ENAS 首次用于解决多模态数据融合位置和比值的问题。实验结果表明,AMF-ENAS 在 Ninapro DB2、DB3 和 DB7 数据集上取得了最先进的性能。
https://arxiv.org/abs/2403.18208
Gestures play a key role in human communication. Recent methods for co-speech gesture generation, while managing to generate beat-aligned motions, struggle generating gestures that are semantically aligned with the utterance. Compared to beat gestures that align naturally to the audio signal, semantically coherent gestures require modeling the complex interactions between the language and human motion, and can be controlled by focusing on certain words. Therefore, we present ConvoFusion, a diffusion-based approach for multi-modal gesture synthesis, which can not only generate gestures based on multi-modal speech inputs, but can also facilitate controllability in gesture synthesis. Our method proposes two guidance objectives that allow the users to modulate the impact of different conditioning modalities (e.g. audio vs text) as well as to choose certain words to be emphasized during gesturing. Our method is versatile in that it can be trained either for generating monologue gestures or even the conversational gestures. To further advance the research on multi-party interactive gestures, the DnD Group Gesture dataset is released, which contains 6 hours of gesture data showing 5 people interacting with one another. We compare our method with several recent works and demonstrate effectiveness of our method on a variety of tasks. We urge the reader to watch our supplementary video at our website.
手势在人类交流中扮演着关键角色。虽然最近的方法在合奏语音手势生成的同时,管理生成与音符同步的手势,但仍然难以生成与言语内容 semantically 对齐的手势。与自然同步到音频信号的音符手势相比,语义上的协调手势需要建模语言和人类动作之间的复杂交互,并且可以通过专注于某些单词进行控制。因此,我们提出了ConvoFusion,一种基于扩散的多模态手势合成方法,不仅可以基于多模态语音输入生成手势,还可以在手势合成中促进可控性。我们的方法提出了两个指导目标,允许用户调整不同调节模式的影響(例如音频 vs 文本)以及选择在手势过程中强调某些单词。我们的方法具有多才多艺,因为它可以用于生成独白手势或甚至对话手势。为了进一步推动多方交互手势的研究,DnD Group Gesture数据集发布,其中包括5个人相互作用的手势数据,持续了6个小时。我们比较了我们的方法与几个最近的工作,并证明了我们的方法在各种任务上都具有有效性。我们呼吁读者查看我们网站的补充视频。
https://arxiv.org/abs/2403.17936