The goal of building a benchmark (suite of datasets) is to provide a unified protocol for fair evaluation and thus facilitate the evolution of a specific area. Nonetheless, we point out that existing protocols of action recognition could yield partial evaluations due to several limitations. To comprehensively probe the effectiveness of spatiotemporal representation learning, we introduce BEAR, a new BEnchmark on video Action Recognition. BEAR is a collection of 18 video datasets grouped into 5 categories (anomaly, gesture, daily, sports, and instructional), which covers a diverse set of real-world applications. With BEAR, we thoroughly evaluate 6 common spatiotemporal models pre-trained by both supervised and self-supervised learning. We also report transfer performance via standard finetuning, few-shot finetuning, and unsupervised domain adaptation. Our observation suggests that current state-of-the-art cannot solidly guarantee high performance on datasets close to real-world applications, and we hope BEAR can serve as a fair and challenging evaluation benchmark to gain insights on building next-generation spatiotemporal learners. Our dataset, code, and models are released at: this https URL
建立基准(一组数据集)的目标是提供一个统一的标准协议来进行公正评估,从而促进特定领域的演化。然而,我们指出,由于存在多个限制,现有的行动识别协议可能会得出部分评估结果。为了全面测试时间空间表示学习的有效性,我们介绍了BEAR,这是一个视频行动识别的新基准。BEAR是一个由18个视频数据集组成的集合,分为五个类别(异常、手势、日常、运动和教学),涵盖了多种实际应用场景。通过使用BEAR,我们全面评估了6个常见的时间空间模型,并通过标准微调、少量微调和无监督跨域适应等方式进行了迁移性能的测试。我们的观察表明,目前的最新技术无法完全保证接近实际应用场景的数据集的高表现,我们期望BEAR可以作为公正且具有挑战性的评估基准,以获得关于构建新一代时间空间学习器的见解。我们的数据集、代码和模型已发布在以下httpsURL:
https://arxiv.org/abs/2303.13505
Gesture synthesis has gained significant attention as a critical research area, focusing on producing contextually appropriate and natural gestures corresponding to speech or textual input. Although deep learning-based approaches have achieved remarkable progress, they often overlook the rich semantic information present in the text, leading to less expressive and meaningful gestures. We propose GesGPT, a novel approach to gesture generation that leverages the semantic analysis capabilities of Large Language Models (LLMs), such as GPT. By capitalizing on the strengths of LLMs for text analysis, we design prompts to extract gesture-related information from textual input. Our method entails developing prompt principles that transform gesture generation into an intention classification problem based on GPT, and utilizing a curated gesture library and integration module to produce semantically rich co-speech gestures. Experimental results demonstrate that GesGPT effectively generates contextually appropriate and expressive gestures, offering a new perspective on semantic co-speech gesture generation.
手势合成作为一个重要的研究领域,重点是如何产生与语音或文本输入对应的适当、自然手势。尽管基于深度学习的方法已经取得了显著进展,但它们往往忽略了文本中丰富的语义信息,导致表达力和有意义的手势减少。我们提出了GesGPT,一种手势生成的新型方法,利用大型语言模型(LLM)如GPT的语义分析能力。通过利用LRM在文本分析方面的优势,我们设计Prompts从文本输入中提取手势相关的信息。我们的方法和方法包括开发Prompt Principles,将手势生成转换为基于GPT的意图分类问题,并利用 curated gesture 库和集成模块生产语义丰富的合并口语手势。实验结果表明,GesGPT有效地生成了适当的、表达性的手势,提供了语义合并口语手势生成的新视角。
https://arxiv.org/abs/2303.13013
Event-based cameras are inspired by the sparse and asynchronous spike representation of the biological visual system. However, processing the even data requires either using expensive feature descriptors to transform spikes into frames, or using spiking neural networks that are difficult to train. In this work, we propose a neural network architecture based on simple convolution layers integrated with dynamic temporal encoding reservoirs with low hardware and training costs. The Reservoir-enabled Time Integrated Attention Network (RetinaNet) allows the network to efficiently process asynchronous temporal features, and achieves the highest accuracy of 99.2% for DVS128 Gesture reported to date, and one of the highest accuracy of 67.5% for DVS Lip dataset at a much smaller network size. By leveraging the internal dynamics of memristors, asynchronous temporal feature encoding can be implemented at very low hardware cost without preprocessing or dedicated memory and arithmetic units. The use of simple DNN blocks and backpropagation based training rules further reduces its implementation cost. Code will be publicly available.
基于事件的窗户被灵感启发来自于生物视觉系统的稀疏和异步 spike 表示。然而,处理 even 数据需要使用昂贵的特征描述器将 spike 转换为帧,或者使用难以训练的 spike 神经网络。在本研究中,我们提出了一种基于简单卷积层和动态时间编码储备库的神经网络架构,该架构具有低硬件和训练成本。该储备库启用的时间集成注意网络(RetinaNet)使网络能够高效处理异步时间特征,并且对于截至日期 DVS128 手势数据的最高准确率达到了 99.2%,而对于更小的 DVS Lip 数据集,其最高准确率达到了 67.5%。通过利用电容器的内部动态,异步时间特征编码可以在无需预处理或专用内存和算术单元的情况下以非常低的硬件成本实现。使用简单的 DNN 块和反向传播训练规则进一步减少了其实现成本。代码将公开可用。
https://arxiv.org/abs/2303.10770
Human action recognition is a challenging problem, particularly when there is high variability in factors such as subject appearance, backgrounds and viewpoint. While deep neural networks (DNNs) have been shown to perform well on action recognition tasks, they typically require large amounts of high-quality labeled data to achieve robust performance across a variety of conditions. Synthetic data has shown promise as a way to avoid the substantial costs and potential ethical concerns associated with collecting and labeling enormous amounts of data in the real-world. However, synthetic data may differ from real data in important ways. This phenomenon, known as \textit{domain shift}, can limit the utility of synthetic data in robotics applications. To mitigate the effects of domain shift, substantial effort is being dedicated to the development of domain adaptation (DA) techniques. Yet, much remains to be understood about how best to develop these techniques. In this paper, we introduce a new dataset called Robot Control Gestures (RoCoG-v2). The dataset is composed of both real and synthetic videos from seven gesture classes, and is intended to support the study of synthetic-to-real domain shift for video-based action recognition. Our work expands upon existing datasets by focusing the action classes on gestures for human-robot teaming, as well as by enabling investigation of domain shift in both ground and aerial views. We present baseline results using state-of-the-art action recognition and domain adaptation algorithms and offer initial insight on tackling the synthetic-to-real and ground-to-air domain shifts.
人类行动识别是一个具有挑战性的问题,特别是在存在大量变量,如主题外观、背景和视角的情况下。虽然深度神经网络(DNN)已被证明在行动识别任务中表现良好,但它们通常需要大量的高质量标记数据来实现在不同条件下的稳健性能。合成数据表现出潜力,可以作为避免在现实世界中收集和标记大量数据所带来的大量成本和潜在伦理担忧的一种方法。然而,合成数据可能与真实数据在许多方面存在差异。这种现象被称为“域转换”,可能会限制合成数据在机器人应用中的有用性。为了减轻域转换的影响,大量 effort 正在用于开发域适应技术。然而,还有很多关于如何最好地开发这些技术的问题需要理解。在本文中,我们介绍了一个名为“Robot Control Gestures”的新数据集(RoCoG-v2)。该数据集由七个手势类中的 real 和合成视频组成,旨在支持基于视频的行动识别任务的合成到现实的域转换研究。我们的工作通过将行动类专注于人类机器人协作手势,以及通过使研究能够在地面和空中视角中进行域转换,扩大了现有数据集的范围。我们使用最先进的行动识别和域适应算法呈现基线结果,并提供了解决合成到现实的域转换和地面到空中域转换初步见解。
https://arxiv.org/abs/2303.10280
Despite their potential, markerless hand tracking technologies are not yet applied in practice to the diagnosis or monitoring of the activity in inflammatory musculoskeletal diseases. One reason is that the focus of most methods lies in the reconstruction of coarse, plausible poses for gesture recognition or AR/VR applications, whereas in the clinical context, accurate, interpretable, and reliable results are required. Therefore, we propose ShaRPy, the first RGB-D Shape Reconstruction and hand Pose tracking system, which provides uncertainty estimates of the computed pose to guide clinical decision-making. Our method requires only a light-weight setup with a single consumer-level RGB-D camera yet it is able to distinguish similar poses with only small joint angle deviations. This is achieved by combining a data-driven dense correspondence predictor with traditional energy minimization, optimizing for both, pose and hand shape parameters. We evaluate ShaRPy on a keypoint detection benchmark and show qualitative results on recordings of a patient.
尽管它们有潜力,但无标记手跟踪技术尚未在实践中应用于诊断或监测抗炎神经肌肉疾病的活动。原因之一是大多数方法的关注点在于对手势识别或增强现实应用中的粗略POS重建,而在实践中,需要准确、可解释和可靠的结果。因此,我们提出了SARPy,它是RGB-D形状重建和手POS跟踪系统的先驱,可以提供计算POS的不确定性估计,以指导临床决策。我们的方法只需要一个轻便的框架和一个消费级RGB-D相机,但它能够在仅有较小关节角度差异的情况下区分相似的POS。这是通过将数据驱动的密集对应预测与传统的能量最小化优化相结合实现的。我们评估了SARPy在一个关键点检测基准上的性能,并展示了患者记录中的质量结果。
https://arxiv.org/abs/2303.10042
Visual information is central to conversation: body gestures and facial expressions, for example, contribute to meaning that transcends words alone. To date, however, most neural conversational models are limited to just text. We introduce CHAMPAGNE, a generative model of conversations that can account for visual contexts. To train CHAMPAGNE, we collect and release YTD-18M, a large-scale corpus of 18M video-based dialogues. YTD-18M is constructed from web videos: crucial to our data collection pipeline is a pretrained language model that converts error-prone automatic transcripts to a cleaner dialogue format while maintaining meaning. Human evaluation reveals that YTD-18M is more sensible and specific than prior resources (MMDialog, 1M dialogues), while maintaining visual-groundedness. Experiments demonstrate that 1) CHAMPAGNE learns to conduct conversation from YTD-18M; and 2) when fine-tuned, it achieves state-of-the-art results on four vision-language tasks focused on real-world conversations. We release data, models, and code at this https URL.
视觉信息是对话的核心:身体手势和面部表情 contribute to meaning that transcends words alone. 然而,到目前为止,大多数神经网络对话模型仅限于文本。我们介绍了CHAMPagne,一个可以处理视觉上下文的对话生成模型。为了训练CHAMPagne,我们收集并发布了YTD-18M,一个大规模的基于视频的对话库。YTD-18M是从Web视频中构建的:我们的数据收集管道的关键是预训练的语言模型,它可以将错误率较高的自动转录转换为更清洁的对话格式,同时保持意义。人类评估表明,YTD-18M比先前的资源(MMDialog,1M对话)更加敏感和具体,同时保持视觉groundedness。实验表明,1) CHAMPagne从YTD-18M学习如何进行对话;2) 当优化时,它实现了针对现实世界对话的四种视觉语言任务最先进的结果。我们将数据、模型和代码放在这个httpsURL上发布。
https://arxiv.org/abs/2303.09713
Animating virtual avatars to make co-speech gestures facilitates various applications in human-machine interaction. The existing methods mainly rely on generative adversarial networks (GANs), which typically suffer from notorious mode collapse and unstable training, thus making it difficult to learn accurate audio-gesture joint distributions. In this work, we propose a novel diffusion-based framework, named Diffusion Co-Speech Gesture (DiffGesture), to effectively capture the cross-modal audio-to-gesture associations and preserve temporal coherence for high-fidelity audio-driven co-speech gesture generation. Specifically, we first establish the diffusion-conditional generation process on clips of skeleton sequences and audio to enable the whole framework. Then, a novel Diffusion Audio-Gesture Transformer is devised to better attend to the information from multiple modalities and model the long-term temporal dependency. Moreover, to eliminate temporal inconsistency, we propose an effective Diffusion Gesture Stabilizer with an annealed noise sampling strategy. Benefiting from the architectural advantages of diffusion models, we further incorporate implicit classifier-free guidance to trade off between diversity and gesture quality. Extensive experiments demonstrate that DiffGesture achieves state-of-theart performance, which renders coherent gestures with better mode coverage and stronger audio correlations. Code is available at this https URL.
动画虚拟角色进行口语手势的交互行为,能够方便地实现各种人类-机器交互应用。现有的方法主要依赖于生成对抗网络(GANs),它们通常遭受模式崩溃和不稳定训练的不利影响,因此很难学习准确的音频-手势联合分布。在本文中,我们提出了一种新的扩散框架,名为扩散口语手势(DiffGesture),以有效地捕捉跨modal的音频-手势 associations 并保持时间一致性,以高保真度音频驱动的口语手势生成。具体来说,我们首先建立扩散条件生成过程,以录制骨骼序列和音频片段,以使整个框架可用。然后,我们设计了一种新扩散音频-手势Transformer,以更好地关注多种感官的信息,并建模长期的时间依赖。此外,为了消除时间不一致性,我们提出了一种有效的扩散口语手势稳定器,并采用放松噪声采样策略。得益于扩散模型的建筑优势,我们还引入了无分类器暗示指导,以权衡多样性和手势质量。广泛的实验表明,DiffGesture实现了卓越的性能,使口语手势具有更好的模式覆盖和更强的音频相关性。代码可在本网站的 https URL上可用。
https://arxiv.org/abs/2303.09119
Air-writing refers to virtually writing linguistic characters through hand gestures in three-dimensional space with six degrees of freedom. This paper proposes a generic video camera-aided convolutional neural network (CNN) based air-writing framework. Gestures are performed using a marker of fixed color in front of a generic video camera, followed by color-based segmentation to identify the marker and track the trajectory of the marker tip. A pre-trained CNN is then used to classify the gesture. The recognition accuracy is further improved using transfer learning with the newly acquired data. The performance of the system varies significantly on the illumination condition due to color-based segmentation. In a less fluctuating illumination condition, the system is able to recognize isolated unistroke numerals of multiple languages. The proposed framework has achieved 97.7%, 95.4% and 93.7% recognition rates in person independent evaluations on English, Bengali and Devanagari numerals, respectively.
空气写作是指通过手势在三维空间中几乎画出语言学字符的方法。本文提出了基于通用视频摄像头辅助的卷积神经网络(CNN)的空气写作框架。手势使用一种固定的颜色标记在通用视频摄像头前进行,然后使用基于颜色的分割来识别标记并跟踪标记的轨迹。然后,使用预先训练的CNN来分类手势。通过使用新获取的数据进行Transfer Learning,识别精度得到了进一步改善。由于基于颜色的分割,系统的照明条件表现差异很大。在较少波动的照明条件下,系统能够识别多种语言的孤立单划数。本文提出的框架在英语、孟加拉语和Devanagari数制的独立评估中分别取得了97.7%、95.4%和93.7%的识别率。
https://arxiv.org/abs/2303.07989
Action segmentation is a challenging task in high-level process analysis, typically performed on video or kinematic data obtained from various sensors. In the context of surgical procedures, action segmentation is critical for workflow analysis algorithms. This work presents two contributions related to action segmentation on kinematic data. Firstly, we introduce two multi-stage architectures, MS-TCN-BiLSTM and MS-TCN-BiGRU, specifically designed for kinematic data. The architectures consist of a prediction generator with intra-stage regularization and Bidirectional LSTM or GRU-based refinement stages. Secondly, we propose two new data augmentation techniques, World Frame Rotation and Horizontal-Flip, which utilize the strong geometric structure of kinematic data to improve algorithm performance and robustness. We evaluate our models on three datasets of surgical suturing tasks: the Variable Tissue Simulation (VTS) Dataset and the newly introduced Bowel Repair Simulation (BRS) Dataset, both of which are open surgery simulation datasets collected by us, as well as the JHU-ISI Gesture and Skill Assessment Working Set (JIGSAWS), a well-known benchmark in robotic surgery. Our methods achieve state-of-the-art performance on all benchmark datasets and establish a strong baseline for the BRS dataset.
动作分割是高级过程分析中一项具有挑战性的任务,通常通过从各种传感器获取的视频或运动数据来完成。在手术过程中,动作分割对于工作流程分析算法至关重要。本工作介绍了两个与运动数据的动作分割贡献。首先,我们介绍了两个多级架构,MS-TCN-BiLSTM和MS-TCN-BiGRU,专门设计用于运动数据。架构包括一个内部阶段 Regularization 和基于双向LSTM或GRU的精化阶段的预测生成器。其次,我们提出了两个新的数据增强技术,世界框架旋转和水平翻转,利用运动数据的强大几何结构来提高算法性能和鲁棒性。我们评估了三种手术缝合任务的数据集:可变组织模拟(VTS)数据集和全新的肠道修复模拟(BRS)数据集,这些都是我们收集的开放手术模拟数据集,以及JHU-ISIGesture和Skill评估工作集(JIGSAWS),这是机器人手术中众所周知的基准。我们的方法和在所有基准数据集上实现了最先进的表现,并为BRS数据集建立了强有力的基线。
https://arxiv.org/abs/2303.07814
Touch is an important channel for human-robot interaction, while it is challenging for robots to recognize human touch accurately and make appropriate responses. In this paper, we design and implement a set of large-format distributed flexible pressure sensors on a robot dog to enable natural human-robot tactile interaction. Through a heuristic study, we sorted out 81 tactile gestures commonly used when humans interact with real dogs and 44 dog reactions. A gesture classification algorithm based on ResNet is proposed to recognize these 81 human gestures, and the classification accuracy reaches 98.7%. In addition, an action prediction algorithm based on Transformer is proposed to predict dog actions from human gestures, reaching a 1-gram BLEU score of 0.87. Finally, we compare the tactile interaction with the voice interaction during a freedom human-robot-dog interactive playing study. The results show that tactile interaction plays a more significant role in alleviating user anxiety, stimulating user excitement and improving the acceptability of robot dogs.
触摸是人类-机器人互动的重要渠道,但对于机器人准确识别人类触摸并作出适当的反应来说,仍然是一个挑战。在本文中,我们设计和实现了一组大型分布式柔性压力传感器,安装在机器人狗上,以实现自然人类-机器人触摸互动。通过启发性研究,我们 sorted 81个常用的触摸手势,其中44个是狗的反应。我们提出了基于 ResNet 的手势分类算法,以识别这些 81 个人类手势,分类准确率达到 98.7%。此外,我们提出了基于 Transformer 的行动预测算法,以预测狗的行动并从人类手势中预测狗的动作,得到 1 个单词的 BLEU 得分 0.87。最后,我们在自由人类-机器人-狗互动玩耍研究中比较了触摸互动和语音互动。结果表明,触摸互动在减轻用户焦虑、刺激用户兴奋并改善机器人狗可接受性方面发挥着更重要的作用。
https://arxiv.org/abs/2303.07595
Human-Robot collaboration in home and industrial workspaces is on the rise. However, the communication between robots and humans is a bottleneck. Although people use a combination of different types of gestures to complement speech, only a few robotic systems utilize gestures for communication. In this paper, we propose a gesture pseudo-language and show how multiple types of gestures can be combined to express human intent to a robot (i.e., expressing both the desired action and its parameters - e.g., pointing to an object and showing that the object should be emptied into a bowl). The demonstrated gestures and the perceived table-top scene (object poses detected by CosyPose) are processed in real-time) to extract the human's intent. We utilize behavior trees to generate reactive robot behavior that handles various possible states of the world (e.g., a drawer has to be opened before an object is placed into it) and recovers from errors (e.g., when the scene changes). Furthermore, our system enables switching between direct teleoperation of the end-effector and high-level operation using the proposed gesture sentences. The system is evaluated on increasingly complex tasks using a real 7-DoF Franka Emika Panda manipulator. Controlling the robot via action gestures lowered the execution time by up to 60%, compared to direct teleoperation.
人类和机器人在家庭和工业工作空间中的协作正在增加。然而,机器人和人类的通信仍然是一个瓶颈。尽管人们使用各种不同类型的手势来补充语言,但只有少量的机器人系统才使用手势来进行通信。在本文中,我们提出了手势伪语言,并展示了如何结合多种类型的手势来表达人类对机器人的意图(即表达所需的行动及其参数——例如,指向一个物体并显示它应该倒入碗里面)。演示的手势和感知的桌面场景(由CosyPose检测的物体姿势)在实时环境中进行处理,以提取人类的意图。我们利用行为树生成响应机器人行为,处理世界上各种可能的状态(例如,在物体放进去之前必须打开抽屉)并恢复错误(例如,当场景发生变化时)。此外,我们的系统使用 proposed 手势语句来实现间接远程控制和高级操作,使用真实的7自由度Franka Emika Panda操纵器对越来越复杂的任务进行评估。通过使用行动手势来控制机器人,系统的执行时间相比直接远程控制降低了高达60%。
https://arxiv.org/abs/2303.04451
Hand gesture-based human-computer interaction is an important problem that is well explored using color camera data. In this work we proposed a hand gesture detection system using thermal images. Our system is capable of handling multiple hand regions in a frame and process it fast for real-time applications. Our system performs a series of steps including background subtraction-based hand mask generation, k-means based hand region identification, hand segmentation to remove the forearm region, and a Convolutional Neural Network (CNN) based gesture classification. Our work introduces two novel algorithms, bubble growth and bubble search, for faster hand segmentation. We collected a new thermal image data set with 10 gestures and reported an end-to-end hand gesture recognition accuracy of 97%.
基于手势的人机交互是一个重要问题,已经通过彩色相机数据进行了充分的探索。在这个项目中,我们提出了一种利用热成像技术进行手势检测的系统。我们的系统能够在一个帧内处理多个手区域,并快速应用于实时应用程序。我们的系统执行了一系列步骤,包括基于背景减除的手遮挡生成、基于聚类技术的手部区域识别、手部分割以去除前臂区域,以及基于卷积神经网络(CNN)的手势分类。我们的工作介绍了两个新的算法,即 bubble growth 和 bubble search,以加快手部分割。我们收集了包含10个手势的新热成像数据集,并报告了端到端手部识别准确率高达97%。
https://arxiv.org/abs/2303.02321
Synthesizing realistic co-speech gestures is an important and yet unsolved problem for creating believable motions that can drive a humanoid robot to interact and communicate with human users. Such capability will improve the impressions of the robots by human users and will find applications in education, training, and medical services. One challenge in learning the co-speech gesture model is that there may be multiple viable gesture motions for the same speech utterance. The deterministic regression methods can not resolve the conflicting samples and may produce over-smoothed or damped motions. We proposed a two-stage model to address this uncertainty issue in gesture synthesis by modeling the gesture segments as discrete latent codes. Our method utilizes RQ-VAE in the first stage to learn a discrete codebook consisting of gesture tokens from training data. In the second stage, a two-level autoregressive transformer model is used to learn the prior distribution of residual codes conditioned on input speech context. Since the inference is formulated as token sampling, multiple gesture sequences could be generated given the same speech input using top-k sampling. The quantitative results and the user study showed the proposed method outperforms the previous methods and is able to generate realistic and diverse gesture motions.
合成真实的并发手势是一个重要的未解决问题,以创造令人信服的动作,使一架人形机器人与人类用户互动和通信。这种能力将改善机器人对人类用户的 impression ,并在教育、培训和医疗服务中应用。在学习并发手势模型时,有一个挑战,即可能在同一句话中产生多个可行的手势动作。确定性回归方法无法解决冲突样本,可能会导致过度平滑或 damped 的动作。我们提出了一个两阶段模型,以解决手势合成中的不确定问题,通过将手势部分建模为离散潜在编码。我们的方法在第一阶段使用 RQ-VAE 从训练数据学习一个离散编码库,其中包含手势代币。在第二阶段,使用两个级别的自回归Transformer模型学习输入语音上下文后剩余编码的先前分布。由于推理以代币采样的形式表示,可以使用top-k采样生成根据相同 speech 输入生成多个手势序列。定量结果和用户研究表明,该方法优于先前方法,能够生成现实和多样化的手势动作。
https://arxiv.org/abs/2303.12822
Predicting natural and diverse 3D hand gestures from the upper body dynamics is a practical yet challenging task in virtual avatar creation. Previous works usually overlook the asymmetric motions between two hands and generate two hands in a holistic manner, leading to unnatural results. In this work, we introduce a novel bilateral hand disentanglement based two-stage 3D hand generation method to achieve natural and diverse 3D hand prediction from body dynamics. In the first stage, we intend to generate natural hand gestures by two hand-disentanglement branches. Considering the asymmetric gestures and motions of two hands, we introduce a Spatial-Residual Memory (SRM) module to model spatial interaction between the body and each hand by residual learning. To enhance the coordination of two hand motions wrt. body dynamics holistically, we then present a Temporal-Motion Memory (TMM) module. TMM can effectively model the temporal association between body dynamics and two hand motions. The second stage is built upon the insight that 3D hand predictions should be non-deterministic given the sequential body postures. Thus, we further diversify our 3D hand predictions based on the initial output from the stage one. Concretely, we propose a Prototypical-Memory Sampling Strategy (PSS) to generate the non-deterministic hand gestures by gradient-based Markov Chain Monte Carlo (MCMC) sampling. Extensive experiments demonstrate that our method outperforms the state-of-the-art models on the B2H dataset and our newly collected TED Hands dataset.
在虚拟角色创建中,预测从身体动力学中提取自然且多样化的三维手动作是一项实用的挑战性任务。以往的工作通常忽略了双手之间的不对称运动,并以整体方式生成双手,导致不自然的结果。在本文中,我们提出了一种新的双边手分离基于两阶段三维手生成方法,以从身体动力学中提取自然且多样化的三维手预测。在第一阶段,我们旨在通过两个手分离分支生成自然的手动作。考虑到双手的不对称手势和运动,我们引入了一个空间残留记忆(SRM)模块,以通过剩余学习模型空间交互。为了增强双手运动与身体动力学整体协调,我们随后介绍了一个时间运动记忆(TMM)模块。TMM可以有效地模型身体动力学与两个手运动的时间关联。第二阶段基于 insights 认为三维手预测应该基于Sequentialbody postures 是随机的。因此,我们进一步从第一阶段的初始输出中多样化我们的三维手预测。具体来说,我们提出了一种原型记忆采样策略(PSS)以通过梯度based马克沁环 Carlo(MCMC)采样生成随机的手动作。广泛的实验表明,我们的方法在B2H数据和我们新收集的TED手数据集上优于最先进的模型。
https://arxiv.org/abs/2303.01765
Hand gesture detection is a well-explored area in computer vision with applications in various forms of Human-Computer Interactions. In this work, we propose a technique for simultaneous hand gesture classification, handedness detection, and hand keypoints localization using thermal data captured by an infrared camera. Our method uses a novel deep multi-task learning architecture that includes shared encoderdecoder layers followed by three branches dedicated for each mentioned task. We performed extensive experimental validation of our model on an in-house dataset consisting of 24 users data. The results confirm higher than 98 percent accuracy for gesture classification, handedness detection, and fingertips localization, and more than 91 percent accuracy for wrist points localization.
手动作检测是计算机视觉领域中一个已经被广泛探索的领域,其应用涵盖了各种人机交互形式。在本研究中,我们提出了一种利用红外摄像头捕获的 thermal 数据,同时实现手动作分类、手性识别和手关键点定位的技术。我们的算法采用了一种独特的深度多任务学习架构,其中包括共享编码解码层,然后为每个任务分配三个分支。我们在一个包含24个用户数据的公司内部数据集上进行了大量实验验证,结果表明,手势分类、手性识别和指代词点定位的准确率高于98%, wrist 点定位的准确率超过91%。
https://arxiv.org/abs/2303.01547
Manual labeling of gestures in robot-assisted surgery is labor intensive, prone to errors, and requires expertise or training. We propose a method for automated and explainable generation of gesture transcripts that leverages the abundance of data for image segmentation to train a surgical scene segmentation model that provides surgical tool and object masks. Surgical context is detected using segmentation masks by examining the distances and intersections between the tools and objects. Next, context labels are translated into gesture transcripts using knowledge-based Finite State Machine (FSM) and data-driven Long Short Term Memory (LSTM) models. We evaluate the performance of each stage of our method by comparing the results with the ground truth segmentation masks, the consensus context labels, and the gesture labels in the JIGSAWS dataset. Our results show that our segmentation models achieve state-of-the-art performance in recognizing needle and thread in Suturing and we can automatically detect important surgical states with high agreement with crowd-sourced labels (e.g., contact between graspers and objects in Suturing). We also find that the FSM models are more robust to poor segmentation and labeling performance than LSTMs. Our proposed method can significantly shorten the gesture labeling process (~2.8 times).
在机器人辅助手术中手动标注手势是一项繁重的工作,容易出错,需要专业知识或培训。我们提出了一种自动化和可解释的手势录屏生成方法,利用图像分割数据充足的优势,训练提供手术工具和物体口罩的手术场景分割模型。通过检查工具和物体之间的距离和相交情况,使用分割 masks 检测手术场景。接下来,使用基于知识的知识型有限状态机(FSM)和数据驱动的长短期记忆(LSTM)模型将上下文标签转换为手势录屏。我们比较了我们的方法和JIGSAWS数据集上 ground truth分割 mask、共识上下文标签和手势标签的结果。我们的结果表明,我们的分割模型在提取缝合针和线方面实现了最先进的性能,我们可以与 crowd-sourced 标签的高一致性自动检测到重要的手术状态(例如,在缝合中抓握器和物体之间的接触)。我们还发现,FSM 模型比 LSTM 更加 robust 于较差的分割和标注性能。我们提出的这种方法可以显著缩短手势标注过程(约 2.8 倍)。
https://arxiv.org/abs/2302.14237
Automatic dubbing (AD) is the task of translating the original speech in a video into target language speech. The new target language speech should satisfy isochrony; that is, the new speech should be time aligned with the original video, including mouth movements, pauses, hand gestures, etc. In this paper, we propose training a model that directly optimizes both the translation as well as the speech duration of the generated translations. We show that this system generates speech that better matches the timing of the original speech, compared to prior work, while simplifying the system architecture.
Automatic dubbing (AD)是将视频中的原始语音翻译为目标语言语音的任务。新的目标语言语音应该满足isochrony条件,即新的语音与原始视频的时间对齐,包括口形动作、暂停、手语等。在本文中,我们提出了训练一个模型,该模型直接优化生成的翻译语音和 speech 持续时间。我们表明,相较于以前的工作,该系统生成的语音更与原始语音的时间匹配,同时简化了系统架构。
https://arxiv.org/abs/2302.12979
This work is about the extraction of the motion of fingers, in their three articulations, of a keyboard player from a video sequence. The relevance of the problem involves several aspects, in fact, the extraction of the movements of the fingers may be used to compute the keystroke efficiency and individual joint contributions, as showed by Werner Goebl and Caroline Palmer in the paper 'Temporal Control and Hand Movement Efficiency in Skilled Music Performance'. Those measures are directly related to the precision in timing and force measures. A very good approach to the hand gesture recognition problem has been presented in the paper ' Real-Time Hand Gesture Recognition Using Finger Segmentation'. Detecting the keys pressed on a keyboard is a task that can be complex because of the shadows that can degrade the quality of the result and possibly cause the detection of not pressed keys. Among the several approaches that already exist, a great amount of them is based on the subtraction of frames in order to detect the movements of the keys caused by their pressure. Detecting the keys that are pressed could be useful to automatically evaluate the performance of a pianist or to automatically write sheet music of the melody that is being played.
这项工作是关于从视频序列中提取键盘演奏者手部三个关节的运动。这个问题涉及到多个方面,实际上,提取手部运动可能用于计算键击效率和个人联合贡献,如 Werner Goebl和Caroline Palmer 在论文《时间控制和手部运动效率在技能音乐表演中》中展示的那样。这些措施直接与时间精度和压力测量精度相关。一篇名为《利用手指分割实现实时手部姿态识别》的论文提出了一种非常好的手部姿态识别方法。检测键盘上按下的键是一个复杂的任务,因为 shadows 可能会降低结果的质量并可能导致未按下键的检测。在已经存在的多种方法中,大量方法基于减法来检测由压力引起的键的运动。检测按下的键可能有助于自动评估钢琴家的表现或自动编写正在演奏的旋律的乐谱。
https://arxiv.org/abs/2303.12697
In this work, we propose a new Dual Min-Max Games (DMMG) based self-supervised skeleton action recognition method by augmenting unlabeled data in a contrastive learning framework. Our DMMG consists of a viewpoint variation min-max game and an edge perturbation min-max game. These two min-max games adopt an adversarial paradigm to perform data augmentation on the skeleton sequences and graph-structured body joints, respectively. Our viewpoint variation min-max game focuses on constructing various hard contrastive pairs by generating skeleton sequences from various viewpoints. These hard contrastive pairs help our model learn representative action features, thus facilitating model transfer to downstream tasks. Moreover, our edge perturbation min-max game specializes in building diverse hard contrastive samples through perturbing connectivity strength among graph-based body joints. The connectivity-strength varying contrastive pairs enable the model to capture minimal sufficient information of different actions, such as representative gestures for an action while preventing the model from overfitting. By fully exploiting the proposed DMMG, we can generate sufficient challenging contrastive pairs and thus achieve discriminative action feature representations from unlabeled skeleton data in a self-supervised manner. Extensive experiments demonstrate that our method achieves superior results under various evaluation protocols on widely-used NTU-RGB+D and NTU120-RGB+D datasets.
在本文中,我们提出了一种新的双重最小最大游戏(DMMG)基于自监督的骨骼行动识别方法,通过增加未标记的数据在对比学习框架中增加数据。我们的DMMG由一个观点变化最小最大游戏和一个边缘扰动最小最大游戏组成。这两个最小最大游戏采用对抗范式在骨骼序列和基于Graph结构的身体关节上进行数据增强。我们的观点变化最小最大游戏专注于通过从不同观点生成骨骼序列建立各种强对比对。这些强对比对帮助我们的模型学习代表行动特征,从而促进了模型向下级的任务的迁移。此外,我们的边缘扰动最小最大游戏专门致力于通过在基于Graph的身体关节之间的连接强度扰动建立各种强对比样本。这种连接强度的变化对比对使模型能够捕获不同行动的最小足够信息,例如代表一个行动的手势,同时防止模型过拟合。通过充分利用提出的DMMG,我们可以生成足够的挑战性对比对,从而在自监督的情况下从未标记的骨骼数据中实现分化的行动特征表示。广泛的实验表明,在我们广泛使用的NTU-RGB+D和NTU120-RGB+D数据集上,我们的方法在多种评估协议下取得了更好的结果。
https://arxiv.org/abs/2302.12007
We present a joint camera and radar approach to enable autonomous vehicles to understand and react to human gestures in everyday traffic. Initially, we process the radar data with a PointNet followed by a spatio-temporal multilayer perceptron (stMLP). Independently, the human body pose is extracted from the camera frame and processed with a separate stMLP network. We propose a fusion neural network for both modalities, including an auxiliary loss for each modality. In our experiments with a collected dataset, we show the advantages of gesture recognition with two modalities. Motivated by adverse weather conditions, we also demonstrate promising performance when one of the sensors lacks functionality.
我们提出了一种结合相机和雷达的方法,使自动驾驶车辆能够在日常交通中理解和响应人类手势。起初,我们使用 PointNet 处理雷达数据,接着使用空间时间多层感知器 (stMLP) 进行处理。独立地,从相机帧中提取人体姿态,并使用独立的 stMLP 网络进行处理。我们提议使用两种模式的合并神经网络,包括每种模式的辅助损失。在我们收集的 dataset 的实验中,我们展示了两种模式手势识别的优势。受到恶劣的天气条件的影响,我们还表现出令人期望的性能,当其中一个传感器失去功能时。
https://arxiv.org/abs/2302.09998