Surface electromyography (sEMG) and high-density sEMG (HD-sEMG) biosignals have been extensively investigated for myoelectric control of prosthetic devices, neurorobotics, and more recently human-computer interfaces because of their capability for hand gesture recognition/prediction in a wearable and non-invasive manner. High intraday (same-day) performance has been reported. However, the interday performance (separating training and testing days) is substantially degraded due to the poor generalizability of conventional approaches over time, hindering the application of such techniques in real-life practices. There are limited recent studies on the feasibility of multi-day hand gesture recognition. The existing studies face a major challenge: the need for long sEMG epochs makes the corresponding neural interfaces impractical due to the induced delay in myoelectric control. This paper proposes a compact ViT-based network for multi-day dynamic hand gesture prediction. We tackle the main challenge as the proposed model only relies on very short HD-sEMG signal windows (i.e., 50 ms, accounting for only one-sixth of the convention for real-time myoelectric implementation), boosting agility and responsiveness. Our proposed model can predict 11 dynamic gestures for 20 subjects with an average accuracy of over 71% on the testing day, 3-25 days after training. Moreover, when calibrated on just a small portion of data from the testing day, the proposed model can achieve over 92% accuracy by retraining less than 10% of the parameters for computational efficiency.
表面电感测量(sEMG)和高密度sEMG(HD-sEMG)生物信号已经被广泛研究用于肢体残疾控制、神经机器人学以及最近的人机接口,因为它们能够在佩戴且非侵入性的情况下进行手动作识别/预测。每日(当天)表现 据报道很高。然而,每日表现(区分训练和测试日)因传统方法的泛化性能较差而大幅度退化,阻碍将这些技术应用于实际实践中。目前,关于一天多次手动作识别的可行性研究有限。现有的研究面临一个主要挑战:需要长sEMG epochs导致相应的神经接口不可能实现,因为肌电控制引起的延迟。本 paper 提出了一种紧凑的ViT-based网络,用于一天多次的动态手动作预测。我们克服了主要挑战,因为 proposed 模型只需要非常短的HD-sEMG信号窗口(即50 ms,只占实时肌电实现的传统标准的六分之一),提高敏捷性和响应性。我们 proposed 模型可以预测20名 subjects 11种动态手势,在测试日,平均准确率超过71%,训练3-25天后。此外,当仅从测试日的数据中校准一小部分数据时,该模型可以实现超过92%的准确率,通过减少计算效率不到10%的参数重新训练。
https://arxiv.org/abs/2309.12602
Gloss-free Sign Language Production (SLP) offers a direct translation of spoken language sentences into sign language, bypassing the need for gloss intermediaries. This paper presents the Sign language Vector Quantization Network, a novel approach to SLP that leverages Vector Quantization to derive discrete representations from sign pose sequences. Our method, rooted in both manual and non-manual elements of signing, supports advanced decoding methods and integrates latent-level alignment for enhanced linguistic coherence. Through comprehensive evaluations, we demonstrate superior performance of our method over prior SLP methods and highlight the reliability of Back-Translation and Fréchet Gesture Distance as evaluation metrics.
无符号 Sign Language 生产(SLP)提供直接将口语句子转换为符号语言,省去了符号中介的 need。本文介绍了 Sign Language Vector Quantization Network,这是一种针对 SLP 的新颖方法,利用 Vector Quantization 从符号姿态序列中推导离散表示。我们的方法和手工和非手工符号运动相结合,支持高级解码方法,并集成潜在水平对齐,以增强语言连贯性。通过全面评估,我们证明了我们方法的性能优于先前的 SLP 方法,并突出了 back-Translation 和 Fréchet 手势距离作为评估指标的可靠性。
https://arxiv.org/abs/2309.12179
Human-Computer Interaction (HCI) has been the subject of research for many years, and recent studies have focused on improving its performance through various techniques. In the past decade, deep learning studies have shown high performance in various research areas, leading researchers to explore their application to HCI. Convolutional neural networks can be used to recognize hand gestures from images using deep architectures. In this study, we evaluated pre-trained high-performance deep architectures on the HG14 dataset, which consists of 14 different hand gesture classes. Among 22 different models, versions of the VGGNet and MobileNet models attained the highest accuracy rates. Specifically, the VGG16 and VGG19 models achieved accuracy rates of 94.64% and 94.36%, respectively, while the MobileNet and MobileNetV2 models achieved accuracy rates of 96.79% and 94.43%, respectively. We performed hand gesture recognition on the dataset using an ensemble learning technique, which combined the four most successful models. By utilizing these models as base learners and applying the Dirichlet ensemble technique, we achieved an accuracy rate of 98.88%. These results demonstrate the effectiveness of the deep ensemble learning technique for HCI and its potential applications in areas such as augmented reality, virtual reality, and game technologies.
人机交互(HCI)已经多年成为研究的主题,最近的研究则重点是如何通过各种技术改善其性能。在过去十年中,深度学习研究在各种研究领域表现出高性能,促使研究人员探索将其应用于HCI。卷积神经网络可以使用深度架构从图像中识别手势。在本研究中,我们评估了HG14数据集上的预先训练高性能深度架构,该数据集包括14个不同的手势类别。在22个不同的模型中,VGGNet和MobileNet模型的版本获得最高的准确率。具体来说,VGG16和VGG19模型的准确率分别为94.64%和94.36%,而MobileNet和MobileNetV2模型的准确率分别为96.79%和94.43%。我们使用一种集成学习技术对数据集进行手势识别,该技术结合了最成功的四个模型。利用这些模型作为基学习器并应用迪杰罗集成技术,我们实现了98.88%的准确率。这些结果表明,深度集成学习技术对于HCI非常有效,并且可能在增强现实、虚拟现实和游戏技术方面有潜在的应用。
https://arxiv.org/abs/2309.11610
Human-robot collaboration has benefited users with higher efficiency towards interactive tasks. Nevertheless, most collaborative schemes rely on complicated human-machine interfaces, which might lack the requisite intuitiveness compared with natural limb control. We also expect to understand human intent with low training data requirements. In response to these challenges, this paper introduces an innovative human-robot collaborative framework that seamlessly integrates hand gesture and dynamic movement recognition, voice recognition, and a switchable control adaptation strategy. These modules provide a user-friendly approach that enables the robot to deliver the tools as per user need, especially when the user is working with both hands. Therefore, users can focus on their task execution without additional training in the use of human-machine interfaces, while the robot interprets their intuitive gestures. The proposed multimodal interaction framework is executed in the UR5e robot platform equipped with a RealSense D435i camera, and the effectiveness is assessed through a soldering circuit board task. The experiment results have demonstrated superior performance in hand gesture recognition, where the static hand gesture recognition module achieves an accuracy of 94.3\%, while the dynamic motion recognition module reaches 97.6\% accuracy. Compared with human solo manipulation, the proposed approach facilitates higher efficiency tool delivery, without significantly distracting from human intents.
人机器人协作已经提高了交互任务的用户效率。然而,大多数协作计划都依赖于复杂的人机界面,这些界面可能比自然肢体控制缺乏必要的直觉性。我们还期望在较低的训练数据要求下理解人类意图。为了应对这些挑战,本文介绍了一种创新性的人机器人协作框架,该框架无缝集成了手势和动态运动识别、语音识别和可切换的控制适应策略。这些模块提供一种易于使用的方法,使机器人根据用户需求提供工具,特别是在用户两手同时工作时。因此,用户可以专注于任务执行,无需额外的培训在使用人机界面方面,而机器人则解释他们的直觉手势。该建议的多模式互动框架是在带有RealSense D435i相机的UR5e机器人平台上执行的,并通过焊接电路板任务评估其效果。实验结果表明,手势识别表现优异,静态手势识别模块准确度高达94.3%,动态运动识别模块达到97.6%。与人类单独操作相比,该建议的方法 facilitates更高效率的工具交付,而不会显著分散人类意图。
https://arxiv.org/abs/2309.11368
In the past decade, there has been significant advancement in designing wearable neural interfaces for controlling neurorobotic systems, particularly bionic limbs. These interfaces function by decoding signals captured non-invasively from the skin's surface. Portable high-density surface electromyography (HD-sEMG) modules combined with deep learning decoding have attracted interest by achieving excellent gesture prediction and myoelectric control of prosthetic systems and neurorobots. However, factors like pixel-shape electrode size and unstable skin contact make HD-sEMG susceptible to pixel electrode drops. The sparse electrode-skin disconnections rooted in issues such as low adhesion, sweating, hair blockage, and skin stretch challenge the reliability and scalability of these modules as the perception unit for neurorobotic systems. This paper proposes a novel deep-learning model providing resiliency for HD-sEMG modules, which can be used in the wearable interfaces of neurorobots. The proposed 3D Dilated Efficient CapsNet model trains on an augmented input space to computationally `force' the network to learn channel dropout variations and thus learn robustness to channel dropout. The proposed framework maintained high performance under a sensor dropout reliability study conducted. Results show conventional models' performance significantly degrades with dropout and is recovered using the proposed architecture and the training paradigm.
过去十年,在设计可穿戴的神经接口以控制神经机器人系统方面取得了重大进展,特别是人造肢体。这些接口通过非侵入性地解码从皮肤表面捕获的信号来实现。便携式高分辨率表面电学测量(HD-sEMG)模块结合深度学习解码技术,吸引了注意,因为它们实现了出色的手势预测和电子控制人造系统和神经机器人。然而,像像素形状电极尺寸和不稳定的皮肤接触等因素使HD-sEMG易受像素电极掉落的影响。稀疏电极-皮肤连接基于问题,如低连接性、汗水、头发阻塞和皮肤拉伸,挑战这些模块作为神经机器人系统感知单元的可靠性和 scalability。本文提出了一种 novel 深度学习模型,为 HD-sEMG 模块提供容错性,可应用于神经机器人的可穿戴接口。该模型在增强的输入空间中训练,通过计算“强迫”网络学习通道 dropout 变体,从而学习通道 dropout 的鲁棒性。提出的框架在传感器 dropout 可靠性研究中保持了高性能。结果表明,传统的模型性能在 dropout 时显著恶化,并使用提出的架构和训练范式恢复。
https://arxiv.org/abs/2309.11086
Most existing hand gesture recognition (HGR) systems are limited to a predefined set of gestures. However, users and developers often want to recognize new, unseen gestures. This is challenging due to the vast diversity of all plausible hand shapes, e.g. it is impossible for developers to include all hand gestures in a predefined list. In this paper, we present a user-friendly framework that lets users easily customize and deploy their own gesture recognition pipeline. Our framework provides a pre-trained single-hand embedding model that can be fine-tuned for custom gesture recognition. Users can perform gestures in front of a webcam to collect a small amount of images per gesture. We also offer a low-code solution to train and deploy the custom gesture recognition model. This makes it easy for users with limited ML expertise to use our framework. We further provide a no-code web front-end for users without any ML expertise. This makes it even easier to build and test the end-to-end pipeline. The resulting custom HGR is then ready to be run on-device for real-time scenarios. This can be done by calling a simple function in our open-sourced model inference API, MediaPipe Tasks. This entire process only takes a few minutes.
大部分现有的手语识别(HGR)系统都局限于预先定义的手势集合。然而,用户和开发者往往希望识别未曾见过的手势。这是因为所有的可能手形都非常多样化,例如,开发者不可能将所有的手势都列在预先定义的列表中。在本文中,我们提出了一个用户友好的框架,让用户可以轻松自定义和部署自己的手势识别流水线。我们的框架提供了一种预先训练的单个手部嵌入模型,可以进行自定义手势识别的微调。用户可以在摄像头前进行手势,以收集每个手势所需的少量图像。我们还提供了一种低代码解决方案,以训练和部署自定义手势识别模型。这使具有有限机器学习专业知识的用户可以使用我们的框架。我们还提供了没有代码的Web前端,以没有机器学习专业知识的用户。这使构建和测试最终端到端流水线变得更容易。最终自定义的HGR模型已经准备好在设备上实时运行。这可以通过调用我们开源模型推断API中一个简单的函数来实现。整个过程只需要几分钟。
https://arxiv.org/abs/2309.10858
Objective: Multimodal hand gesture recognition (HGR) systems can achieve higher recognition accuracy compared to unimodal HGR systems. However, acquiring multimodal gesture recognition data typically requires users to wear additional sensors, thereby increasing hardware costs. Methods: This paper proposes a novel generative approach to improve Surface Electromyography (sEMG)-based HGR accuracy via virtual Inertial Measurement Unit (IMU) signals. Specifically, we trained a deep generative model based on the intrinsic correlation between forearm sEMG signals and forearm IMU signals to generate virtual forearm IMU signals from the input forearm sEMG signals at first. Subsequently, the sEMG signals and virtual IMU signals were fed into a multimodal Convolutional Neural Network (CNN) model for gesture recognition. Results: We conducted evaluations on six databases, including five publicly available databases and our collected database comprising 28 subjects performing 38 gestures, containing both sEMG and IMU data. The results show that our proposed approach significantly outperforms the sEMG-based unimodal HGR approach (with increases of 2.15%-13.10%). Moreover, it achieves accuracy levels closely matching those of multimodal HGR when using virtual Acceleration (ACC) signals. Conclusion: It demonstrates that incorporating virtual IMU signals, generated by deep generative models, can significantly improve the accuracy of sEMG-based HGR. Significance: The proposed approach represents a successful attempt to bridge the gap between unimodal HGR and multimodal HGR without additional sensor hardware, which can help to promote further development of natural and cost-effective myoelectric interfaces in the biomedical engineering field.
目标:多模态手动作识别系统(HGR)相对于单模态HGR系统可以实现更高的识别精度。然而,获取多模态手势识别数据通常需要用户佩戴额外的传感器,从而增加了硬件成本。方法:本文提出了一种新颖的生成式方法,以通过虚拟惯性测量单元(IMU)信号来提高表面肌电学(sEMG)-基于HGR的精度。具体来说,我们训练了一个深度生成模型,基于外周肌电信号和外周IMU信号之间的内在相关性,以从输入外周肌电信号中生成虚拟外周IMU信号。随后,sEMG信号和虚拟IMU信号被输入到一个多模态卷积神经网络(CNN)模型中进行手势识别。结果:我们针对六个数据库进行了评估,包括五个公开数据库和我们收集的数据库,其中包含28名参与者执行38个手势,同时包含sEMG和IMU数据。结果表明,我们提出的方法显著超越了基于sEMG的单模态HGR方法(增加了2.15%到13.10%)。此外,使用虚拟加速(ACC)信号时,它实现了与多模态HGR方法相当的精度水平。结论:这表明,将虚拟IMU信号从深度生成模型中生成可以显著改善sEMG-基于HGR的精度。意义:本文提出的方法代表了一次成功尝试,在没有额外的传感器硬件的情况下,通过多模态HGR与单模态HGR之间的空白进行填补,这有助于促进生物医学工程领域自然且成本效益高的人机交互的发展。
https://arxiv.org/abs/2308.04091
Embodied agents, in the form of virtual agents or social robots, are rapidly becoming more widespread. In human-human interactions, humans use nonverbal behaviours to convey their attitudes, feelings, and intentions. Therefore, this capability is also required for embodied agents in order to enhance the quality and effectiveness of their interactions with humans. In this paper, we propose a novel framework that can generate sequences of joint angles from the speech text and speech audio utterances. Based on a conditional Generative Adversarial Network (GAN), our proposed neural network model learns the relationships between the co-speech gestures and both semantic and acoustic features from the speech input. In order to train our neural network model, we employ a public dataset containing co-speech gestures with corresponding speech audio utterances, which were captured from a single male native English speaker. The results from both objective and subjective evaluations demonstrate the efficacy of our gesture-generation framework for Robots and Embodied Agents.
实体代理以虚拟代理或社会机器人的形式日益普遍。在人类之间的交互中,人类使用非语言行为来传达他们的态度、感受和意图。因此,对于实体代理,增强与他们人类之间的交互质量和效果也需要具备这种能力。在本文中,我们提出了一个 novel 框架,可以从演讲文本和演讲语音输出中生成角度序列。基于条件生成对抗网络(GAN),我们的提议神经网络模型从演讲输入中学习共同演讲手势和语义和声学特征之间的关系。为了训练我们的神经网络模型,我们使用一个包含共同演讲手势和相应演讲语音输出的公共数据集,这些数据是从一名英国男性自然语言处理专家捕获的。从客观和主观评估的结果中,我们证明了我们的机器人和实体代理手势生成框架的有效性。
https://arxiv.org/abs/2309.09346
Gestures are non-verbal but important behaviors accompanying people's speech. While previous methods are able to generate speech rhythm-synchronized gestures, the semantic context of the speech is generally lacking in the gesticulations. Although semantic gestures do not occur very regularly in human speech, they are indeed the key for the audience to understand the speech context in a more immersive environment. Hence, we introduce LivelySpeaker, a framework that realizes semantics-aware co-speech gesture generation and offers several control handles. In particular, our method decouples the task into two stages: script-based gesture generation and audio-guided rhythm refinement. Specifically, the script-based gesture generation leverages the pre-trained CLIP text embeddings as the guidance for generating gestures that are highly semantically aligned with the script. Then, we devise a simple but effective diffusion-based gesture generation backbone simply using pure MLPs, that is conditioned on only audio signals and learns to gesticulate with realistic motions. We utilize such powerful prior to rhyme the script-guided gestures with the audio signals, notably in a zero-shot setting. Our novel two-stage generation framework also enables several applications, such as changing the gesticulation style, editing the co-speech gestures via textual prompting, and controlling the semantic awareness and rhythm alignment with guided diffusion. Extensive experiments demonstrate the advantages of the proposed framework over competing methods. In addition, our core diffusion-based generative model also achieves state-of-the-art performance on two benchmarks. The code and model will be released to facilitate future research.
手势是人们讲话时伴随的重要行为,尽管以前的方法和方法能够生成讲话节奏同步的手势,但手势通常缺乏讲话语义背景。虽然语义手势在人类讲话中并不非常常见,但它们确实对于观众在更沉浸式环境中理解讲话上下文是至关重要的。因此,我们介绍了 Lively Speaker,一个框架,可以实现语义aware的并发手势生成,并提供多个控制开关。特别是,我们将该任务分解为两个阶段:基于脚本的手势生成和音频引导的节奏优化。具体来说,基于脚本的手势生成利用预先训练的 CLIP文本嵌入作为指导,以生成高度语义匹配的手势。然后,我们开发了一个简单但有效的扩散型手势生成基础架构,仅仅使用纯 MLP,它只受到音频信号的限制,并学习以实际动作的方式手势。我们在脚本引导手势与音频信号之间进行如此强大的预 rhyme 之前利用了这种强大的力量,特别是在零样本设置中。我们的新型两阶段生成框架还使多个应用变得可行,例如改变手势风格,通过文本提示编辑并发手势,并控制语义意识和节奏匹配的引导扩散。广泛的实验证明了所提出的框架相对于竞争方法的优势。此外,我们的核心扩散型生成模型也在两个基准上取得了最先进的性能。代码和模型将公开以促进未来的研究。
https://arxiv.org/abs/2309.09294
This work introduces a new multispectral database and novel approaches for eyeblink detection in RGB and Near-Infrared (NIR) individual images. Our contributed dataset (mEBAL2, multimodal Eye Blink and Attention Level estimation, Version 2) is the largest existing eyeblink database, representing a great opportunity to improve data-driven multispectral approaches for blink detection and related applications (e.g., attention level estimation and presentation attack detection in face biometrics). mEBAL2 includes 21,100 image sequences from 180 different students (more than 2 million labeled images in total) while conducting a number of e-learning tasks of varying difficulty or taking a real course on HTML initiation through the edX MOOC platform. mEBAL2 uses multiple sensors, including two Near-Infrared (NIR) and one RGB camera to capture facial gestures during the execution of the tasks, as well as an Electroencephalogram (EEG) band to get the cognitive activity of the user and blinking events. Furthermore, this work proposes a Convolutional Neural Network architecture as benchmark for blink detection on mEBAL2 with performances up to 97%. Different training methodologies are implemented using the RGB spectrum, NIR spectrum, and the combination of both to enhance the performance on existing eyeblink detectors. We demonstrate that combining NIR and RGB images during training improves the performance of RGB eyeblink detectors (i.e., detection based only on a RGB image). Finally, the generalization capacity of the proposed eyeblink detectors is validated in wilder and more challenging environments like the HUST-LEBW dataset to show the usefulness of mEBAL2 to train a new generation of data-driven approaches for eyeblink detection.
这项工作介绍了一个新式的多光谱数据库和新的方法,用于在RGB和近红外(NIR)个体图像中的眼动检测。我们贡献的数据集(mEBAL2,多视觉眼动和注意力水平估计,版本2)是现有的最大眼动数据库,代表了改进基于数据的多光谱方法以眼动检测和相关应用的机会(例如,在面部生物特征学中的注意力水平估计和呈现攻击检测)。mEBAL2包括从180名不同学生中收集的21,100张图像序列(总共超过200万张标记图像),同时执行一系列不同难度的在线学习任务或通过edX MOOC平台学习HTMLInit课程。mEBAL2使用多个传感器,包括两个近红外(NIR)和一个RGB相机,在执行任务时捕捉面部表情手势,并使用EEG band获取用户和眼动事件的认知活动。此外,这项工作提出了一个卷积神经网络架构,在mEBAL2上的检测性能高达97%。不同训练方法使用RGB光谱、NIR光谱和它们的组合来提高现有的眼动检测器的性能。我们证明,在训练期间将NIR和RGB图像组合在一起可以提高RGB眼动检测器的性能(即,仅基于RGB图像的检测)。最后, proposed 眼动检测器的泛化能力在更野和更具挑战性的环境,如HUST-LEBW数据集上进行了验证,以显示mEBAL2对于训练新一代基于数据的眼动检测方法的有用性。
https://arxiv.org/abs/2309.07880
The automatic co-speech gesture generation draws much attention in computer animation. Previous works designed network structures on individual datasets, which resulted in a lack of data volume and generalizability across different motion capture standards. In addition, it is a challenging task due to the weak correlation between speech and gestures. To address these problems, we present UnifiedGesture, a novel diffusion model-based speech-driven gesture synthesis approach, trained on multiple gesture datasets with different skeletons. Specifically, we first present a retargeting network to learn latent homeomorphic graphs for different motion capture standards, unifying the representations of various gestures while extending the dataset. We then capture the correlation between speech and gestures based on a diffusion model architecture using cross-local attention and self-attention to generate better speech-matched and realistic gestures. To further align speech and gesture and increase diversity, we incorporate reinforcement learning on the discrete gesture units with a learned reward function. Extensive experiments show that UnifiedGesture outperforms recent approaches on speech-driven gesture generation in terms of CCA, FGD, and human-likeness. All code, pre-trained models, databases, and demos are available to the public at this https URL.
自动共同语音手势生成在计算机动画中引起了广泛关注。以前的工作设计了个体数据集的网络结构,导致在不同运动捕捉标准下的数据量和泛化性缺乏。此外,由于语音和手势的弱相关性,这是一个具有挑战性的任务。为了解决这些问题,我们提出了UnifiedGesture,一种基于扩散模型的语音驱动手势合成方法,训练了多个不同骨架的多种手势数据集。具体来说,我们首先提出了重新定向网络,以学习不同运动捕捉标准的隐形态图,同时统一了各种手势的表示,并扩展了数据集。然后,我们基于扩散模型架构采用跨Local注意力和自我注意力捕捉语音和手势之间的相关性,生成更好的语音匹配和真实的手势。为了进一步协调语音和手势并增加多样性,我们引入了基于学习的奖励函数Discrete gesture units上的强化学习。广泛的实验结果表明,UnifiedGesture在CCA、FGD和人类相似性方面胜过最近在语音驱动手势生成方面的一些方法。所有代码、预训练模型、数据库和演示均在此httpsURL上提供。
https://arxiv.org/abs/2309.07051
This paper describes a system developed for the GENEA (Generation and Evaluation of Non-verbal Behaviour for Embodied Agents) Challenge 2023. Our solution builds on an existing diffusion-based motion synthesis model. We propose a contrastive speech and motion pretraining (CSMP) module, which learns a joint embedding for speech and gesture with the aim to learn a semantic coupling between these modalities. The output of the CSMP module is used as a conditioning signal in the diffusion-based gesture synthesis model in order to achieve semantically-aware co-speech gesture generation. Our entry achieved highest human-likeness and highest speech appropriateness rating among the submitted entries. This indicates that our system is a promising approach to achieve human-like co-speech gestures in agents that carry semantic meaning.
这篇文章描述了我们为2023年Genea挑战(生成和评估身体代理的非语言行为)开发的一项系统。我们的解决方案基于现有的扩散式运动合成模型。我们提出了一个对比性的语音和运动预训练模块(CSMP),该模块学习语音和手势的联合嵌入,以学习这些模式的语义耦合。CSMP模块的输出被用于扩散式手势合成模型中的条件信号,以实现语义 aware 的共语言手势生成。我们提交的Entry在所有提交Entry中实现了最高的人类相似性和最高的语音适当性评分。这表明我们的系统在具有语义意义的代理中实现人类相似的共语言手势是一个有前途的方法。
https://arxiv.org/abs/2309.05455
The complex and unique neural network topology of the human brain formed through natural evolution enables it to perform multiple cognitive functions simultaneously. Automated evolutionary mechanisms of biological network structure inspire us to explore efficient architectural optimization for Spiking Neural Networks (SNNs). Instead of manually designed fixed architectures or hierarchical Network Architecture Search (NAS), this paper evolves SNNs architecture by incorporating brain-inspired local modular structure and global cross-module connectivity. Locally, the brain region-inspired module consists of multiple neural motifs with excitatory and inhibitory connections; Globally, we evolve free connections among modules, including long-term cross-module feedforward and feedback connections. We further introduce an efficient multi-objective evolutionary algorithm based on a few-shot performance predictor, endowing SNNs with high performance, efficiency and low energy consumption. Extensive experiments on static datasets (CIFAR10, CIFAR100) and neuromorphic datasets (CIFAR10-DVS, DVS128-Gesture) demonstrate that our proposed model boosts energy efficiency, archiving consistent and remarkable performance. This work explores brain-inspired neural architectures suitable for SNNs and also provides preliminary insights into the evolutionary mechanisms of biological neural networks in the human brain.
人类大脑通过自然进化形成复杂的独特神经网络拓扑,使其能够同时执行多种认知功能。生物网络结构自动进化机制启发我们探索对Spiking Neural Networks(SNNs)进行高效 architectural optimization。 Instead of manually设计的固定架构或层次网络架构搜索(NAS),本文采用 brain-inspired 局部模块结构和全球跨模块连接。局部区域灵感模块包含多个神经元主题,具有促进和抑制连接;全球模块间进化自由连接,包括长期跨模块正向和反向连接。我们还介绍了一种基于少量预测性能的优化多目标进化算法,为 SNNs 赋予高性能、效率和低能源消耗。在静态数据集(CIFAR10、CIFAR100)和神经形态数据集(CIFAR10-DVS、DVS128-Gesture)上进行广泛的实验,证明了我们提出的模型可以提高能源效率,存储一致性和显著的性能。这项工作探索了适合 SNNs 的大脑 inspired 神经网络架构,同时也提供了对大脑生物神经网络进化机制的初步理解。
https://arxiv.org/abs/2309.05263
Creating a diverse and comprehensive dataset of hand gestures for dynamic human-machine interfaces in the automotive domain can be challenging and time-consuming. To overcome this challenge, we propose using synthetic gesture datasets generated by virtual 3D models. Our framework utilizes Unreal Engine to synthesize realistic hand gestures, offering customization options and reducing the risk of overfitting. Multiple variants, including gesture speed, performance, and hand shape, are generated to improve generalizability. In addition, we simulate different camera locations and types, such as RGB, infrared, and depth cameras, without incurring additional time and cost to obtain these cameras. Experimental results demonstrate that our proposed framework, SynthoGestures\footnote{\url{this https URL}}, improves gesture recognition accuracy and can replace or augment real-hand datasets. By saving time and effort in the creation of the data set, our tool accelerates the development of gesture recognition systems for automotive applications.
为汽车领域的动态人机界面创建多样化的、全面的手动作数据集可能会是一项挑战性的任务,需要花费大量时间和精力。为了克服这一挑战,我们建议使用由虚拟3D模型生成的合成手动作数据集。我们的框架使用 Unreal Engine 合成真实的手动作,提供定制选项并降低过拟合的风险。我们生成多个变体,包括手势速度、表现和手形状,以提高泛化性。此外,我们模拟了不同的相机位置和类型,如RGB、红外和深度相机,而无需支付获取这些相机额外的时间和成本。实验结果表明,我们提出的框架 SynthoGestures\footnote{\url{this https URL}},可以提高手势识别精度,可以代替或增加真实的手动作数据集。通过在数据集创建过程中节省时间和精力,我们的工具加速了汽车应用手势识别系统的开发。
https://arxiv.org/abs/2309.04421
The objective of the multi-condition human motion synthesis task is to incorporate diverse conditional inputs, encompassing various forms like text, music, speech, and more. This endows the task with the capability to adapt across multiple scenarios, ranging from text-to-motion and music-to-dance, among others. While existing research has primarily focused on single conditions, the multi-condition human motion generation remains underexplored. In this paper, we address these challenges by introducing MCM, a novel paradigm for motion synthesis that spans multiple scenarios under diverse conditions. The MCM framework is able to integrate with any DDPM-like diffusion model to accommodate multi-conditional information input while preserving its generative capabilities. Specifically, MCM employs two-branch architecture consisting of a main branch and a control branch. The control branch shares the same structure as the main branch and is initialized with the parameters of the main branch, effectively maintaining the generation ability of the main branch and supporting multi-condition input. We also introduce a Transformer-based diffusion model MWNet (DDPM-like) as our main branch that can capture the spatial complexity and inter-joint correlations in motion sequences through a channel-dimension self-attention module. Quantitative comparisons demonstrate that our approach achieves SoTA results in both text-to-motion and competitive results in music-to-dance tasks, comparable to task-specific methods. Furthermore, the qualitative evaluation shows that MCM not only streamlines the adaptation of methodologies originally designed for text-to-motion tasks to domains like music-to-dance and speech-to-gesture, eliminating the need for extensive network re-configurations but also enables effective multi-condition modal control, realizing "once trained is motion need".
多条件人类运动合成任务的目标包括包容多种条件输入,包括文本、音乐、言语等多种形式。这赋予任务适应多种场景的能力,包括从文本到运动和音乐到舞蹈等场景。尽管现有研究主要关注单一条件,但多条件人类运动生成仍然未被深入研究。在本文中,我们解决这些问题的方法是引入MCM,一种新的运动合成范式,可以在多种条件的情况下涵盖多个场景。MCM框架能够与任何类似于DDPM的扩散模型集成,以容纳多种条件信息输入,并保留其生成能力。具体来说,MCM使用两个分支架构,包括一个主分支和一个控制分支。控制分支与主分支结构相同,并初始化以主分支的参数,有效地保持了主分支的生成能力和支持多种条件输入。我们还介绍了一种基于Transformer的扩散模型MWNet(类似于DDPM),作为我们的主分支,它可以通过通道维度自注意力模块捕捉运动序列的空间复杂性和交互相关性。量化比较表明,我们的方法和SoTA结果在文本到运动和音乐到舞蹈任务中达到了与任务特定的方法相当的水平,类似于任务特定方法。此外,定性评估表明,MCM不仅简化了最初为文本到运动任务设计的方法向音乐到舞蹈和言语到手势等 domains的适应,避免了广泛的网络重构,还实现了有效的多条件modal控制,实现了“训练后是运动需求”。
https://arxiv.org/abs/2309.03031
This paper explores the critical but often overlooked role of non-verbal cues, including co-speech gestures and facial expressions, in human communication and their implications for Natural Language Processing (NLP). We argue that understanding human communication requires a more holistic approach that goes beyond textual or spoken words to include non-verbal elements. Borrowing from advances in sign language processing, we propose the development of universal automatic gesture segmentation and transcription models to transcribe these non-verbal cues into textual form. Such a methodology aims to bridge the blind spots in spoken language understanding, enhancing the scope and applicability of NLP models. Through motivating examples, we demonstrate the limitations of relying solely on text-based models. We propose a computationally efficient and flexible approach for incorporating non-verbal cues, which can seamlessly integrate with existing NLP pipelines. We conclude by calling upon the research community to contribute to the development of universal transcription methods and to validate their effectiveness in capturing the complexities of real-world, multi-modal interactions.
本论文探讨了非语言信号(包括口语手势和面部表情)在人类沟通中的关键作用,并对这些信号对自然语言处理(NLP)的影响进行了探讨。我们认为,理解人类沟通需要更加综合的方法,超越了文本或口语表达,包括非语言元素。从口语处理的进步中借用先进技术,我们提议开发通用的自动手势分割和摘要模型,将这些非语言信号转换为文本形式。这种方法旨在解决口语理解中的盲点,增强NLP模型的应用范围。通过激励例子,我们展示了仅依靠文本模型的局限性。我们提出了一种高效且灵活的计算方法,可以无缝与现有的NLP管道集成。我们最后呼吁研究社区为通用摘要方法的开发做出贡献,并验证其在捕捉现实世界多感官交互复杂性方面的有效性。
https://arxiv.org/abs/2309.06572
Gestures serve as a fundamental and significant mode of non-verbal communication among humans. Deictic gestures (such as pointing towards an object), in particular, offer valuable means of efficiently expressing intent in situations where language is inaccessible, restricted, or highly specialized. As a result, it is essential for robots to comprehend gestures in order to infer human intentions and establish more effective coordination with them. Prior work often rely on a rigid hand-coded library of gestures along with their meanings. However, interpretation of gestures is often context-dependent, requiring more flexibility and common-sense reasoning. In this work, we propose a framework, GIRAF, for more flexibly interpreting gesture and language instructions by leveraging the power of large language models. Our framework is able to accurately infer human intent and contextualize the meaning of their gestures for more effective human-robot collaboration. We instantiate the framework for interpreting deictic gestures in table-top manipulation tasks and demonstrate that it is both effective and preferred by users, achieving 70% higher success rates than the baseline. We further demonstrate GIRAF's ability on reasoning about diverse types of gestures by curating a GestureInstruct dataset consisting of 36 different task scenarios. GIRAF achieved 81% success rate on finding the correct plan for tasks in GestureInstruct. Website: this https URL
手势是人类非语言沟通中的一种基本和重要方式。特别地,指代手势(例如指向某个物体)提供了在语言难以访问、受到限制或高度专业化的情况下高效表达意图的宝贵方式。因此,机器人理解手势是必要的,以便推断人类的意图,并与他们建立更有效地协调。以前的工作通常依赖于严格的手编码手势库及其含义。然而,手势的解释通常取决于上下文,需要更灵活和常识推理。在这项工作中,我们提出了一个框架GIRAF,通过利用大型语言模型的力量,更灵活地解释手势和语言指令。我们的框架能够准确地推断人类的意图,并 contextualize 他们的手势含义,以更有效地促进人类机器人协作。我们在桌面操作任务中实例化了GIRAF框架,并证明它既有效又受欢迎,比基准方法实现了70%更高的成功率。我们还展示了GIRAF对于各种手势推理能力,通过编辑一个包含36个不同任务场景的GestureInstruct数据集。GIRAF在找到任务的正确计划方面实现了81%的成功率。网站: this https URL
https://arxiv.org/abs/2309.02721
Embodied Reference Understanding studies the reference understanding in an embodied fashion, where a receiver is required to locate a target object referred to by both language and gesture of the sender in a shared physical environment. Its main challenge lies in how to make the receiver with the egocentric view access spatial and visual information relative to the sender to judge how objects are oriented around and seen from the sender, i.e., spatial and visual perspective-taking. In this paper, we propose a REasoning from your Perspective (REP) method to tackle the challenge by modeling relations between the receiver and the sender and the sender and the objects via the proposed novel view rotation and relation reasoning. Specifically, view rotation first rotates the receiver to the position of the sender by constructing an embodied 3D coordinate system with the position of the sender as the origin. Then, it changes the orientation of the receiver to the orientation of the sender by encoding the body orientation and gesture of the sender. Relation reasoning models the nonverbal and verbal relations between the sender and the objects by multi-modal cooperative reasoning in gesture, language, visual content, and spatial position. Experiment results demonstrate the effectiveness of REP, which consistently surpasses all existing state-of-the-art algorithms by a large margin, i.e., +5.22% absolute accuracy in terms of Prec0.5 on YouRefIt.
实体参考理解以实体方式研究参考理解,其中接收者需要在共享的物理环境中找到被 sender 的语言和手势所提及的目标对象。其主要挑战是如何使具有自我中心视角的接收者访问与 sender 相对的空间和视觉信息,以判断从 sender 那里看物体的方向和角度,也就是空间视觉角度。在本文中,我们提出了一种从你的角度思考( Rep)方法来解决挑战,通过通过新的视角旋转和关系推理来建模接收者、 sender 和物体之间的关系。具体来说,视角旋转首先通过构建一个实体三维坐标系统,以 sender 的位置作为原点,将接收者旋转到 sender 的位置。然后,它通过编码 sender 的身体姿势和手势将接收者的方向旋转到 sender 的方向。关系推理通过多模态合作推理在手势、语言、视觉内容和空间位置等方面进行建模,以描述 sender 和物体之间的非语言和语言关系。实验结果证明了Rep 方法的有效性,它 consistently surpasses all existing state-of-the-art algorithms by a large margin,即 YouRefIt Prec0.5 的 +5.22% 绝对准确性。
https://arxiv.org/abs/2309.01073
Intelligent vehicle anticipation of the movement intentions of other drivers can reduce collisions. Typically, when a human driver of another vehicle (referred to as the target vehicle) engages in specific behaviors such as checking the rearview mirror prior to lane change, a valuable clue is therein provided on the intentions of the target vehicle's driver. Furthermore, the target driver's intentions can be influenced and shaped by their driving environment. For example, if the target vehicle is too close to a leading vehicle, it may renege the lane change decision. On the other hand, a following vehicle in the target lane is too close to the target vehicle could lead to its reversal of the decision to change lanes. Knowledge of such intentions of all vehicles in a traffic stream can help enhance traffic safety. Unfortunately, such information is often captured in the form of images/videos. Utilization of personally identifiable data to train a general model could violate user privacy. Federated Learning (FL) is a promising tool to resolve this conundrum. FL efficiently trains models without exposing the underlying data. This paper introduces a Personalized Federated Learning (PFL) model embedded a long short-term transformer (LSTR) framework. The framework predicts drivers' intentions by leveraging in-vehicle videos (of driver movement, gestures, and expressions) and out-of-vehicle videos (of the vehicle's surroundings - frontal/rear areas). The proposed PFL-LSTR framework is trained and tested through real-world driving data collected from human drivers at Interstate 65 in Indiana. The results suggest that the PFL-LSTR exhibits high adaptability and high precision, and that out-of-vehicle information (particularly, the driver's rear-mirror viewing actions) is important because it helps reduce false positives and thereby enhances the precision of driver intention inference.
Intelligent vehicle对其他司机的运动意图的预测可以减少碰撞。通常,当另一个车辆的人类司机(称为目标车辆)采取特定的行为(如检查后视图镜在换车道前),例如检查后视图镜,会提供有价值的线索,有关目标车辆司机的意图。此外,目标司机的意图可以受到他们的驾驶环境的影响和塑造。例如,如果目标车辆过于接近领先车辆,它可能会否认换车道决定。另一方面,在目标车道的跟随车辆过于接近目标车辆可能会导致其否认换车道决定。了解交通流中所有车辆的意图可以帮助提高交通安全。不幸的是,这些信息通常以图像/视频的形式捕获。使用个人身份数据训练一个通用模型可能会侵犯用户隐私。分布式学习(FL)是一个解决这一难题的有前途的工具。FL有效地训练模型而不会暴露底层数据。本文介绍了一个 personalized分布式学习(PFL)模型,其中嵌入了长期短期Transformer(LSTR)框架。框架利用车内视频(司机的运动、手势和表达方式)和出车视频(车辆周围的环境-前/后区)预测司机的意图。提出的 PFL-LSTR 框架通过从印第安纳州Interstate 65的道路上从人类司机收集现实世界驾驶数据进行训练和测试。结果显示,PFL-LSTR 表现出高适应性和高精度,出车信息(特别是司机的后视图镜观察行为)非常重要,因为它有助于减少误判并从而提高司机意图推断的精度。
https://arxiv.org/abs/2309.00790
We propose augmenting the empathetic capacities of social robots by integrating non-verbal cues. Our primary contribution is the design and labeling of four types of empathetic non-verbal cues, abbreviated as SAFE: Speech, Action (gesture), Facial expression, and Emotion, in a social robot. These cues are generated using a Large Language Model (LLM). We developed an LLM-based conversational system for the robot and assessed its alignment with social cues as defined by human counselors. Preliminary results show distinct patterns in the robot's responses, such as a preference for calm and positive social emotions like 'joy' and 'lively', and frequent nodding gestures. Despite these tendencies, our approach has led to the development of a social robot capable of context-aware and more authentic interactions. Our work lays the groundwork for future studies on human-robot interactions, emphasizing the essential role of both verbal and non-verbal cues in creating social and empathetic robots.
我们提议通过整合非语言线索来提高社交机器人的同情心能力。我们的主要贡献是设计和标记社交机器人中的四种同情非语言线索,它们被称为 Safe:语言、行动(手势)、面部表情和情绪。这些线索是通过大型语言模型(LLM)生成的。我们为机器人开发了一个基于LLM的聊天系统,并评估了它们与人类咨询师定义的社会线索的匹配度。初步结果显示,机器人的回答具有独特的模式,例如更倾向于平静和积极的社会情感,如“快乐”和“活泼”,以及频繁点头手势。尽管存在这些趋势,但我们的方法导致了能够意识到上下文和社会情感的社交机器人的开发。我们的工作为未来人类机器人与机器人之间的交互研究奠定了基础,强调语言和非语言线索在创造社交和同情机器人中的重要作用。
https://arxiv.org/abs/2308.16529