Skeleton-based motion representations are robust for action localization and understanding for their invariance to perspective, lighting, and occlusion, compared with images. Yet, they are often ambiguous and incomplete when taken out of context, even for human annotators. As infants discern gestures before associating them with words, actions can be conceptualized before being grounded with labels. Therefore, we propose the first unsupervised pre-training framework, Boundary-Interior Decoding (BID), that partitions a skeleton-based motion sequence into discovered semantically meaningful pre-action segments. By fine-tuning our pre-training network with a small number of annotated data, we show results out-performing SOTA methods by a large margin.
基于骨架的运动表示对于动作定位和理解具有良好的鲁棒性,因为它们对视角、照明和遮挡具有不变性,而与图像相比。然而,当脱离上下文时,它们往往是不清晰和不完整的。正如婴儿在将动作与词语关联之前就开始感知一样,在将动作与标签关联之前,动作可以先于标签进行概念化。因此,我们提出了第一个无监督的前训练框架,边界内解码(BID),将基于骨架的运动序列分割为发现 semantically 有意义的前动作段。通过用小量标记数据微调我们的预训练网络,我们证明了其性能优于当前最先进的方法。
https://arxiv.org/abs/2403.07354
Real-time recognition and prediction of surgical activities are fundamental to advancing safety and autonomy in robot-assisted surgery. This paper presents a multimodal transformer architecture for real-time recognition and prediction of surgical gestures and trajectories based on short segments of kinematic and video data. We conduct an ablation study to evaluate the impact of fusing different input modalities and their representations on gesture recognition and prediction performance. We perform an end-to-end assessment of the proposed architecture using the JHU-ISI Gesture and Skill Assessment Working Set (JIGSAWS) dataset. Our model outperforms the state-of-the-art (SOTA) with 89.5\% accuracy for gesture prediction through effective fusion of kinematic features with spatial and contextual video features. It achieves the real-time performance of 1.1-1.3ms for processing a 1-second input window by relying on a computationally efficient model.
实时识别和预测手术活动是推动机器人辅助手术安全性和自主性的基础。本文提出了一种基于短动作和视频数据的小段运动和视频数据的 multimodal Transformer 架构,用于实时识别和预测手术手势和轨迹。我们进行了一项消融研究,以评估将不同输入模块及其表示集成到手势识别和预测性能中的影响。我们使用 JHU-ISI Gesture and Skill Assessment Working Set (JIGSAWS) 数据集对所提出的架构进行了端到端评估。我们的模型在通过有效地融合运动特征和空间上下文视频特征来提高手势预测准确率的基础上,实现了与最先进水平(SOTA)的 89.5% 的准确率。它能在依赖计算效率模型的情况下,实现对 1 秒输入窗口的实时处理,并达到 1.1-1.3ms 的实时性能。
https://arxiv.org/abs/2403.06705
As technological advancements continue to expand the capabilities of multi unmanned-aerial-vehicle systems (mUAV), human operators face challenges in scalability and efficiency due to the complex cognitive load and operations associated with motion adjustments and team coordination. Such cognitive demands limit the feasible size of mUAV teams and necessitate extensive operator training, impeding broader adoption. This paper developed a Hand Gesture Based Interactive Control (HGIC), a novel interface system that utilize computer vision techniques to intuitively translate hand gestures into modular commands for robot teaming. Through learning control models, these commands enable efficient and scalable mUAV motion control and adjustments. HGIC eliminates the need for specialized hardware and offers two key benefits: 1) Minimal training requirements through natural gestures; and 2) Enhanced scalability and efficiency via adaptable commands. By reducing the cognitive burden on operators, HGIC opens the door for more effective large-scale mUAV applications in complex, dynamic, and uncertain scenarios. HGIC will be open-sourced after the paper being published online for the research community, aiming to drive forward innovations in human-mUAV interactions.
随着技术的进步,多旋翼无人车辆(mUAV)的功能不断扩展,人类操作员面临由于运动调整和团队协调所带来的复杂认知负荷和操作挑战。这种认知要求限制了mUAV团队的规模,需要进行广泛的操作员培训,从而阻碍了更广泛的采用。本文开发了一种基于手势的交互式控制(HGIC),一种新的人机交互界面系统,利用计算机视觉技术将手势自然地转换为机器人协同操作的模块化指令。通过学习控制模型,这些指令使mUAV运动控制和调整变得高效且可扩展。HGIC消除了需要专门硬件的培训要求,提供了两个关键优势:(1)通过自然手势实现最小培训;(2)通过可调整的指令实现增强的可扩展性和效率。通过减轻操作员的认知负担,HGIC为在复杂、动态和不确定的场景中实现更有效的mUAV应用打开了大门。HGIC将在论文在线发表后开源,以推动人机mUAV交互领域的研究创新。
https://arxiv.org/abs/2403.05478
Sonomyography (SMG) is a non-invasive technique that uses ultrasound imaging to detect the dynamic activity of muscles. Wearable SMG systems have recently gained popularity due to their potential as human-computer interfaces for their superior performance compared to conventional methods. This paper demonstrates real-time positional proportional control of multiple gestures using a multiplexed 8-channel wearable SMG system. The amplitude-mode ultrasound signals from the SMG system were utilized to detect muscle activity from the forearm of 8 healthy individuals. The derived signals were used to control the on-screen movement of the cursor. A target achievement task was performed to analyze the performance of our SMG-based human-machine interface. Our wearable SMG system provided accurate, stable, and intuitive control in real-time by achieving an average success rate greater than 80% with all gestures. Furthermore, the wearable SMG system's abilities to detect volitional movement and decode movement kinematic information from SMG trajectories using standard performance metrics were evaluated. Our results provide insights to validate SMG as an intuitive human-machine interface.
Sonography(SMG)是一种非侵入性技术,它使用超声成像来检测肌肉的动态活动。可穿戴式SMG系统最近因其在传统方法相比较优越的性能而受到欢迎。本文使用一个多通道可穿戴SMG系统展示了实时位置比例控制多个手势。SMG系统中的幅度模式超声信号用于从8名健康个体的前臂检测肌肉活动。通过这些信号控制的屏幕上鼠标移动。为了分析基于SMG的人机界面性能,进行了一项目标达成任务。我们基于SMG的人机界面系统通过实现所有手势的平均成功率大于80%来提供准确、稳定和直观的控制。此外,可穿戴SMG系统通过使用标准性能指标检测肌肉运动和解码SMG轨迹的运动学信息的能力进行了评估。我们的结果提供了SMG作为直觉式人机界面的验证。
https://arxiv.org/abs/2403.05308
Micro-action is an imperceptible non-verbal behaviour characterised by low-intensity movement. It offers insights into the feelings and intentions of individuals and is important for human-oriented applications such as emotion recognition and psychological assessment. However, the identification, differentiation, and understanding of micro-actions pose challenges due to the imperceptible and inaccessible nature of these subtle human behaviors in everyday life. In this study, we innovatively collect a new micro-action dataset designated as Micro-action-52 (MA-52), and propose a benchmark named micro-action network (MANet) for micro-action recognition (MAR) task. Uniquely, MA-52 provides the whole-body perspective including gestures, upper- and lower-limb movements, attempting to reveal comprehensive micro-action cues. In detail, MA-52 contains 52 micro-action categories along with seven body part labels, and encompasses a full array of realistic and natural micro-actions, accounting for 205 participants and 22,422 video instances collated from the psychological interviews. Based on the proposed dataset, we assess MANet and other nine prevalent action recognition methods. MANet incorporates squeeze-and excitation (SE) and temporal shift module (TSM) into the ResNet architecture for modeling the spatiotemporal characteristics of micro-actions. Then a joint-embedding loss is designed for semantic matching between video and action labels; the loss is used to better distinguish between visually similar yet distinct micro-action categories. The extended application in emotion recognition has demonstrated one of the important values of our proposed dataset and method. In the future, further exploration of human behaviour, emotion, and psychological assessment will be conducted in depth. The dataset and source code are released at this https URL.
微动作是一种难以察觉的非语言行为,其特征是低强度运动。它揭示了个体情感和意图,对于诸如情感识别和心理评估等以人为中心的应用具有重要意义。然而,由于这些微行为在日常生活中的难以察觉和无法访问性,识别、区分和理解微动作带来了挑战。在这项研究中,我们创新性地收集了一个名为MA-52的新微动作数据集,并提出了名为微动作网络(MANet)的基准用于微动作识别(MAR)任务。与其他方法不同,MA-52提供了全身视角,包括手势、上半身和下半身运动,试图揭示全面的微动作线索。具体来说,MA-52包含了52个微动作类别以及7个身体部位标签,涵盖了205个参与者以及从心理访谈中收集的22,422个视频实例。基于所提出的数据集,我们评估了MANet和其他9种普遍的动作识别方法。MANet将挤压和兴奋(SE)以及时钟转移模块(TSM)融入ResNet架构,以建模微动作的时空特征。然后,为视频和动作标签之间的语义匹配设计了联合嵌入损失;该损失被用于更好地区分视觉上相似但具有区别的微动作类别。在情感识别的扩展应用中,我们发现我们提出的数据和方法的一个关键价值。在未来的研究中,将深入探讨人类行为、情感和心理评估。数据和源代码发布在https://www. thisurl。
https://arxiv.org/abs/2403.05234
This research aims to further understanding in the field of continuous authentication using behavioral biometrics. We are contributing a novel dataset that encompasses the gesture data of 15 users playing Minecraft with a Samsung Tablet, each for a duration of 15 minutes. Utilizing this dataset, we employed machine learning (ML) binary classifiers, being Random Forest (RF), K-Nearest Neighbors (KNN), and Support Vector Classifier (SVC), to determine the authenticity of specific user actions. Our most robust model was SVC, which achieved an average accuracy of approximately 90%, demonstrating that touch dynamics can effectively distinguish users. However, further studies are needed to make it viable option for authentication systems
这项研究旨在通过行为生物识别领域深入了解持续认证。我们为15个在三星平板电脑上玩《我的世界》的用户的每个动作数据创建了一个新的数据集,每个动作持续15分钟。利用这个数据集,我们采用了机器学习(ML)二分类器,包括随机森林(RF)、K-最近邻(KNN)和支持向量分类器(SVC),来确定特定用户动作的准确性。我们最健壮的模型是SVC,其平均准确率约为90%,表明触觉动态可以有效地区分用户。然而,还需要进一步研究,才能使这种认证系统成为可行选项。
https://arxiv.org/abs/2403.03832
Recent advancements in multimodal Human-Robot Interaction (HRI) datasets have highlighted the fusion of speech and gesture, expanding robots' capabilities to absorb explicit and implicit HRI insights. However, existing speech-gesture HRI datasets often focus on elementary tasks, like object pointing and pushing, revealing limitations in scaling to intricate domains and prioritizing human command data over robot behavior records. To bridge these gaps, we introduce NatSGD, a multimodal HRI dataset encompassing human commands through speech and gestures that are natural, synchronized with robot behavior demonstrations. NatSGD serves as a foundational resource at the intersection of machine learning and HRI research, and we demonstrate its effectiveness in training robots to understand tasks through multimodal human commands, emphasizing the significance of jointly considering speech and gestures. We have released our dataset, simulator, and code to facilitate future research in human-robot interaction system learning; access these resources at this https URL
近年来,在多模态人机交互(HRI)数据集的先进发展上,我们强调了语音和手势的融合,将机器的能力扩展到吸收明确和隐含的人机交互(HRI)洞察。然而,现有的语音-手势HRI数据集通常集中于基本任务,如物体指点和推动,揭示了在扩展到复杂领域时优先考虑人类命令数据而忽视机器人行为记录的局限性。为了弥合这些空白,我们引入了NatSGD,一个多模态HRI数据集,涵盖人类通过语音和手势进行的自然命令,并与机器人行为演示同步。NatSGD在机器学习和HRI研究的交叉点处作为基础资源,我们证明了通过多模态人类命令训练机器理解任务的有效性,强调了同时考虑语音和手势的意义。我们已经发布了我们的数据集、模拟器和代码,以促进未来人机交互系统学习的研究;访问这些资源在此链接处。
https://arxiv.org/abs/2403.02274
3D hand pose estimation has found broad application in areas such as gesture recognition and human-machine interaction tasks. As performance improves, the complexity of the systems also increases, which can limit the comparative analysis and practical implementation of these methods. In this paper, we propose a simple yet effective baseline that not only surpasses state-of-the-art (SOTA) methods but also demonstrates computational efficiency. To establish this baseline, we abstract existing work into two components: a token generator and a mesh regressor, and then examine their core structures. A core structure, in this context, is one that fulfills intrinsic functions, brings about significant improvements, and achieves excellent performance without unnecessary complexities. Our proposed approach is decoupled from any modifications to the backbone, making it adaptable to any modern models. Our method outperforms existing solutions, achieving state-of-the-art (SOTA) results across multiple datasets. On the FreiHAND dataset, our approach produced a PA-MPJPE of 5.7mm and a PA-MPVPE of 6.0mm. Similarly, on the Dexycb dataset, we observed a PA-MPJPE of 5.5mm and a PA-MPVPE of 5.0mm. As for performance speed, our method reached up to 33 frames per second (fps) when using HRNet and up to 70 fps when employing FastViT-MA36
3D手势估计在诸如手势识别和人与机器交互任务等领域得到了广泛应用。性能提高时,系统的复杂性也会增加,这可能限制了这些方法的比较分析和实际应用。在本文中,我们提出了一个简单而有效的基准,不仅超越了最先进的(SOTA)方法,还展示了计算效率。为了建立这个基准,我们将现有工作抽象为两个组件:一个标记生成器和一个网格回归器,然后研究其核心结构。在本文中,核心结构是指能够实现固有功能,带来显著改进,同时实现出色性能且无需多余复杂性的结构。我们提出的方法与骨干网络的修改无关,使其适应任何现代模型。我们的方法在多个数据集上的表现优于现有解决方案,在这些数据集上的SOTA结果达到。在FreiHAND数据集上,我们的方法产生了一个PA-MPJPE为5.7mm和一个PA-MPVPE为6.0mm。同样,在Dexycb数据集上,我们观察到了一个PA-MPJPE为5.5mm和一个PA-MPVPE为5.0mm。关于性能速度,当使用HRNet时,我们的方法达到每秒33帧(fps),而当采用FastViT-MA36时,达到每秒70帧(fps)。
https://arxiv.org/abs/2403.01813
Recognizing speaking in humans is a central task towards understanding social interactions. Ideally, speaking would be detected from individual voice recordings, as done previously for meeting scenarios. However, individual voice recordings are hard to obtain in the wild, especially in crowded mingling scenarios due to cost, logistics, and privacy concerns. As an alternative, machine learning models trained on video and wearable sensor data make it possible to recognize speech by detecting its related gestures in an unobtrusive, privacy-preserving way. These models themselves should ideally be trained using labels obtained from the speech signal. However, existing mingling datasets do not contain high quality audio recordings. Instead, speaking status annotations have often been inferred by human annotators from video, without validation of this approach against audio-based ground truth. In this paper we revisit no-audio speaking status estimation by presenting the first publicly available multimodal dataset with high-quality individual speech recordings of 33 subjects in a professional networking event. We present three baselines for no-audio speaking status segmentation: a) from video, b) from body acceleration (chest-worn accelerometer), c) from body pose tracks. In all cases we predict a 20Hz binary speaking status signal extracted from the audio, a time resolution not available in previous datasets. In addition to providing the signals and ground truth necessary to evaluate a wide range of speaking status detection methods, the availability of audio in REWIND makes it suitable for cross-modality studies not feasible with previous mingling datasets. Finally, our flexible data consent setup creates new challenges for multimodal systems under missing modalities.
意识到在人类中的说话是一个理解社交互动的核心任务。理想情况下,就像以前在会议场景中实现的那样,从个人语音录音中检测说话应该是可以检测出来的。然而,从野外的个人语音录音中获取数据很难,尤其是在拥挤的社交场景中,由于成本、物流和隐私问题。作为替代方案,基于视频和可穿戴传感器数据的机器学习模型通过检测其相关手势以非侵入性、隐私保护的方式识别说话。这些模型本身应该通过从语音信号获得的标签进行训练。然而,现有的社交场景数据缺乏高质量音频录音。相反,通常是人类注释者从视频中推断说话状态,而没有对这种方法与音频基线进行验证。在本文中,我们重新回顾了没有音频说话状态估计,通过发布第一个可公开获取的高质量个人语音录音的 multimodal 数据集,该数据集包括 33 个专业网络活动中的 33 个受试者,以展示 no-audio speaking status estimation。我们呈现了三种基于视频、身体加速度(佩戴 chest-worn 加速度计)和身体姿态轨迹的 no-audio speaking status 分割基线:a)从视频,b)从身体加速度(佩戴 chest-worn 加速度计),c)从身体姿态轨迹。在所有情况下,我们预测提取自音频的 20Hz 二进制说话状态信号,这是以前数据集中没有的时间分辨率。此外,REWIND 中的音频可用性使得它可以进行跨模态研究,而其他社交场景数据无法实现。最后,我们的灵活数据同意设置为多模态系统在缺失模式下创建了新的挑战。
https://arxiv.org/abs/2403.01229
Changes in facial expression, head movement, body movement and gesture movement are remarkable cues in sign language recognition, and most of the current continuous sign language recognition(CSLR) research methods mainly focus on static images in video sequences at the frame-level feature extraction stage, while ignoring the dynamic changes in the images. In this paper, we propose a novel motor attention mechanism to capture the distorted changes in local motion regions during sign language expression, and obtain a dynamic representation of image changes. And for the first time, we apply the self-distillation method to frame-level feature extraction for continuous sign language, which improves the feature expression without increasing the computational resources by self-distilling the features of adjacent stages and using the higher-order features as teachers to guide the lower-order features. The combination of the two constitutes our proposed holistic model of CSLR Based on motor attention mechanism and frame-level Self-Distillation (MAM-FSD), which improves the inference ability and robustness of the model. We conduct experiments on three publicly available datasets, and the experimental results show that our proposed method can effectively extract the sign language motion information in videos, improve the accuracy of CSLR and reach the state-of-the-art level.
面部表情、头部运动、身体运动和手势运动的改变是手语识别中的显著线索,而目前的大多数连续手语识别(CSLR)研究方法主要集中在帧级别特征提取阶段的视频序列中静态图像,而忽略了图像动态变化。在本文中,我们提出了一种新颖的手动注意机制来捕捉手语表达中局部运动区域的扭曲变化,并获得图像变化的动态表示。并且是第一次将自监督方法应用于连续手语识别的帧级别特征提取,通过自监督相邻阶段的特征,利用高阶特征作为教师,指导低阶特征,从而提高特征表达。两种方法的结合构成了我们基于手动注意机制和帧级别自监督(MAM-FSD)的全局模型,提高了模型的推理能力和稳健性。我们在三个公开可用的数据集上进行了实验,实验结果表明,我们提出的方法可以有效地提取视频中的手语运动信息,提高CSLR的准确性和达到最先进水平。
https://arxiv.org/abs/2402.19118
Effective communication between humans and collaborative robots is essential for seamless Human-Robot Collaboration (HRC). In noisy industrial settings, nonverbal communication, such as gestures, plays a key role in conveying commands and information to robots efficiently. While existing literature has thoroughly examined gesture recognition and robots' responses to these gestures, there is a notable gap in exploring the design of these gestures. The criteria for creating efficient HRC gestures are scattered across numerous studies. This paper surveys the design principles of HRC gestures, as contained in the literature, aiming to consolidate a set of criteria for HRC gesture design. It also examines the methods used for designing and evaluating HRC gestures to highlight research gaps and present directions for future research in this area.
人机协同合作(HRC)的有效沟通对于无缝的人机合作至关重要。在嘈杂的工业环境中,非语言交流(如手势)在传达命令和信息给机器人方面发挥着关键作用。虽然现有的文献对手势识别和机器人对这些手势的响应进行了详细研究,但探索这些手势的设计原则仍然存在很大的空白。创建高效HRC手势的设计原则分散在多个研究中。本文调查了HRC手势的设计原则,以便汇总成一套HRC手势设计准则。它还检查了用于设计和评估HRC手势的方法,以突出研究空白并呈现未来研究的方向。
https://arxiv.org/abs/2402.19058
Reliable methods for the neurodevelopmental assessment of infants are essential for early detection of medical issues that may need prompt interventions. Spontaneous motor activity, or `kinetics', is shown to provide a powerful surrogate measure of upcoming neurodevelopment. However, its assessment is by and large qualitative and subjective, focusing on visually identified, age-specific gestures. Here, we follow an alternative approach, predicting infants' neurodevelopmental maturation based on data-driven evaluation of individual motor patterns. We utilize 3D video recordings of infants processed with pose-estimation to extract spatio-temporal series of anatomical landmarks, and apply adaptive graph convolutional networks to predict the actual age. We show that our data-driven approach achieves improvement over traditional machine learning baselines based on manually engineered features.
可靠的方法对婴儿的神经发育评估是早期检测可能需要及时干预的医疗问题的关键。自发性运动,或“动力学”,被证明是一种强大的预测即将到来的神经发育的替代指标。然而,它的评估主要是定性的和主观的,主要关注于视觉上识别出的、与年龄相关的手势。在这里,我们采用了一种替代方法,根据对个体运动模式的基于数据驱动评估预测婴儿的神经发育成熟度。我们利用对婴儿进行姿态估计的3D视频记录来提取解剖标志的时空序列,并应用自适应图卷积网络来预测实际年龄。我们证明了我们的数据驱动方法基于手工设计的特征与传统机器学习基线相比实现了改善。
https://arxiv.org/abs/2402.14400
In the fast-paced field of human-computer interaction (HCI) and virtual reality (VR), automatic gesture recognition has become increasingly essential. This is particularly true for the recognition of hand signs, providing an intuitive way to effortlessly navigate and control VR and HCI applications. Considering increased privacy requirements, radar sensors emerge as a compelling alternative to cameras. They operate effectively in low-light conditions without capturing identifiable human details, thanks to their lower resolution and distinct wavelength compared to visible light. While previous works predominantly deploy radar sensors for dynamic hand gesture recognition based on Doppler information, our approach prioritizes classification using an imaging radar that operates on spatial information, e.g. image-like data. However, generating large training datasets required for neural networks (NN) is a time-consuming and challenging process, often falling short of covering all potential scenarios. Acknowledging these challenges, this study explores the efficacy of synthetic data generated by an advanced radar ray-tracing simulator. This simulator employs an intuitive material model that can be adjusted to introduce data diversity. Despite exclusively training the NN on synthetic data, it demonstrates promising performance when put to the test with real measurement data. This emphasizes the practicality of our methodology in overcoming data scarcity challenges and advancing the field of automatic gesture recognition in VR and HCI applications.
在快节奏的人机交互(HCI)和虚拟现实(VR)领域,自动手势识别变得越来越重要。特别是对于手势识别,提供了一种直观的方式来轻松地导航和控制VR和HCI应用程序。考虑到增加的隐私要求,雷达传感器成为一个有说服力的替代相机。它们在低光条件下有效运作,不会捕捉到可识别的人类细节,因为它们的分辨率低于可见光,波长与可见光不同。虽然以前的工作主要基于多普勒信息的动态手势识别,但我们的方法优先考虑使用成像雷达进行分类,例如图像类似数据。然而,为神经网络(NN)生成大量训练数据是一个耗时且具有挑战性的过程,通常难以覆盖所有潜在场景。为了承认这些挑战,这项研究探讨了使用高级雷达光迹模拟器生成的合成数据的有效性。这个模拟器采用了一个直观的材料模型,可以调整以引入数据多样性。尽管它仅在合成数据上训练NN,但当用真实测量数据进行测试时,它表现出令人鼓舞的性能。这强调了在我们的方法克服数据稀缺性挑战并推动VR和HCI应用自动手势识别领域的发展时具有实际价值。
https://arxiv.org/abs/2402.12800
Gesture recognition using low-resolution instantaneous HD-sEMG images opens up new avenues for the development of more fluid and natural muscle-computer interfaces. However, the data variability between inter-session and inter-subject scenarios presents a great challenge. The existing approaches employed very large and complex deep ConvNet or 2SRNN-based domain adaptation methods to approximate the distribution shift caused by these inter-session and inter-subject data variability. Hence, these methods also require learning over millions of training parameters and a large pre-trained and target domain dataset in both the pre-training and adaptation stages. As a result, it makes high-end resource-bounded and computationally very expensive for deployment in real-time applications. To overcome this problem, we propose a lightweight All-ConvNet+TL model that leverages lightweight All-ConvNet and transfer learning (TL) for the enhancement of inter-session and inter-subject gesture recognition performance. The All-ConvNet+TL model consists solely of convolutional layers, a simple yet efficient framework for learning invariant and discriminative representations to address the distribution shifts caused by inter-session and inter-subject data variability. Experiments on four datasets demonstrate that our proposed methods outperform the most complex existing approaches by a large margin and achieve state-of-the-art results on inter-session and inter-subject scenarios and perform on par or competitively on intra-session gesture recognition. These performance gaps increase even more when a tiny amount (e.g., a single trial) of data is available on the target domain for adaptation. These outstanding experimental results provide evidence that the current state-of-the-art models may be overparameterized for sEMG-based inter-session and inter-subject gesture recognition tasks.
使用低分辨率瞬时HD-sEMG图像进行手势识别为开发更加流畅和自然的人机交互界面打开了新的途径。然而,不同会话和不同受试者场景之间的数据变异性为这一任务带来了巨大的挑战。现有的方法使用了非常大和复杂的神话卷积网络(HCN)或2SRNN为基础的领域迁移方法,来近似由这些会话和受试者数据变异性引起的分布转移。因此,这些方法还需要在预训练和迁移阶段学习数百万个训练参数以及一个大型的预训练和目标领域数据集。因此,将其部署到实时应用程序中,高端资源受限且计算成本非常高。为解决这个问题,我们提出了一个轻量级的全卷积网(All-ConvNet)+ TL模型,该模型利用轻量级的全卷积网和迁移学习(TL)来增强会话和受试者手势识别的性能。All-ConvNet+TL模型仅包含卷积层,这是一个简单而有效的框架,用于学习不变的和具有区分性的表示,以解决会话和受试者数据变异性引起的分布转移。在四个数据集上的实验证明,与最先进的现有方法相比,我们提出的方法具有很大的优势,并在会话和受试者场景上实现了最先进的结果,并且在同一会话内的手势识别上表现相当或竞争力。当目标领域的数据可用时(例如,只有一小部分数据),这些性能差距会更大。这些出色的实验结果提供了证据,表明当前最先进的手势识别模型可能存在过度参数化的问题,尤其是在基于sEMG的手势识别任务中。
https://arxiv.org/abs/2305.08014
The development of advanced surgical systems embedding the Master-Slave control strategy introduced the possibility of remote interaction between the surgeon and the patient, also known as teleoperation. The present paper aims to integrate innovative technologies into the teleoperation process to enhance workflow during surgeries. The proposed system incorporates a collaborative robot, Kuka IIWA LBR, and Hololens 2 (an augmented reality device), allowing the user to control the robot in an expansive environment that integrates actual (real data) with additional digital information imported via Hololens 2. Experimental data demonstrate the user's ability to control the Kuka IIWA using various gestures to position it with respect to real or digital objects. Thus, this system offers a novel solution to manipulate robots used in surgeries in a more intuitive manner, contributing to the reduction of the learning curve for surgeons. Calibration and testing in multiple scenarios demonstrate the efficiency of the system in providing seamless movements.
高级手术系统嵌入Master-Slave控制策略,引入了医生和患者之间远程互动的可能,也就是遥控操作。本文旨在将创新技术整合到遥控过程,以提高手术过程中的工作效率。所提出的系统包括一个协作机器人Kuka IIWA LBR和Hololens 2(增强现实设备),使用户可以在一个整合实际(真实数据)和通过Hololens 2进口的额外数字信息的广泛环境中控制机器人。实验数据表明,用户可以使用各种手势控制Kuka IIWA,将其与真实或数字物体对准。因此,这个系统为手术中使用的机器人提供了一种更直观的操纵方法,有助于减少医生学习曲线。在多个场景的校准和测试中,证明了系统在提供连续运动方面的高效性。
https://arxiv.org/abs/2402.12002
While myoelectric control has recently become a focus of increased research as a possible flexible hands-free input modality, current control approaches are prone to inadvertent false activations in real-world conditions. In this work, a novel myoelectric control paradigm -- on-demand myoelectric control -- is proposed, designed, and evaluated, to reduce the number of unrelated muscle movements that are incorrectly interpreted as input gestures . By leveraging the concept of wake gestures, users were able to switch between a dedicated control mode and a sleep mode, effectively eliminating inadvertent activations during activities of daily living (ADLs). The feasibility of wake gestures was demonstrated in this work through two online ubiquitous EMG control tasks with varying difficulty levels; dismissing an alarm and controlling a robot. The proposed control scheme was able to appropriately ignore almost all non-targeted muscular inputs during ADLs (>99.9%) while maintaining sufficient sensitivity for reliable mode switching during intentional wake gesture elicitation. These results highlight the potential of wake gestures as a critical step towards enabling ubiquitous myoelectric control-based on-demand input for a wide range of applications.
虽然肌电控制最近作为可能的可伸缩的双手自由输入模式而成为增加研究的重点,但目前的控制方法在现实世界中容易无意间发生误激活。在本文中,我们提出了一个新的肌电控制范例——需求肌电控制,进行了设计并进行了评估,以减少在现实生活中被错误解释为输入手势的不相关的肌肉运动数量。通过利用唤醒手势的概念,用户能够在日常生活活动(ADLs)中实现专用控制模式和睡眠模式之间的切换,有效消除无意激活。通过两个具有不同难度级别的在线普遍EMG控制任务,展示了唤醒手势的可行性,并在控制机器人时控制了报警。所提出的控制方案在ADL中正确忽略了几乎所有的非目标肌肉输入(>99.9%),同时保持对有意唤醒手势的足够敏感性,以实现可靠的模式切换。这些结果突出了唤醒手势在实现普遍肌电控制-基于需求输入为各种应用提供关键的一步。
https://arxiv.org/abs/2402.10050
This paper presents a hand shape classification approach employing multiscale template matching. The integration of background subtraction is utilized to derive a binary image of the hand object, enabling the extraction of key features such as centroid and bounding box. The methodology, while simple, demonstrates effectiveness in basic hand shape classification tasks, laying the foundation for potential applications in straightforward human-computer interaction scenarios. Experimental results highlight the system's capability in controlled environments.
本文提出了一种采用多尺度模板匹配的多手形状分类方法。背景减法的应用可以得到手对象的二值图像,从而可以提取关键特征,如中心点和边界框。虽然方法简单,但展示了在基本手形状分类任务中的有效性,为可能的应用于直接的人与计算机交互场景奠定了基础。实验结果突出了系统在受控环境中的能力。
https://arxiv.org/abs/2402.09663
Sign language discourse is an essential mode of daily communication for the deaf and hard-of-hearing people. However, research on Bangla Sign Language (BdSL) faces notable limitations, primarily due to the lack of datasets. Recognizing wordlevel signs in BdSL (WL-BdSL) presents a multitude of challenges, including the need for well-annotated datasets, capturing the dynamic nature of sign gestures from facial or hand landmarks, developing suitable machine learning or deep learning-based models with substantial video samples, and so on. In this paper, we address these challenges by creating a comprehensive BdSL word-level dataset named BdSLW60 in an unconstrained and natural setting, allowing positional and temporal variations and allowing sign users to change hand dominance freely. The dataset encompasses 60 Bangla sign words, with a significant scale of 9307 video trials provided by 18 signers under the supervision of a sign language professional. The dataset was rigorously annotated and cross-checked by 60 annotators. We also introduced a unique approach of a relative quantization-based key frame encoding technique for landmark based sign gesture recognition. We report the benchmarking of our BdSLW60 dataset using the Support Vector Machine (SVM) with testing accuracy up to 67.6% and an attention-based bi-LSTM with testing accuracy up to 75.1%. The dataset is available at this https URL and the code base is accessible from this https URL.
手语交谈是聋人和听觉障碍人士日常生活中必不可少的交流方式。然而,关于孟加拉语手语(BdSL)的研究遇到了一些显着限制,主要原因是缺乏数据集。在BdSL(WL-BdSL)中识别词级符号提出了许多挑战,包括需要充分标注的數據集,捕捉到手势的动态性质,开发具有大量视频样本的适当的机器学习或深度学习模型等。在本文中,我们通过在无约束的自然环境中创建全面的手语BdSL词级数据集BdSLW60来解决这些挑战,允许位置和时间变化,允许手语用户自由更换手部主导。该数据集包括60个孟加拉语手语符号,由18个手语专业人员在监督下提供指导的9307个视频试验提供了9307个视频试验。该数据集经过严格注释和交叉检查,由60个注释员完成。我们还引入了一种基于相对量化编码技术的基于手部标记的符号手势识别独特方法。我们使用支持向量机(SVM)对BdSLW60数据集进行基准测试,测试准确率为67.6%,并使用带注意力基双循环神经网络(Bi-LSTM)进行测试,测试准确率为75.1%。该数据集可在此处访问:https://url.in/bdslw60 和代码库可在此处访问:https://url.in/bdslw60-code。
https://arxiv.org/abs/2402.08635
We present a compact spiking convolutional neural network (SCNN) and spiking multilayer perceptron (SMLP) to recognize ten different gestures in dark and bright light environments, using a $9.6 single-photon avalanche diode (SPAD) array. In our hand gesture recognition (HGR) system, photon intensity data was leveraged to train and test the network. A vanilla convolutional neural network (CNN) was also implemented to compare the performance of SCNN with the same network topologies and training strategies. Our SCNN was trained from scratch instead of being converted from the CNN. We tested the three models in dark and ambient light (AL)-corrupted environments. The results indicate that SCNN achieves comparable accuracy (90.8%) to CNN (92.9%) and exhibits lower floating operations with only 8 timesteps. SMLP also presents a trade-off between computational workload and accuracy. The code and collected datasets of this work are available at this https URL.
我们提出了一个紧凑的尖点卷积神经网络(SCNN)和尖点多层感知器(SMLP)来在暗光和亮光环境下识别十种不同的手势。该系统使用价值9.6美元的单光子 avalanche 光电二极管(SPAD)阵列进行训练和测试。在我们的手势识别(HGR)系统中,利用光强数据来训练和测试网络。我们还实现了一个简单的卷积神经网络(CNN)以与相同的网络架构和训练策略比较SCNN的性能。我们的SCNN是从零开始训练的,而不是从CNN转换而来。我们在暗光和环境光(AL)-污染的环境中测试了这三个模型。结果表明,与CNN(92.9%)相比,SCNN具有相似的准确度(90.8%),并且仅在8个时钟周期内表现出较低的浮点运算。SMLP也表现出计算开销和准确度之间的权衡。本工作的代码和收集的数据集可在这个https://URL上找到。
https://arxiv.org/abs/2402.05441
We seek to enable classic processing of continuous ultra-sparse spatiotemporal data generated by event-based sensors with dense machine learning models. We propose a novel hybrid pipeline composed of asynchronous sensing and synchronous processing that combines several ideas: (1) an embedding based on PointNet models -- the ALERT module -- that can continuously integrate new and dismiss old events thanks to a leakage mechanism, (2) a flexible readout of the embedded data that allows to feed any downstream model with always up-to-date features at any sampling rate, (3) exploiting the input sparsity in a patch-based approach inspired by Vision Transformer to optimize the efficiency of the method. These embeddings are then processed by a transformer model trained for object and gesture recognition. Using this approach, we achieve performances at the state-of-the-art with a lower latency than competitors. We also demonstrate that our asynchronous model can operate at any desired sampling rate.
我们希望实现对基于事件传感器生成的连续超稀疏时空数据的传统处理。我们提出了一种新颖的并行感知和同步处理的混合管道,结合了以下几个想法:(1)基于PointNet模型的嵌入——ALERT模块——可以通过泄漏机制持续地集成新事件和忽略旧事件;(2)灵活的数据输出,允许在采样率上始终更新任何下游模型,无论是在任何时间步;(3)利用基于Vision Transformer的补丁基于的方法输入稀疏性,以优化方法的有效性。这些嵌入然后由针对物体和手势识别的Transformer模型处理。使用这种方法,我们在延迟较低的情况下实现与最先进水平的性能。我们还证明了我们的并行模型可以在任何所需的采样率上运行。
https://arxiv.org/abs/2402.01393