Abstract
The teleoperation of robotic hands is limited by the high costs of depth cameras and sensor gloves, commonly used to estimate hand relative joint positions (XYZ). We present a novel, cost-effective approach using three webcams for triangulation-based tracking to approximate relative joint angles (theta) of human fingers. We also introduce a modified DexHand, a low-cost robotic hand from TheRobotStudio, to demonstrate THETA's real-time application. Data collection involved 40 distinct hand gestures using three 640x480p webcams arranged at 120-degree intervals, generating over 48,000 RGB images. Joint angles were manually determined by measuring midpoints of the MCP, PIP, and DIP finger joints. Captured RGB frames were processed using a DeepLabV3 segmentation model with a ResNet-50 backbone for multi-scale hand segmentation. The segmented images were then HSV-filtered and fed into THETA's architecture, consisting of a MobileNetV2-based CNN classifier optimized for hierarchical spatial feature extraction and a 9-channel input tensor encoding multi-perspective hand representations. The classification model maps segmented hand views into discrete joint angles, achieving 97.18% accuracy, 98.72% recall, F1 Score of 0.9274, and a precision of 0.8906. In real-time inference, THETA captures simultaneous frames, segments hand regions, filters them, and compiles a 9-channel tensor for classification. Joint-angle predictions are relayed via serial to an Arduino, enabling the DexHand to replicate hand movements. Future research will increase dataset diversity, integrate wrist tracking, and apply computer vision techniques such as OpenAI-Vision. THETA potentially ensures cost-effective, user-friendly teleoperation for medical, linguistic, and manufacturing applications.
Abstract (translated)
机器人手的遥操作受到深度相机和传感器手套高昂成本的限制,这些设备通常用于估算相对关节位置(XYZ)。我们提出了一种新颖且成本效益高的方法,使用三台网络摄像头进行三角测量跟踪,以近似人类手指的相对关节角度(θ)。此外,还引入了经过修改的DexHand,这是一种来自TheRobotStudio的低成本机器人手,用以展示THETA在实时应用中的效果。数据收集涉及40种不同的手势,通过三个间隔120度排列的640x480p网络摄像头捕捉到超过48,000张RGB图像。关节角度由测量掌指(MCP)、近侧指间(PIP)和远侧指间(DIP)关节中点的手动确定得出。 收集的RGB帧使用带有ResNet-50骨干的DeepLabV3分割模型进行多尺度手部分割处理。随后,对分段图像应用HSV滤波,并将其输入到THETA架构中,该架构由基于MobileNetV2的CNN分类器和编码多种视角的手部表示的9通道张量组成,用于优化层级空间特征提取。分类模型将分割后的手部视图映射为离散关节角度,在准确性、召回率、F1分数以及精度方面分别达到了97.18%、98.72%、0.9274和0.8906。 在实时推断过程中,THETA同时捕捉图像帧,分割手部区域,并过滤这些区域以生成用于分类的9通道张量。关节角度预测通过串行通信发送到Arduino,使DexHand能够复制人类的手部动作。未来的研究将进一步丰富数据集多样性、整合手腕跟踪以及应用计算机视觉技术如OpenAI-Vision。 THETA有望确保成本效益高且用户友好的远程操作,在医疗、语言和制造应用中具有潜在价值。
URL
https://arxiv.org/abs/2601.07768