Current sign language machine translation systems rely on recognizing hand movements, facial expressions and body postures, and natural language processing, to convert signs into text. Recent approaches use Transformer architectures to model long-range dependencies via positional encoding. However, they lack accuracy in recognizing fine-grained, short-range temporal dependencies between gestures captured at high frame rates. Moreover, their high computational complexity leads to inefficient training. To mitigate these issues, we propose an Adaptive Transformer (ADAT), which incorporates components for enhanced feature extraction and adaptive feature weighting through a gating mechanism to emphasize contextually relevant features while reducing training overhead and maintaining translation accuracy. To evaluate ADAT, we introduce MedASL, the first public medical American Sign Language dataset. In sign-to-gloss-to-text experiments, ADAT outperforms the encoder-decoder transformer, improving BLEU-4 accuracy by 0.1% while reducing training time by 14.33% on PHOENIX14T and 3.24% on MedASL. In sign-to-text experiments, it improves accuracy by 8.7% and reduces training time by 2.8% on PHOENIX14T and achieves 4.7% higher accuracy and 7.17% faster training on MedASL. Compared to encoder-only and decoder-only baselines in sign-to-text, ADAT is at least 6.8% more accurate despite being up to 12.1% slower due to its dual-stream structure.
当前的手语机器翻译系统依赖于识别手部动作、面部表情和身体姿态,并通过自然语言处理将手势转换为文本。近期的方法采用了Transformer架构,利用位置编码来建模长距离依赖关系。然而,它们在捕捉高帧率下细微且短时间内的手势依赖关系方面缺乏准确性。此外,其计算复杂度很高,导致训练效率低下。 为了缓解这些问题,我们提出了一种自适应Transformer(ADAT),该模型通过引入增强特征提取和自适应特征加权的组件来解决这一问题,并通过门控机制强调上下文相关的特征,同时减少训练开销并保持翻译准确性。为了评估ADAT的效果,我们推出了MedASL,这是首个公开的医学美国手语数据集。 在手势到文字(经由词符)的实验中,在PHOENIX14T和MedASL上,ADAT的表现优于编码器-解码器Transformer模型,BLEU-4精度提升了0.1%,训练时间分别缩短了14.33%和3.24%。在直接手势到文字的实验中,在PHOENIX14T数据集上,ADAT提高了8.7%的准确率,并减少了2.8%的训练时间;而在MedASL上,其表现更是提升了4.7%的准确度并加快了7.17%的训练速度。 与手势到文字任务中的编码器和解码器基线相比,尽管ADAT由于其双流结构最多慢至多12.1%,但在准确性方面至少提高了6.8%。
https://arxiv.org/abs/2504.11942
In autonomous driving, it is crucial to correctly interpret traffic gestures (TGs), such as those of an authority figure providing orders or instructions, or a pedestrian signaling the driver, to ensure a safe and pleasant traffic environment for all road users. This study investigates the capabilities of state-of-the-art vision-language models (VLMs) in zero-shot interpretation, focusing on their ability to caption and classify human gestures in traffic contexts. We create and publicly share two custom datasets with varying formal and informal TGs, such as 'Stop', 'Reverse', 'Hail', etc. The datasets are "Acted TG (ATG)" and "Instructive TG In-The-Wild (ITGI)". They are annotated with natural language, describing the pedestrian's body position and gesture. We evaluate models using three methods utilizing expert-generated captions as baseline and control: (1) caption similarity, (2) gesture classification, and (3) pose sequence reconstruction similarity. Results show that current VLMs struggle with gesture understanding: sentence similarity averages below 0.59, and classification F1 scores reach only 0.14-0.39, well below the expert baseline of 0.70. While pose reconstruction shows potential, it requires more data and refined metrics to be reliable. Our findings reveal that although some SOTA VLMs can interpret zero-shot human traffic gestures, none are accurate and robust enough to be trustworthy, emphasizing the need for further research in this domain.
在自动驾驶领域,正确解读交通手势(TG)至关重要。这些手势可能来自指挥交通的执法人员或行人向驾驶员发出信号的情况,以确保所有道路使用者的安全和顺畅环境。本研究探讨了目前最先进的视觉-语言模型(VLMs)在零样本解释中的能力,重点关注其对交通环境中人类手势进行描述和分类的能力。我们创建并公开分享了两个定制数据集,包含不同形式的正式和非正式TG,例如“停止”、“倒车”、“招手”等。这些数据集分别是“表演TG(ATG)”和“实际指示TG(ITGI)”,它们用自然语言注释,描述行人的身体位置和手势。 我们采用三种方法评估模型性能,并以专家生成的说明作为基准和控制: 1. 句子相似度 2. 手势分类 3. 姿态序列重建相似度 结果显示:目前的VLM在理解手势方面存在困难。句子相似性的平均值低于0.59,分类F1分数仅为0.14-0.39,远低于专家基准的0.70。虽然姿态重构显示出潜力,但需要更多的数据和更精细的度量标准才能变得可靠。 我们的研究结果表明,尽管一些最先进的VLM能够解释零样本的人类交通手势,但是没有一种模型在准确性和可靠性方面足够强大以确保信任,这突显了该领域进一步研究的需求。
https://arxiv.org/abs/2504.10873
Sign languages are dynamic visual languages that involve hand gestures, in combination with non manual elements such as facial expressions. While video recordings of sign language are commonly used for education and documentation, the dynamic nature of signs can make it challenging to study them in detail, especially for new learners and educators. This work aims to convert sign language video footage into static illustrations, which serve as an additional educational resource to complement video content. This process is usually done by an artist, and is therefore quite costly. We propose a method that illustrates sign language videos by leveraging generative models' ability to understand both the semantic and geometric aspects of images. Our approach focuses on transferring a sketch like illustration style to video footage of sign language, combining the start and end frames of a sign into a single illustration, and using arrows to highlight the hand's direction and motion. While many style transfer methods address domain adaptation at varying levels of abstraction, applying a sketch like style to sign languages, especially for hand gestures and facial expressions, poses a significant challenge. To tackle this, we intervene in the denoising process of a diffusion model, injecting style as keys and values into high resolution attention layers, and fusing geometric information from the image and edges as queries. For the final illustration, we use the attention mechanism to combine the attention weights from both the start and end illustrations, resulting in a soft combination. Our method offers a cost effective solution for generating sign language illustrations at inference time, addressing the lack of such resources in educational materials.
手语是一种动态的视觉语言,通过手势结合面部表情等非手动元素来表达。虽然视频录制的手语在教育和记录中被广泛使用,但由于其动态特性,在细节上进行研究具有挑战性,尤其是对于新手学习者和教师来说更为困难。本项目旨在将手语视频片段转化为静态插图,作为补充视频内容的教育资源。这一过程通常由艺术家完成,因此成本较高。我们提出了一种方法,利用生成模型理解图像的语义和几何方面的能力来绘制手语视频。我们的方法重点在于将类似素描的风格转移到手语视频上,并结合手势开始帧和结束帧以形成单一插图,同时使用箭头突出双手的方向和运动。 虽然许多风格迁移的方法在不同程度上解决了领域适应的问题,但对手势语言特别是对于手部动作和面部表情进行素描化处理仍然具有挑战性。为此,我们对扩散模型的去噪过程进行了干预,在高分辨率注意力层中注入样式作为键值,并融合图像和边缘中的几何信息作为查询。最后,通过注意机制结合开始帧和结束帧插图的关注权重来产生最终的插图,形成柔和组合。 我们的方法提供了一种成本效益高的解决方案,可以在推断时生成手语插图,弥补了教育材料中此类资源的不足。
https://arxiv.org/abs/2504.10822
Overweight and obesity have emerged as widespread societal challenges, frequently linked to unhealthy eating patterns. A promising approach to enhance dietary monitoring in everyday life involves automated detection of food intake gestures. This study introduces a skeleton based approach using a model that combines a dilated spatial-temporal graph convolutional network (ST-GCN) with a bidirectional long-short-term memory (BiLSTM) framework, as called ST-GCN-BiLSTM, to detect intake gestures. The skeleton-based method provides key benefits, including environmental robustness, reduced data dependency, and enhanced privacy preservation. Two datasets were employed for model validation. The OREBA dataset, which consists of laboratory-recorded videos, achieved segmental F1-scores of 86.18% and 74.84% for identifying eating and drinking gestures. Additionally, a self-collected dataset using smartphone recordings in more adaptable experimental conditions was evaluated with the model trained on OREBA, yielding F1-scores of 85.40% and 67.80% for detecting eating and drinking gestures. The results not only confirm the feasibility of utilizing skeleton data for intake gesture detection but also highlight the robustness of the proposed approach in cross-dataset validation.
超重和肥胖已成为社会面临的广泛挑战,往往与不健康的饮食习惯相关。一种改善日常生活中膳食监测的有希望的方法是自动检测进食姿势。本研究提出了一种基于骨架的方法,该方法结合了膨胀的空间-时间图卷积网络(ST-GCN)与双向长短期记忆框架(BiLSTM),命名为ST-GCN-BiLSTM,用于检测摄入手势。基于骨架的方法提供了关键优势,包括环境鲁棒性、减少数据依赖性和增强隐私保护。 为了验证模型的性能,采用了两个数据集。OREBA数据集由实验室录制的视频组成,在识别进食和饮水手势方面实现了片段F1分数分别为86.18%和74.84%。此外,还使用了自采集的数据集,该数据集利用智能手机在更灵活的实验条件下记录,并且在模型训练于OREBA后对该数据集进行了评估,检测到进食和饮水手势的F1分数分别为85.40%和67.80%。 研究结果不仅确认了利用骨架数据进行摄入姿势检测的可行性,还强调了所提出方法在跨数据集验证中的鲁棒性。
https://arxiv.org/abs/2504.10635
Automating the synthesis of coordinated bimanual piano performances poses significant challenges, particularly in capturing the intricate choreography between the hands while preserving their distinct kinematic signatures. In this paper, we propose a dual-stream neural framework designed to generate synchronized hand gestures for piano playing from audio input, addressing the critical challenge of modeling both hand independence and coordination. Our framework introduces two key innovations: (i) a decoupled diffusion-based generation framework that independently models each hand's motion via dual-noise initialization, sampling distinct latent noise for each while leveraging a shared positional condition, and (ii) a Hand-Coordinated Asymmetric Attention (HCAA) mechanism suppresses symmetric (common-mode) noise to highlight asymmetric hand-specific features, while adaptively enhancing inter-hand coordination during denoising. The system operates hierarchically: it first predicts 3D hand positions from audio features and then generates joint angles through position-aware diffusion models, where parallel denoising streams interact via HCAA. Comprehensive evaluations demonstrate that our framework outperforms existing state-of-the-art methods across multiple metrics.
自动化合成协调的双臂钢琴表演面临重大挑战,特别是在捕捉双手之间的复杂编排的同时保持它们各自的运动特征。本文提出了一种双流神经框架,该框架旨在从音频输入中生成同步的手部动作,以解决同时建模手部独立性和协调性的关键问题。我们的框架引入了两个关键创新: (i) 一个解耦的基于扩散的生成框架,通过双重噪声初始化分别对每只手的动作进行独立建模,在共享位置条件的基础上为每一只手采样不同的潜在噪声。 (ii) 手部协同不对称注意力(HCAA)机制抑制对称(共同模式)噪声,突出不对称的手特定特征,并在去噪过程中自适应增强双手之间的协调性。系统分层操作:首先从音频特性中预测3D手位,然后通过位置感知扩散模型生成关节角度,在此过程中并行的去噪流通过HCAA进行交互。 全面评估表明,我们的框架在多个指标上优于现有的最先进的方法。
https://arxiv.org/abs/2504.09885
This work aims to interpret human behavior to anticipate potential user confusion when a robot provides explanations for failure, allowing the robot to adapt its explanations for more natural and efficient collaboration. Using a dataset that included facial emotion detection, eye gaze estimation, and gestures from 55 participants in a user study, we analyzed how human behavior changed in response to different types of failures and varying explanation levels. Our goal is to assess whether human collaborators are ready to accept less detailed explanations without inducing confusion. We formulate a data-driven predictor to predict human confusion during robot failure explanations. We also propose and evaluate a mechanism, based on the predictor, to adapt the explanation level according to observed human behavior. The promising results from this evaluation indicate the potential of this research in adapting a robot's explanations for failures to enhance the collaborative experience.
这项工作旨在解读人类行为,以便在机器人提供故障解释时预测潜在的用户困惑,从而使机器人能够根据情况调整其解释方式,以实现更自然和高效的协作。我们使用了一个数据集,该数据集中包含了55名参与者在用户研究中表现出的面部表情检测、目光估计以及手势等信息,分析了人类行为如何随着不同类型故障及不同解释水平的变化而变化。我们的目标是评估人类合作者是否能够在不引起困惑的情况下接受更简洁的解释。 我们制定了一个基于数据驱动的方法来预测机器人在提供故障解释过程中的人类困惑程度,并提出了并评估了一种机制,该机制可以根据观察到的人类行为调整解释水平。这项研究的初步结果表明,根据人类的行为适应机器人的解释方式具有增强协作体验的潜力。
https://arxiv.org/abs/2504.09717
3-Dimensional Embodied Reference Understanding (3D-ERU) combines a language description and an accompanying pointing gesture to identify the most relevant target object in a 3D scene. Although prior work has explored pure language-based 3D grounding, there has been limited exploration of 3D-ERU, which also incorporates human pointing gestures. To address this gap, we introduce a data augmentation framework-Imputer, and use it to curate a new benchmark dataset-ImputeRefer for 3D-ERU, by incorporating human pointing gestures into existing 3D scene datasets that only contain language instructions. We also propose Ges3ViG, a novel model for 3D-ERU that achieves ~30% improvement in accuracy as compared to other 3D-ERU models and ~9% compared to other purely language-based 3D grounding models. Our code and dataset are available at this https URL.
三维实体参照理解(3-Dimensional Embodied Reference Understanding,简称3D-ERU)结合了语言描述和伴随的手势指向动作来在三维场景中识别最相关的对象。尽管先前的工作已经探讨了纯基于语言的三维定位问题,但对于整合人类手势指向动作的3D-ERU研究还相对较少。为了解决这一缺口,我们引入了一个数据增强框架——Imputer,并利用它结合现有的仅包含语言指令的三维场景数据集中的手势指向动作,创建了一个新的基准数据集——ImputeRefer,用于促进对3D-ERU的研究。此外,我们提出了一种名为Ges3ViG的新模型,在3D-ERU任务中相较于其他方法提高了约30%的准确率,并且相比于纯基于语言的三维定位模型也提升了约9%的表现。 我们的代码和数据集可在以下链接获取:[提供URL的地方]
https://arxiv.org/abs/2504.09623
We present an intuitive human-drone interaction system that utilizes a gesture-based motion controller to enhance the drone operation experience in real and simulated environments. The handheld motion controller enables natural control of the drone through the movements of the operator's hand, thumb, and index finger: the trigger press manages the throttle, the tilt of the hand adjusts pitch and roll, and the thumbstick controls yaw rotation. Communication with drones is facilitated via the ExpressLRS radio protocol, ensuring robust connectivity across various frequencies. The user evaluation of the flight experience with the designed drone controller using the UEQ-S survey showed high scores for both Pragmatic (mean=2.2, SD = 0.8) and Hedonic (mean=2.3, SD = 0.9) Qualities. This versatile control interface supports applications such as research, drone racing, and training programs in real and simulated environments, thereby contributing to advances in the field of human-drone interaction.
我们提出了一种直观的人机无人机交互系统,该系统利用基于手势的运动控制器来增强真实和模拟环境中的无人机操作体验。手持式运动控制器通过操作员的手部、拇指和食指的动作自然地控制无人机:扳机按键管理油门,手倾斜调整俯仰角和横滚角,而拇指杆则用于控制偏航旋转。通过ExpressLRS无线电协议与无人机进行通信,确保在各种频率下都有强大的连接性。 使用UEQ-S调查问卷对设计的无人机控制器进行了用户飞行体验评估,结果显示实用性和愉悦性得分都很高:实用性(均值=2.2,标准差=0.8)和享乐性(均值=2.3,标准差=0.9)。这种多功能控制界面支持包括研究、无人机竞速和训练项目在内的多种应用,在真实和模拟环境中都有所贡献,从而促进了人机交互领域的发展。
https://arxiv.org/abs/2504.09510
Expressive Human Pose and Shape Estimation (EHPS) aims to jointly estimate human pose, hand gesture, and facial expression from monocular images. Existing methods predominantly rely on Transformer-based architectures, which suffer from quadratic complexity in self-attention, leading to substantial computational overhead, especially in multi-person scenarios. Recently, Mamba has emerged as a promising alternative to Transformers due to its efficient global modeling capability. However, it remains limited in capturing fine-grained local dependencies, which are essential for precise EHPS. To address these issues, we propose EMO-X, the Efficient Multi-person One-stage model for multi-person EHPS. Specifically, we explore a Scan-based Global-Local Decoder (SGLD) that integrates global context with skeleton-aware local features to iteratively enhance human tokens. Our EMO-X leverages the superior global modeling capability of Mamba and designs a local bidirectional scan mechanism for skeleton-aware local refinement. Comprehensive experiments demonstrate that EMO-X strikes an excellent balance between efficiency and accuracy. Notably, it achieves a significant reduction in computational complexity, requiring 69.8% less inference time compared to state-of-the-art (SOTA) methods, while outperforming most of them in accuracy.
表达式人体姿态和形状估计(EHPS)的目标是从单目图像中联合估计人体姿势、手势以及面部表情。现有的方法主要依赖于基于Transformer的架构,但自注意力机制在多个人体场景中的复杂度为二次方级,导致了巨大的计算开销。最近,Mamba作为一种替代Transformer的有效全局建模工具崭露头角,但由于其难以捕捉精细局部依赖性(这在精确EHPS中至关重要),因此仍有局限性。为了克服这些问题,我们提出了EMO-X——一种用于多个人体EHPS的高效单阶段模型。 具体而言,我们探索了一种基于扫描的全局-局部解码器(SGLD),该解码器结合了全球上下文和骨骼感知局部特征,以迭代增强人类令牌。我们的EMO-X利用Mamba优秀的全局建模能力,并设计了一个针对骨骼感知局部细化的双向扫描机制。 综合实验表明,EMO-X在效率与准确性之间取得了卓越的平衡。尤其值得注意的是,它显著减少了计算复杂度,在推理时间上比最先进的(SOTA)方法缩短了69.8%,同时在精度上超越了许多同类方法。
https://arxiv.org/abs/2504.08718
Mental models and expectations underlying human-human interaction (HHI) inform human-robot interaction (HRI) with domestic robots. To ease collaborative home tasks by improving domestic robot speech and behaviours for human-robot communication, we designed a study to understand how people communicated when failure occurs. To identify patterns of natural communication, particularly in response to robotic failures, participants instructed Laundrobot to move laundry into baskets using natural language and gestures. Laundrobot either worked error-free, or in one of two error modes. Participants were not advised Laundrobot would be a human actor, nor given information about error modes. Video analysis from 42 participants found speech patterns, included laughter, verbal expressions, and filler words, such as ``oh'' and ``ok'', also, sequences of body movements, including touching one's own face, increased pointing with a static finger, and expressions of surprise. Common strategies deployed when errors occurred, included correcting and teaching, taking responsibility, and displays of frustration. The strength of reaction to errors diminished with exposure, possibly indicating acceptance or resignation. Some used strategies similar to those used to communicate with other technologies, such as smart assistants. An anthropomorphic robot may not be ideally suited to this kind of task. Laundrobot's appearance, morphology, voice, capabilities, and recovery strategies may have impacted how it was perceived. Some participants indicated Laundrobot's actual skills were not aligned with expectations; this made it difficult to know what to expect and how much Laundrobot understood. Expertise, personality, and cultural differences may affect responses, however these were not assessed.
人类与人类之间的互动(HHI)中的心理模型和期望对人机交互(HRI),特别是家庭机器人而言具有指导意义。为了通过改善家用机器人的语音和行为来简化合作的家庭任务,我们设计了一项研究,旨在了解人们在出现故障时如何进行沟通。 为了识别自然交流的模式,特别是在面对机器人故障时的反应,参与者被要求用自然语言和手势指示Laundrobot(一款洗衣专用机器人)将衣物移动到篮子中。Laundrobot要么可以无错误地完成任务,或者工作在两种不同的错误模式之一下。参与者没有被告知Laundrobot实际上是由人类演员扮演的,也没有获得关于故障模式的信息。 通过对42名参与者的视频分析发现,言语模式包括笑声、口头表达以及诸如“哦”和“好的”这样的填充词有所增加;身体动作序列包括摸自己的脸、用手指指向静止物体的动作以及表现出惊讶的表情。当出现错误时,参与者通常采用纠正和教导机器人、承担责任以及展示挫败感的策略。 对错误的反应强度随着暴露次数的增加而减弱,这可能表明了接受或放弃的心态。一些参与者的沟通方式与他们使用其他技术(如智能助手)时的方式相似。这种类型的任务可能并不适合拟人化的机器人。Laundrobot的外观、形态、声音、能力以及故障恢复策略可能对其被感知的方式产生了影响。 部分参与者表示,Laundrobot的实际技能与其期望不匹配,这使得了解其功能范围和理解能力变得困难。然而,专业知识、个性和文化差异的影响没有在本研究中进行评估。
https://arxiv.org/abs/2504.08395
Audio-driven cospeech video generation typically involves two stages: speech-to-gesture and gesture-to-video. While significant advances have been made in speech-to-gesture generation, synthesizing natural expressions and gestures remains challenging in gesture-to-video systems. In order to improve the generation effect, previous works adopted complex input and training strategies and required a large amount of data sets for pre-training, which brought inconvenience to practical applications. We propose a simple one-stage training method and a temporal inference method based on a diffusion model to synthesize realistic and continuous gesture videos without the need for additional training of temporal this http URL entire model makes use of existing pre-trained weights, and only a few thousand frames of data are needed for each character at a time to complete fine-tuning. Built upon the video generator, we introduce a new audio-to-video pipeline to synthesize co-speech videos, using 2D human skeleton as the intermediate motion representation. Our experiments show that our method outperforms existing GAN-based and diffusion-based methods.
音频驱动的同步视频生成通常涉及两个阶段:语音到姿态(speech-to-gesture)和姿态到视频(gesture-to-video)。虽然在语音到姿态生成方面取得了显著进展,但在姿态到视频系统中合成自然表情和手势仍然具有挑战性。为了改进生成效果,以往的研究采用了复杂的输入和训练策略,并且需要大量的数据集进行预训练,这给实际应用带来了不便。 我们提出了一种简单的单阶段训练方法以及一种基于扩散模型的时序推理方法,用于在不需额外训练的情况下合成逼真且连续的手势视频。整个模型利用现有的预训练权重,仅需少量几千帧的数据即可完成每个角色的微调。在此基础上,我们在视频生成器之上引入了一种新的音频到视频管道,使用2D人体骨架作为中间动作表示来合成同步视频。 我们的实验表明,该方法在性能上优于现有的基于GAN和扩散模型的方法。
https://arxiv.org/abs/2504.08344
With the rising interest from the community in digital avatars coupled with the importance of expressions and gestures in communication, modeling natural avatar behavior remains an important challenge across many industries such as teleconferencing, gaming, and AR/VR. Human hands are the primary tool for interacting with the environment and essential for realistic human behavior modeling, yet existing 3D hand and head avatar models often overlook the crucial aspect of hand-body interactions, such as between hand and face. We present InteracttAvatar, the first model to faithfully capture the photorealistic appearance of dynamic hand and non-rigid hand-face interactions. Our novel Dynamic Gaussian Hand model, combining template model and 3D Gaussian Splatting as well as a dynamic refinement module, captures pose-dependent change, e.g. the fine wrinkles and complex shadows that occur during articulation. Importantly, our hand-face interaction module models the subtle geometry and appearance dynamics that underlie common gestures. Through experiments of novel view synthesis, self reenactment and cross-identity reenactment, we demonstrate that InteracttAvatar can reconstruct hand and hand-face interactions from monocular or multiview videos with high-fidelity details and be animated with novel poses.
随着社区对数字虚拟形象的兴趣日益增加,加之表情和手势在交流中的重要性,模拟自然的虚拟形象行为对于电信会议、游戏以及AR/VR等行业而言仍是一个重要的挑战。人类的手是与环境互动的主要工具,同时也是现实主义的人类行为建模中不可或缺的一部分,然而现有的3D手部和头部虚拟模型往往忽视了手部与身体其他部位(如脸部)之间的重要交互作用。我们提出了一种名为InteracttAvatar的模型,这是首个能够忠实捕捉动态手部及非刚性手脸互动的逼真外观的技术。 我们的新Dynamic Gaussian Hand模型结合了模板模型和3D高斯点阵技术,并且包括了一个动态细化模块,可以捕获姿势相关的变形,比如在手指关节活动时产生的细微皱纹和复杂的阴影。尤为重要的是,我们设计的手部与脸部交互模块能够模拟出常见手势背后的微妙几何变化和外观动态。 通过新颖视角合成、自我再现以及跨身份再现的实验,我们证明了InteracttAvatar可以从单目或多视图视频中重建手部及手脸互动,并且可以以新的姿态进行动画化,同时保持高保真的细节。
https://arxiv.org/abs/2504.07949
Sign language is a fundamental means of communication for the deaf and hard-of-hearing (DHH) community, enabling nuanced expression through gestures, facial expressions, and body movements. Despite its critical role in facilitating interaction within the DHH population, significant barriers persist due to the limited fluency in sign language among the hearing population. Overcoming this communication gap through automatic sign language recognition (SLR) remains a challenge, particularly at a dynamic word-level, where temporal and spatial dependencies must be effectively recognized. While Convolutional Neural Networks have shown potential in SLR, they are computationally intensive and have difficulties in capturing global temporal dependencies between video sequences. To address these limitations, we propose a Video Vision Transformer (ViViT) model for word-level American Sign Language (ASL) recognition. Transformer models make use of self-attention mechanisms to effectively capture global relationships across spatial and temporal dimensions, which makes them suitable for complex gesture recognition tasks. The VideoMAE model achieves a Top-1 accuracy of 75.58% on the WLASL100 dataset, highlighting its strong performance compared to traditional CNNs with 65.89%. Our study demonstrates that transformer-based architectures have great potential to advance SLR, overcome communication barriers and promote the inclusion of DHH individuals.
手语是聋人和听力障碍(DHH)社区的基本沟通方式,通过手势、面部表情和身体动作实现细致入微的表达。尽管手语在促进DHH人群之间的互动中扮演着至关重要的角色,但由于听觉人口对手语熟练度有限,仍然存在重大障碍。通过自动手语识别(SLR)来克服这一交流差距仍是一项挑战,特别是在动态词级水平上,这时必须有效识别时间和空间依赖关系。尽管卷积神经网络在SLR中显示出潜力,但它们计算成本高,并且难以捕捉视频序列之间的全局时间依赖性。为了解决这些限制,我们提出了一种基于视频的视觉变换器(ViViT)模型来进行美国手语(ASL)词级识别。变换器模型利用自注意力机制来有效捕捉空间和时间维度上的全局关系,这使其非常适合复杂的手势识别任务。VideoMAE模型在WLASL100数据集上实现了75.58%的Top-1准确率,展示了其相对于传统CNN(65.89%)的强大性能。我们的研究表明,基于变换器的架构具有极大的潜力来推进SLR技术的发展、克服交流障碍并促进DHH人士的包容性。
https://arxiv.org/abs/2504.07792
The growing presence of service robots in human-centric environments, such as warehouses, demands seamless and intuitive human-robot collaboration. In this paper, we propose a collaborative shelf-picking framework that combines multimodal interaction, physics-based reasoning, and task division for enhanced human-robot teamwork. The framework enables the robot to recognize human pointing gestures, interpret verbal cues and voice commands, and communicate through visual and auditory feedback. Moreover, it is powered by a Large Language Model (LLM) which utilizes Chain of Thought (CoT) and a physics-based simulation engine for safely retrieving cluttered stacks of boxes on shelves, relationship graph for sub-task generation, extraction sequence planning and decision making. Furthermore, we validate the framework through real-world shelf picking experiments such as 1) Gesture-Guided Box Extraction, 2) Collaborative Shelf Clearing and 3) Collaborative Stability Assistance.
在以人类为中心的环境中(如仓库)中,服务机器人的日益增多要求无缝且直观的人机协作。本文提出了一种结合多模态交互、基于物理推理和任务分工的货架拣选合作框架,旨在增强人机团队工作的效果。该框架使机器人能够识别人类的手势指示,理解口头提示及语音命令,并通过视觉和听觉反馈与人进行交流。 此外,该框架还采用了一个大型语言模型(LLM),利用链式思维(CoT)和基于物理的仿真引擎来安全地从货架上拾取堆积混乱的箱子堆。它还包括一个子任务生成的关系图、提取序列规划以及决策制定机制。最后,我们通过一系列现实世界中的货架拣选实验验证了该框架的有效性,这些实验包括:1) 手势引导的盒子提取;2) 协作式清空货架;3) 协作式的稳定性辅助。
https://arxiv.org/abs/2504.06593
With the increasing demand for human-computer interaction (HCI), flexible wearable gloves have emerged as a promising solution in virtual reality, medical rehabilitation, and industrial automation. However, the current technology still has problems like insufficient sensitivity and limited durability, which hinder its wide application. This paper presents a highly sensitive, modular, and flexible capacitive sensor based on line-shaped electrodes and liquid metal (EGaIn), integrated into a sensor module tailored to the human hand's anatomy. The proposed system independently captures bending information from each finger joint, while additional measurements between adjacent fingers enable the recording of subtle variations in inter-finger spacing. This design enables accurate gesture recognition and dynamic hand morphological reconstruction of complex movements using point clouds. Experimental results demonstrate that our classifier based on Convolution Neural Network (CNN) and Multilayer Perceptron (MLP) achieves an accuracy of 99.15% across 30 gestures. Meanwhile, a transformer-based Deep Neural Network (DNN) accurately reconstructs dynamic hand shapes with an Average Distance (AD) of 2.076\pm3.231 mm, with the reconstruction accuracy at individual key points surpassing SOTA benchmarks by 9.7% to 64.9%. The proposed glove shows excellent accuracy, robustness and scalability in gesture recognition and hand reconstruction, making it a promising solution for next-generation HCI systems.
随着人机交互(HCI)需求的增长,柔性穿戴手套在虚拟现实、医疗康复和工业自动化等领域展现出巨大的潜力。然而,当前技术仍面临灵敏度不足和耐用性有限等问题,这些限制了其广泛应用。本文提出了一种基于线状电极和液态金属(EGaIn)的高度敏感、模块化且柔性的电容式传感器,并将其集成到一个符合人体手部解剖学特性的传感模块中。该系统能够独立捕捉每个指关节的弯曲信息,同时相邻手指之间的额外测量则允许记录指尖间距的细微变化。这一设计使得基于点云数据实现精确的手势识别和复杂运动下的动态手部形态重建成为可能。 实验结果表明,基于卷积神经网络(CNN)和多层感知机(MLP)的分类器在30种手势中达到了99.15%的准确率。同时,一种基于变压器的深度神经网络能够以平均距离2.076±3.231毫米的高度准确性重构动态手形,并且其关键点重建精度超过现有最佳方法(SOTA)基准9.7%到64.9%。 所提出的智能手套在手势识别和手部形态重建方面表现出色,具有极高的准确度、鲁棒性和可扩展性,为下一代人机交互系统提供了一个有前景的解决方案。
https://arxiv.org/abs/2504.05983
Dyadic social relationships, which refer to relationships between two individuals who know each other through repeated interactions (or not), are shaped by shared spatial and temporal experiences. Current computational methods for modeling these relationships face three major challenges: (1) the failure to model asymmetric relationships, e.g., one individual may perceive the other as a friend while the other perceives them as an acquaintance, (2) the disruption of continuous interactions by discrete frame sampling, which segments the temporal continuity of interaction in real-world scenarios, and (3) the limitation to consider periodic behavioral cues, such as rhythmic vocalizations or recurrent gestures, which are crucial for inferring the evolution of dyadic relationships. To address these challenges, we propose AsyReC, a multimodal graph-based framework for asymmetric dyadic relationship classification, with three core innovations: (i) a triplet graph neural network with node-edge dual attention that dynamically weights multimodal cues to capture interaction asymmetries (addressing challenge 1); (ii) a clip-level relationship learning architecture that preserves temporal continuity, enabling fine-grained modeling of real-world interaction dynamics (addressing challenge 2); and (iii) a periodic temporal encoder that projects time indices onto sine/cosine waveforms to model recurrent behavioral patterns (addressing challenge 3). Extensive experiments on two public datasets demonstrate state-of-the-art performance, while ablation studies validate the critical role of asymmetric interaction modeling and periodic temporal encoding in improving the robustness of dyadic relationship classification in real-world scenarios. Our code is publicly available at: this https URL.
二元社会关系指的是通过反复互动(或没有)相互了解的两个人之间的关系,这种关系受到共同的空间和时间经历的影响。目前用于建模这些关系的计算方法面临三大挑战:(1) 无法模拟不对称关系,例如一个人可能认为另一个人是朋友,而对方则只将其视为普通认识的人;(2) 离散帧采样中断了真实场景中互动的连续性,这破坏了互动时间上的连贯性;(3) 只能考虑周期性的行为线索,如节奏化的发声或重复的手势,这对于推断二元关系的发展至关重要。为了应对这些挑战,我们提出了AsyReC框架,这是一个基于多模态图的方法用于不对称二元关系分类,并包含三个核心创新:(i) 一种带有节点-边双重注意力机制的三元组图神经网络,该网络能够动态加权多模态线索以捕捉互动中的不对称性(解决挑战1);(ii) 一个片段级别的关系学习架构,保持时间连续性,使真实世界互动动力学的精细化建模成为可能(解决挑战2);以及(iii) 周期性时间编码器,将时间索引投影到正弦/余弦波形上以模拟重复的行为模式(解决挑战3)。在两个公开数据集上的大量实验表明了其卓越性能,并且消融研究验证了不对称互动建模和周期性时间编码对改善现实场景中二元关系分类的鲁棒性的关键作用。我们的代码可以在以下网址找到:this https URL.
https://arxiv.org/abs/2504.05030
A fundamental challenge in the cognitive sciences is discovering the dynamics that govern behaviour. Take the example of spoken language, which is characterised by a highly variable and complex set of physical movements that map onto the small set of cognitive units that comprise language. What are the fundamental dynamical principles behind the movements that structure speech production? In this study, we discover models in the form of symbolic equations that govern articulatory gestures during speech. A sparse symbolic regression algorithm is used to discover models from kinematic data on the tongue and lips. We explore these candidate models using analytical techniques and numerical simulations, and find that a second-order linear model achieves high levels of accuracy, but a nonlinear force is required to properly model articulatory dynamics in approximately one third of cases. This supports the proposal that an autonomous, nonlinear, second-order differential equation is a viable dynamical law for articulatory gestures in speech. We conclude by identifying future opportunities and obstacles in data-driven model discovery and outline prospects for discovering the dynamical principles that govern language, brain and behaviour.
认知科学中的一个基本挑战是发现支配行为的动力学规律。以口语为例,它由一系列高度多变且复杂的物理动作组成,这些动作映射到语言中有限的认知单元上。那么,在构成言语产生的运动背后的基本动力学原理是什么呢?在这项研究中,我们发现了描述发音过程中口部和舌头活动的符号方程模型。使用了一种稀疏的符号回归算法从口唇及舌的运动数据中发现这些模型。通过分析技术和数值模拟探索了候选模型,并发现在大约三分之一的情况下,需要非线性力来准确地建模发声动态过程,而二次线性模型则在大多数情况下表现出很高的准确性。这支持了这样一个提议:一个自治的、非线性的二阶微分方程可以作为言语发音运动的动力学法则。 我们最后指出了数据驱动型模型发现中的未来机会和障碍,并概述了发现支配语言、大脑及行为的动力学原理的可能性。
https://arxiv.org/abs/2504.04849
Point cloud video representation learning is primarily built upon the masking strategy in a self-supervised manner. However, the progress is slow due to several significant challenges: (1) existing methods learn the motion particularly with hand-crafted designs, leading to unsatisfactory motion patterns during pre-training which are non-transferable on fine-tuning scenarios. (2) previous Masked AutoEncoder (MAE) frameworks are limited in resolving the huge representation gap inherent in 4D data. In this study, we introduce the first self-disentangled MAE for learning discriminative 4D representations in the pre-training stage. To address the first challenge, we propose to model the motion representation in a latent space. The second issue is resolved by introducing the latent tokens along with the typical geometry tokens to disentangle high-level and low-level features during decoding. Extensive experiments on MSR-Action3D, NTU-RGBD, HOI4D, NvGesture, and SHREC'17 verify this self-disentangled learning framework. We demonstrate that it can boost the fine-tuning performance on all 4D tasks, which we term Uni4D. Our pre-trained model presents discriminative and meaningful 4D representations, particularly benefits processing long videos, as Uni4D gets $+3.8\%$ segmentation accuracy on HOI4D, significantly outperforming either self-supervised or fully-supervised methods after end-to-end fine-tuning.
点云视频表示学习主要基于自监督方式下的掩码策略构建。然而,由于几个显著的挑战,其进展缓慢:(1)现有方法通过手工设计来学习运动模式,在预训练阶段产生了不令人满意的、在微调场景中不可转移的运动模式。(2)之前的遮罩自动编码器(Masked AutoEncoder, MAE)框架难以解决4D数据固有的巨大表示差距。在此研究中,我们引入了第一个自我解耦式MAE来学习鉴别性的4D表示,并应用于预训练阶段。为了解决第一项挑战,我们在潜在空间中建模运动表示;第二项问题通过在解码过程中加入潜在令牌与典型几何令牌一起使用的方法得到解决,从而实现了高级和低级特征的分离。在MSR-Action3D、NTU-RGBD、HOI4D、NvGesture以及SHREC'17数据集上的广泛实验验证了这种自我解耦式学习框架的有效性。我们展示了它可以在所有4D任务上提升微调性能,我们将这一模型称为Uni4D。我们的预训练模型呈现出具有鉴别性和意义的4D表示,特别是对于处理长视频非常有益,因为Uni4D在HOI4D上的分割准确率提高了3.8%,在端到端微调后显著优于任何自我监督或全监督方法。
https://arxiv.org/abs/2504.04837
Hand gesture recognition using multichannel surface electromyography (sEMG) is challenging due to unstable predictions and inefficient time-varying feature enhancement. To overcome the lack of signal based time-varying feature problems, we propose a lightweight squeeze-excitation deep learning-based multi stream spatial temporal dynamics time-varying feature extraction approach to build an effective sEMG-based hand gesture recognition system. Each branch of the proposed model was designed to extract hierarchical features, capturing both global and detailed spatial-temporal relationships to ensure feature effectiveness. The first branch, utilizing a Bidirectional-TCN (Bi-TCN), focuses on capturing long-term temporal dependencies by modelling past and future temporal contexts, providing a holistic view of gesture dynamics. The second branch, incorporating a 1D Convolutional layer, separable CNN, and Squeeze-and-Excitation (SE) block, efficiently extracts spatial-temporal features while emphasizing critical feature channels, enhancing feature relevance. The third branch, combining a Temporal Convolutional Network (TCN) and Bidirectional LSTM (BiLSTM), captures bidirectional temporal relationships and time-varying patterns. Outputs from all branches are fused using concatenation to capture subtle variations in the data and then refined with a channel attention module, selectively focusing on the most informative features while improving computational efficiency. The proposed model was tested on the Ninapro DB2, DB4, and DB5 datasets, achieving accuracy rates of 96.41%, 92.40%, and 93.34%, respectively. These results demonstrate the capability of the system to handle complex sEMG dynamics, offering advancements in prosthetic limb control and human-machine interface technologies with significant implications for assistive technologies.
基于多通道表面肌电图(sEMG)的手势识别由于预测不稳定和时间变化特征增强效率低而具有挑战性。为了解决信号基的时间变化特征问题,我们提出了一种轻量级的挤压激励深度学习多流时空动态时变特征提取方法,以构建有效的基于sEMG的手势识别系统。所提出的模型每个分支都设计用于提取分层特征,捕捉全局和详细的时空关系,确保特征的有效性。 第一个分支利用双向TCN(Bi-TCN),专注于通过建模过去的未来的时间上下文来捕获长期时间依赖关系,从而提供手势动态的全面视图。 第二个分支结合了一维卷积层、可分离CNN和挤压激励(SE)块,高效地提取时空特征并强调关键特征通道,增强特征的相关性。 第三个分支则结合了时序卷积网络(TCN)和双向LSTM(BiLSTM),捕捉双向时间关系和时间变化模式。 所有分支的输出通过连接融合来捕获数据中的细微变化,并使用信道注意模块进行细化,该模块选择性地关注最相关的特征并提高计算效率。所提出的模型在Ninapro DB2、DB4和DB5数据集上进行了测试,在这三个数据集中分别达到了96.41%、92.40% 和 93.34%的准确率。 这些结果展示了该系统能够处理复杂的sEMG动态,为假肢控制和人机接口技术的发展提供重要进展,并对辅助技术领域具有重要意义。
https://arxiv.org/abs/2504.03221
Previous research in human gesture recognition has largely overlooked multi-person interactions, which are crucial for understanding the social context of naturally occurring gestures. This limitation in existing datasets presents a significant challenge in aligning human gestures with other modalities like language and speech. To address this issue, we introduce SocialGesture, the first large-scale dataset specifically designed for multi-person gesture analysis. SocialGesture features a diverse range of natural scenarios and supports multiple gesture analysis tasks, including video-based recognition and temporal localization, providing a valuable resource for advancing the study of gesture during complex social interactions. Furthermore, we propose a novel visual question answering (VQA) task to benchmark vision language models'(VLMs) performance on social gesture understanding. Our findings highlight several limitations of current gesture recognition models, offering insights into future directions for improvement in this field. SocialGesture is available at this http URL.
先前的人体姿态识别研究在很大程度上忽视了多人交互,而这对于理解自然发生的肢体语言的社会背景至关重要。现有数据集中的这一限制对将人类手势与其他模态(如语言和语音)进行对齐构成了重大挑战。为解决这个问题,我们引入了SocialGesture,这是首个专为多人群体姿态分析设计的大规模数据集。SocialGesture涵盖了一系列多样化的自然场景,并支持多种姿态分析任务,包括基于视频的识别及时间定位,从而为复杂社会互动中的手势研究提供了宝贵的资源。此外,我们还提出了一种新颖的视觉问答(VQA)任务,用以评估视觉语言模型(VLMs)在社交肢体语言理解方面的性能表现。我们的发现揭示了当前姿态识别模型的一些局限性,并为这一领域的未来发展提供了见解和方向。 SocialGesture数据集可以在以下链接获取:[此URL](http://此URL)
https://arxiv.org/abs/2504.02244