In this work, we aim to enable legged robots to learn how to interpret human social cues and produce appropriate behaviors through physical human guidance. However, learning through physical engagement can place a heavy burden on users when the process requires large amounts of human-provided data. To address this, we propose a human-in-the-loop framework that enables robots to acquire navigational behaviors in a data-efficient manner and to be controlled via multimodal natural human inputs, specifically gestural and verbal commands. We reconstruct interaction scenes using a physics-based simulation and aggregate data to mitigate distributional shifts arising from limited demonstration data. Our progressive goal cueing strategy adaptively feeds appropriate commands and navigation goals during training, leading to more accurate navigation and stronger alignment between human input and robot behavior. We evaluate our framework across six real-world agile navigation scenarios, including jumping over or avoiding obstacles. Our experimental results show that our proposed method succeeds in almost all trials across these scenarios, achieving a 97.15% task success rate with less than 1 hour of demonstration data in total.
在这项工作中,我们旨在使腿部机器人通过物理人类指导学会解读人类社交线索并产生适当的行为。然而,通过身体互动进行学习可能会给用户带来沉重的负担,特别是当过程需要大量的人类提供的数据时。为了解决这个问题,我们提出了一种人机交互框架,使得机器人能够在数据高效的方式下获取导航行为,并能够接受多模态自然人类输入(具体来说是手势和口头命令)。我们使用基于物理的模拟重建互动场景,并汇总数据以缓解由于演示数据有限而产生的分布变化。我们的渐进式目标提示策略在训练过程中适应性地提供适当的指令和导航目标,从而导致更准确的导航以及人机交互与机器人行为之间更强的一致性。 我们在六个现实世界中的敏捷导航场景中评估了该框架,包括跳跃或避开障碍物的情况。实验结果显示,在这些场景下,我们的方法几乎都在所有试验中取得成功,并且在总演示数据不到1小时的情况下达到了97.15%的任务成功率。
https://arxiv.org/abs/2601.08422
Hand gesture recognition is an important aspect of human-computer interaction. It forms the basis of sign language for the visually impaired people. This work proposes a novel hand gesture recognizing system for the differently-abled persons. The model uses a convolutional neural network, known as VGG-16 net, for building a trained model on a widely used image dataset by employing Python and Keras libraries. Furthermore, the result is validated by the NUS dataset, consisting of 10 classes of hand gestures, fed to the model as the validation set. Afterwards, a testing dataset of 10 classes is built by employing Google's open source Application Programming Interface (API) that captures different gestures of human hand and the efficacy is then measured by carrying out experiments. The experimental results show that by combining a transfer learning mechanism together with the image data augmentation, the VGG-16 net produced around 98% accuracy.
手部手势识别是人机交互的重要组成部分,它是视障人士使用的手语的基础。本研究提出了一种针对残障人士的手势识别系统。该模型采用卷积神经网络(称为VGG-16网络),利用Python和Keras库在广泛使用的一组图像数据上建立训练模型。此外,结果通过新加坡国立大学(NUS)数据集进行验证,该数据集中包含10类手部手势,并将其作为模型的验证集输入。随后,构建了一个包含10个类别测试数据的数据集,利用Google开源的应用编程接口(API),捕捉人类不同手势动作的有效性并执行实验以测量其效能。实验结果显示,通过结合迁移学习机制和图像数据增强技术,VGG-16网络达到了约98%的准确率。
https://arxiv.org/abs/2601.08262
The teleoperation of robotic hands is limited by the high costs of depth cameras and sensor gloves, commonly used to estimate hand relative joint positions (XYZ). We present a novel, cost-effective approach using three webcams for triangulation-based tracking to approximate relative joint angles (theta) of human fingers. We also introduce a modified DexHand, a low-cost robotic hand from TheRobotStudio, to demonstrate THETA's real-time application. Data collection involved 40 distinct hand gestures using three 640x480p webcams arranged at 120-degree intervals, generating over 48,000 RGB images. Joint angles were manually determined by measuring midpoints of the MCP, PIP, and DIP finger joints. Captured RGB frames were processed using a DeepLabV3 segmentation model with a ResNet-50 backbone for multi-scale hand segmentation. The segmented images were then HSV-filtered and fed into THETA's architecture, consisting of a MobileNetV2-based CNN classifier optimized for hierarchical spatial feature extraction and a 9-channel input tensor encoding multi-perspective hand representations. The classification model maps segmented hand views into discrete joint angles, achieving 97.18% accuracy, 98.72% recall, F1 Score of 0.9274, and a precision of 0.8906. In real-time inference, THETA captures simultaneous frames, segments hand regions, filters them, and compiles a 9-channel tensor for classification. Joint-angle predictions are relayed via serial to an Arduino, enabling the DexHand to replicate hand movements. Future research will increase dataset diversity, integrate wrist tracking, and apply computer vision techniques such as OpenAI-Vision. THETA potentially ensures cost-effective, user-friendly teleoperation for medical, linguistic, and manufacturing applications.
机器人手的遥操作受到深度相机和传感器手套高昂成本的限制,这些设备通常用于估算相对关节位置(XYZ)。我们提出了一种新颖且成本效益高的方法,使用三台网络摄像头进行三角测量跟踪,以近似人类手指的相对关节角度(θ)。此外,还引入了经过修改的DexHand,这是一种来自TheRobotStudio的低成本机器人手,用以展示THETA在实时应用中的效果。数据收集涉及40种不同的手势,通过三个间隔120度排列的640x480p网络摄像头捕捉到超过48,000张RGB图像。关节角度由测量掌指(MCP)、近侧指间(PIP)和远侧指间(DIP)关节中点的手动确定得出。 收集的RGB帧使用带有ResNet-50骨干的DeepLabV3分割模型进行多尺度手部分割处理。随后,对分段图像应用HSV滤波,并将其输入到THETA架构中,该架构由基于MobileNetV2的CNN分类器和编码多种视角的手部表示的9通道张量组成,用于优化层级空间特征提取。分类模型将分割后的手部视图映射为离散关节角度,在准确性、召回率、F1分数以及精度方面分别达到了97.18%、98.72%、0.9274和0.8906。 在实时推断过程中,THETA同时捕捉图像帧,分割手部区域,并过滤这些区域以生成用于分类的9通道张量。关节角度预测通过串行通信发送到Arduino,使DexHand能够复制人类的手部动作。未来的研究将进一步丰富数据集多样性、整合手腕跟踪以及应用计算机视觉技术如OpenAI-Vision。 THETA有望确保成本效益高且用户友好的远程操作,在医疗、语言和制造应用中具有潜在价值。
https://arxiv.org/abs/2601.07768
Despite advances in upper-limb (UL) prosthetic design, achieving intuitive control of intermediate joints - such as the wrist and elbow - remains challenging, particularly for continuous and velocity-modulated movements. We introduce a novel movement-based control paradigm entitled Compensation Effect Amplification Control (CEAC) that leverages users' trunk flexion and extension as input for controlling prosthetic elbow velocity. Considering that the trunk can be both a functional and compensatory joint when performing upper-limb actions, CEAC amplifies the natural coupling between trunk and prosthesis while introducing a controlled delay that allows users to modulate both the position and velocity of the prosthetic joint. We evaluated CEAC in a generic drawing task performed by twelve able-bodied participants using a supernumerary prosthesis with an active elbow. Additionally a multiple-target-reaching task was performed by a subset of ten participants. Results demonstrate task performances comparable to those obtained with natural arm movements, even when gesture velocity or drawing size were varied, while maintaining ergonomic trunk postures. Analysis revealed that CEAC effectively restores joint coordinated action, distributes movement effort between trunk and elbow, enabling intuitive trajectory control without requiring extreme compensatory movements. Overall, CEAC offers a promising control strategy for intermediate joints of UL prostheses, particularly in tasks requiring continuous and precise coordination.
尽管上肢(UL)假体设计取得了进展,但实现中间关节(如手腕和肘部)的直观控制仍然颇具挑战性,尤其是在连续运动和速度调节方面。我们引入了一种新的基于动作的控制范式——补偿效应放大控制(CEAC),该范式利用用户的躯干屈曲和伸展作为控制假肢肘部速度的输入。考虑到在执行上肢动作时,躯干可以是一个功能关节或代偿关节,CEAC 放大了躯干与假肢之间的自然耦合,并引入了一个可控延迟,使用户能够调节假肢关节的位置和速度。我们使用一个具有活动肘部的超数假肢,让十二名健全参与者在通用绘图任务中评估了 CEAC 的效果;此外,十个参与者的子集还执行了多目标到达任务。 结果表明,即使是在变化手势速度或绘制大小的情况下,CEAC 仍能实现与自然手臂运动相当的任务表现,并且保持了符合人体工程学的躯干姿势。分析显示,CEAC 有效地恢复了关节协调动作,在躯干和肘部之间分配了运动努力,使得用户可以直观地控制轨迹而无需进行极端代偿性动作。 总的来说,CEAC 为 UL 假肢中间关节提供了一种有前景的控制策略,尤其是在需要连续且精确协调的任务中。
https://arxiv.org/abs/2601.05074
Hands are central to interacting with our surroundings and conveying gestures, making their inclusion essential for full-body motion synthesis. Despite this, existing human motion synthesis methods fall short: some ignore hand motions entirely, while others generate full-body motions only for narrowly scoped tasks under highly constrained settings. A key obstacle is the lack of large-scale datasets that jointly capture diverse full-body motion with detailed hand articulation. While some datasets capture both, they are limited in scale and diversity. Conversely, large-scale datasets typically focus either on body motion without hands or on hand motions without the body. To overcome this, we curate and unify existing hand motion datasets with large-scale body motion data to generate full-body sequences that capture both hand and body. We then propose the first diffusion-based unconditional full-body motion prior, FUSION, which jointly models body and hand motion. Despite using a pose-based motion representation, FUSION surpasses state-of-the-art skeletal control models on the Keypoint Tracking task in the HumanML3D dataset and achieves superior motion naturalness. Beyond standard benchmarks, we demonstrate that FUSION can go beyond typical uses of motion priors through two applications: (1) generating detailed full-body motion including fingers during interaction given the motion of an object, and (2) generating Self-Interaction motions using an LLM to transform natural language cues into actionable motion constraints. For these applications, we develop an optimization pipeline that refines the latent space of our diffusion model to generate task-specific motions. Experiments on these tasks highlight precise control over hand motion while maintaining plausible full-body coordination. The code will be public.
手部在与周围环境互动以及传达手势方面起着关键作用,因此在全身动作合成中包括手部运动是必不可少的。然而,现有的人体动作合成方法存在不足:一些方法完全忽略了手部动作,而另一些则仅能在高度受限的情况下生成窄范围任务的全身体动。一个主要障碍在于缺乏大规模的数据集,这些数据集能够同时捕捉多样化的全身动作和详细的手部关节运动。虽然有些数据集能捕获两者,但它们在规模和多样性方面存在限制。相比之下,大型数据集通常要么专注于没有手的身体运动,要么专注于没有身体的手部运动。 为了解决这个问题,我们整合并统一现有的手部运动数据集与大规模身体运动数据集,以生成同时捕捉手部和身体动作的全身序列。随后,我们提出了首个基于扩散模型的无条件全身体动先验FUSION(Full-body UNconditional motion SynthesIzation),该模型能够联合建模身体和手部的动作。尽管使用的是基于姿势的动作表示方式,但FUSION在HumanML3D数据集的关键点跟踪任务上超越了最新的骨骼控制模型,并实现了更自然的运动效果。 除了标准基准测试外,我们通过两个应用展示了FUSION可以超越传统动作用途:(1)根据物体的移动生成详细的全身动作,包括手指互动;(2)利用LLM将自然语言提示转化为可操作的动作约束来生成自我交互动作。为了这些应用程序,我们开发了一个优化管道,该管道能够细化我们的扩散模型的潜在空间以生成特定任务的运动。在这些任务上的实验突显了对手部动作的精确控制同时保持合理的全身协调性。代码将公开提供。
https://arxiv.org/abs/2601.03959
Sign Language Translation (SLT) is a complex cross-modal task requiring the integration of Manual Signals (MS) and Non-Manual Signals (NMS). While recent gloss-free SLT methods have made strides in translating manual gestures, they frequently overlook the semantic criticality of facial expressions, resulting in ambiguity when distinct concepts share identical manual articulations. To address this, we present **EASLT** (**E**motion-**A**ware **S**ign **L**anguage **T**ranslation), a framework that treats facial affect not as auxiliary information, but as a robust semantic anchor. Unlike methods that relegate facial expressions to a secondary role, EASLT incorporates a dedicated emotional encoder to capture continuous affective dynamics. These representations are integrated via a novel *Emotion-Aware Fusion* (EAF) module, which adaptively recalibrates spatio-temporal sign features based on affective context to resolve semantic ambiguities. Extensive evaluations on the PHOENIX14T and CSL-Daily benchmarks demonstrate that EASLT establishes advanced performance among gloss-free methods, achieving BLEU-4 scores of 26.15 and 22.80, and BLEURT scores of 61.0 and 57.8, respectively. Ablation studies confirm that explicitly modeling emotion effectively decouples affective semantics from manual dynamics, significantly enhancing translation fidelity. Code is available at this https URL.
手语翻译(SLT)是一项复杂的跨模态任务,需要整合手势信号(MS)和非手动信号(NMS)。尽管最近的无助词符SLT方法在翻译手动动作方面取得了进展,但它们常常忽视面部表情所包含的重要语义信息,在不同概念的手势相同的情况下导致理解上的模糊性。为解决这一问题,我们提出了**EASLT** (**E**motion-**A**ware **S**ign **L**anguage **T**ranslation),这是一个将面部情感视为重要语义锚点而非辅助信息的框架。与那些将面部表情置于次要地位的方法不同,EASLT引入了一个专用的情感编码器来捕捉连续的情感动态。通过一种新颖的*情感感知融合*(EAF)模块,这些表示被整合在一起,该模块根据情感背景自适应地重新校准空间-时间手势特征以解决语义模糊问题。 在PHOENIX14T和CSL-Daily基准测试上进行的广泛评估表明,EASLT在无助词符的方法中表现出色,在这两个数据集上分别实现了BLEU-4得分为26.15和22.80,以及BLEURT得分分别为61.0和57.8。消融实验证实了显式建模情感能够有效分离情感语义与手动动态,并显著提升翻译的准确性。 代码可在以下链接获取:[提供链接处](请将[提供链接处]替换为实际提供的代码链接)。
https://arxiv.org/abs/2601.03549
Assistive electric-powered wheelchairs (EPWs) have become essential mobility aids for people with disabilities such as amyotrophic lateral sclerosis (ALS), post-stroke hemiplegia, and dementia-related mobility impairment. This work presents a novel multi-modal EPW control system designed to prioritize patient needs while allowing seamless switching between control modes. Four complementary interfaces, namely joystick, speech, hand gesture, and electrooculography (EOG), are integrated with a continuous vital sign monitoring framework measuring heart rate variability, oxygen saturation (SpO2), and skin temperature. This combination enables greater patient independence while allowing caregivers to maintain real-time supervision and early intervention capability. Two-point calibration of the biophysical sensors against clinical reference devices resulted in root mean square errors of at most 2 bpm for heart rate, 0.5 degree Celsius for skin temperature, and 1 percent for SpO2. Experimental evaluation involved twenty participants with mobility impairments executing a total of 500 indoor navigation commands. The achieved command recognition accuracies were 99 percent for joystick control, 97 percent plus or minus 2 percent for speech, and 95 percent plus or minus 3 percent for hand gesture, with an average closed-loop latency of 20 plus or minus 0.5 milliseconds. Caregivers receive real-time alerts through an Android application following encrypted cloud transmission of physiological data. By integrating multi-modal mobility control with cloud-enabled health monitoring and reporting latency and energy budgets, the proposed prototype addresses key challenges in assistive robotics, contributes toward compliance with ISO 7176-31 and IEC 80601-2-78 safety standards, and establishes a foundation for future adaptive machine learning enhancements.
助行电动轮椅(EPW)已成为肌萎缩侧索硬化症(ALS)、中风后偏瘫和痴呆相关移动障碍等残疾人士的重要移动辅助工具。本文介绍了一种新型的多模式EPW控制系统,该系统旨在优先考虑患者需求并允许在控制模式之间无缝切换。四种互补接口——即操纵杆、语音、手势和眼电图(EOG)被整合进一个连续的生命体征监测框架中,此框架测量心率变异性、血氧饱和度(SpO2)以及皮肤温度。这种组合提高了患者的独立性,同时让护理人员能够进行实时监督并及时采取干预措施。通过与临床参考设备进行两点校准后,生物物理传感器的心率最大误差为每分钟2次搏动,皮肤温度的误差为0.5摄氏度,血氧饱和度的最大误差为1%。 实验评估涉及20名具有移动障碍的参与者执行总共500个室内导航命令。实现的命令识别准确性为:操纵杆控制99%,语音控制97%±2%,手势控制95%±3%,平均闭环延迟时间为20±0.5毫秒。护理人员通过一个安卓应用程序接收实时警报,生理数据则经过加密后上传至云端。 该原型将多模式移动控制与基于云的健康监测相结合,并考虑了报告延迟和能耗预算,从而解决了辅助机器人领域的关键挑战,有助于符合ISO 7176-31和IEC 80601-2-78安全标准的要求。此外,它还为未来的自适应机器学习增强奠定了基础。
https://arxiv.org/abs/2601.02766
Co-speech gesture generation is a critical area of research aimed at synthesizing speech-synchronized human-like gestures. Existing methods often suffer from issues such as rhythmic inconsistency, motion jitter, foot sliding and limited multi-sampling diversity. In this paper, we present SmoothSync, a novel framework that leverages quantized audio tokens in a novel dual-stream Diffusion Transformer (DiT) architecture to synthesis holistic gestures and enhance sampling variation. Specifically, we (1) fuse audio-motion features via complementary transformer streams to achieve superior synchronization, (2) introduce a jitter-suppression loss to improve temporal smoothness, (3) implement probabilistic audio quantization to generate distinct gesture sequences from identical inputs. To reliably evaluate beat synchronization under jitter, we introduce Smooth-BC, a robust variant of the beat consistency metric less sensitive to motion noise. Comprehensive experiments on the BEAT2 and SHOW datasets demonstrate SmoothSync's superiority, outperforming state-of-the-art methods by -30.6% FGD, 10.3% Smooth-BC, and 8.4% Diversity on BEAT2, while reducing jitter and foot sliding by -62.9% and -17.1% respectively. The code will be released to facilitate future research.
本文介绍了一种名为SmoothSync的新型框架,该框架旨在解决同步生成人类样貌手势过程中存在的诸如节奏不一致、动作抖动和脚步滑移等问题。SmoothSync利用新颖的双流Diffusion Transformer(DiT)架构中的量化音频令牌来合成整体手势,并增强采样的多样性。 具体而言,我们采用了以下方法: 1. 通过互补的Transformer流融合音频-运动特征以实现更优同步。 2. 引入抖动抑制损失函数,提升时间上的平滑度。 3. 实现概率性音频量化技术,从相同输入生成不同的手势序列。 为了可靠地评估在抖动情况下的节拍同步,我们引入了Smooth-BC,这是一种对运动噪声不敏感的节拍一致性指标的稳健变体。在BEAT2和SHOW数据集上的全面实验表明,与当前最佳方法相比,SmoothSync的表现更优,在BEAT2数据集中,它分别减少了30.6% FGD(固定手势差异)、提高了10.3% Smooth-BC以及增加了8.4%的多样性,并且在减少抖动和脚步滑移方面分别改善了62.9%和17.1%。未来的研究人员可以获取该代码以进一步开展相关研究工作。
https://arxiv.org/abs/2601.04236
Real-world autonomous driving must adhere to complex human social rules that extend beyond legally codified traffic regulations. Many of these semantic constraints, such as yielding to emergency vehicles, complying with traffic officers' gestures, or stopping for school buses, are intuitive for humans yet difficult to encode explicitly. Although large vision-language models (VLMs) can interpret such semantics, their inference cost makes them impractical for real-time this http URL work proposes LSRE, a Latent Semantic Rule Encoding framework that converts sparsely sampled VLM judgments into decision boundaries within the latent space of a recurrent world model. By encoding language-defined safety semantics into a lightweight latent classifier, LSRE enables real-time semantic risk assessment at 10 Hz without per-frame VLM queries. Experiments on six semantic-failure scenarios in CARLA demonstrate that LSRE attains semantic risk detection accuracy comparable to a large VLM baseline, while providing substantially earlier hazard anticipation and maintaining low computational latency. LSRE further generalizes to rarely seen semantic-similar test cases, indicating that language-guided latent classification offers an effective and deployable mechanism for semantic safety monitoring in autonomous driving.
现实中的自动驾驶必须遵循超出法律规定的交通规则的复杂人类社会规范。许多这样的语义约束,例如给紧急车辆让路、遵守交通警察的手势或为校车停车等行为,对于人类来说是直观的,但对于机器来说却很难明确编码。尽管大型视觉语言模型(VLM)能够解释这些语义,但其推理成本使其无法在实时环境中使用。这项工作提出了LSRE框架,即潜在语义规则编码框架,它将稀疏采样的VLM判断转化为递归世界模型的隐空间中的决策边界。通过将用语言定义的安全语义编码到轻量级的隐分类器中,LSRE能够在不进行每帧VLM查询的情况下以10 Hz的速度实时评估语义风险。在CARLA环境下的六种语义故障场景实验表明,LSRE达到了与大型VLM基线相当的语义风险检测准确性,并提供了显著更早的安全预警同时保持低计算延迟。此外,LSRE还推广到很少见但语义相似的测试案例中,表明语言指导的潜在分类为自动驾驶中的语义安全监控提供了一种有效且可部署的方法。
https://arxiv.org/abs/2512.24712
Proficiency in microanastomosis is a critical surgical skill in neurosurgery, where the ability to precisely manipulate fine instruments is crucial to successful outcomes. These procedures require sustained attention, coordinated hand movements, and highly refined motor skills, underscoring the need for objective and systematic methods to evaluate and enhance microsurgical training. Conventional assessment approaches typically rely on expert raters supervising the procedures or reviewing surgical videos, which is an inherently subjective process prone to inter-rater variability, inconsistency, and significant time investment. These limitations highlight the necessity for automated and scalable solutions. To address this challenge, we introduce a novel AI-driven framework for automated action segmentation and performance assessment in microanastomosis procedures, designed to operate efficiently on edge computing platforms. The proposed system comprises three main components: (1) an object tip tracking and localization module based on YOLO and DeepSORT; (2) an action segmentation module leveraging self-similarity matrix for action boundary detection and unsupervised clustering; and (3) a supervised classification module designed to evaluate surgical gesture proficiency. Experimental validation on a dataset of 58 expert-rated microanastomosis videos demonstrates the effectiveness of our approach, achieving a frame-level action segmentation accuracy of 92.4% and an overall skill classification accuracy of 85.5% in replicating expert evaluations. These findings demonstrate the potential of the proposed method to provide objective, real-time feedback in microsurgical education, thereby enabling more standardized, data-driven training protocols and advancing competency assessment in high-stakes surgical environments.
在神经外科中,显微吻合术的熟练掌握是一项关键的手术技能。该技术需要精确操作精细仪器的能力以确保成功的治疗结果。这些程序要求持续集中注意力、协调的手部动作以及高度精炼的运动技巧,这凸显了客观和系统方法评估及提升显微手术培训的需求。 传统评估方法通常依赖于专家监督手术过程或审查手术视频,这是一个主观性很强的过程,容易导致评分者之间的一致性问题、不稳定性,并需要大量的时间投入。这些局限性突出了自动化的、可扩展的解决方案的必要性。 为了解决这一挑战,我们引入了一种新颖的人工智能驱动框架,用于显微吻合术程序中的自动化动作分割和性能评估,旨在有效运行于边缘计算平台上。该提议系统由三个主要组成部分构成:(1)基于YOLO和DeepSORT的对象尖端跟踪与定位模块;(2)利用自相似矩阵进行动作边界检测和无监督聚类的动作分割模块;以及(3)用于评估手术手势熟练程度的监督分类模块。 通过使用包含58个专家评分显微吻合术视频的数据集进行了实验验证,证明了我们方法的有效性。在该数据集中,我们的系统达到了92.4%的帧级动作分割准确率和85.5%的整体技能分类准确率,在再现专家评估方面取得了显著成果。 这些发现表明所提出的方法具有提供显微外科教育中客观、实时反馈的潜力,从而能够促进更为标准化、基于数据驱动的培训协议,并在高风险手术环境中推进能力评估。
https://arxiv.org/abs/2512.23942
Humans intuitively move to sound, but current humanoid robots lack expressive improvisational capabilities, confined to predefined motions or sparse commands. Generating motion from audio and then retargeting it to robots relies on explicit motion reconstruction, leading to cascaded errors, high latency, and disjointed acoustic-actuation mapping. We propose RoboPerform, the first unified audio-to-locomotion framework that can directly generate music-driven dance and speech-driven co-speech gestures from audio. Guided by the core principle of "motion = content + style", the framework treats audio as implicit style signals and eliminates the need for explicit motion reconstruction. RoboPerform integrates a ResMoE teacher policy for adapting to diverse motion patterns and a diffusion-based student policy for audio style injection. This retargeting-free design ensures low latency and high fidelity. Experimental validation shows that RoboPerform achieves promising results in physical plausibility and audio alignment, successfully transforming robots into responsive performers capable of reacting to audio.
人类天生会随着声音移动,但目前的人形机器人缺乏表达性和即兴表演能力,只能执行预定义的动作或少量命令。从音频生成动作并将其重新定位到机器人上通常依赖于显式的运动重构,这会导致级联误差、高延迟以及声学与驱动之间的映射不连续问题。 我们提出了RoboPerform,这是一个首个统一的音频至行走框架,可以直接根据音乐生成舞蹈,并且可以根据语音生成伴随手势。此框架基于“动作=内容+风格”的核心原则运作,将音频视为隐式的风格信号,从而消除了显式运动重构的需求。RoboPerform整合了一个ResMoE教师策略以适应各种运动模式和一个基于扩散的学生策略来注入音频风格信息。这种无需重新定位的设计确保了低延迟和高保真度。 实验验证显示,RoboPerform在物理真实性和声学同步方面取得了令人满意的结果,成功地将机器人转变成能够响应声音的互动表演者。
https://arxiv.org/abs/2512.23650
Micro-gesture recognition and behavior-based emotion prediction are both highly challenging tasks that require modeling subtle, fine-grained human behaviors, primarily leveraging video and skeletal pose data. In this work, we present two multimodal frameworks designed to tackle both problems on the iMiGUE dataset. For micro-gesture classification, we explore the complementary strengths of RGB and 3D pose-based representations to capture nuanced spatio-temporal patterns. To comprehensively represent gestures, video, and skeletal embeddings are extracted using MViTv2-S and 2s-AGCN, respectively. Then, they are integrated through a Cross-Modal Token Fusion module to combine spatial and pose information. For emotion recognition, our framework extends to behavior-based emotion prediction, a binary classification task identifying emotional states based on visual cues. We leverage facial and contextual embeddings extracted using SwinFace and MViTv2-S models and fuse them through an InterFusion module designed to capture emotional expressions and body gestures. Experiments conducted on the iMiGUE dataset, within the scope of the MiGA 2025 Challenge, demonstrate the robust performance and accuracy of our method in the behavior-based emotion prediction task, where our approach secured 2nd place.
微手势识别和基于行为的情绪预测都是极具挑战性的任务,需要对细微的人体动作进行建模。这些任务主要依赖于视频和骨骼姿态数据。在这项工作中,我们提出了两个多模式框架,旨在解决iMiGUE数据集上的这两个问题。 对于微手势分类,我们探索了RGB表示与基于3D姿势的表示互补的优势,以捕捉细腻的空间-时间模式。为了全面表示手势、视频和骨骼嵌入,分别使用MViTv2-S模型提取视频特征,并利用2s-AGCN模型提取骨骼姿态数据。然后,通过跨模态标记融合模块将空间信息与姿态信息相结合。 在情绪识别方面,我们的框架扩展到了基于行为的情绪预测任务上,这是一个二分类问题,旨在根据视觉线索识别情感状态。我们使用SwinFace和MViTv2-S模型提取面部和上下文嵌入,并利用一个设计用于捕捉面部表情和肢体动作的InterFusion模块进行融合。 在MiGA 2025挑战赛中针对iMiGUE数据集进行的实验表明,我们的方法在基于行为的情绪预测任务上展现了强大的性能和准确性。我们在该比赛的任务中取得了第二名的成绩。
https://arxiv.org/abs/2512.23291
Real-time, streaming interactive avatars represent a critical yet challenging goal in digital human research. Although diffusion-based human avatar generation methods achieve remarkable success, their non-causal architecture and high computational costs make them unsuitable for streaming. Moreover, existing interactive approaches are typically limited to head-and-shoulder region, limiting their ability to produce gestures and body motions. To address these challenges, we propose a two-stage autoregressive adaptation and acceleration framework that applies autoregressive distillation and adversarial refinement to adapt a high-fidelity human video diffusion model for real-time, interactive streaming. To ensure long-term stability and consistency, we introduce three key components: a Reference Sink, a Reference-Anchored Positional Re-encoding (RAPR) strategy, and a Consistency-Aware Discriminator. Building on this framework, we develop a one-shot, interactive, human avatar model capable of generating both natural talking and listening behaviors with coherent gestures. Extensive experiments demonstrate that our method achieves state-of-the-art performance, surpassing existing approaches in generation quality, real-time efficiency, and interaction naturalness. Project page: this https URL .
实时流媒体互动化身代表了数字人类研究中的一个关键但具有挑战性的目标。尽管基于扩散的人类化身生成方法取得了显著的成功,但由于其非因果架构和高昂的计算成本,这些方法不适合用于流媒体应用。此外,现有的交互式方法通常局限于头部和肩部区域,限制了它们生成手势和身体动作的能力。 为了解决这些挑战,我们提出了一种两阶段自回归适应和加速框架,该框架通过自回归蒸馏和对抗性优化来调整高保真的人类视频扩散模型,使其适用于实时交互流媒体。为了确保长期稳定性和一致性,我们引入了三个关键组件:参考源(Reference Sink)、基于参考的位置重编码策略(RAPR)以及一种感知一致性的判别器。 在此框架的基础上,我们开发了一种一次性、互动式的人类化身模型,能够生成自然的说话和倾听行为,并且具备连贯的手势。广泛的实验表明,我们的方法在生成质量、实时效率及交互自然度方面均达到了业界领先水平,超过了现有的方法。项目页面:[请参见原文链接]。 请注意,最后一个句子中的“[请参见原文链接]”需要您查看原始文档以获取实际的链接地址。
https://arxiv.org/abs/2512.22065
Creating physically realistic content in VR often requires complex modeling tools or predefined 3D models, textures, and animations, which present significant barriers for non-expert users. In this paper, we propose SketchPlay, a novel VR interaction framework that transforms humans' air-drawn sketches and gestures into dynamic, physically realistic scenes, making content creation intuitive and playful like drawing. Specifically, sketches capture the structure and spatial arrangement of objects and scenes, while gestures convey physical cues such as velocity, direction, and force that define movement and behavior. By combining these complementary forms of input, SketchPlay captures both the structure and dynamics of user-created content, enabling the generation of a wide range of complex physical phenomena, such as rigid body motion, elastic deformation, and cloth dynamics. Experimental results demonstrate that, compared to traditional text-driven methods, SketchPlay offers significant advantages in expressiveness, and user experience. By providing an intuitive and engaging creation process, SketchPlay lowers the entry barrier for non-expert users and shows strong potential for applications in education, art, and immersive storytelling.
在虚拟现实中创建物理上逼真的内容通常需要复杂的建模工具或预定义的3D模型、纹理和动画,这些都为非专业用户设置了相当大的障碍。在这篇论文中,我们提出了SketchPlay,这是一种新颖的VR交互框架,它将人类在空气中的手绘草图和手势转化为动态且物理上逼真的场景,使内容创作直观而有趣,就像画画一样。具体而言,草图捕捉了物体和场景的结构和空间布局,而手势传达了诸如速度、方向和力等物理线索,这些定义了运动和行为。通过结合这两种互补形式的输入,SketchPlay 能够同时捕获用户创建内容的结构和动态性,从而生成广泛的复杂物理现象,包括刚体运动、弹性变形和布料动力学。 实验结果表明,与传统的基于文本的方法相比,SketchPlay 在表现力和用户体验方面具有显著优势。通过提供直观且引人入胜的内容创作过程,SketchPlay 降低了非专业用户的入门门槛,并显示出在教育、艺术和沉浸式叙事等领域的强大应用潜力。
https://arxiv.org/abs/2512.22016
We open-source \textbf{MiMo-VL-Miloco-7B} and its quantized variant \textbf{MiMo-VL-Miloco-7B-GGUF}, a pair of home-centric vision-language models that achieve strong performance on both home-scenario understanding and general multimodal reasoning. Built on the MiMo-VL-7B backbone, MiMo-VL-Miloco-7B is specialized for smart-home environments, attaining leading F1 scores on gesture recognition and common home-scenario understanding, while also delivering consistent gains across video benchmarks such as Video-MME, Video-MMMU, and Charades-STA, as well as language understanding benchmarks including MMMU-Pro and MMLU-Pro. In our experiments, MiMo-VL-Miloco-7B outperforms strong closed-source and open-source baselines on home-scenario understanding and several multimodal reasoning benchmarks. To balance specialization and generality, we design a two-stage training pipeline that combines supervised fine-tuning with reinforcement learning based on Group Relative Policy Optimization, leveraging efficient multi-domain data. We further incorporate chain-of-thought supervision and token-budget-aware reasoning, enabling the model to learn knowledge in a data-efficient manner while also performing reasoning efficiently. Our analysis shows that targeted home-scenario training not only enhances activity and gesture understanding, but also improves text-only reasoning with only modest trade-offs on document-centric tasks. Model checkpoints, quantized GGUF weights, and our home-scenario evaluation toolkit are publicly available at \href{this https URL}{this https URL} to support research and deployment in real-world smart-home applications.
我们开源了**MiMo-VL-Miloco-7B**及其量化版本**MiMo-VL-Miloco-7B-GGUF**,这是一对专注于家庭环境的视觉语言模型,在家用场景理解和通用多模态推理方面均表现出色。基于MiMo-VL-7B骨干网络,MiMo-VL-Miloco-7B专门针对智能家庭环境进行了优化,在手势识别和常见家庭情景理解上取得了领先的F1分数,并且在Video-MME、Video-MMMU、Charades-STA等视频基准测试以及MMMU-Pro和MMLU-Pro等语言理解基准测试中也表现出一致的优势。实验表明,MiMo-VL-Miloco-7B在家用场景理解和多项多模态推理基准测试上均超越了强大的闭源和开源基线模型。 为了平衡专业性和通用性,我们设计了一个两阶段的训练流程,结合监督微调与基于Group Relative Policy Optimization(GRPO)的强化学习方法,并利用高效的跨域数据。此外,我们进一步引入了链式思维监督及令牌预算感知推理机制,使得模型能够在高效的数据利用率下学习知识的同时,还能够进行有效的推理过程。 我们的分析表明,针对家庭情景的定向训练不仅增强了活动和手势的理解能力,而且还适度改善了仅限于文本的任务上的推理性能。模型检查点、量化GGUF权重以及我们家用场景评估工具包已公开发布在[此链接](this https URL)上,以支持研究与实际智能家庭应用中的部署工作。
https://arxiv.org/abs/2512.17436
Theory of Mind (ToM) -- the ability to attribute beliefs, desires, and emotions to others -- is fundamental for human social intelligence, yet remains a major challenge for artificial agents. Existing Vision-Language Models (VLMs) are increasingly applied in socially grounded tasks, but their capacity for cross-cultural ToM reasoning is largely unexplored. In this work, we introduce CulturalToM-VQA, a new evaluation benchmark containing 5095 questions designed to probe ToM reasoning across diverse cultural contexts through visual question answering. The dataset captures culturally grounded cues such as rituals, attire, gestures, and interpersonal dynamics, enabling systematic evaluation of ToM reasoning beyond Western-centric benchmarks. Our dataset is built through a VLM-assisted human-in-the-loop pipeline, where human experts first curate culturally rich images across traditions, rituals, and social interactions; a VLM then assist in generating structured ToM-focused scene descriptions, which are refined into question-answer pairs spanning a taxonomy of six ToM tasks and four graded complexity levels. The resulting dataset covers diverse theory of mind facets such as mental state attribution, false belief reasoning, non-literal communication, social norm violations, perspective coordination, and multi-agent reasoning.
理论思维(ToM)——即能够赋予他人信念、欲望和情感的能力——是人类社交智能的基础,但对于人工代理来说仍是一项重大挑战。现有的视觉语言模型(VLMs)越来越多地被应用于社会环境中,但它们在跨文化背景下进行理论思维推理的能力尚待探索。 在此研究中,我们引入了CulturalToM-VQA,这是一个新的评估基准,包含5095个问题,旨在通过视觉问答的方式探究跨不同文化背景下的理论思维能力。该数据集捕捉到了如仪式、服饰、手势和人际动态等具有文化基础的线索,从而能够系统地评估超出西方中心主义基准范围内的理论思维推理能力。 我们的数据集是通过一个VLM辅助的人机协作管道构建而成,在这个过程中,人类专家首先整理出跨不同传统、礼仪和社会互动的文化丰富图像;随后,VLM协助生成结构化的关注于场景描述的理论思维内容,这些内容被进一步转化为跨越六个理论任务类别和四个分级复杂度水平的问题-答案对。最终的数据集涵盖了多样化的理论思维方面,如心理状态归属、虚假信念推理、非字面沟通、社会规范违规、视角协调以及多主体推理。
https://arxiv.org/abs/2512.17394
The growing use of service robots in hospitality highlights the need to understand how to effectively communicate with pre-occupied customers. This study investigates the efficacy of commonly used communication modalities by service robots, namely, acoustic/speech, visual display, and micromotion gestures in capturing attention and communicating intention with a user in a simulated restaurant scenario. We conducted a two-part user study (N=24) using a Temi robot to simulate delivery tasks, with participants engaged in a typing game (MonkeyType) to emulate a state of busyness. The participants' engagement in the typing game is measured by words per minute (WPM) and typing accuracy. In Part 1, we compared non-verbal acoustic cue versus baseline conditions to assess attention capture during a single-cup delivery task. In Part 2, we evaluated the effectiveness of speech, visual display, micromotion and their multimodal combination in conveying specific intentions (correct cup selection) during a two-cup delivery task. The results indicate that, while speech is highly effective in capturing attention, it is less successful in clearly communicating intention. Participants rated visual as the most effective modality for intention clarity, followed by speech, with micromotion being the lowest this http URL findings provide insights into optimizing communication strategies for service robots, highlighting the distinct roles of attention capture and intention communication in enhancing user experience in dynamic hospitality settings.
在酒店业中越来越多地使用服务机器人突显了理解如何有效与忙碌中的顾客沟通的必要性。本研究探讨了服务机器人常用的三种沟通方式——声学/语音、视觉显示和微动手势,在模拟餐厅场景下捕获注意力并传达意图的效果。我们进行了一项两部分用户研究(N=24),使用Temi机器人来模拟送餐任务,参与者则通过玩打字游戏(MonkeyType)来模仿忙碌状态。参与者的投入程度通过每分钟单词数和打字准确性衡量。 在第一部分中,我们将非语言声学提示与基线条件进行了比较,以评估单一杯子递送任务中的注意力捕获情况。 在第二部分中,我们评估了语音、视觉显示、微动以及这些多模态组合传达特定意图(正确选择杯子)的有效性,在双杯递送任务中进行测试。 研究结果表明,尽管语音在吸引注意方面非常有效,但在清晰传达意图方面并不成功。参与者认为视觉是最有效的传达意图的方式,其次是语音,而微动被评价为最低效的模态。 这些发现提供了优化服务机器人沟通策略的见解,强调了注意力捕获和意图传达在提升动态酒店环境中用户体验中的不同作用。
https://arxiv.org/abs/2512.17241
We present an innovative end-to-end framework for synthesizing semantically meaningful co-speech gestures and deploying them in real-time on a humanoid robot. This system addresses the challenge of creating natural, expressive non-verbal communication for robots by integrating advanced gesture generation techniques with robust physical control. Our core innovation lies in the meticulous integration of a semantics-aware gesture synthesis module, which derives expressive reference motions from speech input by leveraging a generative retrieval mechanism based on large language models (LLMs) and an autoregressive Motion-GPT model. This is coupled with a high-fidelity imitation learning control policy, the MotionTracker, which enables the Unitree G1 humanoid robot to execute these complex motions dynamically and maintain balance. To ensure feasibility, we employ a robust General Motion Retargeting (GMR) method to bridge the embodiment gap between human motion data and the robot platform. Through comprehensive evaluation, we demonstrate that our combined system produces semantically appropriate and rhythmically coherent gestures that are accurately tracked and executed by the physical robot. To our knowledge, this work represents a significant step toward general real-world use by providing a complete pipeline for automatic, semantic-aware, co-speech gesture generation and synchronized real-time physical deployment on a humanoid robot.
我们提出了一种创新的端到端框架,用于合成具有语义意义的手势,并将其实时部署在人形机器人上。该系统通过将先进的手势生成技术与稳健的物理控制相结合,解决了为机器人创建自然、富有表现力的非语言交流这一挑战。我们的核心创新在于精心整合了一个具备语义感知的手势合成模块。该模块利用基于大型语言模型(LLMs)和自回归Motion-GPT模型的生成检索机制,从语音输入中推导出具有表现力的参考动作。这与高保真的模仿学习控制策略——MotionTracker相耦合,使Unitree G1人形机器人能够动态执行这些复杂的动作,并保持平衡。 为了确保可行性,我们采用了一种稳健的一般运动重定位(GMR)方法,以弥合人类动作数据和机器人平台之间的实现差距。通过全面的评估,我们展示了我们的综合系统生成了语义适当且节奏连贯的手势,物理机器人能够准确地追踪并执行这些手势。 据我们所知,这项工作代表了朝着通用现实世界应用迈出的重要一步,它提供了一个完整的自动、语义感知、与言语同步的手势生成管道,并在人形机器人上实现了实时的物理部署。
https://arxiv.org/abs/2512.17183
This paper presents a real-time American Sign Language (ASL) recognition system utilizing a hybrid deep learning architecture combining 3D Convolutional Neural Networks (3D CNN) with Long Short-Term Memory (LSTM) networks. The system processes webcam video streams to recognize word-level ASL signs, addressing communication barriers for over 70 million deaf and hard-of-hearing individuals worldwide. Our architecture leverages 3D convolutions to capture spatial-temporal features from video frames, followed by LSTM layers that model sequential dependencies inherent in sign language gestures. Trained on the WLASL dataset (2,000 common words), ASL-LEX lexical database (~2,700 signs), and a curated set of 100 expert-annotated ASL signs, the system achieves F1-scores ranging from 0.71 to 0.99 across sign classes. The model is deployed on AWS infrastructure with edge deployment capability on OAK-D cameras for real-time inference. We discuss the architecture design, training methodology, evaluation metrics, and deployment considerations for practical accessibility applications.
本文提出了一种利用混合深度学习架构(结合3D卷积神经网络(3D CNN)和长短期记忆(LSTM)网络)的实时美国手语(ASL)识别系统。该系统处理来自网络摄像头的视频流,以识别单词级别的ASL手势,旨在解决全球超过7000万聋人和听力障碍人士在沟通方面的障碍。 我们的架构利用3D卷积来捕捉视频帧中的空间-时间特征,并通过随后的LSTM层建模手语动作中固有的序列依赖性。该系统在WLASL数据集(2,000个常用单词)、ASL-LEX词汇数据库(约2,700个手势)和100个专家注释的手势集合上进行了训练,实现了从0.71到0.99的F1分数。 该模型部署在AWS基础设施上,并具备边缘计算能力,在OAK-D相机上进行实时推理。本文讨论了架构设计、培训方法、评估指标以及针对实际可访问性应用的部署考虑因素。
https://arxiv.org/abs/2512.22177
As the global population ages, many seniors face the problem of loneliness. Companion robots offer a potential solution. However, current companion robots often lack advanced functionality, while task-oriented robots are not designed for social interaction, limiting their suitability and acceptance by seniors. Our work introduces a senior-oriented system for quadruped robots that allows for more intuitive user input and provides more socially expressive output. For user input, we implemented a MediaPipe-based module for hand gesture and head movement recognition, enabling control without a remote. For output, we designed and trained robotic dog gestures using curriculum-based reinforcement learning in Isaac Gym, progressing from simple standing to three-legged balancing and leg extensions, and more. The final tests achieved over 95\% success on average in simulation, and we validated a key social gesture (the paw-lift) on a Unitree robot. Real-world tests demonstrated the feasibility and social expressiveness of this framework, while also revealing sim-to-real challenges in joint compliance, load distribution, and balance control. These contributions advance the development of practical quadruped robots as social companions for the senior and outline pathways for sim-to-real adaptation and inform future user studies.
随着全球人口老龄化,许多老年人面临着孤独的问题。陪伴机器人提供了一种潜在的解决方案。然而,目前的伴侣机器人通常功能较为有限,而任务导向型机器人则不是为社交互动设计的,这限制了它们在老年群体中的适用性和接受度。我们的工作介绍了一个面向老年人的四足机器人的系统,该系统允许更加直观的用户输入,并提供了更具社会表达性的输出。 对于用户输入,我们实现了一个基于MediaPipe的手势和头部动作识别模块,从而可以在不使用遥控器的情况下进行控制。在输出方面,我们在Isaac Gym中利用基于课程的学习强化学习方法设计并训练了机器狗的姿态,从简单的站立到三足平衡再到腿部伸展等复杂动作。最终的测试在模拟环境中平均达到了超过95%的成功率,并且我们还在一个Unitree机器人上验证了一个关键的社会互动手势(举起爪子)。 实地测试证明了该框架的实际可行性和社会表达性,同时揭示了一些从仿真环境到现实世界的挑战,例如关节顺应性、负载分布和平衡控制等问题。这些贡献推动了实用四足机器人的开发,使其成为老年人社交伙伴,并为未来如何将虚拟设计转化为实际应用提供了指导路径以及未来的用户研究方向。
https://arxiv.org/abs/2512.17136