The field of affective computing has seen significant advancements in exploring the relationship between emotions and emerging technologies. This paper presents a novel and valuable contribution to this field with the introduction of a comprehensive French multimodal dataset designed specifically for emotion recognition. The dataset encompasses three primary modalities: facial expressions, speech, and gestures, providing a holistic perspective on emotions. Moreover, the dataset has the potential to incorporate additional modalities, such as Natural Language Processing (NLP) to expand the scope of emotion recognition research. The dataset was curated through engaging participants in card game sessions, where they were prompted to express a range of emotions while responding to diverse questions. The study included 10 sessions with 20 participants (9 females and 11 males). The dataset serves as a valuable resource for furthering research in emotion recognition and provides an avenue for exploring the intricate connections between human emotions and digital technologies.
情感计算领域在探索情绪与新兴技术之间的关系方面取得了显著进展。本文通过引入一个专为情绪识别设计的全面法国多模态数据集,对该领域做出了新颖且有价值的研究贡献。该数据集涵盖了三种主要模式:面部表情、语音和手势,提供了对情绪的整体视角。此外,该数据集具有潜力融合其他模态,如自然语言处理(NLP),以扩大情感识别研究的范围。 此数据集通过让参与者参与纸牌游戏会话而精心策划而成,在这些会话中,他们被要求在回答各种问题时表达一系列的情绪。研究包括了20名参与者(9名女性和11名男性)的10个会议。该数据集为推进情绪识别的研究提供了宝贵的资源,并为进一步探索人类情感与数字技术之间的复杂联系开辟了一条途径。
https://arxiv.org/abs/2501.08182
Spiking Neural Networks (SNNs) are promising for low-power computation due to their event-driven mechanism but often suffer from lower accuracy compared to Artificial Neural Networks (ANNs). ANN-to-SNN knowledge distillation can improve SNN performance, but previous methods either focus solely on label information, missing valuable intermediate layer features, or use a layer-wise approach that neglects spatial and temporal semantic inconsistencies, leading to performance this http URL address these limitations, we propose a novel method called self-attentive spatio-temporal calibration (SASTC). SASTC uses self-attention to identify semantically aligned layer pairs between ANN and SNN, both spatially and temporally. This enables the autonomous transfer of relevant semantic information. Extensive experiments show that SASTC outperforms existing methods, effectively solving the mismatching problem. Superior accuracy results include 95.12% on CIFAR-10, 79.40% on CIFAR-100 with 2 time steps, and 68.69% on ImageNet with 4 time steps for static datasets, and 97.92% on DVS-Gesture and 83.60% on DVS-CIFAR10 for neuromorphic datasets. This marks the first time SNNs have outperformed ANNs on both CIFAR-10 and CIFAR-100, shedding the new light on the potential applications of SNNs.
脉冲神经网络(SNN)由于其事件驱动机制,在低功耗计算方面具有潜力,但通常在准确性上不如人工神经网络(ANN)。从ANN到SNN的知识蒸馏可以提高SNN的性能,但是以前的方法要么仅关注标签信息而忽视了中间层特征的价值,要么采用逐层方法却忽略了空间和时间语义的一致性问题,导致效果不佳。为了克服这些限制,我们提出了一种名为自注意力时空校准(Self-Attentive Spatio-Temporal Calibration, SASTC)的新方法。 SASTC使用自我注意机制来识别ANN与SNN之间在时间和空间上语义对齐的层对,这使得相关语义信息能够自主转移。广泛的实验表明,SASTC优于现有的方法,并且有效地解决了不匹配问题。其优越的准确性结果包括:在静态数据集上的表现(如CIFAR-10为95.12%,CIFAR-100上使用2个时间步长时达到79.40%,ImageNet上使用4个时间步长时为68.69%)和神经形态数据集上的表现(DVS-Gesture为97.92%,DVS-CIFAR10为83.60%)。这是SNN首次在CIFAR-10和CIFAR-100上超越ANN,这开启了关于SNN潜在应用的新视角。
https://arxiv.org/abs/2501.08049
This report introduces Make-A-Character 2, an advanced system for generating high-quality 3D characters from single portrait photographs, ideal for game development and digital human applications. Make-A-Character 2 builds upon its predecessor by incorporating several significant improvements for image-based head generation. We utilize the IC-Light method to correct non-ideal illumination in input photos and apply neural network-based color correction to harmonize skin tones between the photos and game engine renders. We also employ the Hierarchical Representation Network to capture high-frequency facial structures and conduct adaptive skeleton calibration for accurate and expressive facial animations. The entire image-to-3D-character generation process takes less than 2 minutes. Furthermore, we leverage transformer architecture to generate co-speech facial and gesture actions, enabling real-time conversation with the generated character. These technologies have been integrated into our conversational AI avatar products.
该报告介绍了Make-A-Character 2系统,这是一个先进的系统,可以从单张肖像照片生成高质量的3D角色,非常适合游戏开发和数字人类应用。Make-A-Character 2在前一代的基础上进行了多项重大改进,以优化基于图像的头部生成。 我们利用IC-Light方法来纠正输入照片中的非理想照明,并采用基于神经网络的颜色校正技术,使照片与游戏引擎渲染之间的肤色一致。此外,我们还采用了分层表示网络来捕捉高频率的面部结构,并进行自适应骨骼校准以实现准确且富有表现力的面部动画。 整个从图像到3D角色生成的过程耗时不到2分钟。此外,我们利用了Transformer架构来生成与语音同步的面部和手势动作,使用户能够实现实时对话功能。这些技术已经整合到了我们的会话AI虚拟人物产品中。
https://arxiv.org/abs/2501.07870
Reliable detection and segmentation of human hands are critical for enhancing safety and facilitating advanced interactions in human-robot collaboration. Current research predominantly evaluates hand segmentation under in-distribution (ID) data, which reflects the training data of deep learning (DL) models. However, this approach fails to address out-of-distribution (OOD) scenarios that often arise in real-world human-robot interactions. In this study, we present a novel approach by evaluating the performance of pre-trained DL models under both ID data and more challenging OOD scenarios. To mimic realistic industrial scenarios, we designed a diverse dataset featuring simple and cluttered backgrounds with industrial tools, varying numbers of hands (0 to 4), and hands with and without gloves. For OOD scenarios, we incorporated unique and rare conditions such as finger-crossing gestures and motion blur from fast-moving hands, addressing both epistemic and aleatoric uncertainties. To ensure multiple point of views (PoVs), we utilized both egocentric cameras, mounted on the operator's head, and static cameras to capture RGB images of human-robot interactions. This approach allowed us to account for multiple camera perspectives while also evaluating the performance of models trained on existing egocentric datasets as well as static-camera datasets. For segmentation, we used a deep ensemble model composed of UNet and RefineNet as base learners. Performance evaluation was conducted using segmentation metrics and uncertainty quantification via predictive entropy. Results revealed that models trained on industrial datasets outperformed those trained on non-industrial datasets, highlighting the importance of context-specific training. Although all models struggled with OOD scenarios, those trained on industrial datasets demonstrated significantly better generalization.
可靠地检测和分割人类手部在增强安全性和促进人机协作中的高级互动方面至关重要。当前的研究主要评估深度学习(DL)模型在分布内(ID)数据上的手部分割性能,这些数据反映了训练数据的情况。然而,这种方法未能解决现实世界中经常出现的分布外(OOD)场景问题。在这项研究中,我们提出了一种新的方法,通过同时在ID和更具挑战性的OOD场景下评估预训练DL模型的性能来弥补这一不足。为了模拟真实的工业场景,我们设计了一个多样化的数据集,其中包括简单和杂乱背景下的工业工具、手的数量(0到4)以及戴手套和不戴手套的手部图像。对于OOD场景,我们加入了独特的罕见条件,如手指交叉手势及快速移动的手带来的运动模糊现象,以应对知识不确定性和随机不确定性。 为了确保多角度视角的评估,我们使用了安装在操作员头部上的第一人称相机和固定相机来捕捉RGB图像中的手-机器互动。这种方法不仅允许从多个摄像机视角进行评估,还可以对基于现有第一人称数据集训练的模型以及静态摄像头数据集训练的模型性能进行全面评价。 对于分割任务,我们使用了由UNet和RefineNet作为基础学习器组成的深度集成模型。性能评估采用了分割指标和预测熵来量化不确定性。结果显示,在工业数据集上训练的模型优于在非工业数据集上训练的模型,强调了特定上下文训练的重要性。尽管所有模型在OOD场景下均面临挑战,但基于工业数据集训练的模型表现出显著更好的泛化能力。
https://arxiv.org/abs/2501.07713
This paper introduces GestLLM, an advanced system for human-robot interaction that enables intuitive robot control through hand gestures. Unlike conventional systems, which rely on a limited set of predefined gestures, GestLLM leverages large language models and feature extraction via MediaPipe to interpret a diverse range of gestures. This integration addresses key limitations in existing systems, such as restricted gesture flexibility and the inability to recognize complex or unconventional gestures commonly used in human communication. By combining state-of-the-art feature extraction and language model capabilities, GestLLM achieves performance comparable to leading vision-language models while supporting gestures underrepresented in traditional datasets. For example, this includes gestures from popular culture, such as the ``Vulcan salute" from Star Trek, without any additional pretraining, prompt engineering, etc. This flexibility enhances the naturalness and inclusivity of robot control, making interactions more intuitive and user-friendly. GestLLM provides a significant step forward in gesture-based interaction, enabling robots to understand and respond to a wide variety of hand gestures effectively. This paper outlines its design, implementation, and evaluation, demonstrating its potential applications in advanced human-robot collaboration, assistive robotics, and interactive entertainment.
本文介绍了GestLLM,这是一个先进的机器人交互系统,通过手部手势实现直观的机器人控制。与依赖于一组预定义手势的传统系统不同,GestLLM利用大型语言模型并通过MediaPipe进行特征提取来解读多种多样的手势。这种集成解决了现有系统的几个关键限制,例如受限的手势灵活性和无法识别人类交流中常见的复杂或非传统手势等问题。通过结合最先进的特征提取技术和语言模型能力,GestLLM实现了与领先视觉-语言模型相当的性能,并支持在传统数据集中代表性不足的手势。例如,在没有任何额外预训练、提示工程等操作的情况下,它能够识别流行文化中的手势,如《星际旅行》中的“瓦肯问候礼”。 这种灵活性增强了机器人控制的自然性和包容性,使交互更加直观和用户友好。GestLLM在基于手势的交互方面迈出了一大步,使得机器人能够有效理解和响应各种手部手势。本文概述了其设计、实现及评估,并展示了它在高级人机协作、辅助机器人技术以及互动娱乐领域的潜在应用。
https://arxiv.org/abs/2501.07295
We present GazeGrasp, a gaze-based manipulation system enabling individuals with motor impairments to control collaborative robots using eye-gaze. The system employs an ESP32 CAM for eye tracking, MediaPipe for gaze detection, and YOLOv8 for object localization, integrated with a Universal Robot UR10 for manipulation tasks. After user-specific calibration, the system allows intuitive object selection with a magnetic snapping effect and robot control via eye gestures. Experimental evaluation involving 13 participants demonstrated that the magnetic snapping effect significantly reduced gaze alignment time, improving task efficiency by 31%. GazeGrasp provides a robust, hands-free interface for assistive robotics, enhancing accessibility and autonomy for users.
我们介绍了GazeGrasp,这是一个基于目光的操控系统,允许运动能力受限的人士通过眼球追踪来控制协作机器人。该系统使用ESP32 CAM进行眼部跟踪,MediaPipe进行注视点检测,并采用YOLOv8进行目标定位,与Universal Robot UR10结合用于执行操纵任务。经过用户特定校准后,该系统支持通过磁性吸附效果直观选择物体并通过眼球手势操作机器人。实验评估涉及13名参与者的结果表明,磁性吸附效应显著减少了目光对齐时间,使任务效率提高了31%。GazeGrasp提供了一个稳健的、无手部接触的操作界面,增强了辅助机器人的可访问性和自主性。
https://arxiv.org/abs/2501.07255
Touch is a fundamental aspect of emotion-rich communication, playing a vital role in human interaction and offering significant potential in human-robot interaction. Previous research has demonstrated that a sparse representation of human touch can effectively convey social tactile signals. However, advances in human-robot tactile interaction remain limited, as many humanoid robots possess simplistic capabilities, such as only opening and closing their hands, restricting nuanced tactile expressions. In this study, we explore how a robot can use sparse representations of tactile vibrations to convey emotions to a person. To achieve this, we developed a wearable sleeve integrated with a 5x5 grid of vibration motors, enabling the robot to communicate diverse tactile emotions and gestures. Using chain prompts within a Large Language Model (LLM), we generated distinct 10-second vibration patterns corresponding to 10 emotions (e.g., happiness, sadness, fear) and 6 touch gestures (e.g., pat, rub, tap). Participants (N = 32) then rated each vibration stimulus based on perceived valence and arousal. People are accurate at recognising intended emotions, a result which aligns with earlier findings. These results highlight the LLM's ability to generate emotional haptic data and effectively convey emotions through tactile signals. By translating complex emotional and tactile expressions into vibratory patterns, this research demonstrates how LLMs can enhance physical interaction between humans and robots.
触摸是充满情感的交流中的基本方面,在人类互动中扮演着重要角色,并在人机交互领域具有重大潜力。先前的研究表明,人类触觉的稀疏表示可以有效传达社会性的触觉信号。然而,由于许多类人的机器人仅具备简单的功能(如开合手),限制了复杂细腻的触觉表达能力,因此人机触觉互动的发展仍然有限。 在本研究中,我们探讨了一个机器人如何利用触觉振动的稀疏表示向人类传达情感的可能性。为此,我们开发了一款可穿戴袖套,内置一个5x5网格的振动电机阵列,使机器人能够传递各种触觉情绪和手势信息。通过大型语言模型(LLM)内的链式提示技术,我们生成了对应于10种不同情绪(如快乐、悲伤、恐惧等)及6种触摸手势(如轻拍、抚摸、敲击等)的10秒振动模式。 随后,在参与者(N = 32)中测试每个振动刺激,并依据感知到的情绪价值和唤醒度进行评级。结果显示,人们能够准确识别出预期的情感信息,这与早期的研究结果相符。这些成果强调了LLM在生成情感触觉数据以及通过触觉信号有效传达情绪方面的能力。 通过将复杂的情感及触觉表达转换为振动模式,这项研究展示了大型语言模型如何增强人类与机器人之间的物理互动,为进一步开发更加自然、富有感情的人机交互提供了新的可能性。
https://arxiv.org/abs/2501.07224
The fundamental role of personality in shaping interactions is increasingly being exploited in robotics. A carefully designed robotic personality has been shown to improve several key aspects of Human-Robot Interaction (HRI). However, the fragmentation and rigidity of existing approaches reveal even greater challenges when applied to non-humanoid robots. On one hand, the state of the art is very dispersed; on the other hand, Industry 4.0 is moving towards a future where humans and industrial robots are going to coexist. In this context, the proper design of a robotic personality can lead to more successful interactions. This research takes a first step in that direction by integrating a comprehensive cognitive architecture built upon the definition of robotic personality - validated on humanoid robots - into a robotic Kinova Jaco2 arm. The robot personality is defined through the cognitive architecture as a vector in the three-dimensional space encompassing Conscientiousness, Extroversion, and Agreeableness, affecting how actions are executed, the action selection process, and the internal reaction to environmental stimuli. Our main objective is to determine whether users perceive distinct personalities in the robot, regardless of its shape, and to understand the role language plays in shaping these perceptions. To achieve this, we conducted a user study comprising 144 sessions of a collaborative game between a Kinova Jaco2 arm and participants, where the robot's behavior was influenced by its assigned personality. Furthermore, we compared two conditions: in the first, the robot communicated solely through gestures and action choices, while in the second, it also utilized verbal interaction.
个性在塑造人机互动中的基本作用正越来越多地被机器人技术所利用。精心设计的机器人个性已被证明能提高人类与机器人交互(HRI)中多个关键方面的表现。然而,现有方法的碎片化和僵硬性,在应用于非仿人形机器人时揭示了更大的挑战。一方面,当前的技术状态非常分散;另一方面,随着工业4.0的发展趋势,未来将会有越来越多的人类与工业机器人共存的局面出现。在这种背景下,设计合理的机器人个性可以促成更加成功的互动。本研究为此迈出第一步,即将一套基于定义的、在仿人形机器人上验证过的全面认知架构集成到Kinova Jaco2机械臂中。 在这个框架下,机器人的个性通过其三维空间中的向量来定义——该向量涵盖了尽责性(Conscientiousness)、外向性(Extroversion)和随和性(Agreeableness),影响行动执行、动作选择过程以及对环境刺激的内部反应。我们的主要目标是确定用户是否能够感知到机器人不同的个性特征,不论其形状如何,并且理解语言在塑造这些感知中的作用。为了实现这一目标,我们开展了一项包含144次会话的研究,参与者与Kinova Jaco2机械臂进行协作游戏,在游戏中机器人的行为受到分配给它的个性的影响。此外,我们将两个条件进行了比较:第一个条件下,机器人仅通过手势和动作选择来交流;而在第二个条件下,它还使用了口头互动。
https://arxiv.org/abs/2501.06867
In this paper, we introduce Motion-X++, a large-scale multimodal 3D expressive whole-body human motion dataset. Existing motion datasets predominantly capture body-only poses, lacking facial expressions, hand gestures, and fine-grained pose descriptions, and are typically limited to lab settings with manually labeled text descriptions, thereby restricting their scalability. To address this issue, we develop a scalable annotation pipeline that can automatically capture 3D whole-body human motion and comprehensive textural labels from RGB videos and build the Motion-X dataset comprising 81.1K text-motion pairs. Furthermore, we extend Motion-X into Motion-X++ by improving the annotation pipeline, introducing more data modalities, and scaling up the data quantities. Motion-X++ provides 19.5M 3D whole-body pose annotations covering 120.5K motion sequences from massive scenes, 80.8K RGB videos, 45.3K audios, 19.5M frame-level whole-body pose descriptions, and 120.5K sequence-level semantic labels. Comprehensive experiments validate the accuracy of our annotation pipeline and highlight Motion-X++'s significant benefits for generating expressive, precise, and natural motion with paired multimodal labels supporting several downstream tasks, including text-driven whole-body motion generation,audio-driven motion generation, 3D whole-body human mesh recovery, and 2D whole-body keypoints estimation, etc.
在这篇论文中,我们介绍了Motion-X++,这是一个大规模的多模态3D人体全身表情动作数据集。现有的运动数据集主要捕捉身体姿势,缺少面部表情、手部动作以及细粒度的姿态描述,并且通常受限于实验室环境中的手动文本标注,这限制了它们的可扩展性。为了解决这个问题,我们开发了一个可扩展的注释流水线,可以从RGB视频中自动捕获3D全身人体运动和全面的文字标签,构建包含81,100个文字-动作对的Motion-X数据集。此外,通过改进注释流程、引入更多数据模态以及扩大数据量,我们进一步将Motion-X扩展到了Motion-X++。Motion-X++提供了1950万个3D全身姿态标注,覆盖了来自大规模场景中的120,500个动作序列;80,800段RGB视频,45,300条音频记录,以及1950万个帧级别的全身姿态描述和120,500种序列级的语义标签。全面的实验验证了我们注释流程的准确性,并突出了Motion-X++在生成表达丰富、精准自然的动作方面的显著优势,其配对多模态标签支持包括文本驱动全身动作生成、音频驱动动作生成、3D人体网格恢复和2D全身关键点估计等多种下游任务。
https://arxiv.org/abs/2501.05098
Clothed avatar generation has wide applications in virtual and augmented reality, filmmaking, and more. Previous methods have achieved success in generating diverse digital avatars, however, generating avatars with disentangled components (\eg, body, hair, and clothes) has long been a challenge. In this paper, we propose LayerAvatar, the first feed-forward diffusion-based method for generating component-disentangled clothed avatars. To achieve this, we first propose a layered UV feature plane representation, where components are distributed in different layers of the Gaussian-based UV feature plane with corresponding semantic labels. This representation supports high-resolution and real-time rendering, as well as expressive animation including controllable gestures and facial expressions. Based on the well-designed representation, we train a single-stage diffusion model and introduce constrain terms to address the severe occlusion problem of the innermost human body layer. Extensive experiments demonstrate the impressive performances of our method in generating disentangled clothed avatars, and we further explore its applications in component transfer. The project page is available at: this https URL
这段文字描述了一种名为LayerAvatar的方法,该方法是首个基于前馈扩散模型的用于生成分量解耦着装虚拟化身的技术。以下是翻译: 穿衣虚拟人物生成在虚拟现实、增强现实以及电影制作等领域有着广泛的应用。尽管先前的研究成功地实现了多样化数字虚拟人的生成,但是要实现具有分离组件(例如身体、头发和衣物)的虚拟人生成一直是一个难题。在这篇论文中,我们提出了LayerAvatar——一种基于前馈扩散模型首次用于生成分量解耦着装虚拟人物的方法。为了达成这一目标,我们首先提出了一种分层UV特征平面表示方法,在该方法中,不同的组件分布在以高斯为基础的UV特征平面上的不同层次,并且带有对应的语义标签。这种表示方式支持高清分辨率和实时渲染,以及包括可控制手势和面部表情在内的丰富动画效果。在精心设计的表示基础上,我们训练了一个单阶段扩散模型,并引入了约束项来解决最内层人类身体部分严重遮挡的问题。 广泛的实验展示了我们的方法在生成解耦着装虚拟人物方面的出色表现,此外,我们也探讨了该技术在组件传输中的应用。该项目页面可在以下链接获取:[项目主页](this https URL)
https://arxiv.org/abs/2501.04631
Gesture recognition is a perceptual user interface, which is based on CV technology that allows the computer to interpret human motions as commands, allowing users to communicate with a computer without the use of hands, thus making the mouse and keyboard superfluous. Gesture recognition's main weakness is a light condition because gesture control is based on computer vision, which heavily relies on cameras. These cameras are used to interpret gestures in 2D and 3D, so the extracted information can vary depending on the source of light. The limitation of the system cannot work in a dark environment. A simple night vision camera can be used as our camera for motion capture as they also blast out infrared light which is not visible to humans but can be clearly seen with a camera that has no infrared filter this majorly overcomes the limitation of systems which cannot work in a dark environment. So, the video stream from the camera is fed into a Raspberry Pi which has a Python program running OpenCV module which is used for detecting, isolating and tracking the path of dynamic gesture, then we use an algorithm of machine learning to recognize the pattern drawn and accordingly control the GPIOs of the raspberry pi to perform some activities.
手势识别是一种感知用户界面,它基于计算机视觉(CV)技术,使计算机能够将人类的动作解释为指令。这种技术使得用户可以在不使用手的情况下与计算机进行互动,从而让鼠标和键盘变得不再必要。 然而,手势识别的主要弱点在于光照条件的影响,因为该技术依赖于摄像头捕捉的图像信息来解析手势动作,在二维或三维空间中解读这些动作时,提取的信息会受到光源的影响。因此,系统在黑暗环境中无法正常工作。 为了解决这一问题,可以使用夜视摄像机作为运动捕捉设备,这类摄像机会发出人类肉眼不可见但相机能够清晰捕获的红外光,这大大克服了系统不能在黑暗环境下工作的局限性。 接下来,从摄像头获取的视频流会被输入到运行有OpenCV模块的树莓派(Raspberry Pi)中。这个程序用于检测、分离并跟踪动态手势路径。然后使用机器学习算法识别所画出的模式,并相应地控制树莓派的GPIO接口以执行一些操作。 这种方法能够有效地利用夜视摄像头和计算机视觉技术,实现即使在黑暗环境中也能进行可靠的手势识别与交互。
https://arxiv.org/abs/2501.04002
The vision of adaptive architecture proposes that robotic technologies could enable interior spaces to physically transform in a bidirectional interaction with occupants. Yet, it is still unknown how this interaction could unfold in an understandable way. Inspired by HRI studies where robotic furniture gestured intents to occupants by deliberately positioning or moving in space, we hypothesise that adaptive architecture could also convey intents through gestures performed by a mobile robotic partition. To explore this design space, we invited 15 multidisciplinary experts to join co-design improvisation sessions, where they manually manoeuvred a deactivated robotic partition to design gestures conveying six architectural intents that varied in purpose and urgency. Using a gesture elicitation method alongside motion-tracking data, a Laban-based questionnaire, and thematic analysis, we identified 20 unique gestural strategies. Through categorisation, we introduced architectonic gestures as a novel strategy for robotic furniture to convey intent by indexically leveraging its spatial impact, complementing the established deictic and emblematic gestures. Our study thus represents an exploratory step toward making the autonomous gestures of adaptive architecture more legible. By understanding how robotic gestures are interpreted based not only on their motion but also on their spatial impact, we contribute to bridging HRI with Human-Building Interaction research.
适应性架构的愿景提出,机器人技术可以使室内空间在与使用者的双向互动中物理上发生转变。然而,这种互动如何以一种可理解的方式展开仍然是未知的。受人机交互(HRI)研究的启发,在这些研究中,机器人家具通过有意地定位或移动自身来向用户传达意图,我们假设适应性建筑也可以通过移动机器隔断做出的手势来传达意图。 为了探索这个设计空间,我们邀请了15名来自不同学科的专家参与协同设计即兴创作会议,在会议上他们手动操控一个未激活的机器人隔断,以设计六种不同的建筑设计意图的手势。这些意图在目的和紧迫性上各不相同。通过手势激发法、动作追踪数据、基于拉班体系的问卷以及主题分析,我们确定了20种独特的手势策略。 通过对分类归纳,我们提出了建筑学手势作为一种新的策略,让机器人家具能够通过指代其空间影响来传达意图,从而补充现有的指示性和象征性手势。因此,我们的研究代表了一个初步探索步骤,使适应性建筑的自主手势更加清晰可读。通过理解基于机器人的动作和它们的空间影响力如何共同决定这些动作是如何被解读的,我们为连接人机交互与人类-建筑物互动的研究做出了贡献。
https://arxiv.org/abs/2501.01813
Vision-language models (VLMs) have advanced human-AI interaction but struggle with cultural understanding, often misinterpreting symbols, gestures, and artifacts due to biases in predominantly Western-centric training data. In this paper, we construct CultureVerse, a large-scale multimodal benchmark covering 19, 682 cultural concepts, 188 countries/regions, 15 cultural concepts, and 3 question types, with the aim of characterizing and improving VLMs' multicultural understanding capabilities. Then, we propose CultureVLM, a series of VLMs fine-tuned on our dataset to achieve significant performance improvement in cultural understanding. Our evaluation of 16 models reveals significant disparities, with a stronger performance in Western concepts and weaker results in African and Asian contexts. Fine-tuning on our CultureVerse enhances cultural perception, demonstrating cross-cultural, cross-continent, and cross-dataset generalization without sacrificing performance on models' general VLM benchmarks. We further present insights on cultural generalization and forgetting. We hope that this work could lay the foundation for more equitable and culturally aware multimodal AI systems.
视觉-语言模型(VLMs)虽然在人机交互方面取得了进展,但它们在跨文化理解上仍然面临挑战。由于主要以西方为中心的训练数据中的偏见,这些模型经常错误地解读符号、手势和文物的文化意义。为了应对这一问题,在本文中我们构建了CultureVerse,这是一个涵盖19,682个文化概念、188个国家/地区、15种文化概念以及3类问题的大规模多模态基准测试,旨在表征并改进视觉-语言模型的跨文化理解能力。然后,我们提出了CultureVLM系列模型——这些是基于我们的数据集进行微调后的视觉-语言模型,在文化理解上实现了显著性能提升。 对16个模型的评估表明,它们在西方概念上的表现较强,而在非洲和亚洲背景下的表现较弱。在我们的CultureVerse数据集上进行微调可以增强跨文化的感知能力,并展示出跨文化和跨大陆的一般化能力,同时不会牺牲这些模型在通用视觉-语言基准测试中的性能。 我们进一步探讨了文化一般化和遗忘的见解。希望这项工作能够为更加公平且具有文化意识的多模态AI系统奠定基础。
https://arxiv.org/abs/2501.01282
Dynamic gesture recognition is one of the challenging research areas due to variations in pose, size, and shape of the signer's hand. In this letter, Multiscaled Multi-Head Attention Video Transformer Network (MsMHA-VTN) for dynamic hand gesture recognition is proposed. A pyramidal hierarchy of multiscale features is extracted using the transformer multiscaled head attention model. The proposed model employs different attention dimensions for each head of the transformer which enables it to provide attention at the multiscale level. Further, in addition to single modality, recognition performance using multiple modalities is examined. Extensive experiments demonstrate the superior performance of the proposed MsMHA-VTN with an overall accuracy of 88.22\% and 99.10\% on NVGesture and Briareo datasets, respectively.
动态手势识别由于签名人手的姿态、大小和形状的变化而成为一个具有挑战性的研究领域。本文提出了一种用于动态手部手势识别的多尺度多头注意力视频变换网络(MsMHA-VTN)。通过变压器多尺度头部注意模型,提取了多层次的多尺度特征。所提出的模型为转换器中的每个头部使用不同的注意维度,从而能够在多尺度级别上提供注意力机制。此外,除了单一模式之外,还考察了多种模态下的识别性能。广泛的实验表明,所提出的MsMHA-VTN在NVGesture和Briareo数据集上的整体准确率分别为88.22% 和99.10%,表现出卓越的性能。
https://arxiv.org/abs/2501.00935
Translating human intent into robot commands is crucial for the future of service robots in an aging society. Existing Human-Robot Interaction (HRI) systems relying on gestures or verbal commands are impractical for the elderly due to difficulties with complex syntax or sign language. To address the challenge, this paper introduces a multi-modal interaction framework that combines voice and deictic posture information to create a more natural HRI system. The visual cues are first processed by the object detection model to gain a global understanding of the environment, and then bounding boxes are estimated based on depth information. By using a large language model (LLM) with voice-to-text commands and temporally aligned selected bounding boxes, robot action sequences can be generated, while key control syntax constraints are applied to avoid potential LLM hallucination issues. The system is evaluated on real-world tasks with varying levels of complexity using a Universal Robots UR3e manipulator. Our method demonstrates significantly better performance in HRI in terms of accuracy and robustness. To benefit the research community and the general public, we will make our code and design open-source.
将人类意图翻译成机器人命令对于老龄化社会中服务机器人的未来至关重要。现有的依赖手势或语音指令的人机交互(HRI)系统,因复杂的语法结构或手语障碍,在老年人群体中实用性不高。为了解决这一挑战,本文提出了一种结合声音和指示姿势信息的多模态互动框架,以创建更加自然的人机交互系统。该系统的视觉线索首先通过物体检测模型处理,从而获得对环境的整体理解,随后基于深度信息估算出边界框。 通过使用大语言模型(LLM),将语音转文本指令与时间上对齐的选择性边界框结合,可以生成机器人的动作序列,并应用关键控制语法约束以避免潜在的语言模型幻觉问题。系统在不同复杂度的现实任务中进行了评估,使用的机器人是Universal Robots UR3e机械臂。我们的方法在人机交互的准确性和鲁棒性方面表现出显著优势。 为了惠及研究社区和公众,我们将开源代码和设计。
https://arxiv.org/abs/2501.00785
Although sign language recognition aids non-hearing-impaired understanding, many hearing-impaired individuals still rely on sign language alone due to limited literacy, underscoring the need for advanced sign language production and translation (SLP and SLT) systems. In the field of sign language production, the lack of adequate models and datasets restricts practical applications. Existing models face challenges in production accuracy and pose control, making it difficult to provide fluent sign language expressions across diverse scenarios. Additionally, data resources are scarce, particularly high-quality datasets with complete sign vocabulary and pose annotations. To address these issues, we introduce CNText2Sign and CNSign, comprehensive datasets to benchmark SLP and SLT, respectively, with CNText2Sign covering gloss and landmark mappings for SLP, and CNSign providing extensive video-to-text data for SLT. To improve the accuracy and applicability of sign language systems, we propose the AuraLLM and SignMST-C models. AuraLLM, incorporating LoRA and RAG techniques, achieves a BLEU-4 score of 50.41 on the CNText2Sign dataset, enabling precise control over gesture semantics and motion. SignMST-C employs self-supervised rapid motion video pretraining, achieving a BLEU-4 score of 31.03/32.08 on the PHOENIX2014-T benchmark, setting a new state-of-the-art. These models establish robust baselines for the datasets released for their respective tasks.
尽管手语识别有助于听力正常的人理解手语,但许多听障人士仍依赖于单独使用手语,这是因为他们的读写能力有限。这强调了需要高级的手语生成和翻译(SLP 和 SLT)系统。在手语生成领域,缺乏足够的模型和数据集限制了实际应用的开展。现有的模型在生产准确性与姿态控制方面面临着挑战,难以提供适合不同场景的流畅手语表达。此外,可用的数据资源有限,尤其是高质量的数据集,这些数据集中包含完整的手势词汇表以及姿态标注。 为解决这些问题,我们引入了 CNText2Sign 和 CNSign 两个全面的数据集,用于评估 SLP 和 SLT 的基准性能。CNText2Sign 包含用于手语生成的短语和地标映射信息;CNSign 则提供了广泛的手语视频到文本数据,以供手语翻译使用。为了提高手语系统的准确性和适用性,我们提出了 AuraLLM 和 SignMST-C 两种模型。 AuraLLM 模型结合了 LoRA 和 RAG 技术,在 CNText2Sign 数据集上达到了 BLEU-4 分数为 50.41 的成绩。这使得对手势语义和动作的精准控制成为可能,从而提升了手语生成系统的准确性与流畅度。 SignMST-C 则采用了自我监督的快速运动视频预训练方法,在 PHOENIX2014-T 基准测试中取得了 BLEU-4 分数分别为 31.03 和 32.08 的成绩,从而确立了新的业界领先水平。这两种模型为各自任务发布的数据集建立了稳健的基准性能标准。 这些努力和成果显著地推进了手语生成与翻译技术的发展,并为未来的研究提供了一个坚实的起点。
https://arxiv.org/abs/2501.00765
Cross-corpus speech emotion recognition (SER) plays a vital role in numerous practical applications. Traditional approaches to cross-corpus emotion transfer often concentrate on adapting acoustic features to align with different corpora, domains, or labels. However, acoustic features are inherently variable and error-prone due to factors like speaker differences, domain shifts, and recording conditions. To address these challenges, this study adopts a novel contrastive approach by focusing on emotion-specific articulatory gestures as the core elements for analysis. By shifting the emphasis on the more stable and consistent articulatory gestures, we aim to enhance emotion transfer learning in SER tasks. Our research leverages the CREMA-D and MSP-IMPROV corpora as benchmarks and it reveals valuable insights into the commonality and reliability of these articulatory gestures. The findings highlight mouth articulatory gesture potential as a better constraint for improving emotion recognition across different settings or domains.
跨语料库的语音情感识别(SER)在许多实际应用中起着至关重要的作用。传统的跨语料库情绪转移方法通常侧重于调整声学特征以适应不同的语料库、领域或标签。然而,由于说话人差异、领域变化和录音条件等因素的影响,声学特征本质上是多变且容易出错的。为了解决这些挑战,本研究采用了一种新颖的对比性方法,专注于情绪特异性发音动作作为分析的核心元素。通过将重点放在更稳定和一致的发音动作上,我们旨在增强SER任务中的情绪转移学习。我们的研究利用CREMA-D和MSP-IMPROV语料库作为基准,并揭示了这些发音动作在不同设置或领域中的一致性和可靠性方面的宝贵见解。研究结果突出了口部发音动作作为改进跨情境或领域情感识别的更好约束条件的潜力。
https://arxiv.org/abs/2412.19909
This study mainly explores the application of natural gesture recognition based on computer vision in human-computer interaction, aiming to improve the fluency and naturalness of human-computer interaction through gesture recognition technology. In the fields of virtual reality, augmented reality and smart home, traditional input methods have gradually failed to meet the needs of users for interactive experience. As an intuitive and convenient interaction method, gestures have received more and more attention. This paper proposes a gesture recognition method based on a three-dimensional hand skeleton model. By simulating the three-dimensional spatial distribution of hand joints, a simplified hand skeleton structure is constructed. By connecting the palm and each finger joint, a dynamic and static gesture model of the hand is formed, which further improves the accuracy and efficiency of gesture recognition. Experimental results show that this method can effectively recognize various gestures and maintain high recognition accuracy and real-time response capabilities in different environments. In addition, combined with multimodal technologies such as eye tracking, the intelligence level of the gesture recognition system can be further improved, bringing a richer and more intuitive user experience. In the future, with the continuous development of computer vision, deep learning and multimodal interaction technology, natural interaction based on gestures will play an important role in a wider range of application scenarios and promote revolutionary progress in human-computer interaction.
这项研究主要探讨了基于计算机视觉的自然手势识别在人机交互中的应用,旨在通过手势识别技术提高人机交互的流畅性和自然度。在虚拟现实、增强现实和智能家居等领域中,传统输入方法逐渐无法满足用户对互动体验的需求。作为一种直观且便捷的交互方式,手势越来越受到重视。本文提出了一种基于三维手部骨架模型的手势识别方法。通过模拟手关节的三维空间分布,构建了一个简化的手部骨架结构。连接手掌和每个手指关节后,形成了动态与静态的手势模型,进一步提高了手势识别的准确性和效率。实验结果表明,该方法能够有效识别各种手势,并在不同环境中保持高识别精度和实时响应能力。此外,结合眼动追踪等多模态技术,可以进一步提升手势识别系统的智能化水平,带来更加丰富直观的用户体验。未来,随着计算机视觉、深度学习及多模态交互技术的不断发展,基于手势的自然交互将在更广泛的应用场景中发挥重要作用,并推动人机交互领域的革命性进步。
https://arxiv.org/abs/2412.18321
Emotion recognition and touch gesture decoding are crucial for advancing human-robot interaction (HRI), especially in social environments where emotional cues and tactile perception play important roles. However, many humanoid robots, such as Pepper, Nao, and Furhat, lack full-body tactile skin, limiting their ability to engage in touch-based emotional and gesture interactions. In addition, vision-based emotion recognition methods usually face strict GDPR compliance challenges due to the need to collect personal facial data. To address these limitations and avoid privacy issues, this paper studies the potential of using the sounds produced by touching during HRI to recognise tactile gestures and classify emotions along the arousal and valence dimensions. Using a dataset of tactile gestures and emotional interactions from 28 participants with the humanoid robot Pepper, we design an audio-only lightweight touch gesture and emotion recognition model with only 0.24M parameters, 0.94MB model size, and 0.7G FLOPs. Experimental results show that the proposed sound-based touch gesture and emotion recognition model effectively recognises the arousal and valence states of different emotions, as well as various tactile gestures, when the input audio length varies. The proposed model is low-latency and achieves similar results as well-known pretrained audio neural networks (PANNs), but with much smaller FLOPs, parameters, and model size.
情感识别和触觉手势解码对于推进人机交互(HRI)至关重要,尤其是在社交环境中,情绪线索和触觉感知扮演着重要角色。然而,许多类人机器人,如Pepper、Nao和Furhat,缺乏全身触觉皮肤,这限制了它们进行基于触摸的情感和手势互动的能力。此外,视觉基础的情绪识别方法通常会因收集个人面部数据的需要而面临严格的GDPR合规挑战。为了解决这些局限性并避免隐私问题,本文研究了在HRI中使用触摸产生的声音来识别触觉手势并在唤醒度和效价维度上分类情感的潜力。 通过28名参与者与类人机器人Pepper进行触觉手势和情绪互动的数据集,我们设计了一个仅基于音频、轻量级的手势和情绪识别模型,该模型仅有0.24M参数、0.94MB大小以及0.7G浮点运算(FLOPs)。实验结果表明,所提出的基于声音的触摸手势和情感识别模型能够有效识别不同长度输入音频的情感唤醒度和效价状态,同时还能准确地辨识各种触觉手势。该模型具有低延迟的特点,并且其性能与知名预训练音频神经网络(PANNs)相似,但计算量、参数数量及模型大小都显著较小。
https://arxiv.org/abs/2501.00038
A good co-speech motion generation cannot be achieved without a careful integration of common rhythmic motion and rare yet essential semantic motion. In this work, we propose SemTalk for holistic co-speech motion generation with frame-level semantic emphasis. Our key insight is to separately learn general motions and sparse motions, and then adaptively fuse them. In particular, rhythmic consistency learning is explored to establish rhythm-related base motion, ensuring a coherent foundation that synchronizes gestures with the speech rhythm. Subsequently, textit{semantic emphasis learning is designed to generate semantic-aware sparse motion, focusing on frame-level semantic cues. Finally, to integrate sparse motion into the base motion and generate semantic-emphasized co-speech gestures, we further leverage a learned semantic score for adaptive synthesis. Qualitative and quantitative comparisons on two public datasets demonstrate that our method outperforms the state-of-the-art, delivering high-quality co-speech motion with enhanced semantic richness over a stable base motion.
良好的伴随言语的动作生成离不开对常见节奏动作和稀有但关键的语义动作的仔细整合。在这项工作中,我们提出了SemTalk,用于在帧级语义强调下的整体伴随言语动作生成。我们的核心见解是分别学习一般动作和稀疏动作,然后自适应地融合它们。具体来说,探索了节奏一致性学习来建立与节奏相关的基础动作,确保手势与言语节奏同步的连贯基础。随后,设计了语义强调学习来生成感知语义的稀疏动作,专注于帧级语义提示。最后,为了将稀疏动作用于基底动作并生成具有语义强调的伴随言语手势,我们进一步利用一个学习到的语义分数来进行自适应合成。在两个公共数据集上的定性和定量比较表明,我们的方法优于最先进的技术,能够在稳定的基底动作上生成高质量且语义丰富的伴随言语动作。
https://arxiv.org/abs/2412.16563