Conversational scenarios are very common in real-world settings, yet existing co-speech motion synthesis approaches often fall short in these contexts, where one person's audio and gestures will influence the other's responses. Additionally, most existing methods rely on offline sequence-to-sequence frameworks, which are unsuitable for online applications. In this work, we introduce an audio-driven, auto-regressive system designed to synthesize dynamic movements for two characters during a conversation. At the core of our approach is a diffusion-based full-body motion synthesis model, which is conditioned on the past states of both characters, speech audio, and a task-oriented motion trajectory input, allowing for flexible spatial control. To enhance the model's ability to learn diverse interactions, we have enriched existing two-person conversational motion datasets with more dynamic and interactive motions. We evaluate our system through multiple experiments to show it outperforms across a variety of tasks, including single and two-person co-speech motion generation, as well as interactive motion generation. To the best of our knowledge, this is the first system capable of generating interactive full-body motions for two characters from speech in an online manner.
https://arxiv.org/abs/2412.02419
This review explores the evolution of human-machine interfaces (HMIs) for subsea telerobotics, tracing back the transition from traditional first-person "soda-straw" consoles (narrow field-of-view camera feed) to advanced interfaces powered by gesture recognition, virtual reality, and natural language models. First, we discuss various forms of subsea telerobotics applications, current state-of-the-art (SOTA) interface systems, and the challenges they face in robust underwater sensing, real-time estimation, and low-latency communication. Through this analysis, we highlight how advanced HMIs facilitate intuitive interactions between human operators and robots to overcome these challenges. A detailed review then categorizes and evaluates the cutting-edge HMI systems based on their offered features from both human perspectives (e.g., enhancing operator control and situational awareness) and machine perspectives (e.g., improving safety, mission accuracy, and task efficiency). Moreover, we examine the literature on bidirectional interaction and intelligent collaboration in terms of sensory feedback and intuitive control mechanisms for both physical and virtual interfaces. The paper concludes by identifying critical challenges, open research questions, and future directions, emphasizing the need for multidisciplinary collaboration in subsea telerobotics.
https://arxiv.org/abs/2412.01753
This paper proposes the second version of the widespread Hand Gesture Recognition dataset HaGRID -- HaGRIDv2. We cover 15 new gestures with conversation and control functions, including two-handed ones. Building on the foundational concepts proposed by HaGRID's authors, we implemented the dynamic gesture recognition algorithm and further enhanced it by adding three new groups of manipulation gestures. The ``no gesture" class was diversified by adding samples of natural hand movements, which allowed us to minimize false positives by 6 times. Combining extra samples with HaGRID, the received version outperforms the original in pre-training models for gesture-related tasks. Besides, we achieved the best generalization ability among gesture and hand detection datasets. In addition, the second version enhances the quality of the gestures generated by the diffusion model. HaGRIDv2, pre-trained models, and a dynamic gesture recognition algorithm are publicly available.
https://arxiv.org/abs/2412.01508
Building realistic and animatable avatars still requires minutes of multi-view or monocular self-rotating videos, and most methods lack precise control over gestures and expressions. To push this boundary, we address the challenge of constructing a whole-body talking avatar from a single image. We propose a novel pipeline that tackles two critical issues: 1) complex dynamic modeling and 2) generalization to novel gestures and expressions. To achieve seamless generalization, we leverage recent pose-guided image-to-video diffusion models to generate imperfect video frames as pseudo-labels. To overcome the dynamic modeling challenge posed by inconsistent and noisy pseudo-videos, we introduce a tightly coupled 3DGS-mesh hybrid avatar representation and apply several key regularizations to mitigate inconsistencies caused by imperfect labels. Extensive experiments on diverse subjects demonstrate that our method enables the creation of a photorealistic, precisely animatable, and expressive whole-body talking avatar from just a single image.
https://arxiv.org/abs/2412.01106
Dynamic hand gestures play a crucial role in conveying nonverbal information for Human-Robot Interaction (HRI), eliminating the need for complex interfaces. Current models for dynamic gesture recognition suffer from limitations in effective recognition range, restricting their application to close proximity scenarios. In this letter, we present a novel approach to recognizing dynamic gestures in an ultra-range distance of up to 28 meters, enabling natural, directive communication for guiding robots in both indoor and outdoor environments. Our proposed SlowFast-Transformer (SFT) model effectively integrates the SlowFast architecture with Transformer layers to efficiently process and classify gesture sequences captured at ultra-range distances, overcoming challenges of low resolution and environmental noise. We further introduce a distance-weighted loss function shown to enhance learning and improve model robustness at varying distances. Our model demonstrates significant performance improvement over state-of-the-art gesture recognition frameworks, achieving a recognition accuracy of 95.1% on a diverse dataset with challenging ultra-range gestures. This enables robots to react appropriately to human commands from a far distance, providing an essential enhancement in HRI, especially in scenarios requiring seamless and natural interaction.
动态手势在人机交互(HRI)中传达非言语信息方面发挥着关键作用,无需复杂的界面。目前的动态手势识别模型存在有效识别范围有限的问题,限制了它们的应用场景,只能用于近距离情境。在这封信中,我们提出了一种新的方法,在超远距离达28米的情况下进行动态手势识别,使得在室内和室外环境中都能实现自然且有指导性的与机器人的沟通。我们的提案SlowFast-Transformer(SFT)模型有效地整合了SlowFast架构与转换器层,以高效地处理并分类在超远距离捕捉到的手势序列,克服了低分辨率和环境噪声带来的挑战。我们还引入了一个基于距离的加权损失函数,该函数显示可以增强学习过程,并提高模型在不同距离下的鲁棒性。我们的模型相较于最先进的手势识别框架展现了显著的性能提升,在包含具有挑战性的超远距离手势的多样化数据集上达到了95.1%的识别准确率。这使得机器人能够对来自远处的人类指令作出适当的反应,为HRI提供了一个重要的改进点,特别是在需要无缝且自然交互的情境下。
https://arxiv.org/abs/2411.18413
Advancing human-robot communication is crucial for autonomous systems operating in dynamic environments, where accurate real-time interpretation of human signals is essential. RoboCup provides a compelling scenario for testing these capabilities, requiring robots to understand referee gestures and whistle with minimal network reliance. Using the NAO robot platform, this study implements a two-stage pipeline for gesture recognition through keypoint extraction and classification, alongside continuous convolutional neural networks (CCNNs) for efficient whistle detection. The proposed approach enhances real-time human-robot interaction in a competitive setting like RoboCup, offering some tools to advance the development of autonomous systems capable of cooperating with humans.
推进人机沟通对于在动态环境中运行的自主系统至关重要,因为准确实时地解读人类信号是必不可少的。RoboCup提供了一个测试这些能力的引人注目的场景,要求机器人能够理解裁判的手势和哨声,并且尽量减少对网络的依赖。本研究使用NAO机器人平台实现了一种两阶段的人体关键点提取与分类的 gesture recognition 管道,以及用于高效哨声检测的连续卷积神经网络(CCNN)。所提出的方法提升了在RoboCup这种竞争环境中的人机实时互动效果,为开发能够与人类协同工作的自主系统提供了一些工具。
https://arxiv.org/abs/2411.17347
We present a generative model that learns to synthesize human motion from limited training sequences. Our framework provides conditional generation and blending across multiple temporal resolutions. The model adeptly captures human motion patterns by integrating skeletal convolution layers and a multi-scale architecture. Our model contains a set of generative and adversarial networks, along with embedding modules, each tailored for generating motions at specific frame rates while exerting control over their content and details. Notably, our approach also extends to the synthesis of co-speech gestures, demonstrating its ability to generate synchronized gestures from speech inputs, even with limited paired data. Through direct synthesis of SMPL pose parameters, our approach avoids test-time adjustments to fit human body meshes. Experimental results showcase our model's ability to achieve extensive coverage of training examples, while generating diverse motions, as indicated by local and global diversity metrics.
我们提出了一种生成模型,该模型能够从有限的训练序列中学习合成人类动作。我们的框架提供了在多个时间尺度上进行条件生成和混合的能力。通过集成骨骼卷积层和多尺度架构,模型能熟练地捕捉人类动作模式。我们的模型包含一组生成对抗网络以及嵌入模块,每个模块专门用于以特定帧率生成动作,并控制其内容和细节。值得注意的是,我们的方法还扩展到了共话语手势的合成,展示了从语音输入中生成同步手势的能力,即使配对数据有限也是如此。通过直接合成SMPL姿态参数,我们的方法避免了测试时调整以适应人体网格的需求。实验结果表明,根据局部和全局多样性指标,我们的模型能够广泛覆盖训练样本并生成多样化的动作。
https://arxiv.org/abs/2411.16498
EMG-based hand gesture recognition uses electromyographic~(EMG) signals to interpret and classify hand movements by analyzing electrical activity generated by muscle contractions. It has wide applications in prosthesis control, rehabilitation training, and human-computer interaction. Using electrodes placed on the skin, the EMG sensor captures muscle signals, which are processed and filtered to reduce noise. Numerous feature extraction and machine learning algorithms have been proposed to extract and classify muscle signals to distinguish between various hand gestures. This paper aims to benchmark the performance of EMG-based hand gesture recognition using novel feature extraction methods, namely, fused time-domain descriptors, temporal-spatial descriptors, and wavelet transform-based features, combined with the state-of-the-art machine and deep learning models. Experimental investigations on the Grabmyo dataset demonstrate that the 1D Dilated CNN performed the best with an accuracy of $97\%$ using fused time-domain descriptors such as power spectral moments, sparsity, irregularity factor and waveform length ratio. Similarly, on the FORS-EMG dataset, random forest performed the best with an accuracy of $94.95\%$ using temporal-spatial descriptors (which include time domain features along with additional features such as coefficient of variation (COV), and Teager-Kaiser energy operator (TKEO)).
基于肌电(EMG)信号的手势识别通过分析肌肉收缩产生的电信号来解读和分类手部动作。它在假肢控制、康复训练及人机交互中有着广泛的应用。使用放置于皮肤上的电极,EMG传感器可以捕捉到肌肉的信号,然后对这些信号进行处理和滤波以减少噪声。已有多种特征提取和机器学习算法被提出,用于从肌电信号中提取并分类出不同的手势。本文旨在通过新颖的特征提取方法(包括融合的时间域描述符、时空间描述符以及基于小波变换的特征)结合最先进的机器学习和深度学习模型来评估基于EMG的手势识别性能。实验研究使用了Grabmyo数据集,结果表明,1D空洞卷积神经网络在采用如功率谱矩、稀疏性、不规则因子及波形长度比等融合时间域描述符时表现最优,准确率为97%。同样,在FORS-EMG数据集中,随机森林算法使用时空间描述符(包括时间域特征以及额外的变异系数(COV)和Teager-Kaiser能量算子(TKEO))时表现出最佳性能,准确率达到94.95%。
https://arxiv.org/abs/2411.15655
Electromyography (EMG) is a measure of muscular electrical activity and is used in many clinical/biomedical disciplines and modern human computer interaction. Myo-electric prosthetics analyze and classify the electrical signals recorded from the residual limb. The classified output is then used to control the position of motors in a robotic hand and a movement is produced. The aim of this project is to develop a low-cost and effective myo-electric prosthetic hand that would meet the needs of amputees in developing countries. The proposed prosthetic hand should be able to accurately classify five different patterns (gestures) using EMG recordings from three muscles and control a robotic hand accordingly. The robotic hand is composed of two servo motors allowing for two degrees of freedom. After establishing an efficient signal acquisition and amplification system, EMG signals were thoroughly analyzed in the frequency and time domain. Features were extracted from both domains and a shallow neural network was trained on the two sets of data. Results yielded an average classification accuracy of 97.25% and 95.85% for the time and frequency domains respectively. Furthermore, results showed a faster computation and response for the time domain analysis; hence, it was adopted for the classification system. A wrist rotation mechanism was designed and tested to add significant functionality to the prosthetic. The mechanism is controlled by two of the five gestures, one for each direction. Which added a third degree of freedom to the overall design. Finally, a tactile sensory feedback system which uses force sensors and vibration motors was developed to enable sensation of the force inflicted on the hand for the user.
肌电图(EMG)是衡量肌肉电信号的一种方法,广泛应用于许多临床/生物医学学科和现代人机交互中。肌电假肢通过分析并分类从残肢记录下来的电信号来进行工作。经过分类的输出信号被用来控制机械手上的电机位置,从而产生运动。本项目的目的是开发一种低成本且有效的肌电假肢手,以满足发展中国家截肢者的需求。该假肢手应能利用来自三个肌肉群的EMG记录准确地识别五种不同的模式(手势),并据此控制一个机械手。这个机械手由两个伺服电机组成,允许两种自由度的动作。 在建立了有效的信号采集和放大系统后,对EMG信号进行了全面的时间域和频率域分析。从这两个领域中提取特征,并训练了一个浅层神经网络来处理两组数据。时间域和频率域的平均分类准确率分别达到了97.25%和95.85%。此外,结果表明,在时间域进行分析具有更快的计算速度和响应速度;因此,它被采纳用于分类系统中。 设计并测试了一个手腕旋转机制来增加假肢的功能性。该机制由五种手势中的两种控制,每种方向对应一种手势,从而为整体设计增添了第三自由度。最后,开发了一套利用力传感器和振动电机的触觉反馈系统,使用户能够感知施加在手上的力量。 这个翻译保留了原文的核心信息,并适应了中文的语言习惯。
https://arxiv.org/abs/2411.15533
Speech-driven gesture generation using transformer-based generative models represents a rapidly advancing area within virtual human creation. However, existing models face significant challenges due to their quadratic time and space complexities, limiting scalability and efficiency. To address these limitations, we introduce DiM-Gestor, an innovative end-to-end generative model leveraging the Mamba-2 architecture. DiM-Gestor features a dual-component framework: (1) a fuzzy feature extractor and (2) a speech-to-gesture mapping module, both built on the Mamba-2. The fuzzy feature extractor, integrated with a Chinese Pre-trained Model and Mamba-2, autonomously extracts implicit, continuous speech features. These features are synthesized into a unified latent representation and then processed by the speech-to-gesture mapping module. This module employs an Adaptive Layer Normalization (AdaLN)-enhanced Mamba-2 mechanism to uniformly apply transformations across all sequence tokens. This enables precise modeling of the nuanced interplay between speech features and gesture dynamics. We utilize a diffusion model to train and infer diverse gesture outputs. Extensive subjective and objective evaluations conducted on the newly released Chinese Co-Speech Gestures dataset corroborate the efficacy of our proposed model. Compared with Transformer-based architecture, the assessments reveal that our approach delivers competitive results and significantly reduces memory usage, approximately 2.4 times, and enhances inference speeds by 2 to 4 times. Additionally, we released the CCG dataset, a Chinese Co-Speech Gestures dataset, comprising 15.97 hours (six styles across five scenarios) of 3D full-body skeleton gesture motion performed by professional Chinese TV broadcasters.
基于变换器的生成模型驱动的演讲手势生成代表了虚拟人物创作中一个迅速发展的领域。然而,现有的模型面临着二次时间和空间复杂性带来的显著挑战,这限制了可扩展性和效率。为了解决这些局限性,我们引入了DiM-Gestor,这是一个创新的端到端生成模型,利用了Mamba-2架构。DiM-Gestor具有双组件框架:(1)模糊特征提取器和(2)语音到手势映射模块,这两部分均基于Mamba-2构建。模糊特征提取器与中文预训练模型及Mamba-2结合,能够自主提取隐含的连续语音特征。这些特征被合成成统一的潜在表示,并由语音到手势映射模块处理。该模块采用增强型自适应层规范化(AdaLN)机制来统一应用所有序列标记上的变换。这使得能够精确建模语音特征与手势动态之间的细微相互作用。我们使用扩散模型进行训练和推理以生成多样的手势输出。在新发布的中文伴随演讲手势数据集上进行的广泛主观和客观评估验证了我们提出模型的有效性。相比于基于Transformer架构,评估显示我们的方法提供了具有竞争力的结果,并显著降低了内存使用量,约为2.4倍,并且提高了推理速度2到4倍。此外,我们发布了CCG(Chinese Co-Speech Gestures)数据集,这是一个中文伴随演讲手势数据集,包含由专业中国电视广播员表演的15.97小时(跨越五个场景下的六种风格)3D全身骨架手势动作。
https://arxiv.org/abs/2411.16729
Event cameras, as an emerging imaging technology, offer distinct advantages over traditional RGB cameras, including reduced energy consumption and higher frame rates. However, the limited quantity of available event data presents a significant challenge, hindering their broader development. To alleviate this issue, we introduce a tailored U-shaped State Space Model Knowledge Transfer (USKT) framework for Event-to-RGB knowledge transfer. This framework generates inputs compatible with RGB frames, enabling event data to effectively reuse pre-trained RGB models and achieve competitive performance with minimal parameter tuning. Within the USKT architecture, we also propose a bidirectional reverse state space model. Unlike conventional bidirectional scanning mechanisms, the proposed Bidirectional Reverse State Space Model (BiR-SSM) leverages a shared weight strategy, which facilitates efficient modeling while conserving computational resources. In terms of effectiveness, integrating USKT with ResNet50 as the backbone improves model performance by 0.95%, 3.57%, and 2.9% on DVS128 Gesture, N-Caltech101, and CIFAR-10-DVS datasets, respectively, underscoring USKT's adaptability and effectiveness. The code will be made available upon acceptance.
事件相机作为一种新兴的成像技术,相较于传统的RGB相机具有显著优势,包括减少能耗和提高帧率。然而,可用事件数据量有限是其发展的一大障碍。为了解决这个问题,我们引入了一种专为事件到RGB知识转移设计的U形状态空间模型知识迁移(USKT)框架。该框架生成与RGB帧兼容的输入,使事件数据能够有效地重用预训练的RGB模型,并在几乎不需要调整参数的情况下达到竞争力的表现。在USKT架构中,我们还提出了一种双向逆状态空间模型。不同于传统的双向扫描机制,所提出的双向逆状态空间模型(BiR-SSM)采用了共享权重策略,这不仅促进了高效建模,还节省了计算资源。就有效性而言,在DVS128 Gesture、N-Caltech101和CIFAR-10-DVS数据集上,将USKT与ResNet50结合作为主干网络分别提高了模型性能0.95%、3.57%和2.9%,这突显了USKT的适应性和有效性。代码将在接受后公开提供。
https://arxiv.org/abs/2411.15276
Sign language processing technology development relies on extensive and reliable datasets, instructions, and ethical guidelines. We present a comprehensive Azerbaijani Sign Language Dataset (AzSLD) collected from diverse sign language users and linguistic parameters to facilitate advancements in sign recognition and translation systems and support the local sign language community. The dataset was created within the framework of a vision-based AzSL translation project. This study introduces the dataset as a summary of the fingerspelling alphabet and sentence- and word-level sign language datasets. The dataset was collected from signers of different ages, genders, and signing styles, with videos recorded from two camera angles to capture each sign in full detail. This approach ensures robust training and evaluation of gesture recognition models. AzSLD contains 30,000 videos, each carefully annotated with accurate sign labels and corresponding linguistic translations. The dataset is accompanied by technical documentation and source code to facilitate its use in training and testing. This dataset offers a valuable resource of labeled data for researchers and developers working on sign language recognition, translation, or synthesis. Ethical guidelines were strictly followed throughout the project, with all participants providing informed consent for collecting, publishing, and using the data.
手语处理技术的发展依赖于广泛且可靠的数据集、指导方针和伦理准则。我们推出了一套全面的阿塞拜疆手语数据集(AzSLD),该数据集从多样化的手语使用者和语言参数中收集而来,旨在促进手势识别与翻译系统的进步,并支持当地的手语社区。该数据集是在基于视觉的阿塞拜疆手语翻译项目框架内创建的。本研究介绍的数据集包括手指拼写字母表以及句子级和词汇级的手语数据集概要。数据集从不同年龄、性别和手势风格的手语使用者处收集,使用两个摄像机角度录制视频以全面捕捉每个手势细节。这种方法确保了对姿态识别模型进行稳健的训练与评估。AzSLD包含30,000个视频,每个视频都经过仔细标注,附有准确的手势标签及其相应的语言翻译。该数据集还配有技术文档和源代码,以便于在培训和测试中使用。此数据集为从事手语识别、翻译或合成研究的科研人员和开发者提供了宝贵的带标注的数据资源。在整个项目过程中严格遵守了伦理准则,并且所有参与者都对收集、发布和使用这些数据给予了知情同意。
https://arxiv.org/abs/2411.12865
The addition of a nonlinear restoring force to dynamical models of the speech gesture significantly improves the empirical accuracy of model predictions, but nonlinearity introduces challenges in selecting appropriate parameters and numerical stability, especially when modelling variation in empirical data. We address this issue by introducing simple numerical methods for parameterization of nonlinear task dynamic models. We first illustrate the problem and then outline solutions in the form of power laws that scale nonlinear stiffness terms. We apply the scaling laws to a cubic model and show how they facilitate interpretable simulations of the nonlinear gestural dynamics underpinning speech production.
将非线性恢复力添加到语音手势的动力学模型中,显著提高了模型预测的实证准确性。然而,非线性引入了选择适当参数和数值稳定性的挑战,尤其是在建模经验数据的变化时。我们通过引入用于非线性任务动态模型参数化的简单数值方法来解决这个问题。首先,我们将说明问题所在,然后以幂律的形式概述解决方案,这些幂律可缩放非线性刚度项。我们将缩放定律应用于三次模型,并展示它们如何有助于解释支撑语音生成的非线性手势动力学的模拟。
https://arxiv.org/abs/2411.12720
We have come up with a research that hopes to provide a bridge between the users of American Sign Language and the users of spoken language and Indian Sign Language (ISL). The research enabled us to create a novel framework that we have developed for Learner Systems. Leveraging art of Large models to create key features including: - Real-time translation between these two sign languages in an efficient manner. Making LLM's capability available for seamless translations to ISL. Here is the full study showing its implementation in this paper. The core of the system is a sophisticated pipeline that begins with reclassification and recognition of ASL gestures based on a strong Random Forest Classifier. By recognizing the ASL, it is translated into text which can be more easily processed. Highly evolved natural language NLP (Natural Language Processing) techniques come in handy as they play a role in our LLM integration where you then use LLMs to be able to convert the ASL text to ISL which provides you with the intent of sentence or phrase. The final step is to synthesize the translated text back into ISL gestures, creating an end-to-end translation experience using RIFE-Net. This framework is tasked with key challenges such as automatically dealing with gesture variability and overcoming the linguistic differences between ASL and ISL. By automating the translation process, we hope to vastly improve accessibility for sign language users. No longer will the communication gap between ASL and ISL create barriers; this totally cool innovation aims to bring our communities closer together. And we believe, with full confidence in our framework, that we're able to apply the same principles across a wide variety of sign language dialects.
我们提出了一项研究,希望为美国手语使用者和口语及印度手语(ISL)使用者之间搭建一座桥梁。这项研究使我们能够创建一个新颖的框架,并将其开发用于学习者系统中。通过利用大型模型的技术,我们创造了关键功能,包括:以高效方式实现实时翻译这两种手势语言之间的转换。我们的目标是让大语言模型的能力可以无缝地应用于将文本翻译为印度手语(ISL)。这是展示其实施情况的完整研究论文。 该系统的核⼼是一个复杂的流水线,从基于强随机森林分类器对手势进行重新分类和识别开始。通过识别美国手语后将其翻译成更容易处理的文字形式。高度进化的自然语言NLP(自然语言处理)技术在此处发挥重要作用,它们在我们的大语言模型整合中发挥作用,能够将美国手语文本转换为印度手语,从而传递句子或短语的意图。最终步骤是将翻译后的文字重新合成回印度手语手势,使用RIFE-Net创建端到端的翻译体验。 该框架面临的关键挑战包括自动处理手势变化以及克服美国手语和印度手语之间的语言差异。通过自动化翻译过程,我们希望大大改善手语使用者的可访问性。沟通障碍将不再成为美国手语与印度手语交流间的壁垒;这一完全酷炫的创新旨在使我们的社区更加紧密地联系在一起。我们相信,在对框架充满信心的前提下,我们可以将同样的原则应用于各种不同的手语方言中。
https://arxiv.org/abs/2411.12685
Handling sparse and unstructured geometric data, such as point clouds or event-based vision, is a pressing challenge in the field of machine vision. Recently, sequence models such as Transformers and state-space models entered the domain of geometric data. These methods require specialized preprocessing to create a sequential view of a set of points. Furthermore, prior works involving sequence models iterate geometric data with either uniform or learned step sizes, implicitly relying on the model to infer the underlying geometric structure. In this work, we propose to encode geometric structure explicitly into the parameterization of a state-space model. State-space models are based on linear dynamics governed by a one-dimensional variable such as time or a spatial coordinate. We exploit this dynamic variable to inject relative differences of coordinates into the step size of the state-space model. The resulting geometric operation computes interactions between all pairs of N points in O(N) steps. Our model deploys the Mamba selective state-space model with a modified CUDA kernel to efficiently map sparse geometric data to modern hardware. The resulting sequence model, which we call STREAM, achieves competitive results on a range of benchmarks from point-cloud classification to event-based vision and audio classification. STREAM demonstrates a powerful inductive bias for sparse geometric data by improving the PointMamba baseline when trained from scratch on the ModelNet40 and ScanObjectNN point cloud analysis datasets. It further achieves, for the first time, 100% test accuracy on all 11 classes of the DVS128 Gestures dataset.
处理稀疏和非结构化几何数据(如点云或基于事件的视觉)是机器视觉领域的一个紧迫挑战。最近,序列模型,例如Transformer和状态空间模型进入了几何数据领域。这些方法需要专门的预处理来创建一组点的顺序视图。此外,先前使用序列模型的工作以均匀或学习到的步长迭代几何数据,隐式地依赖于模型推断底层几何结构。在这项工作中,我们建议将几何结构明确编码进状态空间模型的参数化中。状态空间模型基于由时间或空间坐标等一维变量控制的线性动力学。我们利用这一动态变量将坐标之间的相对差异注入状态空间模型的步长中。由此产生的几何运算在O(N)步骤内计算N个点对之间所有交互作用。我们的模型部署了带有修改后的CUDA内核的Mamba选择性状态空间模型,以高效地将稀疏几何数据映射到现代硬件上。最终得到的序列模型,我们称之为STREAM,在从点云分类到基于事件视觉和音频分类的一系列基准测试中取得了有竞争力的结果。当在ModelNet40和ScanObjectNN点云分析数据集上从头开始训练时,STREAM通过改进PointMamba基线,展示了对稀疏几何数据的强大归纳偏置。此外,在DVS128手势数据集的所有11类上的测试精度首次达到了100%。
https://arxiv.org/abs/2411.12603
The primary concern of this research is to take American Sign Language (ASL) data through real time camera footage and be able to convert the data and information into text. Adding to that, we are also putting focus on creating a framework that can also convert text into sign language in real time which can help us break the language barrier for the people who are in need. In this work, for recognising American Sign Language (ASL), we have used the You Only Look Once(YOLO) model and Convolutional Neural Network (CNN) model. YOLO model is run in real time and automatically extracts discriminative spatial-temporal characteristics from the raw video stream without the need for any prior knowledge, eliminating design flaws. The CNN model here is also run in real time for sign language detection. We have introduced a novel method for converting text based input to sign language by making a framework that will take a sentence as input, identify keywords from that sentence and then show a video where sign language is performed with respect to the sentence given as input in real time. To the best of our knowledge, this is a rare study to demonstrate bidirectional sign language communication in real time in the American Sign Language (ASL).
这项研究的主要目标是通过实时摄像机画面获取美国手语(ASL)数据,并能够将这些数据和信息转化为文本。除此之外,我们还致力于创建一个框架,该框架可以实现实时将文本转换为手语,从而帮助打破需要帮助的人的语言障碍。在这项工作中,为了识别美国手语(ASL),我们使用了“YOLO”模型和卷积神经网络(CNN)模型。“YOLO”模型在实时运行中自动从原始视频流中提取判别性的时空特征,无需任何先验知识,消除了设计缺陷。这里的CNN模型也用于实时的手语识别。我们提出了一种新颖的方法来将基于文本的输入转换为手语,该方法通过构建一个框架,此框架可以接收句子作为输入,从句子中识别关键词,并在实时情况下显示一段视频,其中展示了针对给定输入句子所表演的手语。据我们所知,这是少数几项展示实时双向美国手语(ASL)交流的研究之一。
https://arxiv.org/abs/2411.13597
Assistive robots interact with humans and must adapt to different users' preferences to be effective. An easy and effective technique to learn non-expert users' preferences is through rankings of robot behaviors, for example, robot movement trajectories or gestures. Existing techniques focus on generating trajectories for users to rank that maximize the outcome of the preference learning process. However, the generated trajectories do not appear to reflect the user's preference over repeated interactions. In this work, we design an algorithm to generate trajectories for users to rank that we call Covariance Matrix Adaptation Evolution Strategies with Information Gain (CMA-ES-IG). CMA-ES-IG prioritizes the user's experience of the preference learning process. We show that users find our algorithm more intuitive and easier to use than previous approaches across both physical and social robot tasks. This project's code is hosted at this http URL
辅助机器人与人类互动,必须适应不同用户的偏好才能有效工作。一种学习非专家用户偏好的简单而有效的方法是通过排序机器人行为(例如机器人的运动轨迹或手势)来实现的。现有的技术集中在生成供用户排名的轨迹以最大化偏好学习过程的结果。然而,生成的轨迹似乎没有反映在重复互动中用户的偏好变化。在这项工作中,我们设计了一种算法来生成供用户排名的轨迹,该算法名为协方差矩阵适应进化策略结合信息增益(CMA-ES-IG)。CMA-ES-IG优先考虑用户体验偏好的学习过程。我们的结果显示,在物理和社交机器人任务中,用户发现我们的算法比以前的方法更直观、更容易使用。该项目的代码托管于此 http URL
https://arxiv.org/abs/2411.11182
The prevailing of artificial intelligence-of-things calls for higher energy-efficient edge computing paradigms, such as neuromorphic agents leveraging brain-inspired spiking neural network (SNN) models based on spatiotemporally sparse binary activations. However, the lack of efficient and high-accuracy deep SNN learning algorithms prevents them from practical edge deployments with a strictly bounded cost. In this paper, we propose a spatiotemporal orthogonal propagation (STOP) algorithm to tack this challenge. Our algorithm enables fully synergistic learning of synaptic weights as well as firing thresholds and leakage factors in spiking neurons to improve SNN accuracy, while under a unified temporally-forward trace-based framework to mitigate the huge memory requirement for storing neural states of all time-steps in the forward pass. Characteristically, the spatially-backward neuronal errors and temporally-forward traces propagate orthogonally to and independently of each other, substantially reducing computational overhead. Our STOP algorithm obtained high recognition accuracies of 99.53%, 94.84%, 74.92%, 98.26% and 77.10% on the MNIST, CIFAR-10, CIFAR-100, DVS-Gesture and DVS-CIFAR10 datasets with adequate SNNs of intermediate scales from LeNet-5 to ResNet-18. Compared with other deep SNN training works, our method is more plausible for edge intelligent scenarios where resources are limited but high-accuracy in-situ learning is desired.
人工智能物联网的普及要求更高能效的边缘计算范式,比如利用基于时空稀疏二值激活的大脑启发式脉冲神经网络(SNN)模型的类脑代理。然而,缺乏高效且高精度的深度SNN学习算法阻碍了它们在严格成本限制下的实际边缘部署。本文中,我们提出了一种时空正交传播(STOP)算法来应对这一挑战。我们的算法能够在统一的时间向前追踪框架下实现突触权重、脉冲神经元的发放阈值和泄漏因子的完全协同学习,从而提高SNN的准确性,并减少存储前向传递中所有时间步长神经状态所需的巨大内存需求。特征性地讲,空间向后的神经元误差与时间向前的追踪相互正交并独立传播,大幅减少了计算开销。我们的STOP算法在MNIST、CIFAR-10、CIFAR-100、DVS-Gesture和DVS-CIFAR10数据集上分别取得了99.53%、94.84%、74.92%、98.26%和77.10%的高识别准确率,适用于从LeNet-5到ResNet-18等中等规模的充足SNN模型。与其他深度SNN训练方法相比,我们的算法在资源有限但需要高精度现场学习的边缘智能场景下更具可行性。
https://arxiv.org/abs/2411.11082
In recent years, brain-computer interfaces have made advances in decoding various motor-related tasks, including gesture recognition and movement classification, utilizing electroencephalogram (EEG) data. These developments are fundamental in exploring how neural signals can be interpreted to recognize specific physical actions. This study centers on a written alphabet classification task, where we aim to decode EEG signals associated with handwriting. To achieve this, we incorporate hand kinematics to guide the extraction of the consistent embeddings from high-dimensional neural recordings using auxiliary variables (CEBRA). These CEBRA embeddings, along with the EEG, are processed by a parallel convolutional neural network model that extracts features from both data sources simultaneously. The model classifies nine different handwritten characters, including symbols such as exclamation marks and commas, within the alphabet. We evaluate the model using a quantitative five-fold cross-validation approach and explore the structure of the embedding space through visualizations. Our approach achieves a classification accuracy of 91 % for the nine-class task, demonstrating the feasibility of fine-grained handwriting decoding from EEG.
近年来,脑机接口在解码各种与运动相关的任务方面取得了进展,包括通过脑电图(EEG)数据进行手势识别和动作分类。这些发展对于探索如何解读神经信号以识别特定的物理动作具有基础性意义。本研究专注于手写字母分类任务,旨在解码与书写相关的大脑电信号。为此,我们整合了手部运动学知识来指导利用辅助变量(CEBRA)从高维神经记录中提取一致嵌入的过程。这些CEBRA嵌入和EEG数据由一个并行卷积神经网络模型处理,该模型同时从两个数据源中提取特征。模型对包括感叹号和逗点等符号在内的九种不同手写字符进行分类。我们使用定量的五折交叉验证方法评估了模型,并通过可视化探索了嵌入空间的结构。我们的方法在九类任务上达到了91%的分类准确率,这证明了从EEG信号中精细解码手写动作的可行性。
https://arxiv.org/abs/2411.09170
Embodied reference understanding is crucial for intelligent agents to predict referents based on human intention through gesture signals and language descriptions. This paper introduces the Attention-Dynamic DINO, a novel framework designed to mitigate misinterpretations of pointing gestures across various interaction contexts. Our approach integrates visual and textual features to simultaneously predict the target object's bounding box and the attention source in pointing gestures. Leveraging the distance-aware nature of nonverbal communication in visual perspective taking, we extend the virtual touch line mechanism and propose an attention-dynamic touch line to represent referring gesture based on interactive distances. The combination of this distance-aware approach and independent prediction of the attention source, enhances the alignment between objects and the gesture represented line. Extensive experiments on the YouRefIt dataset demonstrate the efficacy of our gesture information understanding method in significantly improving task performance. Our model achieves 76.4% accuracy at the 0.25 IoU threshold and, notably, surpasses human performance at the 0.75 IoU threshold, marking a first in this domain. Comparative experiments with distance-unaware understanding methods from previous research further validate the superiority of the Attention-Dynamic Touch Line across diverse contexts.
具身参考理解对于智能代理来说至关重要,它能帮助这些代理根据人类的意图通过手势信号和语言描述来预测所指对象。本文介绍了Attention-Dynamic DINO,这是一个旨在减轻各种互动情景下指向手势误解的新框架。我们的方法结合了视觉和文本特征,同时预测目标物体的边界框以及指向手势中的注意力来源。利用非言语交流在视觉视角采取时的距离感知特性,我们扩展了虚拟触线机制,并提出了一种基于交互距离的注意力动态触线来表示参考手势。这种方法结合了距离感知的方式以及独立预测注意力来源的特点,增强了对象与所代表的手势线条之间的对齐度。我们在YouRefIt数据集上进行的广泛实验表明,我们的手势信息理解方法在显著提高任务性能方面非常有效。我们的模型在0.25 IoU阈值下达到了76.4%的准确率,并且在0.75 IoU阈值下甚至超过了人类的表现,这在该领域尚属首次。与之前研究中的距离不敏感理解方法进行比较实验进一步验证了注意力动态触线在各种情境下的优越性。
https://arxiv.org/abs/2411.08451