Robotic surgery has reached a high level of maturity and has become an integral part of standard surgical care. However, existing surgeon consoles are bulky and take up valuable space in the operating room, present challenges for surgical team coordination, and their proprietary nature makes it difficult to take advantage of recent technological advances, especially in virtual and augmented reality. One potential area for further improvement is the integration of modern sensory gloves into robotic platforms, allowing surgeons to control robotic arms directly with their hand movements intuitively. We propose one such system that combines an HTC Vive tracker, a Manus Meta Prime 3 XR sensory glove, and God Vision wireless smart glasses. The system controls one arm of a da Vinci surgical robot. In addition to moving the arm, the surgeon can use fingers to control the end-effector of the surgical instrument. Hand gestures are used to implement clutching and similar functions. In particular, we introduce clutching of the instrument orientation, a functionality not available in the da Vinci system. The vibrotactile elements of the glove are used to provide feedback to the user when gesture commands are invoked. A preliminary evaluation of the system shows that it has excellent tracking accuracy and allows surgeons to efficiently perform common surgical training tasks with minimal practice with the new interface; this suggests that the interface is highly intuitive. The proposed system is inexpensive, allows rapid prototyping, and opens opportunities for further innovations in the design of surgical robot interfaces.
机器人手术已经达到了很高的成熟度,已成为标准手术护理的重要组成部分。然而,现有的外科医生助手显得笨重,占据了操作室宝贵空间,使得手术团队协调存在挑战,并且其专有性使得无法充分利用最近的技术进步,尤其是在虚拟和增强现实技术方面。一个进一步改进的可能领域是在机器人平台上集成现代感测手套,使外科医生可以直接用手部动作直观地控制机器人手臂。我们提出了一个这样的系统,结合了HTC Vive追踪器、Manus Meta Prime 3 XR感官手套和God Vision无线智能眼镜。系统可以控制达芬奇手术机器人的一条手臂。除了移动手臂外,外科医生还可以使用手指控制手术器械的末端。手势用于实现抓握和类似的功能。特别是,我们引入了握持器械方向的功能,这是达芬奇系统所没有的。手套的振动触觉元件用于在发出动作指令时向用户提供反馈。对系统的初步评估表明,它的追踪准确度很高,使外科医生能够通过最小练习使用新界面有效地执行常见手术培训任务;这表明界面非常直观。所提出的系统价格低廉,允许快速原型制作,为手术机器人界面设计的进一步创新提供了机会。
https://arxiv.org/abs/2403.13941
Recent research has begun to examine the potential of automatically finding and fixing accessibility issues that manifest in software. However, while recent work makes important progress, it has generally been skewed toward identifying issues that affect users with certain disabilities, such as those with visual or hearing impairments. However, there are other groups of users with different types of disabilities that also need software tooling support to improve their experience. As such, this paper aims to automatically identify accessibility issues that affect users with motor-impairments. To move toward this goal, this paper introduces a novel approach, called MotorEase, capable of identifying accessibility issues in mobile app UIs that impact motor-impaired users. Motor-impaired users often have limited ability to interact with touch-based devices, and instead may make use of a switch or other assistive mechanism -- hence UIs must be designed to support both limited touch gestures and the use of assistive devices. MotorEase adapts computer vision and text processing techniques to enable a semantic understanding of app UI screens, enabling the detection of violations related to four popular, previously unexplored UI design guidelines that support motor-impaired users, including: (i) visual touch target size, (ii) expanding sections, (iii) persisting elements, and (iv) adjacent icon visual distance. We evaluate MotorEase on a newly derived benchmark, called MotorCheck, that contains 555 manually annotated examples of violations to the above accessibility guidelines, across 1599 screens collected from 70 applications via a mobile app testing tool. Our experiments illustrate that MotorEase is able to identify violations with an average accuracy of ~90%, and a false positive rate of less than 9%, outperforming baseline techniques.
最近的研究开始探讨自动发现和修复软件中表现出来的可访问性问题的潜力。然而,虽然最近的工作取得了重要进展,但通常偏重于识别影响某些残疾用户(如视力或听力障碍)的可访问性问题。然而,还有其他类型的用户需要软件工具支持来改善他们的体验。因此,本文旨在自动识别影响使用机械障碍物的用户的可访问性问题。为了实现这一目标,本文引入了一种名为MotorEase的新方法,该方法能够识别移动应用程序UI中影响残疾用户的可访问性问题。残疾用户通常很难使用触摸式设备,而是可能使用开关或其他辅助机制,因此UI必须支持有限的触摸手势和使用辅助设备。MotorEase利用计算机视觉和文本处理技术来实现对应用程序UI屏幕的语义理解,从而能够检测出与支持残疾用户的四种流行UI设计指南相关的违反行为,包括:(i)视觉触摸目标大小,(ii) 扩展部分,(iii) 持久元素和(iv)相邻图标视觉距离。我们在名为MotorCheck的新生基准上评估了MotorEase,该基准包含555个手动标记的上述可访问性指南的违反实例,从70个应用程序收集的1599个屏幕上进行测试。我们的实验结果表明,MotorEase能够以平均准确度~90%识别出违规行为,假阳性率小于9%,超越了基线技术。
https://arxiv.org/abs/2403.13690
Under-actuated robot grippers as a pervasive tool of robots have become a considerable research focus. Despite their simplicity of mechanical design and control strategy, they suffer from poor versatility and weak adaptability, making widespread applications limited. To better relieve relevant research gaps, we present a novel 3-finger linkage-based gripper that realizes retractable and reconfigurable multi-mode grasps driven by a single motor. Firstly, inspired by the changes that occurred in the contact surface with a human finger moving, we artfully design a slider-slide rail mechanism as the phalanx to achieve retraction of each finger, allowing for better performance in the enveloping grasping mode. Secondly, a reconfigurable structure is constructed to broaden the grasping range of objects' dimensions for the proposed gripper. By adjusting the configuration and gesture of each finger, the gripper can achieve five grasping modes. Thirdly, the proposed gripper is just actuated by a single motor, yet it can be capable of grasping and reconfiguring simultaneously. Finally, various experiments on grasps of slender, thin, and large-volume objects are implemented to evaluate the performance of the proposed gripper in practical scenarios, which demonstrates the excellent grasping capabilities of the gripper.
作为机器人普及工具的不到位机器人抓爪已经成为了一个重要的研究焦点。尽管它们的机械设计和控制策略非常简单,但它们存在缺乏多功能性和适应性,这使得它们的广泛应用受限。为了更好地缓解相关研究空白,我们提出了一个新颖的3指连杆式抓爪,该抓爪由一个驱动器驱动,具有可缩回和可重构的多模式抓握。首先,我们灵感来自于人类手指移动时接触表面发生的改变,设计了一个滑块连杆机构作为腕部,实现每个手指的收缩,使得在包裹握持模式下表现更好。其次,为了扩大所提出的抓爪对物体尺寸的抓握范围,我们设计了一个可重构的结构。通过调整每个手指的配置和动作,抓爪可以实现五种抓握模式。第三,所提出的抓爪仅由一个驱动器驱动,但同时可以实现抓握和重构。最后,我们进行了各种实验,研究了薄壁、细长和大型体积物体的抓握性能,以评估所提出的抓爪在实际场景下的表现,这表明了抓爪的抓握能力非常出色。
https://arxiv.org/abs/2403.12502
Despite the development of various deep learning methods for Wi-Fi sensing, package loss often results in noncontinuous estimation of the Channel State Information (CSI), which negatively impacts the performance of the learning models. To overcome this challenge, we propose a deep learning model based on Bidirectional Encoder Representations from Transformers (BERT) for CSI recovery, named CSI-BERT. CSI-BERT can be trained in an self-supervised manner on the target dataset without the need for additional data. Furthermore, unlike traditional interpolation methods that focus on one subcarrier at a time, CSI-BERT captures the sequential relationships across different subcarriers. Experimental results demonstrate that CSI-BERT achieves lower error rates and faster speed compared to traditional interpolation methods, even when facing with high loss rates. Moreover, by harnessing the recovered CSI obtained from CSI-BERT, other deep learning models like Residual Network and Recurrent Neural Network can achieve an average increase in accuracy of approximately 15\% in Wi-Fi sensing tasks. The collected dataset WiGesture and code for our model are publicly available at this https URL.
尽管已经开发了许多用于Wi-Fi感知的深度学习方法,但包损失通常会导致对信道状态信息(CSI)的非连续估计,这会对学习模型的性能产生负面影响。为了克服这一挑战,我们提出了一个基于双向编码器表示的Transformer(BERT)的CSI恢复深度学习模型,名为CSI-BERT。CSI-BERT可以在无需额外数据的情况下在目标数据集上进行自监督训练。此外,与传统的插值方法不同,CSI-BERT捕捉了不同子载波之间的序列关系。实验结果表明,CSI-BERT在即使面临高损失率的情况下,也实现了与传统插值方法不同的较低误率和较快的速度。此外,通过利用CSI-BERT恢复的CSI,像Residual Network和Recurrent Neural Network这样的深度学习模型可以在Wi-Fi感测任务中实现约15%的准确度平均增加。我们收集的数据集WiGesture及其代码现在可以在这个链接上公开获取:https://github.com/yourgist/CSI-BERT
https://arxiv.org/abs/2403.12400
Speech-driven gesture generation is an emerging field within virtual human creation. However, a significant challenge lies in accurately determining and processing the multitude of input features (such as acoustic, semantic, emotional, personality, and even subtle unknown features). Traditional approaches, reliant on various explicit feature inputs and complex multimodal processing, constrain the expressiveness of resulting gestures and limit their applicability. To address these challenges, we present Persona-Gestor, a novel end-to-end generative model designed to generate highly personalized 3D full-body gestures solely relying on raw speech audio. The model combines a fuzzy feature extractor and a non-autoregressive Adaptive Layer Normalization (AdaLN) transformer diffusion architecture. The fuzzy feature extractor harnesses a fuzzy inference strategy that automatically infers implicit, continuous fuzzy features. These fuzzy features, represented as a unified latent feature, are fed into the AdaLN transformer. The AdaLN transformer introduces a conditional mechanism that applies a uniform function across all tokens, thereby effectively modeling the correlation between the fuzzy features and the gesture sequence. This module ensures a high level of gesture-speech synchronization while preserving naturalness. Finally, we employ the diffusion model to train and infer various gestures. Extensive subjective and objective evaluations on the Trinity, ZEGGS, and BEAT datasets confirm our model's superior performance to the current state-of-the-art approaches. Persona-Gestor improves the system's usability and generalization capabilities, setting a new benchmark in speech-driven gesture synthesis and broadening the horizon for virtual human technology. Supplementary videos and code can be accessed at this https URL
演讲驱动的手势生成是一个新兴的虚拟人类创造领域。然而,准确确定和处理海量的输入特征(如音频、语义、情感、个性化和甚至微妙的未知特征)是一个具有挑战性的任务。传统方法,依赖各种显性特征输入和复杂的跨模态处理,限制了生成手势的表现力,并限制了它们的适用性。为了应对这些挑战,我们提出了Persona-Gestor,一种仅依赖原始语音音频生成高度个性化的3D全身手势的新端到端生成模型。该模型结合了模糊特征提取器和无自回归自适应层归一化(AdaLN)变换器扩散架构。模糊特征提取器利用模糊推理策略自动推断隐含的连续模糊特征。这些模糊特征以统一的中间特征的形式输入到AdaLN变换器中。AdaLN变换器引入了一个条件机制,在所有标记符上应用一个统一函数,从而有效地建模模糊特征与手势序列之间的相关性。这个模块确保了高水平的手势与语音同步,同时保留了自然性。最后,我们使用扩散模型来训练和推理各种手势。对Trinity、ZEGGS和BEAT数据集的广泛主观和客观评估证实了我们的模型在现有技术水平上具有卓越性能。Persona-Gestor提高了系统的可用性和扩展能力,为手势驱动的虚拟人类合成树立了新的基准,并为虚拟人类技术的发展拓展了更广阔的空间。附加的视频和代码可以通过这个链接访问:https://
https://arxiv.org/abs/2403.10805
Acoustic sensing manifests great potential in various applications that encompass health monitoring, gesture interface and imaging by leveraging the speakers and microphones on smart devices. However, in ongoing research and development in acoustic sensing, one problem is often overlooked: the same speaker, when used concurrently for sensing and other traditional applications (like playing music), could cause interference in both making it impractical to use in the real world. The strong ultrasonic sensing signals mixed with music would overload the speaker's mixer. To confront this issue of overloaded signals, current solutions are clipping or down-scaling, both of which affect the music playback quality and also sensing range and accuracy. To address this challenge, we propose CoPlay, a deep learning based optimization algorithm to cognitively adapt the sensing signal. It can 1) maximize the sensing signal magnitude within the available bandwidth left by the concurrent music to optimize sensing range and accuracy and 2) minimize any consequential frequency distortion that can affect music playback. In this work, we design a deep learning model and test it on common types of sensing signals (sine wave or Frequency Modulated Continuous Wave FMCW) as inputs with various agnostic concurrent music and speech. First, we evaluated the model performance to show the quality of the generated signals. Then we conducted field studies of downstream acoustic sensing tasks in the real world. A study with 12 users proved that respiration monitoring and gesture recognition using our adapted signal achieve similar accuracy as no-concurrent-music scenarios, while clipping or down-scaling manifests worse accuracy. A qualitative study also manifests that the music play quality is not degraded, unlike traditional clipping or down-scaling methods.
声波感知在各种应用中具有很大的潜力,包括健康监测、手势界面和图像感知,通过利用智能设备上的扬声器和麦克风。然而,在声波感知的持续研究和开发中,一个问题常常被忽视:当同一扬声器用于感知和其他传统应用(如播放音乐)时,可能会导致其在现实世界中的干扰,使得它在实际应用中无法使用。强大的超声波感知信号与音乐混合会使扬声器的混频器过载。为了应对过载信号的问题,现有的解决方案是截断或降维,这两者都会影响音乐播放质量和感知范围与准确性。为了应对这个挑战,我们提出了CoPlay,一种基于深度学习的优化算法,以认知地适应感知信号。它可以:1)在可用的带宽范围内最大化感知信号的幅值,以优化感知范围和准确性;2)最小化可能影响音乐播放的任何后续频率畸变。在这篇工作中,我们设计了一个深度学习模型,并将其在各种类型的感知信号(正弦波或频率 modulated 连续波 FMCW)上进行测试,测试各种无关的并发音乐和语音。首先,我们评估了模型的性能,以显示生成的信号的质量。然后,我们在现实世界中对下游声波感知任务进行了现场研究。一个有12个用户的研究表明,使用我们自适应的信号进行呼吸监测和手势识别可以达到与没有同时播放音乐时的相同准确性,而截断或降维则表现出更差的准确性。此外,定性研究还表明,音乐播放质量没有下降,这与传统截断或降维方法不同。
https://arxiv.org/abs/2403.10796
This paper presents the design and development of an innovative interactive robotic system to enhance audience engagement using character-like personas. Built upon the foundations of persona-driven dialog agents, this work extends the agent application to the physical realm, employing robots to provide a more immersive and interactive experience. The proposed system, named the Masquerading Animated Social Kinematics (MASK), leverages an anthropomorphic robot which interacts with guests using non-verbal interactions, including facial expressions and gestures. A behavior generation system based upon a finite-state machine structure effectively conditions robotic behavior to convey distinct personas. The MASK framework integrates a perception engine, a behavior selection engine, and a comprehensive action library to enable real-time, dynamic interactions with minimal human intervention in behavior design. Throughout the user subject studies, we examined whether the users could recognize the intended character in film-character-based persona conditions. We conclude by discussing the role of personas in interactive agents and the factors to consider for creating an engaging user experience.
本文介绍了使用类似角色的人机交互系统来增强观众参与度的创新设计和发展。该系统基于人物驱动的对话代理商,并将机器人应用于物理领域,使用机器人提供更加沉浸和交互式的体验。所提出的系统名为Masquerading Animated Social Kinematics(MASK),它依赖于一个人形机器人,通过非语言交互与客人互动,包括面部表情和手势。基于有限状态机结构的behavior generation系统有效地将机器人行为约束为传达不同人格。MASK框架集成了感知引擎、行为选择引擎和综合动作库,以实现无需太多人类干预的行为设计,实现实时、动态交互。在整个用户研究过程中,我们研究了用户是否能在基于电影角色的人格条件下识别出意图的角色。最后,我们讨论了人物在交互式代理中的作用以及应考虑的因素,以创建具有吸引力的用户体验。
https://arxiv.org/abs/2403.10041
Gesture synthesis is a vital realm of human-computer interaction, with wide-ranging applications across various fields like film, robotics, and virtual reality. Recent advancements have utilized the diffusion model and attention mechanisms to improve gesture synthesis. However, due to the high computational complexity of these techniques, generating long and diverse sequences with low latency remains a challenge. We explore the potential of state space models (SSMs) to address the challenge, implementing a two-stage modeling strategy with discrete motion priors to enhance the quality of gestures. Leveraging the foundational Mamba block, we introduce MambaTalk, enhancing gesture diversity and rhythm through multimodal integration. Extensive experiments demonstrate that our method matches or exceeds the performance of state-of-the-art models.
手势合成是一个关键的人机交互领域,涉及各种领域,如电影、机器人学和虚拟现实。最近的技术发展利用扩散模型和注意机制来提高手势合成。然而,由于这些技术的高计算复杂性,生成具有低延迟的长而多样序列仍然具有挑战性。我们探讨了状态空间模型的潜力来解决这个挑战,通过离散运动优先级来增强手势的质量。利用发现的Mamba块,我们引入了MambaTalk,通过多模态融合来增强手势的多样性和节奏。大量实验证明,我们的方法与最先进模型的性能相匹敌或超过。
https://arxiv.org/abs/2403.09471
Human-human communication is like a delicate dance where listeners and speakers concurrently interact to maintain conversational dynamics. Hence, an effective model for generating listener nonverbal behaviors requires understanding the dyadic context and interaction. In this paper, we present an effective framework for creating 3D facial motions in dyadic interactions. Existing work consider a listener as a reactive agent with reflexive behaviors to the speaker's voice and facial motions. The heart of our framework is Dyadic Interaction Modeling (DIM), a pre-training approach that jointly models speakers' and listeners' motions through masking and contrastive learning to learn representations that capture the dyadic context. To enable the generation of non-deterministic behaviors, we encode both listener and speaker motions into discrete latent representations, through VQ-VAE. The pre-trained model is further fine-tuned for motion generation. Extensive experiments demonstrate the superiority of our framework in generating listener motions, establishing a new state-of-the-art according to the quantitative measures capturing the diversity and realism of generated motions. Qualitative results demonstrate the superior capabilities of the proposed approach in generating diverse and realistic expressions, eye blinks and head gestures.
人类之间的交流就像一场优雅的舞蹈,其中听众和发言者同时相互作用以维持会话动态。因此,要建立一个有效的模型来生成听众的非语言行为,需要理解双人互动的上下文和相互作用。在本文中,我们提出了一个有效的框架来在双人交互中生成3D面部动作。现有的工作将听众视为对发言者声音的反应性代理,并假定面部动作是对发言者声音的直接反应。我们框架的核心是双人交互建模(DIM),一种通过遮罩和对比学习来共同建模发言者和听众运动的预训练方法,以学习捕捉到双人上下文的表示。为了实现非确定性行为,我们将听众和发言者的运动编码为离散的潜在表示,通过VQ-VAE。预训练模型进一步微调以进行运动生成。大量实验证明,我们的框架在生成听众运动方面具有优势,建立了根据生成运动的多样性和现实性新的领先水平。定性结果表明,与所提出的方法相比,具有生成多样化和真实表达、眼睑和头部动作的能力。
https://arxiv.org/abs/2403.09069
We propose VLOGGER, a method for audio-driven human video generation from a single input image of a person, which builds on the success of recent generative diffusion models. Our method consists of 1) a stochastic human-to-3d-motion diffusion model, and 2) a novel diffusion-based architecture that augments text-to-image models with both spatial and temporal controls. This supports the generation of high quality video of variable length, easily controllable through high-level representations of human faces and bodies. In contrast to previous work, our method does not require training for each person, does not rely on face detection and cropping, generates the complete image (not just the face or the lips), and considers a broad spectrum of scenarios (e.g. visible torso or diverse subject identities) that are critical to correctly synthesize humans who communicate. We also curate MENTOR, a new and diverse dataset with 3d pose and expression annotations, one order of magnitude larger than previous ones (800,000 identities) and with dynamic gestures, on which we train and ablate our main technical contributions. VLOGGER outperforms state-of-the-art methods in three public benchmarks, considering image quality, identity preservation and temporal consistency while also generating upper-body gestures. We analyze the performance of VLOGGER with respect to multiple diversity metrics, showing that our architectural choices and the use of MENTOR benefit training a fair and unbiased model at scale. Finally we show applications in video editing and personalization.
我们提出了VLOGGER方法,一种从单个输入图像的人类视频生成方法,该方法在最近的成功生成扩散模型的基础上进行了改进。VLOGGER方法由两部分组成:1)一个随机的人类到3D运动扩散模型;2)一个新型的扩散基于架构,它通过空间和时间控制来增强文本到图像模型。这支持生成高质量的视频,具有可变长度,并且可以通过高级人脸和身体表示来轻松控制。与之前的工作相比,我们的方法不需要为每个人进行训练,不依赖于人脸检测和裁剪,可以生成完整的图像(不仅是脸或嘴唇),并考虑了广泛的场景(例如可见的躯干或多样主体身份),这些场景对于正确合成交流中的人是至关重要的。我们还策划了MENTOR,一个具有3D姿势和表情注释的新颖而多样的大数据集,比之前的大一倍(800,000个身份),并支持动态手势,我们在其中训练和消融我们的主要技术贡献。VLOGGER在三个公共基准测试中的表现优于最先进的方法,同时考虑了图像质量、身份保留和时间一致性。我们还展示了VLOGGER在视频编辑和个性化方面的应用。
https://arxiv.org/abs/2403.08764
Open-sourced, user-friendly tools form the bedrock of scientific advancement across disciplines. The widespread adoption of data-driven learning has led to remarkable progress in multi-fingered dexterity, bimanual manipulation, and applications ranging from logistics to home robotics. However, existing data collection platforms are often proprietary, costly, or tailored to specific robotic morphologies. We present OPEN TEACH, a new teleoperation system leveraging VR headsets to immerse users in mixed reality for intuitive robot control. Built on the affordable Meta Quest 3, which costs $500, OPEN TEACH enables real-time control of various robots, including multi-fingered hands and bimanual arms, through an easy-to-use app. Using natural hand gestures and movements, users can manipulate robots at up to 90Hz with smooth visual feedback and interface widgets offering closeup environment views. We demonstrate the versatility of OPEN TEACH across 38 tasks on different robots. A comprehensive user study indicates significant improvement in teleoperation capability over the AnyTeleop framework. Further experiments exhibit that the collected data is compatible with policy learning on 10 dexterous and contact-rich manipulation tasks. Currently supporting Franka, xArm, Jaco, and Allegro platforms, OPEN TEACH is fully open-sourced to promote broader adoption. Videos are available at this https URL.
开源、用户友好的工具是跨学科科学进步的基础。数据驱动的学习的广泛采用导致多指灵巧、双臂操作和应用范围从物流到家庭机器人学的显著进步。然而,现有的数据收集平台通常都是专有、昂贵或针对特定机器人形态定制的。我们介绍了一种名为OPEN TEACH的新遥控系统,利用VR头盔实现用户在混合现实中的直观机器人控制。该系统基于价格实惠的Meta Quest 3,售价500美元。通过易用的应用程序,OPEN TEACH可以实时控制各种机器人,包括多指手和双臂,从而实现高效的操作。用户可以使用自然手势和动作操纵机器人,并实现高达90Hz的流畅视觉反馈和界面插件提供近距离环境观察。我们在不同机器人上展示了OPEN TEACH的多样性。全面的用户研究显示,与AnyTeleop框架相比,遥控能力得到了显著提高。进一步的实验表明,收集的数据与在10个多指和触觉丰富的操作任务上进行策略学习是兼容的。目前,OPEN TEACH支持Franka、xArm、Jaco和Allegro平台,完全开源以促进更广泛的采用。视频可在此链接观看。
https://arxiv.org/abs/2403.07870
Skeleton-based motion representations are robust for action localization and understanding for their invariance to perspective, lighting, and occlusion, compared with images. Yet, they are often ambiguous and incomplete when taken out of context, even for human annotators. As infants discern gestures before associating them with words, actions can be conceptualized before being grounded with labels. Therefore, we propose the first unsupervised pre-training framework, Boundary-Interior Decoding (BID), that partitions a skeleton-based motion sequence into discovered semantically meaningful pre-action segments. By fine-tuning our pre-training network with a small number of annotated data, we show results out-performing SOTA methods by a large margin.
基于骨架的运动表示对于动作定位和理解具有良好的鲁棒性,因为它们对视角、照明和遮挡具有不变性,而与图像相比。然而,当脱离上下文时,它们往往是不清晰和不完整的。正如婴儿在将动作与词语关联之前就开始感知一样,在将动作与标签关联之前,动作可以先于标签进行概念化。因此,我们提出了第一个无监督的前训练框架,边界内解码(BID),将基于骨架的运动序列分割为发现 semantically 有意义的前动作段。通过用小量标记数据微调我们的预训练网络,我们证明了其性能优于当前最先进的方法。
https://arxiv.org/abs/2403.07354
Real-time recognition and prediction of surgical activities are fundamental to advancing safety and autonomy in robot-assisted surgery. This paper presents a multimodal transformer architecture for real-time recognition and prediction of surgical gestures and trajectories based on short segments of kinematic and video data. We conduct an ablation study to evaluate the impact of fusing different input modalities and their representations on gesture recognition and prediction performance. We perform an end-to-end assessment of the proposed architecture using the JHU-ISI Gesture and Skill Assessment Working Set (JIGSAWS) dataset. Our model outperforms the state-of-the-art (SOTA) with 89.5\% accuracy for gesture prediction through effective fusion of kinematic features with spatial and contextual video features. It achieves the real-time performance of 1.1-1.3ms for processing a 1-second input window by relying on a computationally efficient model.
实时识别和预测手术活动是推动机器人辅助手术安全性和自主性的基础。本文提出了一种基于短动作和视频数据的小段运动和视频数据的 multimodal Transformer 架构,用于实时识别和预测手术手势和轨迹。我们进行了一项消融研究,以评估将不同输入模块及其表示集成到手势识别和预测性能中的影响。我们使用 JHU-ISI Gesture and Skill Assessment Working Set (JIGSAWS) 数据集对所提出的架构进行了端到端评估。我们的模型在通过有效地融合运动特征和空间上下文视频特征来提高手势预测准确率的基础上,实现了与最先进水平(SOTA)的 89.5% 的准确率。它能在依赖计算效率模型的情况下,实现对 1 秒输入窗口的实时处理,并达到 1.1-1.3ms 的实时性能。
https://arxiv.org/abs/2403.06705
As technological advancements continue to expand the capabilities of multi unmanned-aerial-vehicle systems (mUAV), human operators face challenges in scalability and efficiency due to the complex cognitive load and operations associated with motion adjustments and team coordination. Such cognitive demands limit the feasible size of mUAV teams and necessitate extensive operator training, impeding broader adoption. This paper developed a Hand Gesture Based Interactive Control (HGIC), a novel interface system that utilize computer vision techniques to intuitively translate hand gestures into modular commands for robot teaming. Through learning control models, these commands enable efficient and scalable mUAV motion control and adjustments. HGIC eliminates the need for specialized hardware and offers two key benefits: 1) Minimal training requirements through natural gestures; and 2) Enhanced scalability and efficiency via adaptable commands. By reducing the cognitive burden on operators, HGIC opens the door for more effective large-scale mUAV applications in complex, dynamic, and uncertain scenarios. HGIC will be open-sourced after the paper being published online for the research community, aiming to drive forward innovations in human-mUAV interactions.
随着技术的进步,多旋翼无人车辆(mUAV)的功能不断扩展,人类操作员面临由于运动调整和团队协调所带来的复杂认知负荷和操作挑战。这种认知要求限制了mUAV团队的规模,需要进行广泛的操作员培训,从而阻碍了更广泛的采用。本文开发了一种基于手势的交互式控制(HGIC),一种新的人机交互界面系统,利用计算机视觉技术将手势自然地转换为机器人协同操作的模块化指令。通过学习控制模型,这些指令使mUAV运动控制和调整变得高效且可扩展。HGIC消除了需要专门硬件的培训要求,提供了两个关键优势:(1)通过自然手势实现最小培训;(2)通过可调整的指令实现增强的可扩展性和效率。通过减轻操作员的认知负担,HGIC为在复杂、动态和不确定的场景中实现更有效的mUAV应用打开了大门。HGIC将在论文在线发表后开源,以推动人机mUAV交互领域的研究创新。
https://arxiv.org/abs/2403.05478
Sonomyography (SMG) is a non-invasive technique that uses ultrasound imaging to detect the dynamic activity of muscles. Wearable SMG systems have recently gained popularity due to their potential as human-computer interfaces for their superior performance compared to conventional methods. This paper demonstrates real-time positional proportional control of multiple gestures using a multiplexed 8-channel wearable SMG system. The amplitude-mode ultrasound signals from the SMG system were utilized to detect muscle activity from the forearm of 8 healthy individuals. The derived signals were used to control the on-screen movement of the cursor. A target achievement task was performed to analyze the performance of our SMG-based human-machine interface. Our wearable SMG system provided accurate, stable, and intuitive control in real-time by achieving an average success rate greater than 80% with all gestures. Furthermore, the wearable SMG system's abilities to detect volitional movement and decode movement kinematic information from SMG trajectories using standard performance metrics were evaluated. Our results provide insights to validate SMG as an intuitive human-machine interface.
Sonography(SMG)是一种非侵入性技术,它使用超声成像来检测肌肉的动态活动。可穿戴式SMG系统最近因其在传统方法相比较优越的性能而受到欢迎。本文使用一个多通道可穿戴SMG系统展示了实时位置比例控制多个手势。SMG系统中的幅度模式超声信号用于从8名健康个体的前臂检测肌肉活动。通过这些信号控制的屏幕上鼠标移动。为了分析基于SMG的人机界面性能,进行了一项目标达成任务。我们基于SMG的人机界面系统通过实现所有手势的平均成功率大于80%来提供准确、稳定和直观的控制。此外,可穿戴SMG系统通过使用标准性能指标检测肌肉运动和解码SMG轨迹的运动学信息的能力进行了评估。我们的结果提供了SMG作为直觉式人机界面的验证。
https://arxiv.org/abs/2403.05308
Micro-action is an imperceptible non-verbal behaviour characterised by low-intensity movement. It offers insights into the feelings and intentions of individuals and is important for human-oriented applications such as emotion recognition and psychological assessment. However, the identification, differentiation, and understanding of micro-actions pose challenges due to the imperceptible and inaccessible nature of these subtle human behaviors in everyday life. In this study, we innovatively collect a new micro-action dataset designated as Micro-action-52 (MA-52), and propose a benchmark named micro-action network (MANet) for micro-action recognition (MAR) task. Uniquely, MA-52 provides the whole-body perspective including gestures, upper- and lower-limb movements, attempting to reveal comprehensive micro-action cues. In detail, MA-52 contains 52 micro-action categories along with seven body part labels, and encompasses a full array of realistic and natural micro-actions, accounting for 205 participants and 22,422 video instances collated from the psychological interviews. Based on the proposed dataset, we assess MANet and other nine prevalent action recognition methods. MANet incorporates squeeze-and excitation (SE) and temporal shift module (TSM) into the ResNet architecture for modeling the spatiotemporal characteristics of micro-actions. Then a joint-embedding loss is designed for semantic matching between video and action labels; the loss is used to better distinguish between visually similar yet distinct micro-action categories. The extended application in emotion recognition has demonstrated one of the important values of our proposed dataset and method. In the future, further exploration of human behaviour, emotion, and psychological assessment will be conducted in depth. The dataset and source code are released at this https URL.
微动作是一种难以察觉的非语言行为,其特征是低强度运动。它揭示了个体情感和意图,对于诸如情感识别和心理评估等以人为中心的应用具有重要意义。然而,由于这些微行为在日常生活中的难以察觉和无法访问性,识别、区分和理解微动作带来了挑战。在这项研究中,我们创新性地收集了一个名为MA-52的新微动作数据集,并提出了名为微动作网络(MANet)的基准用于微动作识别(MAR)任务。与其他方法不同,MA-52提供了全身视角,包括手势、上半身和下半身运动,试图揭示全面的微动作线索。具体来说,MA-52包含了52个微动作类别以及7个身体部位标签,涵盖了205个参与者以及从心理访谈中收集的22,422个视频实例。基于所提出的数据集,我们评估了MANet和其他9种普遍的动作识别方法。MANet将挤压和兴奋(SE)以及时钟转移模块(TSM)融入ResNet架构,以建模微动作的时空特征。然后,为视频和动作标签之间的语义匹配设计了联合嵌入损失;该损失被用于更好地区分视觉上相似但具有区别的微动作类别。在情感识别的扩展应用中,我们发现我们提出的数据和方法的一个关键价值。在未来的研究中,将深入探讨人类行为、情感和心理评估。数据和源代码发布在https://www. thisurl。
https://arxiv.org/abs/2403.05234
This research aims to further understanding in the field of continuous authentication using behavioral biometrics. We are contributing a novel dataset that encompasses the gesture data of 15 users playing Minecraft with a Samsung Tablet, each for a duration of 15 minutes. Utilizing this dataset, we employed machine learning (ML) binary classifiers, being Random Forest (RF), K-Nearest Neighbors (KNN), and Support Vector Classifier (SVC), to determine the authenticity of specific user actions. Our most robust model was SVC, which achieved an average accuracy of approximately 90%, demonstrating that touch dynamics can effectively distinguish users. However, further studies are needed to make it viable option for authentication systems
这项研究旨在通过行为生物识别领域深入了解持续认证。我们为15个在三星平板电脑上玩《我的世界》的用户的每个动作数据创建了一个新的数据集,每个动作持续15分钟。利用这个数据集,我们采用了机器学习(ML)二分类器,包括随机森林(RF)、K-最近邻(KNN)和支持向量分类器(SVC),来确定特定用户动作的准确性。我们最健壮的模型是SVC,其平均准确率约为90%,表明触觉动态可以有效地区分用户。然而,还需要进一步研究,才能使这种认证系统成为可行选项。
https://arxiv.org/abs/2403.03832
Recent advancements in multimodal Human-Robot Interaction (HRI) datasets have highlighted the fusion of speech and gesture, expanding robots' capabilities to absorb explicit and implicit HRI insights. However, existing speech-gesture HRI datasets often focus on elementary tasks, like object pointing and pushing, revealing limitations in scaling to intricate domains and prioritizing human command data over robot behavior records. To bridge these gaps, we introduce NatSGD, a multimodal HRI dataset encompassing human commands through speech and gestures that are natural, synchronized with robot behavior demonstrations. NatSGD serves as a foundational resource at the intersection of machine learning and HRI research, and we demonstrate its effectiveness in training robots to understand tasks through multimodal human commands, emphasizing the significance of jointly considering speech and gestures. We have released our dataset, simulator, and code to facilitate future research in human-robot interaction system learning; access these resources at this https URL
近年来,在多模态人机交互(HRI)数据集的先进发展上,我们强调了语音和手势的融合,将机器的能力扩展到吸收明确和隐含的人机交互(HRI)洞察。然而,现有的语音-手势HRI数据集通常集中于基本任务,如物体指点和推动,揭示了在扩展到复杂领域时优先考虑人类命令数据而忽视机器人行为记录的局限性。为了弥合这些空白,我们引入了NatSGD,一个多模态HRI数据集,涵盖人类通过语音和手势进行的自然命令,并与机器人行为演示同步。NatSGD在机器学习和HRI研究的交叉点处作为基础资源,我们证明了通过多模态人类命令训练机器理解任务的有效性,强调了同时考虑语音和手势的意义。我们已经发布了我们的数据集、模拟器和代码,以促进未来人机交互系统学习的研究;访问这些资源在此链接处。
https://arxiv.org/abs/2403.02274
3D hand pose estimation has found broad application in areas such as gesture recognition and human-machine interaction tasks. As performance improves, the complexity of the systems also increases, which can limit the comparative analysis and practical implementation of these methods. In this paper, we propose a simple yet effective baseline that not only surpasses state-of-the-art (SOTA) methods but also demonstrates computational efficiency. To establish this baseline, we abstract existing work into two components: a token generator and a mesh regressor, and then examine their core structures. A core structure, in this context, is one that fulfills intrinsic functions, brings about significant improvements, and achieves excellent performance without unnecessary complexities. Our proposed approach is decoupled from any modifications to the backbone, making it adaptable to any modern models. Our method outperforms existing solutions, achieving state-of-the-art (SOTA) results across multiple datasets. On the FreiHAND dataset, our approach produced a PA-MPJPE of 5.7mm and a PA-MPVPE of 6.0mm. Similarly, on the Dexycb dataset, we observed a PA-MPJPE of 5.5mm and a PA-MPVPE of 5.0mm. As for performance speed, our method reached up to 33 frames per second (fps) when using HRNet and up to 70 fps when employing FastViT-MA36
3D手势估计在诸如手势识别和人与机器交互任务等领域得到了广泛应用。性能提高时,系统的复杂性也会增加,这可能限制了这些方法的比较分析和实际应用。在本文中,我们提出了一个简单而有效的基准,不仅超越了最先进的(SOTA)方法,还展示了计算效率。为了建立这个基准,我们将现有工作抽象为两个组件:一个标记生成器和一个网格回归器,然后研究其核心结构。在本文中,核心结构是指能够实现固有功能,带来显著改进,同时实现出色性能且无需多余复杂性的结构。我们提出的方法与骨干网络的修改无关,使其适应任何现代模型。我们的方法在多个数据集上的表现优于现有解决方案,在这些数据集上的SOTA结果达到。在FreiHAND数据集上,我们的方法产生了一个PA-MPJPE为5.7mm和一个PA-MPVPE为6.0mm。同样,在Dexycb数据集上,我们观察到了一个PA-MPJPE为5.5mm和一个PA-MPVPE为5.0mm。关于性能速度,当使用HRNet时,我们的方法达到每秒33帧(fps),而当采用FastViT-MA36时,达到每秒70帧(fps)。
https://arxiv.org/abs/2403.01813
Recognizing speaking in humans is a central task towards understanding social interactions. Ideally, speaking would be detected from individual voice recordings, as done previously for meeting scenarios. However, individual voice recordings are hard to obtain in the wild, especially in crowded mingling scenarios due to cost, logistics, and privacy concerns. As an alternative, machine learning models trained on video and wearable sensor data make it possible to recognize speech by detecting its related gestures in an unobtrusive, privacy-preserving way. These models themselves should ideally be trained using labels obtained from the speech signal. However, existing mingling datasets do not contain high quality audio recordings. Instead, speaking status annotations have often been inferred by human annotators from video, without validation of this approach against audio-based ground truth. In this paper we revisit no-audio speaking status estimation by presenting the first publicly available multimodal dataset with high-quality individual speech recordings of 33 subjects in a professional networking event. We present three baselines for no-audio speaking status segmentation: a) from video, b) from body acceleration (chest-worn accelerometer), c) from body pose tracks. In all cases we predict a 20Hz binary speaking status signal extracted from the audio, a time resolution not available in previous datasets. In addition to providing the signals and ground truth necessary to evaluate a wide range of speaking status detection methods, the availability of audio in REWIND makes it suitable for cross-modality studies not feasible with previous mingling datasets. Finally, our flexible data consent setup creates new challenges for multimodal systems under missing modalities.
意识到在人类中的说话是一个理解社交互动的核心任务。理想情况下,就像以前在会议场景中实现的那样,从个人语音录音中检测说话应该是可以检测出来的。然而,从野外的个人语音录音中获取数据很难,尤其是在拥挤的社交场景中,由于成本、物流和隐私问题。作为替代方案,基于视频和可穿戴传感器数据的机器学习模型通过检测其相关手势以非侵入性、隐私保护的方式识别说话。这些模型本身应该通过从语音信号获得的标签进行训练。然而,现有的社交场景数据缺乏高质量音频录音。相反,通常是人类注释者从视频中推断说话状态,而没有对这种方法与音频基线进行验证。在本文中,我们重新回顾了没有音频说话状态估计,通过发布第一个可公开获取的高质量个人语音录音的 multimodal 数据集,该数据集包括 33 个专业网络活动中的 33 个受试者,以展示 no-audio speaking status estimation。我们呈现了三种基于视频、身体加速度(佩戴 chest-worn 加速度计)和身体姿态轨迹的 no-audio speaking status 分割基线:a)从视频,b)从身体加速度(佩戴 chest-worn 加速度计),c)从身体姿态轨迹。在所有情况下,我们预测提取自音频的 20Hz 二进制说话状态信号,这是以前数据集中没有的时间分辨率。此外,REWIND 中的音频可用性使得它可以进行跨模态研究,而其他社交场景数据无法实现。最后,我们的灵活数据同意设置为多模态系统在缺失模式下创建了新的挑战。
https://arxiv.org/abs/2403.01229