Tooth arrangement is an essential step in the digital orthodontic planning process. Existing learning-based methods use hidden teeth features to directly regress teeth motions, which couples target pose perception and motion regression. It could lead to poor perceptions of three-dimensional transformation. They also ignore the possible overlaps or gaps between teeth of predicted dentition, which is generally unacceptable. Therefore, we propose DTAN, a differentiable collision-supervised tooth arrangement network, decoupling predicting tasks and feature modeling. DTAN decouples the tooth arrangement task by first predicting the hidden features of the final teeth poses and then using them to assist in regressing the motions between the beginning and target teeth. To learn the hidden features better, DTAN also decouples the teeth-hidden features into geometric and positional features, which are further supervised by feature consistency constraints. Furthermore, we propose a novel differentiable collision loss function for point cloud data to constrain the related gestures between teeth, which can be easily extended to other 3D point cloud tasks. We propose an arch-width guided tooth arrangement network, named C-DTAN, to make the results controllable. We construct three different tooth arrangement datasets and achieve drastically improved performance on accuracy and speed compared with existing methods.
牙齿排列是在数字或牙科规划过程中必不可少的一步。现有的学习方法使用隐藏牙齿特征来直接回归牙齿运动,这导致了目标姿态感知和运动回归的耦合。这可能会导致三维转换效果不佳。它们还忽略了预测牙列中牙齿之间可能存在的重叠或缺口,这在很大程度上是不可接受的。因此,我们提出了DTAN,一种不同的连续轨迹约束的牙齿排列网络,解耦了预测任务和特征建模。通过首先预测最终牙齿姿势的隐藏特征,然后使用它们来辅助预测目标牙齿之间的运动,DTAN解耦了牙齿排列任务。为了更好地学习隐藏特征,DTAN还将牙齿-隐藏特征分解为几何和位置特征,并进一步由特征一致性约束监督。此外,我们还提出了一个新的不同连续轨迹损失函数用于点云数据,约束牙齿间相关的动作,可以轻松地扩展到其他3D点云任务。我们提出了一个带有拱宽引导的牙齿排列网络,名为C-DTAN,以实现结果的可控性。我们构建了三个不同的牙齿排列数据集,并在准确性和速度方面实现了显著的改进,与现有方法相比。
https://arxiv.org/abs/2409.11937
The event camera has demonstrated significant success across a wide range of areas due to its low time latency and high dynamic range. However, the community faces challenges such as data deficiency and limited diversity, often resulting in over-fitting and inadequate feature learning. Notably, the exploration of data augmentation techniques in the event community remains scarce. This work aims to address this gap by introducing a systematic augmentation scheme named EventAug to enrich spatial-temporal diversity. In particular, we first propose Multi-scale Temporal Integration (MSTI) to diversify the motion speed of objects, then introduce Spatial-salient Event Mask (SSEM) and Temporal-salient Event Mask (TSEM) to enrich object variants. Our EventAug can facilitate models learning with richer motion patterns, object variants and local spatio-temporal relations, thus improving model robustness to varied moving speeds, occlusions, and action disruptions. Experiment results show that our augmentation method consistently yields significant improvements across different tasks and backbones (e.g., a 4.87% accuracy gain on DVS128 Gesture). Our code will be publicly available for this community.
事件相机在其低延迟和高动态范围的成功表明,它已经在广泛的领域取得了显著的成功。然而,社区面临着数据不足和多样性有限等挑战,通常导致过拟合和不足的特征学习。值得注意的是,事件社区中数据增强技术的探索仍然很少。本文旨在通过引入一个名为EventAug的系统化增强方案来填补这一空白,以丰富空间-时间多样性。 首先,我们提出多尺度时间整合(MSTI)来丰富对象的动态速度,然后引入空间显著事件掩码(SSEM)和时间显著事件掩码(TSEM)来丰富对象的变体。我们的EventAug可以促进模型学习更丰富的运动模式、对象变体和局部空间-时间关系,从而提高模型对各种运动速度、遮挡和操作干扰的鲁棒性。实验结果表明,我们的增强方法在不同的任务和骨干网络(例如,在DVS128手势识别任务上,准确率增加了4.87%)上都取得了显著的改进。我们的代码将公开供这个社区使用。
https://arxiv.org/abs/2409.11813
Controlling hands in the high-dimensional action space has been a longstanding challenge, yet humans naturally perform dexterous tasks with ease. In this paper, we draw inspiration from the human embodied cognition and reconsider dexterous hands as learnable systems. Specifically, we introduce MoDex, a framework which employs a neural hand model to capture the dynamical characteristics of hand movements. Based on the model, a bidirectional planning method is developed, which demonstrates efficiency in both training and inference. The method is further integrated with a large language model to generate various gestures such as ``Scissorshand" and ``Rock\&Roll." Moreover, we show that decomposing the system dynamics into a pretrained hand model and an external model improves data efficiency, as supported by both theoretical analysis and empirical experiments. Additional visualization results are available at this https URL.
在本文中,我们受到了人类协同认知和可学习系统的启发,将灵巧的手视为一个可学习的系统。具体来说,我们引入了MODEX框架,它采用了一个神经手模型来捕捉手运动的动态特征。基于这个模型,我们开发了一种双向规划方法,证明了在训练和推理过程中具有效率。此外,我们还通过理论和实证实验证明了将系统动态分解为预训练手模型和外部模型的数据效率。更多可视化结果可在此链接查看。
https://arxiv.org/abs/2409.10983
Co-speech gestures are fundamental for communication. The advent of recent deep learning techniques has facilitated the creation of lifelike, synchronous co-speech gestures for Embodied Conversational Agents. "In-the-wild" datasets, aggregating video content from platforms like YouTube via human pose detection technologies, provide a feasible solution by offering 2D skeletal sequences aligned with speech. Concurrent developments in lifting models enable the conversion of these 2D sequences into 3D gesture databases. However, it is important to note that the 3D poses estimated from the 2D extracted poses are, in essence, approximations of the ground-truth, which remains in the 2D domain. This distinction raises questions about the impact of gesture representation dimensionality on the quality of generated motions - a topic that, to our knowledge, remains largely unexplored. Our study examines the effect of using either 2D or 3D joint coordinates as training data on the performance of speech-to-gesture deep generative models. We employ a lifting model for converting generated 2D pose sequences into 3D and assess how gestures created directly in 3D stack up against those initially generated in 2D and then converted to 3D. We perform an objective evaluation using widely used metrics in the gesture generation field as well as a user study to qualitatively evaluate the different approaches.
共同言语手势对交流是至关重要的。最近深度学习技术的出现促进了为嵌入式会话机器人创建逼真的、同步共同言语手势。 "在自然环境中"的数据集通过人体姿态检测技术汇总来自像YouTube这样的平台的视频内容,提供了一个可行的解决方案,它提供了与言语对齐的2D骨骼序列。"同时,举起模型的开发也在推动将2D序列转换为3D手势数据库。然而,需要注意的是,从2D提取的姿态估计的本质上是模拟真实,仍然停留在2D领域。这个区别引发了关于手势表示维度对生成动作质量的影响 - 据我们所知,这个话题大部分都没有被深入探讨。我们的研究探讨了使用2D或3D关节坐标作为训练数据对言语到手势深度生成模型的性能的影响。我们使用提升模型将生成的2D姿势序列转换为3D,并评估手势在3D堆叠与最初在2D中生成,然后转换为3D的效果。我们还在手势生成领域广泛使用的指标以及用户研究的基础上进行客观评估。
https://arxiv.org/abs/2409.10357
Ultrasound imaging of the forearm has demonstrated significant potential for accurate hand gesture classification. Despite this progress, there has been limited focus on developing a stand-alone end- to-end gesture recognition system which makes it mobile, real-time and more user friendly. To bridge this gap, this paper explores the deployment of deep neural networks for forearm ultrasound-based hand gesture recognition on edge devices. Utilizing quantization techniques, we achieve substantial reductions in model size while maintaining high accuracy and low latency. Our best model, with Float16 quantization, achieves a test accuracy of 92% and an inference time of 0.31 seconds on a Raspberry Pi. These results demonstrate the feasibility of efficient, real-time gesture recognition on resource-limited edge devices, paving the way for wearable ultrasound-based systems.
超声波成像技术对手部动作识别展示了很大的潜力。然而,目前尚未对开发一个端到端的动作识别系统进行深入研究,使其具有便携性、实时性和更友好的特点。为了填补这一空白,本文探讨了在边缘设备上使用深度神经网络进行手部超声波动作识别的研究。通过使用量化技术,我们实现了模型大小的大幅减少,同时保持高准确性和低延迟。我们最好的模型,使用浮点16量化,在Raspberry Pi上的测试准确率为92%,推理时间为0.31秒。这些结果证明了在资源受限的边缘设备上实现高效、实时手部动作识别是可能的,为基于超声波的可穿戴系统铺平了道路。
https://arxiv.org/abs/2409.09915
Diffusion models have shown their remarkable ability to synthesize images, including the generation of humans in specific poses. However, current models face challenges in adequately expressing conditional control for detailed hand pose generation, leading to significant distortion in the hand regions. To tackle this problem, we first curate the How2Sign dataset to provide richer and more accurate hand pose annotations. In addition, we introduce adaptive, multi-modal fusion to integrate characters' physical features expressed in different modalities such as skeleton, depth, and surface normal. Furthermore, we propose a novel Region-Aware Cycle Loss (RACL) that enables the diffusion model training to focus on improving the hand region, resulting in improved quality of generated hand gestures. More specifically, the proposed RACL computes a weighted keypoint distance between the full-body pose keypoints from the generated image and the ground truth, to generate higher-quality hand poses while balancing overall pose accuracy. Moreover, we use two hand region metrics, named hand-PSNR and hand-Distance for hand pose generation evaluations. Our experimental evaluations demonstrate the effectiveness of our proposed approach in improving the quality of digital human pose generation using diffusion models, especially the quality of the hand region. The source code is available at this https URL.
扩散模型已经展示了其合成图像的非凡能力,包括生成特定姿势的人。然而,当前的模型在充分表达条件控制以准确生成手部姿势方面面临挑战,导致手部区域出现明显扭曲。为解决这个问题,我们首先对如何2Sign数据集进行策展,以提供更丰富和更准确的手部姿势注释。此外,我们引入了自适应、多模态融合,以整合不同模式表达的角色物理特征,如骨骼、深度和表面法线。此外,我们提出了一个名为区域感知循环损失(RACL)的新颖方法,使扩散模型训练能够集中精力提高手部区域,从而提高生成手部姿势的质量。具体来说,与全身体位姿势关键点之间的加权关键点距离在生成图像和真实地面之间的计算,以在平衡整体姿势精度的同时生成更高质量的手部姿势。此外,我们还使用两个手部区域指标,名为手-PSNR和手-距离,对手部姿势生成评估进行比较。我们的实验评估结果表明,我们提出的方法通过扩散模型改善数字人类姿势生成的质量,特别是手部区域的质量。源代码可在此处访问:https://url.cn/
https://arxiv.org/abs/2409.09149
The advent and growing popularity of Virtual Reality (VR) and Mixed Reality (MR) solutions have revolutionized the way we interact with digital platforms. The cutting-edge gaze-controlled typing methods, now prevalent in high-end models of these devices, e.g., Apple Vision Pro, have not only improved user experience but also mitigated traditional keystroke inference attacks that relied on hand gestures, head movements and acoustic side-channels. However, this advancement has paradoxically given birth to a new, potentially more insidious cyber threat, GAZEploit. In this paper, we unveil GAZEploit, a novel eye-tracking based attack specifically designed to exploit these eye-tracking information by leveraging the common use of virtual appearances in VR applications. This widespread usage significantly enhances the practicality and feasibility of our attack compared to existing methods. GAZEploit takes advantage of this vulnerability to remotely extract gaze estimations and steal sensitive keystroke information across various typing scenarios-including messages, passwords, URLs, emails, and passcodes. Our research, involving 30 participants, achieved over 80% accuracy in keystroke inference. Alarmingly, our study also identified over 15 top-rated apps in the Apple Store as vulnerable to the GAZEploit attack, emphasizing the urgent need for bolstered security measures for this state-of-the-art VR/MR text entry method.
虚拟现实(VR)和混合现实(MR)解决方案的的出现和不断增长的使用已经彻底改变了我们与数字平台互动的方式。现在这些设备高端型号中盛行的 gaze-controlled typing 方法,例如苹果视觉Pro,不仅提高了用户体验,还减轻了依赖于手势、头部运动和听觉信道的老式键盘推测攻击。然而,这一进步反而孕育了一个新的、可能更具破坏性的网络威胁,即 GAZEploit。 在本文中,我们揭示了 GAZEploit,一种专门利用虚拟应用程序中常见的虚拟形象来利用这些眼动信息的新型攻击方法。这种广泛的使用大大增强了我们的攻击相对于现有方法的实用性和可行性。GAZEploit 利用了这个漏洞,通过远程提取眼动估计并窃取敏感键盘信息来攻击各种打字场景——包括消息、密码、网址、邮件和密码。我们的研究包括 30 名参与者,在键位推测方面获得了超过 80% 的准确率。令人担忧的是,我们的研究还发现了苹果商店中排名前 15 的应用程序中有超过 15 个应用程序容易受到 GAZEploit 攻击,进一步强调了对于这种最先进的 VR/MR 文本输入方法需要加强安全措施的紧迫性。
https://arxiv.org/abs/2409.08122
Audio-driven talking video generation has advanced significantly, but existing methods often depend on video-to-video translation techniques and traditional generative networks like GANs and they typically generate taking heads and co-speech gestures separately, leading to less coherent outputs. Furthermore, the gestures produced by these methods often appear overly smooth or subdued, lacking in diversity, and many gesture-centric approaches do not integrate talking head generation. To address these limitations, we introduce DiffTED, a new approach for one-shot audio-driven TED-style talking video generation from a single image. Specifically, we leverage a diffusion model to generate sequences of keypoints for a Thin-Plate Spline motion model, precisely controlling the avatar's animation while ensuring temporally coherent and diverse gestures. This innovative approach utilizes classifier-free guidance, empowering the gestures to flow naturally with the audio input without relying on pre-trained classifiers. Experiments demonstrate that DiffTED generates temporally coherent talking videos with diverse co-speech gestures.
音频驱动的讲解视频生成已经取得了显著的进步,但现有的方法通常依赖于视频到视频翻译技术和传统的生成网络,如GANs,它们通常生成独立说话的头和共同说话的手势,导致输出不够连贯。此外,这些方法产生的手势通常显得过于平滑或低调,缺乏多样性,而且许多手势中心方法没有集成说话头生成。为了克服这些限制,我们引入了DiffTED,一种从单张图像中进行一次音频驱动的TED风格讲解视频生成的新方法。具体来说,我们利用扩散模型生成一系列关键点,为Thin-Plate Spline运动模型,在保证时间连贯和多样手势的同时控制虚拟角色的动画。这种创新方法无需依赖预训练分类器,使手势能够自然地与音频输入相结合。实验证明,DiffTED可以生成具有不同共同说话手势的连贯讲解视频。
https://arxiv.org/abs/2409.07649
In recent years robots have become an important part of our day-to-day lives with various applications. Human-robot interaction creates a positive impact in the field of robotics to interact and communicate with the robots. Gesture recognition techniques combined with machine learning algorithms have shown remarkable progress in recent years, particularly in human-robot interaction (HRI). This paper comprehensively reviews the latest advancements in gesture recognition methods and their integration with machine learning approaches to enhance HRI. Furthermore, this paper represents the vision-based gesture recognition for safe and reliable human-robot-interaction with a depth-sensing system, analyses the role of machine learning algorithms such as deep learning, reinforcement learning, and transfer learning in improving the accuracy and robustness of gesture recognition systems for effective communication between humans and robots.
近年来,机器人已经成为我们日常生活中不可或缺的一部分,各种应用使其变得愈发重要。人机交互在机器人领域产生了积极的影响,使其能够与机器人进行互动和交流。近年来,结合手势识别技术和机器学习算法的手势识别方法取得了显著进步,特别是在人机交互(HRI)方面。本文全面回顾了手势识别方法的最新进展及其与机器学习方法的集成以提高HRI。此外,本文还代表了一种基于视觉的手势识别系统,用于与深度传感器系统进行安全可靠的机器人交互,并分析了机器学习算法(如 deep learning、强化学习和迁移学习)在提高手势识别系统准确性和稳健性方面的作用。
https://arxiv.org/abs/2409.06503
Controllable character animation is an emerging task that generates character videos controlled by pose sequences from given character images. Although character consistency has made significant progress via reference UNet, another crucial factor, pose control, has not been well studied by existing methods yet, resulting in several issues: 1) The generation may fail when the input pose sequence is corrupted. 2) The hands generated using the DWPose sequence are blurry and unrealistic. 3) The generated video will be shaky if the pose sequence is not smooth enough. In this paper, we present RealisDance to handle all the above issues. RealisDance adaptively leverages three types of poses, avoiding failed generation caused by corrupted pose sequences. Among these pose types, HaMeR provides accurate 3D and depth information of hands, enabling RealisDance to generate realistic hands even for complex gestures. Besides using temporal attention in the main UNet, RealisDance also inserts temporal attention into the pose guidance network, smoothing the video from the pose condition aspect. Moreover, we introduce pose shuffle augmentation during training to further improve generation robustness and video smoothness. Qualitative experiments demonstrate the superiority of RealisDance over other existing methods, especially in hand quality.
可控角色动画是一个新兴的任务,它通过从给定角色的图像中的姿势序列生成角色视频。尽管通过参考UNet已经取得了显著的角色一致性进展,但现有的方法尚未对姿势控制进行深入研究,导致存在一些问题:1)当输入姿势序列被损坏时,生成可能会失败。2)使用DWPose序列生成的手部可能会变得模糊和不现实。3)如果姿势序列不够平滑,生成的视频可能会颤抖。在本文中,我们提出了RealisDance来处理这些问题。RealisDance通过适应性地利用三种姿势类型来避免由损坏姿势序列导致的生成失败。在这些姿势类型中,HaMeR提供了准确的手的3D和深度信息,使得RealisDance能够生成甚至对于复杂手势都十分逼真的手。除了在主UNet中使用时序关注外,RealisDance还在姿势指导网络中插入时序关注,平滑视频从姿势条件方面。此外,在训练过程中还引入了姿势洗牌增强,以进一步提高生成 robustness 和视频平滑度。定性实验证明,RealisDance相对于其他现有方法在手部质量方面具有优越性。
https://arxiv.org/abs/2409.06202
Natural co-speech gestures are essential components to improve the experience of Human-robot interaction (HRI). However, current gesture generation approaches have many limitations of not being natural, not aligning with the speech and content, or the lack of diverse speaker styles. Therefore, this work aims to repoduce the work by Yoon et,al generating natural gestures in simulation based on tri-modal inputs and apply this to a robot. During evaluation, ``motion variance'' and ``Frechet Gesture Distance (FGD)'' is employed to evaluate the performance objectively. Then, human participants were recruited to subjectively evaluate the gestures. Results show that the movements in that paper have been successfully transferred to the robot and the gestures have diverse styles and are correlated with the speech. Moreover, there is a significant likeability and style difference between different gestures.
自然共同说话手势是人类机器人交互(HRI)体验的重要组成部分。然而,目前的动作生成方法存在许多局限性,例如不自然、不与语音内容对齐或缺乏多样说话者的风格。因此,本研究旨在通过基于三模态输入的仿真方式复制Yoon等人工作的成果,并将此应用于机器人。在评估过程中,采用了“运动方差”和“Frechet Gesture Distance(FGD)”来客观地评估表现。然后,招募了人类参与者对动作进行主观评价。结果表明,该论文中运动的动作已经成功地应用到机器人上,手势具有多样化的风格,并与语音相关。此外,不同手势之间存在显著的喜好和风格差异。
https://arxiv.org/abs/2409.05010
In this paper, we introduce a novel Multiscale Video Transformer Network (MVTN) for dynamic hand gesture recognition, since multiscale features can extract features with variable size, pose, and shape of hand which is a challenge in hand gesture recognition. The proposed model incorporates a multiscale feature hierarchy to capture diverse levels of detail and context within hand gestures which enhances the model's ability. This multiscale hierarchy is obtained by extracting different dimensions of attention in different transformer stages with initial stages to model high-resolution features and later stages to model low-resolution features. Our approach also leverages multimodal data, utilizing depth maps, infrared data, and surface normals along with RGB images from NVGesture and Briareo datasets. Experiments show that the proposed MVTN achieves state-of-the-art results with less computational complexity and parameters. The source code is available at this https URL.
在本文中,我们提出了一种名为Multiscale Video Transformer Network(MVTN)的新动态手势识别模型,因为多尺度特征可以提取不同大小、姿势和形状的手部特征,这是手势识别中的一个挑战。所提出的模型包含多尺度特征层次结构,以捕捉手部姿势中不同层次的细节和上下文,从而增强模型的能力。这种多尺度层次结构是通过在变压器阶段提取不同维度的注意力和使用初始阶段来建模高分辨率特征,后续阶段来建模低分辨率特征来获得的。我们的方法还利用了多模态数据,包括深度图、红外数据和表面法线以及来自NVGesture和Briareo数据集的RGB图像。实验结果表明,与先前的模型相比,所提出的MVTN具有更少的计算复杂度和参数,同时实现了最先进的结果。源代码可在此处访问:https://www.aclweb.org/anthology/N22-21463-26615。
https://arxiv.org/abs/2409.03890
Unmanned Aerial Vehicles (UAVs), have greatly revolutionized the process of gathering and analyzing data in diverse research domains, providing unmatched adaptability and effectiveness. This paper presents a thorough examination of Unmanned Aerial Vehicle (UAV) datasets, emphasizing their wide range of applications and progress. UAV datasets consist of various types of data, such as satellite imagery, images captured by drones, and videos. These datasets can be categorized as either unimodal or multimodal, offering a wide range of detailed and comprehensive information. These datasets play a crucial role in disaster damage assessment, aerial surveillance, object recognition, and tracking. They facilitate the development of sophisticated models for tasks like semantic segmentation, pose estimation, vehicle re-identification, and gesture recognition. By leveraging UAV datasets, researchers can significantly enhance the capabilities of computer vision models, thereby advancing technology and improving our understanding of complex, dynamic environments from an aerial perspective. This review aims to encapsulate the multifaceted utility of UAV datasets, emphasizing their pivotal role in driving innovation and practical applications in multiple domains.
无人机(UAVs)已经在各种研究领域极大地推动了数据收集和分析的进程,提供了无与伦比的适应性和效果。本文对无人机数据集进行全面评估,强调它们的广泛应用和进展。无人机数据集包括各种类型的数据,如卫星影像、无人机捕获的图像和视频。这些数据集可以分为单模态或多模态,提供详尽而全面的信息。这些数据集在灾害损失评估、无人机监视、目标识别和跟踪中发挥着关键作用。它们为诸如语义分割、姿态估计、车辆识别和手势识别等任务开发复杂的模型提供了便利。通过利用无人机数据集,研究人员可以显著增强计算机视觉模型的能力,从而推动技术的发展和提高我们对复杂、动态环境的从空中的认识。本综述旨在概括无人机数据集的多重用途,强调其在多个领域推动创新和实际应用的关键作用。
https://arxiv.org/abs/2409.03245
Smartphones and wearable devices have been integrated into our daily lives, offering personalized services. However, many apps become overprivileged as their collected sensing data contains unnecessary sensitive information. For example, mobile sensing data could reveal private attributes (e.g., gender and age) and unintended sensitive features (e.g., hand gestures when entering passwords). To prevent sensitive information leakage, existing methods must obtain private labels and users need to specify privacy policies. However, they only achieve limited control over information disclosure. In this work, we present Hippo to dissociate hierarchical information including private metadata and multi-grained activity information from the sensing data. Hippo achieves fine-grained control over the disclosure of sensitive information without requiring private labels. Specifically, we design a latent guidance-based diffusion model, which generates multi-grained versions of raw sensor data conditioned on hierarchical latent activity features. Hippo enables users to control the disclosure of sensitive information in sensing data, ensuring their privacy while preserving the necessary features to meet the utility requirements of applications. Hippo is the first unified model that achieves two goals: perturbing the sensitive attributes and controlling the disclosure of sensitive information in mobile sensing data. Extensive experiments show that Hippo can anonymize personal attributes and transform activity information at various resolutions across different types of sensing data.
智能手机和可穿戴设备已经成为了我们日常生活的一部分,为我们的生活提供了个性化服务。然而,许多应用程序在收集到足够多的传感器数据后,会变得过于亲密,包含不必要的敏感信息。例如,移动传感器数据可能会揭示私人的属性(例如性别和年龄)和意外的敏感功能(例如输入密码时的手势)。为了防止敏感信息泄露,现有方法需要获取私有标签,用户还需要指定隐私政策。然而,它们只能对信息披露实现有限控制。在这项工作中,我们提出了Hippo,以从传感器数据中区分包括私有元数据和多粒度活动信息在内的分层信息。Hippo在不需要私有标签的情况下,实现了对敏感信息披露的精细控制。具体来说,我们设计了一个基于潜在指导的扩散模型,根据分层活动特征生成多粒度的原始传感器数据的变体。Hippo使用户能够控制传感器数据中敏感信息的披露,同时保留满足应用程序有用需求所需的功能。Hippo是第一个实现两个目标的统一模型:扰动敏感属性并控制移动传感器数据中敏感信息的披露。大量的实验结果表明,Hippo可以隐匿个人属性,在各种类型的传感器数据中变换活动信息。
https://arxiv.org/abs/2409.03796
Indonesia ranks fourth globally in the number of deaf cases. Individuals with hearing impairments often find communication challenging, necessitating the use of sign language. However, there are limited public services that offer such inclusivity. On the other hand, advancements in artificial intelligence (AI) present promising solutions to overcome communication barriers faced by the deaf. This study aims to explore the application of AI in developing models for a simplified sign language translation app and dictionary, designed for integration into public service facilities, to facilitate communication for individuals with hearing impairments, thereby enhancing inclusivity in public services. The researchers compared the performance of LSTM and 1D CNN + Transformer (1DCNNTrans) models for sign language recognition. Through rigorous testing and validation, it was found that the LSTM model achieved an accuracy of 94.67%, while the 1DCNNTrans model achieved an accuracy of 96.12%. Model performance evaluation indicated that although the LSTM exhibited lower inference latency, it showed weaknesses in classifying classes with similar keypoints. In contrast, the 1DCNNTrans model demonstrated greater stability and higher F1 scores for classes with varying levels of complexity compared to the LSTM model. Both models showed excellent performance, exceeding 90% validation accuracy and demonstrating rapid classification of 50 sign language gestures.
印度尼西亚在全世界听觉障碍病例中排名第四。患有听觉障碍的个人通常发现交流具有挑战性,需要使用手语。然而,目前提供的公共服务中,很少有提供这种包容性的服务。另一方面,人工智能(AI)的进步为解决听觉障碍者面临的交流障碍提供了有前景的解决方案。本研究旨在探讨将AI应用于开发模型,用于简化手语翻译应用程序和词典,以便 integration到公共设施,以促进听觉障碍者的交流,从而提高公共服务的包容性。研究人员比较了LSTM和1D CNN + Transformer(1DCNNTrans)模型在手语识别方面的性能。通过严格的测试和验证,发现LSTM模型的准确率为94.67%,而1DCNNTrans模型的准确率为96.12%。模型性能评估表明,尽管LSTM具有较低的推理延迟,但它对手语类别的分类存在弱点。相反,1DCNNTrans模型在具有不同复杂性的类别上表现出更大的稳健性,F1分数也更高。两种模型都表现出优异的性能,超过90%的验证准确性,并表明可以快速分类50个手语手势。
https://arxiv.org/abs/2409.01975
Upsurging abnormal activities in crowded locations such as airports, train stations, bus stops, shopping malls, etc., urges the necessity for an intelligent surveillance system. An intelligent surveillance system can differentiate between normal and suspicious activities from real-time video analysis that will enable to take appropriate measures regarding the level of an anomaly instantaneously and efficiently. Video-based human activity recognition has intrigued many researchers with its pressing issues and a variety of applications ranging from simple hand gesture recognition to crucial behavior recognition in a surveillance system. This paper provides a critical survey of video-based Human Activity Recognition (HAR) techniques beginning with an examination of basic approaches for detecting and recognizing suspicious behavior followed by a critical analysis of machine learning and deep learning techniques such as Convolutional Neural Network (CNN), Recurrent Neural Network (RNN), Hidden Markov Model (HMM), K-means Clustering etc. A detailed investigation and comparison are done on these learning techniques on the basis of feature extraction techniques, parameter initialization, and optimization algorithms, accuracy, etc. The purpose of this review is to prioritize positive schemes and to assist researchers with emerging advancements in this field's future endeavors. This paper also pragmatically discusses existing challenges in the field of HAR and examines the prospects in the field.
翻译:在拥挤地点如机场、火车站、汽车站、购物中心等,异常活动的高涨促使我们需要智能监控系统。智能监控系统可以通过实时视频分析区分正常和可疑的活动,从而立即有效地采取适当的措施关于异常的程度。基于视频的人活动识别引起了众多研究人员的浓厚兴趣,从简单的手势识别到在监控系统中的关键行为识别。本文对基于视频的人活动识别(HAR)技术进行了批判性调查,从检测和识别可疑行为的基本方法开始,接着对机器学习和深度学习技术如卷积神经网络(CNN)、循环神经网络(RNN)、隐马尔可夫模型(HMM)、K-means聚类等进行深入调查和比较。这些学习技术在基于特征提取技术、参数初始化和优化算法的基础上进行了详细的调查和比较,准确性等。本文的目的是优先考虑积极的方案,并为研究人员在這個領域未來的研究提供幫助。本文还实际讨论了HAR领域内的现有挑战,并探讨了該領域的未来前景。
https://arxiv.org/abs/2409.00731
In face-to-face dialogues, the form-meaning relationship of co-speech gestures varies depending on contextual factors such as what the gestures refer to and the individual characteristics of speakers. These factors make co-speech gesture representation learning challenging. How can we learn meaningful gestures representations considering gestures' variability and relationship with speech? This paper tackles this challenge by employing self-supervised contrastive learning techniques to learn gesture representations from skeletal and speech information. We propose an approach that includes both unimodal and multimodal pre-training to ground gesture representations in co-occurring speech. For training, we utilize a face-to-face dialogue dataset rich with representational iconic gestures. We conduct thorough intrinsic evaluations of the learned representations through comparison with human-annotated pairwise gesture similarity. Moreover, we perform a diagnostic probing analysis to assess the possibility of recovering interpretable gesture features from the learned representations. Our results show a significant positive correlation with human-annotated gesture similarity and reveal that the similarity between the learned representations is consistent with well-motivated patterns related to the dynamics of dialogue interaction. Moreover, our findings demonstrate that several features concerning the form of gestures can be recovered from the latent representations. Overall, this study shows that multimodal contrastive learning is a promising approach for learning gesture representations, which opens the door to using such representations in larger-scale gesture analysis studies.
在面对面的对话中,共同说话手势的形式-意义关系取决于上下文因素,如手势所指的内容以及说话者的个人特点。这些因素使得共同说话手势表示学习具有挑战性。我们如何考虑手势的变异性及其与 speech 的关系来学习有意义的手势表示呢?本文通过采用自监督对比学习方法来从骨架和语音信息中学习手势表示。我们提出了一个包括单模态和多模态预训练的方法,以将手势表示与共同发生的 speech grounded。为了训练,我们利用一个充满代表性手势的面对面对话数据集。我们通过比较人类注释的双胞胎手势相似度来进行彻底的内置评估。此外,我们还进行了一种诊断性探索分析,以评估从学习到的表示中是否可能恢复可解释的手势特征。我们的结果表明,与人类注释的手势相似性具有显著的正相关,并揭示了学习到的表示与对话交互动态的相关模式是一致的。此外,我们的研究结果表明,可以从潜在表示中恢复手势的几何形状。总的来说,本研究证明了多模态对比学习是学习手势表示的一个有前途的方法,为在大规模手势分析研究中使用这些表示开辟了道路。
https://arxiv.org/abs/2409.10535
Diphthong vowels exhibit a degree of inherent dynamic change, the extent of which can vary synchronically and diachronically, such that diphthong vowels can become monophthongs and vice versa. Modelling this type of change requires defining diphthongs in opposition to monophthongs. However, formulating an explicit definition has proven elusive in acoustics and articulation, as diphthongisation is often gradient in these domains. In this study, we consider whether diphthong vowels form a coherent phonetic category from the articulatory point of view. We present articulometry and acoustic data from six speakers of Northern Anglo-English producing a full set of phonologically long vowels. We analyse several measures of diphthongisation, all of which suggest that diphthongs are not categorically distinct from long monophthongs. We account for this observation with an Articulatory Phonology/Task Dynamic model in which diphthongs and long monophthongs have a common gestural representation, comprising two articulatory targets in each case, but they differ according to gestural constriction and location of the component gestures. We argue that a two-target representation for all long vowels is independently supported by phonological weight, as well as by the nature of historical diphthongisation and present-day dynamic vowel variation in British English.
双元音展示了程度上的固有动态变化,其程度可以随同步和异步变化而有所不同,因此双元音可以成为单元音,反之亦然。建模这种变化需要定义双元音与单元音的区别。然而,在语音学和发音方面,明确定义这一概念仍然具有挑战性,因为双元音化在这些领域常常是渐变的。在这项研究中,我们考虑双元音韵母是否从发声的角度形成了一个有意义的语音范畴。我们提供了来自六名北部英式英语发音者的 articulometry 和 acoustic 数据,生产了一套完整的浊辅音韵母。我们分析了几种双元音化指标,所有这些指标都表明,双元音与长元音不是范畴性的区别。我们用发声语音学/任务动态模型来解释这个观察结果,其中双元音和长元音具有共同的发声表示,包括每个 case 中的两个发声目标,但这些目标根据发声收缩和位置的不同而有所不同。我们认为,所有长元音的双目标表示都是根据语音强度和英国英语历史双元音化和现时动态元音变化的本质来支持的。
https://arxiv.org/abs/2409.00275
In recent years, quadruped robots have attracted significant attention due to their practical advantages in maneuverability, particularly when navigating rough terrain and climbing stairs. As these robots become more integrated into various industries, including construction and healthcare, researchers have increasingly focused on developing intuitive interaction methods such as speech and gestures that do not require separate devices such as keyboards or joysticks. This paper aims at investigating a comfortable and efficient interaction method with quadruped robots that possess a familiar form factor. To this end, we conducted two preliminary studies to observe how individuals naturally interact with a quadruped robot in natural and controlled settings, followed by a prototype experiment to examine human preferences for body-based and hand-based gesture controls using a Unitree Go1 Pro quadruped robot. We assessed the user experience of 13 participants using the User Experience Questionnaire and measured the time taken to complete specific tasks. The findings of our preliminary results indicate that humans have a natural preference for communicating with robots through hand and body gestures rather than speech. In addition, participants reported higher satisfaction and completed tasks more quickly when using body gestures to interact with the robot. This contradicts the fact that most gesture-based control technologies for quadruped robots are hand-based. The video is available at this https URL.
近年来,四足机器人在操纵性方面具有显著优势,尤其是在地形崎岖和爬楼梯时,引起了广泛关注。随着这些机器人越来越多地应用于各个行业,包括建筑和医疗行业,越来越多的研究人员把注意力转向开发不需要单独设备(如键盘或遥控器)的直观交互方法,如语音和手势。本文旨在研究一种具有熟悉形式 factor 的舒适高效的与四足机器人交互的方法。为此,我们进行了两个初步研究,观察了人们在自然和受控环境中如何自然地与四足机器人互动,然后进行了一次手 prototype 实验,以研究人类对手和身体动作控制的偏好,使用 Unitree Go1 Pro 四足机器人。我们使用用户体验问卷评估了13名参与者的用户体验,并测量了完成特定任务所需的时间。我们初步结果的发现表明,人类更喜欢通过手和身体动作与机器人进行沟通,而不是通过语言。此外,参与者报告说,在使用身体动作与机器人互动时,他们感到更加满意,并且完成任务更快。这与大多数四足机器人 gesture-based 控制技术基于手部的事实相矛盾。视频可在此处访问:https://www.youtube.com/watch?v=。
https://arxiv.org/abs/2408.17066
Speech dysfluency modeling is the core module for spoken language learning, and speech therapy. However, there are three challenges. First, current state-of-the-art solutions suffer from poor scalability. Second, there is a lack of a large-scale dysfluency corpus. Third, there is not an effective learning framework. In this paper, we propose \textit{SSDM: Scalable Speech Dysfluency Modeling}, which (1) adopts articulatory gestures as scalable forced alignment; (2) introduces connectionist subsequence aligner (CSA) to achieve dysfluency alignment; (3) introduces a large-scale simulated dysfluency corpus called Libri-Dys; and (4) develops an end-to-end system by leveraging the power of large language models (LLMs). We expect SSDM to serve as a standard in the area of dysfluency modeling. Demo is available at \url{this https URL}.
演讲流利度建模是口语学习和言语治疗的核心模块,然而存在三个挑战。首先,当前的解决方案在可扩展性方面存在不足。其次,缺乏大规模的流利度语料库。第三,缺乏有效的学习框架。在本文中,我们提出了《SSDM:可扩展演讲流利度建模》,它(1)采用语音准确性手势作为可扩展的强制对齐;(2)引入了连接ist子序列对齐器(CSA)以实现流利度对齐;(3)引入了一个大规模模拟的流利度语料库 called Libri-Dys;(4)通过利用大型语言模型的力量开发了端到端的系统。我们期望SSDM将成为流利度建模领域的一个标准。演示可以在 \url{这个链接} 中查看。
https://arxiv.org/abs/2408.16221