Real-time high-accuracy optical flow estimation is a crucial component in various applications, including localization and mapping in robotics, object tracking, and activity recognition in computer vision. While recent learning-based optical flow methods have achieved high accuracy, they often come with heavy computation costs. In this paper, we propose a highly efficient optical flow architecture, called NeuFlow, that addresses both high accuracy and computational cost concerns. The architecture follows a global-to-local scheme. Given the features of the input images extracted at different spatial resolutions, global matching is employed to estimate an initial optical flow on the 1/16 resolution, capturing large displacement, which is then refined on the 1/8 resolution with lightweight CNN layers for better accuracy. We evaluate our approach on Jetson Orin Nano and RTX 2080 to demonstrate efficiency improvements across different computing platforms. We achieve a notable 10x-80x speedup compared to several state-of-the-art methods, while maintaining comparable accuracy. Our approach achieves around 30 FPS on edge computing platforms, which represents a significant breakthrough in deploying complex computer vision tasks such as SLAM on small robots like drones. The full training and evaluation code is available at this https URL.
实时高精度光流估计是各种应用的关键组件,包括机器人定位和地图、目标跟踪和计算机视觉活动识别。虽然最近基于学习的光流方法已经达到高准确度,但它们通常伴随着沉重的计算成本。在本文中,我们提出了一个高效的光流架构,称为NeuFlow,该架构解决了高准确度和计算成本的问题。架构遵循全局到局部方案。根据不同分辨率提取的输入图像的特征,采用全局匹配来估计初始光流在1/16分辨率上,捕获大的位移,然后在1/8分辨率上通过轻量级的CNN层进行微调,以提高准确性。我们在Jetson Orin Nano和RTX 2080上评估我们的方法,以证明不同计算平台上的效率改进。我们实现了与几个最先进方法相当的增长速度,同时保持较高的准确性。我们的方法在边缘计算平台上达到约30 FPS,这标志着在部署类似SLAM等复杂计算机视觉任务的小型机器人方面取得了显著的突破。完整的训练和评估代码可在此处访问:https://url.
https://arxiv.org/abs/2403.10425
Understanding human actions from body poses is critical for assistive robots sharing space with humans in order to make informed and safe decisions about the next interaction. However, precise temporal localization and annotation of activity sequences is time-consuming and the resulting labels are often noisy. If not effectively addressed, label noise negatively affects the model's training, resulting in lower recognition quality. Despite its importance, addressing label noise for skeleton-based action recognition has been overlooked so far. In this study, we bridge this gap by implementing a framework that augments well-established skeleton-based human action recognition methods with label-denoising strategies from various research areas to serve as the initial benchmark. Observations reveal that these baselines yield only marginal performance when dealing with sparse skeleton data. Consequently, we introduce a novel methodology, NoiseEraSAR, which integrates global sample selection, co-teaching, and Cross-Modal Mixture-of-Experts (CM-MOE) strategies, aimed at mitigating the adverse impacts of label noise. Our proposed approach demonstrates better performance on the established benchmark, setting new state-of-the-art standards. The source code for this study will be made accessible at this https URL.
从人体姿态中理解人类动作对于与人类共享空间的辅助机器人来说至关重要,以便在下一个交互中做出明智和安全的决定。然而,精确的时间局部化和活动序列的标注是一个耗时且耗资的过程,所得的标签常常是嘈杂的。如果没有得到有效解决,标签噪声将影响模型的训练,导致识别质量下降。尽管解决这个问题很重要,但迄今为止还没有有效地解决基于骨骼的动作识别的标签噪声问题。在本研究中,我们通过在基于骨骼的人体动作识别方法中集成各种研究领域的标签去噪策略,为这一领域提供一个初步的基准。观察结果表明,这些基线在处理稀疏骨骼数据时的表现只是微不足道。因此,我们引入了一种新的方法,NoiseEraSAR,它结合了全局样本选择、协同教学和跨模态专家混合(CM-MOE)策略,旨在减轻标签噪声的负面影响。我们所提出的方法在既定基准上的表现优于其他基线,为现有的技术水平树立了新的标杆。本研究的源代码将在这个链接中公开:https://www. this URL。
https://arxiv.org/abs/2403.09975
This study addresses the prediction of geomagnetic disturbances by exploiting machine learning techniques. Specifically, the Long-Short Term Memory recurrent neural network, which is particularly suited for application over long time series, is employed in the analysis of in-situ measurements of solar wind plasma and magnetic field acquired over more than one solar cycle, from $2005$ to $2019$, at the Lagrangian point L$1$. The problem is approached as a binary classification aiming to predict one hour in advance a decrease in the SYM-H geomagnetic activity index below the threshold of $-50$ nT, which is generally regarded as indicative of magnetospheric perturbations. The strong class imbalance issue is tackled by using an appropriate loss function tailored to optimize appropriate skill scores in the training phase of the neural network. Beside classical skill scores, value-weighted skill scores are then employed to evaluate predictions, suitable in the study of problems, such as the one faced here, characterized by strong temporal variability. For the first time, the content of magnetic helicity and energy carried by solar transients, associated with their detection and likelihood of geo-effectiveness, were considered as input features of the network architecture. Their predictive capabilities are demonstrated through a correlation-driven feature selection method to rank the most relevant characteristics involved in the neural network prediction model. The optimal performance of the adopted neural network in properly forecasting the onset of geomagnetic storms, which is a crucial point for giving real warnings in an operational setting, is finally showed.
本研究利用机器学习技术对地磁扰动进行预测。具体来说,长短期记忆循环神经网络,特别适用于处理长时间序列,被应用于分析太阳风 plasma 和磁场在超过一个太阳周期内的现场测量,从2005年到2019年在L1拉格朗日点。问题被看作二分类问题,旨在预测低于-50 nT的SYM-H地磁活动指数,这种指数通常被认为表示磁扰动。通过使用适当的损失函数来优化网络在训练阶段的相关技能分数,解决了类别不平衡问题。除了经典技能分数外,还使用了值加权技能分数来评估预测,适用于本研究中这些问题,其特点是时间变化强烈。首次考虑了磁螺旋和由其检测和可能性产生的地理效应携带的太阳能过变事件的磁导率作为网络架构的输入特征。它们的预测能力通过关联分析特征选择方法得到了证明,该方法排除了神经网络预测模型的最相关特征。最后,通过展示所采用的神经网络在正确预测地磁暴爆发 onset 方面的最佳性能,解决了在操作环境中提供实时警告的关键问题。
https://arxiv.org/abs/2403.09847
Segment anything models (SAMs) are gaining attention for their zero-shot generalization capability in segmenting objects of unseen classes and in unseen domains when properly prompted. Interactivity is a key strength of SAMs, allowing users to iteratively provide prompts that specify objects of interest to refine outputs. However, to realize the interactive use of SAMs for 3D medical imaging tasks, rapid inference times are necessary. High memory requirements and long processing delays remain constraints that hinder the adoption of SAMs for this purpose. Specifically, while 2D SAMs applied to 3D volumes contend with repetitive computation to process all slices independently, 3D SAMs suffer from an exponential increase in model parameters and FLOPS. To address these challenges, we present FastSAM3D which accelerates SAM inference to 8 milliseconds per 128*128*128 3D volumetric image on an NVIDIA A100 GPU. This speedup is accomplished through 1) a novel layer-wise progressive distillation scheme that enables knowledge transfer from a complex 12-layer ViT-B to a lightweight 6-layer ViT-Tiny variant encoder without training from scratch; and 2) a novel 3D sparse flash attention to replace vanilla attention operators, substantially reducing memory needs and improving parallelization. Experiments on three diverse datasets reveal that FastSAM3D achieves a remarkable speedup of 527.38x compared to 2D SAMs and 8.75x compared to 3D SAMs on the same volumes without significant performance decline. Thus, FastSAM3D opens the door for low-cost truly interactive SAM-based 3D medical imaging segmentation with commonly used GPU hardware. Code is available at this https URL.
segment anything models(SAMs)因在正确提示下进行零 shot类别的物体分割和未见过的领域中的物体分割而受到关注。交互性是SAM的一个关键优势,使用户能够逐步提供感兴趣的对象,以优化输出。然而,为了实现SAM在3D医学成像任务中的交互使用,需要快速推理时间。高内存需求和长处理延迟仍然是阻碍SAM采用的因素。具体来说,尽管2D SAM应用于3D卷积,但它们在处理所有切片时仍然进行重复计算,而3D SAM在模型参数和FLOPs上经历指数增长。为解决这些挑战,我们提出了FastSAM3D,它通过在NVIDIA A100 GPU上加速SAM推理8毫秒每128x128x128 3D卷积图像来实现。这个速度提升是通过1)一种新颖的逐层 progressive distillation 方案实现的,该方案可以将复杂的高级 ViT-B 层知识传递给轻量级的 6-层 ViT-Tiny 层编码器,而无需从头开始训练;以及2)一种新颖的 3D sparse flash attention 来代替传统的注意力操作,从而大大降低内存需求并提高并行度。在三个不同的数据集上的实验表明,FastSAM3D 与 2D SAM 和与同一卷积图像相比,3D SAM 的速度提高了527.38倍,而性能没有下降。因此,FastSAM3D 为使用常见GPU硬件实现低成本的真正交互式SAM 3D医学成像分割打开了大门。代码可以从这个链接获取。
https://arxiv.org/abs/2403.09827
Grasping complex computing concepts often poses a challenge for students who struggle to anchor these new ideas to familiar experiences and understandings. To help with this, a good analogy can bridge the gap between unfamiliar concepts and familiar ones, providing an engaging way to aid understanding. However, creating effective educational analogies is difficult even for experienced instructors. We investigate to what extent large language models (LLMs), specifically ChatGPT, can provide access to personally relevant analogies on demand. Focusing on recursion, a challenging threshold concept, we conducted an investigation analyzing the analogies generated by more than 350 first-year computing students. They were provided with a code snippet and tasked to generate their own recursion-based analogies using ChatGPT, optionally including personally relevant topics in their prompts. We observed a great deal of diversity in the analogies produced with student-prescribed topics, in contrast to the otherwise generic analogies, highlighting the value of student creativity when working with LLMs. Not only did students enjoy the activity and report an improved understanding of recursion, but they described more easily remembering analogies that were personally and culturally relevant.
抓住复杂的计算概念通常会给那些难以将新观念与熟悉经验相结合的学生带来挑战。为了帮助解决这个问题,一个好的类比可以在不熟悉的概念和熟悉的概念之间搭建一座桥,提供一种有趣的方式来帮助理解。然而,为教育目的创建有效的类比并不容易,即使是经验丰富的教师。我们研究了大型语言模型(LLMs),特别是ChatGPT,在需求时提供与个人相关类比的能力。我们重点关注递归,这是一个具有挑战性的概念,分析了超过350名一年级计算机学生生成的近350个类比。他们被提供一段代码片段,并要求使用ChatGPT生成自己的递归基础类比,可选择在提示中包括个人相关主题。我们观察到,学生指定的主题生成的类比与否则 generic analogies 有很大的差异,这突出了学生与LLM合作时创造力的价值。不仅是学生喜欢这种活动并报告对递归的理解 improved,而且他们描述了更容易记住与个人和文化相关联的类比。
https://arxiv.org/abs/2403.09409
Enterprises and organizations are faced with potential threats from insider employees that may lead to serious consequences. Previous studies on insider threat detection (ITD) mainly focus on detecting abnormal users or abnormal time periods (e.g., a week or a day). However, a user may have hundreds of thousands of activities in the log, and even within a day there may exist thousands of activities for a user, requiring a high investigation budget to verify abnormal users or activities given the detection results. On the other hand, existing works are mainly post-hoc methods rather than real-time detection, which can not report insider threats in time before they cause loss. In this paper, we conduct the first study towards real-time ITD at activity level, and present a fine-grained and efficient framework LAN. Specifically, LAN simultaneously learns the temporal dependencies within an activity sequence and the relationships between activities across sequences with graph structure learning. Moreover, to mitigate the data imbalance problem in ITD, we propose a novel hybrid prediction loss, which integrates self-supervision signals {from normal activities} and supervision signals from abnormal activities into a unified loss for anomaly detection. We evaluate the performance of LAN on two widely used datasets, i.e., CERT r4.2 and CERT r5.2. Extensive and comparative experiments demonstrate the superiority of LAN, outperforming 9 state-of-the-art baselines by at least 9.92% and 6.35% in AUC for real-time ITD on CERT r4.2 and r5.2, respectively. Moreover, LAN can be also applied to post-hoc ITD, surpassing 8 competitive baselines by at least 7.70% and 4.03% in AUC on two datasets. Finally, the ablation study, parameter analysis, and compatibility analysis evaluate the impact of each module and hyper-parameter in LAN.
企业和组织面临着来自内部员工可能导致的潜在威胁,这些威胁可能导致严重的后果。以前的研究主要集中在检测异常用户或异常时间段(例如,一个星期或一天)。然而,用户可能会有数百万个活动记录,即使在一天内,用户也可能有数千个活动。要验证基于检测结果的异常用户或活动,需要高调查预算。另一方面,现有的作品主要是后验方法,而不是实时检测,不能在事件导致损失之前报告内部威胁。在本文中,我们第一个研究了基于活动的实时ITD,并提出了一个细粒度和高效的LAN框架。具体来说,LAN同时学习活动序列内的时序依赖关系和序列间活动关系,并通过图结构学习实现。此外,为了减轻ITD中的数据不平衡问题,我们提出了一个新颖的混合预测损失,将来自正常活动的自监督信号和来自异常活动的监督信号统一为异常检测的统一损失。我们对LAN在两个广泛使用的数据集(CERT r4.2和CERT r5.2)上的性能进行了评估。广泛的比较实验证明LAN具有优越的性能,至少在CERT r4.2和r5.2上的实时ITD的AUC方面优于9个最先进的基线,分别达到9.92%和6.35%。此外,LAN还可以应用于后验ITD,在两个数据集上的AUC分别比8个竞争基线高7.70%和4.03%。最后,抽象研究、参数分析和兼容性分析评估了LAN中每个模块和超参数对性能的影响。
https://arxiv.org/abs/2403.09209
Due to advancements in digital cameras, it is easy to gather multiple images (or videos) from an object under different conditions. Therefore, image-set classification has attracted more attention, and different solutions were proposed to model them. A popular way to model image sets is subspaces, which form a manifold called the Grassmann manifold. In this contribution, we extend the application of Generalized Relevance Learning Vector Quantization to deal with Grassmann manifold. The proposed model returns a set of prototype subspaces and a relevance vector. While prototypes model typical behaviours within classes, the relevance factors specify the most discriminative principal vectors (or images) for the classification task. They both provide insights into the model's decisions by highlighting influential images and pixels for predictions. Moreover, due to learning prototypes, the model complexity of the new method during inference is independent of dataset size, unlike previous works. We applied it to several recognition tasks including handwritten digit recognition, face recognition, activity recognition, and object recognition. Experiments demonstrate that it outperforms previous works with lower complexity and can successfully model the variation, such as handwritten style or lighting conditions. Moreover, the presence of relevances makes the model robust to the selection of subspaces' dimensionality.
由于数字相机的进步,从不同条件下对象上收集多张图像(或视频)变得很容易。因此,图像集分类引起了更多关注,并提出了各种解决方案来建模它们。建模图像集的一种流行方法是子空间,它们形成了一个称为Grassmann子空间的维多面。在本文中,我们将扩展泛化相关性学习向量量化对Grassmann子空间的应用。所提出的模型返回了一组原型子空间和一个相关向量。虽然原型模型了类内典型行为,但相关因素指定了对分类任务的最具判别性的主向量(或图像)。它们通过突出预测过程中具有影响力的图像和像素,为模型决策提供了洞察。此外,由于学习原型,新方法在推理过程中的模型复杂度与数据集大小无关,而不同于之前的工作。我们将它应用于多个识别任务,包括手写数字识别、人脸识别、活动识别和物体识别。实验证明,它在较低复杂度的情况下优于之前的工作,并能够成功建模变化,如手写风格或光线条件。此外,存在相关性使得模型对选择子空间维数具有鲁棒性。
https://arxiv.org/abs/2403.09183
The enormous success of diffusion models in text-to-image synthesis has made them promising candidates for the next generation of end-user applications for image generation and editing. Previous works have focused on improving the usability of diffusion models by reducing the inference time or increasing user interactivity by allowing new, fine-grained controls such as region-based text prompts. However, we empirically find that integrating both branches of works is nontrivial, limiting the potential of diffusion models. To solve this incompatibility, we present StreamMultiDiffusion, the first real-time region-based text-to-image generation framework. By stabilizing fast inference techniques and restructuring the model into a newly proposed multi-prompt stream batch architecture, we achieve $\times 10$ faster panorama generation than existing solutions, and the generation speed of 1.57 FPS in region-based text-to-image synthesis on a single RTX 2080 Ti GPU. Our solution opens up a new paradigm for interactive image generation named semantic palette, where high-quality images are generated in real-time from given multiple hand-drawn regions, encoding prescribed semantic meanings (e.g., eagle, girl). Our code and demo application are available at this https URL.
扩散模型的文本-图像合成在文本到图像合成领域的巨大成功使它们成为下一代终端用户应用程序图像生成和编辑的有希望的候选者。以前的工作主要集中在通过减少推理时间或通过允许诸如基于区域的文本提示来增加用户交互来提高扩散模型的可用性。然而,我们通过经验发现,将两个分支集成起来是非易事的,这限制了扩散模型的潜力。为解决这个问题,我们提出了StreamMultiDiffusion,这是第一个实时基于区域的文本到图像生成框架。通过稳定高速的推理技术和将模型重构为一个新的多提示流批架构,我们实现了比现有解决方案快10倍的全景生成速度,并且在单个RTX 2080 Ti GPU上的区域文本到图像合成上的生成速度为1.57 FPS。我们的解决方案开创了一个新的互动图像生成范例,名为语义调色板,其中高品质图像是从给定的多个手绘区域实时生成的。我们的代码和演示应用程序可在此处访问的URL中获取。
https://arxiv.org/abs/2403.09055
The development of techniques that can be used to analyze and detect animal behavior is a crucial activity for the livestock sector, as it is possible to monitor the stress and animal welfare and contributes to decision making in the farm. Thus, the development of applications can assist breeders in making decisions to improve production performance and reduce costs, once the animal behavior is analyzed by humans and this can lead to susceptible errors and time consumption. Aggressiveness in pigs is an example of behavior that is studied to reduce its impact through animal classification and identification. However, this process is laborious and susceptible to errors, which can be reduced through automation by visually classifying videos captured in controlled environment. The captured videos can be used for training and, as a result, for classification through computer vision and artificial intelligence, employing neural network techniques. The main techniques utilized in this study are variants of transformers: STAM, TimeSformer, and ViViT, as well as techniques using convolutions, such as ResNet3D2, Resnet(2+1)D, and CnnLstm. These techniques were employed for pig video classification with the objective of identifying aggressive and non-aggressive behaviors. In this work, various techniques were compared to analyze the contribution of using transformers, in addition to the effectiveness of the convolution technique in video classification. The performance was evaluated using accuracy, precision, and recall. The TimerSformer technique showed the best results in video classification, with median accuracy of 0.729.
用于分析和检测动物行为的技术的开发对于畜牧业至关重要,因为这可以帮助监测应激和动物福利,并在农场决策中做出贡献。因此,应用程序的开发有助于育种者根据人类分析动物行为后做出改进生产表现和降低成本的决定。通过视觉分类控制环境捕获的视频可以减少其影响。然而,这个过程费力且容易出错,可以通过自动化视觉分类视频来减少错误。捕获的视频可以用于培训,进而通过计算机视觉和人工智能进行分类,采用神经网络技术。 在本研究中使用的技术包括变体的Transformer:STAM,TimeSformer和ViViT,以及使用卷积的技术,如ResNet3D2,Resnet(2+1)D和CnnLstm。这些技术用于将视频分类为攻击性和非攻击性行为。在本研究中,还比较了使用Transformer和卷积技术在视频分类中的效果。STAM技术在视频分类中表现最佳,其均方误差(MSE)为0.729。
https://arxiv.org/abs/2403.08528
Predictive process monitoring is a process mining task aimed at forecasting information about a running process trace, such as the most correct next activity to be executed. In medical domains, predictive process monitoring can provide valuable decision support in atypical and nontrivial situations. Decision support and quality assessment in medicine cannot ignore domain knowledge, in order to be grounded on all the available information (which is not limited to data) and to be really acceptable by end users. In this paper, we propose a predictive process monitoring approach relying on the use of a {\em transformer}, a deep learning architecture based on the attention mechanism. A major contribution of our work lies in the incorporation of ontological domain-specific knowledge, carried out through a graph positional encoding technique. The paper presents and discusses the encouraging experimental result we are collecting in the domain of stroke management.
预测过程监控是一个面向预测运行过程轨迹中信息的进程挖掘任务,例如最正确的下一个活动。在医疗领域,预测过程监控可以为不寻常和非 trivial 情况提供有价值的决策支持。医学决策支持和质量评估不能忽略领域知识,以便基于所有可用信息(不包括数据)并真正符合用户需求。在本文中,我们提出了一个基于使用注意力机制的 transformer 的预测过程监控方法。我们工作的主要贡献在于将元领域特定知识通过图位置编码技术纳入。本文讨论了在卒中管理领域中收集的鼓舞人心的实验结果。
https://arxiv.org/abs/2403.08836
Natural disasters and urban accidents drive the demand for rescue robots to provide safer, faster, and more efficient rescue trajectories. In this paper, a feature learning-based bio-inspired neural network (FLBBINN) is proposed to quickly generate a heuristic rescue path in complex and dynamic environments, as traditional approaches usually cannot provide a satisfactory solution to real-time responses to sudden environmental changes. The neurodynamic model is incorporated into the feature learning method that can use environmental information to improve path planning strategies. Task assignment and collision-free rescue trajectory are generated through robot poses and the dynamic landscape of neural activity. A dual-channel scale filter, a neural activity channel, and a secondary distance fusion are employed to extract and filter feature neurons. After completion of the feature learning process, a neurodynamics-based feature matrix is established to quickly generate the new heuristic rescue paths with parameter-driven topological adaptability. The proposed FLBBINN aims to reduce the computational complexity of the neural network-based approach and enable the feature learning method to achieve real-time responses to environmental changes. Several simulations and experiments have been conducted to evaluate the performance of the proposed FLBBINN. The results show that the proposed FLBBINN would significantly improve the speed, efficiency, and optimality for rescue operations.
自然灾害和城市事故推动了对救援机器人的需求,以提供更安全、更快速、更有效的救援路径。在本文中,提出了一种基于特征学习的光启生物神经网络(FLBBINN)来在复杂和动态环境中快速生成启发式的救援路径,因为传统方法通常无法提供对突发环境变化做出满意的解决方案。将神经动力学模型融入特征学习方法中,可以根据环境信息优化路径规划策略。通过机器姿态和神经活动景观来生成任务分配和无碰撞的救援轨迹。采用双通道级滤波器、神经活动通道和次距离融合来提取和过滤特征神经元。在特征学习过程完成后,建立了一个基于神经动力学的特征矩阵,以快速生成具有参数驱动拓扑适应性的新启发式救援路径。所提出的FLBBINN旨在减少基于神经网络方法的计算复杂性,并使特征学习方法能够实现对环境变化的实时响应。已经进行了多项模拟和实验来评估所提出的FLBBINN的性能。结果显示,与传统方法相比,所提出的FLBBINN显著提高了救援操作的速度、效率和优化程度。
https://arxiv.org/abs/2403.08238
Traditional deep learning methods struggle to simultaneously segment, recognize, and forecast human activities from sensor data. This limits their usefulness in many fields such as healthcare and assisted living, where real-time understanding of ongoing and upcoming activities is crucial. This paper introduces P2LHAP, a novel Patch-to-Label Seq2Seq framework that tackles all three tasks in a efficient single-task model. P2LHAP divides sensor data streams into a sequence of "patches", served as input tokens, and outputs a sequence of patch-level activity labels including the predicted future activities. A unique smoothing technique based on surrounding patch labels, is proposed to identify activity boundaries accurately. Additionally, P2LHAP learns patch-level representation by sensor signal channel-independent Transformer encoders and decoders. All channels share embedding and Transformer weights across all sequences. Evaluated on three public datasets, P2LHAP significantly outperforms the state-of-the-art in all three tasks, demonstrating its effectiveness and potential for real-world applications.
传统深度学习方法在同时从传感器数据中分割、识别和预测人类活动方面遇到困难。这限制了它们在许多领域的应用,如医疗保健和护理,其中实时了解正在进行和即将进行的活动至关重要。本文介绍了P2LHAP,一种新颖的Patch-to-Label Seq2Seq框架,它在单个任务模型中解决了所有三个任务。P2LHAP将传感器数据流划分为一个序列的“补丁”,作为输入标记,并输出包括预测未来活动的补丁级别活动标签。提出了一种基于周围补丁标签的独特平滑技术,以准确识别活动边界。此外,P2LHAP通过传感器信号通道无关的Transformer编码器和解码器学习补丁级别表示。所有通道共享嵌入和Transformer权重。在三个公共数据集上进行评估,P2LHAP在所有三个任务上都显著超过了最先进的水平,证明了其有效性和在现实应用中的潜在。
https://arxiv.org/abs/2403.08214
In today's digital landscape, the Web has become increasingly centralized, raising concerns about user privacy violations. Decentralized Web architectures, such as Solid, offer a promising solution by empowering users with better control over their data in their personal `Pods'. However, a significant challenge remains: users must navigate numerous applications to decide which application can be trusted with access to their data Pods. This often involves reading lengthy and complex Terms of Use agreements, a process that users often find daunting or simply ignore. This compromises user autonomy and impedes detection of data misuse. We propose a novel formal description of Data Terms of Use (DToU), along with a DToU reasoner. Users and applications specify their own parts of the DToU policy with local knowledge, covering permissions, requirements, prohibitions and obligations. Automated reasoning verifies compliance, and also derives policies for output data. This constitutes a ``perennial'' DToU language, where the policy authoring only occurs once, and we can conduct ongoing automated checks across users, applications and activity cycles. Our solution is built on Turtle, Notation 3 and RDF Surfaces, for the language and the reasoning engine. It ensures seamless integration with other semantic tools for enhanced interoperability. We have successfully integrated this language into the Solid framework, and conducted performance benchmark. We believe this work demonstrates a practicality of a perennial DToU language and the potential of a paradigm shift to how users interact with data and applications in a decentralized Web, offering both improved privacy and usability.
在当今数字格局中,Web 越来越集中化,这引发了对用户隐私侵犯的担忧。去中心化的 Web 架构,如 Solid,提供了一种有前景的解决方案,通过赋予用户更好地控制其个人 `Pods` 中的数据的能力,提高了用户隐私。然而,仍然存在一个重要挑战:用户必须浏览许多应用程序,才能决定哪个应用程序可以访问他们的数据 Pod。这往往需要用户阅读漫长而复杂的服务条款,过程通常让用户感到不堪重负或直接忽略。这损害了用户的自主权,并阻碍了数据滥用的检测。我们提出了一个新颖的数据条款使用形式描述(DToU)和 DToU 推理器。用户和应用程序根据当地知识自行定义 DToU 政策,涵盖权限、要求、禁止和义务。自动推理验证合规,并还产生了输出数据的策略。这构成了一个“永恒”的 DToU 语言,政策制定仅发生一次,我们可以在用户、应用程序和活动周期之间进行持续的自动检查。我们的解决方案基于 Turtle、Notation 3 和 RDF 表面,用于语言和推理引擎。它确保了与其它语义工具的无缝集成,提高了互操作性。我们已经成功地将这种语言集成到 Solid 框架中,并进行了性能基准测试。我们相信,这项工作展示了永恒 DToU 语言的实际性以及将去中心化 Web 中用户与数据和应用程序交互的方式从隐私改进为可用性的潜力的可能性。
https://arxiv.org/abs/2403.07587
In this paper, we present a novel approach for joint activity detection (AD), channel estimation (CE), and data detection (DD) in uplink grant-free non-orthogonal multiple access (NOMA) systems. Our approach employs an iterative and parallel interference removal strategy inspired by parallel interference cancellation (PIC), enhanced with deep learning to jointly tackle the AD, CE, and DD problems. Based on this approach, we develop three PIC frameworks, each of which is designed for either coherent or non-coherence schemes. The first framework performs joint AD and CE using received pilot signals in the coherent scheme. Building upon this framework, the second framework utilizes both the received pilot and data signals for CE, further enhancing the performances of AD, CE, and DD in the coherent scheme. The third framework is designed to accommodate the non-coherent scheme involving a small number of data bits, which simultaneously performs AD and DD. Through joint loss functions and interference cancellation modules, our approach supports end-to-end training, contributing to enhanced performances of AD, CE, and DD for both coherent and non-coherent schemes. Simulation results demonstrate the superiority of our approach over traditional techniques, exhibiting enhanced performances of AD, CE, and DD while maintaining lower computational complexity.
在本文中,我们提出了一种在非正交多址(NOMA)系统上进行联合活动检测(AD)、信道估计(CE)和数据检测(DD)的新方法。我们的方法受到并行干扰消除(PIC)的启发,并采用了一种基于深度学习的并行处理策略来共同解决AD、CE和DD问题。基于这种方法,我们开发了三种PIC框架,每种框架都针对 coherent 或 non-coherence 方案设计。第一种框架使用接收到的同步子载波信号进行联合AD和CE。基于这个框架,第二种框架利用接收到的同步子载波和数据信号进行CE,从而在 coherent 方案中进一步增强AD、CE和DD 的性能。第三种框架旨在适应涉及少量数据比特的非正交方案,同时执行AD和DD。通过共同损失函数和干扰消除模块,我们的方法支持端到端训练,为 coherent 和 non-coherent 方案均提高了 AD、CE 和 DD 的性能。仿真结果证明了我们的方法相对于传统技术的优越性,同时保持较低的计算复杂性。
https://arxiv.org/abs/2403.07255
We propose FocusCLIP, integrating subject-level guidance--a specialized mechanism for target-specific supervision--into the CLIP framework for improved zero-shot transfer on human-centric tasks. Our novel contributions enhance CLIP on both the vision and text sides. On the vision side, we incorporate ROI heatmaps emulating human visual attention mechanisms to emphasize subject-relevant image regions. On the text side, we introduce human pose descriptions to provide rich contextual information. For human-centric tasks, FocusCLIP is trained with images from the MPII Human Pose dataset. The proposed approach surpassed CLIP by an average of 8.61% across five previously unseen datasets covering three human-centric tasks. FocusCLIP achieved an average accuracy of 33.65% compared to 25.04% by CLIP. We observed a 3.98% improvement in activity recognition, a 14.78% improvement in age classification, and a 7.06% improvement in emotion recognition. Moreover, using our proposed single-shot LLM prompting strategy, we release a high-quality MPII Pose Descriptions dataset to encourage further research in multimodal learning for human-centric tasks. Furthermore, we also demonstrate the effectiveness of our subject-level supervision on non-human-centric tasks. FocusCLIP shows a 2.47% improvement over CLIP in zero-shot bird classification using the CUB dataset. Our findings emphasize the potential of integrating subject-level guidance with general pretraining methods for enhanced downstream performance.
我们提出了FocusCLIP,将针对主题级别的指导--一个专门的针对目标学习的机制--融入了CLIP框架中,以提高在以人为中心的任务上的零 shots传输。我们的新贡献在视觉和文本方面都增强了CLIP。在视觉方面,我们将人类视觉注意机制的ROI热力图融入其中,强调相关主题的图像区域。在文本方面,我们引入了人类姿态描述,提供了丰富的上下文信息。对于以人为中心的任务,FocusCLIP使用来自MPII Human Pose数据集的图像进行训练。与CLIP相比,所提出的方法在五个之前未见过的数据集上的平均性能提高了8.61%。FocusCLIP的平均准确度为33.65%,而CLIP的平均准确度为25.04%。我们还观察到活动识别的准确度提高了3.98%,年龄分类的准确度提高了14.78%,情感识别的准确度提高了7.06%。此外,使用我们提出的单击LLM提示策略,我们释放了一个高品质的MPII Pose描述数据集,以鼓励进一步研究多模态学习在以人为中心的任务上的应用。此外,我们还证明了在非人类中心任务上,主题级别监督的有效性。FocusCLIP在CUB数据集上的零 shots鸟类分类上的改善率为2.47%。我们的研究结果强调了将主题级别指导与通用预训练方法相结合可以提高下游性能的潜力。
https://arxiv.org/abs/2403.06904
Human-robot physical interaction contains crucial information for optimizing user experience, enhancing robot performance, and objectively assessing user adaptation. This study introduces a new method to evaluate human-robot co-adaptation in lower limb exoskeletons by analyzing muscle activity and interaction torque as a two-dimensional random variable. We introduce the Interaction Portrait (IP), which visualizes this variable's distribution in polar coordinates. We applied this metric to compare a recent torque controller (HTC) based on kinematic state feedback and a novel feedforward controller (AMTC) with online learning, proposed herein, against a time-based controller (TBC) during treadmill walking at varying speeds. Compared to TBC, both HTC and AMTC significantly lower users' normalized oxygen uptake, suggesting enhanced user-exoskeleton coordination. IP analysis reveals this improvement stems from two distinct co-adaptation strategies, unidentifiable by traditional muscle activity or interaction torque analyses alone. HTC encourages users to yield control to the exoskeleton, decreasing muscular effort but increasing interaction torque, as the exoskeleton compensates for user dynamics. Conversely, AMTC promotes user engagement through increased muscular effort and reduced interaction torques, aligning it more closely with rehabilitation and gait training applications. IP phase evolution provides insight into each user's interaction strategy development, showcasing IP analysis's potential in comparing and designing novel controllers to optimize human-robot interaction in wearable robots.
人机物理交互包含了优化用户体验、提高机器人性能和客观评估用户适应性的关键信息。本研究提出了一种新的方法来通过分析肌肉活动和水准力作为二维随机变量来评估人机协同适应度。我们引入了交互肖像(IP),它将该变量的分布在极坐标中进行可视化。我们将该指标应用于比较基于运动状态反馈的最近扭矩控制器(HTC)和一个在线学习的新颖助力控制器(AMTC)与基于时间的控制器(TBC)在跑步机上的步行速度不同速度时的表现。与TBC相比,HTC和AMTC显著降低了用户的正常化氧摄取,表明了用户与假肢的协调性得到了增强。IP分析揭示了这种提高源于两种不同的协同适应策略,而这些策略无法通过传统肌肉活动或水准力分析单独确定。HTC鼓励用户将控制权交给假肢,降低肌肉努力,但增加相互作用力,因为假肢补偿了用户动态。相反,AMTC通过增加肌肉努力和降低相互作用力来促进用户参与,使其更接近康复和步态训练应用。IP阶段演变提供了每个用户的交互策略发展的见解,展示了IP分析在比较和设计新型控制器以优化可穿戴机器人的人机交互方面的潜在能力。
https://arxiv.org/abs/2403.06851
Real-time recognition and prediction of surgical activities are fundamental to advancing safety and autonomy in robot-assisted surgery. This paper presents a multimodal transformer architecture for real-time recognition and prediction of surgical gestures and trajectories based on short segments of kinematic and video data. We conduct an ablation study to evaluate the impact of fusing different input modalities and their representations on gesture recognition and prediction performance. We perform an end-to-end assessment of the proposed architecture using the JHU-ISI Gesture and Skill Assessment Working Set (JIGSAWS) dataset. Our model outperforms the state-of-the-art (SOTA) with 89.5\% accuracy for gesture prediction through effective fusion of kinematic features with spatial and contextual video features. It achieves the real-time performance of 1.1-1.3ms for processing a 1-second input window by relying on a computationally efficient model.
实时识别和预测手术活动是推动机器人辅助手术安全性和自主性的基础。本文提出了一种基于短动作和视频数据的小段运动和视频数据的 multimodal Transformer 架构,用于实时识别和预测手术手势和轨迹。我们进行了一项消融研究,以评估将不同输入模块及其表示集成到手势识别和预测性能中的影响。我们使用 JHU-ISI Gesture and Skill Assessment Working Set (JIGSAWS) 数据集对所提出的架构进行了端到端评估。我们的模型在通过有效地融合运动特征和空间上下文视频特征来提高手势预测准确率的基础上,实现了与最先进水平(SOTA)的 89.5% 的准确率。它能在依赖计算效率模型的情况下,实现对 1 秒输入窗口的实时处理,并达到 1.1-1.3ms 的实时性能。
https://arxiv.org/abs/2403.06705
Context-aware Human Activity Recognition (HAR) is a hot research area in mobile computing, and the most effective solutions in the literature are based on supervised deep learning models. However, the actual deployment of these systems is limited by the scarcity of labeled data that is required for training. Neuro-Symbolic AI (NeSy) provides an interesting research direction to mitigate this issue, by infusing common-sense knowledge about human activities and the contexts in which they can be performed into HAR deep learning classifiers. Existing NeSy methods for context-aware HAR rely on knowledge encoded in logic-based models (e.g., ontologies) whose design, implementation, and maintenance to capture new activities and contexts require significant human engineering efforts, technical knowledge, and domain expertise. Recent works show that pre-trained Large Language Models (LLMs) effectively encode common-sense knowledge about human activities. In this work, we propose ContextGPT: a novel prompt engineering approach to retrieve from LLMs common-sense knowledge about the relationship between human activities and the context in which they are performed. Unlike ontologies, ContextGPT requires limited human effort and expertise. An extensive evaluation carried out on two public datasets shows how a NeSy model obtained by infusing common-sense knowledge from ContextGPT is effective in data scarcity scenarios, leading to similar (and sometimes better) recognition rates than logic-based approaches with a fraction of the effort.
上下文感知的移动计算是一个热门的研究领域,文献中大多数有效的解决方案是基于监督深度学习模型。然而,这些系统的实际部署受到训练所需的无标注数据有限性的限制。神经符号人工智能(NeSy)为解决这个问题提供了一个有趣的研究方向,通过将关于人类活动及其背景的常识知识融入HAR深度学习分类器中,缓解了这一问题。现有的NeSy方法 for context-aware HAR依赖于知识用逻辑模型(例如,本体论)编码的常识知识,这些设计、实现和维护需要大量的人力工程努力、技术知识和领域专业知识。最近的工作表明,预训练的大型语言模型(LLMs) effectively编码了关于人类活动的常识知识。在这项工作中,我们提出了ContextGPT:一种新颖的提示工程方法,用于从LLMs中检索关于人类活动及其背景的常识知识。与本体论不同,ContextGPT不需要大量的人力努力和专业知识。在两个公开数据集上进行的大量评估表明,通过从ContextGPT中注入常识知识,NeSy模型在数据稀疏场景中具有有效的效果,导致类似(有时甚至更好的)识别率,与基于逻辑方法的解决方案相比,投入的努力要少得多。
https://arxiv.org/abs/2403.06586
Past studies on end-to-end meeting transcription have focused on model architecture and have mostly been evaluated on simulated meeting data. We present a novel study aiming to optimize the use of a Speaker-Attributed ASR (SA-ASR) system in real-life scenarios, such as the AMI meeting corpus, for improved speaker assignment of speech segments. First, we propose a pipeline tailored to real-life applications involving Voice Activity Detection (VAD), Speaker Diarization (SD), and SA-ASR. Second, we advocate using VAD output segments to fine-tune the SA-ASR model, considering that it is also applied to VAD segments during test, and show that this results in a relative reduction of Speaker Error Rate (SER) up to 28%. Finally, we explore strategies to enhance the extraction of the speaker embedding templates used as inputs by the SA-ASR system. We show that extracting them from SD output rather than annotated speaker segments results in a relative SER reduction up to 20%.
过去的研究主要集中在端到端会议转录模型的架构上,并且主要在模拟会议数据上进行评估。我们提出了一项旨在优化在现实场景中Speaker-Attributed ASR(SA-ASR)系统使用的全新研究,例如AMI会议语料库,以提高说话人分配语音段。首先,我们提出了一个针对实时应用的管道,包括语音活动检测(VAD)、说话人识别(SD)和SA-ASR。其次,我们主张使用VAD输出段来微调SA-ASR模型,考虑到在测试过程中它也应用于VAD段,并展示了这个结果导致说话人错误率(SER)相对减少28%。最后,我们探讨了如何增强SA-ASR系统提取作为输入的说话人嵌入模板。我们发现,从SD输出而不是注释说话人段位提取会导致相对SER减少20%。
https://arxiv.org/abs/2403.06570
This paper investigates the application of voice activity projection (VAP), a predictive turn-taking model for spoken dialogue, on multilingual data, encompassing English, Mandarin, and Japanese. The VAP model continuously predicts the upcoming voice activities of participants in dyadic dialogue, leveraging a cross-attention Transformer to capture the dynamic interplay between participants. The results show that a monolingual VAP model trained on one language does not make good predictions when applied to other languages. However, a multilingual model, trained on all three languages, demonstrates predictive performance on par with monolingual models across all languages. Further analyses show that the multilingual model has learned to discern the language of the input signal. We also analyze the sensitivity to pitch, a prosodic cue that is thought to be important for turn-taking. Finally, we compare two different audio encoders, contrastive predictive coding (CPC) pre-trained on English, with a recent model based on multilingual wav2vec 2.0 (MMS).
本文研究了将语音活动投影(VAP)应用于会话预测模型,该模型用于语音对话,在多语言数据集上(包括英语、普通话和日语)的效果。VAP模型持续预测参与者之间的即将发生的语音活动,利用跨注意力的Transformer捕捉参与者之间的动态交互。结果表明,在单一语言上训练的VAP模型在应用到其他语言时预测性能不佳。然而,在所有三种语言上训练的多语言模型在各种语言上表现出与单语言模型相媲美的预测性能。进一步的分析显示,多语言模型已经学会了区分输入信号的语言。我们还分析了对音高的敏感性,这是一种被认为对于会话切换很重要的 prosodic cue。最后,我们比较了两个不同的音频编码器:一个基于英语的contrastive预测编码(CPC)预训练模型和一个基于多语言wav2vec 2.0的最近模型。
https://arxiv.org/abs/2403.06487