Multimodal AI Agents are AI models that have the capability of interactively and cooperatively assisting human users to solve day-to-day tasks. Augmented Reality (AR) head worn devices can uniquely improve the user experience of solving procedural day-to-day tasks by providing egocentric multimodal (audio and video) observational capabilities to AI Agents. Such AR capabilities can help AI Agents see and listen to actions that users take which can relate to multimodal capabilities of human users. Existing AI Agents, either Large Language Models (LLMs) or Multimodal Vision-Language Models (VLMs) are reactive in nature, which means that models cannot take an action without reading or listening to the human user's prompts. Proactivity of AI Agents on the other hand can help the human user detect and correct any mistakes in agent observed tasks, encourage users when they do tasks correctly or simply engage in conversation with the user - akin to a human teaching or assisting a user. Our proposed YET to Intervene (YETI) multimodal agent focuses on the research question of identifying circumstances that may require the agent to intervene proactively. This allows the agent to understand when it can intervene in a conversation with human users that can help the user correct mistakes on tasks, like cooking, using AR. Our YETI Agent learns scene understanding signals based on interpretable notions of Structural Similarity (SSIM) on consecutive video frames. We also define the alignment signal which the AI Agent can learn to identify if the video frames corresponding to the user's actions on the task are consistent with expected actions. These signals are used by our AI Agent to determine when it should proactively intervene. We compare our results on the instances of proactive intervention in the HoloAssist multimodal benchmark for an expert agent guiding a user to complete procedural tasks.
多模态AI代理是具备互动协作能力的AI模型,能够帮助人类用户解决日常任务。增强现实(AR)头戴式设备可以通过提供第一人称视角的视听观测能力来独特地改善解决程序化日常任务时的用户体验。这种AR功能可以帮助AI代理观察并听取用户的行动,这些动作与多模态的人类用户能力相关联。现有的AI代理,无论是大型语言模型(LLMs)还是多模态视觉-语言模型(VLMs),都具有反应性,这意味着它们在没有读取或聆听人类用户的提示的情况下无法采取任何行动。另一方面,AI代理的主动性可以帮助人类用户发现并纠正代理人观察到的任务中的错误、鼓励用户正确完成任务或简单地与用户进行对话——类似于一个人教导或帮助另一个用户。 我们提出的“即将干预”(YETI)多模态代理专注于识别可能需要代理主动介入的情况的研究问题。这使代理能够理解何时可以在与人类用户的对话中采取行动,以帮助纠正如烹饪等任务中的错误。我们的YETI代理基于可解释的结构相似性(SSIM)概念学习场景理解信号,并根据连续视频帧来定义对齐信号——AI代理可以学会识别用户在执行任务时的操作是否与其预期操作一致。这些信号由我们的AI代理使用,以确定何时应主动介入。 我们在HoloAssist多模态基准测试中的专家代理引导用户完成程序化任务的实例上进行了对比实验,展示了我们这种方法的效果。
https://arxiv.org/abs/2501.09355
Turn-taking is a fundamental aspect of conversation, but current Human-Robot Interaction (HRI) systems often rely on simplistic, silence-based models, leading to unnatural pauses and interruptions. This paper investigates, for the first time, the application of general turn-taking models, specifically TurnGPT and Voice Activity Projection (VAP), to improve conversational dynamics in HRI. These models are trained on human-human dialogue data using self-supervised learning objectives, without requiring domain-specific fine-tuning. We propose methods for using these models in tandem to predict when a robot should begin preparing responses, take turns, and handle potential interruptions. We evaluated the proposed system in a within-subject study against a traditional baseline system, using the Furhat robot with 39 adults in a conversational setting, in combination with a large language model for autonomous response generation. The results show that participants significantly prefer the proposed system, and it significantly reduces response delays and interruptions.
换言之,对话中的轮流发言是交流的基本方面,但当前的人机互动(HRI)系统通常依赖于基于静默的简单模型,这导致了不自然的停顿和打断。本文首次研究了一般轮流发言模型——特别是TurnGPT和声音活动预测(VAP)的应用,以改善人机对话中的交流动态。这些模型通过自我监督学习目标在人类之间的对话数据上进行训练,并且不需要特定领域的微调。我们提出了利用这两种模型的组合方法来预测机器人何时应开始准备回应、何时发言以及如何处理可能的打断。我们在一次实验中评估了所提出的系统,该实验采用了一个传统的基线系统,在39名成年人与Furhat机器人的对话环境中进行,并结合大型语言模型自动生成自主响应。结果表明,参与者明显更偏好于我们提出的系统,它在减少回应延迟和打断方面也显著有效。
https://arxiv.org/abs/2501.08946
UV map estimation is used in computer vision for detailed analysis of human posture or activity. Previous methods assign pixels to body model vertices by comparing pixel descriptors independently, without enforcing global coherence or plausibility in the UV map. We propose Pose-Constrained Continuous Surface Embeddings (PC-CSE), which integrates estimated 2D human pose into the pixel-to-vertex assignment process. The pose provides global anatomical constraints, ensuring that UV maps remain coherent while preserving local precision. Evaluation on DensePose COCO demonstrates consistent improvement, regardless of the chosen 2D human pose model. Whole-body poses offer better constraints by incorporating additional details about the hands and feet. Conditioning UV maps with human pose reduces invalid mappings and enhances anatomical plausibility. In addition, we highlight inconsistencies in the ground-truth annotations.
UV map估计在计算机视觉中用于详细分析人体姿态或活动。以前的方法通过独立比较像素描述符将像素分配给人体模型顶点,而没有强制执行全局一致性和合理性。我们提出了一种名为姿势约束连续表面嵌入(Pose-Constrained Continuous Surface Embeddings, PC-CSE)的新方法,该方法在像素到顶点的分配过程中整合了估计的2D人体姿态信息。这种姿势提供了全球解剖学约束,确保UV图的一致性同时保留局部精度。在DensePose COCO数据集上的评估表明,在选择不同的2D人体姿态模型时,PC-CSE都能取得一致的改进效果。全身姿态通过加入手和脚的额外细节提供更好的约束条件。用人类姿势对UV地图进行条件处理可以减少无效映射,并增强解剖学合理性。此外,我们还指出了地面真实标注中的不一致性问题。
https://arxiv.org/abs/2501.08815
Extended Reality is a revolutionary method of delivering multimedia content to users. A large contributor to its popularity is the sense of immersion and interactivity enabled by having real-world motion reflected in the virtual experience accurately and immediately. This user motion, mainly caused by head rotations, induces several technical challenges. For instance, which content is generated and transmitted depends heavily on where the user is looking. Seamless systems, taking user motion into account proactively, will therefore require accurate predictions of upcoming rotations. Training and evaluating such predictors requires vast amounts of orientational input data, which is expensive to gather, as it requires human test subjects. A more feasible approach is to gather a modest dataset through test subjects, and then extend it to a more sizeable set using synthetic data generation methods. In this work, we present a head rotation time series generator based on TimeGAN, an extension of the well-known Generative Adversarial Network, designed specifically for generating time series. This approach is able to extend a dataset of head rotations with new samples closely matching the distribution of the measured time series.
扩展现实(Extended Reality,简称XR)是一种革命性的多媒体内容传递方式。其受欢迎的一个重要原因在于它能够通过准确且即时地反映真实世界中的动作来增强沉浸感和互动性。然而,用户的主要动作——尤其是头部转动,会引发一系列技术挑战。例如,生成和传输的内容很大程度上取决于用户的视线方向。为了实现无缝系统,考虑到用户运动的主动性预测变得尤为重要,这需要对未来旋转进行准确的预测。训练并评估此类预测器需要大量的姿态输入数据,而这些数据由于需要人类测试者来采集,成本很高。一种更可行的方法是通过少量的人类测试者收集一个适度的数据集,然后利用合成数据生成方法将其扩展成更大的数据集。 在这项工作中,我们介绍了一种基于时间生成对抗网络(TimeGAN)的头部旋转时间序列生成器。这是一种专门为生成时间序列而设计的生成对抗网络(Generative Adversarial Network,简称GAN)的延伸版本。这种方法能够通过新样本将头部旋转的数据集扩展成与测量的时间序列分布相匹配的大规模数据集。
https://arxiv.org/abs/2501.09050
Seasickness is a prevalent issue that adversely impacts both passenger experiences and the operational efficiency of maritime crews. While techniques that redirect attention have proven effective in alleviating motion sickness symptoms in terrestrial environments, applying similar strategies to manage seasickness poses unique challenges due to the prolonged and intense motion environment associated with maritime travel. In this study, we propose a mindfulness brain-computer interface (BCI), specifically designed to redirect attention with the aim of mitigating seasickness symptoms in real-world settings. Our system utilizes a single-channel headband to capture prefrontal EEG signals, which are then wirelessly transmitted to computing devices for the assessment of mindfulness states. The results are transferred into real-time feedback as mindfulness scores and audiovisual stimuli, facilitating a shift in attentional focus from physiological discomfort to mindfulness practices. A total of 43 individuals participated in a real-world maritime experiment consisted of three sessions: a real-feedback mindfulness session, a resting session, and a pseudofeedback mindfulness session. Notably, 81.39% of participants reported that the mindfulness BCI intervention was effective, and there was a significant reduction in the severity of seasickness, as measured by the Misery Scale (MISC). Furthermore, EEG analysis revealed a decrease in the theta/beta ratio, corresponding with the alleviation of seasickness symptoms. A decrease in overall EEG band power during the real-feedback mindfulness session suggests that the mindfulness BCI fosters a more tranquil and downregulated state of brain activity. Together, this study presents a novel nonpharmacological, portable, and effective approach for seasickness intervention, with the potential to enhance the cruising experience for both passengers and crews.
晕船是一种常见问题,会对乘客体验和海上船员的运营效率产生负面影响。尽管在陆地环境中将注意力转移到其他事物上的技术已被证明可以有效缓解运动病症状,但由于海洋旅行中持续且强烈的运动环境带来的挑战,很难采用类似的策略来管理晕船。在这项研究中,我们提出了一种专为转移注意力而设计的正念脑机接口(BCI),其目的是在现实环境中减轻晕船症状。我们的系统使用单通道头带捕捉前额EEG信号,并将其无线传输到计算设备以评估正念状态。结果会被转化为实时反馈,包括正念分数和视听刺激,帮助人们从生理不适转移注意力至正念练习。 共有43名参与者参加了真实世界中的海上实验,该实验包含三个阶段:真实的反馈正念会话、休息会话以及假反馈的正念会话。值得注意的是,81.39% 的参与者报告称正念 BCI 干预有效,并且根据 Misery Scale(MISC)测量的结果显示晕船严重程度有显著降低。此外,EEG 分析结果显示,在缓解晕船症状时,theta/beta 比率有所下降。在真实反馈的正念会话期间,整体 EEG 带功率的减少表明,正念 BCI 促进了大脑活动更为平静和低调节的状态。 总的来说,这项研究提出了一种新颖、无药理干预且便携有效的晕船治疗方法,有潜力提升乘客与船员的航行体验。
https://arxiv.org/abs/2501.08518
Human Activity Recognition (HAR) has gained significant importance with the growing use of sensor-equipped devices and large datasets. This paper evaluates the performance of three categories of models : classical machine learning, deep learning architectures, and Restricted Boltzmann Machines (RBMs) using five key benchmark datasets of HAR (UCI-HAR, OPPORTUNITY, PAMAP2, WISDM, and Berkeley MHAD). We assess various models, including Decision Trees, Random Forests, Convolutional Neural Networks (CNN), and Deep Belief Networks (DBNs), using metrics such as accuracy, precision, recall, and F1-score for a comprehensive comparison. The results show that CNN models offer superior performance across all datasets, especially on the Berkeley MHAD. Classical models like Random Forest do well on smaller datasets but face challenges with larger, more complex data. RBM-based models also show notable potential, particularly for feature learning. This paper offers a detailed comparison to help researchers choose the most suitable model for HAR tasks.
人类活动识别(HAR)随着配备传感器的设备和大型数据集的广泛应用而变得越来越重要。本文评估了三种类型模型在五个关键的人体活动识别基准数据集上的性能:UCI-HAR、OPPORTUNITY、PAMAP2、WISDM 和 Berkeley MHAD。这些模型包括经典机器学习方法(如决策树和支持向量机)、深度学习架构(如卷积神经网络CNN)以及受限玻尔兹曼机(RBMs)。我们使用准确率、精确率、召回率和F1值等指标来全面比较各种模型的表现。 结果显示,基于卷积神经网络的模型在所有数据集上都表现出色,在Berkeley MHAD 数据集上的表现尤其突出。传统的机器学习方法如随机森林在较小的数据集中表现良好,但在处理更大更复杂的数据时会遇到挑战。受限玻尔兹曼机及其相关模型也显示出强大的特征学习潜力。 本文为研究人员选择适合人体活动识别任务的最佳模型提供了详细的比较和分析。
https://arxiv.org/abs/2501.08471
The target of video moment retrieval (VMR) is predicting temporal spans within a video that semantically match a given linguistic query. Existing VMR methods based on multimodal large language models (MLLMs) overly rely on expensive high-quality datasets and time-consuming fine-tuning. Although some recent studies introduce a zero-shot setting to avoid fine-tuning, they overlook inherent language bias in the query, leading to erroneous localization. To tackle the aforementioned challenges, this paper proposes Moment-GPT, a tuning-free pipeline for zero-shot VMR utilizing frozen MLLMs. Specifically, we first employ LLaMA-3 to correct and rephrase the query to mitigate language bias. Subsequently, we design a span generator combined with MiniGPT-v2 to produce candidate spans adaptively. Finally, to leverage the video comprehension capabilities of MLLMs, we apply VideoChatGPT and span scorer to select the most appropriate spans. Our proposed method substantially outperforms the state-ofthe-art MLLM-based and zero-shot models on several public datasets, including QVHighlights, ActivityNet-Captions, and Charades-STA.
视频时刻检索(VMR)的目标是预测视频中与给定语言查询在语义上匹配的时间片段。现有的基于多模态大型语言模型(MLLMs)的VMR方法过于依赖昂贵且高质量的数据集和耗时的微调过程。尽管一些最近的研究引入了零样本设置以避免微调,但它们忽视了查询中固有的语言偏见,导致定位错误。为了应对上述挑战,本文提出了Moment-GPT,这是一个无须调整(tuning-free)的管道,用于利用冻结状态下的MLLM进行零样本VMR。 具体而言,我们首先使用LLaMA-3来纠正和重述查询以减轻语言偏见的影响。接着,我们设计了一个结合MiniGPT-v2使用的片段生成器,能够自适应地产生候选时间片段。最后,为了利用MLLM的视频理解能力,我们将VideoChatGPT与片段评分器相结合,选择最合适的时刻。 我们的方法在包括QVHighlights、ActivityNet-Captions和Charades-STA在内的多个公共数据集上显著优于现有的基于MLLM和零样本模型的方法。
https://arxiv.org/abs/2501.07972
Multi-agent frameworks powered by large language models (LLMs) have demonstrated great success in automated planning and task execution. However, the effective adjustment of Agentic workflows during execution has not been well-studied. A effective workflow adjustment is crucial, as in many real-world scenarios, the initial plan must adjust to unforeseen challenges and changing conditions in real-time to ensure the efficient execution of complex tasks. In this paper, we define workflows as an activity-on-vertex (AOV) graphs. We continuously refine the workflow by dynamically adjusting task allocations based on historical performance and previous AOV with LLM agents. To further enhance system performance, we emphasize modularity in workflow design based on measuring parallelism and dependence complexity. Our proposed multi-agent framework achieved efficient sub-task concurrent execution, goal achievement, and error tolerance. Empirical results across different practical tasks demonstrate dramatic improvements in the efficiency of multi-agent frameworks through dynamic workflow updating and modularization.
由大型语言模型(LLM)支持的多智能体框架在自动化规划和任务执行方面取得了显著成功。然而,关于在执行过程中有效调整代理工作流程的研究尚不充分。有效的流程调整至关重要,因为在许多现实场景中,初始计划必须根据不可预见的挑战和不断变化的情况实时进行调整,以确保复杂任务的有效执行。 在这篇论文中,我们定义了工作流为活动顶点(AOV)图。通过基于历史性能和先前的AOV动态调整任务分配,我们持续优化工作流程。为了进一步提升系统性能,我们在设计工作流时强调模块化,并根据并行性和依赖复杂度进行测量。 我们的多智能体框架实现了高效的任务并发执行、目标达成以及错误容忍。实验结果表明,在不同的实际任务中通过动态更新工作流程和模块化,多智能体框架的效率得到了显著提高。
https://arxiv.org/abs/2501.07834
Conventional human activity recognition (HAR) relies on classifiers trained to predict discrete activity classes, inherently limiting recognition to activities explicitly present in the training set. Such classifiers would invariably fail, putting zero likelihood, when encountering unseen activities. We propose Open Vocabulary HAR (OV-HAR), a framework that overcomes this limitation by first converting each activity into natural language and breaking it into a sequence of elementary motions. This descriptive text is then encoded into a fixed-size embedding. The model is trained to regress this embedding, which is subsequently decoded back into natural language using a pre-trained embedding inversion model. Unlike other works that rely on auto-regressive large language models (LLMs) at their core, OV-HAR achieves open vocabulary recognition without the computational overhead of such models. The generated text can be transformed into a single activity class using LLM prompt engineering. We have evaluated our approach on different modalities, including vision (pose), IMU, and pressure sensors, demonstrating robust generalization across unseen activities and modalities, offering a fundamentally different paradigm from contemporary classifiers.
传统的活动识别(HAR,Human Activity Recognition)方法依赖于分类器来预测离散的活动类别,这种方法本质上只能识别训练集中明确存在的活动。当遇到未见过的新活动时,这类分类器往往无法做出准确判断,并且会将新活动的概率置为零。 我们提出了一种开放词汇活动识别(OV-HAR,Open Vocabulary HAR)框架,该框架通过首先将每个活动转换成自然语言描述并将其分解为一系列基本动作来克服这一限制。然后,这种描述性文本被编码成一个固定大小的嵌入表示。模型经过训练后能够回归预测这个嵌入表示,并且可以使用预训练的嵌入逆向映射模型(embedding inversion model)将该嵌入解码回自然语言描述。 与依赖于自回归的大规模语言模型的核心方法不同,OV-HAR 实现了开放词汇识别,而无需这些大型语言模型带来的计算开销。生成的文本可以通过大规模语言模型的提示工程转换成一个单独的活动类别。 我们在不同的模态上(包括视觉姿态、IMU 和压力传感器数据)评估了我们的方法,并展示了对未见过的活动和模态的强大泛化能力,这提供了与现有分类器完全不同的范式。
https://arxiv.org/abs/2501.07408
The Internet of Things (IoT) and mobile technology have significantly transformed healthcare by enabling real-time monitoring and diagnosis of patients. Recognizing medical-related human activities (MRHA) is pivotal for healthcare systems, particularly for identifying actions that are critical to patient well-being. However, challenges such as high computational demands, low accuracy, and limited adaptability persist in Human Motion Recognition (HMR). While some studies have integrated HMR with IoT for real-time healthcare applications, limited research has focused on recognizing MRHA as essential for effective patient monitoring. This study proposes a novel HMR method for MRHA detection, leveraging multi-stage deep learning techniques integrated with IoT. The approach employs EfficientNet to extract optimized spatial features from skeleton frame sequences using seven Mobile Inverted Bottleneck Convolutions (MBConv) blocks, followed by ConvLSTM to capture spatio-temporal patterns. A classification module with global average pooling, a fully connected layer, and a dropout layer generates the final predictions. The model is evaluated on the NTU RGB+D 120 and HMDB51 datasets, focusing on MRHA, such as sneezing, falling, walking, sitting, etc. It achieves 94.85% accuracy for cross-subject evaluations and 96.45% for cross-view evaluations on NTU RGB+D 120, along with 89.00% accuracy on HMDB51. Additionally, the system integrates IoT capabilities using a Raspberry Pi and GSM module, delivering real-time alerts via Twilios SMS service to caregivers and patients. This scalable and efficient solution bridges the gap between HMR and IoT, advancing patient monitoring, improving healthcare outcomes, and reducing costs.
物联网(IoT)和移动技术通过实现实时监测和诊断患者,极大地改变了医疗保健。识别与医疗相关的身体活动(MRHA)对于医疗系统至关重要,特别是为了识别对患者健康至关重要的动作。然而,在人体运动识别(HMR)方面仍然存在挑战,例如计算需求高、准确性低以及适应性有限等问题。尽管一些研究将HMR与IoT集成用于实时医疗应用,但针对有效患者监测所必需的MRHA识别的研究却相对较少。本研究提出了一种新的HMR方法来检测MRHA,采用多阶段深度学习技术并结合物联网功能。 该方法使用EfficientNet从骨架帧序列中提取优化的空间特征,并通过七个Mobile Inverted Bottleneck Convolutions (MBConv)块进行处理,随后应用ConvLSTM捕捉时空模式。分类模块包括全局平均池化层、全连接层和dropout层,以生成最终预测结果。 模型在NTU RGB+D 120和HMDB51数据集上进行了评估,重点是诸如打喷嚏、跌倒、行走、坐下等MRHA动作。该模型在NTU RGB+D 120跨受试者验证中达到94.85%的准确率,在NTU RGB+D 120跨视图验证中的准确率为96.45%,同时在HMDB51数据集上的准确率为89.00%。 此外,该系统还利用Raspberry Pi和GSM模块整合IoT功能,并通过Twilio短信服务向护理人员和患者提供实时警报。这种可扩展且高效的解决方案填补了HMR与IoT之间的空白,提升了患者的监测效果,改善医疗保健结果并降低费用。
https://arxiv.org/abs/2501.07039
Long-form egocentric video understanding provides rich contextual information and unique insights into long-term human behaviors, holding significant potential for applications in embodied intelligence, long-term activity analysis, and personalized assistive technologies. However, existing benchmark datasets primarily focus on single, short-duration videos or moderately long videos up to dozens of minutes, leaving a substantial gap in evaluating extensive, ultra-long egocentric video recordings. To address this, we introduce X-LeBench, a novel benchmark dataset specifically crafted for evaluating tasks on extremely long egocentric video recordings. Leveraging the advanced text processing capabilities of large language models (LLMs), X-LeBench develops a life-logging simulation pipeline that produces realistic, coherent daily plans aligned with real-world video data. This approach enables the flexible integration of synthetic daily plans with real-world footage from Ego4D-a massive-scale egocentric video dataset covers a wide range of daily life scenarios-resulting in 432 simulated video life logs that mirror realistic daily activities in contextually rich scenarios. The video life-log durations span from 23 minutes to 16.4 hours. The evaluation of several baseline systems and multimodal large language models (MLLMs) reveals their poor performance across the board, highlighting the inherent challenges of long-form egocentric video understanding and underscoring the need for more advanced models.
长时间自我中心视频理解提供了丰富的情境信息,并为长期人类行为提供了独特的见解,在具身智能、长期活动分析和个人化辅助技术的应用中具有巨大潜力。然而,现有的基准数据集主要侧重于单个的短时长视频或长达几十分钟的较长视频,这在评估大规模、超长时间自我中心视频记录方面存在显著差距。为解决这一问题,我们引入了X-LeBench,这是一个专门用于评估极长自我中心视频录制任务的新颖基准数据集。 通过利用大型语言模型(LLMs)先进的文本处理能力,X-LeBench开发了一种生活日志模拟管道,该管道生成与现实世界视频数据相一致的逼真、连贯的日计划。这种方法使得合成日计划能够灵活地与Ego4D的真实世界片段相结合——这是一个大规模自我中心视频的数据集,涵盖了广泛的日常生活场景,从而产生了432个模拟的生活日志视频,这些视频反映了在丰富情境中的现实日常活动。这些生活日志的时长从23分钟到16.4小时不等。 对几种基准系统和多模态大型语言模型(MLLMs)的评估显示它们整体表现不佳,这突显了长期自我中心视频理解所面临的固有挑战,并强调需要更先进的模型。
https://arxiv.org/abs/2501.06835
Temporal sentence grounding in videos (TSGV) faces challenges due to public TSGV datasets containing significant temporal biases, which are attributed to the uneven temporal distributions of target moments. Existing methods generate augmented videos, where target moments are forced to have varying temporal locations. However, since the video lengths of the given datasets have small variations, only changing the temporal locations results in poor generalization ability in videos with varying lengths. In this paper, we propose a novel training framework complemented by diversified data augmentation and a domain discriminator. The data augmentation generates videos with various lengths and target moment locations to diversify temporal distributions. However, augmented videos inevitably exhibit distinct feature distributions which may introduce noise. To address this, we design a domain adaptation auxiliary task to diminish feature discrepancies between original and augmented videos. We also encourage the model to produce distinct predictions for videos with the same text queries but different moment locations to promote debiased training. Experiments on Charades-CD and ActivityNet-CD datasets demonstrate the effectiveness and generalization abilities of our method in multiple grounding structures, achieving state-of-the-art results.
视频中的时间句落定(Temporal Sentence Grounding in Videos,TSGV)面临的一个挑战是现有公共TSGV数据集中存在显著的时间偏倚,这主要归因于目标时刻在时间分布上的不均。现有的方法生成增强的视频,在这些视频中强制目标时刻具有不同的时间位置。然而,由于给定的数据集中的视频长度变化较小,仅仅改变时间位置会导致处理不同长度视频时泛化能力较差。 为此,本文提出了一种新的训练框架,并辅以多样化的数据增强和领域鉴别器。该数据增强方法生成了具有各种长度和目标时刻位置的视频,从而多样化了时间分布。然而,这些增强的视频不可避免地表现出不同的特征分布,这可能引入噪声。为了应对这一问题,我们设计了一个领域的适应辅助任务,旨在减少原始视频与增强视频之间的特征差异。此外,我们还鼓励模型为具有相同文本查询但不同时刻位置的视频生成截然不同的预测,以促进去偏训练。 在Charades-CD和ActivityNet-CD数据集上的实验结果证明了我们的方法在多种句落定结构中的有效性和泛化能力,并取得了最先进的性能。
https://arxiv.org/abs/2501.06746
Speech enhancement (SE) aims to improve the clarity, intelligibility, and quality of speech signals for various speech enabled applications. However, air-conducted (AC) speech is highly susceptible to ambient noise, particularly in low signal-to-noise ratio (SNR) and non-stationary noise environments. Incorporating multi-modal information has shown promise in enhancing speech in such challenging scenarios. Electromyography (EMG) signals, which capture muscle activity during speech production, offer noise-resistant properties beneficial for SE in adverse conditions. Most previous EMG-based SE methods required 35 EMG channels, limiting their practicality. To address this, we propose a novel method that considers only 8-channel EMG signals with acoustic signals using a modified SEMamba network with added cross-modality modules. Our experiments demonstrate substantial improvements in speech quality and intelligibility over traditional approaches, especially in extremely low SNR settings. Notably, compared to the SE (AC) approach, our method achieves a significant PESQ gain of 0.235 under matched low SNR conditions and 0.527 under mismatched conditions, highlighting its robustness.
语音增强(SE)旨在提高各种语音应用中语音信号的清晰度、可懂性和质量。然而,空气传导(AC)语音在低信噪比(SNR)和非稳态噪声环境中对环境噪声极为敏感。融合多模态信息已被证明能够在这些具有挑战性的场景中提升语音效果。肌电图(EMG)信号捕捉了说话时肌肉的活动情况,并具备抗噪音特性,这对恶劣条件下的语音增强非常有益。然而,大多数以往基于EMG的SE方法需要使用35个EMG通道,限制了其实用性。为此,我们提出了一种新的方法,该方法仅考虑8通道的EMG信号与声学信号,并采用修改后的SEMamba网络以及添加的跨模态模块。我们的实验表明,在极端低SNR环境下,与传统方法相比,本方法在语音质量和可懂度方面取得了显著改进。特别值得注意的是,在匹配和不匹配的低SNR条件下,与仅使用空气传导(AC)的方法相比,我们的方法分别实现了PESQ值0.235和0.527的提升,这突显了其强大的鲁棒性。
https://arxiv.org/abs/2501.06530
Contemporary neural networks intended for natural language processing (NLP) are not designed with specific linguistic rules. It suggests that they may acquire a general understanding of language. This attribute has led to extensive research in deciphering their internal representations. A pioneering method involves an experimental setup using human brain data to explore if a translation between brain and neural network representations can be established. Since this technique emerged, more sophisticated NLP models have been developed. In our study, we apply this method to evaluate four new NLP models aiming to identify the one most compatible with brain activity. Additionally, to explore how the brain comprehends text semantically, we alter the text by removing punctuation in four different ways to understand its impact on semantic processing by the human brain. Our findings indicate that the RoBERTa model aligns best with brain activity, outperforming BERT in accuracy according to our metrics. Furthermore, for BERT, higher accuracy was noted when punctuation was excluded, and increased context length did not significantly diminish accuracy compared to the original results with punctuation.
当代用于自然语言处理(NLP)的神经网络并不是根据特定的语言规则设计的,这表明它们可能能够获得对语言的一般理解。这一特性引发了大量研究来探索其内部表示形式。一种开创性的方法是利用人类大脑数据进行实验设置,以探究能否建立大脑与神经网络表示之间的转换关系。自这种方法出现以来,开发出了更复杂的NLP模型。在我们的研究中,我们应用此方法评估四种新NLP模型,目的是找出最能契合脑活动的那一个。此外,为了探讨人类大脑如何语义理解文本,我们在四种不同方式下移除标点符号来了解其对人脑语义处理的影响。我们的发现表明,RoBERTa模型在与脑活动的一致性上表现最佳,在我们使用的度量标准中优于BERT。另外,对于BERT而言,当去除标点时准确性有所提高,并且相比于原始带有标点的结果,增加上下文长度并未显著降低其准确性。
https://arxiv.org/abs/2501.06278
Detecting and interpreting operator actions, engagement, and object interactions in dynamic industrial workflows remains a significant challenge in human-robot collaboration research, especially within complex, real-world environments. Traditional unimodal methods often fall short of capturing the intricacies of these unstructured industrial settings. To address this gap, we present a novel Multimodal Industrial Activity Monitoring (MIAM) dataset that captures realistic assembly and disassembly tasks, facilitating the evaluation of key meta-tasks such as action localization, object interaction, and engagement prediction. The dataset comprises multi-view RGB, depth, and Inertial Measurement Unit (IMU) data collected from 22 sessions, amounting to 290 minutes of untrimmed video, annotated in detail for task performance and operator behavior. Its distinctiveness lies in the integration of multiple data modalities and its emphasis on real-world, untrimmed industrial workflows-key for advancing research in human-robot collaboration and operator monitoring. Additionally, we propose a multimodal network that fuses RGB frames, IMU data, and skeleton sequences to predict engagement levels during industrial tasks. Our approach improves the accuracy of recognizing engagement states, providing a robust solution for monitoring operator performance in dynamic industrial environments. The dataset and code can be accessed from this https URL.
在动态工业工作流程中检测和解读操作员的动作、参与度以及与物体的互动仍然是人机协作研究中的一个重要挑战,尤其是在复杂的真实世界环境中。传统的一模态方法往往难以捕捉这些无结构工业环境下的细微差别。为了解决这一缺口,我们提出了一个新型的多模态工业活动监测(MIAM)数据集,该数据集记录了现实装配和拆卸任务,便于评估包括动作定位、物体互动以及参与度预测等关键元任务的效果。此数据集中包含来自22个会话收集的多视角RGB、深度图像及惯性测量单元(IMU) 数据,共计未剪辑视频290分钟,并详细标注了任务表现和操作员行为。其独特之处在于整合了多种数据模态,并且着重于真实世界的无裁剪工业工作流程,这对推进人机协作和操作员监测研究至关重要。 此外,我们还提出了一种多模态网络,该网络融合RGB帧、IMU数据及骨架序列以预测工业任务中的人类参与度。我们的方法提高了识别参与状态的准确性,为在动态工业环境中监测操作员表现提供了一个稳健的解决方案。数据集和代码可以从以下链接获取:[此URL](请将"[此URL]"替换为实际提供的访问地址)。
https://arxiv.org/abs/2501.05936
Predicting future events is an important activity with applications across multiple fields and domains. For example, the capacity to foresee stock market trends, natural disasters, business developments, or political events can facilitate early preventive measures and uncover new opportunities. Multiple diverse computational methods for attempting future predictions, including predictive analysis, time series forecasting, and simulations have been proposed. This study evaluates the performance of several large language models (LLMs) in supporting future prediction tasks, an under-explored domain. We assess the models across three scenarios: Affirmative vs. Likelihood questioning, Reasoning, and Counterfactual analysis. For this, we create a dataset1 by finding and categorizing news articles based on entity type and its popularity. We gather news articles before and after the LLMs training cutoff date in order to thoroughly test and compare model performance. Our research highlights LLMs potential and limitations in predictive modeling, providing a foundation for future improvements.
预测未来事件是一项在多个领域中都有应用的重要活动。例如,预见股市趋势、自然灾害、商业发展或政治事件的能力可以促进早期预防措施并发现新的机遇。提出了多种试图进行未来预测的计算方法,包括预测分析、时间序列预测和模拟等。本研究评估了几种大型语言模型(LLMs)在支持未来预测任务方面的表现,这是一个尚未充分探索的领域。我们在三种不同的场景下对这些模型进行了评估:肯定性与可能性问题、推理能力和反事实分析能力。为此,我们创建了一个数据集1,通过查找和根据实体类型及其受欢迎程度分类新闻文章来构建该数据集。为了全面测试和比较模型性能,我们在大型语言模型训练截止日期之前和之后收集了新闻文章。我们的研究突显了LLMs在预测建模方面的潜力与局限性,并为未来改进提供了基础。
https://arxiv.org/abs/2501.05925
Office Assistant Robots (OARs) offer a promising solution to proactively provide in-situ support to enhance employee well-being and productivity in office spaces. We introduce OfficeMate, a social OAR designed to assist with practical tasks, foster social interaction, and promote health and well-being. Through a pilot evaluation with seven participants in an office environment, we found that users see potential in OARs for reducing stress and promoting healthy habits and value the robot's ability to provide companionship and physical activity reminders in the office space. However, concerns regarding privacy, communication, and the robot's interaction timing were also raised. The feedback highlights the need to carefully consider the robot's appearance and behaviour to ensure it enhances user experience and aligns with office social norms. We believe these insights will better inform the development of adaptive, intelligent OAR systems for future office space integration.
办公助理机器人(OAR)为在工作场所主动提供现场支持,以增强员工福祉和生产力提供了有前景的解决方案。我们介绍了OfficeMate,这是一个社交型OAR,旨在协助完成实际任务、促进社会互动,并推广健康与幸福感。通过一项涉及七名参与者在办公室环境中进行的小规模试点评估,我们发现用户认为OAR们具有减少压力和支持健康习惯的潜力,并且他们珍视机器人提供的陪伴和办公场所内的身体活动提醒功能。然而,关于隐私、沟通以及机器人交互时间的问题也被提出。 反馈强调了需要仔细考虑机器人的外观和行为,以确保它们能增强用户体验并符合办公室的社会规范。我们认为这些见解将更好地指导未来办公空间中适应性和智能型OAR系统的开发。
https://arxiv.org/abs/2501.05141
Monitoring complex assembly processes is critical for maintaining productivity and ensuring compliance with assembly standards. However, variability in human actions and subjective task preferences complicate accurate task anticipation and guidance. To address these challenges, we introduce the Multi-Modal Transformer Fusion and Recurrent Units (MMTFRU) Network for egocentric activity anticipation, utilizing multimodal fusion to improve prediction accuracy. Integrated with the Operator Action Monitoring Unit (OAMU), the system provides proactive operator guidance, preventing deviations in the assembly process. OAMU employs two strategies: (1) Top-5 MMTF-RU predictions, combined with a reference graph and an action dictionary, for next-step recommendations; and (2) Top-1 MMTF-RU predictions, integrated with a reference graph, for detecting sequence deviations and predicting anomaly scores via an entropy-informed confidence mechanism. We also introduce Time-Weighted Sequence Accuracy (TWSA) to evaluate operator efficiency and ensure timely task completion. Our approach is validated on the industrial Meccano dataset and the largescale EPIC-Kitchens-55 dataset, demonstrating its effectiveness in dynamic environments.
监控复杂的装配过程对于保持生产力和确保符合装配标准至关重要。然而,人类行为的变异性以及主观任务偏好的存在使得准确预测并指导操作变得复杂。为了解决这些问题,我们引入了多模态变换器融合与递归单元(MMTFRU)网络用于第一人称视角下的活动预见,并利用多模态融合来提高预测准确性。结合操作员动作监控单元(OAMU),该系统提供了前瞻性的操作员指导,防止装配过程中的偏差发生。 OAMU采用两种策略:(1) 通过前5个MMTF-RU预测结果与参考图和行为词典相结合,为下一步提供推荐;以及 (2) 使用最顶级的单个MMTF-RU预测结合参考图来检测序列偏差,并利用信息熵增强的信心机制预测异常分数。 此外,我们还引入了时间加权序列准确性(TWSA)指标来评估操作员效率并确保任务及时完成。我们的方法在工业级Meccano数据集和大规模EPIC-Kitchens-55数据集中进行了验证,展示了其在动态环境中的有效性。
https://arxiv.org/abs/2501.05108
As interest in studying in-the-wild human-robot interaction grows, there is a need for methods to collect data over time and in naturalistic or potentially private environments. HRI researchers have increasingly used the diary method for these studies, asking study participants to self-administer a structured data collection instrument, i.e., a diary, over a period of time. Although the diary method offers a unique window into settings that researchers may not have access to, they also lack the interactivity and probing that interview-based methods offer. In this paper, we explore a novel data collection method in which a robot plays the role of an interactive diary. We developed the Diary Robot system and performed in-home deployments for a week to evaluate the feasibility and effectiveness of this approach. Using traditional text-based and audio-based diaries as benchmarks, we found that robots are able to effectively elicit the intended information. We reflect on our findings, and describe scenarios where the utilization of robots in diary studies as a data collection instrument may be especially applicable.
随着对野外人机交互研究兴趣的增长,有必要开发能够在长时间内收集自然环境甚至可能是私人环境中的数据的方法。人机交互(HRI)研究人员越来越多地使用日记法来进行这些研究,即要求参与者在一个时间段内自己填写结构化的数据收集工具,也就是日记。尽管日记方法为研究人员提供了一个独特的机会来了解他们可能无法访问的环境,但它缺乏基于访谈方法提供的互动性和深入探究能力。在本文中,我们探索了一种新的数据收集方法,在这种方法中,机器人扮演了交互式日记的角色。我们开发了“Diary Robot”系统,并进行了为期一周的家庭部署以评估该方法的可行性和有效性。通过传统的文本和音频日记作为基准,我们发现机器人能够有效地获取所需的信息。本文回顾了我们的研究结果,并描述了在日记研究中使用机器人作为一种数据收集工具可能特别适用的情景。
https://arxiv.org/abs/2501.04860
The performance of pavement under loading depends on the strength of the subgrade. However, experimental estimation of properties of pavement strengths such as California bearing ratio (CBR), unconfined compressive strength (UCS) and resistance value (R) are often tedious, time-consuming and costly, thereby inspiring a growing interest in machine learning based tools which are simple, cheap and fast alternatives. Thus, the potential application of two boosting techniques; categorical boosting (CatBoost) and extreme gradient boosting (XGBoost) and support vector regression (SVR), is similarly explored in this study for estimation of properties of subgrade soil modified with hydrated lime activated rice husk ash (HARSH). Using 121 experimental data samples of varying proportions of HARSH, plastic limit, liquid limit, plasticity index, clay activity, optimum moisture content, and maximum dry density as input for CBR, UCS and R estimation, four evaluation metrics namely coefficient of determination (R2), root mean squared error (RMSE), mean absolute error (MAE) and mean absolute percentage error (MAPE) are used to evaluate the models' performance. The results indicate that XGBoost outperformed CatBoost and SVR in estimating these properties, yielding R2 of 0.9994, 0.9995 and 0.9999 in estimating the CBR, UCS and R respectively. Also, SVR outperformed CatBoost in estimating the CBR and R with R2 of 0.9997 respectively. On the other hand, CatBoost outperformed SVR in estimating the UCS with R2 of 0.9994. Feature sensitivity analysis shows that the three machine learning techniques are unanimous that increasing HARSH proportion lead to values of the estimated properties respectively. A comparison with previous results also shows superiority of XGBoost in estimating subgrade properties.
在载荷作用下的路面性能取决于路基的强度。然而,通过实验估算诸如加利福尼亚承载比(CBR)、无侧限抗压强度(UCS)和阻力值(R)等路面强度属性通常是繁琐、耗时且昂贵的,这激发了人们对基于机器学习工具的兴趣,这些工具简单、便宜且快速。因此,在这项研究中也探索了两种提升技术——类别提升(CatBoost)和极端梯度提升(XGBoost),以及支持向量回归(SVR),用于估算添加水化石灰活化的稻壳灰(HARSH)改良后的路基土壤属性的潜在应用。 使用121个实验数据样本,涵盖不同比例的HARSH、塑限、液限、塑性指数、粘土活性、最优含水量和最大干密度作为输入,用于CBR、UCS和R的估算。采用四个评估指标——决定系数(R²)、均方根误差(RMSE)、平均绝对误差(MAE)和平均绝对百分比误差(MAPE),来评价模型性能。结果显示,XGBoost在估计这些属性时优于CatBoost和SVR,在CBR、UCS和R的估算中分别获得了0.9994、0.9995和0.9999的R²值。同时,SVR在CBR和R的估算上也超过了CatBoost,其CBR和R的R²分别为0.9997。另一方面,在UCS的估算中,CatBoost优于SVR,获得了0.9994的R²值。 特征敏感性分析表明,这三种机器学习技术一致认为,HARSH比例增加会导致相应估计属性值的变化。与以往结果相比,XGBoost在估计路基属性方面表现出优越性能。
https://arxiv.org/abs/2501.04826