The maturity classification of specialty crops such as strawberries and tomatoes is an essential agricultural downstream activity for selective harvesting and quality control (QC) at production and packaging sites. Recent advancements in Deep Learning (DL) have produced encouraging results in color images for maturity classification applications. However, hyperspectral imaging (HSI) outperforms methods based on color vision. Multivariate analysis methods and Convolutional Neural Networks (CNN) deliver promising results; however, a large amount of input data and the associated preprocessing requirements cause hindrances in practical application. Conventionally, the reflectance intensity in a given electromagnetic spectrum is employed in estimating fruit maturity. We present a feature extraction method to empirically demonstrate that the peak reflectance in subbands such as 500-670 nm (pigment band) and the wavelength of the peak position, and contrarily, the trough reflectance and its corresponding wavelength within 671-790 nm (chlorophyll band) are convenient to compute yet distinctive features for the maturity classification. The proposed feature selection method is beneficial because preprocessing, such as dimensionality reduction, is avoided before every prediction. The feature set is designed to capture these traits. The best SOTA methods, among 3D-CNN, 1D-CNN, and SVM, achieve at most 90.0 % accuracy for strawberries and 92.0 % for tomatoes on our dataset. Results show that the proposed method outperforms the SOTA as it yields an accuracy above 98.0 % in strawberry and 96.0 % in tomato classification. A comparative analysis of the time efficiency of these methods is also conducted, which shows the proposed method performs prediction at 13 Frames Per Second (FPS) compared to the maximum 1.16 FPS attained by the full-spectrum SVM classifier.
草莓和西红柿等特色作物的成熟度分类是生产和包装站点选择性收获和质量控制(QC)过程中必不可少的重要农业下游活动。最近在深度学习(DL)方面的进步在颜色图像的成熟度分类应用中产生了鼓舞人心的结果。然而,基于色彩视觉的方法在成熟度分类上劣后于 hyperspectral imaging(HSI)。多变量分析方法和卷积神经网络(CNN)产生了积极的结果;然而,大量的输入数据及其相关预处理要求给应用带来障碍。通常,在给定的电磁频谱中的反射强度被用来估计果实成熟度。我们提出了一个特征提取方法,以经验证明在色素带(500-670纳米,色素带)子频段和最大峰位波长以及相反,在671-790纳米(叶绿素带)中的峰谷反射强度和其相应的波长是方便计算且具有区分性的特征,用于成熟度分类。所提出的特征选择方法有益处,因为预测之前,预处理,例如降维,被避免了。特征集旨在捕捉这些特征。在3D-CNN、1D-CNN和SVM中,最好的SOTA方法,即草莓和西红柿数据集中的3D-CNN,在草莓和西红柿上的准确度分别为90.0%和92.0%。结果表明,与SOTA相比,所提出的方法具有更高的准确度,草莓的准确度为98.0%,西红柿的准确度为96.0%。还进行了这些方法的比较分析,比较了它们的预测时间效率,结果表明,与 full-spectral SVM 分类器达到的最大1.16 FPS 相比,所提出的 method 在13 FPS 的预测速度上表现出色。
https://arxiv.org/abs/2405.09955
Current speaker diarization systems rely on an external voice activity detection model prior to speaker embedding extraction on the detected speech segments. In this paper, we establish that the attention system of a speaker embedding extractor acts as a weakly supervised internal VAD model and performs equally or better than comparable supervised VAD systems. Subsequently, speaker diarization can be performed efficiently by extracting the VAD logits and corresponding speaker embedding simultaneously, alleviating the need and computational overhead of an external VAD model. We provide an extensive analysis of the behavior of the frame-level attention system in current speaker verification models and propose a novel speaker diarization pipeline using ECAPA2 speaker embeddings for both VAD and embedding extraction. The proposed strategy gains state-of-the-art performance on the AMI, VoxConverse and DIHARD III diarization benchmarks.
当前的讲话者语音识别系统在提取讲话者嵌入之前依赖于外部语音活动检测模型。在本文中,我们证明了发言者嵌入提取器的注意系统充当一个弱监督的内部VAD模型,并且其表现与相应的监督VAD系统相当或者更好。随后,通过同时提取VAD日志和相应的讲话者嵌入,可以高效地实现发言者识别。我们详细分析了当前讲话者验证模型中帧级注意系统的行为,并使用ECAPA2讲话者嵌入提出了用于VAD和嵌入提取的新讲话者识别流程。所提出的策略在AMI、VoxConverse和DIHARD III语调基准上获得了最先进的性能。
https://arxiv.org/abs/2405.09142
One of the challenges of human-swarm interaction (HSI) is how to manage the operator's workload. In order to do this, we propose a novel neurofeedback technique for the real-time measurement of workload using functional near-infrared spectroscopy (fNIRS). The objective is to develop a baseline for workload measurement in human-swarm interaction using fNIRS and to develop an interface that dynamically adapts to the operator's workload. The proposed method consists of using fNIRS device to measure brain activity, process this through a machine learning algorithm, and pass it on to the HSI interface. By dynamically adapting the HSI interface, the swarm operator's workload could be reduced and the performance improved.
人类群互动(HSI)的一个挑战是如何管理操作员的负荷。为了实现这一目标,我们提出了使用功能近红外光谱(fNIRS)实时测量工作负荷的新神经反馈技术。其目的是为使用fNIRS测量人类群互动中操作员工作负荷建立基准,并开发一个动态适应操作员工作负荷的界面。所提出的方法包括使用fNIRS设备测量脑活动,通过机器学习算法处理并传递给HSI界面。通过动态适应HSI界面,群操作员的工作负荷可以减轻,性能可以提高。
https://arxiv.org/abs/2405.07834
This paper undertakes an empirical study to revisit the latest advancements in Multimodal Large Language Models (MLLMs): Video Assistant. This study, namely FreeVA, aims to extend existing image-based MLLM to the video domain in a training-free manner. The study provides an essential, yet must-know baseline, and reveals several surprising findings: 1) FreeVA, leveraging only offline image-based MLLM without additional training, excels in zero-shot video question-answering (e.g., MSVD-QA, ActivityNet-QA, and MSRVTT-QA), even surpassing state-of-the-art methods that involve video instruction tuning. 2) While mainstream video-based MLLMs typically initialize with an image-based MLLM (e.g., LLaVA) and then fine-tune using video instruction tuning, the study indicates that utilizing the widely adopted VideoInstruct-100K for video instruction tuning doesn't actually lead to better performance compared to not training at all. 3) The commonly used evaluation metrics in existing works are significantly influenced by changes in the GPT API version over time. If ignored, this could affect the fairness and uniformity of comparisons between different methods and impact the analysis and judgment of researchers in the field. The advancement of MLLMs is currently thriving, drawing numerous researchers into the field. We aim for this work to serve as a plug-and-play, simple yet effective baseline, encouraging the direct evaluation of existing MLLMs in video domain while also standardizing the field of video conversational models to a certain extent. Also, we encourage researchers to reconsider: Have current video MLLM methods truly acquired knowledge beyond image MLLM? Code is available at this https URL
本论文进行了一项实证研究,旨在回顾在Multimodal Large Language Models(MLLMs)方面最新的进展:Video Assistant。这项研究,即FreeVA,旨在以无需额外训练的方式将现有的基于图像的MLLM扩展到视频领域。这项研究提供了一个基本的、必不可少的基线,并揭示了几项令人惊讶的发现:1)FreeVA,仅利用离线图像为基础的MLLM,在零散拍摄视频问题回答(例如,MSVD-QA,ActivityNet-QA和MSRVTT-QA)方面表现优异,甚至超过了涉及视频指令微调的先进方法。2)虽然主流视频MLLM通常从基于图像的MLLM(例如,LLaVA)开始,然后通过视频指令微调进行微调,但这项研究表明,使用广泛采用的VideoInstruct-100K进行视频指令微调实际上并不能带来更好的性能,与不进行训练相比。3)现有作品中共享的评估指标在很大程度上受到GPT API版本变化的影响。如果被忽略,这可能会影响不同方法的公平性和可比性,从而影响领域内研究人员的分析和判断。MLLM的发展目前正处于繁荣时期,吸引了大量研究人员进入该领域。我们希望这项工作能成为一种可插拔、简单而有效的基线,鼓励研究人员直接评估现有的MLLM在视频领域的性能,同时标准化该领域的视频会话模型。此外,我们鼓励研究人员重新考虑:当前的视频MLLM方法是否已经获得了超出图像MLLM的知识?代码在此处:https://this URL
https://arxiv.org/abs/2405.07798
Robots executing tasks following human instructions in domestic or industrial environments essentially require both adaptability and reliability. Behavior Tree (BT) emerges as an appropriate control architecture for these scenarios due to its modularity and reactivity. Existing BT generation methods, however, either do not involve interpreting natural language or cannot theoretically guarantee the BTs' success. This paper proposes a two-stage framework for BT generation, which first employs large language models (LLMs) to interpret goals from high-level instructions, then constructs an efficient goal-specific BT through the Optimal Behavior Tree Expansion Algorithm (OBTEA). We represent goals as well-formed formulas in first-order logic, effectively bridging intent understanding and optimal behavior planning. Experiments in the service robot validate the proficiency of LLMs in producing grammatically correct and accurately interpreted goals, demonstrate OBTEA's superiority over the baseline BT Expansion algorithm in various metrics, and finally confirm the practical deployability of our framework. The project website is this https URL.
在家庭或工业环境中,遵循人类指令执行任务的机器人本质上需要适应性和可靠性。行为树(BT)作为一种具有模块化和反应性的控制架构,为这类场景提供了合适的控制方法。然而,现有的BT生成方法要么不涉及自然语言解释,要么无法理论上将BT的成功率保证下来。本文提出了一个双阶段的BT生成框架,首先使用大型语言模型(LLMs)从高级指令中解释目标,然后通过最优行为树扩展算法(OBTEA)构建具有高效目标特定行为树。我们将目标表示为第一阶逻辑中的良好形式公式,有效解决了意图理解和最优行为规划之间的鸿沟。在服务机器人的实验中,LLMs的语法正确性和准确性的目标被证实,证明了OBTEA在各种指标上的优越性,最终证实了我们的框架在实际部署中的实用性。该项目网站是https://www.example.com。
https://arxiv.org/abs/2405.07474
In today's digital landscape, where cyber attacks have become the norm, the detection of cyber attacks and threats is critically imperative across diverse domains. Our research presents a new empirical framework for cyber threat modeling, adept at parsing and categorizing cyber-related information from news articles, enhancing real-time vigilance for market stakeholders. At the core of this framework is a fine-tuned BERT model, which we call CANAL - Cyber Activity News Alerting Language Model, tailored for cyber categorization using a novel silver labeling approach powered by Random Forest. We benchmark CANAL against larger, costlier LLMs, including GPT-4, LLaMA, and Zephyr, highlighting their zero to few-shot learning in cyber news classification. CANAL demonstrates superior performance by outperforming all other LLM counterparts in both accuracy and cost-effectiveness. Furthermore, we introduce the Cyber Signal Discovery module, a strategic component designed to efficiently detect emerging cyber signals from news articles. Collectively, CANAL and Cyber Signal Discovery module equip our framework to provide a robust and cost-effective solution for businesses that require agile responses to cyber intelligence.
在当今数字 landscape中,网络攻击已成为常态,跨多个领域的检测网络攻击和威胁至关重要。我们的研究提出了一种新的实证框架,用于网络威胁建模,善于解析和分类与网络相关的信息,增强市场参与者的实时警惕。这个框架的核心是一个经过微调的BERT模型,我们称之为CANAL - 网络活动新闻警报语言模型,采用了一种新的银标注方法,利用随机森林进行网络安全分类。我们对比CANAL与其他大型、昂贵的LLM,包括GPT-4、LLLaMA和Zephyr,突显了它们在网络新闻分类中的零到几 shot学习。通过超越所有其他LLM备选,CANAL在准确性和性价比方面都表现卓越。此外,我们还引入了Cyber Signal Discovery模块,这是一个 strategic 组件,旨在有效地从新闻文章中检测新兴的网络安全信号。总之,CANAL 和 Cyber Signal Discovery 模块使我们的框架能够为需要对网络情报作出敏捷反应的企业提供稳健且经济有效的解决方案。
https://arxiv.org/abs/2405.06772
Engaging in recreational activities in public spaces poses challenges for blind people, often involving dependency on sighted help. Window shopping is a key recreational activity that remains inaccessible. In this paper, we investigate the information needs, challenges, and current approaches blind people have to recreational window shopping to inform the design of existing wayfinding and navigation technology for supporting blind shoppers in exploration and serendipitous discovery. We conduct a formative study with a total of 18 blind participants that include both focus groups (N=8) and interviews for requirements analysis (N=10). We find that there is a desire for push notifications of promotional information and pull notifications about shops of interest such as the targeted audience of a brand. Information about obstacles and points-of-interest required customization depending on one's mobility aid as well as presence of a crowd, children, and wheelchair users. We translate these findings into specific information modalities and rendering in the context of two existing AI-infused assistive applications: NavCog (a turn-by-turn navigation app) and Cabot (a navigation robot).
在公共空间的娱乐活动对盲人而言具有挑战性,通常需要依赖视力正常的帮助。购物是主要的娱乐活动之一,但这对盲人来说却是一个障碍。在本文中,我们研究了盲人进行娱乐性购物所需的信息需求、挑战以及现有方法的现状,以指导为盲人提供探索和偶然发现的支持的现有引导和导航技术的设计。我们进行了一项研究,共有18名盲参与者的参与,包括焦点小组(N=8)和需求分析采访(N=10)。我们发现,用户渴望获得关于促销信息的推送通知和关于感兴趣商店(如品牌目标受众)的拉取通知。信息需要根据用户的移动辅助设施以及是否有 crowd(人群)、children(儿童)和wheelchair users(轮椅使用者)进行定制。我们将这些发现转化为具体的信息模式和渲染在现有 AI 辅助应用程序:NavCog(一个逐字导航应用)和 Cabot(一个导航机器人)的背景下。
https://arxiv.org/abs/2405.06611
This paper presents an approach for energy-neutral Internet of Things (IoT) scenarios where the IoT devices (IoTDs) rely entirely on their energy harvesting capabilities to sustain operation. We use a Markov chain to represent the operation and transmission states of the IoTDs, a modulated Poisson process to model their energy harvesting process, and a discrete-time Markov chain to model their battery state. The aim is to efficiently manage the duty cycling of the IoTDs, so as to prolong their battery life and reduce instances of low-energy availability. We propose a duty-cycling management based on K- nearest neighbors, aiming to strike a trade-off between energy efficiency and detection accuracy. This is done by incorporating spatial and temporal correlations among IoTDs' activity, as well as their energy harvesting capabilities. We also allow the base station to wake up specific IoTDs if more information about an event is needed upon initial detection. Our proposed scheme shows significant improvements in energy savings and performance, with up to 11 times lower misdetection probability and 50\% lower energy consumption for high-density scenarios compared to a random duty cycling benchmark.
本文提出了一种能源中性的物联网(IoT)场景,其中物联网设备(IoTDs)完全依赖其能量收集能力来维持运行。我们使用马尔可夫链来表示IoTD的运行和传输状态,用带有平均能量消耗的泊松过程来建模其能量收集过程,用离散时间马尔可夫链来建模其电池状态。旨在有效地管理IoTD的轮询,从而延长电池寿命,降低低能量可用性的实例。我们提出了一种基于K-最近邻居的轮询管理方案,旨在在能源效率和检测精度之间取得平衡。这是通过整合IoTD的活动空间和时间关联以及其能量收集能力来实现的。我们还允许基站在初始检测到事件时唤醒特定的IoTD。我们提出的方案在能源消耗和性能方面具有显著的改进,与随机轮询基准相比,低误检概率可达11倍,高密度场景下的能源消耗降低50%。
https://arxiv.org/abs/2405.06372
This work investigates how tutoring discourse interacts with students' proximal knowledge to explain and predict students' learning outcomes. Our work is conducted in the context of high-dosage human tutoring where 9th-grade students (N= 1080) attended small group tutorials and individually practiced problems on an Intelligent Tutoring System (ITS). We analyzed whether tutors' talk moves and students' performance on the ITS predicted scores on math learning assessments. We trained Random Forest Classifiers (RFCs) to distinguish high and low assessment scores based on tutor talk moves, student's ITS performance metrics, and their combination. A decision tree was extracted from each RFC to yield an interpretable model. We found AUCs of 0.63 for talk moves, 0.66 for ITS, and 0.77 for their combination, suggesting interactivity among the two feature sources. Specifically, the best decision tree emerged from combining the tutor talk moves that encouraged rigorous thinking and students' ITS mastery. In essence, tutor talk that encouraged mathematical reasoning predicted achievement for students who demonstrated high mastery on the ITS, whereas tutors' revoicing of students' mathematical ideas and contributions was predictive for students with low ITS mastery. Implications for practice are discussed.
这项工作研究了指导性对话如何与学生的最近知识相互作用来解释和预测学生的学习成果。我们的研究是在高剂度人辅导背景下进行的,9年级学生(N= 1080)参加了小型小组辅导,并在智能教学系统(ITS)上单独练习了问题。我们分析了指导者的讲话内容和学生在ITS上表现出来的数学学习能力之间的关系。我们训练了随机森林分类器(RFCs)来根据指导者的讲话内容、学生的ITS表现指标以及它们的组合来区分高和低评分。从每个RFC中提取决策树,产生了一个可解释的模型。我们发现讲话内容的AUC为0.63,ITS的AUC为0.66,他们的组合的AUC为0.77,表明两种特征来源之间的相互作用。具体来说,最好决策树是从鼓励严谨思考的指导者讲话和学生的ITS掌握情况中产生的。本质上,鼓励数学推理的指导者讲话预测了在ITS上表现出高掌握力的学生的学习成果,而导师对学生的数学想法和贡献的重复则预测了在ITS上表现出低掌握力的学生的学习成果。我们对实践的建议进行了讨论。
https://arxiv.org/abs/2405.06218
In this paper, we present a robust version of the well-known exact low-resolution electromagnetic tomography (eLORETA) technique, named ReLORETA, to localize brain sources in the presence of different forward model uncertainties. Methods: We first assume that the true lead field matrix is a transformation of the existing lead field matrix distorted by uncertainties and propose an iterative approach to estimate this transformation accurately. Major sources of the forward model uncertainties, including differences in geometry, conductivity, and source space resolution between the real and simulated head models, and misaligned electrode positions, are then simulated to test the proposed method. Results: ReLORETA and eLORETA are applied to simulated focal sources in different regions of the brain and the presence of various noise levels as well as real data from a patient with focal epilepsy. The results show that ReLORETA is considerably more robust and accurate than eLORETA in all cases. Conclusion: Having successfully dealt with the forward model uncertainties, ReLORETA proved to be a promising method for real-world clinical applications. Significance: eLORETA is one of the localization techniques that could be used to study brain activity for medical applications such as determining the epileptogenic zone in patients with medically refractory epilepsy. However, the major limitation of eLORETA is sensitivity to the uncertainties in the forward model. Since this problem can substantially undermine its performance in real-world applications where the exact lead field matrix is unknown, developing a more robust method capable of dealing with these uncertainties is of significant interest.
在本文中,我们提出了一个在存在不同前馈模型不确定性的情况下,对著名的精确低分辨率电磁成像(eLORETA)技术进行改进的方法,称为ReLORETA。方法:我们首先假设真实主场矩阵是通过不确定性扭曲的现有主场矩阵,然后提出了一种迭代方法来准确估计这个转换。主要的前馈模型不确定性来源,包括真实和模拟头模之间几何形状、导电性和源空间分辨率差异以及错位电极位置,然后对提出的方法进行模拟以测试该方法。结果:ReLORETA和eLORETA应用于不同的大脑区域中的模拟凸源,以及各种噪声水平和来自患有癫痫的患者的真实数据。结果表明,在所有情况下,ReLORETA都比eLORETA更稳健和准确。结论:在成功处理前馈模型不确定性后,ReLORETA被证明是一种有前景的实时临床应用方法。意义:eLORETA是一种可用于研究医学应用中脑活动的定位技术,例如确定患有难治性癫痫的患者中的 epileptogenic 区域。然而,eLORETA的主要局限性是对前馈模型不确定性的敏感。由于在现实世界中,精确的主场矩阵通常是未知的,因此开发能够处理这些不确定性的更稳健的方法具有重要的意义。
https://arxiv.org/abs/2405.05790
In this study, we utilized statistical analysis and machine learning methods to examine whether rehabilitation exercises can improve patients post-stroke functional abilities, as well as forecast the improvement in functional abilities. Our dataset is patients' rehabilitation exercises and demographic information recorded in the unstructured electronic health records (EHRs) data and free-text rehabilitation procedure notes. We collected data for 265 stroke patients from the University of Pittsburgh Medical Center. We employed a pre-existing natural language processing (NLP) algorithm to extract data on rehabilitation exercises and developed a rule-based NLP algorithm to extract Activity Measure for Post-Acute Care (AM-PAC) scores, covering basic mobility (BM) and applied cognitive (AC) domains, from procedure notes. Changes in AM-PAC scores were classified based on the minimal clinically important difference (MCID), and significance was assessed using Friedman and Wilcoxon tests. To identify impactful exercises, we used Chi-square tests, Fisher's exact tests, and logistic regression for odds ratios. Additionally, we developed five machine learning models-logistic regression (LR), Adaboost (ADB), support vector machine (SVM), gradient boosting (GB), and random forest (RF)-to predict outcomes in functional ability. Statistical analyses revealed significant associations between functional improvements and specific exercises. The RF model achieved the best performance in predicting functional outcomes. In this study, we identified three rehabilitation exercises that significantly contributed to patient post-stroke functional ability improvement in the first two months. Additionally, the successful application of a machine learning model to predict patient-specific functional outcomes underscores the potential for precision rehabilitation.
在这项研究中,我们利用统计分析和机器学习方法来探讨康复锻炼是否能够改善患者中风后的功能能力,以及预测功能能力的改善情况。我们的数据集是患者康复锻炼和无结构电子病历(EHR)数据中记录的病历信息的文本。我们从匹兹堡大学医疗中心收集了265名中风患者的数据。我们使用预先存在的自然语言处理(NLP)算法提取康复锻炼数据,并开发了一个基于规则的NLP算法,从病历记录中提取活动评分(AM-PAC)分数,涵盖基本行动能力(BM)和应用认知(AC)领域。根据最小临床重要差异(MCID),将变化分为不同的类别,并使用Friedman-Wilcoxon检验进行显著性检验。为了确定具有影响力的锻炼,我们使用卡方检验、Fisher确切检验和逻辑回归进行比例比。此外,我们还开发了五个机器学习模型-逻辑回归(LR)、Adaboost(ADB)、支持向量机(SVM)、梯度提升(GB)和随机森林(RF)-预测功能能力的结果。统计分析显示,功能改善与特定锻炼之间存在显著的关联。随机森林模型在预测功能结果方面表现最佳。在本研究中,我们发现了三个月内显著有助于患者功能能力改善的三个康复锻炼。此外,将机器学习模型应用于预测患者特定功能结果的成功应用,阐明了精准康复的潜力。
https://arxiv.org/abs/2405.05993
Background: Gravity confounds arm movement ability in post-stroke hemiparesis. Reducing its influence allows effective practice leading to recovery. Yet, there is a scarcity of wearable devices suitable for personalized use across diverse therapeutic activities in the clinic. Objective: In this study, we investigated the safety, feasibility, and efficacy of anti-gravity therapy using the ExoNET device in post-stroke participants. Methods: Twenty chronic stroke survivors underwent six, 45-minute occupational therapy sessions while wearing the ExoNET, randomized into either the treatment (ExoNET tuned to gravity-support) or control group (ExoNET tuned to slack condition). Clinical outcomes were evaluated by a blinded-rater at baseline, post, and six-week follow-up sessions. Kinetic, kinematic, and patient experience outcomes were also assessed. Results: Mixed-effect models showed a significant improvement in Box and Blocks scores in the post-intervention session for the treatment group (effect size: 2.1, p = .04). No significant effects were found between the treatment and control groups for ARAT scores and other clinical metrics. Direct kinetic effects revealed a significant reduction in muscle activity during free exploration with an effect size of (-7.12%, p< 005). There were no significant longitudinal kinetic or kinematic trends. Subject feedback suggested a generally positive perception of the anti-gravity therapy. Conclusions: Anti-gravity therapy with the ExoNET is a safe and feasible treatment for post-stroke rehabilitation. The device provided anti-gravity forces, did not encumber range of motion, and clinical metrics of anti-gravity therapy demonstrated improvements in gross manual dexterity. Further research is required to explore potential benefits in broader clinical metrics.
背景:中风后遗症患者的上肢运动能力受限,而重力又加重了这种限制。减少重力影响有助于有效治疗,实现康复。然而,在诊所内,个性化使用的可穿戴设备非常缺乏。目标:本研究旨在探讨使用ExoNET设备进行抗重力治疗在中风参与者中的安全性和有效性。方法:20名中风患者在穿着ExoNET设备进行6次、每次45分钟的职业治疗课程后进行研究,随机分为治疗组(ExoNET调整到重力支持)和对照组(ExoNET调整到滑动条件)。临床结局在基线、治疗后和6周随访时通过一名盲目评估者进行评估。动态、静止和患者体验结局也进行了评估。结果:混合效应模型显示,在治疗组中,干预后两组患者的Box和Blocks评分均显著改善(效应量:2.1,p =.04)。在ARAT评分和其他临床指标上,治疗组和对照组之间没有显著影响。自由探索过程中肌肉活动的直接静力学效应显示显著降低(效应量:-7.12%,p <.05)。没有发现明显的长期动态或静止趋势。患者反馈表明,抗重力治疗总体上被认为是积极有效的。结论:使用ExoNET的抗重力治疗是一种安全有效的中风康复治疗方式。设备提供了反重力力,没有限制活动范围,抗重力治疗临床指标显示出了粗大动作协调性的改善。需要进一步研究在更广泛的临床指标上可能的益处。
https://arxiv.org/abs/2405.04707
Integrating large language models (LLMs) and knowledge graphs (KGs) holds great promise for revolutionizing intelligent education, but challenges remain in achieving personalization, interactivity, and explainability. We propose FOKE, a Forest Of Knowledge and Education framework that synergizes foundation models, knowledge graphs, and prompt engineering to address these challenges. FOKE introduces three key innovations: (1) a hierarchical knowledge forest for structured domain knowledge representation; (2) a multi-dimensional user profiling mechanism for comprehensive learner modeling; and (3) an interactive prompt engineering scheme for generating precise and tailored learning guidance. We showcase FOKE's application in programming education, homework assessment, and learning path planning, demonstrating its effectiveness and practicality. Additionally, we implement Scholar Hero, a real-world instantiation of FOKE. Our research highlights the potential of integrating foundation models, knowledge graphs, and prompt engineering to revolutionize intelligent education practices, ultimately benefiting learners worldwide. FOKE provides a principled and unified approach to harnessing cutting-edge AI technologies for personalized, interactive, and explainable educational services, paving the way for further research and development in this critical direction.
集成大型语言模型(LLMs)和知识图(KGs)在颠覆智能教育方面具有巨大的潜力,但在实现个性化、互动性和可解释性方面仍然存在挑战。我们提出了FOKE,一种结合基础模型、知识图和提示工程的方法,以解决这些挑战。FOKE引入了三个关键创新:(1)用于表示结构化领域知识的分层知识森林;(2) comprehensive learner modeling 的多维度用户跟踪机制;(3)用于生成精确、定制化学习指导的交互式提示工程方案。我们在编程教育、作业评估和学习路径规划中展示了FOKE的应用,证明了其有效性和实用性。此外,我们还实现了Scholar Hero,一个基于FOKE的现实生活中实例。我们的研究突出了将基础模型、知识图和提示工程集成到智能教育实践中,以颠覆教育传统,最终为全球学习者带来利益的潜力。FOKE提供了一种理性和统一的方法,利用最先进的人工智能技术为个性化、互动性和可解释性教育服务,为这个关键领域的研究和开发铺平道路。
https://arxiv.org/abs/2405.03734
Large general-purpose transformer models have recently become the mainstay in the realm of speech analysis. In particular, Whisper achieves state-of-the-art results in relevant tasks such as speech recognition, translation, language identification, and voice activity detection. However, Whisper models are not designed to be used in real-time conditions, and this limitation makes them unsuitable for a vast plethora of practical applications. In this paper, we introduce Whispy, a system intended to bring live capabilities to the Whisper pretrained models. As a result of a number of architectural optimisations, Whispy is able to consume live audio streams and generate high level, coherent voice transcriptions, while still maintaining a low computational cost. We evaluate the performance of our system on a large repository of publicly available speech datasets, investigating how the transcription mechanism introduced by Whispy impacts on the Whisper output. Experimental results show how Whispy excels in robustness, promptness, and accuracy.
近年来,大型通用Transformer模型已成为语音分析领域的主导力量。特别是,Whisper在语音识别、翻译、语言识别和语音活动检测等任务中实现了最先进的结果。然而,Whisper模型不是为了实时使用而设计的,这一限制使得它们不适合大量的实际应用场景。在本文中,我们介绍了Whispy系统,旨在为Whisper预训练模型带来实时功能。通过进行多项架构优化,Whispy能够消耗实时音频流并生成高质量、连贯的语音文本,同时保持较低的计算成本。我们对Whispy系统的性能进行了评估,研究了Whispy引入的转录机制如何影响Whisper的输出。实验结果表明,Whispy在鲁棒性、 promptness和准确性方面表现出色。
https://arxiv.org/abs/2405.03484
Transparency and explainability in image classification are essential for establishing trust in machine learning models and detecting biases and errors. State-of-the-art explainability methods generate saliency maps to show where a specific class is identified, without providing a detailed explanation of the model's decision process. Striving to address such a need, we introduce a post-hoc method that explains the entire feature extraction process of a Convolutional Neural Network. These explanations include a layer-wise representation of the features the model extracts from the input. Such features are represented as saliency maps generated by clustering and merging similar feature maps, to which we associate a weight derived by generalizing Grad-CAM for the proposed methodology. To further enhance these explanations, we include a set of textual labels collected through a gamified crowdsourcing activity and processed using NLP techniques and Sentence-BERT. Finally, we show an approach to generate global explanations by aggregating labels across multiple images.
透明度和可解释性在图像分类中至关重要,用于建立对机器学习模型的信任并检测偏见和错误。最先进的可解释性方法生成确切显示特定类别的 saliency 地图,而不会提供模型决策过程的详细解释。为了解决这个问题,我们引入了一种后置方法,该方法解释了卷积神经网络(CNN)的完整特征提取过程。这些解释包括从输入中提取的每个层的特征的层级表示。这些特征以通过聚类和合并类似特征图生成的 saliency 地图的形式表示,并附有通过扩展 Grad-CAM 获得的权重。为了进一步增强这些解释,我们在活动中通过游戏化众包活动收集了一组文本标签,并使用 NLP 技术和 Sentence-BERT 对这些标签进行处理。最后,我们展示了通过聚合多个图像上的标签来生成全局解释的方法。
https://arxiv.org/abs/2405.03301
Reconstructing human dynamic vision from brain activity is a challenging task with great scientific significance. The difficulty stems from two primary issues: (1) vision-processing mechanisms in the brain are highly intricate and not fully revealed, making it challenging to directly learn a mapping between fMRI and video; (2) the temporal resolution of fMRI is significantly lower than that of natural videos. To overcome these issues, this paper propose a two-stage model named Mind-Animator, which achieves state-of-the-art performance on three public datasets. Specifically, during the fMRI-to-feature stage, we decouple semantic, structural, and motion features from fMRI through fMRI-vision-language tri-modal contrastive learning and sparse causal attention. In the feature-to-video stage, these features are merged to videos by an inflated Stable Diffusion. We substantiate that the reconstructed video dynamics are indeed derived from fMRI, rather than hallucinations of the generative model, through permutation tests. Additionally, the visualization of voxel-wise and ROI-wise importance maps confirms the neurobiological interpretability of our model.
从脑活动重构人动态视觉是一个具有巨大科学意义的具有挑战性的任务。难度来自于两个主要问题:(1) 大脑中的视觉处理机制非常复杂,没有被完全揭示,因此直接从fMRI到视频的映射是困难的;(2) fMRI的时间分辨率远低于自然视频。为了克服这些困难,本文提出了一种名为Mind-Animator的两阶段模型,在三个公开数据集上实现了最先进的性能。具体来说,在fMRI到特征阶段,我们通过fMRI视觉语言三体对比学习将语义、结构和运动特征从fMRI中解耦,并通过稀疏相关注意来降低fMRI的分辨率。在特征到视频阶段,这些特征通过膨胀的稳定扩散合并成视频。我们通过置换测试证实,重构的视频动态确实来源于fMRI,而不是生成模型的虚像。此外,体积和区域重要性图的可视化证实了我们的模型的神经生物学解释性。
https://arxiv.org/abs/2405.03280
We propose a novel, brain-inspired deep neural network model known as the Deep Oscillatory Neural Network (DONN). Deep neural networks like the Recurrent Neural Networks indeed possess sequence processing capabilities but the internal states of the network are not designed to exhibit brain-like oscillatory activity. With this motivation, the DONN is designed to have oscillatory internal dynamics. Neurons of the DONN are either nonlinear neural oscillators or traditional neurons with sigmoidal or ReLU activation. The neural oscillator used in the model is the Hopf oscillator, with the dynamics described in the complex domain. Input can be presented to the neural oscillator in three possible modes. The sigmoid and ReLU neurons also use complex-valued extensions. All the weight stages are also complex-valued. Training follows the general principle of weight change by minimizing the output error and therefore has an overall resemblance to complex backpropagation. A generalization of DONN to convolutional networks known as the Oscillatory Convolutional Neural Network is also proposed. The two proposed oscillatory networks are applied to a variety of benchmark problems in signal and image/video processing. The performance of the proposed models is either comparable or superior to published results on the same data sets.
我们提出了一个新型的基于脑的深度神经网络模型,称为深度 oscillatory 神经网络(DONN)。与递归神经网络这样的深度神经网络确实具有序列处理能力,但是网络内部状态并没有被设计成具有类似脑的周期性活动。为了实现这一点,DONN 被设计具有周期性的内部动态。DONN 的神经元可以是非线性神经元周期器或具有sigmoidal 或 ReLU 激活的传统神经元。在模型中使用的神经元是 Hopf oscillator,其动力学在复数域中描述。输入可以通过三种可能的模式呈现给神经元振荡器。sigmoid 和 ReLU 神经元也使用了复数值扩展。所有权重阶段也是复数值。训练通过最小化输出误差来遵循一般原则,因此具有与复杂反向传播相似的整体外观。也提出了将 DONN 扩展到卷积网络的振荡卷积神经网络的一般化模型。两个提出的振荡网络被应用于各种信号和图像/视频处理基准问题。所提出模型的性能与相同数据集中的已发布结果相比要么相当,要么更优。
https://arxiv.org/abs/2405.03725
Predicting future human pose is a fundamental application for machine intelligence, which drives robots to plan their behavior and paths ahead of time to seamlessly accomplish human-robot collaboration in real-world 3D scenarios. Despite encouraging results, existing approaches rarely consider the effects of the external scene on the motion sequence, leading to pronounced artifacts and physical implausibilities in the predictions. To address this limitation, this work introduces a novel multi-modal sense-informed motion prediction approach, which conditions high-fidelity generation on two modal information: external 3D scene, and internal human gaze, and is able to recognize their salience for future human activity. Furthermore, the gaze information is regarded as the human intention, and combined with both motion and scene features, we construct a ternary intention-aware attention to supervise the generation to match where the human wants to reach. Meanwhile, we introduce semantic coherence-aware attention to explicitly distinguish the salient point clouds and the underlying ones, to ensure a reasonable interaction of the generated sequence with the 3D scene. On two real-world benchmarks, the proposed method achieves state-of-the-art performance both in 3D human pose and trajectory prediction.
预测未来的人体姿势是对机器智能的基本应用,它使机器人能够提前规划它们的行为和路径,实现人机协作在现实世界的三维场景中无缝完成。尽管有鼓舞人心的结果,现有的方法很少考虑外部场景对运动序列的影响,导致预测中的明显误差和物理不现实。为了克服这一限制,本文提出了一种新颖的多模态感信息引导的运动预测方法,它基于两个模态信息:外部3D场景和内部人类视野,并且能够识别它们的显著性对于未来人类活动。此外, gaze信息被视为人类意图,并且与运动和场景特征相结合,我们构建了一个三元组意图感知注意力来监督生成,以匹配人类想要到达的位置。同时,我们引入语义连贯性感知注意力,明确区分显著点云和底层点云,以确保生成的序列与3D场景的合理交互。在两个真实世界的基准上,所提出的方法在3D人体姿势和轨迹预测方面都实现了最先进的性能。
https://arxiv.org/abs/2405.02911
Panoramic Activity Recognition (PAR) aims to identify multi-granularity behaviors performed by multiple persons in panoramic scenes, including individual activities, group activities, and global activities. Previous methods 1) heavily rely on manually annotated detection boxes in training and inference, hindering further practical deployment; or 2) directly employ normal detectors to detect multiple persons with varying size and spatial occlusion in panoramic scenes, blocking the performance gain of PAR. To this end, we consider learning a detector adapting varying-size occluded persons, which is optimized along with the recognition module in the all-in-one framework. Therefore, we propose a novel Adapt-Focused bi-Propagating Prototype learning (AdaFPP) framework to jointly recognize individual, group, and global activities in panoramic activity scenes by learning an adapt-focused detector and multi-granularity prototypes as the pretext tasks in an end-to-end way. Specifically, to accommodate the varying sizes and spatial occlusion of multiple persons in crowed panoramic scenes, we introduce a panoramic adapt-focuser, achieving the size-adapting detection of individuals by comprehensively selecting and performing fine-grained detections on object-dense sub-regions identified through original detections. In addition, to mitigate information loss due to inaccurate individual localizations, we introduce a bi-propagation prototyper that promotes closed-loop interaction and informative consistency across different granularities by facilitating bidirectional information propagation among the individual, group, and global levels. Extensive experiments demonstrate the significant performance of AdaFPP and emphasize its powerful applicability for PAR.
全景活动识别(PAR)旨在识别在全景场景中多个人的多种粒度行为,包括个人活动、群体活动和全局活动。之前的方法1)在训练和推理过程中严重依赖手动标注检测框,阻碍了进一步的实用部署;或者2)直接使用大小和空间遮挡变化多样的个人检测器来检测多个大小和空间遮挡变化多样的个人,阻碍了PAR的性能提升。因此,我们考虑学习一个自适应大小遮挡的检测器,该检测器与识别模块在一体化框架中进行优化。为此,我们提出了一个名为Adapt-Focused bi-Propagating Prototype learning (AdoFPP)的新颖对齐传播原型学习(AdaFPP)框架,通过学习适应关注的检测器和多粒度原型作为端到端任务的前馈任务,以同时识别全景活动场景中的个人、群体和全局活动。 具体来说,为了适应 crowed(拥挤)全景场景中多个人的不同大小和空间遮挡,我们引入了一个全景适应焦点检测器,通过全面选择并执行在原始检测到的物体密集子区域中的精细检测,实现对个人大小的适应检测。此外,为了减轻由于不准确的个人局部定位而产生的信息损失,我们引入了一个双向信息传播原型,通过促进个体、群体和全局层次之间的双向信息传播,实现信息的有用一致性。 大量的实验证明AdoFPP 的性能显著提高,并强调了其在PAR上的强大应用价值。
https://arxiv.org/abs/2405.02538
Deciphering the intricacies of the human brain has captivated curiosity for centuries. Recent strides in Brain-Computer Interface (BCI) technology, particularly using motor imagery, have restored motor functions such as reaching, grasping, and walking in paralyzed individuals. However, unraveling natural language from brain signals remains a formidable challenge. Electroencephalography (EEG) is a non-invasive technique used to record electrical activity in the brain by placing electrodes on the scalp. Previous studies of EEG-to-text decoding have achieved high accuracy on small closed vocabularies, but still fall short of high accuracy when dealing with large open vocabularies. We propose a novel method, EEG2TEXT, to improve the accuracy of open vocabulary EEG-to-text decoding. Specifically, EEG2TEXT leverages EEG pre-training to enhance the learning of semantics from EEG signals and proposes a multi-view transformer to model the EEG signal processing by different spatial regions of the brain. Experiments show that EEG2TEXT has superior performance, outperforming the state-of-the-art baseline methods by a large margin of up to 5% in absolute BLEU and ROUGE scores. EEG2TEXT shows great potential for a high-performance open-vocabulary brain-to-text system to facilitate communication.
解码人类大脑的复杂性已经 captures了几个世纪的好奇心。近年来,脑-机接口(BCI)技术,特别是使用运动想象,在麻痹患者中恢复了诸如抓取和行走等功能。然而,从脑信号中解析自然语言仍然是一个具有挑战性的问题。电生理学(EEG)是一种非侵入性技术,通过在头皮上放置电极来记录大脑的电气活动。之前的研究表明,EEG-到-文本解码在小型关闭词汇中的准确度很高,但当处理大型开放词汇时,准确性仍然不足。我们提出了一个新方法,EEG2TEXT,以提高开放词汇EEG-到-文本解码的准确性。具体来说,EEG2TEXT利用EEG预训练来增强从EEG信号中学习语义,并提出了一个多视角变换器来模拟不同大脑空间区域处理EEG信号。实验证明,EEG2TEXT具有卓越的性能,在绝对BLEU和ROUGE分数上优于最先进的基线方法,优势幅度高达5%。EEG2TEXT具有实现高性能开放词汇脑-机接口系统的巨大潜力,从而促进交流。
https://arxiv.org/abs/2405.02165