In this study, we propose the first hardware implementation of a context-based recurrent spiking neural network (RSNN) emphasizing on integrating dual information streams within the neocortical pyramidal neurons specifically Context- Dependent Leaky Integrate and Fire (CLIF) neuron models, essential element in RSNN. We present a quantized version of the CLIF neuron (qCLIF), developed through a hardware-software codesign approach utilizing the sparse activity of RSNN. Implemented in a 45nm technology node, the qCLIF is compact (900um^2) and achieves a high accuracy of 90% despite 8 bit quantization on DVS gesture classification dataset. Our analysis spans a network configuration from 10 to 200 qCLIF neurons, supporting up to 82k synapses within a 1.86 mm^2 footprint, demonstrating scalability and efficiency
在这项研究中,我们提出了一个基于上下文信息的循环神经网络(RSNN)的硬件实现,重点关注将双信息流集成到轴突颗粒细胞特别是Context- Dependent Leaky Integrate和Fire(CLIF)神经元模型中,这是RSNN的必要组成部分。我们呈现了通过使用RSNN的稀疏活动开发的量化CLIF神经元(qCLIF)。在45nm技术节点上实现,qCLIF是紧凑的(900um2)且具有90%的高准确度,尽管在DVS手势分类数据集上进行了8位量化。我们的分析跨越了网络配置从10到200个qCLIF神经元,支持在1.86mm2的足迹内达到82k个突触。这证明了可扩展性和效率。
https://arxiv.org/abs/2404.18066
The conversion of brain activity into text using electroencephalography (EEG) has gained significant traction in recent years. Many researchers are working to develop new models to decode EEG signals into text form. Although this area has shown promising developments, it still faces numerous challenges that necessitate further improvement. It's important to outline this area's recent developments and future research directions. In this review article, we thoroughly summarize the progress in EEG-to-text conversion. Firstly, we talk about how EEG-to-text technology has grown and what problems we still face. Secondly, we discuss existing techniques used in this field. This includes methods for collecting EEG data, the steps to process these signals, and the development of systems capable of translating these signals into coherent text. We conclude with potential future research directions, emphasizing the need for enhanced accuracy, reduced system constraints, and the exploration of novel applications across varied sectors. By addressing these aspects, this review aims to contribute to developing more accessible and effective Brain-Computer Interface (BCI) technology for a broader user base.
使用电生理学(EEG)将大脑活动转换为文本的成功已经引起了许多研究人员的关注。许多研究人员正在努力开发新的模型,将EEG信号解码为文本形式。尽管这一领域已经取得了 promising 的进展,但仍然面临着许多挑战,需要进一步改进。这里概述了该领域近期的进展和未来的研究方向。在本文综述文章中,我们深入总结了 EEG-to-text 转换的进展。首先,我们谈了谈 EEG-to-text 技术的发展以及我们仍然面临的挑战。其次,我们讨论了该领域中使用的现有技术。这包括收集 EEG 数据的方法、处理这些信号的步骤以及开发可以将这些信号转换为连贯文本的系统。我们最后强调了对未来研究的潜在方向,特别强调了需要提高准确性、降低系统限制和探索新的应用,涉及各种行业。通过解决这些问题,本文旨在为更广泛的用户群体开发更易用和有效的脑机接口(BCI)技术做出贡献。
https://arxiv.org/abs/2405.00726
In this paper, we map out the landscape of options available to visual artists for creating personal artworks, including crafting, adapting and navigating deep generative models. Following that, we argue for revisiting model crafting, defined as the design and manipulation of generative models for creative goals, and motivate studying and designing for model crafting as a creative activity in its own right.
在本文中,我们探讨了视觉艺术家为创作个人艺术作品所可供选择的选项,包括创建、适应和导航深度生成模型。接着,我们主张重新审视模型制作,即设计和管理用于创意目标生成模型的创意活动,并激励研究和设计模型制作作为一种创意活动。
https://arxiv.org/abs/2404.17688
We describe a protocol to study text-to-video retrieval training with unlabeled videos, where we assume (i) no access to labels for any videos, i.e., no access to the set of ground-truth captions, but (ii) access to labeled images in the form of text. Using image expert models is a realistic scenario given that annotating images is cheaper therefore scalable, in contrast to expensive video labeling schemes. Recently, zero-shot image experts such as CLIP have established a new strong baseline for video understanding tasks. In this paper, we make use of this progress and instantiate the image experts from two types of models: a text-to-image retrieval model to provide an initial backbone, and image captioning models to provide supervision signal into unlabeled videos. We show that automatically labeling video frames with image captioning allows text-to-video retrieval training. This process adapts the features to the target domain at no manual annotation cost, consequently outperforming the strong zero-shot CLIP baseline. During training, we sample captions from multiple video frames that best match the visual content, and perform a temporal pooling over frame representations by scoring frames according to their relevance to each caption. We conduct extensive ablations to provide insights and demonstrate the effectiveness of this simple framework by outperforming the CLIP zero-shot baselines on text-to-video retrieval on three standard datasets, namely ActivityNet, MSR-VTT, and MSVD.
我们描述了一个研究文本到视频检索训练的协议,使用未标记的视频,其中我们假设(i)没有访问任何视频的标签,即没有访问地面真实字幕集,但(ii)访问以文本形式表示的标记图像。由于给图像专家模型建立现实场景是合理的,因为给图像贴标签比给视频贴标签更便宜,因此具有可扩展性,与昂贵的视频标注方案相比,更具有可行性。最近,零散式图像专家,如CLIP,已经为视频理解任务建立了新的强基准。在本文中,我们利用这一进步并使用两种模型实例化图像专家:一种是从文本到图像检索模型,为初始骨架提供支持,另一种是图像标题模型,为未标记的视频提供监督信号。我们证明了自动给视频帧贴上图像标题标签可以让文本到视频检索训练。这一过程在不进行手动注释的情况下适应了目标领域,从而在CLIP零散式基准之外表现出色。在训练期间,我们从多个视频帧中采样相应的标题,并对帧表示进行时间池化,根据每个标题对帧进行评分。我们进行了广泛的实验,以提供有关此简单框架的有效性的见解,并通过在ActivityNet、MSR-VTT和MSVD等三个标准数据集上实现文本到视频检索的CLIP零散式基准的超越来证实其有效性。
https://arxiv.org/abs/2404.17498
Conversational tutoring systems (CTSs) offer learning experiences through interactions based on natural language. They are recognized for promoting cognitive engagement and improving learning outcomes, especially in reasoning tasks. Nonetheless, the cost associated with authoring CTS content is a major obstacle to widespread adoption and to research on effective instructional design. In this paper, we discuss and evaluate a novel type of CTS that leverages recent advances in large language models (LLMs) in two ways: First, the system enables AI-assisted content authoring by inducing an easily editable tutoring script automatically from a lesson text. Second, the system automates the script orchestration in a learning-by-teaching format via two LLM-based agents (Ruffle&Riley) acting as a student and a professor. The system allows for free-form conversations that follow the ITS-typical inner and outer loop structure. We evaluate Ruffle&Riley's ability to support biology lessons in two between-subject online user studies (N = 200) comparing the system to simpler QA chatbots and reading activity. Analyzing system usage patterns, pre/post-test scores and user experience surveys, we find that Ruffle&Riley users report high levels of engagement, understanding and perceive the offered support as helpful. Even though Ruffle&Riley users require more time to complete the activity, we did not find significant differences in short-term learning gains over the reading activity. Our system architecture and user study provide various insights for designers of future CTSs. We further open-source our system to support ongoing research on effective instructional design of LLM-based learning technologies.
谈话辅助系统(CTS)通过自然语言交互提供学习体验。它们因促进认知参与和改进学习成果而受到认可,特别是在推理任务中。然而,为创建CTS内容而付出的成本是推广和有效教学设计研究的一个主要障碍。在本文中,我们讨论并评估了一种新型的CTS,它通过两种方式利用了大型语言模型(LLMs)的最近进展:首先,系统通过从课文诱导易于编辑的指导脚本来自动化AI辅助内容创作。其次,系统通过两个LLM代理(Ruffle&Riley)作为学生和教授自动编排脚本,实现了学习-以教模式。系统允许进行自由格式对话,遵循ITS典型的内循环和外循环结构。我们通过在两个同时进行的在线用户研究(N=200)比较系统与简单的问答聊天机器人和阅读活动来评估Ruffle&Riley的生物学课程支持能力。通过分析系统使用模式、前/后测试分数和用户体验调查,我们发现Ruffle&Riley用户报告了很高的参与度、理解和认为提供的支持很有帮助。尽管Ruffle&Riley用户需要更多时间来完成活动,但我们没有在阅读活动中发现短期的学习增长差异。我们的系统架构和用户研究为未来CTS的设计提供了各种洞见。我们进一步开源我们的系统,以支持关于LLM基于学习技术的有效教学设计的研究。
https://arxiv.org/abs/2404.17460
We use the process and findings from a case study of design educators' practices of assessment and feedback to fuel theorizing about how to make AI useful in service of human experience. We build on Suchman's theory of situated actions. We perform a qualitative study of 11 educators in 5 fields, who teach design processes situated in project-based learning contexts. Through qualitative data gathering and analysis, we derive codes: design process; assessment and feedback challenges; and computational support. We twice invoke creative cognition's family resemblance principle. First, to explain how design instructors already use assessment rubrics and second, to explain the analogous role for design creativity analytics: no particular trait is necessary or sufficient; each only tends to indicate good design work. Human teachers remain essential. We develop a set of situated design creativity analytics--Fluency, Flexibility, Visual Consistency, Multiscale Organization, and Legible Contrast--to support instructors' efforts, by providing on-demand, learning objectives-based assessment and feedback to students. We theorize a methodology, which we call situating analytics, firstly because making AI support living human activity depends on aligning what analytics measure with situated practices. Further, we realize that analytics can become most significant to users by situating them through interfaces that integrate them into the material contexts of their use. Here, this means situating design creativity analytics into actual design environments. Through the case study, we identify situating analytics as a methodology for explaining analytics to users, because the iterative process of alignment with practice has the potential to enable data scientists to derive analytics that make sense as part of and support situated human experiences.
我们借鉴设计教育者实践案例中的评估和反馈过程和发现,来探讨如何让AI更好地服务于人类经验。我们建立在Suchman情境行动理论的基础上。我们对5个领域的11位教育者进行了一项定性研究,他们教授的是基于项目学习的设计过程。通过定性数据收集和分析,我们得出以下代码:设计过程;评估和反馈挑战;计算支持。我们两次运用了创意认知的类比原则。首先,解释设计教师如何已经使用评估量表;其次,阐述设计创意分析的作用:没有特定的特质是必要的或者充分的;每个特质都只是表明好的设计作品。人类教师仍然至关重要。我们开发了一套情境设计创造力分析——流利性、灵活性、视觉一致性、多尺度组织和易读对比——来支持教师的努力,通过提供基于需求的学习目标为基础的评估和反馈来帮助学生。我们理论化了一种方法,我们称之为情境分析,因为让AI支持人类活动取决于将分析指标与实践环境对齐。此外,我们还意识到,分析指标对用户来说可能变得非常重要,通过将它们置于使用界面中,使它们融入实际应用场景中。在这里,这意味着将设计创造力分析置于实际设计环境中。通过案例研究,我们发现情境分析是一种向用户提供解释分析的方法,因为与实践对齐的迭代过程有可能使数据科学家得出有意义的数据分析,以支持并促进情境人类体验。
https://arxiv.org/abs/2404.17390
We introduce the task of human action anomaly detection (HAAD), which aims to identify anomalous motions in an unsupervised manner given only the pre-determined normal category of training action samples. Compared to prior human-related anomaly detection tasks which primarily focus on unusual events from videos, HAAD involves the learning of specific action labels to recognize semantically anomalous human behaviors. To address this task, we propose a normalizing flow (NF)-based detection framework where the sample likelihood is effectively leveraged to indicate anomalies. As action anomalies often occur in some specific body parts, in addition to the full-body action feature learning, we incorporate extra encoding streams into our framework for a finer modeling of body subsets. Our framework is thus multi-level to jointly discover global and local motion anomalies. Furthermore, to show awareness of the potentially jittery data during recording, we resort to discrete cosine transformation by converting the action samples from the temporal to the frequency domain to mitigate the issue of data instability. Extensive experimental results on two human action datasets demonstrate that our method outperforms the baselines formed by adapting state-of-the-art human activity AD approaches to our task of HAAD.
我们提出了人类动作异常检测(HAAD)任务,旨在在不经监督的情况下,根据预先确定的正常分类来检测异常运动。与主要关注从视频中寻找异常事件的先前人类相关异常检测任务相比,HAAD涉及学习具体的动作标签,以识别语义异常的人类行为。为解决这个任务,我们提出了一个基于归一化流(NF)的检测框架,有效利用样本可能性来指示异常。由于动作异常通常发生在某些特定身体部位,因此我们在框架中引入额外的编码流以进行更精细的身体子集建模。因此,我们的框架是多层级的,可以共同发现全局和局部运动异常。此外,为了展示在录制过程中意识到的数据抖动,我们通过将动作样本从时间域转换到频率域的离散余弦变换来减轻数据不稳定性。在两个人类动作数据集上的广泛实验结果表明,我们的方法优于将最先进的的人类活动AD方法适应为HAAD任务所构成的基线。
https://arxiv.org/abs/2404.17381
In clinical practice, tri-modal medical image fusion, compared to the existing dual-modal technique, can provide a more comprehensive view of the lesions, aiding physicians in evaluating the disease's shape, location, and biological activity. However, due to the limitations of imaging equipment and considerations for patient safety, the quality of medical images is usually limited, leading to sub-optimal fusion performance, and affecting the depth of image analysis by the physician. Thus, there is an urgent need for a technology that can both enhance image resolution and integrate multi-modal information. Although current image processing methods can effectively address image fusion and super-resolution individually, solving both problems synchronously remains extremely challenging. In this paper, we propose TFS-Diff, a simultaneously realize tri-modal medical image fusion and super-resolution model. Specially, TFS-Diff is based on the diffusion model generation of a random iterative denoising process. We also develop a simple objective function and the proposed fusion super-resolution loss, effectively evaluates the uncertainty in the fusion and ensures the stability of the optimization process. And the channel attention module is proposed to effectively integrate key information from different modalities for clinical diagnosis, avoiding information loss caused by multiple image processing. Extensive experiments on public Harvard datasets show that TFS-Diff significantly surpass the existing state-of-the-art methods in both quantitative and visual evaluations. The source code will be available at GitHub.
在临床实践中,与现有的双模技术相比,三模态医学图像融合可以提供更全面的病变观察,帮助医生评估疾病的形状、位置和生物活性。然而,由于成像设备的局限性和患者安全的考虑,医学图像的质量通常有限,导致融合性能不理想,影响医生对图像分析的深度。因此,迫切需要一种可以同时增强图像分辨率并整合多模态信息的解决方案。尽管当前的图像处理方法可以有效地解决图像融合和超分辨率问题,但同时解决这两个问题仍然具有挑战性。在本文中,我们提出了TFS-Diff,一种同时实现三模态医学图像融合和超分辨率模型的方法。特别地,TFS-Diff基于随机迭代去噪过程的扩散模型生成。我们还开发了一个简单的目标函数和提出的融合超分辨率损失,有效地评估了融合的不确定性,并确保了优化过程的稳定性。此外,通道关注模块被提出,可以有效地整合不同模态的关键信息,避免信息丢失。对公共Harvard数据集的广泛实验证明,TFS-Diff在数量和视觉评价方面显著超越了现有的最先进方法。源代码将 available at GitHub。
https://arxiv.org/abs/2404.17357
In this paper, we present Misaka, a visualized swarm testbed for smart grid algorithm evaluation, also an extendable open-source open-hardware platform for developing tabletop tangible swarm interfaces. The platform consists of a collection of custom-designed 3 omni-directional wheels robots each 10 cm in diameter, high accuracy localization through a microdot pattern overlaid on top of the activity sheets, and a software framework for application development and control, while remaining affordable (per unit cost about 30 USD at the prototype stage). We illustrate the potential of tabletop swarm user interfaces through a set of smart grid algorithm application scenarios developed with Misaka.
在本文中,我们提出了Misaka,一个用于智能电网算法评估的可视化蜂窝测试平台,也是一个可扩展的开放源代码硬件平台,用于开发桌面 tangible 蜂窝接口。平台由一组定制的 3 轴全向轮机器人组成,每个轮子的直径为 10 厘米,通过在活动纸上覆盖微型点图案实现高精度定位,并配备用于应用程序开发和控制的软件框架。我们通过使用Misaka开发的一组智能电网算法应用场景来说明桌面蜂窝用户界面的潜力。
https://arxiv.org/abs/2404.17125
We present a novel multimodal dataset for Cognitive Load Assessment in REaltime (CLARE). The dataset contains physiological and gaze data from 24 participants with self-reported cognitive load scores as ground-truth labels. The dataset consists of four modalities, namely, Electrocardiography (ECG), Electrodermal Activity (EDA), Electroencephalogram (EEG), and Gaze tracking. To map diverse levels of mental load on participants during experiments, each participant completed four nine-minutes sessions on a computer-based operator performance and mental workload task (the MATB-II software) with varying levels of complexity in one minute segments. During the experiment, participants reported their cognitive load every 10 seconds. For the dataset, we also provide benchmark binary classification results with machine learning and deep learning models on two different evaluation schemes, namely, 10-fold and leave-one-subject-out (LOSO) cross-validation. Benchmark results show that for 10-fold evaluation, the convolutional neural network (CNN) based deep learning model achieves the best classification performance with ECG, EDA, and Gaze. In contrast, for LOSO, the best performance is achieved by the deep learning model with ECG, EDA, and EEG.
我们在REaltime (CLARE)中提出了一个新颖的多模态数据集,用于Cognitive Load Assessment。数据集包含来自24名自我报告认知负载分数的参与者的生理和眼动数据作为地面真实标签。数据集包括四个模块,分别是心电图(ECG)、电生理活动(EDA)、脑电图(EEG)和眼动跟踪。为了在实验中映射参与者对心理负载的不同水平,每位参与者在一分钟段的计算机基础操作员绩效和心理工作量任务(MATB-II软件)上完成了四个九分钟的任务。在实验期间,参与者每10秒钟报告他们的认知负载。为了这个数据集,我们还提供了用机器学习和深度学习模型在两种不同的评估方案下的基准二分类结果,即10倍交叉验证和单subject-out (LOSO)交叉验证。基准结果表明,在10倍评估中,基于卷积神经网络(CNN)的深度学习模型通过ECG、EDA和眼动实现了最佳分类性能。相反,在LOSO评估中,实现最佳性能的是具有ECG、EDA和EEG的深度学习模型。
https://arxiv.org/abs/2404.17098
The interactions between tumor cells and the tumor microenvironment (TME) dictate therapeutic efficacy of radiation and many systemic therapies in breast cancer. However, to date, there is not a widely available method to reproducibly measure tumor and immune phenotypes for each patient's tumor. Given this unmet clinical need, we applied multiple instance learning (MIL) algorithms to assess activity of ten biologically relevant pathways from the hematoxylin and eosin (H&E) slide of primary breast tumors. We employed different feature extraction approaches and state-of-the-art model architectures. Using binary classification, our models attained area under the receiver operating characteristic (AUROC) scores above 0.70 for nearly all gene expression pathways and on some cases, exceeded 0.80. Attention maps suggest that our trained models recognize biologically relevant spatial patterns of cell sub-populations from H&E. These efforts represent a first step towards developing computational H&E biomarkers that reflect facets of the TME and hold promise for augmenting precision oncology.
肿瘤细胞与肿瘤微环境(TME)之间的相互作用决定了放射治疗和许多系统治疗在乳腺癌中的治疗效果。然而,目前还没有一种可重复测量每个患者肿瘤的肿瘤和免疫表型的广泛可用方法。鉴于这一未满足的临床需求,我们将多实例学习(MIL)算法应用于从原始乳腺癌的哈希和电子显微镜(H&E)切片评估十种生物相关的通路的活动。我们采用了不同的特征提取方法和最先进的模型架构。使用二分类,我们的模型在几乎所有基因表达通路上的接收者操作特征(AUROC)分数都超过了0.70,在某些情况下甚至超过了0.80。注意力图表明,经过训练的模型能够识别H&E中的细胞亚群的空间模式。这些努力代表了解决计算H&E生物标志物的第一步,这些生物标志物可以反映TME的方面,并具有提高精准癌症治疗的精度的潜力。
https://arxiv.org/abs/2404.16397
Learning from Demonstration (LfD) stands as an efficient framework for imparting human-like skills to robots. Nevertheless, designing an LfD framework capable of seamlessly imitating, generalizing, and reacting to disturbances for long-horizon manipulation tasks in dynamic environments remains a challenge. To tackle this challenge, we present Logic Dynamic Movement Primitives (Logic-DMP), which combines Task and Motion Planning (TAMP) with an optimal control formulation of DMP, allowing us to incorporate motion-level via-point specifications and to handle task-level variations or disturbances in dynamic environments. We conduct a comparative analysis of our proposed approach against several baselines, evaluating its generalization ability and reactivity across three long-horizon manipulation tasks. Our experiment demonstrates the fast generalization and reactivity of Logic-DMP for handling task-level variants and disturbances in long-horizon manipulation tasks.
学习演示(LfD)作为一种将人类技能传递给机器人的有效框架,在长时程操作任务中为机器人提供了重要的价值。然而,设计一个能够无缝模拟、推广和应对干扰的LfD框架仍然具有挑战性。为了解决这个问题,我们提出了逻辑动态运动元启发式(Logic-DMP),它结合了任务和动作规划(TAMP)和DMP的最佳控制公式,允许我们通过级联运动点规格,并处理动态环境中的任务级别差异或干扰。我们对所提出的方案进行了与几个基线的比较分析,评估了其在长时程操作任务中的泛化能力和反应性。我们的实验结果表明,Logic-DMP在处理长时程操作任务中的任务级别变异和干扰方面具有快速泛化能力和反应性。
https://arxiv.org/abs/2404.16138
Benefiting from strong generalization ability, pre-trained vision language models (VLMs), e.g., CLIP, have been widely utilized in zero-shot scene understanding. Unlike simple recognition tasks, grounded situation recognition (GSR) requires the model not only to classify salient activity (verb) in the image, but also to detect all semantic roles that participate in the action. This complex task usually involves three steps: verb recognition, semantic role grounding, and noun recognition. Directly employing class-based prompts with VLMs and grounding models for this task suffers from several limitations, e.g., it struggles to distinguish ambiguous verb concepts, accurately localize roles with fixed verb-centric template1 input, and achieve context-aware noun predictions. In this paper, we argue that these limitations stem from the mode's poor understanding of verb/noun classes. To this end, we introduce a new approach for zero-shot GSR via Language EXplainer (LEX), which significantly boosts the model's comprehensive capabilities through three explainers: 1) verb explainer, which generates general verb-centric descriptions to enhance the discriminability of different verb classes; 2) grounding explainer, which rephrases verb-centric templates for clearer understanding, thereby enhancing precise semantic role localization; and 3) noun explainer, which creates scene-specific noun descriptions to ensure context-aware noun recognition. By equipping each step of the GSR process with an auxiliary explainer, LEX facilitates complex scene understanding in real-world scenarios. Our extensive validations on the SWiG dataset demonstrate LEX's effectiveness and interoperability in zero-shot GSR.
得益于强大的泛化能力,预训练的视觉语言模型(VLMs)(例如CLIP)已经在零散场景理解中得到了广泛应用。与简单的识别任务不同, grounded situation recognition (GSR) 需要模型不仅对图像中的显着活动(动词)进行分类,而且还要检测参与行动的所有语义角色。这种复杂任务通常包括三个步骤:动词识别、语义角色定位和名词识别。直接使用基于类的提示的VLMs和定位模型进行这项任务存在多个局限性,例如它难以区分模糊的动词概念,无法准确地将动词中心模板作为输入来定位固定动词概念,并且无法实现语境感知的名词预测。在本文中,我们认为这些局限源于模式对动词/名词类的理解不足。为此,我们引入了一种新的通过语言解释器(LEX)进行零散GSR的方法,该方法通过三个解释器显著增强了模型的全面能力:1) 动词解释器,它生成通用的动词中心描述,以增强不同动词类别的可区分性;2) 定位解释器,它重新表述了动词中心模板,以更清晰地理解,从而增强精确的语义角色定位;3) 名词解释器,它创建了与场景相关的名词描述,以确保语境感知的名词识别。通过为GSR过程的每个步骤提供辅助解释器,LEX在现实世界场景中促进了复杂场景理解。我们在SWiG数据集上的广泛验证证明了LEX在零散GSR方面的有效性和互操作性。
https://arxiv.org/abs/2404.15785
Report generation models offer fine-grained textual interpretations of medical images like chest X-rays, yet they often lack interactivity (i.e. the ability to steer the generation process through user queries) and localized interpretability (i.e. visually grounding their predictions), which we deem essential for future adoption in clinical practice. While there have been efforts to tackle these issues, they are either limited in their interactivity by not supporting textual queries or fail to also offer localized interpretability. Therefore, we propose a novel multitask architecture and training paradigm integrating textual prompts and bounding boxes for diverse aspects like anatomical regions and pathologies. We call this approach the Chest X-Ray Explainer (ChEX). Evaluations across a heterogeneous set of 9 chest X-ray tasks, including localized image interpretation and report generation, showcase its competitiveness with SOTA models while additional analysis demonstrates ChEX's interactive capabilities.
报告生成模型可以提供对医学图像(如胸部X光片)的细粒度文本解释,但通常缺乏交互性(即通过用户查询引导生成过程的能力)和本地解释性(即通过视觉将预测与图像相结合的能力),我们认为这对于未来在临床实践中采用是至关重要的。尽管已经做出了努力来解决这些问题,但它们要么因不支持文本查询而限制了交互性,要么未能提供本地解释性。因此,我们提出了一个新颖的多任务架构和训练范式,结合文本提示和边界框来处理各种方面(如解剖区域和疾病)。我们将这种方法称为胸部X光片解释器(ChEX)。在包括局部图像解释和报告生成的异质集合9个胸部X光片任务中进行的评估显示,与最先进的模型相比,ChEX具有竞争力,而其他分析则展示了ChEX的交互能力。
https://arxiv.org/abs/2404.15770
Health monitoring of remote critical infrastructure is a complex and expensive activity due to the limited infrastructure accessibility. Inspection drones are ubiquitous assets that enhance the reliability of critical infrastructures through improved accessibility. However, due to the harsh operation environment, it is crucial to monitor their health to ensure successful inspection operations. The battery is a key component that determines the overall reliability of the inspection drones and, with an appropriate health management approach, contributes to reliable and robust inspections. In this context, this paper presents a novel hybrid probabilistic approach for battery end-of-discharge (EOD) voltage prediction of Li-Po batteries. The hybridization is achieved in an error-correction configuration, which combines physics-based discharge and probabilistic error-correction models to quantify the aleatoric and epistemic uncertainty. The performance of the hybrid probabilistic methodology was empirically evaluated on a dataset comprising EOD voltage under varying load conditions. The dataset was obtained from real inspection drones operated on different flights, focused on offshore wind turbine inspections. The proposed approach has been tested with different probabilistic methods and demonstrates 14.8% improved performance in probabilistic accuracy compared to the best probabilistic method. In addition, aleatoric and epistemic uncertainties provide robust estimations to enhance the diagnosis of battery health-states.
远程关键基础设施的健康监测是一个复杂且昂贵的活动,由于基础设施的可访问性有限。检查无人机是一种无处不在的资产,通过提高可访问性来增强关键基础设施的可靠性。然而,由于恶劣的操作环境,监测它们的健康状况对于确保成功的检查操作至关重要。电池是关键组件,决定了检查无人机的整体可靠性,通过适当的 Health 管理方法,还提高了可靠且健壮的检查。在这种情况下,本文介绍了一种新颖的混合概率方法,用于预测锂-聚合物(Li-Po)电池的放电(EOD)电压。混合是在错误纠正配置下完成的,该配置将基于物理模型进行放电和概率误差纠正模型,以量化随机和知识不确定性。在各种负载条件下对 EOD 电压的性能进行了实证评估。数据集是从不同航班上运行的现实检查无人机获得的,重点关注海上风能涡轮机检查。所提出的方法已经通过不同概率方法进行了测试,与最佳概率方法相比,概率准确度提高了 14.8%。此外,随机和知识不确定性为提高电池健康状况的诊断提供了稳健的估计。
https://arxiv.org/abs/2405.00055
Video anomaly detection (VAD) is a challenging task aiming to recognize anomalies in video frames, and existing large-scale VAD researches primarily focus on road traffic and human activity scenes. In industrial scenes, there are often a variety of unpredictable anomalies, and the VAD method can play a significant role in these scenarios. However, there is a lack of applicable datasets and methods specifically tailored for industrial production scenarios due to concerns regarding privacy and security. To bridge this gap, we propose a new dataset, IPAD, specifically designed for VAD in industrial scenarios. The industrial processes in our dataset are chosen through on-site factory research and discussions with engineers. This dataset covers 16 different industrial devices and contains over 6 hours of both synthetic and real-world video footage. Moreover, we annotate the key feature of the industrial process, ie, periodicity. Based on the proposed dataset, we introduce a period memory module and a sliding window inspection mechanism to effectively investigate the periodic information in a basic reconstruction model. Our framework leverages LoRA adapter to explore the effective migration of pretrained models, which are initially trained using synthetic data, into real-world scenarios. Our proposed dataset and method will fill the gap in the field of industrial video anomaly detection and drive the process of video understanding tasks as well as smart factory deployment.
视频异常检测(VAD)是一个具有挑战性的任务,旨在识别视频帧中的异常情况,现有的大规模VAD研究主要集中在道路交通和人类活动场景。在工业场景中,通常存在多种不可预测的异常情况,VAD方法在这些场景中发挥着重要作用。然而,由于对隐私和安全问题的担忧,缺乏针对工业生产场景的可应用数据和方法。为了填补这一空白,我们提出了一个专门为工业场景设计的新的数据集IPAD。我们通过对现场工厂研究和与工程师的讨论来选择工业过程。这个数据集涵盖了16种不同的工业设备,包含了超过6小时的合成和现实世界的视频录像。此外,我们还对工业过程的关键特征,即周期性进行了标注。基于所提出的数据集,我们引入了周期记忆模块和滑动窗口检查机制,有效调查了基本重构模型的周期信息。我们的框架利用了LoRA适配器,探索将预训练模型有效迁移到真实世界场景。我们所提出的数据集和方法将填补工业视频异常检测领域中的空白,推动视频理解任务和智能工厂部署的发展。
https://arxiv.org/abs/2404.15033
Driver activity classification is crucial for ensuring road safety, with applications ranging from driver assistance systems to autonomous vehicle control transitions. In this paper, we present a novel approach leveraging generalizable representations from vision-language models for driver activity classification. Our method employs a Semantic Representation Late Fusion Neural Network (SRLF-Net) to process synchronized video frames from multiple perspectives. Each frame is encoded using a pretrained vision-language encoder, and the resulting embeddings are fused to generate class probability predictions. By leveraging contrastively-learned vision-language representations, our approach achieves robust performance across diverse driver activities. We evaluate our method on the Naturalistic Driving Action Recognition Dataset, demonstrating strong accuracy across many classes. Our results suggest that vision-language representations offer a promising avenue for driver monitoring systems, providing both accuracy and interpretability through natural language descriptors.
驾驶员活动分类对于确保道路安全具有至关重要的作用,应用范围从驾驶员辅助系统到自动驾驶车辆控制转换。在本文中,我们提出了一种利用可扩展性来自视觉-语言模型的驾驶员活动分类新方法。我们的方法采用了一个预训练的视觉-语言编码器来处理来自多个视角的同步视频帧。每个帧都使用预训练的视觉-语言编码器进行编码,然后将得到的嵌入进行融合以生成分类概率预测。通过利用对比学习得到的视觉-语言表示,我们的方法在多样驾驶员活动中取得了稳健的性能。我们在自然驾驶行动识别数据集上评估我们的方法,证明了在许多类别中具有强大的准确性。我们的结果表明,视觉-语言表示为驾驶员监测系统提供了有前途的途径,通过自然语言描述符实现准确性和可解释性。
https://arxiv.org/abs/2404.14906
In the field of fraud detection, the availability of comprehensive and privacy-compliant datasets is crucial for advancing machine learning research and developing effective anti-fraud systems. Traditional datasets often focus on transaction-level information, which, while useful, overlooks the broader context of customer behavior patterns that are essential for detecting sophisticated fraud schemes. The scarcity of such data, primarily due to privacy concerns, significantly hampers the development and testing of predictive models that can operate effectively at the customer level. Addressing this gap, our study introduces a benchmark that contains structured datasets specifically designed for customer-level fraud detection. The benchmark not only adheres to strict privacy guidelines to ensure user confidentiality but also provides a rich source of information by encapsulating customer-centric features. We have developed the benchmark that allows for the comprehensive evaluation of various machine learning models, facilitating a deeper understanding of their strengths and weaknesses in predicting fraudulent activities. Through this work, we seek to bridge the existing gap in data availability, offering researchers and practitioners a valuable resource that empowers the development of next-generation fraud detection techniques.
在欺诈检测领域,全面且符合隐私规范的数据集的可用性对于推动机器学习研究和开发有效的反欺诈系统至关重要。传统数据集通常关注交易层面的信息,虽然这些信息对于欺诈检测很有用,但忽略了检测复杂欺诈计划所需的更广泛的客户行为模式。这类数据的稀缺性,主要由于隐私问题,严重阻碍了开发和测试在客户级别有效运行的预测模型的过程。为解决这一空白,我们的研究引入了一个针对客户级别欺诈检测的基准数据集。该基准数据集不仅遵循严格的隐私指南以确保用户保密,而且通过封装客户中心功能提供了丰富的信息。我们开发了这个基准,允许对各种机器学习模型进行全面的评估,从而更好地了解它们在预测欺诈活动方面的优缺点。通过这项工作,我们希望弥合现有数据可用性方面的空白,为研究人员和实践者提供了一个有价值的资源,以促进下一代欺诈检测技术的开发。
https://arxiv.org/abs/2404.14746
In this paper, we investigate a new problem called narrative action evaluation (NAE). NAE aims to generate professional commentary that evaluates the execution of an action. Unlike traditional tasks such as score-based action quality assessment and video captioning involving superficial sentences, NAE focuses on creating detailed narratives in natural language. These narratives provide intricate descriptions of actions along with objective evaluations. NAE is a more challenging task because it requires both narrative flexibility and evaluation rigor. One existing possible solution is to use multi-task learning, where narrative language and evaluative information are predicted separately. However, this approach results in reduced performance for individual tasks because of variations between tasks and differences in modality between language information and evaluation information. To address this, we propose a prompt-guided multimodal interaction framework. This framework utilizes a pair of transformers to facilitate the interaction between different modalities of information. It also uses prompts to transform the score regression task into a video-text matching task, thus enabling task interactivity. To support further research in this field, we re-annotate the MTL-AQA and FineGym datasets with high-quality and comprehensive action narration. Additionally, we establish benchmarks for NAE. Extensive experiment results prove that our method outperforms separate learning methods and naive multi-task learning methods. Data and code are released at \href{this https URL }{here}.
在本文中,我们研究了一个名为叙述动作评估(NAE)的新问题。NAE的目标是生成专业的评论来评估一个行动的执行。与传统的评分基于动作质量评估和涉及浅层句子的视频标题等任务不同,NAE专注于在自然语言中创建详细的叙述。这些叙述提供了动作的详细描述以及客观评价。因为需要叙述的灵活性和评估的严谨性,NAE是一个更具挑战性的任务。一个现有的可能解决方案是使用多任务学习,其中叙述语言和评估信息分别预测。然而,由于任务之间存在差异和语言信息与评估信息之间的差异,这种方法在个人任务上产生了较低的性能。为了解决这个问题,我们提出了一个提示引导的多模态交互框架。这个框架使用了一对变压器来促进不同信息模态之间的交互。它还使用提示将评分回归任务转化为视频文本匹配任务,从而实现任务交互。为了支持在这个领域进一步的研究,我们用高质量、全面的动作叙述重新标注了MTL-AQA和FineGym数据集。此外,我们还为NAE建立了基准。大量实验结果证明,我们的方法超越了单独学习和 naive multi-task learning 方法。数据和代码发布在 \href{this <https://this URL> }{这里}。
https://arxiv.org/abs/2404.14471
The user base of short video apps has experienced unprecedented growth in recent years, resulting in a significant demand for video content analysis. In particular, text-video retrieval, which aims to find the top matching videos given text descriptions from a vast video corpus, is an essential function, the primary challenge of which is to bridge the modality gap. Nevertheless, most existing approaches treat texts merely as discrete tokens and neglect their syntax structures. Moreover, the abundant spatial and temporal clues in videos are often underutilized due to the lack of interaction with text. To address these issues, we argue that using texts as guidance to focus on relevant temporal frames and spatial regions within videos is beneficial. In this paper, we propose a novel Syntax-Hierarchy-Enhanced text-video retrieval method (SHE-Net) that exploits the inherent semantic and syntax hierarchy of texts to bridge the modality gap from two perspectives. First, to facilitate a more fine-grained integration of visual content, we employ the text syntax hierarchy, which reveals the grammatical structure of text descriptions, to guide the visual representations. Second, to further enhance the multi-modal interaction and alignment, we also utilize the syntax hierarchy to guide the similarity calculation. We evaluated our method on four public text-video retrieval datasets of MSR-VTT, MSVD, DiDeMo, and ActivityNet. The experimental results and ablation studies confirm the advantages of our proposed method.
近年来,短视频应用程序的用户基础经历了空前的增长,导致对视频内容分析的需求显著增加。特别是,文本-视频检索,旨在从庞大的视频语料库中找到与给定文本描述的匹配视频,是至关重要的功能,其挑战在于弥合模态差距。然而,大多数现有方法仅仅将文本视为离散的标记,而忽略了它们的语法结构。此外,视频中丰富的空间和时间线索往往没有被充分利用,因为缺乏与文本的交互。为了应对这些问题,我们认为将文本作为指导,集中关注视频中的相关时态帧和空间区域,从两个角度弥合模态差距是有益的。在本文中,我们提出了一种新颖的语法层次结构增强文本-视频检索方法(SHE-Net),它利用了文本固有的语义和语法层次结构,从两个角度弥合模态差距。首先,为了促进更细粒度的视觉内容整合,我们采用文本语法层次结构,该结构揭示了文本描述的语法结构,指导视觉表示。其次,为了进一步增强多模态交互和对齐,我们还利用语法层次结构指导相似度计算。我们对MSR-VTT、MSVD、DiDeMo和ActivityNet等四个公共文本-视频检索数据集进行了实验评估。实验结果和消融实验证实了我们提出的方法的优势。
https://arxiv.org/abs/2404.14066