While multimodal foundation models can now natively work with data beyond text, they remain underutilized in analyzing the considerable amounts of multi-dimensional time-series data in fields like healthcare, finance, and social sciences, representing a missed opportunity for richer, data-driven insights. This paper proposes a simple but effective method that leverages the existing vision encoders of these models to "see" time-series data via plots, avoiding the need for additional, potentially costly, model training. Our empirical evaluations show that this approach outperforms providing the raw time-series data as text, with the additional benefit that visual time-series representations demonstrate up to a 90% reduction in model API costs. We validate our hypothesis through synthetic data tasks of increasing complexity, progressing from simple functional form identification on clean data, to extracting trends from noisy scatter plots. To demonstrate generalizability from synthetic tasks with clear reasoning steps to more complex, real-world scenarios, we apply our approach to consumer health tasks - specifically fall detection, activity recognition, and readiness assessment - which involve heterogeneous, noisy data and multi-step reasoning. The overall success in plot performance over text performance (up to an 120% performance increase on zero-shot synthetic tasks, and up to 150% performance increase on real-world tasks), across both GPT and Gemini model families, highlights our approach's potential for making the best use of the native capabilities of foundation models.
虽然多模态基础模型现在可以原生地处理数据,包括文本以外,但在医疗、金融和社会科学等领域中处理大量多维度时间序列数据时,它们仍然没有被充分利用,这代表了一个丰富的数据驱动见解的错失机会。本文提出了一种简单而有效的的方法,利用现有模型的视觉编码器来“通过绘制图表”看待时间序列数据,无需进行昂贵的额外模型训练。我们的实证评估结果表明,这种方法在提供原始时间序列数据作为文本的同时超过了它,并且还具有使视觉时间序列表示减少模型API成本的额外优势。我们通过模拟数据任务来验证我们的假设,从简单的功能形式识别开始,逐渐进展到从嘈杂散点图提取趋势。为了从模拟任务中展示从清晰推理步骤到更复杂、现实世界场景的泛化能力,我们将该方法应用于消费者健康任务——尤其是跌倒检测、活动识别和准备评估,这些任务涉及异质、嘈杂数据和多步骤推理。在基于文本的绘制表现与基于图的绘制表现之间(在零散投放虚拟任务中的性能增加达到120%,在真实世界任务中的性能增加达到150%)的全面成功,突出了我们在基础模型上充分利用原功能的能力。
https://arxiv.org/abs/2410.02637
Being able to accurately monitor the screen exposure of young children is important for research on phenomena linked to screen use such as childhood obesity, physical activity, and social interaction. Most existing studies rely upon self-report or manual measures from bulky wearable sensors, thus lacking efficiency and accuracy in capturing quantitative screen exposure data. In this work, we developed a novel sensor informatics framework that utilizes egocentric images from a wearable sensor, termed the screen time tracker (STT), and a vision language model (VLM). In particular, we devised a multi-view VLM that takes multiple views from egocentric image sequences and interprets screen exposure dynamically. We validated our approach by using a dataset of children's free-living activities, demonstrating significant improvement over existing methods in plain vision language models and object detection models. Results supported the promise of this monitoring approach, which could optimize behavioral research on screen exposure in children's naturalistic settings.
准确监测年轻儿童的屏幕曝光对于与屏幕使用相关的研究(如儿童肥胖、体力活动和社会互动)非常重要。现有研究主要依赖于笨重的可穿戴传感器自报告或手动测量,因此缺乏效率和准确性,无法捕捉定量屏幕曝光数据。在这项工作中,我们开发了一种名为屏幕时间跟踪器(STT)的新传感器信息框架和视觉语言模型(VLM)。特别地,我们设计了一个多视角VLM,它从自旋式图像序列中获取多个视角,并动态地解释屏幕曝光。我们通过使用儿童自由活动数据的子集来验证我们的方法,证明在普通视觉语言模型和目标检测模型中,与现有方法相比具有显著的改善。结果证实了这种监测方法的优势,即在儿童的自然环境中优化屏幕曝光的行为研究。
https://arxiv.org/abs/2410.01966
Decision-making agents based on pre-trained Large Language Models (LLMs) are increasingly being deployed across various domains of human activity. While their applications are currently rather specialized, several research efforts are under way to develop more generalist agents. As LLM-based systems become more agentic, their influence on human activity will grow and the transparency of this will decrease. Consequently, developing effective methods for aligning them to human values is vital. The prevailing practice in alignment often relies on human preference data (e.g., in RLHF or DPO), in which values are implicit and are essentially deduced from relative preferences over different model outputs. In this work, instead of relying on human feedback, we introduce the design of reward functions that explicitly encode core human values for Reinforcement Learning-based fine-tuning of foundation agent models. Specifically, we use intrinsic rewards for the moral alignment of LLM agents. We evaluate our approach using the traditional philosophical frameworks of Deontological Ethics and Utilitarianism, quantifying moral rewards for agents in terms of actions and consequences on the Iterated Prisoner's Dilemma (IPD) environment. We also show how moral fine-tuning can be deployed to enable an agent to unlearn a previously developed selfish strategy. Finally, we find that certain moral strategies learned on the IPD game generalize to several other matrix game environments. In summary, we demonstrate that fine-tuning with intrinsic rewards is a promising general solution for aligning LLM agents to human values, and it might represent a more transparent and cost-effective alternative to currently predominant alignment techniques.
基于预训练的大语言模型(LLMs)的决策代理正在越来越多的应用于各种人类活动的各个领域。虽然它们目前的应用相当专业化,但几个研究项目正在进行中,以开发更通用的代理。随着LLM系统变得越来越具有代理性,它们对人类活动的影响力将越来越大,而透明度将降低。因此,开发将它们与人类价值观对齐的有效方法至关重要。通常的协调实践往往依赖于人类偏好数据(例如在RLHF或DPO中),其中价值观是隐含的,并且主要是从不同模型输出之间的相对偏好中推导出来的。在这项工作中,我们 instead of 依赖于人类反馈,引入了奖励函数的设计,该设计明确地编码了核心人类价值以进行基于强化学习的基代理模型 fine-tuning。具体来说,我们使用内在奖励来对LLM代理的道德对齐。我们使用传统哲学框架Deontological Ethics和Utilitarianism来评估我们的方法,并使用迭代囚徒困境(IPD)环境中的动作和后果来衡量道德奖励。我们还展示了如何将道德微调用于使代理重新学习一个之前开发的自私策略。最后,我们发现,在IPD游戏中学习的一些道德策略可以推广到其他矩阵游戏环境中。总之,我们证明了使用内在奖励进行微调是使LLM代理与人类价值观对齐的有前途的通用解决方案,这可能代表目前主流协调技术的更透明、更经济实惠的替代方案。
https://arxiv.org/abs/2410.01639
Multimodal language modeling constitutes a recent breakthrough which leverages advances in large language models to pretrain capable multimodal models. The integration of natural language during pretraining has been shown to significantly improve learned representations, particularly in computer vision. However, the efficacy of multimodal language modeling in the realm of functional brain data, specifically for advancing pathology detection, remains unexplored. This study pioneers EEG-language models trained on clinical reports and 15000 EEGs. We extend methods for multimodal alignment to this novel domain and investigate which textual information in reports is useful for training EEG-language models. Our results indicate that models learn richer representations from being exposed to a variety of report segments, including the patient's clinical history, description of the EEG, and the physician's interpretation. Compared to models exposed to narrower clinical text information, we find such models to retrieve EEGs based on clinical reports (and vice versa) with substantially higher accuracy. Yet, this is only observed when using a contrastive learning approach. Particularly in regimes with few annotations, we observe that representations of EEG-language models can significantly improve pathology detection compared to those of EEG-only models, as demonstrated by both zero-shot classification and linear probes. In sum, these results highlight the potential of integrating brain activity data with clinical text, suggesting that EEG-language models represent significant progress for clinical applications.
多模态语言建模是一项最近的重大突破,它利用大型语言模型的进步来预训练具有多模态能力的模型。在预训练过程中集成自然语言已被证明可以显著改善学习到的表示,特别是在计算机视觉领域。然而,在功能脑数据领域,尤其是在进步病理检测方面,多模态语言建模的有效性仍然是未探索的。 这项研究首创基于临床报告的EEG-语言模型。我们将多模态对齐方法扩展到这个新颖领域,并研究了报告中哪些文本信息对训练EEG-语言模型是有用的。我们的结果表明,通过暴露于各种报告段,模型可以从EEG中学习更丰富的表示,包括患者的病史、描述EEG和医生的解释。与仅暴露于较窄的临床文本信息的模型相比,我们发现这些模型能够大大更准确地检索到基于临床报告的EEG(反之亦然)。然而,只有在使用对比学习方法时才能观察到这一效果。特别是在标注较少的情况下,我们观察到EEG-语言模型的表示可以显著提高病理检测,这通过零散射击分类和线性探针实验得到了证实。总之,这些结果强调了将脑活动数据与临床文本相结合的潜力,表明EEG-语言模型在临床应用中取得了显著的进展。
https://arxiv.org/abs/2409.07480
In this work, we investigate how a model's tendency to broadly integrate its parametric knowledge evolves throughout pretraining, and how this behavior affects overall performance, particularly in terms of knowledge acquisition and forgetting. We introduce the concept of knowledge entropy, which quantifies the range of memory sources the model engages with; high knowledge entropy indicates that the model utilizes a wide range of memory sources, while low knowledge entropy suggests reliance on specific sources with greater certainty. Our analysis reveals a consistent decline in knowledge entropy as pretraining advances. We also find that the decline is closely associated with a reduction in the model's ability to acquire and retain knowledge, leading us to conclude that diminishing knowledge entropy (smaller number of active memory sources) impairs the model's knowledge acquisition and retention capabilities. We find further support for this by demonstrating that increasing the activity of inactive memory sources enhances the model's capacity for knowledge acquisition and retention.
在这项研究中,我们探讨了在预训练过程中,模型将参数知识的广泛整合趋势如何发展,以及这种行为如何影响整体性能,尤其是在知识获取和遗忘方面的表现。我们引入了知识熵的概念,它衡量了模型使用的记忆资源范围;高知识熵表示模型使用广泛的记忆资源,而低知识熵表明模型更可靠地使用特定的记忆资源。我们的分析揭示了随着预训练的进行,知识熵的下降趋势是一致的。我们还发现,这种下降与模型获取和保留知识的能力下降密切相关,因此我们得出结论,降低知识熵(较少的活跃记忆资源)会削弱模型的知识获取和保留能力。我们进一步证实了这一结论,通过表明增加无用记忆资源的活跃度增强了模型获取和保留知识的能力。
https://arxiv.org/abs/2410.01380
Conversational Question Generation (CQG) enhances the interactivity of conversational question-answering systems in fields such as education, customer service, and entertainment. However, traditional CQG, focusing primarily on the immediate context, lacks the conversational foresight necessary to guide conversations toward specified conclusions. This limitation significantly restricts their ability to achieve conclusion-oriented conversational outcomes. In this work, we redefine the CQG task as Conclusion-driven Conversational Question Generation (CCQG) by focusing on proactivity, not merely reacting to the unfolding conversation but actively steering it towards a conclusion-oriented question-answer pair. To address this, we propose a novel approach, called Proactive Conversational Question Planning with self-Refining (PCQPR). Concretely, by integrating a planning algorithm inspired by Monte Carlo Tree Search (MCTS) with the analytical capabilities of large language models (LLMs), PCQPR predicts future conversation turns and continuously refines its questioning strategies. This iterative self-refining mechanism ensures the generation of contextually relevant questions strategically devised to reach a specified outcome. Our extensive evaluations demonstrate that PCQPR significantly surpasses existing CQG methods, marking a paradigm shift towards conclusion-oriented conversational question-answering systems.
对话式问题生成(CQG)在教育、客户服务和娱乐等领域增强了对话式问答系统的互动性。然而,传统的CQG主要关注当前语境,缺乏引导对话达到指定结论的会话先见之明。这一局限性大大限制了它们实现结论导向的对话成果的能力。在这项工作中,我们将CQG任务重新定义为结论导向的对话式问题生成(CCQG),通过关注积极性,而不仅仅是反应于展开的对话,积极引导对话达到结论性的问题-答案对。为解决这个问题,我们提出了名为主动对话式问题规划与自优化(PCQPR)的新方法。具体来说,通过将受到蒙特卡洛树搜索(MCTS)启发的设计用于规划算法与大型语言模型(LLMs)的分析能力相结合,PCQPR预测未来的对话转换并持续优化其询问策略。这种迭代自优化机制确保了生成与上下文相关的策略,以达到指定的目标。我们对PCQPR的广泛评估表明,PCQPR显著超越了现有的CQG方法,标志着一个以结论为导向的对话式问题回答系统范式的转变。
https://arxiv.org/abs/2410.01363
Localizing unusual activities, such as human errors or surveillance incidents, in videos holds practical significance. However, current video understanding models struggle with localizing these unusual events likely because of their insufficient representation in models' pretraining datasets. To explore foundation models' capability in localizing unusual activity, we introduce UAL-Bench, a comprehensive benchmark for unusual activity localization, featuring three video datasets: UAG-OOPS, UAG-SSBD, UAG-FunQA, and an instruction-tune dataset: OOPS-UAG-Instruct, to improve model capabilities. UAL-Bench evaluates three approaches: Video-Language Models (Vid-LLMs), instruction-tuned Vid-LLMs, and a novel integration of Vision-Language Models and Large Language Models (VLM-LLM). Our results show the VLM-LLM approach excels in localizing short-span unusual events and predicting their onset (start time) more accurately than Vid-LLMs. We also propose a new metric, R@1, TD <= p, to address limitations in existing evaluation methods. Our findings highlight the challenges posed by long-duration videos, particularly in autism diagnosis scenarios, and the need for further advancements in localization techniques. Our work not only provides a benchmark for unusual activity localization but also outlines the key challenges for existing foundation models, suggesting future research directions on this important task.
将本地化异常活动(如人为错误或监视事件)在视频中具有实际意义。然而,当前的视频理解模型在本地化这些异常事件方面遇到困难,这可能是因为它们在预训练数据集中的表示不足。为了探索基础模型在本地化异常活动方面的能力,我们引入了UAL-Bench,一个综合性的异常活动本地化基准,包括三个视频数据集:UAG-OOPS,UAG-SSBD和UAG-FunQA,以及一个指令微调数据集:OOPS-UAG-Instruct,以提高模型的能力。UAL-Bench评估了三种方法:视频语言模型(Vid-LLMs),指令微调的Vid-LLMs和视觉语言模型与大型语言模型的结合(VLM-LLM)。我们的结果表明,VLM-LLM方法在本地化短时异常事件和预测其发生(开始时间)方面比Vid-LLMs更准确。我们还提出了一个新的指标,R@1,TD <= p,以解决现有评估方法的局限性。我们的研究结果强调了长时间视频所带来的挑战,特别是在自闭症诊断场景中,并需要进一步改进定位技术。我们的工作不仅为异常活动本地化提供了一个基准,还揭示了现有基础模型的关键挑战,建议未来关于这个重要任务的进一步研究方向。
https://arxiv.org/abs/2410.01180
Analyzing and finding anomalies in multi-dimensional datasets is a cumbersome but vital task across different domains. In the context of financial fraud detection, analysts must quickly identify suspicious activity among transactional data. This is an iterative process made of complex exploratory tasks such as recognizing patterns, grouping, and comparing. To mitigate the information overload inherent to these steps, we present a tool combining automated information highlights, Large Language Model generated textual insights, and visual analytics, facilitating exploration at different levels of detail. We perform a segmentation of the data per analysis area and visually represent each one, making use of automated visual cues to signal which require more attention. Upon user selection of an area, our system provides textual and graphical summaries. The text, acting as a link between the high-level and detailed views of the chosen segment, allows for a quick understanding of relevant details. A thorough exploration of the data comprising the selection can be done through graphical representations. The feedback gathered in a study performed with seven domain experts suggests our tool effectively supports and guides exploratory analysis, easing the identification of suspicious information.
在多维数据中分析并寻找异常任务在不同领域中是一个费力但至关重要的任务。在金融欺诈检测领域,分析师必须快速在交易数据中识别出异常活动。这是一个由复杂探索任务组成的迭代过程,包括识别模式、分组和比较。为了减轻这些步骤中固有的信息过载,我们提出了一个结合自动信息突出、由大型语言模型生成的文本见解和可视分析的工具,促进在详细程度不同的层次进行探索。我们对每个分析区域进行分割,并使用自动可视化线索表示哪些需要更多关注。当用户选择一个区域时,我们的系统提供文本和图形摘要。作为选择部分高层次和详细观点之间的纽带,文本允许快速了解相关细节。通过对所选范围内数据的深入探索,可以通过图形表示进行。在一个由七个领域专家进行的研究中获得的反馈表明,我们的工具有效地支持并引导探索性分析,减轻了识别可疑信息的难度。
https://arxiv.org/abs/2410.00727
Spiking Neural Networks (SNNs) and neuromorphic computing offer bio-inspired advantages such as sparsity and ultra-low power consumption, providing a promising alternative to conventional networks. However, training deep SNNs from scratch remains a challenge, as SNNs process and transmit information by quantizing the real-valued membrane potentials into binary spikes. This can lead to information loss and vanishing spikes in deeper layers, impeding effective training. While weight initialization is known to be critical for training deep neural networks, what constitutes an effective initial state for a deep SNN is not well-understood. Existing weight initialization methods designed for conventional networks (ANNs) are often applied to SNNs without accounting for their distinct computational properties. In this work we derive an optimal weight initialization method specifically tailored for SNNs, taking into account the quantization operation. We show theoretically that, unlike standard approaches, this method enables the propagation of activity in deep SNNs without loss of spikes. We demonstrate this behavior in numerical simulations of SNNs with up to 100 layers across multiple time steps. We present an in-depth analysis of the numerical conditions, regarding layer width and neuron hyperparameters, which are necessary to accurately apply our theoretical findings. Furthermore, our experiments on MNIST demonstrate higher accuracy and faster convergence when using the proposed weight initialization scheme. Finally, we show that the newly introduced weight initialization is robust against variations in several network and neuron hyperparameters.
尖峰神经网络(SNNs)和类神经网络计算提供了生物启发的优势,如稀疏性和超低功耗,为传统网络提供了有前景的替代方案。然而,从零开始训练深SNN仍然具有挑战性,因为SNN通过量化实值膜电位传播信息。这可能导致信息损失和深层信息丢失,阻碍有效的训练。虽然权重初始化对于训练深度神经网络非常重要,但深SNN的有效初始状态尚不清楚。现有的针对常规网络(ANNs)的权重初始化方法通常应用于SNN,而没有考虑到其独特的计算特性。在这项工作中,我们提出了一个专门针对SNN的权重初始化方法,考虑了量化操作。我们证明了,与标准方法相比,这种方法能够使SNN在深层活动中传播活动,而不会丢失尖峰。我们在SNNs的数值仿真中展示了这种行为的证据,这些仿真包括超过100层。我们还深入分析了实现我们理论结果所需的层宽和神经元参数的数值条件。此外,我们在MNIST上的实验还表明,使用所提出的权重初始化方案具有更高的准确性和更快的收敛速度。最后,我们证明了这种新引入的权重初始化对多个网络和神经元参数的变化具有鲁棒性。
https://arxiv.org/abs/2410.00580
In the video-language domain, recent works in leveraging zero-shot Large Language Model-based reasoning for video understanding have become competitive challengers to previous end-to-end models. However, long video understanding presents unique challenges due to the complexity of reasoning over extended timespans, even for zero-shot LLM-based approaches. The challenge of information redundancy in long videos prompts the question of what specific information is essential for large language models (LLMs) and how to leverage them for complex spatial-temporal reasoning in long-form video analysis. We propose a framework VideoINSTA, i.e. INformative Spatial-TemporAl Reasoning for zero-shot long-form video understanding. VideoINSTA contributes (1) a zero-shot framework for long video understanding using LLMs; (2) an event-based temporal reasoning and content-based spatial reasoning approach for LLMs to reason over spatial-temporal information in videos; (3) a self-reflective information reasoning scheme balancing temporal factors based on information sufficiency and prediction confidence. Our model significantly improves the state-of-the-art on three long video question-answering benchmarks: EgoSchema, NextQA, and IntentQA, and the open question answering dataset ActivityNetQA. The code is released here: this https URL.
在视频语言领域,利用零散 shot 大语言模型(large language model)基于推理进行视频理解的最近工作已经成为对之前端到端模型的竞争挑战者。然而,长视频理解由于推理在长时间内的复杂性,即使采用零散 shot LLM 方法也带来了独特的挑战。长视频理解中的信息冗余问题促使我们思考:对于大型语言模型(LLMs),具体哪些信息是必不可少的?以及如何利用它们进行复杂的空间-时间推理。我们提出了一个框架 VideoINSTA,即用于零散视频理解的 informative spatial-temporal reasoning for LLMs。VideoINSTA 贡献:(1)使用 LLMs 的零散视频理解框架;(2)LLMs 对视频中的空间-时间信息进行推理和内容推理的方法;(3)基于信息充分性和预测信心自我反思的信息推理方案,平衡时间因素。我们的模型在三个长视频问题回答基准(EgoSchema,NextQA,IntentQA)上的状态-of-the-art水平以及公开问题回答数据集 ActivityNetQA 上都取得了显著的提高。代码发布在此处:https:// this URL。
https://arxiv.org/abs/2409.20365
OBJECTIVE: To evaluate the efficacy of different forms of virtual reality (VR) treatments as either immersive virtual reality (IVR) or non-immersive virtual reality (NIVR) in comparison to conventional therapy (CT) in improving physical and psychological status among stroke patients. METHODS: The literature search was conducted on seven databases. ACM Digital Library, Medline (via PubMed), Cochrane, IEEE Xplore, Web of Science, and Scopus. The effect sizes of the main outcomes were calculated using Cohen's d. Pooled results were used to present an overall estimate of the treatment effect using a random-effects model. RESULTS: A total of 22 randomized controlled trials were evaluated. 3 trials demonstrated that immersive virtual reality improved upper limb activity, function and activity of daily life in a way comparable to CT. 18 trials showed that NIVR had similar benefits to CT for upper limb activity and function, balance and mobility, activities of daily living and participation. A comparison between the different forms of VR showed that IVR may be more beneficial than NIVR for upper limb training and activities of daily life. CONCLUSIONS: This study found out that IVR therapies may be more effective than NIVR but not CT to improve upper limb activity, function, and daily life activities. However, there is no evidence of the durability of IVR treatment. More research involving studies with larger samples is needed to assess the long-term effects and promising benefits of immersive virtual reality technology.
目标:评估不同形式的虚拟现实(VR)治疗在改善中风患者体力和心理状态方面的有效性,这些治疗是沉浸式虚拟现实(IVR)还是非沉浸式虚拟现实(NIVR)。 方法:在七个数据库中进行了文献搜索。包括ACM Digital Library、PubMed、Cochrane、IEEE Xplore、Web of Science和Scopus。使用 Cohen 的 d 计算了主要结局的效果量。汇总结果使用随机效应模型进行了总体估计。 结果:共进行了22项随机对照试验评估。3项试验表明,沉浸式虚拟现实改善了与CT相似的上肢活动、功能和日常生活活动。18项试验表明,NIVR与CT在改善上肢活动、功能、平衡和移动、日常生活参与活动等方面具有相似的益处。不同形式的VR之间的比较显示,IVR可能比NIVR更有益于上肢训练和日常生活活动。 结论:这项研究发现,沉浸式虚拟现实治疗可能比非沉浸式虚拟现实治疗更有效,但与CT相比,上肢活动、功能和日常生活活动的改善效果尚未得到证实。需要进行更多涉及更大样本的研究,以评估沉浸式虚拟现实技术的长期效果和潜在的益处。
https://arxiv.org/abs/2409.20260
This paper works on streaming automatic speech recognition (ASR). Mamba, a recently proposed state space model, has demonstrated the ability to match or surpass Transformers in various tasks while benefiting from a linear complexity advantage. We explore the efficiency of Mamba encoder for streaming ASR and propose an associated lookahead mechanism for leveraging controllable future information. Additionally, a streaming-style unimodal aggregation (UMA) method is implemented, which automatically detects token activity and streamingly triggers token output, and meanwhile aggregates feature frames for better learning token representation. Based on UMA, an early termination (ET) method is proposed to further reduce recognition latency. Experiments conducted on two Mandarin Chinese datasets demonstrate that the proposed model achieves competitive ASR performance in terms of both recognition accuracy and latency.
本文研究流式语音识别(ASR)。Mamba是一种最近提出的状态空间模型,已经在各种任务中证明了与Transformer相匹敌或超越Transformer的能力,同时具有线性复杂度优势。我们研究了Mamba编码器在流式ASR中的效率,并提出了一个相关的前景检测机制,以便利用可控制的未来信息。此外,还实现了一种流式风格的统一聚合(UMA)方法,该方法自动检测词活动并流式触发词输出,同时对特征帧进行聚合以更好地学习词表示。基于UMA,我们提出了一个早期终止(ET)方法,以进一步降低识别延迟。在两个中文数据集上进行的实验证明,与Transformer模型相比,所提出的模型在识别准确性和延迟方面具有竞争力的ASR性能。
https://arxiv.org/abs/2410.00070
Brain-related research topics in artificial intelligence have recently gained popularity, particularly due to the expansion of what multimodal architectures can do from computer vision to natural language processing. Our main goal in this work is to explore the possibilities and limitations of these architectures in spoken text decoding from non-invasive fMRI recordings. Contrary to vision and textual data, fMRI data represent a complex modality due to the variety of brain scanners, which implies (i) the variety of the recorded signal formats, (ii) the low resolution and noise of the raw signals, and (iii) the scarcity of pretrained models that can be leveraged as foundation models for generative learning. These points make the problem of the non-invasive decoding of text from fMRI recordings very challenging. In this paper, we propose and end-to-end multimodal LLM for decoding spoken text from fMRI signals. The proposed architecture is founded on (i) an encoder derived from a specific transformer incorporating an augmented embedding layer for the encoder and a better-adjusted attention mechanism than that present in the state of the art, and (ii) a frozen large language model adapted to align the embedding of the input text and the encoded embedding of brain activity to decode the output text. A benchmark in performed on a corpus consisting of a set of interactions human-human and human-robot interactions where fMRI and conversational signals are recorded synchronously. The obtained results are very promising, as our proposal outperforms the evaluated models, and is able to generate text capturing more accurate semantics present in the ground truth. The implementation code is provided in this https URL.
近年来,人工智能中与大脑相关的研究主题因多模态架构从计算机视觉到自然语言处理功能的扩展而变得受欢迎。在我们的工作中,我们的主要目标是对这些架构在非侵入性fMRI记录中的口语文本解码的可能性和局限性进行探索。与视觉和文本数据不同,fMRI数据由于各种脑扫描仪的多样性和原始信号的低分辨率和噪声而具有复杂的模式。这表明(i)记录的信号格式多样,(ii)原始信号的低精度和噪声,(iii)可作为生成性学习基础模型的预训练模型的稀有性。这些问题使得从fMRI记录中解码口语文本非常具有挑战性。在本文中,我们提出了一种端到端的 multimodal LLM,用于从fMRI信号中解码口语文本。所提出的架构基于(i)来源于特定 transformer 的编码器,其中包含增强嵌入层以提高编码器的性能和比现有技术更好的自适应注意机制,以及(ii)一个固定的大语言模型,以对输入文本的编码和大脑活动的编码进行对齐,从而解码输出文本。在由人类与人类和人类与机器人交互组成的语料库上进行的基准测试结果非常具有前景,我们的提议超越了评估模型,并且能够生成更准确地捕捉到地面真实中存在的语义信息的文本。实现代码可在此链接处获得。
https://arxiv.org/abs/2409.19710
Surgical procedures are inherently complex and dynamic, with intricate dependencies and various execution paths. Accurate identification of the intentions behind critical actions, referred to as Primary Intentions (PIs), is crucial to understanding and planning the procedure. This paper presents a novel framework that advances PI recognition in instructional videos by combining top-down grammatical structure with bottom-up visual cues. The grammatical structure is based on a rich corpus of surgical procedures, offering a hierarchical perspective on surgical activities. A grammar parser, utilizing the surgical activity grammar, processes visual data obtained from laparoscopic images through surgical action detectors, ensuring a more precise interpretation of the visual information. Experimental results on the benchmark dataset demonstrate that our method outperforms existing surgical activity detectors that rely solely on visual features. Our research provides a promising foundation for developing advanced robotic surgical systems with enhanced planning and automation capabilities.
手术操作本身是复杂和动态的,具有复杂的依赖关系和各种执行路径。准确识别关键行动背后的意图,称为主要意图(PIs),是理解和学习程序规划的关键。本文介绍了一种新的框架,通过结合自上而下语法结构和自下而上视觉提示来提高手术视频中的PIs识别。语法结构基于丰富的手术程序语料库,提供了一种层次化的手术活动观。语法解析器利用手术活动语法处理从腹腔镜图像中获得的视觉数据,确保更精确地解释视觉信息。基准数据集上的实验结果表明,我们的方法优于仅依赖视觉特征的现有手术活动检测器。我们的研究为开发具有增强计划和自动化功能的先进机器人手术系统奠定了坚实的基础。
https://arxiv.org/abs/2409.19579
Segmentation of anatomical structures and pathological regions in medical images is essential for modern clinical diagnosis, disease research, and treatment planning. While significant advancements have been made in deep learning-based segmentation techniques, many of these methods still suffer from limitations in data efficiency, generalizability, and interactivity. As a result, developing precise segmentation methods that require fewer labeled datasets remains a critical challenge in medical image analysis. Recently, the introduction of foundation models like CLIP and Segment-Anything-Model (SAM), with robust cross-domain representations, has paved the way for interactive and universal image segmentation. However, further exploration of these models for data-efficient segmentation in medical imaging is still needed and highly relevant. In this paper, we introduce MedCLIP-SAMv2, a novel framework that integrates the CLIP and SAM models to perform segmentation on clinical scans using text prompts, in both zero-shot and weakly supervised settings. Our approach includes fine-tuning the BiomedCLIP model with a new Decoupled Hard Negative Noise Contrastive Estimation (DHN-NCE) loss, and leveraging the Multi-modal Information Bottleneck (M2IB) to create visual prompts for generating segmentation masks from SAM in the zero-shot setting. We also investigate using zero-shot segmentation labels within a weakly supervised paradigm to enhance segmentation quality further. Extensive testing across four diverse segmentation tasks and medical imaging modalities (breast tumor ultrasound, brain tumor MRI, lung X-ray, and lung CT) demonstrates the high accuracy of our proposed framework. Our code is available at this https URL.
医学图像的人体解剖结构和病理区域的分割对于现代临床诊断、疾病研究和治疗规划至关重要。虽然基于深度学习的分割技术已经取得了显著的进步,但许多这些方法在数据效率、可扩展性和交互性方面仍然存在局限性。因此,开发精确的分割方法仍然是一个关键的挑战。最近,引入了一些基础模型,如CLIP和Segment-Anything-Model(SAM),具有稳健的跨领域表示,为交互式和通用图像分割铺平了道路。然而,对于医学图像数据效率分割的进一步探索仍然需要进行,这些模型的高度相关性也仍有待提高。在本文中,我们介绍了MedCLIP-SAMv2,一种将CLIP和SAM模型集成用于临床扫描文本提示进行分割的新颖框架。我们的方法包括用新的Decoupled Hard Negative Noise Contrastive Estimation(DHN-NCE)损失微调BiomedCLIP模型,并利用Multi-modal Information Bottleneck(M2IB)为SAM生成分割掩模提供视觉提示。我们还研究了在弱监督范式中使用零散分割标签来提高分割质量。通过在四个不同的分割任务和医学成像模态(乳腺超声、脑肿瘤 MRI、肺 X 光片和肺 CT)上进行广泛的测试,证明了我们提出的框架具有很高的准确度。我们的代码可在此处访问:https:// this URL.
https://arxiv.org/abs/2409.19483
We introduce Brain-JEPA, a brain dynamics foundation model with the Joint-Embedding Predictive Architecture (JEPA). This pioneering model achieves state-of-the-art performance in demographic prediction, disease diagnosis/prognosis, and trait prediction through fine-tuning. Furthermore, it excels in off-the-shelf evaluations (e.g., linear probing) and demonstrates superior generalizability across different ethnic groups, surpassing the previous large model for brain activity significantly. Brain-JEPA incorporates two innovative techniques: Brain Gradient Positioning and Spatiotemporal Masking. Brain Gradient Positioning introduces a functional coordinate system for brain functional parcellation, enhancing the positional encoding of different Regions of Interest (ROIs). Spatiotemporal Masking, tailored to the unique characteristics of fMRI data, addresses the challenge of heterogeneous time-series patches. These methodologies enhance model performance and advance our understanding of the neural circuits underlying cognition. Overall, Brain-JEPA is paving the way to address pivotal questions of building brain functional coordinate system and masking brain activity at the AI-neuroscience interface, and setting a potentially new paradigm in brain activity analysis through downstream adaptation.
我们引入了Brain-JEPA,一种基于Joint-Embedding Predictive Architecture(JEPA)的大脑动力学基础模型。这一创新模型通过微调在人口预测、疾病诊断/预后和特征预测方面实现了最先进的性能。此外,它在一系列的预测试评估(例如线性探测)中表现优异,并在不同种族之间表现出卓越的泛化能力,显著超越了前一个大模型。Brain-JEPA采用了两种创新技术:Brain Gradient Positioning和Spatiotemporal Masking。Brain Gradient Positioning引入了一个功能坐标系来对脑功能进行分割编码,从而增强不同关注区域(ROIs)的位置编码。Spatiotemporal Masking专门针对fMRI数据的独特特点,解决了异质时间序列补丁的挑战。这些方法提高了模型性能,并进一步推动了在人工智能与神经科学界面构建脑功能坐标系以及通过下游适应揭示神经活动的全新范式。总的来说,Brain-JEPA为解决在人工智能与神经科学界面构建脑功能坐标系以及揭示神经活动的全新范式铺平了道路。
https://arxiv.org/abs/2409.19407
Activity recognition is a challenging task due to the large scale of trajectory data and the need for prompt and efficient processing. Existing methods have attempted to mitigate this problem by employing traditional LSTM architectures, but these approaches often suffer from inefficiencies in processing large datasets. In response to this challenge, we propose VecLSTM, a novel framework that enhances the performance and efficiency of LSTM-based neural networks. Unlike conventional approaches, VecLSTM incorporates vectorization layers, leveraging optimized mathematical operations to process input sequences more efficiently. We have implemented VecLSTM and incorporated it into the MySQL database. To evaluate the effectiveness of VecLSTM, we compare its performance against a conventional LSTM model using a dataset comprising 1,467,652 samples with seven unique labels. Experimental results demonstrate superior accuracy and efficiency compared to the state-of-the-art, with VecLSTM achieving a validation accuracy of 85.57\%, a test accuracy of 85.47\%, and a weighted F1-score of 0.86. Furthermore, VecLSTM significantly reduces training time, offering a 26.2\% reduction compared to traditional LSTM models.
活动识别是一个具有挑战性的任务,因为轨迹数据的规模庞大,需要进行 prompt 和高效的处理。现有的方法试图通过采用传统的 LSTM 架构来缓解这个问题,但往往这些方法在处理大量数据时存在低效问题。为了应对这个挑战,我们提出了 VecLSTM,一种新颖的框架,可以增强基于 LSTM 的神经网络的性能和效率。与传统方法不同,VecLSTM 包含了向量化层,利用优化后的数学运算更有效地处理输入序列。我们已经实现了 VecLSTM,并将其集成到 MySQL 数据库中。为了评估 VecLSTM 的有效性,我们使用由包含1,467,652个样本,具有七个独特标签的数据集与传统 LSTM 模型进行比较。实验结果表明,与最先进的 LSTM 模型相比,VecLSTM 的准确性和效率具有优越性,VecLSTM 的验证精度为85.57%,测试精度为85.47%,加权 F1 分数为0.86。此外,VecLSTM 显著减少了训练时间,与传统 LSTM 模型相比,减少了26.2%。
https://arxiv.org/abs/2409.19258
Understanding the correlation between EEG features and cognitive tasks is crucial for elucidating brain function. Brain activity synchronizes during speaking and listening tasks. However, it is challenging to estimate task-dependent brain activity characteristics with methods with low spatial resolution but high temporal resolution, such as EEG, rather than methods with high spatial resolution, like fMRI. This study introduces a novel approach to EEG feature estimation that utilizes the weights of deep learning models to explore this association. We demonstrate that attention maps generated from Vision Transformers and EEGNet effectively identify features that align with findings from prior studies. EEGNet emerged as the most accurate model regarding subject independence and the classification of Listening and Speaking tasks. The application of Mel-Spectrogram with ViTs enhances the resolution of temporal and frequency-related EEG characteristics. Our findings reveal that the characteristics discerned through attention maps vary significantly based on the input data, allowing for tailored feature extraction from EEG signals. By estimating features, our study reinforces known attributes and predicts new ones, potentially offering fresh perspectives in utilizing EEG for medical purposes, such as early disease detection. These techniques will make substantial contributions to cognitive neuroscience.
理解EEG特征与认知任务之间的相关性对于阐明大脑功能是至关重要的。在说话和听任务中,脑活动是同步的。然而,使用具有低空间分辨率和高时间分辨率的方法(如EEG)来估计任务相关的大脑活动特征,而不是使用具有高空间分辨率的方法(如fMRI)来估计任务相关的大脑活动特征,是非常具有挑战性的。本研究介绍了一种利用深度学习模型权重的全新方法来估计EEG特征,以探讨这种关联。我们证明了来自Vision Transformers和EEGNet的注意力图有效地识别出与之前研究结果相符的特征。EEGNet在区分受试者独立性和听力和说话任务分类方面表现最为准确。应用Mel-Spectrogram与ViTs可以增强时间和频率相关EEG特征的分辨率。我们的研究结果表明,通过注意力图发现的特征具有显著的差异,从而能够从EEG信号中提取定制化的特征。通过估计特征,我们的研究证实了已知的特征,并预测了新的特征,为利用EEG进行医学目的(如早期疾病检测)提供了新的视角。这些技术将对认知神经科学做出重大贡献。
https://arxiv.org/abs/2409.19174
The analysis of a crime scene is a pivotal activity in forensic investigations. Crime Scene Investigators and forensic science practitioners rely on best practices, standard operating procedures, and critical thinking, to produce rigorous scientific reports to document the scenes of interest and meet the quality standards expected in the courts. However, crime scene examination is a complex and multifaceted task often performed in environments susceptible to deterioration, contamination, and alteration, despite the use of contact-free and non-destructive methods of analysis. In this context, the documentation of the sites, and the identification and isolation of traces of evidential value remain challenging endeavours. In this paper, we propose a photogrammetric reconstruction of the crime scene for inspection in virtual reality (VR) and focus on fully automatic object recognition with deep learning (DL) algorithms through a client-server architecture. A pre-trained Faster-RCNN model was chosen as the best method that can best categorize relevant objects at the scene, selected by experts in the VR environment. These operations can considerably improve and accelerate crime scene analysis and help the forensic expert in extracting measurements and analysing in detail the objects under analysis. Experimental results on a simulated crime scene have shown that the proposed method can be effective in finding and recognizing objects with potential evidentiary value, enabling timely analyses of crime scenes, particularly those with health and safety risks (e.g. fires, explosions, chemicals, etc.), while minimizing subjective bias and contamination of the scene.
犯罪现场分析是法医调查中至关重要的一项活动。犯罪现场调查员和法医科学工作者依赖最佳实践、标准操作程序和批判性思维,生产严谨的科学研究报告,以记录感兴趣的现场并满足法庭期望的质量标准。然而,犯罪现场检验是一个复杂而多面的任务,通常在易受退化、污染和改变的环境中进行,尽管使用了接触免费和非破坏性分析方法。在这种情况下,现场记录和证据痕迹的识别仍然具有挑战性。在本文中,我们提出了一个基于照片测量法的犯罪现场重建,用于虚拟现实(VR)环境下的检查,并着重于使用客户端-服务器架构下的深度学习(DL)算法来实现完全自动化的物体识别。选择了一个预训练的Faster-RCNN模型作为最好的方法,该模型在场景中最佳地分类相关物体,由VR环境中的专家进行选择。这些操作可以显著改进和加速犯罪现场分析,帮助法医专家在分析过程中详细提取物体的测量数据和分析。在模拟犯罪场景的实验结果表明,所提出的方法可以在找到和识别具有潜在证据价值物体方面有效,使对犯罪现场的及时分析成为可能,特别是那些存在健康和安全隐患(例如火、爆炸、化学等)的现场。
https://arxiv.org/abs/2409.18458
Beta-amyloid positron emission tomography (A$\beta$-PET) imaging has become a critical tool in Alzheimer's disease (AD) research and diagnosis, providing insights into the pathological accumulation of amyloid plaques, one of the hallmarks of AD. However, the high cost, limited availability, and exposure to radioactivity restrict the widespread use of A$\beta$-PET imaging, leading to a scarcity of comprehensive datasets. Previous studies have suggested that structural magnetic resonance imaging (MRI), which is more readily available, may serve as a viable alternative for synthesizing A$\beta$-PET images. In this study, we propose an approach to utilize 3D diffusion models to synthesize A$\beta$-PET images from T1-weighted MRI scans, aiming to overcome the limitations associated with direct PET imaging. Our method generates high-quality A$\beta$-PET images for cognitive normal cases, although it is less effective for mild cognitive impairment (MCI) patients due to the variability in A$\beta$ deposition patterns among subjects. Our preliminary results suggest that incorporating additional data, such as a larger sample of MCI cases and multi-modality information including clinical and demographic details, cognitive and functional assessments, and longitudinal data, may be necessary to improve A$\beta$-PET image synthesis for MCI patients.
β-淀粉样蛋白正电子发射断层扫描(Aβ-PET)成像已成为阿尔茨海默病(AD)研究和诊断的关键工具,为研究AD的病理斑块的积累提供了洞察,这是AD的标志性特征之一。然而,高成本、有限可用性和对放射性的暴露限制了Aβ-PET成像的广泛应用,导致全面的数据集 scarcity。以前的研究表明,结构磁共振成像(MRI)这一更易获得的替代方法可能成为合成Aβ-PET图像的可行方法。在这项研究中,我们提出了一种利用3D扩散模型从T1加权MRI扫描中合成Aβ-PET图像的方法,旨在克服直接PET成像的局限性。我们的方法为认知正常患者生成高质量的Aβ-PET图像,尽管在轻度认知障碍(MCI)患者中效果较差,因为不同受试者中Aβ沉积模式存在差异。我们的初步结果表明,为了改善MCI患者的Aβ-PET图像合成,可能需要包括更多的数据,例如MCI病例的大型样本和多模态信息(包括临床和人口学数据、认知和功能评估以及纵向数据)等。
https://arxiv.org/abs/2409.18282