Purpose: Surgical performance depends not only on surgeons' technical skills but also on team communication within and across the different professional groups present during the operation. Therefore, automatically identifying team communication in the OR is crucial for patient safety and advances in the development of computer-assisted surgical workflow analysis and intra-operative support systems. To take the first step, we propose a new task of detecting communication briefings involving all OR team members, i.e. the team Time-out and the StOP?-protocol, by localizing their start and end times in video recordings of surgical operations. Methods: We generate an OR dataset of real surgeries, called Team-OR, with more than one hundred hours of surgical videos captured by the multi-view camera system in the OR. The dataset contains temporal annotations of 33 Time-out and 22 StOP?-protocol activities in total. We then propose a novel group activity detection approach, where we encode both scene context and action features, and use an efficient neural network model to output the results. Results: The experimental results on the Team-OR dataset show that our approach outperforms existing state-of-the-art temporal action detection approaches. It also demonstrates the lack of research on group activities in the OR, proving the significance of our dataset. Conclusion: We investigate the Team Time-Out and the StOP?-protocol in the OR, by presenting the first OR dataset with temporal annotations of group activities protocols, and introducing a novel group activity detection approach that outperforms existing approaches. Code is available at this https URL .
目的:外科手术的表现不仅取决于外科医生的技术技能,还取决于手术过程中不同专业团队之间的沟通。因此,在手术室(OR)中自动识别团队间的交流对于患者安全以及计算机辅助手术工作流程分析和术中支持系统的发展至关重要。为了迈出第一步,我们提出了一项新的任务,即通过在手术视频记录中定位其开始和结束时间来检测涉及所有手术团队成员的沟通简报,包括团队 Time-out 和 StOP?-协议。 方法:我们生成了一个名为 Team-OR 的真实外科手术数据集,该数据集包含超过一百小时由手术室多视角摄像系统捕获的外科视频。数据集中包含了 33 次 Time-out 和 22 次 StOP? 协议活动的时间标注信息。然后,我们提出了一种新颖的群体活动检测方法,在这种方法中我们将场景背景和行为特征进行编码,并采用高效的神经网络模型输出结果。 结果:在 Team-OR 数据集上的实验结果显示,我们的方法优于现有的最先进的时间动作识别方法。这还表明了对 OR 中团队活动研究不足的事实,证明我们数据集的重要性。 结论:通过介绍首个带有群体活动协议的时间标注的手术室(OR)数据集,并提出一种超越现有方法的新颖群体活动检测方法,我们在 OR 中探讨了 Team Time-Out 和 StOP?-协议。代码可在 [https://this-url.com] 获取。
https://arxiv.org/abs/2502.08299
Future Event Prediction (FEP) is an essential activity whose demand and application range across multiple domains. While traditional methods like simulations, predictive and time-series forecasting have demonstrated promising outcomes, their application in forecasting complex events is not entirely reliable due to the inability of numerical data to accurately capture the semantic information related to events. One forecasting way is to gather and aggregate collective opinions on the future to make predictions as cumulative perspectives carry the potential to help estimating the likelihood of upcoming events. In this work, we organize the existing research and frameworks that aim to support future event prediction based on crowd wisdom through aggregating individual forecasts. We discuss the challenges involved, available datasets, as well as the scope of improvement and future research directions for this task. We also introduce a novel data model to represent individual forecast statements.
未来事件预测(FEP)是一项至关重要的活动,其需求和应用范围跨越多个领域。尽管传统方法如模拟、预测及时间序列预测展示了令人鼓舞的结果,但由于数值数据无法准确捕捉与事件相关的语义信息,在复杂事件的预测中这些方法的应用并不完全可靠。一种预测方式是汇集并聚合集体对未来意见以进行预测,因为累积视角有助于估算即将发生的事件的可能性。在这项工作中,我们组织了现有的研究和框架,旨在通过汇总个人预测来支持基于群众智慧的未来事件预测。我们讨论了该任务涉及的挑战、可用的数据集以及改进范围和未来的研发方向。此外,我们还介绍了一种新的数据模型来表示个人预测陈述。
https://arxiv.org/abs/2502.08205
While functional magnetic resonance imaging (fMRI) offers rich spatial resolution, it is limited by high operational costs and significant infrastructural demands. In contrast, electroencephalography (EEG) provides millisecond-level precision in capturing electrical activity but lacks the spatial resolution necessary for precise neural localization. To bridge these gaps, we introduce E2fNet, a simple yet effective deep learning model for synthesizing fMRI images from low-cost EEG data. E2fNet is specifically designed to capture and translate meaningful features from EEG across electrode channels into accurate fMRI representations. Extensive evaluations across three datasets demonstrate that E2fNet consistently outperforms existing methods, achieving state-of-the-art results in terms of the structural similarity index measure (SSIM). Our findings suggest that E2fNet is a promising, cost-effective solution for enhancing neuroimaging capabilities. The code is available at this https URL.
尽管功能性磁共振成像(fMRI)提供了丰富的空间分辨率,但它受制于高昂的操作成本和显著的基础设施需求。相比之下,脑电图(EEG)能够以毫秒级精度捕捉大脑的电信号活动,但其缺乏进行精确神经定位所需的高空间分辨率。为弥合这些差距,我们引入了E2fNet,这是一种简单而有效的深度学习模型,用于从低成本的EEG数据合成出fMRI图像。E2fNet专门设计来捕获并翻译来自电极通道中的有意义特征,并将其转化为准确的fMRI表示形式。 在三个不同数据集上的广泛评估表明,E2fNet始终优于现有的方法,在结构相似性指数测量(SSIM)方面达到了最先进的结果。我们的发现表明,E2fNet是一种有前景且成本效益高的解决方案,可以增强神经影像能力。代码可在[此处](https://this https URL)获取。
https://arxiv.org/abs/2502.08025
Surgical simulation offers a promising addition to conventional surgical training. However, available simulation tools lack photorealism and rely on hardcoded behaviour. Denoising Diffusion Models are a promising alternative for high-fidelity image synthesis, but existing state-of-the-art conditioning methods fall short in providing precise control or interactivity over the generated scenes. We introduce SurGrID, a Scene Graph to Image Diffusion Model, allowing for controllable surgical scene synthesis by leveraging Scene Graphs. These graphs encode a surgical scene's components' spatial and semantic information, which are then translated into an intermediate representation using our novel pre-training step that explicitly captures local and global information. Our proposed method improves the fidelity of generated images and their coherence with the graph input over the state-of-the-art. Further, we demonstrate the simulation's realism and controllability in a user assessment study involving clinical experts. Scene Graphs can be effectively used for precise and interactive conditioning of Denoising Diffusion Models for simulating surgical scenes, enabling high fidelity and interactive control over the generated content.
外科手术模拟为传统的手术培训提供了很有前景的补充。然而,现有的模拟工具缺乏逼真的视觉效果,并且依赖于预设的行为模式。去噪扩散模型作为高质量图像合成的一种有前途的替代方案,但现有最先进的条件设定方法在提供对生成场景的精确控制或交互性方面表现不足。为此,我们引入了SurGrID——一种基于场景图到图像扩散模型的方法,该方法利用场景图来实现可控的外科手术场景合成。这些图包含了手术场景中各组成部分的空间和语义信息,并通过我们的创新预训练步骤将其转化为包含局部与全局信息的中间表示形式。 相比现有技术,我们提出的方法提升了生成图像的真实度以及其与输入图形的一致性。此外,在一项涉及临床专家的用户评估研究中,我们展示了模拟在现实感和可控性方面的表现。场景图可以有效地用于对去噪扩散模型进行精确且互动式的条件设定,从而实现高质量且具有交互控制能力的手术场景生成内容。
https://arxiv.org/abs/2502.07945
Trajectory User Linking (TUL), which links anonymous trajectories with users who generate them, plays a crucial role in modeling human mobility. Despite significant advancements in this field, existing studies primarily neglect the high-order inter-trajectory relationships, which represent complex associations among multiple trajectories, manifested through multi-location co-occurrence patterns emerging when trajectories intersect at various Points of Interest (POIs). Furthermore, they also overlook the variable influence of POIs on different trajectories, as well as the user class imbalance problem caused by disparities in user activity levels and check-in frequencies. To address these limitations, we propose a novel HyperGraph-based multi-perspective Trajectory User Linking model (HGTUL). Our model learns trajectory representations from both relational and spatio-temporal perspectives: (1) it captures high-order associations among trajectories by constructing a trajectory hypergraph and leverages a hypergraph attention network to learn the variable impact of POIs on trajectories; (2) it models the spatio-temporal characteristics of trajectories by incorporating their temporal and spatial information into a sequential encoder. Moreover, we design a data balancing method to effectively address the user class imbalance problem and experimentally validate its significance in TUL. Extensive experiments on three real-world datasets demonstrate that HGTUL outperforms state-of-the-art baselines, achieving improvements of 2.57%~20.09% and 5.68%~26.00% in ACC@1 and Macro-F1 metrics, respectively.
轨迹用户关联(TUL)通过将匿名轨迹与生成它们的用户链接起来,在建模人类移动性方面发挥着关键作用。尽管该领域取得了显著进展,现有的研究主要忽略了高阶跨轨迹关系,这种关系代表了多条轨迹之间的复杂关联,并且在这些轨迹在各种兴趣点(POIs)交汇时通过多个地点同时出现的模式表现出来。此外,现有研究还忽视了POI对不同轨迹影响的变化性以及由于用户活动水平和签到频率差异所导致的用户类别不平衡问题。 为了解决这些问题,我们提出了一种基于超图的多视角轨迹用户关联模型(HGTUL)。我们的模型从关系和时空两个角度学习轨迹表示: 1. 通过构建一个轨迹超图来捕捉轨迹之间的高阶关联,并利用超图注意力网络来学习POI对不同轨迹影响的变化性。 2. 通过将它们的时间和空间信息纳入顺序编码器中,建模出轨迹的时空特性。 此外,我们设计了一种数据平衡方法,有效解决了用户类别不平衡问题,并在TUL实验验证了其重要性。通过对三个真实世界数据集进行广泛的实验,证明HGTUL优于最先进的基准模型,在ACC@1和Macro-F1度量标准上分别实现了2.57%至20.09%和5.68%至26.00%的改进。
https://arxiv.org/abs/2502.07549
Background. Recently, dynamic total-body positron emission tomography (PET) imaging has become possible due to new scanner devices. While clustering algorithms have been proposed for PET analysis already earlier, there is still little research systematically evaluating these algorithms for processing of dynamic total-body PET images. Materials and methods. Here, we compare the performance of 15 unsupervised clustering methods, including K-means either by itself or after principal component analysis (PCA) or independent component analysis (ICA), Gaussian mixture model (GMM), fuzzy c-means (FCM), agglomerative clustering, spectral clustering, and several newer clustering algorithms, for classifying time activity curves (TACs) in dynamic PET images. We use dynamic total-body $^{15}$O-water PET images collected from 30 patients with suspected or confirmed coronary artery disease. To evaluate the clustering algorithms in a quantitative way, we use them to classify 5000 TACs from each image based on whether the curve is taken from brain, right heart ventricle, right kidney, lower right lung lobe, or urinary bladder. Results. According to our results, the best methods are GMM, FCM, and ICA combined with mini batch K-means, which classified the TACs with a median accuracies of 89\%, 83\%, and 81\%, respectively, in a processing time of half a second or less on average for each image. Conclusion. GMM, FCM, and ICA with mini batch K-means show promise for dynamic total-body PET analysis.
背景:最近,由于新型扫描设备的出现,全身动态正电子发射断层成像(PET)成为可能。尽管之前已经提出了用于PET分析的聚类算法,但系统性评估这些算法处理动态全身体积PET图像的能力的研究仍然较少。 材料与方法:在此研究中,我们比较了15种无监督聚类方法的表现,包括K均值聚类(单独使用或结合主成分分析(PCA)或独立成分分析(ICA),高斯混合模型(GMM),模糊C-均值(FCM),凝聚层次聚类,谱聚类以及几种较新的聚类算法。这些方法用于根据时间活性曲线(TACs)对动态PET图像进行分类。我们使用了从怀疑或已确认冠状动脉疾病患者收集的30份动态全身体积$^{15}$O-水PET图像。为了以定量方式评估聚类算法,我们将它们应用于每个图象中的5000个TACs,根据曲线是否来自大脑、右心室、右侧肾脏、下右肺叶或膀胱进行分类。 结果:根据我们的研究结果,表现最好的方法是GMM(高斯混合模型)、FCM(模糊C-均值)和ICA与小批量K均值结合使用的方法。这些方法以极短的处理时间(每个图像半秒或更少的时间内),分别达到了89%,83% 和 81% 的中位分类准确率。 结论:GMM、FCM以及ICA结合小批量K均值展现出了在动态全身体积PET分析中的潜力。
https://arxiv.org/abs/2502.07511
Advancements in DNA sequencing technologies have significantly improved our ability to decode genomic sequences. However, the prediction and interpretation of these sequences remain challenging due to the intricate nature of genetic material. Large language models (LLMs) have introduced new opportunities for biological sequence analysis. Recent developments in genomic language models have underscored the potential of LLMs in deciphering DNA sequences. Nonetheless, existing models often face limitations in robustness and application scope, primarily due to constraints in model structure and training data scale. To address these limitations, we present GENERator, a generative genomic foundation model featuring a context length of 98k base pairs (bp) and 1.2B parameters. Trained on an expansive dataset comprising 386B bp of eukaryotic DNA, the GENERator demonstrates state-of-the-art performance across both established and newly proposed benchmarks. The model adheres to the central dogma of molecular biology, accurately generating protein-coding sequences that translate into proteins structurally analogous to known families. It also shows significant promise in sequence optimization, particularly through the prompt-responsive generation of promoter sequences with specific activity profiles. These capabilities position the GENERator as a pivotal tool for genomic research and biotechnological advancement, enhancing our ability to interpret and predict complex biological systems and enabling precise genomic interventions.
DNA测序技术的进步大大提高了我们解码基因组序列的能力。然而,由于遗传物质的复杂性,预测和解释这些序列仍然具有挑战性。大型语言模型(LLMs)为生物序列分析带来了新的机遇。最近在基因组语言模型方面的发展强调了LLMs在解读DNA序列方面的潜力。尽管如此,现有的模型通常在鲁棒性和应用范围上受到限制,这主要归因于模型结构和训练数据规模的局限性。为了克服这些限制,我们提出了GENERator,这是一种生成型基因组基础模型,其上下文长度为98,000个碱基对(bp),参数数量达到12亿。该模型在包含3860亿个真核DNA碱基对的庞大数据集上进行训练,并在已建立和新提出的基准测试中均表现出业界领先性能。 GENERator遵循分子生物学的核心法则,能够准确生成编码蛋白质的序列,这些蛋白质与已知家族中的结构类似。此外,在通过特定活性谱提示响应生成启动子序列方面,该模型也展现出极大的潜力。这些能力使GENERator成为基因组研究和生物技术进步的关键工具,增强了我们对复杂生物系统的解读和预测能力,并能够实现精准的基因组干预措施。
https://arxiv.org/abs/2502.07272
This work introduces a spike-based wearable analytics system utilizing Spiking Neural Networks (SNNs) deployed on an In-memory Computing engine based on RRAM crossbars, which are known for their compactness and energy-efficiency. Given the hardware constraints and noise characteristics of the underlying RRAM crossbars, we propose online adaptation of pre-trained SNNs in real-time using Direct Feedback Alignment (DFA) against traditional backpropagation (BP). Direct Feedback Alignment (DFA) learning, that allows layer-parallel gradient computations, acts as a fast, energy & area-efficient method for online adaptation of SNNs on RRAM crossbars, unleashing better algorithmic performance against those adapted using BP. Through extensive simulations using our in-house hardware evaluation engine called DFA_Sim, we find that DFA achieves upto 64.1% lower energy consumption, 10.1% lower area overhead, and a 2.1x reduction in latency compared to BP, while delivering upto 7.55% higher inference accuracy on human activity recognition (HAR) tasks.
这项工作介绍了一种基于尖峰的可穿戴分析系统,该系统利用了在基于RRAM交叉阵列(因其紧凑性和能源效率而闻名)上的忆阻计算引擎部署的脉冲神经网络(SNN)。考虑到底层RRAM交叉阵列的硬件限制和噪声特性,我们提出了使用直接反馈对齐(DFA)在线实时调整预训练的SNN的方法,以替代传统的反向传播(BP)方法。直接反馈对齐(DFA)学习允许层并行计算梯度,并作为一个快速、节能且占用面积小的方法,在RRAM交叉阵列上实现SNN的在线适应性改进,从而在那些使用BP调整的算法性能方面表现出色。 通过我们内部开发的硬件评估引擎DFA_Sim进行广泛的模拟测试后发现,与反向传播(BP)相比,直接反馈对齐(DFA)可以减少高达64.1%的能量消耗、降低10.1%的面积开销,并将延迟缩短至多2.1倍,同时在人类活动识别(HAR)任务中提供高达7.55%更高的推理精度。
https://arxiv.org/abs/2502.06736
Code review is a crucial but often complex, subjective, and time-consuming activity in software development. Over the past decades, significant efforts have been made to automate this process. Early approaches focused on knowledge-based systems (KBS) that apply rule-based mechanisms to detect code issues, providing precise feedback but struggling with complex, context-dependent cases. More recent work has shifted toward fine-tuning pre-trained language models for code review, enabling broader issue coverage but often at the expense of precision. In this paper, we propose a hybrid approach that combines the strengths of KBS and learning-based systems (LBS) to generate high-quality, comprehensive code reviews. Our method integrates knowledge at three distinct stages of the language model pipeline: during data preparation (Data-Augmented Training, DAT), at inference (Retrieval-Augmented Generation, RAG), and after inference (Naive Concatenation of Outputs, NCO). We empirically evaluate our combination strategies against standalone KBS and LBS fine-tuned on a real-world dataset. Our results show that these hybrid strategies enhance the relevance, completeness, and overall quality of review comments, effectively bridging the gap between rule-based tools and deep learning models.
代码审查是软件开发中的一个关键但往往复杂、主观且耗时的活动。过去几十年里,人们一直在努力自动化这一过程。早期的方法集中在使用基于规则的知识系统(KBS)来检测代码问题,这种方法能够提供精确反馈,但在处理复杂的上下文依赖情况时却力不从心。近年来的工作则转向于微调预训练的语言模型进行代码审查,这虽然扩大了问题覆盖范围,但通常以牺牲精度为代价。在本文中,我们提出了一种混合方法,结合基于知识系统(KBS)和基于学习的系统的优点,生成高质量、全面的代码评审意见。我们的方法在语言模型管道的三个不同阶段集成知识:数据准备期间(数据增强训练,DAT)、推理过程中(检索增强生成,RAG)以及推理之后(输出简单拼接,NCO)。我们通过使用真实世界的数据集对单独的知识系统和微调后的基于学习的系统的组合策略进行了实证评估。结果表明,这些混合策略增强了审查意见的相关性、完整性和整体质量,有效地弥合了基于规则工具与深度学习模型之间的差距。
https://arxiv.org/abs/2502.06633
Environmental crime currently represents the third largest criminal activity worldwide while threatening ecosystems as well as human health. Among the crimes related to this activity, improper waste management can nowadays be countered more easily thanks to the increasing availability and decreasing cost of Very-High-Resolution Remote Sensing images, which enable semi-automatic territory scanning in search of illegal landfills. This paper proposes a pipeline, developed in collaboration with professionals from a local environmental agency, for detecting candidate illegal dumping sites leveraging a classifier of Remote Sensing images. To identify the best configuration for such classifier, an extensive set of experiments was conducted and the impact of diverse image characteristics and training settings was thoroughly analyzed. The local environmental agency was then involved in an experimental exercise where outputs from the developed classifier were integrated in the experts' everyday work, resulting in time savings with respect to manual photo-interpretation. The classifier was eventually run with valuable results on a location outside of the training area, highlighting potential for cross-border applicability of the proposed pipeline.
目前,环境犯罪已成为全球第三大犯罪活动,对生态系统和人类健康构成威胁。在这类犯罪中,不适当的废物管理如今由于非常高分辨率遥感图像的日益普及和成本下降而更容易被发现。这些图像可以用于半自动地扫描地区以寻找非法倾倒场所。 本文提出了一种与地方环境机构专业人员合作开发的管道流程,旨在利用遥感图像分类器来识别疑似非法倾倒场地。为了确定此类分类器的最佳配置,进行了大量的实验,并深入分析了各种图像特性和训练设置的影响。然后,地方环境机构参与了一个实验练习,在该实验中,所开发分类器的输出被整合到专家日常工作中,从而在手动照片解读方面节省了时间。最终,该分类器在一个超出训练区域的位置运行并取得了有价值的成果,突显了所提出管道流程跨边境应用的可能性。
https://arxiv.org/abs/2502.06607
Public health researchers are increasingly interested in using social media data to study health-related behaviors, but manually labeling this data can be labor-intensive and costly. This study explores whether zero-shot labeling using large language models (LLMs) can match or surpass conventional crowd-sourced annotation for Twitter posts related to sleep disorders, physical activity, and sedentary behavior. Multiple annotation pipelines were designed to compare labels produced by domain experts, crowd workers, and LLM-driven approaches under varied prompt-engineering strategies. Our findings indicate that LLMs can rival human performance in straightforward classification tasks and significantly reduce labeling time, yet their accuracy diminishes for tasks requiring more nuanced domain knowledge. These results clarify the trade-offs between automated scalability and human expertise, demonstrating conditions under which LLM-based labeling can be efficiently integrated into public health research without undermining label quality.
公共卫生研究人员越来越有兴趣利用社交媒体数据来研究与健康相关的行为,但手动标记这些数据既耗时又成本高昂。这项研究探讨了使用大型语言模型(LLMs)进行零样本标注是否可以媲美或超越传统的众包注释方法,尤其是在与睡眠障碍、身体活动和久坐行为有关的推特帖子方面。设计了多个注释流水线来比较领域专家、众包工作者以及基于LLM的方法在不同提示工程策略下的标签表现。我们的研究发现表明,在简单的分类任务中,大型语言模型可以匹敌人类的表现并显著减少标注时间,但在需要更多专业知识的任务上其准确性会下降。这些结果阐明了自动化可扩展性和人类专长之间的权衡,并展示了LLM基标注方法可以在不损害标签质量的情况下有效融入公共卫生研究的条件。
https://arxiv.org/abs/2502.06150
Traveling waves are widely observed in the brain, but their precise computational function remains unclear. One prominent hypothesis is that they enable the transfer and integration of spatial information across neural populations. However, few computational models have explored how traveling waves might be harnessed to perform such integrative processing. Drawing inspiration from the famous ``Can one hear the shape of a drum?'' problem -- which highlights how spectral modes encode geometric information -- we introduce a set of convolutional recurrent neural networks that learn to produce traveling waves in their hidden states in response to visual stimuli. By applying a spectral decomposition to these wave-like activations, we obtain a powerful new representational space that outperforms equivalently local feed-forward networks on tasks requiring global spatial context. In particular, we observe that traveling waves effectively expand the receptive field of locally connected neurons, supporting long-range encoding and communication of information. We demonstrate that models equipped with this mechanism and spectral readouts solve visual semantic segmentation tasks demanding global integration, where local feed-forward models fail. As a first step toward traveling-wave-based representations in artificial networks, our findings suggest potential efficiency benefits and offer a new framework for connecting to biological recordings of neural activity.
在大脑中广泛观察到的行波现象,其精确的计算功能仍不清楚。一种突出的假设认为它们能够使空间信息在神经群体之间转移和整合。然而,很少有计算模型探讨过如何利用行波进行这种整合处理。受到著名的“能否通过鼓的声音来判断鼓的形状?”问题——该问题强调了频谱模式如何编码几何信息——的启发,我们引入了一组卷积递归神经网络(CNNs),这些网络在响应视觉刺激时能够学习在其隐藏状态中生成行波。通过对这些类似波动的激活进行频谱分解,我们可以获得一个新的强大的表征空间,在需要全局空间上下文的任务上优于等效的局部前馈网络。特别是我们观察到,行波有效地扩展了局部连接神经元的感受野,支持长距离的信息编码和通信。我们的结果显示,配备有这一机制和频谱读出功能的模型可以解决视觉语义分割任务,而这些任务需要全局整合,局部前馈模型则无法完成。 这项研究是人工网络中基于行波表示的第一步尝试,并且表明这种方法有可能带来效率提升,同时提供了一个新的框架来连接到生物神经活动记录。
https://arxiv.org/abs/2502.06034
Validating the safety and performance of an autonomous vehicle (AV) requires benchmarking on real-world driving logs. However, typical driving logs contain mostly uneventful scenarios with minimal interactions between road users. Identifying interactive scenarios in real-world driving logs enables the curation of datasets that amplify critical signals and provide a more accurate assessment of an AV's performance. In this paper, we present a novel metric that identifies interactive scenarios by measuring an AV's surprise potential on others. First, we identify three dimensions of the design space to describe a family of surprise potential measures. Second, we exhaustively evaluate and compare different instantiations of the surprise potential measure within this design space on the nuScenes dataset. To determine how well a surprise potential measure correctly identifies an interactive scenario, we use a reward model learned from human preferences to assess alignment with human intuition. Our proposed surprise potential, arising from this exhaustive comparative study, achieves a correlation of more than 0.82 with the human-aligned reward function, outperforming existing approaches. Lastly, we validate motion planners on curated interactive scenarios to demonstrate downstream applications.
验证自主车辆(AV)的安全性和性能需要在真实的驾驶日志上进行基准测试。然而,典型的驾驶日志主要包含乏味的情景,并且道路使用者之间的互动很少。识别真实世界驾驶日志中的交互情景可以使数据集的整理更加丰富,从而更准确地评估AV的表现。在这篇论文中,我们提出了一种新颖的度量方法,通过测量AV对其他道路使用者可能产生的“惊讶”来识别交互情景。 首先,我们在设计空间内定义了三个维度,以描述一系列惊喜潜力衡量指标。其次,在nuScenes数据集上,我们彻底评估和比较了这些不同形式的惊喜潜力措施。为了确定一个惊喜潜力度量方法是否能正确地识别出交互场景,我们使用了一种基于人类偏好的奖励模型来评估与人类直觉的一致性。通过这次详尽的对比研究,我们提出的惊喜潜在度量与人类一致性的奖励函数的相关系数超过了0.82,优于现有方法。 最后,我们在精选的互动场景中验证了运动规划器,并展示了其下游应用的有效性。
https://arxiv.org/abs/2502.05677
Drones or unmanned aerial vehicles are traditionally used for military missions, warfare, and espionage. However, the usage of drones has significantly increased due to multiple industrial applications involving security and inspection, transportation, research purposes, and recreational drone flying. Such an increased volume of drone activity in public spaces requires regulatory actions for purposes of privacy protection and safety. Hence, detection of illegal drone activities such as boundary encroachment becomes a necessity. Such detection tasks are usually automated and performed by deep learning models which are trained on annotated image datasets. This paper builds on a previous work and extends an already published open source dataset. A description and analysis of the entire dataset is provided. The dataset is used to train the YOLOv7 deep learning model and some of its minor variants and the results are provided. Since the detection models are based on a single image input, a simple cross-correlation based tracker is used to reduce detection drops and improve tracking performance in videos. Finally, the entire drone detection system is summarized.
无人机或无人驾驶飞行器传统上用于军事任务、战争和间谍活动。然而,由于涉及安全检查、运输、研究目的以及娱乐飞行等多种工业应用,无人机的使用量大幅增加。这种在公共场所中无人机活动量的增长要求采取监管措施以保护隐私并确保安全。因此,检测非法无人机活动(如边界侵犯)成为必要。此类检测任务通常由深度学习模型自动完成,这些模型是通过标注图像数据集进行训练的。 本文在此基础上展开,并扩展了一个已发布的开源数据集。文中提供了整个数据集的描述和分析。该数据集用于训练YOLOv7深度学习模型及其一些较小变体,并提供了结果展示。由于检测模型基于单张图片输入,因此使用了一种简单的基于交叉相关性的跟踪器来减少检测遗漏并提高视频中的追踪性能。最后,总结了整个无人机检测系统。 简而言之: - 该研究扩展了一个开源数据集用于训练YOLOv7及其变体。 - 数据集中包括描述和分析,以及模型训练的结果展示。 - 使用基于交叉相关性的简单跟踪器来优化视频中的表现。 - 提供了一个完整的无人机非法活动检测系统的总结。
https://arxiv.org/abs/2502.05292
Human Activity Recognition (HAR) such as fall detection has become increasingly critical due to the aging population, necessitating effective monitoring systems to prevent serious injuries and fatalities associated with falls. This study focuses on fine-tuning the Vision Transformer (ViT) model specifically for HAR using radar-based Time-Doppler signatures. Unlike traditional image datasets, these signals present unique challenges due to their non-visual nature and the high degree of similarity among various activities. Directly fine-tuning the ViT with all parameters proves suboptimal for this application. To address this challenge, we propose a novel approach that employs Low-Rank Adaptation (LoRA) fine-tuning in the weight space to facilitate knowledge transfer from pre-trained ViT models. Additionally, to extract fine-grained features, we enhance feature representation through the integration of a serial-parallel adapter in the feature space. Our innovative joint fine-tuning method, tailored for radar-based Time-Doppler signatures, significantly improves HAR accuracy, surpassing existing state-of-the-art methodologies in this domain. Our code is released at this https URL.
人体活动识别(HAR),如跌倒检测,由于人口老龄化问题变得越来越重要,需要有效的监控系统来预防与跌倒相关的严重伤害和死亡。本研究专注于通过雷达基的时-多普勒信号对视觉变压器(ViT)模型进行微调,以专门用于HAR。不同于传统的图像数据集,这些信号因其非视觉性质以及各种活动之间的高度相似性而提出了独特的挑战。直接使用所有参数对ViT进行全面微调对于此类应用效果不佳。为解决这一挑战,我们提出了一种新颖的方法,即在权重空间中采用低秩适应(LoRA)微调技术来促进预训练ViT模型的知识转移。此外,为了提取细粒度特征,我们在特征空间中通过集成串行-并行适配器进一步增强特征表示。我们的创新性联合微调方法针对雷达基时-多普勒信号进行了定制,显著提高了HAR的准确性,并超越了现有领域的最先进方法。我们的代码可在以下网址获取:[此处插入实际URL]。
https://arxiv.org/abs/2502.04740
Deep learning methods have been widely used for Human Activity Recognition (HAR) using recorded signals from Iner-tial Measurement Units (IMUs) sensors that are installed on various parts of the human body. For this type of HAR, sev-eral challenges exist, the most significant of which is the analysis of multivarious IMU sensors data. Here, we introduce a Hierarchically Unsupervised Fusion (HUF) model designed to extract, and fuse features from IMU sensors data via a hybrid structure of Convolutional Neural Networks (CNN)s and Autoencoders (AE)s. First, we design a stack CNN-AE to embed short-time signals into sets of high dimensional features. Second, we develop another CNN-AE network to locally fuse the extracted features from each sensor unit. Finally, we unify all the sensor features through a third CNN-AE architecture as globally feature fusion to create a unique feature set. Additionally, we analyze the effects of varying the model hyperparameters. The best results are achieved with eight convolutional layers in each AE. Furthermore, it is determined that an overcomplete AE with 256 kernels in the code layer is suitable for feature extraction in the first block of the proposed HUF model; this number reduces to 64 in the last block of the model to customize the size of the applied features to the classifier. The tuned model is applied to the UCI-HAR, DaLiAc, and Parkinson's disease gait da-tasets, achieving the classification accuracies of 97%, 97%, and 88%, respectively, which are nearly 3% better com-pared to the state-of-the-art supervised methods.
深度学习方法已被广泛应用于利用惯性测量单元(IMU)传感器记录的信号进行人体活动识别(HAR)。这些传感器安装在人体的不同部位。对于这种类型的HAR,存在若干挑战,其中最重要的是分析来自多个IMU传感器的数据。为此,我们引入了一种层次化无监督融合(HUF)模型,该模型通过卷积神经网络(CNN)和自动编码器(AE)的混合结构从IMU传感器数据中提取并融合特征。 首先,我们设计了一个堆叠式的CNN-AE架构,将短时信号嵌入到高维特征集中。其次,我们开发了另一个CNN-AE网络以局部融合每个传感器单元中抽取的特征。最后,我们通过第三个CNN-AE架构进行全局特征融合,创建一个独特的特征集,以此统一所有传感器的特征。 此外,我们还分析了模型超参数变化的影响。实验结果表明,在每个自动编码器中使用八层卷积层时性能最佳。进一步研究发现,对于HUF模型提出的第一个模块中的特征提取而言,拥有256个核的过度完全自动编码器在代码层是合适的;而在最后一个模块中,该数值减少到64以适应应用于分类器的特征大小。 经过调优后的模型被应用于UCI-HAR、DaLiAc和帕金森病步态数据集上,在这三个数据集中分别实现了97%、97%和88%的分类准确率。这比现有的最佳监督学习方法高出近3个百分点。
https://arxiv.org/abs/2502.04489
We present an approach to identifying which ransomware adversaries are most likely to target specific entities, thereby assisting these entities in formulating better protection strategies. Ransomware poses a formidable cybersecurity threat characterized by profit-driven motives, a complex underlying economy supporting criminal syndicates, and the overt nature of its attacks. This type of malware has consistently ranked among the most prevalent, with a rapid escalation in activity observed. Recent estimates indicate that approximately two-thirds of organizations experienced ransomware attacks in 2023 \cite{Sophos2023Ransomware}. A central tactic in ransomware campaigns is publicizing attacks to coerce victims into paying ransoms. Our study utilizes public disclosures from ransomware victims to predict the likelihood of an entity being targeted by a specific ransomware variant. We employ a Large Language Model (LLM) architecture that uses a unique chain-of-thought, multi-shot prompt methodology to define adversary SKRAM (Skills, Knowledge, Resources, Authorities, and Motivation) profiles from ransomware bulletins, threat reports, and news items. This analysis is enriched with publicly available victim data and is further enhanced by a heuristic for generating synthetic data that reflects victim profiles. Our work culminates in the development of a machine learning model that assists organizations in prioritizing ransomware threats and formulating defenses based on the tactics, techniques, and procedures (TTP) of the most likely attackers.
我们提出了一种方法,用于识别哪些勒索软件对手最有可能针对特定实体进行攻击,从而帮助这些实体制定更好的保护策略。勒索软件是一种具有盈利动机、复杂犯罪集团支持的经济体系以及公开性质的网络威胁,在各种网络安全威胁中占据主导地位,并且其活动正在迅速增加。据最近的估计,2023年约有三分之二的组织遭受了勒索软件攻击。\cite{Sophos2023Ransomware} 勒索软件活动中的一项关键策略是公开宣布对受害者的攻击以迫使他们支付赎金。 我们的研究利用来自勒索软件受害者公共披露的信息来预测某个实体被特定勒索软件变种攻击的可能性。我们使用了一种大型语言模型(LLM)架构,该架构采用独特的链式思维和多轮提示方法从勒索软件公告、威胁报告和新闻报道中定义对手的SKRAM(技能、知识、资源、权限及动机)档案。这种分析通过公开可用的受害者数据得到丰富,并进一步借助一种生成反映受害者特征的人工合成数据的方法来增强。 我们的工作最终开发出一个机器学习模型,该模型帮助组织根据最有可能发起攻击者的战术、技术和程序(TTP)来优先考虑和制定针对勒索软件威胁的防御策略。
https://arxiv.org/abs/2502.04421
Despite remarkable capabilities, large language models (LLMs) struggle to continually update their knowledge without catastrophic forgetting. In contrast, humans effortlessly integrate new information, detect conflicts with existing beliefs, and selectively update their mental models. This paper introduces a cognitive-inspired investigation paradigm to study continual knowledge updating in LLMs. We implement two key components inspired by human cognition: (1) Dissonance and Familiarity Awareness, analyzing model behavior to classify information as novel, familiar, or dissonant; and (2) Targeted Network Updates, which track neural activity to identify frequently used (stubborn) and rarely used (plastic) neurons. Through carefully designed experiments in controlled settings, we uncover a number of empirical findings demonstrating the potential of this approach. First, dissonance detection is feasible using simple activation and gradient features, suggesting potential for cognitive-inspired training. Second, we find that non-dissonant updates largely preserve prior knowledge regardless of targeting strategy, revealing inherent robustness in LLM knowledge integration. Most critically, we discover that dissonant updates prove catastrophically destructive to the model's knowledge base, indiscriminately affecting even information unrelated to the current updates. This suggests fundamental limitations in how neural networks handle contradictions and motivates the need for new approaches to knowledge updating that better mirror human cognitive mechanisms.
尽管大型语言模型(LLM)拥有显著的能力,但在不断更新知识时却难以避免灾难性遗忘。相比之下,人类可以轻松地整合新信息、发现与现有信念的冲突,并有选择地更新其心智模型。本文介绍了一种受认知启发的研究范式,旨在研究LLM中的连续知识更新。我们实现了两个关键组件,这些组件受到了人类认知的启发:(1)不协调和熟悉度感知,通过分析模型行为将信息分类为新奇、熟悉或产生冲突的信息;以及(2)有针对性的网络更新,跟踪神经活动以识别常用(顽固的)和少见用到的(可塑性)神经元。通过在控制环境中精心设计的实验,我们发现了一些实证结果,展示了这种方法的潜力。 首先,使用简单的激活和梯度特征可以进行不协调检测,这表明可能存在基于认知启发的训练方法。其次,我们发现非冲突更新基本上能够保留先前的知识,无论采用何种目标策略,揭示了LLM知识整合中的内在稳健性。最重要的是,我们发现与现有信息产生冲突的信息更新会对模型的知识库造成灾难性的破坏,甚至影响到与当前更新无关的信息。这表明神经网络处理矛盾时存在根本限制,并促使我们需要新的方法来更准确地模拟人类的认知机制进行知识更新。
https://arxiv.org/abs/2502.04390
Activation functions are fundamental elements of deep learning architectures as they significantly influence training dynamics. ReLU, while widely used, is prone to the dying neuron problem, which has been mitigated by variants such as LeakyReLU, PReLU, and ELU that better handle negative neuron outputs. Recently, self-gated activations like GELU and Swish have emerged as state-of-the-art alternatives, leveraging their smoothness to ensure stable gradient flow and prevent neuron inactivity. In this work, we introduce the Gompertz Linear Unit (GoLU), a novel self-gated activation function defined as $\mathrm{GoLU}(x) = x \, \mathrm{Gompertz}(x)$, where $\mathrm{Gompertz}(x) = e^{-e^{-x}}$. The GoLU activation leverages the asymmetry in the Gompertz function to reduce variance in the latent space more effectively compared to GELU and Swish, while preserving robust gradient flow. Extensive experiments across diverse tasks, including Image Classification, Language Modeling, Semantic Segmentation, Object Detection, Instance Segmentation, and Diffusion, highlight GoLU's superior performance relative to state-of-the-art activation functions, establishing GoLU as a robust alternative to existing activation functions.
激活函数是深度学习架构中的基本元素,它们对训练过程有重要影响。尽管ReLU因其简单性和有效性而被广泛使用,但它容易遇到“死亡神经元”问题,即一些神经元在训练过程中变得不活跃。为了解决这个问题,研究人员开发了诸如LeakyReLU、PReLU和ELU等变体,这些变体能够更好地处理负输出值。 最近,自门控激活函数如GELU和Swish因其平滑特性而崭露头角,可以确保稳定的梯度流并防止神经元不活跃。在本研究中,我们引入了戈尔皮茨线性单元(GoLU),这是一种新型的自门控激活函数,定义为$\mathrm{GoLU}(x) = x \, \mathrm{Gompertz}(x)$,其中$\mathrm{Gompertz}(x) = e^{-e^{-x}}$。GoLU激活函数利用了戈尔皮茨函数的不对称性,在降低隐空间方差方面比GELU和Swish更有效,并且保持了稳健的梯度流动。 在包括图像分类、语言模型、语义分割、目标检测、实例分割和扩散等多样任务上的广泛实验表明,GoLU相对于当前最先进的激活函数表现出了卓越性能,确立了其作为现有激活函数的一种稳健替代方案的地位。
https://arxiv.org/abs/2502.03654
Decoding visual images from brain activity has significant potential for advancing brain-computer interaction and enhancing the understanding of human perception. Recent approaches align the representation spaces of images and brain activity to enable visual decoding. In this paper, we introduce the use of human-aligned image encoders to map brain signals to images. We hypothesize that these models more effectively capture perceptual attributes associated with the rapid visual stimuli presentations commonly used in visual brain data recording experiments. Our empirical results support this hypothesis, demonstrating that this simple modification improves image retrieval accuracy by up to 21% compared to state-of-the-art methods. Comprehensive experiments confirm consistent performance improvements across diverse EEG architectures, image encoders, alignment methods, participants, and brain imaging modalities.
从脑活动解码视觉图像对推进脑机交互和增强人类感知的理解具有重要意义。最近的研究方法将图像和脑活动的表示空间进行对齐,以实现视觉解码。在本文中,我们引入了与人脑数据记录实验常用的快速视觉刺激呈现相关的感知属性更有效的捕捉模型——使用与人类大脑反应相匹配的图像编码器来映射脑信号到图像上。我们的假设是,这些模型能够更好地捕获与常用快速视觉刺激呈现相关的感知特性。 实证研究表明,这一简单的修改可以将图像检索准确度提升高达21%,优于现有的最先进的方法。全面的实验确认了在各种不同类型的EEG架构、图像编码器、对齐方法、参与者以及脑成像模式下的一致性能改进。
https://arxiv.org/abs/2502.03081