Video Object Segmentation (VOS) is one of the most fundamental and challenging tasks in computer vision and has a wide range of applications. Most existing methods rely on spatiotemporal memory networks to extract frame-level features and have achieved promising results on commonly used datasets. However, these methods often struggle in more complex real-world scenarios. This paper addresses this issue, aiming to achieve accurate segmentation of video objects in challenging scenes. We propose fine-tuning VOS (FVOS), optimizing existing methods for specific datasets through tailored training. Additionally, we introduce a morphological post-processing strategy to address the issue of excessively large gaps between adjacent objects in single-model predictions. Finally, we apply a voting-based fusion method on multi-scale segmentation results to generate the final output. Our approach achieves J&F scores of 76.81% and 83.92% during the validation and testing stages, respectively, securing third place overall in the MOSE Track of the 4th PVUW challenge 2025.
视频对象分割(VOS)是计算机视觉中最基础且最具挑战性的任务之一,它在广泛的应用领域中发挥着重要作用。目前大多数现有方法依赖于时空记忆网络来提取帧级特征,并在常用数据集上取得了令人鼓舞的结果。然而,在更复杂的现实场景下,这些方法往往表现出色不足。 本文旨在解决这一问题,目标是实现对具有挑战性场景中的视频对象进行准确分割。我们提出了一种针对特定数据集优化现有方法的微调VOS(FVOS)策略,并通过定制化训练来提升性能。此外,我们还引入了一种形态学后处理策略,以应对单模型预测中相邻对象间距离过大的问题。最后,我们将多尺度分割结果结合投票融合法生成最终输出。 我们的方法在验证阶段和测试阶段分别取得了J&F分数76.81%和83.92%,在2025年第四届PVUW挑战赛的MOSE轨道中获得了总成绩第三名。
https://arxiv.org/abs/2504.09507
Mainstream visual object tracking frameworks predominantly rely on template matching paradigms. Their performance heavily depends on the quality of template features, which becomes increasingly challenging to maintain in complex scenarios involving target deformation, occlusion, and background clutter. While existing spatiotemporal memory-based trackers emphasize memory capacity expansion, they lack effective mechanisms for dynamic feature selection and adaptive fusion. To address this gap, we propose a Dynamic Attention Mechanism in Spatiotemporal Memory Network (DASTM) with two key innovations: 1) A differentiable dynamic attention mechanism that adaptively adjusts channel-spatial attention weights by analyzing spatiotemporal correlations between the templates and memory features; 2) A lightweight gating network that autonomously allocates computational resources based on target motion states, prioritizing high-discriminability features in challenging scenarios. Extensive evaluations on OTB-2015, VOT 2018, LaSOT, and GOT-10K benchmarks demonstrate our DASTM's superiority, achieving state-of-the-art performance in success rate, robustness, and real-time efficiency, thereby offering a novel solution for real-time tracking in complex environments.
主流的视觉对象跟踪框架主要依赖于模板匹配范式。其性能很大程度上取决于模板特征的质量,而在涉及目标变形、遮挡和背景杂乱等复杂场景的情况下,保持高质量模板特征变得越来越具有挑战性。尽管现有的基于时空记忆的追踪器强调扩大内存容量,但它们缺乏有效的动态特征选择和自适应融合机制。为了弥补这一不足,我们提出了一种在时空记忆网络中的动态注意力机制(DASTM),其包含两个关键创新点:1)一种可微分的动态注意力机制,该机制通过分析模板与记忆特征之间的时空相关性来自适应地调整通道-空间注意权重;2)一个轻量级的门控网络,根据目标运动状态自主分配计算资源,在复杂场景中优先处理高区分度特征。在OTB-2015、VOT 2018、LaSOT和GOT-10K基准测试中的广泛评估证明了我们提出的DASTM的优势,实现了成功率、鲁棒性和实时效率方面的最新性能,从而为复杂环境下的实时跟踪提供了新颖的解决方案。
https://arxiv.org/abs/2503.16768
Video object segmentation is crucial for the efficient analysis of complex medical video data, yet it faces significant challenges in data availability and annotation. We introduce the task of one-shot medical video object segmentation, which requires separating foreground and background pixels throughout a video given only the mask annotation of the first frame. To address this problem, we propose a temporal contrastive memory network comprising image and mask encoders to learn feature representations, a temporal contrastive memory bank that aligns embeddings from adjacent frames while pushing apart distant ones to explicitly model inter-frame relationships and stores these features, and a decoder that fuses encoded image features and memory readouts for segmentation. We also collect a diverse, multi-source medical video dataset spanning various modalities and anatomies to benchmark this task. Extensive experiments demonstrate state-of-the-art performance in segmenting both seen and unseen structures from a single exemplar, showing ability to generalize from scarce labels. This highlights the potential to alleviate annotation burdens for medical video analysis. Code is available at this https URL.
视频对象分割在复杂医学视频数据的高效分析中至关重要,但其面临着数据可用性和标注方面的重大挑战。我们提出了单样本医学视频对象分割任务,该任务仅基于第一帧的掩码标注来区分整个视频中的前景和背景像素。为解决这一问题,我们提出了一种包含图像编码器和掩码编码器以学习特征表示、时间对比记忆库(Temporal Contrastive Memory Bank)以对齐相邻帧之间的嵌入并拉开不相关帧之间距离以便显式建模帧间关系,并存储这些特征的网络架构。此外,还有一个解码器用于融合编码后的图像特征与记忆库读取的内容来进行分割。 为了评估这一任务,我们收集了一个多样化的、多源医学视频数据集,涵盖各种模式和解剖结构的数据,以作为基准测试。广泛的实验展示了在单个示例的情况下对已见和未见结构进行分割的最先进的性能,这表明了从稀缺标签中泛化的能力。这项研究强调了解决标注负担对于医学视频分析具有潜在的作用。 代码可在提供的链接获取:[此URL](请将方括号中的内容替换为实际的URL)。
https://arxiv.org/abs/2503.14979
The artificial lateral line (ALL) is a bioinspired flow sensing system for underwater robots, comprising of distributed flow sensors. The ALL has been successfully applied to detect the undulatory flow fields generated by body undulation and tail-flapping of bioinspired robotic fish. However, its feasibility and performance in sensing the undulatory flow fields produced by human leg kicks during swimming has not been systematically tested and studied. This paper presents a novel sensing framework to investigate the undulatory flow field generated by swimmer's leg kicks, leveraging bioinspired ALL sensing. To evaluate the feasibility of using the ALL system for sensing the undulatory flow fields generated by swimmer leg kicks, this paper designs an experimental platform integrating an ALL system and a lab-fabricated human leg model. To enhance the accuracy of flow sensing, this paper proposes a feature extraction method that dynamically fuses time-domain and time-frequency characteristics. Specifically, time-domain features are extracted using one-dimensional convolutional neural networks and bidirectional long short-term memory networks (1DCNN-BiLSTM), while time-frequency features are extracted using short-term Fourier transform and two-dimensional convolutional neural networks (STFT-2DCNN). These features are then dynamically fused based on attention mechanisms to achieve accurate sensing of the undulatory flow field. Furthermore, extensive experiments are conducted to test various scenarios inspired by human swimming, such as leg kick pattern recognition and kicking leg localization, achieving satisfactory results.
人工侧线系统(ALL)是一种仿生流体传感系统,用于水下机器人,由分布式的流动传感器组成。该系统已成功应用于检测由生物启发的机器鱼身体摆动和尾部拍打产生的波动水流场。然而,其在感测游泳时人类腿部踢水所产生的波动水流场方面的可行性和性能尚未经过系统的测试与研究。 本文提出了一种新的传感框架,利用仿生ALL传感技术来探究游泳者腿部踢水所生成的波动水流场。为了评估使用ALL系统感应由游泳者的腿部踢动产生的波动水流场的可能性,本文设计了一个实验平台,该平台集成了ALL系统和实验室制造的人类腿模型。 为提高流体感知的准确性,本研究提出了一种基于注意力机制动态融合时域与时频特征的提取方法。具体而言,时间领域的特性通过一维卷积神经网络(1DCNN)与双向长短时记忆网络(BiLSTM)进行抽取;而时间-频率特性则通过短时傅里叶变换(STFT)及二维卷积神经网络(2DCNN)来抽取。这些特征随后基于注意力机制被动态融合,以实现对波动水流场的准确感知。 此外,本文还进行了广泛的实验,测试了由人类游泳启发的各种场景下的性能,如踢腿模式识别和踢动腿部定位等任务,并取得了令人满意的结果。
https://arxiv.org/abs/2503.07312
Chord recognition serves as a critical task in music information retrieval due to the abstract and descriptive nature of chords in music analysis. While audio chord recognition systems have achieved significant accuracy for small vocabularies (e.g., major/minor chords), large-vocabulary chord recognition remains a challenging problem. This complexity also arises from the inherent long-tail distribution of chords, where rare chord types are underrepresented in most datasets, leading to insufficient training samples. Effective chord recognition requires leveraging contextual information from audio sequences, yet existing models, such as combinations of convolutional neural networks, bidirectional long short-term memory networks, and bidirectional transformers, face limitations in capturing long-term dependencies and exhibit suboptimal performance on large-vocabulary chord recognition tasks. This work proposes ChordFormer, a novel conformer-based architecture designed to tackle structural chord recognition (e.g., triads, bass, sevenths) for large vocabularies. ChordFormer leverages conformer blocks that integrate convolutional neural networks with transformers, thus enabling the model to capture both local patterns and global dependencies effectively. By addressing challenges such as class imbalance through a reweighted loss function and structured chord representations, ChordFormer outperforms state-of-the-art models, achieving a 2% improvement in frame-wise accuracy and a 6% increase in class-wise accuracy on large-vocabulary chord datasets. Furthermore, ChordFormer excels in handling class imbalance, providing robust and balanced recognition across chord types. This approach bridges the gap between theoretical music knowledge and practical applications, advancing the field of large-vocabulary chord recognition.
和弦识别在音乐信息检索中是一项关键任务,由于和弦在音乐分析中的抽象性和描述性特点。尽管音频和弦识别系统在处理小词汇量(如大调/小调和弦)时已经取得了显著的准确性,但对于大词汇量和弦识别来说,这仍然是一个具有挑战性的难题。这种复杂性还源于和弦固有的长尾分布特性,在大多数数据集中,罕见和弦类型代表性不足,导致训练样本数量不足。 有效的和弦识别需要从音频序列中获取上下文信息,但现有的模型(如卷积神经网络、双向长短时记忆网络和双向变压器的组合)在捕捉长期依赖关系方面存在局限性,并且在大词汇量和弦识别任务上的表现欠佳。本研究提出了一种名为ChordFormer的新颖架构,该架构基于Conformer模块设计,旨在解决大型词汇表中的结构化和弦识别问题(例如三和弦、低音、七和弦)。ChordFormer利用结合了卷积神经网络与变压器的Conformer块,使模型能够有效捕捉局部模式及全局依赖关系。 通过采用重新加权的损失函数来应对类别不平衡的问题,并且使用有结构化的和弦表示方式,ChordFormer超越了现有的先进模型,在大型词汇表和弦数据集上实现了2%的帧级准确率提升以及6%的类级别准确率增长。此外,ChordFormer在处理类别不平衡方面表现出色,为各种类型的和弦提供了稳健且均衡的识别能力。 这种方法连接了理论音乐知识与实际应用之间的鸿沟,并推动了大规模词汇表和弦识别领域的进步。
https://arxiv.org/abs/2502.11840
Vehicular communication systems face significant challenges due to high mobility and rapidly changing environments, which affect the channel over which the signals travel. To address these challenges, neural network (NN)-based channel estimation methods have been suggested. These methods are primarily trained on high signal-to-noise ratio (SNR) with the assumption that training a NN in less noisy conditions can result in good generalisation. This study examines the effectiveness of training NN-based channel estimators on mixed SNR datasets compared to training solely on high SNR datasets, as seen in several related works. Estimators evaluated in this work include an architecture that uses convolutional layers and self-attention mechanisms; a method that employs temporal convolutional networks and data pilot-aided estimation; two methods that combine classical methods with multilayer perceptrons; and the current state-of-the-art model that combines Long-Short-Term Memory networks with data pilot-aided and temporal averaging methods as post processing. Our results indicate that using only high SNR data for training is not always optimal, and the SNR range in the training dataset should be treated as a hyperparameter that can be adjusted for better performance. This is illustrated by the better performance of some models in low SNR conditions when trained on the mixed SNR dataset, as opposed to when trained exclusively on high SNR data.
车载通信系统面临着由于车辆高速移动和快速变化的环境所带来的显著挑战,这些因素影响了信号传输所依赖的信道。为解决这些问题,基于神经网络(NN)的信道估计方法已被提出。这类方法主要是在高信噪比(SNR)条件下训练的,并假设在噪声较少的情况下训练神经网络可以实现更好的泛化能力。本研究探讨了使用混合SNR数据集来训练基于神经网络的信道估计算法的有效性,而非仅在高SNR数据集上进行训练,后者是许多相关工作中的常见做法。 本文中评估的估计器包括一个采用卷积层和自注意机制的架构;一种利用时间卷积网络和数据导频辅助估计的方法;两种结合经典方法与多层感知机的方法;以及目前最先进的将长短期记忆(LSTM)网络与数据导频辅助和时间平均方法作为后处理手段的模型。 我们的结果显示,仅使用高SNR数据进行训练并不总是最优选择,并且在训练数据集中SNR范围应该被视为可以调整以获得更好性能的一个超参数。这一点通过一些模型在低SNR条件下使用混合SNR数据集进行训练时表现优于单纯使用高SNR数据集训练得到的验证,体现了其有效性。
https://arxiv.org/abs/2502.06824
Accurate detection of traffic anomalies is crucial for effective urban traffic management and congestion mitigation. We use the Spatiotemporal Generative Adversarial Network (STGAN) framework combining Graph Neural Networks and Long Short-Term Memory networks to capture complex spatial and temporal dependencies in traffic data. We apply STGAN to real-time, minute-by-minute observations from 42 traffic cameras across Gothenburg, Sweden, collected over several months in 2020. The images are processed to compute a flow metric representing vehicle density, which serves as input for the model. Training is conducted on data from April to November 2020, and validation is performed on a separate dataset from November 14 to 23, 2020. Our results demonstrate that the model effectively detects traffic anomalies with high precision and low false positive rates. The detected anomalies include camera signal interruptions, visual artifacts, and extreme weather conditions affecting traffic flow.
准确检测交通异常对于有效的城市交通管理和缓解拥堵至关重要。我们使用结合了图神经网络和长短期记忆网络的时空生成对抗网络(STGAN)框架来捕捉交通数据中的复杂空间和时间依赖关系。我们将STGAN应用于瑞典哥德堡42个交通摄像头收集的真实时钟、分钟级观测数据,这些数据于2020年数月内采集。通过处理图像计算出代表车辆密度的流量指标作为模型输入。训练在2020年4月至11月的数据上进行,验证则使用了2020年11月14日至23日的独立数据集。我们的结果显示,该模型能够以高精度和低假阳性率有效地检测交通异常。所检测到的异常包括摄像头信号中断、视觉伪影以及影响车流的极端天气状况。
https://arxiv.org/abs/2502.01391
A crucial step to efficiently integrate Whole Slide Images (WSIs) in computational pathology is assigning a single high-quality feature vector, i.e., one embedding, to each WSI. With the existence of many pre-trained deep neural networks and the emergence of foundation models, extracting embeddings for sub-images (i.e., tiles or patches) is straightforward. However, for WSIs, given their high resolution and gigapixel nature, inputting them into existing GPUs as a single image is not feasible. As a result, WSIs are usually split into many patches. Feeding each patch to a pre-trained model, each WSI can then be represented by a set of patches, hence, a set of embeddings. Hence, in such a setup, WSI representation learning reduces to set representation learning where for each WSI we have access to a set of patch embeddings. To obtain a single embedding from a set of patch embeddings for each WSI, multiple set-based learning schemes have been proposed in the literature. In this paper, we evaluate the WSI search performance of multiple recently developed aggregation techniques (mainly set representation learning techniques) including simple average or max pooling operations, Deep Sets, Memory networks, Focal attention, Gaussian Mixture Model (GMM) Fisher Vector, and deep sparse and binary Fisher Vector on four different primary sites including bladder, breast, kidney, and Colon from TCGA. Further, we benchmark the search performance of these methods against the median of minimum distances of patch embeddings, a non-aggregating approach used for WSI retrieval.
将全滑动图像(WSI)高效地集成到计算病理学中的一个关键步骤是为每个WSI分配一个高质量的特征向量,即单一嵌入。鉴于许多预训练深度神经网络的存在以及基础模型的出现,提取子图(例如,切片或补丁)的嵌入变得简单直接。然而,对于WSIs来说,由于其高分辨率和数吉像素的特性,将它们作为单个图像输入现有GPU中是不可行的。因此,通常会将WSIs分割成许多小块。通过将每个小块传递给预训练模型,每个WSI可以由一组小块表示,从而形成一系列嵌入。在这种设置下,WSI表示学习简化为集合表示学习,在这种情况下,对于每一个WSI,我们都可以访问到一组补丁嵌入。 为了从每张WSI的多个补丁嵌入中获得单一嵌入,文献中提出了多种基于集的方法。在这篇论文中,我们在四个不同的主要位置(包括TCGA的数据集中膀胱、乳腺、肾脏和结肠)上评估了近期开发出的多项聚合技术(主要是集合表示学习技术),例如简单平均或最大池化操作、Deep Sets、内存网络、焦点注意、高斯混合模型(GMM)、Fisher Vector以及深度稀疏和二进制Fisher Vector。此外,我们还将这些方法的搜索性能与补丁嵌入最小距离中位数进行了基准测试,后者是一种用于WSI检索的非聚合方法。
https://arxiv.org/abs/2501.17822
Emotion recognition is a critical task in human-computer interaction, enabling more intuitive and responsive systems. This study presents a multimodal emotion recognition system that combines low-level information from audio and text, leveraging both Convolutional Neural Networks (CNNs) and Bidirectional Long Short-Term Memory Networks (BiLSTMs). The proposed system consists of two parallel networks: an Audio Block and a Text Block. Mel Frequency Cepstral Coefficients (MFCCs) are extracted and processed by a BiLSTM network and a 2D convolutional network to capture low-level intrinsic and extrinsic features from speech. Simultaneously, a combined BiLSTM-CNN network extracts the low-level sequential nature of text from word embeddings corresponding to the available audio. This low-level information from speech and text is then concatenated and processed by several fully connected layers to classify the speech emotion. Experimental results demonstrate that the proposed EmoTech accurately recognizes emotions from combined audio and text inputs, achieving an overall accuracy of 84%. This solution outperforms previously proposed approaches for the same dataset and modalities.
情感识别是人机交互中的一个关键任务,它使系统更加直观和响应迅速。本研究提出了一种多模态情感识别系统,该系统结合了音频和文本的低级信息,并利用卷积神经网络(CNN)和双向长短时记忆网络(BiLSTM)。所提出的系统由两个并行网络组成:一个音频块和一个文本块。通过使用BiLSTM网络和2D卷积网络处理从梅尔频率倒谱系数(MFCCs)中提取的数据,该系统能够捕捉到语音中的低级内在和外在特征。同时,结合的BiLSTM-CNN网络则会根据可用音频对应的文字嵌入来抽取文本的低级序列特性。随后,来自语音和文本的这些低级信息会被连接起来并通过全连接层进行处理,以对语音情感进行分类。 实验结果表明,所提出的EmoTech系统能够准确地从结合了音频和文本输入的情感中识别出情绪,并实现了84%的整体准确性。该解决方案在使用相同数据集和模态时优于之前提出的方法。
https://arxiv.org/abs/2501.12674
Long-range sequence modeling is a crucial aspect of natural language processing and time series analysis. However, traditional models like Recurrent Neural Networks (RNNs) and Transformers suffer from computational and memory inefficiencies, especially when dealing with long sequences. This paper introduces Logarithmic Memory Networks (LMNs), a novel architecture that leverages a hierarchical logarithmic tree structure to efficiently store and retrieve past information. LMNs dynamically summarize historical context, significantly reducing the memory footprint and computational complexity of attention mechanisms from O(n2) to O(log(n)). The model employs a single-vector, targeted attention mechanism to access stored information, and the memory block construction worker (summarizer) layer operates in two modes: a parallel execution mode during training for efficient processing of hierarchical tree structures and a sequential execution mode during inference, which acts as a memory management system. It also implicitly encodes positional information, eliminating the need for explicit positional encodings. These features make LMNs a robust and scalable solution for processing long-range sequences in resource-constrained environments, offering practical improvements in efficiency and scalability. The code is publicly available under the MIT License on GitHub: this https URL.
长序列建模是自然语言处理和时间序列分析中的一个关键方面。然而,传统的模型如循环神经网络(RNN)和变换器在处理长序列时会遇到计算效率低下和内存消耗过高的问题。本文介绍了一种新的架构——对数记忆网络(LMN),该架构利用了分层的对数树结构来高效存储和检索过去的信息。LMNs能够动态地总结历史背景,显著减少了注意力机制的记忆占用量和计算复杂度,从O(n^2)降至O(log(n))。模型采用单向量、目标注意机制来访问存储信息,并且记忆块构建工人(摘要生成器)层在两种模式下运行:一种是在训练期间用于高效处理分层树结构的并行执行模式;另一种是推理时作为内存管理系统工作的顺序执行模式。此外,LMNs还隐式地编码位置信息,从而消除了对显式位置编码的需求。这些特性使LMN成为资源受限环境中处理长序列的有效和可扩展解决方案,提供了在效率和可扩展性方面的实际改进。该代码以MIT许可证的形式公开发布在GitHub上:[此链接](请将括号内的文本替换为实际的URL)。
https://arxiv.org/abs/2501.07905
Parkinson's Disease (PD) is a degenerative neurological disorder that impairs motor and non-motor functions, significantly reducing quality of life and increasing mortality risk. Early and accurate detection of PD progression is vital for effective management and improved patient outcomes. Current diagnostic methods, however, are often costly, time-consuming, and require specialized equipment and expertise. This work proposes an innovative approach to predicting PD progression using regression methods, Long Short-Term Memory (LSTM) networks, and Kolmogorov Arnold Networks (KAN). KAN, utilizing spline-parametrized univariate functions, allows for dynamic learning of activation patterns, unlike traditional linear models. The Movement Disorder Society-Sponsored Revision of the Unified Parkinson's Disease Rating Scale (MDS-UPDRS) is a comprehensive tool for evaluating PD symptoms and is commonly used to measure disease progression. Additionally, protein or peptide abnormalities are linked to PD onset and progression. Identifying these associations can aid in predicting disease progression and understanding molecular changes. Comparing multiple models, including LSTM and KAN, this study aims to identify the method that delivers the highest metrics. The analysis reveals that KAN, with its dynamic learning capabilities, outperforms other approaches in predicting PD progression. This research highlights the potential of AI and machine learning in healthcare, paving the way for advanced computational models to enhance clinical predictions and improve patient care and treatment strategies in PD management.
帕金森病(PD)是一种退行性神经性疾病,会影响运动和非运动功能,显著降低患者的生活质量,并增加死亡风险。早期且准确地检测帕金森病的进展对于有效管理和改善患者的预后至关重要。然而,目前的诊断方法往往成本高昂、耗时长,还需要专业的设备和技术支持。本研究提出了一种使用回归方法、长短时记忆网络(LSTM)和科洛莫哥罗夫-阿诺德网络(KAN)来预测帕金森病进展的创新途径。 KAN利用分段参数化的单变量函数,能够动态学习激活模式,这与传统的线性模型不同。《运动障碍学会赞助的统一帕金森病评定量表修订版》(MDS-UPDRS)是评估帕金森症状和测量疾病进展的一个全面工具。此外,蛋白质或肽的异常变化与帕金森病的发生和发展有关联。识别这些关联有助于预测疾病的进展,并理解分子层面的变化。 本研究比较了多种模型,包括LSTM和KAN,旨在确定哪种方法能够提供最高的性能指标。分析结果显示,具有动态学习能力的KAN在预测帕金森病进展方面优于其他方法。这项研究表明人工智能和机器学习在医疗保健领域的潜力,为临床预测和改善患者护理及治疗策略提供了先进计算模型的发展方向。
https://arxiv.org/abs/2412.20744
Accurate tool wear prediction is essential for maintaining productivity and minimizing costs in machining. However, the complex nature of the tool wear process poses significant challenges to achieving reliable predictions. This study explores data-driven methods, in particular deep learning, for tool wear prediction. Traditional data-driven approaches often focus on a single process, relying on multi-sensor setups and extensive data generation, which limits generalization to new settings. Moreover, multi-sensor integration is often impractical in industrial environments. To address these limitations, this research investigates the transferability of predictive models using minimal training data, validated across two processes. Furthermore, it uses a simple setup with a single acceleration sensor to establish a low-cost data generation approach that facilitates the generalization of models to other processes via transfer learning. The study evaluates several machine learning models, including convolutional neural networks (CNN), long short-term memory networks (LSTM), support vector machines (SVM) and decision trees, trained on different input formats such as feature vectors and short-time Fourier transform (STFT). The performance of the models is evaluated on different amounts of training data, including scenarios with significantly reduced datasets, providing insight into their effectiveness under constrained data conditions. The results demonstrate the potential of specific models and configurations for effective tool wear prediction, contributing to the development of more adaptable and efficient predictive maintenance strategies in machining. Notably, the ConvNeXt model has an exceptional performance, achieving an 99.1% accuracy in identifying tool wear using data from only four milling tools operated until they are worn.
准确的刀具磨损预测对于保持生产效率和降低成本至关重要。然而,刀具磨损过程的复杂性给实现可靠预测带来了重大挑战。本研究探索了基于数据驱动的方法,尤其是深度学习技术,在刀具磨损预测中的应用。传统数据驱动方法通常集中于单一加工过程,并依赖多传感器设置及大量数据生成,这限制了其在新环境下的泛化能力。此外,多传感器整合在工业环境中往往难以实现。为解决这些局限性,本研究探讨了使用少量训练数据的预测模型可移植性的验证,并通过两个不同加工过程进行了测试。同时,该研究采用了一种简单的单加速度传感器设置来建立低成本的数据生成方法,通过迁移学习促进模型向其他加工过程泛化。 在研究中,评估了几种机器学习模型的表现,包括卷积神经网络(CNN)、长短时记忆网络(LSTM)、支持向量机(SVM)和决策树,它们分别基于不同的输入格式训练,如特征向量和短时间傅里叶变换(STFT)。通过不同规模的训练数据集对这些模型进行了评估,其中包括大量减少的数据集场景,从而展示了其在受限数据条件下的有效性。研究结果表明了某些特定模型及其配置对于有效刀具磨损预测具有潜力,并促进了更灵活高效的预测性维护策略的发展。 值得注意的是,ConvNeXt模型表现出色,在仅使用四把铣刀操作直到完全磨损产生的少量数据的情况下,达到了99.1%的准确性,成功地识别出了刀具磨损。
https://arxiv.org/abs/2412.19950
This project aims to develop a robust video surveillance system, which can segment videos into smaller clips based on the detection of activities. It uses CCTV footage, for example, to record only major events-like the appearance of a person or a thief-so that storage is optimized and digital searches are easier. It utilizes the latest techniques in object detection and tracking, including Convolutional Neural Networks (CNNs) like YOLO, SSD, and Faster R-CNN, as well as Recurrent Neural Networks (RNNs) and Long Short-Term Memory networks (LSTMs), to achieve high accuracy in detection and capture temporal dependencies. The approach incorporates adaptive background modeling through Gaussian Mixture Models (GMM) and optical flow methods like Lucas-Kanade to detect motions. Multi-scale and contextual analysis are used to improve detection across different object sizes and environments. A hybrid motion segmentation strategy combines statistical and deep learning models to manage complex movements, while optimizations for real-time processing ensure efficient computation. Tracking methods, such as Kalman Filters and Siamese networks, are employed to maintain smooth tracking even in cases of occlusion. Detection is improved on various-sized objects for multiple scenarios by multi-scale and contextual analysis. Results demonstrate high precision and recall in detecting and tracking objects, with significant improvements in processing times and accuracy due to real-time optimizations and illumination-invariant features. The impact of this research lies in its potential to transform video surveillance, reducing storage requirements and enhancing security through reliable and efficient object detection and tracking.
该项目旨在开发一个强大的视频监控系统,该系统可以根据活动检测将视频分割成较小的片段。例如,它使用CCTV录像记录重大事件,如人员或窃贼出现,从而优化存储并使数字搜索更容易。该项目采用了最新的对象检测和跟踪技术,包括卷积神经网络(CNNs)如YOLO、SSD 和 Faster R-CNN 以及递归神经网络(RNNs)和长短时记忆网络(LSTMs),以实现高精度的检测,并捕捉时间依赖关系。该方法通过高斯混合模型(GMM)和光流法如Lucas-Kanade来实现自适应背景建模,用于检测运动。多尺度和上下文分析被用来改善不同尺寸对象和环境下的检测效果。一种结合统计模型和深度学习模型的混合运动分割策略处理复杂运动,而实时处理优化确保了高效的计算性能。使用卡尔曼滤波器和Siamese网络等跟踪方法,即使在遮挡情况下也能保持平滑的追踪。多尺度和上下文分析改进了各种尺寸对象在多个场景中的检测效果。实验结果表明,在检测和跟踪物体方面实现了高精度和召回率,并且由于实时优化和光照不变特征,处理时间和准确性有了显著提升。这项研究的影响在于其潜在能力可以改变视频监控领域,减少存储需求并通过可靠、高效的对象检测与追踪来增强安全性。
https://arxiv.org/abs/2412.05331
Wastewater treatment plants face unique challenges for process control due to their complex dynamics, slow time constants, and stochastic delays in observations and actions. These characteristics make conventional control methods, such as Proportional-Integral-Derivative controllers, suboptimal for achieving efficient phosphorus removal, a critical component of wastewater treatment to ensure environmental sustainability. This study addresses these challenges using a novel deep reinforcement learning approach based on the Soft Actor-Critic algorithm, integrated with a custom simulator designed to model the delayed feedback inherent in wastewater treatment plants. The simulator incorporates Long Short-Term Memory networks for accurate multi-step state predictions, enabling realistic training scenarios. To account for the stochastic nature of delays, agents were trained under three delay scenarios: no delay, constant delay, and random delay. The results demonstrate that incorporating random delays into the reinforcement learning framework significantly improves phosphorus removal efficiency while reducing operational costs. Specifically, the delay-aware agent achieved 36% reduction in phosphorus emissions, 55% higher reward, 77% lower target deviation from the regulatory limit, and 9% lower total costs than traditional control methods in the simulated environment. These findings underscore the potential of reinforcement learning to overcome the limitations of conventional control strategies in wastewater treatment, providing an adaptive and cost-effective solution for phosphorus removal.
污水处理厂在过程控制方面面临独特的挑战,因为它们具有复杂的动态特性、缓慢的时间常数以及观测和操作中的随机延迟。这些特点使得传统控制方法(如比例-积分-微分控制器)难以实现高效的磷去除,而磷的去除是确保环境可持续性的重要组成部分。本研究采用了一种基于Soft Actor-Critic算法的新颖深度强化学习方法来应对这些挑战,并结合了一个定制模拟器设计以模型化污水处理厂固有的延迟反馈。该模拟器整合了长短时记忆网络,用于准确预测多步状态,从而实现逼真的训练场景。为了考虑延迟的随机性质,在没有延迟、固定延迟和随机延迟三种延迟情景下进行了代理训练。结果表明,将随机延迟纳入强化学习框架显著提高了磷去除效率并降低了运营成本。具体而言,在模拟环境中,延迟感知代理相比传统控制方法实现了36%的磷排放减少,55%更高的奖励,77%的目标偏离监管限值更小,以及9%的总成本降低。这些发现强调了强化学习在克服污水处理中传统控制策略局限性方面的潜力,为磷去除提供了一种适应性强且经济高效的解决方案。
https://arxiv.org/abs/2411.18305
Phishing is one of the most effective ways in which cybercriminals get sensitive details such as credentials for online banking, digital wallets, state secrets, and many more from potential victims. They do this by spamming users with malicious URLs with the sole purpose of tricking them into divulging sensitive information which is later used for various cybercrimes. In this research, we did a comprehensive review of current state-of-the-art machine learning and deep learning phishing detection techniques to expose their vulnerabilities and future research direction. For better analysis and observation, we split machine learning techniques into Bayesian, non-Bayesian, and deep learning. We reviewed the most recent advances in Bayesian and non-Bayesian-based classifiers before exploiting their corresponding weaknesses to indicate future research direction. While exploiting weaknesses in both Bayesian and non-Bayesian classifiers, we also compared each performance with a deep learning classifier. For a proper review of deep learning-based classifiers, we looked at Recurrent Neural Networks (RNN), Convolutional Neural Networks (CNN), and Long Short Term Memory Networks (LSTMs). We did an empirical analysis to evaluate the performance of each classifier along with many of the proposed state-of-the-art anti-phishing techniques to identify future research directions, we also made a series of proposals on how the performance of the under-performing algorithm can improved in addition to a two-stage prediction model
网络钓鱼是网络犯罪分子获取潜在受害者敏感信息(如在线银行、数字钱包、国家机密等的凭证)最有效的方式之一。他们通过向用户发送恶意URL来实现这一目的,其唯一目的是诱使用户泄露这些敏感信息,之后用于各种网络犯罪活动。在这项研究中,我们对当前最先进的机器学习和深度学习网络钓鱼检测技术进行了全面回顾,揭示了它们的漏洞以及未来的研究方向。为了更好地进行分析和观察,我们将机器学习技术分为贝叶斯、非贝叶斯及深度学习三类。 在本研究中,我们在探讨其对应弱点之前,首先回顾了基于贝叶斯和非贝叶斯分类器的最新进展,并利用这些弱点指出了未来的研究方向。当探索贝叶斯和非贝叶斯分类器中的弱点时,我们还将每种分类器的表现与深度学习分类器进行了比较。 为了对基于深度学习的分类器进行适当的回顾,我们关注了循环神经网络(RNN)、卷积神经网络(CNN)以及长短期记忆网络(LSTMs)。我们还进行了实证分析来评估每个分类器及许多提议的最先进的反钓鱼技术的表现,以识别未来的研究方向。此外,我们也提出了一系列如何改进表现不佳算法性能的方法,并提出了一个两阶段预测模型。
https://arxiv.org/abs/2411.16751
Firing rate models are dynamical systems widely used in applied and theoretical neuroscience to describe local cortical dynamics in neuronal populations. By providing a macroscopic perspective of neuronal activity, these models are essential for investigating oscillatory phenomena, chaotic behavior, and associative memory processes. Despite their widespread use, the application of firing rate models to associative memory networks has received limited mathematical exploration, and most existing studies are focused on specific models. Conversely, well-established associative memory designs, such as Hopfield networks, lack key biologically-relevant features intrinsic to firing rate models, including positivity and interpretable synaptic matrices that reflect excitatory and inhibitory interactions. To address this gap, we propose a general framework that ensures the emergence of re-scaled memory patterns as stable equilibria in the firing rate dynamics. Furthermore, we analyze the conditions under which the memories are locally and globally asymptotically stable, providing insights into constructing biologically-plausible and robust systems for associative memory retrieval.
发放率模型是广泛应用于应用神经科学和理论神经科学的动力系统,用于描述皮层局部的神经元群体动力学。通过提供宏观视角来观察神经活动,这些模型对于研究振荡现象、混沌行为及联想记忆过程至关重要。尽管这些模型被广泛应用,但将发放率模型应用于联想记忆网络的数学探索却相对有限,大部分现有研究集中在特定模型上。相反,一些已建立的联想记忆设计,如霍普菲尔德网络,缺乏与发放率模型固有的生物学相关特征,比如正性以及能够解释兴奋性和抑制性相互作用的可解读突触矩阵。为解决这一差距,我们提出了一种通用框架,确保重新缩放的记忆模式在发放率动态中作为稳定的平衡点出现。此外,我们分析了记忆局部和全局渐近稳定的条件,从而提供构建生物学上合理且稳健的联想记忆检索系统的见解。
https://arxiv.org/abs/2411.07388
Current cardiac cine magnetic resonance image (cMR) studies focus on the end diastole (ED) and end systole (ES) phases, while ignoring the abundant temporal information in the whole image sequence. This is because whole sequence segmentation is currently a tedious process and inaccurate. Conventional whole sequence segmentation approaches first estimate the motion field between frames, which is then used to propagate the mask along the temporal axis. However, the mask propagation results could be prone to error, especially for the basal and apex slices, where through-plane motion leads to significant morphology and structural change during the cardiac cycle. Inspired by recent advances in video object segmentation (VOS), based on spatio-temporal memory (STM) networks, we propose a continuous STM (CSTM) network for semi-supervised whole heart and whole sequence cMR segmentation. Our CSTM network takes full advantage of the spatial, scale, temporal and through-plane continuity prior of the underlying heart anatomy structures, to achieve accurate and fast 4D segmentation. Results of extensive experiments across multiple cMR datasets show that our method can improve the 4D cMR segmentation performance, especially for the hard-to-segment regions.
当前的心脏动态磁共振成像(cMR)研究主要关注舒张末期(ED)和收缩末期(ES)的相位,而忽略了整个图像序列中丰富的时序信息。这是由于目前对整个序列进行分割是一个繁琐且不准确的过程。传统的方法首先估计帧之间的运动场,然后使用该运动场沿时间轴传播掩模。然而,这种掩模传播的结果容易出错,尤其是在基底和心尖切片上,因平面外的运动会导致心脏周期中的形态和结构发生显著变化。受到基于时空记忆(STM)网络在视频对象分割(VOS)方面最新进展的启发,我们提出了一种连续时空记忆(CSTM)网络用于半监督下的整个心脏及整段序列cMR分割。我们的CSTM网络充分利用了潜在心肌解剖结构的空间、尺度、时序和平面外延续性先验知识,以实现准确且快速的4D分割。在多个cMR数据集上的广泛实验结果表明,我们提出的方法能够提高4D cMR分割性能,特别是对于难以分割的区域有显著改善。
https://arxiv.org/abs/2410.23191
Personality analysis from online short videos has gained prominence due to its applications in personalized recommendation systems, sentiment analysis, and human-computer interaction. Traditional assessment methods, such as questionnaires based on the Big Five Personality Framework, are limited by self-report biases and are impractical for large-scale or real-time analysis. Leveraging the rich, multi-modal data present in short videos offers a promising alternative for more accurate personality inference. However, integrating these diverse and asynchronous modalities poses significant challenges, particularly in aligning time-varying data and ensuring models generalize well to new domains with limited labeled data. In this paper, we propose a novel multi-modal personality analysis framework that addresses these challenges by synchronizing and integrating features from multiple modalities and enhancing model generalization through domain adaptation. We introduce a timestamp-based modality alignment mechanism that synchronizes data based on spoken word timestamps, ensuring accurate correspondence across modalities and facilitating effective feature integration. To capture temporal dependencies and inter-modal interactions, we employ Bidirectional Long Short-Term Memory networks and self-attention mechanisms, allowing the model to focus on the most informative features for personality prediction. Furthermore, we develop a gradient-based domain adaptation method that transfers knowledge from multiple source domains to improve performance in target domains with scarce labeled data. Extensive experiments on real-world datasets demonstrate that our framework significantly outperforms existing methods in personality prediction tasks, highlighting its effectiveness in capturing complex behavioral cues and robustness in adapting to new domains.
从在线短视频中进行人格分析因其在个性化推荐系统、情感分析和人机交互中的应用而变得越来越重要。传统的评估方法,如基于大五人格框架的问卷调查,受限于自我报告偏差,并且对于大规模或实时分析来说并不实际。利用短视频中存在的丰富多模态数据为更准确的人格推断提供了有希望的替代方案。然而,整合这些多样性和异步模式带来了重大挑战,特别是在对齐随时间变化的数据和确保模型在新领域中有限标注数据的情况下能够很好地泛化方面。本文提出了一种新颖的多模态人格分析框架,通过同步和集成来自多个模态的功能,并通过域适应增强模型的泛化能力来解决这些挑战。我们引入了基于时间戳的模式对齐机制,根据所说单词的时间戳同步数据,确保跨模态之间的准确对应并促进有效特征整合。为了捕捉时序依赖性和多模态交互,我们采用双向长短时记忆网络和自注意力机制,使模型能够专注于人格预测中最具信息量的特征。此外,我们开发了一种基于梯度的领域适应方法,将多个源域的知识转移到目标域以改善在标注数据稀缺的情况下的性能。在真实世界数据集上的广泛实验表明,我们的框架在人格预测任务中显著优于现有方法,突显了其捕捉复杂行为线索和适应新领域的稳健性。
https://arxiv.org/abs/2411.00813
Reproducibility in scientific research, particularly within the realm of natural language processing (NLP), is essential for validating and verifying the robustness of experimental findings. This paper delves into the reproduction and evaluation of dialogue summarization models, focusing specifically on the discrepancies observed between original studies and our reproduction efforts. Dialogue summarization is a critical aspect of NLP, aiming to condense conversational content into concise and informative summaries, thus aiding in efficient information retrieval and decision-making processes. Our research involved a thorough examination of several dialogue summarization models using the AMI (Augmented Multi-party Interaction) dataset. The models assessed include Hierarchical Memory Networks (HMNet) and various versions of Pointer-Generator Networks (PGN), namely PGN(DKE), PGN(DRD), PGN(DTS), and PGN(DALL). The primary objective was to evaluate the informativeness and quality of the summaries generated by these models through human assessment, a method that introduces subjectivity and variability in the evaluation process. The analysis began with Dataset 1, where the sample standard deviation of 0.656 indicated a moderate dispersion of data points around the mean.
https://arxiv.org/abs/2410.15962
We developed Long Short-Term Memory (LSTM) models to predict the formation of active regions (ARs) on the solar surface. Using the Doppler shift velocity, the continuum intensity, and the magnetic field observations from the Solar Dynamics Observatory (SDO) Helioseismic and Magnetic Imager (HMI), we have created time-series datasets of acoustic power and magnetic flux, which are used to train LSTM models on predicting continuum intensity, 12 hours in advance. These novel machine learning (ML) models are able to capture variations of the acoustic power density associated with upcoming magnetic flux emergence and continuum intensity decrease. Testing of the models' performance was done on data for 5 ARs, unseen from the models during training. Model 8, the best performing model trained, was able to make a successful prediction of emergence for all testing active regions in an experimental setting and three of them in an operational. The model predicted the emergence of AR11726, AR13165, and AR13179 respectively 10, 29, and 5 hours in advance, and variations of this model achieved average RMSE values of 0.11 for both active and quiet areas on the solar disc. This work sets the foundations for ML-aided prediction of solar ARs.
我们开发了 Long Short-Term Memory (LSTM) 模型,用于预测太阳表面的 active regions(ARs)的形成。通过使用 Solar Dynamics Observatory (SDO) 的 Helioseismic and Magnetic Imager (HMI) 的多普勒位移速度、连续强度和磁场观测数据,我们创建了音频功率和磁通量的时间序列数据集,这些数据被用于在预测 12 小时前的连续强度。这些新颖的机器学习(ML)模型能够捕捉到即将出现的磁通量爆发和连续强度降低与音频功率密度相关的变化。对模型性能的测试在训练数据之外的数据上进行。表现最好的模型 8 能够在一个实验设置中成功预测所有测试的 active regions 的爆发,而在操作中则预测了三个 active regions 的爆发。该模型预测 AR11726、AR13165 和 AR13179 分别提前 10、29 和 5 小时,该模型的平均 RMSE 值分别为 0.11,对于太阳盘上的活跃和静止区域。这项工作为使用 ML 辅助预测 solar ARs 奠定了基础。
https://arxiv.org/abs/2409.17421