RNN

Application of Long-Short Term Memory and Convolutional Neural Networks for Real-Time Bridge Scour Forecast

2024-04-25 12:04:36

Tahrima Hashem, Negin Yousefpour

arXiv_CV

arXiv_CV RNN CNN Deep_Learning Prediction
Abstract

Scour around bridge piers is a critical challenge for infrastructures around the world. In the absence of analytical models and due to the complexity of the scour process, it is difficult for current empirical methods to achieve accurate predictions. In this paper, we exploit the power of deep learning algorithms to forecast the scour depth variations around bridge piers based on historical sensor monitoring data, including riverbed elevation, flow elevation, and flow velocity. We investigated the performance of Long Short-Term Memory (LSTM) and Convolutional Neural Network (CNN) models for real-time scour forecasting using data collected from bridges in Alaska and Oregon from 2006 to 2021. The LSTM models achieved mean absolute error (MAE) ranging from 0.1m to 0.5m for predicting bed level variations a week in advance, showing a reasonable performance. The Fully Convolutional Network (FCN) variant of CNN outperformed other CNN configurations, showing a comparable performance to LSTMs with significantly lower computational costs. We explored various innovative random-search heuristics for hyperparameter tuning and model optimisation which resulted in reduced computational cost compared to grid-search method. The impact of different combinations of sensor features on scour prediction showed the significance of the historical time series of scour for predicting upcoming events. Overall, this study provides a greater understanding of the potential of Deep Learning (DL) for real-time scour forecasting and early warning in bridges with diverse scour and flow characteristics including riverine and tidal/coastal bridges.

Abstract (translated)

在世界各地的基础设施中，清理桥墩是一个关键的挑战。缺乏分析模型以及由于侵蚀过程的复杂性，当前的实证方法很难实现准确的预测。在本文中，我们利用深度学习算法的优势来预测基于历史传感器监测数据桥墩周围的侵蚀深度变化，包括河床高度、流速和流深。我们还研究了使用2006年至2021年阿拉斯加和俄勒冈州桥梁收集的数据来预测实时侵蚀预测的LSTM和卷积神经网络模型的性能。LSTM模型的预测床面变化平均绝对误差（MAE）在提前一周预测时从0.1米到0.5米，表现出相当不错的性能。全卷积网络（FCN）变体在其他CNN配置中表现优异，与LSTM模型的性能相当，但计算成本较低。我们研究了各种创新随机搜索策略进行超参数调整和模型优化，从而使计算成本比网格搜索方法降低。不同传感器特征组合对侵蚀预测的影响表明了历史侵蚀时间序列对于预测即将发生事件的显著性。总体而言，本研究为深入理解DL在具有多样scour和flow特性的桥梁上的实时侵蚀预测和预警提供了更大的认识。

URL

https://arxiv.org/abs/2404.16549

PDF

https://arxiv.org/pdf/2404.16549.pdf
Read All
Mamba-360: Survey of State Space Models as Transformer Alternative for Long Sequence Modelling: Methods, Applications, and Challenges

2024-04-24 18:10:31

Badri Narayana Patro, Vijay Srinivas Agneeswaran

arXiv_AI

arXiv_AI Speech_Recognition RNN Recognition Memory_Networks Survey Attention Recommendation Transformer Pose Medical Speech
Abstract

Sequence modeling is a crucial area across various domains, including Natural Language Processing (NLP), speech recognition, time series forecasting, music generation, and bioinformatics. Recurrent Neural Networks (RNNs) and Long Short Term Memory Networks (LSTMs) have historically dominated sequence modeling tasks like Machine Translation, Named Entity Recognition (NER), etc. However, the advancement of transformers has led to a shift in this paradigm, given their superior performance. Yet, transformers suffer from $O(N^2)$ attention complexity and challenges in handling inductive bias. Several variations have been proposed to address these issues which use spectral networks or convolutions and have performed well on a range of tasks. However, they still have difficulty in dealing with long sequences. State Space Models(SSMs) have emerged as promising alternatives for sequence modeling paradigms in this context, especially with the advent of S4 and its variants, such as S4nd, Hippo, Hyena, Diagnol State Spaces (DSS), Gated State Spaces (GSS), Linear Recurrent Unit (LRU), Liquid-S4, Mamba, etc. In this survey, we categorize the foundational SSMs based on three paradigms namely, Gating architectures, Structural architectures, and Recurrent architectures. This survey also highlights diverse applications of SSMs across domains such as vision, video, audio, speech, language (especially long sequence modeling), medical (including genomics), chemical (like drug design), recommendation systems, and time series analysis, including tabular data. Moreover, we consolidate the performance of SSMs on benchmark datasets like Long Range Arena (LRA), WikiText, Glue, Pile, ImageNet, Kinetics-400, sstv2, as well as video datasets such as Breakfast, COIN, LVU, and various time series datasets. The project page for Mamba-360 work is available on this webpage.\url{this https URL}.

Abstract (translated)

序列建模是一个贯穿各种领域的关键领域，包括自然语言处理（NLP）、语音识别、时间序列预测、音乐生成和生物信息学。递归神经网络（RNNs）和长短时记忆网络（LSTMs）历史上曾统治序列建模任务，如机器翻译、命名实体识别等。然而，Transformer的进步导致了一种范式的转移，由于它们在性能上的优越表现。然而，Transformer的注意力复杂性和处理归纳偏差的能力仍然存在挑战。为解决这些问题，已经提出了几种变体，包括使用特征网络或卷积的模型，并在各种任务上表现良好。然而，它们仍然很难处理长序列。状态空间模型（SSMs）在这一背景下出现了有前景的替代方案，尤其是S4和其变体，如S4nd、Hippo、Hyena、诊断状态空间（DSS）、Gated State Spaces（GSS）和Linear Recurrent Unit（LRU）、Liquid-S4、Mamba等。在本次调查中，我们根据三种范式对基本SSMs进行了分类，即开关架构、结构架构和循环架构。本调查还强调了SSMs在各个领域的多样化应用，如视觉、视频、音频、语音、语言（特别是长序列建模）、医学（包括基因组学）、化学（如药物设计）和推荐系统，以及时间序列分析，包括表格数据。此外，我们还分析了SSMs在基准数据集，如Long Range Arena（LRA）、WikiText、Glue、Pile、ImageNet、Kinetics-400、sstv2，以及视频数据集，如Breakfast、COIN、LVU等。Mamba-360工作的项目页面可以在该网页上查看。

URL

https://arxiv.org/abs/2404.16112

PDF

https://arxiv.org/pdf/2404.16112.pdf
Read All
Neural Proto-Language Reconstruction

2024-04-24 06:56:46

Chenxuan Cui, Ying Chen, Qinxin Wang, David R. Mortensen

arXiv_CL

arXiv_CL RNN Prediction Transformer Pose Reconstruction
Abstract

Proto-form reconstruction has been a painstaking process for linguists. Recently, computational models such as RNN and Transformers have been proposed to automate this process. We take three different approaches to improve upon previous methods, including data augmentation to recover missing reflexes, adding a VAE structure to the Transformer model for proto-to-language prediction, and using a neural machine translation model for the reconstruction task. We find that with the additional VAE structure, the Transformer model has a better performance on the WikiHan dataset, and the data augmentation step stabilizes the training.

Abstract (translated)

原型重建一直是语言学家们痛苦的过程。最近，提出了使用诸如RNN和Transformer这样的计算模型来自动化这一过程。我们采用了三种不同的方法来改进以前的方法，包括数据增强来恢复缺失的反射，在Transformer模型中添加VAE结构来进行原型到语言预测，以及使用神经机器翻译模型来进行重构任务。我们发现，在添加了VAE结构之后，Transformer模型的WikiHan数据集的表现更好，数据增强步骤使训练趋于稳定。

URL

https://arxiv.org/abs/2404.15690

PDF

https://arxiv.org/pdf/2404.15690.pdf
Read All
GRSN: Gated Recurrent Spiking Neurons for POMDPs and MARL

2024-04-24 02:20:50

Lang Qin, Ziming Wang, Runhao Jiang, Rui Yan, Huajin Tang

arXiv_AI

arXiv_AI RNN Reinforcement_Learning Inference Pose Agent
Abstract

Spiking neural networks (SNNs) are widely applied in various fields due to their energy-efficient and fast-inference capabilities. Applying SNNs to reinforcement learning (RL) can significantly reduce the computational resource requirements for agents and improve the algorithm's performance under resource-constrained conditions. However, in current spiking reinforcement learning (SRL) algorithms, the simulation results of multiple time steps can only correspond to a single-step decision in RL. This is quite different from the real temporal dynamics in the brain and also fails to fully exploit the capacity of SNNs to process temporal data. In order to address this temporal mismatch issue and further take advantage of the inherent temporal dynamics of spiking neurons, we propose a novel temporal alignment paradigm (TAP) that leverages the single-step update of spiking neurons to accumulate historical state information in RL and introduces gated units to enhance the memory capacity of spiking neurons. Experimental results show that our method can solve partially observable Markov decision processes (POMDPs) and multi-agent cooperation problems with similar performance as recurrent neural networks (RNNs) but with about 50% power consumption.

Abstract (translated)

尖峰神经网络（SNNs）因其在节能和快速推理能力而广泛应用于各种领域。将SNN应用于强化学习（RL）可以显著降低代理程序的计算资源需求，并在资源受限条件下提高算法的性能。然而，在当前的尖峰强化学习（SRL）算法中，多个时间步的模拟结果只能对应于RL中的单步决策。这与大脑的实际时间动态以及SNNs处理时间数据的能力之间存在很大的差异。为了解决这一时间差问题，并更好地利用尖峰神经元的固有时间动态，我们提出了一个新的时间对齐范式（TAP）。它利用尖峰神经元的单步更新来累积历史状态信息，并引入门控单元来增强尖峰神经元的记忆容量。实验结果表明，我们的方法可以与具有类似性能的循环神经网络（RNNs）解决部分可观察的马尔可夫决策过程（POMDP）和多智能体合作问题，但功耗大约为RNN的50%。

URL

https://arxiv.org/abs/2404.15597

PDF

https://arxiv.org/pdf/2404.15597.pdf
Read All
Deep Multi-View Channel-Wise Spatio-Temporal Network for Traffic Flow Prediction

2024-04-23 13:39:04

Hao Miao, Senzhang Wang, Meiyue Zhang, Diansheng Guo, Funing Sun, Fan Yang

arXiv_AI

arXiv_AI RNN CNN Relation Prediction Pose
Abstract

Accurately forecasting traffic flows is critically important to many real applications including public safety and intelligent transportation systems. The challenges of this problem include both the dynamic mobility patterns of the people and the complex spatial-temporal correlations of the urban traffic data. Meanwhile, most existing models ignore the diverse impacts of the various traffic observations (e.g. vehicle speed and road occupancy) on the traffic flow prediction, and different traffic observations can be considered as different channels of input features. We argue that the analysis in multiple-channel traffic observations might help to better address this problem. In this paper, we study the novel problem of multi-channel traffic flow prediction, and propose a deep \underline{M}ulti-\underline{V}iew \underline{C}hannel-wise \underline{S}patio-\underline{T}emporal \underline{Net}work (MVC-STNet) model to effectively address it. Specifically, we first construct the localized and globalized spatial graph where the multi-view fusion module is used to effectively extract the local and global spatial dependencies. Then LSTM is used to learn the temporal correlations. To effectively model the different impacts of various traffic observations on traffic flow prediction, a channel-wise graph convolutional network is also designed. Extensive experiments are conducted over the PEMS04 and PEMS08 datasets. The results demonstrate that the proposed MVC-STNet outperforms state-of-the-art methods by a large margin.

Abstract (translated)

准确预测交通流量对许多实际应用（包括公共安全和智能交通系统）至关重要。这个问题包括人和城市交通数据的动态运动模式以及复杂的空间-时间相关性。同时，大多数现有模型忽略了各种交通观察（如车辆速度和道路占用率）对交通流量预测的影响，而不同的交通观察可以被视为不同的输入特征。我们认为，在多通道交通观察分析中可能有助于更好地解决这个问题。在本文中，我们研究了多通道交通流量预测的新问题，并提出了一个深度 Multi-View Multi-Channel Temporal Network (MMC-STNet) 模型来有效地解决它。具体来说，我们首先构建了局部和全局空间图，其中多视图融合模块用于有效地提取局部和全局空间依赖关系。然后使用 LSTM 学习时间关联。为了有效地建模各种交通观察对交通流量预测的不同影响，还设计了一个通道级别的图卷积网络。在 PEMS04 和 PEMS08 数据集上进行了大量实验。结果表明，与最先进的 methods相比，所提出的 MMC-STNet 具有很大的优势。

URL

https://arxiv.org/abs/2404.15034

PDF

https://arxiv.org/pdf/2404.15034.pdf
Read All
Does Your Neural Code Completion Model Use My Code? A Membership Inference Approach

2024-04-22 15:54:53

Yao Wan, Guanghua Wan, Shijie Zhang, Hongyu Zhang, Yulei Sui, Pan Zhou, Hai Jin, Lichao Sun

arXiv_AI

arXiv_AI RNN Deep_Learning Classification Inference Language_Model Transformer Pose Chat
Abstract

Recent years have witnessed significant progress in developing deep learning-based models for automated code completion. Although using source code in GitHub has been a common practice for training deep-learning-based models for code completion, it may induce some legal and ethical issues such as copyright infringement. In this paper, we investigate the legal and ethical issues of current neural code completion models by answering the following question: Is my code used to train your neural code completion model? To this end, we tailor a membership inference approach (termed CodeMI) that was originally crafted for classification tasks to a more challenging task of code completion. In particular, since the target code completion models perform as opaque black boxes, preventing access to their training data and parameters, we opt to train multiple shadow models to mimic their behavior. The acquired posteriors from these shadow models are subsequently employed to train a membership classifier. Subsequently, the membership classifier can be effectively employed to deduce the membership status of a given code sample based on the output of a target code completion model. We comprehensively evaluate the effectiveness of this adapted approach across a diverse array of neural code completion models, (i.e., LSTM-based, CodeGPT, CodeGen, and StarCoder). Experimental results reveal that the LSTM-based and CodeGPT models suffer the membership leakage issue, which can be easily detected by our proposed membership inference approach with an accuracy of 0.842, and 0.730, respectively. Interestingly, our experiments also show that the data membership of current large language models of code, e.g., CodeGen and StarCoder, is difficult to detect, leaving amper space for further improvement. Finally, we also try to explain the findings from the perspective of model memorization.

Abstract (translated)

近年来，在基于深度学习的自动编程补全模型的发展方面取得了显著的进展。虽然在GitHub使用源代码进行深度学习模型训练是一种常见做法，但可能会引发一些法律和道德问题，例如版权侵犯。在本文中，我们研究了当前神经代码补全模型的法律和道德问题，回答了一个问题：我的代码被用于训练您的神经代码补全模型吗？为此，我们将最初为分类任务设计的成员推断方法（称为CodeMI）适应更具挑战性的代码补全任务。特别是，由于目标代码补全模型表现为黑盒，阻止访问其训练数据和参数，我们选择训练多个影子模型以模仿其行为。这些影子模型的获得的概率随后被用于训练一个成员分类器。随后，成员分类器可以有效地用于根据目标代码补全模型的输出推断给定代码样本的成员状态。我们对这种自适应方法在各种神经代码补全模型上的效果进行全面评估（即基于LSTM的模型、基于CodeGPT的模型、基于CodeGen的模型和基于StarCoder的模型）。实验结果表明，基于LSTM和CodeGPT的模型存在成员泄漏问题，可以通过我们提出的成员推断方法以准确度为0.842和0.730进行检测。有趣的是，我们的实验还发现，当前大型语言模型的数据成员，例如CodeGen和StarCoder，很难检测，留下了一定的改进空间。最后，我们还从模型记忆的角度尝试解释这些发现。

URL

https://arxiv.org/abs/2404.14296

PDF

https://arxiv.org/pdf/2404.14296.pdf
Read All
Automated Text Mining of Experimental Methodologies from Biomedical Literature

2024-04-21 21:19:36

Ziqing Guo

arXiv_CL

arXiv_CL RNN Classification Language_Model Bert Pose Medical
Abstract

Biomedical literature is a rapidly expanding field of science and technology. Classification of biomedical texts is an essential part of biomedicine research, especially in the field of biology. This work proposes the fine-tuned DistilBERT, a methodology-specific, pre-trained generative classification language model for mining biomedicine texts. The model has proven its effectiveness in linguistic understanding capabilities and has reduced the size of BERT models by 40\% but by 60\% faster. The main objective of this project is to improve the model and assess the performance of the model compared to the non-fine-tuned model. We used DistilBert as a support model and pre-trained on a corpus of 32,000 abstracts and complete text articles; our results were impressive and surpassed those of traditional literature classification methods by using RNN or LSTM. Our aim is to integrate this highly specialised and specific model into different research industries.

Abstract (translated)

生物医学文献是一个快速发展的科学和技术领域。生物医学文献分类是生物医学研究的重要组成部分，尤其是在生物学领域。本文提出了一个针对生物医学文献的微调DistilBERT，一种特定于方法论的预训练生成分类语言模型，用于挖掘生物医学文本。该模型在语言理解能力方面已经证明了其有效性，并将BERT模型的大小缩小了40\%但速度提高了60\%。本项目的主要目标是为该模型改进并评估其与未微调模型的性能。我们将DistilBERT用作支持模型，预先训练在32,000个摘要和完整文章的语料库中；我们的结果令人印象深刻，超过了传统文献分类方法的水平，这是通过使用RNN或LSTM实现的。我们的目标是将这种高度专业化和特定化的模型整合到不同的研究产业中。

URL

https://arxiv.org/abs/2404.13779

PDF

https://arxiv.org/pdf/2404.13779.pdf
Read All
Social Force Embedded Mixed Graph Convolutional Network for Multi-class Trajectory Prediction

2024-04-20 13:37:55

Quancheng Du, Xiao Wang, Shouguo Yin, Lingxi Li, Huansheng Ning

arXiv_RO

arXiv_RO RNN CNN Deep_Learning Relation Prediction Pose Autonomous Action Agent
Abstract

Accurate prediction of agent motion trajectories is crucial for autonomous driving, contributing to the reduction of collision risks in human-vehicle interactions and ensuring ample response time for other traffic participants. Current research predominantly focuses on traditional deep learning methods, including convolutional neural networks (CNNs) and recurrent neural networks (RNNs). These methods leverage relative distances to forecast the motion trajectories of a single class of agents. However, in complex traffic scenarios, the motion patterns of various types of traffic participants exhibit inherent randomness and uncertainty. Relying solely on relative distances may not adequately capture the nuanced interaction patterns between different classes of road users. In this paper, we propose a novel multi-class trajectory prediction method named the social force embedded mixed graph convolutional network (SFEM-GCN). SFEM-GCN comprises three graph topologies: the semantic graph (SG), position graph (PG), and velocity graph (VG). These graphs encode various of social force relationships among different classes of agents in complex scenes. Specifically, SG utilizes one-hot encoding of agent-class information to guide the construction of graph adjacency matrices based on semantic information. PG and VG create adjacency matrices to capture motion interaction relationships between different classes agents. These graph structures are then integrated into a mixed graph, where learning is conducted using a spatiotemporal graph convolutional neural network (ST-GCNN). To further enhance prediction performance, we adopt temporal convolutional networks (TCNs) to generate the predicted trajectory with fewer parameters. Experimental results on publicly available datasets demonstrate that SFEM-GCN surpasses state-of-the-art methods in terms of accuracy and robustness.

Abstract (translated)

准确预测代理的运动轨迹对自动驾驶至关重要，有助于减少人与车辆互动中的碰撞风险，并为其他交通参与者确保充足的反应时间。目前的研究主要集中在传统深度学习方法，包括卷积神经网络（CNNs）和循环神经网络（RNNs）。这些方法利用相对距离预测单一类代理的运动轨迹。然而，在复杂的交通场景中，不同类型交通参与者的运动模式表现出固有的随机性和不确定性。仅依赖相对距离可能不足以捕捉不同类别道路用户之间的细微交互模式。在本文中，我们提出了名为社会力嵌入混合图卷积神经网络（SFEM-GCN）的新颖多类轨迹预测方法。SFEM-GCN由三个图结构组成：语义图（SG）、位置图（PG）和速度图（VG）。这些图编码了复杂场景中不同类别代理之间的社会力关系。具体来说，SG利用代理类信息的one-hot编码来引导构建基于语义信息的图邻接矩阵。PG和VG创建邻接矩阵以捕捉不同类别代理之间的运动交互关系。然后将这些图结构整合成一个混合图，使用时空图卷积神经网络（ST-GCNN）进行学习。为了进一步提高预测性能，我们采用时间卷积网络（TCNs）生成预测轨迹，同时参数更少。公开可用数据集上的实验结果表明，SFEM-GCN在准确性和鲁棒性方面超过了最先进的 methods。

URL

https://arxiv.org/abs/2404.13378

PDF

https://arxiv.org/pdf/2404.13378.pdf
Read All
Comparative Analysis on Snowmelt-Driven Streamflow Forecasting Using Machine Learning Techniques

2024-04-20 09:02:50

Ukesh Thapa, Bipun Man Pati, Samit Thapa, Dhiraj Pyakurel, Anup Shrestha

arXiv_AI

arXiv_AI RNN CNN Deep_Learning Transformer Pose
Abstract

The rapid advancement of machine learning techniques has led to their widespread application in various domains including water resources. However, snowmelt modeling remains an area that has not been extensively explored. In this study, we propose a state-of-the-art (SOTA) deep learning sequential model, leveraging the Temporal Convolutional Network (TCN), for snowmelt-driven discharge modeling in the Himalayan basin of the Hindu Kush Himalayan Region. To evaluate the performance of our proposed model, we conducted a comparative analysis with other popular models including Support Vector Regression (SVR), Long Short Term Memory (LSTM), and Transformer. Furthermore, Nested cross-validation (CV) is used with five outer folds and three inner folds, and hyper-parameter tuning is performed on the inner folds. To evaluate the performance of the model mean absolute error (MAE), root mean square error (RMSE), R square ($R^{2}$), Kling-Gupta Efficiency (KGE), and Nash-Sutcliffe Efficiency (NSE) are computed for each outer fold. The average metrics revealed that TCN outperformed the other models, with an average MAE of 0.011, RMSE of 0.023, $R^{2}$ of 0.991, KGE of 0.992, and NSE of 0.991. The findings of this study demonstrate the effectiveness of the deep learning model as compared to traditional machine learning approaches for snowmelt-driven streamflow forecasting. Moreover, the superior performance of TCN highlights its potential as a promising deep learning model for similar hydrological applications.

Abstract (translated)

机器学习技术的快速发展导致其在各种领域得到了广泛应用，包括水资源的预测。然而，雪融建模仍然是一个尚未深入研究的问题。在这项研究中，我们提出了一个最先进的（SOTA）深度学习序列模型，利用 Temporal Convolutional Network (TCN)，对喜马拉雅山脉地区喜马拉雅山脉的雪融驱动流量建模进行研究。为了评估我们提出的模型的性能，我们与其他流行的模型（包括支持向量回归（SVR）、长短时记忆（LSTM）和Transformer）进行了比较分析。此外，我们还使用了嵌套交叉验证（CV），包括五个外层fold和三个内层fold，并在内层fold上进行超参数调整。为了评估模型的性能，我们计算了每个外层fold的均绝对误差（MAE）、均方根误差（RMSE）、相关系数（R平方）和高斯-库格曼效率（KGE）和纳什-苏堤效率（NSE）。平均指标显示，TCN在其他模型中表现优异，平均MAE为0.011，RMSE为0.023，$R^{2}$为0.991，KGE为0.992，NSE为0.991。本研究的结果表明，与传统机器学习方法相比，深度学习模型在雪融驱动流预测方面具有有效性。此外，TCN的卓越性能表明，它有望成为一个有前景的深度学习模型，用于类似的 hydrological 应用。

URL

https://arxiv.org/abs/2404.13327

PDF

https://arxiv.org/pdf/2404.13327.pdf
Read All
Explainable Deepfake Video Detection using Convolutional Neural Network and CapsuleNet

2024-04-19 12:21:27

Gazi Hasin Ishrak, Zalish Mahmud, MD. Zami Al Zunaed Farabe, Tahera Khanom Tinni, Tanzim Reza, Mohammad Zavid Parvez

arXiv_CV

arXiv_CV RNN GAN CNN Detection Deep_Learning Adversarial Relation
Abstract

Deepfake technology, derived from deep learning, seamlessly inserts individuals into digital media, irrespective of their actual participation. Its foundation lies in machine learning and Artificial Intelligence (AI). Initially, deepfakes served research, industry, and entertainment. While the concept has existed for decades, recent advancements render deepfakes nearly indistinguishable from reality. Accessibility has soared, empowering even novices to create convincing deepfakes. However, this accessibility raises security concerns.The primary deepfake creation algorithm, GAN (Generative Adversarial Network), employs machine learning to craft realistic images or videos. Our objective is to utilize CNN (Convolutional Neural Network) and CapsuleNet with LSTM to differentiate between deepfake-generated frames and originals. Furthermore, we aim to elucidate our model's decision-making process through Explainable AI, fostering transparent human-AI relationships and offering practical examples for real-life scenarios.

Abstract (translated)

Deepfake技术源于深度学习，无缝地将个人插入数字媒体，不受其实际参与的影响。其基础是机器学习和人工智能（AI）。最初，deepfakes服务于研究、生产和娱乐领域。虽然这个概念已经存在了几十年，但最近的技术进步使得deepfakes几乎无法与现实区分开来。访问性飙升，甚至让新手能够创建令人信服的deepfakes。然而，这种可访问性也引发了一些安全问题。主要的deepfake创建算法GAN（生成对抗网络）使用机器学习来制作真实的图像或视频。我们的目标是利用卷积神经网络（CNN）和胶囊网络（CapsuleNet）与LSTM区分深度伪造的帧和原始内容。此外，我们还希望通过可解释AI阐明我们的模型的决策过程，促进透明的人机关系，并为实际场景提供实际例子。

URL

https://arxiv.org/abs/2404.12841

PDF

https://arxiv.org/pdf/2404.12841.pdf
Read All
Best Practices for a Handwritten Text Recognition System

2024-04-17 13:00:05

George Retsinas, Giorgos Sfikas, Basilis Gatos, Christophoros Nikou

arXiv_CV

arXiv_CV RNN CNN Recognition Deep_Learning Optimization Pose 3D
Abstract

Handwritten text recognition has been developed rapidly in the recent years, following the rise of deep learning and its applications. Though deep learning methods provide notable boost in performance concerning text recognition, non-trivial deviation in performance can be detected even when small pre-processing or architectural/optimization elements are changed. This work follows a ``best practice'' rationale; highlight simple yet effective empirical practices that can further help training and provide well-performing handwritten text recognition systems. Specifically, we considered three basic aspects of a deep HTR system and we proposed simple yet effective solutions: 1) retain the aspect ratio of the images in the preprocessing step, 2) use max-pooling for converting the 3D feature map of CNN output into a sequence of features and 3) assist the training procedure via an additional CTC loss which acts as a shortcut on the max-pooled sequential features. Using these proposed simple modifications, one can attain close to state-of-the-art results, while considering a basic convolutional-recurrent (CNN+LSTM) architecture, for both IAM and RIMES datasets. Code is available at this https URL.

Abstract (translated)

手写文本识别在过去几年中发展迅速，随着深度学习和其应用的增长。尽管深度学习方法在文本识别方面的表现有显著的提高，但即使是在预处理或架构/优化元素改变时，也可以检测到非寻常的性能偏差。本文遵循了一个“最佳实践”的原则；强调简单的 yet effective 的实证实践，可以帮助训练并提供高性能的手写文本识别系统。具体来说，我们考虑了深度 HTR 系统的基本方面，并提出了 simple yet effective 的解决方案：1）保留图像预处理阶段的 aspect ratio，2）使用 max-pooling 将 CNN 输出 3D 特征图转换为特征序列，3）通过额外的 CTC 损失来辅助训练，该损失在 max-pooled 序列特征上起短路作用。利用这些提出的简单修改，可以达到与最先进水平相当的结果，同时考虑基础卷积循环（CNN+LSTM）架构，对于 IAM 和 RIMES 数据集。代码可在此链接下载。

URL

https://arxiv.org/abs/2404.11339

PDF

https://arxiv.org/pdf/2404.11339.pdf
Read All
Deep Neural Networks via Complex Network Theory: a Perspective

2024-04-17 08:42:42

Emanuele La Malfa, Gabriele La Malfa, Giuseppe Nicosia, Vito Latora

arXiv_AI

arXiv_AI RNN CNN Deep_Learning Relation
Abstract

Deep Neural Networks (DNNs) can be represented as graphs whose links and vertices iteratively process data and solve tasks sub-optimally. Complex Network Theory (CNT), merging statistical physics with graph theory, provides a method for interpreting neural networks by analysing their weights and neuron structures. However, classic works adapt CNT metrics that only permit a topological analysis as they do not account for the effect of the input data. In addition, CNT metrics have been applied to a limited range of architectures, mainly including Fully Connected neural networks. In this work, we extend the existing CNT metrics with measures that sample from the DNNs' training distribution, shifting from a purely topological analysis to one that connects with the interpretability of deep learning. For the novel metrics, in addition to the existing ones, we provide a mathematical formalisation for Fully Connected, AutoEncoder, Convolutional and Recurrent neural networks, of which we vary the activation functions and the number of hidden layers. We show that these metrics differentiate DNNs based on the architecture, the number of hidden layers, and the activation function. Our contribution provides a method rooted in physics for interpreting DNNs that offers insights beyond the traditional input-output relationship and the CNT topological analysis.

Abstract (translated)

深度神经网络（DNNs）可以表示为具有边和顶点递归处理数据和解决子优化问题的图。复杂网络理论（CNT）通过将统计物理学与图论相结合，提供了一种解释神经网络的方法，通过分析它们的权重和神经元结构。然而，经典的网络理论仅允许进行拓扑分析，因为它们没有考虑输入数据的影响。此外，CNT metrics 已应用于广泛的架构，主要包括完全连接神经网络。在这篇工作中，我们通过采样来自 DNNs 的训练分布来扩展现有的 CNT metrics，从纯粹的拓扑分析转变为一个与深度学习的可解释性相结合的分析。对于新 metrics，除了现有的 ones，我们为全连接神经网络、自编码器、卷积神经网络和循环神经网络提供了数学公式，其中我们改变激活函数和隐藏层数。我们证明了这些 metrics 根据架构、隐藏层数和激活函数区分 DNNs。我们的工作为基于物理的解释 DNNs 提供了一种方法，该方法不仅限于传统的输入-输出关系和 CNT topological analysis。

URL

https://arxiv.org/abs/2404.11172

PDF

https://arxiv.org/pdf/2404.11172.pdf
Read All
Function Approximation for Reinforcement Learning Controller for Energy from Spread Waves

2024-04-17 02:04:10

Soumyendu Sarkar, Vineet Gundecha, Sahand Ghorbanpour, Alexander Shmakov, Ashwin Ramesh Babu, Avisek Naug, Alexandre Pichard, Mathieu Cocho

arXiv_AI

arXiv_AI RNN Reinforcement_Learning Attention Optimization Transformer Pose Agent
Abstract

The industrial multi-generator Wave Energy Converters (WEC) must handle multiple simultaneous waves coming from different directions called spread waves. These complex devices in challenging circumstances need controllers with multiple objectives of energy capture efficiency, reduction of structural stress to limit maintenance, and proactive protection against high waves. The Multi-Agent Reinforcement Learning (MARL) controller trained with the Proximal Policy Optimization (PPO) algorithm can handle these complexities. In this paper, we explore different function approximations for the policy and critic networks in modeling the sequential nature of the system dynamics and find that they are key to better performance. We investigated the performance of a fully connected neural network (FCN), LSTM, and Transformer model variants with varying depths and gated residual connections. Our results show that the transformer model of moderate depth with gated residual connections around the multi-head attention, multi-layer perceptron, and the transformer block (STrXL) proposed in this paper is optimal and boosts energy efficiency by an average of 22.1% for these complex spread waves over the existing spring damper (SD) controller. Furthermore, unlike the default SD controller, the transformer controller almost eliminated the mechanical stress from the rotational yaw motion for angled waves. Demo: this https URL

Abstract (translated)

用于工业多功能发电机（WEC）的多功能波浪能量转换器（WEC）必须处理来自不同方向的多重同时波浪，这些具有复杂情况的设备需要具有多重能源捕捉效率、降低结构应力以限制维护和主动抗高波浪保护的控制器。使用Proximal Policy Optimization（PPO）算法进行多智能体强化学习（MARL）控制器可以处理这些复杂性。在本文中，我们探讨了用于建模系统动态序列的不同功能逼近对策略和评分网络的影响，并发现它们对系统性能至关重要。我们研究了具有不同深度的全连接神经网络（FCN）、LSTM和Transformer模型变体，并发现Transformer模型具有围绕多头注意、多层感知器和Transformer模块（STrXL）的软开环连接，对于这些复杂扩散波浪具有最优性能，并提高了约22.1%的能源效率。此外，与默认的SD控制器相比，Transformer控制器几乎消除了角波上的机械应力。演示：此链接

URL

https://arxiv.org/abs/2404.10991

PDF

https://arxiv.org/pdf/2404.10991.pdf
Read All
Deep Learning and LLM-based Methods Applied to Stellar Lightcurve Classification

2024-04-16 17:35:25

Yu-Yang Li, Yu Bai, Cunshi Wang, Mengwei Qu, Ziteng Lu, Roberto Soria, Jifeng Liu

arXiv_CL

arXiv_CL RNN Deep_Learning Classification Face Optimization Language_Model Transformer
Abstract

Light curves serve as a valuable source of information on stellar formation and evolution. With the rapid advancement of machine learning techniques, it can be effectively processed to extract astronomical patterns and information. In this study, we present a comprehensive evaluation of deep-learning and large language model (LLM) based models for the automatic classification of variable star light curves, based on large datasets from the Kepler and K2 missions. Special emphasis is placed on Cepheids, RR Lyrae, and eclipsing binaries, examining the influence of observational cadence and phase distribution on classification precision. Employing AutoDL optimization, we achieve striking performance with the 1D-Convolution+BiLSTM architecture and the Swin Transformer, hitting accuracies of 94\% and 99\% correspondingly, with the latter demonstrating a notable 83\% accuracy in discerning the elusive Type II Cepheids-comprising merely 0.02\% of the total dataset.We unveil StarWhisper LightCurve (LC), an innovative Series comprising three LLM-based models: LLM, multimodal large language model (MLLM), and Large Audio Language Model (LALM). Each model is fine-tuned with strategic prompt engineering and customized training methods to explore the emergent abilities of these models for astronomical data. Remarkably, StarWhisper LC Series exhibit high accuracies around 90\%, significantly reducing the need for explicit feature engineering, thereby paving the way for streamlined parallel data processing and the progression of multifaceted multimodal models in astronomical applications. The study furnishes two detailed catalogs illustrating the impacts of phase and sampling intervals on deep learning classification accuracy, showing that a substantial decrease of up to 14\% in observation duration and 21\% in sampling points can be realized without compromising accuracy by more than 10\%.

Abstract (translated)

光曲线作为一种关于恒星形成和演化的宝贵信息来源，随着机器学习技术的快速发展，可以有效地处理以提取天文模式和信息。在这项研究中，我们全面评估了基于深度学习和大型语言模型（LLM）的变星光曲线自动分类模型的性能，基于Kepler和K2任务的大数据集。特别关注Cepheids、RR Lyrae和食人鱼 binary，研究了观测序列和相位分布对分类精度的影响。采用AutoDL优化，我们通过1D-卷积加BiLSTM架构和Swin Transformer取得了显著的性能，前者的准确度为94%，后者则表现出对Type II Cepheids的显著83%的判断能力，前者的最高准确度达到99%，后者的准确性仅为0.02%的整个数据集中的样本总量。我们揭示了StarWhisper LightCurve（LC）系列，这是一种创新的三模型系列：LLM、多模态大型语言模型（MLLM）和大型音频语言模型（LALM）。每个模型都通过战略提示工程和定制化训练方法进行了微调，以探索这些模型在天文数据中产生的新兴能力。值得注意的是，StarWhisper LC系列在准确度方面表现出高达90%的准确度，从而显著减少了不需要的显式特征工程，为简化并行数据处理和多面体多模态模型在天文应用中的发展铺平了道路。这项研究提供了两个详细的目录，说明了相位和采样间隔对深度学习分类准确度的影响，表明在不过度妥协准确度的情况下，可以通过减小观测持续时间和采样点的数量来降低观察持续时间至14%，采样点至21%。

URL

https://arxiv.org/abs/2404.10757

PDF

https://arxiv.org/pdf/2404.10757.pdf
Read All
Clustering and Data Augmentation to Improve Accuracy of Sleep Assessment and Sleep Individuality Analysis

2024-04-16 05:56:41

Shintaro Tamai, Masayuki Numao, Ken-ichi Fukui

arXiv_AI

arXiv_AI RNN Action
Abstract

Recently, growing health awareness, novel methods allow individuals to monitor sleep at home. Utilizing sleep sounds offers advantages over conventional methods like smartwatches, being non-intrusive, and capable of detecting various physiological activities. This study aims to construct a machine learning-based sleep assessment model providing evidence-based assessments, such as poor sleep due to frequent movement during sleep onset. Extracting sleep sound events, deriving latent representations using VAE, clustering with GMM, and training LSTM for subjective sleep assessment achieved a high accuracy of 94.8% in distinguishing sleep satisfaction. Moreover, TimeSHAP revealed differences in impactful sound event types and timings for different individuals.

Abstract (translated)

近年来，随着健康意识的不断提高，人们可以在家监测睡眠。利用睡眠声音来实现传统方法如智能手表等是不侵入性的，并且能够检测到各种生理活动。本研究旨在构建一个基于机器学习的睡眠评估模型，提供基于证据的评估，例如睡眠开始时频繁运动导致的睡眠质量差。通过提取睡眠声音事件，使用VAE生成潜在表示，聚类使用GMM，并使用LSTM进行主观睡眠评估，该模型的区分睡眠满意度高达94.8%。此外，TimeSHAP揭示了不同个体对影响性声音事件和时间的差异。

URL

https://arxiv.org/abs/2404.10299

PDF

https://arxiv.org/pdf/2404.10299.pdf
Read All
Using Long Short-term Memory to merge precipitation data over mountainous area in Sierra Nevada

2024-04-15 21:01:31

Yihan Wang

arXiv_AI

arXiv_AI RNN Detection Deep_Learning Attention Quantitative
Abstract

Obtaining reliable precipitation estimation with high resolutions in time and space is of great importance to hydrological studies. However, accurately estimating precipitation is a challenging task over high mountainous complex terrain. The three widely used precipitation measurement approaches, namely rainfall gauge, precipitation radars, and satellite-based precipitation sensors, have their own pros and cons in producing reliable precipitation products over complex areas. One way to decrease the detection error probability and improve data reliability is precipitation data merging. With the rapid advancements in computational capabilities and the escalating volume and diversity of earth observational data, Deep Learning (DL) models have gained considerable attention in geoscience. In this study, a deep learning technique, namely Long Short-term Memory (LSTM), was employed to merge a radar-based and a satellite-based Global Precipitation Measurement (GPM) precipitation product Integrated Multi-Satellite Retrievals for GPM (IMERG) precipitation product at hourly scale. The merged results are compared with the widely used reanalysis precipitation product, Multi-Radar Multi-Sensor (MRMS), and assessed against gauge observational data from the California Data Exchange Center (CDEC). The findings indicated that the LSTM-based merged precipitation notably underestimated gauge observations and, at times, failed to provide meaningful estimates, showing predominantly near-zero values. Relying solely on individual Quantitative Precipitation Estimates (QPEs) without additional meteorological input proved insufficient for generating reliable merged QPE. However, the merged results effectively captured the temporal trends of the observations, outperforming MRMS in this aspect. This suggested that incorporating bias correction techniques could potentially enhance the accuracy of the merged product.

Abstract (translated)

获得高分辨率的时间和空间可靠降水估计对水文研究非常重要。然而，在高山复杂地形上准确估计降水是一个具有挑战性的任务。目前广泛使用的降水测量方法包括雨量计、降水雷达和基于卫星的降水传感器，它们在复杂区域产生可靠降水产品各有优缺点。一种降低检测误差概率和提高数据可靠性的方法是降水数据合并。随着计算能力的快速发展和地球观测数据的不断增加，深度学习（DL）模型在地质科学领域受到了广泛关注。在这项研究中，采用了一种深度学习方法——长短时记忆（LSTM）对基于雷达和卫星的全球降水测量（GPM）产品进行整合，得到GPM降水产品（IMERG）的每小时级合并结果。将合并结果与广泛使用的重新分析降水产品（MRMS）和加州数据交换中心（CDEC）的观测数据进行比较。研究结果表明，基于LSTM的合并降水估计明显低估了观测数据，有时无法提供有意义的估计，显示出主要是零值。仅依赖单个定量降水估计（QPE）无法生成可靠的合并QPE。然而，合并结果有效地捕捉了观测数据的趋势，在这一点上超过了MRMS。这表明，通过引入偏差纠正技术，可以 potentially提高合并产品的准确性。

URL

https://arxiv.org/abs/2404.10135

PDF

https://arxiv.org/pdf/2404.10135.pdf
Read All
Anatomy of Industrial Scale Multilingual ASR

2024-04-15 14:48:43

Francis McCann Ramirez, Luka Chkhetiani, Andrew Ehrenberg, Robert McHardy, Rami Botros, Yash Khare, Andrea Vanzo, Taufiquzzaman Peyash, Gabriel Oexle, Michael Liang, Ilya Sklyar, Enver Fakhan, Ahmed Efty, Daniel McCrystal, Sam Flamini, Domenic Donato, Takuya Yoshioka

arXiv_CL

arXiv_CL Speech_Recognition RNN Recognition Inference Unsupervised Speech
Abstract

This paper describes AssemblyAI's industrial-scale automatic speech recognition (ASR) system, designed to meet the requirements of large-scale, multilingual ASR serving various application needs. Our system leverages a diverse training dataset comprising unsupervised (12.5M hours), supervised (188k hours), and pseudo-labeled (1.6M hours) data across four languages. We provide a detailed description of our model architecture, consisting of a full-context 600M-parameter Conformer encoder pre-trained with BEST-RQ and an RNN-T decoder fine-tuned jointly with the encoder. Our extensive evaluation demonstrates competitive word error rates (WERs) against larger and more computationally expensive models, such as Whisper large and Canary-1B. Furthermore, our architectural choices yield several key advantages, including an improved code-switching capability, a 5x inference speedup compared to an optimized Whisper baseline, a 30% reduction in hallucination rate on speech data, and a 90% reduction in ambient noise compared to Whisper, along with significantly improved time-stamp accuracy. Throughout this work, we adopt a system-centric approach to analyzing various aspects of fully-fledged ASR models to gain practically relevant insights useful for real-world services operating at scale.

Abstract (translated)

本文描述了AssemblyAI设计的工业规模自动语音识别（ASR）系统，旨在满足大规模、多语言ASR满足各种应用需求的需求。我们的系统利用四个语言中的无监督（12.5M小时）、有监督（188k小时）和伪标签（1.6M小时）数据构建了多样化的训练数据集。我们详细描述了我们的模型架构，包括使用BEST-RQ预训练的完整上下文600M参数的Conformer编码器以及与编码器共同细化的RNN-T解码器。我们的大量评估显示，与更大、更昂贵的大型模型（如Whisper large和Canary-1B）相比，我们的具有竞争力的词错误率（WERs）。此外，我们的架构选择带来几个关键优势，包括提高的换挡能力、与优化后的Whisper基线相比的5倍推理速度、对语音数据的幻觉率降低30%以及与Whisper相比的90%的环境噪声降低。在本文中，我们采用系统中心的方法分析各种大規模ASR模型的各个方面，以获得与实际服务操作规模相关的实际相关见解，这些见解对于大规模服务至关重要。

URL

https://arxiv.org/abs/2404.09841

PDF

https://arxiv.org/pdf/2404.09841.pdf
Read All
Machine Learning Techniques for Python Source Code Vulnerability Detection

2024-04-15 08:01:02

Talaya Farasat, Joachim Posegga

arXiv_AI

arXiv_AI RNN Detection
Abstract

Software vulnerabilities are a fundamental reason for the prevalence of cyber attacks and their identification is a crucial yet challenging problem in cyber security. In this paper, we apply and compare different machine learning algorithms for source code vulnerability detection specifically for Python programming language. Our experimental evaluation demonstrates that our Bidirectional Long Short-Term Memory (BiLSTM) model achieves a remarkable performance (average Accuracy = 98.6%, average F-Score = 94.7%, average Precision = 96.2%, average Recall = 93.3%, average ROC = 99.3%), thereby, establishing a new benchmark for vulnerability detection in Python source code.

Abstract (translated)

软件漏洞是网络攻击普遍存在的根本原因，而识别软件漏洞是网络安全中一个关键但具有挑战性的问题。在本文中，我们针对Python编程语言，应用并比较了不同的机器学习算法来进行源代码漏洞检测。我们的实验评估结果表明，我们的双向长短时记忆（BiLSTM）模型取得了显著的性能（平均准确率=98.6%，平均F1分数=94.7%，平均精确率=96.2%，平均召回率=93.3%，平均准确率=99.3%)，从而为Python源代码漏洞检测树立了一个新的基准。

URL

https://arxiv.org/abs/2404.09537

PDF

https://arxiv.org/pdf/2404.09537.pdf
Read All
TCCT-Net: Two-Stream Network Architecture for Fast and Efficient Engagement Estimation via Behavioral Feature Signals

2024-04-15 06:01:48

Alexander Vedernikov, Puneet Kumar, Haoyu Chen, Tapio Seppanen, Xiaobai Li

arXiv_CV

arXiv_CV RNN CNN Embedding Inference Transformer Pose
Abstract

Engagement analysis finds various applications in healthcare, education, advertisement, services. Deep Neural Networks, used for analysis, possess complex architecture and need large amounts of input data, computational power, inference time. These constraints challenge embedding systems into devices for real-time use. To address these limitations, we present a novel two-stream feature fusion "Tensor-Convolution and Convolution-Transformer Network" (TCCT-Net) architecture. To better learn the meaningful patterns in the temporal-spatial domain, we design a "CT" stream that integrates a hybrid convolutional-transformer. In parallel, to efficiently extract rich patterns from the temporal-frequency domain and boost processing speed, we introduce a "TC" stream that uses Continuous Wavelet Transform (CWT) to represent information in a 2D tensor form. Evaluated on the EngageNet dataset, the proposed method outperforms existing baselines, utilizing only two behavioral features (head pose rotations) compared to the 98 used in baseline models. Furthermore, comparative analysis shows TCCT-Net's architecture offers an order-of-magnitude improvement in inference speed compared to state-of-the-art image-based Recurrent Neural Network (RNN) methods. The code will be released at this https URL.

Abstract (translated)

翻译： Engagement分析在医疗、教育、广告和服务等领域发现了各种应用。用于分析的Deep Neural Networks具有复杂的架构，需要大量的输入数据、计算能力和推理时间。这些限制将嵌入系统推向实时使用设备。为解决这些限制，我们提出了一个名为"Tensor-Convolution and Convolution-Transformer Network"（TCCT-Net）的新颖架构。为了更好地学习时间域中的有意义模式，我们设计了一个"CT"流，该流整合了一个混合卷积-Transformer。并行地，为了有效地从时间频域中提取丰富模式并提高处理速度，我们引入了一个"TC"流，该流使用连续波形变换（CWT）将信息表示为二维张量的形式。在EngageNet数据集上评估，与基线模型相比，所提出的方法仅使用了两个行为特征（头姿态旋转），但性能优于98个基线模型。此外，比较分析显示，TCCT-Net的架构比最先进的基于图像的循环神经网络（RNN）方法具有数量级的改进。代码将在这个https URL上发布。

URL

https://arxiv.org/abs/2404.09474

PDF

https://arxiv.org/pdf/2404.09474.pdf
Read All
Arena: A Patch-of-Interest ViT Inference Acceleration System for Edge-Assisted Video Analytics

2024-04-14 13:14:13

Haosong Peng, Wei Feng, Hao Li, Yufeng Zhan, Qihua Zhou, Yuanqing Xia

arXiv_CV

arXiv_CV RNN Inference Transformer
Abstract

The advent of edge computing has made real-time intelligent video analytics feasible. Previous works, based on traditional model architecture (e.g., CNN, RNN, etc.), employ various strategies to filter out non-region-of-interest content to minimize bandwidth and computation consumption but show inferior performance in adverse environments. Recently, visual foundation models based on transformers have shown great performance in adverse environments due to their amazing generalization capability. However, they require a large amount of computation power, which limits their applications in real-time intelligent video analytics. In this paper, we find visual foundation models like Vision Transformer (ViT) also have a dedicated acceleration mechanism for video analytics. To this end, we introduce Arena, an end-to-end edge-assisted video inference acceleration system based on ViT. We leverage the capability of ViT that can be accelerated through token pruning by only offloading and feeding Patches-of-Interest (PoIs) to the downstream models. Additionally, we employ probability-based patch sampling, which provides a simple but efficient mechanism for determining PoIs where the probable locations of objects are in subsequent frames. Through extensive evaluations on public datasets, our findings reveal that Arena can boost inference speeds by up to $1.58\times$ and $1.82\times$ on average while consuming only 54% and 34% of the bandwidth, respectively, all with high inference accuracy.

Abstract (translated)

边缘计算的出现使得实时智能视频分析成为可能。之前的工作（如CNN，RNN等）基于传统模型架构，采用各种策略来过滤出非感兴趣内容，以最小化带宽和计算消耗，但是表现不佳。然而，基于Transformer的视觉基础模型在恶劣环境中表现出色，因为它们具有惊人的泛化能力。然而，它们需要大量的计算资源，这限制了它们在实时智能视频分析中的应用。在本文中，我们发现视觉基础模型（如Vision Transformer）也有专用的视频分析加速机制。为此，我们介绍了Arena，一种基于ViT的端到端边缘辅助视频推理加速系统。我们利用ViT通过仅卸载并输入兴趣区域（PoIs）来加速下游模型的能力。此外，我们还采用概率基础的补丁采样，这是一种简单但有效的机制，用于确定物体在后续帧中的可能位置。通过在公共数据集上进行广泛的评估，我们的发现表明，Arena可以在平均情况下将推理速度提高1.58倍，而在带宽消耗仅为54%和34%的情况下。

URL

https://arxiv.org/abs/2404.09245

PDF

https://arxiv.org/pdf/2404.09245.pdf
Read All

Content

RNN (20)

RNN

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL