New knowledge originates from the old. The various types of elements, deposited in the training history, are a large amount of wealth for improving learning deep models. In this survey, we comprehensively review and summarize the topic--``Historical Learning: Learning Models with Learning History'', which learns better neural models with the help of their learning history during its optimization, from three detailed aspects: Historical Type (what), Functional Part (where) and Storage Form (how). To our best knowledge, it is the first survey that systematically studies the methodologies which make use of various historical statistics when training deep neural networks. The discussions with related topics like recurrent/memory networks, ensemble learning, and reinforcement learning are demonstrated. We also expose future challenges of this topic and encourage the community to pay attention to the think of historical learning principles when designing algorithms. The paper list related to historical learning is available at \url{this https URL.}
新知识从旧知识中产生。将各种元素存储在训练历史中可以提高深度学习模型的性能。在本调查中,我们全面审查和总结了“历史学习:学习模型与学习历史”主题,该主题在优化过程中利用其学习历史来改进神经网络模型,从三个详细方面:历史类型(是什么)、功能部分(在哪里)和存储形式(如何)进行总结。据我们所知,这是第一项系统地研究在训练深度学习网络时使用各种历史统计方法的 survey。与循环/记忆网络、集成学习和强化学习等相关的主题进行了讨论。我们也揭示了这个主题的未来挑战,并鼓励社区在制定算法时关注历史学习原则。与历史学习相关的论文列表可以在 \url{this https URL.} 找到。
https://arxiv.org/abs/2303.12992
This paper tackles the problem of semi-supervised video object segmentation on resource-constrained devices, such as mobile phones. We formulate this problem as a distillation task, whereby we demonstrate that small space-time-memory networks with finite memory can achieve competitive results with state of the art, but at a fraction of the computational cost (32 milliseconds per frame on a Samsung Galaxy S22). Specifically, we provide a theoretically grounded framework that unifies knowledge distillation with supervised contrastive representation learning. These models are able to jointly benefit from both pixel-wise contrastive learning and distillation from a pre-trained teacher. We validate this loss by achieving competitive J&F to state of the art on both the standard DAVIS and YouTube benchmarks, despite running up to 5x faster, and with 32x fewer parameters.
这篇文章解决了在资源受限的设备上(如智能手机)上半监督视频对象分割的问题。我们将这个问题表述为蒸馏任务,以此来证明虽然具有有限内存的小型空间时间记忆网络可以实现与最新技术相当的结果,但其计算成本只有其一半(例如,三星Galaxy S22每帧32毫秒)。具体来说,我们提供了一种理论扎实的框架,将知识蒸馏与监督对比表示学习相结合。这些模型可以同时从像素级别的对比学习和预先训练的老师那里获得收益。我们通过在标准 Davis 和 YouTube 基准测试中实现与最新技术的可比 J&F 值,尽管运行速度高达5倍,参数数量减少32倍,来验证这个损失。
https://arxiv.org/abs/2303.07815
I propose a novel dual-attention model(DAM) for aspect-level sentiment classification. Many methods have been proposed, such as support vector machines for artificial design features, long short-term memory networks based on attention mechanisms, and graph neural networks based on dependency parsing. While these methods all have decent performance, I think they all miss one important piece of syntactic information: dependency labels. Based on this idea, this paper proposes a model using dependency labels for the attention mechanism to do this task. We evaluate the proposed approach on three datasets: laptop and restaurant are from SemEval 2014, and the last one is a twitter dataset. Experimental results show that the dual attention model has good performance on all three datasets.
我提出了一种新的多关注模型(DAM),用于 aspect-level 情感分类。已经提出了许多方法,如支持向量机用于人工设计特征、基于注意力机制的长期短期记忆网络,以及基于依赖解析的图神经网络。虽然这些方法都取得了不错的性能,但我觉得它们都缺少一个重要的语法信息:依赖标签。基于这个想法,本文提出了一种使用依赖标签的注意力机制模型来完成这个任务。我们评估了三种数据集:笔记本电脑和餐厅来自SemEval 2014,最后一个数据集是推特数据集。实验结果表明,双关注模型在所有三个数据集上取得了良好的性能。
https://arxiv.org/abs/2303.07689
3D single object tracking has been a crucial problem for decades with numerous applications such as autonomous driving. Despite its wide-ranging use, this task remains challenging due to the significant appearance variation caused by occlusion and size differences among tracked targets. To address these issues, we present MBPTrack, which adopts a Memory mechanism to utilize past information and formulates localization in a coarse-to-fine scheme using Box Priors given in the first frame. Specifically, past frames with targetness masks serve as an external memory, and a transformer-based module propagates tracked target cues from the memory to the current frame. To precisely localize objects of all sizes, MBPTrack first predicts the target center via Hough voting. By leveraging box priors given in the first frame, we adaptively sample reference points around the target center that roughly cover the target of different sizes. Then, we obtain dense feature maps by aggregating point features into the reference points, where localization can be performed more effectively. Extensive experiments demonstrate that MBPTrack achieves state-of-the-art performance on KITTI, nuScenes and Waymo Open Dataset, while running at 50 FPS on a single RTX3090 GPU.
三维单个对象跟踪已经困扰了数十年,并且有许多应用,如自动驾驶。尽管它的应用范围广泛,但这仍然是一项具有挑战性的任务,因为跟踪目标之间的显著外观变化是由遮挡和目标大小差异引起的。为了解决这些问题,我们提出了 MBPTrack,它采用 Memory 机制利用过去的信息,并在粗到精的Scheme中使用第一帧中的 Box Priors 对定位进行建模。具体来说,过去的带有目标性掩膜的帧用作外部内存,而一个Transformer-based模块从内存中传播跟踪目标线索到当前帧。为了精确定位所有大小的对象,MBPTrack 首先通过 Hough 投票预测目标中心。通过利用第一帧中的 Box Priors,我们自适应地采样目标中心周围的参考点,大致覆盖不同大小的目标。然后,我们获得密集的特征映射,通过将点特征聚合到参考点,使得定位更加有效。广泛的实验表明,MBPTrack 在KITTI、nuScenes 和 Waymo 开放数据集上实现了最先进的性能,同时运行在单个 RTX3090 显卡上的 50 帧率。
https://arxiv.org/abs/2303.05071
Temporal sentence localization in videos (TSLV) aims to retrieve the most interested segment in an untrimmed video according to a given sentence query. However, almost of existing TSLV approaches suffer from the same limitations: (1) They only focus on either frame-level or object-level visual representation learning and corresponding correlation reasoning, but fail to integrate them both; (2) They neglect to leverage the rich semantic contexts to further benefit the query reasoning. To address these issues, in this paper, we propose a novel Hierarchical Visual- and Semantic-Aware Reasoning Network (HVSARN), which enables both visual- and semantic-aware query reasoning from object-level to frame-level. Specifically, we present a new graph memory mechanism to perform visual-semantic query reasoning: For visual reasoning, we design a visual graph memory to leverage visual information of video; For semantic reasoning, a semantic graph memory is also introduced to explicitly leverage semantic knowledge contained in the classes and attributes of video objects, and perform correlation reasoning in the semantic space. Experiments on three datasets demonstrate that our HVSARN achieves a new state-of-the-art performance.
视频中的时序语句定位(TSLV)旨在根据给定语句查询检索未剪辑视频中最感兴趣的部分。然而,几乎所有现有的TSLV方法都面临相同的限制:(1)它们只关注帧级或对象级的视觉表示学习和相应的相关性推理,但未能将它们综合起来;(2)它们忽视了利用丰富的语义上下文来进一步改善查询推理。为了解决这些问题,本文提出了一种 novel Hierarchical Visual- and Semantic-Aware Reasoning Network (HVSARN),该网络从对象级到帧级实现视觉和语义aware查询推理。具体而言,我们提出了一种新的图记忆机制来执行视觉语义查询推理:对于视觉推理,我们设计了一个视觉图记忆,以利用视频的视觉信息;对于语义推理,我们引入了一个语义图记忆,以 explicitly 利用视频对象的分类和属性包含的语义知识,并在语义空间中进行相关性推理。对三个数据集的实验表明,我们的HVSARN实现了新的顶尖性能。
https://arxiv.org/abs/2303.01046
Monaural speech enhancement has been widely studied using real networks in the time-frequency (TF) domain. However, the input and the target are naturally complex-valued in the TF domain, a fully complex network is highly desirable for effectively learning the feature representation and modelling the sequence in the complex domain. Moreover, phase, an important factor for perceptual quality of speech, has been proved learnable together with magnitude from noisy speech using complex masking or complex spectral mapping. Many recent studies focus on either complex masking or complex spectral mapping, ignoring their performance boundaries. To address above issues, we propose a fully complex dual-path dual-decoder conformer network (D2Former) using joint complex masking and complex spectral mapping for monaural speech enhancement. In D2Former, we extend the conformer network into the complex domain and form a dual-path complex TF self-attention architecture for effectively modelling the complex-valued TF sequence. We further boost the TF feature representation in the encoder and the decoders using a dual-path learning structure by exploiting complex dilated convolutions on time dependency and complex feedforward sequential memory networks (CFSMN) for frequency recurrence. In addition, we improve the performance boundaries of complex masking and complex spectral mapping by combining the strengths of the two training targets into a joint-learning framework. As a consequence, D2Former takes fully advantages of the complex-valued operations, the dual-path processing, and the joint-training targets. Compared to the previous models, D2Former achieves state-of-the-art results on the VoiceBank+Demand benchmark with the smallest model size of 0.87M parameters.
听觉增强技术已经广泛使用在时间频率(TF)域中使用实际网络进行研究。然而,输入和目标在TF域中自然地是复数的,建立一个全复数的网络是极受欢迎的,以便在复杂域中有效地学习特征表示和建模序列。此外,相位是语音感知质量的一个重要因素,已经证明可以通过复杂的掩膜或复杂的谱映射来学习,同时利用复杂的扩展卷积在时间依赖和复杂的正向后续记忆网络(CFSMN)中实现频率重复。许多最近的研究主要关注复杂的掩膜或复杂的谱映射,而忽视了它们的性能边界。为了解决上述问题,我们提出了一个全复数的二路径二分编码器适应器网络(D2 Former),使用协同复杂的掩膜和复杂的谱映射来实现单耳听觉增强。在D2Former中,我们扩展适应器网络到复杂域中,并建立二路径的复杂的TF自注意力架构,以有效地建模复杂的TF序列。我们还使用二路径学习结构利用复杂的扩展卷积在时间依赖和复杂的正向后续记忆网络(CFSMN)中实现频率重复。此外,我们改进了复杂的掩膜和复杂的谱映射的性能边界,通过将两个训练目标的强度组合成一个共同的学习框架来实现。因此,D2Former完全利用了复杂的复值操作、二路径处理和共同训练目标。与以前的模型相比,D2Former在VoiceBank+ Demand基准测试中实现了最先进的结果,模型大小为0.87M参数。
https://arxiv.org/abs/2302.11832
Predicting future direction of stock markets using the historical data has been a fundamental component in financial forecasting. This historical data contains the information of a stock in each specific time span, such as the opening, closing, lowest, and highest price. Leveraging this data, the future direction of the market is commonly predicted using various time-series models such as Long-Short Term Memory networks. This work proposes modeling and predicting market movements with a fundamentally new approach, namely by utilizing image and byte-based number representation of the stock data processed with the recently introduced Vision-Language models. We conduct a large set of experiments on the hourly stock data of the German share index and evaluate various architectures on stock price prediction using historical stock data. We conduct a comprehensive evaluation of the results with various metrics to accurately depict the actual performance of various approaches. Our evaluation results show that our novel approach based on representation of stock data as text (bytes) and image significantly outperforms strong deep learning-based baselines.
使用历史数据预测股票市场的未来方向已经成为金融预测的一个基本组成部分。历史数据包含了每个特定时间范围内的股票的信息和价格,例如开市、收盘价、最低和最高价格。利用这些数据,通常使用各种时间序列模型如LSTM网络来预测市场的未来方向。本工作提出了一种根本性的新的方法来建模和预测市场运动,即使用最近引入的视觉语言模型处理的股票数据的图像和字节表示。我们在德国股票指数的每日股票数据上进行了大规模实验,并使用历史股票数据评估了各种架构对股票价格预测的性能。我们使用各种指标进行全面评估,以准确描述各种方法的实际表现。我们的评估结果显示,我们基于股票数据文本(字节)和图像的表示方法 significantly outperforms strong deep learning-based baselines.
https://arxiv.org/abs/2301.10166
The General Associative Memory Model (GAMM) has a constant state-dependant energy surface that leads the output dynamics to fixed points, retrieving single memories from a collection of memories that can be asynchronously preloaded. We introduce a new class of General Sequential Episodic Memory Models (GSEMM) that, in the adiabatic limit, exhibit temporally changing energy surface, leading to a series of meta-stable states that are sequential episodic memories. The dynamic energy surface is enabled by newly introduced asymmetric synapses with signal propagation delays in the network's hidden layer. We study the theoretical and empirical properties of two memory models from the GSEMM class, differing in their activation functions. LISEM has non-linearities in the feature layer, whereas DSEM has non-linearity in the hidden layer. In principle, DSEM has a storage capacity that grows exponentially with the number of neurons in the network. We introduce a learning rule for the synapses based on the energy minimization principle and show it can learn single memories and their sequential relationships online. This rule is similar to the Hebbian learning algorithm and Spike-Timing Dependent Plasticity (STDP), which describe conditions under which synapses between neurons change strength. Thus, GSEMM combines the static and dynamic properties of episodic memory under a single theoretical framework and bridges neuroscience, machine learning, and artificial intelligence.
https://arxiv.org/abs/2212.05563
Deformable image registration, i.e., the task of aligning multiple images into one coordinate system by non-linear transformation, serves as an essential preprocessing step for neuroimaging data. Recent research on deformable image registration is mainly focused on improving the registration accuracy using multi-stage alignment methods, where the source image is repeatedly deformed in stages by a same neural network until it is well-aligned with the target image. Conventional methods for multi-stage registration can often blur the source image as the pixel/voxel values are repeatedly interpolated from the image generated by the previous stage. However, maintaining image quality such as sharpness during image registration is crucial to medical data analysis. In this paper, we study the problem of anti-blur deformable image registration and propose a novel solution, called Anti-Blur Network (ABN), for multi-stage image registration. Specifically, we use a pair of short-term registration and long-term memory networks to learn the nonlinear deformations at each stage, where the short-term registration network learns how to improve the registration accuracy incrementally and the long-term memory network combines all the previous deformations to allow an interpolation to perform on the raw image directly and preserve image sharpness. Extensive experiments on both natural and medical image datasets demonstrated that ABN can accurately register images while preserving their sharpness. Our code and data can be found at this https URL
https://arxiv.org/abs/2212.03277
Existing video object segmentation (VOS) benchmarks focus on short-term videos which just last about 3-5 seconds and where objects are visible most of the time. These videos are poorly representative of practical applications, and the absence of long-term datasets restricts further investigation of VOS on the application in realistic scenarios. So, in this paper, we present a new benchmark dataset and evaluation methodology named LVOS, which consists of 220 videos with a total duration of 421 minutes. To the best of our knowledge, LVOS is the first densely annotated long-term VOS dataset. The videos in our LVOS last 1.59 minutes on average, which is 20 times longer than videos in existing VOS datasets. Each video includes various attributes, especially challenges deriving from the wild, such as long-term reappearing and cross-temporal similar objeccts. Moreover, we provide additional language descriptions to encourage the exploration of integrating linguistic and visual features for video object segmentation. Based on LVOS, we assess existing video object segmentation algorithms and propose a Diverse Dynamic Memory network (DDMemory) that consists of three complementary memory banks to exploit temporal information adequately. The experiment results demonstrate the strength and weaknesses of prior methods, pointing promising directions for further study. Our objective is to provide the community with a large and varied benchmark to boost the advancement of long-term VOS. Data and code are available at \url{this https URL}.
https://arxiv.org/abs/2211.10181
Encoder-decoder architecture is widely adopted for sequence-to-sequence modeling tasks. For machine translation, despite the evolution from long short-term memory networks to Transformer networks, plus the introduction and development of attention mechanism, encoder-decoder is still the de facto neural network architecture for state-of-the-art models. While the motivation for decoding information from some hidden space is straightforward, the strict separation of the encoding and decoding steps into an encoder and a decoder in the model architecture is not necessarily a must. Compared to the task of autoregressive language modeling in the target language, machine translation simply has an additional source sentence as context. Given the fact that neural language models nowadays can already handle rather long contexts in the target language, it is natural to ask whether simply concatenating the source and target sentences and training a language model to do translation would work. In this work, we investigate the aforementioned concept for machine translation. Specifically, we experiment with bilingual translation, translation with additional target monolingual data, and multilingual translation. In all cases, this alternative approach performs on par with the baseline encoder-decoder Transformer, suggesting that an encoder-decoder architecture might be redundant for neural machine translation.
https://arxiv.org/abs/2210.11807
Graph neural networks (GNNs) have achieved great success in various graph problems. However, most GNNs are Message Passing Neural Networks (MPNNs) based on the homophily assumption, where nodes with the same label are connected in graphs. Real-world problems bring us heterophily problems, where nodes with different labels are connected in graphs. MPNNs fail to address the heterophily problem because they mix information from different distributions and are not good at capturing global patterns. Therefore, we investigate a novel Graph Memory Networks model on Heterophilous Graphs (HP-GMN) to the heterophily problem in this paper. In HP-GMN, local information and global patterns are learned by local statistics and the memory to facilitate the prediction. We further propose regularization terms to help the memory learn global information. We conduct extensive experiments to show that our method achieves state-of-the-art performance on both homophilous and heterophilous graphs.
https://arxiv.org/abs/2210.08195
In this article, we propose a variational inference formulation of auto-associative memories, allowing us to combine perceptual inference and memory retrieval into the same mathematical framework. In this formulation, the prior probability distribution onto latent representations is made memory dependent, thus pulling the inference process towards previously stored representations. We then study how different neural network approaches to variational inference can be applied in this framework. We compare methods relying on amortized inference such as Variational Auto Encoders and methods relying on iterative inference such as Predictive Coding and suggest combining both approaches to design new auto-associative memory models. We evaluate the obtained algorithms on the CIFAR10 and CLEVR image datasets and compare them with other associative memory models such as Hopfield Networks, End-to-End Memory Networks and Neural Turing Machines.
https://arxiv.org/abs/2210.08013
Prior works on event-based optical flow estimation have investigated several gradient-based learning methods to train neural networks for predicting optical flow. However, they do not utilize the fast data rate of event data streams and rely on a spatio-temporal representation constructed from a collection of events over a fixed period of time (often between two grayscale frames). As a result, optical flow is only evaluated at a frequency much lower than the rate data is produced by an event-based camera, leading to a temporally sparse optical flow estimation. To predict temporally dense optical flow, we cast the problem as a sequential learning task and propose a training methodology to train sequential networks for continuous prediction on an event stream. We propose two types of networks: one focused on performance and another focused on compute efficiency. We first train long-short term memory networks (LSTMs) on the DSEC dataset and demonstrated 10x temporally dense optical flow estimation over existing flow estimation approaches. The additional benefit of having a memory to draw long temporal correlations back in time results in a 19.7% improvement in flow prediction accuracy of LSTMs over similar networks with no memory elements. We subsequently show that the inherent recurrence of spiking neural networks (SNNs) enables them to learn and estimate temporally dense optical flow with 31.8% lesser parameters than LSTM, but with a slightly increased error. This demonstrates potential for energy-efficient implementation of fast optical flow prediction using SNNs.
https://arxiv.org/abs/2210.01244
Some data from multiple sources can be modeled as multimodal time-series events which have different sampling frequencies, data compositions, temporal relations and characteristics. Different types of events have complex nonlinear relationships, and the time of each event is irregular. Neither the classical Recurrent Neural Network (RNN) model nor the current state-of-the-art Transformer model can deal with these features well. In this paper, a features fusion framework for multimodal irregular time-series events is proposed based on the Long Short-Term Memory networks (LSTM). Firstly, the complex features are extracted according to the irregular patterns of different events. Secondly, the nonlinear correlation and complex temporal dependencies relationship between complex features are captured and fused into a tensor. Finally, a feature gate are used to control the access frequency of different tensors. Extensive experiments on MIMIC-III dataset demonstrate that the proposed framework significantly outperforms to the existing methods in terms of AUC (the area under Receiver Operating Characteristic curve) and AP (Average Precision).
https://arxiv.org/abs/2209.01728
Anomalies in time-series provide insights of critical scenarios across a range of industries, from banking and aerospace to information technology, security, and medicine. However, identifying anomalies in time-series data is particularly challenging due to the imprecise definition of anomalies, the frequent absence of labels, and the enormously complex temporal correlations present in such data. The LSTM Autoencoder is an Encoder-Decoder scheme for Anomaly Detection based on Long Short Term Memory Networks that learns to reconstruct time-series behavior and then uses reconstruction error to identify abnormalities. We introduce the Denoising Architecture as a complement to this LSTM Encoder-Decoder model and investigate its effect on real-world as well as artificially generated datasets. We demonstrate that the proposed architecture increases both the accuracy and the training speed, thereby, making the LSTM Autoencoder more efficient for unsupervised anomaly detection tasks.
https://arxiv.org/abs/2208.14337
Scour is the number one cause of bridge failure in many parts of the world. Considering the lack of reliability in existing empirical equations for scour depth estimation and the complexity and uncertainty of scour as a physical phenomenon, it is essential to develop more reliable solutions for scour risk assessment. This study introduces a novel AI approach for early forecast of scour based on real-time monitoring data obtained from sonar and stage sensors installed at bridge piers. Long-short Term Memory networks (LSTMs), a prominent Deep Learning algorithm successfully used for time-series forecasting in other fields, were developed and trained using river stage and bed elevation readings for more than 11 years obtained from Alaska scour monitoring program. The capability of the AI models in scour prediction is shown for three case-study bridges. Results show that LSTMs can capture the temporal and seasonal patterns of both flow and river bed variations around bridge piers, through cycles of scour and filling and can provide reasonable predictions of upcoming scour depth as early as seven days in advance. It is expected that the proposed solution can be implemented by transportation authorities for development of emerging AI-based early warning systems, enabling superior bridge scour management.
https://arxiv.org/abs/2208.10500
We consider the problem of training a neural network to store a set of patterns with maximal noise robustness. A solution, in terms of optimal weights and state update rules, is derived by training each individual neuron to perform either kernel classification or interpolation with a minimum weight norm. By applying this method to feed-forward and recurrent networks, we derive optimal networks that include, as special cases, many of the hetero- and auto-associative memory models that have been proposed over the past years, such as modern Hopfield networks and Kanerva's sparse distributed memory. We generalize Kanerva's model and demonstrate a simple way to design a kernel memory network that can store an exponential number of continuous-valued patterns with a finite basin of attraction. The framework of kernel memory networks offers a simple and intuitive way to understand the storage capacity of previous memory models, and allows for new biological interpretations in terms of dendritic non-linearities and synaptic clustering.
https://arxiv.org/abs/2208.09416
When evaluating quantities of interest that depend on the solutions to differential equations, we inevitably face the trade-off between accuracy and efficiency. Especially for parametrized, time dependent problems in engineering computations, it is often the case that acceptable computational budgets limit the availability of high-fidelity, accurate simulation data. Multi-fidelity surrogate modeling has emerged as an effective strategy to overcome this difficulty. Its key idea is to leverage many low-fidelity simulation data, less accurate but much faster to compute, to improve the approximations with limited high-fidelity data. In this work, we introduce a novel data-driven framework of multi-fidelity surrogate modeling for parametrized, time-dependent problems using long short-term memory (LSTM) networks, to enhance output predictions both for unseen parameter values and forward in time simultaneously - a task known to be particularly challenging for data-driven models. We demonstrate the wide applicability of the proposed approaches in a variety of engineering problems with high- and low-fidelity data generated through fine versus coarse meshes, small versus large time steps, or finite element full-order versus deep learning reduced-order models. Numerical results show that the proposed multi-fidelity LSTM networks not only improve single-fidelity regression significantly, but also outperform the multi-fidelity models based on feed-forward neural networks.
https://arxiv.org/abs/2208.03115
Recently, several spatial-temporal memory-based methods have verified that storing intermediate frames and their masks as memory are helpful to segment target objects in videos. However, they mainly focus on better matching between the current frame and the memory frames without explicitly paying attention to the quality of the memory. Therefore, frames with poor segmentation masks are prone to be memorized, which leads to a segmentation mask error accumulation problem and further affect the segmentation performance. In addition, the linear increase of memory frames with the growth of frame number also limits the ability of the models to handle long videos. To this end, we propose a Quality-aware Dynamic Memory Network (QDMN) to evaluate the segmentation quality of each frame, allowing the memory bank to selectively store accurately segmented frames to prevent the error accumulation problem. Then, we combine the segmentation quality with temporal consistency to dynamically update the memory bank to improve the practicability of the models. Without any bells and whistles, our QDMN achieves new state-of-the-art performance on both DAVIS and YouTube-VOS benchmarks. Moreover, extensive experiments demonstrate that the proposed Quality Assessment Module (QAM) can be applied to memory-based methods as generic plugins and significantly improves performance. Our source code is available at this https URL.
https://arxiv.org/abs/2207.07922