In recent years, deep learning has increasingly gained attention in the field of traffic prediction. Existing traffic prediction models often rely on GCNs or attention mechanisms with O(N^2) complexity to dynamically extract traffic node features, which lack efficiency and are not lightweight. Additionally, these models typically only utilize historical data for prediction, without considering the impact of the target information on the prediction. To address these issues, we propose a Pattern-Matching Dynamic Memory Network (PM-DMNet). PM-DMNet employs a novel dynamic memory network to capture traffic pattern features with only O(N) complexity, significantly reducing computational overhead while achieving excellent performance. The PM-DMNet also introduces two prediction methods: Recursive Multi-step Prediction (RMP) and Parallel Multi-step Prediction (PMP), which leverage the time features of the prediction targets to assist in the forecasting process. Furthermore, a transfer attention mechanism is integrated into PMP, transforming historical data features to better align with the predicted target states, thereby capturing trend changes more accurately and reducing errors. Extensive experiments demonstrate the superiority of the proposed model over existing benchmarks. The source codes are available at: this https URL.
近年来,在交通预测领域,深度学习越来越受到关注。现有的交通预测模型通常依赖于 GCNs 或具有 O(N^2) 复杂性的注意力机制来动态提取交通节点特征,这些模型缺乏效率并且不是轻量级的。此外,这些模型通常仅利用历史数据进行预测,而没有考虑目标信息对预测的影响。为了解决这些问题,我们提出了一个模式匹配动态内存网络(PM-DMNet)。PM-DMNet 采用了一种新颖的动态内存网络来捕获交通模式特征,其复杂度仅为 O(N),从而大大减少了计算开销,同时实现了卓越的性能。PM-DMNet 还引入了两种预测方法:递归多步预测(RMP)和并行多步预测(PMP),它们利用预测目标的时序特征来协助进行预测过程。此外,PMP 中还引入了转移注意力机制,将历史数据特征更好地对预测目标的状态进行对齐,从而更准确地捕捉趋势变化并减少错误。大量实验证明,与现有基准相比,所提出的模型具有优越性。源代码可在此处访问:https:// this URL。
https://arxiv.org/abs/2408.07100
Existing video object segmentation (VOS) benchmarks focus on short-term videos which just last about 3-5 seconds and where objects are visible most of the time. These videos are poorly representative of practical applications, and the absence of long-term datasets restricts further investigation of VOS on the application in realistic scenarios. So, in this paper, we present a new benchmark dataset and evaluation methodology named LVOS, which consists of 220 videos with a total duration of 421 minutes. To the best of our knowledge, LVOS is the first densely annotated long-term VOS dataset. The videos in our LVOS last 1.59 minutes on average, which is 20 times longer than videos in existing VOS datasets. Each video includes various attributes, especially challenges deriving from the wild, such as long-term reappearing and cross-temporal similar objeccts. Moreover, we provide additional language descriptions to encourage the exploration of integrating linguistic and visual features for video object segmentation. Based on LVOS, we assess existing video object segmentation algorithms and propose a Diverse Dynamic Memory network (DDMemory) that consists of three complementary memory banks to exploit temporal information adequately. The experiment results demonstrate the strength and weaknesses of prior methods, pointing promising directions for further study. Our objective is to provide the community with a large and varied benchmark to boost the advancement of long-term VOS. Data and code are available at \url{this https URL}.
https://arxiv.org/abs/2211.10181
Recently, several spatial-temporal memory-based methods have verified that storing intermediate frames and their masks as memory are helpful to segment target objects in videos. However, they mainly focus on better matching between the current frame and the memory frames without explicitly paying attention to the quality of the memory. Therefore, frames with poor segmentation masks are prone to be memorized, which leads to a segmentation mask error accumulation problem and further affect the segmentation performance. In addition, the linear increase of memory frames with the growth of frame number also limits the ability of the models to handle long videos. To this end, we propose a Quality-aware Dynamic Memory Network (QDMN) to evaluate the segmentation quality of each frame, allowing the memory bank to selectively store accurately segmented frames to prevent the error accumulation problem. Then, we combine the segmentation quality with temporal consistency to dynamically update the memory bank to improve the practicability of the models. Without any bells and whistles, our QDMN achieves new state-of-the-art performance on both DAVIS and YouTube-VOS benchmarks. Moreover, extensive experiments demonstrate that the proposed Quality Assessment Module (QAM) can be applied to memory-based methods as generic plugins and significantly improves performance. Our source code is available at this https URL.
https://arxiv.org/abs/2207.07922
Recent studies try to build task-oriented dialogue system in an end-to-end manner and the existing works make great progress on this task. However, there are still two issues need to consider: (1) How to effectively represent the knowledge bases and incorporate it into dialogue system. (2) How to efficiently reason the knowledge bases given queries. To solve these issues, we design a novel Transformer-based Dynamic Memory Network (DMN) with a novel Memory Mask scheme, which can dynamically generate the context-aware knowledge base representations, and reason the knowledge bases simultaneously. Furthermore, we incorporate the dynamic memory network into Transformer and propose Dynamic Memory Enhanced Transformer (DMET), which can aggregate information from dialogue history and knowledge bases to generate better responses. Through extensive experiments, our method can achieve superior performance over the state-of-the-art methods.
https://arxiv.org/abs/2010.05740
Event detection (ED), a sub-task of event extraction, involves identifying triggers and categorizing event mentions. Existing methods primarily rely upon supervised learning and require large-scale labeled event datasets which are unfortunately not readily available in many real-life applications. In this paper, we consider and reformulate the ED task with limited labeled data as a Few-Shot Learning problem. We propose a Dynamic-Memory-Based Prototypical Network (DMB-PN), which exploits Dynamic Memory Network (DMN) to not only learn better prototypes for event types, but also produce more robust sentence encodings for event mentions. Differing from vanilla prototypical networks simply computing event prototypes by averaging, which only consume event mentions once, our model is more robust and is capable of distilling contextual information from event mentions for multiple times due to the multi-hop mechanism of DMNs. The experiments show that DMB-PN not only deals with sample scarcity better than a series of baseline models but also performs more robustly when the variety of event types is relatively large and the instance quantity is extremely small.
https://arxiv.org/abs/1910.11621
With increasing information from social media, there are more and more videos available. Therefore, the ability to reason on a video is important and deserves to be discussed. TheDialog System Technology Challenge (DSTC7) (Yoshino et al. 2018) proposed an Audio Visual Scene-aware Dialog (AVSD) task, which contains five modalities including video, dialogue history, summary, and caption, as a scene-aware environment. In this paper, we propose the entropy-enhanced dynamic memory network (DMN) to effectively model video modality. The attention-based GRU in the proposed model can improve the model's ability to comprehend and memorize sequential information. The entropy mechanism can control the attention distribution higher, so each to-be-answered question can focus more specifically on a small set of video segments. After the entropy-enhanced DMN secures the video context, we apply an attention model that in-corporates summary and caption to generate an accurate answer given the question about the video. In the official evaluation, our system can achieve improved performance against the released baseline model for both subjective and objective evaluation metrics.
https://arxiv.org/abs/1908.08191
Template-matching methods for visual tracking have gained popularity recently due to their good performance and fast speed. However, they lack effective ways to adapt to changes in the target object's appearance, making their tracking accuracy still far from state-of-the-art. In this paper, we propose a dynamic memory network to adapt the template to the target's appearance variations during tracking. The reading and writing process of the external memory is controlled by an LSTM network with the search feature map as input. A spatial attention mechanism is applied to concentrate the LSTM input on the potential target as the location of the target is at first unknown. To prevent aggressive model adaptivity, we apply gated residual template learning to control the amount of retrieved memory that is used to combine with the initial template. In order to alleviate the drift problem, we also design a "negative" memory unit that stores templates for distractors, which are used to cancel out wrong responses from the object template. To further boost the tracking performance, an auxiliary classification loss is added after the feature extractor part. Unlike tracking-by-detection methods where the object's information is maintained by the weight parameters of neural networks, which requires expensive online fine-tuning to be adaptable, our tracker runs completely feed-forward and adapts to the target's appearance changes by updating the external memory. Moreover, the capacity of our model is not determined by the network size as with other trackers --- the capacity can be easily enlarged as the memory requirements of a task increase, which is favorable for memorizing long-term object information. Extensive experiments on the OTB and VOT datasets demonstrate that our trackers perform favorably against state-of-the-art tracking methods while retaining real-time speed.
https://arxiv.org/abs/1907.07613
Dialogue Act (DA) classification is a challenging problem in dialogue interpretation, which aims to attach semantic labels to utterances and characterize the speaker's intention. Currently, many existing approaches formulate the DA classification problem ranging from multi-classification to structured prediction, which suffer from two limitations: a) these methods are either handcrafted feature-based or have limited memories. b) adversarial examples can't be correctly classified by traditional training methods. To address these issues, in this paper we first cast the problem into a question and answering problem and proposed an improved dynamic memory networks with hierarchical pyramidal utterance encoder. Moreover, we apply adversarial training to train our proposed model. We evaluate our model on two public datasets, i.e., Switchboard dialogue act corpus and the MapTask corpus. Extensive experiments show that our proposed model is not only robust, but also achieves better performance when compared with some state-of-the-art baselines.
https://arxiv.org/abs/1811.05021
The task of event detection involves identifying and categorizing event triggers. Contextual information has been shown effective on the task. However, existing methods which utilize contextual information only process the context once. We argue that the context can be better exploited by processing the context multiple times, allowing the model to perform complex reasoning and to generate better context representation, thus improving the overall performance. Meanwhile, dynamic memory network (DMN) has demonstrated promising capability in capturing contextual information and has been applied successfully to various tasks. In light of the multi-hop mechanism of the DMN to model the context, we propose the trigger detection dynamic memory network (TD-DMN) to tackle the event detection problem. We performed a five-fold cross-validation on the ACE-2005 dataset and experimental results show that the multi-hop mechanism does improve the performance and the proposed model achieves best $F_1$ score compared to the state-of-the-art methods.
https://arxiv.org/abs/1810.03449
Template-matching methods for visual tracking have gained popularity recently due to their comparable performance and fast speed. However, they lack effective ways to adapt to changes in the target object's appearance, making their tracking accuracy still far from state-of-the-art. In this paper, we propose a dynamic memory network to adapt the template to the target's appearance variations during tracking. An LSTM is used as a memory controller, where the input is the search feature map and the outputs are the control signals for the reading and writing process of the memory block. As the location of the target is at first unknown in the search feature map, an attention mechanism is applied to concentrate the LSTM input on the potential target. To prevent aggressive model adaptivity, we apply gated residual template learning to control the amount of retrieved memory that is used to combine with the initial template. Unlike tracking-by-detection methods where the object's information is maintained by the weight parameters of neural networks, which requires expensive online fine-tuning to be adaptable, our tracker runs completely feed-forward and adapts to the target's appearance changes by updating the external memory. Moreover, unlike other tracking methods where the model capacity is fixed after offline training --- the capacity of our tracker can be easily enlarged as the memory requirements of a task increase, which is favorable for memorizing long-term object information. Extensive experiments on OTB and VOT demonstrates that our tracker MemTrack performs favorably against state-of-the-art tracking methods while retaining real-time speed of 50 fps.
用于视觉跟踪的模板匹配方法由于其可比较的性能和快速速度而最近获得了普及。然而,它们缺乏适应目标物体外观变化的有效方法,使得它们的跟踪精度仍然远离最先进的技术。在本文中,我们提出了一个动态记忆网络,以使模板适应目标在跟踪过程中的外观变化。 LSTM用作存储器控制器,其中输入是搜索特征映射,输出是用于存储器块的读取和写入过程的控制信号。由于目标的位置在搜索特征图中首先是未知的,因此应用注意机制来将LSTM输入集中在潜在目标上。为了防止积极的模型适应性,我们应用门控残差模板学习来控制用于与初始模板组合的检索内存量。与通过神经网络的权重参数维护对象信息的跟踪检测方法不同,这需要昂贵的在线微调以适应,我们的跟踪器完全前馈并通过更新外部来适应目标的外观变化记忆。此外,与离线训练后模型容量固定的其他跟踪方法不同 - 随着任务的内存需求增加,我们的跟踪器的容量可以很容易地扩大,这有利于记忆长期对象信息。在OTB和VOT上进行的大量实验表明,我们的跟踪器MemTrack在保持50 fps的实时速度的同时,对最先进的跟踪方法表现出色。
https://arxiv.org/abs/1803.07268
Working memory is an essential component of reasoning -- the capacity to answer a new question by manipulating acquired knowledge. Current memory-augmented neural networks offer a differentiable method to realize limited reasoning with support of a working memory module. Memory modules are often implemented as a set of memory slots without explicit relational exchange of content. This does not naturally match multi-relational domains in which data is structured. We design a new model dubbed Relational Dynamic Memory Network (RDMN) to fill this gap. The memory can have a single or multiple components, each of which realizes a multi-relational graph of memory slots. The memory is dynamically updated in the reasoning process controlled by the central controller. We evaluate the capability of RDMN on several important application domains: software vulnerability, molecular bioactivity and chemical reaction. Results demonstrate the efficacy of the proposed model.
工作记忆是推理的重要组成部分 - 通过操纵获得的知识来回答新问题的能力。当前的存储器增强神经网络提供了一种可区分的方法,以通过工作存储器模块的支持来实现有限的推理。存储器模块通常被实现为一组存储器槽而没有明确的内容关系交换。这自然不匹配数据结构化的多关系域。我们设计了一个名为关系动态内存网络(RDMN)的新模型来填补这一空白。存储器可以具有单个或多个组件,每个组件实现存储器槽的多关系图。在由中央控制器控制的推理过程中动态更新存储器。我们评估了RDMN在几个重要应用领域的能力:软件脆弱性,分子生物活性和化学反应。结果证明了所提出的模型的功效。
https://arxiv.org/abs/1808.04247
Video Question Answering (QA) is an important task in understanding video temporal structure. We observe that there are three unique attributes of video QA compared with image QA: (1) it deals with long sequences of images containing richer information not only in quantity but also in variety; (2) motion and appearance information are usually correlated with each other and able to provide useful attention cues to the other; (3) different questions require different number of frames to infer the answer. Based these observations, we propose a motion-appearance comemory network for video QA. Our networks are built on concepts from Dynamic Memory Network (DMN) and introduces new mechanisms for video QA. Specifically, there are three salient aspects: (1) a co-memory attention mechanism that utilizes cues from both motion and appearance to generate attention; (2) a temporal conv-deconv network to generate multi-level contextual facts; (3) a dynamic fact ensemble method to construct temporal representation dynamically for different questions. We evaluate our method on TGIF-QA dataset, and the results outperform state-of-the-art significantly on all four tasks of TGIF-QA.
https://arxiv.org/abs/1803.10906
Visual Question Answering (VQA) has attracted much attention since it offers insight into the relationships between the multi-modal analysis of images and natural language. Most of the current algorithms are incapable of answering open-domain questions that require to perform reasoning beyond the image contents. To address this issue, we propose a novel framework which endows the model capabilities in answering more complex questions by leveraging massive external knowledge with dynamic memory networks. Specifically, the questions along with the corresponding images trigger a process to retrieve the relevant information in external knowledge bases, which are embedded into a continuous vector space by preserving the entity-relation structures. Afterwards, we employ dynamic memory networks to attend to the large body of facts in the knowledge graph and images, and then perform reasoning over these facts to generate corresponding answers. Extensive experiments demonstrate that our model not only achieves the state-of-the-art performance in the visual question answering task, but can also answer open-domain questions effectively by leveraging the external knowledge.
视觉问答(VQA)引人关注,因为它提供了对图像的多模式分析与自然语言之间关系的深入了解。目前大多数算法都不能回答开放领域的问题,这些问题需要在图像内容之外进行推理。为了解决这个问题,我们提出了一个新颖的框架,通过利用动态内存网络的大量外部知识来赋予模型回答更复杂问题的能力。具体来说,问题连同相应的图像触发一个过程来检索外部知识库中的相关信息,通过保留实体 - 关系结构将其嵌入到连续向量空间中。之后,我们采用动态记忆网络来关注知识图和图像中的大量事实,然后对这些事实进行推理以产生相应的答案。大量的实验证明,我们的模型不仅能够在视觉问题解答任务中实现最先进的表现,而且还可以通过利用外部知识来有效回答开放领域问题。
https://arxiv.org/abs/1712.00733
We examine Memory Networks for the task of question answering (QA), under common real world scenario where training examples are scarce and under weakly supervised scenario, that is only extrinsic labels are available for training. We propose extensions for the Dynamic Memory Network (DMN), specifically within the attention mechanism, we call the resulting Neural Architecture as Dynamic Memory Tensor Network (DMTN). Ultimately, we see that our proposed extensions results in over 80% improvement in the number of task passed against the baselined standard DMN and 20% more task passed compared to state-of-the-art End-to-End Memory Network for Facebook's single task weakly trained 1K bAbi dataset.
https://arxiv.org/abs/1703.03939
Most tasks in natural language processing can be cast into question answering (QA) problems over language input. We introduce the dynamic memory network (DMN), a neural network architecture which processes input sequences and questions, forms episodic memories, and generates relevant answers. Questions trigger an iterative attention process which allows the model to condition its attention on the inputs and the result of previous iterations. These results are then reasoned over in a hierarchical recurrent sequence model to generate answers. The DMN can be trained end-to-end and obtains state-of-the-art results on several types of tasks and datasets: question answering (Facebook's bAbI dataset), text classification for sentiment analysis (Stanford Sentiment Treebank) and sequence modeling for part-of-speech tagging (WSJ-PTB). The training for these different tasks relies exclusively on trained word vector representations and input-question-answer triplets.
https://arxiv.org/abs/1506.07285
Neural network architectures with memory and attention mechanisms exhibit certain reasoning capabilities required for question answering. One such architecture, the dynamic memory network (DMN), obtained high accuracy on a variety of language tasks. However, it was not shown whether the architecture achieves strong results for question answering when supporting facts are not marked during training or whether it could be applied to other modalities such as images. Based on an analysis of the DMN, we propose several improvements to its memory and input modules. Together with these changes we introduce a novel input module for images in order to be able to answer visual questions. Our new DMN+ model improves the state of the art on both the Visual Question Answering dataset and the \babi-10k text question-answering dataset without supporting fact supervision.
https://arxiv.org/abs/1603.01417