LiVLR: A Lightweight Visual-Linguistic Reasoning Framework for Video Question Answering

2021-11-29 14:18:47

Jingjing Jiang, Ziyi Liu, Yifan Liu, Nanning Zheng

arXiv_CV

Abstract
Abstract (translated)
URL
PDF

Abstract

Video Question Answering (VideoQA), aiming to correctly answer the given question based on understanding multi-modal video content, is challenging due to the rich video content. From the perspective of video understanding, a good VideoQA framework needs to understand the video content at different semantic levels and flexibly integrate the diverse video content to distill question-related content. To this end, we propose a Lightweight Visual-Linguistic Reasoning framework named LiVLR. Specifically, LiVLR first utilizes the graph-based Visual and Linguistic Encoders to obtain multi-grained visual and linguistic representations. Subsequently, the obtained representations are integrated with the devised Diversity-aware Visual-Linguistic Reasoning module (DaVL). The DaVL considers the difference between the different types of representations and can flexibly adjust the importance of different types of representations when generating the question-related joint representation, which is an effective and general representation integration method. The proposed LiVLR is lightweight and shows its superiority on two VideoQA benchmarks, MRSVTT-QA and KnowIT VQA. Extensive ablation studies demonstrate the effectiveness of LiVLR key components.

Abstract (translated)

URL

https://arxiv.org/abs/2111.14547

PDF

https://arxiv.org/pdf/2111.14547.pdf