Paper Reading AI Learner

Sentiment-oriented Transformer-based Variational Autoencoder Network for Live Video Commenting

2024-04-19 10:43:25
Fengyi Fu, Shancheng Fang, Weidong Chen, Zhendong Mao

Abstract

Automatic live video commenting is with increasing attention due to its significance in narration generation, topic explanation, etc. However, the diverse sentiment consideration of the generated comments is missing from the current methods. Sentimental factors are critical in interactive commenting, and lack of research so far. Thus, in this paper, we propose a Sentiment-oriented Transformer-based Variational Autoencoder (So-TVAE) network which consists of a sentiment-oriented diversity encoder module and a batch attention module, to achieve diverse video commenting with multiple sentiments and multiple semantics. Specifically, our sentiment-oriented diversity encoder elegantly combines VAE and random mask mechanism to achieve semantic diversity under sentiment guidance, which is then fused with cross-modal features to generate live video comments. Furthermore, a batch attention module is also proposed in this paper to alleviate the problem of missing sentimental samples, caused by the data imbalance, which is common in live videos as the popularity of videos varies. Extensive experiments on Livebot and VideoIC datasets demonstrate that the proposed So-TVAE outperforms the state-of-the-art methods in terms of the quality and diversity of generated comments. Related code is available at this https URL.

Abstract (translated)

自动实时视频评论因其在叙述生成、主题解释等领域的 significance而受到越来越多的关注。然而,目前的 methods 中缺少情感考虑。情感因素在交互式评论中至关重要,目前还没有相关研究。因此,在本文中,我们提出了一个基于情感的 Transformer-based Variational Autoencoder (So-TVAE) 网络,由情感导向的多样性编码模块和批量注意模块组成,以实现具有多种情感和多种语义的视频评论。具体来说,我们的情感导向多样性编码器巧妙地将 VAE 和随机掩码机制结合起来,在情感引导下实现语义多样性,然后与跨模态特征融合以生成实时视频评论。此外,本文还提出了一种批量注意模块,以减轻由于数据不平衡引起的问题,即在视频受欢迎程度不同的情况下,情感样本的缺失问题。在 Livebot 和 VideoIC 数据集上进行的大量实验证明,与最先进的methods 相比,所提出的 So-TVAE 在评论质量和多样性方面都表现出色。相关代码可在此处访问:https://url.

URL

https://arxiv.org/abs/2404.12782

PDF

https://arxiv.org/pdf/2404.12782.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot