Paper Reading AI Learner

Game-MUG: Multimodal Oriented Game Situation Understanding and Commentary Generation Dataset

2024-04-30 00:39:26
Zhihao Zhang, Feiqi Cao, Yingbin Mo, Yiran Zhang, Josiah Poon, Caren Han

Abstract

The dynamic nature of esports makes the situation relatively complicated for average viewers. Esports broadcasting involves game expert casters, but the caster-dependent game commentary is not enough to fully understand the game situation. It will be richer by including diverse multimodal esports information, including audiences' talks/emotions, game audio, and game match event information. This paper introduces GAME-MUG, a new multimodal game situation understanding and audience-engaged commentary generation dataset and its strong baseline. Our dataset is collected from 2020-2022 LOL game live streams from YouTube and Twitch, and includes multimodal esports game information, including text, audio, and time-series event logs, for detecting the game situation. In addition, we also propose a new audience conversation augmented commentary dataset by covering the game situation and audience conversation understanding, and introducing a robust joint multimodal dual learning model as a baseline. We examine the model's game situation/event understanding ability and commentary generation capability to show the effectiveness of the multimodal aspects coverage and the joint integration learning approach.

Abstract (translated)

电子竞技(esports)的动态性质使得普通观众的情况变得复杂。电子竞技直播涉及游戏专家主持者,但仅依赖主持者的游戏解说无法完全理解游戏情况。通过包括多样化的多模态电竞信息,包括观众的谈话/情感、游戏音频和游戏比赛事件信息,可以让数据更加丰富。本文介绍了一个新的多模态游戏情况理解和观众参与评论生成数据集GAME-MUG及其强基线。我们的数据来自2020-2022年从YouTube和Twitch上收集的《英雄联盟》(LOL)游戏直播,包括多模态电竞游戏信息,包括文本、音频和时间序列事件日志,以检测游戏情况。此外,我们还提出了一个新的观众对话增强评论数据集,涵盖了游戏情况和观众对话理解,并引入了一个稳健的联合多模态双学习作为基线。我们研究了模型的游戏情况/事件理解能力和评论生成能力,以展示多模态方面的覆盖和联合集成学习方法的有效性。

URL

https://arxiv.org/abs/2404.19175

PDF

https://arxiv.org/pdf/2404.19175.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot