Paper Reading AI Learner

Early Multimodal Prediction of Cross-Lingual Meme Virality on Reddit: A Time-Window Analysis

2025-10-07 10:27:36
Sedat Dogan, Nina Dethlefs, Debarati Chakraborty

Abstract

Predicting the virality of online content remains challenging, especially for culturally complex, fast-evolving memes. This study investigates the feasibility of early prediction of meme virality using a large-scale, cross-lingual dataset from 25 diverse Reddit communities. We propose a robust, data-driven method to define virality based on a hybrid engagement score, learning a percentile-based threshold from a chronologically held-out training set to prevent data leakage. We evaluated a suite of models, including Logistic Regression, XGBoost, and a Multi-layer Perceptron (MLP), with a comprehensive, multimodal feature set across increasing time windows (30-420 min). Crucially, useful signals emerge quickly: our best-performing model, XGBoost, achieves a PR-AUC $>$ 0.52 in just 30 minutes. Our analysis reveals a clear "evidentiary transition," in which the importance of the feature dynamically shifts from the static context to the temporal dynamics as a meme gains traction. This work establishes a robust, interpretable, and practical benchmark for early virality prediction in scenarios where full diffusion cascade data is unavailable, contributing a novel cross-lingual dataset and a methodologically sound definition of virality. To our knowledge, this study is the first to combine time series data with static content and network features to predict early meme virality.

Abstract (translated)

预测在线内容的传播性仍然颇具挑战,尤其是在涉及文化复杂、快速演变的梗(memes)的情况下。本研究探讨了使用来自25个不同Reddit社区的大规模跨语言数据集来早期预测梗传播性的可行性。我们提出了一种基于混合参与度评分的数据驱动方法定义传播性,并从按时间顺序划分的训练集中学习百分位数阈值,以防止数据泄露。 我们在一个全面多模态特征集合的基础上评估了一系列模型,包括逻辑回归、XGBoost和多层感知机(MLP),这些特征集在逐渐增加的时间窗口(30-420分钟)内进行测试。至关重要的是,在短短30分钟内,我们的最佳模型——XGBoost达到了PR-AUC大于0.52的成绩。 分析显示了一个明确的“证据转变”,即随着梗的传播范围扩大,其特征的重要性从静态上下文动态地转移到了时间动态上。这项研究建立了一套稳健、可解释且实用的标准,用于在缺乏完整扩散级联数据的情况下进行早期传播性预测,并贡献了一个新颖的跨语言数据集以及一个方法论严谨的传播性定义。 据我们所知,这是首次将时间序列数据与静态内容和网络特征结合使用来预测早期梗的传播性的研究。

URL

https://arxiv.org/abs/2510.05761

PDF

https://arxiv.org/pdf/2510.05761.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot