Abstract
The emoticons are symbolic representations that generally accompany the textual content to visually enhance or summarize the true intention of a written message. Although widely utilized in the realm of social media, the core semantics of these emoticons have not been extensively explored based on multiple modalities. Incorporating textual and visual information within a single message develops an advanced way of conveying information. Hence, this research aims to analyze the relationship among sentences, visuals, and emoticons. For an orderly exposition, this paper initially provides a detailed examination of the various techniques for extracting multimodal features, emphasizing the pros and cons of each method. Through conducting a comprehensive examination of several multimodal algorithms, with specific emphasis on the fusion approaches, we have proposed a novel contrastive learning based multimodal architecture. The proposed model employs the joint training of dual-branch encoder along with the contrastive learning to accurately map text and images into a common latent space. Our key finding is that by integrating the principle of contrastive learning with that of the other two branches yields superior results. The experimental results demonstrate that our suggested methodology surpasses existing multimodal approaches in terms of accuracy and robustness. The proposed model attained an accuracy of 91% and an MCC-score of 90% while assessing emoticons using the Multimodal-Twitter Emoticon dataset acquired from Twitter. We provide evidence that deep features acquired by contrastive learning are more efficient, suggesting that the proposed fusion technique also possesses strong generalisation capabilities for recognising emoticons across several modes.
Abstract (translated)
表情符号是一种象征性的表示形式,通常与文本内容一起使用,以视觉上增强或总结书面信息的确切意图。尽管在社交媒体领域得到了广泛应用,但基于多种模式对这些表情符号的核心语义进行了深入探讨还是不足为继。将文本和视觉信息集成到一个消息中,发展了一种高级传达信息的方法。因此,这项研究旨在分析句子、视觉信息和表情符号之间的关系。为进行有序的阐述,本文首先对各种提取多模态特征的技术进行了详细调查,强调每种方法的优缺点。通过全面评估多个多模态算法,特别是融合方法,我们提出了一个新颖的基于多模态学习的架构。所提出的模型采用双分支编码器与对比学习相结合来准确地将文本和图像映射到共同的潜在空间。我们的关键发现是,将对比学习原理与其他两个分支相结合产生了更好的结果。实验结果表明,我们所提出的方法在准确性和鲁棒性方面超过了现有的多模态方法。在使用Twitter Multimodal Emoticon数据集评估表情符号时,所提出的模型获得了91%的准确性和90%的MCC分数。我们提供了证据,表明通过对比学习获得的深度特征更加有效,表明所提出的融合技术也对识别多种模式下的表情符号具有很强的泛化能力。
URL
https://arxiv.org/abs/2408.02571