Audiovisual emotion recognition (ER) in videos has immense potential over unimodal performance. It effectively leverages the inter- and intra-modal dependencies between visual and auditory modalities. This work proposes a novel audio-visual emotion recognition system utilizing a joint multimodal transformer architecture with key-based cross-attention. This framework aims to exploit the complementary nature of audio and visual cues (facial expressions and vocal patterns) in videos, leading to superior performance compared to solely relying on a single modality. The proposed model leverages separate backbones for capturing intra-modal temporal dependencies within each modality (audio and visual). Subsequently, a joint multimodal transformer architecture integrates the individual modality embeddings, enabling the model to effectively capture inter-modal (between audio and visual) and intra-modal (within each modality) relationships. Extensive evaluations on the challenging Affwild2 dataset demonstrate that the proposed model significantly outperforms baseline and state-of-the-art methods in ER tasks.
音频视觉情感识别 (ER) 在视频中有巨大的潜力。它有效地利用了视觉和听觉模式之间的相互依赖关系。这项工作提出了一种新型的音频-视觉情感识别系统,采用基于关键的跨注意力的联合多模态Transformer架构。这个框架旨在利用视频中的音频和视觉提示(面部表情和语调模式)的互补性质,从而比仅依赖单一模态时获得更好的性能。所提出的模型利用单独的骨干网络来捕捉每个模式内部的时序依赖(音频和视觉)。接着,联合多模态Transformer架构整合了每个模态的单独嵌入,使得模型能够有效捕捉跨模态(在音频和视觉之间)和模态内的(每个模式内)关系。在具有挑战性的Affwild2数据集上进行的广泛评估证明,与基线和最先进的ER方法相比,所提出的模型在ER任务中显著表现出更好的性能。
https://arxiv.org/abs/2403.10488
Human affective behavior analysis aims to delve into human expressions and behaviors to deepen our understanding of human emotions. Basic expression categories (EXPR) and Action Units (AUs) are two essential components in this analysis, which categorize emotions and break down facial movements into elemental units, respectively. Despite advancements, existing approaches in expression classification and AU detection often necessitate complex models and substantial computational resources, limiting their applicability in everyday settings. In this work, we introduce the first lightweight framework adept at efficiently tackling both expression classification and AU detection. This framework employs a frozen CLIP image encoder alongside a trainable multilayer perceptron (MLP), enhanced with Conditional Value at Risk (CVaR) for robustness and a loss landscape flattening strategy for improved generalization. Experimental results on the Aff-wild2 dataset demonstrate superior performance in comparison to the baseline while maintaining minimal computational demands, offering a practical solution for affective behavior analysis. The code is available at this https URL
人类情感行为分析旨在深入研究人类表达和行为,以加深我们对人类情感的理解。基本表达类别(EXPR)和动作单元(AUs)是这个分析的两个关键组成部分,它们分别对情绪进行分类并将面部表情分解成基本单元。尽管有进步,但现有的表达分类和AU检测方法通常需要复杂的模型和大量的计算资源,这限制了它们在日常生活环境中的应用。在这项工作中,我们引入了第一个轻量级框架,专门用于有效地解决表达分类和AU检测问题。这个框架采用了一冻结的CLIP图像编码器与可训练的多层感知器(MLP)相结合,并在条件价值风险(CVaR)和损失景观平滑策略的改善下,增强了鲁棒性。在Aff-wild2数据集上的实验结果表明,与基线相比,其性能卓越,同时保持最小的计算需求,为情感行为分析提供了一个实际可行的解决方案。代码可在此处访问:https://url
https://arxiv.org/abs/2403.09915
We present GazeMotion, a novel method for human motion forecasting that combines information on past human poses with human eye gaze. Inspired by evidence from behavioural sciences showing that human eye and body movements are closely coordinated, GazeMotion first predicts future eye gaze from past gaze, then fuses predicted future gaze and past poses into a gaze-pose graph, and finally uses a residual graph convolutional network to forecast body motion. We extensively evaluate our method on the MoGaze, ADT, and GIMO benchmark datasets and show that it outperforms state-of-the-art methods by up to 7.4% improvement in mean per joint position error. Using head direction as a proxy to gaze, our method still achieves an average improvement of 5.5%. We finally report an online user study showing that our method also outperforms prior methods in terms of perceived realism. These results show the significant information content available in eye gaze for human motion forecasting as well as the effectiveness of our method in exploiting this information.
我们提出了GazeMotion,一种结合过去人类姿态信息和人类眼部目光信息的人体运动预测新方法。受到行为科学证据表明人类眼睛和身体动作密切相关启发,GazeMotion首先预测过去 gaze 中的未来眼部运动,然后将预测的未来眼部运动和过去姿态融合成一个目光-姿态图,最后使用残差图卷积网络预测身体运动。我们对 MoGaze、ADT 和 GIMO 基准数据集进行了广泛的评估,并发现该方法在平均每个关节位置误差方面优于最先进的方法,提高了7.4%的性能。使用头方向作为眼部运动的代理,我们的方法仍然实现了平均提高5.5%的性能。最后,我们报告了一个在线用户调查结果,表明我们的方法在感知真实性的方面也优于先前的方法。这些结果表明,眼部目光对于人体运动预测具有重要信息价值,并且我们的方法有效地利用了这些信息。
https://arxiv.org/abs/2403.09885
This manuscript presents a methodical examination of the utilization of Artificial Intelligence in the assessment of emotions in texts related to healthcare, with a particular focus on the incorporation of Natural Language Processing and deep learning technologies. We scrutinize numerous research studies that employ AI to augment sentiment analysis, categorize emotions, and forecast patient outcomes based on textual information derived from clinical narratives, patient feedback on medications, and online health discussions. The review demonstrates noteworthy progress in the precision of algorithms used for sentiment classification, the prognostic capabilities of AI models for neurodegenerative diseases, and the creation of AI-powered systems that offer support in clinical decision-making. Remarkably, the utilization of AI applications has exhibited an enhancement in personalized therapy plans by integrating patient sentiment and contributing to the early identification of mental health disorders. There persist challenges, which encompass ensuring the ethical application of AI, safeguarding patient confidentiality, and addressing potential biases in algorithmic procedures. Nevertheless, the potential of AI to revolutionize healthcare practices is unmistakable, offering a future where healthcare is not only more knowledgeable and efficient but also more empathetic and centered around the needs of patients. This investigation underscores the transformative influence of AI on healthcare, delivering a comprehensive comprehension of its role in examining emotional content in healthcare texts and highlighting the trajectory towards a more compassionate approach to patient care. The findings advocate for a harmonious synergy between AI's analytical capabilities and the human aspects of healthcare.
本文对将人工智能(AI)应用于评估文本中情感的方法进行了系统审查,特别关注将自然语言处理(NLP)和深度学习技术应用于此目的。我们详细审查了使用AI增强情感分析、分类情感和预测患者结果的研究。评论表明,用于情感分类的算法的精确度、AI模型对神经退行性疾病的有预测能力以及基于AI的系统的临床决策支持方面的进展是显著的。值得注意的是,AI应用在个性化治疗计划方面的使用已经通过将患者情感融入其中,帮助早期识别心理健康障碍而表现出增强。仍然存在一些挑战,包括确保AI应用的伦理应用、保护患者隐私以及解决算法过程中的偏见。然而,AI在医疗保健实践中的潜在革命性变革是不容忽视的,为未来提供了一个更具有知识和效率的医疗保健体系,同时也更加关注患者的需要。这次调查突显了AI在医疗保健中的 transformative 影响,全面阐述了其在检查医疗保健文本情感内容方面以及在患者护理过程中更富有同情心的趋势。研究结果主张在AI的分析能力与人类医疗保健方面实现和谐协同。
https://arxiv.org/abs/2403.09762
Drawing is an art that enables people to express their imagination and emotions. However, individuals usually face challenges in drawing, especially when translating conceptual ideas into visually coherent representations and bridging the gap between mental visualization and practical execution. In response, we propose ARtVista - a novel system integrating AR and generative AI technologies. ARtVista not only recommends reference images aligned with users' abstract ideas and generates sketches for users to draw but also goes beyond, crafting vibrant paintings in various painting styles. ARtVista also offers users an alternative approach to create striking paintings by simulating the paint-by-number concept on reference images, empowering users to create visually stunning artwork devoid of the necessity for advanced drawing skills. We perform a pilot study and reveal positive feedback on its usability, emphasizing its effectiveness in visualizing user ideas and aiding the painting process to achieve stunning pictures without requiring advanced drawing skills. The source code will be available at this https URL.
绘画是一种艺术形式,让人们对想象力和情感进行表达。然而,在绘画过程中,个人通常会面临挑战,尤其是在将概念性想法转化为视觉上连贯的图像,以及将头脑中的想象和实际操作之间建立联系时。为此,我们提出了ArtVista - 一款集成了AR和生成式人工智能技术的全新系统。ArtVista不仅推荐与用户抽象想法相符的参考图像,还为用户生成绘画草图,但还超越了这一点,通过各种绘画风格创作出鲜艳的画作。ArtVista还通过模拟“画数法”概念在参考图像上,让用户在不需要高级绘画技能的情况下,创造出令人惊叹的画作。我们对ArtVista的可用性进行了试点研究,并得到了积极的反馈,强调其在可视化用户想法和帮助绘画过程方面的高效性。源代码将在此处链接提供。
https://arxiv.org/abs/2403.08876
In recent years, there have been frequent incidents of foreign objects intruding into railway and Airport runways. These objects can include pedestrians, vehicles, animals, and debris. This paper introduces an improved YOLOv5 architecture incorporating FasterNet and attention mechanisms to enhance the detection of foreign objects on railways and Airport runways. This study proposes a new dataset, AARFOD (Aero and Rail Foreign Object Detection), which combines two public datasets for detecting foreign objects in aviation and railway systems. The dataset aims to improve the recognition capabilities of foreign object targets. Experimental results on this large dataset have demonstrated significant performance improvements of the proposed model over the baseline YOLOv5 model, reducing computational requirements. improved YOLO model shows a significant improvement in precision by 1.2%, recall rate by 1.0%, and mAP@.5 by 0.6%, while mAP@.5-.95 remained unchanged. The parameters were reduced by approximately 25.12%, and GFLOPs were reduced by about 10.63%. In the ablation experiment, it is found that the FasterNet module can significantly reduce the number of parameters of the model, and the reference of the attention mechanism can slow down the performance loss caused by lightweight.
近年来,铁路和机场跑道上经常发生外国物体侵入的事件。这些物体包括行人、车辆、动物和碎片。本文介绍了一种改进的YOLOv5架构,包括FasterNet和注意力机制,以增强在铁路和机场跑道上检测外国物体的能力。这个研究提出了一个新的数据集AARFOD(航空和铁路外国物体检测),将两个公共数据集结合起来,用于检测航空和铁路系统中的外国物体。这个数据集旨在提高外国物体目标的识别能力。在这个大型数据集上进行的实验结果表明,与基线YOLOv5模型相比,所提出的模型具有显著的性能改进,减少了计算需求。改进后的YOLO模型在精度上提高了1.2%,召回率不变,mAP@.5提高了0.6%。而mAP@.5-95保持不变。参数减少了约25.12%,GFLOPs减少了约10.63%。在消融实验中,发现FasterNet模块可以显著减少模型的参数数量,而注意机制的参考可以减缓轻量级模型造成的性能损失。
https://arxiv.org/abs/2403.08511
Category imbalance is one of the most popular and important issues in the domain of classification. In this paper, we present a new generalized framework with Adaptive Weight function for soft-margin Weighted SVM (AW-WSVM), which aims to enhance the issue of imbalance and outlier sensitivity in standard support vector machine (SVM) for classifying two-class data. The weight coefficient is introduced into the unconstrained soft-margin support vector machines, and the sample weights are updated before each training. The Adaptive Weight function (AW function) is constructed from the distance between the samples and the decision hyperplane, assigning different weights to each sample. A weight update method is proposed, taking into account the proximity of the support vectors to the decision hyperplane. Before training, the weights of the corresponding samples are initialized according to different categories. Subsequently, the samples close to the decision hyperplane are identified and assigned more weights. At the same time, lower weights are assigned to samples that are far from the decision hyperplane. Furthermore, we also put forward an effective way to eliminate noise. To evaluate the strength of the proposed generalized framework, we conducted experiments on standard datasets and emotion classification datasets with different imbalanced ratios (IR). The experimental results prove that the proposed generalized framework outperforms in terms of accuracy, recall metrics and G-mean, validating the effectiveness of the weighted strategy provided in this paper in enhancing support vector machines.
类别不平衡是分类领域中最受欢迎和最重要的问题之一。在本文中,我们提出了一个新的软间隔权重函数Adaptive Weight函数(AW-WSVM)来增强对于二分类数据分类的普通支持向量机(SVM)中的不平衡和异常敏感问题。将权重系数引入到约束软间隔支持向量机中,并在每次训练前更新样本权重。从样本与决策超平面的距离中构建自适应权重函数(AW函数),为每个样本分配不同的权重。我们提出了一种权重更新方法,考虑支持向量距离决策超平面的亲近程度。在训练之前,对应样本的权重初始化并根据不同的类别进行初始化。随后,我们通过识别距离决策超平面的样本并分配更多权重,同时将较低的权重分配给远离决策超平面的样本。此外,我们还提出了一种消除噪声的有效方法。为了评估所提出的广义框架的有效性,我们在标准数据集和情感分类数据集上进行了实验,这些数据集具有不同的不平衡比例(IR)。实验结果证明,与本文中提供的加权策略相比,所提出的广义框架在准确率、召回率和G-均方方面都表现出色,验证了本文中提供的加权策略的有效性,从而增强了支持向量机。
https://arxiv.org/abs/2403.08378
Despite the rapid progress in image generation, emotional image editing remains under-explored. The semantics, context, and structure of an image can evoke emotional responses, making emotional image editing techniques valuable for various real-world applications, including treatment of psychological disorders, commercialization of products, and artistic design. For the first time, we present a novel challenge of emotion-evoked image generation, aiming to synthesize images that evoke target emotions while retaining the semantics and structures of the original scenes. To address this challenge, we propose a diffusion model capable of effectively understanding and editing source images to convey desired emotions and sentiments. Moreover, due to the lack of emotion editing datasets, we provide a unique dataset consisting of 340,000 pairs of images and their emotion annotations. Furthermore, we conduct human psychophysics experiments and introduce four new evaluation metrics to systematically benchmark all the methods. Experimental results demonstrate that our method surpasses all competitive baselines. Our diffusion model is capable of identifying emotional cues from original images, editing images that elicit desired emotions, and meanwhile, preserving the semantic structure of the original images. All code, model, and data will be made public.
尽管图像生成取得了快速进展,但情感图像编辑仍然是一个未被充分探索的领域。图像的语义、上下文和结构可以唤起情感反应,使情感图像编辑技术对于各种现实应用场景(包括治疗心理疾病、产品商业化以及艺术设计)具有价值。为了解决这个问题,我们提出了一个情感激发图像生成的挑战,旨在合成保留原始场景语义和结构的图像,同时唤起目标情感。为了解决这个挑战,我们提出了一个扩散模型,它能够有效理解并编辑源图像以传达所需情感和情感。此外,由于缺乏情感编辑数据集,我们提供了由340,000对图像及其情感注释组成的数据集。此外,我们还进行了人脑心理物理学实验,并引入了四个新的评估指标系统地比较了所有方法。实验结果表明,我们的方法超越了所有竞争基线。我们的扩散模型能够从原始图像中识别出情感线索,编辑唤起所需情感的图像,同时保留原始图像的语义结构。所有代码、模型和数据都将公开发布。
https://arxiv.org/abs/2403.08255
As more than 70$\%$ of reviews in the existing opinion summary data set are positive, current opinion summarization approaches are reluctant to generate negative summaries given the input of negative texts. To address such sentiment bias, a direct approach without the over-reliance on a specific framework is to generate additional data based on large language models to balance the emotional distribution of the dataset. However, data augmentation based on large language models faces two disadvantages: 1) the potential issues or toxicity in the augmented data; 2) the expensive costs. Therefore, in this paper, we propose a novel data augmentation framework based on both large and small language models for debiasing opinion summarization. In specific, a small size of synthesized negative reviews is obtained by rewriting the positive text via a large language model. Then, a disentangle reconstruction model is trained based on the generated data. After training, a large amount of synthetic data can be obtained by decoding the new representation obtained from the combination of different sample representations and filtering based on confusion degree and sentiment classification. Experiments have proved that our framework can effectively alleviate emotional bias same as using only large models, but more economically.
由于现有观点总结数据集中的超过70%好评,现有的观点总结方法不愿意根据负面文本生成负面摘要。为了解决这种情感偏见,一种不依赖特定框架的直接方法是根据大型语言模型生成额外数据来平衡数据集的情感分布。然而,基于大型语言模型的数据增强存在两个缺点:1)增强数据的潜在问题或毒性;2)昂贵的成本。因此,在本文中,我们提出了一个基于大型和小型语言模型的观点总结去偏新方法。具体来说,通过大型语言模型重新编写积极文本可以获得小的负面评论数量。然后,基于生成的数据训练解离重构模型。训练后,可以通过解码不同样本表示的组合以及根据混淆程度和情感分类进行过滤来获得大量合成数据。实验证明,我们的框架可以有效地消除仅使用大型模型时存在的情感偏见,而且更加经济实惠。
https://arxiv.org/abs/2403.07693
Emotion recognition in conversation (ERC) is a task which predicts the emotion of an utterance in the context of a conversation. It tightly depends on dialogue context, speaker identity information, multiparty dialogue scenario and so on. However, the state-of-the-art method (instructERC) solely identifying speaker, and ignores commonsense knowledge(i.e., reaction of the listeners and intention of the speaker, etc.) behind speakers during a conversation, which can deeply mine speaker information. To this end, we propose a novel joint large language models with commonsense knowledge framework for emotion recognition in conversation, namely CKERC.We design prompts to generate interlocutors' commonsense based on historical utterances with large language model. And we use the interlocutor commonsense identification task for LLM pre-training to fine-tune speaker implicit clues this http URL solving above challenge, our method achieve state-of-the-art.We extensive experiment on three widely-used datasets, i.e., IEMOCAP, MELD, EmoryNLP, demonstrate our method superiority. Also, we conduct in-depth analysis and further demonstrate the effectiveness of commonsense knowledge in ERC task in large language model.
情感识别对话(ERC)是一个预测对话中会话语义的情感的任务,紧密依赖于对话背景、说话人身份信息、多方对话场景等。然而,最先进的方法(instructERC)仅识别说话人,并忽略了说话人在对话中的常识知识(即听众的反应和说话人的意图等),这可能导致对说话人的信息挖掘。为此,我们提出了一种新颖的结合常识知识的大语言模型情感识别框架,即CKERC。我们设计提示以根据带有大型语言模型的大型语料库生成会话参与者的常识。然后,我们使用会话参与者常识识别任务对LLM进行预训练,以解决上述挑战。我们的方法实现了一流水平。我们在三个广泛使用的数据集上进行了广泛的实验,即IEMOCAP、MELD和EmoryNLP,证明了我们的方法的优越性。此外,我们进行了深入分析,进一步证明了常识知识在ERC任务中的大型语言模型效果。
https://arxiv.org/abs/2403.07260
In this article, we present CuentosIE (TalesEI: chatbot of tales with a message to develop Emotional Intelligence), an educational chatbot on emotions that also provides teachers and psychologists with a tool to monitor their students/patients through indicators and data compiled by CuentosIE. The use of "tales with a message" is justified by their simplicity and easy understanding, thanks to their moral or associated metaphors. The main contributions of CuentosIE are the selection, collection, and classification of a set of highly specialized tales, as well as the provision of tools (searching, reading comprehension, chatting, recommending, and classifying) that are useful for both educating users about emotions and monitoring their emotional development. The preliminary evaluation of the tool has obtained encouraging results, which provides an affirmative answer to the question posed in the title of the article.
在本文中,我们介绍了一个名为CuentosIE(故事与情感:与情感的聊天机器人)的教育聊天机器人,该机器人不仅教会学生/病人如何培养情商,还为教师和心理学家提供了一个工具来监控他们的学生/病人,通过CuentosIE整理的指标和数据。由于“带有故事的情节”的简单性和易理解性,这种使用方式是有道理的,因为它们的道德或相关隐喻。CuentosIE的主要贡献是选择、收集和分类一系列高度专业化的故事,以及提供(搜索、阅读理解、聊天、推荐和分类)工具,既有助于教育用户关于情商,也能监控他们的情感发展。对工具的初步评估取得了积极的成果,回答了本文题目提出的问题。
https://arxiv.org/abs/2403.07193
Remote photoplethysmography (rPPG) is a promising technology that captures physiological signals from face videos, with potential applications in medical health, emotional computing, and biosecurity recognition. The demand for rPPG tasks has expanded from demonstrating good performance on intra-dataset testing to cross-dataset testing (i.e., domain generalization). However, most existing methods have overlooked the prior knowledge of rPPG, resulting in poor generalization ability. In this paper, we propose a novel framework that simultaneously utilizes explicit and implicit prior knowledge in the rPPG task. Specifically, we systematically analyze the causes of noise sources (e.g., different camera, lighting, skin types, and movement) across different domains and incorporate these prior knowledge into the network. Additionally, we leverage a two-branch network to disentangle the physiological feature distribution from noises through implicit label correlation. Our extensive experiments demonstrate that the proposed method not only outperforms state-of-the-art methods on RGB cross-dataset evaluation but also generalizes well from RGB datasets to NIR datasets. The code is available at this https URL.
远程光脉搏描记(rPPG)是一种有前景的技术,通过捕捉面部视频中的生理信号,具有潜在的医疗健康、情感计算和生物安全识别应用。rPPG任务的 demand 已经从 intra-dataset 测试的良好表现扩展到跨数据集测试(即领域泛化)。然而,大多数现有方法都忽略了 rPPG 的先验知识,导致泛化能力差。在本文中,我们提出了一个新颖的框架,同时利用了 rPPG 任务中的显式和隐式先验知识。具体来说,我们系统地分析了不同领域中噪声源(例如不同相机、照明、皮肤类型和运动)的原因,并将这些先验知识纳入网络中。此外,我们利用两个分支网络通过隐式标签相关性分解生理特征分布与噪声。我们广泛的实验证明,与最先进的 methods相比,所提出的方法在 RGB 跨数据集评估不仅表现优异,而且从 RGB 数据集到 NIR 数据集的泛化能力也表现良好。代码可在此处访问:https://www.xively.ai/remote-photoplethysmography-rppg
https://arxiv.org/abs/2403.06947
We propose FocusCLIP, integrating subject-level guidance--a specialized mechanism for target-specific supervision--into the CLIP framework for improved zero-shot transfer on human-centric tasks. Our novel contributions enhance CLIP on both the vision and text sides. On the vision side, we incorporate ROI heatmaps emulating human visual attention mechanisms to emphasize subject-relevant image regions. On the text side, we introduce human pose descriptions to provide rich contextual information. For human-centric tasks, FocusCLIP is trained with images from the MPII Human Pose dataset. The proposed approach surpassed CLIP by an average of 8.61% across five previously unseen datasets covering three human-centric tasks. FocusCLIP achieved an average accuracy of 33.65% compared to 25.04% by CLIP. We observed a 3.98% improvement in activity recognition, a 14.78% improvement in age classification, and a 7.06% improvement in emotion recognition. Moreover, using our proposed single-shot LLM prompting strategy, we release a high-quality MPII Pose Descriptions dataset to encourage further research in multimodal learning for human-centric tasks. Furthermore, we also demonstrate the effectiveness of our subject-level supervision on non-human-centric tasks. FocusCLIP shows a 2.47% improvement over CLIP in zero-shot bird classification using the CUB dataset. Our findings emphasize the potential of integrating subject-level guidance with general pretraining methods for enhanced downstream performance.
我们提出了FocusCLIP,将针对主题级别的指导--一个专门的针对目标学习的机制--融入了CLIP框架中,以提高在以人为中心的任务上的零 shots传输。我们的新贡献在视觉和文本方面都增强了CLIP。在视觉方面,我们将人类视觉注意机制的ROI热力图融入其中,强调相关主题的图像区域。在文本方面,我们引入了人类姿态描述,提供了丰富的上下文信息。对于以人为中心的任务,FocusCLIP使用来自MPII Human Pose数据集的图像进行训练。与CLIP相比,所提出的方法在五个之前未见过的数据集上的平均性能提高了8.61%。FocusCLIP的平均准确度为33.65%,而CLIP的平均准确度为25.04%。我们还观察到活动识别的准确度提高了3.98%,年龄分类的准确度提高了14.78%,情感识别的准确度提高了7.06%。此外,使用我们提出的单击LLM提示策略,我们释放了一个高品质的MPII Pose描述数据集,以鼓励进一步研究多模态学习在以人为中心的任务上的应用。此外,我们还证明了在非人类中心任务上,主题级别监督的有效性。FocusCLIP在CUB数据集上的零 shots鸟类分类上的改善率为2.47%。我们的研究结果强调了将主题级别指导与通用预训练方法相结合可以提高下游性能的潜力。
https://arxiv.org/abs/2403.06904
The internet has brought both benefits and harms to society. A prime example of the latter is misinformation, including conspiracy theories, which flood the web. Recent advances in natural language processing, particularly the emergence of large language models (LLMs), have improved the prospects of accurate misinformation detection. However, most LLM-based approaches to conspiracy theory detection focus only on binary classification and fail to account for the important relationship between misinformation and affective features (i.e., sentiment and emotions). Driven by a comprehensive analysis of conspiracy text that reveals its distinctive affective features, we propose ConspEmoLLM, the first open-source LLM that integrates affective information and is able to perform diverse tasks relating to conspiracy theories. These tasks include not only conspiracy theory detection, but also classification of theory type and detection of related discussion (e.g., opinions towards theories). ConspEmoLLM is fine-tuned based on an emotion-oriented LLM using our novel ConDID dataset, which includes five tasks to support LLM instruction tuning and evaluation. We demonstrate that when applied to these tasks, ConspEmoLLM largely outperforms several open-source general domain LLMs and ChatGPT, as well as an LLM that has been fine-tuned using ConDID, but which does not use affective features. This project will be released on this https URL.
互联网对社会既有好处也有坏处。后者的一个典型例子是错误信息,包括阴谋论,这些信息涌向了互联网。近年来自然语言处理(NLP)的进步,特别是大型语言模型的出现(LLMs),提高了准确错误信息检测的前景。然而,大多数基于LLM的阴谋论检测方法仅关注二分类,并且没有考虑到错误信息和情感特征(即情感和情绪)之间的关键关系。 我们通过对阴谋文本的全面分析,揭示了其独特的情感特征,提出了ConspEmoLLM,这是第一个开源的LLM,将情感信息集成到其中,并能够执行与阴谋论相关的多样任务。这些任务包括不仅限于阴谋论检测,还包括理论类型的分类和相关的讨论(例如,对理论的看法)。 ConspEmoLLM是基于我们的新ConDID数据集进行情感导向的LLM微调的,该数据集包括五个任务,以支持LLM指令的调整和评估。我们证明了当应用于这些任务时,ConspEmoLLM在很大程度上优于多个开源通用领域LLM以及ChatGPT,以及一个使用ConDID微调的LLM,但并未使用情感特征。 这个项目将发布在https://url.com/这个网站上。
https://arxiv.org/abs/2403.06765
We present a data-driven control architecture for modifying the kinematics of robots and artificial avatars to encode specific information such as the presence or not of an emotion in the movements of an avatar or robot driven by a human operator. We validate our approach on an experimental dataset obtained during the reach-to-grasp phase of a pick-and-place task.
我们提出了一个数据驱动的控制架构,用于修改机器人或虚拟助手(AVA)的运动学,以编码特定信息,如AVA或机器人是否具有情感。我们在抓取任务到达阶段期间获得的实验数据集上验证了我们的方法。
https://arxiv.org/abs/2403.06557
Generating emotional talking faces is a practical yet challenging endeavor. To create a lifelike avatar, we draw upon two critical insights from a human perspective: 1) The connection between audio and the non-deterministic facial dynamics, encompassing expressions, blinks, poses, should exhibit synchronous and one-to-many mapping. 2) Vibrant expressions are often accompanied by emotion-aware high-definition (HD) textures and finely detailed teeth. However, both aspects are frequently overlooked by existing methods. To this end, this paper proposes using normalizing Flow and Vector-Quantization modeling to produce emotional talking faces that satisfy both insights concurrently (FlowVQTalker). Specifically, we develop a flow-based coefficient generator that encodes the dynamics of facial emotion into a multi-emotion-class latent space represented as a mixture distribution. The generation process commences with random sampling from the modeled distribution, guided by the accompanying audio, enabling both lip-synchronization and the uncertain nonverbal facial cues generation. Furthermore, our designed vector-quantization image generator treats the creation of expressive facial images as a code query task, utilizing a learned codebook to provide rich, high-quality textures that enhance the emotional perception of the results. Extensive experiments are conducted to showcase the effectiveness of our approach.
生成情感丰富的聊天脸是一个实际而具有挑战性的任务。为了创建一个逼真的虚拟人,我们从人类的角度借鉴了两个关键见解:1)音频与非确定性面部动态之间的联系,包括表情、眨眼、姿势,应呈现同步和一对一映射。2)生动的表情通常伴随着情感意识的高清(HD)纹理和精细的牙齿。然而,现有的方法经常忽视这两个方面。因此,本文提出使用归一化流和向量化建模来产生满足这两个见解的情感聊天脸(FlowVQTalker)。具体来说,我们开发了一个基于流的系数生成器,将面部情感动态编码为一个多情感类隐含空间,用混合分布表示。生成过程从建模分布的随机采样开始,由伴随的音频指导,实现同步嘴唇同步和非语言面部线索的生成。此外,我们设计的向量化图像生成器将创建表情丰富的图像视为代码查询任务,利用学习到的代码本提供丰富、高质量纹理,增强结果的情感感知。大量实验验证了我们的方法的有效性。
https://arxiv.org/abs/2403.06375
Although automatically animating audio-driven talking heads has recently received growing interest, previous efforts have mainly concentrated on achieving lip synchronization with the audio, neglecting two crucial elements for generating expressive videos: emotion style and art style. In this paper, we present an innovative audio-driven talking face generation method called Style2Talker. It involves two stylized stages, namely Style-E and Style-A, which integrate text-controlled emotion style and picture-controlled art style into the final output. In order to prepare the scarce emotional text descriptions corresponding to the videos, we propose a labor-free paradigm that employs large-scale pretrained models to automatically annotate emotional text labels for existing audiovisual datasets. Incorporating the synthetic emotion texts, the Style-E stage utilizes a large-scale CLIP model to extract emotion representations, which are combined with the audio, serving as the condition for an efficient latent diffusion model designed to produce emotional motion coefficients of a 3DMM model. Moving on to the Style-A stage, we develop a coefficient-driven motion generator and an art-specific style path embedded in the well-known StyleGAN. This allows us to synthesize high-resolution artistically stylized talking head videos using the generated emotional motion coefficients and an art style source picture. Moreover, to better preserve image details and avoid artifacts, we provide StyleGAN with the multi-scale content features extracted from the identity image and refine its intermediate feature maps by the designed content encoder and refinement network, respectively. Extensive experimental results demonstrate our method outperforms existing state-of-the-art methods in terms of audio-lip synchronization and performance of both emotion style and art style.
尽管自动同步音频驱动的谈话头最近受到了越来越多的关注,但之前的努力主要集中在实现音频与视频的同步,而忽略了两个生成富有表现力的视频的关键要素:情感风格和艺术风格。在本文中,我们提出了一个创新的音乐驱动谈话头生成方法:Style2Talker。它包括两个风格化阶段,即Style-E和Style-A,将文本控制的情感风格和图像控制的美术风格融入最终输出。为了准备与视频相对应的稀缺情感文本描述,我们提出了一个无需劳动的范式,使用大规模预训练模型自动注释现有音频视觉数据集中的情感文本标签。结合合成情感文本,Style-E阶段利用大型CLIP模型提取情感表示,将音频与情感表示结合,作为设计3DMM模型的情感运动系数的条件。接着到Style-A阶段,我们开发了一个系数驱动的运动生成器和嵌入在著名的StyleGAN中的艺术特定样式路径。这使我们能够使用生成的情感运动系数和艺术样式源图像合成高分辨率的艺术化谈话头视频。此外,为了更好地保留图像细节并避免伪影,我们将StyleGAN与从身份图像中提取的多尺度内容特征相结合,并通过设计的内容编码器和优化网络分别优化其中间特征图。大量实验结果证明,我们的方法在音频- lip同步和情感风格和艺术风格的表现方面超过了现有最先进的方法。
https://arxiv.org/abs/2403.06365
The effectiveness of central bank communication is a crucial aspect of monetary policy transmission. While recent research has examined the influence of policy communication by the chairs of the Federal Reserve on various financial variables, much of the literature relies on rule-based or dictionary-based methods in parsing the language of the chairs, leaving nuanced information about policy stance contained in nonverbal emotion out of the analysis. In the current study, we propose the Fine-Grained Monetary Policy Analysis Framework (FMPAF), a novel approach that integrates large language models (LLMs) with regression analysis to provide a comprehensive analysis of the impact of the press-conference communications of chairs of the Federal Reserve on financial markets. We conduct extensive comparisons of model performance under different levels of granularity, modalities, and communication scenarios. Based on our preferred specification, a one-unit increase in the sentiment score is associated with an increase of the price of S\&P 500 Exchange-Traded Fund by approximately 500 basis points, a 15-basis-point decrease in the policy interest rate, while not leading to a significant response in exchange rates.
中央银行沟通的有效性是货币政策传导的关键方面。虽然最近的研究已经探讨了美联储主席关于各种金融变量的政策沟通的影响,但大部分文献仍然依赖基于规则或词典的方法来解析主席的语言,这导致关于政策立场的非语言情感中的细微信息未进行分析。在当前的研究中,我们提出了细粒度货币政策分析框架(FMPAF),一种将大型语言模型(LLMs)与回归分析相结合的新方法,以全面分析美联储主席的新闻发布会对金融市场的影响。我们在不同粒度、条件和沟通情景下对模型性能进行了广泛的比较。根据我们最喜欢的规格,情感得分的一单位增加与标普500指数基金价格大约500个基点的关系,而15个基点的政策利率降低,但并未导致汇率的重大反应。
https://arxiv.org/abs/2403.06115
This paper delves into enhancing the classification performance on the GoEmotions dataset, a large, manually annotated dataset for emotion detection in text. The primary goal of this paper is to address the challenges of detecting subtle emotions in text, a complex issue in Natural Language Processing (NLP) with significant practical applications. The findings offer valuable insights into addressing the challenges of emotion detection in text and suggest directions for future research, including the potential for a survey paper that synthesizes methods and performances across various datasets in this domain.
本文深入研究了如何提高GoEmotions数据集上的分类性能,该数据集是一个大型、手动标注的情感检测文本数据。本文的主要目标是为检测文本中的微妙情感提供挑战,这是自然语言处理(NLP)领域的一个复杂问题,具有重要的实际应用价值。本文的发现为解决文本情感检测中的挑战提供了宝贵的见解,并提出了未来研究的方向,包括可能合成各种数据集间方法和性能的调查论文。
https://arxiv.org/abs/2403.06108
The lack of a suitable tool for the analysis of conversational texts in the Persian language has made various analyses of these texts, including Sentiment Analysis, difficult. In this research, we tried to make the understanding of these texts easier for the machine by providing PSC, Persian Slang Converter, a tool for converting conversational texts into formal ones, and by using the most up-to-date and best deep learning methods along with the PSC, the sentiment learning of short Persian language texts for the machine in a better way. be made More than 10 million unlabeled texts from various social networks and movie subtitles (as Conversational texts) and about 10 million news texts (as formal texts) have been used for training unsupervised models and formal implementation of the tool. 60,000 texts from the comments of Instagram social network users with positive, negative, and neutral labels are considered supervised data for training the emotion classification model of short texts. Using the formal tool, 57% of the words of the corpus of conversation were converted. Finally, by using the formalizer, FastText model, and deep LSTM network, an accuracy of 81.91 was obtained on the test data.
波斯语对话文本的分析缺乏适当的工具,包括情感分析,使得各种分析变得困难。在这项研究中,我们试图通过提供PSC(波斯语俚语转换器)、一个将对话文本转换为正式文本的工具,以及使用最先进的和最优秀的深度学习方法和PSC,更好地理解这些文本,使得机器更容易理解。已经使用了超过1000万无标签的社交媒体文本和电影字幕(作为对话文本)以及大约1000万正式文本(作为正式文本)进行训练,并正式发布了该工具。60,000篇来自Instagram社交网络用户正面、负面和中立标签的文本被认为是训练短文本情感分类模型的有监督数据。使用正式工具,将数据集的57%的单词转换为正式文本。最后,通过使用正式化器、FastText模型和深度LSTM网络,在测试数据上获得了81.91%的准确率。
https://arxiv.org/abs/2403.06023