Paper Reading AI Learner

POV Learning: Individual Alignment of Multimodal Models using Human Perception

2024-05-07 16:07:29
Simon Werner, Katharina Christ, Laura Bernardy, Marion G. M\"uller, Achim Rettinger

Abstract

Aligning machine learning systems with human expectations is mostly attempted by training with manually vetted human behavioral samples, typically explicit feedback. This is done on a population level since the context that is capturing the subjective Point-Of-View (POV) of a concrete person in a specific situational context is not retained in the data. However, we argue that alignment on an individual level can boost the subjective predictive performance for the individual user interacting with the system considerably. Since perception differs for each person, the same situation is observed differently. Consequently, the basis for decision making and the subsequent reasoning processes and observable reactions differ. We hypothesize that individual perception patterns can be used for improving the alignment on an individual level. We test this, by integrating perception information into machine learning systems and measuring their predictive performance wrt.~individual subjective assessments. For our empirical study, we collect a novel data set of multimodal stimuli and corresponding eye tracking sequences for the novel task of Perception-Guided Crossmodal Entailment and tackle it with our Perception-Guided Multimodal Transformer. Our findings suggest that exploiting individual perception signals for the machine learning of subjective human assessments provides a valuable cue for individual alignment. It does not only improve the overall predictive performance from the point-of-view of the individual user but might also contribute to steering AI systems towards every person's individual expectations and values.

Abstract (translated)

将机器学习系统与人类期望对齐主要是通过手动审核的人类行为样本进行训练,通常是有明确反馈的。这是在整个人口水平上进行的,因为捕获了一个具体人在特定情境背景中的主观观点的上下文的数据中不保留该上下文。然而,我们认为在个体层面上进行对齐可以显著提高与系统交互的用户的主观预测表现。由于每个人的感知不同,相同的情况以不同的方式被观察。因此,决策基础和后续推理过程以及可观察的反应是不同的。我们假设,个体感知模式可以用于提高个体层面的对齐。我们通过将感知信息集成到机器学习系统中,并测量其对个体主观评估的预测性能来进行实验,以验证这个假设。在我们的实证研究中,我们收集了一个新的多模态刺激数据集以及相应的心跳序列,用于新任务感知引导跨模态共情。我们使用感知引导多模态Transformer来解决这个任务。我们的研究结果表明,利用个人感知信号进行机器学习可以提供有价值的线索来对个体进行对齐。这不仅可以提高个体用户的总体预测表现,而且还可以引导AI系统朝着每个人的个人期望和价值观的方向发展。

URL

https://arxiv.org/abs/2405.04443

PDF

https://arxiv.org/pdf/2405.04443.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot