Paper Reading AI Learner

Lost Your Style? Navigating with Semantic-Level Approach for Text-to-Outfit Retrieval

2023-11-03 07:23:21
Junkyu Jang, Eugene Hwang, Sung-Hyuk Park

Abstract

Fashion stylists have historically bridged the gap between consumers' desires and perfect outfits, which involve intricate combinations of colors, patterns, and materials. Although recent advancements in fashion recommendation systems have made strides in outfit compatibility prediction and complementary item retrieval, these systems rely heavily on pre-selected customer choices. Therefore, we introduce a groundbreaking approach to fashion recommendations: text-to-outfit retrieval task that generates a complete outfit set based solely on textual descriptions given by users. Our model is devised at three semantic levels-item, style, and outfit-where each level progressively aggregates data to form a coherent outfit recommendation based on textual input. Here, we leverage strategies similar to those in the contrastive language-image pretraining model to address the intricate-style matrix within the outfit sets. Using the Maryland Polyvore and Polyvore Outfit datasets, our approach significantly outperformed state-of-the-art models in text-video retrieval tasks, solidifying its effectiveness in the fashion recommendation domain. This research not only pioneers a new facet of fashion recommendation systems, but also introduces a method that captures the essence of individual style preferences through textual descriptions.

Abstract (translated)

历史上,时装造型师曾将消费者需求与完美装备之间的差距联系起来,这涉及到复杂的色彩、图案和材料的组合。尽管最近在时尚推荐系统中的进步使得套装搭配预测和互补项的检索达到了一定的效果,但这些系统仍然高度依赖预先选择好的客户选择。因此,我们引入了一种创新的方法来进行时尚推荐:基于用户提供的文本描述的完整装备集检索任务。我们的模型在语义层上设计为-物品、风格和装备,每个层次都会逐步聚合数据以形成基于文本输入的连贯装备推荐。在这里,我们利用类似于对比语言-图像预训练模型的策略来解决套装集中复杂的风格矩阵。利用马里兰大学 polyvore 和 polyvore 时尚数据集,我们的方法在文本-视频检索任务中显著超过了最先进的模型,巩固了在时尚推荐领域中的有效性。这项研究不仅开创了时尚推荐系统的一个新领域,而且通过文本描述捕捉到了个人风格偏好的本质。

URL

https://arxiv.org/abs/2311.02122

PDF

https://arxiv.org/pdf/2311.02122.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot