Abstract
Fashion stylists have historically bridged the gap between consumers' desires and perfect outfits, which involve intricate combinations of colors, patterns, and materials. Although recent advancements in fashion recommendation systems have made strides in outfit compatibility prediction and complementary item retrieval, these systems rely heavily on pre-selected customer choices. Therefore, we introduce a groundbreaking approach to fashion recommendations: text-to-outfit retrieval task that generates a complete outfit set based solely on textual descriptions given by users. Our model is devised at three semantic levels-item, style, and outfit-where each level progressively aggregates data to form a coherent outfit recommendation based on textual input. Here, we leverage strategies similar to those in the contrastive language-image pretraining model to address the intricate-style matrix within the outfit sets. Using the Maryland Polyvore and Polyvore Outfit datasets, our approach significantly outperformed state-of-the-art models in text-video retrieval tasks, solidifying its effectiveness in the fashion recommendation domain. This research not only pioneers a new facet of fashion recommendation systems, but also introduces a method that captures the essence of individual style preferences through textual descriptions.
Abstract (translated)
历史上,时装造型师曾将消费者需求与完美装备之间的差距联系起来,这涉及到复杂的色彩、图案和材料的组合。尽管最近在时尚推荐系统中的进步使得套装搭配预测和互补项的检索达到了一定的效果,但这些系统仍然高度依赖预先选择好的客户选择。因此,我们引入了一种创新的方法来进行时尚推荐:基于用户提供的文本描述的完整装备集检索任务。我们的模型在语义层上设计为-物品、风格和装备,每个层次都会逐步聚合数据以形成基于文本输入的连贯装备推荐。在这里,我们利用类似于对比语言-图像预训练模型的策略来解决套装集中复杂的风格矩阵。利用马里兰大学 polyvore 和 polyvore 时尚数据集,我们的方法在文本-视频检索任务中显著超过了最先进的模型,巩固了在时尚推荐领域中的有效性。这项研究不仅开创了时尚推荐系统的一个新领域,而且通过文本描述捕捉到了个人风格偏好的本质。
URL
https://arxiv.org/abs/2311.02122