Abstract
Structuring latent representations in a hierarchical manner enables models to learn patterns at multiple levels of abstraction. However, most prevalent image understanding models focus on visual similarity, and learning visual hierarchies is relatively unexplored. In this work, for the first time, we introduce a learning paradigm that can encode user-defined multi-level visual hierarchies in hyperbolic space without requiring explicit hierarchical labels. As a concrete example, first, we define a part-based image hierarchy using object-level annotations within and across images. Then, we introduce an approach to enforce the hierarchy using contrastive loss with pairwise entailment metrics. Finally, we discuss new evaluation metrics to effectively measure hierarchical image retrieval. Encoding these complex relationships ensures that the learned representations capture semantic and structural information that transcends mere visual similarity. Experiments in part-based image retrieval show significant improvements in hierarchical retrieval tasks, demonstrating the capability of our model in capturing visual hierarchies.
Abstract (translated)
以分层方式构建潜在表示可以使模型在多个抽象层次上学习模式。然而,大多数流行的图像理解模型都集中在视觉相似性上,而对学习视觉层级的研究相对较少。在这项工作中,我们首次引入了一种可以将用户定义的多层次视觉层级编码到双曲空间中的学习范式,并且无需显式的层级标签。具体来说,首先,我们使用跨图片和图片内的对象级注释来定义基于部分的图像层级。然后,我们介绍一种方法,通过对比损失和成对蕴含度量来强制执行该层级关系。最后,我们讨论了新的评估指标,以有效衡量分层图像检索的效果。编码这些复杂的相互关系确保学习到的表示不仅捕捉到了语义信息和结构信息,而且超越了简单的视觉相似性。基于部分的图像检索实验显示,在分层检索任务中有了显著改进,证明了我们的模型在捕获视觉层级方面的能力。
URL
https://arxiv.org/abs/2411.17490