Abstract
In this paper, we propose a novel visual Semantic-Spatial Self-Highlighting Network (termed 3SHNet) for high-precision, high-efficiency and high-generalization image-sentence retrieval. 3SHNet highlights the salient identification of prominent objects and their spatial locations within the visual modality, thus allowing the integration of visual semantics-spatial interactions and maintaining independence between two modalities. This integration effectively combines object regions with the corresponding semantic and position layouts derived from segmentation to enhance the visual representation. And the modality-independence guarantees efficiency and generalization. Additionally, 3SHNet utilizes the structured contextual visual scene information from segmentation to conduct the local (region-based) or global (grid-based) guidance and achieve accurate hybrid-level retrieval. Extensive experiments conducted on MS-COCO and Flickr30K benchmarks substantiate the superior performances, inference efficiency and generalization of the proposed 3SHNet when juxtaposed with contemporary state-of-the-art methodologies. Specifically, on the larger MS-COCO 5K test set, we achieve 16.3%, 24.8%, and 18.3% improvements in terms of rSum score, respectively, compared with the state-of-the-art methods using different image representations, while maintaining optimal retrieval efficiency. Moreover, our performance on cross-dataset generalization improves by 18.6%. Data and code are available at this https URL.
Abstract (translated)
在本文中,我们提出了一个新颖的视觉语义-空间自突出网络(称为3SHNet)用于高精度、高效率和高通量的图像-句子检索。3SHNet突出了在视觉模态中突出物体的显著性识别和对其位置的突显,从而实现视觉语义-空间交互,并保持两个模式之间的独立性。这种集成有效地将分割得到的物体区域与相应的语义和位置布局相结合,增强了视觉表示。而模式独立性确保了效率和通用性。此外,3SHNet利用分割得到的结构化上下文视觉场景信息进行局部(基于区域)或全局(基于网格)指导,实现准确的半层次检索。在MS-COCO和Flicker30K基准上进行的大量实验证实了与当代最先进方法相比,所提出的3SHNet具有卓越的性能、推理效率和泛化能力。具体来说,在较大的MS-COCO 5K测试集中,我们分别实现了16.3%、24.8%和18.3%的rSum得分提升,与不同图像表示方法的最先进方法相比,保持最佳的检索效率。此外,跨数据集泛化性能提高了18.6%。数据和代码可在此链接处获取:https://www.acm.org/dl/doi/10.1145/28482-28487.28485-28486.28487-28487.28487
URL
https://arxiv.org/abs/2404.17273