天津科技大学学报2024,Vol.39Issue(4):63-72,10.DOI:10.13364/j.issn.1672-6510.20230177
融合语义增强和位置编码的图文匹配方法
Image-Text Matching Method Combining Semantic Enhancement and Position Encoding
摘要
Abstract
Image-text matching is one of the basic cross-modal tasks.Its core is how to accurately evaluate the similarity between image semantics and text semantics.Existing methods maximize the distinction between relevant and irrelevant distributions by introducing a correlation threshold to obtain better semantic alignment.However,for the features themselves,there is a lack of correlation between their semantics,and it is difficult to accurately align image areas and text words that lack spatial location information,which inevitably limits the learning of relevant thresholds and results in the inability to accurately align semantics.To address this problem,in this article we propose an image-text matching method that combines semantic enhancement and positional coding with adaptive correlation learnable attention.Specifically,an undirected fully connected graph of images(texts)is first constructed based on preliminary feature extraction,and graph at-tention is used to aggregate neighbor information to obtain semantically enhanced features.Then,the absolute position in-formation of the image area is encoded,and the most differentiated relevant and irrelevant distributions are obtained based on the similarity between the image area and the text words with spatial semantics,so as to better learn the optimal correlation between the two distributions.boundary.Finally,through the public datasets Flickr 30 k and MS-COCO,the effectiveness of the method proposed in this article was verified with the use of the Recall@K indicator comparison experiment.关键词
跨模态图文匹配/图注意力/位置编码/相关性阈值Key words
cross-modal image-text matching/graph attention/position encoding/relevance threshold分类
信息技术与安全科学引用本文复制引用
赵婷婷,常玉广,郭宇,陈亚瑞,王嫄..融合语义增强和位置编码的图文匹配方法[J].天津科技大学学报,2024,39(4):63-72,10.基金项目
国家自然科学基金项目(61976156) (61976156)
天津市企业科技特派员项目(20YDTPJC00560) (20YDTPJC00560)