首页|期刊导航|天津科技大学学报|融合语义增强和位置编码的图文匹配方法

融合语义增强和位置编码的图文匹配方法OA

Image-Text Matching Method Combining Semantic Enhancement and Position Encoding

中文摘要英文摘要

图文匹配是跨模态基础任务之一,其核心是如何准确评估图像语义与文本语义之间的相似度.现有方法是通过引入相关阈值,最大限度地区分相关和无关分布,以获得更好的语义对齐.然而,对于特征本身,其语义之间缺乏相互关联,且对于缺乏空间位置信息的图像区域与文本单词很难准确对齐,从而不可避免地限制了相关阈值的学习导致语义无法准确对齐.针对此问题,本文提出一种融合语义增强和位置编码的自适应相关性可学习注意力的图文匹配方法.首先,在初步提取特征的基础上构造图像(文本)无向全连通图,使用图注意力去聚合邻居的信息,获得语义增强的特征.然后,对图像区域的绝对位置信息编码,在具备了空间语义的图像区域与文本单词相似性的基础上获得最大程度区分的相关和无关分布,更好地学习两个分布之间的最优相关边界.最后,通过公开数据集 Flickr 30 k 和 MS-COCO,利用Recall@K指标对比实验,验证本文方法的有效性.

Image-text matching is one of the basic cross-modal tasks.Its core is how to accurately evaluate the similarity between image semantics and text semantics.Existing methods maximize the distinction between relevant and irrelevant distributions by introducing a correlation threshold to obtain better semantic alignment.However,for the features themselves,there is a lack of correlation between their semantics,and it is difficult to accurately align image areas and text words that lack spatial location information,which inevitably limits the learning of relevant thresholds and results in the inability to accurately align semantics.To address this problem,in this article we propose an image-text matching method that combines semantic enhancement and positional coding with adaptive correlation learnable attention.Specifically,an undirected fully connected graph of images(texts)is first constructed based on preliminary feature extraction,and graph at-tention is used to aggregate neighbor information to obtain semantically enhanced features.Then,the absolute position in-formation of the image area is encoded,and the most differentiated relevant and irrelevant distributions are obtained based on the similarity between the image area and the text words with spatial semantics,so as to better learn the optimal correlation between the two distributions.boundary.Finally,through the public datasets Flickr 30 k and MS-COCO,the effectiveness of the method proposed in this article was verified with the use of the Recall@K indicator comparison experiment.

赵婷婷;常玉广;郭宇;陈亚瑞;王嫄

天津科技大学人工智能学院,天津 300457天津科技大学人工智能学院,天津 300457天津科技大学人工智能学院,天津 300457天津科技大学人工智能学院,天津 300457天津科技大学人工智能学院,天津 300457

计算机与自动化

跨模态图文匹配图注意力位置编码相关性阈值

cross-modal image-text matchinggraph attentionposition encodingrelevance threshold

《天津科技大学学报》 2024 (4)

63-72,10

国家自然科学基金项目(61976156)天津市企业科技特派员项目(20YDTPJC00560)

10.13364/j.issn.1672-6510.20230177

评论