计算机技术与发展2024,Vol.34Issue(5):44-51,8.DOI:10.20165/j.cnki.ISSN1673-629X.2024.0039
基于多模态交互网络的图像描述
Multimodal Interaction Network for Image Captioning
摘要
Abstract
In image captioning,multimodal approaches are widely exploited by simultaneously providing visual inputs and semantic attributes to capture multi-level information.However,most approaches still utilize the two modalities in isolation,without considering the correlation between them.With the aim of filling this gap,we first introduce a Bi-Directional Attention Flow(BiDAF)module that extends the self attention mechanism as a bi-directional manner to model complex interactions between different modalities.Then,through a Gated Linear Memory(GLM)module that can realize the same function as a Long Short-Term Memory(LSTM)with only one forget gate,the decoder complexity is effectively reduced and multi-modal interaction information is captured.Finally,we apply BiDAF and GLM as the encoder and the decoder of the image captioning model respectively,forming a Multimodal Interactive Network(MINet).When tested on COCO,experimental results show that MINet not only has a more concise decoder,better image description,and higher evaluation scores than that of existing multimodal methods,but also more efficient in image description without pre-training.关键词
多模态/图像描述/自注意力/长短期记忆网络/视觉/文本Key words
multimodal/image captioning/self attention/long short-term memory/visual/semantic分类
信息技术与安全科学引用本文复制引用
段毛毛,魏燚伟..基于多模态交互网络的图像描述[J].计算机技术与发展,2024,34(5):44-51,8.基金项目
中国石油大学(北京)克拉玛依校区人才引进项目(XQZX20200021) (北京)