首页|期刊导航|计算机技术与发展|基于多模态交互网络的图像描述

基于多模态交互网络的图像描述

段毛毛魏燚伟

计算机技术与发展2024，Vol.34Issue(5)：44-51,8.

计算机技术与发展2024，Vol.34Issue(5)：44-51,8.DOI:10.20165/j.cnki.ISSN1673-629X.2024.0039

基于多模态交互网络的图像描述

Multimodal Interaction Network for Image Captioning

段毛毛 ¹魏燚伟¹

作者信息

1. 中国石油大学(北京)克拉玛依校区石油学院,新疆克拉玛依 834000
折叠

摘要

Abstract

In image captioning,multimodal approaches are widely exploited by simultaneously providing visual inputs and semantic attributes to capture multi-level information.However,most approaches still utilize the two modalities in isolation,without considering the correlation between them.With the aim of filling this gap,we first introduce a Bi-Directional Attention Flow(BiDAF)module that extends the self attention mechanism as a bi-directional manner to model complex interactions between different modalities.Then,through a Gated Linear Memory(GLM)module that can realize the same function as a Long Short-Term Memory(LSTM)with only one forget gate,the decoder complexity is effectively reduced and multi-modal interaction information is captured.Finally,we apply BiDAF and GLM as the encoder and the decoder of the image captioning model respectively,forming a Multimodal Interactive Network(MINet).When tested on COCO,experimental results show that MINet not only has a more concise decoder,better image description,and higher evaluation scores than that of existing multimodal methods,but also more efficient in image description without pre-training.

关键词

多模态/图像描述/自注意力/长短期记忆网络/视觉/文本

Key words

multimodal/image captioning/self attention/long short-term memory/visual/semantic

分类

信息技术与安全科学

引用本文复制引用

段毛毛,魏燚伟..基于多模态交互网络的图像描述[J].计算机技术与发展,2024,34(5):44-51,8.

基金项目

中国石油大学(北京)克拉玛依校区人才引进项目(XQZX20200021) （北京）

计算机技术与发展

OACSTPCD

ISSN：1673-629X

访问量0

下载量0

段落导航