计算机工程2024,Vol.50Issue(2):1-14,14.DOI:10.19678/j.issn.1000-3428.0067514
文本视觉问答综述
Survey of Text-based Visual Question Answering
摘要
Abstract
Traditional Visual Question Answering(VQA)only focuses on the visual object information in the image,ignoring the text information in the image.In addition to visual information,Text-based Visual Question Answering(TextVQA)also focuses on the text information in the image,which can answer questions more accurately and efficiently.In recent years,TextVQA has become a research focal point in the field of multimodality,and it has important application prospects in the field of scenes containing text information,such as automatic driving and scene understanding.This paper describes the concept of TextVQA and the existing problems and challenges,and makes a systematic analysis of TextVQA tasks from the aspects of methods,datasets,and future research directions.This study focuses on the analysis of the existing research methods of TextVQA,and summarizes them into three stages,namely,feature extraction,feature fusion,and answer prediction.According to the different methods used in the fusion stage,the TextVQA methods are described from three aspects:simple attention,Transformer-based,and pre-training methods.The advantages and disadvantages of different methods are summarized,and the performance of existing methods in public datasets is analyzed and compared.Four common public datasets are introduced,and their characteristics and evaluation metrics are analyzed.Finally,this paper discusses the problems and challenges facing the TextVQA task,and discusses the future research directions.关键词
文本视觉问答/文本信息/自然语言处理/计算机视觉/多模态融合Key words
Text-based Visual Question Answering(TextVQA)/text information/natural language processing/computer vision/multimodal fusion分类
信息技术与安全科学引用本文复制引用
朱贵德,黄海..文本视觉问答综述[J].计算机工程,2024,50(2):1-14,14.基金项目
国家自然科学基金面上项目(62272416). (62272416)