首页|期刊导航|山西大学学报（自然科学版）|基于图文理解增强的教科书视觉问答方法

基于图文理解增强的教科书视觉问答方法

胡景畅强鹏鹏谭红叶王宏宇慕永利

山西大学学报（自然科学版）2026，Vol.49Issue(2)：263-271,9.

山西大学学报（自然科学版）2026，Vol.49Issue(2)：263-271,9.DOI:10.13451/j.sxu.ns.2024104

基于图文理解增强的教科书视觉问答方法

Enhancing Image and Text Comprehension for Textbook Visual Question Answering

胡景畅 ¹强鹏鹏 ¹谭红叶 ²王宏宇 ¹慕永利³

作者信息

1. 山西大学计算机与信息技术学院,山西太原 030006
2. 山西大学计算机与信息技术学院,山西太原 030006||计算智能与中文信息处理教育部重点实验室,山西太原 030006
3. 智林信息技术股份有限公司,山西太原 030000
折叠

摘要

Abstract

Textbook Visual Question Answering is a multi-modal task in the field of smart education that requires a deep understand-ing of textbook images,text,and questions to infer the correct answers.However,existing generic Visual Question Answering meth-ods perform poorly in this task.The main reasons are as follows:Firstly,these methods can only simply recognize object attributes,lack disciplinary information,and are susceptible to interference from redundant information unrelated to the questions.Secondly,they struggle to capture key information in the texts.To solve these problems,a textbook visual question-answering method based on image description enhancement is proposed,which mainly includes three modules:(1)Text encoding and understanding:Utilizing large language models to extract keywords from questions and retrieve relevant statements in the text related to the question key-words to enhance text understanding and eliminate interference from redundant informations.(2)Image encoding and description:Employing a question-image attention mechanism in image descriptions to generate fine-grained image description statements con-strained by questions based on question keywords,thereby enhancing image understanding ability.(3)Answer prediction:using a pre-trained visual-language model to fuse text information with visual information to improve the model's reasoning ability.Experi-mental results on relevant datasets demonstrate that the proposed method effectively improves the understanding of textbook infor-mation,thereby enhancing answer prediction accuracy.The accuracy of the test set and the verification set was improved by 1.82%and 1.72%,respectively.

关键词

视觉问答/智慧教育/图像描述/图文理解增强

Key words

visual question answering/intelligent education/image caption/image-text comprehension enhancement

分类

信息技术与安全科学

引用本文复制引用

胡景畅,强鹏鹏,谭红叶,王宏宇,慕永利..基于图文理解增强的教科书视觉问答方法[J].山西大学学报（自然科学版）,2026,49(2):263-271,9.

基金项目

国家自然科学基金(62076155) （62076155）

太原市小店区-山西大学产学研合作项目"短答案自动评分技术在综合评价系统中的推广与应用"(202301S06) （202301S06）

山西大学学报（自然科学版）

ISSN：0253-2395

访问量0

下载量0

段落导航