东华大学学报(英文版)2025,Vol.42Issue(2):177-186,10.DOI:10.19884/j.1672-5220.202403004
基于特定指令驱动的上下文感知视觉蕴含
Context-Aware Visual Entailment Driven by Specific Instructions
摘要
Abstract
Visual entailment(VE)is a prototypical task in multimodal visual reasoning,where current methods frequently utilize large language models(LLMs)as the knowledge base to assist in answering questions.These methods heavily rely on the textual modality,which inherently cannot capture the full extent of information contained within images.We propose a context-aware visual entailment(CAVE)model,which introduces a novel aggregation module designed to extract high-level semantic features from images.This module integrates lower-level semantic image features into high-level visual tokens,formatting them similarly to text tokens so that they can serve as inputs for LLMs.The CAVE model compensates for the loss of image information and integrates it more effectively with textual comprehension.Additionally,the CAVE model incorporates a new input format and training methodology,which is rooted in instruction tuning and in-context learning techniques.The objective of this research is to maximize the inherent logical reasoning capabilities of LLMs.Experimental results on the E-SNLI-VE dataset show that the proposed CAVE model exhibits outstanding performance.关键词
视觉蕴含/文本-视觉融合/指令微调/上下文学习Key words
visual entailment(VE)/textual-visual integration/instruction tuning/in-context learning分类
计算机与自动化引用本文复制引用
韩宇凤,郝矿荣,唐雪嵩,隗兵..基于特定指令驱动的上下文感知视觉蕴含[J].东华大学学报(英文版),2025,42(2):177-186,10.基金项目
Fundamental Research Funds for the Central Universities,China(No.2232021A-10) (No.2232021A-10)
Shanghai Pujiang Program,China(No.22PJ1423400) (No.22PJ1423400)