首页|期刊导航|东华大学学报（英文版）|基于特定指令驱动的上下文感知视觉蕴含

基于特定指令驱动的上下文感知视觉蕴含

韩宇凤郝矿荣唐雪嵩隗兵

东华大学学报（英文版）2025，Vol.42Issue(2)：177-186,10.

东华大学学报（英文版）2025，Vol.42Issue(2)：177-186,10.DOI:10.19884/j.1672-5220.202403004

基于特定指令驱动的上下文感知视觉蕴含

Context-Aware Visual Entailment Driven by Specific Instructions

韩宇凤 ¹郝矿荣 ¹唐雪嵩 ¹隗兵¹

作者信息

1. 东华大学信息科学与技术学院,上海 201620||东华大学数字化纺织服装技术教育部工程研究中心,上海 201620
折叠

摘要

Abstract

Visual entailment(VE)is a prototypical task in multimodal visual reasoning,where current methods frequently utilize large language models(LLMs)as the knowledge base to assist in answering questions.These methods heavily rely on the textual modality,which inherently cannot capture the full extent of information contained within images.We propose a context-aware visual entailment(CAVE)model,which introduces a novel aggregation module designed to extract high-level semantic features from images.This module integrates lower-level semantic image features into high-level visual tokens,formatting them similarly to text tokens so that they can serve as inputs for LLMs.The CAVE model compensates for the loss of image information and integrates it more effectively with textual comprehension.Additionally,the CAVE model incorporates a new input format and training methodology,which is rooted in instruction tuning and in-context learning techniques.The objective of this research is to maximize the inherent logical reasoning capabilities of LLMs.Experimental results on the E-SNLI-VE dataset show that the proposed CAVE model exhibits outstanding performance.

关键词

视觉蕴含/文本-视觉融合/指令微调/上下文学习

Key words

visual entailment(VE)/textual-visual integration/instruction tuning/in-context learning

分类

信息技术与安全科学

引用本文复制引用

韩宇凤,郝矿荣,唐雪嵩,隗兵..基于特定指令驱动的上下文感知视觉蕴含[J].东华大学学报（英文版）,2025,42(2):177-186,10.

基金项目

Fundamental Research Funds for the Central Universities,China(No.2232021A-10) （No.2232021A-10）

Shanghai Pujiang Program,China(No.22PJ1423400) （No.22PJ1423400）

东华大学学报（英文版）

ISSN：1672-5220

访问量0

下载量0

段落导航