首页|期刊导航|郑州大学学报（工学版）|面向图像分类的Vision Transformer研究综述

面向图像分类的Vision Transformer研究综述

智敏陆静芳

郑州大学学报（工学版）2024，Vol.45Issue(4)：19-29,11.

郑州大学学报（工学版）2024，Vol.45Issue(4)：19-29,11.DOI:10.13705/j.issn.1671-6833.2024.01.015

面向图像分类的Vision Transformer研究综述

A Review of Vision Transformer for Image Classification

智敏 ¹陆静芳¹

作者信息

1. 内蒙古师范大学计算机科学技术学院,内蒙古呼和浩特 010022
折叠

摘要

Abstract

ViT as a model based on the Transformer architecture has shown good results in image classification tasks.In this study,the application of ViT on image classification tasks was systematically summarized.Firstly,the functional characteristics of the ViT framework and its four modules(patch module,position encoding,multihead attention mechanism and feed-forward neural network)were briefly introduced.Secondly,the application of ViT in image classification tasks was summarized with the improvement measures of the four modules.Due to the fact that different model structures and improvement measures could have a significant impact on the final classification per-formance,a side-by-side comparison of various types of ViTs was made in this paper.Finally,the advantages and limitations of ViT in image classification were pointed out,and possible future research directions were proposed to break the limitations,and further to extend the application of ViT in other computer vision tasks.The extension of ViT to a wider range of computer vision fields,such as video understanding,was explored.

关键词

ViT模型/图像分类/多头注意力/前馈网络层/位置编码

Key words

ViT model/image classification/multihead attention/feed-forward network layer/position encoding

分类

信息技术与安全科学

引用本文复制引用

智敏,陆静芳..面向图像分类的Vision Transformer研究综述[J].郑州大学学报（工学版）,2024,45(4):19-29,11.

基金项目

内蒙古自治区自然科学基金资助项目(2023MS06009) （2023MS06009）

内蒙古师范大学基本科研业务费专项基金项目(2022JBXC018) （2022JBXC018）

内蒙古师范大学研究生科研创新基金项目(CXJJS22138) （CXJJS22138）

郑州大学学报（工学版）

OA北大核心CSTPCD

ISSN：1671-6833

访问量0

下载量0

段落导航