郑州大学学报(工学版)2024,Vol.45Issue(4):19-29,11.DOI:10.13705/j.issn.1671-6833.2024.01.015
面向图像分类的Vision Transformer研究综述
A Review of Vision Transformer for Image Classification
摘要
Abstract
ViT as a model based on the Transformer architecture has shown good results in image classification tasks.In this study,the application of ViT on image classification tasks was systematically summarized.Firstly,the functional characteristics of the ViT framework and its four modules(patch module,position encoding,multihead attention mechanism and feed-forward neural network)were briefly introduced.Secondly,the application of ViT in image classification tasks was summarized with the improvement measures of the four modules.Due to the fact that different model structures and improvement measures could have a significant impact on the final classification per-formance,a side-by-side comparison of various types of ViTs was made in this paper.Finally,the advantages and limitations of ViT in image classification were pointed out,and possible future research directions were proposed to break the limitations,and further to extend the application of ViT in other computer vision tasks.The extension of ViT to a wider range of computer vision fields,such as video understanding,was explored.关键词
ViT模型/图像分类/多头注意力/前馈网络层/位置编码Key words
ViT model/image classification/multihead attention/feed-forward network layer/position encoding分类
信息技术与安全科学引用本文复制引用
智敏,陆静芳..面向图像分类的Vision Transformer研究综述[J].郑州大学学报(工学版),2024,45(4):19-29,11.基金项目
内蒙古自治区自然科学基金资助项目(2023MS06009) (2023MS06009)
内蒙古师范大学基本科研业务费专项基金项目(2022JBXC018) (2022JBXC018)
内蒙古师范大学研究生科研创新基金项目(CXJJS22138) (CXJJS22138)