|国家科技期刊平台
首页|期刊导航|郑州大学学报(工学版)|面向图像分类的Vision Transformer研究综述

面向图像分类的Vision Transformer研究综述OA北大核心CSTPCD

A Review of Vision Transformer for Image Classification

中文摘要英文摘要

作为一种基于 Transformer架构的模型,ViT已经在图像分类任务中展现出了良好的效果.对 ViT 在图像分类任务上的应用进行系统性归纳总结.首先,简单介绍了 ViT框架及其 4 个模块(patch模块、位置编码、多头注意力和前馈神经网络)的功能特性;其次,以 ViT中 4 个模块的改进措施为脉络综述其在图像分类任务中的应用;再次,由于不同的模型结构和改进措施对最终的分类性能产生显著影响,还对文中出现的各类 ViT 进行了横向对比,并详细列出模型的参数和分类精度及其优缺点;最后,指出 ViT 在图像分类任务中的优势和局限性,并提出未来可能的研究方向以打破其局限性,进一步扩展 ViT在其他计算机视觉任务中的应用,同时,还可以探索将 ViT 扩展到视频理解等更广泛的计算机视觉领域.

ViT as a model based on the Transformer architecture has shown good results in image classification tasks.In this study,the application of ViT on image classification tasks was systematically summarized.Firstly,the functional characteristics of the ViT framework and its four modules(patch module,position encoding,multihead attention mechanism and feed-forward neural network)were briefly introduced.Secondly,the application of ViT in image classification tasks was summarized with the improvement measures of the four modules.Due to the fact that different model structures and improvement measures could have a significant impact on the final classification per-formance,a side-by-side comparison of various types of ViTs was made in this paper.Finally,the advantages and limitations of ViT in image classification were pointed out,and possible future research directions were proposed to break the limitations,and further to extend the application of ViT in other computer vision tasks.The extension of ViT to a wider range of computer vision fields,such as video understanding,was explored.

智敏;陆静芳

内蒙古师范大学 计算机科学技术学院,内蒙古 呼和浩特 010022

计算机与自动化

ViT模型图像分类多头注意力前馈网络层位置编码

ViT modelimage classificationmultihead attentionfeed-forward network layerposition encoding

《郑州大学学报(工学版)》 2024 (004)

19-29 / 11

内蒙古自治区自然科学基金资助项目(2023MS06009);内蒙古师范大学基本科研业务费专项基金项目(2022JBXC018);内蒙古师范大学研究生科研创新基金项目(CXJJS22138)

10.13705/j.issn.1671-6833.2024.01.015

评论