计算机工程与应用2025,Vol.61Issue(23):24-37,14.DOI:10.3778/j.issn.1002-8331.2503-0014
视觉Transformer在细粒度图像分类中的应用综述
Survey of Vision Transformers for Fine-Grained Image Classification
摘要
Abstract
Fine-grained image classification(FGIC)aims to identify subcategories that are visually highly similar yet exhibit subtle differences.With the rapid advancement of deep learning,FGIC algorithms have gradually evolved from traditional fully supervised learning to weakly supervised approaches.Vision Transformers(ViTs),leveraging multi-head self-attention mechanisms,eliminate the reliance on manual annotations and overcome the limitations of convolutional neural networks(CNNs)in terms of receptive field size and global modeling capacity,becoming one of the mainstream methods for this task.This paper first outlines the key characteristics and challenges of FGIC,and briefly introduces the architecture and advantages of ViT.Based on different feature fusion strategies,existing ViT-based improvements are categorized into hierarchical fusion,multi-local fusion,and multi-granularity fusion.The modifications of each category are illustrated in detail,and their underlying mechanisms are systematically analyzed and summarized.In addition,com-monly used public datasets are reviewed,and future research directions are proposed based on current limitations,aiming to further explore the potential of ViT in FGIC tasks.关键词
细粒度图像分类(FGIC)/视觉Transformer(ViT)/特征融合Key words
fine-grained image classification(FGIC)/vision Transformer(ViT)/feature fusion分类
信息技术与安全科学引用本文复制引用
温世雄,智敏..视觉Transformer在细粒度图像分类中的应用综述[J].计算机工程与应用,2025,61(23):24-37,14.基金项目
内蒙古自然科学基金(2023MS06009) (2023MS06009)
内蒙古高等学校科学研究重点项目(NJZZ21004) (NJZZ21004)
呼和浩特市基础研究与应用基础研究项目(2024-规-基-33). (2024-规-基-33)