摘要
Abstract
Objective Object tracking finds extensive applications in fields such as video surveillance,autonomous driving,and UAV aerial photography.Self-attention operations of existing Transformer-based models transform 2D features into 1D sequences,which ignore spatial priors of target objects,leading to poor tracking performance.To address this limitation,this paper proposed a tracker named Pool-Swin Transformer Tracker(PSTransT).Methods PSTransT integrated each stage of the Swin Transformer with a pooling layer,enabling the effective extraction of features across different scales while preserving spatial positional information.Specifically,PSTransT utilized a cross-scale fusion-based Swin Transformer model for contextual modeling,where each stage was parallelized with a pooling layer to effectively capture spatial features at various scales while maintaining feature richness.Furthermore,the method utilized a Transformer-based feature fusion network that leveraged the self-attention mechanism of the Transformer to jointly learn the correlations between template features and search features for feature fusion,aiming to better capture the dynamic changes of the tracked target and local contextual information.Results The effectiveness of PSTransT was extensively evaluated on multiple benchmark datasets.This method achieved a success rate of 69.2%on LaSOT,which was 19.6%higher than SiamPRN++,and reached a mean average overlap(mAO)of 72.2%on GOT-l0k,surpassing SiamPRN++by 20.5%.Conclusion Experimental results demonstrate that preserving spatial prior information alongside contextual information benefits object tracking performance.PSTransT outperforms other comparative methods.关键词
目标追踪/Swin Transformer/池化层/上下文建模Key words
object tracking/Swin Transformer/pool layer/contextual modeling分类
信息技术与安全科学