东华大学学报(英文版)2025,Vol.42Issue(2):187-196,10.DOI:10.19884/j.1672-5220.202403011
用于动作识别的高效时序解码模块
An Efficient Temporal Decoding Module for Action Recognition
摘要
Abstract
Action recognition,a fundamental task in the field of video understanding,has been extensively researched and applied.In contrast to an image,a video introduces an extra temporal dimension.However,many existing action recognition networks either perform simple temporal fusion through averaging or rely on pre-trained models from image recognition,resulting in limited temporal information extraction capabilities.This work proposes a highly efficient temporal decoding module that can be seamlessly integrated into any action recognition backbone network to enhance the focus on temporal relationships between video frames.Firstly,the decoder initializes a set of learnable queries,termed video-level action category prediction queries.Then,they are combined with the video frame features extracted by the backbone network after self-attention learning to extract video context information.Finally,these prediction queries with rich temporal features are used for category prediction.Experimental results on HMDB51,MSRDailyAct3D,Diving48 and Breakfast datasets show that using TokShift-Transformer and VideoMAE as encoders results in a significant improvement in Top-1 accuracy compared to the original models(TokShift-Transformer and VideoMAE),after introducing the proposed temporal decoder.The introduction of the temporal decoder results in an average performance increase exceeding 11%for TokShift-Transformer and nearly 5%for VideoMAE across the four datasets.Furthermore,the work explores the combination of the decoder with various action recognition networks,including Timesformer,as encoders.This results in an average accuracy improvement of more than 3.5%on the HMDB51 dataset.The code is available at https://github.com/huangturbo/TempDecoder.关键词
动作识别/视频理解/时序关系/时序解码器/TransformerKey words
action recognition/video understanding/temporal relationship/temporal decoder/Transformer分类
计算机与自动化引用本文复制引用
黄秋波,梅建敏,赵武鹏,卢怡如,王梅,陈德华..用于动作识别的高效时序解码模块[J].东华大学学报(英文版),2025,42(2):187-196,10.基金项目
Shanghai Municipal Commission of Economy and Information Technology,China(No.202301054) (No.202301054)