首页|期刊导航|东华大学学报（英文版）|用于动作识别的高效时序解码模块

用于动作识别的高效时序解码模块

黄秋波梅建敏赵武鹏卢怡如王梅陈德华

东华大学学报（英文版）2025，Vol.42Issue(2)：187-196,10.

东华大学学报（英文版）2025，Vol.42Issue(2)：187-196,10.DOI:10.19884/j.1672-5220.202403011

用于动作识别的高效时序解码模块

An Efficient Temporal Decoding Module for Action Recognition

黄秋波 ¹梅建敏 ¹赵武鹏 ¹卢怡如 ¹王梅 ¹陈德华¹

作者信息

1. 东华大学计算机科学与技术学院,上海 201620
折叠

摘要

Abstract

Action recognition,a fundamental task in the field of video understanding,has been extensively researched and applied.In contrast to an image,a video introduces an extra temporal dimension.However,many existing action recognition networks either perform simple temporal fusion through averaging or rely on pre-trained models from image recognition,resulting in limited temporal information extraction capabilities.This work proposes a highly efficient temporal decoding module that can be seamlessly integrated into any action recognition backbone network to enhance the focus on temporal relationships between video frames.Firstly,the decoder initializes a set of learnable queries,termed video-level action category prediction queries.Then,they are combined with the video frame features extracted by the backbone network after self-attention learning to extract video context information.Finally,these prediction queries with rich temporal features are used for category prediction.Experimental results on HMDB51,MSRDailyAct3D,Diving48 and Breakfast datasets show that using TokShift-Transformer and VideoMAE as encoders results in a significant improvement in Top-1 accuracy compared to the original models(TokShift-Transformer and VideoMAE),after introducing the proposed temporal decoder.The introduction of the temporal decoder results in an average performance increase exceeding 11%for TokShift-Transformer and nearly 5%for VideoMAE across the four datasets.Furthermore,the work explores the combination of the decoder with various action recognition networks,including Timesformer,as encoders.This results in an average accuracy improvement of more than 3.5%on the HMDB51 dataset.The code is available at https://github.com/huangturbo/TempDecoder.

关键词

动作识别/视频理解/时序关系/时序解码器/Transformer

Key words

action recognition/video understanding/temporal relationship/temporal decoder/Transformer

分类

信息技术与安全科学

引用本文复制引用

黄秋波,梅建敏,赵武鹏,卢怡如,王梅,陈德华..用于动作识别的高效时序解码模块[J].东华大学学报（英文版）,2025,42(2):187-196,10.

基金项目

Shanghai Municipal Commission of Economy and Information Technology,China(No.202301054) （No.202301054）

东华大学学报（英文版）

ISSN：1672-5220

访问量0

下载量0

段落导航