时空多尺度关联特征融合的二维卷积网络细粒度动作识别模型OA北大核心CSTPCD
Fine-grained 2D convolutional network model for action recognition based on spatio-temporal multi-scale correlation feature fusion
针对传统二维(2D)卷积网络提取时空特征尺度单一以及对细粒度动作数据集中帧与帧之间的远程时间关联信息利用不足的问题,本文提出时空多尺度关联特征融合的2D卷积网络细粒度动作识别模型.首先,为建模视频多尺度空间关联以加强对细粒度视频数据的空间表征能力,模型使用多尺度"特征压缩、特征激发"方式,使网络所提取空间特征更加丰富有效.然后,为充分利用细粒度视频数据时间维度上的运动信息,本文引入时间窗口自注意力机制,利用自注意力机制强大的远程依赖建模能力同时只在时间维度上进行自注意力操作,以较低计算成本建模远程时间依赖关系.最后,考虑到所提取时空特征对不同类型动作分类的贡献不均等,本文引入自适应特征融合模块,为特征动态赋予不同权重实现自适应特征融合.模型在 2 个细粒度动作识别数据集 Diving48 和Something-somethingV1 上识别准确率分别达到86.0%和46.9%,分别使原始主干网络识别准确率提升3.8%和1.3%.实验结果表明,在只使用视频帧信息作为输入的情况下,本模型达到与现有基于Transformer和三维卷积神经网络(3D CNN)算法相当的识别准确率.
In order to solve the problems of traditional 2-dimensional(2D)convolutional network extracting spatiotempo-ral features at a single scale and insufficient utilization of long-range temporal correlation information between frames in fine-grained action data sets,this paper proposes a fine-grained 2D convolutional network model for action recog-nition based on spatio-temporal multi-scale correlation feature fusion model.First,in order to model the multi-scale spatial correlation of videos to enhance the spatial representation ability of fine-grained video data,the model uses a multi-scale'feature squeeze and feature excitation'method to make the spatial features extracted by the network more abundant and effective.Then,in order to fully utilize the motion information in the time dimension of fine-grained video data,a temporal window self attention mechanism is introduced,and the powerful long-range depend-ency modeling ability of Transformer is utilized to only perform self attention operations in the time dimension,mod-eling long-range time dependencies at a lower computational cost.Finally,considering that the extracted spatio-temporal features contribute unevenly to different types of action classification,an adaptive feature fusion module is introduced to dynamically assign different weights to features to achieve adaptive feature fusion.The model's Top-1 accuracy on the two fine-grained action recognition data sets Diving48 and Something-somethingV1 reached 86.0%and 46.9%respectively,which improved the Top-1 accuracy of the original backbone network by 3.8%and 1.3%respectively.Experimental results show that when only using video frame information as input,this mod-el achieves recognition accuracy comparable to existing algorithms based on Transformer and 3-dimensional convolu-tional neural network(3D CNN).
胡正平;王昕宇;董佳伟;赵艳霜;刘洋
燕山大学信息科学与工程学院 秦皇岛 066004||燕山大学河北省信息传输与信号处理重点实验室 秦皇岛 066004燕山大学信息科学与工程学院 秦皇岛 066004
细粒度动作识别多尺度时空关联特征远程依赖建模自注意力机制
fine-grained action recognitionmulti-scale spatio-temporal correlation featurelong-range de-pendency modelingself-attention mechanism
《高技术通讯》 2024 (006)
590-601 / 12
国家自然科学基金面上项目(61771420)和国家自然科学基金(62001413)资助项目.
评论