首页|期刊导航|电子学报|融合动作描述生成与跨模态语义对齐的骨架动作识别方法

融合动作描述生成与跨模态语义对齐的骨架动作识别方法

李雨桐马苗陈建芮

电子学报2025，Vol.53Issue(11)：4116-4131,16.

电子学报2025，Vol.53Issue(11)：4116-4131,16.DOI:10.12263/DZXB.20250652

融合动作描述生成与跨模态语义对齐的骨架动作识别方法

Leveraging Action Description Generation and Cross-Modal Semantic Alignment for Skeleton-Based Action Recognition

李雨桐 ¹马苗 ²陈建芮²

作者信息

1. 陕西师范大学人工智能与计算机学院,陕西西安 710119
2. 陕西师范大学人工智能与计算机学院,陕西西安 710119||现代教学技术教育部重点实验室,陕西西安 710062
折叠

摘要

Abstract

Action recognition aims to model and analyze human motions to automatically identify and understand hu-man behaviors,and it has been widely applied in various fields such as intelligent surveillance,human-computer interaction,and smart education.In recent years,self-supervised skeleton-based action recognition has emerged as an important re-search area due to its low computational cost,strong adaptability,and minimal reliance on labeled samples.However,exist-ing methods often rely on template-based prompts to generate action concept descriptions,which suffer from the lack of spa-tio-temporal information and limited semantic modeling capability.To address these issues,this paper proposes a cross-modal prior-assisted self-supervised skeleton-based action recognition method,aiming to effectively integrate skeletal struc-tural features with semantic priors to achieve more semantically rich action representations.On one hand,it employs a dual-branch decoupled skeleton encoder to separately model the spatial structure and temporal dynamics of actions,and integrates a cross-domain contrastive learning strategy to establish feature alignment and consistency constraints from spatial,tempo-ral,and global perspectives,thereby obtaining skeleton-modal features rich in spatio-temporal structure and global context.On the other hand,it feeds temporally concatenated action images along with prompt instructions into a vision-language mod-el to generate action descriptions,and utilizes the text encoder of the contrastive language-image pre-training(CLIP)model to extract text features,thereby supplementing the limited fine-grained semantic representation capability of the skeleton modality.Furthermore,a cross-modal contrastive learning strategy is proposed,where the textual semantics are dynamically modulated under the guidance of skeleton features using a feature-wise linear modulation(FiLM)mechanism,enabling ef-fective semantic alignment between skeleton and text modalities.Experimental results show that the recognition accuracy of the proposed method outperforms more than ten state-of-the-art approaches,including C2VL,on the NTU-RGB+D 60 and NTU-RGB+D 120 datasets,and surpasses eight competitive methods,such as ACA2Net,on the PKU-MMD-II dataset.The proposed method integrates skeletal structural information with semantic priors,achieving effective complementarity be-tween skeleton features and language semantics,and providing a new perspective for skeleton-based action recognition with low annotation cost.In future work,we will further explore domain-adaptive fine-tuning strategies to enhance the open-set description capability of vision-language models,and develop an online collaborative optimization framework to jointly op-timize description generation and action recognition,thereby improving the practicality,intelligence,and interpretability of the proposed method in complex dynamic scenarios such as real-time human-computer interaction and smart education.

关键词

骨架动作识别/动作描述生成/跨模态语义对齐/视觉语言模型/对比学习/自监督学习

Key words

skeleton-based action recognition/action description generation/cross-modal semantic alignment/vi-sion-language model/contrastive learning/self-supervised learning

分类

信息技术与安全科学

引用本文复制引用

李雨桐,马苗,陈建芮..融合动作描述生成与跨模态语义对齐的骨架动作识别方法[J].电子学报,2025,53(11):4116-4131,16.

基金项目

国家自然科学基金(No.62377031) National Natural Science Foundation of China(No.62377031) （No.62377031）

电子学报

OACSCD

ISSN：0372-2112

访问量0

下载量0

段落导航