首页|期刊导航|电子学报|基于语义嵌入学习的特类视频识别

基于语义嵌入学习的特类视频识别

吴晓雨蒲禹江王生进刘子豪

电子学报2023，Vol.51Issue(11)：3225-3237,13.

电子学报2023，Vol.51Issue(11)：3225-3237,13.DOI:10.12263/DZXB.20220601

基于语义嵌入学习的特类视频识别

Special Video Recognition Based on Semantic Embedding Learning

吴晓雨 ¹蒲禹江 ²王生进 ³刘子豪²

作者信息

1. 中国传媒大学信息与通信工程学院,北京 100024||媒体融合与传播国家重点实验室(中国传媒大学),北京 100024
2. 中国传媒大学信息与通信工程学院,北京 100024
3. 清华大学电子工程系,北京 100084
折叠

摘要

Abstract

As special type of videos,violent video dissemination has become one of the hidden dangers facing the In-ternet environment,and intelligent recognition technology for violent videos is of great significance for maintaining Internet content security.Due to the diversity of collection sources,the distribution of violent videos usually shows large intra-class variance and small inter-class variance,and it is difficult for common violence recognition frameworks to adapt to complex and variable violent scenarios.Meanwhile,the word violence itself has highly abstract semantics,and it becomes a major difficulty to learn a generic semantic representation of violence from limited data.In response to these problems,we pres-ent a novel multimodal violent video recognition model based on semantic embedding learning.The model mainly consists of the following three parts.(1)Multimodal feature extraction.Considering that videos have multimodal properties,we use three different deep neural networks to extract feature representations of three modalities,i.e.,appearance,motion,and au-dio.(2)Multimodal feature fusion.To obtain a robust universal video representation,a lightweight multimodal feature fu-sion module,referred to as MEFM(Multimodal Efficient Fusion Module),is designed in this paper.The module includes two parts:common space mapping and multimodal feature interaction,which can effectively suppress the interference be-tween different modal information while fully interacting with multimodal features.(3)Semantic embedding learning.To accommodate violence datasets from different sources,we propose a multi-task learning method based on semantic embed-ding,which computes the semantic center of violence by introducing a center loss and uses cosine embedding loss to aggre-gate violent samples toward the center while discrete with non-violent samples to form a semantic discriminative feature representation,thus enhancing the generalization ability of the model and reducing the noise interference.Experiments on three publicly available datasets,VSD2015,Violent Flows,and RWF-2000,demonstrate that the violence video recognition framework proposed in this paper achieves competitive results by improving 4.79%,0.81%,and 1.5%respectively,over the state of the arts.

关键词

暴力视频识别/多模态特征融合/语义嵌入/多任务学习

Key words

violent video recognition/multimodal feature fusion/semantic embedding/multi-task learning

分类

信息技术与安全科学

引用本文复制引用

吴晓雨,蒲禹江,王生进,刘子豪..基于语义嵌入学习的特类视频识别[J].电子学报,2023,51(11):3225-3237,13.

基金项目

国家自然科学基金(No.61801441)National Natural Science Foundation of China(No.61801441) （No.61801441）

电子学报

OA北大核心CSCDCSTPCD

ISSN：0372-2112

访问量0

下载量0

段落导航