

Knowledge Graph-Based Video Classification Algorithm for Film and Television Drama



Based on the diversity of video perception modalities,a complete video tagging hierarchy classification algorithm combines visual and textual modalities to train a joint model to infer video content.However,most of the existing studies are only applicable to coarse-grained classification.Classification for film and television drama requires more fine-grained identification.This study proposes a knowledge graph-based video classification algorithm.Firstly,the algorithm extracts visual and textual features using a multimodal pre-training model,which is trained on large-scale generic data.A multi-task video label prediction model is further trained to obtain a total of three-level labels for the video:content labels,theme labels and entity labels.The difficulty of training the classification model is improved by introducing a similarity task into the multi-task network.The similarity task provides a tighter fit of similar samples,while the learned characteristics better express sample differences.Secondly,for entity labels,an entity correction model with local attention head is proposed.It can fuse,de-duplicate or extend the prediction results by introducing co-occurrence information from the knowledge graph,and produce a more accurate entity label prediction result.Based on semi-structured data retrieved from Douban,this paper constructs a film and television knowledge graph and conducts an empirical study of the video tag classification model for film and television.Experimental results show that,firstly,the cross-entropy loss function and the loss function of similarity task impose a common constraint on training the classification model,which serves to optimize the feature representation.Top-1 accuracy is improved by 3.70%,3.35%and 16.57%for content labels,theme labels and entity labels respectively.Secondly,entity correction model with global/local attention heads improves the Top-1 accuracy of entity labels from 38.7%to 45.6%after the introduction of knowledge graph information.The proposed research is a new attempt on the multimodal video classification using image-text pair data,providing a new research idea for short video classification in the case of a small number of data samples.


中国人民大学 信息学院,北京 100872



knowledge graphvideo label classificationmultimodal content understandingentity correction

《计算机科学与探索》 2024 (001)

161-174 / 14

国家自然科学基金面上项目(72071203).This work was supported by the National Natural Science Foundation of China(72071203).

