摘要
Abstract
For the problems of inefficient feature extraction,inadequate feature fusion and low prediction accuracy in current audio and video multimodal emotion analysis,a multimodal emotion analysis model based on feature fusion and multi-task learning was proposed.Initially,the pre-trained models,BERT(bidirectional encoder representations from transformers),Wav2Vec(waveform to vector),and CLIP(contrastive language-image pre-training)were used to generate low-order feature representations from text,audio,and images respectively,which were then inputted into a neural network to extract high-order features containing local and temporal characteristics.Subsequently,the proposed attention fusion module was employed to facilitate interactive fusion of the three modes.Finally,by integrating multi-task learning,the accuracy of emotion recognition was enhanced.Experimental results on the public Chinese multimodal dataset CH-SIMS indicate a significant improvement in the accuracy of emotion classification.关键词
视频多模态/情感识别/注意力机制/多任务学习Key words
video multimodality/emotion recognition/attention mechanism/multi-task learning分类
信息技术与安全科学