计算机应用研究2025,Vol.42Issue(6):1662-1667,6.DOI:10.19734/j.issn.1001-3695.2024.10.0465
基于多模态表征学习的自动音频字幕方法
Automatic audio captioning based on multi-modal representation learning
摘要
Abstract
Modality discrepancies have perpetually posed significant challenges for the application of AAC and across all multi-modal research domains.Facilitating models in comprehending text information plays a pivotal role in establishing a seamless connection between the two modalities of text and audio.Recent studies have concentrated on narrowing the disparity between these two modalities via contrastive learning.However,bridging the gap between them merely by employing a simple contrastive loss function is challenging.In order to reduce the influence of modal differences and enhance the utilization of the model for the two modal features,this paper proposed SimTLNet,an audio captioning method based on multi-modal representa-tion learning by introducing a novel representation module,TRANSLATOR,constructing a twin representation structure,and jointly optimizing the model weights through contrastive learning and momentum updates,which enabled the model to concur-rently learn the common high-dimensional semantic information between the audio and text modalities.The proposed method achieves 0.251,0.782,0.480 for METEOR,CIDEr,and SPIDEr-FL on AudioCaps dataset and 0.187,0.475,0.303 for Clotho V2 dataset,respectively,which are comparable with state-of-the-art methods and effectively bridge the difference be-tween the two modalities.关键词
音频字幕/表征学习/对比学习/模态差异/孪生网络Key words
audio captioning/representation learning/contrastive learning/modality discrepancies/twin network分类
计算机与自动化引用本文复制引用
谭力文,周翊,柳银,曹寅..基于多模态表征学习的自动音频字幕方法[J].计算机应用研究,2025,42(6):1662-1667,6.基金项目
国家自然科学基金资助项目(62301096) (62301096)
重庆市自然科学基金资助项目(CSTB2023NSCQ-MSX0659) (CSTB2023NSCQ-MSX0659)
国家重点研究与发展(R&D)计划资助项目(2024QY2630) (R&D)
西交利物浦大学资助项目(RDF-22-01-084) (RDF-22-01-084)