| 注册
首页|期刊导航|计算机应用研究|基于多模态表征学习的自动音频字幕方法

基于多模态表征学习的自动音频字幕方法

谭力文 周翊 柳银 曹寅

计算机应用研究2025,Vol.42Issue(6):1662-1667,6.
计算机应用研究2025,Vol.42Issue(6):1662-1667,6.DOI:10.19734/j.issn.1001-3695.2024.10.0465

基于多模态表征学习的自动音频字幕方法

Automatic audio captioning based on multi-modal representation learning

谭力文 1周翊 1柳银 1曹寅2

作者信息

  • 1. 重庆邮电大学通信与信息工程学院,重庆 400065
  • 2. 西交利物浦大学智能科学系,江苏苏州 215000
  • 折叠

摘要

Abstract

Modality discrepancies have perpetually posed significant challenges for the application of AAC and across all multi-modal research domains.Facilitating models in comprehending text information plays a pivotal role in establishing a seamless connection between the two modalities of text and audio.Recent studies have concentrated on narrowing the disparity between these two modalities via contrastive learning.However,bridging the gap between them merely by employing a simple contrastive loss function is challenging.In order to reduce the influence of modal differences and enhance the utilization of the model for the two modal features,this paper proposed SimTLNet,an audio captioning method based on multi-modal representa-tion learning by introducing a novel representation module,TRANSLATOR,constructing a twin representation structure,and jointly optimizing the model weights through contrastive learning and momentum updates,which enabled the model to concur-rently learn the common high-dimensional semantic information between the audio and text modalities.The proposed method achieves 0.251,0.782,0.480 for METEOR,CIDEr,and SPIDEr-FL on AudioCaps dataset and 0.187,0.475,0.303 for Clotho V2 dataset,respectively,which are comparable with state-of-the-art methods and effectively bridge the difference be-tween the two modalities.

关键词

音频字幕/表征学习/对比学习/模态差异/孪生网络

Key words

audio captioning/representation learning/contrastive learning/modality discrepancies/twin network

分类

计算机与自动化

引用本文复制引用

谭力文,周翊,柳银,曹寅..基于多模态表征学习的自动音频字幕方法[J].计算机应用研究,2025,42(6):1662-1667,6.

基金项目

国家自然科学基金资助项目(62301096) (62301096)

重庆市自然科学基金资助项目(CSTB2023NSCQ-MSX0659) (CSTB2023NSCQ-MSX0659)

国家重点研究与发展(R&D)计划资助项目(2024QY2630) (R&D)

西交利物浦大学资助项目(RDF-22-01-084) (RDF-22-01-084)

计算机应用研究

OA北大核心

1001-3695

访问量6
|
下载量0
段落导航相关论文