首页|期刊导航|计算机应用研究|多维度交叉注意力融合的视听分割网络

多维度交叉注意力融合的视听分割网络

李凡凡张垣垣章永龙朱俊武

计算机应用研究2025，Vol.42Issue(6)：1656-1661,6.

计算机应用研究2025，Vol.42Issue(6)：1656-1661,6.DOI:10.19734/j.issn.1001-3695.2024.08.0369

多维度交叉注意力融合的视听分割网络

Audio-visual segmentation network with multi-dimensional cross-attention fusion

李凡凡 ¹张垣垣 ¹章永龙 ¹朱俊武¹

作者信息

1. 扬州大学信息工程学院,江苏扬州 225100
折叠

摘要

Abstract

Audio-visual segmentation(AVS)aims to locate and accurately segment the sounding objects in images based on both visual and auditory information.While most existing research focuses primarily on exploring methods for audio-visual in-formation fusion,there is insufficient in-depth exploration of fine-grained audio-visual analysis,particularly in aligning conti-nuous audio features with spatial pixel-level information.Therefore,this paper proposed an audio-visual segmentation attention fusion(AVSAF)method based on contrastive learning.Firstly,the method used multi-head cross-attention mechanism and memory token to construct a audio-visual token fusion module to reduce the loss of multi-modal information.Secondly,it intro-duced contrastive learning to minimize the discrepancy between audio and visual features,enhancing their alignment.A dual-layer decoder was then employed to accurately predict and segment the target's position.Finally,it carried out a large number of experiments on the S4 and MS3 sub-datasets of the AVSBenge-Object dataset.The J-value is increased by 3.04 and 4.71 percentage points respectively,and the F value is increased by 2.4 and 3.5 percentage points respectively,which fully proves the effectiveness of the proposed method in audio-visual segmentation tasks.

关键词

视听分割/多模态/对比学习/注意力机制

Key words

audio-visual segmentation/multi-modal/contrastive learning/attention mechanism

分类

信息技术与安全科学

引用本文复制引用

李凡凡,张垣垣,章永龙,朱俊武..多维度交叉注意力融合的视听分割网络[J].计算机应用研究,2025,42(6):1656-1661,6.

基金项目

国家自然科学基金资助项目(61872313) （61872313）

计算机应用研究

OA北大核心

ISSN：1001-3695

访问量7

下载量0

段落导航