计算机应用研究2025,Vol.42Issue(6):1656-1661,6.DOI:10.19734/j.issn.1001-3695.2024.08.0369
多维度交叉注意力融合的视听分割网络
Audio-visual segmentation network with multi-dimensional cross-attention fusion
摘要
Abstract
Audio-visual segmentation(AVS)aims to locate and accurately segment the sounding objects in images based on both visual and auditory information.While most existing research focuses primarily on exploring methods for audio-visual in-formation fusion,there is insufficient in-depth exploration of fine-grained audio-visual analysis,particularly in aligning conti-nuous audio features with spatial pixel-level information.Therefore,this paper proposed an audio-visual segmentation attention fusion(AVSAF)method based on contrastive learning.Firstly,the method used multi-head cross-attention mechanism and memory token to construct a audio-visual token fusion module to reduce the loss of multi-modal information.Secondly,it intro-duced contrastive learning to minimize the discrepancy between audio and visual features,enhancing their alignment.A dual-layer decoder was then employed to accurately predict and segment the target's position.Finally,it carried out a large number of experiments on the S4 and MS3 sub-datasets of the AVSBenge-Object dataset.The J-value is increased by 3.04 and 4.71 percentage points respectively,and the F value is increased by 2.4 and 3.5 percentage points respectively,which fully proves the effectiveness of the proposed method in audio-visual segmentation tasks.关键词
视听分割/多模态/对比学习/注意力机制Key words
audio-visual segmentation/multi-modal/contrastive learning/attention mechanism分类
信息技术与安全科学引用本文复制引用
李凡凡,张垣垣,章永龙,朱俊武..多维度交叉注意力融合的视听分割网络[J].计算机应用研究,2025,42(6):1656-1661,6.基金项目
国家自然科学基金资助项目(61872313) (61872313)