雷达学报2026,Vol.15Issue(2):441-462,22.DOI:10.12000/JR25250
DGS-CapNet:基于空间-频率感知的SAR图像描述模型
DGS-CapNet:A Spatial-frequency-aware Model for SAR Image Captioning
摘要
Abstract
Synthetic Aperture Radar(SAR),as an active microwave remote sensing system,offers all-weather,all-day observation capabilities and has considerable application value in disaster monitoring,urban management,and military reconnaissance.Although deep learning techniques have achieved remarkable progress in interpreting SAR images,existing methods for target recognition and detection primarily focus on local feature extraction and single-target discrimination.They struggle to comprehensively characterize the global semantic structure and multitarget relationships in complex scenes,and the interpretation process remains highly dependent on human expertise with limited automation.SAR image captioning aims to translate visual information into natural language,serving as a key technology to bridge the gap between"perceiving targets"and"cognizing scenes,"which is of great importance for enhancing the automation and intelligence of SAR image interpretation.However,the inherent speckle noise,the scarcity of textural details,and the substantial semantic gap in SAR images further exacerbate the difficulty of cross-modal understanding.To address these challenges,this paper proposes a spatial-frequency aware model for SAR image captioning.First,a spatial-frequency aware module is constructed.It employs a Discrete Cosine Transform(DCT)mask attention mechanism to reweight spectral components for noise suppression and structure enhancement,combined with a Gabor multiscale texture enhancement submodule to improve sensitivity to directional and edge details.Second,a cross-modal semantic enhancement loss function is designed to bridge the semantic gap between visual features and natural language through bidirectional image-text alignment and mutual information maximization.Furthermore,a large-scale fine-grained SAR image captioning dataset,FSAR-Cap,containing 72400 high-quality image-text pairs,is constructed.The experimental results demonstrate that the proposed method achieves CIDEr scores of 151.00 and 95.14 on the SARLANG and FSAR-Cap datasets,respectively.Qualitatively,the model effectively suppresses hallucinations and accurately captures fine-grained spatial-textural details,considerably outperforming mainstream methods.关键词
SAR图像描述/空间-频域感知/DCT掩码注意力/多尺度纹理增强/跨模态对齐/图像-文本数据集Key words
SAR image captioning/Spatial-frequency-awareness/DCT mask attention/Multi-scale texture enhancement/Cross-modal alignment/Image-text dataset分类
信息技术与安全科学引用本文复制引用
张金琪,庄迪,张腊梅,邹斌,董洪伟,司凌宇,孟庆彪,吴有明..DGS-CapNet:基于空间-频率感知的SAR图像描述模型[J].雷达学报,2026,15(2):441-462,22.基金项目
国家自然科学基金(62271172)The National Natural Science Foundation of China(62271172) (62271172)