首页|期刊导航|雷达学报|DGS-CapNet:基于空间-频率感知的SAR图像描述模型

DGS-CapNet:基于空间-频率感知的SAR图像描述模型

张金琪庄迪张腊梅邹斌董洪伟司凌宇孟庆彪吴有明

雷达学报2026，Vol.15Issue(2)：441-462,22.

雷达学报2026，Vol.15Issue(2)：441-462,22.DOI:10.12000/JR25250

DGS-CapNet:基于空间-频率感知的SAR图像描述模型

DGS-CapNet:A Spatial-frequency-aware Model for SAR Image Captioning

张金琪 ¹庄迪 ¹张腊梅 ¹邹斌 ¹董洪伟 ²司凌宇 ²孟庆彪 ³吴有明³

作者信息

1. 哈尔滨工业大学电子与信息工程学院哈尔滨 150001
2. 中国科学院软件研究所天基综合信息系统全国重点实验室北京 100190
3. 中国科学院空天信息创新研究院北京 100190
折叠

摘要

Abstract

Synthetic Aperture Radar(SAR),as an active microwave remote sensing system,offers all-weather,all-day observation capabilities and has considerable application value in disaster monitoring,urban management,and military reconnaissance.Although deep learning techniques have achieved remarkable progress in interpreting SAR images,existing methods for target recognition and detection primarily focus on local feature extraction and single-target discrimination.They struggle to comprehensively characterize the global semantic structure and multitarget relationships in complex scenes,and the interpretation process remains highly dependent on human expertise with limited automation.SAR image captioning aims to translate visual information into natural language,serving as a key technology to bridge the gap between"perceiving targets"and"cognizing scenes,"which is of great importance for enhancing the automation and intelligence of SAR image interpretation.However,the inherent speckle noise,the scarcity of textural details,and the substantial semantic gap in SAR images further exacerbate the difficulty of cross-modal understanding.To address these challenges,this paper proposes a spatial-frequency aware model for SAR image captioning.First,a spatial-frequency aware module is constructed.It employs a Discrete Cosine Transform(DCT)mask attention mechanism to reweight spectral components for noise suppression and structure enhancement,combined with a Gabor multiscale texture enhancement submodule to improve sensitivity to directional and edge details.Second,a cross-modal semantic enhancement loss function is designed to bridge the semantic gap between visual features and natural language through bidirectional image-text alignment and mutual information maximization.Furthermore,a large-scale fine-grained SAR image captioning dataset,FSAR-Cap,containing 72400 high-quality image-text pairs,is constructed.The experimental results demonstrate that the proposed method achieves CIDEr scores of 151.00 and 95.14 on the SARLANG and FSAR-Cap datasets,respectively.Qualitatively,the model effectively suppresses hallucinations and accurately captures fine-grained spatial-textural details,considerably outperforming mainstream methods.

关键词

SAR图像描述/空间-频域感知/DCT掩码注意力/多尺度纹理增强/跨模态对齐/图像-文本数据集

Key words

SAR image captioning/Spatial-frequency-awareness/DCT mask attention/Multi-scale texture enhancement/Cross-modal alignment/Image-text dataset

分类

信息技术与安全科学

引用本文复制引用

张金琪,庄迪,张腊梅,邹斌,董洪伟,司凌宇,孟庆彪,吴有明..DGS-CapNet:基于空间-频率感知的SAR图像描述模型[J].雷达学报,2026,15(2):441-462,22.

基金项目

国家自然科学基金(62271172)The National Natural Science Foundation of China(62271172) （62271172）

雷达学报

ISSN：2095-283X

访问量0

下载量0

段落导航