摘要
Abstract
Aiming at the problems of missing number and incomplete shape of extraction results when the existing models extracting building object from remote sensing images,we constructed a dual branch encoder building extraction model combining convolution neural network(CNN)and vision transformer(VIT).We used the parameterized convolution shuffle unit to build CNN branch,used the deformable self-attention transformer to build VIT branch,and performed the independent feature coding in the two branches.Between the encoder and decoder,we introduced the semantic and detail infusion layer of triplet attention reconstruction to fuse and filter the multi-scale feature map.In the decoder,we used the dynamic upsampling unit to precisely reconstruct the large-scale feature map,and used the weighted combination of the generalized Dice loss function and the focal loss function to calculate the training loss.The experimental results on MDD and WHU-Building dataset show that the F1-score of this model is 91.57%and 92.34%,which is 7.14%and 6.51%higher than that of Swin-UNet model.The extracted building target shape is more authentic and complete,and is superior to the current mainstream semantic segmentation model in terms of extraction accuracy and generalization.关键词
遥感建筑提取/CNN-VIT双分支编码器/三元注意力/语义细节融合/动态上采样Key words
remote sensing building extraction/CNN-VIT dual branch encoder/triplet attention/semantic and detail infusion/dynamic upsampling分类
天文与地球科学