基于多模型组合的类别不平衡海洋数据质量控制方法OA北大核心CSTPCD
Quality control method for class-imbalanced oceanographic data based on multi-model combination
提出一种多模型组合的两层海洋数据质量控制框架,选择了多种常见分类算法作为基学习器对数据质量标签进行初级预测,再经过投票法或堆叠(Stacking)法确定海洋数据质量的标识符;针对类别不平衡问题,结合自适应下采样策略,降低数据的不平衡比率,并结合Focal Loss损失函数,提升模型对难分类样本的识别能力.以来源于国际综合海洋大气数据集的海表温度和气温数据为例进行质量控制验证,结果表明:投票法或堆叠法对极少类的错误样本分类的F1 score(精确率和召回率的加权调和平均值)在海表温度数据上可达到0.980 6和0.981 2,在气温数据上可达到0.998 5和0.998 3.
This paper proposes a two-layer framework for ocean data quality control based on the combination of multiple models.Various common classification algorithms are chosen as base learners to predict the primary quality labels of ocean data,and a Voting or Stacking strategy is used to identify the quality of the data.To address the issue of class imbalance,an adaptive undersampling strategy is combined with the Focal loss function to enhance the model's ability to recognize difficult samples.To verify the performance of the proposed method,we apply it to the quality control of sea surface temperature and air temperature data that are from ICOADS(International Comprehensive Ocean-Atmosphere Data Set).The results show that the F1 score(the weighted harmonic mean of precision and recall)of rare anomaly samples by the Voting or Stacking methods can reach 0.980 6 and 0.981 2 for sea surface temperature data,and 0.998 5 and 0.998 3 for air temperature data.
宋巍;张贵庆;谢京容;董明媚;岳心阳;杨扬
上海海洋大学信息学院,上海 201306国家海洋信息中心,天津 300171
海洋学
质量控制海洋气象数据集成学习类别不平衡
quality controlocean-atmosphere dataensemble learningclass imbalance
《海洋预报》 2024 (003)
61-70 / 10
国家重点研发计划项目(2021YFC3101601);上海市科委部分地方高校能力建设项目(20050501900).
评论