微型电脑应用2025,Vol.41Issue(4):21-24,4.
基于多条件时间序列的海量并行数据清洗算法
Massive Parallel Data Cleaning Algorithm Based on Multi-conditional Time Series
摘要
Abstract
Aimed at problems of massive data in various fields and existing duplicate,missing and invalid data,a massive paral-lel data cleaning algorithm based on multi-conditional time series is studied.The approximate symbol aggregation algorithm is used to discretize and symbolize the multi-conditional time series,and the similarity measurement method is used to solve the similarity of the multi-conditional time series after processing.Combined with MapReduce parallel computing platform,a mas-sive data cleaning algorithm based on sequential similarity measurement is written on this platform to realize the parallel pro-cessing of massive data cleaning.The experimental results show that the distance between the time series of the data after cleaning is more consistent with the real value,and high-quality data can be obtained through cleaning.At the same time,the introduction of parallel processing greatly reduces the time of data cleaning.关键词
多条件时间序列/海量并行数据/数据清洗/MapReduceKey words
multi-conditional time serie/massive parallel data/data cleaning/MapReduce分类
信息技术与安全科学引用本文复制引用
高祖彦,段昌盛..基于多条件时间序列的海量并行数据清洗算法[J].微型电脑应用,2025,41(4):21-24,4.基金项目
教育部科技发展中心高校产学研创新基金(2018A03016) (2018A03016)