计算机工程与应用Issue(20):115-118,4.DOI:10.3778/j.issn.1002-8331.1210-0329
Hadoop平台的海量数据并行随机抽样
Massive data parallel random sampling based on hadoop
宛婉 1周国祥1
作者信息
- 1. 合肥工业大学 计算机与信息学院,合肥 230009
- 折叠
摘要
Abstract
In today’s“information explosion”society, data mining, because of mass data, faces a new challenges. When data mining turns to cloud computing platform to realize parallel, the study of parallel data random sampling further reduces the size of the data size. This paper presents a mapreduce parallel sampling algorithm which not only can clean up dirty data but also achieves the goal of equal probability sampling. The algorithm just needs to scan processed data only one time. It runs this algorithm in the hadoop platform and compares its performance with common random sampling. As a result, this new algorithm obtains a very high time efficiency. It is a kind of effective method which lays a good founda-tion for doing research on sampling in future. It can also promote data mining in the condition of facing mass data.关键词
云计算/hadoop/mapreduce/并行计算/数据挖掘/随机抽样Key words
cloud computing/hadoop/mapreduce/parallel computing/data mining/random sampling分类
信息技术与安全科学引用本文复制引用
宛婉,周国祥..Hadoop平台的海量数据并行随机抽样[J].计算机工程与应用,2014,(20):115-118,4.