首页|期刊导航|计算机工程与应用|Hadoop平台的海量数据并行随机抽样

Hadoop平台的海量数据并行随机抽样

宛婉周国祥

计算机工程与应用Issue(20)：115-118,4.

计算机工程与应用Issue(20)：115-118,4.DOI:10.3778/j.issn.1002-8331.1210-0329

Hadoop平台的海量数据并行随机抽样

Massive data parallel random sampling based on hadoop

宛婉 ¹周国祥¹

作者信息

1. 合肥工业大学计算机与信息学院，合肥 230009
折叠

摘要

Abstract

In today’s“information explosion”society, data mining, because of mass data, faces a new challenges. When data mining turns to cloud computing platform to realize parallel, the study of parallel data random sampling further reduces the size of the data size. This paper presents a mapreduce parallel sampling algorithm which not only can clean up dirty data but also achieves the goal of equal probability sampling. The algorithm just needs to scan processed data only one time. It runs this algorithm in the hadoop platform and compares its performance with common random sampling. As a result, this new algorithm obtains a very high time efficiency. It is a kind of effective method which lays a good founda-tion for doing research on sampling in future. It can also promote data mining in the condition of facing mass data.

关键词

云计算/hadoop/mapreduce/并行计算/数据挖掘/随机抽样

Key words

cloud computing/hadoop/mapreduce/parallel computing/data mining/random sampling

分类

信息技术与安全科学

引用本文复制引用

宛婉,周国祥..Hadoop平台的海量数据并行随机抽样[J].计算机工程与应用,2014,(20):115-118,4.

计算机工程与应用

OA北大核心CSCDCSTPCD

ISSN：1002-8331

访问量0

下载量0

段落导航