计算机技术与发展2019,Vol.29Issue(3):30-34,5.DOI:10.3969/j.issn.1673-629X.2019.03.006
基于Spark的关联规则挖掘算法并行化研究
Research on Parallelization of Association Rules Mining Algorithm Based on Spark
摘要
Abstract
Association rule mining is an important task of data mining. Association rule mining algorithm can excavate potential relationships from data, among which Apriori algorithm is a typical representative. The Spark platform is a distributed memory-based big data framework suitable for iterative computing. In order to improve the mining efficiency of strong association rules, we propose a parallelization scheme of Apriori algorithm based on Spark. The scheme utilizes distributed architecture and cluster scheduling mechanism of the Spark platform to distribute the transaction data set to multiple sub nodes. Each sub node invokes transformation operation to obtain local candidate itemsets and support degree, and stores them in memory. Local candidate itemsets in summary nodes generate global candidate itemsets and global frequent itemsets. The process is iterated until the next level candidate set does not exist. The performance test experiment shows that the parallel Apriori algorithm based on the Spark platform can effectively analyze the frequent itemsets in large data itemsets and extract strong association rules, with high accuracy and timeliness.关键词
Apriori/关联规则/并行化/Spark/推荐算法/频繁项集/挖掘Key words
Apriori/association rules/parallelization/Spark/recommendation algorithm/frequent itemsets/mining分类
信息技术与安全科学引用本文复制引用
许德心,李玲娟..基于Spark的关联规则挖掘算法并行化研究[J].计算机技术与发展,2019,29(3):30-34,5.基金项目
国家自然科学基金(61302158,61571238) (61302158,61571238)