计算机工程与应用2018,Vol.54Issue(13):52-58,110,8.DOI:10.3778/j.issn.1002-8331.1705-0114
基于Spark的FP_Growth算法的并行与优化
Parallelization and optimization of FP_Growth algorithm based on Spark
摘要
Abstract
PFP_Growth algorithm is the parallelization of FP_Growth algorithm on the Hadoop platform based on MapReduce. The algorithm does not consider the balance of the load while grouping the transaction set, which causes the time inconsistency of different nodes to accomplish the tasks and even a bigger difference. Thus, it reduces the efficiency of the algorithm. To improve the efficiency of the algorithm, this paper proposes a Spark-based RPFP algorithm, which optimizes PFP_Growth algorithm through balancing the groups and reducing the time complexity. To balance the group, the large load is placed into the group with the smallest total load. The address of the element is fast accessed by adding a Hash table to the head table, which reduces the time complexity. Experimental results show that RPFP algorithm effectively improves the mining efficiency of the frequent itemsets.关键词
FP_Growth算法/频繁项集挖掘/负载均衡/链头表结构/SparkKey words
FP_Growth algorithm/frequent itemset mining/load balance/head table/Spark分类
信息技术与安全科学引用本文复制引用
石陆魁,张欣,师胜利..基于Spark的FP_Growth算法的并行与优化[J].计算机工程与应用,2018,54(13):52-58,110,8.基金项目
河北省自然科学基金(No.F2016202144,No.F2017202145). (No.F2016202144,No.F2017202145)