计算机技术与发展2018,Vol.28Issue(6):35-38,4.DOI:10.3969/j.issn.1673-629X.2018.06.008
基于Spark的CVFDT分类算法并行化研究
Research on Parallelization of Concept-adapting Very Fast Decision Tree Classification Algorithm Based on Spark
摘要
Abstract
Aiming at increase of classification and mining efficiency for stream data,we study a parallelization scheme of deploying the CVFDT ( concept-adapting fast decision tree ) to the stream data computing platform Spark and design a implementation scheme of CVFDT based on Spark. Firstly,the CVFDT should be parallelized among attributes,that is the parallelization of the splitting point calcu-lation. Then in the process of building decision trees of CVFDT based on Spark,all the attribute lists of the node are transformed into Spark's unique resilient distributed datasets (RDD),and through calculation of parallel task from each RDD,each optimal splitting point is summarized and compared. The Hoeffding boundary is calculated as the node splitting condition to find the optimal splitting point,and the decision tree is recursively created. The experiment shows that the classification efficiency of CVFDT in the Spark cluster environment relative to the stand-alone environment has improved significantly. The improved parallel CVFDT has better adaptability to large-scale stream data processing and the reasonable setting of RDD filtering can further improve the classification efficiency.关键词
数据流/CVFDT/并行化/Spark/弹性分布式数据集Key words
data streams/CVFDT/parallelization/Spark/resilient distributed datasets分类
信息技术与安全科学引用本文复制引用
庄荣,李玲娟..基于Spark的CVFDT分类算法并行化研究[J].计算机技术与发展,2018,28(6):35-38,4.基金项目
国家自然科学基金(61302158,61571238) (61302158,61571238)