首页|期刊导航|计算机技术与发展|基于Spark的CVFDT分类算法并行化研究

基于Spark的CVFDT分类算法并行化研究

庄荣李玲娟

计算机技术与发展2018，Vol.28Issue(6)：35-38,4.

计算机技术与发展2018，Vol.28Issue(6)：35-38,4.DOI:10.3969/j.issn.1673-629X.2018.06.008

基于Spark的CVFDT分类算法并行化研究

Research on Parallelization of Concept-adapting Very Fast Decision Tree Classification Algorithm Based on Spark

庄荣 ¹李玲娟¹

作者信息

1. 南京邮电大学计算机学院,江苏南京 210023
折叠

摘要

Abstract

Aiming at increase of classification and mining efficiency for stream data,we study a parallelization scheme of deploying the CVFDT ( concept-adapting fast decision tree ) to the stream data computing platform Spark and design a implementation scheme of CVFDT based on Spark. Firstly,the CVFDT should be parallelized among attributes,that is the parallelization of the splitting point calcu-lation. Then in the process of building decision trees of CVFDT based on Spark,all the attribute lists of the node are transformed into Spark's unique resilient distributed datasets (RDD),and through calculation of parallel task from each RDD,each optimal splitting point is summarized and compared. The Hoeffding boundary is calculated as the node splitting condition to find the optimal splitting point,and the decision tree is recursively created. The experiment shows that the classification efficiency of CVFDT in the Spark cluster environment relative to the stand-alone environment has improved significantly. The improved parallel CVFDT has better adaptability to large-scale stream data processing and the reasonable setting of RDD filtering can further improve the classification efficiency.

关键词

数据流/CVFDT/并行化/Spark/弹性分布式数据集

Key words

data streams/CVFDT/parallelization/Spark/resilient distributed datasets

分类

信息技术与安全科学

引用本文复制引用

庄荣,李玲娟..基于Spark的CVFDT分类算法并行化研究[J].计算机技术与发展,2018,28(6):35-38,4.

基金项目

国家自然科学基金(61302158,61571238) （61302158,61571238）

计算机技术与发展

OACSTPCD

ISSN：1673-629X

访问量5

下载量0

段落导航