基于Spark和NRSCA策略的并行深度森林算法OACSTPCD
Parallel deep forest algorithm based on Spark and NRSCA strategy
针对并行深度森林在大数据环境下存在冗余及无关特征过多、两端特征利用率过低、模型收敛速度慢以及级联森林并行效率低等问题,提出了基于Spark和NRSCA策略的并行深度森林算法——PDF-SNRSCA.首先,该算法提出了基于邻域粗糙集和Fisher score的特征选择策略(FS-NRS),通过衡量特征的相关性和冗余度,对特征进行过滤,有效减少了冗余及无关特征的数量;其次,提出了一种随机选择和等距提取的扫描策略(S-RSEE),保证了所有特征能够同概率被利用,解决了多粒度扫描两端特征利用率低的问题;最后,结合Spark框架,实现级联森林并行化训练,提出了基于重要性指数的特征筛选机制(FFM-II),筛选出非关键性特征,平衡增强类向量与原始类向量维度,从而加快模型收敛速度,同时设计了基于SCA的任务调度机制(TSM-SCA),将任务重新分配,保证集群负载均衡,解决了级联森林并行效率低的问题.实验表明,PDF-SNRSCA算法能有效提高深度森林的分类效果,且对深度森林并行化训练的效率也有大幅提升.
Aiming to address several issues encountered by parallel deep forest algorithms in big data environments,such as excessive redundancy and irrelevant features,low utilization rate of features at both ends,slow model convergence speed,and low parallel efficiency of cascading forests,this paper proposed a parallel deep forest algorithm based on Spark and NRSCA strategy(PDF-SNRSCA).Firstly,the algorithm proposed a feature selection strategy(FS-NRS)based on neighborhood rough sets and Fisher score,which measured the correlation and redundancy of features to effectively reduce the number of redundant and irrelevant features.Secondly,it proposed a scanning strategy based on random selection and equidistant extraction(S-RSEE)to ensure that all features were utilized with the same probability and solved the problem of low utilization rate of two ends in multi-granularing scanning.Finally,combining with the Spark framework,the algorithm realized the parallel trai-ning of cascading forests,and it proposed a feature filtering mechanism based on the importance index(FFM-Ⅱ)to balance the dimensions of enhanced class vectors and original class vectors,thereby accelerating the model convergence speed.Mean-while,the algorithm designed a task scheduling mechanism based on SCA(TSM-SCA)to redistribute tasks and ensure load balancing in the cluster,which solved the problem of low parallel efficiency of cascading forests.Experiments show that the PDF-SNRSCA algorithm can effectively improve the classification performance of deep forests and greatly enhance the efficien-cy of parallel training of deep forests.
毛伊敏;刘绍芬
江西理工大学信息工程学院,江西赣州 341000||韶关学院信息工程学院,广东韶关 512026江西理工大学信息工程学院,江西赣州 341000
计算机与自动化
并行深度森林算法Spark框架邻域粗糙集正弦余弦算法多粒度扫描
parallel deep forest algorithmSpark frameworkneighborhood rough setssine cosine algorithmmulti-granularing scanning
《计算机应用研究》 2024 (001)
126-133 / 8
广东省重点提升项目(2022ZDJS048);韶关市科技项目(220607154531533);科技创新2030-"新一代人工智能"重大项目(2020AAA0109605)
评论