计算机应用与软件2016,Vol.33Issue(11):240-243,296,5.DOI:10.3969/j.issn.1000-386x.2016.11.056
一种朴素贝叶斯文本分类算法的分布并行实现
DISTRIBUTED PARALLEL IMPLEMENTATION OF A NAIVE BAYESIAN TEXT CLASSIFICATION ALGORITHM
摘要
Abstract
According to the naive Bayes text classification algorithm in text classification of the existence of data sparse,inaccurate classification and low efficiency problem,this paper proposes a Dirichlet naive Bayes text classification algorithm based on MapReduce. Firstly,according to the words and signs within the meaning of the factors and the distribution of the weight classes is adjusted to be corrected on the TF-IDF;Then,we introduce Dirichlet data smoothing methods which in statistical language modeling techniques to reduce the impact on the classification performance of the sparse data,and we achieve this algorithm parallelization used by MapReduce programming model in the Hadoop cloud computing platform.Through experimental comparison analysis shows that the algorithm significantly improves accuracy and recall rate of traditional naive Bayes text classification algorithm,and it has excellent expansibility and data processing ability.关键词
朴素贝叶斯/文本分类/TF-IDF修正/数据平滑/MapReduce并行化Key words
Naive bayes/Text classification/TF-IDF correction/Data smoothing/MapReduce parallelization分类
信息技术与安全科学引用本文复制引用
郭绪坤,范冰冰..一种朴素贝叶斯文本分类算法的分布并行实现[J].计算机应用与软件,2016,33(11):240-243,296,5.基金项目
广东省教育厅2015重大科研立项青年项目。 ()