计算机科学与探索2017,Vol.11Issue(5):732-741,10.DOI:10.3778/j.issn.1673-9418.1608041
Spark平台下的短文本特征扩展与分类研究
Feature Extension and Category Research for Short Text Based on Spark Platform
王雯 1赵衎衎 2李翠平 1陈红 2孙辉1
作者信息
- 1. 中国人民大学数据工程与知识工程教育部重点实验室,北京100872
- 2. 中国人民大学信息学院,北京100872
- 折叠
摘要
Abstract
Short text classification is often confronted with some limitations including high feature dimensions,sparse feature existences and poor classification accuracy,which can be solved by feature extension effectively.However,it decreases the execution efficiency greatly.To improve classification accuracy and efficiency of short text,this paper proposes a new solution,association rule based feature extension method which is designed on Spark platform.Given a background data set of short text corpus,firstly extend origin corpus and complement the features by mining the association rules and the corresponding confidences.Then apply a new cascade SVM (support vector machine) algorithm based on distance to choose during classification.Finally design the feature extension and classification algorithm of short text on Spark platform and improve the efficiency of short text processing through distributed algorithm.The experiments show that the new method gains 4 times of efficiency improvement compared with the traditional method and 15% increase in classification accuracy,in which the accuracy of feature extension and classification optimization is 10% and 5% respectively.关键词
短文本分类/特征扩展/关联规则/Spark平台Key words
short text classification/feature extension/association rule/Spark platform分类
信息技术与安全科学引用本文复制引用
王雯,赵衎衎,李翠平,陈红,孙辉..Spark平台下的短文本特征扩展与分类研究[J].计算机科学与探索,2017,11(5):732-741,10.