计算机科学与探索2017,Vol.11Issue(6):897-907,11.DOI:10.3778/j.issn.1673-9418.1609008
基于Spark的序列数据质量评价
Evaluation of Sequential Data Quality Using Spark
摘要
Abstract
Sequential data are prevalent in many real world applications. The quality evaluation on sequential data, which attracts the attentions from both academic research and industry fields, is important and prerequisite for extracting knowledge from the sequential data. Recently, a method using the probabilistic suffix tree has been proposed for evaluating the sequential data quality. However, this method cannot deal with the large-scale data set. To break this limitation, this paper proposes a Spark-based algorithm, called STALK (sequential data quality evaluation with Spark), for evaluating the quality of large-scale sequential data. Moreover, this paper uses the novel pruning strategies to improve the efficiency of STALK. Specifically, on the Spark platform, the large-scale sequential data are efficiently used to generate model, and the data quality of query sequence can be evaluated according to the generated model rapidly. Experiments on real-world sequential data sets demonstrate that STALK is effective, efficient and scalable.关键词
数据质量/概率后缀树/Spark/并行计算Key words
data quality/probabilistic suffix tree/Spark/parallel computing分类
信息技术与安全科学引用本文复制引用
韩超,段磊,邓松,王慧锋,唐常杰..基于Spark的序列数据质量评价[J].计算机科学与探索,2017,11(6):897-907,11.基金项目
The National Natural Science Foundation of China under Grant Nos. 61572332, 51507084 (国家自然科学基金) (国家自然科学基金)
the Postdoctoral Science Foundation of China under Grant Nos. 2016T90850, 2016M591890 (中国博士后科学基金) (中国博士后科学基金)
the Fundamental Research Funds for the Central Universities of China under Grant No. 2016SCU04A22 (中央高校基本科研业务费专项资金). (中央高校基本科研业务费专项资金)