计算机与数字工程2019,Vol.47Issue(3):625-627,661,4.DOI:10.3969/j.issn.1672-9722.2019.03.028
基于SNM算法的大数据量中文商品清洗方法
Large Amount of Data in Chinese Commodity Cleaning Method Based on the SNM Algorithm
张苗苗 1苏勇1
作者信息
- 1. 江苏科技大学计算机学院 镇江 212003
- 折叠
摘要
Abstract
SNM algorithm namely adjacent sorting algorithms,data cleaning is the most commonly used algorithm in Eng?lish date cleaning. But so far,because of some reasons of the difference of semantics in both English and Chinese,Chinese data cleaning has not formed the perfect theory,most of the existing Chinese data cleaning algorithm is based on English data cleaning al?gorithm. This article will gradually introduce data cleaning,and will focus on based on the application of Chinese data cleaning SNM algorithm. In this paper,the traditional algorithm of SNM is first introduced,the shortcomings of the algorithm are discussed. The defects are improved and the practical application scenarios are proposed. Comparing traditional SNM method and improved SNM al?gorithm through the experiment,the results show that in terms of similar duplicate records to eliminate,SNM improved algorithm has obvious advantages.关键词
SNM算法/数据清洗/重复记录Key words
SNM algorithm/data cleaning/duplicate records分类
信息技术与安全科学引用本文复制引用
张苗苗,苏勇..基于SNM算法的大数据量中文商品清洗方法[J].计算机与数字工程,2019,47(3):625-627,661,4.