计算机工程与应用Issue(19):123-127,5.DOI:10.3778/j.issn.1002-8331.1312-0162
基于长度过滤和有效权值的SNM改进算法
Improved SNM algorithm based on length filtering and effective weights
摘要
Abstract
Approximately duplicate records are produced in heterogeneous database integration, but the numbers of which are limited. Using the traditional SNM algorithm to detect approximately duplicate records, needs to compare all records in the window, and the efficiency is not high. For the defects, an improved SNM algorithm based on the length filtering and effective weights is proposed. According to the length proportion of two records in the window, the records which are impossible to be approximately duplicate are excluded firstly, so it can reduce the number of records comparison, and improve the detection efficiency. By setting the validity factor and weight proportion of the records attribute furtherly, it calculates the effective weights, then according to the weights, detects the records. The recall ratio and the precision ratio are improved. The results of experiments show that the improved algorithm is better than SNM algorithm in various performance.关键词
相似重复记录/数据清洗/有效权值/SNM算法Key words
approximately duplicate records/data cleaning/effective weights/Sorted-Neighborhood Method(SNM)分类
信息技术与安全科学引用本文复制引用
郭文龙..基于长度过滤和有效权值的SNM改进算法[J].计算机工程与应用,2014,(19):123-127,5.基金项目
福建省教育厅A类科技项目(No.JA12335);福建江夏学院青年科研人才培育基金(No.JXZ20130010)。 ()