计算机工程2017,Vol.43Issue(9):51-55,61,6.DOI:10.3969/j.issn.1000-3428.2017.09.010
一种基于标签传播的数据分块算法
A Data Blocking Algorithm Based on Label Propagation
摘要
Abstract
Data blocking can reduce the increasing computational complexity of Entity Resolution (ER) in large-scale data,but there exists the problem of balancing the efficiency and effectiveness in tradition algorithms.To reach a better balance between efficiency and effectiveness,this paper proposes a data blocking algorithm based on label propagation.In this algorithm,record similarity is estimated by the number of identical lexical items between records and potential approximately duplicated records are detected by label propagation so as to reduce the time complexity.Experimental results on common test data set show that the proposed algorithm improves F-Measure value and reduces running time effectively,which can implement data blocking in large-scale data.关键词
数据质量/数据清洗/实体分辨/相似重复记录/数据分块/标签传播算法Key words
data quality/data cleaning/Entity Resolution (ER)/approximately duplicated record/data blocking/Label Propagation Algorithm (LPA)分类
信息技术与安全科学引用本文复制引用
冉德彤,游宏梁..一种基于标签传播的数据分块算法[J].计算机工程,2017,43(9):51-55,61,6.