计算机工程Issue(5):31-35,40,6.DOI:10.3969/j.issn.1000-3428.2014.05.007
基于可信度模型的重复主数据检测算法
Duplicate Master Data Detection Algorithm Based on Credibility Model
摘要
Abstract
To avoid the effect of duplicate master data from multiple business systems on the quality, synchronization of the master data as well as master data mining, this paper propose a fastCdrDetection(Fast Cluster Duplicate Records Detection) algorithm, in which a duplicate master data detection model and a credible record generating algorithm are included, considering data source reliability, data refresh time and data length. A non-recursive algorithm FiledMatch is established for character string similarity calculation. Aiming at the eliminating problems caused by abbreviations and wrong spellings in Chinese input, a sourceKeys algorithm is constructed for pretreatment of duplicate records arising from a same business system and sharing same business keys to achieve high efficiency in duplicate master data detection. Experiments are carried on a power grid with 630 thousand records of raw material and 230 thousand simulated data records. Result shows that the recall rate of the fastCdrDetection algorithm is 88%, while the PQS algorithm is 74%, and the accuracy is 95%to 61%. The effectiveness of the algorithm is verified.关键词
多数据源/重复主数据/可信度模型/检测算法/数据可信度Key words
multiple data source/duplicate master data/credibility model/detection algorithm/data credibility分类
信息技术与安全科学引用本文复制引用
王继奎,李少波..基于可信度模型的重复主数据检测算法[J].计算机工程,2014,(5):31-35,40,6.基金项目
国家科技支撑计划基金资助项目(2012BAF12B14)。 (2012BAF12B14)