首页|期刊导航|计算机工程|基于可信度模型的重复主数据检测算法

基于可信度模型的重复主数据检测算法

王继奎李少波

计算机工程Issue(5)：31-35,40,6.

计算机工程Issue(5)：31-35,40,6.DOI:10.3969/j.issn.1000-3428.2014.05.007

基于可信度模型的重复主数据检测算法

Duplicate Master Data Detection Algorithm Based on Credibility Model

王继奎 ¹李少波²

作者信息

1. 中国科学院成都计算机应用研究所，成都 610041
2. 贵州大学省部共建现代制造技术教育部重点实验室，贵阳 550003
折叠

摘要

Abstract

To avoid the effect of duplicate master data from multiple business systems on the quality, synchronization of the master data as well as master data mining, this paper propose a fastCdrDetection(Fast Cluster Duplicate Records Detection) algorithm, in which a duplicate master data detection model and a credible record generating algorithm are included, considering data source reliability, data refresh time and data length. A non-recursive algorithm FiledMatch is established for character string similarity calculation. Aiming at the eliminating problems caused by abbreviations and wrong spellings in Chinese input, a sourceKeys algorithm is constructed for pretreatment of duplicate records arising from a same business system and sharing same business keys to achieve high efficiency in duplicate master data detection. Experiments are carried on a power grid with 630 thousand records of raw material and 230 thousand simulated data records. Result shows that the recall rate of the fastCdrDetection algorithm is 88%, while the PQS algorithm is 74%, and the accuracy is 95%to 61%. The effectiveness of the algorithm is verified.

关键词

多数据源/重复主数据/可信度模型/检测算法/数据可信度

Key words

multiple data source/duplicate master data/credibility model/detection algorithm/data credibility

分类

信息技术与安全科学

引用本文复制引用

王继奎,李少波..基于可信度模型的重复主数据检测算法[J].计算机工程,2014,(5):31-35,40,6.

基金项目

国家科技支撑计划基金资助项目(2012BAF12B14)。（2012BAF12B14）

计算机工程

OA北大核心CSCDCSTPCD

ISSN：1000-3428

访问量0

下载量0

段落导航