| 注册
首页|期刊导航|计算机工程|基于可信度模型的重复主数据检测算法

基于可信度模型的重复主数据检测算法

王继奎 李少波

计算机工程Issue(5):31-35,40,6.
计算机工程Issue(5):31-35,40,6.DOI:10.3969/j.issn.1000-3428.2014.05.007

基于可信度模型的重复主数据检测算法

Duplicate Master Data Detection Algorithm Based on Credibility Model

王继奎 1李少波2

作者信息

  • 1. 中国科学院成都计算机应用研究所,成都 610041
  • 2. 贵州大学省部共建现代制造技术教育部重点实验室,贵阳 550003
  • 折叠

摘要

Abstract

To avoid the effect of duplicate master data from multiple business systems on the quality, synchronization of the master data as well as master data mining, this paper propose a fastCdrDetection(Fast Cluster Duplicate Records Detection) algorithm, in which a duplicate master data detection model and a credible record generating algorithm are included, considering data source reliability, data refresh time and data length. A non-recursive algorithm FiledMatch is established for character string similarity calculation. Aiming at the eliminating problems caused by abbreviations and wrong spellings in Chinese input, a sourceKeys algorithm is constructed for pretreatment of duplicate records arising from a same business system and sharing same business keys to achieve high efficiency in duplicate master data detection. Experiments are carried on a power grid with 630 thousand records of raw material and 230 thousand simulated data records. Result shows that the recall rate of the fastCdrDetection algorithm is 88%, while the PQS algorithm is 74%, and the accuracy is 95%to 61%. The effectiveness of the algorithm is verified.

关键词

多数据源/重复主数据/可信度模型/检测算法/数据可信度

Key words

multiple data source/duplicate master data/credibility model/detection algorithm/data credibility

分类

信息技术与安全科学

引用本文复制引用

王继奎,李少波..基于可信度模型的重复主数据检测算法[J].计算机工程,2014,(5):31-35,40,6.

基金项目

国家科技支撑计划基金资助项目(2012BAF12B14)。 (2012BAF12B14)

计算机工程

OA北大核心CSCDCSTPCD

1000-3428

访问量0
|
下载量0
段落导航相关论文