| 注册
首页|期刊导航|计算机工程|一种基于标签传播的数据分块算法

一种基于标签传播的数据分块算法

冉德彤 游宏梁

计算机工程2017,Vol.43Issue(9):51-55,61,6.
计算机工程2017,Vol.43Issue(9):51-55,61,6.DOI:10.3969/j.issn.1000-3428.2017.09.010

一种基于标签传播的数据分块算法

A Data Blocking Algorithm Based on Label Propagation

冉德彤 1游宏梁1

作者信息

  • 1. 中国国防科技信息中心,北京100142
  • 折叠

摘要

Abstract

Data blocking can reduce the increasing computational complexity of Entity Resolution (ER) in large-scale data,but there exists the problem of balancing the efficiency and effectiveness in tradition algorithms.To reach a better balance between efficiency and effectiveness,this paper proposes a data blocking algorithm based on label propagation.In this algorithm,record similarity is estimated by the number of identical lexical items between records and potential approximately duplicated records are detected by label propagation so as to reduce the time complexity.Experimental results on common test data set show that the proposed algorithm improves F-Measure value and reduces running time effectively,which can implement data blocking in large-scale data.

关键词

数据质量/数据清洗/实体分辨/相似重复记录/数据分块/标签传播算法

Key words

data quality/data cleaning/Entity Resolution (ER)/approximately duplicated record/data blocking/Label Propagation Algorithm (LPA)

分类

信息技术与安全科学

引用本文复制引用

冉德彤,游宏梁..一种基于标签传播的数据分块算法[J].计算机工程,2017,43(9):51-55,61,6.

计算机工程

OA北大核心CSCDCSTPCD

1000-3428

访问量2
|
下载量0
段落导航相关论文