| 注册
首页|期刊导航|计算机工程与应用|中文短文本去重方法研究

中文短文本去重方法研究

高翔 李兵

计算机工程与应用Issue(16):192-197,6.
计算机工程与应用Issue(16):192-197,6.DOI:10.3778/j.issn.1002-8331.1309-0424

中文短文本去重方法研究

Research on method to detect reduplicative Chinese short texts

高翔 1李兵2

作者信息

  • 1. 北京大学 汇丰商学院,广东 深圳 518055
  • 2. 对外经济贸易大学 信息学院,北京 100029
  • 折叠

摘要

Abstract

The article presents an effective algorithm framework for text de-duplication, focusing on redundancy problem of Chinese short texts. In view of the brevity and huge volumes of short texts, Bloom Filter have been introduced, Trie tree and the SimHash algorithm have been introduced. In the first stage of the algorithm framework, Bloom Filter or Trie tree is designed to remove duplications completely;in the second stage, the SimHash algorithm is used to detect similar duplications. This text has designed the parameters used in the algorithm framework, and the feasibility and rationality is testified.

关键词

文本去重/中文短文本/Bloom Filter/Trie树/SimHash算法

Key words

text de-duplication/Chinese short texts/Bloom Filter/Trie tree/SimHash algorithm

分类

信息技术与安全科学

引用本文复制引用

高翔,李兵..中文短文本去重方法研究[J].计算机工程与应用,2014,(16):192-197,6.

基金项目

教育部人文社会科学项目(No.11YJA870017)。 ()

计算机工程与应用

OA北大核心CSCDCSTPCD

1002-8331

访问量0
|
下载量0
段落导航相关论文