计算机工程与应用Issue(16):192-197,6.DOI:10.3778/j.issn.1002-8331.1309-0424
中文短文本去重方法研究
Research on method to detect reduplicative Chinese short texts
摘要
Abstract
The article presents an effective algorithm framework for text de-duplication, focusing on redundancy problem of Chinese short texts. In view of the brevity and huge volumes of short texts, Bloom Filter have been introduced, Trie tree and the SimHash algorithm have been introduced. In the first stage of the algorithm framework, Bloom Filter or Trie tree is designed to remove duplications completely;in the second stage, the SimHash algorithm is used to detect similar duplications. This text has designed the parameters used in the algorithm framework, and the feasibility and rationality is testified.关键词
文本去重/中文短文本/Bloom Filter/Trie树/SimHash算法Key words
text de-duplication/Chinese short texts/Bloom Filter/Trie tree/SimHash algorithm分类
信息技术与安全科学引用本文复制引用
高翔,李兵..中文短文本去重方法研究[J].计算机工程与应用,2014,(16):192-197,6.基金项目
教育部人文社会科学项目(No.11YJA870017)。 ()