| 注册
首页|期刊导航|计算机工程与应用|基于语言节奏的大规模文档去重算法研究

基于语言节奏的大规模文档去重算法研究

陈钒 冯志勇 李晓红 赵庚

计算机工程与应用2011,Vol.47Issue(11):15-18,33,5.
计算机工程与应用2011,Vol.47Issue(11):15-18,33,5.DOI:10.3778/j.issn.1002-8331.2011.11.005

基于语言节奏的大规模文档去重算法研究

Study on large scale duplicated text deletion algorithm based on lan guage cadence.

陈钒 1冯志勇 2李晓红 1赵庚1

作者信息

  • 1. 天津大学,计算机科学与软件学院,天津,300072
  • 2. 天津财经大学,理工学院,信息科学与技术系,天津,300200
  • 折叠

摘要

Abstract

It is found that language cadence can mark thc text uniquely by studying on large scale text in Web. The large scale duplicated text detection algorithm based on language cadence is prompted here. It has higher precision rate and efficiency that the algorithm based on semantic and text structure. Punctuations can mark the basic language cadence of each text.This cadence can be caught for creating the language cadence code of every paragraph in text,in order to detect the duplicate one quickly and easily. The experiments' result shows that this algorithm has good recall and precision rate in duplicated paragraph detection. It can find the duplicated content not only of page but also of paragraph. So it can detect the duplicated in content with different paragraph sequence.

关键词

文档重复性检测/语言节奏/标点

Key words

duplicated text detection/ language cadence /punctuation

分类

信息技术与安全科学

引用本文复制引用

陈钒,冯志勇,李晓红,赵庚..基于语言节奏的大规模文档去重算法研究[J].计算机工程与应用,2011,47(11):15-18,33,5.

基金项目

国家自然科学基金重大科技研究计划面上项目. ()

计算机工程与应用

OACSCDCSTPCD

1002-8331

访问量0
|
下载量0
段落导航相关论文