计算机工程与应用2011,Vol.47Issue(11):15-18,33,5.DOI:10.3778/j.issn.1002-8331.2011.11.005
基于语言节奏的大规模文档去重算法研究
Study on large scale duplicated text deletion algorithm based on lan guage cadence.
摘要
Abstract
It is found that language cadence can mark thc text uniquely by studying on large scale text in Web. The large scale duplicated text detection algorithm based on language cadence is prompted here. It has higher precision rate and efficiency that the algorithm based on semantic and text structure. Punctuations can mark the basic language cadence of each text.This cadence can be caught for creating the language cadence code of every paragraph in text,in order to detect the duplicate one quickly and easily. The experiments' result shows that this algorithm has good recall and precision rate in duplicated paragraph detection. It can find the duplicated content not only of page but also of paragraph. So it can detect the duplicated in content with different paragraph sequence.关键词
文档重复性检测/语言节奏/标点Key words
duplicated text detection/ language cadence /punctuation分类
信息技术与安全科学引用本文复制引用
陈钒,冯志勇,李晓红,赵庚..基于语言节奏的大规模文档去重算法研究[J].计算机工程与应用,2011,47(11):15-18,33,5.基金项目
国家自然科学基金重大科技研究计划面上项目. ()