| 注册
首页|期刊导航|计算机应用研究|中文文本同频词统计规律及在关键词提取中的应用

中文文本同频词统计规律及在关键词提取中的应用

李晓超 赵书良 罗燕 陈敏 柳萌萌

计算机应用研究2016,Vol.33Issue(4):1007-1012,6.
计算机应用研究2016,Vol.33Issue(4):1007-1012,6.DOI:10.3969/j.issn.1001-3695.2016.04.010

中文文本同频词统计规律及在关键词提取中的应用

Statistics law of same frequency words in Chinese text and its application to keywords extraction

李晓超 1赵书良 2罗燕 3陈敏 1柳萌萌2

作者信息

  • 1. 河北师范大学 数学与信息科学学院,石家庄050024
  • 2. 河北师范大学 河北省计算数学与应用重点实验室,石家庄050024
  • 3. 河北师范大学 移动物联网研究院,石家庄050024
  • 折叠

摘要

Abstract

This paper presented a statistics law on the same frequency words in Chinese text based on a large number of experi-ments.It deduced the mathematical expression of the same frequency words based on Zipf’s law,which could be applied to Chinese text better.Moreover,it re-established the boundary points formula of high-frequency words and low-frequency words, and then verified its correctness.Finally,it applied the proposed statistics law to keywords extraction.Previous academic re-search on how to deal with low-frequency words was rare and nobody gave a concrete solution.This paper provided a standard method on how to deal with the low-frequency words in the application of keywords extraction.It notes that text length must be no less than 3 010 words and it can ignore the calculation of words occurring once and twice when calculating the value of TF-IDF.This method raises the efficiency by 2~7 times but no loss of keywords.

关键词

同频词/齐普夫定律/布茨定律/提取/TF-IDF算法

Key words

same frequency words/Zipf’s law/Booth’s law/keyword extraction/TF-IDF algorithm

分类

信息技术与安全科学

引用本文复制引用

李晓超,赵书良,罗燕,陈敏,柳萌萌..中文文本同频词统计规律及在关键词提取中的应用[J].计算机应用研究,2016,33(4):1007-1012,6.

基金项目

国家自然科学基金资助项目(71271067);国家社会科学基金资助项目(13BTY011);国家社科基金重大项目(13&ZD091);河北师范大学数学与信息科学学院硕士基金资助项目 ()

计算机应用研究

OA北大核心CSCDCSTPCD

1001-3695

访问量0
|
下载量0
段落导航相关论文