首页|期刊导航|计算机应用研究|中文文本同频词统计规律及在关键词提取中的应用

中文文本同频词统计规律及在关键词提取中的应用

李晓超赵书良罗燕陈敏柳萌萌

计算机应用研究2016，Vol.33Issue(4)：1007-1012,6.

计算机应用研究2016，Vol.33Issue(4)：1007-1012,6.DOI:10.3969/j.issn.1001-3695.2016.04.010

中文文本同频词统计规律及在关键词提取中的应用

Statistics law of same frequency words in Chinese text and its application to keywords extraction

李晓超 ¹赵书良 ²罗燕 ³陈敏 ¹柳萌萌²

作者信息

1. 河北师范大学数学与信息科学学院，石家庄050024
2. 河北师范大学河北省计算数学与应用重点实验室，石家庄050024
3. 河北师范大学移动物联网研究院，石家庄050024
折叠

摘要

Abstract

This paper presented a statistics law on the same frequency words in Chinese text based on a large number of experi-ments.It deduced the mathematical expression of the same frequency words based on Zipf’s law,which could be applied to Chinese text better.Moreover,it re-established the boundary points formula of high-frequency words and low-frequency words, and then verified its correctness.Finally,it applied the proposed statistics law to keywords extraction.Previous academic re-search on how to deal with low-frequency words was rare and nobody gave a concrete solution.This paper provided a standard method on how to deal with the low-frequency words in the application of keywords extraction.It notes that text length must be no less than 3 010 words and it can ignore the calculation of words occurring once and twice when calculating the value of TF-IDF.This method raises the efficiency by 2~7 times but no loss of keywords.

关键词

同频词/齐普夫定律/布茨定律/提取/TF-IDF算法

Key words

same frequency words/Zipf’s law/Booth’s law/keyword extraction/TF-IDF algorithm

分类

信息技术与安全科学

引用本文复制引用

李晓超,赵书良,罗燕,陈敏,柳萌萌..中文文本同频词统计规律及在关键词提取中的应用[J].计算机应用研究,2016,33(4):1007-1012,6.

基金项目

国家自然科学基金资助项目（71271067）；国家社会科学基金资助项目（13BTY011）；国家社科基金重大项目（13＆ZD091）；河北师范大学数学与信息科学学院硕士基金资助项目（）

计算机应用研究

OA北大核心CSCDCSTPCD

ISSN：1001-3695

访问量0

下载量0

段落导航