计算机应用研究2016,Vol.33Issue(4):1007-1012,6.DOI:10.3969/j.issn.1001-3695.2016.04.010
中文文本同频词统计规律及在关键词提取中的应用
Statistics law of same frequency words in Chinese text and its application to keywords extraction
摘要
Abstract
This paper presented a statistics law on the same frequency words in Chinese text based on a large number of experi-ments.It deduced the mathematical expression of the same frequency words based on Zipf’s law,which could be applied to Chinese text better.Moreover,it re-established the boundary points formula of high-frequency words and low-frequency words, and then verified its correctness.Finally,it applied the proposed statistics law to keywords extraction.Previous academic re-search on how to deal with low-frequency words was rare and nobody gave a concrete solution.This paper provided a standard method on how to deal with the low-frequency words in the application of keywords extraction.It notes that text length must be no less than 3 010 words and it can ignore the calculation of words occurring once and twice when calculating the value of TF-IDF.This method raises the efficiency by 2~7 times but no loss of keywords.关键词
同频词/齐普夫定律/布茨定律/提取/TF-IDF算法Key words
same frequency words/Zipf’s law/Booth’s law/keyword extraction/TF-IDF algorithm分类
信息技术与安全科学引用本文复制引用
李晓超,赵书良,罗燕,陈敏,柳萌萌..中文文本同频词统计规律及在关键词提取中的应用[J].计算机应用研究,2016,33(4):1007-1012,6.基金项目
国家自然科学基金资助项目(71271067);国家社会科学基金资助项目(13BTY011);国家社科基金重大项目(13&ZD091);河北师范大学数学与信息科学学院硕士基金资助项目 ()