摘要
Abstract
Along with the outburst of information and the developing of information analysis, word frequency analysis is becoming more and more popular in which the defining of high-frequency words serves as the cornerstone. By summarizing the precedent literature resear-ches, this paper first concluded four methods of defining high-frequency words at present, i. e. TOPN, WF>=M, %WF=P and T for-mula. After briefly discussing the main and obvious shortcomings of the above four methods, such as depending on experience too much, subjectivity, lack of theoretical background, inapplicability or impracticability and so on, the paper empirically tested and verified the nor-mal distribution of high-frequency words in depositories, and accordingly proposed the F formula for threshold analysis of high-frequency words. At the final part, the paper compared and contrasted the T formula and the F formula through the analysis of many datasets, and by doing this the F formula was theoretically and applicably legitimized in the research of threshold of high-frequency words based on normal distribution.关键词
词频分析法/正态分布/高频词/齐普夫定律Key words
word frequency analysis/normal distribution/High-frequency Words/Zipf's Law分类
社会科学