数字图书馆论坛Issue(8):18-24,7.DOI:10.3772/j.issn.1673-2286.2019.08.003
基于改进TF-IDF-CHI算法的农业科技文献文本特征抽取
Improvement and Application of TF-IDF-CHI in Agricultural Science Text Feature Extraction
摘要
Abstract
This paper is aimed at improving the lack of traditional TF-IDF method and verifying its effectiveness through text classification tests in the agricultural field. The improved method is called ImpTF_IDF_CHI which is to reconstruct the feature word weighting function by adding chi-square test values and weight correction factors. First, we use the ImpTF-IDF-CHI method, document frequency method, information gain method and the TF-IDF to perform the feature word extraction test. Then we use feature extraction words for test of text classification and judge the pros and cons based on the test. In all the test results, the best results were obtained using the ImpTF-IDF-CHI method. The Accuracy of naive Bayesian text classification using the ImpTF-IDF-CHI method is 94% and F1 value is 0.844. The experiment fully proves the effectiveness and advancement of the ImpTF-IDF-CHI method. The ImpTF-IDF-CHI method has the characteristics of high accuracy, good stability, strong subject representative in text feature extraction. This method can be applied to fields such as text categorization, feature expression and theme extraction.关键词
特征抽取/TF-IDF/卡方统计/文本分类/农业科技文献Key words
Feature Extraction/TF-IDF/Chi-Square Statistics/Text Categorization/Agricultural Science分类
信息技术与安全科学引用本文复制引用
杜若鹏,鲜国建,寇远涛..基于改进TF-IDF-CHI算法的农业科技文献文本特征抽取[J].数字图书馆论坛,2019,(8):18-24,7.基金项目
本研究得到国家社会科学基金项目"科技论文全景式摘要知识图谱构建与应用研究"(编号:19BTQ61)、中国农业科学院科技创新工程项目(编号:CAAS-ASTIP-2016-AII)和中国工程科技知识中心建设项目(编号:CKCEST-2018-1-15)资助. (编号:19BTQ61)