计算机工程2011,Vol.37Issue(1):16-18,21,4.DOI:10.3969/j.issn.1000-3428.2011.01.006
基于信息增益的文本特征权重改进算法
Improved Algorithm of Text Feature Weighting Based on Information Gain
摘要
Abstract
The idf function of traditional tf.idf algorithm can only evaluate the ability of features to discriminate different documents in a macroscopically way, which can not reflect the differences of distribution proportion for features in each document and each class of the whole training set, it reduces the accuracy of text representation. To solve the above problem, this paper proposes an improved feature weighting method called tf.igt.igc. This method begins frotu analyzing the characteristics of feature distribution, through introducing the concept of information gain in the information theory, realizes the comprehensive consideration of the two specific dimensions of feature distributions, and overcomes the shortcomings of the traditional formula. Experimental results on the two open source corpus show that compared to other two feature weighting methods, tf.igt.igc is more effective in terms of calculating the feature weighting.关键词
特征分布/特征加权/文本分类Key words
feature distribution/ feature weighting/ text classification分类
信息技术与安全科学引用本文复制引用
李凯齐,刁兴春,曹建军..基于信息增益的文本特征权重改进算法[J].计算机工程,2011,37(1):16-18,21,4.基金项目
中国博士后科学基金资助项目(20090461425) (20090461425)
江苏省博士后科研计划基金资助项目(0901014B) (0901014B)