情报杂志Issue(9):157-162,180,7.DOI:10.3969/j.issn.1002-1965.2014.09.028
基于语料信息度量的文本分类性能影响研究
Study on the Influences of Text Categorization Performance Based on Corpus Information Measurement
李湘东 1巴志超 2黄莉1
作者信息
- 1. 武汉大学 信息管理学院 武汉 430072
- 2. 武汉大学 信息资源研究中心 武汉 430072
- 折叠
摘要
Abstract
The categorization performances usually vary in different corpus data with different categorization algorithms. The article propo-ses a new method to improve the categorization performance based on the analysis of the basic reason for the difference in categorization effects of the specialized corpus and the self-built corpus. It measures the corpus information from the comparison of the automatic catego-rization performances of different corpus through defining three indexes, namely, the category clustering density, the category complexity and the category definition. And it inspects the relationship between the three indexes and the categorization performance with multiple fac-tors analysis of variance to obtain the effect relationship of the different indexes on the different algorithms categorization performances, and proposes an overlap text categorization method based on the category definition to verify the validity of the index. The experiments show that three indexes all affect the categorization performance of different algorithms to some extent. The higher clustering density, the lower complexity and the higher category definition, the better categorizationperformances will be.关键词
语料库/自建语料/类别信息/分类算法/分类性能Key words
corpus/self-built corpus/category information/categorization algorithm/categorization performance分类
信息技术与安全科学引用本文复制引用
李湘东,巴志超,黄莉..基于语料信息度量的文本分类性能影响研究[J].情报杂志,2014,(9):157-162,180,7.