| 注册
首页|期刊导航|情报杂志|基于语料信息度量的文本分类性能影响研究

基于语料信息度量的文本分类性能影响研究

李湘东 巴志超 黄莉

情报杂志Issue(9):157-162,180,7.
情报杂志Issue(9):157-162,180,7.DOI:10.3969/j.issn.1002-1965.2014.09.028

基于语料信息度量的文本分类性能影响研究

Study on the Influences of Text Categorization Performance Based on Corpus Information Measurement

李湘东 1巴志超 2黄莉1

作者信息

  • 1. 武汉大学 信息管理学院 武汉 430072
  • 2. 武汉大学 信息资源研究中心 武汉 430072
  • 折叠

摘要

Abstract

The categorization performances usually vary in different corpus data with different categorization algorithms. The article propo-ses a new method to improve the categorization performance based on the analysis of the basic reason for the difference in categorization effects of the specialized corpus and the self-built corpus. It measures the corpus information from the comparison of the automatic catego-rization performances of different corpus through defining three indexes, namely, the category clustering density, the category complexity and the category definition. And it inspects the relationship between the three indexes and the categorization performance with multiple fac-tors analysis of variance to obtain the effect relationship of the different indexes on the different algorithms categorization performances, and proposes an overlap text categorization method based on the category definition to verify the validity of the index. The experiments show that three indexes all affect the categorization performance of different algorithms to some extent. The higher clustering density, the lower complexity and the higher category definition, the better categorizationperformances will be.

关键词

语料库/自建语料/类别信息/分类算法/分类性能

Key words

corpus/self-built corpus/category information/categorization algorithm/categorization performance

分类

信息技术与安全科学

引用本文复制引用

李湘东,巴志超,黄莉..基于语料信息度量的文本分类性能影响研究[J].情报杂志,2014,(9):157-162,180,7.

情报杂志

OA北大核心CHSSCDCSSCI

1002-1965

访问量2
|
下载量0
段落导航相关论文