计算机工程与应用2025,Vol.61Issue(4):192-210,19.DOI:10.3778/j.issn.1002-8331.2310-0009
简单且有效的弱监督中文文本分类算法
Simple and Effective Weakly Supervised Chinese Text Classification Algorithm
陈中涛 1周亚同1
作者信息
- 1. 河北工业大学 电子信息工程学院,天津 300401
- 折叠
摘要
Abstract
Most of the current weakly supervised text classification algorithms based on seed words need to search all seed words from the dataset and extend the category dictionary in this way,and the category recognition ability of seed words that occur less frequently is also lower.Therefore,a simple and effective weakly supervised Chinese text classifi-cation(SEWClass)algorithm is designed,which uses the initial weights of the pre-trained language model to generate an abstract understanding of the text and continues to generate abstract constraints and figurative constraints based on this to construct the initial training.Based on the number of categories,a dimensionality reduction model and a classifier are jointly constructed to adapt to the fact that the weakly supervised text classification needs to be specified in advance,and needs to increase training data during self-training.With the two constraints,the pseudo-labeled data have a high precision rate,and only the dimensionality reduction model is trained during self-training to improve the recall and efficiency.SEWClass requires only one seed word,such as the category name,to complete the classification task,and the perfor-mance of SEWClass is independent whether or not the seed word occurs in the dataset.The performance of SEWClass on both Chinese datasets,THUCNews and toutiao,is much higher than that of other weakly supervised algorithms.关键词
弱监督/文本分类/自训练/种子词Key words
weakly supervised/text classification/self-training/seed word分类
计算机与自动化引用本文复制引用
陈中涛,周亚同..简单且有效的弱监督中文文本分类算法[J].计算机工程与应用,2025,61(4):192-210,19.