首页|期刊导航|计算机工程与应用|简单且有效的弱监督中文文本分类算法

简单且有效的弱监督中文文本分类算法OA北大核心

Simple and Effective Weakly Supervised Chinese Text Classification Algorithm

中文摘要英文摘要

目前基于种子词的弱监督文本分类算法大多需要从数据集中搜索所有种子词并以此扩展类别词典,出现频率较低的种子词的类别识别能力也较低.因此设计了一个简单且有效的弱监督中文文本分类算法(simple and effective weakly supervised Chinese text classification,SEWClass).该方法利用预训练语言模型初始权重生成对文本的抽象理解,并以此为基础继续生成抽象约束条件和具象约束条件,以构建初次训练的伪标签数据;根据类别数量联合构建降维模型与分类器,以适应弱监督文本分类需要预先指定类别和在自训练过程中需要增加训练数据的特点;通过两种约束条件,伪标签数据拥有较高精确率,并在自训练过程中仅训练降维模型以提升召回率和算法效率.SEWClass对每个类别只需要一个种子词,如类别名称,即可完成分类任务,且SEWClass的性能与种子词是否出现在数据集中无关.SEWClass在THUCNews与toutiao两个中文数据集上的性能均远高于其他弱监督算法.

Most of the current weakly supervised text classification algorithms based on seed words need to search all seed words from the dataset and extend the category dictionary in this way,and the category recognition ability of seed words that occur less frequently is also lower.Therefore,a simple and effective weakly supervised Chinese text classifi-cation(SEWClass)algorithm is designed,which uses the initial weights of the pre-trained language model to generate an abstract understanding of the text and continues to generate abstract constraints and figurative constraints based on this to construct the initial training.Based on the number of categories,a dimensionality reduction model and a classifier are jointly constructed to adapt to the fact that the weakly supervised text classification needs to be specified in advance,and needs to increase training data during self-training.With the two constraints,the pseudo-labeled data have a high precision rate,and only the dimensionality reduction model is trained during self-training to improve the recall and efficiency.SEWClass requires only one seed word,such as the category name,to complete the classification task,and the perfor-mance of SEWClass is independent whether or not the seed word occurs in the dataset.The performance of SEWClass on both Chinese datasets,THUCNews and toutiao,is much higher than that of other weakly supervised algorithms.

陈中涛;周亚同

河北工业大学 电子信息工程学院,天津 300401河北工业大学 电子信息工程学院,天津 300401

计算机与自动化

弱监督文本分类自训练种子词

weakly supervisedtext classificationself-trainingseed word

《计算机工程与应用》 2025 (4)

192-210,19

10.3778/j.issn.1002-8331.2310-0009

评论