三支边缘过采样的不平衡文本情感分类OA北大核心CSTPCD
Sentiment classification of three-way borderline oversampling for imbalanced text
在实际应用中,少数类样本往往包含重要信息,而传统机器学习方法通常对少数类样本的分类准确率低,且误分类代价较高.针对不平衡文本数据的情感分类问题,以三支采样(three-way sampling,3WS)与过采样为基础,提出了三支过采样算法(three-way SMOTE,3WOS)和三支边缘过采样算法(three-way borderline-SMOTE,3WOBS),3WOS能够更好地识别边界区域上的数据,3 WOBS可以增强边界区域所蕴含的信息.首先,将文本数据构建为超球,获得超球边缘的支持向量.其次,3WOS对边缘上的支持向量直接进行过采样以生成合成新样本并更新样本集,3WOBS则在生成合成新样本后根据给定条件判断是否获得该新样本并更新样本集.最后,将更新的样本集放在不同的基分类器上进行对比实验.实验采用了3个不平衡数据集,并保证了不同的不平衡比.同时,在数据集训练过程中引入粒计算思想,确保模型的鲁棒性.实验结果表明,3WOS-ITSC与3WOBS-ITSC准确率较高且代价低于其他模型,为不平衡文本分类问题提供了新思路.
In practical applications, minority class samples often contain important information, while traditional machine learning methods usually have low classification accuracy and high misclassification cost for minority class samples. This paper proposes three-way sampling ( 3-way SMOTE, 3WOS ) and three-way borderline-SMOTE ( 3WOBS ) algorithms for the sentiment classification of unbalanced text data, based on three-way sampling ( 3WS ) and oversampling. Oversampling enables better identification of data on the borderline, and edge oversampling enhances the information contained in the borderline. First, the text data are built as a hypersphere and the support vectors of the hypersphere edges are obtained. Second, 3WOS oversamples the support vectors on the edges directly to generate synthetic new samples to update the sample set, while 3WOBS generates synthetic new samples and updates the sample set after determining whether to obtain the new samples based on the given conditions. Finally, the updated sample set is placed on different base classifiers for comparison experiments. Three imbalanced datasets are employed and different imbalance ratios are guaranteed. Moreover, granular computing is introduced during the training of the datasets to ensure the robustness of the model. Our experimental results show 3WOS-ITSC and 3WOBS-ITSC are more accurate and less costly than other models, providing a new way to address the imbalanced text classification.
余啟煬;方宇;李昭宸;刘畅;杨梅
西南石油大学 计算机科学学院,成都 610500
计算机与自动化
情感分类不平衡数据三支决策采样粒计算
sentiment classificationimbalanced datathree-way decisionsamplinggranular computing
《重庆理工大学学报》 2024 (005)
201-211 / 11
国家自然科学基金项目(62006200);中央引导地方科技发展专项项目(2021ZYD0003);四川省青年科技创新团队项目(2019JDTD0017);西南石油大学2021年一流本科课程培育建设项目(X2021YLKC035);西南石油大学研究生全英文课程建设项目(2020QY04);第二批产学合作协同育人项目(202102211111)
评论