计算机科学与探索2024,Vol.18Issue(9):2370-2383,14.DOI:10.3778/j.issn.1673-9418.2311063
社交平台不平衡文本数据处理与应用研究
Research on Processing and Application of Imbalanced Textual Data on Social Platforms
摘要
Abstract
With the informatization of the society,it's of great practical value to extract useful information from massive textual data available online using tools of NLP(natural language processing).However,the texts collected from social platforms suffer from issues such as low amount of valuable data and data imbalance.This paper proposes two methods to deal with these problems,named SimDyFeFL(SimBERT&dynamic feedback Focal Loss)and EdaDyFeFL(EDA&dynamic feedback Focal Loss),one is applicable for crisis-related information recognition tasks in Chinese,and another is for cyber trolls detection tasks in English.Specifically,SimBERT and EDA(easy data augmentation)methods are used to augment the original data with large differences between classes to a similar number of classes,and then the Focal Loss function with dynamic feedback process is fused to weight each class.Then,BERT(bidirectional encoder representations from transformers),RoBERTa(robustly optimized BERT pre-training approach),and BERT_DPCNN(BERT deep pyramid convolutional neural networks)text classification models are designed for three-stage comparative experiments to validate the effectiveness of proposed methods.Extensive experiments on two real datasets in Chinese and English show that the performance of the improved text classification models using SimDyFeFL and EdaDyFeFL is significantly improved,the accuracy of Chinese model is increased by 7.70 percentage points,and the accuracy of English model is increased by 5.15 percentage points.Compared with the best results on the Kaggle platform,the accuracy of the English model is 2.92 percentage points higher,and the Macro F1 score and Weighted F1 score are 2.83 percentage points and 2.95 percentage points higher,respectively.关键词
社交平台文本分类/不平衡数据处理/SimBERT/EDA/Focal LossKey words
text classification on social platforms/processing of imbalanced data/SimBERT/EDA(easy data augmentation)/Focal Loss分类
信息技术与安全科学引用本文复制引用
姜钰棋,侯智文,王一帆,翟晗名,卜凡亮..社交平台不平衡文本数据处理与应用研究[J].计算机科学与探索,2024,18(9):2370-2383,14.基金项目
中国人民公安大学安全防范工程双一流专项(2023SYL08). This work was supported by the Double First-Class Innovation Research Project for People's Public Security University of China(2023SYL08). (2023SYL08)