计算机工程与应用Issue(11):126-129,4.DOI:10.3778/j.issn.1002-8331.1302-0127
脏话文本语料库建设
Building foul words text corpus Computer Engineering and Applications
朱晓旭 1钱培德1
作者信息
- 1. 苏州大学 计算机科学与技术学院,江苏 苏州 215006
- 折叠
摘要
Abstract
Being un-offical language, foul words are widespread in Web reviews, and have a bad impact on Web civiliza-tion. The hazards and characteristics of the foul words are analyzed and described. Focused on the research of Web foul words, this paper designs a method for foul words corpus collection, which is integration of the machine automatically and manually technology. Over 6000 sentences are collected from huge amounts of Web review into a Foul Words Cor-pus. An automatic identification foul words experiment is done, which based on SVM and Maximum Entropy. The results show that the recall and accuracy are both over 97%.关键词
脏话文本/语料库/文本分类/自动识别Key words
foul words/corpus/text classification/automatic identification分类
信息技术与安全科学引用本文复制引用
朱晓旭,钱培德..脏话文本语料库建设[J].计算机工程与应用,2014,(11):126-129,4.