首页|期刊导航|计算机工程与应用|脏话文本语料库建设

脏话文本语料库建设

朱晓旭钱培德

计算机工程与应用Issue(11)：126-129,4.

计算机工程与应用Issue(11)：126-129,4.DOI:10.3778/j.issn.1002-8331.1302-0127

脏话文本语料库建设

Building foul words text corpus Computer Engineering and Applications

朱晓旭 ¹钱培德¹

作者信息

1. 苏州大学计算机科学与技术学院，江苏苏州 215006
折叠

摘要

Abstract

Being un-offical language, foul words are widespread in Web reviews, and have a bad impact on Web civiliza-tion. The hazards and characteristics of the foul words are analyzed and described. Focused on the research of Web foul words, this paper designs a method for foul words corpus collection, which is integration of the machine automatically and manually technology. Over 6000 sentences are collected from huge amounts of Web review into a Foul Words Cor-pus. An automatic identification foul words experiment is done, which based on SVM and Maximum Entropy. The results show that the recall and accuracy are both over 97%.

关键词

脏话文本/语料库/文本分类/自动识别

Key words

foul words/corpus/text classification/automatic identification

分类

信息技术与安全科学

引用本文复制引用

朱晓旭,钱培德..脏话文本语料库建设[J].计算机工程与应用,2014,(11):126-129,4.

计算机工程与应用

OA北大核心CSCDCSTPCD

ISSN：1002-8331

访问量0

下载量0

段落导航