计算机工程与应用Issue(5):208-213,6.DOI:10.3778/j.issn.1002-8331.1403-0103
一种基于海量语料的网络热点新词识别方法
Method of new Chinese words identification from large scale network corpora
摘要
Abstract
The new words identification based on large scale corpora is a basis task in Chinese automatic processing. There are many difficulties because the study needs not only processing large scale corpora rapidly, but also requiring much intellectual methods. Based on lots of surveys and researches, it constructs a framework of new Chinese words iden-tification from large scale network corpora, which includes the repeat extraction algorithm based on hierarchical pruning, the new word detection method based on statistical learning and the POS guessing method based on combined features. Through lots of experiments and analyses, the framework can extract repeats from large scale corpora and construct the set of candidate new words rapidly, and can carry out the task of new words detecting and POS guessing with high effi-ciency and good results.关键词
海量语料/重复模式/逐层剪枝算法/新词检测/组合特征Key words
large scale corpora/repeat/hierarchical pruning algorithm/new words detection/combined features分类
信息技术与安全科学引用本文复制引用
张海军,李勇,闫琪琪..一种基于海量语料的网络热点新词识别方法[J].计算机工程与应用,2015,(5):208-213,6.基金项目
国家自然科学基金(No.61163045);新疆维吾尔自治区自然科学基金(No.2012211A057);新疆师范大学重点学科招标课题(No.12XSXZ0601);新疆师范大学研究生创新基金项目(No.20131201)。 ()