首页|期刊导航|计算机工程与应用|一种基于海量语料的网络热点新词识别方法

一种基于海量语料的网络热点新词识别方法

张海军李勇闫琪琪

计算机工程与应用Issue(5)：208-213,6.

计算机工程与应用Issue(5)：208-213,6.DOI:10.3778/j.issn.1002-8331.1403-0103

一种基于海量语料的网络热点新词识别方法

Method of new Chinese words identification from large scale network corpora

张海军 ¹李勇 ²闫琪琪³

作者信息

1. 新疆师范大学初等教育学院，乌鲁木齐 830054
2. 新疆师范大学计算机科学技术学院，乌鲁木齐 830054
3. 新疆师范大学计算机科学技术学院，乌鲁木齐 830054
折叠

摘要

Abstract

The new words identification based on large scale corpora is a basis task in Chinese automatic processing. There are many difficulties because the study needs not only processing large scale corpora rapidly, but also requiring much intellectual methods. Based on lots of surveys and researches, it constructs a framework of new Chinese words iden-tification from large scale network corpora, which includes the repeat extraction algorithm based on hierarchical pruning, the new word detection method based on statistical learning and the POS guessing method based on combined features. Through lots of experiments and analyses, the framework can extract repeats from large scale corpora and construct the set of candidate new words rapidly, and can carry out the task of new words detecting and POS guessing with high effi-ciency and good results.

关键词

海量语料/重复模式/逐层剪枝算法/新词检测/组合特征

Key words

large scale corpora/repeat/hierarchical pruning algorithm/new words detection/combined features

分类

信息技术与安全科学

引用本文复制引用

张海军,李勇,闫琪琪..一种基于海量语料的网络热点新词识别方法[J].计算机工程与应用,2015,(5):208-213,6.

基金项目

国家自然科学基金（No.61163045）；新疆维吾尔自治区自然科学基金（No.2012211A057）；新疆师范大学重点学科招标课题（No.12XSXZ0601）；新疆师范大学研究生创新基金项目（No.20131201）。（）

计算机工程与应用

OA北大核心CSCDCSTPCD

ISSN：1002-8331

访问量0

下载量0

段落导航