| 注册
首页|期刊导航|数据采集与处理|基于CRFs和歧义模型的越南语分词

基于CRFs和歧义模型的越南语分词

熊明明 李英 郭剑毅 毛存礼 余正涛

数据采集与处理2017,Vol.32Issue(3):636-642,7.
数据采集与处理2017,Vol.32Issue(3):636-642,7.DOI:10.16337/j.1004-9037.2017.03.024

基于CRFs和歧义模型的越南语分词

Vietnamese Word Segmentation with Conditional Random Fields and Ambiguity Model

熊明明 1李英 1郭剑毅 1毛存礼 2余正涛1

作者信息

  • 1. 昆明理工大学信息工程与自动化学院,昆明,650500
  • 2. 昆明理工大学智能信息处理重点实验室,昆明,650500
  • 折叠

摘要

Abstract

The Vietnamese lexical features are discussed and essential characteristics of Vietnamese are integrated into condition random fields (CRFs) to propose a Vietnamese word segmentation method based on CRFs and ambiguity model.The segmentation corpus consisting of 25 981 Vietnamese is obtained as a training corpus of CRFs by computer marking and artificial proofreading.Vietnamese crossing ambiguity is widely distributed in the sentence.To eliminate the effects of crossing ambiguity,5 377 ambiguity fragments are extracted from training corpus through dictionary of the forward and reverse matching algorithm.An ambiguity model is obtained by training the maximum entropy model.Then they are both incorparted into the segmentation model.The training corpus is divided into ten copies evenly for cross validation experiments.The segmentation accuracy reaches 96.55 % in the experiment.Experimental results show that the method improves the segmentation accuracy rate,the recall rate and the F value of Vietnamese word obviously,compared with Vietnamese segmentation tool VnTokenizer.

关键词

条件随机场模型/越南语分词/词法/基本特征/最大熵/歧义模型

Key words

condition random fields(CRFs)/Vietnamese segmentation/morphology/essential characteristics/maximum entropy/ambiguity model

分类

信息技术与安全科学

引用本文复制引用

熊明明,李英,郭剑毅,毛存礼,余正涛..基于CRFs和歧义模型的越南语分词[J].数据采集与处理,2017,32(3):636-642,7.

基金项目

国家自然科学基金(61262041,61472168,61562052)资助项目 (61262041,61472168,61562052)

云南省自然科学基金重点项目(2013FA030)资助项目. (2013FA030)

数据采集与处理

OA北大核心CSCDCSTPCD

1004-9037

访问量0
|
下载量0
段落导航相关论文