数据采集与处理2017,Vol.32Issue(3):636-642,7.DOI:10.16337/j.1004-9037.2017.03.024
基于CRFs和歧义模型的越南语分词
Vietnamese Word Segmentation with Conditional Random Fields and Ambiguity Model
摘要
Abstract
The Vietnamese lexical features are discussed and essential characteristics of Vietnamese are integrated into condition random fields (CRFs) to propose a Vietnamese word segmentation method based on CRFs and ambiguity model.The segmentation corpus consisting of 25 981 Vietnamese is obtained as a training corpus of CRFs by computer marking and artificial proofreading.Vietnamese crossing ambiguity is widely distributed in the sentence.To eliminate the effects of crossing ambiguity,5 377 ambiguity fragments are extracted from training corpus through dictionary of the forward and reverse matching algorithm.An ambiguity model is obtained by training the maximum entropy model.Then they are both incorparted into the segmentation model.The training corpus is divided into ten copies evenly for cross validation experiments.The segmentation accuracy reaches 96.55 % in the experiment.Experimental results show that the method improves the segmentation accuracy rate,the recall rate and the F value of Vietnamese word obviously,compared with Vietnamese segmentation tool VnTokenizer.关键词
条件随机场模型/越南语分词/词法/基本特征/最大熵/歧义模型Key words
condition random fields(CRFs)/Vietnamese segmentation/morphology/essential characteristics/maximum entropy/ambiguity model分类
信息技术与安全科学引用本文复制引用
熊明明,李英,郭剑毅,毛存礼,余正涛..基于CRFs和歧义模型的越南语分词[J].数据采集与处理,2017,32(3):636-642,7.基金项目
国家自然科学基金(61262041,61472168,61562052)资助项目 (61262041,61472168,61562052)
云南省自然科学基金重点项目(2013FA030)资助项目. (2013FA030)