计算机工程与应用2016,Vol.52Issue(23):19-24,6.DOI:10.3778/j.issn.1002-8331.1512-0386
非结构化中文自然语言地址描述的自动识别
Automatic identification of address description in unstructured Chinese natural lan-guage
摘要
Abstract
The texts of address description in natural language, which are massive and available on the Internet, imply a wealth of spatial information. Considering its unstructured characteristics, a two-step approach is proposed in this paper to automatically extract the information of words and syntaxes from the corpus of address description in Chinese natural lan-guage, for further discovery of associated spatial knowledge. In the first step, an gazetteer-independent word segmentation algorithm for Chinese is designed, according to statistical regularities of the co-occurrence of character strings in the address corpus. In this algorithm, a predefined list comprised of common words used for indicating or restricting others in address statements, could be introduced to improve segmentation effect and facilitate part-of-speech tagging. In the second step, a finite state machine model is built to represent common syntaxes of Chinese address description, and then applied to automatically match and recognize the syntactic structures of segmented and tagged address statements. On the basis of the abundant address corpus collected from Internet, the experiments for statistical segmentation and syntactic recognition demonstrate the effectiveness and availability of this approach.关键词
地址描述/自然语言/中文分词/句法识别Key words
address description/natural language/Chinese word segmentation/syntactic recognition分类
信息技术与安全科学引用本文复制引用
赵卫锋,张勤..非结构化中文自然语言地址描述的自动识别[J].计算机工程与应用,2016,52(23):19-24,6.基金项目
国家自然科学基金(No.41301513);地理信息工程国家重点实验室开放研究基金(No.SKLGIE 2014-M-4-2);中央高校基本科研业务费专项资金(No.2014G1261056)。 ()