首页|期刊导航|计算机工程与应用|非结构化中文自然语言地址描述的自动识别

非结构化中文自然语言地址描述的自动识别

赵卫锋张勤

计算机工程与应用2016，Vol.52Issue(23)：19-24,6.

计算机工程与应用2016，Vol.52Issue(23)：19-24,6.DOI:10.3778/j.issn.1002-8331.1512-0386

非结构化中文自然语言地址描述的自动识别

Automatic identification of address description in unstructured Chinese natural lan-guage

赵卫锋 ¹张勤²

作者信息

1. 长安大学地质工程与测绘学院，西安 710054
2. 地理信息工程国家重点实验室，西安 710054
折叠

摘要

Abstract

The texts of address description in natural language, which are massive and available on the Internet, imply a wealth of spatial information. Considering its unstructured characteristics, a two-step approach is proposed in this paper to automatically extract the information of words and syntaxes from the corpus of address description in Chinese natural lan-guage, for further discovery of associated spatial knowledge. In the first step, an gazetteer-independent word segmentation algorithm for Chinese is designed, according to statistical regularities of the co-occurrence of character strings in the address corpus. In this algorithm, a predefined list comprised of common words used for indicating or restricting others in address statements, could be introduced to improve segmentation effect and facilitate part-of-speech tagging. In the second step, a finite state machine model is built to represent common syntaxes of Chinese address description, and then applied to automatically match and recognize the syntactic structures of segmented and tagged address statements. On the basis of the abundant address corpus collected from Internet, the experiments for statistical segmentation and syntactic recognition demonstrate the effectiveness and availability of this approach.

关键词

地址描述/自然语言/中文分词/句法识别

Key words

address description/natural language/Chinese word segmentation/syntactic recognition

分类

信息技术与安全科学

引用本文复制引用

赵卫锋,张勤..非结构化中文自然语言地址描述的自动识别[J].计算机工程与应用,2016,52(23):19-24,6.

基金项目

国家自然科学基金（No.41301513）；地理信息工程国家重点实验室开放研究基金（No.SKLGIE 2014-M-4-2）；中央高校基本科研业务费专项资金（No.2014G1261056）。（）

计算机工程与应用

OA北大核心CSCDCSTPCD

ISSN：1002-8331

访问量0

下载量0

段落导航