首页|期刊导航|计算机应用研究|基于网页信息和分词的中文机构名全称和简称提取方法

基于网页信息和分词的中文机构名全称和简称提取方法

张俊玲耿光刚延志伟李晓东

计算机应用研究2017，Vol.34Issue(4)：972-976,5.

计算机应用研究2017，Vol.34Issue(4)：972-976,5.DOI:10.3969/j.issn.1001-3695.2017.04.003

基于网页信息和分词的中文机构名全称和简称提取方法

Extraction method of organization full names and abbreviations based on Web page and word segmentation

张俊玲 ¹耿光刚 ²延志伟 ³李晓东³

作者信息

1. 中国科学院大学,北京100190
2. 中国科学院计算机网络信息中心,北京100190
3. 中国互联网络信息中心,北京100190
折叠

摘要

Abstract

When processing the correspondence between full names and abbreviations,search engine can only use the way of manually adding in the past,resulting in abbreviations omission and low recall rate of search results.To solve these problems,this paper proposed an extraction method of organizations' full names and abbreviations based on Web page and word segmentation.It obtained source code of website homepage of organization firstly.Then it extracted relevant organization full name from the source code,and extracted candidate abbreviations based on contextual features collection of organization names.Finally it calculated the similarity between candidate abbreviations and full name to determine which candidates were the exact abbreviations.Through experiments on 1 287 organization websites,the full names' correct rate of this method is 93.9％,the abbreviations' recall rate and correct rate are 85.3％ and 90.8％ separately.Experimental results show that the method has a good effect.

关键词

机构名简称提取/机构名全称提取/网页分析/简称相似度计算

Key words

extraction of organization abbreviations/extraction of organization full name/Web page analysis/abbreviation similarity calculation

分类

信息技术与安全科学

引用本文复制引用

张俊玲,耿光刚,延志伟,李晓东..基于网页信息和分词的中文机构名全称和简称提取方法[J].计算机应用研究,2017,34(4):972-976,5.

基金项目

国家自然科学基金资助项目(61375039,61272433) （61375039,61272433）

计算机应用研究

OA北大核心CSCDCSTPCD

ISSN：1001-3695

访问量0

下载量0

段落导航