首页|期刊导航|计算机应用与软件|基于正则表达式构建学习的网页信息抽取方法

基于正则表达式构建学习的网页信息抽取方法

朱文琰郑肖雄

计算机应用与软件2017，Vol.34Issue(2)：14-19,79,7.

计算机应用与软件2017，Vol.34Issue(2)：14-19,79,7.DOI:10.3969/j.issn.1000-386x.2017.02.003

基于正则表达式构建学习的网页信息抽取方法

A WEBPAGE INFORMATION EXTRACTION METHOD BASED ON REGEX CONSTRUCTION LEARNING

朱文琰 ¹郑肖雄¹

作者信息

1. 复旦大学计算机科学技术学院智能信息处理重点实验室上海200433
折叠

摘要

Abstract

As one of the main methods in the field of information extraction,the method based on regular expression has been widely used for many years.However,the construction of regular expressions is with high quality and high complexity,it is usually required to spend a lot of manual efforts.Therefore,a method based on regular expression state transition is proposed to learn the construction of complex regular expressions.The method takes in a given initial input RegEx and both positive and negative labeled samples,a collection of candidate RegEx is got after applying two main kind of regular expressions transformation on the input RegEx,based on F value assessment of the candidate RegEx on the information extraction task,the algorithm selects an optimal regular expressions as output by greedy heuristic strategy.The performance of this algorithm is evaluated on multiple datasets.Experiments show that the performance and accuracy of the proposed method outperforms those of the standard machine learning methods.And it still has a good effect on condition of small scale training set and cross domain data set.

关键词

正则表达式构建/状态转换/Web信息抽取

Key words

RegEx construction/State transition/Web information extraction

分类

信息技术与安全科学

引用本文复制引用

朱文琰,郑肖雄..基于正则表达式构建学习的网页信息抽取方法[J].计算机应用与软件,2017,34(2):14-19,79,7.

计算机应用与软件

OA北大核心CSTPCD

ISSN：1000-386X

访问量0

下载量0

段落导航