计算机应用与软件2017,Vol.34Issue(2):14-19,79,7.DOI:10.3969/j.issn.1000-386x.2017.02.003
基于正则表达式构建学习的网页信息抽取方法
A WEBPAGE INFORMATION EXTRACTION METHOD BASED ON REGEX CONSTRUCTION LEARNING
朱文琰 1郑肖雄1
作者信息
- 1. 复旦大学计算机科学技术学院智能信息处理重点实验室 上海200433
- 折叠
摘要
Abstract
As one of the main methods in the field of information extraction,the method based on regular expression has been widely used for many years.However,the construction of regular expressions is with high quality and high complexity,it is usually required to spend a lot of manual efforts.Therefore,a method based on regular expression state transition is proposed to learn the construction of complex regular expressions.The method takes in a given initial input RegEx and both positive and negative labeled samples,a collection of candidate RegEx is got after applying two main kind of regular expressions transformation on the input RegEx,based on F value assessment of the candidate RegEx on the information extraction task,the algorithm selects an optimal regular expressions as output by greedy heuristic strategy.The performance of this algorithm is evaluated on multiple datasets.Experiments show that the performance and accuracy of the proposed method outperforms those of the standard machine learning methods.And it still has a good effect on condition of small scale training set and cross domain data set.关键词
正则表达式构建/状态转换/Web信息抽取Key words
RegEx construction/State transition/Web information extraction分类
信息技术与安全科学引用本文复制引用
朱文琰,郑肖雄..基于正则表达式构建学习的网页信息抽取方法[J].计算机应用与软件,2017,34(2):14-19,79,7.