计算机技术与发展2017,Vol.27Issue(10):24-29,6.DOI:10.3969/j.issn.1673-629X.2017.10.006
一种规则与SVM结合的论文抽取方法
An Extraction Method for Papers via Integration of Rules with SVM
摘要
Abstract
Traditional extraction methods for PDF format papers are mainly based on either rules or machine learning. The extraction method based on rules has obvious advantages in processing fixed format data,which can accurately locate and extract data by making some simple rules of extraction. However it needs fairly complex rules to deal with flexible data and is lack of the adaptability of paper format,which cannot do better than the extraction method of machine learning in terms of flexibility and accuracy. For this,an extraction method for PDF papers via integration of rules with SVM is proposed which makes full use of the advantages of rules and machine learn-ing when extracting information. On the basis of extracting fixed format information via simple rules,the sample characteristics is chosen to build the training set and the optimal kernel function is selected to generate the SVM model for implementation of information extrac-tion based on SVM. By taken extraction results of the SVM as the main body,the verification experiments is conducted based on rules ra-tionally and some appropriate rules made. The experiment results show that it can achieve better results for extracting metadata and chapter headings of PDF papers.关键词
PDF论文/规则/支持向量机/样本特征/混合方法/信息抽取Key words
PDF papers/rules/support vector machine/sample characteristics/hybrid method/information extraction分类
信息技术与安全科学引用本文复制引用
李雪驹,王智广,鲁强..一种规则与SVM结合的论文抽取方法[J].计算机技术与发展,2017,27(10):24-29,6.基金项目
国家自然科学基金资助项目(60803159) (60803159)
国家科技重大专项(2011ZX05005-005-006) (2011ZX05005-005-006)