计算机应用与软件2013,Vol.30Issue(4):88-91,4.DOI:10.3969/j.issn.1000-386x.2013.04.025
一种基于逆序匹配重复模式的主题信息提取方法
A THEME INFORMATION EXTRACTION METHOD BASED ON REPETITIVE PATTERN REVERSE MATCHING
摘要
Abstract
The information in webpage is mainly arranged with repetitive HTML structure and presents in consistent display style. In the paper we put emphasis on studying the recognition of the webpage theme information with complicated repetitive pattern and propose an improved algorithm which is based on repetitive pattern reverse matching. The method improves document tree model in accordance with HTML tag structure and class property, reconstructs vector space model of the pages, reversely matches the repetitive structure pattern and then completes the extraction of the theme information. Experimental results suggest that this method can precisely recognise the theme repetitive pattern in complicated webpage structure, effectively avoid the disturbance from non-theme repetitive pattern blocks and performs well in precision and recall.关键词
信息提取/重复模式/主题识别/逆序匹配Key words
Information extraction/ Repetitive pattern/ Theme recognition/ Reverse match分类
信息技术与安全科学引用本文复制引用
伍杰华,倪振声,陈有青..一种基于逆序匹配重复模式的主题信息提取方法[J].计算机应用与软件,2013,30(4):88-91,4.基金项目
国家自然科学基金项目(61003045). (61003045)