首页|期刊导航|计算机应用与软件|一种基于逆序匹配重复模式的主题信息提取方法

一种基于逆序匹配重复模式的主题信息提取方法

伍杰华倪振声陈有青

计算机应用与软件2013，Vol.30Issue(4)：88-91,4.

计算机应用与软件2013，Vol.30Issue(4)：88-91,4.DOI:10.3969/j.issn.1000-386x.2013.04.025

一种基于逆序匹配重复模式的主题信息提取方法

A THEME INFORMATION EXTRACTION METHOD BASED ON REPETITIVE PATTERN REVERSE MATCHING

伍杰华 ¹倪振声 ¹陈有青¹

作者信息

折叠

摘要

Abstract

The information in webpage is mainly arranged with repetitive HTML structure and presents in consistent display style. In the paper we put emphasis on studying the recognition of the webpage theme information with complicated repetitive pattern and propose an improved algorithm which is based on repetitive pattern reverse matching. The method improves document tree model in accordance with HTML tag structure and class property, reconstructs vector space model of the pages, reversely matches the repetitive structure pattern and then completes the extraction of the theme information. Experimental results suggest that this method can precisely recognise the theme repetitive pattern in complicated webpage structure, effectively avoid the disturbance from non-theme repetitive pattern blocks and performs well in precision and recall.

关键词

信息提取/重复模式/主题识别/逆序匹配

Key words

Information extraction/ Repetitive pattern/ Theme recognition/ Reverse match

分类

信息技术与安全科学

引用本文复制引用

伍杰华,倪振声,陈有青..一种基于逆序匹配重复模式的主题信息提取方法[J].计算机应用与软件,2013,30(4):88-91,4.

基金项目

国家自然科学基金项目(61003045). （61003045）

计算机应用与软件

OA北大核心CSCDCSTPCD

ISSN：1000-386X

访问量4

下载量0

段落导航