首页|期刊导航|广东工业大学学报|基于文本块密度与标签路径等特征的正文提取

基于文本块密度与标签路径等特征的正文提取

杨贤唐超兰李航

广东工业大学学报2018，Vol.35Issue(2)：51-56,6.

广东工业大学学报2018，Vol.35Issue(2)：51-56,6.DOI:10.12052/gdutxb.170152

基于文本块密度与标签路径等特征的正文提取

Text Extraction Based on Text Block Density with Tag Path and Other Features

杨贤 ¹唐超兰 ¹李航²

作者信息

1. 广东工业大学艺术与设计学院, 广东广州 510090
2. 广东工业大学计算机学院, 广东广州 510006
折叠

摘要

Abstract

Most of web pages contain content information as well as a lot of noisy information. In order to address this problem and improve the accuracy of web page extraction, a web page extraction method is proposed via text block density with tap path and other features. The proposed method mostly combines the advantages of text block extraction method and label path extraction method. First, the block of the text is determined according to the density feature of the text block, and then the tag path method is used to remove the noisy node in the block, the text node in the text block is extracted from the content finally. This solution effectively solves the problem that the noisy information in the text block is difficult to filter and the tag path method is easy to extract the long text from the noisy block. In the end, experiments show that the solution is better than CETR and CETD in most cases.

关键词

正文抽取/文本块/标签路径/文本密度

Key words

content extraction/text block/tag path/text density

分类

信息技术与安全科学

引用本文复制引用

杨贤,唐超兰,李航..基于文本块密度与标签路径等特征的正文提取[J].广东工业大学学报,2018,35(2):51-56,6.

基金项目

广东省部产学研专项资金企业创新平台资助项目(2013B090800042) （2013B090800042）

广东工业大学学报

ISSN：1007-7162

访问量2

下载量0

段落导航