首页|期刊导航|桂林电子科技大学学报|基于文本特征值的正文抽取方法

基于文本特征值的正文抽取方法

孟川武小年

桂林电子科技大学学报2017，Vol.37Issue(2)：106-110,5.

基于文本特征值的正文抽取方法

Web content extraction method based on text feature value

孟川 ¹武小年¹

作者信息

1. 桂林电子科技大学信息与通信学院,广西桂林 541004
折叠

摘要

Abstract

In view of poor universality and low accuracy of the existing Web text extraction methods, a text extraction method based on text feature value is proposed.Firstly codes of Web pages are preprocessed, and the preprocessed codes are converted into the DOM tree.Then through traversing the DOM tree, the text feature value of each DOM tree node is calculated based on the text length and punctuation weight of node, and the standard deviation is used to eliminate noise as much as possible.Gauss function is used to smooth the text feature value of nodes, ease the mutation of text feature value, and eventually reduce the possible loss of short text node.The experimental results show that the presented method does not rely on the label, need not training data, and has good versatility and high accuracy.

关键词

正文抽取/主题网页/文本特征值/高斯平滑

Key words

content extraction/topic Web page/text feature value/Gauss smoothing

分类

信息技术与安全科学

引用本文复制引用

孟川,武小年..基于文本特征值的正文抽取方法[J].桂林电子科技大学学报,2017,37(2):106-110,5.

基金项目

广西自然科学基金(2015GXNSFGA139007) （2015GXNSFGA139007）

广西无线宽带通信与信号处理重点实验室基金(GXKL061510, GXKL0614110) （GXKL061510, GXKL0614110）

广西可信软件重点实验室基金(KX201622) （KX201622）

桂林电子科技大学研究生教育创新计划(YJCXS201524) （YJCXS201524）

桂林电子科技大学学报

ISSN：1673-808X

访问量0

下载量0

段落导航