桂林电子科技大学学报2017,Vol.37Issue(2):106-110,5.
基于文本特征值的正文抽取方法
Web content extraction method based on text feature value
摘要
Abstract
In view of poor universality and low accuracy of the existing Web text extraction methods, a text extraction method based on text feature value is proposed.Firstly codes of Web pages are preprocessed, and the preprocessed codes are converted into the DOM tree.Then through traversing the DOM tree, the text feature value of each DOM tree node is calculated based on the text length and punctuation weight of node, and the standard deviation is used to eliminate noise as much as possible.Gauss function is used to smooth the text feature value of nodes, ease the mutation of text feature value, and eventually reduce the possible loss of short text node.The experimental results show that the presented method does not rely on the label, need not training data, and has good versatility and high accuracy.关键词
正文抽取/主题网页/文本特征值/高斯平滑Key words
content extraction/topic Web page/text feature value/Gauss smoothing分类
信息技术与安全科学引用本文复制引用
孟川,武小年..基于文本特征值的正文抽取方法[J].桂林电子科技大学学报,2017,37(2):106-110,5.基金项目
广西自然科学基金(2015GXNSFGA139007) (2015GXNSFGA139007)
广西无线宽带通信与信号处理重点实验室基金(GXKL061510, GXKL0614110) (GXKL061510, GXKL0614110)
广西可信软件重点实验室基金(KX201622) (KX201622)
桂林电子科技大学研究生教育创新计划(YJCXS201524) (YJCXS201524)