计算机工程与科学2011,Vol.33Issue(1):157-160,4.DOI:10.3969/j.issn.1007-130X.2011.01.030
面向Web论坛的网络信息获取技术及系统实现
The Web Forum Crawling Technology and System Implementation
摘要
Abstract
The Web spider is very important in gathering information, which also faces new challenges when it's been used in crawling the Web forum.This paper mainly studies the basic technologies of crawling in the Web forum, designs and implements such a system, which is mainly used to gather the information of the Web forum.According to the information structure, a traversal strategy is proposed.Based on the distribution of the context, a DOM and block algorithm is proposed.The experimental result shows that the traversal strategy is more efficient than the traditional traverses to get those highly subject-relevant Web pages, and after using the strategy for the context extracting of Web pages, effectively improves the accuracy of the information collection.关键词
网络爬虫/Web论坛/正文提取/主题相关度Key words
web spider/ web forum/ context extracting/ subject relevant分类
信息技术与安全科学引用本文复制引用
彭冬,蔡皖东..面向Web论坛的网络信息获取技术及系统实现[J].计算机工程与科学,2011,33(1):157-160,4.基金项目
国家863计划资助项目(2009AA01Z424) (2009AA01Z424)
2009届西北工业大学本科毕业设计重点扶持项目 ()