| 注册
首页|期刊导航|计算机工程与应用|基于爬虫和网站分类的主题信息源发现方法

基于爬虫和网站分类的主题信息源发现方法

邓厚平 武刚

计算机工程与应用2016,Vol.52Issue(3):59-65,7.
计算机工程与应用2016,Vol.52Issue(3):59-65,7.DOI:10.3778/j.issn.1002-8331.1402-0062

基于爬虫和网站分类的主题信息源发现方法

Discovery of topic-specific information source based on web crawler and website classi-fication

邓厚平 1武刚1

作者信息

  • 1. 北京林业大学 信息学院,北京 100083
  • 折叠

摘要

Abstract

The discovery of topic-specific information source is the premise of Web information integration. A topic-specific information discovery method is presented, changing the problem to website topic classification and discover websites using external links. An improved VSM model is established to describe the website topic, using both content and structure features extracted from websites. Based on the improved VSM model, a classification method combining center-vector algorithm and SVM is presented to classify the topic of websites. A web search strategy aiming to minimize the quantity of crawled web page is presented to find out web pages that best represent the topic of the website. The topic-specific infor-mation source discovery method is used to find forestry business website for test and performs well.

关键词

网站主题/特征描述/分类/爬虫/信息源发现

Key words

website topic/feature description/classification/crawler/information source discovery

分类

信息技术与安全科学

引用本文复制引用

邓厚平,武刚..基于爬虫和网站分类的主题信息源发现方法[J].计算机工程与应用,2016,52(3):59-65,7.

基金项目

中央高校基本科研业务费专项基金资助项目(No.BLYX200928). (No.BLYX200928)

计算机工程与应用

OA北大核心CSCDCSTPCD

1002-8331

访问量0
|
下载量0
段落导航相关论文