通信学报2011,Vol.32Issue(7):189-195,7.
基于MapReduce虚拟机的Deep Web数据源发现方法
Applying MapReduce frameworks to a virtualization platform for Deep Web data source discovery
摘要
Abstract
In order to improve the performance of Deep Web crawler in discovering and searching data sources interfaces, a new method was raised to parallel processing the mass data within the Deep Web compromising MapReduce programming model and virtualization technology. The new crawling architecture was designed with three producers, the link classified MapReduce, the page classified MapReduce and the form classified MapReduce. Server virtualization was adopted to simulate the cluster environment in order to test the performance. Experiment results indicate that this method is capable for large-scale data parallel computing, can improve the crawling efficiency and avoid wasteful expenditure, which prove the feasibility of applying cloudy technologies into Deep Web data mining field.关键词
数据源发现/MapReduce/Deep Web/虚拟化技术/云计算Key words
data source discovery/ MapReduce/ Deep Web/ virtualization technology/ cloudy computing分类
信息技术与安全科学引用本文复制引用
辛洁,崔志明,赵朋朋,张广铭,鲜学丰..基于MapReduce虚拟机的Deep Web数据源发现方法[J].通信学报,2011,32(7):189-195,7.基金项目
国家自然科学基金资助项目(60970015,61003054) (60970015,61003054)
江苏省企业博士创新项目(BK2009563) (BK2009563)
江苏省高校自然科学研究项目(10KJB520018):苏州市科技型企业技术创新资金专项(SG201043) (10KJB520018)
江苏省2010年度普通高校研究生科研创新计划基金资助项目(CX10B_041Z) (CX10B_041Z)
江苏省普通高等学校科研成果产业化推进基金资助项目(JH09-46) (JH09-46)