首页|期刊导航|计算机应用与软件|基于Hadoop平台的农产品价格数据爬取和存储系统的研究

基于Hadoop平台的农产品价格数据爬取和存储系统的研究

杨晓东郜鲁涛杨林楠刘建阳

计算机应用与软件2017，Vol.34Issue(3)：76-80,5.

计算机应用与软件2017，Vol.34Issue(3)：76-80,5.DOI:10.3969/j.issn.1000-386x.2017.03.013

基于Hadoop平台的农产品价格数据爬取和存储系统的研究

RESEARCH ON DATA CRAWLING AND STORAGE SYSTEM OF AGRICULTURAL PRODUCT PRICE BASED ON HADOOP PLATFORM

杨晓东 ¹郜鲁涛 ¹杨林楠 ¹刘建阳²

作者信息

1. 云南农业大学基础与信息工程学院云南昆明 650201
2. 云南省信息技术发展中心云南昆明 650228
折叠

摘要

Abstract

At present, many large farm product markets and agricultural information commerce platforms release the information of agricultural product prices from different regions in real-time each day.Because of a large number of various fast-updating data, the data crawling and storage as well as the following analysis work come to be difficult.Therefore, we put forward a data crawling and storage system of agricultural product price based on Hadoop.We implement multi-threaded crawling by HttpClient framework combined with thread pool and finish integrity checking.After filtering out the web pages whose information is incomplete, we crawl again until the information comes to be complete.We analyze and clean the crawled web pages by regular expression, and save the useful extracted data in the form of text file into HDFS (Hadoop Distributed File System).The data crawled later is supplemented into HDFS.Experiment shows that the writing performance of HDFS can satisfy the incremental crawling data.The less duplicates are, the bigger the data block is, then the better the writing performance is.

关键词

分布式系统/爬虫/Hadoop/HDFS/正则表达式

Key words

Distributed system/Crawler/Hadoop/HDFS/Regular expression

分类

信息技术与安全科学

引用本文复制引用

杨晓东,郜鲁涛,杨林楠,刘建阳..基于Hadoop平台的农产品价格数据爬取和存储系统的研究[J].计算机应用与软件,2017,34(3):76-80,5.

基金项目

国家"十二五"科技支撑计划课题(2014BAD10B03). （2014BAD10B03）

计算机应用与软件

OA北大核心CSTPCD

ISSN：1000-386X

访问量0

下载量0

段落导航