计算机应用与软件2017,Vol.34Issue(3):76-80,5.DOI:10.3969/j.issn.1000-386x.2017.03.013
基于Hadoop平台的农产品价格数据爬取和存储系统的研究
RESEARCH ON DATA CRAWLING AND STORAGE SYSTEM OF AGRICULTURAL PRODUCT PRICE BASED ON HADOOP PLATFORM
摘要
Abstract
At present, many large farm product markets and agricultural information commerce platforms release the information of agricultural product prices from different regions in real-time each day.Because of a large number of various fast-updating data, the data crawling and storage as well as the following analysis work come to be difficult.Therefore, we put forward a data crawling and storage system of agricultural product price based on Hadoop.We implement multi-threaded crawling by HttpClient framework combined with thread pool and finish integrity checking.After filtering out the web pages whose information is incomplete, we crawl again until the information comes to be complete.We analyze and clean the crawled web pages by regular expression, and save the useful extracted data in the form of text file into HDFS (Hadoop Distributed File System).The data crawled later is supplemented into HDFS.Experiment shows that the writing performance of HDFS can satisfy the incremental crawling data.The less duplicates are, the bigger the data block is, then the better the writing performance is.关键词
分布式系统/爬虫/Hadoop/HDFS/正则表达式Key words
Distributed system/Crawler/Hadoop/HDFS/Regular expression分类
信息技术与安全科学引用本文复制引用
杨晓东,郜鲁涛,杨林楠,刘建阳..基于Hadoop平台的农产品价格数据爬取和存储系统的研究[J].计算机应用与软件,2017,34(3):76-80,5.基金项目
国家"十二五"科技支撑计划课题(2014BAD10B03). (2014BAD10B03)