计算机工程Issue(3):59-62,81,5.DOI:10.3969/j.issn.1000-3428.2014.03.012
基于Hadoop平台的事实并行处理算法
Parallel Processing Algorithms for Facts Based on Hadoop Platform
孙莉 1何刚 1李继云1
作者信息
- 1. 东华大学计算机科学与技术学院,上海 201620
- 折叠
摘要
Abstract
In view of that traditional Extract, Transform, Load(ETL) tools face the efficient problem of the massive fact data in data warehouse, two algorithms about parallel processing facts are designed and implemented based on Hadoop platform. From the two perspectives of surrogate key lookup of fact table and aggregation for fact data on the different granularity, a multi-way parallel lookup algorithm on slowly changing dimensions and an algorithm of aggregation for fact data on the different granularity are presented. The first algorithm considers slowly changing dimensions and big dimensions synthetically. In order to solve the problem of out of memory, the algorithm adopts an approach to the distributed cache to copy small dimensions to every date nodes’ memory. And implementing multi-way lookup of dimension keys in the stage of map is to avoid network delay result from data transmission. The second algorithm adds merge stage after reducing stage, so it is beneficial to solve the aggregation problem of the fact data according to different granularity effectively. Experimental results show that the two algorithms have better efficient than Hive data warehouse with respect to the problem of parallel processing facts data in data warehouse.关键词
MapReduce模型/维度/事实/代理键/并行查找/聚合Key words
MapReduce model/dimension/fact/surrogate key/parallel lookup/aggregation分类
信息技术与安全科学引用本文复制引用
孙莉,何刚,李继云..基于Hadoop平台的事实并行处理算法[J].计算机工程,2014,(3):59-62,81,5.