首页|期刊导航|电子学报|自适应的Spark数据均衡分区方法

自适应的Spark数据均衡分区方法

何玉林吴东彤黄哲学

电子学报2025，Vol.53Issue(8)：2764-2778,15.

电子学报2025，Vol.53Issue(8)：2764-2778,15.DOI:10.12263/DZXB.20250348

自适应的Spark数据均衡分区方法

Adaptive Data Balanced Partitioner in Spark

何玉林 ¹吴东彤 ¹黄哲学¹

作者信息

1. 人工智能与数字经济广东省实验室(深圳),广东深圳 518107||深圳大学计算机与软件学院,广东深圳 518060
折叠

摘要

Abstract

As an open-source computing engine,due to its simplicity,speed and scalability,Spark is widely used in the field of big data processing and analysis.Spark defaults to using hash partitioning or range partitioning to partition data.It often results in severe imbalances in data volume between partitions when processing data with skewed key distributions.Many optimization methods have been proposed,such as migration partitioning,greedy partitioning,feedback partitioning,etc.,but often have problems such as large data transmission,high extra computing cost,and long running time.In order to better alleviate the impact of key skew distribution problem,this paper proposes an adaptive Spark data balanced partition-ing method,which introduces the idea of reward and punishment to properly regulate the data partitioning process.At the same time,the key with big data volume is properly divided to make the data amount of each partition relatively balanced.After sampling the data and estimating the key weights,the sample data are sorted in descending order according to the key weights,so that all partitions have initial data.Then according to the reward and punishment allocation strategy,the alloca-tion probability of each partition is adaptively updated and the keys to be allocated are directed to the partition with the high-est probability.The adaptive data partitioning scheme was obtained after all sample data were allocated.In actual partition-ing,the data of keys that appear in the sample are allocated according to the adaptive data partitioning scheme,while the da-ta of keys that do not appear are partitioned according to the hash method.The experimental results show that the adaptive data balanced partitioner(ADBP)designed with the new data partitioning method can effectively alleviate the negative im-pact of key skew.On real data sets,the total running time of WordCount program of ADBP is averagely 1.51%and 29.90%shorter than Spark's own partitioners,i.e.,HashPartitioner and RangePartitioner,and averagely 8.12%,21.64%and 19.62%shorter than the existing partitioners learning automata hash partitioner(LAHP),splitting and combination algorithm for skew intermediate data block(SCID)and fined-coarse grained intermediate data placement(FCGIDP)respectively.

关键词

数据倾斜/均衡分区/自适应分区/奖惩分配/Spark

Key words

data skew/balanced partitioning/adaptive partitioning/reward-punishment allocation/Spark

分类

信息技术与安全科学

引用本文复制引用

何玉林,吴东彤,黄哲学..自适应的Spark数据均衡分区方法[J].电子学报,2025,53(8):2764-2778,15.

基金项目

广东省自然科学基金(No.2023A1515011667) （No.2023A1515011667）

深圳市科技重大专项项目(No.KJZD20230923114809020) Natural Science Foundation of Guangdong Province(No.2023A1515011667) （No.KJZD20230923114809020）

Major Science and Technology Special Project of Shenzhen(No.KJZD20230923114809020) （No.KJZD20230923114809020）

电子学报

OA北大核心

ISSN：0372-2112

访问量0

下载量0

段落导航