电子学报2025,Vol.53Issue(8):2764-2778,15.DOI:10.12263/DZXB.20250348
自适应的Spark数据均衡分区方法
Adaptive Data Balanced Partitioner in Spark
摘要
Abstract
As an open-source computing engine,due to its simplicity,speed and scalability,Spark is widely used in the field of big data processing and analysis.Spark defaults to using hash partitioning or range partitioning to partition data.It often results in severe imbalances in data volume between partitions when processing data with skewed key distributions.Many optimization methods have been proposed,such as migration partitioning,greedy partitioning,feedback partitioning,etc.,but often have problems such as large data transmission,high extra computing cost,and long running time.In order to better alleviate the impact of key skew distribution problem,this paper proposes an adaptive Spark data balanced partition-ing method,which introduces the idea of reward and punishment to properly regulate the data partitioning process.At the same time,the key with big data volume is properly divided to make the data amount of each partition relatively balanced.After sampling the data and estimating the key weights,the sample data are sorted in descending order according to the key weights,so that all partitions have initial data.Then according to the reward and punishment allocation strategy,the alloca-tion probability of each partition is adaptively updated and the keys to be allocated are directed to the partition with the high-est probability.The adaptive data partitioning scheme was obtained after all sample data were allocated.In actual partition-ing,the data of keys that appear in the sample are allocated according to the adaptive data partitioning scheme,while the da-ta of keys that do not appear are partitioned according to the hash method.The experimental results show that the adaptive data balanced partitioner(ADBP)designed with the new data partitioning method can effectively alleviate the negative im-pact of key skew.On real data sets,the total running time of WordCount program of ADBP is averagely 1.51%and 29.90%shorter than Spark's own partitioners,i.e.,HashPartitioner and RangePartitioner,and averagely 8.12%,21.64%and 19.62%shorter than the existing partitioners learning automata hash partitioner(LAHP),splitting and combination algorithm for skew intermediate data block(SCID)and fined-coarse grained intermediate data placement(FCGIDP)respectively.关键词
数据倾斜/均衡分区/自适应分区/奖惩分配/SparkKey words
data skew/balanced partitioning/adaptive partitioning/reward-punishment allocation/Spark分类
信息技术与安全科学引用本文复制引用
何玉林,吴东彤,黄哲学..自适应的Spark数据均衡分区方法[J].电子学报,2025,53(8):2764-2778,15.基金项目
广东省自然科学基金(No.2023A1515011667) (No.2023A1515011667)
深圳市科技重大专项项目(No.KJZD20230923114809020) Natural Science Foundation of Guangdong Province(No.2023A1515011667) (No.KJZD20230923114809020)
Major Science and Technology Special Project of Shenzhen(No.KJZD20230923114809020) (No.KJZD20230923114809020)