| 注册
首页|期刊导航|集成技术|快速确定Spark应用配置参数值域的方法

快速确定Spark应用配置参数值域的方法

李瑞 李乐乐 喻之斌

集成技术2025,Vol.14Issue(4):87-105,19.
集成技术2025,Vol.14Issue(4):87-105,19.DOI:10.12146/j.issn.2095-3135.20241129002

快速确定Spark应用配置参数值域的方法

A Method to Quickly Determine the Range of Configuration Parameter Values for Spark Applications

李瑞 1李乐乐 2喻之斌3

作者信息

  • 1. 南方科技大学 深圳 518055||中国科学院深圳先进技术研究院 深圳 518055
  • 2. 中国科学院深圳先进技术研究院 深圳 518055
  • 3. 中国科学院深圳先进技术研究院 深圳 518055||中国科学院大学 北京 100049
  • 折叠

摘要

Abstract

With the increasing popularity of the big data processing framework Apache Spark,ensuring its safe and stable utilization while reducing overhead has become a widely discussed topic in the industry.A critical factor influencing Spark's performance is its configuration parameters.Improper parameter settings can lead to significant performance degradation or even cause large-scale system failures,resulting in substantial financial losses for users.The key challenge lies in determining the valid range of Spark configuration parameters,which varies depending on workloads,cluster resources,and input data.Furthermore,there are complex interdependencies among parameters.For instance,memory-related parameter ranges depend on the available cluster memory,while memory settings also impact Shuffle performance,indirectly influencing the range of Shuffle-related parameters.Therefore,identifying the valid range of Spark configuration parameters is highly challenging.To tackle this challenge,this study proposes a method to efficiently determine Spark configuration parameter ranges across different application scenarios.The goal is to enhance the security and stability of Spark applications while indirectly reducing time and cost overhead.Using mathematical modeling,we improve traditional parameter range determination methods in two key aspects.First,in terms of search speed,we employ a dynamic probing method that expands and contracts the search interval to determine an initial range,followed by Fibonacci search,which has a fast convergence rate,to further refine the boundaries.Second,for search conditions,our method only requires setting the initial search point to Spark's default parameter values,making it adaptable to various scenarios.Based on these two enhancements,we introduce Composite Search,a practical approach for searching Spark configuration parameter ranges.Without requiring prior knowledge of parameter values,Composite Search effectively determines parameter ranges under different workloads and cluster conditions,significantly improving speed and robustness compared to traditional methods.To evaluate the effectiveness of Composite Search,we conducted experiments on a four-node x86 cluster using all 103 TPC-DS Spark queries.The results show that compared to traditional methods for determining parameter ranges in software systems,Composite Search achieves speedups of 5.5×and 4.9×in program-level and parameter-level searches,respectively.Additionally,the parameter ranges identified by Composite Search increased the average program success rate from 46.5%to 81.7%.When integrated with existing experiment-driven and machine learning-based tuning methods,Composite Search reduces overall tuning time by an average of 30%.

关键词

Spark/配置参数/值域/复合搜索/依赖关系

Key words

Spark/configuration parameters/value range/compound search/dependency

分类

信息技术与安全科学

引用本文复制引用

李瑞,李乐乐,喻之斌..快速确定Spark应用配置参数值域的方法[J].集成技术,2025,14(4):87-105,19.

集成技术

2095-3135

访问量0
|
下载量0
段落导航相关论文