并发式Spark消息分发器OA北大核心

Concurrent Spark message distributor

中文摘要英文摘要

在大数据计算框架Spark中,驱动器采用迭代式消息分发机制,会增加任务提交的时间开销,影响任务执行的启动时间,限制了任务执行的并发性,导致多个执行器处于空闲等待状态,造成计算资源的浪费.使用线程池调度策略,构建一种高效且轻量级的并发式Spark消息分发器.与迭代式Spark消息分发器不同,并发式消息分发器更加关注且更适合调度开销较大的细粒度任务作业,通过解析包含执行器重要信息的元数据,获取任务列表及各个任务对应的执行器标识,创建线程池并为每个任务启动异步计算,从而实现并发式任务分发,在保证系统稳定和任务顺利执行的前提下,最大程度地减少任务分发的时间开销.在虚拟机构建的仿真集群环境上,通过与迭代式消息分发器进行对比,证实了并发式消息分发器的良好效果.实验结果表明,在内存保持不变的前提下,并发式Spark消息分发器可减少约9%的任务执行时间,同时能提高约5%的中央处理器的利用率.并发式Spark消息分发器有效解决了迭代式消息分发机制针对细粒度任务分发的时间开销过大和计算资源浪费的问题.

In the Spark big data computing framework,the driver employs an iterative message distributor mechanism that incurs considerable task submission overhead,delays task initiation,restricts execution concurrency,and causes idle waiting among executors—ultimately leading to inefficient utilization of computing resources.To address these issues,we propose an efficient and lightweight concurrent Spark message distributor based on a thread pool scheduling strategy.In contrast to Spark's original distributor,the proposed design is better suitable for scheduling fine-grained,high overhead tasks.It parses metadata containing key executor information to extract the task list and corresponding executor identifiers to each task,then initializes a thread pool to launch asynchronous computations for each task,thereby enabling true concurrent task distribution.This approach significantly reduces dispatch latency while ensuring system stability and reliable task execution.Experimental evaluations conducted in a virtualized cluster environment demonstrate the superiority of the proposed distributor over the original Spark mechanism.Results show that,with memory usage held constant,the concurrent distributor reduces task execution time by about 9%and increases central processing unit utilization by about 5%.The proposed concurrent Spark message distributor,effectively mitigates the high overhead and computational resource inefficiency associated with traditional message distribution methods in fine-grained task scenarios.

何玉林;林泽杰;徐毓阳;成英超;黄哲学

人工智能与数字经济广东省实验室(深圳),广东 深圳 518107||深圳大学计算机与软件学院,广东 深圳 518060人工智能与数字经济广东省实验室(深圳),广东 深圳 518107||深圳大学计算机与软件学院,广东 深圳 518060人工智能与数字经济广东省实验室(深圳),广东 深圳 518107人工智能与数字经济广东省实验室(深圳),广东 深圳 518107人工智能与数字经济广东省实验室(深圳),广东 深圳 518107||深圳大学计算机与软件学院,广东 深圳 518060

计算机与自动化

并行处理大数据计算Spark通信机制消息分发细粒度任务线程池调度

parallel processingbig data computingSpark communication mechanismmessage distributionfine-grained tasksthread pool scheduling

《深圳大学学报(理工版)》 2025 (3)

317-325,9

Natural Science Foundation of Guangdong Province(2023A1515011667)Science and Technology Major Project of Shenzhen(KJZD20230923114809020)Basic Research Foundation of Shenzhen(JCYJ20210324093609026)Guangdong Basic and Applied Basic Research Foundation(2023B1515120020) 广东省自然科学基金资助项目(2023A1515011667)深圳市科技重大专项资助项目(KJZD20230923114809020)深圳市基础研究资助项目(JCYJ20210324093609026)广东省基础与应用基础研究基金粤深联合基金重点资助项目(2023B1515120020)

10.3724/SP.J.1249.2025.03317

评论