| 注册
首页|期刊导航|集成技术|基于Spark与MPI集成的数据分析与处理平台

基于Spark与MPI集成的数据分析与处理平台

周梦兵 李秋彦 吴欧 王洋

集成技术2025,Vol.14Issue(4):106-119,14.
集成技术2025,Vol.14Issue(4):106-119,14.DOI:10.12146/j.issn.2095-3135.20241203002

基于Spark与MPI集成的数据分析与处理平台

Integrated Data Analysis and Processing Platform Based on Spark and MPI

周梦兵 1李秋彦 2吴欧 3王洋4

作者信息

  • 1. 中国科学院深圳先进技术研究院 深圳 518055||中国科学院大学 北京 100049
  • 2. 中国科学院深圳先进技术研究院 深圳 518055
  • 3. 南京大学 软件学院 南京 210093
  • 4. 中国科学院深圳先进技术研究院 深圳 518055||中国科学院大学 北京 100049||深圳理工大学 深圳 518107
  • 折叠

摘要

Abstract

Currently,AI application workloads,represented by machine learning,exhibit a dual-density characteristic,combining both compute-intensive and data-intensive traits.These applications not only require support for the storage,transmission,and fault tolerance of massive data but also need to optimize the performance of complex logical computations.Traditional single big data frameworks or high-performance computing frameworks can no longer meet the challenges posed by these applications.The hybrid big data platform based on Spark and MPI proposed in this paper is a high-performance big data processing platform.This platform,built on a typical large-scale cluster,focuses on addressing the storage and computing characteristics of dual-density applications,such as those in machine learning,and includes 3 key modules:dual-paradigm hybrid computation,heterogeneous storage,and integrated high-performance communication.To address the dual-density nature of these applications,which involve both data-intensive big data processing and compute-intensive high-performance computing,a computational module combining the Spark and MPI paradigms is designed.By splitting and classifying tasks,compute-intensive tasks are offloaded to the MPI computation module,enhancing the dual-paradigm hybrid computation capability.To address the characteristics of different types of data during the computing process,a heterogeneous storage structure and a data-metadata separation strategy are designed.This optimizes data storage through classification,building a high-performance storage system.In response to the communication needs of dual-density computing,this paper proposes an integration approach that combines high-performance communication techniques,providing strong communication support for the computing and storage modules.Test results show that this platform provides efficient dual-paradigm hybrid computation for dual-density applications,achieving performance improvements of 4.2%to 17.3%compared to a standalone Spark big data platform for various computation tasks.

关键词

双密型应用/大数据处理/高性能计算/混合范式计算

Key words

dual-intensive applications/big data processing/high performance computing/hybrid paradigm computing

分类

信息技术与安全科学

引用本文复制引用

周梦兵,李秋彦,吴欧,王洋..基于Spark与MPI集成的数据分析与处理平台[J].集成技术,2025,14(4):106-119,14.

基金项目

第三次新疆综合科学考察项目(2021XJKK1300) (2021XJKK1300)

深圳市承接"人机物融合的云计算架构与平台"的产业化应用研究项目(CJGJZD20230724093659004) (CJGJZD20230724093659004)

深圳市科技计划项目(SGDX20220530111001003) This work is supported by 3rd Xinjiang Scientific Expedition Program(2021XJKK1300) (SGDX20220530111001003)

Industrial Application Research Project of Shenzhen for"Cloud Computing Architecture and Platform for Human-Machine-Thing Integration"(CJGJZD20230724093659004) (CJGJZD20230724093659004)

and Shenzhen Science and Technology Plan Project(SGDX20220530111001003) (SGDX20220530111001003)

集成技术

2095-3135

访问量0
|
下载量0
段落导航相关论文