| 注册
首页|期刊导航|计算机工程|基于自适应张量交换和重算的大模型推理优化

基于自适应张量交换和重算的大模型推理优化

梁绪宁 王思琪 杨海龙 栾钟治 刘轶 钱德沛

计算机工程2025,Vol.51Issue(10):27-36,10.
计算机工程2025,Vol.51Issue(10):27-36,10.DOI:10.19678/j.issn.1000-3428.0070644

基于自适应张量交换和重算的大模型推理优化

Inference Optimization for Large Models Based on Adaptive Tensor Swapping and Recomputation

梁绪宁 1王思琪 1杨海龙 1栾钟治 1刘轶 1钱德沛1

作者信息

  • 1. 北京航空航天大学计算机学院,北京 100191
  • 折叠

摘要

Abstract

Large Language Models(LLM)have demonstrated outstanding performance in natural language processing tasks.However,their extremely large parameter scales pose a significant challenge because the limited capacity of GPU memory becomes a performance bottleneck for inference tasks.To address this issue in the context of LLM inference services,this study proposes AdaptiveLLM,which enables the adaptive selection of offloading strategies between tensor swapping and tensor recomputation based on the characteristics of inference task workloads.To evaluate the characteristics of inference task workloads,AdaptiveLLM establishes a black-box Machine Learning(ML)model through an operator-level computational complexity analysis to predict the overhead of tensor recomputation.It also predicts the overhead of tensor swapping by conducting a fine-grained analysis of KV Cache memory usage.For the adaptive selection of offloading strategies,AdaptiveLLM designs a cost-aware memory optimization strategy specifically for the pre-emption scheduling phase.When GPU memory is insufficient,it opts for the offloading approach with a lower overhead.For the initiation scheduling phase,it devises a fairness-based user-request scheduling strategy.When GPU memory is available,it schedules more user requests in accordance with the principle of fairness.Experimental results indicate that,compared with currently widely used LLM inference benchmark frameworks,AdaptiveLLM achieves an overall increase in throughput while reducing the average weighted turnaround time,thereby realizing fair scheduling.

关键词

大语言模型/推理/张量交换/张量重算/吞吐率/公平性

Key words

Large Language Models(LLM)/inference/tensor swapping/tensor recomputation/throughput/fairness

分类

计算机与自动化

引用本文复制引用

梁绪宁,王思琪,杨海龙,栾钟治,刘轶,钱德沛..基于自适应张量交换和重算的大模型推理优化[J].计算机工程,2025,51(10):27-36,10.

基金项目

国家重点研发计划(2023YFB3001801) (2023YFB3001801)

国家自然科学基金(62322201,62072018,U23B2020) (62322201,62072018,U23B2020)

中央高校基本科研业务费专项资金(YWF-23-L-1121,JKF-20240198) (YWF-23-L-1121,JKF-20240198)

复杂软件全国重点实验室项目(SKLSDE-2023ZX-05). (SKLSDE-2023ZX-05)

计算机工程

OA北大核心

1000-3428

访问量0
|
下载量0
段落导航相关论文