大数据 ›› 2024, Vol. 10 ›› Issue (4): 172-188.doi: 10.11959/j.issn.2096-0271.2024053

• 专栏:信息技术应用创新:系统与软件 • 上一篇    

基于异构硬件的LSTM训练系统

黄为新1,2, 胡伟方1,2, 曹雪娇1,2, 石宣化1,2   

  1. 1 华中科技大学计算机科学与技术学院,湖北 武汉 430074
    2 华中科技大学大数据技术与系统国家地方联合工程研究中心,服务计算技术与系统教育部重点实验室,湖北武汉 430074
  • 出版日期:2024-07-01 发布日期:2024-07-01
  • 作者简介:黄为新(2000- ),男,华中科技大学计算机科学与技术学院硕士生,主要研究方向为深度学习系统的优化。
    胡伟方(1995- ),男,华中科技大学计算机科学与技术学院博士生,主要研究方向为分布式深度学习系统平台。
    曹雪娇(1998- ),女,华中科技大学计算机科学与技术学院硕士生,主要研究方向为边端协同下的模型选择。
    石宣化(1978- ),男,博士,华中科技大学计算机科学与技术学院教授,主要研究方向为并行与分布式计算、云计算与大数据处理等。
  • 基金资助:
    新一代人工智能国家科技重大专项(2020AAA0108501);湖北省重大攻关项目(JD)(2023BAA024)

LSTM training system based on heterogeneous hardware

Weixin HUANG1,2, Weifang HU1,2, Xuejiao CAO1,2, Xuanhua SHI1,2   

  1. 1 School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, China
    2 National Engineering Research Center for Big Data Technology and System, Services Computing Technology and System Lab, Huazhong University of Science and Technology, Wuhan 430074, China
  • Online:2024-07-01 Published:2024-07-01
  • Supported by:
    National Science and Technology Major Project(2020AAA0108501);Major Program(JD) of Hubei Province(2023BAA024)

摘要:

在大数据时代,以LSTM为代表的深度神经网络模型具有处理海量数据的能力,在语言处理、语音识别、时序数据预测等领域表现优异。随着模型复杂度的提高,训练成本大幅提升。现有的LSTM训练系统使用了算子融合、多流等加速手段,但忽略了训练算子内部计算的可并行性,导致计算资源的利用率低,整体耗时长。为此,设计了基于细粒度模型划分和多流并行调度方法的LSTM训练系统TurboLSTM,在英伟达GPU和国产昇腾NPU这两种异构硬件上构建的全新底层训练算子实现了任务对计算资源的合理利用。与已有训练系统相比,在GPU上TurboLSTM的单算子训练时间缩短了23%,模型的整体训练时间缩短了17%,在NPU上TurboLSTM的单算子训练时间缩短了15%,且对计算资源的利用率显著提高。这表明提出的加速方案是高效的,具有良好的泛化能力。

关键词: LSTM, 训练加速, 细粒度并行, 多流调度

Abstract:

In the era of big data, deep neurals network models represented by LSTM have the ability to process massive data, and have excellent performance in the fields of language processing, speech recognition and time series data prediction.However, with the increase of model complexity, the training cost increases significantly.The existing LSTM training systems use acceleration methods, such as operator fusion and multi-stream, but neglect the parallelism of the internal calculation of a single training operator, which leads a low utilization rate of computing resources and a long traning time.Therefore, this paper designs a training acceleration system called TurboLSTM based on fine-grained model partitioning method and multi-stream parallel scheduling strategy.A new underlying training operator built on NVIDIA GPU and domestic Ascend NPU heterogeneous hardware realizes reasonable utilization of computing resources for tasks.Compared with the existing training systems, TurboLSTM on NVIDIA GPU has about 23% speed improvement of a single operator and about 17% speed improvement of the overall training time of a model, while TurboLSTM on Ascend NPU has about 15% speed improvement of a single operator, and the significant increase in the utilization of computing resources is observed.This shows that the acceleration method is efficient and has good generalization ability.

Key words: LSTM, training acceleration, fine-grained parallelism, multi-stream scheduling

中图分类号: 

No Suggested Reading articles found!