大数据 ›› 2021, Vol. 7 ›› Issue (5): 150-163.doi: 10.11959/j.issn.2096-0271.2021054

• 研究 • 上一篇    下一篇

基于分布式缓存加速容器化深度学习的优化方法

张凯, 车漾   

  1. 阿里巴巴科技(北京)有限公司,北京 100102
  • 出版日期:2021-09-15 发布日期:2021-09-01
  • 作者简介:张凯(1981- ),男,阿里巴巴科技(北京)有限公司高级技术专家,主要研究方向为云计算、容器、深度学习、分布式系统
    车漾(1982- ),男,阿里巴巴科技(北京)有限公司高级技术专家,主要研究方向为云计算、容器、分布式缓存、机器学习系统

Method of accelerating deep learning with optimized distributed cache in containers

Kai ZHANG, Yang CHE   

  1. Alibaba Technology (Beijing) Co., Ltd., Beijing 100102, China
  • Online:2021-09-15 Published:2021-09-01

摘要:

使用GPU运行容器化深度学习模型训练任务,性能往往受限于数据加载和预处理效率。很多GPU计算资源浪费在等待从远程存储服务读取数据的过程中。首先介绍了基于容器和分布式缓存技术加速深度学习训练的方法,以及使用Alluxio和Kubernetes实现的系统架构和初步优化手段;然后阐述了TDCS及其训练任务与缓存数据互感知的协同调度策略;接着在Kubernetes容器集群中实现了TDCS,增强了分布式缓存加速大规模深度学习训练的可扩展性;最后用ResNet50图像分类模型训练任务进行性能验证。实验结果表明,相较于直接从远程存储服务中读取数据,TDCS可对运行在128块NVIDIA V100 GPU设备上的分布式训练任务实现2~3倍加速。

关键词: 深度学习, 分布式缓存, 协同调度, Alluxio, 容器

Abstract:

When using GPU to train deep learning models with large-scale dataset, the data loading and preprocessing stages often decrease overall performance notably.Lots of GPU computing resources are wasted on waiting for loading data from remote storage.Firstly, the methods of accelerating deep learning training with container and distributed cache were introduced.The architecture and initial optimization of such training system, which was implemented with Alluxio and Kubernetes, were introduced as well.Secondly, the task and data co-located scheduling (TDCS) and the colocated scheduling policy were elaborated.Thirdly, TDCS was implemented in Kubernetes cluster, which made the acceleration result more extensible.Finally, the result of training ResNet50 image classification model on 128 NVIDIAV100 GPU devices demonstrates that the proposed methods can bring 2 to 3 times speed up comparing with load data from remote storage directly.

Key words: deep learning, distributed cache, co-located scheduling, Alluxio, container

中图分类号: 

No Suggested Reading articles found!