Posted on Sun, May 30, 2021 논문리뷰 MLDL Framework

DeepSpeed 관련... (ZeRO, ZeRO-2, Megatron-LM)

DeepSpeed Inference @ 2021.05.24

거대한 모델 Inference 속도 향상시키기

성능 비교


DeepSpeed-Inference introduces several features to efficiently serve transformer-based PyTorch models. It support model parallelism (MP) to fit large models that would otherwise not fit in GPU memory. Even for smaller models, MP can be used to reduce latency for inference. To further reduce latency and cost, we introduce inference-customized kernels.


Compressed training with Progressive Layer Dropping

1-bit LAMB

Config는 이렇게..(deepspeedconfig)
  "train_batch_size": 65536,
  "train_micro_batch_size_per_gpu": 64, # 1024대 GPU
  "optimizer": {
    "type": "OneBitLamb",
    "params": {
      "lr": 11e-3,
      "max_coeff": 0.3,
      "min_coeff": 0.01,
      "freeze_step": 1000,
      "cuda_aware": false,
      "comm_backend_name": "nccl",
      "coeff_beta": 0.9,
      "factor_max": 4.0,
      "factor_min": 0.5,
      "factor_threshold": 0.1
  "gradient_clipping": 1.0,
  "fp16": {
    "enabled": true,
    "loss_scale": 0,
    "initial_scale_power": 16

DeepSpeed, ZeRO-Infinity @ 2021.04.19

ZeRO 하나로 GPU 1대부터 수천대 GPU까지 무려 nvme도 학습에 갖다쓴다. *SW는 ZeRO-3에 통합됨



솔직히 이것만 보면 될 듯 하다.


The NVIDIA DGX-2 node consists of 16 V100-32GB GPUs along with 1.5 TB of CPU memory and 20 TB of usable NVMe storage