Dell PowerEdge C4140 Deep Learning Performance Comparison - Scale-up vs. Scale - Page 52

Conclusion and Future Work

Page 52 highlights

Deep Learning Performance: Scale-up vs Scale-out 8 Conclusion and Future Work  PowerEdge C4140 using Nvidia 4x NVLink architecture scales relatively well when using Uber Horovod distributed training library and Mellanox InfiniBand RDMA as the highspeed link between nodes.  Table 5 shows that PowerEdge C4140 in multi-node configuration for most widely used model ResNet-50 is within 7.8% of single node Non-Dell EMC 8x-NVLink system. But with C4140-M in multi-node out performs single node 8x NVLink by at least 18% using ResNet-50. The only disclaimer is that C4140-M results are using the latest version of NCCL & TensorFlow containers.  There is lot of performance improvement being added continuously either at the GPU level, library level or framework level. We are continuously looking at how we can improve our performance results by experimenting with different hyper parameters.  Some of our future work in this area will be related to exploring the latest software optimizations being released by Nvidia and looking at fast.ai library where Jeremy Howard and researchers at fast.ai achieved training time of 3 hours on 8x V100 on ResNet-50. Architectures & Technologies Dell EMC | Infrastructure Solutions Group 51

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53

Deep Learning Performance: Scale-up vs Scale-out
Architectures & Technologies
Dell
EMC
| Infrastructure Solutions Group
51
8
Conclusion and Future Work
PowerEdge C4140 using Nvidia 4x NVLink architecture scales relatively well when using
Uber Horovod distributed training library and Mellanox InfiniBand RDMA as the high-
speed link between nodes.
Table 5
shows that PowerEdge C4140 in multi-node configuration for most widely used
model ResNet-50 is within 7.8% of single node Non-Dell EMC 8x-NVLink system. But
with C4140-M in multi-node out performs single node 8x NVLink by at least 18% using
ResNet-50. The only disclaimer is that C4140-M results are using the latest version of
NCCL & TensorFlow containers.
There is lot of performance improvement being added continuously either at the GPU
level, library level or framework level. We are continuously looking at how we can
improve our performance results by experimenting with different hyper parameters.
Some of our future work in this area will be related to exploring the latest software
optimizations being released by Nvidia and looking at fast.ai library where Jeremy
Howard and researchers at fast.ai achieved training time of 3 hours on 8x V100 on
ResNet-50.