Dell PowerEdge C4140 Deep Learning Performance Comparison - Scale-up vs. Scale - Page 46

C4140-V100-SXM2 Configuration-M and IntelXeon6148 CPU. In

Page 46 highlights

Deep Learning Performance: Scale-up vs Scale-out Figure 39: Training long tests to extract accuracy convergence and training time with PowerEdge C4140K multi-node and single-node 8x V100-SXM2 with different models Figure 39 above shows comparison between 8X SXM2 and PowerEdge C4140 Configuration-K in multi-node configuration using ResNet-50. The training time difference between 8X SXM2 and multi-node PowerEdge C4140 is within 7% which shows that using Mellanox InfiniBand RDMA allows PowerEdge C4140 to achieve similar performance as a scale-up server. To show the impact of the CPU in the training of deep learning workloads to reach the accuracy convergence, we run additional tests configuring the multi-node system with servers PowerEdge C4140-V100-SXM2 Configuration-M and IntelXeon6148 CPU. In the Figure 40 we see the multinode system C4140-V100-SXM2 Configuration-M and IntelXeon6148 CPU performs 1.3X faster than SN-8xV100 for the model resnet50 trained in several batch sizes. Again, it shows the relationship between the CPU model and the Deep Learning performance, where most of the data loading, data preprocessing, and batch transformation tasks occur at the CPU level, whereas the training tasks occur at the gpu level. Architectures & Technologies Dell EMC | Infrastructure Solutions Group 45

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53

Deep Learning Performance: Scale-up vs Scale-out
Architectures & Technologies
Dell
EMC
| Infrastructure Solutions Group
45
Figure 39: Training long tests to extract accuracy convergence and training time with PowerEdge C4140-
K multi-node and single-node 8x V100-SXM2 with different models
Figure 39
above shows comparison between 8X SXM2 and PowerEdge C4140 Configuration-K in
multi-node configuration using ResNet-50. The training time difference between 8X SXM2 and
multi-node PowerEdge C4140 is within 7% which shows that using Mellanox InfiniBand RDMA
allows PowerEdge C4140 to achieve similar performance as a scale-up server.
To show the impact of the CPU in the training of deep learning workloads to reach the accuracy
convergence, we run additional tests configuring the multi-node system with servers PowerEdge
C4140-V100-SXM2 Configuration-M and IntelXeon6148 CPU. In the
Figure 40
we see the multi-
node system C4140-V100-SXM2 Configuration-M and IntelXeon6148 CPU performs
1.3X
faster
than SN-8xV100 for the model resnet50 trained in several batch sizes. Again, it shows the
relationship between the CPU model and the Deep Learning performance, where most of the
data loading, data preprocessing, and batch transformation tasks occur at the CPU level, whereas
the training tasks occur at the gpu level.