Dell PowerEdge C4140 Deep Learning Performance Comparison - Scale-up vs. Scale - Page 46

Deep Learning Performance: Scale-up vs Scale-out

Architectures & Technologies

Dell

EMC

| Infrastructure Solutions Group

45

Figure 39: Training long tests to extract accuracy convergence and training time with PowerEdge C4140-

K multi-node and single-node 8x V100-SXM2 with different models

Figure 39

above shows comparison between 8X SXM2 and PowerEdge C4140 Configuration-K in

multi-node configuration using ResNet-50. The training time difference between 8X SXM2 and

multi-node PowerEdge C4140 is within 7% which shows that using Mellanox InfiniBand RDMA

allows PowerEdge C4140 to achieve similar performance as a scale-up server.

To show the impact of the CPU in the training of deep learning workloads to reach the accuracy

convergence, we run additional tests configuring the multi-node system with servers PowerEdge

C4140-V100-SXM2 Configuration-M and IntelXeon6148 CPU. In the

Figure 40

we see the multi-

node system C4140-V100-SXM2 Configuration-M and IntelXeon6148 CPU performs

1.3X

faster

than SN-8xV100 for the model resnet50 trained in several batch sizes. Again, it shows the

relationship between the CPU model and the Deep Learning performance, where most of the

data loading, data preprocessing, and batch transformation tasks occur at the CPU level, whereas

the training tasks occur at the gpu level.

Dell PowerEdge C4140 Deep Learning Performance Comparison - Scale-up vs. Scale - Page 46

C4140-V100-SXM2 Configuration-M and IntelXeon6148 CPU. In

Page 46 highlights