Dell PowerEdge C4140 Deep Learning Performance Comparison - Scale-up vs. Scale - Page 27

Transform batch n+2 on CPU

Page 27 highlights

Deep Learning Performance: Scale-up vs Scale-out Figure 20: PowerEdge C4140-V100-SXM2- Configuration-K vs PowerEdge C4140-V100-SXM2 Configuration-M As shown in Figure 21 below, it shows that the number of CPU cores does play a role in terms of throughput. And the biggest difference is when running AlexNet. 7.1.8 What role does CPU play in Deep learning? The CPU plays a major role in the initial phase called data preprocessing. The steps below show an instruction pipeline, with the following 4 instructions happening in parallel: a. Train on batch n (on GPUs) b. Copy batch n+1 to GPU memory c. Transform batch n+2 (on CPU) d. Load batch n+3 from disk (on CPU) The loop for the data processing when training is: a. Load mini-batch b. Preprocess mini-batch c. Train on mini-batch Architectures & Technologies Dell EMC | Infrastructure Solutions Group 26

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53

Deep Learning Performance: Scale-up vs Scale-out
Architectures & Technologies
Dell
EMC
| Infrastructure Solutions Group
26
Figure 20: PowerEdge C4140-V100-SXM2- Configuration-K vs PowerEdge C4140-V100-SXM2
Configuration-M
As shown in
Figure 21
below, it shows that the number of CPU cores does play a role in terms of
throughput. And the biggest difference is when running AlexNet.
7.1.8
What role does CPU play in Deep learning?
The CPU plays a major role in the initial phase called data preprocessing. The steps below show
an instruction pipeline, with the following 4 instructions happening in parallel:
a.
Train on batch n (on GPUs)
b.
Copy batch n+1 to GPU memory
c.
Transform batch n+2 (on CPU)
d.
Load batch n+3 from disk (on CPU)
The loop for the data processing when training is:
a.
Load mini-batch
b.
Preprocess mini-batch
c.
Train on mini-batch