Dell PowerEdge C4140 Deep Learning Performance Comparison - Scale-up vs. Scale - Page 51

Communication and Neural Networks Primitives

Page 51 highlights

Deep Learning Performance: Scale-up vs Scale-out 7.4.3 Communication and Neural Networks Primitives We wanted to explore the critical kernels executed in one GPU when running the TensorFlow benchmarks, so we used the Nvidia profiling tool Nvprof to analyze one TensorFlow benchmark trained with C4130-P100-PCIe-16GB in multi-node. Figure 44 shows the critical kernels executed, we found that the communication primitives all reduce where called 38.4% of the time, it may suggest that the GPU may spend too much time exchanging and communicating rather than computing, this is something that can be explored in depth for future projects. Figure 44: Critical kernels executed when training with C4130-P100-16GB-SXM2 (8 GPUs) - multi-node CUDA Toolkit offers several performance analysis tools to optimize the performance of CUDA or OpenACC applications. The Visual Profiler nvvp traces CUDA activities, profiles CUDA kernels, and correlates performance instrumentation with source code. The tool Nvprof collects performance events and metrics; CUDA memcheck detects memory accesses issues, incorrect GPU thread synchronization and other important aspects to optimize performance [5]. Architectures & Technologies Dell EMC | Infrastructure Solutions Group 50

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53

Deep Learning Performance: Scale-up vs Scale-out
Architectures & Technologies
Dell
EMC
| Infrastructure Solutions Group
50
7.4.3
Communication and Neural Networks Primitives
We wanted to explore the critical kernels executed in one GPU when running the TensorFlow
benchmarks, so we used the Nvidia profiling tool Nvprof to analyze one TensorFlow benchmark
trained with C4130-P100-PCIe-16GB in multi-node.
Figure 44
shows the critical kernels executed, we found that the communication primitives all
reduce where called 38.4% of the time, it may suggest that the GPU may spend too much time
exchanging and communicating rather than computing, this is something that can be explored in
depth for future projects.
Figure 44: Critical kernels executed when training with C4130-P100-16GB-SXM2 (8 GPUs)
multi-node
CUDA Toolkit offers several performance analysis tools to optimize the performance of CUDA or
OpenACC applications. The Visual Profiler nvvp traces CUDA activities, profiles CUDA kernels, and
correlates performance instrumentation with source code. The tool Nvprof collects performance
events and metrics; CUDA memcheck detects memory accesses issues, incorrect GPU thread
synchronization and other important aspects to optimize performance [5].