Dell PowerEdge C4140 Deep Learning Performance Comparison - Scale-up vs. Scale - Page 51

Deep Learning Performance: Scale-up vs Scale-out

Architectures & Technologies

Dell

EMC

| Infrastructure Solutions Group

50

7.4.3

Communication and Neural Networks Primitives

We wanted to explore the critical kernels executed in one GPU when running the TensorFlow

benchmarks, so we used the Nvidia profiling tool Nvprof to analyze one TensorFlow benchmark

trained with C4130-P100-PCIe-16GB in multi-node.

Figure 44

shows the critical kernels executed, we found that the communication primitives all

reduce where called 38.4% of the time, it may suggest that the GPU may spend too much time

exchanging and communicating rather than computing, this is something that can be explored in

depth for future projects.

Figure 44: Critical kernels executed when training with C4130-P100-16GB-SXM2 (8 GPUs)

–

multi-node

CUDA Toolkit offers several performance analysis tools to optimize the performance of CUDA or

OpenACC applications. The Visual Profiler nvvp traces CUDA activities, profiles CUDA kernels, and

correlates performance instrumentation with source code. The tool Nvprof collects performance

events and metrics; CUDA memcheck detects memory accesses issues, incorrect GPU thread

synchronization and other important aspects to optimize performance [5].

Dell PowerEdge C4140 Deep Learning Performance Comparison - Scale-up vs. Scale - Page 51

Communication and Neural Networks Primitives

Page 51 highlights