Dell PowerEdge C4140 Deep Learning Performance Comparison - Scale-up vs. Scale - Page 51
Communication and Neural Networks Primitives
![]() |
View all Dell PowerEdge C4140 manuals
Add to My Manuals
Save this manual to your list of manuals |
Page 51 highlights
Deep Learning Performance: Scale-up vs Scale-out 7.4.3 Communication and Neural Networks Primitives We wanted to explore the critical kernels executed in one GPU when running the TensorFlow benchmarks, so we used the Nvidia profiling tool Nvprof to analyze one TensorFlow benchmark trained with C4130-P100-PCIe-16GB in multi-node. Figure 44 shows the critical kernels executed, we found that the communication primitives all reduce where called 38.4% of the time, it may suggest that the GPU may spend too much time exchanging and communicating rather than computing, this is something that can be explored in depth for future projects. Figure 44: Critical kernels executed when training with C4130-P100-16GB-SXM2 (8 GPUs) - multi-node CUDA Toolkit offers several performance analysis tools to optimize the performance of CUDA or OpenACC applications. The Visual Profiler nvvp traces CUDA activities, profiles CUDA kernels, and correlates performance instrumentation with source code. The tool Nvprof collects performance events and metrics; CUDA memcheck detects memory accesses issues, incorrect GPU thread synchronization and other important aspects to optimize performance [5]. Architectures & Technologies Dell EMC | Infrastructure Solutions Group 50
![](/manual_guide/products/dell-poweredge-c4140-deep-learning-performance-comparison-scaleup-vs-scaleout-ccc37c0/51.png)