Dell PowerEdge C4140 Deep Learning Performance Comparison - Scale-up vs. Scale - Page 22

Deep Learning Performance: Scale-up vs Scale-out

Architectures & Technologies

Dell

EMC

| Infrastructure Solutions Group

21

In

Figure 14

we see how the GPU memory is accessed directly instead of copying the data n times

across the system components with the use of GPUDirect RDMA, this feature is reflected directly

in the throughput performance of the server.

Figure 14: Nvidia GPU Direct RDMA Connection. Source: https://www.sc-asia.org

6.2

Evaluation Platform Setup

Table 4

shows the software stack configuration used to build the environment to run the tests.

Software Stack

PowerEdge Servers

Non-Dell EMC Servers

OS

Ubuntu 16.04.4 LTS

Ubuntu 16.04.3 LTS

Kernel

GNU/Linux 4.4.0-128-generic x86_64

GNU/Linux 4.4.0-130-generic x86_64

nvidia driver

396.26 for all servers

390.46 for R740-P40

384.145

Open MPI

3.0.1

3.0.0

CUDA

9.1.85

9.0.176

cuDNN

7.1.3.16

7.1.4

NCCL

2.2.15

2.2.13

Docker Container

NVidia TensorFlow Docker

Nvidia TensorFlow Docker

Container Image

–

Single Node

TensorFlow/tensorflow:nightly-gpu-py3

nvcr.io/nvidia/tensorflow:18.06-py3

Container Image

–

Multi Node

Horovod : latest

n/a

Benchmark scripts

tf_cnn_benchmarks

tf_cnn_benchmarks

Test Date

–

V1

April-June 2018

July 2018

Test Date -

V2

Jan 2019

NA

Table 4: OS & Driver Configurations

Dell PowerEdge C4140 Deep Learning Performance Comparison - Scale-up vs. Scale - Page 22

Software Stack, PowerEdge Servers, Non-Dell EMC Servers

Page 22 highlights