Dell PowerEdge C4140 Deep Learning Performance Comparison - Scale-up vs. Scale - Page 20

Deep Learning Performance: Scale-up vs Scale-out

Architectures & Technologies

Dell

EMC

| Infrastructure Solutions Group

19

CPU1

X16 Port 3

AB

CD

X16 Port 2

CD

AB

X16 Port 1

CD

AB

x16

x16

Ethernet

CPU2

X16 Port 2

AB

CD

X16 Port 1

AB

CD

X16 Port 3

AB

CD

x16

x16

GPU-

DWFL

300W

PERC

PCIe

Slot

x16

GPU-

DWFL

300W

SAS SSD

GPU-

DWFL

300W

UPI

Figure 11: Dell PowerEdge R740/R740xd

6

Framework Setup Details

6.1

Distributed Horovod-TensorFlow Setup

Horovod

[8] [9] [10] is a distributed training framework for TensorFlow, Keras and PyTorch

initially developed by Uber. It uses bandwidth-optimal communication protocols (RDMA) [2]

In this section, we explain briefly the software stack configuration we used to extract the

performance throughput in multi-node using distributed Horovod TensorFlow and using high

speed Mellanox InfiniBand ConnectX-5 network adapter with 100Gbit/s over IPoIB, and

GPUDirect RDMA.

To setup the configuration, we used as our reference the configuration procedure presented by

Mellanox on its community blog space [3] and the basic installation of Horovod in Docker [4].

Dell PowerEdge C4140 Deep Learning Performance Comparison - Scale-up vs. Scale - Page 20

Framework Setup Details

Page 20 highlights