Dell PowerEdge C4140 Deep Learning Performance Comparison - Scale-up vs. Scale - Page 14

Use Case, Benchmark code, Hardware Configuration, Servers, Frameworks, Performance, Training tests,

Page 14 highlights

Deep Learning Performance: Scale-up vs Scale-out 4.1.2 Long Test The long tests were run to get throughput and the training time to reach certain accuracy convergence. We used 90 epochs for training run. These tests were run using the maximum number of GPUs supported by that server. In the section below, we describe the setup used, and Table 1 gives an overall view on the test configuration.  Use Case - The benchmark tests are targeting image classification with convolutional neural networks models (CNNs).  Benchmark code - TensorFlow Benchmarks scripts  Hardware Configuration - Each server is configured based on its maximum GPU support.  Servers - The servers tested are PowerEdge R740, PowerEdge C4130, PowerEdge C4140 and non-Dell EMC 8x NVLink GPU server.  Frameworks - TensorFlow for single node, and TensorFlow with Horovod library for distributed training.  Performance - The performance metrics used for comparison across servers is throughput (images per second) and training time to reach top-5 accuracy and top-1 accuracy.  Training tests - We conducted two types of tests. 1- Short Tests: for each test, 10 warmup steps were done and then the next 100 steps were averaged. 2-Long Tests: to get the training accuracy convergence, and elapsed training time.  Dataset - ILSVRC2012  Software stack configuration - The benchmarks were run under docker container environment. See table 1 with details. 4.2 Throughput Testing Workload application and model Benchmarks code Servers - Single Node Servers - Multi Node (2 nodes, 4GPUs each) Frameworks Image classification with convolutional neural networks models (CNNs) TensorFlow Benchmarks scripts Server GPU  PowerEdge R740  P40  PowerEdge C4140  V100-16GB-SXM2  PowerEdge C4140  V100-32GB-SXM2  Non Dell EMC 8x NVLink server  V100-16GB-SXM2  PowerEdge C4140-K  V100-16GB-SXM2  PowerEdge C4140-K  V100-32GB-SXM2  PowerEdge C4140-M  V100-16GB-SXM2  TensorFlow for Single Mode  TensorFlow with Horovod library for Distributed Mode Architectures & Technologies Dell EMC | Infrastructure Solutions Group 13

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53

Deep Learning Performance: Scale-up vs Scale-out
Architectures & Technologies
Dell
EMC
| Infrastructure Solutions Group
13
4.1.2
Long Test
The long tests were run to get throughput and the training time to reach certain accuracy
convergence. We used 90 epochs for training run. These tests were run using the maximum
number of GPUs supported by that server.
In the section below, we describe the setup used, and
Table 1
gives an overall view on the test
configuration.
Use Case
The benchmark tests are targeting image classification with convolutional
neural networks models (CNNs).
Benchmark code
TensorFlow Benchmarks scripts
Hardware Configuration
Each server is configured based on its maximum GPU
support.
Servers
- The servers tested are PowerEdge R740, PowerEdge C4130, PowerEdge C4140
and non-Dell EMC 8x NVLink GPU server.
Frameworks
TensorFlow for single node, and TensorFlow with Horovod library for
distributed training.
Performance
The performance metrics used for comparison across servers is
throughput (images per second) and training time to reach top-5 accuracy and top-1
accuracy.
Training tests
- We conducted two types of tests. 1- Short Tests: for each test, 10
warmup steps were done and then the next 100 steps were averaged. 2-Long Tests: to
get the training accuracy convergence, and elapsed training time.
Dataset
ILSVRC2012
Software stack configuration
The benchmarks were run under docker container
environment. See table 1 with details.
4.2
Throughput Testing
Workload application and model
Image classification with convolutional neural networks models
(CNNs)
Benchmarks code
TensorFlow Benchmarks scripts
Server
GPU
Servers
Single Node
PowerEdge R740
P40
PowerEdge C4140
V100-16GB-SXM2
PowerEdge C4140
V100-32GB-SXM2
Non Dell EMC 8x NVLink server
V100-16GB-SXM2
Servers
Multi Node
(2 nodes, 4GPUs each)
PowerEdge C4140-K
V100-16GB-SXM2
PowerEdge C4140-K
V100-32GB-SXM2
PowerEdge C4140-M
V100-16GB-SXM2
Frameworks
TensorFlow for Single Mode
TensorFlow with Horovod library for Distributed Mode