Dell PowerEdge C4140 Deep Learning Performance Comparison - Scale-up vs. Scale - Page 48

Hyper-parameters tuning

Page 48 highlights

Deep Learning Performance: Scale-up vs Scale-out 7.4.1 Hyper-parameters tuning The section below are the commands with the hyper-parameter tuning used to maximize the throughput performance in single and distributed mode server implementations. Figure 41 shows the high impact of the hyper-parameter tuning in the throughput performance: Single Node - TensorFlow: #python3 tf_cnn_benchmarks.py --variable_update=replicated -data_dir=/data/imagenet_tfrecord/train --data_name=imagenet --model=ResNet50 --batch_size=128 -device=gpu --num_gpus=4 --num_epochs=90 --print_training_accuracy=true --summary_verbosity=0 -momentum=0.9 --piecewise_learning_rate_schedule='0.4;10;0.04;60;0.004' --weight_decay=0.0001 -optimizer=momentum --use_fp16=True --local_parameter_device=gpu --all_reduce_spec=nccl -display_every=1000 Distributed Horovod - TensorFlow: #mpirun -np 8 -H 192.168.11.1:4,192.168.11.2:4 -x NCCL_IB_DISABLE=0 -x NCCL_IB_CUDA_SUPPORT=1 x NCCL_SOCKET_IFNAME=ib0 -x NCCL_DEBUG=INFO --bind-to none --map-by slot --mca plm_rsh_args "p 50000" python tf_cnn_benchmarks.py --variable_update=horovod -data_dir=/data/imagenet_tfrecord/train --data_name=imagenet --model=ResNet50 --batch_size=128 -num_epochs=90 --display_every=1000 --device=gpu --print_training_accuracy=true -summary_verbosity=0 --momentum=0.9 --piecewise_learning_rate_schedule='0.4;10;0.04;60;0.004' -weight_decay=0.0001 --optimizer=momentum --use_fp16=True --local_parameter_device=gpu -horovod_device=gpu --datasets_num_private_threads=4 Architectures & Technologies Dell EMC | Infrastructure Solutions Group 47

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53

Deep Learning Performance: Scale-up vs Scale-out
Architectures & Technologies
Dell
EMC
| Infrastructure Solutions Group
47
7.4.1
Hyper-parameters tuning
The section below are the commands with the hyper-parameter tuning used to maximize the
throughput performance in single and distributed mode server implementations.
Figure 41
shows the high impact of the hyper-parameter tuning in the throughput performance:
Single Node
TensorFlow:
#python3 tf_cnn_benchmarks.py --variable_update=replicated --
data_dir=/data/imagenet_tfrecord/train --data_name=imagenet --model=ResNet50 --batch_size=128 --
device=gpu --num_gpus=4 --num_epochs=90 --print_training_accuracy=true --summary_verbosity=0 --
momentum=0.9 --piecewise_learning_rate_schedule='0.4;10;0.04;60;0.004' --weight_decay=0.0001 --
optimizer=momentum --use_fp16=True --local_parameter_device=gpu --all_reduce_spec=nccl --
display_every=1000
Distributed Horovod
TensorFlow:
#mpirun -np 8 -H 192.168.11.1:4,192.168.11.2:4 -x NCCL_IB_DISABLE=0 -x NCCL_IB_CUDA_SUPPORT=1 -
x NCCL_SOCKET_IFNAME=ib0 -x NCCL_DEBUG=INFO --bind-to none --map-by slot --mca plm_rsh_args "-
p 50000" python tf_cnn_benchmarks.py --variable_update=horovod --
data_dir=/data/imagenet_tfrecord/train --data_name=imagenet --model=ResNet50 --batch_size=128 --
num_epochs=90 --display_every=1000 --device=gpu --print_training_accuracy=true --
summary_verbosity=0 --momentum=0.9 --piecewise_learning_rate_schedule='0.4;10;0.04;60;0.004' --
weight_decay=0.0001 --optimizer=momentum --use_fp16=True --local_parameter_device=gpu --
horovod_device=gpu --datasets_num_private_threads=4