Data ParallelismΒΆ
In this section, we will show how to scale the training of ResNet-50 model with EPL data parallelism.
EPL can easily transform the local bert training program to a distributed one by adding a few lines of code.
+ import epl
+ epl.init()
+ epl.set_default_strategy(epl.replicate(device_count=1))
ResNet50()
training_session()
The following command launches a data parallelism program with two model replicas over two GPUs.
epl-launch --num_workers 2 --gpu_per_worker 1 scripts/train_dp.sh
scripts/train_bert_base_dp.sh
is a local training script,
epl-launch
will automatically launch a distributed training program by configuring cluster information.
You can refer to EPL ResNet Example for detailed implementation.