Data ParallelismΒΆ

In this section, we will show how to scale the training of ResNet-50 model with EPL data parallelism.

EPL can easily transform the local bert training program to a distributed one by adding a few lines of code.

+ import epl
+ epl.init()
+ epl.set_default_strategy(epl.replicate(device_count=1))


The following command launches a data parallelism program with two model replicas over two GPUs.

epl-launch --num_workers 2 --gpu_per_worker 1 scripts/

scripts/ is a local training script, epl-launch will automatically launch a distributed training program by configuring cluster information.

You can refer to EPL ResNet Example for detailed implementation.