MoE Tensor Model Parallelism

This repo contains MoE (Mixture of Experts) transformer training examples with EPL.

Training setup.

The model code is based on .

Prepare dataset

Refering to, script for translate_ende_wmt32k shows as following:

t2t-datagen --data_dir=data --tmp_dir=data/original/dataset --problem=translate_ende_wmt32k

Or, set FLAGS.generate_data in scripts/ to generate dataset for problem FLAGS.problem automatially.

Distributed Training

To implement MoE tensor model parallelism, EPL only needs to change the annotation and configuration, as follows:

+ import epl
+ config = epl.Config({"cluster.colocate_split_and_replicate": True})
+ epl.init(config)
+ epl.set_default_strategy(epl.replicate(total_gpu_num))


+ with epl.split(total_gpu_num):


You can refer to EPL MOE Example for detailed implementation.

The following command launches a tensor model parallelism program with two workers.

epl-launch --num_workers 2 --gpu_per_worker 1 scripts/