MoE Tensor Model Parallelism¶
This repo contains MoE (Mixture of Experts) transformer training examples with EPL.
Training setup.¶
The model code is based on https://github.com/tensorflow/tensor2tensor .
Prepare dataset¶
Refering to https://github.com/tensorflow/tensor2tensor#adding-a-dataset, script for translate_ende_wmt32k
shows as following:
t2t-datagen --data_dir=data --tmp_dir=data/original/dataset --problem=translate_ende_wmt32k
Or, set FLAGS.generate_data
in scripts/train_moe_t5.sh
to generate dataset for problem FLAGS.problem
automatially.
Distributed Training¶
To implement MoE tensor model parallelism, EPL only needs to change the annotation and configuration, as follows:
+ import epl
+ config = epl.Config({"cluster.colocate_split_and_replicate": True})
+ epl.init(config)
+ epl.set_default_strategy(epl.replicate(total_gpu_num))
AttentionAndGating()
+ with epl.split(total_gpu_num):
MOE_Variable_Define()
MOE_Calculation_Define()
You can refer to EPL MOE Example for detailed implementation.
The following command launches a tensor model parallelism program with two workers.
epl-launch --num_workers 2 --gpu_per_worker 1 scripts/train_moe_t5.sh