How to Train Really Large Models on Many GPUs?
[Updated on 2022-03-13: add expert choice routing.] In recent years, we are seeing better results on many NLP benchmark tasks with larger pre-trained language models. How to train large and deep neural networks is challenging, as it demands a large amount of GPU memory and a long horizon of training time. However an individual GPU worker has limited memory and the sizes of many large models have grown beyond a single GPU....