How to Train Really Large Models on Many GPUs?

[Updated on 2022-03-13: add expert choice routing.] In recent years, we are seeing better results on many NLP benchmark tasks with larger pre-trained language models. How to train large and deep neural networks is challenging, as it demands a large amount of GPU memory and a long horizon of training time. However an individual GPU worker has limited memory and the sizes of many large models have grown beyond a single GPU....

September 24, 2021 路 21 min 路 Lilian Weng

How to Build an Open-Domain Question Answering System?

[Updated on 2020-11-12: add an example on closed-book factual QA using OpenAI API (beta). A model that can answer any question with regard to factual knowledge can lead to many useful and practical applications, such as working as a chatbot or an AI assistant馃. In this post, we will review several common approaches for building such an open-domain question answering system. Disclaimers given so many papers in the wild: Assume we have access to a powerful pretrained language model....

October 29, 2020 路 33 min 路 Lilian Weng

The Transformer Family

It has been almost two years since my last post on attention. Recent progress on new and enhanced versions of Transformer motivates me to write another post on this specific topic, focusing on how the vanilla Transformer can be improved for longer-term attention span, less memory and computation consumption, RL task solving and more. Notations Symbol Meaning $d$ The model size / hidden state dimension / positional encoding size....

April 7, 2020 路 25 min 路 Lilian Weng

Generalized Language Models

[Updated on 2019-02-14: add ULMFiT and GPT-2.] [Updated on 2020-02-29: add ALBERT.] [Updated on 2020-10-25: add RoBERTa.] [Updated on 2020-12-13: add T5.] [Updated on 2020-12-30: add GPT-3.] [Updated on 2021-11-13: add XLNet, BART and ELECTRA; Also updated the Summary section.] Fig. 0. I guess they are Elmo & Bert? (Image source: here) We have seen amazing progress in NLP in 2018. Large-scale pre-trained language modes like OpenAI GPT and BERT have achieved great performance on a variety of language tasks using generic model architectures....

January 31, 2019 路 36 min 路 Lilian Weng

Attention? Attention!

[Updated on 2018-10-28: Add Pointer Network and the link to my implementation of Transformer.] [Updated on 2018-11-06: Add a link to the implementation of Transformer model.] [Updated on 2018-11-18: Add Neural Turing Machines.] [Updated on 2019-07-18: Correct the mistake on using the term 鈥渟elf-attention鈥 when introducing the show-attention-tell paper; moved it to Self-Attention section.] [Updated on 2020-04-07: A follow-up post on improved Transformer models is here.] Attention is, to some extent, motivated by how we pay visual attention to different regions of an image or correlate words in one sentence....

June 24, 2018 路 21 min 路 Lilian Weng