Models#

This page lists the available pre-trained T5 models. To use a pre-trained model, you need a Gin config file that defines the model params, and the model checkpoint to load from. For your convenience, TensorFlow checkpoints and Gin configs for common T5 pre-trained models have been made available for use in T5X. Following is a list of these pre-trained models and their Gin and checkpoint locations.

All checkpoints: gs://t5-data/pretrained_models/t5x/
All Gin files: t5x/configs/models/

Selecting a model:#

Publicly Available Models:

Model	Use Case
T5 1.1	Improved T5, recommended for most research. English only.
T5	The original T5 work for reproducibility. English only.
T5 1.1 LM-Adapted	Trained for 100k additional steps on the LM objective, per prompt tuning paper.
mT5	Multilingual T5. Recommended for multilingual research. Note that at smaller scales (at least through XL), mT5 performance is lower than T5 on English tasks.
mT5 LM-Adapted	Trained for 100k additional steps on the LM objective, per zero-shot cross-lingual generation (XGen) paper.
umT5	umT5, an updated mT5 model trained using a more uniform language distribution, per the UniMax paper.
ByT5	ByT5. A “token-free” model that uses UTF-8 bytes for input and output. Recommended for tasks involving word-internal phenomena such as spelling, pronunciation, or morphology.
LongT5	Recommended checkpoints to fine-tune for long input sequence tasks
MoE	Useful for MoE experimentation.
Flan-T5	General purpose T5 checkpoints for few-shot and finetuning. We recommend Flan-T5 over vanilla T5 and T5 LM-adapted
UL2	Checkpoints for 20B pretrained and FLAN-based instruction-tuned models using the UL2 objective from UL2 paper
BigScience	Checkpoints from the BigScience paper
FLIP	Language-Image models trained with an alternative to CLIP, presented in the FLIP paper
RankGen	1.2B parameter encoder model for English to score model generations given a prefix for decoding from the RankGen paper
Dipper	11B parameter paraphrase generation model from the Dipper paper

Public Research Models#

T5 Checkpoints#

These are the checkpoints used in the paper Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. They are encoder-decoder models pre-trained on C4 with a “span corruption” denoising objective, in addition to a mixture of downstream tasks including: GLUE, SuperGLUE, CNN/Daily Mail, SQuAD, and WMT.

Vocabulary: cc_all.32000.100extra

Model	Gin File Location	Checkpoint Location
T5 Small	t5_small.gin	gs://t5-data/pretrained_models/t5x/t5_small/checkpoint_1000000
T5 Base	t5_base.gin	gs://t5-data/pretrained_models/t5x/t5_base/checkpoint_999900
T5 Large	t5_large.gin	gs://t5-data/pretrained_models/t5x/t5_large/checkpoint_1000700
T5 3B	t5_3B.gin	gs://t5-data/pretrained_models/t5x/t5_3B/checkpoint_1000000
T5 11B	t5_11B.gin	gs://t5-data/pretrained_models/t5x/t5_11B/checkpoint_1000000

T5 1.1 Checkpoints#

These are similar to the models from Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer, but with the following improvements:

GEGLU activation in feed-forward hidden layer, rather than ReLU - see https://arxiv.org/abs/2002.05202 .
Dropout was turned off in pre-training (quality win). Dropout should be re-enabled during fine-tuning.
Pre-trained on C4 only without mixing in the downstream tasks.
no parameter sharing between embedding and classifier layer
“xl” and “xxl” replace “3B” and “11B”. The model shapes are a bit different - larger d_model and smaller num_heads and d_ff.

For English-language, sequence-to-sequence-style tasks (ones where the goal is to map from an input text sequence to a target sequence) these are usually the best models to fine-tune.

Vocabulary: cc_all.32000.100extra

Model	Gin File Location	Checkpoint Location
T5 1.1 Small	t5_1_1/small.gin	gs://t5-data/pretrained_models/t5x/t5_1_1_small/checkpoint_1000000
T5 1.1 Base	t5_1_1/base.gin	gs://t5-data/pretrained_models/t5x/t5_1_1_base/checkpoint_1000000
T5 1.1 Large	t5_1_1_large.gin	gs://t5-data/pretrained_models/t5x/t5_1_1_large/checkpoint_1000000
T5 1.1 XL	t5_1_1_xl.gin	gs://t5-data/pretrained_models/t5x/t5_1_1_xl/checkpoint_1000000
T5 1.1 XXL	t5_1_1_xxl.gin	gs://t5-data/pretrained_models/t5x/t5_1_1_xxl/checkpoint_1000000

T5 1.1 LM-Adapted Checkpoints#

These “LM-adapted” models are initialized from T5 1.1 (above) and trained for an additional 100K steps on the LM objective discussed in the T5 paper. This adaptation improves the ability of the model to be used for prompt tuning. These checkpoints were also used within the BigScience T0 project.

Vocabulary: cc_all.32000.100extra

Model	Gin File Location	Checkpoint Location
T5 1.1 LM-100K Small	t5_1_1_small.gin	t5_1_1_lm100k_small/checkpoint_1100000
T5 1.1 LM-100K Base	t5_1_1_base.gin	t5_1_1_lm100k_base/checkpoint_1100000
T5 1.1 LM-100K Large	t5_1_1_large.gin	t5_1_1_lm100k_large/checkpoint_1100000
T5 1.1 LM-100K XL	t5_1_1_xl.gin	t5_1_1_lm100k_xl/checkpoint_1100000
T5 1.1 LM-100K XXL	t5_1_1_xxl.gin	t5_1_1_lm100k_xxl/checkpoint_1100000

mT5 Checkpoints#

These are the checkpoints used in the paper mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer. They are encoder-decoder models trained on multilingual C4 with a denoising objective. These are the best checkpoints to fine-tune for non-English sequence-to-sequence tasks.

Vocabulary: mc4.250000.100extra

Model	Gin File Location	Checkpoint Location
mT5 Small	mt5/small.gin	gs://t5-data/pretrained_models/t5x/mt5_small/checkpoint_1000000
mT5 Base	mt5/base.gin	gs://t5-data/pretrained_models/t5x/mt5_base/checkpoint_1000000
mT5 Large	mt5/large.gin	gs://t5-data/pretrained_models/t5x/mt5_large/checkpoint_1000000
mT5 XL	mt5/xl.gin	gs://t5-data/pretrained_models/t5x/mt5_xl/checkpoint_1000000
mT5 XXL	mt5/xxl.gin	gs://t5-data/pretrained_models/t5x/mt5_xxl/checkpoint_1000000

mT5 LM-Adapted Checkpoints#

These are the checkpoints released as part of the zero-shot cross-lingual generation (XGen) paper.

These “LM-adapted” models are initialized from mT5 (above) and trained for an additional 100K steps on the LM objective discussed in the T5 paper.

This adaptation improves the ability of the model to be used for prompt tuning.

Vocabulary: mc4.250000.100extra

Model	Gin File Location	Checkpoint Location
mT5 LM-Adapted Small	mt5/small.gin	mt5_lm_adapted/small/checkpoint_1100000
mT5 LM-Adapted Base	mt5/base.gin	mt5_lm_adapted/base/checkpoint_1100000
mT5 LM-Adapted Large	mt5/large.gin	mt5_lm_adapted/large/checkpoint_1100000
mT5 LM-Adapted XL	mt5/xl.gin	mt5_lm_adapted/xl/checkpoint_1100000
mT5 LM-Adapted XXL	mt5/xxl.gin	mt5_lm_adapted/xxl/checkpoint_1100000

umT5 Checkpoints#

These are the checkpoints described in the paper UniMax: Fairer and More Effective Language Sampling for Large-Scale Multilingual Pretraining. umT5 is similar to mT5 (see above); both are multilingual encoder-decoder models ranging from 300M to 13B parameters, trained on the mC4 corpus using a denoising objective. umT5 is trained on a fresher version of the mC4 corpus (3.1.0), and with a more uniform language balancing strategy.

Vocabulary: umt5.256000

Model	Gin File Location	Checkpoint Location
umT5 Small	umt5/pretrain_small.gin	umt5/small/checkpoint_1000000
umT5 Base	umt5/pretrain_base.gin	umt5/base/checkpoint_1000000
umT5 XL	umt5/pretrain_xl.gin	umt5/xl/checkpoint_1000000
umT5 XXL	umt5/pretrain_xxl.gin	umt5/xxl/checkpoint_1000000

ByT5 Checkpoints#

These are the checkpoints used in the paper ByT5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models. They are similar to mT5 (above), but are “token-free”, processing text as raw UTF-8 bytes, as opposed to using a pretrained subword vocabulary. These models are more robust to character-level noise, and outperform parameter-matched mT5 models in many settings, particularly on word-level tasks sensitive to spelling, pronunciation, or morphology. However inference is significantly slower, up to 10x depending on the task.

Vocabulary: None

Model	Gin File Location	Checkpoint Location
ByT5 Small	byt5/small.gin	gs://t5-data/pretrained_models/t5x/byt5_small/checkpoint_1000000
ByT5 Base	byt5/base.gin	gs://t5-data/pretrained_models/t5x/byt5_base/checkpoint_1000000
ByT5 Large	byt5/large.gin	gs://t5-data/pretrained_models/t5x/byt5_large/checkpoint_1000000
ByT5 XL	byt5/xl.gin	gs://t5-data/pretrained_models/t5x/byt5_xl/checkpoint_1000000
ByT5 XXL	byt5/xxl.gin	gs://t5-data/pretrained_models/t5x/byt5_xxl/checkpoint_1000000

LongT5 Checkpoints#

These are the checkpoints used in the paper LongT5: Efficient Text-to-Text Transformer for Long Sequences. They are encoder-decoder models trained on C4 using the PEGASUS Principle Sentences Generation objective. These are the recommended checkpoints to fine-tune for long input sequence tasks.

LongT5 Local Attention Checkpoints#

The checkpoints below use local attention, which uses a sliding window to reduce training time from quadratic (with regards to input length) to linear. These are the recommended checkpoints to use for faster training/inference time.

Vocabulary: cc_all.32000.100extra

Model	Gin File Location	Checkpoint Location
LongT5 Local Attention Base	longt5/models/longt5_1_1_base.gin	gs://t5-data/pretrained_models/t5x/longt5/local_base/checkpoint_1000000
LongT5 Local Attention Large	longt5/models/longt5_1_1_large.gin	gs://t5-data/pretrained_models/t5x/longt5/local_large/checkpoint_1000000

LongT5 Transient Global Attention Checkpoints#

The checkpoints below use transient global attention, which introduces global tokens at each encoder layer to allow tokens to interact with each other at longer distances. These are the recommended checkpoints to use for increased performance on long input sequence tasks.

Vocabulary: cc_all.32000.100extra

Model	Gin File Location	Checkpoint Location
LongT5 Base	longt5/models/longt5_1_1_transient_base.gin	gs://t5-data/pretrained_models/t5x/longt5/tglobal_base/checkpoint_1000000
LongT5 Large	longt5/models/longt5_1_1_transient_large.gin	gs://t5-data/pretrained_models/t5x/longt5/tglobal_large/checkpoint_1000000
LongT5 XL	longt5/models/longt5_1_1_transient_xl.gin	gs://t5-data/pretrained_models/t5x/longt5/tglobal_xl/checkpoint_1000000

Mixture of Experts (MoE) Checkpoints#

These MoE checkpoints need to be used with T5X MoE overrides – specifically, the MoeTrainer and the MoePjitPartitioner. For example, for fine-tuning, use the MoE fine-tune run config.

Converted Mesh Tensorflow checkpoints#

These are the checkpoints from the Switch Transformer model.

Vocabulary: cc_all.32000.100extra

Model	Gin File Location	Checkpoint Location
Switch Transformer Base 8 Experts	switch_base.gin	gs://t5-data/pretrained_models/t5x/moe/switch_classic/base/e8/checkpoint_500100
Switch Transformer Base 16 Experts	switch_base.gin	gs://t5-data/pretrained_models/t5x/moe/switch_classic/base/e16/checkpoint_550000
Switch Transformer Base 32 Experts	switch_base.gin	gs://t5-data/pretrained_models/t5x/moe/switch_classic/base/e32/checkpoint_550000
Switch Transformer Base 64 Experts	switch_base.gin	gs://t5-data/pretrained_models/t5x/moe/switch_classic/base/e64/checkpoint_550000
Switch Transformer Base 128 Experts	switch_base.gin	gs://t5-data/pretrained_models/t5x/moe/switch_classic/base/e128/checkpoint_550000
Switch Transformer Base 256 Experts	switch_base.gin	gs://t5-data/pretrained_models/t5x/moe/switch_classic/base/e256/checkpoint_550000
Switch Transformer Large 128 Experts	switch_large.gin	gs://t5-data/pretrained_models/t5x/moe/switch_classic/large/e128/checkpoint_483100
Switch Transformer XXL 128 Experts	switch_xxl.gin	gs://t5-data/pretrained_models/t5x/moe/switch_classic/xxl/e128/checkpoint_634600
Switch Transformer C 2048 Experts (1.6T)	switch_c.gin	gs://t5-data/pretrained_models/t5x/moe/switch_classic/c/e2048/checkpoint_611800

Flan-T5 Checkpoints#

These are the checkpoints released as part of the paper Scaling Instruction-Finetuned Language Models. They were initialized from the T5 1.1 LM-Adapted and instruction-finetuned.

They significantly outperform the LM-adapted checkpoints. For example, Flan-T5-XXL outperforms T5-LM-XXL by 26.6% absolute on the normalized average score. It even outperforms a much larger PaLM 62B model on BigBench Hard a set of challenging BigBench benchmark.

Unlike the vanilla T5 checkpoints, these can be directly used for few-shot prompting as well as standard finetuning. See Chung et al. 2022 for details.

Model	Gin File Location	Checkpoint Location
Flan-T5 Small	t5_1_1/small.gin	gs://t5-data/pretrained_models/t5x/flan_t5_small/checkpoint_1198000
Flan-T5 Base	t5_1_1/base.gin	gs://t5-data/pretrained_models/t5x/flan_t5_base/checkpoint_1184000
Flan-T5 Large	t5_1_1_large.gin	gs://t5-data/pretrained_models/t5x/flan_t5_large/checkpoint_1164000
Flan-T5 XL	t5_1_1_xl.gin	gs://t5-data/pretrained_models/t5x/flan_t5_xl/checkpoint_1138000
Flan-T5 XXL	t5_1_1_xxl.gin	gs://t5-data/pretrained_models/t5x/flan_t5_xxl/checkpoint_1114000

UL2 Checkpoints#

Checkpoints for 20B pretrained and FLAN-based instruction-tuned models using the UL2 objective from UL2 paper. Checkpoints are released at https://github.com/google-research/google-research/tree/master/ul2#checkpoints.

BigScience Checkpoints#

Checkpoints from the BigScience paper, released at https://github.com/bigscience-workshop/architecture-objective/tree/main#checkpoints.

FLIP Checkpoints#

Language-Image models trained with an alternative to CLIP, presented in the FLIP paper. Checkpoints are released at https://github.com/facebookresearch/flip#results-and-pre-trained-flip-models.

RankGen Checkpoints#

1.2B parameter encoder model for English to score model generations given a prefix for decoding from the RankGen paper. Checkpoints are released at https://github.com/google-research/google-research/tree/master/rankgen.

Dipper Checkpoints#

11B parameter paraphrase generation model from the Dipper paper. Checkpoints are released at https://github.com/google-research/google-research/tree/master/dipper.

Models

Contents

Models#

Selecting a model:#

Public Research Models#

T5 Checkpoints#

T5 1.1 Checkpoints#

T5 1.1 LM-Adapted Checkpoints#

mT5 Checkpoints#

mT5 LM-Adapted Checkpoints#

umT5 Checkpoints#

ByT5 Checkpoints#

LongT5 Checkpoints#

LongT5 Local Attention Checkpoints#

LongT5 Transient Global Attention Checkpoints#

Mixture of Experts (MoE) Checkpoints#

Converted Mesh Tensorflow checkpoints#

Flan-T5 Checkpoints#

UL2 Checkpoints#

BigScience Checkpoints#

FLIP Checkpoints#

RankGen Checkpoints#

Dipper Checkpoints#