Auxiliary Job#
Introduction#
This page outlines the steps needed to use the auxiliary job capabilities available in T5X.
Overview#
There are a variety of situations in which running a single job is insufficient or suboptimal. For example, consider the following scenarios:
You want to keep track of evaluation (
infer_eval
ortrain_eval
) metrics per checkpoint, but evaluation takes a very long time due to having a large eval dataset, slow decoding, or multiple tasks to evaluate.You want to finetune every checkpoint on a downstream task as you train.
You have customized evaluation code that you want to run on every checkpoint as you train, but that does not naturally fit within a
seqio.Evaluator
framework.
In cases like these, users can make use of the auxiliary job functionality. At a
high-level, the auxiliary job will launch a new job every time a new checkpoint
is saved. This new job can either re-use the train.py
binary (e.g. for
continuous finetuning) or a different one. For example, this allows users to
perform continuous evaluation (using eval.py
) without slowing down the
training job. We will provide detailed examples showing how to use the auxiliary
job for these use-cases.
When this new job is launched, the controller will replace four gin macros:
MODEL_DIR
, MIXTURE_OR_TASK_NAME
,INITIAL_CHECKPOINT_PATH
, TRAIN_STEPS
.
The second of these is set by the user-controlled flag (more on this below), and
the third one is equal to the last checkpoint seen. Aside from this, users are
free to modify the configuration as needed. Beyond gin macros, the auxiliary job
can also have different resource requirements, priority, and even cell placement
from the train job.
Example 1: Separate evaluation job.#
Step 1: Choose a model architecture.#
Similar to pretraining, we will need some gin configuration. For this example, we will use the T5-1.1-Base model.
Step 2: Choose a SeqIO Task/Mixture for training and evaluation.#
In this example, we will use the classic task of English-French translation from
WMT14, which is conveniently available as a SeqIO task in the tasks file from
the T5 tasks under the name 'wmt_enfr14_v003'
.
Step 3: Write a Gin config.#
Unlike pretraining or finetuning, we will need two gin files for this setup: one
for the training job, and one for the auxiliary job. The train gin file will
have the same requirements as the gin file for pretraining or finetuning. The
auxiliary job gin file can leverage these gin files or be its own independent
gin file, depending on the user’s choice. For this example, we will make a new
gin which is mostly a wrapper around pretrain.gin
with some additional
hardcoded features. We will use this gin file for the train job and eval.gin
for the auxiliary job.
Step 4: Launch your experiment.#
Our sample script will be quite similar to the one used in pretraining and finetuning, but with a few additional flags which we describe below.
auxiliary_job_mixtures
: This is a comma-separated list of mixtures. A separate auxiliary job will be run for each mixture and will replace the gin macroMIXTURE_OR_TASK_NAME
. Note that you need this flag even if you are using a custom binary, which does not need a mixture since otherwise no auxiliary job will run.auxiliary_job_gin_file
: This is identical togin_file
, except it is used for the auxiliary job instead of the train job.replace_gin_file
: If True, this auxiliary launcher will not use any of the gin files from train job. This is necessary when using a binary different fromtrain.py
, since the top-level functions will not match.auxiliary_job_cell
: The cell in which to run your job. Note that this can be different from the training cell.auxiliary_job_platform
: The platform to use for the auxiliary. Note that this can be different from the one use for the train job, allowing users to use smaller configurations for evaluation than needed for training.auxiliary_job_build_target
: The binary to use for auxiliary job.final_auxiliary_job_steps
: This flag controls how many additional steps to take when using the auxiliary job for finetuning. Setting to 0 enables continuous evaluation.
We provide the sample script below.
declare -a ARGS=(
--cell=iz
--platform=jd=2x2
--final_auxiliary_job_steps=0
--replace_gin_file=True
--auxiliary_job_mixtures=wmt14_enfr_v003
--auxiliary_job_gin_file=t5x/examples/t5/t5_1_1/examples/base_wmt14enfr_eval.gin
--auxiliary_job_cell=iz
--auxiliary_job_platform=jd=2x2
--auxiliary_job_build_target_path=//t5x:eval
--gin_file=t5x/examples/t5/t5_1_1/examples/base_wmt14enfr_train.gin
)
gxm t5x/google/xm_launch.py "${ARGS[@]}"
Example 2: Continuous finetuning job.#
In this example, we will be pretraining a model on a span corruption task on the
C4 dataset, and finetuning it on the WMT’14 English-French translation task. As
before, we will launch a new auxiliary job once every checkpoint is saved.
However, instead of using the eval.py
binary, we will use the train.py
binary.
Step 1: Choose a model architecture.#
We will use the T5-1.1-Base model as in the previous example.
Step 2: Choose a SeqIO Task/Mixture for training and evaluation.#
For pretraining, we re-use the span coprruption task c4_v220_span_corruption
available in the T5 mixtures tasks.py
file.
Step 3: Write a Gin config.#
As before, we need our gin files to contain all the desired macros in them. We
thus create two new gin files: base_c4_pretrain.gin
for the train job and
base_wmtenfr14_finetune.gin
for the auxiliary job.
Step 4: Launch your experiment.#
Our script is quite similar to the first example, with the same flags as before
but with the appropiate changes. The main distinction is that we must change the
flag final_auxiliary_job_steps
to be non-zero to start finetuning. We will
settle for a modest 200 steps for the sake of demonstration (and evaluate every
100 steps), but users should use larger steps in realistic scenarios. We also
use train.py
binary instead of eval.py
.
We provide the sample script below.
declare -a ARGS=(
--cell=iz
--platform=jd=2x2
--final_auxiliary_job_steps=200
--replace_gin_file=True
--auxiliary_job_mixtures=wmt14_enfr_v003
--auxiliary_job_gin_file=t5x/examples/t5/t5_1_1/examples/base_wmt14enfr_finetune.gin
--auxiliary_job_cell=iz
--auxiliary_job_platform=jd=2x2
--auxiliary_job_build_target_path=//t5x:train
--gin_file=t5x/examples/t5/t5_1_1/examples/base_c4_pretrain.gin
)
gxm t5x/google/xm_launch.py "${ARGS[@]}"
Common Gotchas.#
We outline a few common error patterns that we have encountered.
Not passing a value for the
auxiliary_mixtures
flag. Even if you have the desired task in your gin file, or you use a differently named macro, you should still pass a value for this flag, since launch script will launch a new job per value of this flag.Not setting
replace_gin_file=True
when using a different binary from train.py. This will usually yield an error that there is notrain
function.No metrics being logged. It can be tempting to use gin files usually used for evaluation. However, one must ensure that the corresponding SeqIO evaluators still log to the tensorboard, otherwise you won’t see the metrics.
Slow
train_eval
. While the approach outlined above separates out the infer_eval job, it may be that even train_eval is too slow. In these situations, we suggest adding the metrics from train_eval into themetrics_fn
argument of the SeqIO task and have them be computed in the auxiliary job as well. To do this with teacher forcing, you will have to usetrain.py
instead ofeval.py
.Using
CHECKPOINT_PATH
ratherINITIAL_CHECKPOINT_PATH
. For legacy reasons, the auxiliary job uses the macroINITIAL_CHECKPOINT_PATH
rather thanCHECKPOINT_PATH
as found ineval.gin
. Make sure to use the latter macro building your gin scripts.Gin macros being ignored when passed through the format
gin.{MACRO}={VAL}
. In the current setup, you must include all gin macros in the gin script. Attempting to pass them as additional flags will usually not work.Not setting
final_auxiliary_job_steps=0
when performing continuous evaluation. The current parameter controller uses this as a check. When this is true, it will replace theEVAL_OUTPUT_DIR
folder with the currentMODEL_DIR
, so that the evaluation metrics are saved in the right place and the metrics are showed correctly on the tensorboard.