Mart 31, 2020 —
Posted by Jaeyoun Kim, Technical Program Manager, and Jing Li, Software Engineer
We would like to introduce an update to the Model Garden that provides TensorFlow users a centralized place to find code examples for state-of-the-art models and reusable modeling libraries for TensorFlow 2.
The Model Garden aims to demonstrate the best practices for modeling so that TensorFlow users can take full adv…
pip install tf-models-nightly
). TensorFlow 2 users can get started immediately with code examples to learn the best practice for training models on GPUs and TPUs.tf.distribute.MirroredStrategy
API. Here is an example to train a model with ImageNet data using two GPUs. The classifier_trainer.py is a new unified framework for training image classification models using TensorFlow's high-level API for building and training deep learning models (Keras compile and fit methods). $ python3 classifier_trainer.py \
--mode=train_and_eval \
--model_type=resnet \
--dataset=imagenet \
--model_dir=$MODEL_DIR \
--data_dir=$DATA_DIR \
--config_file=configs/examples/resnet/imagenet/gpu.yaml \
--params_override="runtime.num_gpus=2"
You need to specify the number of GPUs (e.g., --params_override="runtime.num_gpus=2"
) for running synchronous distributed training on multiple GPUs on a single GPU host. You also need to find optimal values for the batch size, number of epochs, and number of training steps as you scale your training with more GPUs. Examples of these values on the ImageNet dataset can be found in the configuration file (configs/examples/resnet/imagenet/gpu.yaml). For hyperparameter tuning, you may be interested in checking out Keras Tuner. TF_CONFIG
environment variable at each GPU host to specify what tasks constitute a cluster, their addresses and each task's role in the cluster. Here is a typical example of TF_CONFIG
for the first GPU host appointed as the chief worker: os.environ["TF_CONFIG"] = json.dumps({
"cluster": {
"worker": ["host1:port", "host2:port", "host3:port"]
},
"task": {"type": "worker", "index": 0}})
In this example, the “worker” part configures three GPU hosts (host1, host2, and host3) to run training using MultiWorkerMirroredStrategy. The "task" part specifies the role of the current task in the cluster. For the chief worker that takes on a little more responsibility (e.g., saving checkpoint), you need to set the task type to "worker" and the task index to 0. MultiWorkerMirroredStrategy
will automatically use all the available GPUs at each host. tf.distribute.TPUStrategy
. You can use the same framework (classifier_trainer.py) to train a model using Cloud TPUs. You just need to use a different YAML configuration file for TPU and set --tpu=$TPU_NAME
, where $TPU_NAME
is your TPU instance name in the Cloud Console. Please see the TPU configuration file (configs/examples/resnet/imagenet/tpu.yaml). $ python3 classifier_trainer.py \
--mode=train_and_eval \
--model_type=resnet \
--dataset=imagenet \
--tpu=$TPU_NAME \
--model_dir=$MODEL_DIR \
--data_dir=$DATA_DIR \
--config_file=configs/examples/resnet/imagenet/tpu.yaml
For those who want to have their own training loops rather than using TensorFlow's high-level API for building and training deep learning models, please check out custom training tutorials. export BERT_BASE_DIR=gs://cloud-tpu-checkpoints/bert/keras_bert/uncased_L-24_H-1024_A-16
export MODEL_DIR=gs://some_bucket/my_output_dir
export GLUE_DIR=gs://some_bucket/datasets
export TASK=MRPC
python3 run_classifier.py \
--mode='train_and_eval' \
--input_meta_data_path=${GLUE_DIR}/${TASK}_meta_data \
--train_data_path=${GLUE_DIR}/${TASK}_train.tf_record \
--eval_data_path=${GLUE_DIR}/${TASK}_eval.tf_record \
--bert_config_file=${BERT_BASE_DIR}/bert_config.json \
--init_checkpoint=${BERT_BASE_DIR}/bert_model.ckpt \
--train_batch_size=4 \
--eval_batch_size=4 \
--steps_per_loop=1 \
--learning_rate=2e-5 \
--num_train_epochs=3 \
--model_dir=${MODEL_DIR} \
--distribution_strategy=mirrored
Similar to the TPU training for the ResNet model, users can easily switch to TPU for distributed training by changing the distribution strategy type to tpu with TPU information like the following example. --distribution_strategy=tpu
--tpu=grpc://${TPU_IP_ADDRESS}:8470
If you want to use a pre-trained SavedModel provided by the TensorFlow Hub, please simply replace init_checkpoint
FLAG with hub_module_url
FLAG to specify a TensorFlow Hub module path. --hub_module_url=https://tfhub.dev/tensorflow/bert_en_uncased_L-24_H-1024_A-16/1
Mart 31, 2020
—
Posted by Jaeyoun Kim, Technical Program Manager, and Jing Li, Software Engineer
We would like to introduce an update to the Model Garden that provides TensorFlow users a centralized place to find code examples for state-of-the-art models and reusable modeling libraries for TensorFlow 2.
The Model Garden aims to demonstrate the best practices for modeling so that TensorFlow users can take full adv…