Introducing the Model Garden for TensorFlow 2
Mart 31, 2020
Posted by Jaeyoun Kim, Technical Program Manager, and Jing Li, Software Engineer
We would like to introduce an update to the Model Garden that provides TensorFlow users a centralized place to find code examples for state-of-the-art models and reusable modeling libraries for TensorFlow 2.

The Model Garden aims to demonstrate the best practices for modeling so that TensorFlow users can take full advantage of the recent TensorFlow 2 for their research and product development. TensorFlow Hub will keep playing its role as a repository where people can easily search for ready-to-use pre-trained models. We plan to increasingly make more state-of-the-art models from the Model Garden available in TensorFlow Hub, and TensorFlow Hub pages will link to the model implementations in the Model Garden.

We are initially making available several new code examples for distributed training to solve Computer Vision and Natural Language Processing problems. The code examples in the Model Garden Official repository will be maintained by the TensorFlow team to keep them up to date with the latest TensorFlow 2 APIs.

Model Garden can be easily installed using PIP (pip install tf-models-nightly). TensorFlow 2 users can get started immediately with code examples to learn the best practice for training models on GPUs and TPUs.

Many of the models in the Model Garden can be trained in a distributed fashion. In TensorFlow 2, you can distribute training workloads to single-host/multi-accelerator as well as multi-hosts/multi-accelerator configurations using the Distribution Strategy API. Here are some common distribution strategies for building a model we will briefly explain in this blog. Please visit TensorFlow.org to find a list of the distribution strategies available for TensorFlow 2.
The following two sections (Computer Vision and Natural Language Processing) will walk through examples that show how to use distributed training.

Computer Vision

Image classification with ResNet

The ResNet model can classify the major object in the image into 1,000 object categories (e.g., car, soccer ball, table lamp, etc). The Model Garden provides an example showing how to train ResNet on GPUs and TPUs. The model is written in pure TensorFlow 2 APIs which are user-friendly and use an object-oriented style.
Distributed training on multiple GPUs
You can train a model on multiple GPUs using tf.distribute.MirroredStrategy API. Here is an example to train a model with ImageNet data using two GPUs. The classifier_trainer.py is a new unified framework for training image classification models using TensorFlow's high-level API for building and training deep learning models (Keras compile and fit methods).
$ python3 classifier_trainer.py \
    --mode=train_and_eval \
    --model_type=resnet \
    --dataset=imagenet \
    --model_dir=$MODEL_DIR \
    --data_dir=$DATA_DIR \
    --config_file=configs/examples/resnet/imagenet/gpu.yaml \
    --params_override="runtime.num_gpus=2"
You need to specify the number of GPUs (e.g., --params_override="runtime.num_gpus=2") for running synchronous distributed training on multiple GPUs on a single GPU host. You also need to find optimal values for the batch size, number of epochs, and number of training steps as you scale your training with more GPUs. Examples of these values on the ImageNet dataset can be found in the configuration file (configs/examples/resnet/imagenet/gpu.yaml). For hyperparameter tuning, you may be interested in checking out Keras Tuner.
Distributed training on multiple GPU hosts
For distributed training across multiple GPUs on multiple GPU hosts, you will need to set up TF_CONFIG environment variable at each GPU host to specify what tasks constitute a cluster, their addresses and each task's role in the cluster. Here is a typical example of TF_CONFIG for the first GPU host appointed as the chief worker:
os.environ["TF_CONFIG"] = json.dumps({
    "cluster": {
        "worker": ["host1:port", "host2:port", "host3:port"]
    },
   "task": {"type": "worker", "index": 0}})
In this example, the “worker” part configures three GPU hosts (host1, host2, and host3) to run training using MultiWorkerMirroredStrategy. The "task" part specifies the role of the current task in the cluster. For the chief worker that takes on a little more responsibility (e.g., saving checkpoint), you need to set the task type to "worker" and the task index to 0. MultiWorkerMirroredStrategy will automatically use all the available GPUs at each host.
Distributed training on Cloud TPUs
TensorFlow 2 makes it easy to distribute and train models on different hardware configurations without changing the model definition. Users can train a ResNet model on Cloud TPUs using tf.distribute.TPUStrategy. You can use the same framework (classifier_trainer.py) to train a model using Cloud TPUs. You just need to use a different YAML configuration file for TPU and set --tpu=$TPU_NAME, where $TPU_NAME is your TPU instance name in the Cloud Console. Please see the TPU configuration file (configs/examples/resnet/imagenet/tpu.yaml).
$ python3 classifier_trainer.py \
    --mode=train_and_eval \
    --model_type=resnet \
    --dataset=imagenet \
    --tpu=$TPU_NAME \
    --model_dir=$MODEL_DIR \
    --data_dir=$DATA_DIR \
    --config_file=configs/examples/resnet/imagenet/tpu.yaml
For those who want to have their own training loops rather than using TensorFlow's high-level API for building and training deep learning models, please check out custom training tutorials.

Natural Language Processing (NLP)

Let’s move on to our next example for NLP. The Model Garden contains TensorFlow 2 implementations for the BERT (Bidirectional Encoder Representations from Transformers) and BERT variant models (e.g., ALBERT). Here we’ll demonstrate our best practices for training a TensorFlow 2 BERT model on GPUs and TPUs.

Sentence and Sentence-pair Classification using BERT

Sentence and Sentence-pair classification task is to classify given a pair of sentences as paraphrases or not paraphrases. Here is an example to fine-tune the BERT-LARGE model using multiple GPUs on Google Cloud Platform. This task uses the Microsoft Research Paraphrase Corpus (MRPC) corpus that contains 5,801 pairs of sentences along with human annotations indicating whether each pair captures a paraphrase/semantic equivalence relationship.
export BERT_BASE_DIR=gs://cloud-tpu-checkpoints/bert/keras_bert/uncased_L-24_H-1024_A-16
export MODEL_DIR=gs://some_bucket/my_output_dir
export GLUE_DIR=gs://some_bucket/datasets
export TASK=MRPC
 
python3 run_classifier.py \
  --mode='train_and_eval' \
  --input_meta_data_path=${GLUE_DIR}/${TASK}_meta_data \
  --train_data_path=${GLUE_DIR}/${TASK}_train.tf_record \
  --eval_data_path=${GLUE_DIR}/${TASK}_eval.tf_record \
  --bert_config_file=${BERT_BASE_DIR}/bert_config.json \
  --init_checkpoint=${BERT_BASE_DIR}/bert_model.ckpt \
  --train_batch_size=4 \
  --eval_batch_size=4 \
  --steps_per_loop=1 \
  --learning_rate=2e-5 \
  --num_train_epochs=3 \
  --model_dir=${MODEL_DIR} \
  --distribution_strategy=mirrored
Similar to the TPU training for the ResNet model, users can easily switch to TPU for distributed training by changing the distribution strategy type to tpu with TPU information like the following example.
--distribution_strategy=tpu
--tpu=grpc://${TPU_IP_ADDRESS}:8470
If you want to use a pre-trained SavedModel provided by the TensorFlow Hub, please simply replace init_checkpoint FLAG with hub_module_url FLAG to specify a TensorFlow Hub module path.
--hub_module_url=https://tfhub.dev/tensorflow/bert_en_uncased_L-24_H-1024_A-16/1

Next Steps

Please visit the Model Garden repository to find the example codes we introduced in this article.

In the coming months, we will provide more state-of-the-art canonical models and example code for building your own models. We would like to encourage AI researchers and developers to leverage the Model Garden when building their models.

We also welcome contributions to the repository to benefit the entire TensorFlow community. If you need any help, please reach out to us on GitHub.
Next post
Introducing the Model Garden for TensorFlow 2

Posted by Jaeyoun Kim, Technical Program Manager, and Jing Li, Software Engineer
We would like to introduce an update to the Model Garden that provides TensorFlow users a centralized place to find code examples for state-of-the-art models and reusable modeling libraries for TensorFlow 2.

The Model Garden aims to demonstrate the best practices for modeling so that TensorFlow users can take full adv…