What’s new in TensorFlow 2.4?
دسمبر 14, 2020

Posted by Goldie Gadde and Nikita Namjoshi for the TensorFlow Team


TF 2.4 is here! With increased support for distributed training and mixed precision, new NumPy frontend and tools for monitoring and diagnosing bottlenecks, this release is all about new features and enhancements for performance and scaling.

New Features in tf.distribute

Parameter Server Strategy

In 2.4, the tf.distribute module introduces experimental support for asynchronous training of models with ParameterServerStrategy and custom training loops. Like MultiWorkerMirroredStrategy, ParameterServerStrategy is a multi-worker data parallelism strategy; however, the gradient updates are asynchronous.

A parameter server training cluster consists of workers and parameter servers. Variables are created on parameter servers and then read and updated by workers during each step. The reading and updating of variables happens independently across the workers without any synchronization. Because the workers do not depend on one another, this strategy has the benefit of worker fault tolerance and is useful if you use preemptible VMs.

To get started with this strategy, check out the Parameter Server Training tutorial. This tutorial shows you how to set up ParameterServerStrategy and define a training step, and explains how to use the ClusterCoordinator class to dispatch the execution of training steps to remote workers.

Multi Worker Mirrored Strategy

MultiWorkerMirroredStrategy has moved out of experimental and is now part of the stable API. Like its single worker counterpart, MirroredStrategy, MultiWorkerMirroredStrategy implements distributed training with synchronous data parallelism. However, as the name suggests, with MultiWorkerMirroredStrategy you can train across multiple machines, each with potentially multiple GPUs.

In synchronous training, each worker computes the forward and backward passes on different slices of the input data, and the gradients are aggregated before updating the model. For this aggregation, known as an all-reduce, MultiWorkerMirroredStrategy uses CollectiveOps to keep variables in sync. A collective op is a single op in the TensorFlow graph that can automatically choose an all-reduce algorithm in the TensorFlow runtime according to hardware, network topology, and tensor sizes.

Graph of TF GPUs and CPU

To get started with MultiWorkerMirroredStrategy, check out the Multi-worker training with Keras tutorial, which has been updated with details on dataset sharding, saving/loading models trained with a distribution strategy, and failure recovery with the BackupAndRestore callback.

If you are new to distributed training and want to learn how to get started, or you’re interested in distributed training on GCP, see this blog post for an introduction to the key concepts and steps.

Updates in Keras

Mixed Precision

In TensorFlow 2.4, the Keras mixed precision API has moved out of experimental and is now a stable API. Most TensorFlow models use the float32 dtype; however, there are lower-precision types such as float16 that use less memory. Mixed precision is the use of 16-bit and 32-bit floating point types in the same model for faster training. This API can improve model performance by 3x on GPUs and 60% on TPUs.

To make use of the mixed precision API, you must use Keras layers and optimizers, but it’s not necessary to use other Keras classes such as models or losses. If you’re curious to learn how to take advantage of this API for better performance, check out the Mixed Precision tutorial.

Optimizers

This release includes refactoring the tf.keras.optimizers.Optimizer class, enabling users of model.fit or custom training loops to write training code that works with any optimizer. All built-in tf.keras.optimizer.Optimizer subclasses now accept gradient_transformers and gradient_aggregator arguments, allowing you to easily define custom gradient transformations.

With the refactor, you can now pass a loss tensor directly to Optimizer.minimize when writing custom training loops:

tape = tf.GradientTape()
with tape:
  y_pred = model(x, training=True)
  loss = loss_fn(y_pred, y_true)

# You can pass in the `tf.GradientTape` when using a loss `Tensor` as shown below.

optimizer.minimize(loss, model.trainable_variables, tape=tape)

These changes are intended to make both Model.fit and custom training loops more agnostic to optimizer details, allowing you to write training code that works with any optimizer without modification.

Functional API model construction internal improvements

Lastly, TensorFlow 2.4 includes a major refactoring of the internals of the Keras Functional API, improving the memory consumption of functional model construction and simplifying triggering logic. This refactoring also ensures TensorFlowOpLayers behave predictably and work with CompositeTensor type signatures.

Introducing tf.experimental.numpy

TensorFlow 2.4 introduces experimental support for a subset of NumPy APIs, available as tf.experimental.numpy. This module enables you to run NumPy code, accelerated by TensorFlow. Because it is built on top of TensorFlow, this API interoperates seamlessly with TensorFlow, allowing access to all of TensorFlow’s APIs and providing optimized execution using compilation and auto-vectorization. For example, TensorFlow ND arrays can interoperate with NumPy functions, and similarly TensorFlow NumPy functions can accept inputs of different types including tf.Tensor and np.ndarray.

import tensorflow.experimental.numpy as tnp

# Use NumPy code in input pipelines

dataset = tf.data.Dataset.from_tensor_slices(
    tnp.random.randn(1000, 1024)).map(
    lambda z: z.clip(-1,1)).batch(100)

# Compute gradients through NumPy code

def grad(x, wt):
  with tf.GradientTape() as tape:
    tape.watch(wt)
    output = tnp.dot(x, wt)
    output = tf.sigmoid(output)
  return tape.gradient(tnp.sum(output), wt)

You can learn more about how to use this API in the NumPy API on TensorFlow guide.

New Profiler Tools

MultiWorker Support in TensorFlow Profiler

The TensorFlow Profiler is a suite of tools you can use to measure the training performance and resource consumption of your TensorFlow models. The TensorFlow Profiler helps you understand the hardware resource consumption of the ops in your model, diagnose bottlenecks, and ultimately train faster.

Previously, the TensorFlow Profiler supported monitoring multi-GPU, single host training jobs. In 2.4 you can now profile MultiWorkerMirroredStrategy training jobs. For example, you can use the sampling mode API to perform on demand profiling and connect to the same server:port in use by MultiWorkerMirroredStrategy workers:


# Start a profiler server before your model runs.


tf.profiler.experimental.server.start(6009)

# Model code goes here....
 
# E.g. your worker IP addresses are 10.0.0.2, 10.0.0.3, 10.0.0.4, and you
# would like to profile for a duration of 2 seconds. The profiling data will
# be saved to the Google Cloud Storage path “your_tb_logdir”.
 
tf.profiler.experimental.client.trace(
    'grpc://10.0.0.2:6009,grpc://10.0.0.3:6009,grpc://10.0.0.4:6009',
    'gs://your_tb_logdir',
    2000)

Alternatively, you can use the TensorBoard profile plugin by providing the worker addresses to the Capture Profile tool.

After profiling, you can use the new Pod Viewer tool to choose a training step and view its step-time category breakdown across all workers.

TensorBoard preview

For more information on how to use the TensorFlow Profiler, check out the newly released GPU Performance Guide. This guide shows common scenarios you might encounter when you profile your model training job and provides a debugging workflow to help you get better performance, whether you’re training with one GPU, multiple GPUs, or multiple machines.

TFLite Profiler

The TFLite Profiler enables tracing TFLite internals in Android to identify performance bottlenecks. The TFLite Performance Measurement Guide shows you how to add trace events, enable TFLite tracing, and capture traces with both the Android Studio CPU Profiler and the System Tracing app.

Example trace using the Android System Tracing app

Example trace using the Android System Tracing app

New Features for GPU Support

TensorFlow 2.4 runs with CUDA 11 and cuDNN 8, enabling support for the newly available NVIDIA Ampere GPU architecture. To learn more about CUDA 11 features, check out this NVIDIA developer blog.

Additionally, support for TensorFloat-32 on Ampere-based GPUs is enabled by default. TensorFloat-32, or `TF32` for short, is a math mode for NVIDIA Ampere GPUs that causes certain float32 ops, such as matrix multiplications and convolutions, to run much faster on Ampere GPUs but with reduced precision. To learn more , see the documentation for tf.config.experimental.enable_tensor_float_32_execution.

Next steps

Check out the release notes for more information. To stay up to date, you can read the TensorFlow blog, follow twitter.com/tensorflow, or subscribe to youtube.com/tensorflow. If you’ve built something you’d like to share, please submit it for our Community Spotlight at goo.gle/TFCS. For feedback, please file an issue on GitHub. Thank you!

Next post
What’s new in TensorFlow 2.4?

Posted by Goldie Gadde and Nikita Namjoshi for the TensorFlow Team
TF 2.4 is here! With increased support for distributed training and mixed precision, new NumPy frontend and tools for monitoring and diagnosing bottlenecks, this release is all about new features and enhancements for performance and scaling. New Features in tf.distributeParameter Server Strategy In 2.4, the tf.distribute module in…