Hugging Face: State-of-the-Art Natural Language Processing in ten lines of TensorFlow 2.0
نومبر 04, 2019
A guest post by the Hugging Face team

Hugging Face is the leading NLP startup with more than a thousand companies using their library in production including Bing, Apple, Monzo.

All examples used in this tutorial are available on Colab. The links are available in the corresponding sections.

The HuggingFace Logo

Introduction

Hugging Face is an NLP-focused startup with a large open-source community, in particular around the Transformers library. 🤗/Transformers is a python-based library that exposes an API to use many well-known transformer architectures, such as BERT, RoBERTa, GPT-2 or DistilBERT, that obtain state-of-the-art results on a variety of NLP tasks like text classification, information extraction, question answering, and text generation. Those architectures come pre-trained with several sets of weights. Getting started with Transformers only requires to install the pip package:


pip install transformers

The library has seen super-fast growth in PyTorch and has recently been ported to TensorFlow 2.0, offering an API that now works with Keras’ fit API, TensorFlow Extended, and TPUs 👏. This blog post is dedicated to the use of the Transformers library using TensorFlow: using the Keras API as well as the TensorFlow TPUStrategy to fine-tune a State-of-The-Art Transformer model.

Library & Philosophy

Transformers is based around the concept of pre-trained transformer models. These transformer models come in different shapes, sizes, and architectures and have their own ways of accepting input data: via tokenization.

The library builds on three main classes: a configuration class, a tokenizer class, and a model class.
  • The configuration class: the configuration class hosts relevant information concerning the model we will be using, such as the number of layers and the number of attention heads. Below is an example of a BERT configuration file, for the pre-trained weights bert-base-cased. The configuration classes host these attributes with various I/O methods and standardized name properties.

{
  "attention_probs_dropout_prob": 0.1,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "max_position_embeddings": 512,
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "type_vocab_size": 2,
  "vocab_size": 28996
}

  • The tokenizer class: the tokenizer class takes care of converting python string in arrays or tensors of integers which are indices in a model vocabulary. It has many handy features revolving around the tokenization of a string into tokens. This tokenization varies according to the model, therefore each model has its own tokenizer.
  • The model class: the model class holds the neural network modeling logic itself. When using a TensorFlow model, it inherits from tf.keras.layers.Layer which means it can be used very simply by the Keras’ fit API or trained using a custom training loop and GradientTape.

Joy in simplicity

The advantage of using Transformers lies in the straight-forward model-agnostic API. Loading a pre-trained model, along with its tokenizer can be done in a few lines of code. Here is an example of loading the BERT and GPT-2 TensorFlow models as well as their tokenizers:

from transformers import (TFBertModel, BertTokenizer,
                         TFGPT2Model, GPT2Tokenizer)

bert_model = TFBertModel.from_pretrained("bert-base-cased")  # Automatically loads the config
bert_tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

gpt2_model = TFGPT2Model.from_pretrained("gpt2")  # Automatically loads the config
gpt2_tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

The weights are downloaded from HuggingFace’s S3 bucket and cached locally on your machine. The models are ready to be used for inference or finetuned if need be. Let’s see that in action.

Fine-tuning a Transformer model

Fine-tuning a model is made easy thanks to some methods available in the Transformer library. The next parts are built as such:
  • Loading text data and pre-processing it
  • Defining the hyper-parameters
  • Training (with Keras on CPU/GPU and with TPUStrategy)

Building an input pipeline

We have made an accompanying colab notebook to get you fast on track with all the code. We’ll leverage the tensorflow_datasets package for data loading. Tensorflow-dataset provides us with a tf.data.Dataset, which can be fed into our glue_convert_examples_to_features method.

This method will make use of the tokenizer to tokenize the input and add special tokens at the beginning and the end of sequences (like [SEP], [CLS], </s> or <s> for instance) if such additional tokens are required by the model. This method returns a tf.data.Dataset holding the featurized inputs.

We can then shuffle this dataset and batch it in batches of 32 units using standard tf.data.Dataset methods.

import tensorflow_datasets
from transformers import glue_convert_examples_to_features

data = tensorflow_datasets.load("glue/mrpc")

train_dataset = data["train"]
validation_dataset = data["validation"]

train_dataset = glue_convert_examples_to_features(train_dataset, bert_tokenizer, 128, 'mrpc')
validation_dataset = glue_convert_examples_to_features(validation_dataset, bert_tokenizer, 128, 'mrpc')
train_dataset = train_dataset.shuffle(100).batch(32).repeat(2)
validation_dataset = validation_dataset.batch(64)

Training with Keras’ fit method

Training a model using Keras’ fit method has never been simpler. Now that we have the input pipeline setup, we can define the hyperparameters, and call the Keras’ fit method with our dataset.

optimizer = tf.keras.optimizers.Adam(learning_rate=3e-5, epsilon=1e-08, clipnorm=1.0)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')
bert_model.compile(optimizer=optimizer, loss=loss, metrics=[metric])

bert_history = bert_model.fit(
 bert_train_dataset, 
 epochs=2, 
 steps_per_epoch=115, 
 validation_data=bert_validation_dataset, 
 validation_steps=7
)

Training with Strategy

Training with a strategy gives you better control over what happens during the training. By switching between strategies, the user can select the distributed fashion in which the model is trained: from multi-GPUs to TPUs.

As of the time of writing, TPUStrategy is the only surefire way to train a model on a TPU using TensorFlow 2. Building a custom loop using a strategy makes even more sense in that regard, as strategies may easily be switched around and training on multi-GPU would require practically no code change.

Building a custom loop requires a bit of work to set-up, therefore the reader is advised to open the following colab notebook to have a better grasp of the subject at hand. It does not go into the detail of tokenization as the first colab has done, but it shows how to build an input pipeline that will be used by the TPUStrategy.

This makes use of Google Cloud Platform bucket as a means to host data, as TPUs are complicated to handle when using local filesystems. The colab notebook is available here.

Transformers now has access to TensorFlow APIs - So what?

The main selling point of the Transformers library is its model agnostic and simple API. Acting as a front-end to models that obtain state-of-the-art results in NLP, switching between models according to the task at hand is extremely easy.

As an example, here’s the complete script to fine-tune BERT on a language classification task(MRPC):

model = TFBertForSequenceClassification.from_pretrained("bert-base-cased")
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

data = tensorflow_datasets.load("glue/mrpc")
train_dataset = data["train"]
train_dataset = glue_convert_examples_to_features(train_dataset, tokenizer, 128, 'mrpc')

optimizer = tf.keras.optimizers.Adam(learning_rate=3e-5, epsilon=1e-08, clipnorm=1.0)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')

model.compile(optimizer=optimizer, loss=loss, metrics=[metric])
model.fit(train_dataset, epochs=3)

However, in a production environment, memory is scarce. You would like to use a smaller model instead; switching to DistilBERT for example. Simply change the first two lines to these two in order to do so:

model = TFDistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased")
tokenizer = DistilbertTokenizer.from_pretrained("distilbert-base-uncased")

As a platform hosting 10+ Transformer architectures, 🤗/Transformers makes it very easy to use, fine-tune and compare the models that have transfigured the deep-learning for NLP field. It serves as a backend for many downstream apps that leverage transformer models and is in use in production by many different companies. We’ll welcome any question or issue you might have on our GitHub repository.
Next post
Hugging Face: State-of-the-Art Natural Language Processing in ten lines of TensorFlow 2.0

A guest post by the Hugging Face team

Hugging Face is the leading NLP startup with more than a thousand companies using their library in production including Bing, Apple, Monzo.

All examples used in this tutorial are available on Colab. The links are available in the corresponding sections.

IntroductionHugging Face is an NLP-focused startup with a large open-source community, in particular around t…