Lingvo: A TensorFlow Framework for Sequence Modeling

फ़रवरी 22, 2019

Posted by Jonathan Shen

Lingvo is the international language Esperanto word for “language”. This naming alludes to the roots of the Lingvo framework — it was developed as a general deep learning framework using TensorFlow with a focus on sequence models for language-related tasks such as machine translation, speech recognition, and speech synthesis.

Internally, the framework gained traction and the number of researchers using it ballooned. As a result, there are now dozens of published papers with state-of-the-art results produced using Lingvo with more to come. Supported architectures range from traditional RNN sequence models to Transformer models and models that include VAE components. To show our support of the research community and encourage reproducible research effort, we have open-sourced the framework and are starting to release the models used in our papers.

Figure 1: An overview of the Lingvo framework, outlining how models are instantiated, trained, and exported for evaluation and serving.

Lingvo was built with collaborative research in mind, and promotes code reuse by sharing the implementation of common layers across different tasks. In addition, all layers implement the same common interface and are laid out in the same way. Not only does this produce cleaner and more understandable code, it makes it extremely simple to apply improvements someone else made for a different task to your own task. Enforcing this consistency does come at the cost of requiring more discipline and boilerplate, but Lingvo attempts to minimize this to ensure fast iteration time during research.

Another aspect of collaboration is sharing reproducible results. Lingvo provides a centralized location for checked-in model hyperparameter configurations. Not only does this serve to document important experiments, it gives others an easy way to reproduce your results by training an identical model.

def Task(cls):
  p = model.AsrModel.Params()
  p.name = 'librispeech'

  # Initialize encoder params.
  ep = p.encoder
  # Data consists 240 dimensional frames (80 x 3 frames), which we
  # re-interpret as individual 80 dimensional frames. See also,
  # LibrispeechCommonAsrInputParams.
  ep.input_shape = [None, None, 80, 1]
  ep.lstm_cell_size = 1024
  ep.num_lstm_layers = 4
  ep.conv_filter_shapes = [(3, 3, 1, 32), (3, 3, 32, 32)]
  ep.conv_filter_strides = [(2, 2), (2, 2)]
  ep.cnn_tpl.params_init = py_utils.WeightInit.Gaussian(0.001)
  # Disable conv LSTM layers.
  ep.num_conv_lstm_layers = 0

  # Initialize decoder params.
  dp = p.decoder
  dp.rnn_cell_dim = 1024
  dp.rnn_layers = 2
  dp.source_dim = 2048
  # Use functional while based unrolling.
  dp.use_while_loop_based_unrolling = False

  tp = p.train
  tp.learning_rate = 2.5e-4
  tp.lr_schedule = lr_schedule.ContinuousLearningRateSchedule.Params().Set(
      start_step=50000, half_life_steps=100000, min=0.01)

  # Setting p.eval.samples_per_summary to a large value ensures that dev,
  # devother, test, testother are evaluated completely (since num_samples for
  # each of these sets is less than 5000), while train summaries will be
  # computed on 5000 examples.
  p.eval.samples_per_summary = 5000
  p.eval.decoder_samples_per_summary = 0

  # Use variational weight noise to prevent overfitting.
  p.vn.global_vn = True
  p.train.vn_std = 0.075
  p.train.vn_start_step = 20000

  return p

Figure 2: An example of a task configuration in Lingvo. Hyperparameters for each experiment is configured in its own class separate from the code that builds the network and checked into version control. Source.

While Lingvo started out with a focus on NLP, it is inherently very flexible, and models for tasks such as image segmentation and point cloud classification have been successfully implemented using the framework. Distillation, GANs, and multi-task models are also supported. At the same time, the framework does not compromise on speed, and features an optimized input pipeline and fast distributed training. Finally, Lingvo was put together with an eye towards easy productionization, and there is even a well-defined path towards porting models for mobile inference.

To jump straight into the code, check out our github page and the codelab. To learn more details about Lingvo or some of the advanced features it supports, see our paper.

Lingvo: A TensorFlow Framework for Sequence Modeling

फ़रवरी 22, 2019 — Posted by Jonathan Shen

Lingvo is the international language Esperanto word for “language”. This naming alludes to the roots of the Lingvo framework — it was developed as a general deep learning framework using TensorFlow with a focus on sequence models for language-related tasks such as machine translation, speech recognition, and speech synthesis.

Internally, the framework gained traction and t…