A look at how we built the Emoji Scavenger Hunt using TensorFlow.js
oktober 10, 2018
Posted by Jacques Bruwer, JK Kafalas, and Shuhei Iitsuka from Google Brand Studio

In this post we’ll discuss the inner workings of the experimental game, Emoji Scavenger Hunt. We’ll show you how we used TensorFlow to train a custom model for object recognition and how we use that model on the web front-end with TensorFlow.js. We’ll also go over some of the challenges and workarounds when using browser APIs for camera access and text-to-speech. All the code for this game is open source and available on Github.

Introduction to the Emoji Scavenger Hunt

Emoji scavenger hunt
Emoji Scavenger Hunt is a fun experimental game where you are shown an emoji and you have a set number of seconds to find the real world equivalent object and point your phone camera at it, extending your timer. As you play and find emojis in the real world, the subsequent emojis shown increase in difficulty. Starting out with objects you are likely to have on hand like a shoe, book or your own hand :) and progressing to things like a banana, a candle or even a scooter.

Our goal is to showcase machine learning technology at work in a fun, interactive way.

Training a model for object recognition

The core functionality of the Emoji Scavenger Hunt game is identifying objects that your camera sees and matching that to an object (emoji) the game asked you to find. But how does the camera know what it is seeing? We needed a model that could help identify objects. Initially we started by using a pretrained model called MobileNet. This model is lightweight and optimized for mobile, but the objects in it were too specific to work well for our game. For example, dog breeds like “Golden Retriever” were identified but there was no generic object class for “dog”. We realized that training our own custom image recognition model was the right way to go.

This is where transfer learning can come in handy. Transfer learning is a technique that reuses a machine learning model trained for a specific task by manipulating it for another target task. We built our own custom model by retraining a model based on MobileNet using a similar process as described in this Tensorflow retraining tutorial. We added a fully-connected layer which maps the default output logits to our desired emoji objects such as “hand” and “keyboard”, etc. We listed around 400 objects to recognize and gathered 100–1000 images as training data for each object. The added fully-connected layer infers these 400 objects by combining 1000 signals from the output layer of MobileNet.
Architecture of our image recognition model. We trained a fully-connected layer added to the pretrained MobileNet.
The training script is available on the TensorFlow Github repository. We’ve compiled the training process as a Dockerfile so that you can train your own model by pointing to your own image dataset.

We ran the script feeding training image data into the model. In order to streamline our training process, we built the entire pipeline on Google Cloud Platform. All training data is stored in a bucket on Cloud Storage and a GPU instance on Compute Engine is started once any change is detected in the bucket by setting a Cloud Storage Trigger in Cloud Functions. The GPU instance outputs the resultant retrained model in TensorFlow SavedModel format, and saves it in another bucket on Cloud Storage.
Data pipeline for model training

How we integrate with TensorFlow.js

Once the steps in the model training section above were complete, we ended up with a TensorFlow SavedModel for object recognition. In order to access and use this model in the browser via TensorFlow.js, we used TensorFlow.js converter to convert this SavedModel into a format that TensorFlow.js can load.

The act of identifying objects can be divided into two sub tasks. Firstly, grabbing pixels from the camera and secondly, sending that image data to TensorFlow.js to predict what it thinks it could be based on our previously trained model.

Camera and Model Setup

Before we can start predicting objects, we need to ensure that the camera (via MediaDevices.getUserMedia) is ready to show content and that our machine learning model is loaded and ready to start predicting. We use the following snippet to start both and perform some task setup before we can start predicting.
Promise.all([
  this.emojiScavengerMobileNet.load().then(() => this.warmUpModel()),
  camera.setupCamera().then((value: CameraDimentions) => {
    camera.setupVideoDimensions(value[0], value[1]);
  }),
]).then(values => {
  // Both the camera and model are loaded, we can start predicting
  this.predict();
}).catch(error => {
  // Some errors occurred and we need to handle them
});
Both the camera setup and the model loading resolve with a Promise once they have successfully completed. You will notice that once the model has been loaded, we call this.warmUpModel(). This function simply does a predict call passing in zeros to compile the program and upload weights to the GPU so that when we want to pass in real data to predict against, the model will be ready. This helps to make the initial prediction feel snappy.

Sending Image Data to TensorFlow.js

The following snippet (comments removed) is our predict function call which grabs data from the camera, parses it into the correct image size, sends it to our TensorFlow.js based MobileNet and uses the resultant identified objects to see if we found our emoji.
async predict() {
  if (this.isRunning) {
    const result = tfc.tidy(() => {

      const pixels = tfc.fromPixels(camera.videoElement);
      const centerHeight = pixels.shape[0] / 2;
      const beginHeight = centerHeight - (VIDEO_PIXELS / 2);
      const centerWidth = pixels.shape[1] / 2;
      const beginWidth = centerWidth - (VIDEO_PIXELS / 2);
      const pixelsCropped =
            pixels.slice([beginHeight, beginWidth, 0],
                         [VIDEO_PIXELS, VIDEO_PIXELS, 3]);

      return this.emojiScavengerMobileNet.predict(pixelsCropped);
    });

    const topK =
        await this.emojiScavengerMobileNet.getTopKClasses(result, 10);

    this.checkEmojiMatch(topK[0].label, topK[1].label);
  }
  requestAnimationFrame(() => this.predict());
}
Let’s look at this snippet in a little more detail. We wrap our whole predict code logic inside a requestAnimationFrame call to ensure the browser will perform this logic in the most efficient way possible when doing screen paint updates. We only execute the prediction logic if the game is in a running state. This way we can ensure that when doing screen animations like end and win screens we won’t be running any of the GPU intensive prediction code.

Another small but significant performance improvement is wrapping our TensorFlow.js logic inside a call to tf.tidy(). This will ensure that all TensorFlow.js tensors created during the execution of that logic will get cleaned up after, ensuring better long-term running performance. See https://js.tensorflow.org/api/latest/#tidy

The core of our prediction logic is related to extracting an image from the camera to send to TensorFlow.js. Instead of simply taking the whole camera image and sending that along, we slice out a portion of the screen from the center of the camera and send that to TensorFlow.js. In our game we trained our model with reference images which are 224 pixels x 224 pixels. Sending images to TensorFlow.js that are of the same dimensions as our reference training data will ensure better prediction performance. The reason why our camera element (which is just an HTML video element) isn’t 224 pixels is because we want to ensure a full screen experience for the user which means using CSS to spread the camera element to be 100% of the screen.

The following reference image shows the slice in the top left corner which will be sent to TensorFlow.js. This isn’t implemented to be visible in the live version but simply shown here for reference.
The model then uses that image data to generate a list of the top 10 most likely items. You’ll notice we take the top 2 values and pass that along to checkEmojiMatch to determine if we found our match. We chose to use the top 2 matches instead of the topmost item simply because it made the game more fun and allowed us some leeway in matches based on our model. Having a model that was too accurate and strict caused users to get frustrated when objects weren’t being recognized.
In the image example above you can see we are currently tasked with finding a “keyboard” emoji. In this example we are also showing some debug information so you can see all 10 possible items the model is predicting based on the input image. The top two matches here are “keyboard” and “hand” which are both in the image whereas “hand” has a slightly bigger likelihood. The game detected a match here even though “keyboard” is in the second detected spot because we use the top two matches to check against.

Giving our model a voice with text to speech

As a fun addition to the game, we implemented the SpeechSynthesis API to read aloud what the Model “thinks” it’s seeing each moment as you’re running around hunting for Emoji. In Chrome on Android, this ended up being extremely trivial to implement via the code below:
speak(msg: string) {
  if (this.topItemGuess) {
    if ('speechSynthesis' in window) {
      let msgSpeak = new SpeechSynthesisUtterance();
      msgSpeak.voice = this.sleuthVoice['activeVoice'];

      msgSpeak.text = msg;
      speechSynthesis.speak(msgSpeak);
    }
  }
}
This API works great on-the-fly on Android, but iOS limits any SpeechSynthesis calls to those directly tied to a user action (such as a click-event), so we needed to find an alternative solution for that platform. We were already familiar with iOS’s requirement of tying an audio play event to a user action, which we handled for the other sounds in our game by starting all of our audio files playing when the user initially clicks the “Play” button, then just immediately pausing all of those audio files. In the end, we ended up making an audio sprite, containing all of the “success” speech lines (“Hey, you found a Beer,” for instance). The drawback of this method is that this audio sprite file becomes very large the more dialogue needs to be in it.

One approach we tried was to split the audio sprite up even further into prefixes (“Hey, you found a,” “Is that a”) and suffixes (“beer,” “banana,” etc.), but we discovered that iOS adds an inescapable one-second delay between playing a segment of one audio file, pausing, moving the playhead, then playing another segment of the same file. The gap between the prefixes and suffixes was so long that it felt jarring, and we frequently found that the speech would lag far behind the actual gameplay. We’re still investigating other options for speech improvements on iOS.

Below is our function to play an audio file, which contains additional code to handle playing a segment of an audio sprite via a start and stop timestamp:
playAudio(audio: string, loop = false, startTime = 0,
    endTime:number = undefined) {
  let audioElement = this.audioSources[audio];
  if (loop) {
    audioElement.loop = true;
  }
  if (!this.audioIsPlaying(audio)) {
    audioElement.currentTime = startTime;
    let playPromise = audioElement.play();
    if (endTime !== undefined) {
      const timeUpdate = (e: Event) => {
        if (audioElement.currentTime >= endTime) {
          audioElement.pause();
          audioElement.removeEventListener('timeupdate', timeUpdate);
        }
      };
      audioElement.addEventListener('timeupdate', timeUpdate);
    }
    if (playPromise !== undefined) {
      playPromise.catch(error => {
        console.log('Error in playAudio: ' + error);
      });
    }
  }
}

Adventures with camera access via getUserMedia

Emoji Scavenger Hunt relies heavily on being able to access the camera via Javascript in the browser. We make use of the MediaDevices.getUserMedia API in the browser to access the camera. This API isn’t supported in all browsers but recent versions of most major browsers have good support.

To access the camera via this API we use the following snippet:
if (navigator.mediaDevices && navigator.mediaDevices.getUserMedia) {
  const stream = await navigator.mediaDevices.getUserMedia({
    'audio': false,
    'video': {facingMode: 'environment'}
  });
  (window).stream = stream;
  this.videoElement.srcObject = stream;
}
This API provides a way to access both the front and rear facing camera by passing in a config object and specifying the facingMode.

No access via UIWebViews

During testing we realized that Apple does not support the getUserMedia API from any UIWebView based webkit browser meaning that any app on iOS that implements their own browser, like 3rd party twitter clients or Chrome on iOS, will not have access to the camera.

To resolve this issue we detect a camera initialization failure in those instances and prompt the user to open the experience in the native Safari browser.

Final Thoughts and Acknowledgements

With this experiment we wanted to create a fun and delightful game that made use of the amazing machine learning technology available in the browser today. It’s just a start and we’d love to see what you build using the power of TensorFlow.js and TensorFlow.js converter. As mentioned our code is available on Github, so please use it to start your own ideas.

We’d like to thank Takashi Kawashima, Daniel Smilkov, Nikhil Thorat, and Ping Yu for all their help while building this experiment.
Next post
A look at how we built the Emoji Scavenger Hunt using TensorFlow.js

Posted by Jacques Bruwer, JK Kafalas, and Shuhei Iitsuka from Google Brand Studio

In this post we’ll discuss the inner workings of the experimental game, Emoji Scavenger Hunt. We’ll show you how we used TensorFlow to train a custom model for object recognition and how we use that model on the web front-end with TensorFlow.js. We’ll also go over some of the challenges and workarounds when using br…