https://blog.tensorflow.org/2020/01/photobooth-lite-on-raspberry-pi-with-tensorflow-lite.html
https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhanerZLp5kGU6wO5EYrjrJMWEqYTpu5vMGaMMjV_U85PkXPGF2x172Huknziu5Uy3erVUg8FjaKlgis0Gw5uZWHWHHDimJwPZFK1yanzciBOUSO7uhGATQXiKNPjORBokGRd18rwNX28Q/s1600/gif1.gif
Posted by Lucia Li, TensorFlow Lite Intern
|
Illustration of the Smart Photo Booth application running in real time. |
We’re excited to showcase the experience of building a Smart Photo Booth application on Raspberry Pi with TensorFlow (we're not open-sourcing the code yet). It can capture smiling faces and record them automatically. Additionally, you can use speech commands to interact with it. Thanks to the
Tensorflow Lite framework, we built the application to easily handle smiling face detection and speech commands recognition in real-time.
Why should we build an application on Raspberry Pi?
Raspberry Pi is not only a widely-used embedded platform, but also tiny in size and cheap in price. We decided to use TensorFlow Lite as it is specifically designed for mobile and IoT devices which is perfect for Raspberry Pi.
What do we need to build the Photo Booth App Demo?
We implemented our Photo Booth App on Raspberry Pi 3B+, with 1GB RAM equipped and the 32-bit ARMv7 operating system installed. Our application has image input and audio input, so we will also need a camera and a microphone. In addition, we will need a monitor for display. The total cost is under
$100 USD. The details are listed below:
- A Raspberry Pi ($35)
‣ Parameters:
» Quad core 64-bit processor clocked at 1.4GHz.
» 1GB LPDDR2 SRAM.
- A camera to capture image (~$15+)
- A microphone to sample audio data (~$5+)
- A 7-inch monitor (~$20+)
In our Photo Booth application, there are two key technologies involved. From the camera image input, we need to be able to detect whether there is a smiling face. From the microphone audio input, we need to be able to recognize whether there is a speech command which is “yes” or “no”.
How do we detect smiling faces?
Using a single model to detect faces and predict the resulting smiling score, with both high accuracy and low latency, is difficult. Thus, we detect a smiling face by three steps:
|
Smiling Face Detection Workflow |
- Apply a face detection model to detect whether there is a face in the given image.
- If there is a face, crop it from the original image.
- With the cropped face image, apply a facial attribute classification model to measure if it is a smiling face.
We tried various options to reduce the latency for detecting a smiling face:
- In order to reduce memory and speed up execution, we leveraged the TensorFlow model optimization toolkit's post-training quantization. In this tutorial, you can see how easy it is to use in your own TensorFlow Lite model.
- We resized the original image captured from the camera with its length-width ratio fixed. The compression ratio can be 4 or 2 depending on its original size. We try to make the image size less than 160x160 (the original designed size is 320x320). Smaller inputs significantly reduce the inference time, as shown in the table below. In our application, the original image size captured from the camera is 640x480, so we resized it to 160x120.
- Instead of using the original image for facial attribute classification, we cropped the standard faces and abandoned the background. It reduced the input size while keeping the useful information.
- We used multi-threads for inference.
The table below shows the effects of the strategies we apply. We used
Tensorflow Lite benchmark_model to evaluate the performance of the face detection model on Raspberry Pi
|
Face Detection Latency Comparison |
The whole pipeline of detecting smiling faces, including the three steps we mentioned before, cost
48.1ms with one single thread on average, which means we realized real-time smiling face detection.
Face detection
Our face detection model consists of an 8-bit modified MobileNet v1 body and SSD-Lite head with a 0.25 depth multiplier. Its size is only a little larger than 200kB. Why is this model so small? First, the TensorFlow Lite model is based on Flatbuffer, which is smaller in size than the TensorFlow model based on protobuf. Second, we applied an 8-bit quantized model. Third, our modified MobileNet v1 has fewer channels than the original. Similar to most face detection models, our model outputs the position of a bounding box and 6 landmarks including the left eye, right eye, nose tip, mouth center, left ear tragion, and right ear tragion. We also apply non-maximum suppression to filter repeated faces. The inference time of our face detection TensorFlow Lite model is about 30ms. It means our model can detect a face on Raspberry Pi in real time.
|
Example of the bounding box and 6 landmarks. |
Face cropper
The detected face may have various directions and various sizes. To unify them for better classification, we rotated, cropped, and resized the original image. The input of this function is the positions of the 6 landmarks we get from the face detection model. With 6 landmarks, we can compute the rotation Euler angles and resize ratios. Through this, we can get a 128x128 standard face. The figure below shows an example of our face cropper function. The blue bounding box is the output of the face detection model, while the red bounding box is our calculated cropping bounding box. We duplicated the borderline for the pixels outside the image.
|
Face Cropper Illustration |
Face attribute classification
Our face attribute classification model is also an 8-bit quantized MobileNet model. With a 128x128 standard face as the input, the model outputs a float variable from 0 to 1 to predict the smiling probability. The model also outputs a 90-d vector to predict age from 0 to 90. Its inference times on Raspberry Pi can reach around 30ms.
How to recognize speech commands?
Real-time speech commands recognition can also be divided into three steps:
- Pre-processing: we use a sliding window to store the latest 1s audio data, with 512 frames different from the last recording.
- Inference: given a 1s audio input, we can apply a speech command recognition model to get probabilities for four categories (“yes”/“no”/“silence”/“unknown”).
- Post-processing: we average current inference result with previous ones. When the average probability of one word exceeds one threshold, we decide that a speech command is detected.
The three steps are explained in detail below.
Pre-processing:
We use
PortAudio, an open-source library to get audio data from a microphone. The following figure shows how we store the audio data.
|
Audio Stream Processing |
Since our model uses 1s audio data with a sampling rate of 16kHz for training, the size of our data buffer is 16,000 bytes. The data buffer also serves as a cycle buffer. We update 512 frames every time. Additionally, we record an offset which indicates the end of the last update. When the tail of the buffer is full, we will continue from the head of the buffer. When we want to get the audio data for inference, we will start reading from the offset and end it until the frame before the offset.
Speech command recognition
The speech command recognition model we used can be found publicly in many TensorFlow examples. It is composed of audio_spectrogram, MFCC, 2 convolutional layers, and 1 fully-connected layer. The input of this model is 1s of audio data with a sampling rate of 16kHz. The dataset is public, or you can train it yourself. This dataset contains 30 categories of speech command data. Since we only need “yes” and “no”, we disregard all other categories labeled as “unknown”. Additionally, we used other methods to improve the latency performance:
- We cut half the channels. The TensorFlow Lite model size is about 1.9 MB after compression.
- We used 4 output channels of the last fully-connected layer than the usual 12 as we only need 4 categories.
- We use multi-threads for inference.
In training, we set the background volume to 0.3 to improve the noise tolerance of our model. We also set the silence percentage as 25% and the unknown percentage as 25% to balance the training set.
Post-processing:
|
Audio Stream Post-processing |
Since the audio data we get may only cover half of the words, the single prediction result is not as precise. We stored previous results with recording time that is no more than 1.5s earlier to get average prediction results. It significantly improves the real-time performance of keyword detection. The amount of previous results we keep depends highly on our inference time. For example, the inference time of our model on Raspberry Pi is about 160ms, which means we can keep 9 previous results at most.
Whats next?
We hope to open source code for this example soon in the TensorFlow Lite Github repository. For more information about how to start with TensorFlow Lite, please see
here and other reference examples
here. Please let us know what you think or
share your TensorFlow Lite use case with us.
Acknowledgements
Lucia Li, Renjie Liu, Tiezhen Wang, Shuangfeng Li, Lawrence Chan, Daniel Situnayake, Pete Warden.