Optical character recognition with TensorFlow Lite: A new example app
September 27, 2021

Posted by Wei Wei, TensorFlow Developer Advocate

As the old adage goes, “a picture is worth a thousand words.” Images are rich in visual information, but sometimes the key is with the text within. While it is easy for literate human beings to read words embedded in images, how do we use computer vision and machine learning to teach computers to do so?

Today, we are going to show you how to use TensorFlow Lite to extract text from images on Android devices. We will walk you through the key steps of the Optical Character Recognition (OCR) Android app that we recently open sourced here, which you can refer to for the complete code. You can see how the app extracts the product names from three Google product logos in the animation below.

Optical Character Recognition demo

The process of recognizing text from images is called Optical Character Recognition and is widely used in many domains. For example, Google Maps uses OCR technology to automatically extract information from the geo-located imagery to improve Google Maps.

Generally speaking, OCR is a pipeline with multiple steps. Usually they consist of text detection and text recognition:

  • Use a text detection model to find out bounding boxes around text;
  • Do some post-processing to transform the bounding boxes;
  • Transform the images within those bounding boxes into grayscale, so that a text recognition model can map out the words and numbers.

In our case, we are going to leverage the text detection and text recognition models from TensorFlow Hub. There are several different model versions for speed / accuracy tradeoffs; we use the float16 quantized models here. For more information on model quantization, please refer to the TensorFlow Lite quantization section. We also use OpenCV, which is a widely used computer vision library for Non-Maximum Suppression (NMS) and perspective transformation (we’ll expand on this later) to post-process detection results. In addition, we use the TFLite Support Library to grayscale and normalize the images.

OCR pipeline from text detection, perspective transformation, to recognition
OCR pipeline from text detection, perspective transformation, to recognition.

For text detection, since the detection model accepts a fixed size of 320x320, we use the TFLite Support Library to resize and normalize the input image:

val imageProcessor =
 ImageProcessor.Builder()
   .add(ResizeOp(height, width, ResizeOp.ResizeMethod.BILINEAR))
   .add(NormalizeOp(means, stds))
   .build()
var tensorImage = TensorImage(DataType.FLOAT32)

tensorImage.load(bitmapIn)
tensorImage = imageProcessor.process(tensorImage)

Then we use TFLite to run the detection model:

detectionInterpreter.runForMultipleInputsOutputs(detectionInputs, detectionOutputs)

The output of the detection model is a number of rotated bounding boxes which contain the text in the image. We run Non-Maximum Suppression to identify one bounding box for each text block with OpenCV:

NMSBoxesRotated(
  boundingBoxesMat,
  detectedConfidencesMat,
  detectionConfidenceThreshold.toFloat(),
  detectionNMSThreshold.toFloat(),
  indicesMat
)

Sometimes texts inside images are distorted (e.g., the ‘kubernetes’ sticker on my laptop) with a perspective angle:

Perspective transformation demo
Perspective transformation demo

If we just feed the raw rotated bounding box into the recognition model, the model is unlikely to correctly identify the characters. In this case, we need to use OpenCV to do perspective transformation:

val rotationMatrix = getPerspectiveTransform(srcPtsMat, targetPtsMat)

warpPerspective(
  srcBitmapMat,
  recognitionBitmapMat,
  rotationMatrix,
  Size(recognitionImageWidth.toDouble(), recognitionImageHeight.toDouble())
)

After that, we use the TFLite Support Library again to resize, grayscale, and normalize the transformed images inside the bounding boxes:

val imageProcessor =
  ImageProcessor.Builder()
    .add(ResizeOp(height, width, ResizeOp.ResizeMethod.BILINEAR))
    .add(TransformToGrayscaleOp())
    .add(NormalizeOp(mean, std))
    .build()

Finally, we run the text recognition model, map out the characters and numbers from the model output, and update the app UI:

recognitionInterpreter.run(recognitionTensorImage.buffer, recognitionResult)

var recognizedText = ""
for (k in 0 until recognitionModelOutputSize) {
  var alphabetIndex = recognitionResult.getInt(k * 8)
  if (alphabetIndex in 0..alphabets.length - 1)
    recognizedText = recognizedText + alphabets[alphabetIndex]
}
Log.d("Recognition result:", recognizedText)
if (recognizedText != "") {
  ocrResults.put(recognizedText, getRandomColor())
}

That's it. We are now able to extract text from input images using TFLite within our app.

Finally, if you just want a ready-to-use OCR SDK, Google also offers on-device OCR functionality through ML Kit, which uses TFLite underneath and should be sufficient for most OCR use cases. There are some situations where you may want to build your own OCR solution with TFLite such as:

  • You have your own text detection / recognition TFLite models that you would like to use;
  • You have special business requirements (e.g. recognizing upside-down text) and need to customize the OCR pipeline;
  • You want to support languages not covered by ML Kit;
  • Your target user devices that don’t necessarily have Google Play services installed;
  • You want to have control over hardware backends (CPU / GPU / etc.) used to run your models.


In these cases, I hope that this tutorial and our example implementation can help you get started on building your own OCR functionality in your app.

You can learn more about OCR with the resources below.

Acknowledgements

The author would like to thank Tian Lin for the helpful feedback and community contributors @Tulasi123789 and @risingsayak for their prior work on OCR using TFLite (creating and uploading the models to TF Hub, providing accompanying notebooks, and etc.).

Next post
Optical character recognition with TensorFlow Lite: A new example app

Posted by Wei Wei, TensorFlow Developer Advocate As the old adage goes, “a picture is worth a thousand words.” Images are rich in visual information, but sometimes the key is with the text within. While it is easy for literate human beings to read words embedded in images, how do we use computer vision and machine learning to teach computers to do so? Today, we are going to show you how to use …