Real World Character Recognition

Code from this post

One increasingly important task for machines to accomplish is reading text present in real-world situations. Think of self-driving cars reading road signs, or a translation app on your phone deciphering a foreign menu.

In this post, similar to my last post about LSTMs solving XOR, I’d like to walk through a (hopefully small) bit of code that lets us go from a real-world set of images of digits and characters, to actual predictions with a decent level of accuracy.

The dataset we’ll use is Chars74K. It’s pretty simple to work with. There are 62 classes overall: 26 lowercase letters, 26 uppercase, and 10 digits. We’ll standardize on images of size \[ 18 \times 12 \], and the total number of samples (7,700) is small enough to be manageable while large enough to give us nice predictive power.

Before working with the data, we have to have the data. Let’s download it all, and do a minor bit of preprocessing cleanup here.

wget http://www.ee.surrey.ac.uk/CVSSP/demos/chars74k/EnglishImg.tgz
tar -xzf EnglishImg.tgz
mv English/ chars74k
rm EnglishImg.tgz

# some of the images for sample 53 have a different format, just remove them
rm chars74k/Img/GoodImg/Bmp/Sample053/img053-00049.png
rm chars74k/Img/GoodImg/Bmp/Sample053/img053-00028.png
rm chars74k/Img/GoodImg/Bmp/Sample053/img053-00024.png
rm chars74k/Img/GoodImg/Bmp/Sample053/img053-00009.png
rm chars74k/Img/GoodImg/Bmp/Sample053/img053-00035.png

Switching to Python now. It wouldn’t be machine learning if there weren’t a million imports.

import glob
import numpy as np
import random
from cv2 import resize
from matplotlib.image import imread
from matplotlib.pyplot import imshow
from sklearn.model_selection import train_test_split
from tensorflow.keras.layers import Activation, Convolution2D, Dense, Dropout, Flatten, MaxPooling2D
from tensorflow.keras.models import Sequential

Somehow we’ll find a use for all of this stuff. In contrast to the LSTM XOR post, note that we now have convolution and pooling layers around. Perhaps less excitingly, we also have glob and imread to read in images from an outside dataset.

Our character classes are standardized by the order they appear in below. The model itself will learn to predict a number from 0 to 61, and we can translate back and forth as needed.

SAMPLES = 7700
CHARS = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"
ROWS = 18
COLS = 12

Let’s set up zero-valued arrays so we don’t have to do any slow copying or appending while building up our in-memory dataset. The images will be loaded in and stored as \[ 18 \times 12 \times 3 \] matrices. The 3 there is for each piece of RGB. Labels are read in as well—they’re just a long one-dimensional array of character class numbers.

images = np.zeros(shape=(SAMPLES, ROWS, COLS, 3))
labels = np.zeros(shape=(SAMPLES,), dtype=int)
sample_num = 0
for sample in sorted(glob.glob('data/chars74k/Img/GoodImg/Bmp/*')):
    for image in glob.glob(sample + '/*'):
        images[sample_num] = resize(imread(image), dsize=(COLS, ROWS))
        labels[sample_num] = int(sample[-2:]) - 1
        sample_num += 1

Here’s something new, and fairly important. Rather than use all of our data to both train and test a model (and risk overfitting), we’ll split off 20% of our data to use as a validation set. This helps us ensure that the predictions our model comes up with will generalize well to images that weren’t present in the initial dataset.

images, images_test, labels, labels_test = train_test_split(images, labels, test_size=0.2)

It’s always good to check that your shapes match up with your expectations. This is a quick check that saves a lot of potential headaches from mismatched dimensions.

print(f'{images.shape[0]} sample images of size {images.shape[1]}x{images.shape[2]}')
print(f'{labels.shape[0]} labels')
assert images.shape[0] == labels.shape[0]

Here’s the actual model itself! First we convolve, then we pool, then we flatten, then we have some dense layers with dropout, and finally we activate. Any questions?

layers = \
  [ Convolution2D(128, 3, 3, input_shape=(ROWS, COLS, 3), activation='relu')
  , Convolution2D(256, 2, 1, activation='relu')
  , Convolution2D(512, 2, 1, activation='relu')
  , MaxPooling2D(pool_size=(2, 2))
  , Flatten()
  , Dense(1024, activation='relu')
  , Dropout(0.5)
  , Dense(512, activation='relu')
  , Dropout(0.5)
  , Dense(62)
  , Activation('softmax')
  ]

model = Sequential()
for layer in layers:
    model.add(layer)

Alright, so that’s actually pretty complicated. Here’s a super-brief overview of each piece:

Convolution - think of this as mapping over the input data in a single sample with a small “window”, and thereby mashing up a bunch of pixel data into a form we can use more easily
Pooling - convolution can give use a ton of information. Let’s take nearby input pixels (or nearby features in general) and pool their resources together, turning lots of information into a sort of average that’s easier to work with
Flatten - let’s go from a 2d image to a 1d layer that our upcoming dense neural network can work with nicely
Dense - what you probably think of when you think “layer of a neural network”
Dropout - turns out randomly forgetting information is good sometimes! Let’s randomly have some units (“neurons”) drop out and not matter. This reduces overreliance on single units, introduces redundant encodings of information, and helps our models generalize
Softmax activation - turn our neural network’s results into probabilities that our sample belongs in each class

Now we train the model! Notice that we’re using our validation data here. While accuracy numbers may seem lower with this included, that’s typically because we’ve reduced overfitting, which is a good thing. Everything else is pretty standard. sparse_categorical_crossentropy means that we’re using numbered labels 0-61 instead of encoding our categorical predictions as one-hot vectors or something.

model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(images, labels, epochs=20, batch_size=128, validation_data=(images_test, labels_test))
model.summary()

Here’s my favorite part. Let’s watch the model actually attempt to predict some characters based on images alone. Depending on how lucky you get, you might see an i or 1 or l with maybe 50% confidence, or perhaps an E sitting pretty at 99.9%.

predictions = model.predict(images_test)
for _ in range(3):
    i = random.randint(0, images_test.shape[0])
    print('\nselected a random image')
    imshow(images_test[i])
    print('prediction:', CHARS[np.argmax(predictions[i])])
    chance = predictions[i][np.argmax(predictions[i])]
    print('confidence: {:0.2f}%'.format(chance * 100))
    print('actual:', CHARS[labels_test[i]])

Here’s an example run. (I committed to publishing whatever the results were before actually running this, so hopefully it’s a fairly representative sample):

selected a random image
prediction: b
confidence: 49.79%
actual: 8

selected a random image
prediction: t
confidence: 99.58%
actual: t

selected a random image
prediction: a
confidence: 75.41%
actual: a

That’s about all it takes to recognize characters, of course assuming that you have a nice clean dataset handy. This method generalizes pretty well to other limited-set image classification tasks. You could take a bunch of pictures of any smallish collection of objects you have, and see if your classifier can tell the objects in the collection apart.

If you really wanted to go wild with this, you could look into techniques for locating characters within images in the first place, and individually cropping them out. That’s about all the prep you need before you start feeding them to this model. From there, you could start doing things like predicting street addresses from pictures of houses, or reading off book titles from cover images alone.