One increasingly important task for machines to accomplish is reading text present in real-world situations. Think of self-driving cars reading road signs, or a translation app on your phone deciphering a foreign menu.
In this post, similar to my last post about LSTMs solving XOR, I’d like to walk through a (hopefully small) bit of code that lets us go from a real-world set of images of digits and characters, to actual predictions with a decent level of accuracy.
The dataset we’ll use is Chars74K. It’s pretty simple to work with. There are 62 classes overall: 26 lowercase letters, 26 uppercase, and 10 digits. We’ll standardize on images of size \[ 18 \times 12 \], and the total number of samples (7,700) is small enough to be manageable while large enough to give us nice predictive power.
Before working with the data, we have to have the data. Let’s download it all, and do a minor bit of preprocessing cleanup here.
wget http://www.ee.surrey.ac.uk/CVSSP/demos/chars74k/EnglishImg.tgz
tar -xzf EnglishImg.tgz
mv English/ chars74k
rm EnglishImg.tgz
# some of the images for sample 53 have a different format, just remove them
rm chars74k/Img/GoodImg/Bmp/Sample053/img053-00049.png
rm chars74k/Img/GoodImg/Bmp/Sample053/img053-00028.png
rm chars74k/Img/GoodImg/Bmp/Sample053/img053-00024.png
rm chars74k/Img/GoodImg/Bmp/Sample053/img053-00009.png
rm chars74k/Img/GoodImg/Bmp/Sample053/img053-00035.png
Switching to Python now. It wouldn’t be machine learning if there weren’t a million imports.
import glob
import numpy as np
import random
from cv2 import resize
from matplotlib.image import imread
from matplotlib.pyplot import imshow
from sklearn.model_selection import train_test_split
from tensorflow.keras.layers import Activation, Convolution2D, Dense, Dropout, Flatten, MaxPooling2D
from tensorflow.keras.models import Sequential
Somehow we’ll find a use for all of this stuff. In contrast to the
LSTM XOR post, note that we now have convolution and pooling layers
around. Perhaps less excitingly, we also have glob
and
imread
to read in images from an outside dataset.
Our character classes are standardized by the order they appear in below. The model itself will learn to predict a number from 0 to 61, and we can translate back and forth as needed.
= 7700
SAMPLES = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"
CHARS = 18
ROWS = 12 COLS
Let’s set up zero-valued arrays so we don’t have to do any slow copying or appending while building up our in-memory dataset. The images will be loaded in and stored as \[ 18 \times 12 \times 3 \] matrices. The 3 there is for each piece of RGB. Labels are read in as well—they’re just a long one-dimensional array of character class numbers.
= np.zeros(shape=(SAMPLES, ROWS, COLS, 3))
images = np.zeros(shape=(SAMPLES,), dtype=int)
labels = 0
sample_num for sample in sorted(glob.glob('data/chars74k/Img/GoodImg/Bmp/*')):
for image in glob.glob(sample + '/*'):
= resize(imread(image), dsize=(COLS, ROWS))
images[sample_num] = int(sample[-2:]) - 1
labels[sample_num] += 1 sample_num
Here’s something new, and fairly important. Rather than use all of our data to both train and test a model (and risk overfitting), we’ll split off 20% of our data to use as a validation set. This helps us ensure that the predictions our model comes up with will generalize well to images that weren’t present in the initial dataset.
= train_test_split(images, labels, test_size=0.2) images, images_test, labels, labels_test
It’s always good to check that your shapes match up with your expectations. This is a quick check that saves a lot of potential headaches from mismatched dimensions.
print(f'{images.shape[0]} sample images of size {images.shape[1]}x{images.shape[2]}')
print(f'{labels.shape[0]} labels')
assert images.shape[0] == labels.shape[0]
Here’s the actual model itself! First we convolve, then we pool, then we flatten, then we have some dense layers with dropout, and finally we activate. Any questions?
= \
layers 128, 3, 3, input_shape=(ROWS, COLS, 3), activation='relu')
[ Convolution2D(256, 2, 1, activation='relu')
, Convolution2D(512, 2, 1, activation='relu')
, Convolution2D(=(2, 2))
, MaxPooling2D(pool_size
, Flatten()1024, activation='relu')
, Dense(0.5)
, Dropout(512, activation='relu')
, Dense(0.5)
, Dropout(62)
, Dense('softmax')
, Activation(
]
= Sequential()
model for layer in layers:
model.add(layer)
Alright, so that’s actually pretty complicated. Here’s a super-brief overview of each piece:
Now we train the model! Notice that we’re using our validation data
here. While accuracy numbers may seem lower with this included, that’s
typically because we’ve reduced overfitting, which is a good thing.
Everything else is pretty standard.
sparse_categorical_crossentropy
means that we’re using
numbered labels 0-61 instead of encoding our categorical predictions as
one-hot vectors or something.
compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.=20, batch_size=128, validation_data=(images_test, labels_test))
model.fit(images, labels, epochs model.summary()
Here’s my favorite part. Let’s watch the model actually attempt to
predict some characters based on images alone. Depending on how lucky
you get, you might see an i
or 1
or
l
with maybe 50% confidence, or perhaps an E
sitting pretty at 99.9%.
= model.predict(images_test)
predictions for _ in range(3):
= random.randint(0, images_test.shape[0])
i print('\nselected a random image')
imshow(images_test[i])print('prediction:', CHARS[np.argmax(predictions[i])])
= predictions[i][np.argmax(predictions[i])]
chance print('confidence: {:0.2f}%'.format(chance * 100))
print('actual:', CHARS[labels_test[i]])
Here’s an example run. (I committed to publishing whatever the results were before actually running this, so hopefully it’s a fairly representative sample):
selected a random image
prediction: b
confidence: 49.79%
actual: 8
selected a random image
prediction: t
confidence: 99.58%
actual: t
selected a random image
prediction: a
confidence: 75.41%
actual: a
That’s about all it takes to recognize characters, of course assuming that you have a nice clean dataset handy. This method generalizes pretty well to other limited-set image classification tasks. You could take a bunch of pictures of any smallish collection of objects you have, and see if your classifier can tell the objects in the collection apart.
If you really wanted to go wild with this, you could look into techniques for locating characters within images in the first place, and individually cropping them out. That’s about all the prep you need before you start feeding them to this model. From there, you could start doing things like predicting street addresses from pictures of houses, or reading off book titles from cover images alone.