How to Make Real-Time Handwritten Text Recognition With Augmentation and Deep Learning
Use a convolutional recurrent neural network to recognize the handwritten line text image without pre-segmentation into words or characters. Use the CTC loss function to train.

What is covered here:
- Offline Handwritten Recognition
- Understand the detailed architecture of the handwritten recognition system.
- How to use the data augmentation technique to increase the accuracy and ability to work in real-time.
Why deep learning?

Machine learning needs early feature extraction and classification. But deep learning acts as a “black box,” which does feature extraction and classification on its own.
The above figure shows that the main task is to classify the given image as face or non-face. In the case of machine learning, it needs a feature of an image, such as edges, color, shape, etc., and performs classification on its own.
But deep learning extracts features and performs classification on its own. Example of this given image to the Convolutional Neural Networks that every layer of CNN takes the feature and finally Fully Connected Layer perform classification
The main conclusion is that deep learning self-extracts features with deep neural networks and classifies itself. Compared to traditional algorithms, its performance increases with the amount of data.
This article is all about building your own handwritten recognition system with TensorFlow. It covers detailed intuition about architecture and how I reach the solution and increase accuracy. In the end, I will provide the Github repo link where the pre-trained model is provided. Note that you can build a handwritten recognition system in any language where the architecture remains the same. In the end, you can build a handwritten recognition system in your own language, provided you have the dataset. Let’s get started!
Basic Intuition on How It Works.

First, use a convolutional recurrent neural network to extract the important features from the handwritten text image.
- The output before the CNN FC layer (512x100) is passed to the BLSTM, which is for sequence dependency and time-sequence operations. The output of BLSTM is 100x80, i.e., 100 timesteps and 80 characters, including blanks.
- Then CTC LOSS Alex Graves is used to train the RNN, which eliminates the alignment problem in handwritten since handwritten has a different alignment for every writer. We just gave them what is written in the image (ground truth text) and BLSTM output, and then it calculates loss simply by
log("gtText")
aiming to minimize the negative maximum likelihood path. - Then, CTC finds out the possible paths from the given labels. Loss is given by (X, Y) pairs:

Finally, CTC decode is used to decode the output during prediction.
Detail Architecture

Altogether, there are three steps:
- Multi-scale feature Extraction → Convolutional Neural Network 7 Layers
- Sequence Labeling (BLSTM-CTC) → Recurrent Neural Network (2 layers of LSTM) with CTC
- Transcription → Decoding the output of the RNN (CTC decode)

I hope that you have now understood the basic intuition of how it works. Let’s see the code and methods to increase accuracy.
Code
How do I increase accuracy?
Basically, to improve accuracy related to images, you will perform data augmentation, where the same image in a different form will be created and given to the model for training. The model can then be robust enough to handle images of different types.
But for real-time handwritten recognition, we need to create the handwritten line image shown below to make the system capable of recognizing handwritten lines with a pen as well. You may observe that IAM dataset images have thick lines. So we perform augmentation both on the IAM dataset and self-created line images to make the system more robust and able to work in real-time.


For this handwritten recognition task, data augmentation includes reduced line thickness, random noise, a blur filter, and random stretch. Other techniques will also work well, but I found these four techniques sufficient for now. You'll notice that I haven’t introduced random rotation to it, since rotation doesn’t preserve the original property of a handwritten image.
If you see in the IAM dataset, the line images are thick in size, we can use the dilation technique to make the line width thinner where the real-time handwriting was found to be thin.

Each augmentation in image output is shown below:
Random Noise Added

Random Stretch

Blur Image

The complete code to augment the IAM dataset's handwritten lines is given below:
Get Code and data
- Implementation: Handwritten Line Text Recognition with Deep Learning.
- Real-time handwriting and save it in IAM dataset format. Found at: https://www.kaggle.com/sushant097/english-handwritten-line-dataset
Output
Prediction Output on IAM test dataset

Prediction Output on real handwritten image

Further Improvement
- Line segmentation can be added for full-paragraph text recognition. For line segmentation, you can use the A* path planning algorithm or CNN model to separate paragraphs into lines.
- Better image preprocessing, such as: reducing background noise to handle real-time images more accurately.
- Better decoding approach to improve accuracy. Some of the CTC decoders can be found here.
- What about localizing handwriting on the page and recognizing it with an end-to-end approach? (Object localization concept to localize handwritten and segmentation)
Conclusion
We have discussed how CRNN (CNN + LSTM) is able to recognize text in images with its detailed architecture. The architecture consists of 7 CNN and 2 LSTM layers and outputs a character-probability matrix. This matrix is used for CTC loss calculation and decoding. We have also discussed how data augmentation techniques can increase accuracy and make a real-time handwritten recognition system with detailed code. Finally, further improvement of this system was given.
References
[1] A. Hannun, Sequence ModelingWith CTC (2017),
[2] T. Bluche, J. Louradour, and R. Messina Scan Attend, and Read: End-to-End Handwritten Paragraph Recognition with MDLSTM Attention (2016)
[3] B. Shi and X. Bai and C. Yao An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition (2015)