The written record is considered by historians as man’s transition from pre-history. More importantly, handwriting (and accounting) enabled the further development of civilization with records such as agricultural yields, livestock, births, and land ownership, which in turn led to centralized management and the rise of cities.
With such a significant role, it’s ironic that modern information processing systems cannot reliably “read” unstructured handwriting, particularly when of unknown language or mixed with printed text and images. While Optical Character Recognition (OCR) of printed text has become robust, even routine, Handwriting Character Recognition (HCR) remains stubbornly difficult except for controlled input conditions. The few successful applications, such as postal code address reading, or form scanning, require a defined input format of expected content. This article will address the general more difficult problem of extracting and classifying handwriting of unknown location, size, color, content, and language, in a document also containing undefined images and undefined printed text.
High value documents, such as mission plans, or intelligence reports, may be handwritten for cultural reasons or to frustrate electronic methods of surveillance. The age-old method of couriering sealed handwritten documents is impervious to modern threats of hacking and electronic attack. Most of today’s handwritten documents do not possess such levels of intrigue, but rather reflect everyday activities such as diaries, calendar notes, letters, to-do lists and other common artifacts. However, even these seemingly mundane snippets of information can shed light on an intelligence analysis problem if properly indexed and searched. Separating the wheat from the chaff is an overwhelming task given the large volume of documents which contain unstructured handwritten notes, mixed with print and images.
The first step in solving this problem is to discover the handwriting and determine the language so that the HCR algorithm can be properly initialized. This is a big data problem due to the magnitude of document datasets. It is a machine learning challenge due to the wide variability between languages, people, sensors and environmental conditions such as poor or uneven lighting. Figure 1 lists several techniques evaluated for a possible solution to this challenging task.
The simple process flow shown in Figure 2 does not reflect the combined algorithmic complexity of integrating and evaluating the various segmentation, recognition and language classification approaches. For example, considering the techniques listed in Figure 1, the total number of system configurations is the overall product of all possible combinations of algorithms for each stage. In this case there are four binarization techniques, two recognition techniques and three language classification techniques or 24 total (=4x2x3) which, when coupled with all the parameter settings of each individual algorithm, can easily stress the test/evaluation capacities.
Dividing the processing into three stages helps confine this complexity by allowing algorithms at each stage to be optimized as an independent problem. Complexity is further managed through separation of the document handling and user interfaces from the algorithm development within the evaluation architecture as shown in Figure 3. In this way, the infrastructure is able to scale to available resources required to handle millions of documents in an automatic workflow and users are able to direct and annotate processing results. Users can point the system to collections of scanned images and route the processed result to the appropriate language specialist. They can also mark the machine learning results as incorrect or mark missed detections for further analysis.
The goal of binarization is to convert the input document into a background represented by 0 (zero, or logical false), and a foreground represented by 1 (one, or logical true) which includes the objects of interest, in this case handwriting. This simple procedure often proves to be a difficult task due to variations in illumination, condition of the paper, and other factors such as variations in the ink. However, the success of the later stages of handwriting recognition and language classification depends on a good binarization which makes easier the computerized interpretation and classification of the handwriting’s component objects.
Otsu’s Method Versus Modified Sauvola Method
Otsu’s method1 calculates a global threshold by maximizing the interclass variance between the foreground and the background. This approach can completely fail when the handwriting is a light gray, such as when using a pencil, and the rest of the image has darker interfering elements such as machine printed text or images.
Niblack2 first applied an adaptive method to adjust the binarization threshold similar to the way a Constant False Alarm Rate (CFAR) might be adjusted by making the threshold proportional to the local mean and standard deviation of a sliding window. Subsequent experiments by Sauvola3 showed that including a term proportional to the product of the local mean and standard deviation could provide better results. We modified Sauvola’s method to first pre-process the input image using the stretch histogram in only those places that have energy over a threshold. Energy is detected by dividing the document into small sub-blocks. At each sub-block position the maximum intensity difference is recorded. The resulting sub-block image is interpolated to the original image size. Morphological operations are used to select the higher energy regions for processing. Without this technique of selective contrast stretching, the salt-and-pepper noise in non-information bearing parts of the document are amplified, causing false detects. Selectively stretching contrast in this manner contributed to better handling of lighting variations and separation of the foreground images as shown in Figure 4.
Separation of handwriting on top of machine printed text is much less difficult if a color difference can be exploited. For example, handwriting is often in blue ink as depicted in Figure 5. Algorithms were created to extract these text objects by exploiting detectable color differences. The document is converted from RGB to the Lab color system in order to create color filters. The Lab color system is a mathematically described color space where L is Lightness and a and b represent the color components green–red and blue–yellow respectively.
The blue Arabic handwriting is easily extracted by the Lab color filter and sent downstream for language identification. In the figure there is black Arabic handwriting below the blue that was not extracted by the blue color filter. In these cases, when there is no detectable color difference, a graph-based segmentation algorithm is used to extract the handwriting.
Deep Learning CNN Page Layout Versus Line Cuts
Characteristic features that distinguish handwriting from printed text include poor alignment of characters within a word, or between words within a phrase or sentence; variations in relative character sizes; and greater variation in character spacing than occurs in printed text. In order to capture these properties, it is necessary to first properly align the suspected handwriting, and then measure the lack of printed text uniformity. Conventional methods to distinguish handwriting from machine printed text exploit these variations using a line cut method. Non-uniformities between characters and words can be measured with a horizontal line cut. Vertical line cuts can detect the uniformity of lower and upper case variation in printed text. The problem with the line cut method is that embedded images, such as those found in magazines or news articles, also have unstructured variation which can cause the algorithm to confuse images for handwriting.
An alternate solution was evaluated using a deep learning convolutional neural network (CNN) to recognize handwriting, machine printed text, and images. The machine learning algorithm was first trained with a dataset of printed text, handwriting, and images. Then a page from a travel article about Washington D.C. written by the author was marked with handwriting and processed as an image by the algorithm (Figure 6). The red boxes indicate handwriting detected by the CNN and the green boxes are detected images. Since the entire page is an image, the deep learning CNN is detecting images inside the overall image. Similar to the line cut problem, handwriting is sometimes falsely detected inside the images. However, in this implementation it is a configuration option to process candidate handwriting on top of detected images.
The most difficult part of machine learning is oftentimes training set preparation. In this case, the available handwriting data consisted of a collection of handwritten documents in various languages. Our approach segmented each document into a collection of small, binarized images as shown in Figure 7.
Once the handwriting has been extracted various versions are created using image warping routines to slant the image to the left and to the right. The middle column of Figure 8 shows the original extracted handwriting (top) and below it, two warped versions of the image. In addition, each image is rotated left and right through 1 degree increments up to +/- 8 degrees (left and right columns).
Three methods were evaluated to classify the detected handwriting: SURF-based features with Support Vector Machine (SVM) classifier; N-gram histogram vectors with K-Nearest Neighbor classifier; and deep learning Convolutional Neural Net.
SURF-Based Features with Support Vector Machine (SVM) Classifier
Initial experiments utilized the Speeded Up Robust Features (SURF)4 algorithm to identify keypoints in the training set. The strongest keypoints were selected to create a visual “bag-of-words,” which was then used in a Support Vector Machine to distinguish the language classes.5 This approach worked surprisingly well as an “off-the-shelf” algorithm with no specific customization for handwriting language recognition, achieving accuracies of more than 75%.
N-gram histogram vectors with K-Nearest Neighbor Classifier
The first step in this method was to approach the problem in much the same way as early day cryptographers, by noting that n-gram character sets from secret writing, (where n is 1, 2 or 3), exhibited statistical frequencies characteristic of the language the code was concealing. By constructing sub-blocks of the text, we believed we could map their frequencies to a set of the major languages. Approximately 100 individual features were designed to capture the uniqueness of each language (Figure 9). For example, the French phrase, “Où dans la forêt est le garçon étudiant naïf?” illustrates all five French accent marks: grave, circumflex, cedilla, acute and umlaut. We first detected the appearance of these marks and then encoded them as features, assigning each feature a unique number. Next, a detector was designed for each feature. Similarly, detectors for other languages were developed, such as unique arrangement of circles and lines found in Korean, “薯 換縑 寰唳檜棻” (“beauty is in the eye of the beholder”); the curves of Arabic, “اصبر تنل” (“be patient”); the multiple orthogonal intersections of Chinese, “见钟١!” (“love at first sight”); and so on for Japanese, Urdu, Persian, Bengali, Hindu, Portuguese, Russian, Swahili, Tamil, Telugu, and Turkish.
Once each feature was detected and encoded into a number, the language classification process began. The approach was patterned on the successful Cavnar and Trenkel technique6 used on characters (not handwriting) where histograms of n-grams are formed to create a language profile. An n-gram, in this case, is an occurrence of two features together. The language profile vector of n-gram normalized counts is developed during training and stored for each language. During testing, n-gram profile test vectors of the test document are compared to the stored profile vectors. The “closest match” is the reported language.
The feature bi-gram approach takes advantage of the fact that certain handwritten strokes uniquely appear together in certain languages. For example, the letters ‘th’ are the most common character bi-gram in the English language. The circumflex found above the ê (e-circumflex) was assigned the feature code of 64. The top part of the “e” is coded as a “South horseshoe” feature (see Figure 9). The feature bi-gram formed by 18 and 64 appearing together is common for the occurrence of the e-circumflex found in Afrikaans, Dutch, French, Friulian, Kurdish, Portuguese, Vietnamese, and Welsh and would be prominent in their profile vectors. A counter-example is the diacritics found in Arabic such as the fatah and kasrah which are small diagonal marks placed above or below letters, respectively to indicate pronunciation. Depending on the personal style of the handwriter they would be assigned feature codes 50, 51, or 52. The Arabic bi-gram feature profile will have many occurrences of these feature codes which helps distinguish it from Latin languages. We evaluated various distance metrics such as Spearman, Minkowski, Mahalanobis, Jaccard, Hamming, Euclidean, Seuclidean, cosine, correlation, Chebyshev and cityblock. In general, Hamming and cosine distance measures performed the best for this application.
The approach used in this study was to form n-grams using the feature numbers. A profile n-gram histogram vector for each language was created during training and then n-gram test vectors were compared to it during test to estimate the language by choosing the profile vector that was the best match to the test vector. The experiments showed this as a viable technique, which could learn a language profile and match it against features extracted from never-before-seen data, achieving accuracies of up to 87%.
The downside of this technique, however, is the complexity of coding the individual feature detectors.
Deep Learning CNN Handwriting Language Classifier
Deep learning CNNs do not require the language expertise to know which features most likely differentiate handwriting systems. Instead, presumably, the deep network learns the unique characteristics of each language such as the unique L’accent marks in French and the inverted question marks (¿) in Spanish. The deep net, conceivably, removes the need for one or more language specialists that know the peculiarities and distinguishing features of each language; and one or more computer specialists to code the feature detectors.
This large gain in automation obtained from the deep net does not come without a price. A new kind of specialist is needed to organize and feed the algorithm a very large dataset of training examples. Moreover, in some cases of rare and/or vanishing languages it is difficult to obtain a sufficiently large set of handwritten samples for training.
The deep CNN approach automatically learns the features from the raw input training images. The performance of the deep learning CNN increases with the number of layers and the quality of the input training set. In the prototype experiment, each convolution layer had a rectified linear unit (ReLU) and batch normalization. The cross entropy7 loss function was minimized to select one of the mutually exclusive language categories achieving accuracies up to 95%.
The language classification results are on a per-word basis. The evidence for declaring a language increases with the numberof words evaluated. If all the handwritten words in the document are from the same language, additional accuracy can be achieved by implementing a majority voting scheme. Assume that the accuracy of each language class is p, and all the misclassification error probabilities are the same, then the Pcorrect of a majority voting scheme over n-words is given by the binomial equation:
The majority voting scheme can yield substantial improvement as Figure 10 shows for per-word probabilities of 0.7 (red), 0.8 (magenta), and 0.9 (blue). A per-word classification accuracy of 0.7 converts to an overall document accuracy of .95 after only 15 words are input into the majority voting.
Detection, extraction, and classification of handwriting language are a
pre-requisite to Handwriting Character Recognition (HCR). Raytheon has developed a scalable prototype system that accomplishes these tasks. Deep learning algorithms were evaluated for both the page layout and language classification tasks. The deep learning language classification performance was compared to more conventional SURF bag-of-words features with an SVM classifier, and to a novel bi-gram handwritten feature representation with a nearest neighbor classifier. The development of the custom features requires significantly more thought and programming effort, but could be useful in those cases where there is insufficient data to fully train the other machine learning approaches.
The division of the problem into manageable segments controlled the combinatorial complexity. The end-to-end performance of the system requires each segment work as intended. Handwriting language classification requires the handwriting to be detected, which requires correct page layout which requires good binarization and so forth.
The immense variety of the unconstrained input document types, combined with the cascade dependencies of the processing chain, continues to make the handwriting language classification on general documents of any type a challenging problem. Future work will explore the gain of various machine learning architectures including more layers and extended training sets for improved machine learning of document layout, handwriting detection and handwriting language recognition. This work could also extend the techniques developed in the prototype system to detect hand-drawn maps and circuit diagrams which could have relevance for counter-IED and other important intelligence applications.
– Dr. Darrell Young
– Dr. Kevin Holley
1N. Otsu, “A threshold selection method from gray-level histograms,” IEEE® Trans. Systems Man Cybernet., 9 (1), 1979, pp. 62-66.
2 W. Niblack, “An Introduction to Digital Image Processing,” pages 115-1 16. Englewood Cliffs, N.J., Prentice Hall, 1956.
3J. Sauvola, M. Pietikainen, “Adaptive Document Image Binarization,” Pattern Recognition, 33, 2000, pp. 225-236.
4Herbert Bay, Andreas Ess, Tinne Tuytelaars, Luc Van Gool, “SURF: Speeded Up Robust Features,” Computer Vision and Image Understanding (CVIU), Vol. 110, No. 3, pp. 346–359, 2008.
5Csurka, G., C. R. Dance, L. Fan, J. Willamowski, and C. Bray Visual Categorization with Bag of Keypoints, Workshop on Statistical Learning in Computer Vision, ECCV 1 (1-22), 1-2.
6Cavnar, William B., and John M. Trenkle. “N-gram-based text categorization.” Ann Arbor MI 48113, no. 2 (1994): 161-175.
7Bishop, C. M. Pattern Recognition and Machine Learning. Springer, New York, NY, 2006.