Bdinski sbornik

Author: Aleksandra Zaytseva (dmitrivna@gmail.com) Maintained by: David J. Birnbaum (djbpitt@gmail.com) Last modified: 2012-05-29T16:20:36+0000

Optical character recognition

The Project

Optical character recognition (OCR) for the Bdinski sbornik project was implemented with ABBYY Finereader, version 11. The input source was Jan L. Scharpe and F. Vyncke’s 1973 Bdinski Zbornik: An Old-Slavonic Menologium of Women Saints (Ghent University Library Ms. 408, A.D. 1360) (Brugge: De Tempel), where the original typeset text looks like:

[Sample text]

Scanning

ABBYY Finereader can perform OCR on either pregenerated PDF image files or input fed directly from the scanner into the software through the TWAIN interface. The latter approach, scanning directly into the program from the flatbed scanner (Scan to Microsoft Word in the image below), yielded much better results:

[Choose task]

Scans were performed at 300 dpi; we tried 400 dpi in the hope of resolving some recognition problems, but we found that the recognition rate at the two resolutions was comparable, and that the higher resolution yielded larger files that had no processing advantages.

Language settings and the pattern editor

The OCR process required us to select a language (the term is in quotation marks because it is not necessarily the same as an actual human language), train the system to map individual glyph images to individual characters (code points) in the specified language, and then convert text areas from bitmapped images to character data streams. To choose a language, first click on Tools to drop down the menu below:

[Option menu]

From that menu, click on Language Editor … to select and edit the language or languages that the system will be asked to recognize. You can either select from a list of existing languages or create a new one (by clicking on New … at the bottom):

[Language editor]

We initially set the recognition language to Russian (OCS was not an option) in the hope that the pretrained knowledge of modern Cyrillic that shipped with the product would improve the recognition rate. We found that the difference between the black-letter typeface in the input source (see the image above) and modern Russian typography was such that we obtained much better results starting with a completely clean slate. That is, we created a separate, custom language, which we called BdinskiSbornik. All in all, we found this more reliable (quicker to train, more accurate results) than using Russian or even Russian (Old Spelling) as a preset language.

Once we had decided to create a new language we had to configure it by specifying the available inventory of characters, which would then be used as mapping targets during OCR training. To edit a selected or custom language in order to specify the characters that should be available, hit Properties …, which opens the Language Properties dialog:

[Language properties editor]

We selected Russian (Old Style) as the Source language as a way of preselecting most of the characters that we wanted to include in our language. This selection at this point just specifies the base character inventory, and should not be confused with selecting Russian (Old Style) as a language, which we avoided because it would have resulted in incorporating preexisting knowledge about character distribution that would have been erroneous for our purposes. Since the Source language is just the starting point for specifying the character inventory, we then selected the … button next to the Alphabet bar to customize our inventory by including and excluding individual characters:

[Language properties: alphabet editor]

We found it frustrating that although the Alphabet editing feature permits us to identify the characters to be recognized, certain ones cannot be excluded (e.g., Latin x) and others cannot be included (such as the entire Unicode Cyrillic Extended-B range; http://www.unicode.org/charts/PDF/UA640.pdf). We understand that the use of a pretrained language might entail a commitment to a standard character inventory, but we see no reason why a custom language setting should not make the entire Unicode Basic Multilingual Plane (BMP) available. Because of this limitation, during the recognition process it is not possible simply to copy and paste a new character into the program. What we did instead was utilize placeholders, e.g., я for ꙗ, and we then replaced the placeholders through global search and replace operations after the initial OCR had been completed and the document had been saved to a Microsoft Word file.

The overall recognition rate was good. We didn’t count the errors per line, but after training the system and letting it train us, we found that we could process a page of input in approximately ten minutes, which was much quicker and more accurate than the results we could have obtained by keyboarding. Nonetheless, the consistency of the ABBYY program was unpredictable, and sometimes it would recognize a character flawlessly for a while and then begin to make mistakes with the same character later. Processing the same page multiple times could yield different results each time.

Where the program makes a consistent mistake due to a training error, such as regularly mapping a glyph image to the same wrong character, it is possible to undo the erroneous training by using the Pattern Editor (Tools → Pattern Editor). Begin by selecting the language you’re using:

[Select the pattern editor]

and then hit Edit …. A screen of glyph images that have been stored during training, along with the characters to which they have been mapped, will open, and you can then delete the glyphs that are being misrecognized consistently and retrain the page or document:

[Pattern editor interface]

Reading and training

After scanning each page we found it most efficient first to delete or edit the green text boxes that define the areas to which OCR will be applied:

[Define text areas for OCR processing]

This enabled us to exclude running headers, footnotes, edition line numbers, and other areas that we did not want to retain in our output.

Because we created a new language, the system began with no ability to recognize any glyphs, and we had to train it. The training proceeds slowly, one glyph at a time, at first, but the system learns quickly, after which we could just feed it pages and let it read. The character recognition was never flawless, and we always had to read each page of output carefully and correct the errors, but overall we were satisfied with ABBYY’s ability to simplify and streamline the process of converting the printed text to character data.

To train, go to Tools → Options, and check the box that says Read with Training:

[Read with training configuration]

This box will uncheck itself after every reading, so this selection must be repeated every time you run a training scan. If you forget (as we did frequently), ABBYY will read all your pages (or your selected page) automatically, and if you have already manually corrected the mistakes in the output panel on the right side, and are rereading to improve the training, it will revert it to the standard reading and overwrite the corrections.

If you just hit Read, ABBYY will read the entire document, which probably isn’t what you want, both at first (because you’ll need to train the system) and later (because pretty extensive editing and correction was required, we found it easiest to work through the book a page at a time). To read selections, highlight the desired pages and right-click. To read selected blocks of text, outline them with a green box and right-click.

Training itself is a fairly straightforward process. As was noted above, it is not possible to copy and paste new symbols into the Training field, and if there is not an option for a certain letter, it is necessary either to edit the range of available characters in Language Editor (if permitted) or select a place-holder character and then replace it afterwards by employing a global search and replace operation to post-process the output Word file. The system tries to isolate individual characters, but it sometimes misses, whether because the glyph is discontinuous (e.g., ы) or because the image has a faint part that makes it appear discontinuous, and it also sometimes erroneously misreads what should be two separate characters as one. If the system has erroneously selected only part of a letter, or more than one letter or symbol, it is possible to adjust the recognition area with the help of the « and » buttons, which can join or unjoin what appear to be discrete glyphs (separated by white space). The training interface is illustrated below:

[Training interface]

After training and reading, letters about which the system is uncertain are highlighted on the right side of the screen in turquoise. The user can correct these manually, but those corrections are not used to update the training. It is possible to correct any character, whether highlighted in turquoise or not, but in general the system showed a fairly good awareness of which recognition moments were uncertain, and it proved most efficient to concentrate on reviewing just the turquoise sections at this stage. The remainder is not error-free, but since the entire output will undergo comprehensive proofreading, correction, and editing at a later stage in the development process, we felt that any further correction should be deferred until that time.

One of the trickiest and most frustrating aspects of the training and reading workflow is that correcting the output errors as described above does not feed back into the training, and therefore does not improve subsequent recognition. It is possible to retrain on a whole page, but this is tedious and sometimes counterproductive in that it may correct one error while introducing another. Alternatively, it is possible to select just a small area of text to be trained, select Read with training, and right click on that selected text box to read only that area. The user can then find any problematic symbol (select … next to the text box), copy it, close the trainer without saving the pattern, and delete the unnecessary text area. This occurred fairly frequently with letters that were adorned with diacritics or superscript letters. A related problem is that it is not possible to train the system on glyphs that it incorrectly thinks it has recognized. During the training process the system will stop and query the user when it is uncertain, but if it is certain but incorrect, there is no way to select for training a glyph on which the training routine has not stopped on its own.

Saving, opening, etc.

It is possible to save selected pages by highlighting them on the left side and right-clicking. To save the entire project, the user must select Save FineReader Document under File. Likewise, to open a project, the user must specifically select Open FineReader Document under File:

[Open FineReader Document dialog]

Simply hitting Open will open only an individual file, and not an entire project. Saving the project means saving not only the text, but also the training, so this is an important step if the OCR is not going to be performed all in one session.

Recognition issues

The following recognition problems were particularly noticeable in the project:

The system mistook the letter б (b) for the digit 6 (six), the letter o for the digit 0 (zero), and the letter з (z) for the digit 3 (three). We tracked down many of these afterwards by performing a regular expression search for sequences of characters that included both letters and digits, since although the input document contained both, it was unlikely to include them in the same word. We corrected single-character errors, that could not be identified in this way, during the character-by-character proofreading and editing stage.
The system occasionally skipped or inserted spaces. To fix this we globally reduced all sequences of space characters to a single space afterwards; we inserted missing spaces manually during proofreading.
The system frequently confused в (v) and б (b).
The system very frequently confused и, н, and п. Some of these could be corrected through global search and replace operations (e.g., of those three letters, only и can stand alone as a word, only н occurs before а in a two-letter word) or global search with verification before replacement (e.g., in word-initial position before о, п is overwhelmingly the most likely, н is possible, and и is impossible).
The system mistook м for iл. This was easily corrected with a global search and replace operation.
The system inconsistently recognized ы, often making it ьi. Ultimately we remap most occurrences of jery to ьї because that’s what occurs in the manuscript, so this is easily handled during post-processing.
Capitals in general were very poorly recognized. In addition, the system would capitalize lower case letters and convert to lowercase letters that should have been capitalized. The latter isn’t a problem because we converted the entire output file to lowercase anyway. The edition introduces case distinctions according to modern orthographic conventions, capitalizing proper nouns and the first word of every sentence, and we corrected these to conform to the spellings found in the manuscript, reserving modern upper-case letters for decorative initials and headings in the manuscript.