End of the Product-Lifecycle
ABBYY FineReader XIX is a special version of the award-winning FineReader optical character recognition (OCR) software for recognising “fraktur” or “black letter” texts from the period between 1800 and 1938. It is designed to convert scans of old documents, books, and papers into text for the purpose of digital archiving and publishing, and it is the first omnifont OCR software for Fraktur.
The Solution: First Omnifont OCR for Fraktur
ABBYY FineReader XIX is the first omnifont OCR for Fraktur, giving users a solution for scanning and converting old documents with minimal training and dictionary work. This was achieved by combining extremely intelligent technology with dedicated linguistic study:
OCR systems work by analysing a text image and making a hypothesis about which letter or word an image represents. The hypotheses are analysed in context and verified by use of sophisticated OCR dictionaries made up of Language Models (LMs). Language Models (LM) are computer databases that describe the vocabulary of a language. The problem is that modern OCR systems do not have LMs for older text fonts and older text spellings. The solution for Fraktur text recognition was achieved through the development of OCR dictionaries specifically for this time period. Special language models were created for five European languages.
The Fraktur language models were created with the help of ABBYY partner, ATAPY Software. Through development process, 10 different dictionaries and more than 105 books published between 1 808 and 1 930 were analysed. Linguists reviewed word stock, identified words that have phased out through the evolution of the languages, and identified the correct paradigm assignments for synchronising the language models with the appropriate grammar usage for the time period. More than 500.000 word entries were manually compared with existing FineReader dictionaries.
Grammatical paradigms and word evolutions were reviewed to add 159 historic grammar paradigms that were missing from the contemporary language models. Language models were then compiled and tested on a control group of testing documents featuring old text.
To recognise the Fraktur style fonts, ABBYY development teams created special classifiers, or alphabets, capable of recognising the Fraktur symbols. As part of this effort, ABBYY development teams collected a symbol image base with an average of 2500 symbol samples for each symbol, a new alphabet pattern, and collected and input a sample test base representing 31000 pages of text from different sources. Using the sample text, the recognition engine was “fine-tuned” to work with the subtle features of the Fraktur alphabet (such as the ligatures, or connected letters). The new alphabet was then added to the FineReader XIX and interface and tested extensively.