Below is a short list of reasons why OCR on historic documents is a real challenge.
Image Quality
Old documents are hard to scan, but good scan quality is important for good OCR results. Problems that you may encounter:
Curled paper
Pages are stuck together
Wired layouts
Curved lines of text when the book has to be treated carefully
Layout detection
Historic books/documents often have a different layout structure.
Accordingly algorithms that were designed for “modern” layouts might not be able to deliver proper results on these layouts
Old newspapers can also be very tricky
Small Fonts
Complex Layouts
Reading order
Texttypes Used
Old font types are used - standard character recognisers cannot read gothic/fraktur fonts
Quality of the characters that should be OCRed is often very bad
Broken characters
Mixed with noise and dirt or writing
There are characters in old documents that are not available in modern computer fonts
Language-Issues
Historically spelling was not unified and consequently there are many different writing variants