Optical Character Recognition (OCR)

Optical character recognition is the process of translating scanned images containing text into a format that includes the actual text. For people this process is referred to as reading. For computers, it’s a complex combination of pattern recognition, artificial intelligence and computer “vision” and as such, the success of the process relies on a number of different factors other than the shape of individual letters. OCR saftware doesn’t just recognise the shapes of individual letters, strings of lettters are compared with dictionary words and language patterns. With human interaction the software may also be “trained” to recognise repeated variation of letter shapes.

Scanning Considerations

Images intended for OCR processing have historically required high contrast, with bitonal BW images being the preferred image type. Newer algorithms for OCR now recognise text in images with a wider range of contrast and colours making this less of a requirement. In some cases, increasing the contrast of a greyscale image will provide a small increase in accuracy.

Resolution can play a key role in determining the accuracy of the recognition but it is NOT a case of more is better. Scanning poorly printed text at higher resolutions will not necessarily increase OCR accuracy and may in fact reduce the accuracy. Resolutions of 200dpi – 400dpi will generally provide the best reults.

OCR application overview

Basic overview of OCR software
Application	Cost	Accuracy	Languages	Asian Languages	Industry-specific dictionaries
Microsoft Document Imaging	Lowest	Lowest	3 (en, sp, fr)	No	No
Adobe Acrobat	…	…	42	Yes	No
Omnipage/FineReader	Highest	Highest	100+	Yes	Yes

Microsoft Office Document Imaging

This is not installed by default on university computers but it can be relatively easily added. The OCR capability is quite basic and is best done with BW text pages, but it can be a useful tool for getting text from scanned pages.

Recognise previously scanned images
Recognise text in images embedded in Word documents
(Optional) Add OCR text for supported images to Windows indexing service

Adobe Acrobat

OCR functions are not available in Acrobat Reader, requiring the purchase of either Acrobat Standard or Professional.

Layout detection
Manual suspect character correction
Read aloud function will perform OCR on image only pages
Process multiple files at once
OCR to text files (via batch processing only)

Dedicated OCR Packages

OCR-specific packages such as Nuance Omnipage or Abbyy FineReader provide additional functionality and improved features, and greater accuracy.

Improved layout detection
Barcode recognition
Output multiple file formats at once
Wide range of input formats
Watched folders for automatic processing

Advanced business document imaging systems

Advanced business document scanning systems provide additional

Distributed processing, scanning and verification on different workstations
Page replacement while scanning
Data processing of OCRed text e.g. barcode verification

Optical Character Recognition (OCR)

Scanning Considerations

OCR application overview

Microsoft Office Document Imaging

Adobe Acrobat

Dedicated OCR Packages

Advanced business document imaging systems

Recent Posts

Reference pages

Archives

Categories

Digitisation Lab

Scanning Considerations

OCR application overview

Microsoft Office Document Imaging

Adobe Acrobat

Dedicated OCR Packages

Advanced business document imaging systems

Recent Posts

Reference pages

Tags

Archives

Categories