Optical Character Recognition (OCR)
Optical character recognition is the process of translating scanned images containing text into a format that includes the actual text. For people this process is referred to as reading. For computers, it’s a complex combination of pattern recognition, artificial intelligence and computer “vision” and as such, the success of the process relies on a number of different factors other than the shape of individual letters. OCR saftware doesn’t just recognise the shapes of individual letters, strings of lettters are compared with dictionary words and language patterns. With human interaction the software may also be “trained” to recognise repeated variation of letter shapes.
Scanning Considerations
Images intended for OCR processing have historically required high contrast, with bitonal BW images being the preferred image type. Newer algorithms for OCR now recognise text in images with a wider range of contrast and colours making this less of a requirement. In some cases, increasing the contrast of a greyscale image will provide a small increase in accuracy.
Resolution can play a key role in determining the accuracy of the recognition but it is NOT a case of more is better. Scanning poorly printed text at higher resolutions will not necessarily increase OCR accuracy and may in fact reduce the accuracy. Resolutions of 200dpi – 400dpi will generally provide the best reults.
OCR application overview
Application | Cost | Accuracy | Languages | Asian Languages | Industry-specific dictionaries |
---|---|---|---|---|---|
Microsoft Document Imaging | Lowest | Lowest | 3 (en, sp, fr) | No | No |
Adobe Acrobat | … | … | 42 | Yes | No |
Omnipage/FineReader | Highest | Highest | 100+ | Yes | Yes |
Microsoft Office Document Imaging
This is not installed by default on university computers but it can be relatively easily added. The OCR capability is quite basic and is best done with BW text pages, but it can be a useful tool for getting text from scanned pages.
- Recognise previously scanned images
- Recognise text in images embedded in Word documents
- (Optional) Add OCR text for supported images to Windows indexing service
Adobe Acrobat
OCR functions are not available in Acrobat Reader, requiring the purchase of either Acrobat Standard or Professional.
- Layout detection
- Manual suspect character correction
- Read aloud function will perform OCR on image only pages
- Process multiple files at once
- OCR to text files (via batch processing only)
Dedicated OCR Packages
OCR-specific packages such as Nuance Omnipage or Abbyy FineReader provide additional functionality and improved features, and greater accuracy.
- Improved layout detection
- Barcode recognition
- Output multiple file formats at once
- Wide range of input formats
- Watched folders for automatic processing
Advanced business document imaging systems
Advanced business document scanning systems provide additional
- Distributed processing, scanning and verification on different workstations
- Page replacement while scanning
- Data processing of OCRed text e.g. barcode verification