Command line tools for digitisation

This blog refers to a number of command line tools that the University Digitisation Centre uses to automate a number of digitisation processes from manipulating images and PDFs to embedding metadata into files. This page provides a list of those tools and a brief description of how we use them.

Primary command line digitisation tools

EXIFTool

EXIFTool is the backbone of many of UDC’s workflows. Many of the decisions made in our workflows are based on attributes of the images, so rather than just list files in a directory we use EXIFTool to provide a range of additional metadata. We also use it to write as much metadata as practical/available into the files. Apart from reading and writing metadata we also use it to move files around rather than using copy/move commands because:

  • it creates directory trees as required
  • we get to embed/update the metadata in the same operation
  • it is not possible to write a command that will replace an existing file

PDFToolkit (PDFtk)

Our PDF workflow involves creating single page PDFs from scanned images (with or without OCR) and then merging the pages for each item into a single PDF document. PDFtk was our first utility that we used for that

QPDF

QPDF does essentially the same things as PDFtk with the important difference that linearisation doesn’t strip XMP metadata. I haven’t replaced PDFTk totally as QPDF has some limitations on the number of pages it will merge.

ImageMagick

ImageMagick is an extremely powerful image processing tool that is ideal for automating processing tasks without requiring Photoshop. Our main use is for converting file formats and resizing images to suit clients’ needs I’ve also explored automating other image processing steps to reduce the need for manual processing in Photoshop e.g. removing black borders around scanned items

Ghostscript (GPL)

  • Website:http://www.ghostscript.com/
  • Primary function: Used by ImageMagick for reading/writing PDF files
  • Secondary function: Converting scanned PDFs to images for reprocessing.

XPDF

A new addition (for us) to this set. I’ll be exploring the use of this tool along with Tesseract for providing additional text layout options for OCRed PDFs.

Additional digitisation applications

Abbyy Recognition Sever

We have recently upgraded our OCR server to use Recognition Server with a licence that provides unlimited processing stations. All of our office PCs run as processing stations enabling OCR speeds of around 95% of file upload speeds!

The shift to a subscription license has significantly increased the cost of this software

Nuance Omnipage Professional

Omnipage was our original OCR software but is not stable enough to run as a server (it’s a desktop product so it’s not designed for that). We still use it for smaller OCR projects for researchers and for form data extraction from paper surveys.

Tesseract

I am currently testing command line pipelines for Tesseract. Stay tuned for more info.

PDF-PLOP

This tool was into production from on of our business document scanners. It’s only function that we use is to digitally sing PDFs if they are required.

Miscellaneous command line tools

Windows Resource Kit

I was in need of a way to pause a script for a period of time so I downloaded WRK just for “sleep.exe” but some of the other utilities have come in handy on occasions.

7Zip

While we normally use the GUI version of 7Zip for creating ZIP files to transfer to clients we are currently working on automating the packaging process which will utilise the command line version.

b64

Installation

Generally speaking, you can install command line tools to any folder you want. In my case, I need to also sync the command line tools across multiple computers so I keep things fairly simple. With the exception of ImageMagick and Ghostscript, all of the tools listed above are installed into their own directory under c:\Tools. e.g the command for EXIFTool will be “C:\Tools\EXIFTool\EXIFTool.exe”

All of the command line scripts on this site will reference this directory path so you will need to edit them to suit your own setup.

Temporary files

Many of the processes described on this site require the creation of intermediate image files and metadata files. I use a separate temporary directory for scripts as it often necessary to delete all of the files in a directory. Scripts downloaded from this site will reference the following directories which you may want to customise for your own purposes:

  • c:\temp
    • scripts exported by databases
    • output files from scripts for tracking script progress
  • c:\temp\meta
    • temporary metadata files for file packaging (database export)
    • metadata collected by scripts (database import)
  • c:\temp\files\in
    • individual processed files requiring further processing
  • c:\temp\files\out
    • combined PDFs
    • packaged files for delivery to clients