Command line tools for digitisation

This blog refers to a number of command line tools that the University Digitisation Centre uses to automate a number of digitisation processes from manipulating images and PDFs to embedding metadata into files. This page provides a list of those tools and a brief description of how we use them.

Primary digitisation tools

EXIFTool

EXIFTool is the backbone of many of UDC’s workflows. Many of the decisions made in our workflows are based on attributes of the images, so rather than just list files in a directory we use EXIFTool to provide a range of additional metadata. We also use it to write as much metadata as practical/available into the files. Apart from reading and writing metadata we also use it to move files around rather than using copy/move commands because:

  • it creates directory trees as required
  • we get to embed/update the metadata in the same operation
  • it is not possible to write a command that will replace an existing file

PDFToolkit (PDFtk)

Our PDF workflow involves creating single page PDFs from scanned images (with or without OCR) and then merging the pages for each item into a single PDF document. PDFtk was our first utility that we used for that

ImageMagick

ImageMagick is an extremely powerful image processing tool that is ideal for automating processing tasks without requiring Photoshop. Our main use is for converting file formats and resizing images to suit clients’ needs I’ve also explored automating other image processing steps to reduce the need for manual processing in Photoshop e.g. removing black borders around scanned items

Ghostscript (GPL)

  • Website:http://www.ghostscript.com/
  • Primary function: Used by ImageMagick for reading/writing PDF files
  • Secondary function: Converting scanned PDFs to images for reprocessing.

VeryPDF PDF Toolbox Command Line

This application is a commercial package and was included in our toolbox primarily because it was the only (affordable) tool that could linearise our PDFs after adding metadata with EXIFTool. Both PDFtk and PDF-PLOP will linearise PDFs but the extra metadata is deleted in the process.We also use this for producing PDFs with specific resolution/compression settings that are not available in Omnipage.

QPDF

5/10/2017: I’m currently testing this as a replacement for the previous commercial tool. Linearisation doesn’t strip XMP metadata but it doesn’t overwrite the original file thus requiring an additional write operation but it should be possible to replace PDFTK as well which will remove one file write operation.

Nuance Omnipage Professional

While technically not a command line utility, Omnipage Pro runs as an OCR server on one of our workstations, providing automatic OCR conversion via a number of “watched” folders. A such it is an integral part of our toolkit.  In 2013 we developed a series of tools and processes for processing multiple choice surveys using Omnipage for the raw data extraction for the forms. A customised version of these tools is now used to mark MCQ exams for  a couple of departments.

Abbyy Recognition Sever

We are curently looking at upgrading to Abbyy Recognition Server for our OCR processing for improved stability and capacity.

PDF-PLOP

This tool was into production from on of our business document scanners. It’s only function that we use is to digitally sing PDFs if they are required.

Miscellaneous command line tools

Windows Resource Kit

I was in need of a way to pause a script for a period of time so I downloaded WRK just for “sleep.exe” but some of the other utilities have come in handy on occasions.

7Zip

While we normally use the GUI version of 7Zip for creating ZIP files to transfer to clients we are currently working on automating the packaging process which will utilise the command line version.

b64

Installation

Generally speaking, you can install command line tools to any folder you want. In my case, I need to also sync the command line tools across multiple computers so I keep things fairly simple. With the exception of ImageMagick and Ghostscript, all of the tools listed above are installed into their own directory under c:\Tools. e.g the command for EXIFTool will be “C:\Tools\EXIFTool\EXIFTool.exe”

All of the command line scripts on this site will reference this directory path so you will need to edit them to suit your own setup.

Temporary files

Many of the processes described on this site require the creation of intermediate image files and metadata files. I use a separate temporary directory for scripts as it often necessary to delete all of the files in a directory. Scripts downloaded from this site will reference the following directories which you may want to customise for your own purposes:

  • c:\temp
    • scripts exported by databases
    • output files from scripts for tracking script progress
  • c:\temp\meta
    • temporary metadata files for file packaging (database export)
    • metadata collected by scripts (database import)
  • c:\temp\files\in
    • individual processed files requiring further processing
  • c:\temp\files\out
    • combined PDFs
    • packaged files for delivery to clients