Command line tools for digitisation
This blog refers to a number of command line tools that the University Digitisation Centre uses to automate a number of digitisation processes from manipulating images and PDFs to embedding metadata into files. This page provides a list of those tools and a brief description of how we use them.
Primary command line digitisation tools
EXIFTool
- Website: http://www.sno.phy.queensu.ca/~phil/exiftool/
- Primary function: Read and write metadata
EXIFTool is the backbone of many of UDC’s workflows. Many of the decisions made in our workflows are based on attributes of the images, so rather than just list files in a directory we use EXIFTool to provide a range of additional metadata. We also use it to write as much metadata as practical/available into the files. Apart from reading and writing metadata we also use it to move files around rather than using copy/move commands because:
- it creates directory trees as required
- we get to embed/update the metadata in the same operation
- it is not possible to write a command that will replace an existing file
PDFToolkit (PDFtk)
- Website: https://www.pdflabs.com/tools/pdftk-server/
- Primary function: Combining and splitting PDF documents
Our PDF workflow involves creating single page PDFs from scanned images (with or without OCR) and then merging the pages for each item into a single PDF document. PDFtk was our first utility that we used for that
QPDF
QPDF does essentially the same things as PDFtk with the important difference that linearisation doesn’t strip XMP metadata. I haven’t replaced PDFTk totally as QPDF has some limitations on the number of pages it will merge.
ImageMagick
- Website: http://www.imagemagick.org/
- Primary function: Creating derivative image sizes and formats
ImageMagick is an extremely powerful image processing tool that is ideal for automating processing tasks without requiring Photoshop. Our main use is for converting file formats and resizing images to suit clients’ needs I’ve also explored automating other image processing steps to reduce the need for manual processing in Photoshop e.g. removing black borders around scanned items
Ghostscript (GPL)
- Website:http://www.ghostscript.com/
- Primary function: Used by ImageMagick for reading/writing PDF files
- Secondary function: Converting scanned PDFs to images for reprocessing.
XPDF
- Website: https://www.xpdfreader.com/
- Primary function: Extracting text from OCRed PDFs
A new addition (for us) to this set. I’ll be exploring the use of this tool along with Tesseract for providing additional text layout options for OCRed PDFs.
Additional digitisation applications
Abbyy Recognition Sever
We have recently upgraded our OCR server to use Recognition Server with a licence that provides unlimited processing stations. All of our office PCs run as processing stations enabling OCR speeds of around 95% of file upload speeds!
The shift to a subscription license has significantly increased the cost of this software
Nuance Omnipage Professional
- Website: http://www.nuance.com/for-individuals/by-product/omnipage/index.htm
- Primary function: Optical character recognition and form data extraction
Omnipage was our original OCR software but is not stable enough to run as a server (it’s a desktop product so it’s not designed for that). We still use it for smaller OCR projects for researchers and for form data extraction from paper surveys.
Tesseract
- Website: https://github.com/UB-Mannheim/tesseract/wiki (they have a nice Windows installer)
- Primary function: Optical character recognition at an acceptable price for researchers.
I am currently testing command line pipelines for Tesseract. Stay tuned for more info.
PDF-PLOP
- Website: https://www.pdflib.com/products/plop/
- Primary function: Digitally sign PDF files
This tool was into production from on of our business document scanners. It’s only function that we use is to digitally sing PDFs if they are required.
Miscellaneous command line tools
Windows Resource Kit
- Website: https://www.microsoft.com/en-au/download/details.aspx?id=17657
- Primary function: Additional commandline functions not available in Windows CMD
I was in need of a way to pause a script for a period of time so I downloaded WRK just for “sleep.exe” but some of the other utilities have come in handy on occasions.
7Zip
- Website: http://www.7-zip.org/
- Primary function:Scripted creation of ZIP archives
While we normally use the GUI version of 7Zip for creating ZIP files to transfer to clients we are currently working on automating the packaging process which will utilise the command line version.
b64
- Website: https://sourceforge.net/projects/base64/
- Primary function: Encode files for including in XML data exports.
Installation
Generally speaking, you can install command line tools to any folder you want. In my case, I need to also sync the command line tools across multiple computers so I keep things fairly simple. With the exception of ImageMagick and Ghostscript, all of the tools listed above are installed into their own directory under c:\Tools. e.g the command for EXIFTool will be “C:\Tools\EXIFTool\EXIFTool.exe”
All of the command line scripts on this site will reference this directory path so you will need to edit them to suit your own setup.
Temporary files
Many of the processes described on this site require the creation of intermediate image files and metadata files. I use a separate temporary directory for scripts as it often necessary to delete all of the files in a directory. Scripts downloaded from this site will reference the following directories which you may want to customise for your own purposes:
- c:\temp
- scripts exported by databases
- output files from scripts for tracking script progress
- c:\temp\meta
- temporary metadata files for file packaging (database export)
- metadata collected by scripts (database import)
- c:\temp\files\in
- individual processed files requiring further processing
- c:\temp\files\out
- combined PDFs
- packaged files for delivery to clients