Cooking with DOS: One [digitisation] script to rule them all!

Over the years I’ve developed a range of Windows Command Line (CMD) scripts to automate stages of our image processing and database tools to join these together into a managed workflow. This year I’m going to try something different and build a single CMD script to take a bunch of images from a scanned item and turn it into a complete digital package in our archive. Regardless of whether I succeed or fail there will be invaluable lessons to be learnt along the way. Join me for the ride as I work though each of the stages.

The workflow.

My inspiration for our archiving schema comes from the Internet Archive and the list of downloadable files available for each item e.g. “The mammals of Australia“. The final file structure will be very similar to this and will include embedded metadata in the images and PDFs.

Embedding metadata requires some preparation of files and to this end I’m building a spreadsheet tool for LibreOffice to remap and prepare the metadata, with the final sidecar files and folder setup being done by a small EXIFTool script. We currently do this with a Filemaker Pro database but that’s highly customised for our specific needs and not easy to share. For every item to be digitised a folder will be created with an XMP sidecar files in it containing the metadata to be embedded.

The digitisation for this workflow will be done by a digital camera. A DNG and TIF of each image will be produced from the camera raw images and the images for each item will be sorted into their respective item folder. For the purposes of this project the files will be manually moved into their item folders without requiring any renaming as long as the image order is reflected by the alphanumeric sort order of the filenames. Alternative sorting options will also be explored.

The main script will be designed as a “drag and drop” script initially. Drag an item folder onto the script and sit back and watch … or get back to scanning.

The end result

Whilst our archive schema doesn’t have all of the filetypes that the Internet Archive has I will build the script to create as many types as practical for the sake of demonstration. This will include:

Image formats
- Multipage GIF thumbnail file
- DNG (subfolder)
- TIFF (subfolder)
- JXL (subfolder, experimental alternative to TIFF)
- JPG (subfolder)
OCR formats (via Tesseract)
- Colour PDF
- Bitonal BW PDF
- hOCR HTML (gz)
- Per image ALTO XML (tar.gz)
- Per image TXT (tar.gz)
- Composite TXT (gz)
File metadata
- EXIF metadata (tab-delimited)
- File list (txt, prior to creating compressed archives)
- Checksums (xml, after creating compressed archives)
Metadata
- Dublin Core XML
- MARC XML
- XMP

The item folder will (optionally) be moved to a final archive location in a nested folder structure based on the item identifier.

Stage by stage

I’ll link in posts for each of the individual stages as they’re completed.

Software and environment setup
Metadata preparation and setup
Passing variables to the script
File naming and folder structure
Creating derivative images
Optical Character Recognition (OCR)
Gathering file metadata

January 10, 2025

Categories

Posted by

Ben Kreunen

Cooking with DOS: One [digitisation] script to rule them all!

The workflow.

The end result

Stage by stage

Leave a Reply Cancel reply

Digitisation Lab