Cooking with DOS: One [digitisation] script to rule them all!
Over the years I’ve developed a range of Windows Command Line (CMD) scripts to automate stages of our image processing and database tools to join these together into a managed workflow. This year I’m going to try something different and build a single CMD script to take a bunch of images from a scanned item and turn it into a complete digital package in our archive. Regardless of whether I succeed or fail there will be invaluable lessons to be learnt along the way. Join me for the ride as I work though each of the stages.
The workflow.
My inspiration for our archiving schema comes from the Internet Archive and the list of downloadable files available for each item e.g. “The mammals of Australia“. The final file structure will be very similar to this and will include embedded metadata in the images and PDFs.
Embedding metadata requires some preparation of files and to this end I’m building a spreadsheet tool for LibreOffice to remap and prepare the metadata, with the final sidecar files and folder setup being done by a small EXIFTool script. We currently do this with a Filemaker Pro database but that’s highly customised for our specific needs and not easy to share. For every item to be digitised a folder will be created with an XMP sidecar files in it containing the metadata to be embedded.
The digitisation for this workflow will be done by a digital camera. A DNG and TIF of each image will be produced from the camera raw images and the images for each item will be sorted into their respective item folder. For the purposes of this project the files will be manually moved into their item folders without requiring any renaming as long as the image order is reflected by the alphanumeric sort order of the filenames. Alternative sorting options will also be explored.
The main script will be designed as a “drag and drop” script initially. Drag an item folder onto the script and sit back and watch … or get back to scanning.
The end result
Whilst our archive schema doesn’t have all of the filetypes that the Internet Archive has I will build the script to create as many types as practical for the sake of demonstration. This will include:
- Image formats
- Multipage GIF thumbnail file
- DNG (subfolder)
- TIFF (subfolder)
- JXL (subfolder, experimental alternative to TIFF)
- JPG (subfolder)
- OCR formats (via Tesseract)
- Colour PDF
- Bitonal BW PDF
- hOCR HTML (gz)
- Per image ALTO XML (tar.gz)
- Per image TXT (tar.gz)
- Composite TXT (gz)
- File metadata
- EXIF metadata (tab-delimited)
- File list (txt, prior to creating compressed archives)
- Checksums (xml, after creating compressed archives)
- Metadata
- Dublin Core XML
- MARC XML
- XMP
The item folder will (optionally) be moved to a final archive location in a nested folder structure based on the item identifier.
Stage by stage
I’ll link in posts for each of the individual stages as they’re completed.
- Software and environment setup
- Metadata preparation and setup
- Passing variables to the script
- File naming and folder structure
- Creating derivative images
- Optical Character Recognition (OCR)
- Gathering file metadata
Leave a Reply