Cooking with DOS: Reading metadata

Having discovered how simple it is to write metadata to a file it’s now time to look at how this metadata can be reused by other people and applications.

Windows Explorer search

Windows Explorer can search the contents of any file for any text string. Embedding XMP metadata to in an image is basically strapping an RDF packet to the file. RDF is text. Download the small image below and open it in a text editor to see for yourself.

Sample image with more embedded metadata — Image URL

But Windows is not set to search the file contents of images by default. For each file type that you embed metadata into (and that you want to search) you will need to set the indexing options to index both File properties and Contents (see this tutorial). While we’re talking about searching the contents of image files, as a slight digression you may also want to turn on the “TIFF IFilter”. Windows has had the capability to OCR TIFF images for search for quite some time but it is disabled by default.

Listing files

Creating a list of files is an important part of many image management processes. There are numerous ways of doing this and which one is most suitable depends largely on the level of detail you require.

Dir

Getting a list of just filenames can be done using the standard dir command piped to a text file. This is a drag and drop script to list all of the files in a directory (just the filenames) and then opening the list in your text editor.

dir /b "%~1*.*" > list.txt
list.txt

This is simple and quick BUT is not sorted. This can be an important consideration if you are listing files on a CIFS share (i.e. most network file shares)

Robocopy

Robocopy can be used to list files using its logging options without actually copying any files. The main difference with the previous method is that the full path is included. I use robocopy for quick file lists with file sizes (in bytes) because it is quick, sorted alphabetically and formatted as tab delimited text. The following command lists all files recursively with just the file path and file size in the list.

robocopy %1 null *.* /S /L /NDL /NC /TEE /NJH /NJS /NODD /BYTES /LOG:"LIST.txt"
list.txt

Breakdown
/S: copy subdirectories

/L: list files only (don't copy/move)

/NDL: don't list directories (ie. only files)

/NC: don't log file classes

/TEE: display the output on screen as well as writing to the log file (so you can see that someting's happening)

/NJH: don't include a header

/NJS: don't include a summary

/NODD: destination directory is not specified

/BYTES: write files sizes as bytes (alternative is number and units (KB, MB, GB etc…)

/LOG: The name of the log file

EXIFTool: Single file

For image processing you often need to know much more about a file if you are trying to automate decisions based on different properties of the image. The range of metadata includes other image qualities (image dimensions, bit depth, compression methods etc…) as well as identifiers of the equipment and software used to create and process the image (make, model, serial number, creation software etc…)

The Windows version of EXIFTool includes some of the command line options in the file name.This operates as a handy drag and drop metadata reader for a file. Drag and drop an file onto it and it will open a Windows CMD window and display the metadata it reads from the file. I usually keep a copy of this file with a shortcut in SendTo menu so that I can view the metadata for a file via the right click context menu.

For use in scripts you should rename (a copy of) it to just “exiftool.exe”

JeffreFriedl has an excellent online version of this tool that can be used to look at metadata for any online image. To see this in action on a website that actually embeds metadata, checkout Getty Images and copy the URL of ANY thumbnail image in the banner at the top of the page(or else where on the site) to paste it into the tool.

EXIFTool: File lists

To customise the format of the metadata EXIFTool displays it is easiest to set up a “print format” file to specify the fields to be retrieved and the structure of the document. A drag and drop script to recursively list all of the files in a directory (with an error log) looks something like this:

exiftool.exe -m -s -r -q -p TEMPLATE.txt %1 1&gt; LIST.txt 2&gt; ERROR.txt

Breakdown and tweaks
/S: copy subdirectories

/L: list files only (don't copy/move)

/NDL: don't list directories (ie. only files)

/NC: don't log file classes

/TEE: display the output on screen as well as writing to the log file (so you can see that someting's happening)

/NJH: don't include a header

/NJS: don't include a summary

/NODD: destination directory is not specified

/BYTES: write files sizes as bytes (alternative is number and units (KB, MB, GB etc…)

/LOG: The name of the log file

… with TEMPLATE.txt containing the specific tags and formatting you want to to collect. Outputting errors can be useful for detecting file corruption or file naming issues (e.g. CSV files with an extension .XLS). Our basic output format file looks something like this:

#[HEAD]Directory Filename FileSize ImageWidth ImageHeight BitsPerSample PageCount Make Model
#[BODY]$Directory $Filename $FileSize#  $ImageWidth $ImageHeight $BitsPerSample $PageCount $Make $Model

I’ve used spaces here to separate terms to save space but my production file uses tabs. Any lines prefixed with “#[HEAD]” will be written at the start of the file. Each line with “#[BODY]” at the start being repeated for every file and “#[TAIL]” for the end of the file. As of version 10.41 folder level text cane be specified using using “#[SECT]” and “#[ENDS]”. Tag names to be written into the file are prefixed with “$”. Adding a number sign (#) at the end of FileSize returns the size in bytes which is easier to use in calculations.

The full list of tags used for our workflows include:

Directory and FileName: The full path must be less than 290 characters. Verbal diarrhoea in file/folder names is to be discouraged.
Filesize#: Used for stats, storage planning and checking against file size restrictions
Compression: Check compression method used for images from other sources
CreateDate, ModificationDate: Time-based matching with item records for large batches of items with small image counts.
ImageWidth,ImageHeight: Our maximum image size for OCR = 8,000px, used for/with cropping coordinates of art prints
XResolution, ResolutionUnit: Checking image resolution (assuming X and Y resolution are the same)
Bits per sample: Automate image processing steps dependent on bit depth (different for 1,8,16)
Page count: Verify PDF page counts match image counts
Make, Model, Artist, CreatorTool, SerialNumber: Determine the scanning device (Zeutschel use Artist instead of CreatorTool for their capture software name) and distinguish between left/right cameras in a 2 camera rig.
JobRefID: Our job item ID gets embedded after scanning. Used to cross reference with archive identifiers and locate copies of our files on network shares even if the files have been renamed.
Label, Rating: Some of our manual QA steps use Label and/or Rating to manually classify images during visual inspection in Adobe Bridge and then automating per file processing for each classification.
Title, Author: Primary metadata check. Also handy for checking a range of documents on file shares.

From this list you can see that there is a lot of information within files that can be used to guide and even automate image processing steps. Reading and writing metadata to images and PDFs provides the foundation for improving high volume, repetitive processes. UDC’s workflows were initially set up with the primary goal of embedding metadata in images and PDFs. The establishment of the data structures, processes and tools required to achieve this have made it relatively easy to add automation to many other image processing and management tasks.

EXIFTool: Reformatting metadata

Given that the output of EXIFTool is text and you have complete freedom of how to format the output file it’s relatively straightforward to produce CSV, tab-delimited or XML files from embedded metadata. As an example you could create a CSV file like the “automap” test file in CSVImportPlus plugin for Omeka using this format file:

#[HEAD]Dublin Core:Title, Dublin Core:Creator, Dublin Core:Description, tags,Filenames
#[BODY]"$Title","$Creator","$Description","$keywords","$Directory/$Filename"

If the folder structure used to store the files was the same as their online location then all you’d need to do is substitute the part of the local path with the website URL. While this might sound appealing to some, the level of organisation required to embed the metadata in the first place would enable the CSV to be created more easily from the source metadata… and everyone has their metadata sorted out before they upload files to Omeka don’t they?

References

EXIFTool
- Command line options
- Tag names
- Geotagging and Inverse geotagging (examples of reading, writing and reformatting metadata)

September 28, 2017

Categories

Posted by

Ben Kreunen