Naming files and folders for digitisation

Naming files and folders is a fundamental part of managing any sort of digital data and yet 90% of people who come to use our scanners don’t have a naming schema for their image files. This post explores how I arrived at the file naming schemas used by the University Digitisation Centre (UDC) for digitisation workflows and how and why these often differ from typical research data management training.

A clever file naming schema on its own
does not constitute data management

TL:DR

What’s in a name?

The primary purpose of a name is to distinguish one thing from another. File paths (the combination of folder and file names) should therefore be unique. Since we are talking about going from an unorganised state to an organised state it helps if file names are also unique.

  1. From a data perspective, the filename is also metadata. It identifies what the file is and what it contains. The filename is an identifier.
  2. Digitisation is the process of creating a digital form on a non-digital item.  The file name provides a link between the digital and non-digital form. The file name, as an identifier, is a link between the metadata of the digital and non-digital form.
  3. Digitisation is typically a high volume, highly repetitive process with very few variations. There are ample opportunities for automating file management processes. File names should be compatible with any automated file management processes.

So there you have it… 3 key principles which I use for our file naming schemas.

File naming and process automation

UDC do not have any large proprietary data management systems and yet in 2017 our small team processed ~2,000,000 images. We do this by automating as many of our file management processes as possible. This, in turn, is largely done by managing and reusing metadata of the original items and the digitised files to create command line scripts to do the processing. Command line tools offer the greatest flexibility to automate tasks at very low cost (although they require expertise to set up).

Many of the command line tools we use were originally built on Linux/Unix computers and then re-worked to work with Windows. As a result of this, the Windows versions of these tools may interpret commands in scripts using the syntax of both operating systems. The rules for formatting text in command line scripts can be thought of as grammar for computers.  In simple terms then, to avoid problems with automating stuff via command line scripts file and folder names should obey a few simple grammar rules.

Do

  1. Only use the following characters in file/folder names.
    • alphabetic characters (a-z, A-Z)
    • numbers (0-9)
    • hyphen (-)
    • underscore (_)
  2. If you can’t Tweet it, it’s too long. Keep file paths (the complete folder path and file name) shorter 256 characters. The exact number will vary between applications but this number provides a reasonable length whilst maximising compatibility.

That’s it.

Don’t

  1. Break the “Do” rules
  2. Allow exceptions. It can be difficult to think of everything you’ll need in advance but changing file names for a few items can break automated processes. Either manage the exceptions with your metadata or change your naming schema and apply it to ALL of your existing files.

Yes it is possible to use spaces and other characters in file names and your computer won’t complain but don’t do it. If your file names make it impossible to run automated processes it’s your own fault and your only alternatives will require more of your time.

As an example: ImageMagick may interpret a space and hyhen (e.g. “xxx – yyy.tif”) as a command option instead of part of the file name even if the file path is enclosed in double quotes. As a result the command will fail to do anything.  There is no workaround other than to rename the file or process it manually in another application.

Context in file names.

The only context you really need is a link
between the physical and digital forms

Advice on this varies… this is my version.

Many people include some context of the original item in the file names of the digitised files. e.g. title, author or year of publication. In the bigger picture of data management this is usually unnecessary and irrelevant. The premise for this is that it makes finding files by browsing folders easier. While this is true, the concept is fundamentally flawed and doesn’t scale well.

  1. Context (descriptive metadata) belongs in the embedded metadata of the file.
  2. You should be managing metadata for the original items prior to (or as a part of) creating the digital version. The only context you really need is a link between the physical and digital forms. This is critical to managing your data and without it any file naming schema will be a band aid solution.
  3. If you’re finding files by browsing folders you’re doing something wrong, and doomed to inefficient processes. If you have a register of the original physical items it should be possible to automate accessing and processing the digitised versions.

The only context that should be included in digitised files are those that facilitate automated file processing. Some institutions for example, use the keywords for the format of the original (map, photo, book etc…) to automatically process a file with the relevant image processing workflow. This depends entirely on the mechanisms and systems you use to automate your processing.

Typical problems with using context in file names include:

  • the context may only be meaningful to the person who creates the schema
  • there are always exceptions which often result in inconsistent additions to some file names
  • adding many different bits of context adds more time to naming files
  • file and folder names tend to be longer which can cause problems

Folder names: think big, then think small

One consideration when planning a folder structure and folder names is to keep the number of items in any one folder down to a reasonable number.  The more items there are in a folder the longer it will take the computer to list all of the items when browsing a folder, and eventually there will come a point where you will not be able to see anything within a folder (although they will still be there).  Normally I aim to have less than 1,000 items a folder, with an absolute maximum of 10,000. To avoid unforeseen problems in the future, design your folder naming schema so that the maximum possible number of items in a folder is within your chosen limit.

For example. I use the bibliographic record ID from our library catalogue as part of our file identifier. This consists of a “b” and 7 digits representing up to 10,000,000 possible records. The folder structure derived from this splits the identifier up into 2 character components. “b1234567” would result in a directory structure of “\b1\23\45\67\” resulting in no more than 100 items in each folder.

Collections, groups, projects and items

Putting files into groups based on their content (e.g. about a person, by genre etc…) is adding context to the file path. As mentioned earlier though, you should be managing metadata for the original items prior to (or as a part of) creating the digital version. The only context you really need is a link between the physical and digital forms.  Using folder and file names for context only allows you to “organise” files with a limited number of contexts. Managing metadata about the files and original items in a relational database allows have as many contexts as you want and makes it easy add additional contexts at any time.

Our archives, then, are merely collections of individual items (aka intellectual entities). The only additional context that we use is at the collection level but only because each collection has its own naming schema and access permissions.

Additional data types

So you’ve digitised stuff. Apart from your original digitised files you may also have:

  • the original scans
  • a composite file  (e.g. PDF)
  • an access derivative (low resolution JPEGs
  • thumbnails
  • OCRed text, xml

Apart from these typical data types there are other types of data/metadata that could also be stored with these files to facilitate various file management processes.

Screen grab of a file listing from the Internet ArchiveUDC’s archive structure is based on that used by the Internet Archive. If you look at the files for a single item you will see a number of files all named with the item identifier and the type of data that the file contains. These include various file lists and metadata for administrative purposes, digital preservation and online viewing. Image file types include the original scans, cropped derivatives for one online viewing, BW and colour PDFs, DJVU, ePub etc…  apart from metadata embedded in the files there are also metadata files describing what the images are of so that even in the absence of a data management system you can still make sense of the files in this folder. In this example there is also additional data relating to biological names extracted from the OCR data and additional metadata for use in the Biodiversity Heritage Library.

Getting back onto the topic of file naming, the file name in this example is loosely based on the title of the book but does not make sense to a person if read on its own. The file name is merely a unique identifier that provides the link to the rest of the metadata in management systems and as such the only requirement is that it must be unique. The file name is, for human purposes, irrelevant. The context of the item digitised is stored as metadata with the digitised files and can be interpreted by anyone accessing the files.

This data model is not only technologically simple, it’s also extremely FAIR… but that’s a topic for another day.

<Update: 10/09/2019>

When to include context in file names

Creating machine readable file names is great if you have some form of data management system to find and retrieve files but it fails to consider situations where, for whatever reason, people persist without file management tools. If the files are accessible by many people you may want to consider a mix of machine and human readable filenames to respectively allow for future data management tools and to “discourage” people from renaming the files.  Try to be as consistent as possible and pad numbers so that they will sort alphanumerically.

e.g. ID1234_JournalXXX_1987_vol001_num018.pdf

Too Long : Didn’t Read

The file name is, for human purposes, irrelevant

  • Use file names compatible with command line scripting
  • Minimise the context included in file names
  • Manage your files by managing the metadata
  • Managing metadata is the key to automating processes
  • A clever file naming schema on its own does not constitute data management
  • In the absence of data management processes and systems a compromise needs to be struck to create names that can be accessed both programmatically and manually.

Leave a Reply

Your email address will not be published. Required fields are marked *