Taking care with context

Take care, particularly with secondary data

A basic lesson in data recording and management is to carefully and accurately describe the variables in a data file.  This includes describing the measurement scale used and the units for numerical variables. Variable names or labels should directly provide this information, as best possible.  In using secondary data — data originally collected by someone else — it is equally important to ensure that you understand the measurements made.  You may need to do some digging and even go back to the original source to make sure the details are correct.  Otherwise there is potential for errors in understanding of the data to promulgate.

An early study of Vitamin C

A concern during the Second World War was the provision of vitamin C to soldiers, and in this broad context the effects of ascorbic acid and orange juice were studied in animals.  One such study was:

  • Crampton E.W. (1947). The growth of the odontoblast of the incisor teeth as a criterion of vitamin C intake of the guinea pig. The Journal of Nutrition33(5), 491–504.

In Crampton’s study, 60 guinea pigs were given a dietary supplement of vitamin C in one of three doses (0.5, 1 or 2 mg/day) delivered in one of two ways (as ascorbic acid or orange juice).  The diet commenced when the guinea pigs were 28 days old. After 42 days on the diet, the guinea pigs were sacrificed; incisors were removed and sectioned to obtain measurements on the length of the odontoblasts — cells that are important to tooth development. There could be multiple measurements (odontoblasts) per animal, so the lengths were averaged.  The outcome of interest is the average length of the incisor odontoblasts.

Crampton’s study was described in:

  • Bliss C.I. (1952). The Statistics of Bioassay. Academic Press.

Bliss used Crampton’s data to illustrate statistical methods for the analysis of numerical outcomes in experiments involving two factors. In Bliss’ analysis of the outcome, the two factors of interest are dose and method of delivery.

The ToothGrowth dataset

The software package R includes a number of datasets that anyone can use.  These datasets are very often used in online posts to illustrate the use of R and in answers to queries from users.  The most popular dataset for this purpose is ‘mtcars’.

‘ToothGrowth’ is another such R dataset.  The source of the ‘ToothGrowth’ dataset is Bliss (1952), referenced above; Crampton (1947) is also listed as a relevant reference in the R documentation for this dataset.

R bloggers and others use the ‘ToothGrowth’ dataset to illustrate methods of visualisation and two-way analysis of variance.

The R documentation about the variables in the ‘ToothGrowth’ dataset is as follows:
[,1] len numeric Tooth length
[,2] supp factor Supplement type (VC or OJ).
[,3] dose numeric Dose in milligrams/day

The abbreviations VC and OJ are defined but there are three important issues in the description of the numerical outcome.

  • The outcome is described as ‘Tooth length’; it is not the length of the tooth. As described above, the outcome for Crampton’s study was the (average) odontoblast length. An odontoblast is a cell. To label the variable as ‘Tooth length’ is quite misleading.
  • The units of the numeric variable are not provided.  Bliss and Crampton indicate that the average length of odontoblasts is in microns.
  • The data in ‘ToothGrowth’, as sourced from Table XXVIII on page 500 of Bliss, is the average length of odontoblasts minus 20, not the original values.  This can be confirmed by examining Figure 4 on page 501 of Bliss, which plots the means in microns (rather than microns minus 20).

Examples of online use of the ‘ToothGrowth’ dataset pay very limited attention to these issues.  Few users refer to the length of odontoblasts, rather than tooth length; here is an example where the definition is first correctly noted but subsequently labelled ‘teeth growth’. Some users, for example, note that the measurement units for the outcome variable is not specified, but go no further.  Here is an example where the units are assumed to be mm — an implausible scale for odontoblasts (cells), and also for tooth length (as many of the observed values are too large for guinea pig incisor lengths).

You might wonder why Bliss reported tabulated the data as ‘microns minus 20’.  Bliss was demonstrating the computational procedures for analysis of these data at a time when analysis was carried out ‘by hand’ — without the aid of a computer. It’s likely that Bliss rescaled the data to make the computation less onerous.  However, his visualisation of the data is on the original scale — he did not lose sight of the original measurement scale.

Here is a plot of the data on the original scale:

What’s the fuss about?

You may be thinking — why does this matter?  The data is only being used as an example.  People won’t even worry about the units. People won’t realise that the data have been rescaled.

Here are some consequences of losing sight of these details.

  • It suggests that details about measurement don’t matter. They do. Crampton’s work was careful, time consuming and difficult.
  • It discourages a focus on the practical interpretation of the results in terms of the quantities estimated and their uncertainty.
  • It makes it harder to check if the results make intuitive sense.  (What are typical lengths of guinea pig teeth? Could the scale be mm?)
  • It encourages a reliance on hypothesis testing in drawing statistical inferences.  If inferences can’t be readily described in terms of the mean differences on a known scale (such as microns), analysts may fall back on generic claims of statistical significance.
  • It fails to encourage care and attention to detail more generally.