Missing data
The term ‘missing data’ refers to data which was intended to have been collected but was not. Missing data occurs commonly across a range of quantitative disciplines. Analysing datasets that have missing data requires extra care and consideration to produce correct results.
An analysis which only uses the cases which have no missing data across all the variables of interest is called a complete cases analysis. If there is a substantial amount of missing data, this can lead to a substantial reduction in the number of cases for analysis.
An historical approach to this problem was to replace missing values for a numerical variable, for example, with the mean of the observed values. Replacing the missing data for a variable with the same value is not a good approach; for one thing, the variation in the final data set is unlikely to reflect the real underlying variation.
The best method for dealing with missing data depends on the underlying process causing the missingness. There is no single approach that is always the best, and the terminology commonly used to describe missingness doesn’t directly relate to the methods that perform best. One class of modern methods for dealing with missing data are referred to as methods of multiple imputation.
There are common classifications for missing data mechanisms. These can be confusing, so let’s consider them next.
MCAR, MAR, MNAR?
The most common classification for missing data mechanisms is that data is either:
- Missing Completely at Random (MCAR);
- Missing at Random (MAR), or;
- Missing Not at Random (MNAR).
This classification is based on concepts first introduced in Rubin (1976). These terms are widespread, but can be confusing when first encountered because they do not mean quite what you might expect them to mean.
Data is considered to be Missing Completely at Random (MCAR) if the probability of being missing does not depend on the values of any of the variables in the data, whether those values are missing or observed (Little & Rubin, 2014). In other words, the missingness cannot be related to the research question of interest (Lee et al., 2021).
Data is Missing at Random (MAR) if the probability of being missing depends only on data that was observed (Little & Rubin, 2014). Most commonly this refers to a situation where a variable with incomplete data has probability of being missing which relates to completely-observed variables, but there are other possibilities which fit this definition too. This term is misleading if encountered without context, as a plain-language interpretation suggests it may mean something more like MCAR. However, MCAR is a stricter requirement than MAR. If data is MCAR, it is also considered MAR.
Data is Missing Not at Random (MNAR) if it’s not Missing at Random (Little & Rubin, 2014). At least that’s relatively straightforward. In other words, the probability of data being missing is related to what the missing values would have been, had we observed them. It is sometimes implied that MNAR data is a lost cause for statistical analysis, but we will see specific examples of MNAR mechanisms where common statistical procedures produce valid results for common questions.
Strictly speaking, these classifications refer to the entire dataset collectively, usually consisting of many variables, some of which may have missing values and some of which may not. When there are multiple variables with missing values — a common situation in real datasets — these classifications are sometimes informally applied to specific variables with missing data. These missing data classifications depend on the type of analysis and set of available variables. For example, if additional variables (sometimes called auxiliary variables) which relate to probability of partially-observed variables having missing values are added to the dataset, the missing data mechanism may change from being MNAR to being MAR.
How do I know what kind of missing data mechanism I have?
Unfortunately, there’s no way to determine the missing data mechanism purely by looking at the data — you need to think about the process generating the data and why some of it is missing. Often there will be different reasons for missing data in the same variable. A causal directed acyclic graph (DAG) is a good way to reason about the relationship between the variables and their reasons for being missing and can help you decide what analysis to use. Rather than using the DAG to determine whether your data is MCAR, MAR, or MNAR, and then use those classifications to choose an analysis, you can use the DAG directly to guide your choice of analysis method.
You can read more about this in an extended blog: Understanding missing data mechanisms using causal DAGs
Some examples of missing data mechanisms
To make these definitions more concrete, here are some examples of missing data mechanisms that may occur in real studies.
Example 1: Planned missing data
Sometimes missing data arises from a study design where some data is not collected on some participants. For example, consider a psychometric instrument with a large number of items. The items could be split up into sets which are asked at different times; for example, half at baseline and half a follow-up. Ideally this process would be randomised so each participant receives a different random split. The items which were not asked at a particular time point are missing values, and because the missingness was deliberately introduced by the experimenter using randomisation, we know that the probability of a particular item being missing is unrelated to any other variable in the study.
This is one of the few cases where we can be sure the data is MCAR.
Example 2: Missingness related to confounders only
Consider a longitudinal study where participants need to regularly travel to a site, for example a hospital, in order for their data to be recorded. Participants living in rural or remote areas may find the travel more inconvenient if they need to visit a city to participate in the study, and thus be more likely to have missing values in many variables.
If we know the location of all of the participants, this missingness mechanism would be MAR.
Example 3: Missingness related to confounders and exposure
This example is drawn from Moreno-Betancur et al. (2018). Suppose we want to investigate the relationship between childhood maternal mental illness and child behaviour in subsequent years (the outcome). In order to provide a causal estimate of the effect of this exposure, we must control for confounding variables, such as maternal alcohol consumption, smoking, and other variables relating to the child’s behaviour.
It was considered likely that missingness in all variables was related to both maternal mental illness and the confounding variables. Since child behaviour was measured at a later time than the exposure and confounding variables, it was considered unlikely to have affected missingness in the confounders or exposure.
It was uncertain whether missingness in child behaviour could be related to the child behaviour itself — so there are two plausible missing data mechanisms to consider in this example. Both of those plausible mechanisms are MNAR.
If I know my missing data mechanism, what analysis should I do?
It depends! These are the standard results about what missing data methods are appropriate under different processes:
A complete case analysis is unbiased if the data is MCAR. This is a sufficient condition, not a necessary condition (Little, 2021): there are situations where complete case analysis may be valid, or at least provide a valid estimate of a particular quantity of interest, under MAR or MNAR. Depending on the amount of missing data, complete case analysis may be inefficient (i.e., have lower power than other methods) because only cases where no variables have missing values are used in the analysis.
Other methods of dealing with missing data include multiple imputation, inverse probability weighting, likelihood-based methods and full Bayesian methods; these are unbiased if the data is MAR or MCAR (Little, 2021). Multiple imputation is approximately equivalent to a maximum-likelihood method if the underlying model is the same (Collins et al., 2001).
In some specific MNAR situations, some of the above methods may still be valid, depending on what you’re trying to estimate (White & Carlin, 2010).
I have missing data – What do I actually do?
Dealing with missing data can be tricky, as the considerations outlined here suggest. Modern statistical analysis packages include methods for dealing with missing data. For example, the mice
package in R
is often used for multiple imputation. If you have missing data, it is best to consult with an experienced statistician. A lot can go wrong with imputing missing data, as discussed in this article:
Sterne, J. A., White, I. R., Carlin, J. B., Spratt, M., Royston, P., Kenward, M. G., … & Carpenter, J. R. (2009). Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls. BMJ, 338.
An interesting historical aside
Rubin (1976) is often cited as the source of this classification system, but didn’t actually introduce the terms Missing Completely at Random or Missing Not at Random, only Missing at Random. Rubin originally defined an additional condition, Observed at Random. Data which is both Missing at Random and Observed at Random is what we would now commonly refer to as Missing Completely at Random. The term Missing Completely at Random came later, in Marini et al. (1980). This history is given in Little (2021) and was also confirmed in a Tweet by Raphael Nishimura: ‘I was curious about that too and did some digging with the authors. Rod said that “Rubin’s 1976 Biometrika paper defines MAR and OAR (observed at random) but he may not have put the two together.” Don confirmed it and added that “MCAR was first formally defined in a joint paper with Marini and Olsen, I think in 1980, in a more applied paper.”’
Ignorable and non-ignorable missingness
Sometimes you may hear missing data described as “ignorable” or “non-ignorable”. These terms are also potentially misleading. “Ignorable” missing data doesn’t mean that you can just ignore the fact that you have missing data when doing an analysis. This term was defined in Rubin (1976) to mean that (1) the data is MAR; (2) the likelihood can be factorised into a part relating to the missingness probability and a part relating to the distribution of the underlying data. These provide a sufficient condition for missing data to be dealt with using likelihood-based methods.
References
Collins, L. M., Schafer, J. L., & Kam, C.-M. (2001). A comparison of inclusive and restrictive strategies in modern missing data procedures. Psychological Methods, 6(4), 330–351.
Lee, K. J., Tilling, K. M., Cornish, R. P., Little, R. J. A., Bell, M. L., Goetghebeur, E., Hogan, J. W., & Carpenter, J. R. (2021). Framework for the treatment and reporting of missing data in observational studies: The Treatment And Reporting of Missing data in Observational Studies framework. Journal of Clinical Epidemiology, 134, 79–88.
Little, R. J. A. (2021). Missing Data Assumptions. Annual Review of Statistics and Its Application, 8(1), 89–107.
Little, R. J. A., & Rubin, D. B. (2014). Statistical analysis with missing data (Second edition). John Wiley & Sons.
Moreno-Betancur, M., Lee, K. J., Leacy, F. P., White, I. R., Simpson, J. A., & Carlin, J. B. (2018). Canonical Causal Diagrams to Guide the Treatment of Missing Data in Epidemiologic Studies. American Journal of Epidemiology, 187(12), 2705–2715.
Rubin, D. B. (1976). Inference and missing data. Biometrika, 63(3), 581–592.
White, I. R., & Carlin, J. B. (2010). Bias and efficiency of multiple imputation compared with complete-case analysis for missing covariate values. Statistics in Medicine, 29(28), 2920–2931.