Some simple inferences in R

Navigating R can be daunting with the vast number of packages and options for obtaining analyses.

Here we have curated a few options for some simple inferential procedures – one and two sample inference for means – to help you get started. We assume you have some basic knowledge of R, and are providing this as a quick reference for some common, simple methods of statistical inference.

Remember to first set your working directory.

Load packages and read in data

library(ggplot2)
library(summarytools)
library(tidyverse)

In this script, the file is called MYDATA.RData. Replace this with the name of the file you wish to use.

load("MYDATA.Rdata")

Alternatively, you might read in data, for example, from an Excel file using R the read_excel function; this comes with the tidyverse package.

MYDATA <- read_excel("MYDATA.xlsx")

One sample inference for the mean

In all of the code below, you will need to replace MYDATA with the name of your data frame.
You will need to use the appropriate variable names.

Start with an appropriate plot:

ggplot(MYDATA, aes(x=NUMERICAL_VARIABLE)) +
  geom_histogram() +
  labs(x="NICE AXIS LABEL")

ggplot(MYDATA, aes(x=NUMERICAL_VARIABLE)) +
  geom_dotplot() +
  labs(x="NICE AXIS LABEL") +
  scale_y_continuous(breaks=NULL)

ggplot(MYDATA, aes(y=NUMERICAL_VARIABLE)) +
  geom_boxplot() +
  labs(y="NICE AXIS LABEL") +
  coord_flip() +
  scale_x_continuous(breaks=NULL)

Obtain the summary statistics:

descr(MYDATA, NUMERICAL_VARIABLE, stats = c("n.valid", "mean", "sd"))

A simple confidence interval for a population mean:

t.test(MYDATA$NUMERICAL_VARIABLE, conf.level=0.95)

This assumes that the data arise from a random sample from a Normal distribution.

Inference for the mean difference: Paired samples

If the differences are stored in the data frame, the code for one sample inference for a mean can be used.
Appropriate graphs are based on the difference scores.

The statistical inference can be carried out if the paired data are in two separate columns, using the following code:

t.test(MYDATA$Column1, MYDATA$Column2, paired = TRUE, conf.level=0.95)

This assumes that the differences are a random sample from a Normal distribution.

Inference for the difference of means: Independent samples

Start with an appropriate plot:

ggplot(MYDATA, aes(x = CATEGORICAL_VARIABLE, y = NUMERICAL_VARIABLE)) +
  geom_dotplot(binaxis = 'y', dotsize = 0.5) +
  labs(y ="NICE Y-AXIS LABEL", x = "NICE X-AXIS LABEL") +
  coord_flip()

ggplot(MYDATA, aes(x = CATEGORICAL_VARIABLE, y = NUMERICAL_VARIABLE)) +
  geom_boxplot(width = 0.4) +
  labs(y ="NICE Y-AXIS LABEL", x = "NICE X-AXIS LABEL") +
  coord_flip()

Use the following code if the numerical variable is stored in one variable and the grouping variable in a second variable.

Summary statistics:

MYDATA %>%
group_by(CATEGORICAL_VARIABLE) %>%
descr(NUMERICAL_VARIABLE, stats = c("n.valid", "mean", "sd"))

Confidence interval and t-test, without assuming equal variances:

t.test(NUMERICAL_VARIABLE ~ CATEGORICAL_VARIABLE, data = MYDATA, conf.level=0.95)

Confidence interval and t-test, assuming equal variances:

t.test(NUMERICAL_VARIABLE ~ CATEGORICAL_VARIABLE, data = MYDATA, var.equal=TRUE, conf.level=0.95)

Both methods assume that data in each group arise from a random sample from a Normal distribution.