RStudio uses a “Working Directory”.

An R Markdown file is saved in a directory; when you knit the file, RStudio will automatically use this as the Working Directory.

Any data file you use should be in the same directory as the R Markdown file that reads in the data. In SRW, we have arranged this.

Similarly, any output files you create will be saved in the same directory.

R uses “packages” to carry out many operations. Packages need to be installed once, but loaded every time you use them, with the library() command; see below.

If you try to use a package that you have not installed, RStudio gives you a warning and an opportunity to install it.

library(tidyverse)
library(readxl)
library(gt)
library(broom)

Reading in the data

In this R Markdown file, the data is called EXAMPLE_DATA. Replace this with the name of the file you wish to use.

EXAMPLE_DATA <- read_excel("analytic_data.xlsx")
EXAMPLE_DATA <- EXAMPLE_DATA %>% 
  mutate_if(is.character,as.factor)

In all of the code below, you will need to replace EXAMPLE_DATA with the name of your data frame. You will need to use the appropriate variable names.

One sample inference for a mean

Start with an appropriate plot (choose one)

ggplot(EXAMPLE_DATA, aes(x=NUMERICAL_VARIABLE1)) +
  geom_histogram() +
  labs(x="NICE AXIS LABEL")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(EXAMPLE_DATA, aes(x=NUMERICAL_VARIABLE1)) + 
  geom_dotplot() +
  labs(x="NICE AXIS LABEL") +
  scale_y_continuous(breaks=NULL)
## Bin width defaults to 1/30 of the range of the data. Pick better value with
## `binwidth`.

ggplot(EXAMPLE_DATA, aes(x=NUMERICAL_VARIABLE1)) + 
  geom_boxplot() + 
  labs(x="NICE AXIS LABEL") +
  scale_x_continuous(breaks=NULL)

Summary statistics

EXAMPLE_DATA %>%
  summarise(Mean = mean(NUMERICAL_VARIABLE1),
            "Standard deviation" = sd(NUMERICAL_VARIABLE1),
            n = n()) %>%
  gt() %>%
  fmt_number(c(Mean, "Standard deviation"),
             decimals = 1)
Mean Standard deviation n
4.3 2.3 20

Confidence interval and hypothesis test for a population mean

t.test(EXAMPLE_DATA$NUMERICAL_VARIABLE1, conf.level=0.95)
## 
##  One Sample t-test
## 
## data:  EXAMPLE_DATA$NUMERICAL_VARIABLE1
## t = 8.5764, df = 19, p-value = 5.873e-08
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  3.280852 5.399148
## sample estimates:
## mean of x 
##      4.34

Here is a way of creating a neater summary table of the analysis. We need the tidy() function from the broom package.

onesample <- t.test(EXAMPLE_DATA$NUMERICAL_VARIABLE1, conf.level=0.95)
tidy(onesample) %>%
    select(estimate, conf.low, conf.high, p.value) %>%
  gt() %>%
  fmt_number(c(estimate, conf.low, conf.high),
             decimals = 2) %>%
  fmt_number(p.value, decimals = 3) %>%
  cols_merge_range(conf.low, conf.high, sep = " to ") %>%
  cols_align("center", everything()) %>%
  cols_label(estimate = "Mean", conf.low = "95% CI",
              p.value = "P-value") 
Mean 95% CI P-value
4.34 3.28 to 5.40 0.000

The \(P\)-value calculated and shown here is (formally) testing the null hypothesis that the true mean is equal to zero. For inference on a single sample, this is usually not relevant. It is included here for completeness; sometimes a different null hypothesis value is of interest, and tested.

Hence a realistic presentation of the inference here would be:

tidy(onesample) %>%
    select(estimate, conf.low, conf.high) %>%
  gt() %>%
  fmt_number(c(estimate, conf.low, conf.high),
             decimals = 2) %>%
  cols_merge_range(conf.low, conf.high, sep = " to ") %>%
  cols_align("center", everything()) %>%
  cols_label(estimate = "Mean", conf.low = "95% CI") 
Mean 95% CI
4.34 3.28 to 5.40

Inference for a difference of means

Paired samples

If the differences are stored in the data file, the code for one sample inference for the mean can be used. Appropriate graphs are based on the difference scores.

The statistical inference can be carried out if the paired data are in two separate columns, using the following code.

t.test(EXAMPLE_DATA$NUMERICAL_VARIABLE1, EXAMPLE_DATA$NUMERICAL_VARIABLE2, paired = TRUE)
## 
##  Paired t-test
## 
## data:  EXAMPLE_DATA$NUMERICAL_VARIABLE1 and EXAMPLE_DATA$NUMERICAL_VARIABLE2
## t = 0.40192, df = 19, p-value = 0.6922
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
##  -0.82048  1.21048
## sample estimates:
## mean difference 
##           0.195

Neater output:

pairedsamples <- t.test(EXAMPLE_DATA$NUMERICAL_VARIABLE1, EXAMPLE_DATA$NUMERICAL_VARIABLE2, paired = TRUE)
tidy(pairedsamples) %>%
    select(estimate, conf.low, conf.high, p.value) %>%
  gt() %>%
  fmt_number(c(estimate, conf.low, conf.high),
             decimals = 2) %>%
  fmt_number(p.value, decimals = 3) %>%
  cols_merge_range(conf.low, conf.high, sep = " to ") %>%
  cols_align("center", everything()) %>%
  cols_label(estimate = "Estimate of the mean difference", conf.low = "95% CI ",
              p.value = "P-value") 
Estimate of the mean difference 95% CI P-value
0.19 −0.82 to 1.21 0.692

Note that the analysis of paired data is essentially a one-sample inference, on the differences. For this application the \(P\)-value tests the null hypothesis that the true mean difference is zero. This \(P\)-value is relevant to consider.

Independent samples

Start with an appropriate plot (choose one):

ggplot(EXAMPLE_DATA, aes(x = CATEGORICAL_VARIABLE1, y = NUMERICAL_VARIABLE1)) + 
  geom_dotplot(binaxis = 'y', dotsize = 0.5) +
  coord_flip() +
  labs(x ="NICE Y-AXIS LABEL", y = "NICE X-AXIS LABEL") 
## Bin width defaults to 1/30 of the range of the data. Pick better value with
## `binwidth`.

ggplot(EXAMPLE_DATA, aes(x = CATEGORICAL_VARIABLE1, y = NUMERICAL_VARIABLE1)) + 
  geom_dotplot(binaxis = 'y', stackdir='center', dotsize = 0.5) +
    coord_flip() +
  labs(x ="NICE Y-AXIS LABEL", y = "NICE X-AXIS LABEL")
## Bin width defaults to 1/30 of the range of the data. Pick better value with
## `binwidth`.

ggplot(EXAMPLE_DATA, aes(y = CATEGORICAL_VARIABLE1, x = NUMERICAL_VARIABLE1)) + 
  geom_boxplot(width = 0.4) +
  labs(y ="NICE Y-AXIS LABEL", x = "NICE X-AXIS LABEL")

Use the following code to obtain summary statistics if the numerical variable is stored in one column (vector) and the grouping variable in a second column (factor).

EXAMPLE_DATA %>%
  group_by(CATEGORICAL_VARIABLE1) %>%
  summarise(Mean = mean(NUMERICAL_VARIABLE1),
            "Standard deviation" = sd(NUMERICAL_VARIABLE1),
            n = n()) %>%
  gt() %>%
  cols_align("center", everything()) %>%
  fmt_number(c(Mean, "Standard deviation"),
             decimals = 1)
CATEGORICAL_VARIABLE1 Mean Standard deviation n
A 4.0 1.4 10
B 4.7 2.9 10

The relevant inferences about the differences of the means can be carried out.

Without assuming equal variances:

t.test(NUMERICAL_VARIABLE1 ~ CATEGORICAL_VARIABLE1, data = EXAMPLE_DATA, conf.level=0.95)
## 
##  Welch Two Sample t-test
## 
## data:  NUMERICAL_VARIABLE1 by CATEGORICAL_VARIABLE1
## t = -0.7219, df = 12.804, p-value = 0.4833
## alternative hypothesis: true difference in means between group A and group B is not equal to 0
## 95 percent confidence interval:
##  -2.957998  1.477998
## sample estimates:
## mean in group A mean in group B 
##            3.97            4.71

This is one of the (somewhat incredible!) standard R outputs where the single most important quantity of interest, namely, the difference between the two sample means, is not provided in the output. However, we can get it using the tidy function.

independentsamples <- t.test(NUMERICAL_VARIABLE1 ~ CATEGORICAL_VARIABLE1, data = EXAMPLE_DATA, conf.level=0.95)
tidy(independentsamples) %>%
  select(estimate)
## # A tibble: 1 × 1
##   estimate
##      <dbl>
## 1    -0.74

Assuming equal variances:

t.test(NUMERICAL_VARIABLE1 ~ CATEGORICAL_VARIABLE1, data = EXAMPLE_DATA, var.equal=TRUE, conf.level=0.95)
## 
##  Two Sample t-test
## 
## data:  NUMERICAL_VARIABLE1 by CATEGORICAL_VARIABLE1
## t = -0.7219, df = 18, p-value = 0.4796
## alternative hypothesis: true difference in means between group A and group B is not equal to 0
## 95 percent confidence interval:
##  -2.893602  1.413602
## sample estimates:
## mean in group A mean in group B 
##            3.97            4.71

Example of neater output, using the unequal variances assumption:

independentsamples <- t.test(NUMERICAL_VARIABLE1 ~ CATEGORICAL_VARIABLE1, data = EXAMPLE_DATA, conf.level=0.95)
tidy(independentsamples) %>%
    select(estimate, conf.low, conf.high, p.value) %>%
  gt() %>%
  fmt_number(c(estimate, conf.low, conf.high),
             decimals = 2) %>%
  fmt_number(p.value, decimals = 3) %>%
  cols_merge_range(conf.low, conf.high, sep = " to ") %>%
  cols_align("center", everything()) %>%
  cols_label(estimate = "Estimate of the difference of means", conf.low = "95% CI ",
              p.value = "P-value") 
Estimate of the difference of means 95% CI P-value
−0.74 −2.96 to 1.48 0.483

© Statistical Consulting Centre, University of Melbourne, 2023