RStudio uses a “Working Directory”.

An R Markdown file is saved in a directory; when you knit the file, RStudio will automatically use this as the Working Directory.

Any data file you use should be in the same directory as the R Markdown file that reads in the data. In SRW, we have arranged this.

Similarly, any output files you create will be saved in the same directory.

R uses “packages” to carry out many operations. Packages need to be installed once, but loaded every time you use them, with the library() command; see below.

If you try to use a package that you have not installed, RStudio gives you a warning and an opportunity to install it.

library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.3.1
## Warning: package 'purrr' was built under R version 4.3.1
## Warning: package 'dplyr' was built under R version 4.3.1
library(emmeans)
## Warning: package 'emmeans' was built under R version 4.3.1
library(readxl)
## Warning: package 'readxl' was built under R version 4.3.1
library(gt)
library(gtsummary)
## Warning: package 'gtsummary' was built under R version 4.3.1

Reading in the data

In this R Markdown file, the data is called EXAMPLE_DATA. Replace this with the name of the file you wish to use.

EXAMPLE_DATA <- read_excel("analytic_data.xlsx")
EXAMPLE_DATA <- EXAMPLE_DATA %>% 
  mutate_if(is.character,as.factor)

In all of the code below, you will need to replace EXAMPLE_DATA with the name of your data frame. You will need to use the appropriate variable names.

Logistic regression analysis

Fit logistic regression with one categorical explanatory variable:

summary(MY.MODEL1 <- glm(BINARY_VARIABLE ~ CATEGORICAL_VARIABLE3, family="binomial", data=EXAMPLE_DATA))
## 
## Call:
## glm(formula = BINARY_VARIABLE ~ CATEGORICAL_VARIABLE3, family = "binomial", 
##     data = EXAMPLE_DATA)
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)
## (Intercept)                 -0.4055     0.6455  -0.628     0.53
## CATEGORICAL_VARIABLE3Small  -0.4418     0.9449  -0.468     0.64
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 25.898  on 19  degrees of freedom
## Residual deviance: 25.678  on 18  degrees of freedom
## AIC: 29.678
## 
## Number of Fisher Scoring iterations: 4

Obtain the coefficients and confidence intervals:

coefficients(MY.MODEL1)
##                (Intercept) CATEGORICAL_VARIABLE3Small 
##                 -0.4054651                 -0.4418328

One way to get the 95% confidence interval for the log(odds ratio) is to use a Normal approximation, and hence find estimate \(\pm\) (1.96 \(\times\) standard error). This is not the approach used here; rather, a “profile likelihood” approach is used. There is evidence that this may be a better a method generally.

confint(MY.MODEL1)
## Waiting for profiling to be done...
##                                2.5 %    97.5 %
## (Intercept)                -1.769409 0.8475084
## CATEGORICAL_VARIABLE3Small -2.380978 1.4119697

Obtain the coefficients and confidence intervals on the odds ratio scale.

or.estimate <- exp(coefficients(MY.MODEL1))
print(or.estimate)
##                (Intercept) CATEGORICAL_VARIABLE3Small 
##                  0.6666667                  0.6428571
or.inference <- exp(confint(MY.MODEL1))
## Waiting for profiling to be done...
print(or.inference)
##                                2.5 %   97.5 %
## (Intercept)                0.1704336 2.333825
## CATEGORICAL_VARIABLE3Small 0.0924601 4.104031

You can use tbl_regression to get a summary of the results on the odds ratio scale. This will provide comparisons with a baseline category.

tbl_regression(MY.MODEL1, pvalue_fun = ~ style_pvalue(.x, digits = 3), estimate_fun = ~ style_number(.x, digits = 2), exponentiate = TRUE)  %>%
 modify_header(label ~ "**Variable**", estimate ~ "**Odds ratio**") %>%
 modify_footnote(everything() ~ NA, abbreviation = TRUE)
Variable Odds ratio 95% CI p-value
CATEGORICAL_VARIABLE3
    Large
    Small 0.64 0.09, 4.10 0.640

You can also obtain the pairwise comparisons on the odds ratio scale from the emmeans package. This will provide all pairwise comparisons. In this example, the output will be the same as that above, as there are only two levels in the explanatory variable. The standard output for a categorical explanatory variable compares levels 2, 3, …, k with level 1, according to the ordered levels of the variable.

Unlike the output above, the `emmeans’ package does use the Normal approximation to get confidence intervals on the odds ratio scale, which are then transformed to the odds ratio scale. This is why you will see slight differences in the results of the two approaches.

emm.1 <- emmeans(MY.MODEL1, "CATEGORICAL_VARIABLE3", type = "response")
pair.comparisons <- pairs(emm.1, reverse = TRUE, infer=TRUE, adjust = "none")
print(pair.comparisons)
##  contrast      odds.ratio    SE  df asymp.LCL asymp.UCL null z.ratio p.value
##  Small / Large      0.643 0.607 Inf     0.101       4.1    1  -0.468  0.6401
## 
## Confidence level used: 0.95 
## Intervals are back-transformed from the log odds ratio scale 
## Tests are performed on the log odds ratio scale

Neater output:

pair.comparisons %>%
  as_tibble() %>%
  select(contrast,odds.ratio, asymp.LCL, asymp.UCL, p.value) %>%
  gt() %>%
  fmt_number(c(odds.ratio, asymp.LCL, asymp.UCL),
             decimals = 2) %>%
   fmt_number(c(p.value),
             decimals = 3) %>%
  cols_merge_range(asymp.LCL, asymp.UCL, sep = " to ") %>%
  cols_align("center", everything()) %>%
  cols_label(contrast = "Contrast", odds.ratio = "Odds ratio", asymp.LCL = "95% CI ", p.value = "P-value") 
Contrast Odds ratio 95% CI P-value
Small / Large 0.64 0.10 to 4.10 0.640

Fit logistic regression with one numerical explanatory variable

summary(MY.MODEL2 <- glm(BINARY_VARIABLE ~ NUMERICAL_VARIABLE1, family="binomial", data=EXAMPLE_DATA))
## 
## Call:
## glm(formula = BINARY_VARIABLE ~ NUMERICAL_VARIABLE1, family = "binomial", 
##     data = EXAMPLE_DATA)
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)
## (Intercept)         -0.08956    1.06236  -0.084    0.933
## NUMERICAL_VARIABLE1 -0.12489    0.23026  -0.542    0.588
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 25.898  on 19  degrees of freedom
## Residual deviance: 25.584  on 18  degrees of freedom
## AIC: 29.584
## 
## Number of Fisher Scoring iterations: 4

Get the estimated coefficients. The tbl_regression function in the gtsummary package is quite useful here too.

coefficients(MY.MODEL2)
##         (Intercept) NUMERICAL_VARIABLE1 
##         -0.08956265         -0.12489125

Obtain the estimate and confidence interval on the odds ratio scale

exp(coefficients(MY.MODEL2))
##         (Intercept) NUMERICAL_VARIABLE1 
##           0.9143310           0.8825929
exp(confint(MY.MODEL2))
## Waiting for profiling to be done...
##                         2.5 %   97.5 %
## (Intercept)         0.1121228 8.444078
## NUMERICAL_VARIABLE1 0.5169669 1.347382

The tbl_regression function in the gtsummary package is quite useful here too.

tbl_regression(MY.MODEL2, pvalue_fun = ~ style_pvalue(.x, digits = 3), estimate_fun = ~ style_number(.x, digits = 2), exponentiate = TRUE)  %>%
 modify_header(label ~ "**Variable**", estimate ~ "**Odds ratio**") %>%
 modify_footnote(everything() ~ NA, abbreviation = TRUE)
Variable Odds ratio 95% CI p-value
NUMERICAL_VARIABLE1 0.88 0.52, 1.35 0.588

Multiple logistic regression (add as many predictor variables as you need)

MY.MODEL3 <- glm(BINARY_VARIABLE ~ NUMERICAL_VARIABLE1 + CATEGORICAL_VARIABLE3, family="binomial", data=EXAMPLE_DATA)
summary(MY.MODEL3)
## 
## Call:
## glm(formula = BINARY_VARIABLE ~ NUMERICAL_VARIABLE1 + CATEGORICAL_VARIABLE3, 
##     family = "binomial", data = EXAMPLE_DATA)
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)
## (Intercept)                  0.1340     1.1784   0.114    0.909
## NUMERICAL_VARIABLE1         -0.1270     0.2347  -0.541    0.588
## CATEGORICAL_VARIABLE3Small  -0.4452     0.9526  -0.467    0.640
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 25.898  on 19  degrees of freedom
## Residual deviance: 25.364  on 17  degrees of freedom
## AIC: 31.364
## 
## Number of Fisher Scoring iterations: 4
coefficients(MY.MODEL3)
##                (Intercept)        NUMERICAL_VARIABLE1 
##                  0.1339841                 -0.1270463 
## CATEGORICAL_VARIABLE3Small 
##                 -0.4451727
confint(MY.MODEL3)
## Waiting for profiling to be done...
##                                 2.5 %    97.5 %
## (Intercept)                -2.1541445 2.6643463
## NUMERICAL_VARIABLE1        -0.6749361 0.3010609
## CATEGORICAL_VARIABLE3Small -2.4036262 1.4230628

The confidence intervals obtained so far are on the log odds scale. To obtain them on the odds ratio scale:

exp(coefficients(MY.MODEL3))
##                (Intercept)        NUMERICAL_VARIABLE1 
##                  1.1433746                  0.8806929 
## CATEGORICAL_VARIABLE3Small 
##                  0.6407136
exp(confint(MY.MODEL3))
## Waiting for profiling to be done...
##                                 2.5 %    97.5 %
## (Intercept)                0.11600239 14.358561
## NUMERICAL_VARIABLE1        0.50918897  1.351292
## CATEGORICAL_VARIABLE3Small 0.09038959  4.149811

The standard output for a categorical explanatory variable compares levels 2, 3, …, k with level 1, according to the ordered levels of the variable.

Sometimes we want all the pairwise comparisons for a categorical explanatory variable; usually we want this on the odds ratio scale. In this example, the categorical explanatory variable only takes two values, so there is only one pairwise comparison; in general, for \(k\) levels there will be \(k \choose 2\) pairwise comparisons.

emm.3 <- emmeans(MY.MODEL3, "CATEGORICAL_VARIABLE3", type = "response")
pair.comparisons <- pairs(emm.3, reverse=TRUE, infer=TRUE, adjust= "none")
pair.comparisons
##  contrast      odds.ratio   SE  df asymp.LCL asymp.UCL null z.ratio p.value
##  Small / Large      0.641 0.61 Inf     0.099      4.15    1  -0.467  0.6403
## 
## Confidence level used: 0.95 
## Intervals are back-transformed from the log odds ratio scale 
## Tests are performed on the log odds ratio scale

You can use tbl_regression to get a summary of the results on the odds ratio scale. This will provide comparisons with a baseline category.

tbl_regression(MY.MODEL3, pvalue_fun = ~ style_pvalue(.x, digits = 3), estimate_fun = ~ style_number(.x, digits = 2), exponentiate = TRUE)  %>%
 modify_header(label ~ "**Variable**", estimate ~ "**Odds ratio**") %>%
 modify_footnote(everything() ~ NA, abbreviation = TRUE)
Variable Odds ratio 95% CI p-value
NUMERICAL_VARIABLE1 0.88 0.51, 1.35 0.588
CATEGORICAL_VARIABLE3
    Large
    Small 0.64 0.09, 4.15 0.640

© Statistical Consulting Centre, University of Melbourne, 2023