RStudio uses a “Working Directory”.
An R Markdown file is saved in a directory; when you knit the file, RStudio will automatically use this as the Working Directory.
Any data file you use should be in the same directory as the R Markdown file that reads in the data. In SRW, we have arranged this.
Similarly, any output files you create will be saved in the same directory.
R uses “packages” to carry out many operations. Packages need to be installed once, but loaded every time you use them, with the library() command; see below.
If you try to use a package that you have not installed, RStudio gives you a warning and an opportunity to install it.
library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.3.1
## Warning: package 'purrr' was built under R version 4.3.1
## Warning: package 'dplyr' was built under R version 4.3.1
library(emmeans)
## Warning: package 'emmeans' was built under R version 4.3.1
library(readxl)
## Warning: package 'readxl' was built under R version 4.3.1
library(gt)
library(gtsummary)
## Warning: package 'gtsummary' was built under R version 4.3.1
In this R Markdown file, the data is called EXAMPLE_DATA. Replace this with the name of the file you wish to use.
EXAMPLE_DATA <- read_excel("analytic_data.xlsx")
EXAMPLE_DATA <- EXAMPLE_DATA %>%
mutate_if(is.character,as.factor)
In all of the code below, you will need to replace EXAMPLE_DATA with the name of your data frame. You will need to use the appropriate variable names.
Fit logistic regression with one categorical explanatory variable:
summary(MY.MODEL1 <- glm(BINARY_VARIABLE ~ CATEGORICAL_VARIABLE3, family="binomial", data=EXAMPLE_DATA))
##
## Call:
## glm(formula = BINARY_VARIABLE ~ CATEGORICAL_VARIABLE3, family = "binomial",
## data = EXAMPLE_DATA)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.4055 0.6455 -0.628 0.53
## CATEGORICAL_VARIABLE3Small -0.4418 0.9449 -0.468 0.64
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 25.898 on 19 degrees of freedom
## Residual deviance: 25.678 on 18 degrees of freedom
## AIC: 29.678
##
## Number of Fisher Scoring iterations: 4
Obtain the coefficients and confidence intervals:
coefficients(MY.MODEL1)
## (Intercept) CATEGORICAL_VARIABLE3Small
## -0.4054651 -0.4418328
One way to get the 95% confidence interval for the log(odds ratio) is to use a Normal approximation, and hence find estimate \(\pm\) (1.96 \(\times\) standard error). This is not the approach used here; rather, a “profile likelihood” approach is used. There is evidence that this may be a better a method generally.
confint(MY.MODEL1)
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) -1.769409 0.8475084
## CATEGORICAL_VARIABLE3Small -2.380978 1.4119697
Obtain the coefficients and confidence intervals on the odds ratio scale.
or.estimate <- exp(coefficients(MY.MODEL1))
print(or.estimate)
## (Intercept) CATEGORICAL_VARIABLE3Small
## 0.6666667 0.6428571
or.inference <- exp(confint(MY.MODEL1))
## Waiting for profiling to be done...
print(or.inference)
## 2.5 % 97.5 %
## (Intercept) 0.1704336 2.333825
## CATEGORICAL_VARIABLE3Small 0.0924601 4.104031
You can use tbl_regression
to get a summary of the
results on the odds ratio scale. This will provide comparisons with a
baseline category.
tbl_regression(MY.MODEL1, pvalue_fun = ~ style_pvalue(.x, digits = 3), estimate_fun = ~ style_number(.x, digits = 2), exponentiate = TRUE) %>%
modify_header(label ~ "**Variable**", estimate ~ "**Odds ratio**") %>%
modify_footnote(everything() ~ NA, abbreviation = TRUE)
Variable | Odds ratio | 95% CI | p-value |
---|---|---|---|
CATEGORICAL_VARIABLE3 | |||
Large | — | — | |
Small | 0.64 | 0.09, 4.10 | 0.640 |
You can also obtain the pairwise comparisons on the odds ratio scale
from the emmeans
package. This will provide all pairwise
comparisons. In this example, the output will be the same as that above,
as there are only two levels in the explanatory variable. The standard
output for a categorical explanatory variable compares levels 2, 3, …, k
with level 1, according to the ordered levels of the variable.
Unlike the output above, the `emmeans’ package does use the Normal approximation to get confidence intervals on the odds ratio scale, which are then transformed to the odds ratio scale. This is why you will see slight differences in the results of the two approaches.
emm.1 <- emmeans(MY.MODEL1, "CATEGORICAL_VARIABLE3", type = "response")
pair.comparisons <- pairs(emm.1, reverse = TRUE, infer=TRUE, adjust = "none")
print(pair.comparisons)
## contrast odds.ratio SE df asymp.LCL asymp.UCL null z.ratio p.value
## Small / Large 0.643 0.607 Inf 0.101 4.1 1 -0.468 0.6401
##
## Confidence level used: 0.95
## Intervals are back-transformed from the log odds ratio scale
## Tests are performed on the log odds ratio scale
Neater output:
pair.comparisons %>%
as_tibble() %>%
select(contrast,odds.ratio, asymp.LCL, asymp.UCL, p.value) %>%
gt() %>%
fmt_number(c(odds.ratio, asymp.LCL, asymp.UCL),
decimals = 2) %>%
fmt_number(c(p.value),
decimals = 3) %>%
cols_merge_range(asymp.LCL, asymp.UCL, sep = " to ") %>%
cols_align("center", everything()) %>%
cols_label(contrast = "Contrast", odds.ratio = "Odds ratio", asymp.LCL = "95% CI ", p.value = "P-value")
Contrast | Odds ratio | 95% CI | P-value |
---|---|---|---|
Small / Large | 0.64 | 0.10 to 4.10 | 0.640 |
Fit logistic regression with one numerical explanatory variable
summary(MY.MODEL2 <- glm(BINARY_VARIABLE ~ NUMERICAL_VARIABLE1, family="binomial", data=EXAMPLE_DATA))
##
## Call:
## glm(formula = BINARY_VARIABLE ~ NUMERICAL_VARIABLE1, family = "binomial",
## data = EXAMPLE_DATA)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.08956 1.06236 -0.084 0.933
## NUMERICAL_VARIABLE1 -0.12489 0.23026 -0.542 0.588
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 25.898 on 19 degrees of freedom
## Residual deviance: 25.584 on 18 degrees of freedom
## AIC: 29.584
##
## Number of Fisher Scoring iterations: 4
Get the estimated coefficients. The tbl_regression
function in the gtsummary
package is quite useful here
too.
coefficients(MY.MODEL2)
## (Intercept) NUMERICAL_VARIABLE1
## -0.08956265 -0.12489125
Obtain the estimate and confidence interval on the odds ratio scale
exp(coefficients(MY.MODEL2))
## (Intercept) NUMERICAL_VARIABLE1
## 0.9143310 0.8825929
exp(confint(MY.MODEL2))
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) 0.1121228 8.444078
## NUMERICAL_VARIABLE1 0.5169669 1.347382
The tbl_regression
function in the
gtsummary
package is quite useful here too.
tbl_regression(MY.MODEL2, pvalue_fun = ~ style_pvalue(.x, digits = 3), estimate_fun = ~ style_number(.x, digits = 2), exponentiate = TRUE) %>%
modify_header(label ~ "**Variable**", estimate ~ "**Odds ratio**") %>%
modify_footnote(everything() ~ NA, abbreviation = TRUE)
Variable | Odds ratio | 95% CI | p-value |
---|---|---|---|
NUMERICAL_VARIABLE1 | 0.88 | 0.52, 1.35 | 0.588 |
Multiple logistic regression (add as many predictor variables as you need)
MY.MODEL3 <- glm(BINARY_VARIABLE ~ NUMERICAL_VARIABLE1 + CATEGORICAL_VARIABLE3, family="binomial", data=EXAMPLE_DATA)
summary(MY.MODEL3)
##
## Call:
## glm(formula = BINARY_VARIABLE ~ NUMERICAL_VARIABLE1 + CATEGORICAL_VARIABLE3,
## family = "binomial", data = EXAMPLE_DATA)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.1340 1.1784 0.114 0.909
## NUMERICAL_VARIABLE1 -0.1270 0.2347 -0.541 0.588
## CATEGORICAL_VARIABLE3Small -0.4452 0.9526 -0.467 0.640
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 25.898 on 19 degrees of freedom
## Residual deviance: 25.364 on 17 degrees of freedom
## AIC: 31.364
##
## Number of Fisher Scoring iterations: 4
coefficients(MY.MODEL3)
## (Intercept) NUMERICAL_VARIABLE1
## 0.1339841 -0.1270463
## CATEGORICAL_VARIABLE3Small
## -0.4451727
confint(MY.MODEL3)
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) -2.1541445 2.6643463
## NUMERICAL_VARIABLE1 -0.6749361 0.3010609
## CATEGORICAL_VARIABLE3Small -2.4036262 1.4230628
The confidence intervals obtained so far are on the log odds scale. To obtain them on the odds ratio scale:
exp(coefficients(MY.MODEL3))
## (Intercept) NUMERICAL_VARIABLE1
## 1.1433746 0.8806929
## CATEGORICAL_VARIABLE3Small
## 0.6407136
exp(confint(MY.MODEL3))
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) 0.11600239 14.358561
## NUMERICAL_VARIABLE1 0.50918897 1.351292
## CATEGORICAL_VARIABLE3Small 0.09038959 4.149811
The standard output for a categorical explanatory variable compares levels 2, 3, …, k with level 1, according to the ordered levels of the variable.
Sometimes we want all the pairwise comparisons for a categorical explanatory variable; usually we want this on the odds ratio scale. In this example, the categorical explanatory variable only takes two values, so there is only one pairwise comparison; in general, for \(k\) levels there will be \(k \choose 2\) pairwise comparisons.
emm.3 <- emmeans(MY.MODEL3, "CATEGORICAL_VARIABLE3", type = "response")
pair.comparisons <- pairs(emm.3, reverse=TRUE, infer=TRUE, adjust= "none")
pair.comparisons
## contrast odds.ratio SE df asymp.LCL asymp.UCL null z.ratio p.value
## Small / Large 0.641 0.61 Inf 0.099 4.15 1 -0.467 0.6403
##
## Confidence level used: 0.95
## Intervals are back-transformed from the log odds ratio scale
## Tests are performed on the log odds ratio scale
You can use tbl_regression
to get a summary of the
results on the odds ratio scale. This will provide comparisons with a
baseline category.
tbl_regression(MY.MODEL3, pvalue_fun = ~ style_pvalue(.x, digits = 3), estimate_fun = ~ style_number(.x, digits = 2), exponentiate = TRUE) %>%
modify_header(label ~ "**Variable**", estimate ~ "**Odds ratio**") %>%
modify_footnote(everything() ~ NA, abbreviation = TRUE)
Variable | Odds ratio | 95% CI | p-value |
---|---|---|---|
NUMERICAL_VARIABLE1 | 0.88 | 0.51, 1.35 | 0.588 |
CATEGORICAL_VARIABLE3 | |||
Large | — | — | |
Small | 0.64 | 0.09, 4.15 | 0.640 |
© Statistical Consulting Centre, University of Melbourne, 2023