Logistic Regression

Warning: package 'ggridges' was built under R version 4.4.2

To illustrate logistic regression, we will build a spam filter from email data. Today’s data represent incoming emails in David Diez’s (one of the authors of OpenIntro textbooks) Gmail account for the first three months of 2012 . All personally identifiable information has been removed.

email <- read_csv("https://st511-01.github.io/data/email.csv") |>
  mutate(spam = factor(spam))
Rows: 3890 Columns: 21
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr   (2): winner, number
dbl  (18): spam, to_multiple, from, cc, sent_email, image, attach, dollar, i...
dttm  (1): time

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(email)
Rows: 3,890
Columns: 21
$ spam         <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ to_multiple  <dbl> 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ from         <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ cc           <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 2, 1, 0, 2, 0, …
$ sent_email   <dbl> 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, …
$ time         <dttm> 2012-01-01 06:16:41, 2012-01-01 07:03:59, 2012-01-01 16:…
$ image        <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ attach       <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ dollar       <dbl> 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 5, 0, 0, …
$ winner       <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "no…
$ inherit      <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ viagra       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ password     <dbl> 0, 0, 0, 0, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, …
$ num_char     <dbl> 11.370, 10.504, 7.773, 13.256, 1.231, 1.091, 4.837, 7.421…
$ line_breaks  <dbl> 202, 202, 192, 255, 29, 25, 193, 237, 69, 68, 25, 79, 191…
$ format       <dbl> 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, …
$ re_subj      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, …
$ exclaim_subj <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, …
$ urgent_subj  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ exclaim_mess <dbl> 0, 1, 6, 48, 1, 1, 1, 18, 1, 0, 2, 1, 0, 10, 4, 10, 20, 0…
$ number       <chr> "big", "small", "small", "small", "none", "none", "big", …

The variables we’ll use in this analysis are

Fit the model

spam_model <- glm(spam ~ exclaim_mess, data = email, family = binomial)

summary(spam_model)

Call:
glm(formula = spam ~ exclaim_mess, family = binomial, data = email)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -1.91139    0.06404 -29.846  < 2e-16 ***
exclaim_mess -0.16836    0.02398  -7.021 2.21e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2417.5  on 3889  degrees of freedom
Residual deviance: 2318.5  on 3888  degrees of freedom
AIC: 2322.5

Number of Fisher Scoring iterations: 7

Evaluate your models

Testing vs Training

Let’s play around with the code below. Last activity, we made our own testing and training data. Let’s introduce a function that will do this for us.

set.seed(1234)

spam_split <- initial_split(email, prop = .8)

spam_split
<Training/Testing/Total>
<3112/778/3890>
spam_train <- training(spam_split)
spam_test <- testing(spam_split)

Threshold

This is something that you get to justify as the researcher! And will be used for model evaluation.

thresh <- .05 # if p >= .05, we will classify as spam

We tend to look at Specificity and Sensitivity with logistic regression models:

Specificity - prob of predicting a success (pred success over total success)

Sensitivity - prob of predicting a failure (pred failure over total failures)

Fit the model on the Training Data + Evaluate it

Comment through the code below. Next, calculate sensitivity and specificity.

#glm is base R
#below is the tidymodels way

model1 <- logistic_reg() |> #for logistic
  set_engine("glm") |> # fit a generalized linear model
  fit(spam ~ exclaim_mess , data = spam_train, family = binomial)

#see our data is spam_train
#family = binomial because bernouli is a special case of binomial

tab <- predict(model1, data.frame(exclaim_mess = spam_test$exclaim_mess), type = "prob") |>
  bind_cols(spam_test) |> 
  select(.pred_0, .pred_1, spam) |>
  mutate(spam_pred = if_else(.pred_1 <= thresh, 0, 1)) |>
  group_by(spam, spam_pred) |>
  summarize(n = n())
`summarise()` has grouped output by 'spam'. You can override using the
`.groups` argument.
# 0 = failure = not spam 
# 1 = success = spam 

Thought Exercise: Use a dplyr function to take the summary table above and make it look like a 2x2 table.

pivot_wider(tab,
            id_cols = spam_pred,#ids that hang out on left side of table
            names_from = spam,
            values_from = n, names_prefix = "truth_") 
# A tibble: 2 × 3
  spam_pred truth_0 truth_1
      <dbl>   <int>   <int>
1         0     133       5
2         1     568      72