Warning: package 'ggridges' was built under R version 4.4.2
Logistic Regression
To illustrate logistic regression, we will build a spam filter from email data. Today’s data represent incoming emails in David Diez’s (one of the authors of OpenIntro textbooks) Gmail account for the first three months of 2012 . All personally identifiable information has been removed.
Rows: 3890 Columns: 21
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): winner, number
dbl (18): spam, to_multiple, from, cc, sent_email, image, attach, dollar, i...
dttm (1): time
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(email)
Rows: 3,890
Columns: 21
$ spam <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ to_multiple <dbl> 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ from <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ cc <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 2, 1, 0, 2, 0, …
$ sent_email <dbl> 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, …
$ time <dttm> 2012-01-01 06:16:41, 2012-01-01 07:03:59, 2012-01-01 16:…
$ image <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ attach <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ dollar <dbl> 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 5, 0, 0, …
$ winner <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "no…
$ inherit <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ viagra <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ password <dbl> 0, 0, 0, 0, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, …
$ num_char <dbl> 11.370, 10.504, 7.773, 13.256, 1.231, 1.091, 4.837, 7.421…
$ line_breaks <dbl> 202, 202, 192, 255, 29, 25, 193, 237, 69, 68, 25, 79, 191…
$ format <dbl> 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, …
$ re_subj <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, …
$ exclaim_subj <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, …
$ urgent_subj <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ exclaim_mess <dbl> 0, 1, 6, 48, 1, 1, 1, 18, 1, 0, 2, 1, 0, 10, 4, 10, 20, 0…
$ number <chr> "big", "small", "small", "small", "none", "none", "big", …
The variables we’ll use in this analysis are
-
spam
: 1 if the email is spam, 0 otherwise -
exclaim_mess
: The number of exclamation points in the email message
Fit the model
Call:
glm(formula = spam ~ exclaim_mess, family = binomial, data = email)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.91139 0.06404 -29.846 < 2e-16 ***
exclaim_mess -0.16836 0.02398 -7.021 2.21e-12 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 2417.5 on 3889 degrees of freedom
Residual deviance: 2318.5 on 3888 degrees of freedom
AIC: 2322.5
Number of Fisher Scoring iterations: 7
Evaluate your models
Testing vs Training
Let’s play around with the code below. Last activity, we made our own testing and training data. Let’s introduce a function that will do this for us.
set.seed(1234)
spam_split <- initial_split(email, prop = .8)
spam_split
<Training/Testing/Total>
<3112/778/3890>
spam_train <- training(spam_split)
spam_test <- testing(spam_split)
Threshold
This is something that you get to justify as the researcher! And will be used for model evaluation.
thresh <- .05 # if p >= .05, we will classify as spam
We tend to look at Specificity and Sensitivity with logistic regression models:
Specificity - prob of predicting a success (pred success over total success)
Sensitivity - prob of predicting a failure (pred failure over total failures)
Fit the model on the Training Data + Evaluate it
Comment through the code below. Next, calculate sensitivity and specificity.
#glm is base R
#below is the tidymodels way
model1 <- logistic_reg() |> #for logistic
set_engine("glm") |> # fit a generalized linear model
fit(spam ~ exclaim_mess , data = spam_train, family = binomial)
#see our data is spam_train
#family = binomial because bernouli is a special case of binomial
tab <- predict(model1, data.frame(exclaim_mess = spam_test$exclaim_mess), type = "prob") |>
bind_cols(spam_test) |>
select(.pred_0, .pred_1, spam) |>
mutate(spam_pred = if_else(.pred_1 <= thresh, 0, 1)) |>
group_by(spam, spam_pred) |>
summarize(n = n())
`summarise()` has grouped output by 'spam'. You can override using the
`.groups` argument.
# 0 = failure = not spam
# 1 = success = spam
Thought Exercise: Use a dplyr
function to take the summary table above and make it look like a 2x2 table.
pivot_wider(tab,
id_cols = spam_pred,#ids that hang out on left side of table
names_from = spam,
values_from = n, names_prefix = "truth_")
# A tibble: 2 × 3
spam_pred truth_0 truth_1
<dbl> <int> <int>
1 0 133 5
2 1 568 72