dplyr: wrangling + joins

Lecture 8

Dr. Elijah Meyer

NC State University
ST 295 - Spring 2025

2025-02-04

Checklist

– Have you cloned the today’s AE repo?

– Are you keeping up with prepare material?

– Homework-1 is graded!

– Homework-2 is live! Due Feb 10th at 11:59

  -- Late window Feb 11th at 11:59

Homework-1 Recap

It went really well! Nice work! Here are some things to think about…

Naming objects

Why?

penguins_new <- penguins |>
  group_by(island) |>
  summarise(mean_bill = mean(bill_length_mm, na.rm = T))

penguins_new

# A tibble: 3 × 2
  island    mean_bill
  <fct>         <dbl>
1 Biscoe         45.3
2 Dream          44.2
3 Torgersen      39.0

Single Pipeline

Unless otherwise asked, please practice writing code in a single pipeline!

penguins |>
  group_by(island) |>
  summarise(mean_bill = mean(bill_length_mm, na.rm = T))

# A tibble: 3 × 2
  island    mean_bill
  <fct>         <dbl>
1 Biscoe         45.3
2 Dream          44.2
3 Torgersen      39.0

legends

What’s wrong with the following code below?

penguins |>
  ggplot(
    aes(x = bill_length_mm, y = bill_depth_mm, color = island)
  ) + 
  geom_point() + 
  labs(x = "bill length (mm)",
       y = "bill depth (mm)",
       z = "Island")

legends

To change the name of the legend, make sure that your labs() argument matches what is in your aes() function!

Warm up

Read these logical operators as a sentence:

– x <= y

– x == y

– x & y

– x | y

– is.na(x)

– x %in% y

Warm up

– x <= y x less than or equal to y

– x == y x exactly equal to y

– x & y x and y

– x | y x or y

– is.na(x) is x NA

– x %in% y x in y

`==` vs `%in%`

Change %in% to ==. What happens?

What do the following functions do?

filter()

mutate()

count()

summarise()

AE

Data are messy

Messy data

– The sheer volume of information is sometimes referred to as “messy” data, because it’s hard to make sense of it all.

Messy data

How?

Joining datasets

Data merging is the process of combining two or more data sets into a single data set. Most often, this process is necessary when you have raw data stored in multiple files, worksheets, or data tables, that you want to analyze together.

Joining datasets

– Left Join

– Inner Join

– Right Join

– Full Join

Joining datasets

AE

– Joining Data

– Recreate:

Recap of AE

– This is important! Data are messy!

– Think carefully about the join you use

dplyr: wrangling + joins

Checklist

Homework-1 Recap

Naming objects

Single Pipeline

legends

legends

Warm up

Warm up

== vs %in%

What do the following functions do?

AE

Data are messy

Messy data

Messy data

How?

Joining datasets

Joining datasets

Joining datasets

AE

Recap of AE

`==` vs `%in%`