Dplyr and Joins

Solutions

Packages

library(tidyverse)
library(nycflights13)
library(palmerpenguins)

As a reminder, the dataset name is flights.

`rename()`

rename() changes the names of individual variables using new_name = old_name

Let’s practice using this function be renaming dep_time to departure_time.

flights |>
  rename(departure_time = dep_time) |>
  select(departure_time) # but we typically don't like spaces in names!

# A tibble: 336,776 × 1
   departure_time
            <int>
 1            517
 2            533
 3            542
 4            544
 5            554
 6            554
 7            555
 8            557
 9            557
10            558
# ℹ 336,766 more rows

flights |>
  rename("Departure Time" = dep_time) |>
  select("Departure Time") # but we typically don't like spaces in names!

# A tibble: 336,776 × 1
   `Departure Time`
              <int>
 1              517
 2              533
 3              542
 4              544
 5              554
 6              554
 7              555
 8              557
 9              557
10              558
# ℹ 336,766 more rows

Practice

Create a new dataset that only contains flights that do not have a missing departure time. Include the columns year, month, day, dep_time, dep_delay, and dep_delay_hours (the departure delay divided by 60).

new.data <- flights |>
  filter(!is.na(dep_time)) |>
  mutate(dep_delay_hours = dep_delay / 60) |>
  select(month, day, dep_time, dep_delay, dep_delay_hours)
  
new.data

# A tibble: 328,521 × 5
   month   day dep_time dep_delay dep_delay_hours
   <int> <int>    <int>     <dbl>           <dbl>
 1     1     1      517         2          0.0333
 2     1     1      533         4          0.0667
 3     1     1      542         2          0.0333
 4     1     1      544        -1         -0.0167
 5     1     1      554        -6         -0.1   
 6     1     1      554        -4         -0.0667
 7     1     1      555        -5         -0.0833
 8     1     1      557        -3         -0.05  
 9     1     1      557        -3         -0.05  
10     1     1      558        -2         -0.0333
# ℹ 328,511 more rows

Putting it all together

We want to look at the relationship between bill length and flipper length of penguins only from the Chinstrap and Gentoo species. Further, we want to look at the relationship between bill length and flipper length after they have been scaled (divided by) body mass. On the scatter plot, fit two linear trend lines. Turn off the standard error bars.

penguins |>
  filter(species %in% c("Chinstrap", "Gentoo")) |>
  mutate(bm = bill_length_mm / body_mass_g,
         fm = flipper_length_mm / body_mass_g) |>
  ggplot(
    aes(x = bm, y = fm, colour = species)
  ) + 
  geom_point() +
  geom_smooth(se = F, method = "lm")

`geom_smooth()` using formula = 'y ~ x'

Warning: Removed 1 row containing non-finite outside the scale range
(`stat_smooth()`).

Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_point()`).

Suppose now that we want to calculate how many large Adelie penguins were observed on the Torgersen island. We consider a penguin to be large if they are larger than 4000 grams.

Hint: First, remove all the na’s from body mass.

Hint: Look up if_else()

penguins |>
  filter(species == "Adelie",
         island == "Torgersen",
         !is.na(body_mass_g)) |>
  mutate(size = if_else(body_mass_g > 4000,"large","small")) |>
  select(body_mass_g, size)

# A tibble: 51 × 2
   body_mass_g size 
         <int> <chr>
 1        3750 small
 2        3800 small
 3        3250 small
 4        3450 small
 5        3650 small
 6        3625 small
 7        4675 large
 8        3475 small
 9        4250 large
10        3300 small
# ℹ 41 more rows

Back to the slides to introduce joins!

Joins

Working with multiple data frames

Often instead of being provided the data you need for your analysis in a single data frame, you will need to bring information from multiple datasets together into a data frame yourself. These datasets will be linked to each other via a column (usually an identifier, something that links the two datasets together) that you can use to join them together.

There are many possible types of joins. All have the format something_join(x, y).

x <- tibble(
  value = c(1, 2, 3),
  xcol = c("x1", "x2", "x3")
  )

y <- tibble(
  value = c(1, 2, 4),
  ycol = c("y1", "y2", "y4")
  )

x

# A tibble: 3 × 2
  value xcol 
  <dbl> <chr>
1     1 x1   
2     2 x2   
3     3 x3

# A tibble: 3 × 2
  value ycol 
  <dbl> <chr>
1     1 y1   
2     2 y2   
3     4 y4

We will demonstrate each of the joins on these small, toy datasets.

Note: These functions below know to join x and y by value because each dataset has value as a column. See for yourself!

names(x)

[1] "value" "xcol"

names(y)

[1] "value" "ycol"

`inner_join()`

inner_join(x, y)

Joining with `by = join_by(value)`

# A tibble: 2 × 3
  value xcol  ycol 
  <dbl> <chr> <chr>
1     1 x1    y1   
2     2 x2    y2

joined data based on value numbers that appear in both

`left_join()`

left_join(x , y)

Joining with `by = join_by(value)`

# A tibble: 3 × 3
  value xcol  ycol 
  <dbl> <chr> <chr>
1     1 x1    y1   
2     2 x2    y2   
3     3 x3    <NA>

keeps all in x and joins y

`right_join()`

right_join(x, y)

Joining with `by = join_by(value)`

# A tibble: 3 × 3
  value xcol  ycol 
  <dbl> <chr> <chr>
1     1 x1    y1   
2     2 x2    y2   
3     4 <NA>  y4

keeps all in y and joins x

`full_join()`

full_join(x, y)

Joining with `by = join_by(value)`

# A tibble: 4 × 3
  value xcol  ycol 
  <dbl> <chr> <chr>
1     1 x1    y1   
2     2 x2    y2   
3     3 x3    <NA> 
4     4 <NA>  y4

keeps all x and y

`semi_join()`

semi_join(x, y)

Joining with `by = join_by(value)`

# A tibble: 2 × 2
  value xcol 
  <dbl> <chr>
1     1 x1   
2     2 x2

x that matches with y

`anti_join()`

anti_join(x, y)

Joining with `by = join_by(value)`

# A tibble: 1 × 2
  value xcol 
  <dbl> <chr>
1     3 x3

stuff in x that is not in y