Dplyr, Joins, and Pivots

Solutions

Packages

if_else vs case_when

Pull up the help files for if_else vs case_when. What’s the difference?

case_when is a flexible version of if_else. We do not need to be in the situation where it’s just condition, true, false

We are going to practice on the mtcars dataset. Refresh your memory by pulling up the help file for mtcars.

We notice that the vs variable is not very informative, coding 0 and 1. Let’s change the values to be more descriptive! Should we use if_else or case_when?

if_else

Now, make a new variable called vs_cat that has “straight” for the value of 1, and “v-shaped” for the value of 0.

mtcars |>
  mutate(vs_cat = if_else(vs == 1, "straight", "v-shaped")) |>
  select(vs, vs_cat)
                    vs   vs_cat
Mazda RX4            0 v-shaped
Mazda RX4 Wag        0 v-shaped
Datsun 710           1 straight
Hornet 4 Drive       1 straight
Hornet Sportabout    0 v-shaped
Valiant              1 straight
Duster 360           0 v-shaped
Merc 240D            1 straight
Merc 230             1 straight
Merc 280             1 straight
Merc 280C            1 straight
Merc 450SE           0 v-shaped
Merc 450SL           0 v-shaped
Merc 450SLC          0 v-shaped
Cadillac Fleetwood   0 v-shaped
Lincoln Continental  0 v-shaped
Chrysler Imperial    0 v-shaped
Fiat 128             1 straight
Honda Civic          1 straight
Toyota Corolla       1 straight
Toyota Corona        1 straight
Dodge Challenger     0 v-shaped
AMC Javelin          0 v-shaped
Camaro Z28           0 v-shaped
Pontiac Firebird     0 v-shaped
Fiat X1-9            1 straight
Porsche 914-2        0 v-shaped
Lotus Europa         1 straight
Ford Pantera L       0 v-shaped
Ferrari Dino         0 v-shaped
Maserati Bora        0 v-shaped
Volvo 142E           1 straight

Now, suppose we want to change wt from a quantitative variable to a categorical variable. Specifically, we want to categorize weight by low (< 2), med (> 2, < 4), and heavy (> 4). Let’s walk through the code below!

#syntax case_when is condition ~ "true",
mtcars |> # and then 
  mutate(wt_cat = case_when( # making a new variable with case_when
    wt < 2 ~ "low", # condition 1; output is "low"
    wt > 2 & wt < 4 ~ "med",
    wt > 4 ~ "high"
  )) |>
  select(wt, wt_cat)
                       wt wt_cat
Mazda RX4           2.620    med
Mazda RX4 Wag       2.875    med
Datsun 710          2.320    med
Hornet 4 Drive      3.215    med
Hornet Sportabout   3.440    med
Valiant             3.460    med
Duster 360          3.570    med
Merc 240D           3.190    med
Merc 230            3.150    med
Merc 280            3.440    med
Merc 280C           3.440    med
Merc 450SE          4.070   high
Merc 450SL          3.730    med
Merc 450SLC         3.780    med
Cadillac Fleetwood  5.250   high
Lincoln Continental 5.424   high
Chrysler Imperial   5.345   high
Fiat 128            2.200    med
Honda Civic         1.615    low
Toyota Corolla      1.835    low
Toyota Corona       2.465    med
Dodge Challenger    3.520    med
AMC Javelin         3.435    med
Camaro Z28          3.840    med
Pontiac Firebird    3.845    med
Fiat X1-9           1.935    low
Porsche 914-2       2.140    med
Lotus Europa        1.513    low
Ford Pantera L      3.170    med
Ferrari Dino        2.770    med
Maserati Bora       3.570    med
Volvo 142E          2.780    med

Joining by more than one variable

x2 <- tibble(
  value = c(1, 2, 3),
  value2 = c(4,5,6),
  xcol = c("x1", "x2", "x3")
  )

y2<- tibble(
  value = c(1, 2, 4), 
  value2 = c(4,5,7),
  ycol = c("y1", "y2", "y4")
  )

x2
# A tibble: 3 × 3
  value value2 xcol 
  <dbl>  <dbl> <chr>
1     1      4 x1   
2     2      5 x2   
3     3      6 x3   
y2
# A tibble: 3 × 3
  value value2 ycol 
  <dbl>  <dbl> <chr>
1     1      4 y1   
2     2      5 y2   
3     4      7 y4   

inner_join()

inner_join(x2, y2, by = c("value"))
# A tibble: 2 × 5
  value value2.x xcol  value2.y ycol 
  <dbl>    <dbl> <chr>    <dbl> <chr>
1     1        4 x1           4 y1   
2     2        5 x2           5 y2   
inner_join(x2, y2 , by = c("value" , "value2"))
# A tibble: 2 × 4
  value value2 xcol  ycol 
  <dbl>  <dbl> <chr> <chr>
1     1      4 x1    y1   
2     2      5 x2    y2   

left_join()

left_join(x2 , y2, by = c("value"))
# A tibble: 3 × 5
  value value2.x xcol  value2.y ycol 
  <dbl>    <dbl> <chr>    <dbl> <chr>
1     1        4 x1           4 y1   
2     2        5 x2           5 y2   
3     3        6 x3          NA <NA> 
left_join(x2 , y2)
Joining with `by = join_by(value, value2)`
# A tibble: 3 × 4
  value value2 xcol  ycol 
  <dbl>  <dbl> <chr> <chr>
1     1      4 x1    y1   
2     2      5 x2    y2   
3     3      6 x3    <NA> 

Real world example

The Fisheries and Aquaculture Department of the Food and Agriculture Organization of the United Nations collects data on fisheries production of countries. Our goal is to combine these data in order to calculate the mean, minimum and maximum aquaculture proportion for continents in the fisheries data.

Data

fisheries <- read_csv("data/fisheries.csv")
continents <- read_csv("data/continents.csv")

names(fisheries)
[1] "country"     "capture"     "aquaculture" "total"      
names(continents)
[1] "country"   "continent"

Step 1: Which variable(s) will we use to join the fisheries and continents data frames?

We will join by country

We want to keep all rows and columns from fisheries and add a column for corresponding continents. Which join function should we use?

Demo: Join the two data frames and name assign the joined data frame back to fisheries_cont.

fisheries_cont <- left_join(fisheries, continents)
Joining with `by = join_by(country)`

Take a look at the updated fisheries_cont data frame. There are some countries that were not in continents. First, identify which countries these are (they will have NA values for continent).

fisheries_cont |>
  filter(is.na(continent))
# A tibble: 3 × 5
  country                          capture aquaculture   total continent
  <chr>                              <dbl>       <dbl>   <dbl> <chr>    
1 Democratic Republic of the Congo  237372        3161  240533 <NA>     
2 Hong Kong                         142775        4258  147033 <NA>     
3 Myanmar                          2072390     1017644 3090034 <NA>     

Then, manually update the continent information for these countries using the case_when function. Finally, check that these updates have been made as intended and no countries are left without continent information.

Comment through the following code below:

fisheries_cont_complete <- fisheries_cont |> 
  mutate(
    continent = case_when(
    country == "Democratic Republic of the Congo" ~ "Africa",
    country == "Hong Kong" ~ "Asia",
    country == "Myanmar" ~ "Asia", 
    TRUE ~ continent # for everything else (TRUE) ~ leave it as is
    )
  )

fisheries_cont_complete |>
  filter(is.na(continent))
# A tibble: 0 × 5
# ℹ 5 variables: country <chr>, capture <dbl>, aquaculture <dbl>, total <dbl>,
#   continent <chr>

Demo: Add a new column to the complete data frame called aq_prop. We will calculate it as aquaculture / total. In the same pipeline, calculate the mean, minimum, and maximum aquaculture proportion for continent in our complete data set.

fisheries_cont_complete |>
  mutate(aq_prop = aquaculture / total) |>
  group_by(continent) |>
  summarise(
    min_aq_prop = min(aq_prop),
    max_aq_prop = max(aq_prop),
    mean_aq_prop = mean(aq_prop)
  )
# A tibble: 5 × 4
  continent min_aq_prop max_aq_prop mean_aq_prop
  <chr>           <dbl>       <dbl>        <dbl>
1 Africa        0             0.803       0.0943
2 Americas      0             0.529       0.192 
3 Asia          0             0.782       0.367 
4 Europe        0.00682       0.618       0.165 
5 Oceania       0.0197        0.357       0.150 

Pivots

Let’s go back to the slides, and introduce pivots!

Pivot Practice

x <- tibble(
  state = rep(c("MT", "NC" , "SC"),2),
  group = c(rep("C", 3), rep("D", 3)),
  obs = c(1:6)
  )

x
# A tibble: 6 × 3
  state group   obs
  <chr> <chr> <int>
1 MT    C         1
2 NC    C         2
3 SC    C         3
4 MT    D         4
5 NC    D         5
6 SC    D         6

Pivot these data so that the data are wide. i.e. Each state should be it’s own unique observation (row). Save this new data set as y.

y <- x |>
  pivot_wider(
    names_from = group,
    values_from = obs
  )