More dplyr

solutions

Packages

  • Your turn (5 minutes): Create a data frame that only includes the plane tail number (tailnum), carrier (carrier), and departure delay for the flight with the longest departure delay. What is the plane tail number (tailnum) for this flight?
flights |>
  select(tailnum, carrier, dep_delay) |>
  arrange(desc(dep_delay)) |>
  slice(1)
# A tibble: 1 × 3
  tailnum carrier dep_delay
  <chr>   <chr>       <dbl>
1 N384HA  HA           1301

filter()

  • Demo: Filter the data frame by selecting the rows where the destination airport is RDU. Comment the code below.
flights |> # dataset and then 
    filter(dest == "RDU") #subset dataframe where dest equals RDU 
# A tibble: 8,163 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      800            810       -10      949            955
 2  2013     1     1      832            840        -8     1006           1030
 3  2013     1     1      851            851         0     1032           1036
 4  2013     1     1      917            920        -3     1052           1108
 5  2013     1     1     1024           1030        -6     1204           1215
 6  2013     1     1     1127           1129        -2     1303           1309
 7  2013     1     1     1157           1205        -8     1342           1345
 8  2013     1     1     1240           1235         5     1415           1415
 9  2013     1     1     1317           1325        -8     1454           1505
10  2013     1     1     1449           1450        -1     1651           1640
# ℹ 8,153 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

Now, run the following code with one equals sign instead of two. Does it still work?

Explanation:

(=) is a Assignment operator while (==) is a Equal to operator. (=) is used for assigning the values from right to left while (==) is used for showing equality between values.

  • Demo: We can also filter using more than one condition. Here we select all rows where the destination airport is RDU and the arrival delay is less than 0. As we’ve learned, conditions within functions are separated by a ,.
flights |>
  filter(dest == "RDU", 
         arr_delay < 0) 
# A tibble: 4,232 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      800            810       -10      949            955
 2  2013     1     1      832            840        -8     1006           1030
 3  2013     1     1      851            851         0     1032           1036
 4  2013     1     1      917            920        -3     1052           1108
 5  2013     1     1     1024           1030        -6     1204           1215
 6  2013     1     1     1127           1129        -2     1303           1309
 7  2013     1     1     1157           1205        -8     1342           1345
 8  2013     1     1     1317           1325        -8     1454           1505
 9  2013     1     1     1505           1510        -5     1654           1655
10  2013     1     1     1800           1800         0     1945           1951
# ℹ 4,222 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

We can do more complex tasks using logical operators:

operator definition
< is less than?
<= is less than or equal to?
> is greater than?
>= is greater than or equal to?
== is exactly equal to?
!= is not equal to?
x & y is x AND y?
x | y is x OR y?
is.na(x) is x NA?
!is.na(x) is x not NA?
x %in% y is x in y?
!(x %in% y) is x not in y?
!x is not x?

The final operator only makes sense if x is logical (TRUE / FALSE).

  • Your turn (4 minutes): Describe what the code is doing in words.
flights |> #and then
  filter(dest %in% "RDU", # look for RDU in dest
         arr_delay < 0 | dep_delay < 0 # or
         ) 
# A tibble: 5,308 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      800            810       -10      949            955
 2  2013     1     1      832            840        -8     1006           1030
 3  2013     1     1      851            851         0     1032           1036
 4  2013     1     1      917            920        -3     1052           1108
 5  2013     1     1     1024           1030        -6     1204           1215
 6  2013     1     1     1127           1129        -2     1303           1309
 7  2013     1     1     1157           1205        -8     1342           1345
 8  2013     1     1     1317           1325        -8     1454           1505
 9  2013     1     1     1449           1450        -1     1651           1640
10  2013     1     1     1505           1510        -5     1654           1655
# ℹ 5,298 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>
  • What if we want to look at destinations of RDU and GSO? How does the below code change?
flights |>
  filter(dest %in% c("RDU","GSO"), #look for RDU or GSO 
         arr_delay < 0 | dep_delay < 0
         )
# A tibble: 6,203 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      800            810       -10      949            955
 2  2013     1     1      832            840        -8     1006           1030
 3  2013     1     1      851            851         0     1032           1036
 4  2013     1     1      917            920        -3     1052           1108
 5  2013     1     1     1024           1030        -6     1204           1215
 6  2013     1     1     1127           1129        -2     1303           1309
 7  2013     1     1     1157           1205        -8     1342           1345
 8  2013     1     1     1317           1325        -8     1454           1505
 9  2013     1     1     1449           1450        -1     1651           1640
10  2013     1     1     1505           1510        -5     1654           1655
# ℹ 6,193 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

Why c? because we need to combine character strings

What happens if we change %in% to == ? bad things it goes through the data set looking for RDU in row 1, GSO in row 2, and repeat

count()

  • Demo: Create a frequency table of the destination (dest) locations for flights from New York.
flights |>
  count(dest)
# A tibble: 105 × 2
   dest      n
   <chr> <int>
 1 ABQ     254
 2 ACK     265
 3 ALB     439
 4 ANC       8
 5 ATL   17215
 6 AUS    2439
 7 AVL     275
 8 BDL     443
 9 BGR     375
10 BHM     297
# ℹ 95 more rows
  • Demo: In which month was there the fewest number of flights? How many flights were there in that month? Hint: Type ?min into the console and use the filter() function!
flights |> 
  count(month) |>
  arrange(n) |>
  slice(1)
# A tibble: 1 × 2
  month     n
  <int> <int>
1     2 24951
flights |>
  count(month) |>
  filter(n == min(n)) #look at n and find min value
# A tibble: 1 × 2
  month     n
  <int> <int>
1     2 24951

On which date (month + day) was there the largest number of flights? How many flights were there on that day? Comment the code below.

flights |> 
  count(month,day) |> # freq table
  filter(n == max(n)) # finds the max
# A tibble: 1 × 3
  month   day     n
  <int> <int> <int>
1    11    27  1014

mutate()

Use mutate() to create a new variable.

  • Demo: In the code chunk below, air_time (minutes in the air) is converted to hours, and then new variable mph is created, corresponding to the miles per hour of the flight. Comment each line of code below.
data_flights <- flights |> 
  mutate(hours = air_time / 60, # var name = stuff
         mph = distance / hours) |> # make mph
  select(air_time, distance, hours, mph) # choose columns
  • Your turn (4 minutes): Create a new variable to calculate the percentage of flights in each month. What percentage of flights take place in July?
flights |> 
  count(month) |>
  mutate(perc = n / sum(n) * 100) |>
  kbl(digits = 2)
month n perc
1 27004 8.02
2 24951 7.41
3 28834 8.56
4 28330 8.41
5 28796 8.55
6 28243 8.39
7 29425 8.74
8 29327 8.71
9 27574 8.19
10 28889 8.58
11 27268 8.10
12 28135 8.35

changing variable type

Suppose we wanted to make year a factor…

flights |>
  mutate(year = as.numeric(year)) |> #makes var a factor
  select(year)
# A tibble: 336,776 × 1
    year
   <dbl>
 1  2013
 2  2013
 3  2013
 4  2013
 5  2013
 6  2013
 7  2013
 8  2013
 9  2013
10  2013
# ℹ 336,766 more rows

group_by() + summarise() review

summarize() collapses the rows into summary statistics and removes columns irrelevant to the calculation.

group_by() is used for grouped operations. It’s very powerful when paired with summarise() to calculate summary statistics by group.

Find the mean and standard deviation of departure delay for each month.

flights |>
  group_by(month) |>
  summarise(mean_dep_delay = mean(dep_delay, na.rm = T),
            sd_dep_delay = sd(dep_delay, na.rm = T)) 
# A tibble: 12 × 3
   month mean_dep_delay sd_dep_delay
   <int>          <dbl>        <dbl>
 1     1          10.0          36.4
 2     2          10.8          36.3
 3     3          13.2          40.1
 4     4          13.9          43.0
 5     5          13.0          39.4
 6     6          20.8          51.5
 7     7          21.7          51.6
 8     8          12.6          37.7
 9     9           6.72         35.6
10    10           6.24         29.7
11    11           5.44         27.6
12    12          16.6          41.9