Wrangling flights

Solutions

To demonstrate data wrangling we will use flights, a tibble in the nycflights13 R package. It includes characteristics of all flights departing from New York City (JFK, LGA, EWR) in 2013.

Note: As we go through the AE, practicing thinking in steps, and reading your code as sentences

Write a sentence here that explains how many rows and columns are in the data set using in-line code

This dataset has 336776 rows, and 19 columns.

Now, let’s take a glimpse of the data set! Also, pull up the help file for the flights dataset to see the data index.

glimpse(flights)

As a reminder, we can use the names() function to get a quick reminder on the names of the variables in our data set. We typically do not include this type of code in our professional document. However, we are this time so you can reference this function!

names(flights)
 [1] "year"           "month"          "day"            "dep_time"      
 [5] "sched_dep_time" "dep_delay"      "arr_time"       "sched_arr_time"
 [9] "arr_delay"      "carrier"        "flight"         "tailnum"       
[13] "origin"         "dest"           "air_time"       "distance"      
[17] "hour"           "minute"         "time_hour"     

We can also get a more traditional view of the data set using functions like head()and tail(). We can also View() our data set, but will NOT include View() in our rendered document. Let’s show why.

head(flights) # shows top 6 rows of our data
# A tibble: 6 × 19
   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
  <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
1  2013     1     1      517            515         2      830            819
2  2013     1     1      533            529         4      850            830
3  2013     1     1      542            540         2      923            850
4  2013     1     1      544            545        -1     1004           1022
5  2013     1     1      554            600        -6      812            837
6  2013     1     1      554            558        -4      740            728
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>
tail(flights) # shows bottom 6 rows of our data
# A tibble: 6 × 19
   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
  <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
1  2013     9    30       NA           1842        NA       NA           2019
2  2013     9    30       NA           1455        NA       NA           1634
3  2013     9    30       NA           2200        NA       NA           2312
4  2013     9    30       NA           1210        NA       NA           1330
5  2013     9    30       NA           1159        NA       NA           1344
6  2013     9    30       NA            840        NA       NA           1020
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

Tibble vs. data frame

A tibble is an opinionated version of the R data frame. In other words, all tibbles are data frames, but not all data frames are tibbles!

There are many differences between a tibble and a data frame. The main one is…

When you print a tibble, the first ten rows and all of the columns that fit on the screen will display, along with the type of each column.

Other differences include:

– it never changes the type of the inputs (e.g. it never converts strings to factors!)

– it never changes the names of variables

– it never creates row names

– it can have column names that are not valid R variable names

You can read more about the difference here: Tibble Reading

Let’s look at the differences in the output when we type flights (tibble) in the console versus typing mtcars (data frame) in the console.

Then, use as_tibble() on the mtcars data set. Save this new data set as mtcars_tibble. Check your environment to see if it worked!

mtcars_tibble <- as_tibble(mtcars)

Making your own tibbles

There are two different ways we can make our own tibbles in R. The first way is to use the tibble() function. Comment on what each line of code is doing below.

data <- tibble(
  x = 1:5, 
  y = 1, 
  z = x ^ 2 + y
)

tribble() stands for transposed tibble. It’s more intuitive to make data sets this way (in my opinion). Run the following code below and see the results.

tribble(
  ~x, ~y, ~z,
  "a", 2, 3.6,
  "b", 1, 8.5
)
# A tibble: 2 × 3
  x         y     z
  <chr> <dbl> <dbl>
1 a         2   3.6
2 b         1   8.5

Data wrangling with dplyr

dplyr is the primary package in the tidyverse for data wrangling. Click here for the dplyr reference page. Click hereto download the dplyr cheat sheet.

Quick summary of key dplyr functions1:

Rows:

  • filter():chooses rows based on column values.
  • slice(): chooses rows based on location.
  • arrange(): changes the order of the rows
  • sample_n(): take a random subset of the rows

Columns:

  • select(): changes whether or not a column is included.
  • rename(): changes the name of columns.
  • mutate(): changes the values of columns and creates new columns.

Groups of rows:

  • summarise(): collapses a group into a single row.
  • count(): count unique values of one or more variables.
  • group_by(): perform calculations separately for each value of a variable

The pipe (reminder)

Before working with more data wrangling functions, let’s formally introduce the pipe. The pipe, |>, is an operator (a tool) for passing information from one process to another. We will use |> mainly in data pipelines to pass the output of the previous line of code as the first input of the next line of code.

When reading code “in English”, say “and then” whenever you see a pipe.

Activities

select()

  • Demo: Make a data frame that only contains the variables dep_delay and arr_delay.
flights |>
  select(dep_delay, arr_delay)
# A tibble: 336,776 × 2
   dep_delay arr_delay
       <dbl>     <dbl>
 1         2        11
 2         4        20
 3         2        33
 4        -1       -18
 5        -6       -25
 6        -4        12
 7        -5        19
 8        -3       -14
 9        -3        -8
10        -2         8
# ℹ 336,766 more rows
  • Demo: Make a data frame that keeps every variable except dep_delay. Call the new data frame new.data
flights |>
  select(-dep_delay)
# A tibble: 336,776 × 18
    year month   day dep_time sched_dep_time arr_time sched_arr_time arr_delay
   <int> <int> <int>    <int>          <int>    <int>          <int>     <dbl>
 1  2013     1     1      517            515      830            819        11
 2  2013     1     1      533            529      850            830        20
 3  2013     1     1      542            540      923            850        33
 4  2013     1     1      544            545     1004           1022       -18
 5  2013     1     1      554            600      812            837       -25
 6  2013     1     1      554            558      740            728        12
 7  2013     1     1      555            600      913            854        19
 8  2013     1     1      557            600      709            723       -14
 9  2013     1     1      557            600      838            846        -8
10  2013     1     1      558            600      753            745         8
# ℹ 336,766 more rows
# ℹ 10 more variables: carrier <chr>, flight <int>, tailnum <chr>,
#   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>
  • Think about the tibble creation above when we typed 1:5. We can use this idea as a tool to subset data! Instead of numbers, let’s do this with names.

  • Demo: Make a data frame that includes all variables between year through dep_delay (inclusive). These are all variables that provide information about the departure of each flight.

flights_subset <- flights |>
  select(month:dep_delay)
  • Demo: Use select and contains() to make a data frame that includes the variables associated with the arrival, i.e., contains the string "arr_" in the name. Reminder: Thinking about code as sentences can help make nesting functions more intuitive.

Note: There should not be a backslash after arr. Quarto puts a backslash (sometimes) to indicate that the underscore is just text.

flights |>
  select(contains("arr_"))
# A tibble: 336,776 × 3
   arr_time sched_arr_time arr_delay
      <int>          <int>     <dbl>
 1      830            819        11
 2      850            830        20
 3      923            850        33
 4     1004           1022       -18
 5      812            837       -25
 6      740            728        12
 7      913            854        19
 8      709            723       -14
 9      838            846        -8
10      753            745         8
# ℹ 336,766 more rows
  • Why is arr_ in quotes?

because we are looking for a character string, and not a variable!

slice()

  • Demo: Display the first five rows of the flights data frame.
flights |>
  slice(1:5)
# A tibble: 5 × 19
   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
  <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
1  2013     1     1      517            515         2      830            819
2  2013     1     1      533            529         4      850            830
3  2013     1     1      542            540         2      923            850
4  2013     1     1      544            545        -1     1004           1022
5  2013     1     1      554            600        -6      812            837
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>
  • Demo: Display the last two rows of the flights data frame. Hint: n() produces the number of the last row in the data set.

Solves this problem with n(). Solve this problem with slice_tail().

flights |>
  slice((n()-1):n())
# A tibble: 2 × 19
   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
  <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
1  2013     9    30       NA           1159        NA       NA           1344
2  2013     9    30       NA            840        NA       NA           1020
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>
flights |>
  slice_tail(n = 2)
# A tibble: 2 × 19
   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
  <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
1  2013     9    30       NA           1159        NA       NA           1344
2  2013     9    30       NA            840        NA       NA           1020
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

arrange()

  • Demo: Arrange the flights data set using the arrange() function, by dep_delay. How are the data ordered?
flights |>
  arrange(dep_delay) |>
  select(dep_delay, arr_time)
# A tibble: 336,776 × 2
   dep_delay arr_time
       <dbl>    <int>
 1       -43       40
 2       -33     2240
 3       -32     1549
 4       -30     2233
 5       -27     1947
 6       -26     1002
 7       -25     2143
 8       -25     2213
 9       -24     1601
10       -24     1225
# ℹ 336,766 more rows
  • Demo: Now let’s arrange the data by descending departure delay, so the flights with the longest departure delays will be at the top. Hint, run ?desc in the console.
flights |>
  arrange(desc(dep_delay)) |>
  select(dep_delay)
# A tibble: 336,776 × 1
   dep_delay
       <dbl>
 1      1301
 2      1137
 3      1126
 4      1014
 5      1005
 6       960
 7       911
 8       899
 9       898
10       896
# ℹ 336,766 more rows

Footnotes

  1. From dplyr vignette↩︎