Wrangling flights
To demonstrate data wrangling we will use flights
, a tibble in the nycflights13 R package. It includes characteristics of all flights departing from New York City (JFK, LGA, EWR) in 2013.
Note: As we go through the AE, practicing thinking in steps, and reading your code as sentences
Write a sentence here that explains how many rows and columns are in the data set using in-line code
This dataset has 336776 rows, and 19 columns.
Now, let’s take a glimpse of the data set! Also, pull up the help file for the flights
dataset to see the data index.
As a reminder, we can use the names()
function to get a quick reminder on the names of the variables in our data set. We typically do not include this type of code in our professional document. However, we are this time so you can reference this function!
[1] "year" "month" "day" "dep_time"
[5] "sched_dep_time" "dep_delay" "arr_time" "sched_arr_time"
[9] "arr_delay" "carrier" "flight" "tailnum"
[13] "origin" "dest" "air_time" "distance"
[17] "hour" "minute" "time_hour"
We can also get a more traditional view of the data set using functions like head()
and tail()
. We can also View()
our data set, but will NOT include View()
in our rendered document. Let’s show why.
head(flights) # shows top 6 rows of our data
# A tibble: 6 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
5 2013 1 1 554 600 -6 812 837
6 2013 1 1 554 558 -4 740 728
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
tail(flights) # shows bottom 6 rows of our data
# A tibble: 6 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 9 30 NA 1842 NA NA 2019
2 2013 9 30 NA 1455 NA NA 1634
3 2013 9 30 NA 2200 NA NA 2312
4 2013 9 30 NA 1210 NA NA 1330
5 2013 9 30 NA 1159 NA NA 1344
6 2013 9 30 NA 840 NA NA 1020
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
Tibble vs. data frame
A tibble is an opinionated version of the R
data frame. In other words, all tibbles are data frames, but not all data frames are tibbles!
There are many differences between a tibble and a data frame. The main one is…
When you print a tibble, the first ten rows and all of the columns that fit on the screen will display, along with the type of each column.
Other differences include:
– it never changes the type of the inputs (e.g. it never converts strings to factors!)
– it never changes the names of variables
– it never creates row names
– it can have column names that are not valid R variable names
You can read more about the difference here: Tibble Reading
Let’s look at the differences in the output when we type flights
(tibble) in the console versus typing mtcars
(data frame) in the console.
Then, use as_tibble()
on the mtcars
data set. Save this new data set as mtcars_tibble
. Check your environment to see if it worked!
mtcars_tibble <- as_tibble(mtcars)
Making your own tibbles
There are two different ways we can make our own tibbles in R. The first way is to use the tibble()
function. Comment on what each line of code is doing below.
data <- tibble(
x = 1:5,
y = 1,
z = x ^ 2 + y
stands for transposed tibble. It’s more intuitive to make data sets this way (in my opinion). Run the following code below and see the results.
~x, ~y, ~z,
"a", 2, 3.6,
"b", 1, 8.5
# A tibble: 2 × 3
x y z
<chr> <dbl> <dbl>
1 a 2 3.6
2 b 1 8.5
Data wrangling with dplyr
dplyr is the primary package in the tidyverse for data wrangling. Click here for the dplyr reference page. Click hereto download the dplyr cheat sheet.
Quick summary of key dplyr functions1:
:chooses rows based on column values. -
: chooses rows based on location. -
: changes the order of the rows -
: take a random subset of the rows
: changes whether or not a column is included. -
: changes the name of columns. -
: changes the values of columns and creates new columns.
Groups of rows:
: collapses a group into a single row. -
: count unique values of one or more variables. -
: perform calculations separately for each value of a variable
The pipe (reminder)
Before working with more data wrangling functions, let’s formally introduce the pipe. The pipe, |>
, is an operator (a tool) for passing information from one process to another. We will use |>
mainly in data pipelines to pass the output of the previous line of code as the first input of the next line of code.
When reading code “in English”, say “and then” whenever you see a pipe.
- Demo: Make a data frame that only contains the variables
flights |>
select(dep_delay, arr_delay)
# A tibble: 336,776 × 2
dep_delay arr_delay
<dbl> <dbl>
1 2 11
2 4 20
3 2 33
4 -1 -18
5 -6 -25
6 -4 12
7 -5 19
8 -3 -14
9 -3 -8
10 -2 8
# ℹ 336,766 more rows
- Demo: Make a data frame that keeps every variable except
. Call the new data framenew.data
flights |>
# A tibble: 336,776 × 18
year month day dep_time sched_dep_time arr_time sched_arr_time arr_delay
<int> <int> <int> <int> <int> <int> <int> <dbl>
1 2013 1 1 517 515 830 819 11
2 2013 1 1 533 529 850 830 20
3 2013 1 1 542 540 923 850 33
4 2013 1 1 544 545 1004 1022 -18
5 2013 1 1 554 600 812 837 -25
6 2013 1 1 554 558 740 728 12
7 2013 1 1 555 600 913 854 19
8 2013 1 1 557 600 709 723 -14
9 2013 1 1 557 600 838 846 -8
10 2013 1 1 558 600 753 745 8
# ℹ 336,766 more rows
# ℹ 10 more variables: carrier <chr>, flight <int>, tailnum <chr>,
# origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dttm>
Think about the tibble creation above when we typed
. We can use this idea as a tool to subset data! Instead of numbers, let’s do this with names.Demo: Make a data frame that includes all variables between
(inclusive). These are all variables that provide information about the departure of each flight.
flights_subset <- flights |>
- Demo: Use
to make a data frame that includes the variables associated with the arrival, i.e., contains the string"arr_"
in the name. Reminder: Thinking about code as sentences can help make nesting functions more intuitive.
Note: There should not be a backslash after arr
. Quarto puts a backslash (sometimes) to indicate that the underscore is just text.
# A tibble: 336,776 × 3
arr_time sched_arr_time arr_delay
<int> <int> <dbl>
1 830 819 11
2 850 830 20
3 923 850 33
4 1004 1022 -18
5 812 837 -25
6 740 728 12
7 913 854 19
8 709 723 -14
9 838 846 -8
10 753 745 8
# ℹ 336,766 more rows
- Why is arr_ in quotes?
because we are looking for a character string, and not a variable!
- Demo: Display the first five rows of the
data frame.
flights |>
# A tibble: 5 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
5 2013 1 1 554 600 -6 812 837
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
- Demo: Display the last two rows of the
data frame. Hint:n()
produces the number of the last row in the data set.
Solves this problem with n()
. Solve this problem with slice_tail()
# A tibble: 2 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 9 30 NA 1159 NA NA 1344
2 2013 9 30 NA 840 NA NA 1020
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
flights |>
slice_tail(n = 2)
# A tibble: 2 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 9 30 NA 1159 NA NA 1344
2 2013 9 30 NA 840 NA NA 1020
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
- Demo: Arrange the
data set using thearrange()
function, bydep_delay
. How are the data ordered?
# A tibble: 336,776 × 2
dep_delay arr_time
<dbl> <int>
1 -43 40
2 -33 2240
3 -32 1549
4 -30 2233
5 -27 1947
6 -26 1002
7 -25 2143
8 -25 2213
9 -24 1601
10 -24 1225
# ℹ 336,766 more rows
- Demo: Now let’s arrange the data by descending departure delay, so the flights with the longest departure delays will be at the top. Hint, run
in the console.
