Visualizing time series data I

Packages

Warm up

sales <- read_excel("data/sales.xlsx",
                    skip = 3,
                    col_names = c("id" , "n"))

Now, write the code below to recreate the data set seen on the slides.

Hint1: Think about making a new column with a logical indicator

Hint2: We are going to use the function fill() to answer this question. Pull up this help file and think about how this could help us.

sales |>
  mutate(is_brand_name = str_detect(id, "Brand"),
         brand = if_else(is_brand_name, id, NA)) |>
  fill(brand) |>
  filter(!is_brand_name) |>
  select(brand, id, n)
# A tibble: 7 × 3
  brand   id    n    
  <chr>   <chr> <chr>
1 Brand 1 1234  8    
2 Brand 1 8721  2    
3 Brand 1 1822  3    
4 Brand 2 3333  1    
5 Brand 2 2156  3    
6 Brand 2 3987  6    
7 Brand 2 3216  5    

Reading in data with dates

Below, we are going to practice reading in date data. Note that data1 and data2 have the year represented by 4 digits. Let’s first read data1.csv in without thinking too much about the data type. Write code below to read the data in as we’ve done before. What type of variable is var1?

data1 <- read_csv("data/data1.csv")
Rows: 1 Columns: 1
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): var1

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

var 1 is a character until we do something about it!

data1

Now, let’s make this a date variable upon reading the data in! In read_csv(), we can specify the column type using col_types = cols(). Let’s read in var1 appropriately.

data1 is in month day year format

data1 <- read_csv("data/data1.csv", col_types = cols(var1 = col_date("%m/%d/%Y")))

data1
# A tibble: 1 × 1
  var1      
  <date>    
1 1993-03-17

What happens if we mess up the col_types code?

We typically get a NA! But we may get some data where R is guessing. We need to be carful

data3

Now, let’s read in data3.

data3 is in year month day format. Note: The year 1993 is abbreviated to 93

data3 <- read_csv("data/data3.csv", col_types = cols(var1 = col_date("%y/%m/%d")))

data3
# A tibble: 1 × 1
  var1      
  <date>    
1 1993-03-17

Takeaway: No matter how you specify the date format, it’s always displayed the same way once you get it into R.

Flights

Warning: package 'nycflights13' was built under R version 4.4.2

Instead of a single string, sometimes you’ll have the individual components of the date-time spread across multiple columns. This is what we have in the flights data. Let’s show this. Take the flights data set, and select() year, month, day, hour, and minute. We just want to show these variables are in the data set.

flights |>
  select(year, month, day, hour, minute)
# A tibble: 336,776 × 5
    year month   day  hour minute
   <int> <int> <int> <dbl>  <dbl>
 1  2013     1     1     5     15
 2  2013     1     1     5     29
 3  2013     1     1     5     40
 4  2013     1     1     5     45
 5  2013     1     1     6      0
 6  2013     1     1     5     58
 7  2013     1     1     6      0
 8  2013     1     1     6      0
 9  2013     1     1     6      0
10  2013     1     1     6      0
# ℹ 336,766 more rows

Now, we are going to add to our pipeline from above! We want to create a date/time variable from the existing data. To create a date/time from this sort of input, use make_date() for dates, or make_datetime() for date-times! Below let’s make a departure variable. Before we begin, let’s pull up the help file for make_datetime()

flights_dt <- flights |>
  select(year, month, day, hour, minute) |>
  mutate(departure = make_datetime(year, month, day, hour, minute))

Note: If you are having trouble with this code, the answer is at the bottom of the document under the Answer section. Run that code so you can continue with the AE!

We have explored many geoms in this course so far. Let’s introduce another one! We are going to use geom_freqpoly() to plot these time series data. Pull up the help file below, and describe this geom.

it’s a lot like geom_histogram, but will create a line instead of bars!

Now, let’s create a frequency plot using geom_freqpoly() only for the day Jan 1st, 2013.

Let’s walk through the code together below.

Note: we can specify a date structure by using ymd, followed by the year, month, day, as one numeric string. This will be useful when filtering!

Note: we tend to use <, >, or functions such as between() when working with complex data structures. Common filtering methods (ex. %in%) do not work with such a complex data structure.

flights_dt |> 
  filter(departure < ymd(20130102)) |> 
  ggplot(aes(x = departure)) + 
  geom_freqpoly(binwidth = 600) 

Takeaway: that when you use date-times in a numeric context (like in a histogram), 1 means 1 second, so a binwidth of 86400 means one day.

For dates, 1 means 1 day.

=========================== Answer

flights_dt <- flights |> 
  select(year, month, day, hour, minute) |> 
  mutate(departure = make_datetime(year, month, day, hour, minute))