Pivots, Data Types

Packages

library(tidyverse)
library(scales)
library(readxl)

Pivot Practice

x <- tibble(
  state = rep(c("MT", "NC" , "SC"),2),
  group = c(rep("C", 3), rep("D", 3)),
  obs = c(1:6)
  )

x

# A tibble: 6 × 3
  state group   obs
  <chr> <chr> <int>
1 MT    C         1
2 NC    C         2
3 SC    C         3
4 MT    D         4
5 NC    D         5
6 SC    D         6

(Done in the slides): Pivot these data so that the data are wide. i.e. Each state should be it’s own unique observation (row). Save this new data set as y.

y <- x |>
  pivot_wider(names_from = group,
              values_from = obs)

y

# A tibble: 3 × 3
  state     C     D
  <chr> <int> <int>
1 MT        1     4
2 NC        2     5
3 SC        3     6

Now, let’s change it back. Introducing pivot_longer. There are three things we need to consider with pivot_longer:

y |>
  pivot_longer(cols = !state,
               names_to = "group",
               values_to = "obs")

# A tibble: 6 × 3
  state group   obs
  <chr> <chr> <int>
1 MT    C         1
2 MT    D         4
3 NC    C         2
4 NC    D         5
5 SC    C         3
6 SC    D         6

Pivot Practice 2

Let’s try this on a real data set.

The Portland Trailblazers are a National Basketball Association (NBA) sports team. These data reflect the points scored by 9 Portland Trailblazers players across the first 10 games of the 2021-2022 NBA season.

trailblazer <- read_csv("data/trailblazer21.csv")

– Take a slice at the data above. Are these data in wide or long format?

trailblazer |>
  slice(1:3)

# A tibble: 3 × 11
  Player       Game1_Home Game2_Home Game3_Away Game4_Home Game5_Home Game6_Away
  <chr>             <dbl>      <dbl>      <dbl>      <dbl>      <dbl>      <dbl>
1 Damian Lill…         20         19         12         20         25         14
2 CJ McCollum          24         28         20         25         14         25
3 Norman Powe…         14         16         NA         NA         12         14
# ℹ 4 more variables: Game7_Away <dbl>, Game8_Away <dbl>, Game9_Home <dbl>,
#   Game10_Home <dbl>

Wide format. No rep in the player column

Pivot the data so that you have columns for Player, Game, Points. Save this as a new data set called new.blazer.

new.blazer <- trailblazer |>
  pivot_longer(cols = -Player,
               names_to = "Game",
               values_to = "Points")

—————————————– Answer Below

But don’t look until you’ve tried it on your own above!

Suppose now that you are asked to have two separate columns within these data. One column to represent Game, and one to represent Location. Let’s make this happen below. Save your new data set as new.blazer

Comment through the code. Let’s introduce ourselves to a new function separate_wider_delim(). Open up the help file and read about the different functions we can use!

new.blazer <- trailblazer |>  # save object as new.blazer
  pivot_longer( #make data longer
    cols = -Player, #bring all cols but Player down
    names_to = "Game", #name the col Game
    values_to = "Points", #name the values col Points
  ) |> 
  separate_wider_delim(Game , delim = "_", names = c("Game" , "Location"))

new.blazer

# A tibble: 90 × 4
   Player         Game   Location Points
   <chr>          <chr>  <chr>     <dbl>
 1 Damian Lillard Game1  Home         20
 2 Damian Lillard Game2  Home         19
 3 Damian Lillard Game3  Away         12
 4 Damian Lillard Game4  Home         20
 5 Damian Lillard Game5  Home         25
 6 Damian Lillard Game6  Away         14
 7 Damian Lillard Game7  Away         20
 8 Damian Lillard Game8  Away         26
 9 Damian Lillard Game9  Home          4
10 Damian Lillard Game10 Home         25
# ℹ 80 more rows

Now, use pivot_wider to reshape the new.blazer data frame such that you have a 90 x 4 tibble with columns Player, Game, Home, Away.

new.blazer |>
  pivot_wider(names_from = Location, 
              values_from = Points)

# A tibble: 90 × 4
   Player         Game    Home  Away
   <chr>          <chr>  <dbl> <dbl>
 1 Damian Lillard Game1     20    NA
 2 Damian Lillard Game2     19    NA
 3 Damian Lillard Game3     NA    12
 4 Damian Lillard Game4     20    NA
 5 Damian Lillard Game5     25    NA
 6 Damian Lillard Game6     NA    14
 7 Damian Lillard Game7     NA    20
 8 Damian Lillard Game8     NA    26
 9 Damian Lillard Game9      4    NA
10 Damian Lillard Game10    25    NA
# ℹ 80 more rows

Data Types

Why is understanding data types important?

Functions operate on data types. I can’t make a scatterplot with Player on the x axis.

Type coercion

Type coercion is the automatic conversion of data types from one to another.

Demo: Determine the type of the following vector. And then, change the type to numeric.

x <- c("1", "2", "3")
typeof(x)

[1] "character"

as.numeric(x) # to change something to a number

[1] 1 2 3

Let’s try another example…

y <- c("a", "b", "c")
typeof(y)

[1] "character"

as.numeric(y) #gives us NAs because characters are letters

Warning: NAs introduced by coercion

[1] NA NA NA

z <- c("1", "2", "three")
typeof(z)

[1] "character"

as.numeric(z)

Warning: NAs introduced by coercion

[1]  1  2 NA

Survey Results

survey_results <- tibble(cars = c(1, 2, "three"))

survey_results

# A tibble: 3 × 1
  cars 
  <chr>
1 1    
2 2    
3 three

This is annoying because of that third survey taker who just had to go and type out the number instead of providing as a numeric value. So now you need to update the cars variable to be numeric. You do the following

survey_results |>
  mutate(cars = as.numeric(cars))

Warning: There was 1 warning in `mutate()`.
ℹ In argument: `cars = as.numeric(cars)`.
Caused by warning:
! NAs introduced by coercion

# A tibble: 3 × 1
   cars
  <dbl>
1     1
2     2
3    NA

What warning comes out?

NAs introduced by coercion

This is because some of the character strings aren’t properly structured integers and so can’t be translated to the numeric class

And now things are even more annoying because you get a warning NAs introduced by coercion that happened while computing cars = as.numeric(cars) and the response from the third survey taker is now an NA (you lost their data). Let’s fix this.

survey_results |> 
  mutate(
    cars = if_else(cars == "three", "3", cars),
    cars = as.numeric(cars)
  )

# A tibble: 3 × 1
   cars
  <dbl>
1     1
2     2
3     3

Note: there are many ways to replace NA depending on the situation. We can use case_when for replacing multiple NAs. We can use replace_na(). Etc. Etc.

Characters vs Factors

In short….

Character = text

Factor = categories

Technically, a factor is a data “class” that represents a higher-level categorization of an object, defining its structure and the methods that can be applied to it. Other classes include things like a list and a data frame. A character can also be a class…We won’t spend too much time diving into the details here. Instead, we will focus on the functionality around common situations of type coercion.

These categories are stored as numbers behind the scene. This is important to think about when exploring type coercion.

Survey Results (Factor Edition)

survey_results_F <- tibble(cars = factor(c(1, 2, "three")))

survey_results_F

# A tibble: 3 × 1
  cars 
  <fct>
1 1    
2 2    
3 three

survey_results_F |>
  mutate(cars = as.numeric(cars))

# A tibble: 3 × 1
   cars
  <dbl>
1     1
2     2
3     3

How is this different than before?

It assigned “three” the actual value of 3 without giving us a NA value

Your turn!

Your turn: First, check the type of each element in the vector below. Then, guess the type of the entire vector. Then, check if you guessed right by running typeof() on the vector.

Different responses

v1 <- c(1, 1L, "C")

# to help you guess
typeof(1)

[1] "double"

typeof(1L)

[1] "integer"

typeof("C")

[1] "character"

# to check after you guess
typeof(v1)

[1] "character"

v2 <- c(1L , 0, Inf, TRUE)

# to help you guess
typeof(Inf)

[1] "double"

typeof(0)

[1] "double"

typeof(1L)

[1] "integer"

typeof(TRUE)

[1] "logical"

# to check after you guess
typeof(v2)

[1] "double"