Pivots, Data Types
Packages
Pivot Practice
x <- tibble(
state = rep(c("MT", "NC" , "SC"),2),
group = c(rep("C", 3), rep("D", 3)),
obs = c(1:6)
)
x
# A tibble: 6 × 3
state group obs
<chr> <chr> <int>
1 MT C 1
2 NC C 2
3 SC C 3
4 MT D 4
5 NC D 5
6 SC D 6
(Done in the slides): Pivot these data so that the data are wide. i.e. Each state should be it’s own unique observation (row). Save this new data set as y.
y <- x |>
pivot_wider(names_from = group,
values_from = obs)
y
# A tibble: 3 × 3
state C D
<chr> <int> <int>
1 MT 1 4
2 NC 2 5
3 SC 3 6
Now, let’s change it back. Introducing pivot_longer. There are three things we need to consider with pivot_longer:
y |>
pivot_longer(cols = !state,
names_to = "group",
values_to = "obs")
# A tibble: 6 × 3
state group obs
<chr> <chr> <int>
1 MT C 1
2 MT D 4
3 NC C 2
4 NC D 5
5 SC C 3
6 SC D 6
Pivot Practice 2
Let’s try this on a real data set.
The Portland Trailblazers are a National Basketball Association (NBA) sports team. These data reflect the points scored by 9 Portland Trailblazers players across the first 10 games of the 2021-2022 NBA season.
trailblazer <- read_csv("data/trailblazer21.csv")
– Take a slice at the data above. Are these data in wide or long format?
trailblazer |>
slice(1:3)
# A tibble: 3 × 11
Player Game1_Home Game2_Home Game3_Away Game4_Home Game5_Home Game6_Away
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Damian Lill… 20 19 12 20 25 14
2 CJ McCollum 24 28 20 25 14 25
3 Norman Powe… 14 16 NA NA 12 14
# ℹ 4 more variables: Game7_Away <dbl>, Game8_Away <dbl>, Game9_Home <dbl>,
# Game10_Home <dbl>
Wide format. No rep in the player column
Pivot the data so that you have columns for Player, Game, Points. Save this as a new data set called new.blazer.
new.blazer <- trailblazer |>
pivot_longer(cols = -Player,
names_to = "Game",
values_to = "Points")
—————————————– Answer Below
But don’t look until you’ve tried it on your own above!
Suppose now that you are asked to have two separate columns within these data. One column to represent Game, and one to represent Location. Let’s make this happen below. Save your new data set as new.blazer
Comment through the code. Let’s introduce ourselves to a new function separate_wider_delim()
. Open up the help file and read about the different functions we can use!
new.blazer <- trailblazer |> # save object as new.blazer
pivot_longer( #make data longer
cols = -Player, #bring all cols but Player down
names_to = "Game", #name the col Game
values_to = "Points", #name the values col Points
) |>
separate_wider_delim(Game , delim = "_", names = c("Game" , "Location"))
new.blazer
# A tibble: 90 × 4
Player Game Location Points
<chr> <chr> <chr> <dbl>
1 Damian Lillard Game1 Home 20
2 Damian Lillard Game2 Home 19
3 Damian Lillard Game3 Away 12
4 Damian Lillard Game4 Home 20
5 Damian Lillard Game5 Home 25
6 Damian Lillard Game6 Away 14
7 Damian Lillard Game7 Away 20
8 Damian Lillard Game8 Away 26
9 Damian Lillard Game9 Home 4
10 Damian Lillard Game10 Home 25
# ℹ 80 more rows
Now, use pivot_wider to reshape the new.blazer data frame such that you have a 90 x 4 tibble with columns Player, Game, Home, Away.
new.blazer |>
pivot_wider(names_from = Location,
values_from = Points)
# A tibble: 90 × 4
Player Game Home Away
<chr> <chr> <dbl> <dbl>
1 Damian Lillard Game1 20 NA
2 Damian Lillard Game2 19 NA
3 Damian Lillard Game3 NA 12
4 Damian Lillard Game4 20 NA
5 Damian Lillard Game5 25 NA
6 Damian Lillard Game6 NA 14
7 Damian Lillard Game7 NA 20
8 Damian Lillard Game8 NA 26
9 Damian Lillard Game9 4 NA
10 Damian Lillard Game10 25 NA
# ℹ 80 more rows
Data Types
Why is understanding data types important?
Functions operate on data types. I can’t make a scatterplot with Player on the x axis.
Type coercion
Type coercion is the automatic conversion of data types from one to another.
Demo: Determine the type of the following vector. And then, change the type to numeric.
as.numeric(x) # to change something to a number
[1] 1 2 3
Let’s try another example…
as.numeric(y) #gives us NAs because characters are letters
Warning: NAs introduced by coercion
[1] NA NA NA
Survey Results
# A tibble: 3 × 1
cars
<chr>
1 1
2 2
3 three
This is annoying because of that third survey taker who just had to go and type out the number instead of providing as a numeric value. So now you need to update the cars variable to be numeric. You do the following
survey_results |>
mutate(cars = as.numeric(cars))
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `cars = as.numeric(cars)`.
Caused by warning:
! NAs introduced by coercion
# A tibble: 3 × 1
cars
<dbl>
1 1
2 2
3 NA
What warning comes out?
NAs introduced by coercion
This is because some of the character strings aren’t properly structured integers and so can’t be translated to the numeric class
And now things are even more annoying because you get a warning NAs introduced by coercion that happened while computing cars = as.numeric(cars) and the response from the third survey taker is now an NA (you lost their data). Let’s fix this.
survey_results |>
mutate(
cars = if_else(cars == "three", "3", cars),
cars = as.numeric(cars)
)
# A tibble: 3 × 1
cars
<dbl>
1 1
2 2
3 3
Note: there are many ways to replace NA
depending on the situation. We can use case_when
for replacing multiple NAs. We can use replace_na()
. Etc. Etc.
Characters vs Factors
In short….
Character = text
Factor = categories
Technically, a factor is a data “class” that represents a higher-level categorization of an object, defining its structure and the methods that can be applied to it. Other classes include things like a list and a data frame. A character can also be a class…We won’t spend too much time diving into the details here. Instead, we will focus on the functionality around common situations of type coercion.
These categories are stored as numbers behind the scene. This is important to think about when exploring type coercion.
Survey Results (Factor Edition)
# A tibble: 3 × 1
cars
<fct>
1 1
2 2
3 three
survey_results_F |>
mutate(cars = as.numeric(cars))
# A tibble: 3 × 1
cars
<dbl>
1 1
2 2
3 3
How is this different than before?
It assigned “three” the actual value of 3 without giving us a NA value
Your turn!
Your turn: First, check the type of each element in the vector below. Then, guess the type of the entire vector. Then, check if you guessed right by running typeof()
on the vector.
Different responses
# to check after you guess
typeof(v1)
[1] "character"
# to check after you guess
typeof(v2)
[1] "double"