Midterm - In class - Solutions

STA 295 - Spring 2025

Name: ___________________________________

I hereby state that I have not communicated with or gained information in any way from my classmates, or any external resources during this exam, and that all work is my own.

Signature: _________________________________

Any potential violation of NC State’s policy on academic integrity will be reported. All work on this exam must be your own.

You have 75 minutes to complete the exam.
You are allowed one \(8\frac{1}{2}" \times 11"\) sheet of notes (cheat sheet) with writing on both sides, pen or a pencil, and to ask questions to the professor.
You are not allowed a cell phone, even if you intend to use it for checking the time, music device or headphones, notes (other than your cheat sheet), books, or other resources, or to communicate with anyone other than the professor during the exam.
For multiple choice questions, please circle your answer.
This is a 60 point exam. Point values per question can be found next to the question name.

Good luck!

Question 1 (4 points)

In 1-2 sentences, explain the difference between R and RStudio.

R is a statistical programming language (car engine), while RStudio is the integrated environment (IDE) (car dashboard)

Question 2 (4 points)

In 2-4 sentences, explain how GitHub can be a useful tool for statisticians, data scientists, and researchers.

GitHub is critical for reproducible work, and transparency. These are things we need to strive for as a statistics/research community. It’s also an important tool for collaboration.

Statistical Science majors

The data for this question is on Statistical Science majors at Duke over the years. The Department of Statistical Science offers two majors – Bachelor of Science (BS) and Bachelor of Arts (AB). Students who have Statistical Science as their first major are coded as BS and AB, those who have it as their second major are coded as BS2 and AB2. The data frame, statsci, is shown below. The question is on the next page.

# A tibble: 56 × 3
   year  degree     n
   <chr> <chr>  <dbl>
 1 2011  AB         2
 2 2011  AB2        0
 3 2011  BS         5
 4 2011  BS2        2
 5 2012  AB         2
 6 2012  AB2        1
 7 2012  BS         9
 8 2012  BS2        6
 9 2013  AB         4
10 2013  AB2        0
# ℹ 46 more rows

Question 3 (3 points)

Your teammate made the following plot and took a screenshot of it, but they haven’t saved the code in your Quarto document.

Now the whole team needs to work backwards from the plot and figure out the code that generated it. Which of the following produces the plot?

ggplot(statsci, aes(x = year, y = n, fill = degree)) + 
  geom_col()

ggplot(statsci, aes(x = degree, y = n, fill = year)) + 
  geom_col()

ggplot(statsci, aes(x = year, y = n, fill = degree)) + 
  geom_bar()

ggplot(statsci, aes(x = degree, y = n, fill = year)) + 
  geom_bar()

Question 4a (3 points)

Currently, the above plot is hard to read. In 1-2 sentences, explain it would be better to display this information using relative proportions vs counts.

**The above plot is hard to read because the sample sizes across year are extremely different! It’s not the most appropriate to share data in this way when sample sizes differ across group significantly.

Question 4b (3 points)

Identify the argument, that would go inside the appropriate geom, to help fix the concern in question 4a.

position = "stack"

position = "fill"

position = "prop"

position = "size"

Question 5 (2 points)

True or False: ggplot(statsci,...) is equivalent to statsci |> ggplot(...)

a. True

False

Question 5b (2 points)

True or False: aes() is a …

a. function

argument

Question 6 (3 points)

Another teammate has made the following table for your report. But they also haven’t saved their code for generating this table.

degree	2011	2012	2013	2014	2015	2016	2017	2018	2019	2020	2021	2022	2023	2024
AB	2	2	4	1	3	6	3	4	4	1	0	0	2	1
AB2	0	1	0	0	4	4	1	0	0	1	2	0	3	1
BS	5	9	4	13	10	17	24	21	26	27	35	33	53	41
BS2	2	6	1	0	5	6	6	8	8	17	16	15	29	19

Which of the following is the correct combination of values that should go into the blanks in the code cell below to get statsci into the right shape for this table? Note: You can revisit the original data set on page 2.

statsci |>
  pivot_wider(
    names_from = _BLANK_1_,
    values_from = _BLANK_2_
  )

	`BLANK_1`	`BLANK_2`
a.	`year`	`n`

b.	`degree`	`n`

c.	`degree`	`year`

d.	`n`	`degree`

e.	`n`	`year`

Gerrymandering

The gerrymander dataset includes information on Congressional Districts. For each Congressional District the dataset provides some results for the 2016 election (winning party, % votes for Clinton, % votes for Trump, whether a Democrat won the House election, name of election winner), some results for the 2018 election (winning party, whether a Democrat won the 2018 House election), whether the seat for that Congressional District was flipped between 2016 and 2018 elections (from Democrat to Republican or from Republican to Democrat), and prevalence of gerrymandering in the state the district is located in (low, mid, and high).

Below is a display of the gerrymander data frame.

# A tibble: 435 × 12
   first_name last_name district flip18 gerry party16 clinton16 trump16 dem16
   <chr>      <chr>     <chr>     <dbl> <fct> <chr>       <dbl>   <dbl> <dbl>
 1 Don        Young     AK-AL         0 mid   R            37.6    52.8     0
 2 Bradley    Byrne     AL-01         0 high  R            34.1    63.5     0
 3 Martha     Roby      AL-02         0 high  R            33      64.9     0
 4 Mike D.    Rogers    AL-03         0 high  R            32.3    65.3     0
 5 Rob        Aderholt  AL-04         0 high  R            17.4    80.4     0
 6 Mo         Brooks    AL-05         0 high  R            31.3    64.7     0
 7 Gary       Palmer    AL-06         0 high  R            26.1    70.8     0
 8 Terri      Sewell    AL-07         0 high  D            69.8    28.6     1
 9 Rick       Crawford  AR-01         0 mid   R            30.2    65       0
10 French     Hill      AR-02         0 mid   R            41.7    52.4     0
# ℹ 425 more rows
# ℹ 3 more variables: state <chr>, party18 <chr>, dem18 <dbl>

Question 7 (5 points)

Based on this output alone, which of the following must be true about the gerrymander data frame? Select all that apply.

There is no missing data in the gerrymander data frame.
The gerrymander data frame is a tibble.
The gerrymander data frame has 9 columns.
The gerrymander data frame has 10 rows.
The gerry variable in the gerrymander data frame is a factor variable.

b, e

Question 8 (6 points)

Comment on what each line of code is doing. Be specific.

tb1 <- gerrymander |>
  group_by(party18, gerry) |>
  summarise(n = n()) 

tb1

# A tibble: 6 × 3
# Groups:   party18 [2]
  party18 gerry     n
  <chr>   <fct> <int>
1 D       low      37
2 D       mid     139
3 D       high     51
4 R       low      25
5 R       mid     131
6 R       high     52

Line 1: we take the gerrymander tibble and assign anything we do to the tibble to the name tb1

Line 2: group the gerrymander tibble by 2 variables (party18, gerry)

Line 3: calculate a summary statistic named n. That summary statistic is a count of party 18 and gerry combinations.

Question 8b (2 points)

Suppose you ran the following code. What would be the corresponding output for the gerry column? Write it below. For this question, you can assume the ordering of gerry is low, medium, high.

tb1 |>
  mutate(gerry = as.numeric(gerry))

gerry 1 2 3 1 2 3

Question 8c (2 points)

Suppose you ran the following code. What would be the corresponding output for the party18 column? Write it below.

tb1 |>
  mutate(party18 = as.numeric(party18))

party18 NA NA NA NA NA NA

Question 8d (4 points)

Please see the following code below.

tb1 |>
  mutate(new_var = if_else(party18 == "D" | n > 50, "1", "0"))

Now, write out the new column created by the code above, including the the column values. Use the appropriate column name, and include the correct column data type.

# A tibble: 6 × 4
# Groups:   party18 [2]
  party18 gerry     n new_var
  <chr>   <fct> <int> <chr>  
1 D       low      37 1      
2 D       mid     139 1      
3 D       high     51 1      
4 R       low      25 0      
5 R       mid     131 1      
6 R       high     52 1

Countries and populations

We have a small dataset of countries and their populations:

population

# A tibble: 5 × 3
  country_name         country_code pop_2023
  <chr>                <chr>           <dbl>
1 Hong Kong SAR, China HKG           7536100
2 Ireland              IRL           5262382
3 Kiribati             KIR            133515
4 Nicaragua            NIC           7046310
5 Slovenia             SVN           2120937

And another small dataset of countries and the continent they are in:

continents

# A tibble: 5 × 2
  entity       continent    
  <chr>        <chr>        
1 Ireland      Europe       
2 Kiribati     Oceania      
3 Nicaragua    North America
4 Sierra Leone Africa       
5 Slovenia     Europe

You join the two datasets with the following:

population |>
  inner_join(continents, by = join_by(country_name == entity))

Question 9 (3 points)

How many rows will the resulting data frame have?

c. 4

Question 10 (5 points)

Using the populations data set, your fellow researcher made the following plot.

You are not impressed. List 5 specific things wrong with the plot above.

no title poor y-axis label poor x-axis label poor z-axis label could comment on color… unnecessary legend

Question 10b (3 points)

We want to calculate the mean of the pop_2023 column. Please write out the appropriate functions that should take the place of A1 and A2.

population |>
  A1(mean_pop = A2(pop_2023))

A1: summarise

A2: mean

Question 11

Aside: Classical music historians might disagree on the precise counts, but these values are widely accepted counts.

composers

# A tibble: 6 × 3
  composer  genre    count
  <chr>     <chr>    <dbl>
1 Beethoven symphony     9
2 Beethoven opera        1
3 Beethoven concerto     9
4 Mozart    symphony    41
5 Mozart    opera       22
6 Mozart    concerto    37

Question 11 (3 points)

Just like addition and subtraction or multiplication and division, pivot_wider() and pivot_longer() are inverses. They undo each other.

Imagine we pivot the data frame composers given above, get it to the following format, and name it composers_pivoted:

composers_pivoted

# A tibble: 2 × 4
  composer  symphony opera concerto
  <chr>        <dbl> <dbl>    <dbl>
1 Beethoven        9     1        9
2 Mozart          41    22       37

See next page for the rest of the question.

Which of the following will undo this transformation and give us back the original, composers, exactly?

composers_pivoted |>
  pivot_longer(
    cols = !composer, 
    names_to = c("symphony", "opera", "concerto"), 
    values_to = "count"
  )

composers_pivoted |>
  pivot_longer(
    cols = !composer, 
    names_to = "count", 
    values_to = c("symphony", "opera", "concerto")
  )

composers_pivoted |>
  pivot_longer(
    cols = !composer, 
    names_to = "count", 
    values_to = "genre"
  )

composers_pivoted |>
  pivot_longer(
    cols = !composer, 
    names_to = "genre", 
    values_to = "count"
  )

End.