Qual Data Exploration

Packages

You will need to install.packages() wordcloud and tidytext for this activity

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Warning: package 'friends' was built under R version 4.4.3
Warning: package 'wordcloud' was built under R version 4.4.2
Loading required package: RColorBrewer
Warning: package 'tidytext' was built under R version 4.4.2

Coffee with Friends

We are going to explore which character loves (said?) coffee the most!

For this, we are going to use the function str_detect(). Specifically, we are going to subset our rows that only have the word coffee in an utterance. Let’s save this as a new R object, friends_coffee.

friends_coffee <- friends |>
  filter(str_detect(text, fixed("coffee", ignore_case = TRUE)))

We are going to narrow the scope of our exploration to just the main 6 characters (so we can continue to practice with fun new functions). Below, you are given a cast R object to use.

At the same time, our goal is to now visualize these data using a bar plot. To start, let’s make a contingency table that counts the number of times each speaker says coffee. Arrange this table in descending order.

cast <- c("Ross", "Rachel", "Chandler", "Phoebe", "Joey", "Monica")

friends_coffee |>
  mutate(speaker = fct_infreq(speaker)) |>
  group_by(speaker) |>
  summarize(n = n()) |>
  arrange(desc(n)) |>
  filter(str_detect(speaker, paste(cast, collapse = "|"))) |>
  slice(1:6) |>
  ggplot(
    aes(x = speaker, y = n)
  ) +
  geom_col(fill = "brown") +
  labs(title = "Coffee characters", 
       y = "", 
       x = "cast") +
  geom_text(aes(label = n), nudge_y = 2)

Now that we have our table, let’s subset the rows to only contain speakers that are in the values of our cast object. Note: We can subset our rows by more than one string as long as they are separated by a |. Let’s go through a clever way to do this. Run the following code:

paste(cast, collapse = "|")
[1] "Ross|Rachel|Chandler|Phoebe|Joey|Monica"

It’s good practice to put the bars in descending order, and we see that the bars didn’t take to the table order we gave it…. There are again many ways to accomplish this. Let’s do this by using fct_infreq(). Pull up the help file, and alter your following code above so the bars are in descending order.

Word Clouds

To make word clouds, we are going to need to install tidytext and wordcloud. We did that above!

Thought exercise: What is a word cloud? What do we think the data should look like in order to make a word cloud plot?

It appears difficult to try and count up frequencies of text when there are sentences for each value. The data would be easier to work with if the words were separate.

And we can start doing that by using unnest_tokens()! Comment out the code below.

tidy <- friends |>
  unnest_tokens(word, text)

Do you care about words like and?

No

The tidytext package has a database of just over a thousand of these words, including ‘and’:

print(stop_words, n = 30)
# A tibble: 1,149 × 2
   word        lexicon
   <chr>       <chr>  
 1 a           SMART  
 2 a's         SMART  
 3 able        SMART  
 4 about       SMART  
 5 above       SMART  
 6 according   SMART  
 7 accordingly SMART  
 8 across      SMART  
 9 actually    SMART  
10 after       SMART  
11 afterwards  SMART  
12 again       SMART  
13 against     SMART  
14 ain't       SMART  
15 all         SMART  
16 allow       SMART  
17 allows      SMART  
18 almost      SMART  
19 alone       SMART  
20 along       SMART  
21 already     SMART  
22 also        SMART  
23 although    SMART  
24 always      SMART  
25 am          SMART  
26 among       SMART  
27 amongst     SMART  
28 an          SMART  
29 and         SMART  
30 another     SMART  
# ℹ 1,119 more rows

We don’t want “stop words” to be in our plot.

  1. Make a contingency table using count() to display how many times words appear

  2. Remove all rows from tidy that match with stop_words (hint: think about how we have put two data sets together in the past)

  3. Save this as a r object clean_words

clean_words <- tidy |>
  count(word, sort = T) |>
  anti_join(stop_words)
Joining with `by = join_by(word)`

Note: The answer is at the bottom of the document!

Let’s build a word cloud using the function wordcloud(). Comment through the code below.

wordcloud(clean_words$word, # the words on the plot
          freq = clean_words$n, # the freq which = size
          min.freq = 300, # min number of times a word needs to be said
          random.order = FALSE,
          rot.per = .3, # % of words rotated
          colors = brewer.pal(4, "Dark2")) # color pallet (unique color and colors used)

Note: Here is a resource on brewer.pal

wordcloud2

Resource

Optional

geom_text() can be a really useful geom to enhance data visualizations. You used geom_text() on your take-home exam. How could we use geom_text() to enhance the bar plot created above?

We could make the plot be more precise by having the values on top of the bars! We did this above.

Let’s pull up the help file for geom_text() and scroll down to the aesthetics.

Here is an additional resource here

data set solutions

clean_words <- tidy |>
  count(word, sort = T) |>
  anti_join(stop_words)
Joining with `by = join_by(word)`