PA 9: COVID-19 Infections and Impact on States

Author

Instructions

This task is complex. It requires many different types of abilities. Everyone will be good at some of these abilities but nobody will be good at all of them. In order to solve this puzzle, you will need to use the skills of each member of your group.

Groupwork Protocols

During the Practice Activity, you and your partner will alternate between two roles—Computer and Coder.

When you are the Computer, you will type into the Quarto document in RStudio. However, you do not type your own ideas. Instead, you type what the Coder tells you to type. You are permitted to ask the Coder clarifying questions, and, if both of you have a question, you are permitted to ask the professor. You are expected to run the code provided by the Coder and, if necessary, to work with the Coder to debug the code. Once the code runs, you are expected to collaborate with the Coder to write code comments that describe the actions taken by your code.

When you are the Coder, you are responsible for reading the instructions / prompts and directing the Computer what to type in the Quarto document. You are responsible for managing the resources your group has available to you (e.g., cheatsheet, textbook). If necessary, you should work with the Computer to debug the code you specified. Once the code runs, you are expected to collaborate with the Computer to write code comments that describe the actions taken by your code.

Here are more details of the Pair Programming Protocols

Note

The partner has the most relaxing spring break plans (this is of course relative) will start as the Computer (typing and listening to instructions from the Coder).

Group Norms

Remember, your group is expected to adhere to the following norms:

Be curious. Don’t correct.
Be open minded.
Ask questions rather than contribute.
Respect each other.
Allow each teammate to contribute to the activity through their role.
Do not divide the work.
No cross talk with other groups.
Communicate with each other!

Goals for the Activity

Use the dplyr verbs to transform your data
Create new data sets through the joining of data from various sources
Use forcats to reorder values on a graph for easier comparison

THROUGHOUT THE Activity be sure to follow the Style Guide by doing the following:

load the appropriate packages at the beginning of the Quarto document
use proper spacing
add labels to all code chunks
comment at least once in each code chunk to describe why you made your coding decisions
add appropriate labels to all graphic axes

Setting up your Project

Your project should have the following components:

completed pa-9-covid-factors-activity.qmd
rendered file as .html
data sub-folder that contains
- created data
- us-pop-2019.csv

Important

The original Computer should submit the zip file for the canvas quiz. The original Coder can just submit the rendered html file.

Computer - Be sure to share the final .qmd and .html file with the original Coder.

Data Description - United States COVID-19 Cases and Deaths

Starting in January 2020, the New York Times started reporting on COVID-19 infections in the United States and eventually created a Githhub Repository of the data they used and reported on in their stories (a field called “Data Journalism”). They ended their data collection in March 2023 and switched to just using data from national reporting systems.

We will use their data to evaluate COVID-19 and how it varied across different states. Here are the NY Times data on cumulative cases by date and states (including territories).

# data comes from NY Times GitHub Repository - ends March 2023
# not evaluating this code chunk because we will clean the data and use the clean data set
cases <- read_csv("https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-states.csv")

Note that both cases and deaths are cumulative by date and we will want to observe just the unique cases per day so we can get various estimates of totals across the states.

If we want to extract the daily cases we can use the following code which will calculate the difference in (cumulative) cases for one day minus the previous day using the diff() function from base R.

Warning

Your team should create a data subfolder in your project to store the created data and to upload the population data to as well.

# calculates the unique new daily cases and deaths for each state
# eval is false since we created a new data set we will use going forward
cases |> 
  group_by(state) |> 
  arrange(state, date) |> 
  mutate(cases_daily = c(cases[1], diff(cases)),
         deaths_daily = c(deaths[1], diff(deaths))) |> 
  ungroup() |> 
  write_csv("data/covid-cases-us.csv")

Now that are data is in the format we need (sort of), we will start our analysis.

Part 1: Greatest Number of Cases

First, we need to read in our clean data.

Note - there may be an error in the code below.

cases <- read_csv("covid_cases_us.csv")

Error in `read_csv()`:
! could not find function "read_csv"

Which 10 US States/Territories had the most COVID-19 Cases from January 2020-March 2023

We want to identify the states with the most cases (top 10). To do this, we will need to modify our data to find the total number of cases for each state. The code below does this but it has errors!

Correct the errors and then assign the resulting data to state_totals.

states_totals <- cases |> 
  groupby(state) |> 
  sum(total = sum(case_daily)) |> 
  ungroup() # this is useful to do after a group_by calculation is complete so you do not continue grouping by those variables

Error in `ungroup()`:
! could not find function "ungroup"

Now use the states_totals and identify the top ten states/territories with the most cases between January 2020 and March 2023. Then make a graph ordered by greatest number of cases to least number of cases (Hint: use forcats functions to help you).

states_totals |> 
  ## extract just the top ten (Hint - look at the `slice() functions)
  ## create a bar chart with x as the totals and y as the state names
  ## order the states by greatest to smallest total on the graph
  ## label, color, update as necessary to provide all relevant information

Error in `ggplot()`:
! could not find function "ggplot"

Canvas Quiz Question 1

Which state had the most COVID-19 cases between January 2020 and March 2023?

Part 2: Updating our Data Source

Hopefully you realized that there is a flaw in our comparison. Many of the states with the most cases are also some of the biggest states in terms of population size! So is there a better way to make comparisons between groups with differing group sizes at baseline? YES! Here are a few options:

Calculate the Proportion
Calculate the Percentage
Calculate a Rate, such as Rate per 100,000 people

To do any of the above calculations, we would need the population sizes for each state. Here is data from the US Census Bureau on the state estimated population size in 2019.

Note - if this code does not work you did not organize your project folders correctly.

us_pop <- read_csv("data/us-pop-2019.csv", show_col_types = FALSE)

Error in `read_csv()`:
! could not find function "read_csv"

Calculate a Relative Rate for Comparison

Choose one of the above statistics to calculate to make our comparison between states/territories. Then recreate your graph from above with the new metric for case numbers.

states_totals |> 
  ## join in column with state population sizes  
  ## calculate your chosen relative rate (e.g. proportion, percentage, rate)
  ## extract just the top ten (Hint - look at the `slice() functions)
  ## create a bar chart with x as the relative rate and y as the state names
  ## order the states by greatest to smallest relative rate on the graph
  ## label, color, update as necessary to provide all relevant information

Error in `ggplot()`:
! could not find function "ggplot"

Canvas Quiz Question 2

Which state had highest number of COVID-19 cases relative to population size between January 2020 and March 2023?

Other Questions

What are some possible issues with our analysis? Consider the following:

Data sources and reliability of the data
Time scale and impact of time

What other questions would you like to try to answer using this data? Would that require other data sources? What information would you need to answer your questions?

--- title: "PA 9: COVID-19 Infections and Impact on States" author: "Instructions" format: html code-tools: true toc: true editor: source execute: error: true echo: true eval: false message: false warning: false --- ```{r} #| label: setup #| include: false library(tidyverse) library(scales) #can make nicer axis labels for graphs ``` ***This task is complex. It requires many different types of abilities. Everyone will be good at some of these abilities but nobody will be good at all of them. In order to solve this puzzle, you will need to use the skills of each member of your group.*** # Groupwork Protocols During the Practice Activity, you and your partner will alternate between two roles---Computer and Coder. When you are the **Computer**, you will type into the Quarto document in RStudio. However, you **do not** type your own ideas. Instead, you type what the Coder tells you to type. You are permitted to ask the Coder clarifying questions, and, if both of you have a question, you are permitted to ask the professor. You are expected to run the code provided by the Coder and, if necessary, to work with the Coder to debug the code. Once the code runs, you are expected to collaborate with the Coder to write code comments that describe the actions taken by your code. When you are the **Coder**, you are responsible for reading the instructions / prompts and directing the Computer what to type in the Quarto document. You are responsible for managing the resources your group has available to you (e.g., cheatsheet, textbook). If necessary, you should work with the Computer to debug the code you specified. Once the code runs, you are expected to collaborate with the Computer to write code comments that describe the actions taken by your code. Here are more details of the [Pair Programming Protocols](../pair-programming-norms.qmd) ::: {.callout-note} The partner has the most relaxing spring break plans (this is of course relative) will start as the Computer (typing and listening to instructions from the Coder). ::: ### Group Norms Remember, your group is expected to adhere to the following norms: 1. Be curious. Don't correct. 2. Be open minded. 3. Ask questions rather than contribute. 4. Respect each other. 5. Allow each teammate to contribute to the activity through their role. 6. Do not divide the work. 7. No cross talk with other groups. 8. Communicate with each other! ## Goals for the Activity - Use the `dplyr` verbs to transform your data - Create new data sets through the joining of data from various sources - Use `forcats` to reorder values on a graph for easier comparison **THROUGHOUT THE Activity** be sure to follow the Style Guide by doing the following: - load the appropriate packages at the beginning of the Quarto document - use proper spacing - *add labels* to all code chunks - comment at least once in each code chunk to describe why you made your coding decisions - add appropriate labels to all graphic axes ## Setting up your Project Your project should have the following components: 1. completed [pa-9-covid-factors-activity.qmd](../week-10/pa-9-covid-factors-activity.qmd) 2. rendered file as `.html` 3. `data` sub-folder that contains - created data - [us-pop-2019.csv](../week-10/data/us-pop-2019.csv) ::: {.callout-important} The original `Computer` should submit the zip file for the canvas quiz. The original `Coder` can just submit the rendered html file. `Computer` - Be sure to share the final .qmd and .html file with the original `Coder`. ::: ## Data Description - United States COVID-19 Cases and Deaths Starting in January 2020, the New York Times started reporting on COVID-19 infections in the United States and eventually created a [Githhub Repository](https://github.com/nytimes/covid-19-data/) of the data they used and reported on in their stories (a field called "Data Journalism"). They ended their data collection in March 2023 and switched to just using data from national reporting systems. We will use their data to evaluate COVID-19 and how it varied across different states. Here are the NY Times data on cumulative cases by date and states (including territories). ```{r} #| label: read-data #| message: false #| eval: false # data comes from NY Times GitHub Repository - ends March 2023 # not evaluating this code chunk because we will clean the data and use the clean data set cases <- read_csv("https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-states.csv") ``` Note that both `cases` and `deaths` are cumulative by date and we will want to observe just the unique cases per day so we can get various estimates of totals across the states. If we want to extract the daily cases we can use the following code which will calculate the difference in (cumulative) cases for one day minus the previous day using the `diff()` function from `base` R. ::: {.callout-warning} Your team should create a data subfolder in your project to store the created data and to upload the population data to as well. ::: ```{r} #| label: calculate-new-daily-cases #| eval: false # calculates the unique new daily cases and deaths for each state # eval is false since we created a new data set we will use going forward cases |> group_by(state) |> arrange(state, date) |> mutate(cases_daily = c(cases[1], diff(cases)), deaths_daily = c(deaths[1], diff(deaths))) |> ungroup() |> write_csv("data/covid-cases-us.csv") ``` Now that are data is in the format we need (sort of), we will start our analysis. ## Part 1: Greatest Number of Cases First, we need to read in our clean data. Note - there may be an error in the code below. ```{r} #| label: read-clean-covid-cases cases <- read_csv("covid_cases_us.csv") ``` ```{r} #| label: read-clean-covid-cases-solution #| echo: false #| eval: true cases <- read_csv("data/covid-cases-us.csv", show_col_types = FALSE) ``` ### Which 10 US States/Territories had the most COVID-19 Cases from January 2020-March 2023 We want to identify the states with the most cases (top 10). To do this, we will need to modify our data to find the total number of cases for each state. The code below does this but ***it has errors!*** Correct the errors and then assign the resulting data to `state_totals`. ```{r} #| label: calculate-totals states_totals <- cases |> groupby(state) |> sum(total = sum(case_daily)) |> ungroup() # this is useful to do after a group_by calculation is complete so you do not continue grouping by those variables ``` ```{r} #| label: calculate-totals-solution #| eval: true #| echo: false states_totals <- cases |> group_by(state) |> summarize(total = sum(cases_daily)) |> ungroup() # this is useful to do after a group_by calculation is complete so you do not continue grouping by those variables ``` Now use the `states_totals` and identify the top ten states/territories with the most cases between January 2020 and March 2023. Then make a graph ordered by greatest number of cases to least number of cases (Hint: use `forcats` functions to help you). ```{r} #| label: top-10-states-cases states_totals |> ## extract just the top ten (Hint - look at the `slice() functions) ## create a bar chart with x as the totals and y as the state names ## order the states by greatest to smallest total on the graph ## label, color, update as necessary to provide all relevant information ``` ```{r} #| label: top-10-states-cases-solution #| echo: false #| eval: true states_totals |> slice_max(total, n = 10) |> # extract just the top ten (Hint - look at the `slice() functions) ggplot(aes(x = total, y = fct_reorder(state, total))) + geom_col() +## create a bar chart with x as the totals and y as the state names labs(x = "Total Number of COVID-19 Cases (January 2020 - March 2023)", y = "Top 10 States by Total Cases", caption = "Data: NY Times") +## order the states by greatest to smallest total on the graph theme_light() ## label, color, update as necessary to provide all relevant information ``` ### Canvas Quiz Question 1 **Which state had the most COVID-19 cases between January 2020 and March 2023?**  ### Other Question **Is this the best way to look at our data? Is looking at just the total number of cases the best way to compare states?**   ## Part 2: Updating our Data Source Hopefully you realized that there is a flaw in our comparison. Many of the states with the most cases are also some of the biggest states in terms of population size! So is there a better way to make comparisons between groups with differing group sizes at baseline? YES! Here are a few options: - Calculate the Proportion - Calculate the Percentage - Calculate a Rate, such as Rate per 100,000 people To do any of the above calculations, we would need the population sizes for each state. Here is data from the US Census Bureau on the state estimated population size in 2019. Note - if this code does not work you did not organize your project folders correctly. ```{r} #| label: us-state-population-data #| eval: true us_pop <- read_csv("data/us-pop-2019.csv", show_col_types = FALSE) ``` ### Calculate a Relative Rate for Comparison Choose one of the above statistics to calculate to make our comparison between states/territories. Then recreate your graph from above with the new metric for case numbers. ```{r} #| label: join-graph-data states_totals |> ## join in column with state population sizes ## calculate your chosen relative rate (e.g. proportion, percentage, rate) ## extract just the top ten (Hint - look at the `slice() functions) ## create a bar chart with x as the relative rate and y as the state names ## order the states by greatest to smallest relative rate on the graph ## label, color, update as necessary to provide all relevant information ``` ```{r} #| label: join-graph-data-solution #| echo: false #| eval: true states_totals |> left_join(us_pop, by = c("state" = "NAME")) |> ## join in column with state population sizes mutate(prop = total/estimate, percentage = total/estimate * 100, rate = total/estimate*100000) |> ## calculate your chosen relative rate (e.g. proportion, percentage, rate) slice_max(rate, n = 10) |> ggplot(aes(x = rate, y = fct_reorder(state, rate))) + geom_col() +## create a bar chart with x as the totals and y as the state names labs(x = "Rate Per 100,000 of COVID-19 Cases (January 2020 - March 2023)", y = "Top 10 States by Rate", caption = "Data: NY Times") +## order the states by greatest to smallest total on the graph theme_light() ## extract just the top ten (Hint - look at the `slice_xxx() functions) ## create a bar chart with x as the relative rate and y as the state names ## order the states by greatest to smallest relative rate on the graph ## label, color, update as necessary to provide all relevant information ``` ### Canvas Quiz Question 2 **Which state had highest number of COVID-19 cases relative to population size between January 2020 and March 2023?**  ### Other Questions **What are some possible issues with our analysis? Consider the following:** - Data sources and reliability of the data - Time scale and impact of time  **What other questions would you like to try to answer using this data? Would that require other data sources? What information would you need to answer your questions?**