colleges <- read_csv("https://raw.githubusercontent.com/statistical-computing-r/spring-2026/refs/heads/main/group-activities/week-6/data/colleges.csv")PA 5: Identify the Mystery College
Data Wrangling with dplyr
Today you will use the dplyr package to clean some data. We will then use that cleaned data to figure out what college Monte has been accepted to.
This task is complex. It requires many different types of abilities. Everyone will be good at some of these abilities but nobody will be good at all of them. In order to solve this puzzle, you will need to use the skills of each member of your group.
Groupwork Protocols
During the Practice Activity, you and your partner will alternate between two roles—Computer and Coder.
When you are the Computer, you will type into the Quarto document in RStudio. However, you do not type your own ideas. Instead, you type what the Coder tells you to type. You are permitted to ask the Coder clarifying questions, and, if both of you have a question, you are permitted to ask the professor. You are expected to run the code provided by the Coder and, if necessary, to work with the Coder to debug the code. Once the code runs, you are expected to collaborate with the Coder to write code comments that describe the actions taken by your code.
When you are the Coder, you are responsible for reading the instructions / prompts and directing the Computer what to type in the Quarto document. You are responsible for managing the resources your group has available to you (e.g., cheatsheet, textbook). If necessary, you should work with the Computer to debug the code you specified. Once the code runs, you are expected to collaborate with the Computer to write code comments that describe the actions taken by your code.
Here are more details of the Pair Programming Protocols
The partner whose birthday is closest to January 1st starts as the Computer (typing and listening to instructions from the Coder).
Group Norms
Remember, your group is expected to adhere to the following norms:
- Be curious. Don’t correct.
- Be open minded.
- Ask questions rather than contribute.
- Respect each other.
- Allow each teammate to contribute to the activity through their role.
- Do not divide the work.
- No cross talk with other groups.
- Communicate with each other!
Goals for the Activity
- Solve issues in your data as you read it in specifying NA values and cleaning up variable names
- Apply data verbs from
dplyrto extract specific information from a large data set.
THROUGHOUT the Activity be sure to follow the Style Guide by doing the following:
- load the appropriate packages at the beginning of the Quarto doc
- use proper spacing
- name all code chunks
- comment at least once in each code chunk to describe why you made your coding decisions
- add appropriate labels to all graphic axes
Setting up your Project
Your project should have the following components:
datathat will eventually hold your cleaned up data
- completed pa-5-dplyr-reshaping-activity.qmd
- rendered file as
.html
Part One: Data Import and Cleaning
This section will clean today’s data so that you can use it more easily in Part Two.
Data Download & Package Loading
First, we declare our package dependencies and load the data.
The data loading function read_csv() will give you an outpouring of helpful information about the dataset. If you do not see the word “error”, there is nothing to be concerned about.
Take a look at the variables in your downloaded data by running the following code. This tells reports the data type for each column in the dataset.
Data Cleaning
Now we will clean the data. Alas, each of the R chunks in this section will cause an error and / or do the desired task incorrectly. Even the chunks that run without error are not correct! You will need to find the mistake and correct it to complete the intended action.
Step 1: Instead of a blank, they have “NULL” as the missing value. Add an argument to the data import code above to specify the missing values as “NULL”.
Step 2 Clean the variable names to be snake_case (Hint: use janitor) by adding an additional argument in the data import code chunk.
Step 3 Write the cleaned data set to a sub-folder called data (you will need to make this in your project first.)
Practice with dplyr Verbs
In each of the following actions there are errors in the code - fix the errors to run the code.
Choosing variables with select
There are too many variables in this data set. We don’t need all of them. Narrow your data set down to only:
instmname of the institutioncitycity,stabbrstate, andzipZIP code of the institutionadm_rateadmissions ratesat_avgaverage SAT scoreugdsnumber of undergraduate studentstuitionfee_inin- andtuitionfee_outout-of-state tuitioncontrolWhether the school is public or private- `region`` region of the school.
Choosing rows with filter
Remove the schools that are private (category 2) and for-profit (category 3).
Learn a new function to create a variable
Create a new variable called tuition_diff which contains the difference between out-of-state and in-state costs. Look at the cheatsheet to identify the new function you need to use.
Learn a new function to drop missing values
Identify the right function name that removes every row with missing data as the function provided drop.na() is incorrect.
Combine everything into one pipe
At this point we’ve made six different datasets, but notice each of these steps started with colleges_clean <- colleges_clean |>. That is pretty redundant! Instead, we could have performed all these tasks as one long “pipeline.”
Combine your (fixed) code chunks into a single code chunk that carries out all of the steps necessary to clean the data.
Think about coding efficiency – you should not have multiple calls to the same function! You should always start by selecting just the variables you need.
# A tibble: 5 × 12
instnm city stabbr zip adm_rate sat_avg ugds tuitionfee_in tuitionfee_out
<chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 DigiPe… Redm… WA 98052 0.539 1274 1019 29000 29000
2 Waldor… Fore… IA 5043… 0.718 1027 2053 21664 21664
3 School… New … NY 10010 0.724 1139 3559 38000 38000
4 Bob Jo… Gree… SC 2961… 0.862 1146 2367 16500 16500
5 LIM Co… New … NY 1002… 0.929 1032 1427 26395 26395
# ℹ 3 more variables: control <dbl>, region <dbl>, tuition_diff <dbl>
Part Two: Identify the Mystery College
Wow! Your best friend Monte has been accepted to his top choice college! Unfortunately, Monte is an otterly mysterious person, and he won’t tell you directly which college this is. You’ll have to use his clues to figure out which school is his dream school.
Clues:
This college is located in Region 8.
This college’s admission rate is less than 85% (0.85).
This college does not charge the same for in- and out-of-state tuition. You will need the code that created the new variable
tuition_difffrom above.The average SAT score of this college is an even number. (Hint: an even number modulo (
%%) 2 is equal to 0).
This college is located in the contiguous 48 states..
Less than 15,000 undergraduates attend this college.
Of the five options remaining at this step, Monte will attend the college that has the lowest admission rate (organize the data so it is the first listed of the five).
Think about coding efficiency – you should not have multiple calls to the same function! In addition, you should only start with the variables you need (Hint: see select code above).
Submit the college Monte will attend to the Canvas Quiz.