TDM 10100: R Project 3 — 2024
Motivation: Now that we are comfortable with the table
command in R, we can learn about the tapply
command. The tapply
will apply a function to values in one column, which are grouped according to another column. This sounds abstract, but once you see some examples, it makes a lot of sense.
Context: tapply
takes two columns and a function. It applies the function to the first column of values, split into groups according to the second column.
Scope: tapply
in R
Dataset(s)
This project will use the following dataset(s):
-
/anvil/projects/tdm/data/election/itcont1980.txt
-
/anvil/projects/tdm/data/icecream/combined/products.csv
-
/anvil/projects/tdm/data/flights/subset/1990.csv
-
/anvil/projects/tdm/data/8451/The_Complete_Journey_2_Master/5000_transactions.csv
-
/anvil/projects/tdm/data/olympics/athlete_events.csv
Questions
As before, please use the |
Three examples of the tapply
function:
Example 1 Using the 1980 election data, we can find the amount of money donated in each state.
library(data.table)
myDF <- fread("/anvil/projects/tdm/data/election/itcont1980.txt", quote="")
names(myDF) <- c("CMTE_ID", "AMNDT_IND", "RPT_TP", "TRANSACTION_PGI", "IMAGE_NUM", "TRANSACTION_TP", "ENTITY_TP", "NAME", "CITY", "STATE", "ZIP_CODE", "EMPLOYER", "OCCUPATION", "TRANSACTION_DT", "TRANSACTION_AMT", "OTHER_ID", "TRAN_ID", "FILE_NUM", "MEMO_CD", "MEMO_TEXT", "SUB_ID")
We take the money for the donations (in the TRANSACTION_AMT
column), group the data according to the state (in the STATE
column), and sum up the donation amounts:
tapply(myDF$TRANSACTION_AMT, myDF$STATE, sum)
Example 2 Using the ice cream products data, we can find the average rating for each brand.
library(data.table)
myDF <- fread("/anvil/projects/tdm/data/icecream/combined/products.csv")
We take the rating (in the rating
column), group the data according to the brand (in the brand
column), and take an average of these reviews:
tapply(myDF$rating, myDF$brand, mean)
Example 3 Using the 1990 airport data, we can find the average departure delay for flights from each airport.
library(data.table)
myDF <- fread("/anvil/projects/tdm/data/flights/subset/1990.csv")
We take the departure delays (in the DepDelay
column), group the data according to airport where the flights depart (in the Origin
column), and take an average of these departure delays:
tapply(myDF$DepDelay, myDF$Origin, mean)
The values show up as "NA" (not available) because some values are missing, so R cannot take an average. In such a case, we can give R the fourth parameter na.rm=TRUE
so that it ignores missing values, and we try again:
tapply(myDF$DepDelay, myDF$Origin, mean, na.rm=TRUE)
For Dr Ward, using the Firefox browser, 1 core was enough for this entire project, but Dr Ward met one student who demonstrated that he needed 2 cores, even in Firefox. So if you cannot load the 1990 flights subset with 1 core, then you might want to try it with 2 cores. Please make sure that you are using Firefox. |
Question 1 (2 pts)
Using the 1990 airport data, find the average arrival delay for flights arriving to each airport.
The arrival delays are in the |
In the three examples at the start of the project (before Question 1), we used:
to load the data. I recommend that you use the |
Question 2 (2 pts)
In the grocery store file:
/anvil/projects/tdm/data/8451/The_Complete_Journey_2_Master/5000_transactions.csv
Find the sum of the amount spent (in the SPEND
column) at each of the store regions (the STORE_R
column).
Question 3 (2 pts)
In the grocery store file (same file from question 2):
Find the total amount of money spent in 2016 altogether, and the total amount of money spent in 2017 altogether. (You can use the tapply
to do this with just one cell.)
Question 4 (2 pts)
In the Olympics file /anvil/projects/tdm/data/olympics/athlete_events.csv
Find the average height of the athletes in each country (the country is the NOC
column).
Remember to use |
Question 5 (2 pts)
In the Olympics file (same file from question 4):
Find the average height of the athletes in each sport (the sport is the Sport
column, of course!). After finding these average heights, please sort your results. In which sport are the athletes the tallest (on average)? Does this make sense intuitively, i.e., is height an advantage in this sport?
Again, remember to use |
Submitting your Work
We only learned about tapply
in this project because it is a short week, but it is powerful! As always, please ask any questions you have, on Piazza, or in office hours. We hope you have a nice Labor Day weekend!
-
firstname_lastname_project3.ipynb
You must double check your You will not receive full credit if your |