TDM 20100: Project 5 — More practice with pipelines
Dataset(s)
This project will use the following dataset(s):
-
/anvil/projects/tdm/data/flights/subset/
(airplane data) -
/anvil/projects/tdm/data/election
(election data) -
/anvil/projects/tdm/data/8451/The_Complete_Journey_2_Master/5000_transactions.csv
(grocery store data) -
/anvil/projects/tdm/data/beer/breweries.csv
(breweries data) -
/anvil/projects/tdm/data/icecream/combined/reviews.csv
(ice cream review data)
Questions
Question 1 (2 pts)
For this question, use the flight files found here:
/anvil/projects/tdm/data/flights/subset/[12]*.csv
-
Make a list of all of the values that appear in column 9, namely, the UniqueCarrier values. (These are abbreviations for the airlines.) For each such airline, list the number of flights on each airline. The list is short enough that you can display the full list of airlines and the number of flights on each airline.
-
Make a list of all of the TailNum values that appear in column 11 BUT do not print this list! Just print the number of (distinct) TailNum values.
You can just use |
There are more than 10,000 TailNum values, so please do not print the full list of TailNum values. You only need to print the total number of (unique) TailNum values that occur. |
Note: If you ever looked at an airplane (or a picture of an airplane), you might have noticed that there is one TailNum painted onto the tail of each airplane, which uniquely identifies that airplane. So, in part b, you are printing the number of airplanes that have flown in the United States! |
-
a. Print a list of all of all UniqueCarrier values and how many times that each UniqueCarrier appears.
-
b. Print the number of (distinct) TailNum values that occur (do not print the list itself; there are more than 10,000 values!)
Question 2 (2 pts)
In the grocery store data:
/anvil/projects/tdm/data/8451/The_Complete_Journey_2_Master/5000_transactions.csv
how many (distinct) PRODUCT_NUM values appear?
You only need to print the number of distinct PRODUCT_NUM values. Please do NOT print the entire list. There are more than 100,000 such values! |
-
Print the number of (distinct) PRODUCT_NUM values that appear. (Do not print the list itself!)
Question 3 (2 pts)
Consider the election files:
/anvil/projects/tdm/data/election/itcont*.txt
There are more than 200 million donations altogether (i.e., 200 million lines of data; one donation per line). But the number of committee IDs (which are in column 1) is much smaller.
-
How many (distinct) committee IDs appear altogether?
-
Which committee ID received the largest number of donations? How many donations did this committee ID receive? (Do not worry about the monetary amounts. Just keep track of the number of donations. There is one donation per line.)
-
a. The number of (distinct) committee IDs that appear.
-
b. The committee ID that received the largest number of donations, and how many donations that this committee ID received.
Question 4 (2 pts)
In the breweries data in this file:
/anvil/projects/tdm/data/beer/breweries.csv
use the grep
command to print all 22 lines of data (everything on each line) corresponding to Lafayette or West Lafayette (Indiana).
-
The full contents of the lines of data corresponding to Lafayette or West Lafayette (Indiana). (Please print the whole line of data each time, so you are printing 22 lines of data!)
Question 5 (2 pts)
In this ice cream reviews file:
/anvil/projects/tdm/data/icecream/combined/reviews.csv
change each space into a newline, for instance, using any of the methods here:
It might be easiest, for instance, to embed:
tr ' ' '\n'
into your pipeline, which turns each space into a newline.
Then sort the resulting lines, which will now have only 1 word per line, and find the 25 words that occur most often in the file, along with the number of occurrences of each such word. Hint: The 5 most popular words are:
18688 ice
19480 a
26848 and
29090 I
34937 the
-
The 25 words that occur most often in the file, along with the number of occurrences of each such word.
Submitting your Work
You are now very familiar with bash pipelines! BUT you can still ask us questions anytime, if you need advice or help!
-
firstname-lastname-project5.ipynb
You must double check your You will not receive full credit if your |