TDM 20100: Project 3 — Pipelines
Motivation: In the previous project, at each stage in our analysis, we saved the output to a file. A more efficient method is to take the output from one command and use it as the input to the next command. This is called a pipeline of bash commands.
Context: Once we learn how to write bash commands in a pipeline, we can (more easily) use several bash commands in tandem.
Scope: Pipelines in Bash
Dataset(s)
This project will use the following dataset(s):
-
/anvil/projects/tdm/data/flights/subset/
(airplane data) -
/anvil/projects/tdm/data/election
(election data) -
/anvil/projects/tdm/data/8451/The_Complete_Journey_2_Master/5000_transactions.csv
(grocery store data)
Questions
Question 1 (2 pts)
In the previous project, on this collection of files:
/anvil/projects/tdm/data/flights/subset/[12]*.csv
we ran the following commands in bash:
`cat` (to print the files; we did this in Project 2, Question 1)
`cut -d, -f17,18` (to extract the 17th and 18th fields, for the Origin and Destination columns)
`sort` (to get all of the same flight paths next to each other in the file)
`uniq -c` (to discover how many times that each flight path occurs)
`sort -n` (to numerically sort the number of times that the flight paths occur)
`tail` (to get the 10 most popular flight paths from the years 1987 to 2008 and the number of times that airplanes flew on each of these flight paths)
Now we can do all of this together, in one long line:
cat /anvil/projects/tdm/data/flights/subset/[12]*.csv | cut -d, -f17,18 | sort | uniq -c | sort -n | tail
(To simplify things, we are not removing the head of each file.)
Please use 3 or 4 cores when working on this question. |
-
Show the 10 most popular flight paths from the years 1987 to 2008 and the number of times that airplanes flew on each of these flight paths.
Question 2 (2 pts)
In the previous project, from this file:
/anvil/projects/tdm/data/8451/The_Complete_Journey_2_Master/5000_transactions.csv
we discovered how many times that each of the STORE_R
values appear in the file, using the following commands in bash:
`cut -d, -f7` (to extract the `STORE_R` values from this file)
`sort` (to get all of the same `STORE_R` values next to each other in the file)
`uniq -c` (to discover how many times that each `STORE_R` value occurs)
`sort -n` (to numerically sort the number of times that the `STORE_R` values occur)
Now we can do all of this together, in one long line:
cut -d, -f7 /anvil/projects/tdm/data/8451/The_Complete_Journey_2_Master/5000_transactions.csv | sort | uniq -c | sort -n
-
List the number of times that each of the
STORE_R
values appear in the file.
Question 3 (2 pts)
Using a pipeline to discover the 10 states in which the largest number of donations have been made (and the number of donations from each of these states), using the data stored in:
/anvil/projects/tdm/data/election/itcont*.txt
The data can be extracted from the 10th field of the files. The symbol "|" is the delimiter. So the cut command should look like: cut -d'|' -f10
-
The 10 states in which the largest number of donations have been made (and the number of donations from each of these states)
Question 4 (2 pts)
Modify your solution to Question 3 so that you extract both the city and the state (simultaneously) for each donation. In this way, you can discover the 10 city-and-state pairs in which the largest number of donations have been made (and the number of donations from each of these city-and-state pairs).
-
The 10 city-and-state pairs in which the largest number of donations have been made (and the number of donations from each of these city-and-state pairs).
Question 5 (2 pts)
Return to the analysis of the airline data. Modify your solution to Question 1 so that, instead of extracting the Origin and Destination airports, this time you can extract three columns: the year, month, and day of the flights. In this way, you can discover the 10 most popular days to fly from 1987 to 2008, i.e., the 10 dates on which the most flights occurred (and the number of flights on each of those 10 dates).
-
The 10 dates on which the most flights occurred (and the number of flights on each of those 10 dates).
Submitting your Work
Congratulations, with your understanding of pipelines, you are ready to leverage the strength of many bash commands in a sequence! Please feel encouraged to refer back to previous projects and ask questions (anytime) about how you can use bash for powerful and easy data analysis!
-
firstname-lastname-project3.ipynb
You must double check your You will not receive full credit if your |