TDM 20100: Project 4 — Pattern matching in pipelines
Motivation: We have begun to learn how to build pipelines in bash. Now we will integrate pattern matching into bash pipelines.
Context: Pattern matching, used as part of a pipeline, is a very powerful technique.
Scope: Pipelines and pattern matching in Bash
Dataset(s)
This project will use the following dataset(s):
-
/anvil/projects/tdm/data/flights/subset/
(airplane data) -
/anvil/projects/tdm/data/election
(election data) -
/anvil/projects/tdm/data/8451/The_Complete_Journey_2_Master/5000_transactions.csv
(grocery store data) -
/anvil/projects/tdm/data/beer/breweries.csv
(breweries data) -
/anvil/projects/tdm/data/icecream/combined/reviews.csv
(ice cream review data)
Questions
Question 1 (2 pts)
In Project 1, question 4, we learned how to find lines of the file that contain the pattern "IND" (these are the flights that have origin or destination at the Indianapolis airport).
In Project 3, question 1, we extracted all of the origin and destination airports, using one pipeline of bash commands.
Revisit Project 3, question 1, but this time, use grep IND
early in your bash pipeline, so that your results only show flights that either have origin or destination at Indianapolis. The goal is to show the 10 most popular flight paths to-or-from Indianapolis (during the years 1987 to 2008), and the number of times that airplanes flew on each of these flight paths.
You want the top 10 most popular flight paths that are either to Indianapolis or from Indianapolis. The top 2 most popular flight paths that are either to Indianapolis or from Indianapolis (which should be the last two lines of your ten lines of output) are:
76554 ORD,IND
77720 IND,ORD
In this question, you will find all 10 such flight paths. Besides ORD (which is Chicago O’Hare), the other popular flight paths include these airports: ATL,DFW,DTW,MSP,STL.
-
Show the 10 most popular flight paths to-or-from Indianapolis (during the years 1987 to 2008), and the number of times that airplanes flew on each of these flight paths. (We already gave you 2 of these 10 in the hint!)
Question 2 (2 pts)
Revisit Project 3, Question 2, to study the grocery store data, from this file:
/anvil/projects/tdm/data/8451/The_Complete_Journey_2_Master/5000_transactions.csv
This time, limit your analysis to only the EAST
stores, using the grep
command. Use the cut
command to extract the PURCHASE_
date, and see how many times each PURCHASE_
date occurs. What are the 10 most "popular days" for shopping at the EAST
stores? "Popular days" are measured by how many times that each date appears in the file. (You can ignore the SPEND
and UNITS
columns in this question; only pay attention to how many times that each date occur at the EAST
stores.) Be sure to list the number of times that each date appears.
The two most popular dates at the EAST
stores are:
8730 23-DEC-16
8889 23-DEC-17
-
What are the 10 most "popular days" for shopping at the
EAST
stores?
Question 3 (2 pts)
Consider the election files:
/anvil/projects/tdm/data/election/itcont*.txt
Using the cut
command with '|' as the delimiter, cut out the 8th field, which are the names of the donors. (Be sure to cautiously use head
whenever you look at the output from a pipeline, so that you do not print millions of rows of output.)
Now add to the pipeline another cut
command with a comma as the delimiter, cut out the 1st field, which will be the family name of the donors. (Again, be sure to carefully use head
whenever you look at pipeline output.)
Finally, finish the pipeline, to get the 10 most popular last names of donors (be sure to print how many times each of these 10 most popular last names occur).
You do not need to use |
The two most popular last names of donors are:
1225266 JOHNSON
1654933 SMITH
-
The 10 most popular last names of donors (and the number of times each of these 10 most popular last names occurs).
Question 4 (2 pts)
In Project 3, Question 4, you extracted the city and state from the donation data. In this question, instead of studying the election data, consider (instead) the breweries data in this file:
/anvil/projects/tdm/data/beer/breweries.csv
Discover the 10 city-and-state pairs in which the largest number of breweries are located (and the number of breweries in each of these city-and-state pairs).
You do not need to use |
The two most popular city-and-state pairs for breweries are:
499 Philadelphia,PA
512 Chicago,IL
-
The 10 city-and-state pairs in which the largest number of breweries are located (and the number of breweries in each of these city-and-state pairs).
Question 5 (2 pts)
In this ice cream reviews file:
/anvil/projects/tdm/data/icecream/combined/reviews.csv
We can use grep salty
to see that 335 lines of this file have the word "salty". By default, grep
pays attention to the case, so (for instance) these 335 occurrences do not include "Salty" or "SALTY" or "saLTy". If we want to search without paying attention to the case, we will get more occurrences of the pattern. In this case, grep -i salty
allows us to see that 350 lines of this file have the word "salty" when we do not pay attention to the case. The "-i" stands for a case-insensitive search.
Similarly, there are 1972 lines that include the exact pattern "sweet", but if use a case-insensitive search, there are 2080 lines that include the pattern "sweet" without paying attention to case.
How many lines of the file include the exact pattern "chocolate"?
How many lines of the file include the pattern "chocolate" as a case-insensitive search, in other words, without paying attention to the case?
-
The number of lines of the file that include the exact pattern "chocolate".
-
The number of lines of the file that include the pattern "chocolate" as a case-insensitive search, in other words, without paying attention to the case.
Submitting your Work
You now have some experience using pattern matching inside pipelines of bash commands! Your skills from one project to the next are growing! Please refer back to previous projects, and ask questions anytime that you need advice or help!
-
firstname-lastname-project4.ipynb
You must double check your You will not receive full credit if your |