TDM 20100: Project 4 — Pattern matching in pipelines

Motivation: We have begun to learn how to build pipelines in bash. Now we will integrate pattern matching into bash pipelines.

Context: Pattern matching, used as part of a pipeline, is a very powerful technique.

Scope: Pipelines and pattern matching in Bash

Learning Objectives:

Learn about using pattern matching in bash pipelines.

Make sure to read about, and use the template found here, and the important information about project submissions here.

Dataset(s)

This project will use the following dataset(s):

/anvil/projects/tdm/data/flights/subset/ (airplane data)
/anvil/projects/tdm/data/election (election data)
/anvil/projects/tdm/data/8451/The_Complete_Journey_2_Master/5000_transactions.csv (grocery store data)
/anvil/projects/tdm/data/beer/breweries.csv (breweries data)
/anvil/projects/tdm/data/icecream/combined/reviews.csv (ice cream review data)

Questions

Question 1 (2 pts)

In Project 1, question 4, we learned how to find lines of the file that contain the pattern "IND" (these are the flights that have origin or destination at the Indianapolis airport).

In Project 3, question 1, we extracted all of the origin and destination airports, using one pipeline of bash commands.

Revisit Project 3, question 1, but this time, use grep IND early in your bash pipeline, so that your results only show flights that either have origin or destination at Indianapolis. The goal is to show the 10 most popular flight paths to-or-from Indianapolis (during the years 1987 to 2008), and the number of times that airplanes flew on each of these flight paths.

You want the top 10 most popular flight paths that are either to Indianapolis or from Indianapolis. The top 2 most popular flight paths that are either to Indianapolis or from Indianapolis (which should be the last two lines of your ten lines of output) are:

  76554 ORD,IND
  77720 IND,ORD

In this question, you will find all 10 such flight paths. Besides ORD (which is Chicago O’Hare), the other popular flight paths include these airports: ATL,DFW,DTW,MSP,STL.

Deliverables

Show the 10 most popular flight paths to-or-from Indianapolis (during the years 1987 to 2008), and the number of times that airplanes flew on each of these flight paths. (We already gave you 2 of these 10 in the hint!)

Question 2 (2 pts)

Revisit Project 3, Question 2, to study the grocery store data, from this file:

/anvil/projects/tdm/data/8451/The_Complete_Journey_2_Master/5000_transactions.csv

This time, limit your analysis to only the EAST stores, using the grep command. Use the cut command to extract the PURCHASE_ date, and see how many times each PURCHASE_ date occurs. What are the 10 most "popular days" for shopping at the EAST stores? "Popular days" are measured by how many times that each date appears in the file. (You can ignore the SPEND and UNITS columns in this question; only pay attention to how many times that each date occur at the EAST stores.) Be sure to list the number of times that each date appears.

The two most popular dates at the EAST stores are:

   8730 23-DEC-16
   8889 23-DEC-17

Deliverables

What are the 10 most "popular days" for shopping at the EAST stores?

Question 3 (2 pts)

Consider the election files:

/anvil/projects/tdm/data/election/itcont*.txt

Using the cut command with '|' as the delimiter, cut out the 8th field, which are the names of the donors. (Be sure to cautiously use head whenever you look at the output from a pipeline, so that you do not print millions of rows of output.)

Now add to the pipeline another cut command with a comma as the delimiter, cut out the 1st field, which will be the family name of the donors. (Again, be sure to carefully use head whenever you look at pipeline output.)

Finally, finish the pipeline, to get the 10 most popular last names of donors (be sure to print how many times each of these 10 most popular last names occur).

You do not need to use grep on this question.

The two most popular last names of donors are:

1225266 JOHNSON
1654933 SMITH

Deliverables

The 10 most popular last names of donors (and the number of times each of these 10 most popular last names occurs).

Question 4 (2 pts)

In Project 3, Question 4, you extracted the city and state from the donation data. In this question, instead of studying the election data, consider (instead) the breweries data in this file:

/anvil/projects/tdm/data/beer/breweries.csv

Discover the 10 city-and-state pairs in which the largest number of breweries are located (and the number of breweries in each of these city-and-state pairs).

You do not need to use grep on this question.

The two most popular city-and-state pairs for breweries are:

    499 Philadelphia,PA
    512 Chicago,IL

Deliverables

The 10 city-and-state pairs in which the largest number of breweries are located (and the number of breweries in each of these city-and-state pairs).

Question 5 (2 pts)

In this ice cream reviews file:

/anvil/projects/tdm/data/icecream/combined/reviews.csv

We can use grep salty to see that 335 lines of this file have the word "salty". By default, grep pays attention to the case, so (for instance) these 335 occurrences do not include "Salty" or "SALTY" or "saLTy". If we want to search without paying attention to the case, we will get more occurrences of the pattern. In this case, grep -i salty allows us to see that 350 lines of this file have the word "salty" when we do not pay attention to the case. The "-i" stands for a case-insensitive search.

Similarly, there are 1972 lines that include the exact pattern "sweet", but if use a case-insensitive search, there are 2080 lines that include the pattern "sweet" without paying attention to case.

How many lines of the file include the exact pattern "chocolate"?

How many lines of the file include the pattern "chocolate" as a case-insensitive search, in other words, without paying attention to the case?

Deliverables

The number of lines of the file that include the exact pattern "chocolate".
The number of lines of the file that include the pattern "chocolate" as a case-insensitive search, in other words, without paying attention to the case.

Submitting your Work

You now have some experience using pattern matching inside pipelines of bash commands! Your skills from one project to the next are growing! Please refer back to previous projects, and ask questions anytime that you need advice or help!

Items to submit

firstname-lastname-project4.ipynb

You must double check your .ipynb after submitting it in gradescope. A very common mistake is to assume that your .ipynb file has been rendered properly and contains your code, comments (in markdown or with hashtags), and code output, even though it may not. Please take the time to double check your work. See the instructions on how to double check your submission.

You will not receive full credit if your .ipynb file submitted in Gradescope does not show all of the information you expect it to, including the output for each question result (i.e., the results of running your code), and also comments about your work on each question. Please ask a TA if you need help with this. Please do not wait until Friday afternoon or evening to complete and submit your work.