TDM 20100: Project 7 — Awk

Motivation: Now we will learn about associative arrays in awk, which allow you to (for instance) add entries in one column, grouped according to the entries in another column.

Context: We learn how to use associative arrays in awk.

Scope: associative arrays in awk

Learning Objectives:
  • Use associative arrays in awk

Make sure to read about, and use the template found here, and the important information about project submissions here.

Dataset(s)

This project will use the following dataset(s):

  • /anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales_cleaner.txt (Iowa liquor sales data)

  • /anvil/projects/tdm/data/election/itcont1980.txt (election data)

  • /anvil/projects/tdm/data/beer/reviews_sample.csv (beer reviews data)

  • /anvil/projects/tdm/data/8451/The_Complete_Journey_2_Master/5000_transactions.csv (grocery store data)

  • /anvil/projects/tdm/data/flights/subset/ (airplane data)

Example 1:

Example 2:

Example 3:

Questions

Question 1 (2 pts)

In Project 6, Question 1, we added all of the sales amounts in dollars from column 22 of this file of liquor sales in Iowa:

/anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales_cleaner.txt

Now, we can (instead) use associative arrays in awk to add the sales amounts in dollars for each store (from column 4): Grouping the sales amounts according to the store, add the sale amounts for each store. For your output, on each line, print the dollar amount for the sales of each store, and the store too. Sort the output in numerical order. It is OK to just print the tail of the result, so that you will print 10 lines of output.

The bottom 4 lines are given below.

Because these numbers are really large, instead of using sort -n you need to use sort -g which can handle numbers in scientific notation. These numbers are tens and hundreds of millions.

4.13889e+07 at SAM'S CLUB 8162 / CEDAR RAPIDS
4.87686e+07 at HY-VEE WINE AND SPIRITS / IOWA CITY
1.08328e+08 at CENTRAL CITY 2
1.23313e+08 at HY-VEE #3 / BDI / DES MOINES
Deliverables
  • Print the 10 largest sales amounts according to the stores. (The 4 largest are shown above.)

Question 2 (2 pts)

Back in Project 6, Question 2, we added all of the donation dollar amounts from the 1980 election data:

/anvil/projects/tdm/data/election/itcont1980.txt

Now, instead, let’s add the donation dollar amounts in each state: Grouping the transaction amounts according to the state, add the donation dollar amounts for each state. For your output, on each line, print the dollar amount and the state, and sort the output in numerical order. It is OK to just print the tail of the result, so that you will print 10 lines of output.

The bottom 4 lines are given below.

17916669
18673301 NY
24085171 CA
24472610 TX

The line 17916669 without a state corresponds to the sum of the donation amounts where the state was blank!

Deliverables
  • Print the 10 largest total dollar amounts of donations, and the analogous states where those donations were made. (The 4th largest one is a blank state and that is OK.)

Question 3 (2 pts)

In this sample file of beer reviews /anvil/projects/tdm/data/beer/reviews_sample.csv

Consider the mean beer scores on each date.

Find the three dates on which the mean score is a 5.

This week, we are using only the sample file: /anvil/projects/tdm/data/beer/reviews_sample.csv

You DO NOT need to use the (much larger) file from last week: /anvil/projects/tdm/data/beer/reviews.csv

A mean score of "5" is a perfect score, so you can add up the scores on each date, and the number of reviews on each date, and print the ratio of the sum of scores and the number of reviews, and then sort the results numerically, and print only the tail.

The date is in the 3rd field and the score is the last field on each line, i.e., the score is stored in $NF.

Deliverables
  • In the reviews sample file, show the three dates on which the mean score is a 5.

Question 4 (2 pts)

Consider the data in the file /anvil/projects/tdm/data/8451/The_Complete_Journey_2_Master/5000_transactions.csv

Solve the same question from Project 6, Question 4, again, but this time use associative arrays. By using associative arrays, you can solve this question with just 1 line of awk. You should just use awk one time (not 4 times). By using associative arrays, you can add the total amounts of the values in the SPEND column, grouping the values according to the STORE_R column, and print the results for all 4 regions using awk just one time.

Deliverables
  • Print the sum of the SPEND column values corresponding to each of the four store regions. Use awk only one time (by using associative arrays).

Question 5 (2 pts)

Find the average DepDelay from each Origin airport in 1990, i.e., using the data in the file /anvil/projects/tdm/data/flights/subset/1990.csv

You do not need to print the output for all of the Origin airports. Instead, it is OK to include:

grep 'EWR\|JFK\|LGA'

at the end of your pipeline, so that you are only displaying the average departure delays for the three huge Origin airports in New York City.

Deliverables
  • Print the average departure delays for the three biggest airports located in New York City, namely, the average departure delay from EWR, the average departure delay from JFK, and the average departure delay from LGA.

Submitting your Work

Please let us know (anytime!) if you need help as you are learning about associative arrays in awk.

Items to submit
  • firstname-lastname-project7.ipynb

You must double check your .ipynb after submitting it in gradescope. A very common mistake is to assume that your .ipynb file has been rendered properly and contains your code, comments (in markdown or with hashtags), and code output, even though it may not. Please take the time to double check your work. See the instructions on how to double check your submission.

You will not receive full credit if your .ipynb file submitted in Gradescope does not show all of the information you expect it to, including the output for each question result (i.e., the results of running your code), and also comments about your work on each question. Please ask a TA if you need help with this. Please do not wait until Friday afternoon or evening to complete and submit your work.