TDM 20100: Project 7 — Awk
Motivation: Now we will learn about associative arrays in awk, which allow you to (for instance) add entries in one column, grouped according to the entries in another column.
Context: We learn how to use associative arrays in awk.
Scope: associative arrays in awk
Dataset(s)
This project will use the following dataset(s):
-
/anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales_cleaner.txt
(Iowa liquor sales data) -
/anvil/projects/tdm/data/election/itcont1980.txt
(election data) -
/anvil/projects/tdm/data/beer/reviews_sample.csv
(beer reviews data) -
/anvil/projects/tdm/data/8451/The_Complete_Journey_2_Master/5000_transactions.csv
(grocery store data) -
/anvil/projects/tdm/data/flights/subset/
(airplane data)
Example 1:
Example 2:
Example 3:
Questions
Question 1 (2 pts)
In Project 6, Question 1, we added all of the sales amounts in dollars from column 22 of this file of liquor sales in Iowa:
/anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales_cleaner.txt
Now, we can (instead) use associative arrays in awk to add the sales amounts in dollars for each store (from column 4): Grouping the sales amounts according to the store, add the sale amounts for each store. For your output, on each line, print the dollar amount for the sales of each store, and the store too. Sort the output in numerical order. It is OK to just print the tail of the result, so that you will print 10 lines of output.
The bottom 4 lines are given below. |
Because these numbers are really large, instead of using |
4.13889e+07 at SAM'S CLUB 8162 / CEDAR RAPIDS
4.87686e+07 at HY-VEE WINE AND SPIRITS / IOWA CITY
1.08328e+08 at CENTRAL CITY 2
1.23313e+08 at HY-VEE #3 / BDI / DES MOINES
-
Print the 10 largest sales amounts according to the stores. (The 4 largest are shown above.)
Question 2 (2 pts)
Back in Project 6, Question 2, we added all of the donation dollar amounts from the 1980 election data:
/anvil/projects/tdm/data/election/itcont1980.txt
Now, instead, let’s add the donation dollar amounts in each state: Grouping the transaction amounts according to the state, add the donation dollar amounts for each state. For your output, on each line, print the dollar amount and the state, and sort the output in numerical order. It is OK to just print the tail of the result, so that you will print 10 lines of output.
The bottom 4 lines are given below. |
17916669
18673301 NY
24085171 CA
24472610 TX
The line 17916669 without a state corresponds to the sum of the donation amounts where the state was blank! |
-
Print the 10 largest total dollar amounts of donations, and the analogous states where those donations were made. (The 4th largest one is a blank state and that is OK.)
Question 3 (2 pts)
In this sample file of beer reviews /anvil/projects/tdm/data/beer/reviews_sample.csv
Consider the mean beer scores on each date.
Find the three dates on which the mean score is a 5.
This week, we are using only the sample file: You DO NOT need to use the (much larger) file from last week: |
A mean |
The date is in the 3rd field and the score is the last field on each line, i.e., the score is stored in |
-
In the reviews sample file, show the three dates on which the mean
score
is a 5.
Question 4 (2 pts)
Consider the data in the file /anvil/projects/tdm/data/8451/The_Complete_Journey_2_Master/5000_transactions.csv
Solve the same question from Project 6, Question 4, again, but this time use associative arrays. By using associative arrays, you can solve this question with just 1 line of awk. You should just use awk one time (not 4 times). By using associative arrays, you can add the total amounts of the values in the SPEND
column, grouping the values according to the STORE_R
column, and print the results for all 4 regions using awk just one time.
-
Print the sum of the
SPEND
column values corresponding to each of the four store regions. Useawk
only one time (by using associative arrays).
Question 5 (2 pts)
Find the average DepDelay
from each Origin
airport in 1990, i.e., using the data in the file /anvil/projects/tdm/data/flights/subset/1990.csv
You do not need to print the output for all of the Origin
airports. Instead, it is OK to include:
grep 'EWR\|JFK\|LGA'
at the end of your pipeline, so that you are only displaying the average departure delays for the three huge Origin
airports in New York City.
-
Print the average departure delays for the three biggest airports located in New York City, namely, the average departure delay from EWR, the average departure delay from JFK, and the average departure delay from LGA.
Submitting your Work
Please let us know (anytime!) if you need help as you are learning about associative arrays in awk.
-
firstname-lastname-project7.ipynb
You must double check your You will not receive full credit if your |