TDM 20100: Project 6 — Awk
Motivation: As you’ve seen so far, bash
has a wide variety of commands that enable us to do different things, and we can use pipes to connect those commands and perform whole loads of data processing in one big step. However, conciseness is a virtue. In this project we’ll start learning about awk
. By the end of the next few weeks, you will be able to do entire pipelines worth of work with just one awk
!
Context: This project will relate awk
concepts back to previously learned commands, and at the very least a basic knowledge of filesystem navigation and regex will be needed.
Scope: awk
, data processing, Bash, GNU, CLI
Dataset(s)
This project will use the following dataset(s):
-
/anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales_cleaner.txt
(Iowa liquor sales data) -
/anvil/projects/tdm/data/election
(election data) -
/anvil/projects/tdm/data/beer/reviews.csv
(beer reviews data) -
/anvil/projects/tdm/data/8451/The_Complete_Journey_2_Master/5000_transactions.csv
(grocery store data) -
/anvil/projects/tdm/data/flights/subset/
(airplane data)
Example 1:
Example 2:
Example 3:
Example 4:
Example 5:
Example 6:
Questions
Question 1 (2 pts)
To begin learning about awk
, it is important to conceptualize the two fundamental units that awk
operates on: records and fields.
You can think of a record as one row of data in a file.
A field can be thought of as a singular element of data (within a row of data).
Simple awk files that run on comma-separated data look this this:
awk -F, 'BEGIN{action to run before processing the data} {action to perform on each row of data} END{action to run after processing the data}' mydatafile.txt
but many awk programs do not have the beginning or ending section, i.e., they just look like this:
awk -F, '{action to perform on each row of data}' mydatafile.txt
The action to run before processing the data only runs once, before awk
ever touches the data. It is helpful, for example, if you want to print a header before running your output. (The BEGIN section is often omitted.)
The action to run after processing the data will print after the data is all processed. It is helpful if you want to print out some calculations that you ran on the data set. (The END section is often omitted.)
The main action of an awk program will run on each and every line of the data.
If the data is not comma separated, but (instead) is tab-separated, then we use -F\t
instead of -F,
(as an example). Or if the data has |
between the pieces of data, then we use -F'|'
instead. Or if the data has ;
between the pieces of data, then we use -F';'
instead.
The Sales values (given in Dollars) are available in field 22 of this data set:
head /anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales_cleaner.txt | cut -d';' -f22
Use awk
(without using cut
!) to add the total Sales values from the entire file.
The total Sales values are almost 4 billion dollars. The |
This question might take 2 or 3 minutes to run. |
-
Print the total Sales values from the entire file.
Question 2 (2 pts)
2a. Very similarly to question 1, use awk
(without using cut
!) to sum the total dollar amounts of the donations (altogether) given in column 15 of this file: /anvil/projects/tdm/data/election/itcont1980.txt
2b. Now sum the total dollar amounts of the donations (altogether) given in column 15 of all of the itcont*.txt
files (altogether).
2a. The total for your answer should be a little more than 200 million dollars. 2b. The total for your answer should be a little more than 62 billion dollars. |
Question 2b might take 30 to 60 minutes to run. |
-
a. Print the sum of the total dollar amounts of the donations in the 1980 election data.
-
b. Print the sum of the total dollar amounts of the donations in all of the election data files of the form
itcont*.txt
.
Question 3 (2 pts)
Consider the data in the file /anvil/projects/tdm/data/beer/reviews.csv
Notice that the number of columns on each line varies, because each line has a varied number of commas. Also note that the number of fields on each line is NF
and therefore the last field on each line is $NF
. Use this information to add all of the values in the score
column, in a variable called totalscores
. Also, at the same time, add the number of lines, in a variable called totallines
. Finally, at the end, print the ratio of totalscores
and totallines
, so that we have the overall average score across the entire data set.
-
Print the overall average score across the entire data set.
Question 4 (2 pts)
Consider the data in the file /anvil/projects/tdm/data/8451/The_Complete_Journey_2_Master/5000_transactions.csv
Use grep SOUTH
and also awk
(not cut
) to sum the total amount of values in the SPEND
column (corresponding to lines with SOUTH
for the STORE_R
value). Then do this again (in a separate bash pipeline) for the EAST
stores, and then do it again (in a third bash pipeline) for the WEST
stores, and finally in a fourth bash pipeline for the CENTRAL
stores.
In the future, we will learn how to do all of this with one line of |
-
Print the sum of the
SPEND
column values corresponding to each of the four store regions. This will take four separate bash pipelines, one Jupyter Lab cell each.
Question 5 (2 pts)
Consider the data in the file /anvil/projects/tdm/data/flights/subset/1990.csv
Use awk
for formatted output, like this:
awk -F, '{print "flights from "$17" to "$18;}'
incorporated into a pipeline (with sort | uniq -c | sort -n | tail
) from the previous projects, to find the 10 most popular flight paths in 1990 and the number of flights on those paths. Hint: The top two flight paths should be:
25779 flights from LAX to SFO
26134 flights from SFO to LAX
-
Print the 10 most popular flight paths in 1990 and the number of flights on those paths, with the nice formatting described above.
Submitting your Work
We are just starting to get familiar with awk
so please feel welcome to ask for clarifications and help anytime. This is a powerful tool that will enable you to (pre-)process data and to analyze data very, very quickly. It is also a wonderful tool to incorporate in bash
pipelines.
-
firstname-lastname-project6.ipynb
You must double check your You will not receive full credit if your |