TDM 40100: Project 01 - Intro to ML - Using Anvil

Project Objectives

We remind ourselves how to use the Anvil platform and how to run Python code in Jupyter Lab. We also remind ourselves about using the Pandas library. This project is intended to be a light start to the fall semester.

Learning Objectives
  • Create and use Anvil sessions

  • Create Jupyter notebooks

  • Load dataset with pandas

  • Basic data manipulation with pandas

Dataset

This project will use the following dataset: - /anvil/projects/tdm/data/iris/Iris.csv

Questions

Question 1 (2 points)

Let’s start out by starting a new Anvil session. If you do not remember how to do this, please read through Project 1 at the introduction TDM 10100 level.

Once you have started a new Anvil session, download the project template and upload it. Then, open this template in Jupyter notebook. Save it as a new file with the following naming convention: lastname_firstname_project#.ipynb. For example, doe_jane_project1.ipynb.

You may be prompted to select a kernel when opening the notebook. We will use the seminar kernel (not the seminar-r kernel) for TDM 40100 projects. You are able to change the kernel by clicking on the kernel dropdown menu and selecting the appropriate kernel if needed.

To make sure everything is working, run the following code cell:

print("Hello, world!")

Your output should be Hello, world!. If you see this, you are ready to move on to the next question.

Although question 1 is trivially easy, we still want you to (please) get into the habit of commenting on the work in each question. So (please) it would be helpful to write (in a separate cell) something like, "We are reminding ourselves how to use Anvil and how to print a line of output."

Deliverables
  • Output of running the code cell

  • Be sure to document your work from Question 1, using some comments and insights about your work.

Question 2 (2 points)

Now that we have our Jupyter Lab notebook set up, let’s begin working with the pandas library.

Pandas is a Python library that allows us to work with datasets in tabular form. There are functions for loading datasets, manipulating data, etc.

To start out with, let’s load the Iris dataset that is located at /anvil/projects/tdm/data/iris/Iris.csv.

To do this, you will need to import the pandas library and use the read_csv function to load the dataset.

Run the following code cell to load the dataset:

import pandas as pd

myDF = pd.read_csv('/anvil/projects/tdm/data/iris/Iris.csv')

In the provided code, pandas is imported as pd for brevity. This is a common convention in the Python community. Similarly, myDF (short for "my dataframe") is often used as a variable for pandas dataframes. It is not required for you to follow either of these conventions, but it is good practice to do so.

Now that our dataset is loaded, let’s take a look at the first 5 rows of the dataset. To do this, run the following code cell:

myDF.head()

The head function is used to display the first n rows of the dataset. By default, n is set to 5. You can change this by passing an integer to the function. For example, myDF.head(10) will display the first 10 rows of the dataset. This function is useful for quickly inspecting the dataframe to see what the data looks like.

Deliverables
  • Output of running the code cell

  • Be sure to document your work from Question 2, using some comments and insights about your work.

Question 3 (2 points)

An important aspect of our dataframe for machine learning is the shape (rows, columns). As you will learn later, the shape will help us determine what kind of machine learning model will be the best fit, as well as how complex it may be.

To get the shape of the dataframe, run the following code cell:

myDF.shape

There are multiple ways to get the number of rows and columns in a DataFrame. len(myDF.index) gives the number of rows, and len(myDF.columns) gives the number of columns in a DataFrame. The shape attribute is commonly preferred because it’s more concise and returns both the number of rows and columns in a single call.

This returns a tuple in the form (rows, columns).

Deliverables
  • How many rows are in the dataframe?

  • How many columns are in the dataframe?

  • Be sure to document your work from Question 3, using some comments and insights about your work.

Question 4 (2 points)

Now that we have loaded the dataset, let’s investigate how we can manipulate the data.

One common operation is to select a subset of the data. This is done using the iloc function, which allows us to index the dataframe by row and column numbers.

The iloc function is extremely powerful. It can be used in way too many ways to list here. For a more comprehensive list of how to use iloc, please refer to the official pandas iloc documentation.

To select the first n rows of the dataframe, we can use the iloc function with a slice: myDF.iloc[:n].

Write code to select the first 10 rows of the dataframe from Question 3 into a new dataframe called myDF_subset. Print the shape of myDF_subset to verify that you have selected the correct number of rows.

We can also use the iloc function to select specific columns. To select specific columns, we can also use a slice, however we must specify the rows we want first. To select all rows, we simply pass a colon :. For example, to select the first 10 rows and the first 3 columns, we could use the following code: myDF.iloc[:10, :3].

Write code to select the 40th through 50th rows (inclusive) and the 2nd and 4th columns of the dataframe from Question 3 into a new dataframe called myDF_subset2. Print the shape of myDF_subset2 to verify that you have selected the correct number of rows and columns.

The iloc function can also be used to filter rows based on a condition. For example, if we wanted all rows where the PetalWidthCm is greater than 1.5, we could use the following code: myDF.loc[myDF['PetalWidthCm'] > 1.5, :].

Write code to select all rows where SepalLengthCm is less than 5.0 into a new dataframe called myDF_subset3. How many rows are in this dataframe?

Deliverables
  • Output of printing the shape of myDF_subset

  • Output of printing the shape of myDF_subset2

  • How many rows are in the myDF_subset3 dataframe?

  • Be sure to document your work from Question 4, using some comments and insights about your work.

Question 5 (2 points)

Another common operation is to remove column(s) from the dataframe. This is done using the drop function.

Similarly to the iloc function, the drop function is extremely powerful. For a more comprehensive list of how to use drop, please refer to the official pandas drop documentation.

The most readable way to drop a column is by dropping it by name. To drop column(s) by name, you can use the following syntax: myDF.drop(['column1_name', 'column2_name', …​], axis=1). The axis=1 argument tells pandas to drop columns, not rows.

Write code to drop the Id column from the myDF_subset into a new dataframe called myDF_without_id. Print the shape of the dataframe to verify that the column has been removed.

Additionally, we can extract columns from a dataframe into a new dataframe. Extracting a column is very simple: myDF['column_name'] will return a pandas series containing the values of the column. To extract multiple columns, you can pass a list of column names: myDF[['column1_name', 'column2_name', …​]]. To then store these series into a new dataframe, we can simply cast the series into a dataframe: pd.DataFrame(myDF['column_name']).

Write code to extract the Species and SepalWidthCm columns from the myDF_without_id dataframe into a new dataframe called myDF_species. Print the shape of the dataframe to verify that the column has been extracted. Print the first 5 rows of the dataframe to verify that the columns have been extracted correctly.

Deliverables
  • Output of printing the shape of the dataframe after dropping the Id column

  • Output of printing the first 5 rows of the dataframe after extracting the Species and SepalWidthCm columns

  • Be sure to document your work from Question 5, using some comments and insights about your work.

Question 6 (2 points)

We briefly touched on filtering rows based on a condition in Question 4. In this case, we simply filtered by one condition. It is fine to simply filter by one condition repeatedly until you have performed all the filtering you need. However, it is also possible to filter by multiple conditions in a single operation.

Pandas allows us to use logical operators to combine multiple conditions into a boolean expression. The logical operators are & for "and", | for "or", and ~ for "not". We can use these operators allong with conditionals to filter rows based on multiple conditions.

We can store each condition in a variable like so:

condition1 = myDF[column] > value1
condition2 = myDF[column] == value2

Once we have these conditions, we can combine them in the iloc function like so:

#both conditions must be true
myDF.iloc[condition1 & condition2]
#condition1 must be false or condition2 must be true
myDF.iloc[~condition1 | condition2]

Write code to filter the myDF dataframe to only include rows where SepalLengthCm is greater than 5.0, PetalWidthCm is less than 1.5, and Species is not Iris-setosa. Store the filtered dataframe in a new variable called myDF_filtered. How many rows meet these conditions?

Deliverables
  • How many rows meet the conditions?

  • Be sure to document your work from Question 6, using some comments and insights about your work.

Question 7 (2 points)

One of the most common operations in data analysis is to calculate summary statistics. This includes things like the mean, median, standard deviation, etc.

Pandas has many operations to calculate these statistics. To calculate the mean of a column, we can simply write myDF[column].mean(). Similarly, to calculate the median, we can write myDF[column].median(). For standard deviation, we can write myDF[column].std().

Additionally, we can find all unique values in a column by using the unique function. myDF[column].unique() will return a list containing all unique values in the column.

It may be beneficial to cast the result of unique to a list to make it easier to work with.

Write a code to determine the mean, median, and standard deviation of the SepalLengthCm column in the myDF dataframe, for each unique species in the Species column. Please note that there are 3 unique species in the Species column, so you should have 3 sets of statistics.

Deliverables
  • Output of the mean, median, and standard deviation of the SepalLengthCm column for each unique species

  • Be sure to document your work from Question 7, using some comments and insights about your work.

Submitting your Work

Once you have completed the questions, save your Jupyter notebook. You can then download the notebook and submit it to Gradescope.

Items to submit
  • firstname_lastname_project1.ipynb

You must double check your .ipynb after submitting it in gradescope. A very common mistake is to assume that your .ipynb file has been rendered properly and contains your code, markdown, and code output even though it may not. Please take the time to double check your work. See here for instructions on how to double check this.

You will not receive full credit if your .ipynb file does not contain all of the information you expect it to, or if it does not render properly in Gradescope. Please ask a TA if you need help with this.