TDM 30100: Project 01 - Intro to ML - Using Anvil
Project Objectives
We remind ourselves how to use the Anvil platform and how to run Python code in Jupyter Lab. We also remind ourselves about using the Pandas library. This project is intended to be a light start to the fall semester.
Questions
Question 1 (2 points)
Let’s start out by starting a new Anvil session. If you do not remember how to do this, please read through Project 1 at the introduction TDM 10100 level.
Once you have started a new Anvil session, download the project template and upload it. Then, open this template in Jupyter notebook. Save it as a new file with the following naming convention: lastname_firstname_project#.ipynb
. For example, doe_jane_project1.ipynb
.
You may be prompted to select a kernel when opening the notebook. We will use the |
To make sure everything is working, run the following code cell:
print("Hello, world!")
Your output should be Hello, world!
. If you see this, you are ready to move on to the next question.
Although question 1 is trivially easy, we still want you to (please) get into the habit of commenting on the work in each question. So (please) it would be helpful to write (in a separate cell) something like, "We are reminding ourselves how to use Anvil and how to print a line of output."
-
Output of running the code cell
-
Be sure to document your work from Question 1, using some comments and insights about your work.
Question 2 (2 points)
Now that we have our Jupyter Lab notebook set up, let’s begin working with the pandas library.
Pandas is a Python library that allows us to work with datasets in tabular form. There are functions for loading datasets, manipulating data, etc.
To start out with, let’s load the Iris dataset that is located at /anvil/projects/tdm/data/iris/Iris.csv
.
To do this, you will need to import the pandas library and use the read_csv
function to load the dataset.
Run the following code cell to load the dataset:
import pandas as pd
myDF = pd.read_csv('/anvil/projects/tdm/data/iris/Iris.csv')
In the provided code, pandas is imported as |
Now that our dataset is loaded, let’s take a look at the first 5 rows of the dataset. To do this, run the following code cell:
myDF.head()
The head function is used to display the first n rows of the dataset. By default, n is set to 5. You can change this by passing an integer to the function. For example, |
-
Output of running the code cell
-
Be sure to document your work from Question 2, using some comments and insights about your work.
Question 3 (2 points)
An important aspect of our dataframe for machine learning is the shape (rows, columns). As you will learn later, the shape will help us determine what kind of machine learning model will be the best fit, as well as how complex it may be.
To get the shape of the dataframe, run the following code cell:
myDF.shape
There are multiple ways to get the number of rows and columns in a DataFrame. |
This returns a tuple in the form (rows, columns).
-
How many rows are in the dataframe?
-
How many columns are in the dataframe?
-
Be sure to document your work from Question 3, using some comments and insights about your work.
Question 4 (2 points)
Now that we have loaded the dataset, let’s investigate how we can manipulate the data.
One common operation is to select a subset of the data. This is done using the iloc
function, which allows us to index the dataframe by row and column numbers.
The |
To select the first n rows of the dataframe, we can use the iloc
function with a slice: myDF.iloc[:n]
.
Write code to select the first 10 rows of the dataframe from Question 3 into a new dataframe called myDF_subset
. Print the shape of myDF_subset
to verify that you have selected the correct number of rows.
We can also use the iloc
function to select specific columns. To select specific columns, we can also use a slice, however we must specify the rows we want first. To select all rows, we simply pass a colon :
. For example, to select the first 10 rows and the first 3 columns, we could use the following code: myDF.iloc[:10, :3]
.
Write code to select the 40th through 50th rows (inclusive) and the 2nd and 4th columns of the dataframe from Question 3 into a new dataframe called myDF_subset2
. Print the shape of myDF_subset2
to verify that you have selected the correct number of rows and columns.
The iloc function can also be used to filter rows based on a condition. For example, if we wanted all rows where the PetalWidthCm is greater than 1.5, we could use the following code: myDF.loc[myDF['PetalWidthCm'] > 1.5, :]
.
Write code to select all rows where SepalLengthCm is less than 5.0 into a new dataframe called myDF_subset3
. How many rows are in this dataframe?
-
Output of printing the shape of
myDF_subset
-
Output of printing the shape of
myDF_subset2
-
How many rows are in the
myDF_subset3
dataframe? -
Be sure to document your work from Question 4, using some comments and insights about your work.
Question 5 (2 points)
Another common operation is to remove column(s) from the dataframe. This is done using the drop
function.
Similarly to the |
The most readable way to drop a column is by dropping it by name. To drop column(s) by name, you can use the following syntax: myDF.drop(['column1_name', 'column2_name', …], axis=1)
. The axis=1
argument tells pandas to drop columns, not rows.
Write code to drop the Id
column from the myDF_subset into a new dataframe called myDF_without_id
. Print the shape of the dataframe to verify that the column has been removed.
Additionally, we can extract columns from a dataframe into a new dataframe. Extracting a column is very simple: myDF['column_name']
will return a pandas series containing the values of the column. To extract multiple columns, you can pass a list of column names: myDF[['column1_name', 'column2_name', …]]
.
To then store these series into a new dataframe, we can simply cast the series into a dataframe: pd.DataFrame(myDF['column_name'])
.
Write code to extract the Species
and SepalWidthCm
columns from the myDF_without_id
dataframe into a new dataframe called myDF_species
. Print the shape of the dataframe to verify that the column has been extracted. Print the first 5 rows of the dataframe to verify that the columns have been extracted correctly.
-
Output of printing the shape of the dataframe after dropping the
Id
column -
Output of printing the first 5 rows of the dataframe after extracting the
Species
andSepalWidthCm
columns -
Be sure to document your work from Question 5, using some comments and insights about your work.
Submitting your Work
Once you have completed the questions, save your Jupyter notebook. You can then download the notebook and submit it to Gradescope.
-
firstname_lastname_project1.ipynb
You must double check your You will not receive full credit if your |