TDM 40100: Project 03 - Intro to ML - Data Preprocessing

Project Objectives

Learn how to preprocess data for machine learning models. This includes one-hot encoding, scaling, and train-validation-test splitting.

Learning Objectives
  • Learn how to encode categorical variables

  • Learn why scaling data is important

  • Learn how to split data into training, validation, and test sets

Dataset

  • /anvil/projects/tdm/data/fips/fips.csv

  • /anvil/projects/tdm/data/boston_housing/boston.csv

Questions

Question 1 (2 points)

The accuracy of a machine learning model depends heavily on the quality of the dataset used to train it. There are several issues you may encounter if you feed raw data into a model. We explore some of these issues in this project, as well as other necessary steps to format data for machine learning.

The first step in preprocessing data for supervised learning is to split the dataset into input features and target variable(s). This is because, as you should have learned from project 2, supervised learning models require a dataset of input features and their corresponding target variables. By separating our dataset into these components, we are ensuring that our model is learning the relationship between the correct columns.

Write code to load the fips dataset into a variable called 'fips_df' using pandas and separate it into 2 dataframes: one containing the input features and the other containing the target variable. Let’s use the CLASSFP column as the target variable and the rest of the columns as input features.

Typically, the dataset storing the target variable is denoted by y and the dataset storing the input features is denoted by X. This is not required, but it is a common convention that we recommend following.

To confirm your dataframes are correct, print the columns of each dataframe.

Deliverables
  • Load the fips dataset using pandas

  • Separate the dataset into input features and target variable

  • Print the column names of each dataframe

Question 2 (2 points)

Label encoding is a technique used to change categorical variables into a number format that can be understood by machine learning models. This is necessary for models that require numerical input features (which often is the case). Another benefit of label encoding is that it can decrease the memory usage of the dataset (a single integer value as opposed to a string).

The basic concept behind how it works is that if there are n unique category labels in a column, label encoding will assign a unique integer value to each category label from 0 to n-1.

For example, if we have several colors we can encode them as follows:

Color Encoded Value

Red

0

Green

1

Blue

2

Yellow

3

where we have four (n=4) unique colors, so their encoded values range from 0 to 3 (n-1).

Label encoding can lead the model to interpret the encoded values as having an order or ranking. In some cases, this is a benefit, such as encoding 'small', 'medium', and 'large' as 0, 1, and 2. However, this can sometimes lead to ordering that is not intended (such as our color example above). This is something to think about when deciding if label encoding is the right choice for a column or dataset.

Print the first 5 rows from the fips dataset. As you can see, the CountyName and State columns are categorical variables. If we were to use this dataset for a machine learning model, we would likely need to encode these columns into a numerical format.

In this question, you will use the LabelEncoder class from the scikit-learn library to label encode the CountyName column from the dataset.

Fill in and run the following code to label encode the input features that need to be encoded. (This code assumes your input features are stored in a variable called X.)

from sklearn.preprocessing import LabelEncoder

# create a LabelEncoder object
encoder = LabelEncoder()

# create a copy of the input features to separate the encoded columns
X_label_encoded = X.copy()

X_label_encoded['COUNTYNAME'] = encoder.fit_transform(X_label_encoded['COUNTYNAME'])

Now that you have encoded the COUNTYNAME column, print the first 5 rows of the X_label_encoded dataset to see the changes. What is the largest encoded value in the COUNTYNAME column (i.e., the number of unique counties)?

You are not required to use the same variable names (X, X_label_encoded, etc.), but following this convention is strongly recommended.

Deliverables
  • Print the first 5 rows of the X dataset before encoding

  • Label encode the COUNTYNAME column in the fips dataset

  • Print the first 5 rows of the X_label_encoded dataset after encoding

  • Largest encoded value in the COUNTYNAME column

Question 3 (2 points)

As we mentioned last question, label encoding can sometimes lead to undesired hierarchies or ordering with the model. A different encoding approach that alleviates this potential issue is one-hot encoding. Instead of simply assigning a unique integer value to each label, one-hot encoding will create a new binary column for each category label. The value in the binary column will be 1 if the category label is present in the original column, and 0 otherwise. By doing this, the model will not interpret these encoded values as being related, rather as completely separate features.

To give an example, let’s look at how we would use one-hot encoding for the color example in the previous question:

Color Red Green Blue Yellow

Red

1

0

0

0

Green

0

1

0

0

Blue

0

0

1

0

Yellow

0

0

0

1

We have four unique colors, so one-hot encoding gives us four new columns to represent these colors.

The scikit-learn library also provides a OneHotEncoder class that can be used to one-hot encode categorical variables. In this question, you will use this class to one-hot encode the STATE column from the dataset.

First, print the dimensions of the X dataset to see how many rows and columns are in the dataset before one-hot encoding.

Run the following code to one-hot encode the input features that need to be encoded. (This code assumes your input features are stored in a variable called X.)

from sklearn.preprocessing import OneHotEncoder

# create a OneHotEncoder object
encoder = OneHotEncoder()

# create a copy of the input features to separate the encoded columns
X_encoded = X.copy()

# fit and transform the 'STATE' column
# additionally, convert the output to an array and then cast it to a DataFrame
encoded_columns = pd.DataFrame(encoder.fit_transform(X['STATE']).toarray())

# drop the original column from the dataset
X_encoded = X_encoded.drop(['STATE'], axis=1)

# concatenate the encoded columns
X_encoded = pd.concat([X_encoded, encoded_columns], axis=1)

Now that you have one-hot encoded the STATE column, print the dimensions of the X_encoded dataset to see the changes. You should see the same number of rows as the original dataset, but with a large amount of additional columns for the one-hot encoded variables. Are there any concerns with how many columns were created (hint, think about memory size and the curse of dimensionality)?

Deliverables
  • How many rows and columns are in the X_encoded dataset after one-hot encoding?

  • How many columns were created during one-hot encoding?

  • What are some disadvantages of one-hot encoding?

  • When would you use one-hot encoding over label encoding?

Question 4 (2 points)

For this question, let’s switch over to the Boston Housing dataset. Load the dataset into a variable called boston_df. Print the first 5 rows of the CRIM, CHAS, AGE, and TAX columns. Then, write code to find the mean and range of values for each of these columns.

You can use max and min functions to find the maximum and minimum values in a column, respectively. For example, boston_df['AGE'].max() will return the maximum value in the AGE column.

Scaling is another important preprocessing step that is often necessary when working with machine learning models. There are many approaches to this, however the goal is to ensure that all features are on a similar scale. Two common techniques are normalization and standardization. Normalization adjusts feature so that all values fall between 0 and 1. Standardization adjusts features to a set mean (typically 0) and standard deviation (typically 1). This is important because many machine learning models are sensitive to the scale of the input features. If the input features are on different scales, the model may give more weight to features with larger values, which can lead to poor performance.

As you may guess from the previous 2 questions, the scikit-learn library provides a StandardScaler class that can be used to scale input features. This class standardizes features to a mean of 0 and a standard deviation of 1.

Run the following code to scale the columns in the Boston dataset. (This code assumes your dataframe is stored in a variable called boston_df)

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

# scale the SepalLengthCm, SepalWidthCm, PetalLengthCm, and PetalWidthCm columns
X_scaled = scaler.fit_transform(boston_df[['CRIM', 'CHAS', 'AGE', 'TAX']])

#convert X_scaled back into a dataframe
X_scaled = pd.DataFrame(X_scaled, index=boston_df.index, columns=['CRIM', 'CHAS', 'AGE', 'TAX'])

Now that you have scaled the input features, print the mean and range of values for the 4 columns after scaling. you should see that the range of values for each column is now similar, and the mean is close to 0.

Deliverables
  • Mean and range of values for the CRIM, CHAS, AGE, and TAX columns before scaling.

  • Mean and range of values for the CRIM, CHAS, AGE, and TAX columns after scaling.

  • How did scaling the input features affect the mean and range of values?

Question 5 (2 points)

The final step in preprocessing data for machine learning is to split the dataset into training and testing sets. The training set is the data used to train the model, and the testing set is used to evaluate the model’s performance after training.

Often times a validation set is also created to help tune the parameters of the model. This is not required for this project, but you may encounter it in other machine learning projects.

Again, scikit-learn provides everything we need. The train_test_split function can be used to split the dataset into training and testing sets.

This function takes in the input features and target variable(s), along with the test size and randomly splits the dataset into training and testing sets. The test size is the fraction of the dataset that will be used for testing. We can also set a random state to ensure reproducibility.

If we withhold too much data for testing, the model may not have enough data to learn from. However, if we withhold too little data, the model may become overfit to the training data, and the limited testing data may not be representative of the model’s performance. Typically, a test size of 10-30% is used.

Using our y dataframe from Question 1, and the X_encoded dataframe from Question 3, split the dataset into training and testing sets. Run the following code to split the dataset.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_encoded, y, test_size=0.2, random_state=42)

If we wanted to create a validation set, we can use the same function to split X_train and y_train datasets into training and validation sets.

Now that you have split the dataset, print the number of rows in the training and testing sets to confirm the split was successful.

Deliverables
  • Number of rows in the training and testing sets

Question 6 (2 points)

A common issue with datasets is missing or incomplete data. Perhaps a row is missing information in a column (or multiple for that matter). This can cause serious issues with our model if it is used for training, so it is important to handle missing data before we train our model.

One way we can deal with missing data is to simply remove the rows that have missing data. This is a very simple approach, but effective if the amount of missing data is small.

We can check if a row has a missing value in a specific column using the isnull() function. For example

missing_data = df['column_name'].isnull()

will return a boolean series with True for rows that have missing data, and False for rows that do not.

We can also simply use the dropna function to remove rows with missing data, and specify to only look in a subset of columns with the subset option. For example:

df = df.dropna(subset=['column_name'])

will remove rows with missing data in the column_name column.

For this question, we will modify the Boston dataset to have missing data, and then you will remove the rows with missing data.

First, run the following code to load the dataset and insert missing data:

import random
boston_df = pd.read_csv('/anvil/projects/tdm/data/boston_housing/boston.csv')

random.seed(30)
for col in ['CRIM', 'CHAS', 'AGE', 'TAX']:
    #for each row
    for i in range(len(boston_df)):
        if random.random() < 0.1:
            boston_df.loc[i, col] = np.nan

Now, given what you’ve learned, write code to answer the deliverables below.

Deliverables
  • Number of rows missing data in the CRIM column

  • Number of rows missing data in the CHAS column

  • Number of rows missing data in the AGE column

  • Number of rows missing data in the TAX column

  • Number of rows left in the dataset after removing missing data

  • Is it always a good idea to remove rows with missing data (think curse of dimensionality)? Why or why not? Can you think of other ways to handle missing data?

Submitting your Work

Items to submit
  • firstname_lastname_project3.ipynb

You must double check your .ipynb after submitting it in gradescope. A very common mistake is to assume that your .ipynb file has been rendered properly and contains your code, comments (in markdown or with hashtags), and code output, even though it may not. Please take the time to double check your work. See the instructions on how to double check your submission.

You will not receive full credit if your .ipynb file submitted in Gradescope does not show all of the information you expect it to, including the output for each question result (i.e., the results of running your code), and also comments about your work on each question. Please ask a TA if you need help with this. Please do not wait until Friday afternoon or evening to complete and submit your work.