Predicting House Sale Prices


In this project, we are going to work with housing data for the city of Ames, Iowa, United States from 2006 to 2010. You can read more about why the data was collected here. You can also read about the different columns in the data here

We are going to start by setting up a pipeline of functions that will let us quickly iterate on different models.

train => trainform_features() => select_features() => train_and_test()

Setting up the environment

Loading output library...

Feature Engineering


This is experiment part before applying it to the tranform_features fuction.

The goal of this function is to:

- remove features that we don't want to use in the model, just based on the number of missing values or leakage
- transform features into the proper format (numerical to categorical, scaling numerical, filling in missing values, etc)
- create new featuers by combining other features

Null Values Ratio


Drop any column column that more than 25% of missing values

Fill missing values with the most popular value in the column if the column has less than 5% of missing values. Now we have 9 columns that have less than 5% of missing values but still have missing values.

There are still the following missing values. Let's not consider them in the model for now because there aren't clear benefit for using those parameters.

New Features


How long ago the property was remodeled when it was sold?
How long did it take to remodel?
How old was the property when it was sold?

Loading output library...

This is weird that some of the years_remod_before_sale values came back negative. That means either they were remodeled after sales, in which case it is irrelevent to the final price, or human error. Either way let's remove those records because it will distort the model and they are only few.

For the simplicity, we are going to only use the features that are numeric for now. 'Yr Sold', 'Mo Sold', 'Sale Type', and 'Sale Condition' are going be dropped because they can potentially leak data about the final sale.

'Order' and 'PID' are also going to be dropped because they identifier, they don't really help the model.

'Year Built' and 'Year Remod/Add' are also to be dropped because they are already used to create a new features.

Let's update the functions and see how our model has been improved.

Loading output library...

RMSE score improved by about 1500.

Feature Selection


Let's see how each columns are correlated to the final sale price.

Loading output library...

There are quite a big jump between BsmtFin SF 1 and Lot Frontage. Let's cut off the columns that has lower than 0.4 of absolute coefficient value.

Loading output library...
Loading output library...

It looks like following columns are strongly related to each other.

  • Total Bsmt SF and 1st Flr SF
  • TotRms AbvGrd and Gr Liv Area
  • Garage Cars and Garage Area

It might be a good idea to drop one of the correlated columns.

Loading output library...

Now we are left with the folloiwng numeric features. Let's test how they are doing.

Loading output library...
Loading output library...
Loading output library...

Categorical Columns


Lot Shape and Config


It's generally believed that corner lots are more expensive than others. It's probably save to assume that the shape of the lots matter too. Let's explore that.

Loading output library...
Loading output library...
Loading output library...
Loading output library...

There are indeed some different on average price depending on the lot shape and config. They can't be used directly in the model because linear regression model only takes numerical values as features. Let's turn them into dummy columns.

Loading output library...

Let's test if that improve the model.

Loading output library...

With lot configure feature in, the model has improved. There are other nomial features we can explore, but for now we will move onto different validation method and leave other features for another time.

K-Fold Validation