In this project, we are going to work with housing data for the city of Ames, Iowa, United States from 2006 to 2010. You can read more about why the data was collected here. You can also read about the different columns in the data here
We are going to start by setting up a pipeline of functions that will let us quickly iterate on different models.
train => trainform_features() => select_features() => train_and_test()
This is experiment part before applying it to the tranform_features fuction.
The goal of this function is to:
1 2 3
- remove features that we don't want to use in the model, just based on the number of missing values or leakage - transform features into the proper format (numerical to categorical, scaling numerical, filling in missing values, etc) - create new featuers by combining other features
Fill missing values with the most popular value in the column if the column has less than 5% of missing values. Now we have 9 columns that have less than 5% of missing values but still have missing values.
There are still the following missing values. Let's not consider them in the model for now because there aren't clear benefit for using those parameters.
How long ago the property was remodeled when it was sold?
How long did it take to remodel?
How old was the property when it was sold?
This is weird that some of the years_remod_before_sale values came back negative. That means either they were remodeled after sales, in which case it is irrelevent to the final price, or human error. Either way let's remove those records because it will distort the model and they are only few.
For the simplicity, we are going to only use the features that are numeric for now. 'Yr Sold', 'Mo Sold', 'Sale Type', and 'Sale Condition' are going be dropped because they can potentially leak data about the final sale.
'Order' and 'PID' are also going to be dropped because they identifier, they don't really help the model.
'Year Built' and 'Year Remod/Add' are also to be dropped because they are already used to create a new features.
Let's update the functions and see how our model has been improved.
RMSE score improved by about 1500.
Let's see how each columns are correlated to the final sale price.
There are quite a big jump between BsmtFin SF 1 and Lot Frontage. Let's cut off the columns that has lower than 0.4 of absolute coefficient value.
It looks like following columns are strongly related to each other.
It might be a good idea to drop one of the correlated columns.
Now we are left with the folloiwng numeric features. Let's test how they are doing.
It's generally believed that corner lots are more expensive than others. It's probably save to assume that the shape of the lots matter too. Let's explore that.
There are indeed some different on average price depending on the lot shape and config. They can't be used directly in the model because linear regression model only takes numerical values as features. Let's turn them into dummy columns.
Let's test if that improve the model.
With lot configure feature in, the model has improved. There are other nomial features we can explore, but for now we will move onto different validation method and leave other features for another time.