Data Analysis for Machine Learning


We'll work with a Kaggle dataset: House Sales in King County, USA.

These are the features of the dataset:

  • id: a notation for a house
  • date: Date house was sold
  • price: Price is prediction target
  • bedrooms: Number of Bedrooms/House
  • bathrooms: Number of bathrooms/bedrooms
  • sqft_living: square footage of the home
  • sqft_lot: square footage of the lot
  • floors: Total floors (levels) in house
  • waterfront: House which has a view to a waterfront
  • view: Has been viewed
  • condition: How good the condition is ( Overall )
  • grade: overall grade given to the housing unit, based on King County grading system
  • sqft_above: square footage of house apart from basement
  • sqft_basement: square footage of the basement
  • yr_built: Built Year
  • yr_renovated: Year when house was renovated
  • zipcode: zip
  • lat: Latitude coordinate
  • long: Longitude coordinate
  • sqft_living15: Living room area in 2015(implies-- some renovations) This might or might not have affected the lotsize area
  • sqft_lot15: lotSize area in 2015(implies-- some renovations)

Importing the required libraries:

Exploratory Data Analysis


Loading the dataframe:

Loading output library...

Step 1: 5-minute health check


The first step when analyzing data is cleaning. Understanding if we've loaded the data correctly and we have valid values. This is a process that will involve multiple steps, but for now, we start with our 5 minute check:

Loading output library...

With shape we know that there are 21,613 rows, with 21 columns (features). Let's check for red flags on those features:

info gives you a quick summary of both the type and the count for each column. In this case the data seems correct, there are no missing values and the types are correct.

Step 2: High level Feature Selection


Our objective is to predict the price of a house based on the features that we know about the house. For example, we know that a larger surface area and more bedrooms will relate with a highest price. But what about the id of the house? It's probably just an internal ID and is not affecting the real price.

That is feature selection, understanding what features are important to the ML model.

With pandas is extremely simple to exclude columns:

Loading output library...

What other variables would you exclude? For this workshop, we'll exclude date, lat and long. We could have done a better analysis for lat and long, but with zipcode it's probably enough.

Step 3: Correlation between variables


Some variables will have higher (positive or negative) correlation with the price. We know that the surface area of a house is positively correlated with its price: the larger the house, a higher price. But what about others? We can build a simple correlation plot to understand a little bit better the relationship between different variables:

Loading output library...

So, for example, we can see that sqft_living is highly correlated with the price:

Loading output library...

We'll use a simple visualization mechanism to have a visual clue about these variables and their correlation:

Loading output library...
Loading output library...

We see some strange patterns, like for example, the apparent "negative" correlation between zipcode and price. Something that doesn't make any sense. We'll talk more about this when we explore zipcode as a categorical feature later.

Once we identify correlation between different variables, we can explore how they're correlated. For example, we saw sqft_living and price:

Loading output library...
Loading output library...

What about grade and price?

Loading output library...

They also seem strongly correlated, but, are they just linearly correlated?

Loading output library...
Loading output library...

Doesn't seem so, or at least it's not as clear as with sqft_living. There seems to be some sort of polynomic relationship. We can use a logarithmic y axis to test:

Loading output library...
Loading output library...

It now looks a little bit better. We can use these relationships we've identified to improve our model later.

Step 4: More cleaning, identifying outliers


Linear regression (along with other ML models) will be really sensitive to outliers:

Loading output library...

🤔A house with 33 bedrooms? There's something going on here:

Loading output library...
Loading output library...

It makes sense for a (really expensive) house to have, let's say 10 bedrooms, but 33 seems like an error.

Loading output library...

33 bedrooms and only 1.75 bathrooms? 😅 clearly an error.

Now, what about those properties without bathrooms? That is strange, let's take a look:

Loading output library...

Now that we look at it it makes a little bit more sense. Maybe those are just warehouses or other type of storage unit facilities? Without more information is now difficult to make a decision. This is an important lesson: domain expertise is fundamental when analyzing data

I'll not remove any house for now.

How are other variables doing?

Loading output library...
Loading output library...

This probably requires a little bit more analysis, but let's proceed.

Step 5: Dummy variables


The zipcode feature imposes an issue. Machine learning models, don't understand "human" features like zipcode. For a ML algorithm, a value of 98178 in zipcode is "greater" than 98125, even though for us, knowing the area, the zipcode 98125 might have more expensive houses. These are the zipcodes in our dataset:

Loading output library...
Loading output library...

Only 70 zipcodes:

Loading output library...

Introducing "Dummy Variables":

Loading output library...

Dummy variables is the correct way to feed a ML model a categorical feature. We'll see how to combine these later.

Step 6: Feature scaling and normalization


There's a final IMPORTANT point to discuss, and that is "scaling" and "normalizing" features. It has a mathematical explanation, but basically, what we DON'T want is to have features that are in completely different units. For example:

Loading output library...

The values here are too dissimilars, which will make some algorithms perform poorly and slower. We'll then "scale" these features to remove the unit. Read more here: Importance of Feature Scaling

Loading output library...

Step 7: Putting it all together


We'll now use a really convenient package called sklearn-pandas that will let us scale our features and also create the Dummy zip variables:

Step 8: Profit


Let's see now how our Linear Regression is performing with these simple modifications:

Loading output library...
Loading output library...

0.79! Much better, right? This is just an introduction on how important it is a good process of data analysis applied to Machine Learning.