Simple machine learning model to predict daily Airbnb prices in Austin, Texas

#Simple-machine-learning-model-to-predict-daily-Airbnb-prices-in-Austin,-Texas

We're looking at this process in terms of 4 steps:

  • Obtaining the data
  • Cleaning the data
  • Modelling the data
  • Interpreting the data

Step 1: Obtaining the data

#Step-1:-Obtaining-the-data

We're going to work with Airbnb data for Austin, which we downloaded in CSV format from this site: http://insideairbnb.com/get-the-data.html

There's a bunch of data on this website, including reviews and calendar availability data, but given the time we had we're focusing on the "listings.csv" data. Of course things like the reviews would have a big influence on what price a host can charge.

We're going to be using a python stack, with jupyter notebooks, pandas and sci-kit learn, as that's a good way to iteratively develop models before getting them ready for production.

Step 2: Cleaning the data

#Step-2:-Cleaning-the-data

There's A LOT of information for each listing. Since we don't have a lot of time to build a model, let's look at the information we have and only keep the most relevant data. In this step, we're also going to delete information we don't care about for each listing, as well as information we don't have time to process properly like complex text (could be further work for a future project), data with a lot of overlap with other fields. We'll also check for missing values, as well as convert each field to the format we need for training a model.

First let's look at the reviews.csv and see what information it gives us about a listing.

Loading output library...
Loading output library...
Loading output library...

Looking at this further, we can also drop city and state since we're only looking at Austin, 'amenities' needs more text processing than we can deal with right now, 'maximum minimum nights' is annoying to deal with, 'calendar updated' is text, 'calendar last scraped' is not relevant, 'jurisdiction names' is not useful, and we can see we have some NaaN and non-existent values to deal with later as well.

Loading output library...

Host listings count looks identical to host_total_listings count so drop one of them.

Loading output library...

Next, let's look at columns with a lot of missing data.

Loading output library...

Let's delete features will a lot of empty data, even if we try to estimate and input the values it it will probably be wrong.

Loading output library...

Convert all numerical values to floating point

Loading output library...

Look at features that are left, is there anything else we need to delete?

Loading output library...
Loading output library...

We can guess neighourhood and location in general is an important feature, but how many listings are in each neighbourhood?

#We-can-guess-neighourhood-and-location-in-general-is-an-important-feature,-but-how-many-listings-are-in-each-neighbourhood?
Loading output library...
Loading output library...

Drop 'neighbourhoud cleansed' as it's the same as the zipcode for Austin (making assumption, could examine further), and also trim zipcode.

Let's look at any high correlation between the features to see if there's any more features we can drop.

#Let's-look-at-any-high-correlation-between-the-features-to-see-if-there's-any-more-features-we-can-drop.
Loading output library...
Loading output library...
Loading output library...

Let's get everything ready to train model

#Let's-get-everything-ready-to-train-model
Loading output library...

Step 3) Modelling the data: let's train a few different models and see how they perform.

#Step-3)-Modelling-the-data:-let's-train-a-few-different-models-and-see-how-they-perform.

This isn't very good at all, let's try some ensemble methods to see if they works better.

#This-isn't-very-good-at-all,-let's-try-some-ensemble-methods-to-see-if-they-works-better.

Try tuning the parameters a bit, this is still better

We could try more models and optimizing them with more time, but let's take a look at what features seem to be the most important for predicting the price.

#We-could-try-more-models-and-optimizing-them-with-more-time,-but-let's-take-a-look-at-what-features-seem-to-be-the-most-important-for-predicting-the-price.
Loading output library...

Conclusions (interpreting the data):

#Conclusions-(interpreting-the-data):

In conclusion, we can see the features which seem to correlate the most with the price of an Airbnb are what you'd expect if you've ever booked one: the number of people it can accomodate (people are willing to pay more for a big house which accomodates a group of friends, or two couples travelling together, for example), the hosts listings count might indicate that hosts managing multiple listings are perhaps more professional about it and can charge more vs. someone renting just one property or a room in their house, people pay more for an entire house/apartment which makes sense, a neighorhood of Downtown and East Downtown probably correlates with a higher price since that's where most tourist locations are in Austin, this would of course vary for different cities.

There's a lot we could improve upon on this model that we couldn't do in the time allowed. It's not super accurate yet but still shows a reasonable process of getting data, cleaning it, examining it, transforming it and using it to create a simple predictive model. For example, with more time we could do further feature engineering on textual data like the reviews and the amenities, as we can guess positive reviews would allow a host to charge a higher price, and amenities like a full kitchen, a pool or balcony would as well. We didn't deal very well with missing values (I tried during testing and didn't seem to affect the accuracy much but this could be looked at more closely), we could do more feature tuning and reject low values features that we can see above, we could also try different algorithms and tune the ones we have better.