We're looking at this process in terms of 4 steps:
We're going to work with Airbnb data for Austin, which we downloaded in CSV format from this site: http://insideairbnb.com/get-the-data.html
There's a bunch of data on this website, including reviews and calendar availability data, but given the time we had we're focusing on the "listings.csv" data. Of course things like the reviews would have a big influence on what price a host can charge.
We're going to be using a python stack, with jupyter notebooks, pandas and sci-kit learn, as that's a good way to iteratively develop models before getting them ready for production.
There's A LOT of information for each listing. Since we don't have a lot of time to build a model, let's look at the information we have and only keep the most relevant data. In this step, we're also going to delete information we don't care about for each listing, as well as information we don't have time to process properly like complex text (could be further work for a future project), data with a lot of overlap with other fields. We'll also check for missing values, as well as convert each field to the format we need for training a model.
First let's look at the reviews.csv and see what information it gives us about a listing.
Looking at this further, we can also drop city and state since we're only looking at Austin, 'amenities' needs more text processing thank we can deal with right now, 'maximum minimum nights' is annoying to deal with, calendar updated is text, calendar last scraped is not relevant, 'jurisdiction names' is not useful, and we can see we have some NaaN and non-existent values to deal with later as well.
Host listings count looks identical to host_total_listings count so drop one of them.
Next, let's look at columns with a lot of missing data.
Let's delete features will a lot of empty data, even if we try to input it it will probably be wrong.
Convert all numerical values to floating point
Look at features that are left, is there anything else we need to delete?
We can guess neighourhoood and location in general is an important feature, but how many listings are in each neighbourhood?
Drop neighborhound cleansed as it's the same as zipcode for Austin (making assumption, could examine further), and also trimp zipcode.
Let's train a few different models and see how they perform
None of this is very good, try some ensemble methods to see if they works better.
Try tuning the parameters a bit, this is still better
In conclusion, we can see the features which seem to correlate the most with the price of an Airbnb are what you'd expect if you've ever booked one: the number of people it can accomodate (people are willing to pay more for a big house which accomodates a group of friends, or two couples travelling together, for example), the hosts listings count might indicate that hosts managing multiple listings are perhaps more professional about it and can charge more vs. someone renting just one property or a room in their house, people pay more for an entire house/apartment which makes sense, a neighorhood of Downtown and East Downtown probably correlates with a higher price since that's where most tourist locations are.
There's a lot we could improve upon on this model that we couldn't do in the time allowed. It's not super accurate yet but still shows a reasonable process of getting data, cleaning it, examining it, transforming it and using it to create a simple predictive model. For example, with more time we could do further feature engineering on textual data like the reviews and the amenities, as we can guess positive reviews would allow a host to charge a higher price, and amenities like a full kitchen, a pool or balcony would as well. We didn't deal very well with missing values (I tried during testing and didn't seem to affect the accuracy much but this could be looked at more closely), we could do more feature tuning and reject low values features that we can see above, we could also try different algoriths and tune the ones we have better.