Classification of prediction correctnes, a.k.a. VIRDI-indikator


I'm here performing some first experimentations with creating a classifyer for predicting whether we can trust an predicted price or not. I have downloaded the estimates_for_sold from prod and combined it with the address and market transactions data base that I have localy.

Note that I have also taken some code from the change-io-method to ease the pre-processing steps, and that I have made this notebook stand-alone by adding code for index and double sale.

Note: The above processing only includes one estimation for each dwelling.

Reading in our dataset


Here I have already defined the types for each column in data_types.yaml.

Pre-processing of the data


Now, get information on the median ape and count of dwellings that have been sold nearby. Again, this is slightly cheating bacause there is information from the ape that is beeing stored also in the test data.

Split dataset


Check that each feature is distributed equaly in both the training and test dataset

Train model and run classification


Evaluate the results


First lests check the accuracy when we consider that the sale was missplaced one and two categories

There is no getting around that this is a really good result, and more than good enough for the intended usage. Lets have a closer look at how the classifications are spread.

Loading output library...

From the above plot we can observe that the number of estimates in category 1 is overestimated, while the oposit is true in category 2 and 3. Of note is also that dwellings with >25% error is very well predicted (category 7). Now lets have a look at the original distribution of the data.

Loading output library...