scikit-learn expects all features to be numeric. So how do we include a categorical feature in our model?
What are the categorical features in our dataset?
For season, we can't simply leave the encoding as 1 = spring, 2 = summer, 3 = fall, and 4 = winter, because that would imply an ordered relationship. Instead, we create multiple dummy variables:
The @@7@@ values are called the model coefficients:
In the diagram above:
pred_ln_sales_in_thousands = -2.5129 - 0.0558 Price_in_thousands + 0.3565 Engine_size + 0.0426Wheelbase+ 0.1010Fuel_efficiency -Vehicle_type_Passenger*0.7385
pred_sales_in_thousands = exp(pred_ln_sales_in_thousands)
How do we choose which features to include in the model? We're going to use train/test split (and eventually cross-validation).
Why not use of p-values or R-squared for feature selection?
R-squared is a statistical measure of how close the data are to the fitted regression line.R-square signifies percentage of variations in the reponse variable that can be explained by the model.
R-squared value varies between 0 and 100%. 0% signifies that the model explains none of the variability,while 100% signifies that the model explains all the variability of the response. The closer the r-square to 100%, the better is the model.
Evaluation metrics for classification problems, such as accuracy, are not useful for regression problems. We need evaluation metrics designed for comparing continuous values.
Here are three common evaluation metrics for regression problems:
Mean Absolute Error (MAE) is the mean of the absolute value of the errors:
Mean Squared Error (MSE) is the mean of the squared errors:
Root Mean Squared Error (RMSE) is the square root of the mean of the squared errors:
Comparing these metrics:
All of these are loss functions, because we want to minimize them.
Here's an additional example, to demonstrate how MSE/RMSE punish larger errors:
sklearn library has a comprehensive set of APIs to split datasets, build models, test models and calculate accuracy metrics
The residuals are randomly distributed. There are no visible relationship. The model can be assumed to be correct
As p - values are less than 5% - the variables are siginificant in the regression equation.
Advantages of linear regression:
Disadvantages of linear regression: