• Business problem definition - One of major automobile company would like to design new product which can improve the sales. Inorder to define the product, they want to understand identify drivers for the sales(what are the factors driving sales) and Predicting sales of different car models given driving factors.
• convert business problem into statistical problem sales = F(sales attributes, product features, marketing info etc.)
• Finding the right technique - Since it is predicting value (Regression Problem) problem so we can use OLS as one of the technique. We can also use other techniques like Decision Trees, Ensemble learning, KNN, SVM, ANN etc.
• Data colletion(Y, X) - Identify the sources of information and collect the data
• Consolidate the data - aggregate and consolidate the data at Model level/customer level/store level depends on business problem
• Data preparation for modeling (create data audit report to identify the steps to perform as part of data preparation) a. missing value treatment b. outlier treatment c. dummy variable creation
• Variable creation by using transformation and derived variable creation.
• Basic assumptions (Normality, linearity, no outliers, homoscadasticity, no pattern in residuals, no auto correlation etc)
• Variable reduction techniques (removing multicollinerity with the help of FA/PCA, correlation matrics, VIF)
• Create dev and validation data sets (50:50 if you have more data else 70:30 or 80:20)
• Modeling on dev data set (identify significant variables, model interpretation, check the signs and coefficients, multi-collinierity check, measures of good neess fit, final mathematical equation etc)
• validating on validation data set (check the stability of model, scoring, decile analysis, cross validation etc.)
• Output interpretation and derive insights (understand the limitations of the model and define strategy to implementation)
• convert statistical solution into business solutions (implementation, model monitoring etc)

scikit-learn expects all features to be numeric. So how do we include a categorical feature in our model?

• Ordered categories: transform them to sensible numeric values (example: small=1, medium=2, large=3)
• Unordered categories: use dummy encoding (0/1)

What are the categorical features in our dataset?

• Ordered categories: weather (already encoded with sensible numeric values)
• Unordered categories: season (needs dummy encoding), holiday (already dummy encoded), workingday (already dummy encoded)

For season, we can't simply leave the encoding as 1 = spring, 2 = summer, 3 = fall, and 4 = winter, because that would imply an ordered relationship. Instead, we create multiple dummy variables:

@@0@@

• @@1@@ is the response
• @@2@@ is the intercept
• @@3@@ is the coefficient for @@4@@ (the first feature)
• @@5@@ is the coefficient for @@6@@ (the nth feature)

The @@7@@ values are called the model coefficients:

• These values are estimated (or "learned") during the model fitting process using the least squares criterion.
• Specifically, we are find the line (mathematically) which minimizes the sum of squared residuals (or "sum of squared errors").
• And once we've learned these coefficients, we can use the model to predict the response. In the diagram above:

• The black dots are the observed values of x and y.
• The blue line is our least squares line.
• The red lines are the residuals, which are the vertical distances between the observed values and the least squares line.

pred_ln_sales_in_thousands = -2.5129 - 0.0558 Price_in_thousands + 0.3565 Engine_size + 0.0426Wheelbase+ 0.1010Fuel_efficiency -Vehicle_type_Passenger*0.7385

pred_sales_in_thousands = exp(pred_ln_sales_in_thousands)

How do we choose which features to include in the model? We're going to use train/test split (and eventually cross-validation).

Why not use of p-values or R-squared for feature selection?

• Linear models rely upon a lot of assumptions (such as the features being independent), and if those assumptions are violated, p-values and R-squared are less reliable. Train/test split relies on fewer assumptions.
• Features that are unrelated to the response can still have significant p-values.
• Adding features to your model that are unrelated to the response will always increase the R-squared value, and adjusted R-squared does not sufficiently account for this.
• p-values and R-squared are proxies for our goal of generalization, whereas train/test split and cross-validation attempt to directly estimate how well the model will generalize to out-of-sample data.

More generally:

• There are different methodologies that can be used for solving any given data science problem, and this course follows a machine learning methodology.
• This course focuses on general purpose approaches that can be applied to any model, rather than model-specific approaches.

R-squared is a statistical measure of how close the data are to the fitted regression line.

R-square signifies percentage of variations in the reponse variable that can be explained by the model.

• R-squared = Explained variation / Total variation
• Total variation is variation of response variable around it's mean.

R-squared value varies between 0 and 100%. 0% signifies that the model explains none of the variability,

while 100% signifies that the model explains all the variability of the response.
The closer the r-square to 100%, the better is the model.

## Other Evaluation metrics for regression problems

#Other-Evaluation-metrics-for-regression-problems

Evaluation metrics for classification problems, such as accuracy, are not useful for regression problems. We need evaluation metrics designed for comparing continuous values.

Here are three common evaluation metrics for regression problems:

Mean Absolute Error (MAE) is the mean of the absolute value of the errors:

@@0@@

Mean Squared Error (MSE) is the mean of the squared errors:

@@1@@

Root Mean Squared Error (RMSE) is the square root of the mean of the squared errors:

@@2@@

Comparing these metrics:

• MAE is the easiest to understand, because it's the average error.
• MSE is more popular than MAE, because MSE "punishes" larger errors, which tends to be useful in the real world.
• RMSE is even more popular than MSE, because RMSE is interpretable in the "y" units.

All of these are loss functions, because we want to minimize them.

Here's an additional example, to demonstrate how MSE/RMSE punish larger errors:

sklearn library has a comprehensive set of APIs to split datasets, build models, test models and calculate accuracy metrics

The residuals are randomly distributed. There are no visible relationship. The model can be assumed to be correct

As p - values are less than 5% - the variables are siginificant in the regression equation.

## Comparing linear regression with other models

#Comparing-linear-regression-with-other-models

• Simple to explain
• Highly interpretable
• Model training and prediction are fast
• No tuning is required (excluding regularization)
• Features don't need scaling
• Can perform well with a small number of observations
• Well-understood