Let's start with the linear regression, the most basic tool of statistical learning.

- For brevity, everything here is assumed to be in 1D.
- Mathematical and implementational details are not dicussed here, while basic implementations are provided anyway.

Please consult with textbooks or Google. - The main reference of this notebook is "The Elements of Statictical Learning" (ESL) 2E by Hastie et al.

It seeks the best straight line to explain the data.

In other words, linear regression is like -- I am looking at the scatterplot, holding a big ruler, and I simply draws a straight line intuitively suitable for the data.

@@0@@

The solution -- or the estimate of -- @@1@@ of the above problem will be more or less similar with my straight line.

Loading output library...

The response variable becomes discrete and the "classification problem" pursuits an answer: Where this item belongs.

@@0@@

But the linear regression doesn't care much anyway.

Loading output library...

One of the popular choices is to bring the indicator (or dummy) variables @@1@@ instead of @@2@@. Then we can equivalently represent @@3@@ as

@@4@@

If the linear regression applies to each of @@5@@ separately, we get 3 fitted values and it is a natural decision to take the highest value as our answer.

Using the indicator variables, we may intuitively consider the fitted values as the score, or the possibility, to belong to some group. Theoretically it is also true.

With an indicator variable @@0@@,

@@1@@

But our fitted values that the linear regression provided do not seem suitable as probabilities. So here comes the better approaches like linear discriminant analysis (LDA) and logistic regression (LDA not covered here).

In a nutshell, logistic regression tries to make the regression function satisfy the properties of probability.

@@0@@

It is monotone and continuous, so is its inverse (called logistic function).

@@1@@

Therefore it seems pretty good for converting the real number to the probability.

Loading output library...

Using indicator variables, what we have done is

@@0@@

But if we want more probability-like estimates, the monotone transformation comes to action:

@@1@@

where we arrive at the formulation of logistic regression.

Loading output library...

Logistic regression seems to preserve the decision rule (or decision boundary) as linear one, which looks definitely better.

A monotone transform does not affect the order of the original input values.

Therefore, because -- when using the indicator variables -- we take the maximum fitted value as the estimate, the logit transform doesn't do any harm.

Logistic regression makes the same decision as the linear regression with indicator variables. What's the difference then?

- Logistic regression keeps the decision boundary linear as the linear regression does.
- Moreover it makes the solution more sensible when testing, or real application.

Consider we fit the linear regression model with indicator variables. Then we should test the model. What if the test point is outside range (or convex hull) of the training data?

This is a consequence of the rigid nature of linear regression, especially if we make predictions outside the hull of the training data. These violation (of probabilistic properties) in themselves does not guarantee that this approach will not work. (page 104 of ESL, @@0@@ 4.2)

The logistic regression model arises from the desire to model the posterior probabilities of the @@1@@ classes via linear functions in @@2@@, while at the same time ensuring that they sum to one and remain in @@3@@. (page 119 of ESL, @@4@@ 4.4)

As you may see the actual implementation of logistic regression, it usually requires MLE with the multinomial distribution and the solution is not even the closed-form so we solve it via an iterative procedure.

Therefore the logistic regression is called linear method because the decision boundary it provides is linear, not because we can get its estimates by solving a linear equation.

End of the story.