Author: Don Kim (github.com/dgkim5360)
Originally uploaded to the Github repository.

This notebook aims to learn the gist of neural networks, taking the basic flow of the machine learning course step-by-step. I here try to delivery the idea of "what the neural networks do", minimizing the address of "how to compute/implement". The main reference is the famous textbook "The Elements of Statistical Learning", the second edition by Hastie et al.

However this notebook contains implementations anyway. For example, the Newton-Raphson algorithm for logistic regression is not explained here. Please consult with any textbook for the algorithmic/implementational details.

• Supervised Learning, Regression
• Linear Regression
• Basis Expansion
• Supervised Learning, Classification
• Linear Regression, Again
• Logistic Regression
• Finally, Neural Networks
• Make It Simple, Stupid
• Activation vs. Basis Expansion
• Implementation with PyTorch

In the supervised learning problem, we would like to earn a function @@0@@, which predicts the output @@1@@ from the input @@2@@.

@@3@@

The @@4@@ is referred to as the regression function, and this supervised learning problem is also known as function approximation in the field of mathematics.

The supervised learning typically proceeds as follows: Compute @@5@@!

Since, of course, it is extremely difficult to do such task without a clue, we simplify the situation by enchanting @@6@@ with a special property.

The most absurd assumption is that @@0@@ is linear.

With @@1@@, we can write the linear regression function as

@@2@@

where, for brevity, @@3@@ and @@4@@ are @@5@@-vector containing the intercept, respectively.

• @@6@@
• @@7@@

Now estimating @@8@@ comes equivalent with finding @@9@@. Exaggeratingly speaking, linear regression is that you see the point cloud and draw your proper straight line. The linear regression fit @@10@@ is computed by the least squares, which is not covered here.

Here goes the simplest example.

@@0@@

From the above picture, the computed straight line looks either good or bad. 1. Does the line explains the original data? Bad.
We actually did wrong with picking the model, since we saw the curvy data but fit with the straight line. (High bias) 1. How well does the line predict a new data? Good. (Low variance)

The straight line is simple, and simple is good, because

• it makes the computation easier,
• it makes the interpretation easier, and
• it works well with the scarse data.

Furthermore, we can exploit the linear regression beyond the straight line.

A reason why many textbooks primarily deal with the linear regression is that it allows us to fit in more complex way with the same method.

The idea is simple. Assume we want to fit a quadratic curve instead of a straight line. Then we change the original linear model

@@0@@

@@1@@

To get the estimate of @@2@@, the corresponding training data change in the same way

@@3@@

and the final computation is done with the plain linear regression.

We can fit the higher-order polynomial fit in this way, with the plain linear regression (fitting the sine wave is still hard though).

In summary, whatever variables come in (e.g. @@0@@ or @@1@@), the linear regression only aims to compute their linear coefficients. This is the idea of the basis expansion.

Let's name the transformation which takes input @@0@@:

@@1@@

Then the linear basis expansion model is written as

@@2@@

Below are some examples of the basis expansion for easy understanding.

• If @@3@@ for all @@4@@, it comes down to the linear regression model.
• Additional higher-order variables make the polynomial regression model.
• You can add some nonlinear ones like @@5@@ or @@6@@ if you wish.

We can add variables not only introducing new data, but also restricting the existing data.

• Setting @@7@@ with some constants @@8@@ and @@9@@ allows the model to deal with the data only within some windows or intervals.

And we can mix all such variables.

For your information, @@10@@ is called the indicator function, returning 0 or 1 depending on the conditional expression.

@@11@@

Here we briefly check the natural cubic spline, which is quite a popular method of basis expansion. As the name depicts itself, the model is 3rd-order polynomial, with additional properties: 1. Split the domain of @@0@@, and fit each split region with the cubic curve.
The splitting points are called as knots. 1. The function values at knots should coincide.
The first and second derivatives of the function values at knots should also coincide. 1. Assume the fit is linear outside of boundary knots.

The natural cubic spline restricts such things in order to make the fit "natural". Also note that the boundary linear assumption is due to the difficulty of fitting boundaries.

Although it is worth learning what the natural cubic spline is, it is more important to understand the idea of basis expansion that ultimately boils the model down to solving linear regression problem with whatever variables we compose.

Above is a gorgeous estimation fit.

There is one difficulty for the basis expansion: We must decide @@0@@'s to be used. Note that the natural cubic spline resolves the difficulty with fixing the model as the cubic polynomial, but one should decide the location and number of the knots.

The regression problem is basically a good way to predict a continuous output @@0@@. And it is OK to use a discrete @@1@@ for the regression problem.

A discrete @@2@@ leads the problem to the classification. For example, "whether @@3@@ is 0 or 1" can come from "whether @@4@@ is a human or a animal" or "whether @@5@@ has desease or not" or any scenario.

Below is a 2D example of the binary @@0@@. The lengthy code is due to the data generation and visualization in 2D and 3D, but the code for the actual computation is quite short.

• The 3D plane is the actual linear regression fit, and
• the black line is the collection of the fitted values equal to 0.5. We set this line as a decision boundary to split the regions of BLUE and ORANGE.

Is it perfect to set the decision boundary as the line of 0.5 fitted values? I don't know, but at least it seems natural to use the middle value of 0 and 1 as the split point. However we need to consider more cases.

• As seen in the 3D plane fit, the fitted value @@0@@ is way narrower than other regions!

To resolve such issues, indicator variables are introduced and they consist of binary @@1@@'s.

@@2@@

Since there is no change in the training data @@3@@ (in fact, one matrix computation is enough).

Then for some data point, we would get the estimate @@4@@ and pick the category of the largest among them.

Indicator variables bring the concept of probability in.

@@0@@

However the fitted values computed by the previous linear model do not satisfy the properties of probability (see the 3D plane fit above). Logistic regression resolves this issue.

Here goes the logit transform.

@@0@@

Since the logit transform is monotone and continuous, so is its inverse (logistic function).

@@1@@

The below graph shows that the logistic transform is proper to change the real value to the probability-like value.

We can write the linear regression problem with indicator variables as follows.

@@0@@

And it is previously mentioned that these linear combinations do not satisfy the probabilistic properties. So the logistic transform cuts in here to help.

@@1@@

This is the logistic regression problem.

Now, the procedure to compute the estimate of @@2@@ gets more complicated because the nonlinear function is introduced in the problem. The below implementation uses the Newton-Raphson method to minimize the loss in the iterative manner.

We can see that the logistic regression compute the beautiful fit within the range between 0 and 1 for the binary classification problem (what currently we are dealing with). Then the decision boundary of the 0.5 split becomes much more plausible. Also note that the linearity of the decision boundary is also preserved.

For more detail, you may check the other article.

Finally it's time for neural networks. Let's start with the simplest one, the single layer perceptron model.

@@0@@

where

• @@1@@ and @@2@@ are vectors of @@3@@'s and @@4@@'s, respectively:
@@5@@,
@@6@@
• @@7@@ is called activation function,
• @@8@@ is called output function.
• The hidden layer @@9@@ is between the input layer @@10@@ and the output layer @@11@@.
• Each node of a layer is called a unit.

Below is the famous figure of the single layer perceptron. • Only one unit is enough for the output layer if we want to solve the regression problem.
• If it is the @@12@@-class classification problem, the output layer contains @@13@@ indicator variables @@14@@ representing the possibility for each class.

From now on, we study what the spider-web-like network does step-by-step.

It is quite difficult, at least for me, to comprehend the complete flow of the neural network by the above figure and the model formula. It helps a lot to reduce the problem into the simpler ones.

We first throw away the @@1@@.

@@2@@

Then the neural network model is much simplified to a linear model.

The output function may confuse us in that the the reduced model is not linear because the output function @@3@@ is nonlinear. Note that, as the name tells itself, the output function transforms the output, not the input (Consider why the logitstic regression is classified as a linear method).

Here we make a bold move to remove the hidden layer. Why not? Also it would be worth thinking how the structure of the network changes.

• No hidden layer, no @@5@@.
• Leave only one unit in the output layer, @@6@@.
• No output function, @@7@@.

Then the network exactly comes down to a linear regression model.

@@8@@

Now adding the output function @@10@@ as a logistic function yields the logistic regression model for the binary classification problem.

@@11@@

Note that in the context of neural networks world, the logistic function is called the sigmoid function.

If we raise @@12@@, the logistic regression model can solve the @@13@@-class classification problem.

@@14@@

I would like to just mention that the softmax output function can do almost the same job.

We have confirmed that, without activation functions, the neural networks are simplified to linear models. Now we check how it goes if the activation is alive. To focus on the activation functions, we still minimize the remaining settings.

• The activation function @@1@@ and the number of hidden units @@2@@ come in.
• Assume @@3@@ and no output function, @@4@@.

@@5@@

Except the fact that the activation function @@6@@ is universally used for all @@7@@, this network resembles the basis expansion model. In other words, the hidden layer units chase the nonlinearity based on the input units just like the basis functions do (So @@8@@ is also called derived feature).

If we intentionally set the each activation function as a custom function like

@@0@@

then the network is the same as the basis expansion model. The basis expansion models set the compulsory setting. That is, in the perspective of neural networks, all the activations @@1@@ and all the values of the coefficients @@2@@ are completely fixed from the scratch.

In the analogy of parenting style, the basis expansion is the helicopter mom, while the neural network mom is hand-off parenting style. What should we decide in order to construct the neural network model?

• Activation function @@3@@,
• The number of hidden units @@4@@,
• Output function @@5@@

The neural network mom seems to care her children more than we thought (What a love of mother). With @@6@@ hidden units decided, these units grow on their own as the time does on. In other words, the values of @@7@@ keep adjusted to fit the training data in an iterative manner (the iterative optimization is inevitable due to nonlinearities of the model).

In the meantime, we still need to decide the fine settings of the model like below:

• which activation and output function to be used,
• how the information propagates between layers (e.g. setting neighborhoods within which the information propagates only).

We can see from the famous MNIST example that the results are totally different depending on how the network consist of.

So if you can pick, which mom do you prefer?

No, this is not the appropriate question. Since we here have to assess the model, the proper question is "which parenting style is better?"

Suppose that their children has success in whatever sense.

The helicopter mom can say that the success came from her strategy. It is due to her plans and schedules for her children.

• She decides which variables and their transformations by her hand.
• The reasoning of such decision comes from the domain knowledge or insight of the data and the problem. It is very hard to achieve right.
• But the resulting model has clear meaning and is easy to interpret.
• Therefore the power of the model is mainly due to her strategy. In other words, she can have clear feedbacks from the result (good or bad).

How about the other side? The other mom may feel happy, but not knowing why the success is achieved (if she cares). Maybe her child was born to succeed.

• The units of hidden layers automatically develop themselves, adjusting from the given data.
• The training procedure is relatively easier than the helicopter mom. For instance, we may compare testing various @@8@@ with testing more various combinations of basis functions.
• Since the core units grow on their own, the model is opaque and has no clear meaning to humans.

Compared to the statstical learning models which are based on the definite relation, the black-box-characteristic of neural networks is its obvious disadvantage. While, compared again, the flexibility of neural networks is also obvious advantage.

As mentioned from the start, this notebook does not address how to train the neural network model. To do this, the concept of loss should be introduced much earlier, and that explanation may result in another article.

The code goes on though. Here the single layer perceptron is trained for the previous 2D example. This notebook uses PyTorch, one of famous deep-learning libraries.

The reason why this notebook tries to use numpy/scipy only is 1. to minimize the load to learn the usage of external frameworks, 1. to show more clearly how the learning procedure goes.

However, in the case of neural networks, we need to implement from scratch every training algorithm depending on activation functions and output functions. If we want to make such procedure (computing gradient) more universally possible, then the resulting implementation would more or less the same as PyTorch.

PyTorch contains its core modules torch, torch.autograd, torch.nn.

• torch provides its basic data type Tensor. This is a numpy.array-like tensor library with GPU support.
If we are capable, it is possible to implement the neural network model only with torch.Tensor. What we need to do is to compute the gradient of the loss.
• torch.autograd does the job -- computing gradients.
If we are capable, it is again possible to implement the model with torch.Tensor and torch.autograd. Then the implementation of various activations, output functions, and loss functions is required.
• torch.nn does the job. It contains such mathematics and provides the blocks to build neural networks.

Here this notebook exploits all the framework provides to implement the below examples: 1. A linear regression model without hidden layers 1. A logistic regression model without hidden layers 1. A neural network model with 1 hidden layer, 6 hidden units, and sigmoid activation + sigmoid output