This notebook aims to learn the gist of neural networks, taking the basic flow of the machine learning course step-by-step. I here try to delivery the idea of "what the neural networks do", minimizing the address of "how to compute/implement". The main reference is the famous textbook "The Elements of Statistical Learning", the second edition by Hastie et al.
However this notebook contains implementations anyway. For example, the Newton-Raphson algorithm for logistic regression is not explained here. Please consult with any textbook for the algorithmic/implementational details.
In the supervised learning problem, we would like to earn a function @@0@@, which predicts the output @@1@@ from the input @@2@@.
The @@4@@ is referred to as the regression function, and this supervised learning problem is also known as function approximation in the field of mathematics.
The supervised learning typically proceeds as follows: Compute @@5@@!
Since, of course, it is extremely difficult to do such task without a clue, we simplify the situation by enchanting @@6@@ with a special property.
The most absurd assumption is that @@0@@ is linear.
With @@1@@, we can write the linear regression function as
where, for brevity, @@3@@ and @@4@@ are @@5@@-vector containing the intercept, respectively.
Now estimating @@8@@ comes equivalent with finding @@9@@. Exaggeratingly speaking, linear regression is that you see the point cloud and draw your proper straight line. The linear regression fit @@10@@ is computed by the least squares, which is not covered here.
From the above picture, the computed straight line looks either good or bad.
1. Does the line explains the original data? Bad.
We actually did wrong with picking the model, since we saw the curvy data but fit with the straight line. (High bias) 1. How well does the line predict a new data? Good. (Low variance)
The straight line is simple, and simple is good, because
Furthermore, we can exploit the linear regression beyond the straight line.
A reason why many textbooks primarily deal with the linear regression is that it allows us to fit in more complex way with the same method.
The idea is simple. Assume we want to fit a quadratic curve instead of a straight line. Then we change the original linear model
to the quadratic model by adding the quadratic term as
To get the estimate of @@2@@, the corresponding training data change in the same way
and the final computation is done with the plain linear regression.
We can fit the higher-order polynomial fit in this way, with the plain linear regression (fitting the sine wave is still hard though).
In summary, whatever variables come in (e.g. @@0@@ or @@1@@), the linear regression only aims to compute their linear coefficients. This is the idea of the basis expansion.
Let's name the transformation which takes input @@0@@:
Then the linear basis expansion model is written as
Below are some examples of the basis expansion for easy understanding.
We can add variables not only introducing new data, but also restricting the existing data.
And we can mix all such variables.
For your information, @@10@@ is called the indicator function, returning 0 or 1 depending on the conditional expression.
Here we briefly check the natural cubic spline, which is quite a popular method of basis expansion. As the name depicts itself, the model is 3rd-order polynomial, with additional properties:
1. Split the domain of @@0@@, and fit each split region with the cubic curve.
The splitting points are called as knots. 1. The function values at knots should coincide.
The first and second derivatives of the function values at knots should also coincide. 1. Assume the fit is linear outside of boundary knots.
The natural cubic spline restricts such things in order to make the fit "natural". Also note that the boundary linear assumption is due to the difficulty of fitting boundaries.
Although it is worth learning what the natural cubic spline is, it is more important to understand the idea of basis expansion that ultimately boils the model down to solving linear regression problem with whatever variables we compose.
Above is a gorgeous estimation fit.
There is one difficulty for the basis expansion: We must decide @@0@@'s to be used. Note that the natural cubic spline resolves the difficulty with fixing the model as the cubic polynomial, but one should decide the location and number of the knots.
The regression problem is basically a good way to predict a continuous output @@0@@. And it is OK to use a discrete @@1@@ for the regression problem.
A discrete @@2@@ leads the problem to the classification. For example, "whether @@3@@ is 0 or 1" can come from "whether @@4@@ is a human or a animal" or "whether @@5@@ has desease or not" or any scenario.
Below is a 2D example of the binary @@0@@. The lengthy code is due to the data generation and visualization in 2D and 3D, but the code for the actual computation is quite short.
Is it perfect to set the decision boundary as the line of 0.5 fitted values? I don't know, but at least it seems natural to use the middle value of 0 and 1 as the split point. However we need to consider more cases.
To resolve such issues, indicator variables are introduced and they consist of binary @@1@@'s.
Since there is no change in the training data @@3@@ (in fact, one matrix computation is enough).
Then for some data point, we would get the estimate @@4@@ and pick the category of the largest among them.
Indicator variables bring the concept of probability in.
However the fitted values computed by the previous linear model do not satisfy the properties of probability (see the 3D plane fit above). Logistic regression resolves this issue.
Here goes the logit transform.
Since the logit transform is monotone and continuous, so is its inverse (logistic function).
The below graph shows that the logistic transform is proper to change the real value to the probability-like value.
We can write the linear regression problem with indicator variables as follows.
And it is previously mentioned that these linear combinations do not satisfy the probabilistic properties. So the logistic transform cuts in here to help.
This is the logistic regression problem.
Now, the procedure to compute the estimate of @@2@@ gets more complicated because the nonlinear function is introduced in the problem. The below implementation uses the Newton-Raphson method to minimize the loss in the iterative manner.
We can see that the logistic regression compute the beautiful fit within the range between 0 and 1 for the binary classification problem (what currently we are dealing with). Then the decision boundary of the 0.5 split becomes much more plausible. Also note that the linearity of the decision boundary is also preserved.
For more detail, you may check the other article.
Finally it's time for neural networks. Let's start with the simplest one, the single layer perceptron model.
Below is the famous figure of the single layer perceptron.
From now on, we study what the spider-web-like network does step-by-step.
It is quite difficult, at least for me, to comprehend the complete flow of the neural network by the above figure and the model formula. It helps a lot to reduce the problem into the simpler ones.
We first throw away the @@1@@.
Then the neural network model is much simplified to a linear model.
The output function may confuse us in that the the reduced model is not linear because the output function @@3@@ is nonlinear. Note that, as the name tells itself, the output function transforms the output, not the input (Consider why the logitstic regression is classified as a linear method).
Here we make a bold move to remove the hidden layer. Why not? Also it would be worth thinking how the structure of the network changes.
Then the network exactly comes down to a linear regression model.
Now adding the output function @@10@@ as a logistic function yields the logistic regression model for the binary classification problem.
Note that in the context of neural networks world, the logistic function is called the sigmoid function.
If we raise @@12@@, the logistic regression model can solve the @@13@@-class classification problem.
I would like to just mention that the softmax output function can do almost the same job.
We have confirmed that, without activation functions, the neural networks are simplified to linear models. Now we check how it goes if the activation is alive. To focus on the activation functions, we still minimize the remaining settings.
Except the fact that the activation function @@6@@ is universally used for all @@7@@, this network resembles the basis expansion model. In other words, the hidden layer units chase the nonlinearity based on the input units just like the basis functions do (So @@8@@ is also called derived feature).
If we intentionally set the each activation function as a custom function like
then the network is the same as the basis expansion model. The basis expansion models set the compulsory setting. That is, in the perspective of neural networks, all the activations @@1@@ and all the values of the coefficients @@2@@ are completely fixed from the scratch.
In the analogy of parenting style, the basis expansion is the helicopter mom, while the neural network mom is hand-off parenting style. What should we decide in order to construct the neural network model?
The neural network mom seems to care her children more than we thought (
What a love of mother). With @@6@@ hidden units decided, these units grow on their own as the time does on. In other words, the values of @@7@@ keep adjusted to fit the training data in an iterative manner (the iterative optimization is inevitable due to nonlinearities of the model).
In the meantime, we still need to decide the fine settings of the model like below:
We can see from the famous MNIST example that the results are totally different depending on how the network consist of.
So if you can pick, which mom do you prefer?
No, this is not the appropriate question. Since we here have to assess the model, the proper question is "which parenting style is better?"
Suppose that their children has success in whatever sense.
The helicopter mom can say that the success came from her strategy. It is due to her plans and schedules for her children.
How about the other side? The other mom may feel happy, but not knowing why the success is achieved (if she cares). Maybe her child was born to succeed.
Compared to the statstical learning models which are based on the definite relation, the black-box-characteristic of neural networks is its obvious disadvantage. While, compared again, the flexibility of neural networks is also obvious advantage.
As mentioned from the start, this notebook does not address how to train the neural network model. To do this, the concept of loss should be introduced much earlier, and that explanation may result in another article.
The code goes on though. Here the single layer perceptron is trained for the previous 2D example. This notebook uses PyTorch, one of famous deep-learning libraries.
The reason why this notebook tries to use numpy/scipy only is 1. to minimize the load to learn the usage of external frameworks, 1. to show more clearly how the learning procedure goes.
However, in the case of neural networks, we need to implement from scratch every training algorithm depending on activation functions and output functions. If we want to make such procedure (computing gradient) more universally possible, then the resulting implementation would more or less the same as PyTorch.
PyTorch contains its core modules
torchprovides its basic data type
Tensor. This is a
numpy.array-like tensor library with GPU support.
torch.Tensor. What we need to do is to compute the gradient of the loss.
torch.autograddoes the job -- computing gradients.
torch.autograd. Then the implementation of various activations, output functions, and loss functions is required.
torch.nndoes the job. It contains such mathematics and provides the blocks to build neural networks.
Here this notebook exploits all the framework provides to implement the below examples: 1. A linear regression model without hidden layers 1. A logistic regression model without hidden layers 1. A neural network model with 1 hidden layer, 6 hidden units, and sigmoid activation + sigmoid output
The neural network model gave us the awesome decision boundary that we haven't seen before. However note that fitting well of the training data and performing well on the test data is a whole another different story. There are a lot of theoretical and practical studies for this. Maybe a future article continues on philosophy of training and testing (i.e., the model assessment).
Here ends the journey to understand the basics of neural networks. This is the tip of the iceberg, and the real-world application uses much more complex networks like CNN and RNN (especially LSTM). Keep study and practice!