LSTM (Long Short Term Memory) is gaining a lot of recognition in recent past. LSTM are an interesting type of deep learning network, they are used in some fairly complex problem domain such as language translation, automatic image captioning and text generation. LSTM are designed specifically for sequence prediction problem. This post starts of with an introduction to LSTM, there importance , the mechanics , lstm architectures and close with getting most out of LSTM models.

MLP and neural net are generally better suited to handle time series forecasting or sequence prediction as they are robust to noise, non linear by nature, They can have multivariate inputs and outputs. Application of MLP for sequence prediction requires an input sequence be divided into smaller overlapping sub sequences to generate a prediction. The time steps of the input sequence becomes an input feature . The subsequences are overlapping to simulate a window slid among the sequence in order to generate the output.

This can work well but there are limitations

1.Stateless- MLP learn a fixed function approximation. Any inputs that are conditional on the context of the input sequence must be generalized and frozen into the network weights.

2.Unaware of Temporal Structure- Time steps are modeled as input features meaning the network has no explicit handling or understanding of temporal structure or order between observations.

3.Messy Scaling - For problems that require modeling multiple parallel input sequences, the number of input features increased by a factor of the size of the sliding window without any explicit seperation of time steps of series.

4.Fixed sized inputs - Size of the sliding window is fixed and must be imposed on all networks

5.Fixed sized outputs- Size of output is fixed and any outputs which dont conform must be forced.

RNN to the rescue LSTM network is type of RNN are special type of neural network specifically designed for sequence problems. Given the standard feed forward network RNN can be thought of as addition loops to the architecture.

For example in a given layer each neuron may pass its signal sideways in addition to forward to the next layer. The output of the network may feedback as an input to the network with next input vector and so on.


The recurrent connection adds state or memory to the network and allow it learn and harness the ordered nature of observations of the input sequences.

RNN contains cycles of that feed the network activations from a previous time step of inputs to the network to influence predictions at the current time step. These activation are stored in the internal states of the network which can in principle hold long term temporal contextual information. This mechanism allows RNN to exploit a dynamically changing contextual window over input sequence history.

The addition of a sequence is a new dimension to the funcition been approximated. Instead of mapping inputs to outputs alone, network is capable of learning a mapping function for inputs over time to an output. The internal memory can mean outputs are conditional on the recent context in the input sequence , not just what has been processed as input to the network. In a sense this capability unlocks time series for neural network.

LSTM is able to solve many time series unsolvable by feedforward NN using fixed size time windows. RNN can learn and harness the temporal dependence from the data.

LSTM have an internal state they are explicity aware of the temporal structure in the inputs, are able to model multiple parallel input series seperately and can step through varied length into sequences to produce variable length output sequences, one observation at a time.

Like RNN, LSTM have recurrent connections so that the state of previous activations of the neuron from the previous time step is used in context formation of output. But unlike RNN , LSTM have a different formulation that allows it to avoid the problems that prevent the training and scaling of other RNNs.

Key technical historical challenge in RNN is how to train them effectively. Experiments show how this was where the weight update procedure resulted in weight changes that quickly became so small as have no effect (vanishing gradients) or so large as to result in very large changes or even overflow (exploding gradients).

LSTM overcomes this by design RNN is limited in terms of accessing the range of contextual information. The problem is that the influences of a given input on the hidden layer and therefore the network output, either decays or blows up, as it cycles around network recurrent.

For the complete working of LSTM refer this link

Application of LSTM's

Automatic Image Caption Generation A sequence classification problem. Automatic Image Captioning is the task given an image the system should generate a caption describing the image. Use CNN to detect the objects in the image then use LSTM to turn the labels into coherent sentences.

Automatic translation of text Given a text in one language translate into another language. Model must learns translation of words the context where the translation is modified and support input and output sequences that may vary in both length both generally with regards to each other. This is a classic sequence to sequence problem

Automatic Handwriting Generation A sequence generation problem , The task is given a corpus of handwritten examples, new handwriting for a given word or phrase is generated.

Starting out with Vanilla LSTM code defined as 1. Input Layer 2. Fully connected LSTM hidden layer 3. Fully connected hidden layer image.png Properties of Vanilla LSTM 1. Sequence classification conditional on multiple distributed input time steps 2. Memory of precise input observations over thousands of time steps 3. Sequence prediction as a function of prior time steps 4. Robust to the insertion of random time steps on the input sequences 4. Robust to the placement of signal data on input sequence

Basic LSTM (Vanilla)

Loading output library...
Loading output library...

Above is a simple example of Vanilla LSTM, A point to note the value of y is set to y= encodedout_index.reshape(1,n_features). out_index is how far ahead or back is the value of y set to.

The next example deals with more complex LSTM. What is been covered so far is single LSTM, A more realistic situation will require more than 1 LSTM layer. This introduces the concept of Stacked LSTM. Before dwelling into that lets take another example with some real data. The below example consists of a dataset of air passenger travel (#) daily , The idea here is to build an LSTM to predict the # of air passenger for the future. This example uses a single layer or vanilla LSTM

Loading output library...
Loading output library...
Loading output library...

Stacked LSTM


So far we have seen vanilla LSTM we have set the value of y to a look back in x chain by x days. The LSTM layer used so far 4 memory units. Its time to graduate on to multi layer LSTM , lets focus on developing a Stacked LSTM. Solving the damped Sine wave problem. Damped Sinewave is a sinusiodal function whose amplititude approaches zero as time increases. image.png

First establishing the understanding as why do we need stacked LSTM. Stacking LSTM to form a deeper network. DNN essentially are massive pipeline where each layer solves some part of the problem and passes it to next layer. Stacked LSTM are essentially multiple network in the lines of deep network. The nature of the damped sine wave is very evident why this is a good example of stacked LSTM.

A bit on the Architecture of LSTM LSTM operates on sequence data , addition of new layer adds a level of abstraction of input observation over time. In effect chunking observations over time or representing the problem at different time scales. A Stacked LSTM architecture can be defined as LSTM model comprised of multiple LSTM layers. An LSTM layer can provide q sequence output , Specifically one output per input time step. Lets dig into the implementation. `