This article gives a jargon and mathematic free introduction to RNNs. It also includes a practical demonstration of how to use an RNN for Natural Language Processing and sentiment analysis.
My previous articles about Deep Learning have focused upon traditional Feed Forward architectures, where incoming data travels in a single direction from the input layer to the output layer.
In this article I will introduce Recurrent Neural Networks (RNNs) which have a somewhat different architecture. RNNs have attributes that have made them very popular for tasks where data must be handled in a sequential manner, such as Natural Language Processing (NLP).
I'll start by presenting some theory about RNNs before running through an RNN implementation for sentiment analysis of IMDB movie reviews.
If you are new to Deep Learning then I would recommend that you first read my article An Introduction to Deep Learning as it gives a background for some of the concepts mentioned in this article.
A potential weakness with Feed Forward architectures can be highlighted by the following animation and the question "which way is the arrow moving":
To answer this question we need to process a sequence of images while maintaining state across them. Feed Forward architectures can struggle with this type of sequential task as they do not maintain an internal state between seperate inputs that happen in a given order.
A similar issue becomes apparent when processing text:
“The trailers were the best part of the whole movie.”
To understand this sentence we again need to process the data in a given sequence - interpreting each word in the context of the words that have come before it. Once again we see that a lack of internal state could make this use case difficult to implement.
RNNs support processing of sequential data by the addition of a loop. This loop allows the network to step through sequential input data whilst persisting the state of nodes in the Hidden Layer between steps - a sort of working memory. The following image gives a conceptual representation of how this works.
On the left hand side we see a simple RNN with input, hidden and output nodes. It takes in a sequential input with 4 elements (i[1-4]), loops through these and outputs a value of o4. In each iteration the Hidden Layer inherits the working memory from the previous iterations, as indicated by the red arrow.
On the right hand side we see an “unrolled” representation of the same RNN, which presents the RNNs loops as a chain of identical Feed Forward ANNs. These ANNs are identical, sharing the same structure, weights and activation functions. Once again the red arrows show how the working memory for each iteration is passed to the next.
The following image zooms in on the hidden node and shows how it concatenates it's current input and working memory from the previous iteration, before passing the result on to an activation function.
The output from the activation function is then both sent onwards to the output layer and forwarded on to the next iteration of the RNN, as the working memory of the node.
In the next section we'll see how "unrolling" an RNN is an important part of the RNN learning process.
Back-propagation is the most popular technique for ANN learning. It adjusts an ANNs weights in order to reduce loss or the difference between actual and expected output, thereby optimising the ANN's performance.
Back-propogation uses Gradient Descent to find the optimal weight values.
Check our my Introduction to Deep Learning if you need a refresher on back-propagation and Gradient Descent.
The cyclical nature of RNNs adds an extra dimension of complexity to back-propagation. Each loop of the RNN has it's own input and output pair, while all loops share the same weights. So how and when do we update an RNN's weights?
This problem is addressed by back-propagation through time (BPTT). BPTT takes place after forward-propagation of a sequence of inputs and updates the weights of an RNN as follows:
This is a greatly simplified explanation of BPTT. A deeper understanding requires a mathematical run-through which is outside the scope of this article. Note also that a lot of the back-propagation functionally is hidden by the Keras / TensorFlow API's
The "working memory" of standard RNNs struggles to retain longer term dependancies. The below image illustrates this problem:
This behaviour is due to the Vanishing Gradient problem, and can cause problems when early parts of the input sequence contain important contextual information.
The Vanishing Gradient problem is a well known issue with back-propagation and Gradient Descent. It can be experienced when training RNNs and also feed-forward ANNs:
With both ANNs and RNNs the Vanishing Gradient problem results in a loss of information or "working memory" as one updates weights further back in the network.
There are multiple RNN variants that aim to solve the Vanishing Gradient problem where Long Short Term Memory Networks (LSTMs) are currently the most popular.
As with standard RNNs, LSTMs loop through sequences of data, persisting and aggregating the working memory over mutliple iterations. LSTMs also share weights and activation functions across iterations with weights being optimised via BPTT.
However LSTMs add some extra components to the mix, as the below diagram shows:
Note that the working memory as shown in this diagram acts in pretty much the same way as it would in a normal RNN.
Information coming into the node includes:
The new components on show include:
Output from this node includes:
We can see that the working memory and Cell State (long term memory) interact in this type of solution. To really delve into the details of LSTM would require it's own article, so please check out Christopher Olah's definitive blog article on Understanding LSTMs if you would like more detailed information about how these work.
I think that that is enough theory for now. Lets move on with the practical part of this article!
I will be using the Keras IMDB Movie Reviews Dataset. This contains 50000 IMDB Reviews, each one classified as a "positive" or "negative" review. The data is split as follows:
The Keras distribution of this dataset is somewhat different to the original version as it has been pre-processed:
Having such a setup allows one to easily filter the dataset by for example only including the most common 10000 words, or even eliminating the top 20. The Keras distribution also includes a codex for mapping the numbers back to words.
Here is an example positive review directly from the dataset.
[1, 13, 296, 4, 20, 11, 6, 4435, 5, 13, 66, 447, 12, 4, 177, 9, 321, 5, 4, 114, 9, 518, 427, 642, 160, 2468, 7, 4, 20, 9, 407, 4, 228, 63, 2363, 80, 30, 626, 515, 13, 386, 12, 8, 316, 37, 1232, 4, 698, 1285, 5, 262, 8, 32, 5247, 140, 5, 67, 45, 87]
And this is how it looks once it has been translated using the codex.
<START> i watched the movie in a preview and i really loved it the cast is excellent and the plot is sometimes absolutely hilarious another highlight of the movie is definitely the music which hopefully will be released soon i recommend it to everyone who likes the british humour and especially to all musicians go and see it's great
I have made a couple of changes to the dataset:
These changes are clearly marked in my code and you can experiment with different values to see how they affect the end result!
Our RNN/LSTM model is going to be based on word embedding.
Word embedding is a popular technique for modelling word relationships by plotting them in a multi-dimensional space. The following GIF shows this concept in practice.
In this example a trained word embedding model has been queried for words that share the most context or meaning with the word "bus". The grey dots represent other words in the model.
Word embedding works by placing words in a multi-dimensonal space. The closer the words are, the stronger the similarity. By increasing the dimensions of our model we can capture different linguistic relationships between words. It is not unusal to encounter word embedding models with hundreds of dimensions.
One of the most exciting aspects of word embedding is that trained models have been shown to correctly interpret words that they have never seen before, by assigning the new word coordinates that match the locations of similar words in the models vocabulary.
There are many pre-trained word embedding models available. These use algorithms such as GloVe and can be inserted into your network via transfer learning. Our Deep Learning model is going to be based on word embedding. Rather than use an existing word embedding model we will create our own.
I have used the following tools to create my LSTM:
The code for the example RNN/LSTM is available on from my GitHub, but you may want to use this link for an executable copy in Google Colab!
Note that the code is meant as a starting point, so feel free to experiement with the parameters and layers to see how your changes affects the networks performance!
The goal is to create a Deep Learning Model that, after training, correctly predicts the overall sentiment (positive or negative) of IMDB movie reviews that it has never seen berfore.
To address this I have created the following LSTM architecture.
The key components here are the Embedding Layer and the CuDNNLSTM Layer:
Lets through the architecture in full to see what is happening with the rest of the layers.
The Input Layer is automatically generated based on the input parameters of the Embedding Layer, which accepts each IMDB review as a 1D list of integer values.
The Embedding Layer accepts a single review as a 1D list of integer values.
The Embedding Layer will learn a Word Embedding for all words in the the IMDB Review training set. Word embedding is implemented as a lookup table mapping between a given "word" and it's coordinates in the multi-dimensional word embedding space. The more dimensions you use, the “wider” the lookup table becomes.
During training the coordinates in the lookup table are treated as weights and are updated via back-propagation during the training phase.
Here is the Keras code for implementing our Embedding Layer.
model.add( tf.keras.layers.Embedding( # The size of our vocabulary input_dim = 10000, # Dimensions for each word output_dim = 16, # Length of reviews input_length = 500 ) )
A quick explanation of these parameters:
The output from the Embedding Layer is a 2D array containing the original review of 500 integers, plus 16 word embedding coordinates for each word.
The Dropout Layer fights overfitting and forces the model to learn mulitiple representations of the same data by randomly disabling a given amount of nodes in the learning phase. The code for this is as follows:
model.add( tf.keras.layers.Dropout( rate=0.25 ) )
Here we will diasble 25% of the nodes that carry the output from the Embedding Layer to our LSTM Layer. Otherwise this is simply a pass through layer.
This layer learns how to differentiate between positive and negative reviews based on a sequential set of words, each represented by word vector from the Embedding Layer.
Creating an LSTM Layer in Keras is trivial, as Keras hides the complexity (including activation functions) behind a simple API. Note that LSTM nodes commonly use both Sigmoid and Tanh Activation Functions in the different gates. For a explanation of this please check out Christopher Olah's blog article on Understanding LSTMs.
In order to speed up training I am using a Fast LSTM implementation backed by cuDNN. If you do not have a GPU available then you can use tf.keras.layers.LSTM, but this will take a little longer.
model.add( tf.keras.layers.CuDNNLSTM( units=32 ) )
There are few guidelines for choosing the amount of the nodes in a LSTM Layer, so my choice of 32 was guided done by trial and error. I found that 32 nodes gave good results and a very short training time.
Our second dropout layer randomly disables 25% of the nodes between the LSTM Layer and the final Dense Layer, once again to fight overfitting. The code is exactly the same as the first Dropout Layer.
model.add( tf.keras.layers.Dropout( rate=0.25 ) )
The Dense Layer represents the final layer in our LSTM. It's role is to map the 32 nodes in the LSTM Layer to a single nodes that will predict the sentiment of a given IMDB Movie Review.
model.add( tf.keras.layers.Dense( units=1, activation='sigmoid' ) )
Here we are using a Sigmoid Activation Function to calculate our final prediction. This is because it will give us a probability between 0 and 1. For our use case a prediction closer to 0 will indicate a negative review, while a probability closer to 1 will indicate a positive review.
In addition to specifying the layers of our LSTM we also need to specify which loss and optimisation functions we want to use in the training of our model.
model.compile( loss=tf.keras.losses.binary_crossentropy, optimizer=tf.keras.optimizers.Adam(), metrics=['accuracy'] )
Three things are being specified in this code:
For more background about back-propagation, Loss and Optimiser Functions refer to my Introduction to Deep Learning.
The code for this LSTM implementation (along with additional code for further exploring the use case and data) is available in Google Colaboratory for you to play around with in your browser! The code is also available in static form via GitHub.
In this article we have implemented a LSTM Neural Network with the help of TensorFlow and Keras. What have we learned along the way?
It is not coincidental that we are focusing on NLP use cases in this article, as RNNs have shown most promise in this area. Other NLP use cases for RNN type networks include:
As a final note, there has been a recent interest on using Convolutional Neural Networks for some NLP tasks. Only time will tell if CNNs will take over, but it will be exciting to follow the ongoing development of Artifical Neural Networks and Deep Learning!
I hope that you enjoyed this Introduction to Recursive Neural Networks! Feel free to check out the other articles in this series:
Mark West leads the Data Science team at Bouvet Oslo.