#hide_input
import torch
import torch.nn as nn
import numpy as np
import tensorflow as tf
import pandas as pd
from sklearn.metrics import accuracy_score
In this post we will be going over the intuition, theory and implementation of a gated RNN, particularly LSTM. This is the sequel to my previous post about RNNs.
Note: This post builds on my previous explanation of RNNs. The dataset used in this post is the IMDB dataset of 50,000 movie reviews, used for sentiment classification.
RNN vs LSTM
LSTM, shorthand for Long-Short-Term-Memory, is a recurrent architecture for processing sequences just as the vanilla RNN. Compared to LSTM, the vanilla RNN is not used in practice as much, as it has some notable limitations that the LSTM architecture tries to address. These include:
- Exploding/Vanishing Activations
- Memory Limitations
Exploding/Vanishing Activations
The problem of exploding/vanishing gradients occurs due to the way the vanilla RNN backpropagates gradients. If you consider the unrolled representation of an RNN and its backpropagation mechanism you will quickly notice that for long sequences there is a lot of repeated multiplication done in order for gradients to arrive from the last layer to the initial one. This is the exact reason why we might experience exploding/vanishing gradients. Let’s explore this problem in more detail by simulating repeated multiplication in a backpropagation.
Let say we have a sequence of length 50. We will multiply matrices initialized with random numbers from a normal distribution.
=torch.randn(50,50)
xfor i in range(50):
= x @ torch.randn(50,50)
x x
tensor([[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
...,
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan]])
This is the occurence of exploding gradients. Basically, our numbers got so large that it caused numerical overflow and produced NaNs. What if we try to mitigate this by decreasing our matrices by a factor of 0.01?
=torch.randn(50,50)
xfor i in range(50):
= x @ torch.randn(50,50)*0.01
x x
tensor([[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]])
Now, this produced vanishing gradients or numerical underflow. The numbers became so small that our computer just represents them with zeros. If we would have a weight matrix like this in any part of our neural network it would break the models learning capability, so that is why this is a delicate but important problem to solve.
Solutions to this include: - Identity initialization of weights paired with ReLU activation - Gradient clipping - Skip connections - Model alteration
LSTM is an instance of a model approach to this problem. It mitigates the vanishing/exploding gradients problem by introducing the concept of gates which change the way gradients flow.
Memory Limitations
If you would experiment with RNNs, e.g. for next-word-prediction (language model), you would notice that they have a hard time with long-term memory needed in sentences such as this:
“I grew up in France and had a wonderful childhood, therefore I speak fluent French” -The language model would need to retain the information of “France” until its useful for predicting the word “French”.
LSTMs do a much better job with long-term memory because of several reasons:
- An additional state called the cell state which enables accumulation of information over a long duration.
- The concept of memory management depending on the current timestep’s input and hidden state.
Cell State
LSTMs have an additional state called the cell state which is passed along with the hidden state to each cell. But the cell state has no linear layers through which it passes, therefore enabling easier information flow over a longer duration. This is what enables the long-term memory of an LSTM. Cell state is only influenced by element-wise operations controlled by gates which we will observe soon.
Memory Management
The term “gated RNN” comes from the fact that the cell state is gated (protected) by so-called gates. These gates are linear layers responsible for managing the cell state by extracting relevant information from the current timestep input and hidden state. The idea is that at each timestep the cell state information we don’t need anymore should be forgotten and new valuable information should be stored. Since these gates are layers we delegate this mechanism for the neural network to learn itself without us having to manage it manually.
Architecture
This is what the LSTM Cell looks like:
There are four gates and each of them is fed the stacked tensor of the current timestep’s hidden state and input. In summary, what each gate does for a sequence at a particular timestep is the following:
- Forget gate: what information to keep and what to forget in the long memory
- Input gate: what information needs to be updated in the long memory
- Cell gate: how the information will be updated in the long memory
- Output gate: what part of the long memory is relevant for the short memory
Forget Gate
Forget gate decides what to forget/eliminate from the cell state. It is followed by a sigmoid function, mapping values to the (0, 1) range. We then multiply this output element-wise with the cell state. If the scalar at some position of the sigmoid output is closer to 0 it will result in the elimination of the value at the same position in the cell state. The opposite is true for values close to 1.
Input Gate and Cell Gate
Input and Cell gate together decide what to update/store in the cell state. Input gate decides what needs to be updated and to what degree, dictated by the sigmoid. Cell gate decides what are the updated values for positions chosen by the input gate. Cell gate is followed by the tanh function mapping values to (-1, 1) range. The output of the input and cell gate is multiplied element-wise and then added element wise to the cell state.
Output Gate
Output gate decides which information from the cell state is relevant for the next hidden state. Then this is fed to the sigmoid and the output is multiplied element-wise with the tanh of the updated cell state
Implementation
Below is a PyTorch implemenation of the LSTM Cell we described.
class LSTMCell(nn.Module):
def __init__(self, n_input, n_hidden):
super(LSTMCell, self).__init__()
self.forget_gate = nn.Linear(n_input+n_hidden, n_hidden)
self.input_gate = nn.Linear(n_input+n_hidden, n_hidden)
self.cell_gate = nn.Linear(n_input+n_hidden, n_hidden)
self.output_gate = nn.Linear(n_input+n_hidden, n_hidden)
self.out=nn.Linear(n_hidden, n_hidden)
def forward(self, x, state):
= state
h, c
= torch.cat([h, x], dim=1)
h
= torch.sigmoid(self.forget_gate(h))
forget_gate_out = torch.sigmoid(self.input_gate(h))
input_gate_out = torch.sigmoid(self.cell_gate(h))
cell_gate_out = torch.sigmoid(self.output_gate(h))
output_gate_out
= c*forget_gate_out
c = c+(input_gate_out*cell_gate_out)
c
= output_gate_out*torch.tanh(c)
h
= torch.relu(self.out(h))
out
return out, (h, c)
Note: This particular implementation is not efficient, its only a toy example. For an efficient implementation merge the 4 matrix multiplications into 1.
Hopefully, all parts of an LSTM are now clear. For an even better understanding continue to see it applied to sentiment analysis.
Performance
We will be testing LSTM’s performance on the IMDB dataset for sentiment analysis.
Preparing the dataset
# Loading data
=pd.read_csv('./imdb.csv')
data=data.sample(frac=1)
dataprint(data.head())
review sentiment
49390 I want to add to the praise for the production... positive
33574 A crackling and magnificent thriller about a c... positive
44801 Like most everyone who views this movie, I did... negative
379 I'm a big fan of surrealist art, but this film... negative
5305 This movie was the worst i've ever seen.<br />... negative
# Splitting data
=data[:25000]
train=train['review'].values.tolist()
xtrain=train['sentiment'].values
ytrain
=data[25000:]
val=val['review'].values.tolist()
xval=val['sentiment'].values
yval
# Preprocessing (Tokenization and padding)
=tf.keras.preprocessing.text.Tokenizer(num_words=3000)
tokenizer
tokenizer.fit_on_texts(xtrain)
=tokenizer.texts_to_sequences(xtrain)
xtrain_pro=tf.keras.preprocessing.sequence.pad_sequences(xtrain_pro, maxlen=128)
xtrain_pro
=tokenizer.texts_to_sequences(xval)
xval_pro=tf.keras.preprocessing.sequence.pad_sequences(xval_pro, maxlen=128)
xval_pro
=[1 if y=='positive' else 0 for y in ytrain]
ytrain=[1 if y=='positive' else 0 for y in yval] yval
Now we create our dataset class and our datasets
class IMDBDataset:
def __init__(self,reviews,targets):
self.reviews=reviews
self.targets=targets
def __len__(self):
return len(self.reviews)
def __getitem__(self,idx):
=self.reviews[idx,:]
review=self.targets[idx]
target
=torch.tensor(review,dtype=torch.long)
review=torch.tensor(target,dtype=torch.float)
target
=(review,
item
target)
return item
=IMDBDataset(xtrain_pro, ytrain)
train_ds=IMDBDataset(xval_pro, yval) val_ds
Lastly, we create our dataloaders
=torch.utils.data.DataLoader(
train_dl
train_ds,=128,
batch_size
)
=torch.utils.data.DataLoader(
val_dl
val_ds,=128,
batch_size )
Constructing the SentimentClassifier Model with our LSTM
In order to apply the LSTMCell we package it in the LSTM class which applies it to a sequence.
class LSTM(nn.Module):
def __init__(self, n_input, n_hidden):
super(LSTM, self).__init__()
self.n_input = n_input
self.n_hidden = n_hidden
self.LSTMCell = LSTMCell(n_input, n_hidden)
def forward(self, input, state=None):
# Input dims are (batch_size, seq_length, timestep_features)
= input.size()[1]
sequence_length
# Initialize hidden and cell state if not provided
if state == None:
= (torch.zeros((input.size()[0], self.n_hidden), device=input.device),
h, c input.size()[0], self.n_hidden), device=input.device))
torch.zeros((else:
= state
h, c
= torch.tensor([], device=input.device)
outs for i in range(sequence_length):
= torch.squeeze(input[:, i, :], dim=1)
x_timestep_features
= self.LSTMCell(x_timestep_features, (h, c))
out, (h, c)
= torch.unsqueeze(out, dim=1)
out = torch.cat((outs, out), dim=1)
outs
return outs
Finally we package our LSTM into a SentimentClassifier class which uses an additional embedding layer and linear layer.
class SentimentClassifier(nn.Module):
def __init__(self,vocab_sz,n_hidden):
super(SentimentClassifier,self).__init__()
self.embedding=nn.Embedding(vocab_sz,n_hidden)
self.rnn=LSTM(n_hidden,n_hidden)
self.linear=nn.Linear(n_hidden*2,1)
def forward(self, x):
=self.embedding(x)
x=self.rnn(x)
x
# Using the avg and max pool of all RNN outputs
=torch.mean(x, dim=1)
avg_pool=torch.max(x,1)
max_pool, _
# We concatenate them (hidden size before the linear layer is multiplied by 2)
=torch.cat((avg_pool, max_pool), dim=1)
out=self.linear(out)
out
return torch.squeeze(out, dim=1)
Constructing the training loop
='cuda' if torch.cuda.is_available() else 'cpu'
device device
'cuda'
=SentimentClassifier(3000,100)
model=model.to(device)
model=torch.nn.BCEWithLogitsLoss()
loss_func=torch.optim.Adam(model.parameters(), lr=1e-3) optimizer
#hide
= lambda x: 1 / (1+np.exp(-x)) sigmoid
=5
epochsfor epoch in range(epochs):
# Training
=[]
ys=[]
y_predsfor x,y in train_dl:
=x.to(device)
x=y.to(device)
y
=model(x)
y_pred=loss_func(y_pred,y)
loss
optimizer.zero_grad()
loss.backward()
optimizer.step()
ys.extend(y.detach().cpu().numpy().tolist())
y_preds.extend(y_pred.detach().cpu().numpy().tolist())
# Measuring Training accuracy
=sigmoid(np.array(y_preds))>0.5
y_preds_final=accuracy_score(y_preds_final,ys)
accuracyprint(f"Epoch: {epoch} | Training accuracy: {accuracy:.2f}")
# Validation
=[]
ys=[]
y_predswith torch.no_grad():
for x,y in val_dl:
=x.to(device)
x=y.to(device)
y
=model(x)
y_pred
ys.extend(y.detach().cpu().numpy().tolist())
y_preds.extend(y_pred.detach().cpu().numpy().tolist())
# Measuring Validation accuracy
=sigmoid(np.array(y_preds))>0.5
y_preds_final=accuracy_score(y_preds_final,ys)
accuracyprint(f"Epoch: {epoch} | Validation accuracy: {accuracy:.2f}")
print("-"*40)
Epoch: 0 | Training accuracy: 0.66
Epoch: 0 | Validation accuracy: 0.75
----------------------------------------
Epoch: 1 | Training accuracy: 0.80
Epoch: 1 | Validation accuracy: 0.81
----------------------------------------
Epoch: 2 | Training accuracy: 0.85
Epoch: 2 | Validation accuracy: 0.84
----------------------------------------
Epoch: 3 | Training accuracy: 0.87
Epoch: 3 | Validation accuracy: 0.85
----------------------------------------
Epoch: 4 | Training accuracy: 0.89
Epoch: 4 | Validation accuracy: 0.85
----------------------------------------
For the end we can test our model on arbitrary input.
#collapse-hide
def predict(model,text):
=tokenizer.texts_to_sequences(text)
x=tf.keras.preprocessing.sequence.pad_sequences(x, maxlen=128)
x=torch.tensor(x,device='cuda')
x
eval()
model.with torch.no_grad():
=model(x)
logit=sigmoid(logit.cpu().numpy())
prob
print(f"Output: {str(prob)} | ", end='')
if prob>=0.5:
print(f'Sentiment: positive')
else:
print(f'Sentiment: negative')
=['This guy made this blog about LSTMs and provided the implementation without explaining each line of the code!']
text predict(model,text)
Output: [0.2402171] | Sentiment: negative