Photo by John Cameron on Unsplash

Fixing vanishing and exploding gradients in RNN-networks

Optimize My Day Job
3 min readJun 3, 2021

--

Any neural network struggles with vanishing or exploding gradients when the computational graph becomes too deep. This happens with traditional neural networks when the number of layers is very large or to Recurrent Neural Networks (RNNs), like LSTM or GRU, when the sequence length of the input data is too big. The reason for the struggle of deep RNNs comes from the repeated multiplication of the parameters, especially weights when they are less or greater than one.

“Suppose that a computational graph contains a path that consists of repeatedly multiplying by a matrix W. After t steps, this is equivalent to multiplying with Wt. Suppose that W has an eigendecomposition W = Vdiag(l)V􀀀1. In this simple case, it is straightforward to see that

Any eigenvalues li that are not near an absolute value of 1 will either explode if they are greater than 1 in magnitude or vanish if they are less than 1 in magnitude.” (Goodfellow et al. 2016, p. 290).

Approaches for mitigating vanishing & exploding gradients

Approaches for mitigating vanishing and exploding gradients include techniques like gradient clipping, skip connections, the use of LSTM, and GRU.

Gradient clipping is an effective method to mitigate the exploding gradients issue. The method is very simple; if a gradient becomes too large it is rescaled to a smaller value. The algorithm for gradient clipping works as in the equation below. where g is the gradient, c is a hyperparameter and ||g|| is the norm of g (Goodfellow et al. 2016, p. 413).

Gradient Clipping algorithm

To impliment this in Keras, just add clipnorm= xx or clipvalue = xx when specifying your optimizer e.g. SDG or Adam.

Skip connections enable the network to take a short-cut by skipping some of the layers if it deems it optimal during training. In practice this allows the network to bypass parts of the network where vanishing gradients occurs, passing on gradients from earlier layers to later layers in the network.

The idea of skip connections is introduced in (Lin et al. 1996) and used for the winning contribution (ResNet) of the ImageNet 2015 competition (He et al. 2015). The basic idea of the paper by (He et al. 2015) is that increasing the depth of the network is reducing the loss, which means the model becomes better at predicting. However, due to vanishing gradients, increasing the depth of a network did increase the loss, even though the deeper network had more learning capacity. The solution found in (He et al. 2015) is the skip connection (also known as residual blocks or residual learning), which allowed the network to learn to pass on the identity function of input x to the activation function of a hidden unit skipping the weights section of the hidden unit.

The image below illustrates the original skip connection presented in (He et al. 2015), where two layers f (x) can be skipped by passing the identity function of input x around the layers f (x) + x.

Skip Connection

To implement this in Keras use the funcional api to link the layers. Below is an example of use of skip connections.

input = Input(shape=(None, num_x_signals,))
a = LSTM(64, return_sequences=True, dropout=0.6)(input)

x = LSTM(128, return_sequences=True, dropout=0.5)(a) # main1
a = LSTM(64, return_sequences=True, dropout=0.5)(a) # skip1

x = LSTM(64, return_sequences=True)(x) # main1
x = LSTM(32, return_sequences=True)(x) # main1
x = LSTM(64, return_sequences=True)(x) # main1

b = Add()([a, x]) # main1 + skip1

x = Conv1D(64, kernel_size=5, padding='same', activation='relu')(b)
x = Conv1D(num_y_signals, kernel_size=5, padding='same', activation='tanh')(x)

model = Model(input, x)

--

--

Optimize My Day Job

An programming amateur from Denmark, who tries to make his and your life easier with code.