# The review of Recurrant Neural Network - LSTM & BRNN [2/3]

| Tags rnn  deeplearning

## Modern RNN architectures (LSTM & BRNNs)

The most successful RNN architecture for sequence learning stem from two work published in 1997. The first, Long Short-Term Memory, introduces the memory cell, a unit of computation that replaces traditional nodes in the hidden layer of a network. With these memory cells, networks are able to overcome difficulties with training encounterred by earlier recurrent network. The second network, Bidirectional Recurrent Neural Network, introduces an architecture in which information from both the future and the past are used to determine the output at any point in the sequence.

Long short-term memory (LSTM)

The LSTM model is introduced primarily in order to overcome the problem of vanishing gradients. This model resembles a standard recurrent neural network with a hidden layer, but each oridinary node in the hidden layer is replaced by a memory cell. Each memory cell contains a node with a self-connected recurrent edge of fixed weight one, ensuring that the gradient can pass across many time steps without vanishing or exploding. To distinguish references to a memory cell and not an ordinary node, we use the subscript $c$.

The term “long short-term memory” comes from the following intuition. Simple recurrent neural network have long-term memory in the form of weights. The weights change slowly during training, encoding general knowledge about the data. They also have short-term memory in the form of ephemeral activations, which pass from each node to successive nodes. The LSTM mdoel introduces an intermediate type of storge via the memory cell. A memory cell is a composite unit, built from simpler nodes in a specific connectivity pattern, with the novel inclusion of multiplicative nodes. All elements of the LSTM cell are enumerated and described below.

• Input node: This unit, labeled $g_c$, is a node that takes activation in standard way from the input layer $\mathbf{x}^{(t)}$ at the current time step and (along recurrent edge) from the hidden layer at the previous time step $\mathbf{h}^{(t-1)}$. Typically, the summed weighted input is run through a tanh activation function, although in the original LSTM paper, the activation function is sigmoid.

• Input gate: Gates are a distinctive feature of the LSTM approach. A gate is a sigmoidal unit that, like the input node, takes activation from the current data point $\mathbf{x}^{(t)}$ as well as from the hidden layer at the previous time step. A gate is so-called because its value is used to multiply the value of another node. It is a gate in the sense that if its value is zero, then flow from the other node is cut off. If the value of the gate is one, all flow is passed through. The value of the input gate $i_c$ multiplies the value of the input node.

• Internal state: At the heart of each memory cell is a node $s_c$ with linear activation, which is referred to in the original paper as the “internal state” of the cell. The internal state $s_c$ has a self-connected recurrent edge with fixed unit weight. Because this edge spans adjacent time steps with constant weight, error can flow across time steps without vanishing or exploding. This edge is often called the constrant error carousel. In vector notation, the update for the internal state is $\mathbf{s}^{(t)} = \mathbf{g}^{(t)} \odot \mathbf{i}^{(t)} + \mathbf{s}^{(t-1)}$ where $\odot$ is pointwise multiplication.

• Forget gate: These gates $f_c$ were introduced by Gers et al.. They provide a method by which the network can learn to flush the contents of the internal state. This is especially useful in continuously running networks. With forget gates, the equation to calculate the internal state on the forward pass is

$$\mathbf{s}^{(t)} = \mathbf{g}^{(t)} \odot \mathbf{i}^{(t)} + \mathbf{f}^{(t)} \odot \mathbf{s}^{(t-1)}$$

• Output gate: the value $v_c$ ultimately produced by a memory cell is the value of the internal state $s_c$ multiplied by the value of the output gate $o_c$. It is customary that the internal state first be run through a tanh activation function, as this gives the output of each cell the same dynamic range as an ordinary tanh hidden unit. However, in other neural network research, rectified linear units, which have a greater dynamic range, are easier to train. Thus it seems plausible that the nolinear function on the internal state might be ommited.

Bidirectional recurrent neural networks (BRNNs)

Along with the LSTM, one of the most used RNN architecture is the bidirectinal recurrent neural network (BRNN). In this architecture, there are two layers of hidden nodes. Both hidden layers are connected to input and output. The two hidden layers are differentiated in that the first has reccurent connections from the past time steps while in the second the direction of recurrent of connections is flipped, passing activation backwards along the sequence. Given an input sequence and target sequence, the BRNN can be trained by ordinary backpropagation after unfolding across time. The following three equations describes a BRNN:

$$\mathbf{h}^{(t)} = \sigma(W^{hx} \mathbf{x}^{(t)} + W^{hh} \mathbf{h}^{(t-1)} + \mathbf{b}_h)$$

$$\mathbf{z}^{(t)} = \sigma(W^{(zx)} \mathbf{x}^{(t)} + W^{zz} \mathbf{z}^{(t+1)} + \mathbf{b}_z)$$

$$\hat{\mathbf{y}}^{(t)} = softmax(W^{yh} \mathbf{h}^{(t)} + W^{yt} \mathbf{z}^{(t)} + \mathbf{b}_y)$$

where $\mathbf{h}^{(t)}$ and $\mathbf{z}^{(t)}$ are the values of the hidden layers in the forwards and backwards directions respectively.

One limitation of the BRTT is that cannot run continuously, as it requires a fixed endpoint in both the future and in the past. Further, it is not an appropriate machine learning algorithm for the online setting, as it is implausible to receive information from the future, i.e., to know sequence elements that have not beeen observed. But for prediction over a sequence of fixed length, it is often sensible to take into account both past and future seuqnce elements. Consider the natural laguange task of part-of-speech tagging. Given any word in a sequence, information about both the words which precede and those which follow it is useful for predicting that word’s part-of-speech.