The best and most comprehensive article explaining RNNs I’v found so far is this article by researcher at UCSD. The following contents are drawn from this article.

## Introduction

Neural networks are powerful learning models that archive state-of-the-art performance in a wide range of machine learning tasks. They are suited especially well for *machine perception tasks*, where the raw underlying features are not individually interpretable. This success is attributed to their ability to **learn hierarchical representations**, unlike traidtional methods that rely upon hand-engineered features.

Recurrent neural networks (RNNs) are *connectionist models* with the ability to selectively pass information across sequence steps, while processing sequential data one element at a time. *Recurrent Neural Networks (RNNs), particularly those using Long Short-Term Memory (LSTM) hidden units, are powerful and increasingly popular models for learning from sequence data. They effectively model varying length sequences and capture long range dependencies.*

**RNNs vs Markov models?**

Markov model approaches are limited because their states must be drawn from a modestly sized **descrete state space**. And each hidden state in Markov model can depend only on the immediately previous state. While, **RNNs** is able to capture *long-range time dependencies*, overcoming the chief limitation of Markov models (the hidden state at any step can contain information from a nearly arbitraily long context window).

## Background

**Notations**

- An input sequence: , where each data point is a real-valued vector.
- A target sequence: .

Using temporal terminalogy, an input sequence consists of data points that arrive in a discrete sequence of *time steps* indexed by . When a model predicted data points, these are labeled .

**Nerual networks**

Generally, a neural network consits of a set of *artifical neurons*, commonly referred to as *nodes* or *units*, and a set of directed edges between them, which intuitively represent the *synapses* in a bilogical neural network. Associated with each nuron is an activation function (e.g., sigmoid function), which is sometimes called a link function. The values of the hidden nodes in a network which, as a vector, is commonly noted .

Associated with each edge from node to is a weight (denotes the “to-from” weight). The value of each neuron is calculated by applying its activation function to a weighted sumo of the values of its input nodes:

\begin{equation} v_j = l_j(\sum_{j’} w_{jj’} v_{j’}) \end{equation}

For convenience, we term the weighted sum as *incoming activation* and note it as .

Common choices for the activation function include the *sigmoid* and the *tanh* function . Another activation function which has become prominent is the *rectified linear unit* (ReLU) whose formula is .

The activation function at the output nodes depends onpon the task. For multiclass classification with aleternative classes, we apply a softmax nolinearity in an output layer of nodes, The softmax function calculates

\begin{equation}
\hat{y}*k = \frac{\exp^{a_k}}{\sum*{k’=1}^{K} \exp^{a_{k’}}}
\end{equation}

For multilabel classification, the activation function is simply a point-wise sigmoid, and for regression we typically have linear output.

**Feedforward networks and backpropagation**

In feedforward network, all nodes can be arranged into layers, and the values of the nodes in each layer are computed successively as a function of the prior layers. The input to a feedforward network is provided by setting the vaues of the lowest layer. Each higher layer is then successively computed until output is generated at the topmost layer . Learning is accompolished byb iteratively updating each of the weights to minimize a loss function, , which penalizes the distance between the output and the target .

The most successful algorithm for training neural networks is **backpropagation**. Backpropagation uses the chain rule to calculate the derivation of the loss function with respect to each parameter in the network. The weights are then adjusted by gradient descent. Because the loss surface is non-convex, there is no assurance that backpropagation will reach a global minimum.

Nowdays, neural networks are usully trained with *stochastic gradient descent* (SGD) using mini-batches. With batch size equal to one, the stochastic gradient update equation is

\begin{equation} \mathbf{w} \leftarrow \mathbf{w} - \eta \Delta_{\mathbf{w}} F_i \end{equation}

where is the learning rate and is the gradient of the objective function with respect to the parameters as calculated on a single example .

To calculate the gradient in a feedforward neural network, backpropagation proceeds as follows. First, an example is propagated forward through the network to produce a value at each node and outpus at the topmost layer. Then a loss function value is computed at each output node . Subsequently, for each output node , we calculate

\begin{equation} \delta_k = \frac{\partial \mathcal{L}(\hat{y}_k, y_k)}{\partial \hat{y}_k} \cdot {l’}_k(a_k) \end{equation}

Given these values , for each node in the immediately prior layer we calculate

\begin{equation} \delta_j = {l’}(a_j)\sum_{k} \delta_k \cdot w_{kj} \end{equation}

This calculation is performanced successively for each lower layer to yield for every node given the values for each node connected to by an outgoing edge. Each value represents the derivative of the total loss function with respect to that node’s incoming activation. Given the values calculated during the forward pass, and the values calculated during the backward pass, the derivative of the loss with respect a given parameter is

\begin{equation} \frac{\partial\mathcal{L}}{\partial w_{jj’}} = \delta_j v_{j’} \end{equation}

One open question in neural network research is *how to exploit sparsity in training*. In a neural network with *sigmoid* or *tanh* activation functions, the nodes in each layer never takes value exactly zero. Thus, even if the inputs are sparse, the nodes at each hidden layers are not. However, rectified linear units (ReLUs) introduce sparsity to hidden layers. In this setting, a promising path may be to store the sparsity pattern when computing each layer’s values and use it to speed up computation of the next layer in the network. Some recent work shows that given sparse inputs to a linear model with a standard regularizer, sparsity can be fully exploited even if regularization makes the gradient be not sparse.

## Recurrent neural networks

At time , nodes with recurrent edges receive input from the current data and also from hidden node values in the network’s previous state. The output at each time is calculated given the hidden node value at time . Input at time can influence the output at time and later by way of the recurrent connection.

Two equations specify all calculations necessary for computation at each time step on the forward pass in a simple recurrent neural network:

\begin{equation} \mathbf{h}^{(t)} = \sigma(W^{hx}\mathbf{x}^{(t)} + W^{hh}\mathbf{h}^{(t-1)} + \mathbf{b}_h) \end{equation}

\begin{equation} \hat{\mathbf{y}}^{(t)} = softmax(W^{yh} \mathbf{h}^{(t)} + \mathbf{b}_y) \end{equation}

where is the matrix of conventional weights between the input and the hidden layer and is the matrix of recurrent weights between the hidden layer and iteself at adjacent time steps. The vectors and are bias parameters which allow each node to learn an offset.

The network can be interpreted as a **deep network** with one layer per time step and shared weights across time steps. It is then clear that the network can be trained across many time steps using backpropagation. This algorithm, called **backpropagation through time** (BPTT).

**Training recurrent networks**

Learning with recurrent networks has long been considerred to be especially challenging due to the difficulty of learning long-range dependencies. The problems of *vanishing* and *exploding* gradients occur when backpropagating errors across many time steps. As a toy example, consider a network with a single input node, a single output node, and a single recurrent hidden node. Now consider an input passed to the network at time and an error calculated at time , assuming input of zero in the intervening time steps. The tying of weights across time steps means that the recurrent edge at the hidden node always has the same weight. Therefore, the contribution of the input at time to the output at time will eighter explode or approach to zero, exponentially fast, as grows large. Hence the derivative of the error with respect to the input will eighter explode or vanish.

Which of the two phenomena occurs depends on whether the weight of the recurrent edge or and on the activation function in the hidden node. Given a sigmoid activation function, the vanishing gradient problem is more pressing, but with a rectified linear unit , it is easier to imagine the exploding gradients. Truncated backpropagation through time (TBPTT) is one solution to the exploding gradient problem for continuously running networks. With TBPTT, some maximum number of time steps is set along which error can be propagated. While TBPTT with small cutoff can be used to alleviate the exploding gradient problem, it requires that one sacrifice the ability to learn long-range dependencies. The LSTM architecture described below uses carefully designed nodes with recurrent edge with fixed unit weight as a solution to the vanishing gradient problem.

The issue of local optima is an obstacle to effective training that cannot be dealt with simply by modifying the network architecture. However, recent empirical and theoretical studies suggest that in practice, the issue may not be as important as once thought.