# The review of Recurrant Neural Network - Background [1/3]

| Tags rnn  deeplearning

The best and most comprehensive article explaining RNNs I’v found so far is this article by researcher at UCSD. The following contents are drawn from this article.

## Introduction

Neural networks are powerful learning models that archive state-of-the-art performance in a wide range of machine learning tasks. They are suited especially well for machine perception tasks, where the raw underlying features are not individually interpretable. This success is attributed to their ability to learn hierarchical representations, unlike traidtional methods that rely upon hand-engineered features.

Recurrent neural networks (RNNs) are connectionist models with the ability to selectively pass information across sequence steps, while processing sequential data one element at a time. Recurrent Neural Networks (RNNs), particularly those using Long Short-Term Memory (LSTM) hidden units, are powerful and increasingly popular models for learning from sequence data. They effectively model varying length sequences and capture long range dependencies.

RNNs vs Markov models?

Markov model approaches are limited because their states must be drawn from a modestly sized descrete state space. And each hidden state in Markov model can depend only on the immediately previous state. While, RNNs is able to capture long-range time dependencies, overcoming the chief limitation of Markov models (the hidden state at any step can contain information from a nearly arbitraily long context window).

## Background

Notations

• An input sequence: $(\mathbf{x}^{(1)}, \mathbf{x}^{(2)}, \ldots, \mathbf{x}^{(T)})$, where each data point $\mathbf{x}^{(t)}$ is a real-valued vector.
• A target sequence: $(\mathbf{y}^{(1)}, \mathbf{y}^{(2)}, \ldots, \mathbf{y}^{(T)})$.

Using temporal terminalogy, an input sequence consists of data points $\mathbf{x}^{(t)}$ that arrive in a discrete sequence of time steps indexed by $t$. When a model predicted data points, these are labeled $\hat{\mathbf{y}}^{(t)}$.

Nerual networks

Generally, a neural network consits of a set of artifical neurons, commonly referred to as nodes or units, and a set of directed edges between them, which intuitively represent the synapses in a bilogical neural network. Associated with each nuron $j$ is an activation function $l_j(*)$ (e.g., $\sigma$ sigmoid function), which is sometimes called a link function. The values of the hidden nodes in a network which, as a vector, is commonly noted $\mathbf{h}$.

Associated with each edge from node $j'$ to $j$ is a weight $w_{jj'}$ (denotes the “to-from” weight). The value $v_j$ of each neuron $j$ is calculated by applying its activation function to a weighted sumo of the values of its input nodes:

$$v_j = l_j(\sum_{j’} w_{jj’} v_{j’})$$

For convenience, we term the weighted sum as incoming activation and note it as $a_j$.

Common choices for the activation function include the sigmoid $\sigma(z) = 1/(1 + \exp^{-z})$ and the tanh function $\phi(z) = (\exp^z - \exp^{-z})/(\exp^z + \exp^{-z})$. Another activation function which has become prominent is the rectified linear unit (ReLU) whose formula is $l_j(z) = max(0, z)$.

The activation function at the output nodes depends onpon the task. For multiclass classification with $K$ aleternative classes, we apply a softmax nolinearity in an output layer of $K$ nodes, The softmax function calculates

\hat{y}k = \frac{\exp^{a_k}}{\sum{k’=1}^{K} \exp^{a_{k’}}}

For multilabel classification, the activation function is simply a point-wise sigmoid, and for regression we typically have linear output.

Feedforward networks and backpropagation

In feedforward network, all nodes can be arranged into layers, and the values of the nodes in each layer are computed successively as a function of the prior layers. The input $\mathbf{x}$ to a feedforward network is provided by setting the vaues of the lowest layer. Each higher layer is then successively computed until output is generated at the topmost layer $\hat{\mathbf{y}}$. Learning is accompolished byb iteratively updating each of the weights to minimize a loss function, $\mathcal{L}(\hat{\mathbf{y}}, \mathbf{y})$, which penalizes the distance between the output $\hat{\mathbf{y}}$ and the target $\mathbf{y}$.

The most successful algorithm for training neural networks is backpropagation. Backpropagation uses the chain rule to calculate the derivation of the loss function $\mathcal{L}$ with respect to each parameter in the network. The weights are then adjusted by gradient descent. Because the loss surface is non-convex, there is no assurance that backpropagation will reach a global minimum.

Nowdays, neural networks are usully trained with stochastic gradient descent (SGD) using mini-batches. With batch size equal to one, the stochastic gradient update equation is

$$\mathbf{w} \leftarrow \mathbf{w} - \eta \Delta_{\mathbf{w}} F_i$$

where $\eta$ is the learning rate and $\Delta_{\mathbf{w}}F_i$ is the gradient of the objective function with respect to the parameters $\mathbf{w}$ as calculated on a single example $(x_i, y_i)$.

To calculate the gradient in a feedforward neural network, backpropagation proceeds as follows. First, an example is propagated forward through the network to produce a value $v_j$ at each node and outpus $\hat{\mathbf{y}}$ at the topmost layer. Then a loss function value $\mathcal{L}(\hat{y}_k, y_k)$ is computed at each output node $k$. Subsequently, for each output node $k$, we calculate

$$\delta_k = \frac{\partial \mathcal{L}(\hat{y}_k, y_k)}{\partial \hat{y}_k} \cdot {l’}_k(a_k)$$

Given these values $\delta_k$, for each node in the immediately prior layer we calculate

$$\delta_j = {l’}(a_j)\sum_{k} \delta_k \cdot w_{kj}$$

This calculation is performanced successively for each lower layer to yield $\delta_j$ for every node $j$ given the $\delta$ values for each node connected to $j$ by an outgoing edge. Each value $\delta_j$ represents the derivative ${\partial\mathcal{L}}/{\partial a_j}$ of the total loss function with respect to that node’s incoming activation. Given the values $v_j$ calculated during the forward pass, and the values $\delta_j$ calculated during the backward pass, the derivative of the loss $\mathcal{L}$ with respect a given parameter $w_{jj'}$ is

$$\frac{\partial\mathcal{L}}{\partial w_{jj’}} = \delta_j v_{j’}$$

One open question in neural network research is how to exploit sparsity in training. In a neural network with sigmoid or tanh activation functions, the nodes in each layer never takes value exactly zero. Thus, even if the inputs are sparse, the nodes at each hidden layers are not. However, rectified linear units (ReLUs) introduce sparsity to hidden layers. In this setting, a promising path may be to store the sparsity pattern when computing each layer’s values and use it to speed up computation of the next layer in the network. Some recent work shows that given sparse inputs to a linear model with a standard regularizer, sparsity can be fully exploited even if regularization makes the gradient be not sparse.

## Recurrent neural networks

At time $t$, nodes with recurrent edges receive input from the current data $\mathbf{x}^{(t)}$ and also from hidden node values $\mathbf{h}^{(t-1)}$ in the network’s previous state. The output $\hat{\mathbf{y}}^{(t)}$ at each time $t$ is calculated given the hidden node value $\mathbf{h}^{(t)}$ at time $t$. Input $\mathbf{x}^{(t-1)}$ at time $t-1$ can influence the output $\hat{\mathbf{y}}^{(t)}$ at time $t$ and later by way of the recurrent connection.

Two equations specify all calculations necessary for computation at each time step on the forward pass in a simple recurrent neural network:

$$\mathbf{h}^{(t)} = \sigma(W^{hx}\mathbf{x}^{(t)} + W^{hh}\mathbf{h}^{(t-1)} + \mathbf{b}_h)$$

$$\hat{\mathbf{y}}^{(t)} = softmax(W^{yh} \mathbf{h}^{(t)} + \mathbf{b}_y)$$

where $W^{hx}$ is the matrix of conventional weights between the input and the hidden layer and $W^{hh}$ is the matrix of recurrent weights between the hidden layer and iteself at adjacent time steps. The vectors $\mathbf{b}_h$ and $\mathbf{b}_y$ are bias parameters which allow each node to learn an offset.

The network can be interpreted as a deep network with one layer per time step and shared weights across time steps. It is then clear that the network can be trained across many time steps using backpropagation. This algorithm, called backpropagation through time (BPTT).

Training recurrent networks

Learning with recurrent networks has long been considerred to be especially challenging due to the difficulty of learning long-range dependencies. The problems of vanishing and exploding gradients occur when backpropagating errors across many time steps. As a toy example, consider a network with a single input node, a single output node, and a single recurrent hidden node. Now consider an input passed to the network at time $\tau$ and an error calculated at time $t$, assuming input of zero in the intervening time steps. The tying of weights across time steps means that the recurrent edge at the hidden node $j$ always has the same weight. Therefore, the contribution of the input at time $\tau$ to the output at time $t$ will eighter explode or approach to zero, exponentially fast, as $t-\tau$ grows large. Hence the derivative of the error with respect to the input will eighter explode or vanish.

Which of the two phenomena occurs depends on whether the weight of the recurrent edge $\vert w_{jj} \vert > 1$ or $% $ and on the activation function in the hidden node. Given a sigmoid activation function, the vanishing gradient problem is more pressing, but with a rectified linear unit $max(0, x)$, it is easier to imagine the exploding gradients. Truncated backpropagation through time (TBPTT) is one solution to the exploding gradient problem for continuously running networks. With TBPTT, some maximum number of time steps is set along which error can be propagated. While TBPTT with small cutoff can be used to alleviate the exploding gradient problem, it requires that one sacrifice the ability to learn long-range dependencies. The LSTM architecture described below uses carefully designed nodes with recurrent edge with fixed unit weight as a solution to the vanishing gradient problem.

The issue of local optima is an obstacle to effective training that cannot be dealt with simply by modifying the network architecture. However, recent empirical and theoretical studies suggest that in practice, the issue may not be as important as once thought.