Representations of natural language inputs and output
When words are output at each time step, generally the output consists of a softmax vector where is the size of the vocabulary. A softmax layer is an element-wise logistic function that is normalized so that all of its components sum to one. Intuitively, these outputs correspond to the probabilities that each word is the correct output at that time step.
For application where an input consists of a sequence of words, typically the words are fed to the network one at a time in consecutive time steps. In these cases, the simplest way to represent words is a one-hot encoding, using binary vectors with a length equal to the size of the vocabulary, so “1000” and “0100” would represent the first and the second words in the vocabulary respectively. However, this encoding is inefficient, requiring as many bits as the vocabulary is large. Further it offers no direct way to capture different aspects of similarity between words in the encoding itself. Thus it is common now to model words with a distributed representation using a meaning vector. In some cases, these meanings for words are learned given a large corpus of supervised data, but it is more usual to initialize the meaning vectors using an embedding based on word co-occurrence statistics. Freely available code to produce word vectors from these satistics include GloVe and word2vec.