Clear Filters
Clear Filters

Neural Networks Algorithm for predicting the fourth word in a sentence

4 views (last 30 days)
I have this assignment with where we're required to analyze a learning algorithm for predicting the fourth word in a sentence. It is a 4-grams model, with the first three words given. The parameters we can change are d=the number of dimensions we can represent a word, and numHid= the number of hidden units in the hidden layer (here we're using a single hidden layer). So I trained the algorithm with different d every time and different numHid, the algorithm stops automatically when the validation error starts increasing. My question is: What does the number of epochs represent? is it better for the epochs to be minimum? provided that the learning rate is kept constant throughout the algorithm. Should I use the parameters that give me the minimum cross Entropy error?
Thanks

Answers (1)

Greg Heath
Greg Heath on 29 Jan 2012
>I have this assignment with where we're required to analyze a learning algorithm for predicting the fourth word in a sentence. It is a 4-grams model, with the first three words given. The parameters we can change are d=the number of dimensions we can represent a word,
How, exactly, are words represented? How many 4-word combinations do you have? Are there multiple combinations that have the same 4th word?
>and numHid= the number of hidden units in the hidden layer (here we're using a single hidden layer). So I trained the algorithm
What kind of algorithm? What is it's name? Are you using the NN Toolbox?
> with different d every time and different numHid, the algorithm stops automatically when the validation error starts increasing. My question is: What does the number of epochs represent?
The interval between successive weight update stages is an epoch
> is it better for the epochs to be minimum? provided that the learning rate is kept constant throughout the algorithm.
Regardless of learning rate, the ultimate goal is to minimize the performance error on nondesign data. Speed is of secondary importance.
>Should I use the parameters that give me the minimum cross Entropy error?
Use the parameters that optimize YOUR measure of performance. From my point of view you have a classification problem and should try to minimize the rate of failure to choose the correct 4th word. However, classification error rate is not continuous. Therefore, it is much better to use a continuous objective function like mean-square-error or cross-entropy.
In the words of Confusious: "Try both, choose best"
Hope this helps.
Greg
  2 Comments
Mohamed Temraz
Mohamed Temraz on 29 Jan 2012
Thanks Greg. Here is a detailed explanation of the assignment.
In this assignment, you will run code that trains a simple neural language
model on a dataset of sentences that were culled from a large corpus of
newspaper articles so that the culled sentences would have a highly restricted
vocabulary of only 250 words.
The model you will train on these data produces a distribution over the next
word given the previous three words as input. Since the neural network will be
trained on 4-grams extracted from isolated sentences, this means that it will
never be asked to predict any of the first three words in a sentence. The
neural network learns a d-dimensional embedding for the words in the vocabulary
and has a hidden layer with numHid hidden units fully connected to the single
output softmax unit. If we are so inclined, we can view the embedding as
another (earlier) hidden layer with weights shared across the three words.
After you have loaded the data, you can set d and numHid like so:
>> d = 8; numHid = 64;
and then run the main training script,
>> train;
which will train the neural network using the embedding dimensionality and
number of hidden units specified.
The training script monitors the cross entropy error on the validation data and
uses that information to decide when to stop training. Training stops as soon
as the validation error increases and the final weights are set to be the
weights from immediately before this increase. This procedure is a form of
"early stopping" and is a common method for avoiding overfitting in neural
network training.
Here is a list of the variables of interest that the train script puts in the
workspace:
wordRepsFinal - the learned word embedding
repToHidFinal - the learned embedding-to-hidden unit weights
hidToOutFinal - the learned hidden-to-output unit weights
hidBiasFinal - the learned hidden biases
outBiasFinal - the learned output biases
epochsBeforeVErrUptick - the number of epochs of training before validation
error increased. In other words, the number of training epochs used to produce
the final weights above.
finalTrainCEPerCase - The per case cross entropy error on the training set of
the weights with the best validation error.
finalValidCEPerCase - The per case cross entropy error on the validation set
of the final weights (this will always be the best validation error the
training script has seen).
finalTestCEPerCase - The per case cross entropy error on the test set of
the final weights.
You must train the model four times, trying all possible combinations of
d=8,d=32 and numHid=64,numHid=256. You must record the final cross entropy
error on the training, validation, and test sets (stored in the appropriate
variables mentioned above) for each of these runs. You must also record the
number of epochs before a validation error increase (stored in
epochsBeforeVErrUptick) for each of the runs.
Select the best configuration that you ran. The function wordDistance has been
provided for you so that you can compute the distance between the learned
representations of two words. The wordDistance function takes two strings, the
wordReps matrix that you learned (use wordRepsFinal unless you have a good
reason not to) and the vocabulary. For example, if you wanted to compute the
distance between the words "and" and "but" you would do the following (after
training the model of course).
>> wordDistance('and', 'but', wordRepsFinal, vocab)
The wordDistance function simply takes the feature vector corresponding to each
word and computes the L2 norm of the difference vector. Because of this, you
can only meaningfully compare the relative distances between two pairs of words
and discover things like "the word 'and' is closer to the word 'but' than it is
to the word 'or' in the learned embedding." If you are especially enterprising,
you can compare the distance between two words to the average distance to each
of those words. Remember that if you want to enter a string that contains the
single quote character in matlab you must escape it with another single
quote. So the string apostrophe s, which is in the vocabulary, would have to be
entered as
>> '''s'
in matlab.
Compute the distances between a few words and look for patterns. See if you can
discover a few interesting things about the learned word embedding by looking
at the distances between various pairs of words. What words would you expect to
be close together? Are they? Think about what factors contribute to words being
given nearby feature vectors. You can access the vocabulary of words from the
'vocab' variable and the raw sentences from the file rawSentences.txt.gz
Greg Heath
Greg Heath on 20 Feb 2015
I don't remember this 3 year old post. However, it sure would have been useful to see a few inputs and the corresponding targets.

Sign in to comment.

Categories

Find more on Sequence and Numeric Feature Data Workflows in Help Center and File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!