In this lesson, we will learn about Hidden Markov Models to tag sentences with words ‘Nouns’, ‘Verbs’, etc…
‘Part of Speech’ Tagging
Suppose that we have a sentence that we want to label each word by parts of speech (nouns, verbs, adjectives, determinate, etc).
For simplicity, we will only consider 3 parts of speech:
- Noun: John, Mary, people
- Modal verb: would, can, will, might
- Verb: run, speak, write, feel
The easiest tagging algorithm would be the Lookup Table. Using a dataset, we create a table that counts how many times a word was tagged each parts of speech, and choose the parts of speech that was tagged the most.
Although it works for simple examples, the lookup table algorithm fails when words have multiple valid tags. If the word ‘Will’ can act both as a modal verb and as a noun, but using the lookup table, all ‘Will’ will be tagged the same parts of speech without context.
To give words more context, we can use Bigrams method. Instead of using individual words, we tag adjacent pairs of words with pairs of parts of speech. This way, if the word ‘will’ comes after ‘mary’, ‘jane’, or other nouns, it will be tagged as a modal verb, but if the word ‘will’ comes after verbs like ‘see’, it will be tagged as a noun. Note that we can use $n$ words instead of 2. We call this the $n$-grams method.
There are many variants of Bigrams algorithm on how you use the table to tag the words. For example, given the sentence “Mary, Will and John ate lunch”, it is possible that the tags for pair “Mary Will” and “Will and” to tag the word “Will” differently. You could either preserve or overwrite the previous tag. Variants should be chosen based on what fits the data best.
The problem with bigrams (and $n$-grams) is that certain bigrams in the sentence might not appear in the data. If the sentence you want to tag has words not often used, there is a good chance this could happen.
Hidden Markov Models (HMM)
Hidden Markov Models is a method using probabilities to tag the sentence. From all possible tagging schema for the sentence, we choose the one with highest probability. This is also known as maximum likelihood.
The probability of the tagging schema is calculated in two ways. First, we calculate the probabilities of each word that given its tag. Then, we calculate the probability of one part of speech transitioning to the next part of speech. These are known as the emission probabilities and transition probabilities.
To calculate the emission probabilities, we simple look at the dataset to compute the probability of each word given a tag. For example, in the 4 sentences above, a noun is Mary 4 times, Jane 2 times, Spot 2 times and Will 1 times, so we each give them probabilities $4/9, 2/9, 2/9, 1/9$.
To calculate the transition probabilities, we first add a state in each end of the sentence. Then we count the number of times a tag follows another tag and normalize it to sum to 1.
We combine the two probabilities to create the hidden markov model.
In general, we call the words observations and the parts of speech the hidden states of the model. Like the name suggests, observations are states we observe, and the hidden states are the states that we have to infer from the observations.
Probability from HMM
Now, given a tagged schema, we can find the probability of the schema by multiplying the individual emission and transition probabilities. Suppose we have a sentence “Jane will spot Will.” There are $3^4 = 81$ possible tagging, and one of them is NMVN. We compute the probability of this tagging to be $0.0003858$.
An incorrect tagging like NNNN gives a much smaller probability of $0.0000002788$.
In general, we use the maximum likelihood principle and pick the tagging schema with the highest probability.
We can draw all possible tagging schemas using a Trellis diagram:
For the example above, we needed to check 81 possible taggings to find the most likely tagging. This is not too bad, but in reality we have longer sentences and many more parts of speech. Thus, we need a better way to find the most likely schema.
One improvement is removing emission or transition probabilities of zero. The product of anything with zero is zero, so if the tagging schema contains one probability of zero, it cannot be the most likely.
Deleting vertices and edges could make some vertices unable to reach the terminal state. For example, the tag M for word Will cannot go to the terminal state. This means that all possible ways of reaching the terminal state had probability 0, so we can delete this state.
With this simple algorithm, we removed the number of possible schemas to 4.
The ‘removing zeros’ algorithm was effective in this example, but it is not effective when there are not many probabilities of 0. A more general algorithm is the Viterbi algorithm. In the Viterbi algorithm, for each hidden state, we only remember the path with the highest probability.
For example, in the diagram above, we can reach the hidden state N for word “spot” two different ways. The path NMN has higher probability than NNN. Then, we can ignore the path NNN since no matter what hidden states it goes through afterwards, we can get a better probability by simply replacing the path NNN to NMN.
Let’s use the Viterbi algorithm with the full trellis diagram. For the first observation, we calculate the possibility of all hidden states.
Then, for the next observation, we calculate the probability for all possible transitions from the previous hidden states. For each hidden state, we choose the largest probability and discard the rest of the probabilities.
We repeat the step continuously to get until we reach the terminal state.
After finishing the Viterbi algorithm, we can find the most likely tagging by simply looking back from the terminal state.