# Where to find viterbi algorithm transition values for natural language processing?

I just watched a video where they used Viterbi algorithm to determine whether certain words in a sentence are intended to be nouns/verbs/adjs etc, they used transition and emission probabilities, for example the probability of the word 'Time' being used as a verb is known (emission) and the probability of a noun leading onto a verb (transition).

http://www.youtube.com/watch?v=O_q82UMtjoM&feature=relmfu (The video)

How can I find a good dataset of transition and emission probabilities for this use-case?

Or EVEN just a single example with all the probabilities displayed, I want to use realistic numbers in a demonstration.

## Answers

Usually, implementations of **Hidden Markov Models** (HMMs) cannot only perform the Viterbi algorithm for tagging, but also an algorithm used to *train* the model (e.g. the Baum-Welch algorithm). Then the way to obtain the model (i.e. the set of transition and emission probabilities) is to **run the training algorithm** on a suitable training corpus (such as the PennTreebank).

I am not aware of any freely available, off-the-shelf HMM-based implementation of a POS tagger that comes with a pre-trained model that can be readily inspected. However, an approach that is in many ways similar to an HMM is the **Conditional Random Field** (CRF). The CRFTagger created at Tohoku University, Japan, appears to come with a pre-trained model for English (see the file model/model.txt after downloading and unpacking). The file is human-readable, but to understand the details of the format you might have to contact the authors.