Ngrams
• Counting words• Using observation to make predictions
Unigram
• “how’s the weather out there?”• [how’s, the, weather, out, there]
Unigram
• how many words are there?
Unigram
• How many times does “weather” occur?
Unigram
• Prob “weather” = occurrences of “weather”/ total # words
Unigram
• P(“weather”) = c(“weather”) / c(total)
Bigram
• “the storm swept through the land”• [(the, storm), (storm, swept), (swept,
through), (through, the), (the land)]
Bigram
• How many times does “storm” follow “the”?
Bigram
• How many times does the word “the” occur?
Bigram
• Prob “the storm” given “the” = occurrences of “the storm”/ occurrences of “the”
Bigram
• Prob “the storm” = occurrences of “the storm”/ occurrences of “the”
• P(word n| word n-1)
Markov Assumption
• The assumption that the probability of a word can depend only on the previous word, or previous N words
• P(“land” | “the”)• P (“land” | “the storm swept through the”)
N gram
• Extends bigram model to previous N words
Maximum Likelihood Estimation
• N-Gram probability based on corpus counts• P(word n| word n-1) = counts of word n-1 followed by word n /Counts of all times word n-1 occurs
Trigram
• “the quick red fox jumped the quick black bear. The quick red fox hopped away.”
• [(the, quick, red), (quick, red, fox), (red, fox, jumped), (fox, jumped, the), (jumped, the, quick), (the, quick, black), (quick, black, bear) (the, quick, red) (quick, red, fox), (red, fox, hopped), (fox, hopped, away)]
Trigram
• How many times does “the quick red” occur?
Trigram
• How many times does “the quick” occur?
Trigram
• Prob “the quick red” given “the quick” = occurrences of “the quick red” /
occurrences of “the quick”
Test it in Google
• Google “the weather”• How many results?
Test it in Google
• Google “the weather is”• How many results?
Test it in Google
• Google “the weather out”• How many results?
Test it in Google
• Google “weather the out”• How many results?
Test it in Google
• Prob “the weather out” =Count “the weather out”/Count “the weather”
Test in Google
• Why so few results for “weather the out”?
Training and Testing
• Training set – bigger ie. 80-90%• Testing set – smaller ie. 10-20%