learning bit by bit class 3 – stemming and tokenization
Post on 20-Dec-2015
226 views
TRANSCRIPT
Morphology
• The study of the way words are constructed from smaller components
• Stems – “talk”• Affixes – “ing”
Parsing
• Morphological Parsing – decomposing a word into its constituent morphemes
• Foxes -> fox + es
Morphological Parsing
• Must recognize proper words “spelling”• Must not recognize improper words
“computering”
Morphological Parsing
• Web Search• Spell check, grammar check • Machine translation• Sentiment analysis
Porter Stemmer
• Returns the stem of each word
• Input: cats, output: cat• Input: positivity, output: positive• Input: pitted, output: pit
Porter Stemmer
• ATIONAL : ATE (relational -> relate)• ING : ε (motoring - > motor)• SSES : SS (grasses -> grass)
Tokenization
• IndoEuropean Tokenizer• General purpose alphabetic• Token = letters + numbers• Splits on whitespace, punctuation, special
characters
Stop List
• EnglishStopTokenizerFactory:• “a, be, had, it, only, she, was, about, because,
has, its, of, some, we, after, been, have, last, on, such, were, all, but, he, more, one, than, when, also, by, her, most, or, that, which, an, can, his, mr, other, the, who, any, co, if, mrs, out, their, will, and, corp, in, ms, over, there, with, are, could, inc, mz, s, they, would, as, for, into, no, so, this, up, at, from, is, not, says, to”