cis192 python programmingcis192/fall2016/files/lec/lec...cis 526: machine translation cis 530:...
TRANSCRIPT
python.png
CIS192 Python ProgrammingIntro to Natural Language Processing
Raymond Yin
University of Pennsylvania
November 2, 2016
Raymond Yin (University of Pennsylvania) CIS 192 November 2, 2016 1 / 27
python.png
Outline
1 UpdatesFinal Project Proposal
2 Natural Language Processing (NLP)MotivationTokenizationCounting Word FrequenciesGenerating Random SentencesPart of Speech TaggingFree Word AssociationSentiment AnalysisNext Steps
Raymond Yin (University of Pennsylvania) CIS 192 November 2, 2016 2 / 27
python.png
Final Project Proposal
Can work individually or with a partner~10 hours of work per personEmail me and the TAs a 150-400 word description and teammembers by this SundayDemos during CIS Project Fair (Reading Days)
Raymond Yin (University of Pennsylvania) CIS 192 November 2, 2016 3 / 27
python.png
Outline
1 UpdatesFinal Project Proposal
2 Natural Language Processing (NLP)MotivationTokenizationCounting Word FrequenciesGenerating Random SentencesPart of Speech TaggingFree Word AssociationSentiment AnalysisNext Steps
Raymond Yin (University of Pennsylvania) CIS 192 November 2, 2016 4 / 27
python.png
Natural Language Processing
Raymond Yin (University of Pennsylvania) CIS 192 November 2, 2016 5 / 27
python.png
Natural Language Processing
source: researchperspectives.org
Raymond Yin (University of Pennsylvania) CIS 192 November 2, 2016 6 / 27
python.png
Language is Hard
How can a computer:Recognize parts of speech of sentences?Tell whether a sentence is positive or negative?Figure out what words are most commonly used?Summarize text?
Raymond Yin (University of Pennsylvania) CIS 192 November 2, 2016 7 / 27
python.png
Some Applications of NLP
Sentiment analysisSpam filteringPlagiarism detectionDocument categorizationSummarizationText searchMuch more...
Raymond Yin (University of Pennsylvania) CIS 192 November 2, 2016 8 / 27
python.png
Natural Language Tool Kit (NLTK)
NLP toolkit for English in PythonDeveloped at Penn in 2001!nltk.org
Raymond Yin (University of Pennsylvania) CIS 192 November 2, 2016 9 / 27
python.png
Terminology
Corpus: a body of textToken: Each meaningful "entity" in a string
Depending on context, tokens can be words, sentences,paragraphs
Part of Speech: categories that words are assigned tonoun, verb, adjective, ...
Stopwords: most common words in a language, filtered out beforeNLP tasks
the, is, at, which, on, ...
Raymond Yin (University of Pennsylvania) CIS 192 November 2, 2016 10 / 27
python.png
Outline
1 UpdatesFinal Project Proposal
2 Natural Language Processing (NLP)MotivationTokenizationCounting Word FrequenciesGenerating Random SentencesPart of Speech TaggingFree Word AssociationSentiment AnalysisNext Steps
Raymond Yin (University of Pennsylvania) CIS 192 November 2, 2016 11 / 27
python.png
Word Tokenization
>>> nltk.word_tokenize(’The mitochondria is thepowerhouse of the cell.’)[’The’, ’mitochondria’, ’is’, ’the’, ’powerhouse’, ’of’, ’the’, ’cell’, ’.’]
Raymond Yin (University of Pennsylvania) CIS 192 November 2, 2016 12 / 27
python.png
Sentence Tokenization
>>> sentences = "Prof. Sanjeev Khanna taught CIS 320last spring. It was a great class...and I wasn’t
able to get off the waitlist for CIS 677.">>> nltk.sent_tokenize(sentences)[’Prof. Sanjeev Khanna taught CIS 320 last spring.’,"It was a great class...and I wasn’t able to getoff the waitlist for CIS 677."]
Raymond Yin (University of Pennsylvania) CIS 192 November 2, 2016 13 / 27
python.png
Outline
1 UpdatesFinal Project Proposal
2 Natural Language Processing (NLP)MotivationTokenizationCounting Word FrequenciesGenerating Random SentencesPart of Speech TaggingFree Word AssociationSentiment AnalysisNext Steps
Raymond Yin (University of Pennsylvania) CIS 192 November 2, 2016 14 / 27
python.png
Counting Words in a Corpus
Before today:
>>> counts = defaultdict(int)>>> for word in words:
counts[word] += 1
Better:
>>> counts = FreqDist(words)>>> counts.most_common(10) #=> [(’the’, 49), ...]
Neat!
Raymond Yin (University of Pennsylvania) CIS 192 November 2, 2016 15 / 27
python.png
Outline
1 UpdatesFinal Project Proposal
2 Natural Language Processing (NLP)MotivationTokenizationCounting Word FrequenciesGenerating Random SentencesPart of Speech TaggingFree Word AssociationSentiment AnalysisNext Steps
Raymond Yin (University of Pennsylvania) CIS 192 November 2, 2016 16 / 27
python.png
Creating "random" sentences from a corpus
In probability theory, Markov Chains are "memoryless""Future state depends on current state only"To create a "random" sentence:
Take your current wordAdd a new word that typically appears after your current wordRepeat!
Raymond Yin (University of Pennsylvania) CIS 192 November 2, 2016 17 / 27
python.png
Outline
1 UpdatesFinal Project Proposal
2 Natural Language Processing (NLP)MotivationTokenizationCounting Word FrequenciesGenerating Random SentencesPart of Speech TaggingFree Word AssociationSentiment AnalysisNext Steps
Raymond Yin (University of Pennsylvania) CIS 192 November 2, 2016 18 / 27
python.png
Part of Speech Tagging
Use nltk.pos_tag(list_of_tokens) to identify part ofspeech tagsnltk.help.upenn_tagset shows what each tag code means
Raymond Yin (University of Pennsylvania) CIS 192 November 2, 2016 19 / 27
python.png
Outline
1 UpdatesFinal Project Proposal
2 Natural Language Processing (NLP)MotivationTokenizationCounting Word FrequenciesGenerating Random SentencesPart of Speech TaggingFree Word AssociationSentiment AnalysisNext Steps
Raymond Yin (University of Pennsylvania) CIS 192 November 2, 2016 20 / 27
python.png
Free Word Association
After hearing a word, what’s the word that immediately comes tomind?"dog" → "cat" → "meow" → ...How would we do this?
Raymond Yin (University of Pennsylvania) CIS 192 November 2, 2016 21 / 27
python.png
Free Word Association
After hearing a word, what’s the word that immediately comes tomind?"dog" → "cat" → "meow" → ...Simple way:
For each token in our corpus, count the occurrences of surroundingtokens
Raymond Yin (University of Pennsylvania) CIS 192 November 2, 2016 22 / 27
python.png
Outline
1 UpdatesFinal Project Proposal
2 Natural Language Processing (NLP)MotivationTokenizationCounting Word FrequenciesGenerating Random SentencesPart of Speech TaggingFree Word AssociationSentiment AnalysisNext Steps
Raymond Yin (University of Pennsylvania) CIS 192 November 2, 2016 23 / 27
python.png
Sentiment Analysis
Is some particular text is positive or negative (and to whatdegree?)How might we do this?
Raymond Yin (University of Pennsylvania) CIS 192 November 2, 2016 24 / 27
python.png
Sentiment Analysis
Is some particular text is positive or negative (and to whatdegree?)How might we do this?
Machine learning (last two lectures)Try to "learn" the sentiment-relevant features of textNeed lots of training dataData driven approach
Rule-based methods"Rule of thumb": uses heuristics to determine sentimentsNeeds little training dataGood for production: fast, but harder to initially createVADER: popular rule based model aimed for social media
Raymond Yin (University of Pennsylvania) CIS 192 November 2, 2016 25 / 27
python.png
Outline
1 UpdatesFinal Project Proposal
2 Natural Language Processing (NLP)MotivationTokenizationCounting Word FrequenciesGenerating Random SentencesPart of Speech TaggingFree Word AssociationSentiment AnalysisNext Steps
Raymond Yin (University of Pennsylvania) CIS 192 November 2, 2016 26 / 27
python.png
Next Steps
CIS 526: Machine TranslationCIS 530: Computational LinguisticsKaggle for large datasets, competitionsawesome-nlp: curated list of NLP resources on GitHub
Raymond Yin (University of Pennsylvania) CIS 192 November 2, 2016 27 / 27