cis192 python programmingcis192/fall2016/files/lec/lec...cis 526: machine translation cis 530:...

27
python.png CIS192 Python Programming Intro to Natural Language Processing Raymond Yin University of Pennsylvania November 2, 2016 Raymond Yin (University of Pennsylvania) CIS 192 November 2, 2016 1 / 27

Upload: others

Post on 24-Sep-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CIS192 Python Programmingcis192/fall2016/files/lec/lec...CIS 526: Machine Translation CIS 530: Computational Linguistics Kagglefor large datasets, competitions awesome-nlp: curated

python.png

CIS192 Python ProgrammingIntro to Natural Language Processing

Raymond Yin

University of Pennsylvania

November 2, 2016

Raymond Yin (University of Pennsylvania) CIS 192 November 2, 2016 1 / 27

Page 2: CIS192 Python Programmingcis192/fall2016/files/lec/lec...CIS 526: Machine Translation CIS 530: Computational Linguistics Kagglefor large datasets, competitions awesome-nlp: curated

python.png

Outline

1 UpdatesFinal Project Proposal

2 Natural Language Processing (NLP)MotivationTokenizationCounting Word FrequenciesGenerating Random SentencesPart of Speech TaggingFree Word AssociationSentiment AnalysisNext Steps

Raymond Yin (University of Pennsylvania) CIS 192 November 2, 2016 2 / 27

Page 3: CIS192 Python Programmingcis192/fall2016/files/lec/lec...CIS 526: Machine Translation CIS 530: Computational Linguistics Kagglefor large datasets, competitions awesome-nlp: curated

python.png

Final Project Proposal

Can work individually or with a partner~10 hours of work per personEmail me and the TAs a 150-400 word description and teammembers by this SundayDemos during CIS Project Fair (Reading Days)

Raymond Yin (University of Pennsylvania) CIS 192 November 2, 2016 3 / 27

Page 4: CIS192 Python Programmingcis192/fall2016/files/lec/lec...CIS 526: Machine Translation CIS 530: Computational Linguistics Kagglefor large datasets, competitions awesome-nlp: curated

python.png

Outline

1 UpdatesFinal Project Proposal

2 Natural Language Processing (NLP)MotivationTokenizationCounting Word FrequenciesGenerating Random SentencesPart of Speech TaggingFree Word AssociationSentiment AnalysisNext Steps

Raymond Yin (University of Pennsylvania) CIS 192 November 2, 2016 4 / 27

Page 5: CIS192 Python Programmingcis192/fall2016/files/lec/lec...CIS 526: Machine Translation CIS 530: Computational Linguistics Kagglefor large datasets, competitions awesome-nlp: curated

python.png

Natural Language Processing

Raymond Yin (University of Pennsylvania) CIS 192 November 2, 2016 5 / 27

Page 6: CIS192 Python Programmingcis192/fall2016/files/lec/lec...CIS 526: Machine Translation CIS 530: Computational Linguistics Kagglefor large datasets, competitions awesome-nlp: curated

python.png

Natural Language Processing

source: researchperspectives.org

Raymond Yin (University of Pennsylvania) CIS 192 November 2, 2016 6 / 27

Page 7: CIS192 Python Programmingcis192/fall2016/files/lec/lec...CIS 526: Machine Translation CIS 530: Computational Linguistics Kagglefor large datasets, competitions awesome-nlp: curated

python.png

Language is Hard

How can a computer:Recognize parts of speech of sentences?Tell whether a sentence is positive or negative?Figure out what words are most commonly used?Summarize text?

Raymond Yin (University of Pennsylvania) CIS 192 November 2, 2016 7 / 27

Page 8: CIS192 Python Programmingcis192/fall2016/files/lec/lec...CIS 526: Machine Translation CIS 530: Computational Linguistics Kagglefor large datasets, competitions awesome-nlp: curated

python.png

Some Applications of NLP

Sentiment analysisSpam filteringPlagiarism detectionDocument categorizationSummarizationText searchMuch more...

Raymond Yin (University of Pennsylvania) CIS 192 November 2, 2016 8 / 27

Page 9: CIS192 Python Programmingcis192/fall2016/files/lec/lec...CIS 526: Machine Translation CIS 530: Computational Linguistics Kagglefor large datasets, competitions awesome-nlp: curated

python.png

Natural Language Tool Kit (NLTK)

NLP toolkit for English in PythonDeveloped at Penn in 2001!nltk.org

Raymond Yin (University of Pennsylvania) CIS 192 November 2, 2016 9 / 27

Page 10: CIS192 Python Programmingcis192/fall2016/files/lec/lec...CIS 526: Machine Translation CIS 530: Computational Linguistics Kagglefor large datasets, competitions awesome-nlp: curated

python.png

Terminology

Corpus: a body of textToken: Each meaningful "entity" in a string

Depending on context, tokens can be words, sentences,paragraphs

Part of Speech: categories that words are assigned tonoun, verb, adjective, ...

Stopwords: most common words in a language, filtered out beforeNLP tasks

the, is, at, which, on, ...

Raymond Yin (University of Pennsylvania) CIS 192 November 2, 2016 10 / 27

Page 11: CIS192 Python Programmingcis192/fall2016/files/lec/lec...CIS 526: Machine Translation CIS 530: Computational Linguistics Kagglefor large datasets, competitions awesome-nlp: curated

python.png

Outline

1 UpdatesFinal Project Proposal

2 Natural Language Processing (NLP)MotivationTokenizationCounting Word FrequenciesGenerating Random SentencesPart of Speech TaggingFree Word AssociationSentiment AnalysisNext Steps

Raymond Yin (University of Pennsylvania) CIS 192 November 2, 2016 11 / 27

Page 12: CIS192 Python Programmingcis192/fall2016/files/lec/lec...CIS 526: Machine Translation CIS 530: Computational Linguistics Kagglefor large datasets, competitions awesome-nlp: curated

python.png

Word Tokenization

>>> nltk.word_tokenize(’The mitochondria is thepowerhouse of the cell.’)[’The’, ’mitochondria’, ’is’, ’the’, ’powerhouse’, ’of’, ’the’, ’cell’, ’.’]

Raymond Yin (University of Pennsylvania) CIS 192 November 2, 2016 12 / 27

Page 13: CIS192 Python Programmingcis192/fall2016/files/lec/lec...CIS 526: Machine Translation CIS 530: Computational Linguistics Kagglefor large datasets, competitions awesome-nlp: curated

python.png

Sentence Tokenization

>>> sentences = "Prof. Sanjeev Khanna taught CIS 320last spring. It was a great class...and I wasn’t

able to get off the waitlist for CIS 677.">>> nltk.sent_tokenize(sentences)[’Prof. Sanjeev Khanna taught CIS 320 last spring.’,"It was a great class...and I wasn’t able to getoff the waitlist for CIS 677."]

Raymond Yin (University of Pennsylvania) CIS 192 November 2, 2016 13 / 27

Page 14: CIS192 Python Programmingcis192/fall2016/files/lec/lec...CIS 526: Machine Translation CIS 530: Computational Linguistics Kagglefor large datasets, competitions awesome-nlp: curated

python.png

Outline

1 UpdatesFinal Project Proposal

2 Natural Language Processing (NLP)MotivationTokenizationCounting Word FrequenciesGenerating Random SentencesPart of Speech TaggingFree Word AssociationSentiment AnalysisNext Steps

Raymond Yin (University of Pennsylvania) CIS 192 November 2, 2016 14 / 27

Page 15: CIS192 Python Programmingcis192/fall2016/files/lec/lec...CIS 526: Machine Translation CIS 530: Computational Linguistics Kagglefor large datasets, competitions awesome-nlp: curated

python.png

Counting Words in a Corpus

Before today:

>>> counts = defaultdict(int)>>> for word in words:

counts[word] += 1

Better:

>>> counts = FreqDist(words)>>> counts.most_common(10) #=> [(’the’, 49), ...]

Neat!

Raymond Yin (University of Pennsylvania) CIS 192 November 2, 2016 15 / 27

Page 16: CIS192 Python Programmingcis192/fall2016/files/lec/lec...CIS 526: Machine Translation CIS 530: Computational Linguistics Kagglefor large datasets, competitions awesome-nlp: curated

python.png

Outline

1 UpdatesFinal Project Proposal

2 Natural Language Processing (NLP)MotivationTokenizationCounting Word FrequenciesGenerating Random SentencesPart of Speech TaggingFree Word AssociationSentiment AnalysisNext Steps

Raymond Yin (University of Pennsylvania) CIS 192 November 2, 2016 16 / 27

Page 17: CIS192 Python Programmingcis192/fall2016/files/lec/lec...CIS 526: Machine Translation CIS 530: Computational Linguistics Kagglefor large datasets, competitions awesome-nlp: curated

python.png

Creating "random" sentences from a corpus

In probability theory, Markov Chains are "memoryless""Future state depends on current state only"To create a "random" sentence:

Take your current wordAdd a new word that typically appears after your current wordRepeat!

Raymond Yin (University of Pennsylvania) CIS 192 November 2, 2016 17 / 27

Page 18: CIS192 Python Programmingcis192/fall2016/files/lec/lec...CIS 526: Machine Translation CIS 530: Computational Linguistics Kagglefor large datasets, competitions awesome-nlp: curated

python.png

Outline

1 UpdatesFinal Project Proposal

2 Natural Language Processing (NLP)MotivationTokenizationCounting Word FrequenciesGenerating Random SentencesPart of Speech TaggingFree Word AssociationSentiment AnalysisNext Steps

Raymond Yin (University of Pennsylvania) CIS 192 November 2, 2016 18 / 27

Page 19: CIS192 Python Programmingcis192/fall2016/files/lec/lec...CIS 526: Machine Translation CIS 530: Computational Linguistics Kagglefor large datasets, competitions awesome-nlp: curated

python.png

Part of Speech Tagging

Use nltk.pos_tag(list_of_tokens) to identify part ofspeech tagsnltk.help.upenn_tagset shows what each tag code means

Raymond Yin (University of Pennsylvania) CIS 192 November 2, 2016 19 / 27

Page 20: CIS192 Python Programmingcis192/fall2016/files/lec/lec...CIS 526: Machine Translation CIS 530: Computational Linguistics Kagglefor large datasets, competitions awesome-nlp: curated

python.png

Outline

1 UpdatesFinal Project Proposal

2 Natural Language Processing (NLP)MotivationTokenizationCounting Word FrequenciesGenerating Random SentencesPart of Speech TaggingFree Word AssociationSentiment AnalysisNext Steps

Raymond Yin (University of Pennsylvania) CIS 192 November 2, 2016 20 / 27

Page 21: CIS192 Python Programmingcis192/fall2016/files/lec/lec...CIS 526: Machine Translation CIS 530: Computational Linguistics Kagglefor large datasets, competitions awesome-nlp: curated

python.png

Free Word Association

After hearing a word, what’s the word that immediately comes tomind?"dog" → "cat" → "meow" → ...How would we do this?

Raymond Yin (University of Pennsylvania) CIS 192 November 2, 2016 21 / 27

Page 22: CIS192 Python Programmingcis192/fall2016/files/lec/lec...CIS 526: Machine Translation CIS 530: Computational Linguistics Kagglefor large datasets, competitions awesome-nlp: curated

python.png

Free Word Association

After hearing a word, what’s the word that immediately comes tomind?"dog" → "cat" → "meow" → ...Simple way:

For each token in our corpus, count the occurrences of surroundingtokens

Raymond Yin (University of Pennsylvania) CIS 192 November 2, 2016 22 / 27

Page 23: CIS192 Python Programmingcis192/fall2016/files/lec/lec...CIS 526: Machine Translation CIS 530: Computational Linguistics Kagglefor large datasets, competitions awesome-nlp: curated

python.png

Outline

1 UpdatesFinal Project Proposal

2 Natural Language Processing (NLP)MotivationTokenizationCounting Word FrequenciesGenerating Random SentencesPart of Speech TaggingFree Word AssociationSentiment AnalysisNext Steps

Raymond Yin (University of Pennsylvania) CIS 192 November 2, 2016 23 / 27

Page 24: CIS192 Python Programmingcis192/fall2016/files/lec/lec...CIS 526: Machine Translation CIS 530: Computational Linguistics Kagglefor large datasets, competitions awesome-nlp: curated

python.png

Sentiment Analysis

Is some particular text is positive or negative (and to whatdegree?)How might we do this?

Raymond Yin (University of Pennsylvania) CIS 192 November 2, 2016 24 / 27

Page 25: CIS192 Python Programmingcis192/fall2016/files/lec/lec...CIS 526: Machine Translation CIS 530: Computational Linguistics Kagglefor large datasets, competitions awesome-nlp: curated

python.png

Sentiment Analysis

Is some particular text is positive or negative (and to whatdegree?)How might we do this?

Machine learning (last two lectures)Try to "learn" the sentiment-relevant features of textNeed lots of training dataData driven approach

Rule-based methods"Rule of thumb": uses heuristics to determine sentimentsNeeds little training dataGood for production: fast, but harder to initially createVADER: popular rule based model aimed for social media

Raymond Yin (University of Pennsylvania) CIS 192 November 2, 2016 25 / 27

Page 26: CIS192 Python Programmingcis192/fall2016/files/lec/lec...CIS 526: Machine Translation CIS 530: Computational Linguistics Kagglefor large datasets, competitions awesome-nlp: curated

python.png

Outline

1 UpdatesFinal Project Proposal

2 Natural Language Processing (NLP)MotivationTokenizationCounting Word FrequenciesGenerating Random SentencesPart of Speech TaggingFree Word AssociationSentiment AnalysisNext Steps

Raymond Yin (University of Pennsylvania) CIS 192 November 2, 2016 26 / 27

Page 27: CIS192 Python Programmingcis192/fall2016/files/lec/lec...CIS 526: Machine Translation CIS 530: Computational Linguistics Kagglefor large datasets, competitions awesome-nlp: curated

python.png

Next Steps

CIS 526: Machine TranslationCIS 530: Computational LinguisticsKaggle for large datasets, competitionsawesome-nlp: curated list of NLP resources on GitHub

Raymond Yin (University of Pennsylvania) CIS 192 November 2, 2016 27 / 27