words, words, words: reading shakespare with python

51
Words, words, words Reading Shakespeare with Python

Upload: adam-palay

Post on 16-Jul-2015

223 views

Category:

Software


4 download

TRANSCRIPT

Page 1: Words, Words, Words: Reading Shakespare with Python

Words, words, wordsReading Shakespeare with Python

Page 2: Words, Words, Words: Reading Shakespare with Python

Prologue

Page 3: Words, Words, Words: Reading Shakespare with Python

Motivation

How can we use Python to supplement our reading of Shakespeare?

How can we get Python to read for us?

Page 4: Words, Words, Words: Reading Shakespare with Python

Act I

Page 5: Words, Words, Words: Reading Shakespare with Python

Why Shakespeare?

Polonius: What do you read, my lord?Hamlet: Words, words, words.P: What is the matter, my lord?H: Between who?P: I mean, the matter that you read, my lord. --II.2.184

Page 7: Words, Words, Words: Reading Shakespare with Python

Shakespeare XML

Page 8: Words, Words, Words: Reading Shakespare with Python

Shakespeare XML

Page 9: Words, Words, Words: Reading Shakespare with Python

Challenges

• Language, especially English, is messy

• Texts are usually unstructured

• Pronunciation is not standard

• Reading is pretty hard!

Page 10: Words, Words, Words: Reading Shakespare with Python

Humans and Computers

Nuance

Ambiguity

Close reading

Counting

Repetitive tasks

Making graphs

Humans are good at: Computers are good at:

Page 11: Words, Words, Words: Reading Shakespare with Python

Act II

Page 12: Words, Words, Words: Reading Shakespare with Python

(leveraging metadata)

Who is the main Character in _______?

Page 13: Words, Words, Words: Reading Shakespare with Python

Who is the main character in Hamlet?

Number of Lines

Page 14: Words, Words, Words: Reading Shakespare with Python

Who is the main character in King Lear?

Number of Lines

Page 15: Words, Words, Words: Reading Shakespare with Python

Who is the main character in Macbeth?

Number of Lines

Page 16: Words, Words, Words: Reading Shakespare with Python

Who is the main character in Othello?

Number of Lines

Page 17: Words, Words, Words: Reading Shakespare with Python

Iago and Othello, Detail

Number of Lines

Page 18: Words, Words, Words: Reading Shakespare with Python

Obligatory Social Network

Page 19: Words, Words, Words: Reading Shakespare with Python

Act III

Page 20: Words, Words, Words: Reading Shakespare with Python

First steps with natural language processing (NLP)

What are Shakespeare’s most interesting rhymes?

Page 21: Words, Words, Words: Reading Shakespare with Python

Shakespeare’s Sonnets

• A sonnet is 14 line poem• There are many different rhyme schemes a sonnet can have; Shakespeare was pretty unique in choosing one• This is a huge win for us, since we can “hard code” his rhyme scheme in our analysis

Page 22: Words, Words, Words: Reading Shakespare with Python

Shall I compare thee to a summer’s day?Thou art more lovely and more temperate:Rough winds do shake the darling buds of May,And summer’s lease hath all too short a date;Sometime too hot the eye of heaven shines,And often is his gold complexion dimm'd;And every fair from fair sometime declines,By chance or nature’s changing course untrimm'd;But thy eternal summer shall not fade,Nor lose possession of that fair thou ow’st;Nor shall death brag thou wander’st in his shade,When in eternal lines to time thou grow’st: So long as men can breathe or eyes can see, So long lives this, and this gives life to thee.

http://www.poetryfoundation.org/poem/174354

ababcdcdefefgg

Sonnet 18

Page 23: Words, Words, Words: Reading Shakespare with Python

Rhyme Distribution

• Most common rhymes• nltk.FreqDict

Frequency Distribution

• Given a word, what is the frequency distribution of the words that rhyme with it?• nltk.ConditionalFreqDict

Conditional Frequency Distribution

Page 24: Words, Words, Words: Reading Shakespare with Python

Rhyme Distribution

Page 25: Words, Words, Words: Reading Shakespare with Python

Rhyme Distribution

Page 26: Words, Words, Words: Reading Shakespare with Python

1) “Boring” rhymes: “me” and “thee”2) “Lopsided” rhymes: “thee” and “usury”

Interesting Rhymes?

Page 27: Words, Words, Words: Reading Shakespare with Python

Act IV

Page 28: Words, Words, Words: Reading Shakespare with Python

Classifiers 101

Writing code that reads

Page 29: Words, Words, Words: Reading Shakespare with Python

Our Classifier

Can we write code to tell if a given speech is from a tragedy or comedy?

Page 30: Words, Words, Words: Reading Shakespare with Python

● Requires labeled text○ (in this case, speeches labeled by genre)○ [(<speech>, <genre>), ...]

● Requires “training”● Predicts labels of text

Classifiers: overview

Page 31: Words, Words, Words: Reading Shakespare with Python

Classifiers: ingredients

● Classifier● Vectorizer, or Feature Extractor● Classifiers only interact with features, not

the text itself

Page 32: Words, Words, Words: Reading Shakespare with Python

Vectorizers (or Feature Extractors)

● A vectorizer, or feature extractor, transforms a text into quantifiable information about the text.

● Theoretically, these features could be anything. i.e.:○ How many capital letters does the text contain?○ Does the text end with an exclamation point?

● In practice, a common model is “Bag of Words”.

Page 33: Words, Words, Words: Reading Shakespare with Python

Bag of Words is a kind of feature extraction where:

● The set of features is the set of all words in the text you’re analyzing

● A single text is represented by how many of each word appears in it

Bag of Words

Page 34: Words, Words, Words: Reading Shakespare with Python

Bag of Words: Simple Example

Two texts:

● “Hello, Will!”● “Hello, Globe!”

Page 35: Words, Words, Words: Reading Shakespare with Python

Bag of Words: Simple Example

Two texts:

● “Hello, Will!”● “Hello, Globe!”

Bag: [“Hello”, “Will”, “Globe”]

“Hello” “Will” “Globe”

Page 36: Words, Words, Words: Reading Shakespare with Python

Bag of Words: Simple Example

Two texts:

● “Hello, Will!”● “Hello, Globe!”

Bag: [“Hello”, “Will”, “Globe”]

“Hello” “Will” “Globe”

“Hello, Will”

1 1 0

“Hello, Globe”

1 0 1

Page 37: Words, Words, Words: Reading Shakespare with Python

Bag of Words: Simple Example

Two texts:

● “Hello, Will!”● “Hello, Globe!”

“Hello” “Will” “Globe”

“Hello, Will”

1 1 0

“Hello, Globe”

1 0 1

“Hello, Will” → “A text that contains one instance of the word “Hello”, contains one instance of the word “Will”, and does not contain the word “Globe”.(Less readable for us, more readable for computers!)

Page 38: Words, Words, Words: Reading Shakespare with Python

Live Vectorizer:

Page 39: Words, Words, Words: Reading Shakespare with Python

Why are these called “Vectorizers”?

text_1 = "words, words, words"

text_2 = "words, words, birds"

# times “birds” is used

# times “words” is used

text_2

text_1

Page 40: Words, Words, Words: Reading Shakespare with Python

Act V

Page 41: Words, Words, Words: Reading Shakespare with Python

Putting it all Together

Classifier Workflow

Page 42: Words, Words, Words: Reading Shakespare with Python

Classification: Steps

1) Split pre-labeled text into training and testing sets

2) Vectorize text (extract features)3) Train classifier4) Test classifier

Text → Features → Labels

Page 43: Words, Words, Words: Reading Shakespare with Python

Training

Page 44: Words, Words, Words: Reading Shakespare with Python

Classifier Training

from sklearn.feature_extraction.text import CountVectorizer

from sklearn.naive_bayes import MultinomialNB

vectorizer = CountVectorizer()

vectorizer.fit(train_speeches)

train_features = vectorizer.transform(train_speeches)

classifier = MultinomialNB()

classifier.fit(train_features, train_labels)

Page 45: Words, Words, Words: Reading Shakespare with Python

Testing

Page 46: Words, Words, Words: Reading Shakespare with Python

test_speech = test_speeches[0]

print test_speech

Farewell, Andronicus, my noble father,

The woefull'st man that ever liv'd in Rome.

Farewell, proud Rome, till Lucius come again;

He loves his pledges dearer than his life.

...

(From Titus Andronicus, III.1.288-300)

Classifier Testing

Page 47: Words, Words, Words: Reading Shakespare with Python

Classifier Testing

test_speech = test_speeches[0]

test_label = test_labels[0]

test_features = vectorizer.transform([test_speech])

prediction = classifier.predict(test_features)[0]

print prediction

>>> 'tragedy'

print test_label

>>> 'tragedy'

Page 48: Words, Words, Words: Reading Shakespare with Python

test_features = vectorizer.transform(test_speeches)

print classifier.score(test_features, test_labels)

>>> 0.75427682737169521

Classifier Testing

Page 49: Words, Words, Words: Reading Shakespare with Python

Critiques

• "Bag of Words" assumes a correlation

between word use and label. This

correlation is stronger in some cases

than in others.

• Beware of highly-disproportionate

training data.

Page 50: Words, Words, Words: Reading Shakespare with Python

Epilogue