words, words, words: reading shakespare with python

Download Words, Words, Words: Reading Shakespare with Python

Post on 16-Jul-2015




4 download

Embed Size (px)


  • Words, words, wordsReading Shakespeare with Python

  • Prologue

  • Motivation

    How can we use Python to supplement our reading of Shakespeare?How can we get Python to read for us?

  • Act I

  • Why Shakespeare?Polonius: What do you read, my lord?Hamlet: Words, words, words.P: What is the matter, my lord?H: Between who?P: I mean, the matter that you read, my lord. --II.2.184

  • Why Shakespeare?(Also the XML)

    (thank you, https://github.com/severdia/PlayShakespeare.com-XML !!!)

  • Shakespeare XML

  • Shakespeare XML

  • Challenges

    Language, especially English, is messy Texts are usually unstructured Pronunciation is not standard Reading is pretty hard!

  • Humans and Computers



    Close reading


    Repetitive tasks

    Making graphs

    Humans are good at: Computers are good at:

  • Act II

  • (leveraging metadata)

    Who is the main Character in _______?

  • Who is the main character in Hamlet?

    Number of Lines

  • Who is the main character in King Lear?

    Number of Lines

  • Who is the main character in Macbeth?

    Number of Lines

  • Who is the main character in Othello?

    Number of Lines

  • Iago and Othello, Detail

    Number of Lines

  • Obligatory Social Network

  • Act III

  • First steps with natural language processing (NLP)

    What are Shakespeares most interesting rhymes?

  • Shakespeares Sonnets

    A sonnet is 14 line poem There are many different rhyme schemes a sonnet can have; Shakespeare was pretty unique in choosing one This is a huge win for us, since we can hard code his rhyme scheme in our analysis

  • Shall I compare thee to a summers day?Thou art more lovely and more temperate:Rough winds do shake the darling buds of May,And summers lease hath all too short a date;Sometime too hot the eye of heaven shines,And often is his gold complexion dimm'd;And every fair from fair sometime declines,By chance or natures changing course untrimm'd;But thy eternal summer shall not fade,Nor lose possession of that fair thou owst;Nor shall death brag thou wanderst in his shade,When in eternal lines to time thou growst: So long as men can breathe or eyes can see, So long lives this, and this gives life to thee.



    Sonnet 18

  • Rhyme Distribution

    Most common rhymes nltk.FreqDict

    Frequency Distribution

    Given a word, what is the frequency distribution of the words that rhyme with it? nltk.ConditionalFreqDict

    Conditional Frequency Distribution

  • Rhyme Distribution

  • Rhyme Distribution

  • 1) Boring rhymes: me and thee2) Lopsided rhymes: thee and usury

    Interesting Rhymes?

  • Act IV

  • Classifiers 101

    Writing code that reads

  • Our Classifier

    Can we write code to tell if a given speech is from a tragedy or comedy?

  • Requires labeled text (in this case, speeches labeled by genre) [(, ), ...]

    Requires training Predicts labels of text

    Classifiers: overview

  • Classifiers: ingredients

    Classifier Vectorizer, or Feature Extractor Classifiers only interact with features, not

    the text itself

  • Vectorizers (or Feature Extractors)

    A vectorizer, or feature extractor, transforms a text into quantifiable information about the text.

    Theoretically, these features could be anything. i.e.: How many capital letters does the text contain? Does the text end with an exclamation point?

    In practice, a common model is Bag of Words.

  • Bag of Words is a kind of feature extraction where: The set of features is the set of all words in

    the text youre analyzing A single text is represented by how many of

    each word appears in it

    Bag of Words

  • Bag of Words: Simple ExampleTwo texts:

    Hello, Will! Hello, Globe!

  • Bag of Words: Simple ExampleTwo texts:

    Hello, Will! Hello, Globe!

    Bag: [Hello, Will, Globe]

    Hello Will Globe

  • Bag of Words: Simple ExampleTwo texts:

    Hello, Will! Hello, Globe!

    Bag: [Hello, Will, Globe]

    Hello Will Globe

    Hello, Will 1 1 0

    Hello, Globe 1 0 1

  • Bag of Words: Simple ExampleTwo texts:

    Hello, Will! Hello, Globe!

    Hello Will Globe

    Hello, Will 1 1 0

    Hello, Globe 1 0 1

    Hello, Will A text that contains one instance of the word Hello, contains one instance of the word Will, and does not contain the word Globe.(Less readable for us, more readable for computers!)

  • Live Vectorizer:

  • Why are these called Vectorizers?text_1 = "words, words, words"text_2 = "words, words, birds"

    # times birds is used

    # times words is used



  • Act V

  • Putting it all Together

    Classifier Workflow

  • Classification: Steps

    1) Split pre-labeled text into training and testing sets

    2) Vectorize text (extract features)3) Train classifier4) Test classifier

    Text Features Labels

  • Training

  • Classifier Trainingfrom sklearn.feature_extraction.text import CountVectorizerfrom sklearn.naive_bayes import MultinomialNBvectorizer = CountVectorizer()vectorizer.fit(train_speeches)train_features = vectorizer.transform(train_speeches)classifier = MultinomialNB()classifier.fit(train_features, train_labels)

  • Testing

  • test_speech = test_speeches[0]print test_speechFarewell, Andronicus, my noble father,The woefull'st man that ever liv'd in Rome.Farewell, proud Rome, till Lucius come again;He loves his pledges dearer than his life....(From Titus Andronicus, III.1.288-300)

    Classifier Testing

  • Classifier Testingtest_speech = test_speeches[0]test_label = test_labels[0]test_features = vectorizer.transform([test_speech])prediction = classifier.predict(test_features)[0]print prediction>>> 'tragedy'print test_label>>> 'tragedy'

  • test_features = vectorizer.transform(test_speeches)print classifier.score(test_features, test_labels)>>> 0.75427682737169521

    Classifier Testing

  • Critiques

    "Bag of Words" assumes a correlation between word use and label. This correlation is stronger in some cases than in others. Beware of highly-disproportionate training data.

  • Epilogue

  • adampalay@gmail.com@adampalay


    Thank you!


View more >