intro to practical natural language processing wharton data camp sessions 8

48
Intro to Practical Natural Language Processing Wharton Data Camp Sessions 8 Agenda 1) Tasks in NLP 2) Use NLTK

Upload: gaenor

Post on 05-Jan-2016

46 views

Category:

Documents


0 download

DESCRIPTION

Intro to Practical Natural Language Processing Wharton Data Camp Sessions 8. Agenda Tasks in NLP Use NLTK. Quick Overview of Resources For Machine Learning, NLP, and Econometrics: Wharton Specific and Books. Machine Learning/ NLP classes. CIS 520 Machine Learning CIS 530 NLP - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Intro to Practical  Natural Language  Processing  Wharton Data Camp Sessions  8

Intro to Practical Natural Language Processing

Wharton Data Camp Sessions 8

Agenda1) Tasks in NLP 2) Use NLTK

Page 2: Intro to Practical  Natural Language  Processing  Wharton Data Camp Sessions  8

Quick Overview of Resources For Machine Learning, NLP, and Econometrics:

Wharton Specific and Books

Page 3: Intro to Practical  Natural Language  Processing  Wharton Data Camp Sessions  8

Machine Learning/ NLP classes

• CIS 520 Machine Learning• CIS 530 NLP• CIS 630 Machine Learning for NLP

• There are more classes for theory in STAT/CIS

Page 4: Intro to Practical  Natural Language  Processing  Wharton Data Camp Sessions  8

Awesome Machine Learning Books!

• The Element of Statistical Learning – Hastie and Tibshirani– ML bible 1

• Pattern Recognition and Machine Learning– Chris Bishop – ML bible 2

IF YOU WANT DEEP UNDERSTANDING OF THE MATERIALS• Statistical Learning Theory

– Theory of ML Bible 1• Probability theory of Pattern Recognition

– Theory of ML Bible 2

Page 5: Intro to Practical  Natural Language  Processing  Wharton Data Camp Sessions  8

If you are going to do any sort of empirical work

• STAT 500 – If you have never taken a course in Econometrics

• STAT 520 – basic econometrics• STAT 521 – Use of R for applied econometrics (this course

went through a major change)• STAT 541 – Andreas Buja on multivariate stat and writing

(There exist one and only one required textbook in this course and that’s a writing book)

• STAT 542 – Shane Jensen Bayesian stat (Jensen is the man)• STAT 921 – Dylan Small Observational study (Required if

you are doing any empirical work)• Econ 705-706 for theory

Page 6: Intro to Practical  Natural Language  Processing  Wharton Data Camp Sessions  8

Subjective Econometric Books Recs

• William H. Greene is great • “Mostly Harmless Econometrics” is great• Edward Frees’ longitudinal and panel data:

analysis and applications in the social sciences IS one of my favorite econometric books

• Lot more based on usage but ask me separately

Page 7: Intro to Practical  Natural Language  Processing  Wharton Data Camp Sessions  8

ML in Business & Combining the two

• Data Science for Business What you need to know about data mining and data-analytic thinking (For Intro & overview)http://data-science-for-biz.com/– Foster Provost: Great researcher in IS at NYU– 72 Reviews on Amazon- 4.7 average!

• Targeted Learning – Springer Series in Statistics (AKA Serious Series) – Incorporate Machine Learning into Causal Inference– UCLA Statisticians

Page 8: Intro to Practical  Natural Language  Processing  Wharton Data Camp Sessions  8

Good Quick Cook-book style NLP books

• http://www.nltk.org/• http://nltk.org/book/ FREE BOOK online • Jurafsky & Martin “Speech and Language Processing” for deep theory • Bing Liu’s two books: http://www.cs.uic.edu/~liub/

Page 9: Intro to Practical  Natural Language  Processing  Wharton Data Camp Sessions  8

There are many tasks that NLP can do and many are hard

• Machine translation – Very hard– http://translationparty.com/ Funny– Hilarious Video (Fresh Prince of Bel-Air theme after it was

translated several times into different languages)• http://www.youtube.com/watch?v=LMkJuDVJdTw

• Sentiment detection• Automatic summarization• Etc

Page 10: Intro to Practical  Natural Language  Processing  Wharton Data Camp Sessions  8

Today

• Supervised Learning + NLP–Identifying certain content (this

is what we will probably use the most). Content-coding.

–A Research Example–Sentiment Analysis Example

Page 11: Intro to Practical  Natural Language  Processing  Wharton Data Camp Sessions  8

Given: – a set of texts (corpus), – and labels (comprising the training set)– Label can be

• Certain content exist • Negative/Positive sentiment• etc

Goal: – create algorithm that mimics the label

Supervised Learning + NLP

Page 12: Intro to Practical  Natural Language  Processing  Wharton Data Camp Sessions  8

Imagine a task

• You are an NSA agent OR You are a hacker• You are given a job OR You are on a mission and

are looking for fellow hackers• Train an NLP algorithm to be able to tell if a

sentence or short text on the internet contains any planning of hacking/ddos attack plans

• “Greetings, fellow anons, we have a new target in our movement against RIAA [...] WE WILL NOT TOLERATE ANY LONGER!”

Page 13: Intro to Practical  Natural Language  Processing  Wharton Data Camp Sessions  8

What do we, humans, do in realizing the existence of the content?

• “Greetings, fellow anons, we have a new target in our movement against RIAA [...] WE WILL NOT TOLERATE ANY LONGER!”

• Key words: target, movement, anons, RIAA, not, tolerate.

• bigrams: new target, our movement, against RIAA, not tolerate

• Use of upper case and “!”• ETC

Page 14: Intro to Practical  Natural Language  Processing  Wharton Data Camp Sessions  8

Narrow and Specific NLP Example

• I can only show you one very specific example of NLP today

• You need to take at least a machine learning course and an NLP course to be able to do this type of processing comfortably – 2 courses will probably suffice for applying ML + NLP for your research

Page 15: Intro to Practical  Natural Language  Processing  Wharton Data Camp Sessions  8

Overview of 1 Example in NLP: Identifying certain content in text

e.g., positive/negative sentiment

1. Find text data (short text or a sentence – a review for example)2. break the sentence down into basic building blocks using NLP techniques

I’ll show – outcome is ordered list of building blocks3. process the ordered list of building blocks and come up with many

sentence-level patterns – these will be the x-variables or sentence-level attributes (e.g., content = “positive review or not”)

– Count the number of word “great” occuring X1

– Count the number of laudatory words X2

– Etc. Recording certain patterns Xn

4. Obtain text data with labels (positive or not): this is called the gold set and comes with y-var {positive, negative} tags

5. Use machine learning techniques on the gold set to learn the relationship between X-var from 3 and Y-var from 4. This part is training the machine learning algorithm.

Page 16: Intro to Practical  Natural Language  Processing  Wharton Data Camp Sessions  8

Basic idea in NLP: identifying certain content in text

1. Find text data 2. Breaking Sentence : break the sentence down into

understandable building blocks (e.g., words or lemmas)3. Sentence Attribute Generation :identify different sentence-

attributes just as humans do when reading (many to be explained)

4. Gold Set Generation: obtain a set of training sentences with labels identifying if the sentences do or do not have certain content from a reliable source (gold data set)

5. Training: use statistical tools to infer which sentence-attributes are correlated with certain content outcomes, thereby “learning” to identify content in sentences.

Page 17: Intro to Practical  Natural Language  Processing  Wharton Data Camp Sessions  8

NLP uses machine learning

• Machine Learning (Classification)– Supervised Learning – given training data x-vars &

y-vars, infer function “f” y=f(x). Curve fitting is a basic supervised learning. You need labeled training data which is X-Y pair.

– Unsupervised Learning – problem of finding hidden structure from unlabeled data just x-vars. E.g. Clustering.

– NLP uses both and in our context it’s supervised learning

Page 18: Intro to Practical  Natural Language  Processing  Wharton Data Camp Sessions  8

Supervised Learning

Taken from nltk.com

Page 19: Intro to Practical  Natural Language  Processing  Wharton Data Camp Sessions  8

Breaking Sentence

• Stop-words removal:removing punctuation and words with low information such as the definite article “the”

• Tokenizing: the process of breaking a sentence into words, phrases, and symbols or “tokens”

• Stemming: the process of reducing inflected words to their root form, e.g., “playing” to “play”

• Part-of-speech tagging: determining part-of-speech such as noun

• etc

Page 20: Intro to Practical  Natural Language  Processing  Wharton Data Camp Sessions  8

Sentence Attribute Generation • Bag of words: collect words • Counted bag of words: words and count the occurrence• Bigram: A bigram is formed by two adjacent words (e.g. “Bigram is”,

“is formed” are bigrams).• Ngram: self-explanatory• Specific keywords (“like”, “love”, “bad”)• Frequency count of certain part of speech• Count the location of certain words• Count the use of !,?,etc• SO MANY MORE! • In big projects, engineers develop algorithm to automatically

generate attributes!

Page 21: Intro to Practical  Natural Language  Processing  Wharton Data Camp Sessions  8

Gold Set Generation

• Get example sentences or text data – You tag them – Or get RAs – Or use Amazon Mechanical Turk

• Or there maybe database already existing – Online tagged corpora

• Speaking of database for NLP, it’s not used in this context but there exist great resources – Check out wordnet and framenet + more

Page 22: Intro to Practical  Natural Language  Processing  Wharton Data Camp Sessions  8

Training the classifiers

• You are done breaking the sentences and generating sentence attributes – these are x variables

• Y-variables are the tags you obtained• Use your favorite ML algorithm or combinations

– Regular GLM – SVM – Naïve Bayes – Neural Network – Decision Tree– Conditional Random Forest– Ensemble Learning: Boosting and Bagging – ETC

Page 23: Intro to Practical  Natural Language  Processing  Wharton Data Camp Sessions  8

Let’s go deeper into each stagesFirst, Breaking Sentence

Page 24: Intro to Practical  Natural Language  Processing  Wharton Data Camp Sessions  8

Natural Language ProcessingTasks

“Hurricane Sandy grounded 3,200 flights scheduled for today and tomorrow, prompted New York to suspend subway and bus service and forced the evacuation of the New Jersey shore as it headed toward land with life-threatening wind and rain.

The system, which killed as many as 65 people in the Caribbean on its path north, may be capable of inflicting as much as $18 billion in damage when it barrels into New Jersey tomorrow and knock out power to millions for a week or more, according to forecasters and risk experts.”

(Bloomberg article on Sandy)

Slides Taken from: Bommarito Consulting

Page 25: Intro to Practical  Natural Language  Processing  Wharton Data Camp Sessions  8

What kind of questions can we ask?• Basic

– What is the structure of the text?• Paragraphs• Sentences• Tokens/words

– What are the words that appear in this text?• Nouns

– Subjects– Direct objects

• Verbs

• Advanced– What are the concepts that appear in this text?– How does this text compare to other text?

Natural Language ProcessingTasks

Slides Taken from: Bommarito Consulting

Page 26: Intro to Practical  Natural Language  Processing  Wharton Data Camp Sessions  8

Segmentation and Tokenization

“Hurricane Sandy grounded 3,200 flights scheduled for today and tomorrow, prompted New York to suspend subway and bus service and forced the evacuation of the New Jersey shore as it headed toward land with life-threatening wind and rain.

The system, which killed as many as 65 people in the Caribbean on its path north, may be capable of inflicting as much as $18 billion in damage when it barrels into New Jersey tomorrow and knock out power to millions for a week or more, according to forecasters and risk experts.”

• Segments Types• Paragraphs• Sentences• Tokens

Natural Language ProcessingTasks

Slides Taken from: Bommarito Consulting

Page 27: Intro to Practical  Natural Language  Processing  Wharton Data Camp Sessions  8

Segmentation and TokenizationBut how does it work?• Paragraphs

– Two consecutive line breaks– A hard line break followed by an indent

• Sentences– Period, except abbreviation, ellipsis within quotation, etc.

• Tokens and Words– Whitespace– Punctuation

Natural Language ProcessingTasks

Slides Taken from: Bommarito Consulting

Page 28: Intro to Practical  Natural Language  Processing  Wharton Data Camp Sessions  8

Segmentation and Tokenization“Hurricane Sandy grounded 3,200 flights scheduled for today and tomorrow, prompted New York to suspend subway and bus service and forced the evacuation of the New Jersey shore as it headed toward land with life-threatening wind and rain.

The system, which killed as many as 65 people in the Caribbean on its path north, may be capable of inflicting as much as $18 billion in damage when it barrels into New Jersey tomorrow and knock out power to millions for a week or more, according to forecasters and risk experts.”

• Paragraphs: 2• Sentences: 2• Words: 561.

– ['Hurricane', 'Sandy', 'grounded', '3,200', 'flights', 'scheduled', 'for', 'today', 'and', 'tomorrow‘, …]

Natural Language ProcessingTasks

Slides Taken from: Bommarito Consulting

Page 29: Intro to Practical  Natural Language  Processing  Wharton Data Camp Sessions  8

What kind of questions can we ask?

We now have an ordered list of tokens.

['Hurricane', 'Sandy', 'grounded', '3,200', 'flights', 'scheduled', 'for', 'today', 'and', 'tomorrow‘, …]

Natural Language ProcessingTasks

Slides Taken from: Bommarito Consulting

Page 30: Intro to Practical  Natural Language  Processing  Wharton Data Camp Sessions  8

Stop Words Removal

Hurricane Sandy grounded 3,200 flights scheduled for today and tomorrow, prompted New York to suspend subway and bus service and forced the evacuation of the New Jersey shore as it headed toward land with life-threatening wind and rain.

The system, which killed as many as 65 people in the Caribbean on its path north, may be capable of inflicting as much as $18 billion in damage when it barrels into New Jersey tomorrow and knock out power to millions for a week or more, according to forecasters and risk experts.

Hurricane Sandy grounded 3,200 flights scheduled today tomorrow, prompted New York suspend subway bus service forced evacuation New Jersey shore headed toward land life-threatening wind rain.

System, killed many 65 people Caribbean path north, may capable inflicting much $18 billion damage barrels New Jersey tomorrow knock power millions week, according forecasters risk experts.

Slides Taken from: Bommarito Consulting

Natural Language ProcessingTasks

Page 31: Intro to Practical  Natural Language  Processing  Wharton Data Camp Sessions  8

Natural language processing Tasks

Stop Words Removal+ Stemming

Hurricane Sandy grounded 3,200 flights scheduled for today and tomorrow, prompted New York to suspend subway and bus service and forced the evacuation of the New Jersey shore as it headed toward land with life-threatening wind and rain.

The system, which killed as many as 65 people in the Caribbean on its path north, may be capable of inflicting as much as $18 billion in damage when it barrels into New Jersey tomorrow and knock out power to millions for a week or more, according to forecasters and risk experts.

Hurrican Sandi ground 3,200 flight schedul today tomorrow, prompt New York suspend subway bu servic forc evacu New Jersey shore head toward land life-threaten wind rain.

System, kill mani 65 peopl Caribbean path north, may capabl inflict much $18 billion damag barrel New Jersey tomorrow knock power million week, accord forecast risk expert.

Slides Taken from: Bommarito Consulting

Page 32: Intro to Practical  Natural Language  Processing  Wharton Data Camp Sessions  8

Natural language processingPart of Speech Tagging

Hurricane Sandy grounded 3,200 flights scheduled for today and tomorrow, prompted New York to suspend subway and bus service and forced the evacuation of the New Jersey shore as it headed toward land with life-threatening wind and rain.

The system, which killed as many as 65 people in the Caribbean on its path north, may be capable of inflicting as much as $18 billion in damage when it barrels into New Jersey tomorrow and knock out power to millions for a week or more, according to forecasters and risk experts.

[('Hurricane', 'NNP'), ('Sandy', 'NNP'), ('grounded', 'VBD'), ('3,200', 'CD'), ('flights', 'NNS'), ('scheduled', 'VBN'), ('for', 'IN'), ('today', 'NN'), ('and', 'CC'), ('tomorrow', 'NN'), …]

NNP: Proper Noun, Plural NNS: Noun, Plural VBD: Verb, Past tense VBN:Verb, Past ParticipleCD: Cardinal Number IN: Proposition/sub-conj etcFor more http://www.mozart-oz.org/mogul/doc/lager/brill-tagger/penn.html

Page 33: Intro to Practical  Natural Language  Processing  Wharton Data Camp Sessions  8

Let’s go deeper into each stagesSecond, Sentence Attribute Generation

Page 34: Intro to Practical  Natural Language  Processing  Wharton Data Camp Sessions  8

Remember one thing

• When you read sentences yourself, what do you notice about what you notice?

• Make those into attributes! • The goal is to mimic what we

humans do

Page 35: Intro to Practical  Natural Language  Processing  Wharton Data Camp Sessions  8

Let’s go deeper into each stagesThird, Gold Set Generation

Page 36: Intro to Practical  Natural Language  Processing  Wharton Data Camp Sessions  8

Resources for Gold Set Generation

• Yourself • RA: pretty expensive • AMT: Amazon Mechanical Turk

• Obtain multiple tags and you have to check inter-rater agreement to be robust

Page 37: Intro to Practical  Natural Language  Processing  Wharton Data Camp Sessions  8

Research Examplessrn.com/abstract=2290802

Page 38: Intro to Practical  Natural Language  Processing  Wharton Data Camp Sessions  8

38

Research Question

What content attributes of social media messages elicit greater consumer response & engagement?

E.g., 1. What’s the comparative effect of informative

advertising (product, price information, etc) VS persuasive advertising (Emotion, humor, etc) on engagement?

2. Differences across industries?

Introduction & Motivation

Page 39: Intro to Practical  Natural Language  Processing  Wharton Data Camp Sessions  8

Sample Messages from Walmart (Dec 2012 -https://www.facebook.com/walmart)

• Score an iPad 3 for an iPad2 price! Now at your local store, $50 off the iPad 3. Plus, get a $30 iTunes Gift Card. Offer good through 12/31 or while supplies last. (Product Advertisement + Deal + Product Location + Product Stock Availability)

• Rollback with Vizio. Select models have lower prices ranging from $228 for a 32" (diagonal screen size 31.5") LCD TV to $868 on a 55” (diagonal screen size 54.6") LED TV. http://walmarturl.com/10oZ6yS (Product Advertisement + Price info + Brand Mention + Link)

• Maria’s mission is helping veterans and their families find employment. Like this and watch Maria’s story. http://walmarturl.com/VzWFlh (Philanthropic Message + Explicit Like solicitation + Link)

39Data

Page 40: Intro to Practical  Natural Language  Processing  Wharton Data Camp Sessions  8

40

Data• Post-level panel data on messages posted by many companies from

Sep 2011 to July 2012• Message content • Impressions, likes and comments on a daily basis

• Page-level panel data on each pages • Page statistics on a daily basis (e.g., Fan number, Industry type)• Aggregate demographics of fans and post viewers (impressions

demographics)

• After Cleaning: 106,316 unique messages posted by 782 companies• Daily Likes & Comments: 1.3 million rows of post-level snapshots

recording about 450 million page fans’ responses.

Data

Page 41: Intro to Practical  Natural Language  Processing  Wharton Data Camp Sessions  8

41

VariablesEngagement Metric (Dependent Variable)

Variables that affect engagement (Independent Variables)

Informative Ad Content

• Brand and Product mention

• Price• Deals• Product

Availability• etc

Persuasive Ad Content

• Emotion• Humor• Philanthropic• Emoticon• Small Talk• etc

Message Type

• Photo, Video, Status update, App, Link

Controls

• Impressions• Industry Type• Day since post • Reading

Complexity • Message

Length• etc

Empirical Strategy

COMMENTS LIKES

Page 42: Intro to Practical  Natural Language  Processing  Wharton Data Camp Sessions  8

Message Content Tagging

• At least 9 different workers per message + Majority vote• Used to train natural language processing algorithm to tag remaining posts

– 7 Statistical classifiers + rule-based method combined by ensemble learning – Greater than 99% accuracy, precision, and recall for most variables (10-fold CV)

42Data

• Worker Eligibility Criteria– Must have > 97% accuracy – Must have > 100 previously

approved tasks– Location: US only

• Criteria for using the input– Question for detecting if the

worker is paying attention– Completion duration > 30 seconds

(avg took 3 min)– Plus, 5+ more protocols

Page 43: Intro to Practical  Natural Language  Processing  Wharton Data Camp Sessions  8

NLP Algorithm Process

Page 44: Intro to Practical  Natural Language  Processing  Wharton Data Camp Sessions  8

NLP Algorithm Performance

Page 45: Intro to Practical  Natural Language  Processing  Wharton Data Camp Sessions  8

NLTK

Page 46: Intro to Practical  Natural Language  Processing  Wharton Data Camp Sessions  8

Open up nlp.py

Page 47: Intro to Practical  Natural Language  Processing  Wharton Data Camp Sessions  8

WITH BAD NLP WITH GOOD NLP

“COMPUTER, HOT EARL GREY TEA”

“COMPUTER, TEA, EARL GREY, HOT”

Page 48: Intro to Practical  Natural Language  Processing  Wharton Data Camp Sessions  8

This Concludes the 2014 Wharton Tech/Data Camp

Please help me and give feedback on this course for improvement. Thank you!

http://wharton.qualtrics.com/SE/?SID=SV_agzfeKZvPQD0hUN