intro to practical natural language processing wharton data camp sessions 8

Intro to Practical Natural Language Processing

Wharton Data Camp Sessions 8

Agenda1) Tasks in NLP 2) Use NLTK

Quick Overview of Resources For Machine Learning, NLP, and Econometrics:

Wharton Specific and Books

Machine Learning/ NLP classes

• CIS 520 Machine Learning• CIS 530 NLP• CIS 630 Machine Learning for NLP

• There are more classes for theory in STAT/CIS

Awesome Machine Learning Books!

• The Element of Statistical Learning – Hastie and Tibshirani– ML bible 1

• Pattern Recognition and Machine Learning– Chris Bishop – ML bible 2

IF YOU WANT DEEP UNDERSTANDING OF THE MATERIALS• Statistical Learning Theory

– Theory of ML Bible 1• Probability theory of Pattern Recognition

– Theory of ML Bible 2

If you are going to do any sort of empirical work

• STAT 500 – If you have never taken a course in Econometrics

• STAT 520 – basic econometrics• STAT 521 – Use of R for applied econometrics (this course

went through a major change)• STAT 541 – Andreas Buja on multivariate stat and writing

(There exist one and only one required textbook in this course and that’s a writing book)

• STAT 542 – Shane Jensen Bayesian stat (Jensen is the man)• STAT 921 – Dylan Small Observational study (Required if

you are doing any empirical work)• Econ 705-706 for theory

Subjective Econometric Books Recs

• William H. Greene is great • “Mostly Harmless Econometrics” is great• Edward Frees’ longitudinal and panel data:

analysis and applications in the social sciences IS one of my favorite econometric books

• Lot more based on usage but ask me separately

ML in Business & Combining the two

• Data Science for Business What you need to know about data mining and data-analytic thinking (For Intro & overview)http://data-science-for-biz.com/– Foster Provost: Great researcher in IS at NYU– 72 Reviews on Amazon- 4.7 average!

• Targeted Learning – Springer Series in Statistics (AKA Serious Series) – Incorporate Machine Learning into Causal Inference– UCLA Statisticians

http://data-science-for-biz.com/



Good Quick Cook-book style NLP books

• http://www.nltk.org/• http://nltk.org/book/ FREE BOOK online • Jurafsky & Martin “Speech and Language Processing” for deep theory • Bing Liu’s two books: http://www.cs.uic.edu/~liub/

http://www.nltk.org/


http://nltk.org/book/

http://nltk.org/book/


There are many tasks that NLP can do and many are hard

• Machine translation – Very hard– http://translationparty.com/ Funny– Hilarious Video (Fresh Prince of Bel-Air theme after it was

translated several times into different languages)• http://www.youtube.com/watch?v=LMkJuDVJdTw

• Sentiment detection• Automatic summarization• Etc

http://translationparty.com/



Today

• Supervised Learning + NLP–Identifying certain content (this

is what we will probably use the most). Content-coding.

–A Research Example–Sentiment Analysis Example

Given: – a set of texts (corpus), – and labels (comprising the training set)– Label can be

• Certain content exist • Negative/Positive sentiment• etc

Goal: – create algorithm that mimics the label

Supervised Learning + NLP

Imagine a task

• You are an NSA agent OR You are a hacker• You are given a job OR You are on a mission and

are looking for fellow hackers• Train an NLP algorithm to be able to tell if a

sentence or short text on the internet contains any planning of hacking/ddos attack plans

• “Greetings, fellow anons, we have a new target in our movement against RIAA [...] WE WILL NOT TOLERATE ANY LONGER!”

What do we, humans, do in realizing the existence of the content?

• “Greetings, fellow anons, we have a new target in our movement against RIAA [...] WE WILL NOT TOLERATE ANY LONGER!”

• Key words: target, movement, anons, RIAA, not, tolerate.

• bigrams: new target, our movement, against RIAA, not tolerate

• Use of upper case and “!”• ETC

Narrow and Specific NLP Example

• I can only show you one very specific example of NLP today

• You need to take at least a machine learning course and an NLP course to be able to do this type of processing comfortably – 2 courses will probably suffice for applying ML + NLP for your research

Overview of 1 Example in NLP: Identifying certain content in text

e.g., positive/negative sentiment

1. Find text data (short text or a sentence – a review for example)2. break the sentence down into basic building blocks using NLP techniques

I’ll show – outcome is ordered list of building blocks3. process the ordered list of building blocks and come up with many

sentence-level patterns – these will be the x-variables or sentence-level attributes (e.g., content = “positive review or not”)

– Count the number of word “great” occuring X1

– Count the number of laudatory words X2

– Etc. Recording certain patterns Xn

4. Obtain text data with labels (positive or not): this is called the gold set and comes with y-var {positive, negative} tags

5. Use machine learning techniques on the gold set to learn the relationship between X-var from 3 and Y-var from 4. This part is training the machine learning algorithm.

Basic idea in NLP: identifying certain content in text

1. Find text data 2. Breaking Sentence : break the sentence down into

understandable building blocks (e.g., words or lemmas)3. Sentence Attribute Generation :identify different sentence-

attributes just as humans do when reading (many to be explained)

4. Gold Set Generation: obtain a set of training sentences with labels identifying if the sentences do or do not have certain content from a reliable source (gold data set)

5. Training: use statistical tools to infer which sentence-attributes are correlated with certain content outcomes, thereby “learning” to identify content in sentences.

NLP uses machine learning

• Machine Learning (Classification)– Supervised Learning – given training data x-vars &

y-vars, infer function “f” y=f(x). Curve fitting is a basic supervised learning. You need labeled training data which is X-Y pair.

– Unsupervised Learning – problem of finding hidden structure from unlabeled data just x-vars. E.g. Clustering.

– NLP uses both and in our context it’s supervised learning

Supervised Learning

Taken from nltk.com

Breaking Sentence

• Stop-words removal:removing punctuation and words with low information such as the definite article “the”

• Tokenizing: the process of breaking a sentence into words, phrases, and symbols or “tokens”

• Stemming: the process of reducing inflected words to their root form, e.g., “playing” to “play”

• Part-of-speech tagging: determining part-of-speech such as noun

• etc

Sentence Attribute Generation • Bag of words: collect words • Counted bag of words: words and count the occurrence• Bigram: A bigram is formed by two adjacent words (e.g. “Bigram is”,

“is formed” are bigrams).• Ngram: self-explanatory• Specific keywords (“like”, “love”, “bad”)• Frequency count of certain part of speech• Count the location of certain words• Count the use of !,?,etc• SO MANY MORE! • In big projects, engineers develop algorithm to automatically

generate attributes!

Gold Set Generation

• Get example sentences or text data – You tag them – Or get RAs – Or use Amazon Mechanical Turk

• Or there maybe database already existing – Online tagged corpora

• Speaking of database for NLP, it’s not used in this context but there exist great resources – Check out wordnet and framenet + more

Training the classifiers

• You are done breaking the sentences and generating sentence attributes – these are x variables

• Y-variables are the tags you obtained• Use your favorite ML algorithm or combinations

– Regular GLM – SVM – Naïve Bayes – Neural Network – Decision Tree– Conditional Random Forest– Ensemble Learning: Boosting and Bagging – ETC

Let’s go deeper into each stagesFirst, Breaking Sentence

Natural Language ProcessingTasks

“Hurricane Sandy grounded 3,200 flights scheduled for today and tomorrow, prompted New York to suspend subway and bus service and forced the evacuation of the New Jersey shore as it headed toward land with life-threatening wind and rain.

The system, which killed as many as 65 people in the Caribbean on its path north, may be capable of inflicting as much as $18 billion in damage when it barrels into New Jersey tomorrow and knock out power to millions for a week or more, according to forecasters and risk experts.”

(Bloomberg article on Sandy)

Slides Taken from: Bommarito Consulting

http://topics.bloomberg.com/new-york/


http://topics.bloomberg.com/new-jersey/

What kind of questions can we ask?• Basic

– What is the structure of the text?• Paragraphs• Sentences• Tokens/words

– What are the words that appear in this text?• Nouns

– Subjects– Direct objects

• Verbs

• Advanced– What are the concepts that appear in this text?– How does this text compare to other text?



Segmentation and Tokenization

“Hurricane Sandy grounded 3,200 flights scheduled for today and tomorrow, prompted New York to suspend subway and bus service and forced the evacuation of the New Jersey shore as it headed toward land with life-threatening wind and rain.


• Segments Types• Paragraphs• Sentences• Tokens





Segmentation and TokenizationBut how does it work?• Paragraphs

– Two consecutive line breaks– A hard line break followed by an indent

• Sentences– Period, except abbreviation, ellipsis within quotation, etc.

• Tokens and Words– Whitespace– Punctuation



Segmentation and Tokenization“Hurricane Sandy grounded 3,200 flights scheduled for today and tomorrow, prompted New York to suspend subway and bus service and forced the evacuation of the New Jersey shore as it headed toward land with life-threatening wind and rain.


• Paragraphs: 2• Sentences: 2• Words: 561.

– ['Hurricane', 'Sandy', 'grounded', '3,200', 'flights', 'scheduled', 'for', 'today', 'and', 'tomorrow‘, …]





What kind of questions can we ask?

We now have an ordered list of tokens.

['Hurricane', 'Sandy', 'grounded', '3,200', 'flights', 'scheduled', 'for', 'today', 'and', 'tomorrow‘, …]



Stop Words Removal

Hurricane Sandy grounded 3,200 flights scheduled for today and tomorrow, prompted New York to suspend subway and bus service and forced the evacuation of the New Jersey shore as it headed toward land with life-threatening wind and rain.

The system, which killed as many as 65 people in the Caribbean on its path north, may be capable of inflicting as much as $18 billion in damage when it barrels into New Jersey tomorrow and knock out power to millions for a week or more, according to forecasters and risk experts.

Hurricane Sandy grounded 3,200 flights scheduled today tomorrow, prompted New York suspend subway bus service forced evacuation New Jersey shore headed toward land life-threatening wind rain.

System, killed many 65 people Caribbean path north, may capable inflicting much $18 billion damage barrels New Jersey tomorrow knock power millions week, according forecasters risk experts.





Natural language processing Tasks

Stop Words Removal+ Stemming



Hurrican Sandi ground 3,200 flight schedul today tomorrow, prompt New York suspend subway bu servic forc evacu New Jersey shore head toward land life-threaten wind rain.

System, kill mani 65 peopl Caribbean path north, may capabl inflict much $18 billion damag barrel New Jersey tomorrow knock power million week, accord forecast risk expert.




Natural language processingPart of Speech Tagging



[('Hurricane', 'NNP'), ('Sandy', 'NNP'), ('grounded', 'VBD'), ('3,200', 'CD'), ('flights', 'NNS'), ('scheduled', 'VBN'), ('for', 'IN'), ('today', 'NN'), ('and', 'CC'), ('tomorrow', 'NN'), …]

NNP: Proper Noun, Plural NNS: Noun, Plural VBD: Verb, Past tense VBN:Verb, Past ParticipleCD: Cardinal Number IN: Proposition/sub-conj etcFor more http://www.mozart-oz.org/mogul/doc/lager/brill-tagger/penn.html



Let’s go deeper into each stagesSecond, Sentence Attribute Generation

Remember one thing

• When you read sentences yourself, what do you notice about what you notice?

• Make those into attributes! • The goal is to mimic what we

humans do

Let’s go deeper into each stagesThird, Gold Set Generation

Resources for Gold Set Generation

• Yourself • RA: pretty expensive • AMT: Amazon Mechanical Turk

• Obtain multiple tags and you have to check inter-rater agreement to be robust

Research Examplessrn.com/abstract=2290802

38

Research Question

What content attributes of social media messages elicit greater consumer response & engagement?

E.g., 1. What’s the comparative effect of informative

advertising (product, price information, etc) VS persuasive advertising (Emotion, humor, etc) on engagement?

2. Differences across industries?

Introduction & Motivation

Sample Messages from Walmart (Dec 2012 -https://www.facebook.com/walmart)

• Score an iPad 3 for an iPad2 price! Now at your local store, $50 off the iPad 3. Plus, get a $30 iTunes Gift Card. Offer good through 12/31 or while supplies last. (Product Advertisement + Deal + Product Location + Product Stock Availability)

• Rollback with Vizio. Select models have lower prices ranging from $228 for a 32" (diagonal screen size 31.5") LCD TV to $868 on a 55” (diagonal screen size 54.6") LED TV. http://walmarturl.com/10oZ6yS (Product Advertisement + Price info + Brand Mention + Link)

• Maria’s mission is helping veterans and their families find employment. Like this and watch Maria’s story. http://walmarturl.com/VzWFlh (Philanthropic Message + Explicit Like solicitation + Link)

39Data

http://walmarturl.com/10oZ6yS

http://walmarturl.com/10oZ6yS

http://walmarturl.com/VzWFlh

http://walmarturl.com/VzWFlh

40

Data• Post-level panel data on messages posted by many companies from

Sep 2011 to July 2012• Message content • Impressions, likes and comments on a daily basis

• Page-level panel data on each pages • Page statistics on a daily basis (e.g., Fan number, Industry type)• Aggregate demographics of fans and post viewers (impressions

demographics)

• After Cleaning: 106,316 unique messages posted by 782 companies• Daily Likes & Comments: 1.3 million rows of post-level snapshots

recording about 450 million page fans’ responses.

Data

41

VariablesEngagement Metric (Dependent Variable)

Variables that affect engagement (Independent Variables)

Informative Ad Content

• Brand and Product mention

• Price• Deals• Product

Availability• etc

Persuasive Ad Content

• Emotion• Humor• Philanthropic• Emoticon• Small Talk• etc

Message Type

• Photo, Video, Status update, App, Link

Controls

• Impressions• Industry Type• Day since post • Reading

Complexity • Message

Length• etc

Empirical Strategy

COMMENTS LIKES

Message Content Tagging

• At least 9 different workers per message + Majority vote• Used to train natural language processing algorithm to tag remaining posts

– 7 Statistical classifiers + rule-based method combined by ensemble learning – Greater than 99% accuracy, precision, and recall for most variables (10-fold CV)

42Data

• Worker Eligibility Criteria– Must have > 97% accuracy – Must have > 100 previously

approved tasks– Location: US only

• Criteria for using the input– Question for detecting if the

worker is paying attention– Completion duration > 30 seconds

(avg took 3 min)– Plus, 5+ more protocols

NLP Algorithm Process

NLP Algorithm Performance

Open up nlp.py

WITH BAD NLP WITH GOOD NLP

“COMPUTER, HOT EARL GREY TEA”

“COMPUTER, TEA, EARL GREY, HOT”

This Concludes the 2014 Wharton Tech/Data Camp

Please help me and give feedback on this course for improvement. Thank you!

http://wharton.qualtrics.com/SE/?SID=SV_agzfeKZvPQD0hUN

intro to practical natural language processing wharton data camp sessions 8

Documents