natural language processing...approaches to nlp •rule based approach – circa from 1950...
TRANSCRIPT
![Page 1: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):](https://reader035.vdocuments.net/reader035/viewer/2022071113/5fe9e8ca08228568ee38cf6b/html5/thumbnails/1.jpg)
Natural Language Processing
10 April 2017
![Page 2: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):](https://reader035.vdocuments.net/reader035/viewer/2022071113/5fe9e8ca08228568ee38cf6b/html5/thumbnails/2.jpg)
Who are we?
Jiri Tom Rado Martin
![Page 3: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):](https://reader035.vdocuments.net/reader035/viewer/2022071113/5fe9e8ca08228568ee38cf6b/html5/thumbnails/3.jpg)
Infrastructure
● Keboola project invitation● Python 3+ (preferably Anaconda) installed● cmder.net on Windows (mac and linux should be fine)
![Page 4: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):](https://reader035.vdocuments.net/reader035/viewer/2022071113/5fe9e8ca08228568ee38cf6b/html5/thumbnails/4.jpg)
NLP – Why do we care?
![Page 5: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):](https://reader035.vdocuments.net/reader035/viewer/2022071113/5fe9e8ca08228568ee38cf6b/html5/thumbnails/5.jpg)
Problem
Huge amount of text, growing faster and fasterComputers process mostly structured data
Businesses are forced to ignore crucial data
![Page 6: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):](https://reader035.vdocuments.net/reader035/viewer/2022071113/5fe9e8ca08228568ee38cf6b/html5/thumbnails/6.jpg)
We had a great time in Royal Plaza. We had to wait a while at hte reception before check in, Mike seemed to be very busy handling other guests. Bathrooms were a little bit rusty, but the room service was escellent. Speed Taxi service to Gatwick was way too expensive!
Im completely dissappointed. nothing works. The phone battery lasts only few hours before I have to charge it again. I barely have any reception at home. And the last invoice was wrong again. Third time in a row! You either fix it now or I’m leaving.
Problem
![Page 7: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):](https://reader035.vdocuments.net/reader035/viewer/2022071113/5fe9e8ca08228568ee38cf6b/html5/thumbnails/7.jpg)
We had a great time in Royal Plaza. We had to wait a while at hte reception before check in, Mike seemed to be very busy handling other guests. Bathrooms were a little bit rusty, but the room service was escellent. Speed Taxi service to Gatwick was way too expensive!
Im completely dissappointed. nothing works. The phone battery lasts only few hours before I have to charge it again. I barely have any reception at home. And the last invoice was wrong again. Third time in a row! You either fix it now or I’m leaving.
Solution = analyze text automatically
![Page 8: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):](https://reader035.vdocuments.net/reader035/viewer/2022071113/5fe9e8ca08228568ee38cf6b/html5/thumbnails/8.jpg)
We had a great time in Royal Plaza. We had to wait a while at the reception before check in, Mike seemed to be very busy handling other guests. Bathrooms were a little bit rusty, but the room service was escellent. Speed Taxi service to Gatwick was way too expensive!
Im completely dissappointed. nothing works. The phone battery lasts only few hours before I have to charge it again. I barely have any reception at home. And the last invoice was wrong again. Third time in a row! You either fix it now or I’m leaving.
Royal Plaza: greatRoom service: excellent
Reception: waiting Bathrooms: rusty Speed Taxi: expensive!
Retention: I’m leavingBilling: last invoice was wrongTechnical support: bad signal at homeCustomer support: battery lasts only few hours
Solution = analyze text automatically
![Page 9: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):](https://reader035.vdocuments.net/reader035/viewer/2022071113/5fe9e8ca08228568ee38cf6b/html5/thumbnails/9.jpg)
Example – demo.geneea.com
![Page 10: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):](https://reader035.vdocuments.net/reader035/viewer/2022071113/5fe9e8ca08228568ee38cf6b/html5/thumbnails/10.jpg)
Example – Customer feedback – Relations
![Page 11: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):](https://reader035.vdocuments.net/reader035/viewer/2022071113/5fe9e8ca08228568ee38cf6b/html5/thumbnails/11.jpg)
![Page 12: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):](https://reader035.vdocuments.net/reader035/viewer/2022071113/5fe9e8ca08228568ee38cf6b/html5/thumbnails/12.jpg)
Just text analysis is not enough
Connection with structured data:• Time• Location• Popularity of text (likes/dislikes, retweets)• Financial data• Age, gender of the author of the text
![Page 13: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):](https://reader035.vdocuments.net/reader035/viewer/2022071113/5fe9e8ca08228568ee38cf6b/html5/thumbnails/13.jpg)
What else?
• Machine translation• Information retrieval• Finding similar documents (plagiarism)• Summarization• Dictation, IVR, automatic closed-captioning, text-to-speech• Email/ticket routing • Grammar/spelling checking• Recognition of people, companies …; relations between them• Detection of sentiment, uncertainty• Intent detection• much more
![Page 14: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):](https://reader035.vdocuments.net/reader035/viewer/2022071113/5fe9e8ca08228568ee38cf6b/html5/thumbnails/14.jpg)
Low level tasks
• Sentence boundary detection
• Tokenization
• Part-of-speech tagging: Can he can me for kicking a can?
• Lemmatization
• Parsing
• Coreference resolution
• Understanding dates: 10th of April 2017; April 10, 2017; 04/10/2017; 04/10/17; 04/10; April 10; 2017-04-10
![Page 15: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):](https://reader035.vdocuments.net/reader035/viewer/2022071113/5fe9e8ca08228568ee38cf6b/html5/thumbnails/15.jpg)
Approaches to NLP
![Page 16: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):](https://reader035.vdocuments.net/reader035/viewer/2022071113/5fe9e8ca08228568ee38cf6b/html5/thumbnails/16.jpg)
Approaches to NLP
• Rule based approach – circa from 1950
• Machine learning – circa from 1980
![Page 17: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):](https://reader035.vdocuments.net/reader035/viewer/2022071113/5fe9e8ca08228568ee38cf6b/html5/thumbnails/17.jpg)
Approaches to NLP – Rule based
• Chomsky (159): Syntactic structures
• Machine translation from IBM & Georgetown
![Page 18: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):](https://reader035.vdocuments.net/reader035/viewer/2022071113/5fe9e8ca08228568ee38cf6b/html5/thumbnails/18.jpg)
![Page 19: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):](https://reader035.vdocuments.net/reader035/viewer/2022071113/5fe9e8ca08228568ee38cf6b/html5/thumbnails/19.jpg)
Time flies like an arrow.
![Page 20: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):](https://reader035.vdocuments.net/reader035/viewer/2022071113/5fe9e8ca08228568ee38cf6b/html5/thumbnails/20.jpg)
Approaches to NLP – Machine learning
• Statistical methods, machine learning
• Importance of exact evaluation
• Data, data, data
• Annotation
![Page 21: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):](https://reader035.vdocuments.net/reader035/viewer/2022071113/5fe9e8ca08228568ee38cf6b/html5/thumbnails/21.jpg)
Machine learning
•Unsupervised • Finding hidden structure in data• For example clustering
•Supervised • Requires training data with correct answers
![Page 22: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):](https://reader035.vdocuments.net/reader035/viewer/2022071113/5fe9e8ca08228568ee38cf6b/html5/thumbnails/22.jpg)
Data – Corpora
● Morphology, tagging: Penn Treebank, PDT, ...● Parallel corpora:
European, Canadian parliament, movie subtitles● Specialised (e.g. for sentiment)
![Page 23: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):](https://reader035.vdocuments.net/reader035/viewer/2022071113/5fe9e8ca08228568ee38cf6b/html5/thumbnails/23.jpg)
Sentiment analysis
![Page 24: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):](https://reader035.vdocuments.net/reader035/viewer/2022071113/5fe9e8ca08228568ee38cf6b/html5/thumbnails/24.jpg)
Sentiment Analysis
• German economy is booming.
• Human trafficking is booming in California.
• Burger King has better fries than McDonald.
• Battery is good, but the display is terrible.
![Page 25: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):](https://reader035.vdocuments.net/reader035/viewer/2022071113/5fe9e8ca08228568ee38cf6b/html5/thumbnails/25.jpg)
Sentiment Analysis
• I was happy.
• I was sad.
• I was not happy.
• I have never been happy in my life.
• I have never been so happy in my life.
• It’s not good, but I still love it.
![Page 26: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):](https://reader035.vdocuments.net/reader035/viewer/2022071113/5fe9e8ca08228568ee38cf6b/html5/thumbnails/26.jpg)
Sentiment Analysis
•She is pretty.
•She is pretty annoying.
![Page 27: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):](https://reader035.vdocuments.net/reader035/viewer/2022071113/5fe9e8ca08228568ee38cf6b/html5/thumbnails/27.jpg)
Sentiment Analysis
•Well, that was a success. (sarcasm?)
•Go read the book.
![Page 28: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):](https://reader035.vdocuments.net/reader035/viewer/2022071113/5fe9e8ca08228568ee38cf6b/html5/thumbnails/28.jpg)
Sentiment Analysis
• The previous version was absolutely great, it was a pleasure to work with, but now, I am a little confused.
![Page 29: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):](https://reader035.vdocuments.net/reader035/viewer/2022071113/5fe9e8ca08228568ee38cf6b/html5/thumbnails/29.jpg)
Sentiment Analysis
•Harmony Smith drives me crazy.
•Bob’s Bad Breath Burger is delicious.
•That was a bad ass burger!
![Page 30: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):](https://reader035.vdocuments.net/reader035/viewer/2022071113/5fe9e8ca08228568ee38cf6b/html5/thumbnails/30.jpg)
Sentiment classification - python
go to https://jupyter.geneea.com
![Page 31: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):](https://reader035.vdocuments.net/reader035/viewer/2022071113/5fe9e8ca08228568ee38cf6b/html5/thumbnails/31.jpg)
![Page 32: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):](https://reader035.vdocuments.net/reader035/viewer/2022071113/5fe9e8ca08228568ee38cf6b/html5/thumbnails/32.jpg)
Evaluation
![Page 33: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):](https://reader035.vdocuments.net/reader035/viewer/2022071113/5fe9e8ca08228568ee38cf6b/html5/thumbnails/33.jpg)
Evaluation: Precision/Recall
False positiveBad Precision
False negativeBad Recall
![Page 34: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):](https://reader035.vdocuments.net/reader035/viewer/2022071113/5fe9e8ca08228568ee38cf6b/html5/thumbnails/34.jpg)
Evaluation: Confusion matrix
Prediction
Predicted positive
Predicted Negative
Reality
Real positive
True positive
False negative
Real negative
False positive
True negative
![Page 35: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):](https://reader035.vdocuments.net/reader035/viewer/2022071113/5fe9e8ca08228568ee38cf6b/html5/thumbnails/35.jpg)
Evaluation: Confusion matrix
Prediction
Predicted positive
Predicted Negative
Reality
Real positive
10 5
Real negative
3 16
![Page 36: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):](https://reader035.vdocuments.net/reader035/viewer/2022071113/5fe9e8ca08228568ee38cf6b/html5/thumbnails/36.jpg)
Evaluation
• Multiple possible answers
• Not all errors are equally important
• Inter-annotator agreement (very low for tagging with an open tag set)
![Page 37: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):](https://reader035.vdocuments.net/reader035/viewer/2022071113/5fe9e8ca08228568ee38cf6b/html5/thumbnails/37.jpg)
Machine learning – Overfitting
![Page 38: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):](https://reader035.vdocuments.net/reader035/viewer/2022071113/5fe9e8ca08228568ee38cf6b/html5/thumbnails/38.jpg)
Discovery analysis● explore the data● prefer recall over precision● malformed or irrelevant tags not a big deal
(as opposed to media tags)
![Page 39: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):](https://reader035.vdocuments.net/reader035/viewer/2022071113/5fe9e8ca08228568ee38cf6b/html5/thumbnails/39.jpg)
Yelp Sample – 160k Restaurant Reviews
![Page 40: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):](https://reader035.vdocuments.net/reader035/viewer/2022071113/5fe9e8ca08228568ee38cf6b/html5/thumbnails/40.jpg)
![Page 41: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):](https://reader035.vdocuments.net/reader035/viewer/2022071113/5fe9e8ca08228568ee38cf6b/html5/thumbnails/41.jpg)
Command line tools
• Simple and generic tools for text transformation
• Where to get:• Linux (and Mac) – part of the
operating system• Windows – install: cmder.net or
cygwin
https://jupyter.geneea.com/tree/data(click the file, then choose File > Save; DO NOT R-Click and Save as !!)
![Page 42: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):](https://reader035.vdocuments.net/reader035/viewer/2022071113/5fe9e8ca08228568ee38cf6b/html5/thumbnails/42.jpg)
Example
• How many lines, words and characters?• The first command: wc <filename>
• How to show :• cat <filename> (prints the whole file)• less <filename> (pages, use space and then “q”) • head / tail <filename>
• All commands: parameter --help
![Page 43: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):](https://reader035.vdocuments.net/reader035/viewer/2022071113/5fe9e8ca08228568ee38cf6b/html5/thumbnails/43.jpg)
Encoding
• https://en.wikipedia.org/wiki/Character_encoding
• ASCII – 7bits• Does not cover all languages’
characters• More encodings for the same
language (e.g. windows-1250, iso-8859-2, utf-8 for Czech)
Conversion from one encoding to another (iso-8859-2 to utf-8)
iconv -f iso-8859-2 -t utf-8 text_orig.txt > text_utf8.txt
![Page 44: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):](https://reader035.vdocuments.net/reader035/viewer/2022071113/5fe9e8ca08228568ee38cf6b/html5/thumbnails/44.jpg)
Other commands and generic principles
• sort• Sorting the file alphabetically or numerically (-n)
• Each command has input and output
• It is possible to make a chain of commands (output of one command becomes input for another one)
• Using character | (pipe)• cat text_utf8.txt | sort
• How to send output to a file?• Using character >• cat text_utf8.txt | sort > text_sorted.txt
![Page 45: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):](https://reader035.vdocuments.net/reader035/viewer/2022071113/5fe9e8ca08228568ee38cf6b/html5/thumbnails/45.jpg)
CSV processing
• cvskit• https://csvkit.readthedocs.io • pip install csvkit
• csvcut -c 2 file.csv
• in2csv data.xls > data.csv• Conversion from Excel to csv
• csvcut -n data.csv• Print column names
![Page 46: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):](https://reader035.vdocuments.net/reader035/viewer/2022071113/5fe9e8ca08228568ee38cf6b/html5/thumbnails/46.jpg)
Other commands – tr, uniq, cut
• tr• Replace one character with another:
• tr 'a' 'b'• tr ' ' '\n'• tr '[:punct:]' '\n'
• uniq• Exclude repeating rows• It’s necessary to have the input sorted• cat text_uf8.txt | tr '[:punct:]' '\n' | tr ' ' '\n' | sort | uniq -c | sort
• cut
• Filters columns or characters and prints only selected ones• cut –f 1 –d “ “• Works well with tsv (tab separated)
![Page 47: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):](https://reader035.vdocuments.net/reader035/viewer/2022071113/5fe9e8ca08228568ee38cf6b/html5/thumbnails/47.jpg)
Other commands - grep
• Filtering rows• grep “foo”
• Regular expressions:• Template matching more words/texts• [a-z] … characters from a to z
• Exercises:• Print rows containing at least one number• How many unique words does the file contain?• What are the most frequent words starting with a specific letter?
![Page 48: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):](https://reader035.vdocuments.net/reader035/viewer/2022071113/5fe9e8ca08228568ee38cf6b/html5/thumbnails/48.jpg)
Other commands
• wget• Crawl pages from a website
• echo• Prints text to console
• sed• More complex tool for replacing strings• echo "wine" | sed -e 's/wine/beer/
•dos2unix, unix2dos• Encoding of end-of-line characters
![Page 49: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):](https://reader035.vdocuments.net/reader035/viewer/2022071113/5fe9e8ca08228568ee38cf6b/html5/thumbnails/49.jpg)
Other tools
• Notepad++• With TextFX plug-in
![Page 50: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):](https://reader035.vdocuments.net/reader035/viewer/2022071113/5fe9e8ca08228568ee38cf6b/html5/thumbnails/50.jpg)
Simple tagging
1. Tokenization: split text into words
2. Drop unimportant words: stop words
3. Find important words: tf-idf
![Page 51: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):](https://reader035.vdocuments.net/reader035/viewer/2022071113/5fe9e8ca08228568ee38cf6b/html5/thumbnails/51.jpg)
tf – term frequencySo bummed, our company ordered pizza from here today and I tried to give the person who answered the phone the name of our company but she stated just give me your first name...due to that fact when the pizza was delivered over an hour later and we are less then 3 minutes down the street it was ICE COLD!!! I even called Dina's up questioning if the pizza was on the way yet and was told it left a while a go I don't understand how you don't have it yet. We ordered the taco specialty pizza so try warming that up with the lettuce and tomato all over the top of it. Not to mention the fact that the toppings completely fell off the pizza and were all on the corner of the box. WHAT A WASTE OF MONEY THIS WAS!!!!!!!!!!
tf(pizza) = 5
![Page 52: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):](https://reader035.vdocuments.net/reader035/viewer/2022071113/5fe9e8ca08228568ee38cf6b/html5/thumbnails/52.jpg)
tf – term frequencySo bummed, our company ordered pizza from here today and I tried to give the person who answered the phone the name of our company but she stated just give me your first name...due to that fact when the pizza was delivered over an hour later and we are less then 3 minutes down the street it was ICE COLD!!! I even called Dina's up questioning if the pizza was on the way yet and was told it left a while a go I don't understand how you don't have it yet. We ordered the taco specialty pizza so try warming that up with the lettuce and tomato all over the top of it. Not to mention the fact that the toppings completely fell off the pizza and were all on the corner of the box. WHAT A WASTE OF MONEY THIS WAS!!!!!!!!!!
tf(pizza) = 5tf(the) = 15
![Page 53: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):](https://reader035.vdocuments.net/reader035/viewer/2022071113/5fe9e8ca08228568ee38cf6b/html5/thumbnails/53.jpg)
idf – inverse document frequency
idf(the) = log(160,000 / 152,000) = log(1.05)
idf(pizza) = log(160,000 / 11,800) = log(13.56)
idf(Tokyo) = log(160,000 / 194) = log(824.74)
![Page 54: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):](https://reader035.vdocuments.net/reader035/viewer/2022071113/5fe9e8ca08228568ee38cf6b/html5/thumbnails/54.jpg)
tf-idf
w(t,doc) = log(1 + tf(t) ) * log (N / df(t))
idf(the) = log(1+15) * log(1.05) = 0.28
idf(pizza) = log(1+5) * log(13.56) = 9.72
idf(Tokyo) = log(1+0) * log(824.74) = 0
Simple, yet works well
![Page 55: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):](https://reader035.vdocuments.net/reader035/viewer/2022071113/5fe9e8ca08228568ee38cf6b/html5/thumbnails/55.jpg)
TokenizationMerriam-Webster's, Tumu-M'Pongo
10-year, one-liners, self-proclaimed
United Kingdom-United States relations
5-3, 5-3+1, U+2010, 2:4, 14:34, 10:00-14:00
10000, 10 000, 10,000
3.14159, 10.12., 10. prosince, U.S.A., H2O
km/h, A/C, s/he, °C
N40° 44.9064', W073° 59.0735'
www.some-news.com/article-about-stuff, [email protected]
![Page 56: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):](https://reader035.vdocuments.net/reader035/viewer/2022071113/5fe9e8ca08228568ee38cf6b/html5/thumbnails/56.jpg)
Tokenization
Arbitrary decisions have to be made.
Stick to them consistently. Pre-trained models might work poorly if fed with differently tokenized data
![Page 57: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):](https://reader035.vdocuments.net/reader035/viewer/2022071113/5fe9e8ca08228568ee38cf6b/html5/thumbnails/57.jpg)
eat x ate x eatenOK, not cheap but not outrageously expensive either. I've eaten here twice, the last time during May 2009, I enjoyed both the food & atmosphere. I suppose you could call the place a Bistro. The food is Scottish & locally sourced, caters for vegetarians & has a pretty varied menu without being ridiculously extensive. I seem to remember a good selection of wines but don't think they serve anything but bottled beer. Damned if I can remember what I ate but had fish once that was extremely tasty & their veg isn't undercooked that can be the fashion. The service was friendly with no unseemly waiting! A great night out in New Town. There are two sister restaurants: A Room in the West End & A Room in Leith. Enjoy a great place to eat in a fabulous city! tf(eat) + tf(ate) + tf(eaten) = 3
tf(eat) = 1
![Page 58: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):](https://reader035.vdocuments.net/reader035/viewer/2022071113/5fe9e8ca08228568ee38cf6b/html5/thumbnails/58.jpg)
Lemmatization & Morphology
![Page 59: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):](https://reader035.vdocuments.net/reader035/viewer/2022071113/5fe9e8ca08228568ee38cf6b/html5/thumbnails/59.jpg)
Processing MorphologyLemmatization: word → lemma (dictionary form) Peter saw her. → Peter see she .
POS Tagging: word → tagPeter saw her. → noun, verb, pronoun, punct
Morphological analysis: ignores contextsaw → {[see, verb.past], [saw, noun.sg]}
Morpheme segmentation: de-nation-al-iz-ation
Generation: see + verb.past → saw
![Page 60: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):](https://reader035.vdocuments.net/reader035/viewer/2022071113/5fe9e8ca08228568ee38cf6b/html5/thumbnails/60.jpg)
Morphology – not so easy
city – citi-es, goose – geese, sheep – sheep, go – went
Stuhl – Stühl-e, Vater – Väter
matk-a – mat-e-k – matc-e – matč-in
![Page 61: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):](https://reader035.vdocuments.net/reader035/viewer/2022071113/5fe9e8ca08228568ee38cf6b/html5/thumbnails/61.jpg)
Morphology – not so easy
Tagalog (Philippines):
basa ‘read’ b-um-asa ‘readpast’sulat ‘write’ s-um-ulat ‘wrote’
rare in English: abso-bloody-lutely
Arabic, Hebrew – templates
![Page 62: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):](https://reader035.vdocuments.net/reader035/viewer/2022071113/5fe9e8ca08228568ee38cf6b/html5/thumbnails/62.jpg)
Choice of lemma depth
inflection: debates → debatebrought, brings, bringing → bring
negation: unreasonable → reasonable
gradation: highest → high (Highest Court)
![Page 63: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):](https://reader035.vdocuments.net/reader035/viewer/2022071113/5fe9e8ca08228568ee38cf6b/html5/thumbnails/63.jpg)
Morphology: not so easy - derivation
solution – solve; kind – kindly – kindness
un-happy – in-comprehensive – im-possible – ir-rational
unloosen = loosen
unnerve, unearth
![Page 64: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):](https://reader035.vdocuments.net/reader035/viewer/2022071113/5fe9e8ca08228568ee38cf6b/html5/thumbnails/64.jpg)
Zipf’s lawword frequency is inversely proportional to its freq rankUnique token count: 145kTotal token count: 20M rank word freq
1 the 801132
2 and 635035
3 I 521421
4 a 509089
5 to 398695
...
53038 turorials 2
>68000 1
First Covers
1% 84%
10% 97.7%
20% 98.9%
![Page 65: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):](https://reader035.vdocuments.net/reader035/viewer/2022071113/5fe9e8ca08228568ee38cf6b/html5/thumbnails/65.jpg)
Consequences
Pareto’s rule (80 : 20)One can achieve “reasonable” quality fast
Costs of additional improvements rise “exponentially” (long tail)
Ambiguity and fuzziness on every layer of language
![Page 66: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):](https://reader035.vdocuments.net/reader035/viewer/2022071113/5fe9e8ca08228568ee38cf6b/html5/thumbnails/66.jpg)
Part-of-speech tagging
![Page 67: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):](https://reader035.vdocuments.net/reader035/viewer/2022071113/5fe9e8ca08228568ee38cf6b/html5/thumbnails/67.jpg)
Part-of-speech tagging
I love hiking through the woods on weekends .
PRP VBP N IN DT NNS IN NNS .
![Page 68: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):](https://reader035.vdocuments.net/reader035/viewer/2022071113/5fe9e8ca08228568ee38cf6b/html5/thumbnails/68.jpg)
Petrov et al – (Google) Universal POS TagsetVERB - verbs (all tenses and modes)
NOUN - nouns (common and proper)
PRON - pronouns
ADJ - adjectives
ADV - adverbs
ADP - adpositions (prepositions and postpositions)
CONJ - conjunctions
DET - determiners
NUM - cardinal numbers
PRT - particles or other function words
X - other: foreign words, typos, abbreviations
. - punctuation
![Page 69: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):](https://reader035.vdocuments.net/reader035/viewer/2022071113/5fe9e8ca08228568ee38cf6b/html5/thumbnails/69.jpg)
Penn Treebank tagset
![Page 70: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):](https://reader035.vdocuments.net/reader035/viewer/2022071113/5fe9e8ca08228568ee38cf6b/html5/thumbnails/70.jpg)
Ambiguity
Mrs. Shaefer never got around/RP to joining.
All we gotta do is go around/IN the corner.
Chateau Petrus costs around/RB 2,500.
They were married/VBN by the Justice of the Peace yesterday at 5:00.
At the time, she was already married/JJ.
![Page 71: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):](https://reader035.vdocuments.net/reader035/viewer/2022071113/5fe9e8ca08228568ee38cf6b/html5/thumbnails/71.jpg)
Entities
![Page 72: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):](https://reader035.vdocuments.net/reader035/viewer/2022071113/5fe9e8ca08228568ee38cf6b/html5/thumbnails/72.jpg)
![Page 73: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):](https://reader035.vdocuments.net/reader035/viewer/2022071113/5fe9e8ca08228568ee38cf6b/html5/thumbnails/73.jpg)
Example – Svejk – characters
![Page 74: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):](https://reader035.vdocuments.net/reader035/viewer/2022071113/5fe9e8ca08228568ee38cf6b/html5/thumbnails/74.jpg)
![Page 75: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):](https://reader035.vdocuments.net/reader035/viewer/2022071113/5fe9e8ca08228568ee38cf6b/html5/thumbnails/75.jpg)
Švejk
![Page 76: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):](https://reader035.vdocuments.net/reader035/viewer/2022071113/5fe9e8ca08228568ee38cf6b/html5/thumbnails/76.jpg)
![Page 77: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):](https://reader035.vdocuments.net/reader035/viewer/2022071113/5fe9e8ca08228568ee38cf6b/html5/thumbnails/77.jpg)
Entities – named and non-named
Named entities: personal names, organizations, geographical names
Other interesting entities: URL, e-mail, phone numbers, money amounts and other quantities, date and time
Custom entities for given domain: bacon, onion, tomato, cheese for a burger chain
![Page 78: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):](https://reader035.vdocuments.net/reader035/viewer/2022071113/5fe9e8ca08228568ee38cf6b/html5/thumbnails/78.jpg)
Entities – some basic challengesTypes – fuzziness, hierarchy
Facebook – product or company?
European Union – organization or place
Embedded entities
[Dr.] Martin Luther King [Jr.]
[The [New England] Journal of Medicine]
[Gymnázium [Jozefa Gregora Tajovského] v [Banskej Bystrici]]
List look-up not enough
Washington, The police, ANO (Yes)
![Page 79: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):](https://reader035.vdocuments.net/reader035/viewer/2022071113/5fe9e8ca08228568ee38cf6b/html5/thumbnails/79.jpg)
Entities – ML
• annotation – tag tokens with labels like PERSON_START, PERSON_CONT• popular classifier – CRF• features
• word shape (case, is alphanumeric etc.)
• morphological features
• gazetteers
• distsim, word2vec
• labels already assigned to previous word(s)
• add features of surrounding tokens, previous instances of the same word, use n-grams …
• could use two passes
• can use two passes
![Page 80: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):](https://reader035.vdocuments.net/reader035/viewer/2022071113/5fe9e8ca08228568ee38cf6b/html5/thumbnails/80.jpg)
Entities/Tags – remaining issues
Correference – increase tf pronouns, the president
StandardizationiPads > iPad
Windows != Window, United States != Unite State
The first stage has landed on Of Course I Still Love You.
He sang Bratříčku zavírej vrátka.
NormalizationUSA = United States of America = United States ~ America
Hillary Rodham = Hillary Clinton
![Page 81: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):](https://reader035.vdocuments.net/reader035/viewer/2022071113/5fe9e8ca08228568ee38cf6b/html5/thumbnails/81.jpg)
Syntax & Parsing
![Page 82: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):](https://reader035.vdocuments.net/reader035/viewer/2022071113/5fe9e8ca08228568ee38cf6b/html5/thumbnails/82.jpg)
![Page 83: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):](https://reader035.vdocuments.net/reader035/viewer/2022071113/5fe9e8ca08228568ee38cf6b/html5/thumbnails/83.jpg)
![Page 84: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):](https://reader035.vdocuments.net/reader035/viewer/2022071113/5fe9e8ca08228568ee38cf6b/html5/thumbnails/84.jpg)
![Page 85: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):](https://reader035.vdocuments.net/reader035/viewer/2022071113/5fe9e8ca08228568ee38cf6b/html5/thumbnails/85.jpg)
![Page 86: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):](https://reader035.vdocuments.net/reader035/viewer/2022071113/5fe9e8ca08228568ee38cf6b/html5/thumbnails/86.jpg)
![Page 87: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):](https://reader035.vdocuments.net/reader035/viewer/2022071113/5fe9e8ca08228568ee38cf6b/html5/thumbnails/87.jpg)
Old men and women are hard to live with.
I saw her duck.
The chicken are too hot to eat.
The mayor is a dirty street fighter.
Happily they left.
Terry loves his wife and so do I.
![Page 88: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):](https://reader035.vdocuments.net/reader035/viewer/2022071113/5fe9e8ca08228568ee38cf6b/html5/thumbnails/88.jpg)
Vectors
![Page 89: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):](https://reader035.vdocuments.net/reader035/viewer/2022071113/5fe9e8ca08228568ee38cf6b/html5/thumbnails/89.jpg)
![Page 90: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):](https://reader035.vdocuments.net/reader035/viewer/2022071113/5fe9e8ca08228568ee38cf6b/html5/thumbnails/90.jpg)
Vector methods
One way to bridge natural language and classical ML
After transforming to vectors, integration with ML systems is easy
Applications: Search
Text classification
Preprocessing / feature extraction for any ML task e.g., neural networks image -> vector -> text
![Page 91: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):](https://reader035.vdocuments.net/reader035/viewer/2022071113/5fe9e8ca08228568ee38cf6b/html5/thumbnails/91.jpg)
Vector methods: bag of words
Preprocessing – tokenization, stemming/lemmatization, cleaning
Create a vector d with dimension V (size of vocabulary)
di = tfi (term frequency of the i-th word)
A black cat and a white cat slept on a mat -> {black:1, white:1, cat:2, sleep:1, mat:1} -> [1, 1, 2, 1, 1, ...]
![Page 92: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):](https://reader035.vdocuments.net/reader035/viewer/2022071113/5fe9e8ca08228568ee38cf6b/html5/thumbnails/92.jpg)
Vector methods: bag of words improved
Fancier values instead of tf. (e.g., tf-idf)
Add n-grams/phrases/entities to the bag {..., black cat:1, white cat:1, ...}
![Page 93: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):](https://reader035.vdocuments.net/reader035/viewer/2022071113/5fe9e8ca08228568ee38cf6b/html5/thumbnails/93.jpg)
Vector methods: dimensionality reduction
Each term is a feature - very big dimension
Dimensionality reductionLSI (LSA) – term-document matrix decomposition
LDA – topic inference using probabilistic graphical model
Word2vec – transform words to vectors of given size, capture their context
gensim Python library
![Page 94: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):](https://reader035.vdocuments.net/reader035/viewer/2022071113/5fe9e8ca08228568ee38cf6b/html5/thumbnails/94.jpg)
Vector methods: Latent Semantic Indexing (LSI)goal: map semantically similar documents to similar vectors
{(car), (truck), (flower)} –> a{(1.345 * car + 0.282 * truck), (flower)}
reduce dimensionality by singular value decomposition (SVD) of the term-document matrix
somehow addresses synonymy, in lesser extent homonymy
From EP corpus: 0.365*fishery + 0.342*fishing + 0.197*fish + -0.153*tax + -0.140*food + 0.116*aquaculture + ...
Source: Jialu Liu: Topic Model
![Page 95: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):](https://reader035.vdocuments.net/reader035/viewer/2022071113/5fe9e8ca08228568ee38cf6b/html5/thumbnails/95.jpg)
Vector methods: Latent Dirichlet Allocation (LDA)
topic1 –> 0.1 milk, 0.09 meow, 0.08 kitten
topic2 –> 0.12 bark, 0.11 bone, 0.07 puppy
Finds probability distributions of topics for documents and words for topics
Source: Jialu Liu: Topic Model
![Page 96: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):](https://reader035.vdocuments.net/reader035/viewer/2022071113/5fe9e8ca08228568ee38cf6b/html5/thumbnails/96.jpg)
Vector methods: Latent Dirichlet Allocation (LDA)
From EP corpus: 0.018*transport + 0.013*passenger + 0.011*airline + 0.010*road + 0.009*safety + 0.007*simplify + 0.007*rail + 0.006*travel + ...
0.025*Israel + 0.017*Palestinian + 0.015*Jerusalem + 0.015*Gaza + 0.012*Prime + 0.011*Israeli + 0.009*peace +
![Page 97: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):](https://reader035.vdocuments.net/reader035/viewer/2022071113/5fe9e8ca08228568ee38cf6b/html5/thumbnails/97.jpg)
Vector methods: Word2vecDoesn’t ignore word order, uses either skip-grams or continuous bag of words (CBOW)
vector arithmetic king - man + woman = queen
uses neural networks
research shows analogy to matrix factorization
![Page 98: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):](https://reader035.vdocuments.net/reader035/viewer/2022071113/5fe9e8ca08228568ee38cf6b/html5/thumbnails/98.jpg)
Vector methods: Word2vec
model.most_similar(positive=['nuclear'])
[('stations', 0.6321508884429932),('reactor', 0.6199184060096741),('plants', 0.6013395190238953),('atomic', 0.5934208035469055),('coal-fired', 0.5920413732528687),('reactors', 0.549136221408844),('solar', 0.5483176112174988),('weapons', 0.5343624353408813),('disarmament', 0.5275484919548035),('plant', 0.5141536593437195)]