learning from data lecture 5: practice with new tasks and...
TRANSCRIPT
![Page 1: Learning from Data Lecture 5: Practice with new tasks and ...esslli2016.unibz.it/wp-content/uploads/2015/10/lecture5.pdf · Word Sense Disambiguation in Russian (4 words) look into](https://reader034.vdocuments.net/reader034/viewer/2022043018/5f3a4338f0d5b860a0461cbc/html5/thumbnails/1.jpg)
Learning from DataLecture 5: Practice with new tasks and datasets
Malvina Nissim and Johannes [email protected], [email protected]
ESSLLI, 26 August 2016
Malvina & Johannes LFD – Lecture 5 ESSLLI, 26 August 2016 1 / 20
![Page 2: Learning from Data Lecture 5: Practice with new tasks and ...esslli2016.unibz.it/wp-content/uploads/2015/10/lecture5.pdf · Word Sense Disambiguation in Russian (4 words) look into](https://reader034.vdocuments.net/reader034/viewer/2022043018/5f3a4338f0d5b860a0461cbc/html5/thumbnails/2.jpg)
Getting the updated code
START AS SOON AS YOU ENTER THE ROOM! (PLEASE)
IMPORTANT AS ALL THE NEW DATASETS ARE IN
Approach 1: Use git (updateable, recommended if you have git)
1 In your terminal, type: ’git clonehttps://github.com/bjerva/esslli-learning-from-data-students.git’
2 Followed by ’cd esslli-learning-from-data-students’3 Whenever the code is updated, type: ’git pull’
Approach 2: Download a zip (static)
1 Download the zip archive from: https://github.com/bjerva/esslli-learning-from-data-students/archive/master.zip
2 Whenever the code is updated, download the archive again.
Malvina & Johannes LFD – Lecture 5 ESSLLI, 26 August 2016 2 / 20
![Page 3: Learning from Data Lecture 5: Practice with new tasks and ...esslli2016.unibz.it/wp-content/uploads/2015/10/lecture5.pdf · Word Sense Disambiguation in Russian (4 words) look into](https://reader034.vdocuments.net/reader034/viewer/2022043018/5f3a4338f0d5b860a0461cbc/html5/thumbnails/3.jpg)
General points
how to choose the “right” algorithm?
Malvina & Johannes LFD – Lecture 5 ESSLLI, 26 August 2016 3 / 20
![Page 4: Learning from Data Lecture 5: Practice with new tasks and ...esslli2016.unibz.it/wp-content/uploads/2015/10/lecture5.pdf · Word Sense Disambiguation in Russian (4 words) look into](https://reader034.vdocuments.net/reader034/viewer/2022043018/5f3a4338f0d5b860a0461cbc/html5/thumbnails/4.jpg)
![Page 5: Learning from Data Lecture 5: Practice with new tasks and ...esslli2016.unibz.it/wp-content/uploads/2015/10/lecture5.pdf · Word Sense Disambiguation in Russian (4 words) look into](https://reader034.vdocuments.net/reader034/viewer/2022043018/5f3a4338f0d5b860a0461cbc/html5/thumbnails/5.jpg)
General points Classification vs Regression
Classification vs Regression
create models of prediction from gathered data
classificationthe dependent variables are categorical
input x: feature vectoroutput: discrete class label
regressionthe dependent variables are numerical
input x: feature vectoroutput y: continuous value
Malvina & Johannes LFD – Lecture 5 ESSLLI, 26 August 2016 5 / 20
![Page 6: Learning from Data Lecture 5: Practice with new tasks and ...esslli2016.unibz.it/wp-content/uploads/2015/10/lecture5.pdf · Word Sense Disambiguation in Russian (4 words) look into](https://reader034.vdocuments.net/reader034/viewer/2022043018/5f3a4338f0d5b860a0461cbc/html5/thumbnails/6.jpg)
General points Supervised vs Unsupervised
classification and regression are the most standard ways of doingsupervised learning
Malvina & Johannes LFD – Lecture 5 ESSLLI, 26 August 2016 6 / 20
![Page 7: Learning from Data Lecture 5: Practice with new tasks and ...esslli2016.unibz.it/wp-content/uploads/2015/10/lecture5.pdf · Word Sense Disambiguation in Russian (4 words) look into](https://reader034.vdocuments.net/reader034/viewer/2022043018/5f3a4338f0d5b860a0461cbc/html5/thumbnails/7.jpg)
General points Supervised vs Unsupervised
Supervised and Unsupervised Learning
information about the correct distribution/label of the training examples
in supervised learning it is known
→ fitting a model to labelled data which has the correct answerassociated to it
in unsupervised learning it is not known
→ finding structure in unlabelled data
Malvina & Johannes LFD – Lecture 5 ESSLLI, 26 August 2016 7 / 20
![Page 10: Learning from Data Lecture 5: Practice with new tasks and ...esslli2016.unibz.it/wp-content/uploads/2015/10/lecture5.pdf · Word Sense Disambiguation in Russian (4 words) look into](https://reader034.vdocuments.net/reader034/viewer/2022043018/5f3a4338f0d5b860a0461cbc/html5/thumbnails/10.jpg)
General points Unsupervised learning
Supervised learning
supervised learning – classification or regression
in training, instances are associated with their class label
based on features, the system must search for patterns and build amodel
the model must be able to predict the class of previously unseeninstances
Malvina & Johannes LFD – Lecture 5 ESSLLI, 26 August 2016 10 / 20
![Page 11: Learning from Data Lecture 5: Practice with new tasks and ...esslli2016.unibz.it/wp-content/uploads/2015/10/lecture5.pdf · Word Sense Disambiguation in Russian (4 words) look into](https://reader034.vdocuments.net/reader034/viewer/2022043018/5f3a4338f0d5b860a0461cbc/html5/thumbnails/11.jpg)
General points Unsupervised learning
Unsupervised learning
unsupervised learning – clustering
partitioning instances into subsets (clusters) that share similarcharacteristics
subsets are not predefined
a system can be told how many clusters it should form (K-meansalgorithm)
Malvina & Johannes LFD – Lecture 5 ESSLLI, 26 August 2016 11 / 20
![Page 12: Learning from Data Lecture 5: Practice with new tasks and ...esslli2016.unibz.it/wp-content/uploads/2015/10/lecture5.pdf · Word Sense Disambiguation in Russian (4 words) look into](https://reader034.vdocuments.net/reader034/viewer/2022043018/5f3a4338f0d5b860a0461cbc/html5/thumbnails/12.jpg)
practice
![Page 13: Learning from Data Lecture 5: Practice with new tasks and ...esslli2016.unibz.it/wp-content/uploads/2015/10/lecture5.pdf · Word Sense Disambiguation in Russian (4 words) look into](https://reader034.vdocuments.net/reader034/viewer/2022043018/5f3a4338f0d5b860a0461cbc/html5/thumbnails/13.jpg)
datasets
![Page 14: Learning from Data Lecture 5: Practice with new tasks and ...esslli2016.unibz.it/wp-content/uploads/2015/10/lecture5.pdf · Word Sense Disambiguation in Russian (4 words) look into](https://reader034.vdocuments.net/reader034/viewer/2022043018/5f3a4338f0d5b860a0461cbc/html5/thumbnails/14.jpg)
Practice
(Your) Datasets
Ok Cupid (4 tasks)
Word Sense Disambiguation in Russian (4 words)
Slovene Regional Language Variants
Pragmatic conditionals
Malvina & Johannes LFD – Lecture 5 ESSLLI, 26 August 2016 14 / 20
![Page 15: Learning from Data Lecture 5: Practice with new tasks and ...esslli2016.unibz.it/wp-content/uploads/2015/10/lecture5.pdf · Word Sense Disambiguation in Russian (4 words) look into](https://reader034.vdocuments.net/reader034/viewer/2022043018/5f3a4338f0d5b860a0461cbc/html5/thumbnails/15.jpg)
LEARNING FROM OKCUPID DATA
FILIP KLUBIČKA
![Page 16: Learning from Data Lecture 5: Practice with new tasks and ...esslli2016.unibz.it/wp-content/uploads/2015/10/lecture5.pdf · Word Sense Disambiguation in Russian (4 words) look into](https://reader034.vdocuments.net/reader034/viewer/2022043018/5f3a4338f0d5b860a0461cbc/html5/thumbnails/16.jpg)
THE OKCUPID DATASET
![Page 17: Learning from Data Lecture 5: Practice with new tasks and ...esslli2016.unibz.it/wp-content/uploads/2015/10/lecture5.pdf · Word Sense Disambiguation in Russian (4 words) look into](https://reader034.vdocuments.net/reader034/viewer/2022043018/5f3a4338f0d5b860a0461cbc/html5/thumbnails/17.jpg)
THE OKCUPID DATASET
![Page 18: Learning from Data Lecture 5: Practice with new tasks and ...esslli2016.unibz.it/wp-content/uploads/2015/10/lecture5.pdf · Word Sense Disambiguation in Russian (4 words) look into](https://reader034.vdocuments.net/reader034/viewer/2022043018/5f3a4338f0d5b860a0461cbc/html5/thumbnails/18.jpg)
THE OKCUPID DATASET
EXAMPLE OF USER FEATURE VECTOR
▸ https://github.com/everett-wetchler/okcupid
▸ http://www.stephenleefischer.com/posts/scraping-okcupid-will-bot-for-love
![Page 19: Learning from Data Lecture 5: Practice with new tasks and ...esslli2016.unibz.it/wp-content/uploads/2015/10/lecture5.pdf · Word Sense Disambiguation in Russian (4 words) look into](https://reader034.vdocuments.net/reader034/viewer/2022043018/5f3a4338f0d5b860a0461cbc/html5/thumbnails/19.jpg)
THE OKCUPID DATASET
FOUR TASKS
▸ Predict GENDER
▸ Predict ORIENTATION
▸ Predict DIET
▸ Predict SIGN
NB: there is one dataset per task
![Page 20: Learning from Data Lecture 5: Practice with new tasks and ...esslli2016.unibz.it/wp-content/uploads/2015/10/lecture5.pdf · Word Sense Disambiguation in Russian (4 words) look into](https://reader034.vdocuments.net/reader034/viewer/2022043018/5f3a4338f0d5b860a0461cbc/html5/thumbnails/20.jpg)
WSD for Russian nouns Task: to predict word sense given context Words: lavka (2), kran (2), kosak (4), ruchka (5)
Contexts:
- 185 - 200 for each word
- from the Internet corpus
- about 10 words before and after the target word
Variations:
- left context / full context
- with lemmatization / without lemmatization
![Page 21: Learning from Data Lecture 5: Practice with new tasks and ...esslli2016.unibz.it/wp-content/uploads/2015/10/lecture5.pdf · Word Sense Disambiguation in Russian (4 words) look into](https://reader034.vdocuments.net/reader034/viewer/2022043018/5f3a4338f0d5b860a0461cbc/html5/thumbnails/21.jpg)
WSD for Russian nouns lavka
kran
2
ruchka
1 2
4
3
5
2 1
1
1 2
3 4
kosak
![Page 22: Learning from Data Lecture 5: Practice with new tasks and ...esslli2016.unibz.it/wp-content/uploads/2015/10/lecture5.pdf · Word Sense Disambiguation in Russian (4 words) look into](https://reader034.vdocuments.net/reader034/viewer/2022043018/5f3a4338f0d5b860a0461cbc/html5/thumbnails/22.jpg)
WSD for Russian nouns. Datasets lavka
…left_tokenized …full_tokenized …left_tokenized_lemmatized …full_tokenized_lemmatized
label text-cat targetword-cat 2 одессит игорь который уже привлекался
к ответственности за разбой и 35 летний валерий зайдя в ювелирной нанесли тяжкие телесные повреждения продавцу кастетом травмировали лицо и все же девушке удалось
лавке
![Page 23: Learning from Data Lecture 5: Practice with new tasks and ...esslli2016.unibz.it/wp-content/uploads/2015/10/lecture5.pdf · Word Sense Disambiguation in Russian (4 words) look into](https://reader034.vdocuments.net/reader034/viewer/2022043018/5f3a4338f0d5b860a0461cbc/html5/thumbnails/23.jpg)
Classifica(on of Slovene Regional Language Variants
• Dataset: Slovene tweets manually annotated by user's region of origin – 4 out of 7 Slovene regions (Gorenjska, Dolenjska, Štajerska, Primorska) – 500 tweets per region = 2000 tweets
• Task: build a model to predict the regional language variant of Slovene tweets
![Page 24: Learning from Data Lecture 5: Practice with new tasks and ...esslli2016.unibz.it/wp-content/uploads/2015/10/lecture5.pdf · Word Sense Disambiguation in Russian (4 words) look into](https://reader034.vdocuments.net/reader034/viewer/2022043018/5f3a4338f0d5b860a0461cbc/html5/thumbnails/24.jpg)
Practice
Pragmatic conditionals: ‘if p, q’ (Chi-He Elder, [email protected])
Categories
res = resultative (‘if you take the class you will learn a lot’)
inf = inferential (‘if it’s 6.30 the class must be over’)
pch = propositional content hedge (‘if I remember rightly...’)
ifh = illocutionary force hedge (‘...if you see what I mean’)
tm = topic marker (‘if you think about conditionals, they usually startwith ‘if”)
dir = directive (‘if you could just pay attention...’)
Features
‘Primary meaning’ (the main message communicated)- Bare form ‘if p, q’; p only; q only; enriched forms p′, q′, etc;
completely overridden logical form r
Speech act type (A = assertive, D = directive, C = commissive)
Conditionality of p and q (Y/N)
Malvina & Johannes LFD – Lecture 5 ESSLLI, 26 August 2016 15 / 20
![Page 25: Learning from Data Lecture 5: Practice with new tasks and ...esslli2016.unibz.it/wp-content/uploads/2015/10/lecture5.pdf · Word Sense Disambiguation in Russian (4 words) look into](https://reader034.vdocuments.net/reader034/viewer/2022043018/5f3a4338f0d5b860a0461cbc/html5/thumbnails/25.jpg)
Practice
Practical session
running experiments in small groups (2-3)
reporting on experiments (a couple of minutes per group, dependingon how many groups there are)
taskdatasetset upfeaturesclassifierresultsany reflections
Malvina & Johannes LFD – Lecture 5 ESSLLI, 26 August 2016 16 / 20
![Page 26: Learning from Data Lecture 5: Practice with new tasks and ...esslli2016.unibz.it/wp-content/uploads/2015/10/lecture5.pdf · Word Sense Disambiguation in Russian (4 words) look into](https://reader034.vdocuments.net/reader034/viewer/2022043018/5f3a4338f0d5b860a0461cbc/html5/thumbnails/26.jpg)
Practice
General commands
general options
--max-train-size N (maximum number of training samples to look at)
--nchars N --nwords N --features X Y Z
visualisation options
--cm (print confusion matrix + classification report)
--plot (shows CM)
algorithm-specific options
K-Nearest Neighbor (knn): --k N
Decision Tree (dt): --max-nodes N --min-samples N
example runpython run experiment.py --csv data/trainset-sentiment-extra.csv
--nchars 1 --algorithms svm --cm
Look into data folder for datasets’ names (tasks with more datasets have own dirunder data)
Malvina & Johannes LFD – Lecture 5 ESSLLI, 26 August 2016 17 / 20
![Page 27: Learning from Data Lecture 5: Practice with new tasks and ...esslli2016.unibz.it/wp-content/uploads/2015/10/lecture5.pdf · Word Sense Disambiguation in Russian (4 words) look into](https://reader034.vdocuments.net/reader034/viewer/2022043018/5f3a4338f0d5b860a0461cbc/html5/thumbnails/27.jpg)
Practice
Datasets names
Ok Cupid (4 tasks)
okcupid/okcupid data diet.csv
okcupid/okcupid data orientation.csv
okcupid/okcupid data sex.csv
okcupid/okcupid data sign.csv
Word Sense Disambiguation in Russian (4 words)
look into russian wsd/: 16 files (4 different settings per word)
Slovene Regional Language Variants
slovene-dialects.csv
Pragmatic conditionals
trainif.csv
Malvina & Johannes LFD – Lecture 5 ESSLLI, 26 August 2016 18 / 20
![Page 28: Learning from Data Lecture 5: Practice with new tasks and ...esslli2016.unibz.it/wp-content/uploads/2015/10/lecture5.pdf · Word Sense Disambiguation in Russian (4 words) look into](https://reader034.vdocuments.net/reader034/viewer/2022043018/5f3a4338f0d5b860a0461cbc/html5/thumbnails/28.jpg)
Practice
Running on a real test set
With a simple modification you can apply to the run experiment.py
script, you can eventually evaluate your model on completely unseen data:a 15% of the whole dataset that gets held out while you develop.
Change:
#print(’\nResults on the test set:’)
#evaluate classifier(clf, test X, test y, args)
to
print(’\nResults on the test set:’)
evaluate classifier(clf, test X, test y, args)
Malvina & Johannes LFD – Lecture 5 ESSLLI, 26 August 2016 19 / 20
![Page 29: Learning from Data Lecture 5: Practice with new tasks and ...esslli2016.unibz.it/wp-content/uploads/2015/10/lecture5.pdf · Word Sense Disambiguation in Russian (4 words) look into](https://reader034.vdocuments.net/reader034/viewer/2022043018/5f3a4338f0d5b860a0461cbc/html5/thumbnails/29.jpg)
End
Bye bye
Malvina: [email protected]: [email protected]
repo: github.com/bjerva/esslli-learning-from-data-students
Malvina & Johannes LFD – Lecture 5 ESSLLI, 26 August 2016 20 / 20