applied machine learning lecture 4-2: data collection ...richajo/dit866/lectures/l4/l4_2.pdf ·...

41

Upload: others

Post on 13-Jul-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Applied Machine Learning Lecture 4-2: Data collection ...richajo/dit866/lectures/l4/l4_2.pdf · Applied Machine Learning Lecture 4-2: Data collection, bias, and ... Manual annotation

Applied Machine LearningLecture 4-2: Data collection, bias, and annotation

Selpi ([email protected])

The slides are further development of Richard Johansson's slides

January 31, 2020

Page 2: Applied Machine Learning Lecture 4-2: Data collection ...richajo/dit866/lectures/l4/l4_2.pdf · Applied Machine Learning Lecture 4-2: Data collection, bias, and ... Manual annotation

Overview

The need for DATA and how to get DATA

Data collection and bias

Manual annotation

Quality control of annotation

Review and Closing

Page 3: Applied Machine Learning Lecture 4-2: Data collection ...richajo/dit866/lectures/l4/l4_2.pdf · Applied Machine Learning Lecture 4-2: Data collection, bias, and ... Manual annotation

Supervised, unsupervised, and semi-supervised learning

I To be able to learn, the machine needs DATA!

Page 4: Applied Machine Learning Lecture 4-2: Data collection ...richajo/dit866/lectures/l4/l4_2.pdf · Applied Machine Learning Lecture 4-2: Data collection, bias, and ... Manual annotation

Supervised, unsupervised, and semi-supervised learning

I To be able to learn, the machine needs DATA!

Page 5: Applied Machine Learning Lecture 4-2: Data collection ...richajo/dit866/lectures/l4/l4_2.pdf · Applied Machine Learning Lecture 4-2: Data collection, bias, and ... Manual annotation

Scraping from websites or using open APIs

Page 6: Applied Machine Learning Lecture 4-2: Data collection ...richajo/dit866/lectures/l4/l4_2.pdf · Applied Machine Learning Lecture 4-2: Data collection, bias, and ... Manual annotation

Copyright issues

I Published on the web 6= freely available!

I There is a risk that the work you do will be wastedI Twitter datasets

I May distribute just the URLs (as in e.g. ImageNet)I but they may disappear

Page 7: Applied Machine Learning Lecture 4-2: Data collection ...richajo/dit866/lectures/l4/l4_2.pdf · Applied Machine Learning Lecture 4-2: Data collection, bias, and ... Manual annotation

How and where do we get data?

I Download publicly open data from:I UCI Machine Learning RepositoryI data.europa.euI ...

I Get access to publicly accessible but regulated (with varyingdegree) data from:I Swedish Traffic Accident Data Acquisition(STRADA)I Authors of papers who made their data accessible to users

after registration (e.g., HighD))I Kaggle, ...

I Pay to get some dataI SHRP2 Naturalistic Driving DataI ...

I Or collect new data � This can be challenging!

Page 8: Applied Machine Learning Lecture 4-2: Data collection ...richajo/dit866/lectures/l4/l4_2.pdf · Applied Machine Learning Lecture 4-2: Data collection, bias, and ... Manual annotation

Overview

The need for DATA and how to get DATA

Data collection and bias

Manual annotation

Quality control of annotation

Review and Closing

Page 9: Applied Machine Learning Lecture 4-2: Data collection ...richajo/dit866/lectures/l4/l4_2.pdf · Applied Machine Learning Lecture 4-2: Data collection, bias, and ... Manual annotation

Discuss the projects used for illustrations

Page 10: Applied Machine Learning Lecture 4-2: Data collection ...richajo/dit866/lectures/l4/l4_2.pdf · Applied Machine Learning Lecture 4-2: Data collection, bias, and ... Manual annotation

assumptions about data in machine learning

Page 11: Applied Machine Learning Lecture 4-2: Data collection ...richajo/dit866/lectures/l4/l4_2.pdf · Applied Machine Learning Lecture 4-2: Data collection, bias, and ... Manual annotation

what's the �population�? what's �representative�?

I the sample is representative if what's true about the sampleis also true in generalI is our sample of drivers for a certain vehicle brand

representative of all drivers?I are our images taken in good lighting condition representative

of �images in general�?

I depending on the type of data, it can be hard to determinewhether a sample is representative in practice

I useful to document the composition of a dataset

Page 12: Applied Machine Learning Lecture 4-2: Data collection ...richajo/dit866/lectures/l4/l4_2.pdf · Applied Machine Learning Lecture 4-2: Data collection, bias, and ... Manual annotation

Example of bias

I in 1936, Literary Digest polled a few million Americans abouttheir preferred candidate in the presidential electionI result of the poll: Landon 57%, Roosevelt 43%I result of the election: Roosevelt 62%, Landon 38%

I the massive polling error was caused byI sampling bias: they polled people with a phone; poorer

people were over represented among people without a phoneI nonresponse bias: who are the people who answer the

survey?I similar di�culty: self-selection in web survey data

I What about in real driving data collection?

Page 13: Applied Machine Learning Lecture 4-2: Data collection ...richajo/dit866/lectures/l4/l4_2.pdf · Applied Machine Learning Lecture 4-2: Data collection, bias, and ... Manual annotation

strati�cation and weighting

[source]

I To decide, take into account the purpose of collecting data

I Illustrate di�erent scenarios of sampling drivers w.r.t. agegroups

Page 14: Applied Machine Learning Lecture 4-2: Data collection ...richajo/dit866/lectures/l4/l4_2.pdf · Applied Machine Learning Lecture 4-2: Data collection, bias, and ... Manual annotation

Strati�ed train/test splits in scikit-learn

from sklearn.model_selection import train_test_split

X, Y = ( ... read the dataset ...)

Xtrain, Xtest, Ytrain, Ytest = train_test_split(X, Y,

stratified=Y)

I but this doesn't solve the problem of di�erence between oursample and the real-world distribution. . .

Page 15: Applied Machine Learning Lecture 4-2: Data collection ...richajo/dit866/lectures/l4/l4_2.pdf · Applied Machine Learning Lecture 4-2: Data collection, bias, and ... Manual annotation

availability vs. representativity

I sometimes, we don't have the luxury of selecting a�representative� sample: we just have to take what we can getI observational data in medicineI historical dataI drivers in naturalistic driving study

I technical issues:I �ction is harder to access than web-published text (news,

blogs, . . . )

I copyright: we get a bias if we only include free data

Page 16: Applied Machine Learning Lecture 4-2: Data collection ...richajo/dit866/lectures/l4/l4_2.pdf · Applied Machine Learning Lecture 4-2: Data collection, bias, and ... Manual annotation

sampling e�ects and machine learning systemsI genre: what if there are only book reviews in our sentiment

dataset?

I time: how well will my system work in di�erent year, di�erentseason?

I selection: what if the skin tumor detection system wastrained only on people who saw a specialist?

I demography: what if there are only white people, or onlypeople without glasses in the training data for imageclassi�ers?

[source]

Page 17: Applied Machine Learning Lecture 4-2: Data collection ...richajo/dit866/lectures/l4/l4_2.pdf · Applied Machine Learning Lecture 4-2: Data collection, bias, and ... Manual annotation
Page 18: Applied Machine Learning Lecture 4-2: Data collection ...richajo/dit866/lectures/l4/l4_2.pdf · Applied Machine Learning Lecture 4-2: Data collection, bias, and ... Manual annotation

Cross-domain classi�cation example

I Example of sample selection bias

I See notebook

Page 19: Applied Machine Learning Lecture 4-2: Data collection ...richajo/dit866/lectures/l4/l4_2.pdf · Applied Machine Learning Lecture 4-2: Data collection, bias, and ... Manual annotation

Other aspects to consider for data collection

I Ethical issues (e.g., animal/human testing)

I Legal issues (e.g., foreign companies cannot collect GPS datato be used outside China)

I Budget & time

Page 20: Applied Machine Learning Lecture 4-2: Data collection ...richajo/dit866/lectures/l4/l4_2.pdf · Applied Machine Learning Lecture 4-2: Data collection, bias, and ... Manual annotation

Overview

The need for DATA and how to get DATA

Data collection and bias

Manual annotation

Quality control of annotation

Review and Closing

Page 21: Applied Machine Learning Lecture 4-2: Data collection ...richajo/dit866/lectures/l4/l4_2.pdf · Applied Machine Learning Lecture 4-2: Data collection, bias, and ... Manual annotation

Training data for imitating human decisions

I in many cases, the goal of a predictive system is toautomatize human decisions

I in practice, the human input is often missing and has to beadded manually

I this process is called annotationI or �labeling�, �tagging�, �coding� etc

I in real-world scenarios, this is a substantial investment

I we will now discuss some practical aspects of annotation

Page 22: Applied Machine Learning Lecture 4-2: Data collection ...richajo/dit866/lectures/l4/l4_2.pdf · Applied Machine Learning Lecture 4-2: Data collection, bias, and ... Manual annotation

Some types of annotation

I categories:I what type of animal is this?I is this email spam or legit?I does this event lead to crash or not?

I segmentation or tagging:I highlight the parts of an image showing a street signI mark when driver is distracted (eating, texting, talking on

phone, etc.)I mark the segments of the text that refer to proteins

I graphs, trees and other types of structures:I biology, language, . . .

Page 23: Applied Machine Learning Lecture 4-2: Data collection ...richajo/dit866/lectures/l4/l4_2.pdf · Applied Machine Learning Lecture 4-2: Data collection, bias, and ... Manual annotation

Tools for annotation

I small projects, simple annotation: text �le, Excel, directories

I in the long run, it usually pays o� to �nd or develop aspecialized annotation user interfaceI because the type of data to annotate is complexI because we want to keep track of annotators

I Example tool for annotating driving data (Fig.67 D3.3)

Page 25: Applied Machine Learning Lecture 4-2: Data collection ...richajo/dit866/lectures/l4/l4_2.pdf · Applied Machine Learning Lecture 4-2: Data collection, bias, and ... Manual annotation

example: object annotation

[source]

Page 27: Applied Machine Learning Lecture 4-2: Data collection ...richajo/dit866/lectures/l4/l4_2.pdf · Applied Machine Learning Lecture 4-2: Data collection, bias, and ... Manual annotation

example of annotating text: names

I http://brat.nlplab.org/

I WebAnno is a similar toolhttps://webanno.github.io/webanno/

Page 28: Applied Machine Learning Lecture 4-2: Data collection ...richajo/dit866/lectures/l4/l4_2.pdf · Applied Machine Learning Lecture 4-2: Data collection, bias, and ... Manual annotation

example: relation annotation in biomedical text

Page 29: Applied Machine Learning Lecture 4-2: Data collection ...richajo/dit866/lectures/l4/l4_2.pdf · Applied Machine Learning Lecture 4-2: Data collection, bias, and ... Manual annotation

Biases in annotation

I is the user interface biased?

I is some choice easier? is there a �default�?

I are the annotators paid by the hour or by quantity?

I boredom?

Page 30: Applied Machine Learning Lecture 4-2: Data collection ...richajo/dit866/lectures/l4/l4_2.pdf · Applied Machine Learning Lecture 4-2: Data collection, bias, and ... Manual annotation

Annotation manual / speci�cations

I See SHRP2 Code book / data dictionary

I we need to write down a manual specifying the task in detail

I the clarity of the manual will in�uence the quality of theannotation

I a few useful things to include:I the purpose of the annotationI de�nitions of the concepts in the modelI . . . and practical explanations of how they are applied

I a reasonable amount of examplesI describe common hard cases, borderline situations

Page 31: Applied Machine Learning Lecture 4-2: Data collection ...richajo/dit866/lectures/l4/l4_2.pdf · Applied Machine Learning Lecture 4-2: Data collection, bias, and ... Manual annotation

example: de�ning an annotation task

Page 32: Applied Machine Learning Lecture 4-2: Data collection ...richajo/dit866/lectures/l4/l4_2.pdf · Applied Machine Learning Lecture 4-2: Data collection, bias, and ... Manual annotation

Who should annotate and how to get annotators

I specialists? students with specialist training? companyspecialising for this task?

I use software to do semi-automatic annotation?

I use crowdsourcing (e.g., non-experts instead oftrained-experts)?I the most well-known framework is Amazon Mechanical Turk:

http://mturk.comI risk of cheating, ethical issues (e.g., low salary)

Page 33: Applied Machine Learning Lecture 4-2: Data collection ...richajo/dit866/lectures/l4/l4_2.pdf · Applied Machine Learning Lecture 4-2: Data collection, bias, and ... Manual annotation

Example of unpaid crowdsourcing: reCAPTCHA

Page 34: Applied Machine Learning Lecture 4-2: Data collection ...richajo/dit866/lectures/l4/l4_2.pdf · Applied Machine Learning Lecture 4-2: Data collection, bias, and ... Manual annotation

Example of unpaid crowdsourcing: A/B testing

Page 35: Applied Machine Learning Lecture 4-2: Data collection ...richajo/dit866/lectures/l4/l4_2.pdf · Applied Machine Learning Lecture 4-2: Data collection, bias, and ... Manual annotation

Examples of companies in the annotation business

I https://www.annotell.com/ (Gothenburg)

I https://www.figure-eight.com/ (formerly CrowdFlower)

I https://appen.com

I https://www.cogitotech.com/

Page 36: Applied Machine Learning Lecture 4-2: Data collection ...richajo/dit866/lectures/l4/l4_2.pdf · Applied Machine Learning Lecture 4-2: Data collection, bias, and ... Manual annotation

Overview

The need for DATA and how to get DATA

Data collection and bias

Manual annotation

Quality control of annotation

Review and Closing

Page 37: Applied Machine Learning Lecture 4-2: Data collection ...richajo/dit866/lectures/l4/l4_2.pdf · Applied Machine Learning Lecture 4-2: Data collection, bias, and ... Manual annotation

Safeguards in crowdsourcing

I inspection after the fact

I mix annotation with checks

I double annotation

I inter-annotator agreement (to see how often the annotatorsagree with each other)

Page 38: Applied Machine Learning Lecture 4-2: Data collection ...richajo/dit866/lectures/l4/l4_2.pdf · Applied Machine Learning Lecture 4-2: Data collection, bias, and ... Manual annotation

various inter-annotator scores in Python

I the StatsModels Python library includes some of these scores

http://www.statsmodels.org/dev/stats.html#

module-statsmodels.stats.inter_rater

Page 39: Applied Machine Learning Lecture 4-2: Data collection ...richajo/dit866/lectures/l4/l4_2.pdf · Applied Machine Learning Lecture 4-2: Data collection, bias, and ... Manual annotation

Overview

The need for DATA and how to get DATA

Data collection and bias

Manual annotation

Quality control of annotation

Review and Closing

Page 40: Applied Machine Learning Lecture 4-2: Data collection ...richajo/dit866/lectures/l4/l4_2.pdf · Applied Machine Learning Lecture 4-2: Data collection, bias, and ... Manual annotation

Review of data collection, bias, and annotation

I On data collection:I Reason why the choice of data collection could have a big

in�uence on machine learning performances

I On annotation:I Explain the pros and cons of the di�erent methods used for

data annotation (see "Pros and cons of labelling

approaches")I Describe what could be done to control the quality of data

annotationI Explain how data annotation could in�uence the performance

of machine learning systems

I On bias:I Suggest ways to minimise the bias from data collection and

annotation

Page 41: Applied Machine Learning Lecture 4-2: Data collection ...richajo/dit866/lectures/l4/l4_2.pdf · Applied Machine Learning Lecture 4-2: Data collection, bias, and ... Manual annotation

Next lecture (on Friday next week)

I Optimisation in machine learning

I Logistic regression and support vector classi�ers