applied machine learning lecture 4-2: data collection ...richajo/dit866/lectures/l4/l4_2.pdf ·...

Applied Machine LearningLecture 4-2: Data collection, bias, and annotation

Selpi ([email protected])

The slides are further development of Richard Johansson's slides

January 31, 2020

mailto:[email protected]

Overview

The need for DATA and how to get DATA

Data collection and bias

Manual annotation

Quality control of annotation

Review and Closing

Supervised, unsupervised, and semi-supervised learning

I To be able to learn, the machine needs DATA!

Scraping from websites or using open APIs

Copyright issues

I Published on the web 6= freely available!

I There is a risk that the work you do will be wastedI Twitter datasets

I May distribute just the URLs (as in e.g. ImageNet)I but they may disappear

How and where do we get data?

I Download publicly open data from:I UCI Machine Learning RepositoryI data.europa.euI ...

I Get access to publicly accessible but regulated (with varyingdegree) data from:I Swedish Traffic Accident Data Acquisition(STRADA)I Authors of papers who made their data accessible to users

after registration (e.g., HighD))I Kaggle, ...

I Pay to get some dataI SHRP2 Naturalistic Driving DataI ...

I Or collect new data � This can be challenging!

https://archive.ics.uci.edu/ml/datasets.php

https://data.europa.eu/

https://www.transportstyrelsen.se/STRADA

https://www.highd-dataset.com/

https://insight.shrp2nds.us/home/index

Overview



Manual annotation


Review and Closing

Discuss the projects used for illustrations

assumptions about data in machine learning

what's the �population�? what's �representative�?

I the sample is representative if what's true about the sampleis also true in generalI is our sample of drivers for a certain vehicle brand

representative of all drivers?I are our images taken in good lighting condition representative

of �images in general�?

I depending on the type of data, it can be hard to determinewhether a sample is representative in practice

I useful to document the composition of a dataset

Example of bias

I in 1936, Literary Digest polled a few million Americans abouttheir preferred candidate in the presidential electionI result of the poll: Landon 57%, Roosevelt 43%I result of the election: Roosevelt 62%, Landon 38%

I the massive polling error was caused byI sampling bias: they polled people with a phone; poorer

people were over represented among people without a phoneI nonresponse bias: who are the people who answer the

survey?I similar di�culty: self-selection in web survey data

I What about in real driving data collection?

strati�cation and weighting

[source]

I To decide, take into account the purpose of collecting data

I Illustrate di�erent scenarios of sampling drivers w.r.t. agegroups

https://commons.wikimedia.org/wiki/File:StratifiedRandomSampling.jpg

Strati�ed train/test splits in scikit-learn

from sklearn.model_selection import train_test_split

X, Y = ( ... read the dataset ...)

Xtrain, Xtest, Ytrain, Ytest = train_test_split(X, Y,

stratified=Y)

I but this doesn't solve the problem of di�erence between oursample and the real-world distribution. . .

availability vs. representativity

I sometimes, we don't have the luxury of selecting a�representative� sample: we just have to take what we can getI observational data in medicineI historical dataI drivers in naturalistic driving study

I technical issues:I �ction is harder to access than web-published text (news,

blogs, . . . )

I copyright: we get a bias if we only include free data

sampling e�ects and machine learning systemsI genre: what if there are only book reviews in our sentiment

dataset?

I time: how well will my system work in di�erent year, di�erentseason?

I selection: what if the skin tumor detection system wastrained only on people who saw a specialist?

I demography: what if there are only white people, or onlypeople without glasses in the training data for imageclassi�ers?

[source]

https://arxiv.org/pdf/1511.05547.pdf

Cross-domain classi�cation example

I Example of sample selection bias

I See notebook

Other aspects to consider for data collection

I Ethical issues (e.g., animal/human testing)

I Legal issues (e.g., foreign companies cannot collect GPS datato be used outside China)

I Budget & time

Overview



Manual annotation


Review and Closing

Training data for imitating human decisions

I in many cases, the goal of a predictive system is toautomatize human decisions

I in practice, the human input is often missing and has to beadded manually

I this process is called annotationI or �labeling�, �tagging�, �coding� etc

I in real-world scenarios, this is a substantial investment

I we will now discuss some practical aspects of annotation

Some types of annotation

I categories:I what type of animal is this?I is this email spam or legit?I does this event lead to crash or not?

I segmentation or tagging:I highlight the parts of an image showing a street signI mark when driver is distracted (eating, texting, talking on

phone, etc.)I mark the segments of the text that refer to proteins

I graphs, trees and other types of structures:I biology, language, . . .

Tools for annotation

I small projects, simple annotation: text �le, Excel, directories

I in the long run, it usually pays o� to �nd or develop aspecialized annotation user interfaceI because the type of data to annotate is complexI because we want to keep track of annotators

I Example tool for annotating driving data (Fig.67 D3.3)

https://research.chalmers.se/publication/171344/file/171344_Fulltext.pdf

example: image tagging

[source]

https://www.androidauthority.com/cloud-automl-vision-guide-894671/

example: object annotation

[source]

https://www.quora.com/What-is-annotation-in-machine-learning

example: tagging relevant audio segments

[source]

https://www.researchgate.net/publication/328997207_To_bee_or_not_to_bee_Investigating_machine_learning_approaches_for_beehive_sound_recognition

example of annotating text: names

I http://brat.nlplab.org/

I WebAnno is a similar toolhttps://webanno.github.io/webanno/

http://brat.nlplab.org/

https://webanno.github.io/webanno/

example: relation annotation in biomedical text

Biases in annotation

I is the user interface biased?

I is some choice easier? is there a �default�?

I are the annotators paid by the hour or by quantity?

I boredom?

Annotation manual / speci�cations

I See SHRP2 Code book / data dictionary

I we need to write down a manual specifying the task in detail

I the clarity of the manual will in�uence the quality of theannotation

I a few useful things to include:I the purpose of the annotationI de�nitions of the concepts in the modelI . . . and practical explanations of how they are applied

I a reasonable amount of examplesI describe common hard cases, borderline situations

https://insight.shrp2nds.us/projectBackground/index

example: de�ning an annotation task

Who should annotate and how to get annotators

I specialists? students with specialist training? companyspecialising for this task?

I use software to do semi-automatic annotation?

I use crowdsourcing (e.g., non-experts instead oftrained-experts)?I the most well-known framework is Amazon Mechanical Turk:

http://mturk.comI risk of cheating, ethical issues (e.g., low salary)

http://mturk.com

Example of unpaid crowdsourcing: reCAPTCHA

Example of unpaid crowdsourcing: A/B testing

Examples of companies in the annotation business

I https://www.annotell.com/ (Gothenburg)

I https://www.figure-eight.com/ (formerly CrowdFlower)

I https://appen.com

I https://www.cogitotech.com/

https://www.annotell.com/

https://www.figure-eight.com/

https://appen.com

https://www.cogitotech.com/

Overview



Manual annotation


Review and Closing

Safeguards in crowdsourcing

I inspection after the fact

I mix annotation with checks

I double annotation

I inter-annotator agreement (to see how often the annotatorsagree with each other)

various inter-annotator scores in Python

I the StatsModels Python library includes some of these scores

http://www.statsmodels.org/dev/stats.html#

module-statsmodels.stats.inter_rater

http://www.statsmodels.org/dev/stats.html#module-statsmodels.stats.inter_rater

http://www.statsmodels.org/dev/stats.html#module-statsmodels.stats.inter_rater

Overview



Manual annotation


Review and Closing

Review of data collection, bias, and annotation

I On data collection:I Reason why the choice of data collection could have a big

in�uence on machine learning performances

I On annotation:I Explain the pros and cons of the di�erent methods used for

data annotation (see "Pros and cons of labelling

approaches")I Describe what could be done to control the quality of data

annotationI Explain how data annotation could in�uence the performance

of machine learning systems

I On bias:I Suggest ways to minimise the bias from data collection and

annotation

https://www.kdnuggets.com/2018/05/data-labeling-machine-learning.html

https://www.kdnuggets.com/2018/05/data-labeling-machine-learning.html

Next lecture (on Friday next week)

I Optimisation in machine learning

I Logistic regression and support vector classi�ers

applied machine learning lecture 4-2: data collection ...richajo/dit866/lectures/l4/l4_2.pdf ·...

Documents