lecture 2 basic concepts in machine learning for language technology

Machine Learning for Language Technology

Lecture 2: Basic ConceptsMarina Santini

Department of Linguistics and PhilologyUppsala University, Uppsala, Sweden

Autumn 2014

Acknowledgement: Thanks to Prof. Joakim Nivre for course design and material

Outline

• Definition of Machine Learning

• Type of Machine Learning:– Classification– Regression– Supervised Learning– Unsupervised Learning– Reinforcement Learning

• Supervised Learning:– Supervised Classification– Training set– Hypothesis class– Empirical error– Margin– Noise– Inductive bias– Generalization– Model assessment– Cross-Validation– Classification in NLP– Types of Classification

Lecture 2: Basic Concepts 2

What is Machine Learning

• Machine learning is programming computers to optimize a performance criterion for some task using example data or past experience

• Why learning?– No known exact method – vision, speech recognition,

robotics, spam filters, etc.– Exact method too expensive – statistical physics– Task evolves over time – network routing

• Compare:– No need to use machine learning for computing

payroll… we just need an algorithm


Machine Learning – Data Mining –Artificial Intelligence – Statistics

• Machine Learning: creation of a model that uses training data or past experience

• Data Mining: application of learning methods to large datasets (ex. physics, astronomy, biology, etc.)– Text mining = machine learning applied to unstructured textual data

(ex. sentiment analyisis, social media monitoring, etc. Text Mining, Wikipedia)

• Artificial intelligence: a model that can adapt to a changing environment.

• Statistics: Machine learning uses the theory of statistics in building mathematical models, because the core task is making inference from a sample.


http://en.wikipedia.org/wiki/Text_mining

The bio-cognitive analogy

• Imagine that a learning algorithm as a single neuron.

• This neuron receives input from other neurons, one for each input feature.

• The strength of these inputs are the feature values.

• Each input has a weight and the neuron simply sums up all the weighted inputs.

• Based on this sum, the neuron decides whether to “fire” or not. Firing is interpreted as being a positive example and not firing is interpreted as being a negative example.


Elements of Machine Learning

1. Generalization:– Generalize from specific examples– Based on statistical inference

2. Data:– Training data: specific examples to learn from– Test data: (new) specific examples to assess performance

3. Models:– Theoretical assumptions about the task/domain– Parameters that can be inferred from data

4. Algorithms:– Learning algorithm: infer model (parameters) from data– Inference algorithm: infer predictions from model


Types of Machine Learning

• Association

• Supervised Learning

– Classification

– Regression

• Unsupervised Learning

• Reinforcement Learning


Learning Associations

• Basket analysis:

P (Y | X ) probability that somebody who buys X also buys Y where X and Y are products/services

Example: P ( chips | beer ) = 0.7


Classification


• Example: Credit scoring

• Differentiating between low-risk and high-risk customers from their income and savings

Discriminant: IF income > θ1 AND savings > θ2

THEN low-risk ELSE high-risk

Classification in NLP

• Binary classification:– Spam filtering (spam vs. non-spam)– Spelling error detection (error vs. non error)

• Multiclass classification:– Text categorization (news, economy, culture, sport, ...)– Named entity classification (person, location,

organization, ...)

• Structured prediction:– Part-of-speech tagging (classes = tag sequences)– Syntactic parsing (classes = parse trees)


Regression

• Example:Price of used car

• x : car attributes

y : price

y = g (x | q )

g ( ) model,

q parameters

Lecture 2: Basic Concepts11

y = wx+w0

Uses of Supervised Learning

• Prediction of future cases:

– Use the rule to predict the output for future inputs

• Knowledge extraction:

– The rule is easy to understand

• Compression:

– The rule is simpler than the data it explains

• Outlier detection:

– Exceptions that are not covered by the rule, e.g., fraud


Unsupervised Learning

• Finding regularities in data

• No mapping to outputs

• Clustering:

– Grouping similar instances

• Example applications:– Customer segmentation in CRM

– Image compression: Color quantization

– NLP: Unsupervised text categorization


Reinforcement Learning

• Learning a policy = sequence of outputs/actions

• No supervised output but delayed reward

• Example applications:

– Game playing

– Robot in a maze

– NLP: Dialogue systems


Supervised Classification

• Learning the class C of a “family car” from examples– Prediction: Is car x a family car?

– Knowledge extraction: What do people expect from a family car?

• Output (labels): Positive (+) and negative (–) examples

• Input representation (features):

x1: price, x2 : engine power


Training set X

X {x t,r t }t1N

r 1 if x is positive

0 if x is negative


x x1

x2

Hypothesis class H

p1 price p2 AND e1 engine power e2


Empirical (training) error

h(x) 1 if h says x is positive

0 if h says x is negative

E(h | X ) 1 h x t r t t1

N


Empirical error of h on X:

S, G, and the Version Space


most specific hypothesis, S

most general hypothesis, G

h H, between S and G isconsistent [E( h | X) = 0] and make up the version space

Margin

• Choose h with largest margin


Noise

Unwanted anomaly in data• Imprecision in input attributes

• Errors in labeling data points

• Hidden attributes (relative to H)

Consequence:• No h in H may be consistent!


Noise and Model Complexity

Arguments for simpler model (Occam’s razor principle):

1. Easier to make predictions

2. Easier to train (fewer parameters)

3. Easier to understand

4. Generalizes better (if data is noisy)


Inductive Bias

• Learning is an ill-posed problem– Training data is never sufficient to find a unique

solution– There are always infinitely many consistent

hypotheses

• We need an inductive bias:– Assumptions that entail a unique h for a training set X

1. Hypothesis class H – axis-aligned rectangles2. Learning algorithm – find consistent hypothesis with max-

margin3. Hyperparameters – trade-off between training error and

margin


Model Selection and Generalization• Generalization – how well a model performs

on new data

– Overfitting: H more complex than C

– Underfitting: H less complex than C


Triple Trade-Off

• Trade-off between three factors:

1. Complexity of H, c(H)

2. Training set size N

3. Generalization error E on new data

• Dependencies:

– As N, E

– As c(H), first E and then E


Model Selection Generalization Error

• To estimate generalization error, we need data unseen during training:

• Given models (hypotheses) h1, ..., hk induced from the training set X, we can use E(hi | V ) to select the model hi with the smallest generalization error


ˆ E E(h |V ) 1 h x t r t t1

M

V {x t,r t }t1M X

Model Assessment

• To estimate the generalization error of the best model hi, we need data unseen during training and model selection

• Standard setup:1. Training set X (50–80%)2. Validation (development) set V (10–25%)3. Test (publication) set T (10–25%)

• Note:– Validation data can be added to training set before

testing– Resampling methods can be used if data is limited


Cross-Validation

121

31

2

2

2

32

1

1

1

K

K

K

K

K

K

XXXTXV

XXXTXV

XXXTXV


• K-fold cross-validation: Divide X into X1, ..., XK

• Note:– Generalization error estimated by means across K folds

– Training sets for different folds share K–2 parts

– Separate test set must be maintained for model assessment

Bootstrapping

36801

1 1 .

e

N

N


• Generate new training sets of size N from X by random sampling with replacement

• Use original training set as validation set (V = X )• Probability that we do not pick an instance after N

draws

that is, only 36.8% of instances are new!

Measuring Error

• Error rate = # of errors / # of instances = (FP+FN) / N

• Accuracy = # of correct / # of instances = (TP+TN) / N

• Recall = # of found positives / # of positives = TP / (TP+FN)

• Precision = # of found positives / # of found = TP / (TP+FP)


Statistical Inference

• Interval estimation to quantify the precision of our measurements

• Hypothesis testing to assess whether differences between models are statistically significant


m 1.96

N

e01 e10 1 2

e01 e10

~ X1

2

Supervised Learning – Summary

• Training data + learner hypothesis

– Learner incorporates inductive bias

• Test data + hypothesis estimated generalization

– Test data must be unseen


Anatomy of a Supervised Learner(Dimensions of a supervised machine learning algorithm)

• Model:

• Loss function:

• Optimization procedure:

g x |q

E q | X L r t ,g x t |q t


q* arg minq

E q | X

Supervised Classification: Extension

34

• Divide instances into (two or more) classes

– Instance (feature vector):• Features may be categorical or numerical

– Class (label):

– Training data:

• Classification in Language Technology– Spam filtering (spam vs. non-spam)– Spelling error detection (error vs. no error)– Text categorization (news, economy, culture, sport, ...)– Named entity classification (person, location, organization, ...)

X {x t,y t }t1N

x x1, , xm

y

Lec 2: Decision Trees - Nearest Neighbors

NLP: Classification (i)

NLP: Classification (ii)

NLP: Classification (iii)

Types of Classification (i)

Types of Classification (ii)

Reading

• Alpaydin (2010): chs 1-2; 19

• Daume’ III (2012): ch 4: only 4.5-4.6


End of Lecture 2


lecture 2 basic concepts in machine learning for language technology

Education