lecture 2 basic concepts in machine learning for language technology
DESCRIPTION
Definition of Machine Learning Type of Machine Learning: Classification Regression Supervised Learning Unsupervised Learning Reinforcement Learning Supervised Learning: Supervised Classification Training set Hypothesis class Empirical error Margin Noise Inductive bias Generalization Model assessment Cross-Validation Classification in NLP Types of ClassificationTRANSCRIPT
Machine Learning for Language Technology
Lecture 2: Basic ConceptsMarina Santini
Department of Linguistics and PhilologyUppsala University, Uppsala, Sweden
Autumn 2014
Acknowledgement: Thanks to Prof. Joakim Nivre for course design and material
Outline
• Definition of Machine Learning
• Type of Machine Learning:– Classification– Regression– Supervised Learning– Unsupervised Learning– Reinforcement Learning
• Supervised Learning:– Supervised Classification– Training set– Hypothesis class– Empirical error– Margin– Noise– Inductive bias– Generalization– Model assessment– Cross-Validation– Classification in NLP– Types of Classification
Lecture 2: Basic Concepts 2
What is Machine Learning
• Machine learning is programming computers to optimize a performance criterion for some task using example data or past experience
• Why learning?– No known exact method – vision, speech recognition,
robotics, spam filters, etc.– Exact method too expensive – statistical physics– Task evolves over time – network routing
• Compare:– No need to use machine learning for computing
payroll… we just need an algorithm
Lecture 2: Basic Concepts 3
Machine Learning – Data Mining –Artificial Intelligence – Statistics
• Machine Learning: creation of a model that uses training data or past experience
• Data Mining: application of learning methods to large datasets (ex. physics, astronomy, biology, etc.)– Text mining = machine learning applied to unstructured textual data
(ex. sentiment analyisis, social media monitoring, etc. Text Mining, Wikipedia)
• Artificial intelligence: a model that can adapt to a changing environment.
• Statistics: Machine learning uses the theory of statistics in building mathematical models, because the core task is making inference from a sample.
Lecture 2: Basic Concepts 4
The bio-cognitive analogy
• Imagine that a learning algorithm as a single neuron.
• This neuron receives input from other neurons, one for each input feature.
• The strength of these inputs are the feature values.
• Each input has a weight and the neuron simply sums up all the weighted inputs.
• Based on this sum, the neuron decides whether to “fire” or not. Firing is interpreted as being a positive example and not firing is interpreted as being a negative example.
Lecture 2: Basic Concepts 5
Elements of Machine Learning
1. Generalization:– Generalize from specific examples– Based on statistical inference
2. Data:– Training data: specific examples to learn from– Test data: (new) specific examples to assess performance
3. Models:– Theoretical assumptions about the task/domain– Parameters that can be inferred from data
4. Algorithms:– Learning algorithm: infer model (parameters) from data– Inference algorithm: infer predictions from model
Lecture 2: Basic Concepts 6
Types of Machine Learning
• Association
• Supervised Learning
– Classification
– Regression
• Unsupervised Learning
• Reinforcement Learning
Lecture 2: Basic Concepts 7
Learning Associations
• Basket analysis:
P (Y | X ) probability that somebody who buys X also buys Y where X and Y are products/services
Example: P ( chips | beer ) = 0.7
Lecture 2: Basic Concepts 8
Classification
Lecture 2: Basic Concepts 9
• Example: Credit scoring
• Differentiating between low-risk and high-risk customers from their income and savings
Discriminant: IF income > θ1 AND savings > θ2
THEN low-risk ELSE high-risk
Classification in NLP
• Binary classification:– Spam filtering (spam vs. non-spam)– Spelling error detection (error vs. non error)
• Multiclass classification:– Text categorization (news, economy, culture, sport, ...)– Named entity classification (person, location,
organization, ...)
• Structured prediction:– Part-of-speech tagging (classes = tag sequences)– Syntactic parsing (classes = parse trees)
Lecture 2: Basic Concepts 10
Regression
• Example:Price of used car
• x : car attributes
y : price
y = g (x | q )
g ( ) model,
q parameters
Lecture 2: Basic Concepts11
y = wx+w0
Uses of Supervised Learning
• Prediction of future cases:
– Use the rule to predict the output for future inputs
• Knowledge extraction:
– The rule is easy to understand
• Compression:
– The rule is simpler than the data it explains
• Outlier detection:
– Exceptions that are not covered by the rule, e.g., fraud
Lecture 2: Basic Concepts 12
Unsupervised Learning
• Finding regularities in data
• No mapping to outputs
• Clustering:
– Grouping similar instances
• Example applications:– Customer segmentation in CRM
– Image compression: Color quantization
– NLP: Unsupervised text categorization
Lecture 2: Basic Concepts 13
Reinforcement Learning
• Learning a policy = sequence of outputs/actions
• No supervised output but delayed reward
• Example applications:
– Game playing
– Robot in a maze
– NLP: Dialogue systems
Lecture 2: Basic Concepts 14
Supervised Classification
• Learning the class C of a “family car” from examples– Prediction: Is car x a family car?
– Knowledge extraction: What do people expect from a family car?
• Output (labels): Positive (+) and negative (–) examples
• Input representation (features):
x1: price, x2 : engine power
Lecture 2: Basic Concepts 15
Training set X
X {x t,r t }t1N
r 1 if x is positive
0 if x is negative
Lecture 2: Basic Concepts16
x x1
x2
Hypothesis class H
p1 price p2 AND e1 engine power e2
Lecture 2: Basic Concepts 17
Empirical (training) error
h(x) 1 if h says x is positive
0 if h says x is negative
E(h | X ) 1 h x t r t t1
N
Lecture 2: Basic Concepts 18
Empirical error of h on X:
S, G, and the Version Space
Lecture 2: Basic Concepts 19
most specific hypothesis, S
most general hypothesis, G
h H, between S and G isconsistent [E( h | X) = 0] and make up the version space
Margin
• Choose h with largest margin
Lecture 2: Basic Concepts 20
Noise
Unwanted anomaly in data• Imprecision in input attributes
• Errors in labeling data points
• Hidden attributes (relative to H)
Consequence:• No h in H may be consistent!
Lecture 2: Basic Concepts 21
Noise and Model Complexity
Arguments for simpler model (Occam’s razor principle):
1. Easier to make predictions
2. Easier to train (fewer parameters)
3. Easier to understand
4. Generalizes better (if data is noisy)
Lecture 2: Basic Concepts 22
Inductive Bias
• Learning is an ill-posed problem– Training data is never sufficient to find a unique
solution– There are always infinitely many consistent
hypotheses
• We need an inductive bias:– Assumptions that entail a unique h for a training set X
1. Hypothesis class H – axis-aligned rectangles2. Learning algorithm – find consistent hypothesis with max-
margin3. Hyperparameters – trade-off between training error and
margin
Lecture 2: Basic Concepts 23
Model Selection and Generalization• Generalization – how well a model performs
on new data
– Overfitting: H more complex than C
– Underfitting: H less complex than C
Lecture 2: Basic Concepts 24
Triple Trade-Off
• Trade-off between three factors:
1. Complexity of H, c(H)
2. Training set size N
3. Generalization error E on new data
• Dependencies:
– As N, E
– As c(H), first E and then E
Lecture 2: Basic Concepts 25
Model Selection Generalization Error
• To estimate generalization error, we need data unseen during training:
• Given models (hypotheses) h1, ..., hk induced from the training set X, we can use E(hi | V ) to select the model hi with the smallest generalization error
Lecture 2: Basic Concepts 26
ˆ E E(h |V ) 1 h x t r t t1
M
V {x t,r t }t1M X
Model Assessment
• To estimate the generalization error of the best model hi, we need data unseen during training and model selection
• Standard setup:1. Training set X (50–80%)2. Validation (development) set V (10–25%)3. Test (publication) set T (10–25%)
• Note:– Validation data can be added to training set before
testing– Resampling methods can be used if data is limited
Lecture 2: Basic Concepts 27
Cross-Validation
121
31
2
2
2
32
1
1
1
K
K
K
K
K
K
XXXTXV
XXXTXV
XXXTXV
Lecture 2: Basic Concepts 28
• K-fold cross-validation: Divide X into X1, ..., XK
• Note:– Generalization error estimated by means across K folds
– Training sets for different folds share K–2 parts
– Separate test set must be maintained for model assessment
Bootstrapping
36801
1 1 .
e
N
N
Lecture 2: Basic Concepts 29
• Generate new training sets of size N from X by random sampling with replacement
• Use original training set as validation set (V = X )• Probability that we do not pick an instance after N
draws
that is, only 36.8% of instances are new!
Measuring Error
• Error rate = # of errors / # of instances = (FP+FN) / N
• Accuracy = # of correct / # of instances = (TP+TN) / N
• Recall = # of found positives / # of positives = TP / (TP+FN)
• Precision = # of found positives / # of found = TP / (TP+FP)
Lecture 2: Basic Concepts 30
Statistical Inference
• Interval estimation to quantify the precision of our measurements
• Hypothesis testing to assess whether differences between models are statistically significant
Lecture 2: Basic Concepts 31
m 1.96
N
e01 e10 1 2
e01 e10
~ X1
2
Supervised Learning – Summary
• Training data + learner hypothesis
– Learner incorporates inductive bias
• Test data + hypothesis estimated generalization
– Test data must be unseen
Lecture 2: Basic Concepts 32
Anatomy of a Supervised Learner(Dimensions of a supervised machine learning algorithm)
• Model:
• Loss function:
• Optimization procedure:
g x |q
E q | X L r t ,g x t |q t
Lecture 2: Basic Concepts33
q* arg minq
E q | X
Supervised Classification: Extension
34
• Divide instances into (two or more) classes
– Instance (feature vector):• Features may be categorical or numerical
– Class (label):
– Training data:
• Classification in Language Technology– Spam filtering (spam vs. non-spam)– Spelling error detection (error vs. no error)– Text categorization (news, economy, culture, sport, ...)– Named entity classification (person, location, organization, ...)
X {x t,y t }t1N
x x1, , xm
y
Lec 2: Decision Trees - Nearest Neighbors
NLP: Classification (i)
NLP: Classification (ii)
NLP: Classification (iii)
Types of Classification (i)
Types of Classification (ii)
Reading
• Alpaydin (2010): chs 1-2; 19
• Daume’ III (2012): ch 4: only 4.5-4.6
Lecture 2: Basic Concepts 40
End of Lecture 2
Lecture 2: Basic Concepts 41