introduction ling 572 fei xia week 1: 1/4/06. outline course overview mathematical foundation:...
Post on 21-Dec-2015
216 Views
Preview:
TRANSCRIPT
Introduction
LING 572
Fei Xia
Week 1: 1/4/06
Outline
• Course overview
• Mathematical foundation: (Prereq)– Probability theory– Information theory
• Basic concepts in the classification task
Course overview
General info
• Course url: http://courses.washington.edu/ling572
– Syllabus (incl. slides, assignments, and papers): updated every week.
– Message board– ESubmit
• Slides:– I will try to put the slides online before class.– “Additional slides” are not required and not covered in
class.
Office hour
• Fei:– Email:
• Email address: fxia@u • Subject line should include “ling572”• The 48-hour rule
– Office hour: • Time: Fr 10-11:20am • Location: Padelford A-210G
Lab session
• Bill McNeil– Email: billmcn@u– Lab session: what time is good for you?
• Explaining homework and solution• Mallet related questions• Reviewing class material
I highly recommend you to attend lab sessions, especially the first few sessions.
Time for Lab Session
• Time:– Monday: 10:00am - 12:20pm, or– Tues: 10:30 am - 11:30 am, or– ??
• Location: ??
Thursday 3-4pm, MGH 271?
Misc
• Ling572 Mailing list: ling572a_wi07@u
• EPost
• Mallet developer mailing list:
mallet-dev@cs.umass.edu
Prerequisites
• Ling570– Some basic algorithms: FSA, HMM, – NLP tasks: tokenization, POS tagging, ….
• Programming: If you don’t know Java well, talk to me. – Java: Mallet
• Basic concepts in probability and statistics– Ex: random variables, chain rule, Gaussian distribution, ….
• Basic concepts in Information Theory: – Ex: entropy, relative entropy, …
Expectations
• Reading: – Papers are online: – Reference book: Manning & Schutze (MS)– Finish reading papers before class
I will ask you questions.
Grades
• Assignments (9 parts): 90%– Programming language: Java
• Class participation: 10%
• No quizzes, no final exams
• No “incomplete” unless you can prove your case.
Course objectives
• Covering basic statistical methods that produce state-of-the-art results
• Focusing on classification algorithms
• Touching on unsupervised and semi-supervised algorithms
• Some material is not easy. We will focus on applications, not theoretical proofs.
Course layout
• Supervised methods– Classification algorithms:
• Individual classifiers:– Naïve Bayes– kNN and Rocchio – Decision tree– Decision list: ??– Maximum Entropy (MaxEnt)
• Classifier ensemble:– Bagging– Boosting– System combination
Course layout (cnt)
• Supervised algorithms (cont)– Sequence labeling algorithms:
• Transformation-based learning (TBL)• FST, HMM, …
• Semi-supervised methods– Self-training– Co-training
Course layout (cont)
• Unsupervised methods– EM algorithm
• Forward-backward algorithm• Inside-outside algorithm• …
Questions for each method
• Modeling: – what is the model? – How does the decomposition work? – What kind of assumption is made? – How many types of model parameters? – How many “internal” (or non-model) parameters?– How to handle multi-class problem?– How to handle non-binary features?– …
Questions for each method (cont)
• Training: how to estimate parameters? • Decoding: how to find the “best” solution?• Weaknesses and strengths?
– Is the algorithm • robust? (e.g., handling outliners) • scalable?• prone to overfitting?• efficient in training time? Test time?
– How much data is needed?• Labeled data• Unlabeled data
Relation between 570/571 and 572
• 570/571 are organized by tasks; 572 is organized by learning methods.
• 572 focuses on statistical methods.
NLP tasks covered in Ling570
• Tokenization
• Morphological analysis
• POS tagging
• Shallow parsing
• WSD
• NE tagging
NLP tasks covered in Ling571
• Parsing
• Semantics
• Discourse
• Dialogue
• Natural language generation (NLG)
• …
A ML method for multiple NLP tasks
• Task (570/571):– Tokenization– POS tagging– Parsing– Reference resolution– …
• Method (572):– MaxEnt
Multiple methods for one NLP task
• Task (570/571): POS tagging
• Method (572):– Decision tree– MaxEnt– Boosting– Bagging– ….
Projects: Task 1• Text Classification Task: 20 groups
– P1: First look at the Mallet package– P2: Your first tui class Naïve Bayes– P3: Feature selection Decision Tree– P4: Bagging Boosting
• Individual project
Projects: Task 2
• Sequence labeling task: IGT detection– P5: MaxEnt– P6: Beam Search– P7: TBA– P8: Presentation: final class– P9: Final report
• Group project (?)
Both projects
• Use Mallet, a Java package
• Two types of work:– Reading code to understand ML methods– Writing code to solve problems
Feedback on assignments
• “Misc” section in each assignment– How long it takes to finish the homework?– Which part is difficult?– …
Mallet overview
• It is a Java package, that includes many– classifiers, – sequence labeling algorithms,– optimization algorithms,– useful data classes,– …
• You should – read “Mallet Guides”– attend mallet tutorial: next Tuesday 10:30-11:30am: LLC109– start on Hw1
• I will use Mallet class/method names if possible.
Questions for “course overview”?
Outline
• Course overview
• Mathematical foundation– Probability theory– Information theory
• Basic concepts in the classification task
Probability Theory
Basic concepts
• Sample space, event, event space
• Random variable and random vector
• Conditional probability, joint probability, marginal probability (prior)
Sample space, event, event space
• Sample space (Ω): a collection of basic outcomes. – Ex: toss a coin twice: {HH, HT, TH, TT}
• Event: an event is a subset of Ω.– Ex: {HT, TH}
• Event space (2Ω): the set of all possible events.
Random variable
• The outcome of an experiment need not be a number.
• We often want to represent outcomes as numbers.
• A random variable X is a function: ΩR.– Ex: toss a coin twice: X(HH)=0, X(HT)=1, …
Two types of random variables
• Discrete: X takes on only a countable number of possible values.– Ex: Toss a coin 10 times. X is the number of
tails that are noted.
• Continuous: X takes on an uncountable number of possible values.– Ex: X is the lifetime (in hours) of a light bulb.
Probability function
• The probability function of a discrete variable X is a function which gives the probability p(xi) that X equals xi: a.k.a. p(xi) = p(X=xi).
1)(
1)(0
ix
i
i
xp
xp
Random vector
• Random vector is a finite-dimensional vector of random variables: X=[X1,…,Xk].
• P(x) = P(x1,x2,…,xn)=P(X1=x1,…., Xn=xn)
• Ex: P(w1, …, wn, t1, …, tn)
Three types of probability
• Joint prob: P(x,y)= prob of x and y happening together
• Conditional prob: P(x|y) = prob of x given a specific value of y
• Marginal prob: P(x) = prob of x for all possible values of y
Common tricks (I):Marginal prob joint prob
B
BAPAP ),()(
nAA
nAAPAP,...,
11
2
),...,()(
Common tricks (II):Chain rule
)|(*)()|(*)(),( BAPBPABPAPBAP
),...|(),...,( 111
1 ii
in AAAPAAP
Common tricks (III):Bayes rule
)(
)()|(
)(
),()|(
AP
BPBAP
AP
BAPABP
)()|(maxarg
)(
)()|(maxarg
)|(maxarg*
yPyxP
xP
yPyxP
xyPy
y
y
y
Common tricks (IV):Independence assumption
)|(
),...|(),...,(
11
111
1
ii
i
ii
in
AAP
AAAPAAP
Prior and Posterior distribution
• Prior distribution: P() a distribution over parameter values θ set prior to
observing any data.
• Posterior Distribution: P( |data) It represents our belief that θ is true after observing the
data.
• Likelihood of the model : P(data | )
• Relation among the three: Bayes Rule: P( | data) = P(data | ) P() / P(data)
Two ways of estimating
• Maximum likelihood: (ML)
* = arg max P(data | )
• Maxinum A-Posterior: (MAP)
* = arg max P(data)
Information Theory
Information theory
• It is the use of probability theory to quantify and measure “information”.
• Basic concepts:– Entropy– Joint entropy and conditional entropy– Cross entropy and relative entropy– Mutual information and perplexity
Entropy
• Entropy is a measure of the uncertainty associated with a distribution.
• The lower bound on the number of bits it takes to transmit messages.
• An example: – Display the results of horse races. – Goal: minimize the number of bits to encode the results.
x
xpxpXH )(log)()(
An example
• Uniform distribution: pi=1/8.
• Non-uniform distribution: (1/2,1/4,1/8, 1/16, 1/64, 1/64, 1/64, 1/64)
bitsXH 3)8
1log8
1(*8)( 2
bitsXH 2)64
1log
64
1*4
16
1log
16
1
8
1log8
1
4
1log4
1
2
1log2
1()(
(0, 10, 110, 1110, 111100, 111101, 111110, 111111)
Uniform distribution has higher entropy.MaxEnt: make the distribution as “uniform” as possible.
Joint and conditional entropy
• Joint entropy:
• Conditional entropy:
x y
yxpyxpYXH ),(log),(),(
)(),(
)|(log),()|(
XHYXH
xypyxpXYHx y
Cross Entropy
• Entropy:
• Cross Entropy:
• Cross entropy is a distance measure between p(x) and q(x): p(x) is the true probability; q(x) is our estimate of p(x).
xc
x
xqxpXH
xpxpXH
)(log)()(
)(log)()(
)()( XHXH c
Relative Entropy
• Also called Kullback-Leibler divergence:
• Another “distance” measure between prob functions p and q.
• KL divergence is asymmetric (not a true distance):
)()()(
)(log)()||( 2 XHXH
xq
xpxpqpKL c
),(),( pqKLqpKL
Mutual information
• It measures how much is in common between X and Y:
• I(X;Y)=KL(p(x,y)||p(x)p(y))
);(
),()()(
)()(
),(log),();(
XYI
YXHYHXH
ypxp
yxpyxpYXI
x y
Perplexity
• Perplexity is 2H.
• Perplexity is the weighted average number of choices a random variable has to make.
Questions for “Mathematical foundation”?
Outline
• Course overview
• Mathematical foundation– Probability theory– Information theory
• Basic concepts in the classification task
Types of ML problems
• Classification problem• Estimation problem• Clustering• Discovery• …
A learning method can be applied to one or more types of ML problems.
We will focus on the classification problem.
Definition of classification problem
• Task: – C= {c1, c2, .., cm} is a set of pre-defined classes
(a.k.a., labels, categories).– D={d1, d2, …} is a set of input needed to be classified.– A classifier is a function: D £ C {0, 1}.
• Multi-label vs. single-label– Single-label: for each di, only one class is assigned to
it.
• Multi-class vs. binary classification problem– Binary: |C| = 2.
Conversion to single-label binary problem
• Multi-label single-label– We will focus on single-label problem.– A classifier: D £ C {0, 1}
becomes D C– More general definition: D £ C [0, 1]
• Multi-class binary problem– Positive examples vs. negative examples
Examples of classification problems
• Text classification
• Document filtering
• Language/Author/Speaker id
• WSD
• PP attachment
• Automatic essay grading
• …
Problems that can be treated as a classification problem
• Tokenization / Word segmentation
• POS tagging
• NE detection
• NP chunking
• Parsing
• Reference resolution
• …
Labeled vs. unlabeled data
• Labeled data:– {(xi, yi)} is a set of labeled data.
– xi 2 D: data/input, often represented as a feature vector.
– yi 2 C: target/label
• Unlabeled data– {xi} without yi.
Instance, training and test data
• xi with or without yi is called an instance.
• Training data: a set of (labeled) instances.
• Test data: a set of unlabeled instances.
• The training data is stored in an InstanceList in Mallet, so is test data.
Attribute-value table
• Each row corresponds to an instance.• Each column corresponds to a feature.
• A feature type (a.k.a. a feature template): w-1
• A feature: w-1=book• Binary feature vs. non-binary feature
Attribute-value table
f1 f2 … fK Target
d1 yes 1 no -1000 c2
d2
d3
…
dn
Feature sequence vs. Feature vector
• Feature sequence: a (featName, featValue) list for features that are present.
• Feature Vector: a (featName, featValue) list for all the features.
• Representing data x as a feature vector.
Data/Input a feature vector
• Example:– Task: text classification– Original x: a document– Feature vector: bag-of-words approach
• In Mallet, the process is handled by a sequence of pipes:– Tokenization– Lowercase– Merging the counts– …
Classifier and decision matrix
• A classifier is a function f: f(x) = {(ci, scorei)}. It fills out a decision matrix.
• {(ci, scorei)} is called a Classification in Mallet.
d1 d2 d3 ….
c1 0.1 0.4 0 …
c20.9 0.1 0 …
c3
…
Trainer (a.k.a Learner)
• A trainer is a function that takes an InstanceList as input, and outputs a classifier.
• Training stage: – Classifier train (instanceList);
• Test stage:– Classification classify (instance);
Important concepts (summary)
• Instance, InstanceList• Labeled data, unlabeled data• Training data, test data
• Feature, feature template• Feature vector• Attribute-value table
• Trainer, classifier• Training stage, test stage
Steps for solving an NLP task with classifiers
• Convert the task into a classification problem (optional)
• Split data into training/test/validation
• Convert the data into attribute-value table
• Training
• Decoding
• Evaluation
Important subtasks (for you)
• Converting the data into attribute-value table– Define feature types– Feature selection– Convert an instance into a feature vector
• Understanding training/decoding algorithms for various algorithms.
Notation
Classification in general
Text categorization
Input/data xi di
Target/label yi ci
Features fk tk (term)
… … …
Questions for “Concepts in a classification task”?
Summary
• Course overview
• Mathematical foundation– Probability theory– Information theoryM&S Ch2
• Basic concepts in the classification task
Downloading
• Hw1
• Mallet Guide
• Homework Guide
Coming up
• Next Tuesday:– Mallet tutorial on 1/8 (Tues): 10:30-11:30am
at LLC 109.– Classification algorithm overview and Naïve
Bayes: read the paper beforehand.
• Next Thursday:– kNN and Rocchio: read the other paper
• Hw1 is due at 11pm on 1/13
Additional slides
An example
• 570/571:– POS tagging: HMM– Parsing: PCFG– MT: Model 1-4 training
• 572:– HMM: forward-backward algorithm– PCFG: inside-outside algorithm– MT: EM algorithm All special cases of EM algorithm, one method of
unsupervised learning.
Proof: Relative entropy is always non-negative
1log0 zzz
0))(())((
)))()((())1)(
)()(((
)(
)(log)(
)(
)(log)(
)||(
xx
xx
xx
xqxp
xpxqxp
xqxp
xp
xqxp
xq
xpxp
qpKL
Entropy of a language
• The entropy of a language L:
• If we make certain assumptions that the language is “nice”, then the cross entropy can be calculated as:
n
xpxp
LH nxnn
n
1
)(log)(
lim)(11
n
xp
n
xpLH nn
n
)(log)(loglim)( 11
Cross entropy of a language
• The cross entropy of a language L:
• If we make certain assumptions that the language is “nice”, then the cross entropy can be calculated as:
n
xqxp
qLH nxnn
n
1
)(log)(
lim),(11
n
xq
n
xqqLH nn
n
)(log)(loglim),( 11
Conditional Entropy
top related