final review

72
Final review LING 572 Fei Xia 03/07/06

Upload: brygid

Post on 08-Feb-2016

36 views

Category:

Documents


0 download

DESCRIPTION

Final review. LING 572 Fei Xia 03/07/06. Misc. Parts 3 and 4 were due at 6am today. Presentation: email me the slides by 6am on 3/9 Final report: email me by 6am on 3/14. Group meetings: 1:30-4:00pm on 3/16. Outline. Main topics Applying to NLP tasks Tricks. Main topics. Main topics. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Final review

Final review

LING 572Fei Xia

03/07/06

Page 2: Final review

Misc

• Parts 3 and 4 were due at 6am today.

• Presentation: email me the slides by 6am on 3/9

• Final report: email me by 6am on 3/14.

• Group meetings: 1:30-4:00pm on 3/16.

Page 3: Final review

Outline

• Main topics

• Applying to NLP tasks

• Tricks

Page 4: Final review

Main topics

Page 5: Final review

Main topics• Supervised learning

– Decision tree– Decision list– TBL– MaxEnt– Boosting

• Semi-supervised learning– Self-training– Co-training– EM– Co-EM

Page 6: Final review

Main topics (cont)• Unsupervised learning

– The EM algorithm– The EM algorithm for PM models

• Forward-backward• Inside-outside • IBM models for MT

• Others– Two dynamic models: FSA and HMM– Re-sampling: bootstrap– System combination– Bagging

Page 7: Final review

Main topics (cont)• Homework

– Hw1: FSA and HMM– Hw2: DT, DL, CNF, DNF, and TBL– Hw3: Boosting

• Project:– P1: Trigram (learn to use Carmel, relation between

HMM and FSA)– P2: TBL– P3: MaxEnt– P4: Bagging, boosting, system combination, SSL

Page 8: Final review

Supervised learning

Page 9: Final review

A classification problemDistrict House

typeIncome Previous

CustomerOutcome

Suburban Detached High No Nothing

Suburban Semi-detached

High Yes Respond

Rural Semi-detached

Low No Respond

Urban Detached Low Yes Nothing

Page 10: Final review

Classification and estimation problems

• Given– x: input attributes– y: the goal– training data: a set of (x, y)

• Predict y given a new x: – y is a discrete variable classification problem– y is a continuous variable estimation problem

Page 11: Final review

Five ML methods

• Decision tree• Decision list• TBL• Boosting• MaxEnt

Page 12: Final review

Decision tree

• Modeling: tree representation

• Training: top-down induction, greedy algorithm

• Decoding: find the path from root to a leaf node, where the tests along the path are satisfied.

Page 13: Final review

Decision tree (cont)• Main algorithms: ID3, C4.5, CART

• Strengths:– Ability to generate understandable rules– Ability to clearly indicate best attributes

• Weakness:– Data splitting– Trouble with non-rectangular regions– The instability of top-down induction bagging

Page 14: Final review

Decision list

• Modeling: a list of decision rules

• Training: greedy, iterative algorithm

• Decoding: find the 1st rule that applies

• Each decision is based on a single piece of evidence, in contrast to MaxEnt, boosting, TBL

Page 15: Final review

TBL

• Modeling: a list of transformations (similar to decision rules)

• Training: – Greedy, iterative algorithm – The concept of current state

• Decoding: apply every transformation to the data

Page 16: Final review

TBL (cont)

• Strengths:– Minimizing error rate directly– Ability to handle non-classification problem

• Dynamic problem: POS tagging• Non-classification problem: parsing

• Weaknesses:– Transformations are hard to interpret as they interact

with one another– Probabilistic TBL: TBL-DT

Page 17: Final review

Boosting

Training Sample

Weighted Sample

Weighted Sample

fT

f1

…f2

f

ML

ML

ML

Page 18: Final review

Boosting (cont)

• Modeling: combining a set of weak classifiers to produce a powerful committee.

• Training: learn one classifier at each iteration

• Decoding: use the weighted majority vote of the weak classifiers

Page 19: Final review

Boosting (cont)

• Strengths– It comes with a set of theoretical guarantee

(e.g., training error, test error).– It only needs to find weak classifiers.

• Weaknesses:– It is susceptible to noise.– The actual performance depends on the data

and the base learner.

Page 20: Final review

MaxEnt

)(maxarg* pHpPp

}},...,1{,|{ ~ kjfEfEpP jpjp

The task: find p* s.t.

where

Zexp

xf jk

jj )(

1

)(

If p* exists, it has of the form

Page 21: Final review

MaxEnt (cont)

• If p* exists, then )(maxarg* qLpQq

x

xqxpqL )(log)(~)(

}0,)(|{1

)(

j

xf

ZexqqQ

k

jjj

where

Page 22: Final review

MaxEnt (cont)

• Training: GIS, IIS

• Feature selection: – Greedy algorithm– Select one (or more) at a time

• In general, MaxEnt achieves good performance on many NLP tasks.

Page 23: Final review

Common issues

• Objective function / Quality measure:– DT, DL: e.g., information gain– TBL, Boosting: minimize training errors– MaxEnt: maximize entropy while satisfying

constraints

Page 24: Final review

Common issues (cont)

• Avoiding overfitting– Use development data– Two strategies:

• stop early• post-pruning

Page 25: Final review

Common issues (cont)

• Missing attribute values:– Assume a “blank” value– Assign most common value among all “similar”

examples in the training data – (DL, DT): Assign a fraction of example to each

possible class.

• Continuous-valued attributes– Choosing thresholds by checking the training data

Page 26: Final review

Common issues (cont)

• Attributes with different costs– DT: Change the quality measure to include

the costs

• Continuous-valued goal attribute– DT, DL: each “leaf” node is marked with a real

value or a linear function– TBL, MaxEnt, Boosting: ??

Page 27: Final review

Comparison of supervised learnersDT DL TBL Boosting MaxEnt

Probabilistic PDT PDL TBL-DT Confidence Y

Parametric N N N N Y

representation Tree Ordered list of rules

Ordered list of transformations

List of weighted classifiers

List of weighted features

Each iteration Attribute Rule Transformation

Classifier & weight

Feature & weight

Data processing

Splitdata

Split data*

Change cur_y

Reweight (x,y)

None

decoding Path 1st rule Sequence of rules

Calc f(x) Calc f(x)

Page 28: Final review

Semi-supervised Learning

Page 29: Final review

Semi-supervised learning

• Each learning method makes some assumptions about the problem.

• SSL works when those assumptions are satisfied.

• SSL could degrade the performance when mistakes reinforce themselves.

Page 30: Final review

SSL (cont)

• We have covered four methods: self-training, co-training, EM, co-EM

Page 31: Final review

Co-training

• The original paper: (Blum and Mitchell, 1998)– Two “independent” views: split the features into two

sets.– Train a classifier on each view.– Each classifier labels data that can be used to train

the other classifier.• Extension:

– Relax the conditional independence assumptions– Instead of using two views, use two or more

classifiers trained on the whole feature set.

Page 32: Final review

Unsupervised learning

Page 33: Final review

Unsupervised learning

• EM is a method of estimating parameters in the MLE framework.

• It finds a sequence of parameters that improve the likelihood of the training data.

Page 34: Final review

The EM algorithm

• Start with initial estimate, θ0

• Repeat until convergence– E-step: calculate

– M-step: find

),(maxarg)1( tt Q

)|,(log),|(),(1

yxPxyPQ it

n

i yi

t

Page 35: Final review

The EM algorithm (cont)

• The optimal solution for the M-step exists for many classes of problems.

A number of well-known methods are special cases of EM.

• The EM algorithm for PM models– Forward-backward algorithm– Inside-outside algorithm– …

Page 36: Final review

Other topics

Page 37: Final review

FSA and HMM

• Two types of HMMs:– State-emission and arc-emission HMMs– They are equivalent

• We can convert HMM into WFA• Modeling: Marcov assumption• Training:

– Supervised: counting– Unsupervised: forward-backward algorithm

• Decoding: Viterbi algorithm

Page 38: Final review

Bootstrap

f1

f2

fB

ML

ML

ML

f

Page 39: Final review

Bootstrap (cont)

• A method of re-sampling:– One original sample B bootstrap samples

• It has a strong mathematical background.

• It is a method for estimating standard errors, bias, and so on.

Page 40: Final review

System combination

f1

f2

fB

ML1

MLB

ML2

f

Page 41: Final review

System combination (cont)

• Hybridization: combine substructures to produce a new one.– Voting– Naïve Bayes

• Switching: choose one of the fi(x)– Similarity switching– Naïve Bayes

))(),....,(()( 1 xfxfgxf m

Page 42: Final review

Bagging

f1

f2

fB

ML

ML

ML

f

bootstrap + system combination

Page 43: Final review

Bagging (cont)

• It is effective for unstable learning methods:– Decision tree– Regression tree– Neural network

• It does not help stable learning methods– K-nearest neighbors

Page 44: Final review

Relations

Page 45: Final review

Relations

• WFSA and HMM• DL, DT, TBL• EM, EM for PM

Page 46: Final review

WFSA and HMM

HMM Finish

Add a “Start” state and a transition from “Start” to any state in HMM.Add a “Finish” state and a transition from any state in HMM to “Finish”.

Start

Page 47: Final review

DT, CNF, DNF, DT, TBL

k-CNF k-DNFk-DT

K-DL

k-TBL

Page 48: Final review

The EM algorithm

The generalized EM

The EM algorithm

PM Gaussian MixInside-OutsideForward-backwardIBM models

Page 49: Final review

Solving a NLP problem

Page 50: Final review

Issues• Modeling: represent the problem as a formula and

decompose the formula into a function of parameters

• Training: estimate model parameters

• Decoding: find the best answer given the parameters

• Other issues:– Preprocessing– Postprocessing– Evaluation– …

Page 51: Final review

Modeling

• Generative vs. discriminative models

• Introducing hidden variables

• The order of decomposition

)|()|(),( FEPEFPEFP

a

EaFPEFP )|,()|(

),|(*)|()|,( EaFPEaPEaFP

),|(*)|()|,( EFaPEFPEaFP

Page 52: Final review

Modeling (cont)

• Approximation / assumptions

• Final formulae and types of parameters

)|(),|()|( 11

1 i

ii

i

ii aaPaEaPEaP

)|()1()|()|(

1 1i

m

j

l

ijm efP

llmPEFP

Page 53: Final review

Modeling (cont)

• Using classifiers for non-classification problem– POS tagging– Chunking– Parsing

Page 54: Final review

Training

• Objective functions:– Maximize likelihood: EM– Minimize error rate: TBL– Maximum entropy: MaxEnt– ….

• Supervised, semi-supervised, unsupervised:– Ex: Maximize likelihood

• Supervised: simple counting• Unsupervised: EM

Page 55: Final review

Training (cont)

• At each iteration: – Choose one attribute / rule / weight / … at a time,

and never change it in later time: DT, DL, TBL,

– Update all the parameters at each iteration: EM

• Choose “untrained” parameters (e.g., thresholds): use development data.– Minimal “gain” for continuing iteration

Page 56: Final review

Decoding

• Dynamic programming: – CYK for PCFG– Viterbi for HMM

• Dynamic problem:– Decode from left to right– Features only look at the left context– Keep top-N hypotheses at each position

Page 57: Final review

Preprocessing

• Sentence segmentation• Sentence alignment (for MT)• Tokenization• Morphing• POS tagging • …

Page 58: Final review

Post-processing

• System combination• Casing (MT)• …

Page 59: Final review

Evaluation

• Use standard training/test data if possible.

• Choose appropriate evaluation measures:– WSD: for what applications?– Word alignment: F-measure vs. AER. How

does it affect MT result?– Parsing: F-measure vs. dependency link

accuracy

Page 60: Final review

Tricks

Page 61: Final review

Tricks

• Algebra• Probability• Optimization• Programming

Page 62: Final review

Algebra

),...,(...),...,(...1 21 2

11 nx x x

nx x x

xxfcxxfcnn

),...,(...),...,(...1 11 2

11 nx x x

nx x x

xxfxxfn nn

The order of sums:

Pulling out constants:

Page 63: Final review

Algebra (cont)

i

ii

i ff loglog

The order of sums and products:

)()(...1111 22

n

i xii

x x x

n

iii

iinn

xfxf

The order of log and product / sum:

Page 64: Final review

Probability

yy

yxpypyxpxp ),|(*)|()|,()|(

),|()|()(),,( zyxPyzPyPzyxP

Introducing a new random variable:

The order of decomposition:

),|()|()(),,( yxzPxyPxPzyxP

Page 65: Final review

More general cases

),...|(),...,(

),...,()(

111

1

,...,11

2

ii

in

AAn

AAAPAAP

AAPAPn

Page 66: Final review

Probability (cont)

)()|()()|(

xpyxpypxyp

)|()(maxarg)|(maxarg yxpypxypyy

Source-channel model:

Bayes Rule:

Page 67: Final review

Probability (cont)

x

xCtxCtxp)()()(Normalization:

Jensen’s inequality:

)])([log()]([log( xpExpE

Page 68: Final review

Optimization

• When there is no analytical solution, use iterative approach.

• If the optimal solution to g(x) is hard to find, look for the optimal solution to a (tight) lower bound of g(x).

Page 69: Final review

Optimization (cont)• Using Lagrange multipliers: Constrained problem: maximize f(x) with the constraint that g(x)=0

Unconstrained problem: maximize f(x) – λg(x)

• Taking first derivatives to find the stationary points.

Page 70: Final review

Programming

• Using/creating a good package:– Tutorial, sample data, well-written code– Multiple levels of code

• Core ML algorithm: e.g., TBL• Wrapper for a task: e.g., POS tagger• Wrapper to deal with input, output, etc.

Page 71: Final review

Programming (cont)• Good practice:

– Write notes and create wrappers (all the commands should be stored in the notes, or even better in a wrapper code)

– Use standard directory structures:• src/, include/, exec/, bin/, obj/, docs/, sample/, data/, result/

– Give meaning filenames only to important code: e.g., aaa100.exec, build_trigram_tagger.pl

– Give meaning function, variable names

– Don’t use global variables

Page 72: Final review

Final words

• We have covered a lot of topics: 5+4+3+4

• It takes time to digest, but at least we understand the basic concepts.

• The next step: applying them to real applications.