introduction ling 572 fei xia week 1: 1/4/06. outline course overview mathematical foundation:...

Introduction

LING 572

Fei Xia

Week 1: 1/4/06

Outline

• Course overview

• Mathematical foundation: (Prereq)– Probability theory– Information theory

• Basic concepts in the classification task

Course overview

General info

• Course url: http://courses.washington.edu/ling572

– Syllabus (incl. slides, assignments, and papers): updated every week.

– Message board– ESubmit

• Slides:– I will try to put the slides online before class.– “Additional slides” are not required and not covered in

class.

Office hour

• Fei:– Email:

• Email address: fxia@u • Subject line should include “ling572”• The 48-hour rule

– Office hour: • Time: Fr 10-11:20am • Location: Padelford A-210G

Lab session

• Bill McNeil– Email: billmcn@u– Lab session: what time is good for you?

• Explaining homework and solution• Mallet related questions• Reviewing class material

I highly recommend you to attend lab sessions, especially the first few sessions.

Time for Lab Session

• Time:– Monday: 10:00am - 12:20pm, or– Tues: 10:30 am - 11:30 am, or– ??

• Location: ??

Thursday 3-4pm, MGH 271?

• Ling572 Mailing list: ling572a_wi07@u

• EPost

• Mallet developer mailing list:

mallet-dev@cs.umass.edu

Prerequisites

• Ling570– Some basic algorithms: FSA, HMM, – NLP tasks: tokenization, POS tagging, ….

• Programming: If you don’t know Java well, talk to me. – Java: Mallet

• Basic concepts in probability and statistics– Ex: random variables, chain rule, Gaussian distribution, ….

• Basic concepts in Information Theory: – Ex: entropy, relative entropy, …

Expectations

• Reading: – Papers are online: – Reference book: Manning & Schutze (MS)– Finish reading papers before class

I will ask you questions.

Grades

• Assignments (9 parts): 90%– Programming language: Java

• Class participation: 10%

• No quizzes, no final exams

• No “incomplete” unless you can prove your case.

Course objectives

• Covering basic statistical methods that produce state-of-the-art results

• Focusing on classification algorithms

• Touching on unsupervised and semi-supervised algorithms

• Some material is not easy. We will focus on applications, not theoretical proofs.

Course layout

• Supervised methods– Classification algorithms:

• Individual classifiers:– Naïve Bayes– kNN and Rocchio – Decision tree– Decision list: ??– Maximum Entropy (MaxEnt)

• Classifier ensemble:– Bagging– Boosting– System combination

Course layout (cnt)

• Supervised algorithms (cont)– Sequence labeling algorithms:

• Transformation-based learning (TBL)• FST, HMM, …

• Semi-supervised methods– Self-training– Co-training

Course layout (cont)

• Unsupervised methods– EM algorithm

• Forward-backward algorithm• Inside-outside algorithm• …

Questions for each method

• Modeling: – what is the model? – How does the decomposition work? – What kind of assumption is made? – How many types of model parameters? – How many “internal” (or non-model) parameters?– How to handle multi-class problem?– How to handle non-binary features?– …

Questions for each method (cont)

• Training: how to estimate parameters? • Decoding: how to find the “best” solution?• Weaknesses and strengths?

– Is the algorithm • robust? (e.g., handling outliners) • scalable?• prone to overfitting?• efficient in training time? Test time?

– How much data is needed?• Labeled data• Unlabeled data

Relation between 570/571 and 572

• 570/571 are organized by tasks; 572 is organized by learning methods.

• 572 focuses on statistical methods.

NLP tasks covered in Ling570

• Tokenization

• Morphological analysis

• POS tagging

• Shallow parsing

• WSD

• NE tagging

NLP tasks covered in Ling571

• Parsing

• Semantics

• Discourse

• Dialogue

• Natural language generation (NLG)

• …

A ML method for multiple NLP tasks

• Task (570/571):– Tokenization– POS tagging– Parsing– Reference resolution– …

• Method (572):– MaxEnt

Multiple methods for one NLP task

• Task (570/571): POS tagging

• Method (572):– Decision tree– MaxEnt– Boosting– Bagging– ….

Projects: Task 1• Text Classification Task: 20 groups

– P1: First look at the Mallet package– P2: Your first tui class Naïve Bayes– P3: Feature selection Decision Tree– P4: Bagging Boosting

• Individual project

Projects: Task 2

• Sequence labeling task: IGT detection– P5: MaxEnt– P6: Beam Search– P7: TBA– P8: Presentation: final class– P9: Final report

• Group project (?)

Both projects

• Use Mallet, a Java package

• Two types of work:– Reading code to understand ML methods– Writing code to solve problems

Feedback on assignments

• “Misc” section in each assignment– How long it takes to finish the homework?– Which part is difficult?– …

Mallet overview

• It is a Java package, that includes many– classifiers, – sequence labeling algorithms,– optimization algorithms,– useful data classes,– …

• You should – read “Mallet Guides”– attend mallet tutorial: next Tuesday 10:30-11:30am: LLC109– start on Hw1

• I will use Mallet class/method names if possible.

Questions for “course overview”?

Outline

• Course overview

• Mathematical foundation– Probability theory– Information theory

Probability Theory

Basic concepts

• Sample space, event, event space

• Random variable and random vector

• Conditional probability, joint probability, marginal probability (prior)

Sample space, event, event space

• Sample space (Ω): a collection of basic outcomes. – Ex: toss a coin twice: {HH, HT, TH, TT}

• Event: an event is a subset of Ω.– Ex: {HT, TH}

• Event space (2Ω): the set of all possible events.

Random variable

• The outcome of an experiment need not be a number.

• We often want to represent outcomes as numbers.

• A random variable X is a function: ΩR.– Ex: toss a coin twice: X(HH)=0, X(HT)=1, …

Two types of random variables

• Discrete: X takes on only a countable number of possible values.– Ex: Toss a coin 10 times. X is the number of

tails that are noted.

• Continuous: X takes on an uncountable number of possible values.– Ex: X is the lifetime (in hours) of a light bulb.

Probability function

• The probability function of a discrete variable X is a function which gives the probability p(xi) that X equals xi: a.k.a. p(xi) = p(X=xi).

Random vector

• Random vector is a finite-dimensional vector of random variables: X=[X1,…,Xk].

• P(x) = P(x1,x2,…,xn)=P(X1=x1,…., Xn=xn)

• Ex: P(w1, …, wn, t1, …, tn)

Three types of probability

• Joint prob: P(x,y)= prob of x and y happening together

• Conditional prob: P(x|y) = prob of x given a specific value of y

• Marginal prob: P(x) = prob of x for all possible values of y

Common tricks (I):Marginal prob joint prob

BAPAP ),()(

nAAPAP,...,

),...,()(

Common tricks (II):Chain rule

)|(*)()|(*)(),( BAPBPABPAPBAP

),...|(),...,( 111

in AAAPAAP

Common tricks (III):Bayes rule

),()|(

BAPABP

)()|(maxarg

)|(maxarg*

Common tricks (IV):Independence assumption

),...|(),...,(

AAAPAAP

Prior and Posterior distribution

• Prior distribution: P() a distribution over parameter values θ set prior to

observing any data.

• Posterior Distribution: P( |data) It represents our belief that θ is true after observing the

• Likelihood of the model : P(data | )

• Relation among the three: Bayes Rule: P( | data) = P(data | ) P() / P(data)

Two ways of estimating

• Maximum likelihood: (ML)

* = arg max P(data | )

• Maxinum A-Posterior: (MAP)

* = arg max P(data)

Information Theory

Information theory

• It is the use of probability theory to quantify and measure “information”.

• Basic concepts:– Entropy– Joint entropy and conditional entropy– Cross entropy and relative entropy– Mutual information and perplexity

Entropy

• Entropy is a measure of the uncertainty associated with a distribution.

• The lower bound on the number of bits it takes to transmit messages.

• An example: – Display the results of horse races. – Goal: minimize the number of bits to encode the results.

xpxpXH )(log)()(

An example

• Uniform distribution: pi=1/8.

• Non-uniform distribution: (1/2,1/4,1/8, 1/16, 1/64, 1/64, 1/64, 1/64)

bitsXH 3)8

1(*8)( 2

bitsXH 2)64

(0, 10, 110, 1110, 111100, 111101, 111110, 111111)

Uniform distribution has higher entropy.MaxEnt: make the distribution as “uniform” as possible.

Joint and conditional entropy

• Joint entropy:

• Conditional entropy:

yxpyxpYXH ),(log),(),(

)|(log),()|(

xypyxpXYHx y

Cross Entropy

• Entropy:

• Cross Entropy:

• Cross entropy is a distance measure between p(x) and q(x): p(x) is the true probability; q(x) is our estimate of p(x).

xqxpXH

xpxpXH

)(log)()(

)()( XHXH c

Relative Entropy

• Also called Kullback-Leibler divergence:

• Another “distance” measure between prob functions p and q.

• KL divergence is asymmetric (not a true distance):

)()()(

)(log)()||( 2 XHXH

xpxpqpKL c

),(),( pqKLqpKL

Mutual information

• It measures how much is in common between X and Y:

• I(X;Y)=KL(p(x,y)||p(x)p(y))

),()()(

),(log),();(

YXHYHXH

yxpyxpYXI

Perplexity

• Perplexity is 2H.

• Perplexity is the weighted average number of choices a random variable has to make.

Questions for “Mathematical foundation”?

Outline

• Course overview

• Mathematical foundation– Probability theory– Information theory

Types of ML problems

• Classification problem• Estimation problem• Clustering• Discovery• …

A learning method can be applied to one or more types of ML problems.

We will focus on the classification problem.

Definition of classification problem

• Task: – C= {c1, c2, .., cm} is a set of pre-defined classes

(a.k.a., labels, categories).– D={d1, d2, …} is a set of input needed to be classified.– A classifier is a function: D £ C {0, 1}.

• Multi-label vs. single-label– Single-label: for each di, only one class is assigned to

• Multi-class vs. binary classification problem– Binary: |C| = 2.

Conversion to single-label binary problem

• Multi-label single-label– We will focus on single-label problem.– A classifier: D £ C {0, 1}

becomes D C– More general definition: D £ C [0, 1]

• Multi-class binary problem– Positive examples vs. negative examples

Examples of classification problems

• Text classification

• Document filtering

• Language/Author/Speaker id

• WSD

• PP attachment

• Automatic essay grading

• …

Problems that can be treated as a classification problem

• Tokenization / Word segmentation

• POS tagging

• NE detection

• NP chunking

• Parsing

• Reference resolution

• …

Labeled vs. unlabeled data

• Labeled data:– {(xi, yi)} is a set of labeled data.

– xi 2 D: data/input, often represented as a feature vector.

– yi 2 C: target/label

• Unlabeled data– {xi} without yi.

Instance, training and test data

• xi with or without yi is called an instance.

• Training data: a set of (labeled) instances.

• Test data: a set of unlabeled instances.

• The training data is stored in an InstanceList in Mallet, so is test data.

Attribute-value table

• Each row corresponds to an instance.• Each column corresponds to a feature.

• A feature type (a.k.a. a feature template): w-1

• A feature: w-1=book• Binary feature vs. non-binary feature

Attribute-value table

f1 f2 … fK Target

d1 yes 1 no -1000 c2

Feature sequence vs. Feature vector

• Feature sequence: a (featName, featValue) list for features that are present.

• Feature Vector: a (featName, featValue) list for all the features.

• Representing data x as a feature vector.

Data/Input a feature vector

• Example:– Task: text classification– Original x: a document– Feature vector: bag-of-words approach

• In Mallet, the process is handled by a sequence of pipes:– Tokenization– Lowercase– Merging the counts– …

Classifier and decision matrix

• A classifier is a function f: f(x) = {(ci, scorei)}. It fills out a decision matrix.

• {(ci, scorei)} is called a Classification in Mallet.

d1 d2 d3 ….

c1 0.1 0.4 0 …

c20.9 0.1 0 …

Trainer (a.k.a Learner)

• A trainer is a function that takes an InstanceList as input, and outputs a classifier.

• Training stage: – Classifier train (instanceList);

• Test stage:– Classification classify (instance);

Important concepts (summary)

• Instance, InstanceList• Labeled data, unlabeled data• Training data, test data

• Feature, feature template• Feature vector• Attribute-value table

• Trainer, classifier• Training stage, test stage

Steps for solving an NLP task with classifiers

• Convert the task into a classification problem (optional)

• Split data into training/test/validation

• Convert the data into attribute-value table

• Training

• Decoding

• Evaluation

Important subtasks (for you)

• Converting the data into attribute-value table– Define feature types– Feature selection– Convert an instance into a feature vector

• Understanding training/decoding algorithms for various algorithms.

Notation

Classification in general

Text categorization

Input/data xi di

Target/label yi ci

Features fk tk (term)

… … …

Questions for “Concepts in a classification task”?

Summary

• Course overview

• Mathematical foundation– Probability theory– Information theoryM&S Ch2

Downloading

• Hw1

• Mallet Guide

• Homework Guide

Coming up

• Next Tuesday:– Mallet tutorial on 1/8 (Tues): 10:30-11:30am

at LLC 109.– Classification algorithm overview and Naïve

Bayes: read the paper beforehand.

• Next Thursday:– kNN and Rocchio: read the other paper

• Hw1 is due at 11pm on 1/13

Additional slides

An example

• 570/571:– POS tagging: HMM– Parsing: PCFG– MT: Model 1-4 training

• 572:– HMM: forward-backward algorithm– PCFG: inside-outside algorithm– MT: EM algorithm All special cases of EM algorithm, one method of

unsupervised learning.

Proof: Relative entropy is always non-negative

1log0 zzz

0))(())((

)))()((())1)(

)()(((

)(log)(

xpxqxp

Entropy of a language

• The entropy of a language L:

• If we make certain assumptions that the language is “nice”, then the cross entropy can be calculated as:

LH nxnn

)(log)(

lim)(11

xpLH nn

)(log)(loglim)( 11

Cross entropy of a language

• The cross entropy of a language L:

• If we make certain assumptions that the language is “nice”, then the cross entropy can be calculated as:

qLH nxnn

)(log)(

lim),(11

xqqLH nn

)(log)(loglim),( 11

Conditional Entropy

introduction ling 572 fei xia week 1: 1/4/06. outline course overview mathematical foundation:...

course overview slide

outside algorithm slide

classification task

basic algorithms

training time

lab session time

semisupervised algorithms

course objectives

Documents

the adaptability of deng xia oping's four modernizations...

xi xia language

xia rulebook

uas xia ganjil

1 probability theory ling 570 fei xia week 2: 10/01/07

xia xiao-poster

xia xing_2009

introduction ling 572 fei xia week 1: 1/3/06. outline course...

aib prereq manual

xia j 177_final

bireme / ops / oms -...

nov 7, 2013 lirong xia hypothesis testing and statistical...

gazette mo xia

· colegiul national "dinicu golescu" c,împul ung muscel...

lirong xia

xia, shang, & zhou dynasties xia, shang, & zhou dynasties

guidelines quality management · g 1 q (prereq.) (prereq.)...

introduction to information theory ling 572 fei xia, dan...

perm are xia

haijun xia