1 ling 696b: graph-based methods and supervised learning

35
1 LING 696B: Graph-based methods and Supervised learning

Upload: rodger-perry

Post on 18-Jan-2016

231 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 LING 696B: Graph-based methods and Supervised learning

1

LING 696B: Graph-based methodsandSupervised learning

Page 2: 1 LING 696B: Graph-based methods and Supervised learning

2

Road map Types of learning problems:

Unsupervised: clustering, dimension reduction -- Generative models

Supervised: classification (today)-- Discriminative models

Methodology: Parametric: stronger assumptions about the

distribution (blobs, mixture model) Non-parametric: weaker assumptions

(neural nets, spectral clustering, Isomap)

Page 3: 1 LING 696B: Graph-based methods and Supervised learning

3

Puzzle from several weeks ago How do people learn categories

from distributions?

Liberman et al.(1952)

Page 4: 1 LING 696B: Graph-based methods and Supervised learning

4

Graph-based non-parametric methods “Learn locally, think globally” Local learning produces a graph that

reveals the underlying structure Learning the neighbors

Graph is used to reveal global structure in the data Isomap: geodesic distance through

shortest path Spectral clustering: connected

components from graph spectrum (see demo)

Page 5: 1 LING 696B: Graph-based methods and Supervised learning

5

Clustering as a graph partitioning problem Normalized-cut problem: splitting

the graph into two parts, so that Each part is not too small The edges being cut don’t carry too

many weights

Weights on edges from A to B

Weights on edges within A

A B

Page 6: 1 LING 696B: Graph-based methods and Supervised learning

6

Normalized cut through spectral embedding

Exact solution of normalized-cut is NP-hard (explodes for large graph)

“Soft” version is solvable: looking for coordinates for the nodes x1, … xN to minimize

Strongly connected nodes stay nearby, weakly connected nodes stay faraway

Such coordinates are provided by eigenvectors of adjacency/laplacian matrix (recall MDS) -- Spectral embedding

Neighborhood matrix

Page 7: 1 LING 696B: Graph-based methods and Supervised learning

7

Is this relevant to how people learn categories? Maye & Gerken: learning a bi-modal

distribution on a curve (living in an abstract manifold) from /d/ to /(s)t/ Mixture model: transform the signal,

and approximate with two “dynamic blobs”

Can people learn categories from arbitrary manifolds following a “local learning” strategy? Simple case: start from a uniform

distribution (see demo)

Page 8: 1 LING 696B: Graph-based methods and Supervised learning

8

Local learning from graphs Can people learn categories from arbitrary

manifolds following a “local learning” strategy? Most likely no

What constrains the kind of manifolds that people can learn?

What are the reasonable metrics people use?

How does neighborhood size affect such type of learning?

Learning through non-uniform distributions?

Page 9: 1 LING 696B: Graph-based methods and Supervised learning

9

Switch gear Supervised learning: learning a

function from input-output pairs Arguably, something that people also do

Example: perceptron Learning a function f(x)= sign(<w,x> +

b) Also called a “classifier”: machine with

yes/no output

Page 10: 1 LING 696B: Graph-based methods and Supervised learning

10

Speech perception as a classification problem Speech perception is viewed as a

bottom-up procedure involving many decisions E.g. sonorant/consonant,

voice/voiceless See Peter’s presentation

A long-standing effort of building machines that do the same Stevens’ view of distinctive features

Page 11: 1 LING 696B: Graph-based methods and Supervised learning

11

Knowledge-based speech recognition Mainstream method:

Front end: uniform signal representation Back end: hidden Markov models

Knowledge based: Front end: sound-specific features based

on acoustic knowledge Back end: a series of decisions on how

lower level knowledge is integrated

Page 12: 1 LING 696B: Graph-based methods and Supervised learning

12

The conceptual framework from (Liu, 96) and others Each step is hard work

Bypassed inStevens 02

Page 13: 1 LING 696B: Graph-based methods and Supervised learning

13

Implications of flow-chart architecture Requires accurate low-level decisions

Mistakes can build up very quickly Thought experiment: “linguistic” speech

recognition through a sequence of distinctive feature classifiers

Hand-crafted decision rules often not robust/flexible The need for good statistical classifiers

Page 14: 1 LING 696B: Graph-based methods and Supervised learning

14

An unlikely marriage Recent years have seen several

sophisticated classification machines Example: support vector machine by

Vapnik (today) Interest moving from neural nets to

these new machines Many have proposed to integrate

the new classifiers as a back-end Niyogi and Burges paper: building

feature detectors with SVM

Page 15: 1 LING 696B: Graph-based methods and Supervised learning

15

Generalization in classification Experiment: you are learning a line

that separates two classes

Page 16: 1 LING 696B: Graph-based methods and Supervised learning

16

Generalization in classification Question: Where does the yellow

dot belong?

Page 17: 1 LING 696B: Graph-based methods and Supervised learning

17

Generalization in classification Question: Where does the yellow

dot belong?

Page 18: 1 LING 696B: Graph-based methods and Supervised learning

18

Margin and linear classifiers We tend to draw a line that gives

the most “room” between the two clouds

margin

Page 19: 1 LING 696B: Graph-based methods and Supervised learning

19

Margin Margin needs to be defined on

“border” points

Page 20: 1 LING 696B: Graph-based methods and Supervised learning

20

Margin Margin needs to be defined on

“border” points

Page 21: 1 LING 696B: Graph-based methods and Supervised learning

21

Justification for maximum margin Hopefully, they generalize well

Page 22: 1 LING 696B: Graph-based methods and Supervised learning

22

Justification for maximum margin Hopefully, they generalize well

Page 23: 1 LING 696B: Graph-based methods and Supervised learning

23

Support vectors in the separable case Data points that reaches the

maximal margin from the separating line

Page 24: 1 LING 696B: Graph-based methods and Supervised learning

24

Formalizing maximum margin -- optimization for SVM Need constrained optimization

f(x) = sign(<w,x>+b) is the same as sign(<Cw,x>+Cb), for any C>0

Two strategies to choose a constrained optimization problem: Limit the length of w, and maximize margin Fix the margin, and minimize the length of w

w

Page 25: 1 LING 696B: Graph-based methods and Supervised learning

25

SVM optimization (see demo) Constrained quadratic

programming problem

It can be shown (through Lagrange multiplier method) that solution looks like:

Fixed marginLabel

A linear combination of training data!

Page 26: 1 LING 696B: Graph-based methods and Supervised learning

26

SVM applied to non-separable data What happens when data is not

separable? The optimization problem has no

solution (recall the XOR problem) See demo

Page 27: 1 LING 696B: Graph-based methods and Supervised learning

27

Extension to non-separable data through new variables Allow the data points to

“encroach” the separating line(see demo)

+

ToleranceOriginal objective

Page 28: 1 LING 696B: Graph-based methods and Supervised learning

28

When things become wild: Non-linear extensions The majority of “real world” problems

are not separable This can be due to some deep underlying

laws, e.g. XOR data Non-linearity from Neural nets:

Hidden layers Non-linear activations

SVM initiates a more trendy way of making non-linear machines -- kernels

Page 29: 1 LING 696B: Graph-based methods and Supervised learning

29

Kernel methods Model-fitting problems ill-posed

without constraining the space Avoid commitment to space: non-

parametric method using kernels Idea: let the space grow with data How? Associate each data point with a

little function, e.g. a blob, and set the space to be the linear combination of these

Connection to neural nets

Page 30: 1 LING 696B: Graph-based methods and Supervised learning

30

Kernel extension of SVM Recall the linear solution: Substituting this into f:

Using general kernel function K(x, xi) in the place of <x, xi>

What matters is the dot product

Page 31: 1 LING 696B: Graph-based methods and Supervised learning

31

Kernel extension of SVM This is very much like replacing

linear with non-linear nodes in a neural net Radial Basis Network: each K(x, xi) is

a Gaussian centered at xi -- a small blob

“seeing” non-linearity: a theoremi.e. the kernel is still a dot product, exceptthat it works in an infinite dimensional space of “features”

Page 32: 1 LING 696B: Graph-based methods and Supervised learning

32

This is not a fairy tale Hopefully, by throwing data into infinite

dimensions, they will become separable How can things work in infinite dimensions?

The infinite dimension is implicit Only support vectors act as “anchors” for the

separating plane in feature space All the computation is done in finite dimensions

by searching through support vectors and their weights

As a result, we can do lots of things with SVM by playing with kernels (see demo)

Page 33: 1 LING 696B: Graph-based methods and Supervised learning

33

Reflections How likely this is a human learning

model?

Page 34: 1 LING 696B: Graph-based methods and Supervised learning

34

Reflections How likely this is a human learning

model? Are all learning problems reducible

to classification?

Page 35: 1 LING 696B: Graph-based methods and Supervised learning

35

Reflections How likely this is a human learning

model? Are all learning problems reducible

to classification? What learning models are

appropriate for speech?