function learning and neural nets

50
FUNCTION LEARNING AND NEURAL NETS 1

Upload: irish

Post on 23-Feb-2016

48 views

Category:

Documents


0 download

DESCRIPTION

Function Learning and Neural Nets. Setting. Learn a function with : Continuous-valued examples E.g., pixels of image Continuous-valued output E.g., likelihood that image is a ‘7’ Known as regression [ Regression can be turned into classification via thresholds]. f(x). x. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Function Learning and Neural Nets

1

FUNCTION LEARNING AND NEURAL NETS

Page 2: Function Learning and Neural Nets

2

SETTING Learn a function with : Continuous-valued examples

E.g., pixels of image Continuous-valued output

E.g., likelihood that image is a ‘7’ Known as regression [Regression can be turned into classification

via thresholds]

Page 3: Function Learning and Neural Nets

3

FUNCTION-LEARNING (REGRESSION) FORMULATION Goal function f Training set: (x(i),y(i)), i = 1,…,n, y(i)=f(x(i))

Inductive inference: find a function h that fits the points well

Same Keep-It-Simple biasx

f(x)

Page 4: Function Learning and Neural Nets

4

LEAST-SQUARES FITTING Hypothesize a class of functions f(x,θ)

parameterized by θ Minimize squared loss E(θ) = Σi ( f(x(i),θ)-y(i) )2

x

f(x)

Page 5: Function Learning and Neural Nets

5

LINEAR LEAST-SQUARES f(x,θ) = x ∙ θ Value of θ that optimizes E(θ) is:

θ = [Σi x(i) ∙ y(i)] / [Σi x(i) ∙ x(i)] E(θ) = Σi ( x(i)∙θ - y(i) )2

= Σi ( x(i) 2 θ 2 – 2 x(i)y(i) θ + y(i)2) E’(θ) = 0 => d/d θ [Σi ( x(i) 2 θ 2 – 2 x(i) y(i) θ +

y(i)2)] = Σi 2 x(i)2 θ – 2 x(i) y(i) = 0 => θ = [Σi x(i) ∙ y(i)] / [Σi x(i) ∙ x(i)]

x

f(x)f(x,q)

Page 6: Function Learning and Neural Nets

6

LINEAR LEAST-SQUARES WITH CONSTANT OFFSET f(x,θ0,θ1) = θ0 + θ1 x E(θ0,θ1) = Σi (θ0+θ1 x(i) - y(i) )2

= Σi (θ02 + θ1

2 x(i) 2 + y(i)2 +2θ0θ1x(i)-2θ0y(i)-2θ1x(i)y(i))

dE/dθ0(θ0*,θ1

*) = 0 and dE/dθ1(θ0*,θ1

*) = 0, so:0 = 2Σi (θ0

* +θ1*x(i) - y(i))

0 = 2Σi x(i)(θ0*+ θ1

* x(i) - y(i)) Verify the solution:

θ0* = 1/N Σi (y(i) – θ1

*x(i))θ1

* = [N (Σi x(i)y(i)) – (Σi x(i))(Σi y(i))]/[N (Σi x(i)2) – (Σi x(i))2]

x

f(x)f(x,q)

Page 7: Function Learning and Neural Nets

7

MULTI-DIMENSIONAL LEAST-SQUARES Let x include attributes (x1,…,xN) Let θ include coefficients (θ1,…,θN) Model f(x,θ) = x1 θ1 + … + xN θN

x

f(x)f(x,q)

Page 8: Function Learning and Neural Nets

8

MULTI-DIMENSIONAL LEAST-SQUARES f(x,θ) = x1 θ1 + … + xN θN Best θ given by

θ = (ATA)-1 AT b Where A is matrix of x(i)’s in rows, b is vector

of y(i)’s

x

f(x)f(x,q)

Page 9: Function Learning and Neural Nets

9

NONLINEAR LEAST-SQUARES E.g. quadratic f(x,θ) = θ0 + x θ1 + x2 θ2 E.g. exponential f(x,θ) = exp(θ0 + x θ1) Any combinations

f(x,θ) = exp(θ0 + x θ1) + θ2 + x θ3 Fitting can be done using gradient descent

x

f(x)linear

quadratic other

Page 10: Function Learning and Neural Nets

10

ASIDE: FEATURE TRANSFORMS Common model: weighted sums of nonlinear

functions f1(x),…,fN(x) Linear in the feature space

Polynomial g(x,θ) = θ0 + x θ1 + … + xd θd In general g(x,θ) = f1(x) θ1 + … + fN(x) θN

Least squares fit can be solved exactly by consider a transformed dataset (x’,y) with x’=(f1(x),…,fN(x))

More on this later…

Page 11: Function Learning and Neural Nets

Gradient direction is orthogonal to the level sets (contours) of f,points in direction of steepest increase

Page 12: Function Learning and Neural Nets

Gradient direction is orthogonal to the level sets (contours) of f,points in direction of steepest increase

Page 13: Function Learning and Neural Nets

Gradient descent: iteratively move in direction

Page 14: Function Learning and Neural Nets

Gradient descent: iteratively move in direction

Page 15: Function Learning and Neural Nets

Gradient descent: iteratively move in direction

Page 16: Function Learning and Neural Nets

Gradient descent: iteratively move in direction

Page 17: Function Learning and Neural Nets

Gradient descent: iteratively move in direction

Page 18: Function Learning and Neural Nets

Gradient descent: iteratively move in direction

Page 19: Function Learning and Neural Nets

Gradient descent: iteratively move in direction

Page 20: Function Learning and Neural Nets

GRADIENT DESCENT FOR LEAST SQUARES Let θ = (θ1,…,θn) Error Take gradient:

Update rule

A vector, in which element k indicates how quickly the prediction at example i changes with respect to a change in qk

Page 21: Function Learning and Neural Nets

GRADIENT DESCENT FOR LEAST SQUARES Let θ = (θ1,…,θn) Error Take gradient:

Update rule

The error at example i

Page 22: Function Learning and Neural Nets

GRADIENT DESCENT EXAMPLE: LINEAR FITTING f(x,θ) = x1 θ1 + … + xN θN

Update rule

Page 23: Function Learning and Neural Nets

23

PERCEPTRON(THE GOAL FUNCTION F IS A BOOLEAN ONE)

S gxi

x1

xn

ywi

y = f(x,w) = g(Si=1,…,n wi xi)

+ +

+

++ -

-

--

-x1

x2

w1 x1 + w2 x2 = 0

g(u)

u

Page 24: Function Learning and Neural Nets

24

A SINGLE PERCEPTRON CAN LEARN

S gxi

x1

xn

ywi

A disjunction of boolean literals x1 x2 x3

Majority function

Page 25: Function Learning and Neural Nets

25

A SINGLE PERCEPTRON CAN LEARN

S gxi

x1

xn

ywi

A disjunction of boolean literals x1 x2 x3

Majority functionXOR?

Page 26: Function Learning and Neural Nets

26

PERCEPTRON LEARNING RULE θ θ + x(i)(y(i)-g(θT x(i))) (g outputs either 0 or 1, y is either 0 or 1)

If output is correct, weights are unchanged If g is 0 but y is 1, then the value of g on

attribute i is increased If g is 1 but y is 0, then the value of g on

attribute i is decreased

Converges if data is linearly separable, but oscillates otherwise

Page 27: Function Learning and Neural Nets

27

PERCEPTRON(THE GOAL FUNCTION F IS A BOOLEAN ONE)

S gxi

x1

xn

ywi

+ +

+ +

+ -

-

- -

-

?

y = f(x,w) = g(Si=1,…,n wi xi)

g(u)

u

Page 28: Function Learning and Neural Nets

28

UNIT (NEURON)

S gxi

x1

xn

ywi

y = g(Si=1,…,n wi xi)g(u) = 1/[1 + exp(-u)]

Page 29: Function Learning and Neural Nets

29

NEURAL NETWORK Network of interconnected neurons

S gxi

x1

xn

ywi

S gxi

x1

xn

ywi

Acyclic (feed-forward) vs. recurrent networks

Page 30: Function Learning and Neural Nets

30

TWO-LAYER FEED-FORWARD NEURAL NETWORK

Inputs Hiddenlayer

Outputlayer

w1j w2k

Page 31: Function Learning and Neural Nets

31

NETWORKS WITH HIDDEN LAYERS Can represent XORs, other nonlinear

functions Common neuron types:

Linear, soft perception (sigmoid), radial basis functions

As the number of hidden units increase, so does the network’s capacity to learn functions with more nonlinear features

How to train hidden layers?

Page 32: Function Learning and Neural Nets

32

BACKPROPAGATION (PRINCIPLE) New example y(k) = f(x(k)) φ(k) = compute NN prediction with weights

w(k-1) for inputs x(k) Error function: E(k)(w(k-1)) = (φ(k) – y(k))2

Backpropagation algorithm: Update the weights of the inputs to the last layer, then the weights of the inputs to the previous layer, etc.

Page 33: Function Learning and Neural Nets

33

STOCHASTIC GRADIENT DESCENT Gradient descent uses a batch update

because all examples are incorporated in each step

Stochastic Gradient Descent: use single example on each step

Update rule: Pick example i (either at random or in order) and

a step size Update rule

θ θ + x(i)(y(i)-g(x(i),θ)) Reduces error on i’th example… but does it

converge?

Page 34: Function Learning and Neural Nets

34

UNDERSTANDING BACKPROPAGATION Minimize E(q) Gradient Descent…

E(q)

q

Page 35: Function Learning and Neural Nets

35

UNDERSTANDING BACKPROPAGATION Minimize E(q) Gradient Descent…

E(q)

q

Gradient of E

Page 36: Function Learning and Neural Nets

36

UNDERSTANDING BACKPROPAGATION Minimize E(q) Gradient Descent…

E(q)

q

Step ~ gradient

Page 37: Function Learning and Neural Nets

37

LEARNING ALGORITHM Given many examples (x(1),y(1)),…, (x(N),y(N))

and a learning rate Init: Set k = 0 (or rand(1,N)) Repeat:

Tweak weights with a backpropagation update on example x(k), y(k)

Set k = k+1 (or rand(1,N))

Page 38: Function Learning and Neural Nets

38

UNDERSTANDING BACKPROPAGATION Example of Stochastic Gradient Descent Decompose E(q) = e1(q)+e2(q)+…+eN(q)

Here ek = (g(x(k),q)-y(k))2

On each iteration take a step to reduce ek

E(q)

q

Gradient of e1

Page 39: Function Learning and Neural Nets

39

UNDERSTANDING BACKPROPAGATION Example of Stochastic Gradient Descent Decompose E(q) = e1(q)+e2(q)+…+eN(q)

Here ek = (g(x(k),q)-y(k))2

On each iteration take a step to reduce ek

E(q)

q

Gradient of e1

Page 40: Function Learning and Neural Nets

40

UNDERSTANDING BACKPROPAGATION Example of Stochastic Gradient Descent Decompose E(q) = e1(q)+e2(q)+…+eN(q)

Here ek = (g(x(k),q)-y(k))2

On each iteration take a step to reduce ek

E(q)

q

Gradient of e2

Page 41: Function Learning and Neural Nets

41

UNDERSTANDING BACKPROPAGATION Example of Stochastic Gradient Descent Decompose E(q) = e1(q)+e2(q)+…+eN(q)

Here ek = (g(x(k),q)-y(k))2

On each iteration take a step to reduce ek

E(q)

q

Gradient of e2

Page 42: Function Learning and Neural Nets

42

UNDERSTANDING BACKPROPAGATION Example of Stochastic Gradient Descent Decompose E(q) = e1(q)+e2(q)+…+eN(q)

Here ek = (g(x(k),q)-y(k))2

On each iteration take a step to reduce ek

E(q)

q

Gradient of e3

Page 43: Function Learning and Neural Nets

43

UNDERSTANDING BACKPROPAGATION Example of Stochastic Gradient Descent Decompose E(q) = e1(q)+e2(q)+…+eN(q)

Here ek = (g(x(k),q)-y(k))2

On each iteration take a step to reduce ek

E(q)

q

Gradient of e3

Page 44: Function Learning and Neural Nets

STOCHASTIC GRADIENT DESCENT Objective function values (measured over all

examples) over time settle into local minimum

Step size must be reduced over time, e.g., O(1/t)

44

Page 45: Function Learning and Neural Nets

45

CAVEATS Choosing a convergent “learning rate” can

be hard in practice

E(q)

q

Page 46: Function Learning and Neural Nets

EXAMPLE FROM B553: IMAGE ENCODING 2 layer network, 1 hidden radial basis

function layer with 50 neurons (200 parameters) 1000 training

examplesFitted NN 12x18 image

Page 47: Function Learning and Neural Nets

47

COMMENTS AND ISSUES How to choose the size and structure of

networks? If network is too large, risk of over-fitting (data

caching) If network is too small, representation may not

be rich enough Role of representation: e.g., learn the

concept of an odd number

Page 48: Function Learning and Neural Nets

48

BENEFITS / DRAWBACKS OF NNS Benefits

Easy to generate complex nonlinear function classes

Incremental learning via stochastic gradient descent

Good performance on many problems Predictions evaluated quickly

Drawbacks Difficult to characterize the hypothesis space Low interpretability Relatively slow training, local minima

Page 49: Function Learning and Neural Nets

49

PERFORMANCE OF FUNCTION LEARNING Overfitting: too many parameters Regularization: penalize large parameter

values Minimize E(θ) + C(θ)

where C(θ) measures the cost of a parameter (independent of data) and is a regularization parameter

Efficient optimization If E(q) is nonconvex, can only guarantee finding a

local minimum Batch updates are expensive, stochastic updates

converge slowly

Page 50: Function Learning and Neural Nets

50

READINGS R&N 18.8-9