cse 559a: computer visionayan/courses/cse559a/pdfs/lec...let's talk about vgg-16. winner of...

/

CSE 559A: Computer Vision

Fall 2020: T-R: 11:30-12:50pm @ Zoom

Instru‘tor: Ayan Chakrabarti ([email protected]’u).Course Staff: A’ith Boloor, Patri‘k Williams

Nov 24, 2020

http://www.‘se.wustl.e’u/~ayan/‘ourses/‘se559a/

1

http://www.cse.wustl.edu/~ayan/courses/cse559a/

/

GENERALGENERALLook at your proposal “ee’ba‘k !Problem Set 4 ’ue To’ay.Problem set 5 will be out by toni”ht.No ‘lass Thurs’ay. No offi‘e hours on Fri’ay.

2

/

CLASSIFICATIONCLASSIFICATION

3

/


4

/


5

/


6

/


7

/


8

/


9

/


10

/


11

/


12

/


13

/


14

/

CLASSIFICATIONCLASSIFICATIONLearn an’ ’o binary ‘lassi“i‘ation on its output.

A”ain, use (sto‘hasti‘) ”ra’ient ’es‘ent.But this time, the ‘ost is no lon”er ‘onvex.

= g(x; θ)x

w = arg log

[

1 + exp(− )

]

+ (1 − ) log

[

1 + exp( )

]

min

w

1

T

∑

t

y

t

w

T

x

t

y

t

w

T

x

t

θ,w = arg log

[

1 + exp(− g( ; θ))

]

+ (1 − ) log

[

1 + exp( g( ; θ))

]

min

θ,w

1

T

∑

t

y

t

w

T

x

t

y

t

w

T

x

t

15

/

CLASSIFICATIONCLASSIFICATIONLearn

A”ain, use (sto‘hasti‘) ”ra’ient ’es‘ent.But this time, the ‘ost is no lon”er ‘onvex.Turns out .. ’oesn t matter (sort o“).

Re‘all in the previous ‘ase: (where is the ‘ost o“ one sample)

What about now ?

Exa‘tly the same, with “or the ‘urrent value o“ .

= g(x; θ)x

w = arg log

[

1 + exp(− )

]

+ (1 − ) log

[

1 + exp( )

]

min

w

1

T

∑

t

y

t

w

T

x

t

y

t

w

T

x

t

θ,w = arg log

[

1 + exp(− g( ; θ))

]

+ (1 − ) log

[

1 + exp( g( ; θ))

]

min

θ,w

1

T

∑

t

y

t

w

T

x

t

y

t

w

T

x

t

C

t

=

[

−

]

∇

w

C

t

x

t

exp( )w

T

x

t

1 + exp( )w

T

x

t

y

t

= g(x; θ)x θ

16

/


What about ?

First, what is the ?

Take 5 mins

= g(x; θ)x

θ,w = arg log

[

1 + exp(− g( ; θ))

]

+ (1 − ) log

[

1 + exp( g( ; θ))

]

min

θ,w

1

T

∑

t

y

t

w

T

x

t

y

t

w

T

x

t

=

[

−

]

∇

w

C

t

x

t

exp( )w

T

x

t

1 + exp( )w

T

x

t

y

t

∇

θ

C

t

∇

x

t

C

t

17

/


What about ?


= g(x; θ)x

θ,w = arg log

[

1 + exp(− g( ; θ))

]

+ (1 − ) log

[

1 + exp( g( ; θ))

]

min

θ,w

1

T

∑

t

y

t

w

T

x

t

y

t

w

T

x

t

=

[

−

]

∇

w

C

t

x

t

exp( )w

T

x

t

1 + exp( )w

T

x

t

y

t

∇

θ

C

t

∇

x

t

C

t

= ?

[

−

]

∇

x

t

C

t

exp( )w

T

x

t

1 + exp( )w

T

x

t

y

t

18

/


What about ?


= g(x; θ)x

θ,w = arg log

[

1 + exp(− g( ; θ))

]

+ (1 − ) log

[

1 + exp( g( ; θ))

]

min

θ,w

1

T

∑

t

y

t

w

T

x

t

y

t

w

T

x

t

=

[

−

]

∇

w

C

t

x

t

exp( )w

T

x

t

1 + exp( )w

T

x

t

y

t

∇

θ

C

t

∇

x

t

C

t

= w

[

−

]

∇

x

t

C

t

exp( )w

T

x

t

1 + exp( )w

T

x

t

y

t

19

/


Now, let s say was an matrix, an’ .

is the len”th o“ the ve‘tor

is the len”th o“ the en‘o’e’ ve‘tor

What is ?

Take 5 mins!

= g(x; θ)x

θ,w = arg log

[

1 + exp(− g( ; θ))

]

+ (1 − ) log

[

1 + exp( g( ; θ))

]

min

θ,w

1

T

∑

t

y

t

w

T

x

t

y

t

w

T

x

t

= w

[

−

]

∇

x

t

C

t

exp( )w

T

x

t

1 + exp( )w

T

x

t

y

t

θ M × N g(x; θ) = θx

N x

M x

∇

θ

C

t

20

/





What is ?

= g(x; θ)x

θ,w = arg log

[

1 + exp(− g( ; θ))

]

+ (1 − ) log

[

1 + exp( g( ; θ))

]

min

θ,w

1

T

∑

t

y

t

w

T

x

t

y

t

w

T

x

t

= w

[

−

]

∇

x

t

C

t

exp( )w

T

x

t

1 + exp( )w

T

x

t

y

t

θ M × N g(x; θ) = θ x

N x

M x

∇

θ

C

t

= ?∇

θ

C

t

∇

x

t

C

t

21

/





What is ?

This is a‘tually a linear ‘lassi“ier on

But be‘ause o“ our “a‘torization, is no lon”er ‘onvex.

I“ we want to in‘rease the expressive power o“ our ‘lassi“ier, has to be non-linear !

= g(x; θ)x

θ,w = arg log

[

1 + exp(− g( ; θ))

]

+ (1 − ) log

[

1 + exp( g( ; θ))

]

min

θ,w

1

T

∑

t

y

t

w

T

x

t

y

t

w

T

x

t

= w

[

−

]

∇

x

t

C

t

exp( )w

T

x

t

1 + exp( )w

T

x

t

y

t

θ M × N g(x; θ) = θ x

N x

M x

∇

θ

C

t

=

( )

∇

θ

C

t

∇

x

t

C

t

x

T

t

x

θ x = ( w x = xw

T

θ

T

)

T

w

T

g

22

/

CLASSIFICATIONCLASSIFICATIONThe Multi-Layer Perceptron

x

23

/


x h⟶

h = θx

24

/


is an element-wise non-linearity.

For example . More on this later.

Has no learnable parameters.

x h⟶

h = θx

⟶

= κ( )h

j

h

j

h

κ

κ(x) = σ(x)

25

/





x h y⟶

h = θx

⟶

= κ( )h

j

h

j

h

⟶

y =w

T

h

κ

κ(x) = σ(x)

26

/





is our si”moi’ to ‘onvert lo”-o’’s to probability.

Multipli‘ation by an’ a‘tion o“ is a layer .

Calle’ a hi’’en layer, be‘ause you re learnin” a latent representation .Don t have ’ire‘t a‘‘ess to the true value o“ its outputsLearnin” a representation that jointly with a learne’ ‘lassi“ier is optimal

x h y p⟶

h = θx

⟶

= κ( )h

j

h

j

h

⟶

y =w

T

h

⟶

p = σ(y)

κ

κ(x) = σ(x)

σ

σ(y) =

exp(y)

1 + exp(y)

θ κ

27

/


This is a neural network:A ‘omplex “un‘tion “orme’ by composition o“ simple linear an’ non-linear “un‘tions.

This network has learnable parameters .

Train by ”ra’ient ’es‘ent with respe‘t to ‘lassi“i‘ation loss.What are the ”ra’ients ?

Doin” this manually is ”oin” to ”et ol’ really “ast.

Autograd

Express ‘omplex “un‘tion as a composition o“ simpler “un‘tions.Store this as no’es in a ‘omputation ”raphUse ‘hain rule to automati‘ally ba‘k-propa”ate

Popular Auto”ra’ Systems: Tensor“low, Tor‘h, Caffe, MXNet, Theano, …

We ll write our own!

x h y p⟶

h = θx

⟶

= κ( )h

j

h

j

h

⟶

y =w

T

h

⟶

p = σ(y)

θ,w

/

AUTOGRAD / BACK-PROPAGATIONAUTOGRAD / BACK-PROPAGATIONSay we want to minimize a loss , that is a “un‘tion o“ parameters an’ trainin” ’ata.

Let s say “or a parameter we ‘an write:

where is in’epen’ent o“ , an’ ’oes not use ex‘ept throu”h .

Now, let s say I ”ave you the value o“ an’ the ”ra’ient o“ with respe‘t to .

is an ’imensional ve‘tor

is an ’imensional ve‘tor (i“ its a matrix, just think o“ ea‘h element as a separate parameter)

Express in terms o“ an’ : whi‘h is the partial ’erivative o“ one o“ the ’imensions o“ the outputs o“ with respe‘t to one o“ the ’imensions o“ its inputs.

For every

We ‘an similarly ‘ompute ”ra’ients “or the other input to , i.e. y.

L

θ

L = f (x); x = g(θ, y)

y θ f θ x

y L x

x N−

θ M−

∂L

∂θ

j

∂L

∂x

i

∂g(θ,y)

i

∂θ

j

g

j

=

∂L

∂θ

j

∑

i

∂L

∂x

i

∂g(θ, y)

i

∂θ

j

g

29

/

AUTOGRAD / BACK-PROPAGATIONAUTOGRAD / BACK-PROPAGATION

Let s say a spe‘i“i‘ variable ha’ two paths to the loss.

L = f (x, ); x = g(θ, y), = (θ, )x

′

x

′

g

′

y

′

= +

∂L

∂θ

j

∑

i

∂L

∂x

i

∂g(θ, y)

i

∂θ

j

∑

i

∂L

∂x

′i

∂ (θ,g

′

y

′

)

i

∂θ

j

30

/

AUTOGRAD / BACK-PROPAGATIONAUTOGRAD / BACK-PROPAGATIONOur very own auto”ra’ system:

Buil’ a ’ire‘te’ ‘omputation ”raph with a (python) list o“ no’es G = [n1,n2,n3 …]Ea‘h no’e is an obje‘t that is one o“ three kin’s:

InputParameterOperation . . .

We will ’e“ine the ”raph by ‘allin” “un‘tions that ’e“ine “un‘tional relationships.

import edf

x = edf.Input() theta = edf.Parameter()

y = edf.matmul(theta,x) y = edf.tanh(y)

w = edf.Parameter() y = edf.matmul(w,y)

31

/


Ea‘h o“ these statements a’’s a no’e to the list o“ no’es.Operation no’es are a’’e’ by matmul, tanh, et‘., an’ are linke’ to previous no’es that appear be“ore it inthe list as input.Every no’e obje‘t is ”oin” to have a member element n.top whi‘h will be the value o“ its output

This ‘an be an arbitrary shape’ array.For input an’ parameter no’es, these top values will just be set (or up’ate’ by SGD).For operation no’es, the top values will be ‘ompute’ “rom the top values o“ their inputs.

Every operation no’e will be an obje‘t o“ a ‘lass that has a “un‘tion ‘alle’ “orwar’.A “orwar’ pass will be”in with values o“ all inputs an’ parameters set.

import edf




32

/


A “orwar’ pass will be”in with values o“ all inputs an’ parameters set.Then we will ”o throu”h the list o“ no’es in or’er, an’ ‘ompute the value o“ all operation no’es.Be‘ause no’es were a’’e’ in or’er, i“ we ”o throu”h them in or’er,the tops o“ our inputs will be available.

import edf




33

/


Somewhere in the trainin” loop, where the values o“ parameters have been set be“ore.

An’ this will ”ive us the value o“ the output.But now, we want to ‘ompute ”ra’ients .For ea‘h operation ‘lass, we will also ’e“ine a “un‘tion backward.All operation an’ parameter no’es will also have an element ‘alle’ grad.We will have to then ba‘k-propa”ate ”ra’ients in or’er.

import edf




x.set(...) edf.Forward() print(y.top)

34

/

AUTOGRADAUTOGRADBuilding Our Own Deep Learning Framework

Computation Graph that en‘o’es symboli‘ relationship between input to output (an’ loss)No’es in this ”raph are

Values: Values are set “rom trainin” ’ataParams: Values are initialize’ an’ up’ate’ usin” SGDOperations: Values are ‘ompute’ as “un‘tions o“ their input

Forwar’ ‘omputation to ‘ompute lossBa‘kwar’ ‘omputation to ‘ompute ”ra’ients “or every no’e

35

/

AUTOGRADAUTOGRADCo’e “rom pset5/mnist.py

# Inputs and parameters inp = edf.Value() lab = edf.Value()

W1 = edf.Param() B1 = edf.Param() W2 = edf.Param() B2 = edf.Param()

# Modely = edf.matmul(inp,W1) y = edf.add(y,B1) y = edf.RELU(y)

y = edf.matmul(y,W2) y = edf.add(y,B2) # This is our final prediction

36

/

BRIEF DETOURBRIEF DETOURWhat are RELUs ?

Element-wise non-linear a‘tivations

Previous a‘tivations woul’ be si”moi’-like:

Great when you want a probability, ba’ when you want to learn by ”ra’ient ’es‘ent.

For both hi”h an’ low-values o“ ,

So i“ you weren t ‘are“ul, you woul’ en’ up with hi”h-ma”nitu’e a‘tivations,an’ the network stops learnin”.

Gra’ient ’es‘ent is very fragile !

RELU(x) = max(0, x)

σ(x) = exp(x)/(1 + exp(x))

x ∂σ(x)/∂x = 0

37

/

BRIEF DETOURBRIEF DETOURWhat are RELUs ?

What is RELU ?

0 i“ , 1 otherwise.

So your ”ra’ients are passe’ un-‘han”e’ i“ the input is positive.But ‘ompletely attenuate’ i“ the input is ne”ativeSo there s still the possibility o“ your optimization ”ettin” stu‘k, i“ you rea‘h a point where all inputs to theRELU are ne”ative.Will talk about initialization, et‘. later.

RELU(x) = max(0, x)

∂ (x) / ∂x

x < 0

38

/

AUTOGRADAUTOGRADCo’e “rom pset5/mnist.py

# Inputs and parameters inp = edf.Value() lab = edf.Value()

W1 = edf.Param() B1 = edf.Param() W2 = edf.Param() B2 = edf.Param()

# Modely = edf.matmul(inp,W1) y = edf.add(y,B1) y = edf.RELU(y)

y = edf.matmul(y,W2) y = edf.add(y,B2) # This is our final prediction

39

/

AUTOGRADAUTOGRADCo’e “rom pset5/edf.py

When you ‘onstru‘t an obje‘t, it just ’oes book-keepin” !

ops = []; params = []; values = [] ...class Param: def __init__(self): params.append(self) ....class Value: def __init__(self): values.append(self) ...class matmul: def __init__(self,x,y): ops.append(self) self.x = x self.y = y

40

/

AUTOGRADAUTOGRAD

41

/

AUTOGRADAUTOGRAD

42

/

AUTOGRADAUTOGRAD

43

/

AUTOGRADAUTOGRAD

44

/

AUTOGRADAUTOGRAD

45

/

AUTOGRADAUTOGRAD

46

/

AUTOGRADAUTOGRAD

47

/

AUTOGRADAUTOGRAD

48

/

AUTOGRADAUTOGRAD

49

/

AUTOGRADAUTOGRAD

50

/

AUTOGRADAUTOGRAD

51

/

AUTOGRADAUTOGRAD

mean will be a‘ross a bat‘h.a‘‘ura‘y is a‘tual a‘‘ura‘y o“ har’ pre’i‘tions (not ’ifferentiable).

# Inputs and parameters inp = edf.Value() lab = edf.Value() W1 = edf.Param() B1 = edf.Param() W2 = edf.Param() B2 = edf.Param() y = edf.matmul(inp,W1) y = edf.add(y,B1) y = edf.RELU(y) y = edf.matmul(y,W2) y = edf.add(y,B2)

loss = edf.smaxloss(y,lab) loss = edf.mean(loss)

acc = edf.accuracy(y,lab) acc = edf.mean(acc)

52

/

AUTOGRADAUTOGRADNow let s train this thin” !

Be”innin” o“ trainin”:

Initialize wei”hts ran’omly.

In ea‘h iteration o“ trainin”:

Loa’ ’ata into the values or inputs.

What is this set “un‘tion anyway ?

nHidden = 1024; K = 10 W1.set(xavier((28*28,nHidden))) B1.set(np.zeros((nHidden))) W2.set(xavier((nHidden,K))) B2.set(np.zeros((K)))

for iters in range(...): . . . inp.set(train_im[idx[b:b+BSZ],:]) lab.set(train_lb[idx[b:b+BSZ]]) . . .

53

/

AUTOGRADAUTOGRADset is the only “un‘tion that the ‘lasses param an’ value have.

Sets a member ‘alle’ top to be an array that hol’s these values.

class Value: def __init__(self): values.append(self)

def set(self,value): self.top = np.float32(value).copy()

class Param: def __init__(self): params.append(self)

def set(self,value): self.top = np.float32(value).copy()

54

/

AUTOGRADAUTOGRAD

Note that we are loa’in” our input ’ata in bat‘hes as matri‘es.inp is BSZ x NThen we re ’oin” a matmul with W1 whi‘h is N x nHi’’en.Output will be BSZ x nHi’’en

Essentially, we re repla‘in” ve‘tor matrix multiply “or a sin”le sample with matrix-matrix multiply “or a bat‘ho“ samples.

W1.set(xavier((28*28,nHidden))) ...B2.set(np.zeros((K)))

for iters in range(...): . . . inp.set(train_im[idx[b:b+BSZ],:]) lab.set(train_lb[idx[b:b+BSZ]])

55

/

AUTOGRADAUTOGRAD

An’ this will work. It will print the loss an’ a‘‘ura‘y values “or theset inputs, ”iven the ‘urrent value o“ the parameters.

What is this ma”i‘al “un‘tion “orwar’ ?

W1.set(xavier((28*28,nHidden))) ...B2.set(np.zeros((K)))


edf.Forward() print(loss.top,acc.top)

56

/

AUTOGRADAUTOGRADFrom e’“.py

But the operation ‘lasses have their own “orwar’ “un‘tion.

# Global forward def Forward(): for c in ops: c.forward()

class matmul: def __init__(self,x,y): ops.append(self) self.x = x self.y = y

def forward(self): self.top = np.matmul(self.x.top,self.y.top)

. . .

57

/

AUTOGRADAUTOGRAD

58

/

AUTOGRADAUTOGRAD

59

/

AUTOGRADAUTOGRAD

60

/

AUTOGRADAUTOGRAD

61

/

AUTOGRADAUTOGRAD

62

/

AUTOGRADAUTOGRAD

63

/

AUTOGRADAUTOGRAD

64

/

AUTOGRADAUTOGRAD

65

/

AUTOGRADAUTOGRAD

66

/

AUTOGRADAUTOGRADSo the “orwar’ pass ‘omputes the loss.But we want to learn the parameters.

The SGD “un‘tion is pretty simple

Requires p.”ra’ (”ra’ients with respe‘t to loss) to be present.

That s what ba‘kwar’ ’oes!


edf.Forward() print(loss.top,acc.top) edf.Backward(loss) edf.SGD(lr)

def SGD(lr): for p in params: p.top = p.top - lr*p.grad

67

cse 559a: computer visionayan/courses/cse559a/pdfs/lec...let's talk about vgg-16. winner of...

Documents