csci 315: artificial intelligence through deep...

52
CSCI 315: Artificial Intelligence through Deep Learning W&L Winter Term 2016 Prof. Levy Introduction to Deep Learning with Theano

Upload: ngothu

Post on 19-Mar-2018

230 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: CSCI 315: Artificial Intelligence through Deep Learninghome.wlu.edu/~levys/courses/csci315w2016/lectures/theano.pdf · CSCI 315: Artificial Intelligence through Deep Learning W&L

CSCI 315: Artificial Intelligence through Deep Learning

W&L Winter Term 2016Prof. Levy

Introduction to Deep Learning with Theano

Page 2: CSCI 315: Artificial Intelligence through Deep Learninghome.wlu.edu/~levys/courses/csci315w2016/lectures/theano.pdf · CSCI 315: Artificial Intelligence through Deep Learning W&L

Why Theano (vs. just NumPy)?• Recall two main essentials: dot product and

activation function derivative.

• Dot product:net i=∑

j=0

n

x j w ij ,w0≡1

• “Embarrasingly Parallel”: since each unit i has its own incoming weights, net

i can be compute independently

from / simultaneously with all other units in its layer.

• On an ordinary computer, we (NumPy dot) must compute one net

i after another, sequentially.

net = np.dot(append(x,1), w)

Page 3: CSCI 315: Artificial Intelligence through Deep Learninghome.wlu.edu/~levys/courses/csci315w2016/lectures/theano.pdf · CSCI 315: Artificial Intelligence through Deep Learning W&L

Ordinary dot product computation for a layer

First me! Then me!

Page 4: CSCI 315: Artificial Intelligence through Deep Learninghome.wlu.edu/~levys/courses/csci315w2016/lectures/theano.pdf · CSCI 315: Artificial Intelligence through Deep Learning W&L

Exploiting Parallelism

All together now!

Page 5: CSCI 315: Artificial Intelligence through Deep Learninghome.wlu.edu/~levys/courses/csci315w2016/lectures/theano.pdf · CSCI 315: Artificial Intelligence through Deep Learning W&L

GPU to the Rescue!

• Graphics Processing Unit: Designed for videogames, to exploit the parallelism in pixel-level updates.

• NVIDIA offers CUDA API for programmers, but it's wicked hard – need to track locations of values in memory.

• Theano exploits GPU / CUDA if they're available.

Page 6: CSCI 315: Artificial Intelligence through Deep Learninghome.wlu.edu/~levys/courses/csci315w2016/lectures/theano.pdf · CSCI 315: Artificial Intelligence through Deep Learning W&L

GPU: A Multi-threaded architecture

A traditional architecture has: one processor, one memory, one process at a time:

CPU

Memory

Von Neumann Bottleneck

http://web.eecs.utk.edu/~plank/plank/classes/cs360/360/notes/Memory/lecture.html

Page 7: CSCI 315: Artificial Intelligence through Deep Learninghome.wlu.edu/~levys/courses/csci315w2016/lectures/theano.pdf · CSCI 315: Artificial Intelligence through Deep Learning W&L

• A distributed architecture (e.g., Beowulf cluster) has several processors, each with its own memory

• Communication among processors uses message-passing (e.g., MPI)

CPU CPU CPU…

Memory Memory Memory…

Connecting Network

Page 8: CSCI 315: Artificial Intelligence through Deep Learninghome.wlu.edu/~levys/courses/csci315w2016/lectures/theano.pdf · CSCI 315: Artificial Intelligence through Deep Learning W&L

• A shared memory architecture allows several processes to access the same memory, either from a single CPU or several CPUs

• Typically, a single process launches several “lightweight processes” called threads, which all share the same heap and global memory with each having its own stack.

• Ideally, each thread runs on its own processor (“core”)

Core 1 …

Memory (Heap / Globals)

Core 1 Core n

NVIDIA Jetson TK1: 192 cores

NVIDIA Jetson TX1: 256 cores

Page 9: CSCI 315: Artificial Intelligence through Deep Learninghome.wlu.edu/~levys/courses/csci315w2016/lectures/theano.pdf · CSCI 315: Artificial Intelligence through Deep Learning W&L

Python vs. NumPy vs. Theano• Dot product in “naive” Python:

• This will be slow, because the interpreter is executing the loop code c += a[k] * b[k] over and over

• Some speedup is likely once the interpreter has compiled your code into a .pyc (bytecode) file.

Page 10: CSCI 315: Artificial Intelligence through Deep Learninghome.wlu.edu/~levys/courses/csci315w2016/lectures/theano.pdf · CSCI 315: Artificial Intelligence through Deep Learning W&L

Python vs. NumPy vs. Theano

• Dot in NumPy: c = np.dot(a, b)

• “Under the hood”: Your arrays a and b are passed to a pre-compiled C program that computes the dot product, typically much faster than you would get with your own code:

• Hence, Theano will require us to specify info about types and memory in order to exploit GPU speedup

Page 11: CSCI 315: Artificial Intelligence through Deep Learninghome.wlu.edu/~levys/courses/csci315w2016/lectures/theano.pdf · CSCI 315: Artificial Intelligence through Deep Learning W&L

Why Theano (vs. just NumPy)?• Recall two main essentials: dot product and

activation function derivative.

• Activation function derivative:

f (x )=1

1+e−x

df (x)

dx= f ' (x)=

e x

(1+e x)2 = f ( x)(1− f ( x))

f (x )= tanh(x ) f ' (x )=sech2( x)

y i=f ( x i)=e i

x

∑j

e jx

*

• This is called symbolic differentiation and requires us to use our calculus or a special computation tool, case-by-case. Theano will automate this for us!

∂ y i

∂ y j

= yi(1− y i)if i= j ,− y i y jif i≠ j

Page 12: CSCI 315: Artificial Intelligence through Deep Learninghome.wlu.edu/~levys/courses/csci315w2016/lectures/theano.pdf · CSCI 315: Artificial Intelligence through Deep Learning W&L

Theano: Basics*

* from Chapter 3 of Buduma 2015 (first draft manuscript)

Page 13: CSCI 315: Artificial Intelligence through Deep Learninghome.wlu.edu/~levys/courses/csci315w2016/lectures/theano.pdf · CSCI 315: Artificial Intelligence through Deep Learning W&L

Theano: Basics

Page 14: CSCI 315: Artificial Intelligence through Deep Learninghome.wlu.edu/~levys/courses/csci315w2016/lectures/theano.pdf · CSCI 315: Artificial Intelligence through Deep Learning W&L

Theano has a special class for functions, which allows it to compute stuff efficiently.

Theano: Basics

Page 15: CSCI 315: Artificial Intelligence through Deep Learninghome.wlu.edu/~levys/courses/csci315w2016/lectures/theano.pdf · CSCI 315: Artificial Intelligence through Deep Learning W&L

A scalar (single number) is a zero-dimensional tensor. Theano allows us to create it with a name, i.e., a symbol. The d in dvector means “double-precision” (64 bits)

Theano: Basics

Page 16: CSCI 315: Artificial Intelligence through Deep Learninghome.wlu.edu/~levys/courses/csci315w2016/lectures/theano.pdf · CSCI 315: Artificial Intelligence through Deep Learning W&L

The + and ** operators have been overladed to work with dscalar objects.

Theano: Basics

Page 17: CSCI 315: Artificial Intelligence through Deep Learninghome.wlu.edu/~levys/courses/csci315w2016/lectures/theano.pdf · CSCI 315: Artificial Intelligence through Deep Learning W&L

We build a function f piece by piece. Theano will compile this function for optimized performance (e.g., GPU).

Theano: Basics

Page 18: CSCI 315: Artificial Intelligence through Deep Learninghome.wlu.edu/~levys/courses/csci315w2016/lectures/theano.pdf · CSCI 315: Artificial Intelligence through Deep Learning W&L

Theano: Dataflow Graphs

Page 19: CSCI 315: Artificial Intelligence through Deep Learninghome.wlu.edu/~levys/courses/csci315w2016/lectures/theano.pdf · CSCI 315: Artificial Intelligence through Deep Learning W&L

Theano: Dataflow Graphs

Page 20: CSCI 315: Artificial Intelligence through Deep Learninghome.wlu.edu/~levys/courses/csci315w2016/lectures/theano.pdf · CSCI 315: Artificial Intelligence through Deep Learning W&L

Theano: Dataflow Graphs(Special Note)

This will give you an error in Python3 because of Python2/Python3 incompatibility in pydot library.

You can use theano.printing.debugprint instead:

Page 21: CSCI 315: Artificial Intelligence through Deep Learninghome.wlu.edu/~levys/courses/csci315w2016/lectures/theano.pdf · CSCI 315: Artificial Intelligence through Deep Learning W&L

Theano: Shared Variables and Side-Effects

• You will see the keyword shared in many Theano programs.

• It has two meanings:

–Keep the data in the GPU for efficiency

–Allow a function to have state (side-effects)

Page 22: CSCI 315: Artificial Intelligence through Deep Learninghome.wlu.edu/~levys/courses/csci315w2016/lectures/theano.pdf · CSCI 315: Artificial Intelligence through Deep Learning W&L

Ordinary CUDA in C++: have to move data in/out of GPU yourself!

Page 23: CSCI 315: Artificial Intelligence through Deep Learninghome.wlu.edu/~levys/courses/csci315w2016/lectures/theano.pdf · CSCI 315: Artificial Intelligence through Deep Learning W&L

Adding State to a Function

A Python class is a set of functions (methods) that share a state (instance variables) – so you're already familiar with state!

Page 24: CSCI 315: Artificial Intelligence through Deep Learninghome.wlu.edu/~levys/courses/csci315w2016/lectures/theano.pdf · CSCI 315: Artificial Intelligence through Deep Learning W&L

Adding State to a Function

Python even allows us to create a class that behaves like a function with state:

Page 25: CSCI 315: Artificial Intelligence through Deep Learninghome.wlu.edu/~levys/courses/csci315w2016/lectures/theano.pdf · CSCI 315: Artificial Intelligence through Deep Learning W&L

State in TheanoExample (Buduma Chapter 3): a simple classifier function that keeps count of how many times we've called it:

Page 26: CSCI 315: Artificial Intelligence through Deep Learninghome.wlu.edu/~levys/courses/csci315w2016/lectures/theano.pdf · CSCI 315: Artificial Intelligence through Deep Learning W&L

Theano: Randomness• True Random Number Generators (based on

physical phenomena) are pretty uncommon!

• So to understand random numbers in Theano, we need to understand how computers simulate randomness algorithmically: pseudo-random number generators.

https://en.wikipedia.org/wiki/Hardware_random_number_generator

TRNG

Page 27: CSCI 315: Artificial Intelligence through Deep Learninghome.wlu.edu/~levys/courses/csci315w2016/lectures/theano.pdf · CSCI 315: Artificial Intelligence through Deep Learning W&L

Linear Congruential Method• Uses modulus (clock) arithmetic to generate a

sequence x • Simple example:

x0 = 10xn = (7 xn -1 + 1) mod 11

https://en.wikipedia.org/wiki/Linear_congruential_generator

Page 28: CSCI 315: Artificial Intelligence through Deep Learninghome.wlu.edu/~levys/courses/csci315w2016/lectures/theano.pdf · CSCI 315: Artificial Intelligence through Deep Learning W&L

Random Numbers in NumPy

No seed specified: an arbitrary value, like the current system in microseconds, is used as seed.

Explicit seed (7) used: same pattern every time! Do this to debug stochastic (pseudorandom) programs.

Page 29: CSCI 315: Artificial Intelligence through Deep Learninghome.wlu.edu/~levys/courses/csci315w2016/lectures/theano.pdf · CSCI 315: Artificial Intelligence through Deep Learning W&L

Thread-Safe Pseudorandoms

• Recall our linear congruential formula for generating random numbers; e.g.:

11mod)7(

1

1

0

nn rr

r

• We’d like to have each of our p processors generate its share of numbers.• Problem: each processor will produce the same sequence!

modulusmultiplier

seed

r = 1, 7, 5, 2, 3, 10, 4, 6, 9, 8, ...

Page 30: CSCI 315: Artificial Intelligence through Deep Learninghome.wlu.edu/~levys/courses/csci315w2016/lectures/theano.pdf · CSCI 315: Artificial Intelligence through Deep Learning W&L

Thread-Safe Pseudorandoms

“Interleaving” Trick: For p processors,

1. Generate the first p numbers in the sequence: e.g., for p = 2, get 1, 7. These become the seeds for each processor.

2. To get the new multiplier, raise the old multiplier to p and mod the result with the modulus: e.g., for p = 2,

72 = 49; 49 mod 11 = 5. This becomes the multiplier for all processors.

Page 31: CSCI 315: Artificial Intelligence through Deep Learninghome.wlu.edu/~levys/courses/csci315w2016/lectures/theano.pdf · CSCI 315: Artificial Intelligence through Deep Learning W&L

Thread-Safe Pseudorandoms

11mod)5(

1

:

1

0

0

nn rr

r

p

11mod)5(

7

:

1

0

1

nn rr

r

p

p0: 1, 5, 3, 4, 9,

p1: 7, 2, 10, 6, 8

Page 32: CSCI 315: Artificial Intelligence through Deep Learninghome.wlu.edu/~levys/courses/csci315w2016/lectures/theano.pdf · CSCI 315: Artificial Intelligence through Deep Learning W&L

Random Numbers in Theano: Theano Level (Buduma Ch. 3)

Page 33: CSCI 315: Artificial Intelligence through Deep Learninghome.wlu.edu/~levys/courses/csci315w2016/lectures/theano.pdf · CSCI 315: Artificial Intelligence through Deep Learning W&L

Random Numbers in Theano: From NumPy (deeplearning.net)

Page 34: CSCI 315: Artificial Intelligence through Deep Learninghome.wlu.edu/~levys/courses/csci315w2016/lectures/theano.pdf · CSCI 315: Artificial Intelligence through Deep Learning W&L

Random Numbers in Theano: From NumPy (deeplearning.net)

Page 35: CSCI 315: Artificial Intelligence through Deep Learninghome.wlu.edu/~levys/courses/csci315w2016/lectures/theano.pdf · CSCI 315: Artificial Intelligence through Deep Learning W&L

Random Numbers in Theano: From NumPy (deeplearning.net)

Page 36: CSCI 315: Artificial Intelligence through Deep Learninghome.wlu.edu/~levys/courses/csci315w2016/lectures/theano.pdf · CSCI 315: Artificial Intelligence through Deep Learning W&L
Page 37: CSCI 315: Artificial Intelligence through Deep Learninghome.wlu.edu/~levys/courses/csci315w2016/lectures/theano.pdf · CSCI 315: Artificial Intelligence through Deep Learning W&L

The borrow keyword• Memory aliasing: when two names are used for

the same piece of memory

• Ordinary NumPy Example:

Page 38: CSCI 315: Artificial Intelligence through Deep Learninghome.wlu.edu/~levys/courses/csci315w2016/lectures/theano.pdf · CSCI 315: Artificial Intelligence through Deep Learning W&L

The borrow keyword• The aggressive reuse of memory is one of the ways through which

Theano makes code fast, and it is important for the correctness and speed of your program that you understand how Theano might alias buffers.*

• The memory allocated for a shared variable buffer is unique: it is never aliased to another shared variable.*

• So what the #@&% does THIS mean:

*http://deeplearning.net/software/theano/tutorial/aliasing.html

Page 39: CSCI 315: Artificial Intelligence through Deep Learninghome.wlu.edu/~levys/courses/csci315w2016/lectures/theano.pdf · CSCI 315: Artificial Intelligence through Deep Learning W&L

The borrow keyword

Conclusion:

Page 40: CSCI 315: Artificial Intelligence through Deep Learninghome.wlu.edu/~levys/courses/csci315w2016/lectures/theano.pdf · CSCI 315: Artificial Intelligence through Deep Learning W&L

The borrow keyword

Conclusion: Used borrowed=True when you want a Theano shared variable to be aliased (updated along with) the NumPy variable from which you created it.

Page 41: CSCI 315: Artificial Intelligence through Deep Learninghome.wlu.edu/~levys/courses/csci315w2016/lectures/theano.pdf · CSCI 315: Artificial Intelligence through Deep Learning W&L

Theano: Computing Derivatives Symbolically

• As we have seen, computing partial derivatives is a necessity for gradient-descent methods

• Consider a simple example:

f (x)=∑k

n

x k2

• “Vanilla” NumPy code:

fx = np.sum(x**2)

• Let's compute the partial derivative of f (x) w.r.t. each element x

k ….

Page 42: CSCI 315: Artificial Intelligence through Deep Learninghome.wlu.edu/~levys/courses/csci315w2016/lectures/theano.pdf · CSCI 315: Artificial Intelligence through Deep Learning W&L

Theano: Computing Derivatives Symbolically

• This looks complicated, but: since each element x

k is independent of the others (unlike softmax),

we can compute each element's derivative using ordinary Calc 101:

[ ∂∂ x1

∑k

n

xk2 , ∂

∂ x2∑

k

n

xk2 ,... , ∂

∂ xn∑

k

n

xn2]

• Scary notation:

∇ f =

f (x)=x2 df (x)dx

=2 x

Page 43: CSCI 315: Artificial Intelligence through Deep Learninghome.wlu.edu/~levys/courses/csci315w2016/lectures/theano.pdf · CSCI 315: Artificial Intelligence through Deep Learning W&L

Theano: Computing Derivatives Symbolically

Let x = [3, 5,7]. Then we expect to see [6,10,14] for the derivative:

f (x)=x2 df (x)dx

=2 x

Page 44: CSCI 315: Artificial Intelligence through Deep Learninghome.wlu.edu/~levys/courses/csci315w2016/lectures/theano.pdf · CSCI 315: Artificial Intelligence through Deep Learning W&L

Theano: Computing Derivatives in a Real Network

Let's look at a real example, from our Logistic Regression network. First, the logistic regression code (abbreviated):

Page 45: CSCI 315: Artificial Intelligence through Deep Learninghome.wlu.edu/~levys/courses/csci315w2016/lectures/theano.pdf · CSCI 315: Artificial Intelligence through Deep Learning W&L
Page 46: CSCI 315: Artificial Intelligence through Deep Learninghome.wlu.edu/~levys/courses/csci315w2016/lectures/theano.pdf · CSCI 315: Artificial Intelligence through Deep Learning W&L
Page 47: CSCI 315: Artificial Intelligence through Deep Learninghome.wlu.edu/~levys/courses/csci315w2016/lectures/theano.pdf · CSCI 315: Artificial Intelligence through Deep Learning W&L

Theano: Multi-Layer Networks

Page 48: CSCI 315: Artificial Intelligence through Deep Learninghome.wlu.edu/~levys/courses/csci315w2016/lectures/theano.pdf · CSCI 315: Artificial Intelligence through Deep Learning W&L
Page 49: CSCI 315: Artificial Intelligence through Deep Learninghome.wlu.edu/~levys/courses/csci315w2016/lectures/theano.pdf · CSCI 315: Artificial Intelligence through Deep Learning W&L

Theano: Where's the Back-Prop?

Page 50: CSCI 315: Artificial Intelligence through Deep Learninghome.wlu.edu/~levys/courses/csci315w2016/lectures/theano.pdf · CSCI 315: Artificial Intelligence through Deep Learning W&L

Training Details

Recall from previous lecture:

Page 51: CSCI 315: Artificial Intelligence through Deep Learninghome.wlu.edu/~levys/courses/csci315w2016/lectures/theano.pdf · CSCI 315: Artificial Intelligence through Deep Learning W&L
Page 52: CSCI 315: Artificial Intelligence through Deep Learninghome.wlu.edu/~levys/courses/csci315w2016/lectures/theano.pdf · CSCI 315: Artificial Intelligence through Deep Learning W&L