csci 315: artificial intelligence through deep...

Post on 19-Mar-2018

233 Views

Category:

Documents

3 Downloads

Preview:

Click to see full reader

TRANSCRIPT

CSCI 315: Artificial Intelligence through Deep Learning

W&L Winter Term 2016Prof. Levy

Introduction to Deep Learning with Theano

Why Theano (vs. just NumPy)?• Recall two main essentials: dot product and

activation function derivative.

• Dot product:net i=∑

j=0

n

x j w ij ,w0≡1

• “Embarrasingly Parallel”: since each unit i has its own incoming weights, net

i can be compute independently

from / simultaneously with all other units in its layer.

• On an ordinary computer, we (NumPy dot) must compute one net

i after another, sequentially.

net = np.dot(append(x,1), w)

Ordinary dot product computation for a layer

First me! Then me!

Exploiting Parallelism

All together now!

GPU to the Rescue!

• Graphics Processing Unit: Designed for videogames, to exploit the parallelism in pixel-level updates.

• NVIDIA offers CUDA API for programmers, but it's wicked hard – need to track locations of values in memory.

• Theano exploits GPU / CUDA if they're available.

GPU: A Multi-threaded architecture

A traditional architecture has: one processor, one memory, one process at a time:

CPU

Memory

Von Neumann Bottleneck

http://web.eecs.utk.edu/~plank/plank/classes/cs360/360/notes/Memory/lecture.html

• A distributed architecture (e.g., Beowulf cluster) has several processors, each with its own memory

• Communication among processors uses message-passing (e.g., MPI)

CPU CPU CPU…

Memory Memory Memory…

Connecting Network

• A shared memory architecture allows several processes to access the same memory, either from a single CPU or several CPUs

• Typically, a single process launches several “lightweight processes” called threads, which all share the same heap and global memory with each having its own stack.

• Ideally, each thread runs on its own processor (“core”)

Core 1 …

Memory (Heap / Globals)

Core 1 Core n

NVIDIA Jetson TK1: 192 cores

NVIDIA Jetson TX1: 256 cores

Python vs. NumPy vs. Theano• Dot product in “naive” Python:

• This will be slow, because the interpreter is executing the loop code c += a[k] * b[k] over and over

• Some speedup is likely once the interpreter has compiled your code into a .pyc (bytecode) file.

Python vs. NumPy vs. Theano

• Dot in NumPy: c = np.dot(a, b)

• “Under the hood”: Your arrays a and b are passed to a pre-compiled C program that computes the dot product, typically much faster than you would get with your own code:

• Hence, Theano will require us to specify info about types and memory in order to exploit GPU speedup

Why Theano (vs. just NumPy)?• Recall two main essentials: dot product and

activation function derivative.

• Activation function derivative:

f (x )=1

1+e−x

df (x)

dx= f ' (x)=

e x

(1+e x)2 = f ( x)(1− f ( x))

f (x )= tanh(x ) f ' (x )=sech2( x)

y i=f ( x i)=e i

x

∑j

e jx

*

• This is called symbolic differentiation and requires us to use our calculus or a special computation tool, case-by-case. Theano will automate this for us!

∂ y i

∂ y j

= yi(1− y i)if i= j ,− y i y jif i≠ j

Theano: Basics*

* from Chapter 3 of Buduma 2015 (first draft manuscript)

Theano: Basics

Theano has a special class for functions, which allows it to compute stuff efficiently.

Theano: Basics

A scalar (single number) is a zero-dimensional tensor. Theano allows us to create it with a name, i.e., a symbol. The d in dvector means “double-precision” (64 bits)

Theano: Basics

The + and ** operators have been overladed to work with dscalar objects.

Theano: Basics

We build a function f piece by piece. Theano will compile this function for optimized performance (e.g., GPU).

Theano: Basics

Theano: Dataflow Graphs

Theano: Dataflow Graphs

Theano: Dataflow Graphs(Special Note)

This will give you an error in Python3 because of Python2/Python3 incompatibility in pydot library.

You can use theano.printing.debugprint instead:

Theano: Shared Variables and Side-Effects

• You will see the keyword shared in many Theano programs.

• It has two meanings:

–Keep the data in the GPU for efficiency

–Allow a function to have state (side-effects)

Ordinary CUDA in C++: have to move data in/out of GPU yourself!

Adding State to a Function

A Python class is a set of functions (methods) that share a state (instance variables) – so you're already familiar with state!

Adding State to a Function

Python even allows us to create a class that behaves like a function with state:

State in TheanoExample (Buduma Chapter 3): a simple classifier function that keeps count of how many times we've called it:

Theano: Randomness• True Random Number Generators (based on

physical phenomena) are pretty uncommon!

• So to understand random numbers in Theano, we need to understand how computers simulate randomness algorithmically: pseudo-random number generators.

https://en.wikipedia.org/wiki/Hardware_random_number_generator

TRNG

Linear Congruential Method• Uses modulus (clock) arithmetic to generate a

sequence x • Simple example:

x0 = 10xn = (7 xn -1 + 1) mod 11

https://en.wikipedia.org/wiki/Linear_congruential_generator

Random Numbers in NumPy

No seed specified: an arbitrary value, like the current system in microseconds, is used as seed.

Explicit seed (7) used: same pattern every time! Do this to debug stochastic (pseudorandom) programs.

Thread-Safe Pseudorandoms

• Recall our linear congruential formula for generating random numbers; e.g.:

11mod)7(

1

1

0

nn rr

r

• We’d like to have each of our p processors generate its share of numbers.• Problem: each processor will produce the same sequence!

modulusmultiplier

seed

r = 1, 7, 5, 2, 3, 10, 4, 6, 9, 8, ...

Thread-Safe Pseudorandoms

“Interleaving” Trick: For p processors,

1. Generate the first p numbers in the sequence: e.g., for p = 2, get 1, 7. These become the seeds for each processor.

2. To get the new multiplier, raise the old multiplier to p and mod the result with the modulus: e.g., for p = 2,

72 = 49; 49 mod 11 = 5. This becomes the multiplier for all processors.

Thread-Safe Pseudorandoms

11mod)5(

1

:

1

0

0

nn rr

r

p

11mod)5(

7

:

1

0

1

nn rr

r

p

p0: 1, 5, 3, 4, 9,

p1: 7, 2, 10, 6, 8

Random Numbers in Theano: Theano Level (Buduma Ch. 3)

Random Numbers in Theano: From NumPy (deeplearning.net)

Random Numbers in Theano: From NumPy (deeplearning.net)

Random Numbers in Theano: From NumPy (deeplearning.net)

The borrow keyword• Memory aliasing: when two names are used for

the same piece of memory

• Ordinary NumPy Example:

The borrow keyword• The aggressive reuse of memory is one of the ways through which

Theano makes code fast, and it is important for the correctness and speed of your program that you understand how Theano might alias buffers.*

• The memory allocated for a shared variable buffer is unique: it is never aliased to another shared variable.*

• So what the #@&% does THIS mean:

*http://deeplearning.net/software/theano/tutorial/aliasing.html

The borrow keyword

Conclusion:

The borrow keyword

Conclusion: Used borrowed=True when you want a Theano shared variable to be aliased (updated along with) the NumPy variable from which you created it.

Theano: Computing Derivatives Symbolically

• As we have seen, computing partial derivatives is a necessity for gradient-descent methods

• Consider a simple example:

f (x)=∑k

n

x k2

• “Vanilla” NumPy code:

fx = np.sum(x**2)

• Let's compute the partial derivative of f (x) w.r.t. each element x

k ….

Theano: Computing Derivatives Symbolically

• This looks complicated, but: since each element x

k is independent of the others (unlike softmax),

we can compute each element's derivative using ordinary Calc 101:

[ ∂∂ x1

∑k

n

xk2 , ∂

∂ x2∑

k

n

xk2 ,... , ∂

∂ xn∑

k

n

xn2]

• Scary notation:

∇ f =

f (x)=x2 df (x)dx

=2 x

Theano: Computing Derivatives Symbolically

Let x = [3, 5,7]. Then we expect to see [6,10,14] for the derivative:

f (x)=x2 df (x)dx

=2 x

Theano: Computing Derivatives in a Real Network

Let's look at a real example, from our Logistic Regression network. First, the logistic regression code (abbreviated):

Theano: Multi-Layer Networks

Theano: Where's the Back-Prop?

Training Details

Recall from previous lecture:

top related