csci 315: artificial intelligence through deep...
TRANSCRIPT
CSCI 315: Artificial Intelligence through Deep Learning
W&L Winter Term 2016Prof. Levy
Introduction to Deep Learning with Theano
Why Theano (vs. just NumPy)?• Recall two main essentials: dot product and
activation function derivative.
• Dot product:net i=∑
j=0
n
x j w ij ,w0≡1
• “Embarrasingly Parallel”: since each unit i has its own incoming weights, net
i can be compute independently
from / simultaneously with all other units in its layer.
• On an ordinary computer, we (NumPy dot) must compute one net
i after another, sequentially.
net = np.dot(append(x,1), w)
Ordinary dot product computation for a layer
First me! Then me!
Exploiting Parallelism
All together now!
GPU to the Rescue!
• Graphics Processing Unit: Designed for videogames, to exploit the parallelism in pixel-level updates.
• NVIDIA offers CUDA API for programmers, but it's wicked hard – need to track locations of values in memory.
• Theano exploits GPU / CUDA if they're available.
GPU: A Multi-threaded architecture
A traditional architecture has: one processor, one memory, one process at a time:
CPU
Memory
Von Neumann Bottleneck
http://web.eecs.utk.edu/~plank/plank/classes/cs360/360/notes/Memory/lecture.html
• A distributed architecture (e.g., Beowulf cluster) has several processors, each with its own memory
• Communication among processors uses message-passing (e.g., MPI)
CPU CPU CPU…
Memory Memory Memory…
Connecting Network
• A shared memory architecture allows several processes to access the same memory, either from a single CPU or several CPUs
• Typically, a single process launches several “lightweight processes” called threads, which all share the same heap and global memory with each having its own stack.
• Ideally, each thread runs on its own processor (“core”)
Core 1 …
Memory (Heap / Globals)
Core 1 Core n
NVIDIA Jetson TK1: 192 cores
NVIDIA Jetson TX1: 256 cores
Python vs. NumPy vs. Theano• Dot product in “naive” Python:
• This will be slow, because the interpreter is executing the loop code c += a[k] * b[k] over and over
• Some speedup is likely once the interpreter has compiled your code into a .pyc (bytecode) file.
Python vs. NumPy vs. Theano
• Dot in NumPy: c = np.dot(a, b)
• “Under the hood”: Your arrays a and b are passed to a pre-compiled C program that computes the dot product, typically much faster than you would get with your own code:
• Hence, Theano will require us to specify info about types and memory in order to exploit GPU speedup
Why Theano (vs. just NumPy)?• Recall two main essentials: dot product and
activation function derivative.
• Activation function derivative:
f (x )=1
1+e−x
df (x)
dx= f ' (x)=
e x
(1+e x)2 = f ( x)(1− f ( x))
f (x )= tanh(x ) f ' (x )=sech2( x)
y i=f ( x i)=e i
x
∑j
e jx
*
• This is called symbolic differentiation and requires us to use our calculus or a special computation tool, case-by-case. Theano will automate this for us!
∂ y i
∂ y j
= yi(1− y i)if i= j ,− y i y jif i≠ j
Theano: Basics*
* from Chapter 3 of Buduma 2015 (first draft manuscript)
Theano: Basics
Theano has a special class for functions, which allows it to compute stuff efficiently.
Theano: Basics
A scalar (single number) is a zero-dimensional tensor. Theano allows us to create it with a name, i.e., a symbol. The d in dvector means “double-precision” (64 bits)
Theano: Basics
The + and ** operators have been overladed to work with dscalar objects.
Theano: Basics
We build a function f piece by piece. Theano will compile this function for optimized performance (e.g., GPU).
Theano: Basics
Theano: Dataflow Graphs
Theano: Dataflow Graphs
Theano: Dataflow Graphs(Special Note)
This will give you an error in Python3 because of Python2/Python3 incompatibility in pydot library.
You can use theano.printing.debugprint instead:
Theano: Shared Variables and Side-Effects
• You will see the keyword shared in many Theano programs.
• It has two meanings:
–Keep the data in the GPU for efficiency
–Allow a function to have state (side-effects)
Ordinary CUDA in C++: have to move data in/out of GPU yourself!
Adding State to a Function
A Python class is a set of functions (methods) that share a state (instance variables) – so you're already familiar with state!
Adding State to a Function
Python even allows us to create a class that behaves like a function with state:
State in TheanoExample (Buduma Chapter 3): a simple classifier function that keeps count of how many times we've called it:
Theano: Randomness• True Random Number Generators (based on
physical phenomena) are pretty uncommon!
• So to understand random numbers in Theano, we need to understand how computers simulate randomness algorithmically: pseudo-random number generators.
https://en.wikipedia.org/wiki/Hardware_random_number_generator
TRNG
Linear Congruential Method• Uses modulus (clock) arithmetic to generate a
sequence x • Simple example:
x0 = 10xn = (7 xn -1 + 1) mod 11
https://en.wikipedia.org/wiki/Linear_congruential_generator
Random Numbers in NumPy
No seed specified: an arbitrary value, like the current system in microseconds, is used as seed.
Explicit seed (7) used: same pattern every time! Do this to debug stochastic (pseudorandom) programs.
Thread-Safe Pseudorandoms
• Recall our linear congruential formula for generating random numbers; e.g.:
11mod)7(
1
1
0
nn rr
r
• We’d like to have each of our p processors generate its share of numbers.• Problem: each processor will produce the same sequence!
modulusmultiplier
seed
r = 1, 7, 5, 2, 3, 10, 4, 6, 9, 8, ...
Thread-Safe Pseudorandoms
“Interleaving” Trick: For p processors,
1. Generate the first p numbers in the sequence: e.g., for p = 2, get 1, 7. These become the seeds for each processor.
2. To get the new multiplier, raise the old multiplier to p and mod the result with the modulus: e.g., for p = 2,
72 = 49; 49 mod 11 = 5. This becomes the multiplier for all processors.
Thread-Safe Pseudorandoms
11mod)5(
1
:
1
0
0
nn rr
r
p
11mod)5(
7
:
1
0
1
nn rr
r
p
p0: 1, 5, 3, 4, 9,
p1: 7, 2, 10, 6, 8
Random Numbers in Theano: Theano Level (Buduma Ch. 3)
Random Numbers in Theano: From NumPy (deeplearning.net)
Random Numbers in Theano: From NumPy (deeplearning.net)
Random Numbers in Theano: From NumPy (deeplearning.net)
The borrow keyword• Memory aliasing: when two names are used for
the same piece of memory
• Ordinary NumPy Example:
The borrow keyword• The aggressive reuse of memory is one of the ways through which
Theano makes code fast, and it is important for the correctness and speed of your program that you understand how Theano might alias buffers.*
• The memory allocated for a shared variable buffer is unique: it is never aliased to another shared variable.*
• So what the #@&% does THIS mean:
*http://deeplearning.net/software/theano/tutorial/aliasing.html
The borrow keyword
Conclusion:
The borrow keyword
Conclusion: Used borrowed=True when you want a Theano shared variable to be aliased (updated along with) the NumPy variable from which you created it.
Theano: Computing Derivatives Symbolically
• As we have seen, computing partial derivatives is a necessity for gradient-descent methods
• Consider a simple example:
f (x)=∑k
n
x k2
• “Vanilla” NumPy code:
fx = np.sum(x**2)
• Let's compute the partial derivative of f (x) w.r.t. each element x
k ….
Theano: Computing Derivatives Symbolically
• This looks complicated, but: since each element x
k is independent of the others (unlike softmax),
we can compute each element's derivative using ordinary Calc 101:
[ ∂∂ x1
∑k
n
xk2 , ∂
∂ x2∑
k
n
xk2 ,... , ∂
∂ xn∑
k
n
xn2]
• Scary notation:
∇ f =
f (x)=x2 df (x)dx
=2 x
Theano: Computing Derivatives Symbolically
Let x = [3, 5,7]. Then we expect to see [6,10,14] for the derivative:
f (x)=x2 df (x)dx
=2 x
Theano: Computing Derivatives in a Real Network
Let's look at a real example, from our Logistic Regression network. First, the logistic regression code (abbreviated):
Theano: Multi-Layer Networks
Theano: Where's the Back-Prop?
Training Details
Recall from previous lecture: