random projection neural networks: algorithms &...

Random Projection Neural Networks: Algorithms &

Hardware

Jonathan TapsonArindam Basu

1

Telluride 2015

All of Machine Learning in 2 Slides• We generally use machine learning for two purposes: Given input data X

and corresponding output data Y, find a function of X that either – Classification: classify each sample by finding a separating hyperplane so that

y {0,…,N} for N classes– Regression: Finds a function y = f(x) that minimises some error condition

• If there is a linear solution, we don’t need machine learning

• For a nonlinear solution, we use a machine learning method that:– Projects the data into a higher dimensional space– The projection must be nonlinear, so as to create separations which did not

exist in the original space– We can then solve using a linear solution in the higher dimensional space

2 Jonathan Tapson, The MARCS Institute

Machine Learning

Jonathan Tapson, The MARCS Institute3

y

xProject into higher dimensional space (add z dimension) non‐linearly, e.g.: z = x2

z

x

y

Machine Learning

Jonathan Tapson, The MARCS Institute4

y

xFind separating hyperplane:

z

x

y

Neural Networks

5

• Single Layer (Perceptron)• Linear classification only • Easy to train

Neural Networks

6


• Multiple Layer (Multi-layer perceptron)• Universal approximator • Difficult to train

Neural Networks

7



• Support vector machines (SVM)• Maximise margin of classification

• Combination of weak classifiers• Voting, Adaboost

Neural Networks

8



• Support vector machines (SVM)• Maximise margin of classification

• Combination of weak classifiers• Voting, Adaboost

• Random projection based neural networks• Universal approximator • Fast training • Easy to implement (?)

Random Projection Neural Networks-I

9

• Large number of randomly weighted connections—not trained.

• Only output weights are trained (linear decoding)—very fast training.

• Example: Extreme Learning Machine (ELM), Neural Engineering Framework (NEF)

Input dimension: d

Hidden layer dimension: L(L>>d)

Random Projection Neural Networks-II

10

• Another example: Reservoir computing (Liquid state machine, Echo state network)

• Recurrence encodes time history implicitly.• Can be done explicitly in ELM.

ELM: Algorithms and Applications

Extreme Learning Machine (ELM)

• Multi-class regression & classification.

• Quick training, good generalization.

• Exploit fixed random weights of 1st layer for VLSI implementation.

Guang-Bin Huang, et. al., “Extreme learning machine for regression and multiclass classification,” IEEE Transactions SMC-B, 2012G. B. Huang, Q. Y. Zhu, and C. K. Siew, “Extreme Learning Machines: Theory and Applications,” Neurocomputing, 2006..

G() can be even more general

ELM: Training

Guang-Bin Huang, et. al., “Extreme learning machine for regression and multiclass classification,” IEEE Transactions SMC-B, 2012G. B. Huang, Q. Y. Zhu, and C. K. Siew, “Extreme Learning Machines: Theory and Applications,” Neurocomputing, 2006..

Where T is training set

Where H’ is Moore Penrose Pseudoinverse (more sophisticatedLearning possible)

However can use other online Training—perceptron, lms etc

Intuition about Universal Approximation

G.-B. Huang, L. Chen and C.-K. Siew, “Universal Approximation Using Incremental Constructive Feedforward Networks with Random Hidden Nodes”, IEEE Transactions on Neural Networks, vol. 17, no. 4, pp. 879-892, 2006.

What if you train all weights?

A. Rahimi and Recht, “Weighted Sums of Random Kitchen Sinks:…” NIPS, 2008.

• ~3.5X more random projections..• What if ~100X benefit in other aspects?

# weak learners # weak learners

% e

rror

Trai

ning

tim

e

Random Projections in Image Recognition

K. Jarrett et. al, “What is the best multi-stage architecture… ,” ICCV 2009.

• Lots of architectures for deep networks – what is really important?• 3 questions:

Which nonlinearities following filters are good? (tanh, abs, max-pool, avg pool) Are unsupervised training much better than random filters? 2 stage vs 1 stage?

Random Projections in Image Recognition

K. Jarrett et. al, “What is the best multi-stage architecture… ,” ICCV 2009.

• Lots of architectures for deep networks – what is really important?• 3 questions:

Which nonlinearities following filters are good? Are unsupervised training much better than random filters? 2 stage vs 1 stage?

Testing on Caltech 101

Random Projections in Image Recognition—Intuition

A. Saxe et. al, “On Random Weights and Unsupervised..… ,” NIPS2010.

• Random filters are also tuned to specific frequencies!

• Great option for quick architectural explorations! (size of convolution filter, stride, pooling etc)

ELM in Image Processing

L. Kasun et. Al., “Representational Learning with ELMs for Big Data,” IEEE Intelligent Systems, 2013.

• Can train multiple layers by ELM auto-encoder.

784-700-700-15000-10 network

ELM: Hardware Designs

Case study: Implantable Brain Machine Interfaces (BMI)

21BrainGate, Brown University

Large Scale Implantable BMI-Problems & Solutions

PROBLEMS

• Data rate/ channel ~ 200 Kbps• 1000 channels 200 Mbps• Power dissipation

UNSUSTAINABLE

22

I. Stevenson and K. Kording, “How advances in neural recording ..,” Nature Neuroscience, 2011.

SOLUTION

• Compress data• On-chip Neural decoding• E.g. Decode which finger (5

choices) moves in which direction (2 choices) 4 bits @ 1 KHz

• Spiking Neural Network (SNN) on-chip.

Proposal: Machine Learning Co-processor (MLCP)

Example Application: Decoding Dexterous Finger Movement

• Classify 12 movement types and onset time of movement—reuse H!

• Moving average* of number of spikes is input feature x.

• Notations: Input dimension: D, No. of Hidden neurons: L No. of classes: C

Recorded neural activity

Monkey moves finger

Neuromorphic Decoder

Predicted movement

• 3 monkeys trained to perform visually cued flexion and extension of wrist & fingers

• Single unit activities recorded from M1 neurons.• Pseudorandom sequence of movement types.• Unsuccessful trials discarded from analysis.

V. Aggarwal et.al, “Asynchronous decoding of …” IEEE TNSRE, 2008.

MOTOR INTENTION DECODING

*Average over 100 ms with moving step of 20 ms

Algorithmic novelty: Time delay based dimension Increase (TDBDI)

• Common problem: loss of signal over time in some electrodes.

• Use extra information from previous (p-1) time samples of functional electrodes.

• For “n” electrodes, dimension D=nxp


Monkey moves finger


Predicted movement





Hardware Architecture: One Channel

spikesCounter

MovingWindowAverage

DigitalAnalogConverter

RandomProjection

WinCNT DAC CurrentMirrorArray

Proposed Design: Hardware Architecture of MLCP

Chen Yi, Yao Enyi and Arindam Basu, "A 128 Channel 290 GMACs/W Machine Learning..," IEEE ISCAS, 2015 27

27

• D=128, L=128• 2nd stage on DSP

Machine Learning Co-Processor (MLCP): Architecture

28

28

T

tU

V

new

A. Basu, S. Shuo, H. Zhou, G. Huang and M. Lim, “Silicon Spiking Neurons for Hardware Implementation of Extreme Learning Machines,” Neurocomputing. , 2013.Y. Enyi, S. Hussain, A. Basu and G. Huang, “Computation using Mismatch:…,” IEEE BioCAS. , 2013.


Machine Learning Co-Processor (MLCP): Architecture

29

29

T

tU

V

new


Results: Characterization

IC fabricated in 0.35um CMOS

Portable External Unit (PEU)

Results: Characterization

Results: Performance in Neural Decoding

32

Accuracy at par with State of the Art

Can use extra information from earlier time samples if number of M1 neurons is less!!

Power ~ 0.4uW @ 50 class/sec 8

nJ/class !!


Monkey moves finger


Predicted movement





Results: Real Time Operation of MLCP

Chen Yi, Yao Enyi and Arindam Basu, "A 128 Channel 290 GMACs/W Machine Learning..," IEEE ISCAS, 2015 33

In Telluride..

34

• Speech recognition with spikes from cohlea or MFCC.

• Word2vec

• Combined/ Fused classification of cochlea+DAVIS for lip-reading.

Results: Performance in Neural Decoding

Power ~ 0.4uW @ 50 class/sec 8

nJ/class !!


Monkey moves finger


Predicted movement





random projection neural networks: algorithms &...

Documents