random projection neural networks: algorithms &...
TRANSCRIPT
Random Projection Neural Networks: Algorithms &
Hardware
Jonathan TapsonArindam Basu
1
Telluride 2015
All of Machine Learning in 2 Slides• We generally use machine learning for two purposes: Given input data X
and corresponding output data Y, find a function of X that either – Classification: classify each sample by finding a separating hyperplane so that
y {0,…,N} for N classes– Regression: Finds a function y = f(x) that minimises some error condition
• If there is a linear solution, we don’t need machine learning
• For a nonlinear solution, we use a machine learning method that:– Projects the data into a higher dimensional space– The projection must be nonlinear, so as to create separations which did not
exist in the original space– We can then solve using a linear solution in the higher dimensional space
2 Jonathan Tapson, The MARCS Institute
Machine Learning
Jonathan Tapson, The MARCS Institute3
y
xProject into higher dimensional space (add z dimension) non‐linearly, e.g.: z = x2
z
x
y
Machine Learning
Jonathan Tapson, The MARCS Institute4
y
xFind separating hyperplane:
z
x
y
Neural Networks
5
• Single Layer (Perceptron)• Linear classification only • Easy to train
Neural Networks
6
• Single Layer (Perceptron)• Linear classification only • Easy to train
• Multiple Layer (Multi-layer perceptron)• Universal approximator • Difficult to train
Neural Networks
7
• Single Layer (Perceptron)• Linear classification only • Easy to train
• Multiple Layer (Multi-layer perceptron)• Universal approximator • Difficult to train
• Support vector machines (SVM)• Maximise margin of classification
• Combination of weak classifiers• Voting, Adaboost
Neural Networks
8
• Single Layer (Perceptron)• Linear classification only • Easy to train
• Multiple Layer (Multi-layer perceptron)• Universal approximator • Difficult to train
• Support vector machines (SVM)• Maximise margin of classification
• Combination of weak classifiers• Voting, Adaboost
• Random projection based neural networks• Universal approximator • Fast training • Easy to implement (?)
Random Projection Neural Networks-I
9
• Large number of randomly weighted connections—not trained.
• Only output weights are trained (linear decoding)—very fast training.
• Example: Extreme Learning Machine (ELM), Neural Engineering Framework (NEF)
Input dimension: d
Hidden layer dimension: L(L>>d)
Random Projection Neural Networks-II
10
• Another example: Reservoir computing (Liquid state machine, Echo state network)
• Recurrence encodes time history implicitly.• Can be done explicitly in ELM.
ELM: Algorithms and Applications
Extreme Learning Machine (ELM)
• Multi-class regression & classification.
• Quick training, good generalization.
• Exploit fixed random weights of 1st layer for VLSI implementation.
Guang-Bin Huang, et. al., “Extreme learning machine for regression and multiclass classification,” IEEE Transactions SMC-B, 2012G. B. Huang, Q. Y. Zhu, and C. K. Siew, “Extreme Learning Machines: Theory and Applications,” Neurocomputing, 2006..
G() can be even more general
ELM: Training
Guang-Bin Huang, et. al., “Extreme learning machine for regression and multiclass classification,” IEEE Transactions SMC-B, 2012G. B. Huang, Q. Y. Zhu, and C. K. Siew, “Extreme Learning Machines: Theory and Applications,” Neurocomputing, 2006..
Where T is training set
Where H’ is Moore Penrose Pseudoinverse (more sophisticatedLearning possible)
However can use other online Training—perceptron, lms etc
Intuition about Universal Approximation
G.-B. Huang, L. Chen and C.-K. Siew, “Universal Approximation Using Incremental Constructive Feedforward Networks with Random Hidden Nodes”, IEEE Transactions on Neural Networks, vol. 17, no. 4, pp. 879-892, 2006.
What if you train all weights?
A. Rahimi and Recht, “Weighted Sums of Random Kitchen Sinks:…” NIPS, 2008.
• ~3.5X more random projections..• What if ~100X benefit in other aspects?
# weak learners # weak learners
% e
rror
Trai
ning
tim
e
Random Projections in Image Recognition
K. Jarrett et. al, “What is the best multi-stage architecture… ,” ICCV 2009.
• Lots of architectures for deep networks – what is really important?• 3 questions:
Which nonlinearities following filters are good? (tanh, abs, max-pool, avg pool) Are unsupervised training much better than random filters? 2 stage vs 1 stage?
Random Projections in Image Recognition
K. Jarrett et. al, “What is the best multi-stage architecture… ,” ICCV 2009.
• Lots of architectures for deep networks – what is really important?• 3 questions:
Which nonlinearities following filters are good? Are unsupervised training much better than random filters? 2 stage vs 1 stage?
Testing on Caltech 101
Random Projections in Image Recognition—Intuition
A. Saxe et. al, “On Random Weights and Unsupervised..… ,” NIPS2010.
• Random filters are also tuned to specific frequencies!
• Great option for quick architectural explorations! (size of convolution filter, stride, pooling etc)
ELM in Image Processing
L. Kasun et. Al., “Representational Learning with ELMs for Big Data,” IEEE Intelligent Systems, 2013.
• Can train multiple layers by ELM auto-encoder.
784-700-700-15000-10 network
ELM: Hardware Designs
Case study: Implantable Brain Machine Interfaces (BMI)
21BrainGate, Brown University
Large Scale Implantable BMI-Problems & Solutions
PROBLEMS
• Data rate/ channel ~ 200 Kbps• 1000 channels 200 Mbps• Power dissipation
UNSUSTAINABLE
22
I. Stevenson and K. Kording, “How advances in neural recording ..,” Nature Neuroscience, 2011.
SOLUTION
• Compress data• On-chip Neural decoding• E.g. Decode which finger (5
choices) moves in which direction (2 choices) 4 bits @ 1 KHz
• Spiking Neural Network (SNN) on-chip.
Proposal: Machine Learning Co-processor (MLCP)
Example Application: Decoding Dexterous Finger Movement
• Classify 12 movement types and onset time of movement—reuse H!
• Moving average* of number of spikes is input feature x.
• Notations: Input dimension: D, No. of Hidden neurons: L No. of classes: C
Recorded neural activity
Monkey moves finger
Neuromorphic Decoder
Predicted movement
• 3 monkeys trained to perform visually cued flexion and extension of wrist & fingers
• Single unit activities recorded from M1 neurons.• Pseudorandom sequence of movement types.• Unsuccessful trials discarded from analysis.
V. Aggarwal et.al, “Asynchronous decoding of …” IEEE TNSRE, 2008.
MOTOR INTENTION DECODING
*Average over 100 ms with moving step of 20 ms
Algorithmic novelty: Time delay based dimension Increase (TDBDI)
• Common problem: loss of signal over time in some electrodes.
• Use extra information from previous (p-1) time samples of functional electrodes.
• For “n” electrodes, dimension D=nxp
Recorded neural activity
Monkey moves finger
Neuromorphic Decoder
Predicted movement
• 3 monkeys trained to perform visually cued flexion and extension of wrist & fingers
• Single unit activities recorded from M1 neurons.• Pseudorandom sequence of movement types.• Unsuccessful trials discarded from analysis.
V. Aggarwal et.al, “Asynchronous decoding of …” IEEE TNSRE, 2008.
MOTOR INTENTION DECODING
Hardware Architecture: One Channel
spikesCounter
MovingWindowAverage
DigitalAnalogConverter
RandomProjection
WinCNT DAC CurrentMirrorArray
Proposed Design: Hardware Architecture of MLCP
Chen Yi, Yao Enyi and Arindam Basu, "A 128 Channel 290 GMACs/W Machine Learning..," IEEE ISCAS, 2015 27
27
• D=128, L=128• 2nd stage on DSP
Machine Learning Co-Processor (MLCP): Architecture
28
28
T
tU
V
new
A. Basu, S. Shuo, H. Zhou, G. Huang and M. Lim, “Silicon Spiking Neurons for Hardware Implementation of Extreme Learning Machines,” Neurocomputing. , 2013.Y. Enyi, S. Hussain, A. Basu and G. Huang, “Computation using Mismatch:…,” IEEE BioCAS. , 2013.
• D=128, L=128• 2nd stage on DSP
Machine Learning Co-Processor (MLCP): Architecture
29
29
T
tU
V
new
• D=128, L=128• 2nd stage on DSP
Results: Characterization
IC fabricated in 0.35um CMOS
Portable External Unit (PEU)
Results: Characterization
Results: Performance in Neural Decoding
32
Accuracy at par with State of the Art
Can use extra information from earlier time samples if number of M1 neurons is less!!
Power ~ 0.4uW @ 50 class/sec 8
nJ/class !!
Recorded neural activity
Monkey moves finger
Neuromorphic Decoder
Predicted movement
• 3 monkeys trained to perform visually cued flexion and extension of wrist & fingers
• Single unit activities recorded from M1 neurons.• Pseudorandom sequence of movement types.• Unsuccessful trials discarded from analysis.
V. Aggarwal et.al, “Asynchronous decoding of …” IEEE TNSRE, 2008.
MOTOR INTENTION DECODING
Results: Real Time Operation of MLCP
Chen Yi, Yao Enyi and Arindam Basu, "A 128 Channel 290 GMACs/W Machine Learning..," IEEE ISCAS, 2015 33
In Telluride..
34
• Speech recognition with spikes from cohlea or MFCC.
• Word2vec
• Combined/ Fused classification of cochlea+DAVIS for lip-reading.
35
Results: Performance in Neural Decoding
Power ~ 0.4uW @ 50 class/sec 8
nJ/class !!
Recorded neural activity
Monkey moves finger
Neuromorphic Decoder
Predicted movement
• 3 monkeys trained to perform visually cued flexion and extension of wrist & fingers
• Single unit activities recorded from M1 neurons.• Pseudorandom sequence of movement types.• Unsuccessful trials discarded from analysis.
V. Aggarwal et.al, “Asynchronous decoding of …” IEEE TNSRE, 2008.
MOTOR INTENTION DECODING