Download - A Perspective on the Limits of Computation
![Page 1: A Perspective on the Limits of Computation](https://reader030.vdocuments.net/reader030/viewer/2022033105/56816687550346895dda3a16/html5/thumbnails/1.jpg)
A Perspective on the Limits of Computation
Oskar Mencer
May 2012
![Page 2: A Perspective on the Limits of Computation](https://reader030.vdocuments.net/reader030/viewer/2022033105/56816687550346895dda3a16/html5/thumbnails/2.jpg)
Limits of Computation
Objective: Maximum Performance Computing (MPC)
What is the fastest we can compute desired results?
Conjecture:Data movement is the real limit on computation.
![Page 3: A Perspective on the Limits of Computation](https://reader030.vdocuments.net/reader030/viewer/2022033105/56816687550346895dda3a16/html5/thumbnails/3.jpg)
Maximum Performance Computing (MPC)
The journey will take us through:1. Information Theory: Kolmogorov Complexity2. Optimised Arithmetic: Winograd Bounds3. Optimisation via Kahneman and Von Neumann4. Real World Dataflow Implications and Results
Less Data Movement = Less Data + Less Movement
![Page 4: A Perspective on the Limits of Computation](https://reader030.vdocuments.net/reader030/viewer/2022033105/56816687550346895dda3a16/html5/thumbnails/4.jpg)
Kolmogorov Complexity (K)
Definition (Kolmogorov): “If a description of string s, d(s), is of minimal length, […] it is called a minimal description of s. Then the length of d(s), […] is the Kolmogorov complexity of s, written K(s), where K(s) = |d(s)|”
Of course K(s) depends heavily on the Language L used to describe actions in K. (e.g. Java, Esperanto, an Executable file, etc)
Kolmogorov, A.N. (1965). "Three Approaches to the Quantitative Definition of Information". Problems Inform. Transmission 1 (1): 1–7.
![Page 5: A Perspective on the Limits of Computation](https://reader030.vdocuments.net/reader030/viewer/2022033105/56816687550346895dda3a16/html5/thumbnails/5.jpg)
A Maximum Performance Computing Theorem
For a computational task f, computing the result r, given inputs i, i.e. task f: r = f( i ), or
Assuming infinite capacity to compute and remember inside box f, the time T to compute task f depends on moving the data in and out of the box.
Thus, for a machine f with infinite memory and infinitely fast arithmetic, Kolmogorov complexity K(i+r) defines the fastest way to compute task f.
fi r
![Page 6: A Perspective on the Limits of Computation](https://reader030.vdocuments.net/reader030/viewer/2022033105/56816687550346895dda3a16/html5/thumbnails/6.jpg)
The representation K(σ,F) of the state σ,F is critical!
dtdZdWdZddWFdF
ttt
tttt
,
SABR model:
We integrate in time (Euler in log-forward, Milstein in volatility)
dtZZ
WFdtFFF
tttttt
ttttttt
221
1
221
1
))((
)ln)1exp((.))ln)1exp(((lnln
logicstate
σ, F
![Page 7: A Perspective on the Limits of Computation](https://reader030.vdocuments.net/reader030/viewer/2022033105/56816687550346895dda3a16/html5/thumbnails/7.jpg)
MPC– Bad News
1. Real computers do not have either infinite memory or infinitely fast arithmetic units.
2. Kolmogorov Theorem. K is not a computable function.
MPC – Good News
Today’s arithmetic units are fast enough.
So in practice...Kolmogorov Complexity => Discretisation & Compression=> MPC depends on the Representation of the Problem.
![Page 8: A Perspective on the Limits of Computation](https://reader030.vdocuments.net/reader030/viewer/2022033105/56816687550346895dda3a16/html5/thumbnails/8.jpg)
Euclids Elements, Representing a²+b²=c²
![Page 9: A Perspective on the Limits of Computation](https://reader030.vdocuments.net/reader030/viewer/2022033105/56816687550346895dda3a16/html5/thumbnails/9.jpg)
17 × 24 = ?
![Page 10: A Perspective on the Limits of Computation](https://reader030.vdocuments.net/reader030/viewer/2022033105/56816687550346895dda3a16/html5/thumbnails/10.jpg)
Thinking Fast and Slow
Daniel Kahneman Nobel Prize in Economics, 2002
back to 17 × 24
Kahneman splits thinking into:System 1: fast, hard to control ... 400System 2: slow, easier to control ... 408
![Page 11: A Perspective on the Limits of Computation](https://reader030.vdocuments.net/reader030/viewer/2022033105/56816687550346895dda3a16/html5/thumbnails/11.jpg)
Remembering Fast and Slow
John von Neumann, 1946:
“We are forced to recognize the possibility of constructing a hierarchy of memories, each of which has greater capacity than the preceding, but which is less quickly accessible.”
![Page 12: A Perspective on the Limits of Computation](https://reader030.vdocuments.net/reader030/viewer/2022033105/56816687550346895dda3a16/html5/thumbnails/12.jpg)
Consider Computation and Memory Together
Computing f(x) in the range [a,b] with |E| ≤ 2 ⁿ⁻ Table Table+Arithmetic Arithmetic
and +,-,×,÷ +,-,×,÷
uniform vs non-uniform number of table entries how many coefficients
polynomial or rational approx continued fractions multi-partite tables
Underlying hardware/technology changes the optimum
![Page 13: A Perspective on the Limits of Computation](https://reader030.vdocuments.net/reader030/viewer/2022033105/56816687550346895dda3a16/html5/thumbnails/13.jpg)
MPC in PracticeTradeoff Representation, Memory and Arithmetic
![Page 14: A Perspective on the Limits of Computation](https://reader030.vdocuments.net/reader030/viewer/2022033105/56816687550346895dda3a16/html5/thumbnails/14.jpg)
Limits on Computing + and ×Shmuel Winograd, 1965
Bounds on Addition- Binary: O(log n)- Residue Number System: O(log 2log α(N))- Redundant Number System: O(1)
Bounds on Multiplication- Binary: O(log n)- Residue Number System: O(log 2log β(N))- Using Tables: O(2[log n/2]+2+[log 2n/2])- Logarithmic Number System: O(Addition)
However, Binary and Log numbers are easy to compare, others are not!
Lesson: If you optimize only a little piece of the computation, the result is useless in practice => Need to optimize ENTIRE programs.Or in other words: abstraction kills performance.
![Page 15: A Perspective on the Limits of Computation](https://reader030.vdocuments.net/reader030/viewer/2022033105/56816687550346895dda3a16/html5/thumbnails/15.jpg)
Addition in O(1)
Redundant: 2 bits represent 1 binary digit=> use counters to reduce the input
(3,2) counters reduce threenumbers (a,b,c) to two numbers (out1, out2)so that a + b + c = out1 + out2
abcout1out2
![Page 16: A Perspective on the Limits of Computation](https://reader030.vdocuments.net/reader030/viewer/2022033105/56816687550346895dda3a16/html5/thumbnails/16.jpg)
From Theory to PracticeOptimise Whole Programs
Bit Level Representation
Storage
Processor
Discretisation
Iteration
Method
CustomiseNumerics
CustomiseArchitecture
![Page 17: A Perspective on the Limits of Computation](https://reader030.vdocuments.net/reader030/viewer/2022033105/56816687550346895dda3a16/html5/thumbnails/17.jpg)
Mission Impossible?
![Page 18: A Perspective on the Limits of Computation](https://reader030.vdocuments.net/reader030/viewer/2022033105/56816687550346895dda3a16/html5/thumbnails/18.jpg)
Maximum Performance Computing (MPC)
The journey will take us through:1. Information Theory: Kolmogorov Complexity2. Optimised Arithmetic: Winograd Bounds3. Optimisation via Kahneman and Von Neumann4. Real World Dataflow Implications and Results
Less Data Movement = Less Data + Less Movement
![Page 19: A Perspective on the Limits of Computation](https://reader030.vdocuments.net/reader030/viewer/2022033105/56816687550346895dda3a16/html5/thumbnails/19.jpg)
Optimise Whole Programs with Finite Resources
SYSTEM 1x86 cores
SYSTEM 2flexible
memory+logic
Low LatencyMemory
High ThroughputMemory
Balance Computation and Memory
![Page 20: A Perspective on the Limits of Computation](https://reader030.vdocuments.net/reader030/viewer/2022033105/56816687550346895dda3a16/html5/thumbnails/20.jpg)
The Ideal System 2 is a Production Line
SYSTEM 1x86 cores
SYSTEM 2flexible
memory+logic
Low LatencyMemory
High ThroughputMemory
Balance Computation and Memory
![Page 21: A Perspective on the Limits of Computation](https://reader030.vdocuments.net/reader030/viewer/2022033105/56816687550346895dda3a16/html5/thumbnails/21.jpg)
8 Maxeler DFEs replacing 1,900 Intel CPU cores
0
200
400
600
800
1,000
1,200
1,400
1,600
1,800
2,000
1 4 8
Equi
vale
nt C
PU c
ores
Number of MAX2 cards
15Hz peak frequency
30Hz peak frequency
45Hz peak frequency
70Hz peak frequency
presented by ENI at the Annual SEG Conference, 2010
Compared to 32 3GHz x86 cores parallelized using MPI
100kWatts of Intel cores => 1kWatt of Maxeler Dataflow Engines
![Page 22: A Perspective on the Limits of Computation](https://reader030.vdocuments.net/reader030/viewer/2022033105/56816687550346895dda3a16/html5/thumbnails/22.jpg)
Given matrix A, vector b, find vector x in Ax = b.
Example: Sparse Matrix ComputationsO. Lindtjorn et al, HotChips 2010
0
10
20
30
40
50
60
0 1 2 3 4 5 6 7 8 9 10
Compression Ratio
Spee
dup
per 1
U N
ode
GREE0A1new01
Domain Specific Address and Data Encoding (*Patent Pending)
MAXELER SOLUTION: 20-40x in 1UDOES NOT SCALE BEYOND
SIX x86 CPU CORES
![Page 23: A Perspective on the Limits of Computation](https://reader030.vdocuments.net/reader030/viewer/2022033105/56816687550346895dda3a16/html5/thumbnails/23.jpg)
• Compute value and risk of complex credit derivatives.
• Moving overnight run to realtime intra day
• Reported Speedup: 220-270x 8 hours => 2 minutes
• Power consumption per node drops from 250W to 235W per node
Example: JP Morgan Derivatives PricingO Mencer, S Weston, Journal on Concurrency and Computation, July 2011.
See JP Morgan talk at Stanford on Youtube, search “weston maxeler”
![Page 24: A Perspective on the Limits of Computation](https://reader030.vdocuments.net/reader030/viewer/2022033105/56816687550346895dda3a16/html5/thumbnails/24.jpg)
Maxeler Loop Flow Graphs for JP Morgan Credit Derivatives
Whole Program Transformation Options
![Page 25: A Perspective on the Limits of Computation](https://reader030.vdocuments.net/reader030/viewer/2022033105/56816687550346895dda3a16/html5/thumbnails/25.jpg)
Maxeler Data Flow Graph for JP Morgan Interest Rates Monte Carlo Acceleration
![Page 26: A Perspective on the Limits of Computation](https://reader030.vdocuments.net/reader030/viewer/2022033105/56816687550346895dda3a16/html5/thumbnails/26.jpg)
Example:data flow graph generated by MaxCompiler
4866 static dataflow cores
in 1 chip
![Page 27: A Perspective on the Limits of Computation](https://reader030.vdocuments.net/reader030/viewer/2022033105/56816687550346895dda3a16/html5/thumbnails/27.jpg)
Maxeler Dataflow Engines (DFEs)
High Density DFEsIntel Xeon CPU cores and up to 6
DFEs with 288GB of RAM
The Dataflow ApplianceDense compute with 8 DFEs, 384GB of RAM and dynamic
allocation of DFEs to CPU servers with zero-copy RDMA access
The Low Latency ApplianceIntel Xeon CPUs and 1-2 DFEs with
direct links to up to six 10Gbit Ethernet connections
MaxWorkstationDesktop dataflowdevelopment system
Dataflow Engines48GB DDR3, high-speed connectivity and dense configurable logic
MaxRack10, 20 or 40 node rack systems integratingcompute, networking & storage
MaxCloudHosted, on-demand, scalable accelerated compute