custom computing systemswl/teachlocal/cuscomp/notes/cc... · 2020-01-14 · wl 2020 2.12 custom...
TRANSCRIPT
![Page 1: Custom computing systemswl/teachlocal/cuscomp/notes/cc... · 2020-01-14 · wl 2020 2.12 Custom computing in a PC Processor Register file L1$ L2$ where is the Custom Architecture?](https://reader033.vdocuments.net/reader033/viewer/2022042122/5e9c8615bb0e985dcd4b2299/html5/thumbnails/1.jpg)
wl 2020 2.1
Custom computing systems
• difference engine: Charles Babbage 1832- compute maths tables
• digital orrery: MIT 1985 - special-purpose engine, found pluto motion chaotic
• Splash2: Supercomputing Research Center 1993 - multi-FPGA engine, for video processing, DNA computing etc
• Harp1: Oxford University 1995- FPGA + microprocessor (transputer)
• SONIC, UltraSonic: Sony + Imperial College 1999-2002- multi-FPGA, professional video processing
• MaxWorkstation, MaxNode: 2011, Max5: 2017- FPGA cards adopted by JP Morgan, Amazon…
![Page 2: Custom computing systemswl/teachlocal/cuscomp/notes/cc... · 2020-01-14 · wl 2020 2.12 Custom computing in a PC Processor Register file L1$ L2$ where is the Custom Architecture?](https://reader033.vdocuments.net/reader033/viewer/2022042122/5e9c8615bb0e985dcd4b2299/html5/thumbnails/2.jpg)
wl 2020 2.2
• 1 exaflop = 1018 FLOPS (TaihuLight: 93 Petaflops)
• using processor cores with 8FLOPS/clock at 2.5GHz
• 50M CPU cores
• what about power?- assume power envelope of 100W per chip
- Moore’s Law scaling: 6 cores today ~100 cores/chip
- 500k CPU chips
• 50MW (just for CPUs!) 100MW likely
• ‘TaihuLight’ power consumption: 15MW
The Exaflop Supercomputer (2022)
source: Maxeler
![Page 3: Custom computing systemswl/teachlocal/cuscomp/notes/cc... · 2020-01-14 · wl 2020 2.12 Custom computing in a PC Processor Register file L1$ L2$ where is the Custom Architecture?](https://reader033.vdocuments.net/reader033/viewer/2022042122/5e9c8615bb0e985dcd4b2299/html5/thumbnails/3.jpg)
wl 2020 2.3
• 1 exaflop = 1018 FLOPS
• using processor cores with 8FLOPS/clock at 2.5GHz
• 50M CPU cores
• what about power?- assume power envelope of 100W per chip
- Moore’s Law scaling: 6 cores today ~100 cores/chip
- 500k CPU chips
• 50MW (just for CPUs!) 100MW likely
• ‘TaihuLight’ power consumption: 15MW
The Exaflop Supercomputer (2018)
How do we program this?
Who pays for this?
source: Maxeler
![Page 4: Custom computing systemswl/teachlocal/cuscomp/notes/cc... · 2020-01-14 · wl 2020 2.12 Custom computing in a PC Processor Register file L1$ L2$ where is the Custom Architecture?](https://reader033.vdocuments.net/reader033/viewer/2022042122/5e9c8615bb0e985dcd4b2299/html5/thumbnails/4.jpg)
wl 2020 2.4
Technology comparison
DSP: Digital Signal Processor Dedicated HW=ASIC/FPGA
![Page 5: Custom computing systemswl/teachlocal/cuscomp/notes/cc... · 2020-01-14 · wl 2020 2.12 Custom computing in a PC Processor Register file L1$ L2$ where is the Custom Architecture?](https://reader033.vdocuments.net/reader033/viewer/2022042122/5e9c8615bb0e985dcd4b2299/html5/thumbnails/5.jpg)
wl 2020 2.5
Execution units
Out-of-order
scheduling &
retirement
L1 data cache
Memory
ordering and
execution
Instruction
decode and
microcode
L2 Cache &
interrupt
servicing
Paging
Branch
prediction
Instruction fetch
& L1 cache
Memory controller
Shared L3 cache
Un
core
Core
I/O
an
d Q
PI I/O
and
QP
IShared L3 cache
CoreCoreCoreCoreCore
Intel 6-Core X5680 “Westmere”
Computation
Core
![Page 6: Custom computing systemswl/teachlocal/cuscomp/notes/cc... · 2020-01-14 · wl 2020 2.12 Custom computing in a PC Processor Register file L1$ L2$ where is the Custom Architecture?](https://reader033.vdocuments.net/reader033/viewer/2022042122/5e9c8615bb0e985dcd4b2299/html5/thumbnails/6.jpg)
wl 2020 2.6
• a chip customised for a specific application
• no instructions no instruction decode logic
• no branches no branch prediction
• explicit parallelism no out-of-order scheduling
• data streamed onto-chip no multi-level caches
A special purpose computer
MyApplication
Chip
(Lots o
f)
Mem
ory
Rest of the
world
source: Maxeler
![Page 7: Custom computing systemswl/teachlocal/cuscomp/notes/cc... · 2020-01-14 · wl 2020 2.12 Custom computing in a PC Processor Register file L1$ L2$ where is the Custom Architecture?](https://reader033.vdocuments.net/reader033/viewer/2022042122/5e9c8615bb0e985dcd4b2299/html5/thumbnails/7.jpg)
wl 2020 2.7
• but we have more than one application
• impractical to optimise machines for only one application- need to run many applications in a typical system
A special purpose computer
MyApplication
Chip
Mem
ory
NetworkMyApplication
Chip
Mem
ory
NetworkMyApplication
Chip
Mem
ory
NetworkOtherApplication
Chip
Mem
ory
Rest of the
world
source: Maxeler
![Page 8: Custom computing systemswl/teachlocal/cuscomp/notes/cc... · 2020-01-14 · wl 2020 2.12 Custom computing in a PC Processor Register file L1$ L2$ where is the Custom Architecture?](https://reader033.vdocuments.net/reader033/viewer/2022042122/5e9c8615bb0e985dcd4b2299/html5/thumbnails/8.jpg)
wl 2020 2.8
• use reconfigurable chip: reprogram at runtime to implement:- different applications, or
- different versions of the same application
A special purpose computer
Config 1
Mem
ory
Network Optimized for
Application A
Optimized for
Application B
Optimized for
Application C
Optimized for
Application D
Optimized for
Application E
source: Maxeler
![Page 9: Custom computing systemswl/teachlocal/cuscomp/notes/cc... · 2020-01-14 · wl 2020 2.12 Custom computing in a PC Processor Register file L1$ L2$ where is the Custom Architecture?](https://reader033.vdocuments.net/reader033/viewer/2022042122/5e9c8615bb0e985dcd4b2299/html5/thumbnails/9.jpg)
wl 2020 2.9
Instruction processors
source: Maxeler
![Page 10: Custom computing systemswl/teachlocal/cuscomp/notes/cc... · 2020-01-14 · wl 2020 2.12 Custom computing in a PC Processor Register file L1$ L2$ where is the Custom Architecture?](https://reader033.vdocuments.net/reader033/viewer/2022042122/5e9c8615bb0e985dcd4b2299/html5/thumbnails/10.jpg)
wl 2020 2.10
Dataflow/stream processors
source: Maxeler
![Page 11: Custom computing systemswl/teachlocal/cuscomp/notes/cc... · 2020-01-14 · wl 2020 2.12 Custom computing in a PC Processor Register file L1$ L2$ where is the Custom Architecture?](https://reader033.vdocuments.net/reader033/viewer/2022042122/5e9c8615bb0e985dcd4b2299/html5/thumbnails/11.jpg)
wl 2020 2.11
Lines of code
Total Application 1,000,000
Kernel to accelerate 2,000
Software to restructure 20,000
Accelerating real applications
• CPUs are good for:
- latency-sensitive, control-intensive, non-repetitive code
• dataflow engines are good for:- high throughput repetitive processing on large data volumes
a system should contain both
source: Maxeler
![Page 12: Custom computing systemswl/teachlocal/cuscomp/notes/cc... · 2020-01-14 · wl 2020 2.12 Custom computing in a PC Processor Register file L1$ L2$ where is the Custom Architecture?](https://reader033.vdocuments.net/reader033/viewer/2022042122/5e9c8615bb0e985dcd4b2299/html5/thumbnails/12.jpg)
wl 2020 2.12
Custom computing in a PC
Processor
Register
fileL1$
L2$
where is the Custom Architecture?• on-chip with access to register file• co-processor w/ access to level 1 cache• next to level 2 cache • in adjacent processor socket, connected using QPI/Hypertransport• as Memory Controller not North/South Bridge• as main memory (DIMMs)• as a peripheral on PCI Express bus• inside peripheral, eg customizable Disk controller
North/South Bridge
PCI Bus
Disk Dim
ms
![Page 13: Custom computing systemswl/teachlocal/cuscomp/notes/cc... · 2020-01-14 · wl 2020 2.12 Custom computing in a PC Processor Register file L1$ L2$ where is the Custom Architecture?](https://reader033.vdocuments.net/reader033/viewer/2022042122/5e9c8615bb0e985dcd4b2299/html5/thumbnails/13.jpg)
wl 2020 2.13
Embedded systems
• partition programs into software and hardware (custom architecture)
- hardware software co-design
• System-on-Chip: SoC (cover later)
• custom architecture as extension of the processor instruction set
Processor
Register
file
Data
Instructions
Custo
m
Arc
hite
ctu
re
![Page 14: Custom computing systemswl/teachlocal/cuscomp/notes/cc... · 2020-01-14 · wl 2020 2.12 Custom computing in a PC Processor Register file L1$ L2$ where is the Custom Architecture?](https://reader033.vdocuments.net/reader033/viewer/2022042122/5e9c8615bb0e985dcd4b2299/html5/thumbnails/14.jpg)
wl 2020 2.14
• depends on the application
- avoid system bottleneck for the application
• possible bottlenecks
- memory access latency
- memory access bandwidth
- memory size
- processor local memory size
- processor ALU resource
- processor ALU operation latency
- various bus bandwidths
Where to locate custom architecture?
![Page 15: Custom computing systemswl/teachlocal/cuscomp/notes/cc... · 2020-01-14 · wl 2020 2.12 Custom computing in a PC Processor Register file L1$ L2$ where is the Custom Architecture?](https://reader033.vdocuments.net/reader033/viewer/2022042122/5e9c8615bb0e985dcd4b2299/html5/thumbnails/15.jpg)
wl 2020 2.15
Bottleneck example: Bing page ranking
source: Microsoft
![Page 16: Custom computing systemswl/teachlocal/cuscomp/notes/cc... · 2020-01-14 · wl 2020 2.12 Custom computing in a PC Processor Register file L1$ L2$ where is the Custom Architecture?](https://reader033.vdocuments.net/reader033/viewer/2022042122/5e9c8615bb0e985dcd4b2299/html5/thumbnails/16.jpg)
wl 2020 2.16
Reconfigurable computing with FPGAs
DSP Block
Block RAM (20TB/s)
IO BlockLogic Cell (105 elements)
Xilinx Virtex-6 FPGA
DSP BlockBlock RAM
![Page 17: Custom computing systemswl/teachlocal/cuscomp/notes/cc... · 2020-01-14 · wl 2020 2.12 Custom computing in a PC Processor Register file L1$ L2$ where is the Custom Architecture?](https://reader033.vdocuments.net/reader033/viewer/2022042122/5e9c8615bb0e985dcd4b2299/html5/thumbnails/17.jpg)
wl 2020 2.17
• 1U Form Factor for racks DFE: Data Flow Engine
High density compute with FPGAs: examples
source: Maxeler
![Page 18: Custom computing systemswl/teachlocal/cuscomp/notes/cc... · 2020-01-14 · wl 2020 2.12 Custom computing in a PC Processor Register file L1$ L2$ where is the Custom Architecture?](https://reader033.vdocuments.net/reader033/viewer/2022042122/5e9c8615bb0e985dcd4b2299/html5/thumbnails/18.jpg)
wl 2020 2.18
• schematic entry of circuits
• hardware Description Languages- VHDL, Verilog, SystemC
• object-oriented languages - C/C++, Python, Java, and related languages
• dataflow languages: e.g. MaxJ, OpenSPL
• functional languages: e.g. Haskell, Ruby
• high level interface: e.g. Mathematica, MatLab
• schematic block diagram e.g. Simulink
• domain specific languages (DSLs)
How could we program it?
![Page 19: Custom computing systemswl/teachlocal/cuscomp/notes/cc... · 2020-01-14 · wl 2020 2.12 Custom computing in a PC Processor Register file L1$ L2$ where is the Custom Architecture?](https://reader033.vdocuments.net/reader033/viewer/2022042122/5e9c8615bb0e985dcd4b2299/html5/thumbnails/19.jpg)
wl 2020 2.19
Accelerator programming models
DSL
DS
LDSLDSL
Possible applications
Leve
l of
Ab
stra
ctio
n
Flexible Compiler System: MaxCompiler/Ruby
Higher Level Libraries
Higher
Level
Libraries
![Page 20: Custom computing systemswl/teachlocal/cuscomp/notes/cc... · 2020-01-14 · wl 2020 2.12 Custom computing in a PC Processor Register file L1$ L2$ where is the Custom Architecture?](https://reader033.vdocuments.net/reader033/viewer/2022042122/5e9c8615bb0e985dcd4b2299/html5/thumbnails/20.jpg)
wl 2020 2.20
Acceleration development flowS
tart
Original
Application
Identify code
for acceleration
and analyze
bottlenecks
Write accelerator
codeSimulate
Functions
correctly?Build for Hardware
Integrate with
Host code
Meets
performance
goals?
Accelerated
Application
NO
YESYES
NO
Transform app,
architect and
model
performance
source: Maxeler
![Page 21: Custom computing systemswl/teachlocal/cuscomp/notes/cc... · 2020-01-14 · wl 2020 2.12 Custom computing in a PC Processor Register file L1$ L2$ where is the Custom Architecture?](https://reader033.vdocuments.net/reader033/viewer/2022042122/5e9c8615bb0e985dcd4b2299/html5/thumbnails/21.jpg)
wl 2020 2.21
Acceleration development flowS
tart
Original
Application
Identify code
for acceleration
and analyze
bottlenecks
Write accelerator
codeSimulate
Functions
correctly?Build for Hardware
Integrate with
Host code
Meets
performance
goals?
Accelerated
Application
NO
YESYES
NO
Transform app,
architect and
model
performance
Mainly for project
source: Maxeler
![Page 22: Custom computing systemswl/teachlocal/cuscomp/notes/cc... · 2020-01-14 · wl 2020 2.12 Custom computing in a PC Processor Register file L1$ L2$ where is the Custom Architecture?](https://reader033.vdocuments.net/reader033/viewer/2022042122/5e9c8615bb0e985dcd4b2299/html5/thumbnails/22.jpg)
wl 2020 2.22
Customisation techniques
• FPGA technology offers customisation opportunities
- some data may remain constant: e.g. algebraic simplification
- adopt different data structures: e.g. number representation
- transform: e.g. enhance parallelism, pipelining, serialisation
• reuse possibilities (more next lecture)
- description: repeating unit, parametrisation
- transforms: patterns, laws, proofs
• example: polynomial evaluation for numbers ai, xy = a0 + a1 x + a2 x2 + a3 x3 (repeat many times)
![Page 23: Custom computing systemswl/teachlocal/cuscomp/notes/cc... · 2020-01-14 · wl 2020 2.12 Custom computing in a PC Processor Register file L1$ L2$ where is the Custom Architecture?](https://reader033.vdocuments.net/reader033/viewer/2022042122/5e9c8615bb0e985dcd4b2299/html5/thumbnails/23.jpg)
wl 2020 2.23
Performance estimation
• clocked circuit: no combinational loops
• gates have delay, and speed limited by propagation delay through the slowest combinational path
• slowest path: usually carry path
• clock rate: approx. 1/(delay of slowest path) assuming- edge-triggered design
- register propagation delay, set-up time, clock skew etc assumed negligible
• lowest level: logic gates, do not worry about transistors
![Page 24: Custom computing systemswl/teachlocal/cuscomp/notes/cc... · 2020-01-14 · wl 2020 2.12 Custom computing in a PC Processor Register file L1$ L2$ where is the Custom Architecture?](https://reader033.vdocuments.net/reader033/viewer/2022042122/5e9c8615bb0e985dcd4b2299/html5/thumbnails/24.jpg)
wl 2020 2.24
First polynomial evaluator
• compute y = a0 + a1 x + a2 x2 + a3 x3
• simplification: assume x constant
• problems: speed? size? repeating units?
x
+
a3
x
x
+
+
xx
x
a2
a1
a0
y
y = 0 ;
for i = 0 .. 3
y = y + ai x xi ;
![Page 25: Custom computing systemswl/teachlocal/cuscomp/notes/cc... · 2020-01-14 · wl 2020 2.12 Custom computing in a PC Processor Register file L1$ L2$ where is the Custom Architecture?](https://reader033.vdocuments.net/reader033/viewer/2022042122/5e9c8615bb0e985dcd4b2299/html5/thumbnails/25.jpg)
wl 2020 2.25
Customisation possibilities
1. exploit algebraic properties
2. enhance parallelism
3. pipelining
Other possibilities
• serialisation
• customise data representation- non-standard word-length, e.g. 18 bits rather than 32 bits
- non-standard arithmetic, e.g. logarithmic, residue…
![Page 26: Custom computing systemswl/teachlocal/cuscomp/notes/cc... · 2020-01-14 · wl 2020 2.12 Custom computing in a PC Processor Register file L1$ L2$ where is the Custom Architecture?](https://reader033.vdocuments.net/reader033/viewer/2022042122/5e9c8615bb0e985dcd4b2299/html5/thumbnails/26.jpg)
wl 2020 2.26
1. Algebraic property: Horner’s Rule
• given
• then
x
+
a3
x
x
+
+
xx
x
a2
a1
a0
x
+
a3
x
x
+
+
a2
a1
a0
a0 + a1 x + a2 x2 + a3 x3 = a0 + x (a1 + x (a2 + a3x))
x
+a
b
x
a x + b x = (a + b) x
x
+
b
a
![Page 27: Custom computing systemswl/teachlocal/cuscomp/notes/cc... · 2020-01-14 · wl 2020 2.12 Custom computing in a PC Processor Register file L1$ L2$ where is the Custom Architecture?](https://reader033.vdocuments.net/reader033/viewer/2022042122/5e9c8615bb0e985dcd4b2299/html5/thumbnails/27.jpg)
wl 2020 2.27
2. Enhance parallelism
RR R R
R R R R
R R
R
RR R
![Page 28: Custom computing systemswl/teachlocal/cuscomp/notes/cc... · 2020-01-14 · wl 2020 2.12 Custom computing in a PC Processor Register file L1$ L2$ where is the Custom Architecture?](https://reader033.vdocuments.net/reader033/viewer/2022042122/5e9c8615bb0e985dcd4b2299/html5/thumbnails/28.jpg)
wl 2020 2.28
3. Pipelining
• split up combinational circuit: add pipeline registers
• shorter cycle time, assembly-line parallelism, lower power
• pipelined design (if regular: systolic array – more later)- mandatory: same number of additional registers for all inputs
- preferable: balance delay in different stages
- preferable: addition of registers preserves regularity
f g
h
Source: M Spivey
![Page 29: Custom computing systemswl/teachlocal/cuscomp/notes/cc... · 2020-01-14 · wl 2020 2.12 Custom computing in a PC Processor Register file L1$ L2$ where is the Custom Architecture?](https://reader033.vdocuments.net/reader033/viewer/2022042122/5e9c8615bb0e985dcd4b2299/html5/thumbnails/29.jpg)
wl 2020 2.29
Horner’s Rule for pipelining?
• given
• then
Q
R P
P and Q are registers, R is computational component
Q
R
Q
R
Q
Q
R
R
PP
P
Q
R
Q
Q
R
R
![Page 30: Custom computing systemswl/teachlocal/cuscomp/notes/cc... · 2020-01-14 · wl 2020 2.12 Custom computing in a PC Processor Register file L1$ L2$ where is the Custom Architecture?](https://reader033.vdocuments.net/reader033/viewer/2022042122/5e9c8615bb0e985dcd4b2299/html5/thumbnails/30.jpg)
wl 2020 2.30
module incr_pipe
#(parameter G=4,N=4) // G groups of N bits
(output [G*N-1:0] outp, input [G*N-1:0] inp, input clk);
wire [G:0] carry; // carry chain
wire [G*N-1:0] temp1; // output of delay triangle
genvar i; // loop counter
assign carry[G] = 1; // prime carry input
upper_tri_delay #(G, N) tru (temp1, inp, clk); // upper reg triangle
lower_tri_delay #(G, N) trl (outp, temp2, clk); // lower reg triangle
generate
for (i = 0; i < G; i = i + 1) // for each group generate
begin // 1-stage pipelined incrementer
incr_stage #(N) istg (carry[G-i-1], temp2[(i+1)*N-1:i*N],
temp1[(i+1)*N-1:i*N], carry[G-i], clk);
end
endgenerate
endmodule
Pipelined incrementer: Verilog
• parameterize:- G groups of N bits
- width = G*N
- bits per stage = N
• Verilog implementation:
- decompose into:
• upper register triangle
• chain of incrementers + register (1-stage pipeline)
• lower register triangle
- only top level shown
- need to manage array indices
incrementer cout
a[15..12]
incrementer
a[11..8]
cinincrementer
a[7..4]
incrementer
a[3..0]
sum[15..12] sum[11..8] sum[7..4] sum[3..0]
1-stage pipeline
![Page 31: Custom computing systemswl/teachlocal/cuscomp/notes/cc... · 2020-01-14 · wl 2020 2.12 Custom computing in a PC Processor Register file L1$ L2$ where is the Custom Architecture?](https://reader033.vdocuments.net/reader033/viewer/2022042122/5e9c8615bb0e985dcd4b2299/html5/thumbnails/31.jpg)
wl 2020 2.31
Concise parametric representation
• given
• then
Q
R P
[P, Q] ; R = R ; Q, P and Q are registers
Q
R
Q
R
Q
Q
R
R
PP
P
Q
R
Q
Q
R
R
[nP, Qn] ; rdrn R = rdrn (2Q ; R)
![Page 32: Custom computing systemswl/teachlocal/cuscomp/notes/cc... · 2020-01-14 · wl 2020 2.12 Custom computing in a PC Processor Register file L1$ L2$ where is the Custom Architecture?](https://reader033.vdocuments.net/reader033/viewer/2022042122/5e9c8615bb0e985dcd4b2299/html5/thumbnails/32.jpg)
wl 2020 2.32
module incr_pipe
#(parameter G=4,N=4) // G groups of N bits
(output [G*N-1:0] outp, input [G*N-1:0] inp, input clk);
wire [G:0] carry; // carry chain
wire [G*N-1:0] temp1; // output of delay triangle
genvar i; // loop counter
assign carry[G] = 1; // prime carry input
upper_tri_delay #(G, N) tru (temp1, inp, clk); // upper reg triangle
lower_tri_delay #(G, N) trl (outp, temp2, clk); // lower reg triangle
generate
for (i = 0; i < G; i = i + 1) // for each group generate
begin // 1-stage pipelined incrementer
incr_stage #(N) istg (carry[G-i-1], temp2[(i+1)*N-1:i*N],
temp1[(i+1)*N-1:i*N], carry[G-i], clk);
end
endgenerate
endmodule
Pipelined incrementer: Verilog vs Ruby
• parameterize:- G groups of N bits
- width = G*N
- bits per stage = N
incrementer cout
a[15..12]
incrementer
a[11..8]
cinincrementer
a[7..4]
incrementer
a[3..0]
sum[15..12] sum[11..8] sum[7..4] sum[3..0]
Pipelined_incrementer G N
= snd (tri G (tri N D)) ; # upper reg triangle
row G (row N (halfadd ; snd D) ; # 1-stage pipelined incre
fst (tri~ G (tri~ N D)) # lower reg triangle
Verilog:
Ruby:
* can generate Verilog or MaxJ!