custom computing systemswl/teachlocal/cuscomp/notes/cc... · 2020-01-14 · wl 2020 2.12 custom...

wl 2020 2.1

Custom computing systems

• difference engine: Charles Babbage 1832- compute maths tables

• digital orrery: MIT 1985 - special-purpose engine, found pluto motion chaotic

• Splash2: Supercomputing Research Center 1993 - multi-FPGA engine, for video processing, DNA computing etc

• Harp1: Oxford University 1995- FPGA + microprocessor (transputer)

• SONIC, UltraSonic: Sony + Imperial College 1999-2002- multi-FPGA, professional video processing

• MaxWorkstation, MaxNode: 2011, Max5: 2017- FPGA cards adopted by JP Morgan, Amazon…

wl 2020 2.2

• 1 exaflop = 1018 FLOPS (TaihuLight: 93 Petaflops)

• using processor cores with 8FLOPS/clock at 2.5GHz

• 50M CPU cores

• what about power?- assume power envelope of 100W per chip

- Moore’s Law scaling: 6 cores today ~100 cores/chip

- 500k CPU chips

• 50MW (just for CPUs!) 100MW likely

• ‘TaihuLight’ power consumption: 15MW

The Exaflop Supercomputer (2022)

source: Maxeler

wl 2020 2.3

• 1 exaflop = 1018 FLOPS

• using processor cores with 8FLOPS/clock at 2.5GHz

• 50M CPU cores

• what about power?- assume power envelope of 100W per chip

- Moore’s Law scaling: 6 cores today ~100 cores/chip

- 500k CPU chips

• 50MW (just for CPUs!) 100MW likely

• ‘TaihuLight’ power consumption: 15MW

The Exaflop Supercomputer (2018)

How do we program this?

Who pays for this?

source: Maxeler

wl 2020 2.4

Technology comparison

DSP: Digital Signal Processor Dedicated HW=ASIC/FPGA

wl 2020 2.5

Execution units

Out-of-order

scheduling &

retirement

L1 data cache

Memory

ordering and

execution

Instruction

decode and

microcode

L2 Cache &

interrupt

servicing

Paging

Branch

prediction

Instruction fetch

& L1 cache

Memory controller

Shared L3 cache

Un

core

Core

I/O

an

d Q

PI I/O

and

QP

IShared L3 cache

CoreCoreCoreCoreCore

Intel 6-Core X5680 “Westmere”

Computation

Core

wl 2020 2.6

• a chip customised for a specific application

• no instructions no instruction decode logic

• no branches no branch prediction

• explicit parallelism no out-of-order scheduling

• data streamed onto-chip no multi-level caches

A special purpose computer

MyApplication

Chip

(Lots o

f)

Mem

ory

Rest of the

world

source: Maxeler

wl 2020 2.7

• but we have more than one application

• impractical to optimise machines for only one application- need to run many applications in a typical system


MyApplication

Chip

Mem

ory

NetworkMyApplication

Chip

Mem

ory

NetworkMyApplication

Chip

Mem

ory

NetworkOtherApplication

Chip

Mem

ory

Rest of the

world

source: Maxeler

wl 2020 2.8

• use reconfigurable chip: reprogram at runtime to implement:- different applications, or

- different versions of the same application


Config 1

Mem

ory

Network Optimized for

Application A

Optimized for

Application B

Optimized for

Application C

Optimized for

Application D

Optimized for

Application E

source: Maxeler

wl 2020 2.9

Instruction processors

source: Maxeler

wl 2020 2.10

Dataflow/stream processors

source: Maxeler

wl 2020 2.11

Lines of code

Total Application 1,000,000

Kernel to accelerate 2,000

Software to restructure 20,000

Accelerating real applications

• CPUs are good for:

- latency-sensitive, control-intensive, non-repetitive code

• dataflow engines are good for:- high throughput repetitive processing on large data volumes

a system should contain both

source: Maxeler

wl 2020 2.12

Custom computing in a PC

Processor

Register

fileL1$

L2$

where is the Custom Architecture?• on-chip with access to register file• co-processor w/ access to level 1 cache• next to level 2 cache • in adjacent processor socket, connected using QPI/Hypertransport• as Memory Controller not North/South Bridge• as main memory (DIMMs)• as a peripheral on PCI Express bus• inside peripheral, eg customizable Disk controller

North/South Bridge

PCI Bus

Disk Dim

ms

wl 2020 2.13

Embedded systems

• partition programs into software and hardware (custom architecture)

- hardware software co-design

• System-on-Chip: SoC (cover later)

• custom architecture as extension of the processor instruction set

Processor

Register

file

Data

Instructions

Custo

m

Arc

hite

ctu

re

wl 2020 2.14

• depends on the application

- avoid system bottleneck for the application

• possible bottlenecks

- memory access latency

- memory access bandwidth

- memory size

- processor local memory size

- processor ALU resource

- processor ALU operation latency

- various bus bandwidths

Where to locate custom architecture?

wl 2020 2.15

Bottleneck example: Bing page ranking

source: Microsoft

wl 2020 2.16

Reconfigurable computing with FPGAs

DSP Block

Block RAM (20TB/s)

IO BlockLogic Cell (105 elements)

Xilinx Virtex-6 FPGA

DSP BlockBlock RAM

wl 2020 2.17

• 1U Form Factor for racks DFE: Data Flow Engine

High density compute with FPGAs: examples

source: Maxeler

wl 2020 2.18

• schematic entry of circuits

• hardware Description Languages- VHDL, Verilog, SystemC

• object-oriented languages - C/C++, Python, Java, and related languages

• dataflow languages: e.g. MaxJ, OpenSPL

• functional languages: e.g. Haskell, Ruby

• high level interface: e.g. Mathematica, MatLab

• schematic block diagram e.g. Simulink

• domain specific languages (DSLs)

How could we program it?

wl 2020 2.19

Accelerator programming models

DSL

DS

LDSLDSL

Possible applications

Leve

l of

Ab

stra

ctio

n

Flexible Compiler System: MaxCompiler/Ruby

Higher Level Libraries

Higher

Level

Libraries

wl 2020 2.20

Acceleration development flowS

tart

Original

Application

Identify code

for acceleration

and analyze

bottlenecks

Write accelerator

codeSimulate

Functions

correctly?Build for Hardware

Integrate with

Host code

Meets

performance

goals?

Accelerated

Application

NO

YESYES

NO

Transform app,

architect and

model

performance

source: Maxeler

wl 2020 2.21

Acceleration development flowS

tart

Original

Application

Identify code

for acceleration

and analyze

bottlenecks

Write accelerator

codeSimulate

Functions

correctly?Build for Hardware

Integrate with

Host code

Meets

performance

goals?

Accelerated

Application

NO

YESYES

NO

Transform app,

architect and

model

performance

Mainly for project

source: Maxeler

wl 2020 2.22

Customisation techniques

• FPGA technology offers customisation opportunities

- some data may remain constant: e.g. algebraic simplification

- adopt different data structures: e.g. number representation

- transform: e.g. enhance parallelism, pipelining, serialisation

• reuse possibilities (more next lecture)

- description: repeating unit, parametrisation

- transforms: patterns, laws, proofs

• example: polynomial evaluation for numbers ai, xy = a0 + a1 x + a2 x2 + a3 x3 (repeat many times)

wl 2020 2.23

Performance estimation

• clocked circuit: no combinational loops

• gates have delay, and speed limited by propagation delay through the slowest combinational path

• slowest path: usually carry path

• clock rate: approx. 1/(delay of slowest path) assuming- edge-triggered design

- register propagation delay, set-up time, clock skew etc assumed negligible

• lowest level: logic gates, do not worry about transistors

wl 2020 2.24

First polynomial evaluator

• compute y = a0 + a1 x + a2 x2 + a3 x3

• simplification: assume x constant

• problems: speed? size? repeating units?

x

+

a3

x

x

+

+

xx

x

a2

a1

a0

y

y = 0 ;

for i = 0 .. 3

y = y + ai x xi ;

wl 2020 2.25

Customisation possibilities

1. exploit algebraic properties

2. enhance parallelism

3. pipelining

Other possibilities

• serialisation

• customise data representation- non-standard word-length, e.g. 18 bits rather than 32 bits

- non-standard arithmetic, e.g. logarithmic, residue…

wl 2020 2.26

1. Algebraic property: Horner’s Rule

• given

• then

x

+

a3

x

x

+

+

xx

x

a2

a1

a0

x

+

a3

x

x

+

+

a2

a1

a0

a0 + a1 x + a2 x2 + a3 x3 = a0 + x (a1 + x (a2 + a3x))

x

+a

b

x

a x + b x = (a + b) x

x

+

b

a

wl 2020 2.27

2. Enhance parallelism

RR R R

R R R R

R R

R

RR R

wl 2020 2.28

3. Pipelining

• split up combinational circuit: add pipeline registers

• shorter cycle time, assembly-line parallelism, lower power

• pipelined design (if regular: systolic array – more later)- mandatory: same number of additional registers for all inputs

- preferable: balance delay in different stages

- preferable: addition of registers preserves regularity

f g

h

Source: M Spivey

wl 2020 2.29

Horner’s Rule for pipelining?

• given

• then

Q

R P

P and Q are registers, R is computational component

Q

R

Q

R

Q

Q

R

R

PP

P

Q

R

Q

Q

R

R

wl 2020 2.30

module incr_pipe

#(parameter G=4,N=4) // G groups of N bits

(output [G*N-1:0] outp, input [G*N-1:0] inp, input clk);

wire [G:0] carry; // carry chain

wire [G*N-1:0] temp1; // output of delay triangle

genvar i; // loop counter

assign carry[G] = 1; // prime carry input

upper_tri_delay #(G, N) tru (temp1, inp, clk); // upper reg triangle

lower_tri_delay #(G, N) trl (outp, temp2, clk); // lower reg triangle

generate

for (i = 0; i < G; i = i + 1) // for each group generate

begin // 1-stage pipelined incrementer

incr_stage #(N) istg (carry[G-i-1], temp2[(i+1)*N-1:i*N],

temp1[(i+1)*N-1:i*N], carry[G-i], clk);

end

endgenerate

endmodule

Pipelined incrementer: Verilog

• parameterize:- G groups of N bits

- width = G*N

- bits per stage = N

• Verilog implementation:

- decompose into:

• upper register triangle

• chain of incrementers + register (1-stage pipeline)

• lower register triangle

- only top level shown

- need to manage array indices

incrementer cout

a[15..12]

incrementer

a[11..8]

cinincrementer

a[7..4]

incrementer

a[3..0]

sum[15..12] sum[11..8] sum[7..4] sum[3..0]

1-stage pipeline

wl 2020 2.31

Concise parametric representation

• given

• then

Q

R P

[P, Q] ; R = R ; Q, P and Q are registers

Q

R

Q

R

Q

Q

R

R

PP

P

Q

R

Q

Q

R

R

[nP, Qn] ; rdrn R = rdrn (2Q ; R)

wl 2020 2.32

module incr_pipe

#(parameter G=4,N=4) // G groups of N bits

(output [G*N-1:0] outp, input [G*N-1:0] inp, input clk);

wire [G:0] carry; // carry chain

wire [G*N-1:0] temp1; // output of delay triangle

genvar i; // loop counter

assign carry[G] = 1; // prime carry input

upper_tri_delay #(G, N) tru (temp1, inp, clk); // upper reg triangle

lower_tri_delay #(G, N) trl (outp, temp2, clk); // lower reg triangle

generate

for (i = 0; i < G; i = i + 1) // for each group generate

begin // 1-stage pipelined incrementer

incr_stage #(N) istg (carry[G-i-1], temp2[(i+1)*N-1:i*N],

temp1[(i+1)*N-1:i*N], carry[G-i], clk);

end

endgenerate

endmodule

Pipelined incrementer: Verilog vs Ruby

• parameterize:- G groups of N bits

- width = G*N

- bits per stage = N

incrementer cout

a[15..12]

incrementer

a[11..8]

cinincrementer

a[7..4]

incrementer

a[3..0]

sum[15..12] sum[11..8] sum[7..4] sum[3..0]

Pipelined_incrementer G N

= snd (tri G (tri N D)) ; # upper reg triangle

row G (row N (halfadd ; snd D) ; # 1-stage pipelined incre

fst (tri~ G (tri~ N D)) # lower reg triangle

Verilog:

Ruby:

* can generate Verilog or MaxJ!

custom computing systemswl/teachlocal/cuscomp/notes/cc... · 2020-01-14 · wl 2020 2.12 custom...

Documents