ece8833 polymorphous and many-core computer architecture

60
ECE8833 Polymorphous and Many-Core Computer Architecture Prof. Hsien-Hsin S. School of Electrical and Computer Engineer Lecture 4 Billion-Transistor Architecture 97 (Part II)

Upload: london

Post on 11-Jan-2016

29 views

Category:

Documents


0 download

DESCRIPTION

ECE8833 Polymorphous and Many-Core Computer Architecture. Lecture 4 Billion-Transistor Architecture 97 (Part II). Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering. Practitioners’ Groups. Every one has an acronym !  IRAM Implementation at Berkeley CMP - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: ECE8833 Polymorphous and Many-Core Computer Architecture

ECE8833 Polymorphous and Many-Core Computer Architecture

Prof. Hsien-Hsin S. LeeSchool of Electrical and Computer Engineering

Lecture 4 Billion-Transistor Architecture 97 (Part II)

Page 2: ECE8833 Polymorphous and Many-Core Computer Architecture

2ECE8833 H.-H. S. Lee 2009

Practitioners’ GroupsEvery one has an acronym ! • IRAM

– Implementation at Berkeley

• CMP– Lead to Sun Niagra and the multicore

(r)evolution

• SMT – Intel HyperThreading (arguably Intel first

envisioned the idea), IBM Power5, Alpha 21464– Many credit this technology to UCSB’s

multistreaming work in early 1990s.

• RAW– Lead to Tilera64

Page 3: ECE8833 Polymorphous and Many-Core Computer Architecture

3ECE8833 H.-H. S. Lee 2009

C. E. Kozyrakis, S. Perissakis, D. Patterson, T. Anderson, K. Asanovic, N. Cardwell, R. Fromm, J. Golbus, B. Gribstad, K. Keeton, R. Thomas, N. Treuhaft, K. Yelick

Page 4: ECE8833 Polymorphous and Many-Core Computer Architecture

4ECE8833 H.-H. S. Lee 2009

Mission Statement

Page 5: ECE8833 Polymorphous and Many-Core Computer Architecture

5ECE8833 H.-H. S. Lee 2009

Future Roadblocks that Inspires IRAM• Latency issues

– Continuingly increased performance gap between processor and memory

– DRAM optimized for density, not speed

• Bandwidth issues – Off-chip bus

• Slow and narrow• high capacitance, high energy

– Especially, scientific codes, database, etc.

Page 6: ECE8833 Polymorphous and Many-Core Computer Architecture

6ECE8833 H.-H. S. Lee 2009

IRAM Approach • Move DRAM closer to processor

– Enlarge on-chip bandwidth

• Fewer I/O pins – Smaller package– Serial interface

Anything look familiar?

Page 7: ECE8833 Polymorphous and Many-Core Computer Architecture

7ECE8833 H.-H. S. Lee 2009

IRAM Chip Design Research • How much larger and slower is a processor designed in a straight

DRAM process vs. a standard logic process– Microprocessor fab offers fast transistors fo fast logic and many metal

layers for accelerating communication and simplifying power distribution– DRAM fabs offer many poly layers to give small DRAM cells and low

leakage for low refresh rate

• Speed of page buffer vs. registers and cache

• New DRAM interface based on fast serial links (2.5Gbit/s or 300 MB/s per pin)

• Quantify Bandwidth vs. Area/Power tradeoff

• Area overhead for IRAM vs. a DRAM

• Extra power dissipation for IRAM vs. a DRAM

• Performance of IRAM with same area and power as DRAM (“processor for free)Source: David Patterson’s slide in his IRAM Overview talk

Page 8: ECE8833 Polymorphous and Many-Core Computer Architecture

8ECE8833 H.-H. S. Lee 2009

IRAM Architecture Research • How much slower can a processor with a high

bandwidth memory be and yet be as fast as a conventional computer? (very interesting point)

• Compare memory management schemes (e.g., vector registers, scratch pad, wide TLB/cache)

• Compare scheme for running large programs, i.e., span multiple IRAMs

• Quantify value of compact programs and data (e.g., compact code, on-the-fly compression)

• Quantify pros and cons of standard instruction set vs. custom IRAM instruction set

Source: David Patterson’s slide in his IRAM Overview talk

Page 9: ECE8833 Polymorphous and Many-Core Computer Architecture

9ECE8833 H.-H. S. Lee 2009

IRAM Compiler Research• Explicit SW control of memory management vs.

conventional implicit HW designs– Protection (software fault isolation)– Paging (dynamic relocation, overlap I/O accesses)– “Cache” control (vector register, scratch pad)– I/O interrupt/polling

• Evaluate benchmark performance in conjunction with architectural research– Number crunching (Vector vs. superscalar)– Memory intensive (database, operating system)– Real-time benchmarks (stability and performance)– Pointer intensive (GCC compiler)

• Impact of Language on IRAM (Fortran 77 vs. HPF, C/C++ vs Java)

Source: David Patterson’s slide in his IRAM Overview talk

Page 10: ECE8833 Polymorphous and Many-Core Computer Architecture

10ECE8833 H.-H. S. Lee 2009

Potential IRAM Architecture• “New Model”: VSIW=Very Short Instruction

Word!– Compact: Describe N operations with 1 short inst.

(vector)– Predictable: (real-time) perf. Vs. statistical perf.

(cache)– Multimedia ready: choose Nx64b, 2Nx32b, 4Nx16– Easy to get high performance; N operations:

• Are independent• Use same functional unit• Access disjoint registers• Access registers in same order as previous instructions• Access contiguous memory words or known pattern• Hides memory latency (and any other latency)

– Compiler technology already developed..Source: David Patterson’s slide in his IRAM talk

Page 11: ECE8833 Polymorphous and Many-Core Computer Architecture

11ECE8833 H.-H. S. Lee 2009

Berkeley Vector-Intelligent RAMWhy vector processing• Scalable design• Higher code density• Run at a higher clock rate• Better energy efficiency

due to easier clock gating for vector / scalar units

• Lower die temperature to keep good data retention rate

• On-chip DRAM is sufficient for embedded applications

• Use external off-chip DRAM as secondary memory – Pages swapped between

on-chip and off-chip DRAMs

Page 12: ECE8833 Polymorphous and Many-Core Computer Architecture

12ECE8833 H.-H. S. Lee 2009

VIRAM-1 Floorplan• 180nm, CMOS, 6-layer copper• 125 million transistors, 325 mm2

• 2 watts @ 200MHz• 13MB eDRAM macros from IBM and 4 vector units (total 8KB vector registers)• VRF = 32x64b or 64x32b or 128x16b

[Gebis et al. DAC student contest 04]

IBM Embedded DRAM macros, each 13Mbit

¼ of 8KB VRF (Custom layout)

64-bit MIPS M5Kc

Page 13: ECE8833 Polymorphous and Many-Core Computer Architecture

13ECE8833 H.-H. S. Lee 2009

S. J. Eggers, J. S. Emer, H. M. Levy, J. L. Lo, R. L. Stamm, D. M. Tullsen

Page 14: ECE8833 Polymorphous and Many-Core Computer Architecture

14ECE8833 H.-H. S. Lee 2009

SMT Concept vs. Other Alternatives

Thread 1Thread 1UnusedUnused

Exec

utio

n Ti

me

Exec

utio

n Ti

me

FU1FU1 FU2FU2 FU3FU3 FU4FU4

ConventionalConventionalSuperscalarSuperscalar

SingleSingleThreadedThreaded

SimultaneousSimultaneousMultithreadingMultithreading(or Intel’s HT)(or Intel’s HT)

Fine-grainedFine-grainedMultithreadingMultithreading(cycle-by-cycle(cycle-by-cycle

Interleaving)Interleaving)

Thread 2Thread 2Thread 3Thread 3Thread 4Thread 4Thread 5Thread 5

Coarse-grainedCoarse-grainedMultithreadingMultithreading

(Block Interleaving)(Block Interleaving)

Chip Chip MultiprocessorMultiprocessor

(CMP)(CMP)

• Early SMT idea was developed by UCSB (Mario Nemirosky’s group HICSS’94)• The name SMT was christened by the group at University of Washington ISCA’95

Page 15: ECE8833 Polymorphous and Many-Core Computer Architecture

15ECE8833 H.-H. S. Lee 2009

Exploiting Choice: SMT Inst Fetch Policies• FIFO, Round Robin, simple but may be too

naive• RR.X.Y

– X threads for Y instructions– RR1.8 – RR.2.4 or RR.4.2– RR.2.8

• What are the main design and/or performance issue when X > 1

[Tullsen et al. ISCA96]

Page 16: ECE8833 Polymorphous and Many-Core Computer Architecture

16ECE8833 H.-H. S. Lee 2009

Exploiting Choice: SMT Inst Fetch Policies• Adaptive Fetching Policies

– BRCOUNT (reduce wrong path issuing)• Count # of br inst in decode/rename/IQ stages• Give top priority to thread with the least BRCOUNT

– MISSCOUT (reduce IQ clog)• Count # of outstanding D-cache misses• Give top priority to thread with the least MISSCOUNT

– ICOUNT (reduce IQ clog)• Count # of inst in decode/rename/IQ stages• Give top priority to thread with the least ICOUNT

– IQPOSN (reduce IQ clog)• Give lowest priority to those threads with inst closest to the

head of INT or FP instruction queues– Due to that threads with the oldest instructions will be

most prone to IQ clog• No Counter needed

[Tullsen et al. ISCA96]

Page 17: ECE8833 Polymorphous and Many-Core Computer Architecture

17ECE8833 H.-H. S. Lee 2009

Exploiting Choice: SMT Inst Fetch Policies

[Tullsen et al. ISCA96]

Page 18: ECE8833 Polymorphous and Many-Core Computer Architecture

18ECE8833 H.-H. S. Lee 2009

Alpha 21464 (EV8)• Leading-edge process technology

– 1.2 to 2.0GHz– 0.125m CMOS– SOI-compatible– Cu interconnect, 7 metal layers– Low-k dielectrics

• Chip characteristics– 1.2V Vdd, 250W (EV6: 72W and EV7: 125W)

– 250 million transistors, 350mm2

– 1100 signal pins in flip chip packaging

Slide Source: Dr. Joel Emer

Page 19: ECE8833 Polymorphous and Many-Core Computer Architecture

19ECE8833 H.-H. S. Lee 2009

EV8 Architecture Overview• Enhanced OoO execution• 8-wide issue superscalar processor• Large on-die L2 (1.75MB)• 8 DRDRAM channels• On-chip router for system interconnect• Directory-based ccNUMA for up to 512-way

SMP• 4-way SMT

Slide Source: Dr. Joel Emer

Page 20: ECE8833 Polymorphous and Many-Core Computer Architecture

20ECE8833 H.-H. S. Lee 2009

SMT Pipeline• Replicated

– PCs– Register maps

Slide Source: Dr. Joel Emer

Fetch Decode/Map

Queue Reg Read

Execute Dcache/Store Buffer

Reg Write

Retire

IcacheDcache

PC

RegisterMap

Regs Regs

• Shared resources– RF– Instruction queue– First and second level caches– Translation buffers– Branch predictor

Page 21: ECE8833 Polymorphous and Many-Core Computer Architecture

21ECE8833 H.-H. S. Lee 2009

Intel HyperThreading • Intel Xeon Processor, Xeon MP Processor, and ATOM

• Enable Simultaneous Multi-Threading (SMT)– Exploit ILP through TLP (—Thread-Level Parallelism)– Issuing and executing multiple threads at the same snapshot

• Appears to be 2 logical processors

• Share the same execution resources

• Duplicate architectural states and certain microarchitectural states – IPs, iTLB, streaming buffer– Architectural register file– Return stack buffer– Branch history buffer– Register Alias Table

Page 22: ECE8833 Polymorphous and Many-Core Computer Architecture

22ECE8833 H.-H. S. Lee 2009

Sharing Resource in Intel HT• P4’s TC (or ROM) is alternatively accessed per cycle

for each logical processor unless one is stalled due to TC miss

• TLB shared with logical processor ID but partitioned– X86 does not employ ASID – Hard-partitioning appears to be the only option to allow HT

op queue (into ½) after fetched from TC• ROB (126/2 in P4)• LB (48/2 in P4)• SB (24/2 or 32/2 in P4)• General op queue and memory op queue (1/2) • Retirement: alternating between 2 logical processors

Page 23: ECE8833 Polymorphous and Many-Core Computer Architecture

23ECE8833 H.-H. S. Lee 2009

HT in Intel ATOM

Source: Microprocessor Report and Intel

• First In-order processor with HT• HT claimed to enlarge silicon

asset by 8%• Claimed 30% performance

increase at 15% power increase• Shared cache space

deprived/competed between threads

• No dedicated Multiplier – use SIMD Multiplier

• No dedicated Int Divider - use FP Divider

32KB

24KB

25mm2 @45nm

512KB

Page 24: ECE8833 Polymorphous and Many-Core Computer Architecture

24ECE8833 H.-H. S. Lee 2009

L. Hammond, B. A. Nayfeh, K. Olukotun

Page 25: ECE8833 Polymorphous and Many-Core Computer Architecture

25ECE8833 H.-H. S. Lee 2009

Main Argument• Single thread of control has limited parallelism (ILP is dead)• Cost of the above is prohibitive due to complexity

• Achieving parallelization with SW, not HW– Inherently parallel multimedia application– Widespread Multi-tasking OS – Emerging parallel compilers (ref. SUIF), mainly for loop-level

parallelism• Why not SMT?

– Interconnect delay issue – Partitioning is less localized than CMP

• Use relatively simple single-thread processor– Exploit only “modest” amount of ILP– Execute multiple threads in parallel

• Bottom line

Page 26: ECE8833 Polymorphous and Many-Core Computer Architecture

26ECE8833 H.-H. S. Lee 2009

Architectural Comparison

Page 27: ECE8833 Polymorphous and Many-Core Computer Architecture

27ECE8833 H.-H. S. Lee 2009

Single Chip Multiprocessor

Page 28: ECE8833 Polymorphous and Many-Core Computer Architecture

28ECE8833 H.-H. S. Lee 2009

Commercial CMP (AMD Phenom II Quad-Core)

• AMD K10 (Barcelona)• Code name “Deneb” • 45nm process• 4 cores, private 512KB L2• Shared 6MB L3 (2MB in

Phenom)• Integrated Northbridge

– Up to 4 DIMMs

• Sideband Stack optimizer (SSO)– Parallelize many POPs and

PUSHs (which were dependent on each other)

• Convert them into pure loads/store instructions

– No uops in FUs for stack pointer adjustment

Page 29: ECE8833 Polymorphous and Many-Core Computer Architecture

29ECE8833 H.-H. S. Lee 2009

Intel Core i7 (Nehalem)• 4-core• HT support each core• 8MB shared L3

• 3 DDR3 channels• 25.6GB/s memory BW

• Turbo Boost Technology– New P-state

(Performance)– DFVS when workloads

operated under max power

– Same frequency for all cores

Page 30: ECE8833 Polymorphous and Many-Core Computer Architecture

30ECE8833 H.-H. S. Lee 2009

Ultra Sparc T1• Up to Eight cores, each 4-way threaded• Fine-grained multithreading

– a thread-selection logic• Take out threads that encounter

long latency events– Round-robin cycle-by-cycle– 4 threads in a group share a

processing pipeline (Sparc pipe)• 1.2 GHz (90nm)• In-order, 8 instructions per cycle (single

issue from each core)• 1 shared FPU• Caches

– 16K 4-way 32B L1-I– 8K 4-way 16B L1-D– Blocking cache (reason for MT)– 4-banked 12-way 3MB L2 + 4

memory controllers. (shared by all)– Data moved between the L2 and the

cores using an integrated crossbar switch to provide high throughput (200GB/s)

Page 31: ECE8833 Polymorphous and Many-Core Computer Architecture

31ECE8833 H.-H. S. Lee 2009

Ultra Sparc T1• Thread-select logic marks a thread inactive

based on– Instruction type

• A predecode bit in the I-cache to indicate long-latency instruction

– Misses– Traps– Resource conflicts

Page 32: ECE8833 Polymorphous and Many-Core Computer Architecture

32ECE8833 H.-H. S. Lee 2009

Ultra Sparc T2

• A fatter version of T1• 1.4GHz (65nm)• 8 threads per core, 8 cores on-die• 1 FPU per core (1 FPU per die in T1), 16 INT EU (8 in T1)• L2 increased to 8-banked 16-way 4MB shared • 8 stage integer pipeline ( as opposed to 6 for T1)• 16 instructions per cycle• One PCI Express port (x8 1.0)• Two 10 Gigabit Ethernet ports with packet classification and

filtering• Eight encryption engines • Four dual-channel FBDIMM memory controllers• 711 signal I/O,1831 total

• Subsequent T2 Plus contains 2 sockets: 16 cores / 128 threads

Page 33: ECE8833 Polymorphous and Many-Core Computer Architecture

33ECE8833 H.-H. S. Lee 2009

Sun ROCK Processor • 16-core, two threads per core• Hardware scout threading

(runahead)– Invisible to SW– Long latency inst starts auto HW

scout• L1 D$ miss• Micro-DTLB miss• Divide

– Warm up branch predictor– Prefetch memory

• Execute Ahead (EXE)– Retire independent instructions

while scouting• Simultaneous Speculative

Threading (SST) [ISCA’09]

– Two hardware threads for one program

– Runahead speculatively executes under a cache miss

– OoO retirement• HTM Support

Page 34: ECE8833 Polymorphous and Many-Core Computer Architecture

34ECE8833 H.-H. S. Lee 2009

Many-Core Processors• 2KB Data Memory• 3KB Instruction Memory• No coherence support• 2 FMACs

• Next-gen will have 3D-integrated memory– SRAM first– DRAM in the future

Intel Teraflops (Polaris)

Page 35: ECE8833 Polymorphous and Many-Core Computer Architecture

35ECE8833 H.-H. S. Lee 2009

E. Waingold, M. Taylor, D. Srikrishna, V. Sarkar, W. Lee, V. Lee, J. Kim, M. Frank, P. Finch, R. Barua, J. Babb, S. Amarasinghe, A. Agarwal

Page 36: ECE8833 Polymorphous and Many-Core Computer Architecture

36ECE8833 H.-H. S. Lee 2009

MIT RAW Design Tenet• Long wire across chip will be the constraint• Exposed architecture to software (parallelizing

compilers)– Explicit parallelization– Pins– Communication

• Use tile-based architecture– Similar designs sponsored by DARPA PCA program: UT

TRIPS, Stanford Smart Memories

• Simple Point-to-point static routing network– One cycle across each tile– Scalable (than bus)– Harnessed by compiler with precise count of wire hops– Use dynamic router to support memory accesses that

cannot be analyzed statically.

Page 37: ECE8833 Polymorphous and Many-Core Computer Architecture

37ECE8833 H.-H. S. Lee 2009

Application Mapping on RAW

[Taylor IEEE MICRO’02]

Four-way parallelized scalar code

Two-way threaded Java program

httpd Zzzz..

VideoData

Stream

Frame BufferAnd Screen

Custom Data Path Pipeline(by Compiler)

Sleep Mode (power saving)Fast Inter-tile ALU

forwarding : 3 cycles

Page 38: ECE8833 Polymorphous and Many-Core Computer Architecture

38ECE8833 H.-H. S. Lee 2009

Scalar Operand Network Design

[Taylor et al. HPCA’03]

Non-Pipelined Scalar Operand Network

Pipelined w/ Bypass Link

Pipelined w/ Bypass Link and Multiple ALUsLots of live values in the SON

Page 39: ECE8833 Polymorphous and Many-Core Computer Architecture

39ECE8833 H.-H. S. Lee 2009

Communication Scalability Issue

• RB (# of result bus) * WS (window size) compares made per cycle• Long, dense wire elongates cycle time

– Pipeline the wire • Cost of processing incoming information is high• Similar problem in bus-based snoopy cache protocol

Routing area

Large MUX

Complex Compare logic

Page 40: ECE8833 Polymorphous and Many-Core Computer Architecture

40ECE8833 H.-H. S. Lee 2009

Scalar Operand Network

RegFile RegFile RegFileMultiscalar Operand Network(distributed ILP machine)

[Taylor et al. HPCA’03]

RegFile RegFile RegFile

RegFile RegFile RegFile

Scalar Operand NetworkOn a 2-D, p2p interconnect(e.g., Raw or TRIPS)

Switch

Page 41: ECE8833 Polymorphous and Many-Core Computer Architecture

41ECE8833 H.-H. S. Lee 2009

Mapping Operations to Tile-based Architecture

• Done at – Compile time (RAW)– Or Runtime

• “Point-to-point” 2D mesh • Tradeoff

– Computation vs. Communication

– Compute Affinity (data flow through fewer hops)

• How to maintain control flow-control

RegFile RegFile RegFile

RegFile RegFile RegFile

ld a

ld bst bst b

*>>

+

i = a[j];q = b[i];r = q+j;s = q >> 3;t = r * s;b[j] = l;b[t] = t;

Page 42: ECE8833 Polymorphous and Many-Core Computer Architecture

42ECE8833 H.-H. S. Lee 2009

RAW Core-to-Core Communication• Static Router

– Place-and-route wires by software – P2p scalar transport– Compilers (or assembly writers) handle

predictable communication

• Dynamic Router– Transport dynamic, unpredictable operations

• Interrupts• Cache misses

– Unpredictable communication at compile-time

Page 43: ECE8833 Polymorphous and Many-Core Computer Architecture

43ECE8833 H.-H. S. Lee 2009

Architectural Comparison

• Raw replace a bus of a superscalar with switched network• Switched network is tightly integrated into processor’s pipeline to

support single-cycle message injection and receive operations• Raw software (compiler) has to implement functions such as

instruction scheduling, dependency checking, etc.• Raw yields complexity to software so that more hardware can be

used for ALU and memory

RAW Superscalar Multiprocessor

Page 44: ECE8833 Polymorphous and Many-Core Computer Architecture

44ECE8833 H.-H. S. Lee 2009

RAW’s Four On-Chip Mesh Networks

ComputePipeline

Registered at input longest wire = length of tile

8 32-bit channels

[Slide Source: Michael B. Taylor]

Page 45: ECE8833 Polymorphous and Many-Core Computer Architecture

45ECE8833 H.-H. S. Lee 2009

Raw Architecture

[Slide Source: Volker Strumpen]

Page 46: ECE8833 Polymorphous and Many-Core Computer Architecture

46ECE8833 H.-H. S. Lee 2009

Raw Compute Processor Pipeline

[Taylor IEEE MICRO’02]

Fast ALU-to-network (4 cycles)

R24-27 map to 4 on-chip physical

networks

0-cycle local bypass

Page 47: ECE8833 Polymorphous and Many-Core Computer Architecture

47ECE8833 H.-H. S. Lee 2009

RAW Processor TileEach tile contains• Tile processor

– 32-bit MIPS, 8-stage in-order, single issue

– 32KB instruction memory– 32KB data cache (not

coherent, user managed)

• Switch processor– 8K-instruction memory– Executes basic move and

branch instructions– Transfer between local

switch and neighbor switches

• Dynamic Router– Hardware control (not

directly under programmer’s control)

Page 48: ECE8833 Polymorphous and Many-Core Computer Architecture

48ECE8833 H.-H. S. Lee 2009

Raw Programming• Compute the sum c=a+b across four tiles:

Page 49: ECE8833 Polymorphous and Many-Core Computer Architecture

49ECE8833 H.-H. S. Lee 2009

Data Path: Zoom 1• Stateful hardware: local data memory (a,c), register (b) and

both static networks (snet1 and 2)

Page 50: ECE8833 Polymorphous and Many-Core Computer Architecture

50ECE8833 H.-H. S. Lee 2009

Zoom 2: Processor Datapaths

Page 51: ECE8833 Polymorphous and Many-Core Computer Architecture

51ECE8833 H.-H. S. Lee 2009

Zoom 2: Switch Datapaths (+-tile processor)

Page 52: ECE8833 Polymorphous and Many-Core Computer Architecture

52ECE8833 H.-H. S. Lee 2009

Raw Assembly

Page 53: ECE8833 Polymorphous and Many-Core Computer Architecture

53ECE8833 H.-H. S. Lee 2009

RAW On-Chip Network• 2D Mesh

– Longest wire is no greater than one side of a tile– Worst case: 6 hops (or cycles) for 16 tiles

• 2 Static Routers, “point-to-point,” each has– A 64KB SW-managed instruction cache– A pair of routing crossbars– Example:

• 2 Dynamic Routers– Dimension-ordered routing by hardware– Example:

lui $3, $0, 15ihdr $cgno, $3, 0x0200 #header msg len=2or $cgno,$0,$9 #sent word1ld $cgno,$0,$csti #sent word2

or $2, $cgni, $0 #word1or $3, $cgni, $0 #word2

Tile 15 (receiver)Tile 0 (sender)

or $csto, $0, $5nop route $csto->$cEo2 #SWITCH0

nop route $cWi2->$csti2 #SWITCH1and $5, $5, $csti2

Tile 1 (receiver)Tile 0 (sender)

Page 54: ECE8833 Polymorphous and Many-Core Computer Architecture

54ECE8833 H.-H. S. Lee 2009

Control Orchestration Optimization• Orchestrated by the Raw compiler• Control localization

– Hide control flow sequence within a “macro-instruction” assigned to a tile

[Lee et al. ASPLOS’98]

macroins

: One instruction

Page 55: ECE8833 Polymorphous and Many-Core Computer Architecture

55ECE8833 H.-H. S. Lee 2009

Example of RAW Compiler Transformation

y = a+b;z = a*a;a = y*a*5;y = y*b*6;

read(a)read(b)y_1 = a+bz_1 = a*atmp_1 = y_1*aa_1 = tmp_1*5tmp_2 = y_1*by_2 = tmp_2*6write(z)write(a)write(y)

Initial Code Transformatio

n

Instruction Partitioner

Global Data Partitioner

Data & Inst Placer

Communication Code Gen

Event Scheduler

read (a)

z_1 = a*a

write(z)

tmp_1 = y_1*a

a_1=tmp_1*5

write(a)

read (b)

y_1 = a+b

tmp_2 = y_1*b

y_2 = tmp_2*6

write(y)

[Lee et al. ASPLOS’98]

Initial Code Transformation

Page 56: ECE8833 Polymorphous and Many-Core Computer Architecture

56ECE8833 H.-H. S. Lee 2009

Example of RAW Compiler Transformation

[Lee et al. ASPLOS’98]

read (a)

z_1 = a*a

write(z)

tmp_1 = y_1*a

a_1=tmp_1*5

write(a)

read (b)

y_1 = a+b

tmp_2 = y_1*b

y_2 = tmp_2*6

write(y)

Instruction Partitioner

{a,z}

{b,y}

GlobalData

Partitioner

read (a)

z_1 = a*a

write(z)

tmp_1 = y_1*a

a_1=tmp_1*5

write(a)

read (b)

y_1 = a+b

tmp_2 = y_1*b

y_2 = tmp_2*6

write(y)

Data & Inst Placer

{a,z} {b,y}

P0 P1

Page 57: ECE8833 Polymorphous and Many-Core Computer Architecture

57ECE8833 H.-H. S. Lee 2009

Example of RAW Compiler Transformation

[Lee et al. ASPLOS’98]

read (a)

z_1 = a*a

write(z)

tmp_1 = y_1*a

a_1=tmp_1*5

write(a)

read (b)

y_1 = a+b

tmp_2 = y_1*b

y_2 = tmp_2*6

write(y)

Communication Code Gen

P0 P1

send (a)

route(P0,S1) route(S0,P1)

a=rcv()

send(y_1)route(S1,P0) route(P1,S0)

y_1=rcv()

S0 S1

Page 58: ECE8833 Polymorphous and Many-Core Computer Architecture

58ECE8833 H.-H. S. Lee 2009

Example of RAW Compiler Transformation

[Lee et al. ASPLOS’98]Event Scheduler

route(P0,S1)

route(S1,P0)

S0

route(S0,P1)

route(P1,S0)

S1

read (a)

z_1 = a*a

write(z)

tmp_1 = y_1*a

a_1=tmp_1*5

write(a)

P0

send (a)

y_1=rcv()

P1

read (b)

y_1 = a+b

tmp_2 = y_1*b

y_2 = tmp_2*6

write(y)

a=rcv()

send(y_1)

Page 59: ECE8833 Polymorphous and Many-Core Computer Architecture

59ECE8833 H.-H. S. Lee 2009

Raw Compiler Example

tmp3 = (seed*6+2)/3v2 = (tmp1 - tmp3)*5v1 = (tmp1 + tmp2)*3v0 = tmp0 - v1….

pval5=seed.0*6.0

pval4=pval5+2.0

tmp3.6=pval4/3.0

tmp3=tmp3.6

v3.10=tmp3.6-v2.7

v3=v3.10

v2.4=v2

pval3=seed.o*v2.4

tmp2.5=pval3+2.0

tmp2=tmp2.5

pval6=tmp1.3-tmp2.5

v2.7=pval6*5.0

v2=v2.7

seed.0=seed

pval1=seed.0*3.0

pval0=pval1+2.0

tmp0.1=pval0/2.0

tmp0=tmp0.1

v1.2=v1

pval2=seed.0*v1.2

tmp1.3=pval2+2.0

tmp1=tmp1.3

pval7=tmp1.3+tmp2.5

v1.8=pval7*3.0

v1=v1.8

v0.9=tmp0.1-v1.8

v0=v0.9

pval5=seed.0*6.0

pval4=pval5+2.0

tmp3.6=pval4/3.0

tmp3=tmp3.6

v3.10=tmp3.6-v2.7

v3=v3.10

v2.4=v2

pval3=seed.o*v2.4

tmp2.5=pval3+2.0

tmp2=tmp2.5

pval6=tmp1.3-tmp2.5

v2.7=pval6*5.0

v2=v2.7

seed.0=seed

pval1=seed.0*3.0

pval0=pval1+2.0

tmp0.1=pval0/2.0

tmp0=tmp0.1

v1.2=v1

pval2=seed.0*v1.2

tmp1.3=pval2+2.0

tmp1=tmp1.3

pval7=tmp1.3+tmp2.5

v1.8=pval7*3.0

v1=v1.8v0.9=tmp0.1-v1.8

v0=v0.9

Assign instructions to the tiles, maximizing locality. Generate the static routerinstructions to transferOperands & streams tiles.

[Slide Source: Michael B. Taylor]

Page 60: ECE8833 Polymorphous and Many-Core Computer Architecture

60ECE8833 H.-H. S. Lee 2009

Scalability

1 cycle180 nm, 16 tiles 90 nm, 64 tiles

Just stamp out more tiles!

Longest wire, frequency, design and verification complexityall independent of issue width.

Architecture is backwards compatible.

[Slide Source: Michael B. Taylor]