3/23/99: 1 vlsi architecture past, present, and future william j. dally computer systems laboratory...

3/23/99: 1

VLSI ArchitecturePast, Present, and Future

William J. DallyComputer Systems Laboratory

Stanford University

March 23, 1999

3/23/99: 2

Past, Present, and Future

• The last 20 years has seen a 1000-fold increase in grids per chip and a 20-fold reduction in gate delay

• We expect this trend to continue for the next 20 years

• For the past 20 years, these devices have been applied to implicit parallelism

• We will see a shift toward implicit parallelism over the next 20 years

3/23/99: 3

1960 1970 1980 1990 2000 201010

-1

100

101

102

103

wire pitch (um)

gate length (um)

Technology Evolution

1960 1970 1980 1990 2000 201010

-2

10-1

100

101

102

gate delay (ns)

3/23/99: 4

Technology Evolution (2)

Parameter 1979 1999 2019 Units

Gate Length 5 0.2 0.008 m

Gate Delay 3000 150 7.5 ps

Clock Cycle 200 2.5 0.08 ns

Gates/ Clock 67 17 10

Wire Pitch 15 1 .07 m

Chip Edge 6 15 38 mm

Grids/ Chip 1.6 105 2.3 108 3.0 1011

3/23/99: 5

Architecture Evolution

Year Microprocessor High-endProcessor

1979 i80860.5 MIPS0.001 MFLOPS

Cray 170 MIPS250 MFLOPS

1999 Compaq 21264500 MIPS, 500 MFLOPS (x 4?)

2019 X10000 MIPS,10000 MFLOPS

MP with 1000Xs

3/23/99: 6

Incremental Returns

Processor Cost (Die Area)

Per

form

ance

Pipelined RISC

Dual-issue in order

Quad-issue out of order

3/23/99: 7

Efficiency and Granularity

Pea

k P

erfo

rman

ce

2P+M

2P+2M

P+M

System Cost (Die Area)

3/23/99: 8

VLSI in 1979

3/23/99: 9

VLSI Architecture in 1979

• 5m NMOS technology• 6mm die size• 100,000 grids per chip, 10,000

transistors• 8086 microprocessor

– 0.5MIPS

3/23/99: 10

1979-1989: Attack of the Killer Micros

• 50% per year improvement in performance• Transistors applied to implicit parallelism

– pipeline processor (10 CPI --> 1 CPI)– shorten clock cycle

(67 gates/clock --> 30 gates/clock)

• in 1989 a 32-bit processor w/ floating point and caches fits on one chip– e.g., i860 40MIPS, 40MFLOPS– 5,000,000 grids, 1M transistors (many memory)

3/23/99: 11

1989-1999: The Era of Diminishing Returns

• 50% per year increase in performance through 1996, but– projects delayed, performance below expectations– 50% increase in grids, 15% increase in frequency (72%

total)

• Squeaking out the last implicit parallelism– 2-way to 6-way issue, out-of-order issue, branch prediction– 1 CPI --> 0.5 CPI, 30 gates/clock --> 20 gates/clock

• Convert data parallelism to ILP• Examples

– Intel Pentium II (3-way o-o-o)– Compaq 21264 (4-way o-o-o)

3/23/99: 12

1979-1999: Why Implicit Parallelism?

• Opportunity– large gap between micros and fastest

processors

• Compatibility– software pool ready to run on implicitly

parallel machines

• Technology– not available for fine-grain explicitly parallel

machines

3/23/99: 13

1999-2019: Explicit Parallelism Takes Over

• Opportunity– no more processor gap

• Technology– interconnection, interaction, and shared

memory technologies have been proven

3/23/99: 14

Technology for Fine-Grain Parallel Machines

• A collection of workstations does not make a good parallel machine. (BLAGG)– Bandwidth - large fraction (0.1) of local

memory BW– LAtency - small multiple (3) of local memory

latency– Global mechanisms - sync, fetch-and-op– Granularity - of tasks (100 inst) and memory

(8MB)

3/23/99: 15

Technology for Parallel MachinesThree Components

• Networks– 2 clocks/hop latency– 8GB/s global bandwidth

• Interaction mechanisms– single-cycle communication and

synchronization

• Software

0

10

20

30

40

50

60

70

0 2 4 6 8 10 12

k-ary n-cubes

• Link bandwidth, B, depends on radix, k, for both wire- and pin-limited networks.

• Select radix to trade-off diameter, D, against B.

TLB

D

TL

Cknk 4

Dally, “Performance Analysis of k-ary n-cube Interconnection Networks”, IEEE TC, 1990

4K NodesL = 256Bs= 16K

Late

ncy

Dimension

Delay of Express Channels

The Torus Routing Chip

• k-ary n-cube topology– 2D Torus Network– 8bit x 20MHz Channels

• Hardware routing• Wormhole routing• Virtual channels• Fully Self-Timed Design• Internal Crossbar

Architecture

Dally and Seitz, “The Torus Routing Chip”, Distributed Computing, 1986

The Reliable Router

Dally, Dennison, Harris, Kan, and Xanthopoulos, “Architecture and Implementation of the Reliable Router”, Hot Interconnects II, 1994Dally, Dennison, and Xanthopoulos, “Low-Latency Plesiochronous Data Retiming, “ ARVLSI 1995Dennison, Lee, and Dally, “High Performance Bidirectional Signalling in VLSI Systems,” SIS 1993

• Fault-tolerant – Adaptive routing (adaptation of

Duato’s algorithm)– Link-level retry– Unique token protocol

• 32bit x 200MHz channels– Simultaneous bidirectional

signalling– Low latency plesiochronous

synchronizers

• Optimisitic routing

3/23/99: 20

Equalized 4Gb/s Signaling

End-to-End Latency

• Software sees ~10s latency with 500ns network

• Heavy compute load associated with sending a message– system call– buffer allocation– synchronization

• Solution: treat the network like memory, not like an I/O device– hardware formatting,

addressing, and buffer allocation

Regs

Send

Net

Buffer Dispatch

Tx Node

Rx Node

3/23/99: 22

Network Summary

• We can build networks with 2-4 clocks/hop latency (12-24 clocks for a 512-node 3-cube)– networks faster than main memory access of modern

machines– need end-to-end hardware support to see this, no ‘libraries’

• With high-speed signaling, bandwdith of 4GB/s or more per channel (512GB/s bisection) is easy to achieve– nearly flat memory bandwidth

• Topology is a matter of matching pin and bisection constraints to the packaging technology– its hard to beat a 3-D mesh or torus

• This gives us B and LA (of BLAGG)

3/23/99: 23

The Importance of Mechanisms

B

S eria l E xecution

A

3/23/99: 24


A

B

S eria l E xecution

O V H

C O MBO V H

C O MS ync

A

P ara lle l E xecution (H igh O vherhead 0 .5)

3/23/99: 25


A

B

S eria l E xecution

O V H

C O MBO V H

C O MS ync

A

P ara lle l E xecution (H igh O vherhead 0 .5)

A

B

P ara lle l E xecution(Low O vherhead 0 .062)

3/23/99: 26

Granularity and Cost Effectiveness

• Parallel Computers Built for– Capability - run problems that are

too big or take too long to solve any other way

• absolute performance at any cost

– Capacity - get throughput on lots of small problems

• performance/cost

• A parallel computer built from workstation size nodes will always have lower perf/cost than a workstation– sublinear speedup– economies of scale

• A parallel computer with less memory per node can have better perf/cost than a workstation

MP $

M M M M

P $ P $ P $ P $

3/23/99: 27

MIT J-Machine (1991)

3/23/99: 28

Exploiting fine-grain threads

• Where will the parallelism come from to keep all of these processors busy?– ILP - limited to about 5– Outer-loop parallelism

• e.g., domain decomposition• requires big problems to get

lots of parallelism

• Fine threads– make communication and

synchronization very fast (1 cycle)

– break the problem into smaller pieces

– more parallelism

3/23/99: 29

Mechanism and Granularity Summary

• Fast communication and synchronization mechanisms enable fine-grain task decomposition– simplifies programming– exposes parallelism– facilitates load balance

• Have demonstrated– 1-cycle communication and synchronization locally– 10-cycle communication, synchronization, and task

dispatch across a network

• Physically fine-grain machines have better performance/cost than sequential machines

3/23/99: 30

A 2009 Multicomputer

System : 16 C h ips C hip : 64 T iles

M em ory8M B

Processor

T ile : P + 8M B

3/23/99: 31

Challenges for the Explicitly Parallel Era

• Compatibility• Managing locality• Parallel software

3/23/99: 32

Compatibility

• Almost no fine-grain parallel software exists

• Writing parallel software is easy– with good mechanisms

• Parallelizing sequential software is hard– needs to be designed from the ground up

• An incremental migration path– run sequential codes with acceptable

performance– parallelize selected applications for

considerable speedup

3/23/99: 33

Performance Depends on Locality

• Applications have data/time-dependent graph structure– Sparse-matrix solution

• non-zero and fill-in structure

– Logic simulation• circuit topology and activity

– PIC codes• structure changes as particles

move

– ‘Sort-middle’ polygon rendering

• structure changes as viewpoint moves

3/23/99: 34

Fine-Grain Data MigrationDrift and Diffusion

• Run-time relocation based on pointer use– move data at both ends of pointer– move control and data

• Each ‘relocation cycle’– compute drift vector based on

pointer use– compute diffusion vector based on

density potential (Taylor)– need to avoid oscillations

• Should data be replicated?– not just update vs. invalidate– need to duplicate computation to

avoid communication

3/23/99: 35

Migration and Locality

0

1

2

3

4

5

6

1 5 9 13 17 21 25 29 33 37

Migration Period

Dis

tan

ce (

in t

iles

)

NoMigration

OneStep

Hierarchy

Mixed

3/23/99: 36

Parallel Software:Focus on the Real Problems

• Almost all demanding problems have ample parallelism

• Need to focus on fundamental problems– extracting parallelism– load balance – locality

• load balance and locality can be covered by excess parallelism

• Avoid incidental issues– aggregating tasks to avoid overhead– manually managing data movement and replication– oversynchronization

3/23/99: 37

Parallel Software:Design Strategy

• A program must be designed for parallelism from the ground up– no bottlenecks in the data structures

• e.g., arrays instead of linked lists

• Data parallelism– many for loops (over data,not time) can be

forall– break dependencies out of the loop– synchronize on natural units (no barriers)

3/23/99: 38

Conclusion: We are on the threshold of the explicitly parallel era

• As in 1979, we expect a 1000-fold increase in ‘grids’ per chip in the next 20 years

• Unlike 1979 these ‘grids’ are best applied to explicitly parallel machines– Diminishing returns from sequential processors (ILP) - no

alternative to explicit parallelism– Enabling technologies have been proven

• interconnection networks, mechanisms, cache coherence

– Fine-grain machines are more efficient than sequential machines

• Fine-grain machines will be constructed from multi-processor/DRAM chips

• Incremental migration to parallel software

3/23/99: 1 vlsi architecture past, present, and future william j. dally computer systems laboratory...

Documents