3/23/99: 1 vlsi architecture past, present, and future william j. dally computer systems laboratory...
Post on 22-Dec-2015
221 views
TRANSCRIPT
3/23/99: 1
VLSI ArchitecturePast, Present, and Future
William J. DallyComputer Systems Laboratory
Stanford University
March 23, 1999
3/23/99: 2
Past, Present, and Future
• The last 20 years has seen a 1000-fold increase in grids per chip and a 20-fold reduction in gate delay
• We expect this trend to continue for the next 20 years
• For the past 20 years, these devices have been applied to implicit parallelism
• We will see a shift toward implicit parallelism over the next 20 years
3/23/99: 3
1960 1970 1980 1990 2000 201010
-1
100
101
102
103
wire pitch (um)
gate length (um)
Technology Evolution
1960 1970 1980 1990 2000 201010
-2
10-1
100
101
102
gate delay (ns)
3/23/99: 4
Technology Evolution (2)
Parameter 1979 1999 2019 Units
Gate Length 5 0.2 0.008 m
Gate Delay 3000 150 7.5 ps
Clock Cycle 200 2.5 0.08 ns
Gates/ Clock 67 17 10
Wire Pitch 15 1 .07 m
Chip Edge 6 15 38 mm
Grids/ Chip 1.6 105 2.3 108 3.0 1011
3/23/99: 5
Architecture Evolution
Year Microprocessor High-endProcessor
1979 i80860.5 MIPS0.001 MFLOPS
Cray 170 MIPS250 MFLOPS
1999 Compaq 21264500 MIPS, 500 MFLOPS (x 4?)
2019 X10000 MIPS,10000 MFLOPS
MP with 1000Xs
3/23/99: 6
Incremental Returns
Processor Cost (Die Area)
Per
form
ance
Pipelined RISC
Dual-issue in order
Quad-issue out of order
3/23/99: 7
Efficiency and Granularity
Pea
k P
erfo
rman
ce
2P+M
2P+2M
P+M
System Cost (Die Area)
3/23/99: 8
VLSI in 1979
3/23/99: 9
VLSI Architecture in 1979
• 5m NMOS technology• 6mm die size• 100,000 grids per chip, 10,000
transistors• 8086 microprocessor
– 0.5MIPS
3/23/99: 10
1979-1989: Attack of the Killer Micros
• 50% per year improvement in performance• Transistors applied to implicit parallelism
– pipeline processor (10 CPI --> 1 CPI)– shorten clock cycle
(67 gates/clock --> 30 gates/clock)
• in 1989 a 32-bit processor w/ floating point and caches fits on one chip– e.g., i860 40MIPS, 40MFLOPS– 5,000,000 grids, 1M transistors (many memory)
3/23/99: 11
1989-1999: The Era of Diminishing Returns
• 50% per year increase in performance through 1996, but– projects delayed, performance below expectations– 50% increase in grids, 15% increase in frequency (72%
total)
• Squeaking out the last implicit parallelism– 2-way to 6-way issue, out-of-order issue, branch prediction– 1 CPI --> 0.5 CPI, 30 gates/clock --> 20 gates/clock
• Convert data parallelism to ILP• Examples
– Intel Pentium II (3-way o-o-o)– Compaq 21264 (4-way o-o-o)
3/23/99: 12
1979-1999: Why Implicit Parallelism?
• Opportunity– large gap between micros and fastest
processors
• Compatibility– software pool ready to run on implicitly
parallel machines
• Technology– not available for fine-grain explicitly parallel
machines
3/23/99: 13
1999-2019: Explicit Parallelism Takes Over
• Opportunity– no more processor gap
• Technology– interconnection, interaction, and shared
memory technologies have been proven
3/23/99: 14
Technology for Fine-Grain Parallel Machines
• A collection of workstations does not make a good parallel machine. (BLAGG)– Bandwidth - large fraction (0.1) of local
memory BW– LAtency - small multiple (3) of local memory
latency– Global mechanisms - sync, fetch-and-op– Granularity - of tasks (100 inst) and memory
(8MB)
3/23/99: 15
Technology for Parallel MachinesThree Components
• Networks– 2 clocks/hop latency– 8GB/s global bandwidth
• Interaction mechanisms– single-cycle communication and
synchronization
• Software
0
10
20
30
40
50
60
70
0 2 4 6 8 10 12
k-ary n-cubes
• Link bandwidth, B, depends on radix, k, for both wire- and pin-limited networks.
• Select radix to trade-off diameter, D, against B.
TLB
D
TL
Cknk 4
Dally, “Performance Analysis of k-ary n-cube Interconnection Networks”, IEEE TC, 1990
4K NodesL = 256Bs= 16K
Late
ncy
Dimension
Delay of Express Channels
The Torus Routing Chip
• k-ary n-cube topology– 2D Torus Network– 8bit x 20MHz Channels
• Hardware routing• Wormhole routing• Virtual channels• Fully Self-Timed Design• Internal Crossbar
Architecture
Dally and Seitz, “The Torus Routing Chip”, Distributed Computing, 1986
The Reliable Router
Dally, Dennison, Harris, Kan, and Xanthopoulos, “Architecture and Implementation of the Reliable Router”, Hot Interconnects II, 1994Dally, Dennison, and Xanthopoulos, “Low-Latency Plesiochronous Data Retiming, “ ARVLSI 1995Dennison, Lee, and Dally, “High Performance Bidirectional Signalling in VLSI Systems,” SIS 1993
• Fault-tolerant – Adaptive routing (adaptation of
Duato’s algorithm)– Link-level retry– Unique token protocol
• 32bit x 200MHz channels– Simultaneous bidirectional
signalling– Low latency plesiochronous
synchronizers
• Optimisitic routing
3/23/99: 20
Equalized 4Gb/s Signaling
End-to-End Latency
• Software sees ~10s latency with 500ns network
• Heavy compute load associated with sending a message– system call– buffer allocation– synchronization
• Solution: treat the network like memory, not like an I/O device– hardware formatting,
addressing, and buffer allocation
Regs
Send
Net
Buffer Dispatch
Tx Node
Rx Node
3/23/99: 22
Network Summary
• We can build networks with 2-4 clocks/hop latency (12-24 clocks for a 512-node 3-cube)– networks faster than main memory access of modern
machines– need end-to-end hardware support to see this, no ‘libraries’
• With high-speed signaling, bandwdith of 4GB/s or more per channel (512GB/s bisection) is easy to achieve– nearly flat memory bandwidth
• Topology is a matter of matching pin and bisection constraints to the packaging technology– its hard to beat a 3-D mesh or torus
• This gives us B and LA (of BLAGG)
3/23/99: 23
The Importance of Mechanisms
B
S eria l E xecution
A
3/23/99: 24
The Importance of Mechanisms
A
B
S eria l E xecution
O V H
C O MBO V H
C O MS ync
A
P ara lle l E xecution (H igh O vherhead 0 .5)
3/23/99: 25
The Importance of Mechanisms
A
B
S eria l E xecution
O V H
C O MBO V H
C O MS ync
A
P ara lle l E xecution (H igh O vherhead 0 .5)
A
B
P ara lle l E xecution(Low O vherhead 0 .062)
3/23/99: 26
Granularity and Cost Effectiveness
• Parallel Computers Built for– Capability - run problems that are
too big or take too long to solve any other way
• absolute performance at any cost
– Capacity - get throughput on lots of small problems
• performance/cost
• A parallel computer built from workstation size nodes will always have lower perf/cost than a workstation– sublinear speedup– economies of scale
• A parallel computer with less memory per node can have better perf/cost than a workstation
MP $
M M M M
P $ P $ P $ P $
3/23/99: 27
MIT J-Machine (1991)
3/23/99: 28
Exploiting fine-grain threads
• Where will the parallelism come from to keep all of these processors busy?– ILP - limited to about 5– Outer-loop parallelism
• e.g., domain decomposition• requires big problems to get
lots of parallelism
• Fine threads– make communication and
synchronization very fast (1 cycle)
– break the problem into smaller pieces
– more parallelism
3/23/99: 29
Mechanism and Granularity Summary
• Fast communication and synchronization mechanisms enable fine-grain task decomposition– simplifies programming– exposes parallelism– facilitates load balance
• Have demonstrated– 1-cycle communication and synchronization locally– 10-cycle communication, synchronization, and task
dispatch across a network
• Physically fine-grain machines have better performance/cost than sequential machines
3/23/99: 30
A 2009 Multicomputer
System : 16 C h ips C hip : 64 T iles
M em ory8M B
Processor
T ile : P + 8M B
3/23/99: 31
Challenges for the Explicitly Parallel Era
• Compatibility• Managing locality• Parallel software
3/23/99: 32
Compatibility
• Almost no fine-grain parallel software exists
• Writing parallel software is easy– with good mechanisms
• Parallelizing sequential software is hard– needs to be designed from the ground up
• An incremental migration path– run sequential codes with acceptable
performance– parallelize selected applications for
considerable speedup
3/23/99: 33
Performance Depends on Locality
• Applications have data/time-dependent graph structure– Sparse-matrix solution
• non-zero and fill-in structure
– Logic simulation• circuit topology and activity
– PIC codes• structure changes as particles
move
– ‘Sort-middle’ polygon rendering
• structure changes as viewpoint moves
3/23/99: 34
Fine-Grain Data MigrationDrift and Diffusion
• Run-time relocation based on pointer use– move data at both ends of pointer– move control and data
• Each ‘relocation cycle’– compute drift vector based on
pointer use– compute diffusion vector based on
density potential (Taylor)– need to avoid oscillations
• Should data be replicated?– not just update vs. invalidate– need to duplicate computation to
avoid communication
3/23/99: 35
Migration and Locality
0
1
2
3
4
5
6
1 5 9 13 17 21 25 29 33 37
Migration Period
Dis
tan
ce (
in t
iles
)
NoMigration
OneStep
Hierarchy
Mixed
3/23/99: 36
Parallel Software:Focus on the Real Problems
• Almost all demanding problems have ample parallelism
• Need to focus on fundamental problems– extracting parallelism– load balance – locality
• load balance and locality can be covered by excess parallelism
• Avoid incidental issues– aggregating tasks to avoid overhead– manually managing data movement and replication– oversynchronization
3/23/99: 37
Parallel Software:Design Strategy
• A program must be designed for parallelism from the ground up– no bottlenecks in the data structures
• e.g., arrays instead of linked lists
• Data parallelism– many for loops (over data,not time) can be
forall– break dependencies out of the loop– synchronize on natural units (no barriers)
3/23/99: 38
Conclusion: We are on the threshold of the explicitly parallel era
• As in 1979, we expect a 1000-fold increase in ‘grids’ per chip in the next 20 years
• Unlike 1979 these ‘grids’ are best applied to explicitly parallel machines– Diminishing returns from sequential processors (ILP) - no
alternative to explicit parallelism– Enabling technologies have been proven
• interconnection networks, mechanisms, cache coherence
– Fine-grain machines are more efficient than sequential machines
• Fine-grain machines will be constructed from multi-processor/DRAM chips
• Incremental migration to parallel software