influence of technology directions on system...
TRANSCRIPT
Influence of Technology Directions Influence of Technology Directions on System Architectureon System Architecture
Dr. Randy IsaacDr. Randy IsaacVP of Science and TechnologyVP of Science and Technology
IBM Research DivisionIBM Research DivisionSeptember 10, 2001September 10, 2001
Moore's Law continues beyond conventional scalingMoore's Law continues beyond conventional scaling
Power becomes the limiting metricPower becomes the limiting metric
The integration focus moves from circuit to processorThe integration focus moves from circuit to processor
after Kurzweil, 1999 & Moravec, 1998
1900 1920 1940 1960 1980 2000 2020
Year
1E-6
1E-3
1E+0
1E+3
1E+6
1E+9
1E+12C
om
pu
tatio
ns/
sec
MechanicalElectro-mechanical
Vacuum tubeDiscrete transistor
Integrated circuit
$1000 Buys...$1000 Buys...$1000 Buys...$1000 Buys...
Integrated Circuit Performance TrendsIntegrated Circuit Performance TrendsIntegrated Circuit Performance TrendsIntegrated Circuit Performance Trends#
Tra
nsis
tors
per
Chi
p
x
x
x
x
1Gbx Logic
Density
Memory
1GHz
64Mb
4Mb
256Kb
16Kb
1980 1990 2000
1MHz
10MHz
100MHz
1010
109
108
107
106
105
104
103
1011
The Original Moore's Law ProposalThe Original Moore's Law ProposalThe Original Moore's Law ProposalThe Original Moore's Law Proposal
After G. E. Moore"Electronics," 1965
A Decade of AgreementA Decade of AgreementA Decade of AgreementA Decade of Agreement
After G. E. Moore Proc. IEDM, 1975
Complexity's InfluenceComplexity's InfluenceComplexity's InfluenceComplexity's Influence
After G. E. Moore SPIE v. 2440, 1995
Increased integrationIncreased integrationIncreased integrationIncreased integration
Function implemented with:
Many silicon components
Few silicon components
One Chip
Function
Speed
Cost
Partitioning the Improvement RatePartitioning the Improvement RatePartitioning the Improvement RatePartitioning the Improvement Rate
Improving Integration: Components per chipImproving Integration: Components per chip
50% Gain from Lithography50% Gain from Lithography
25% Gain from Device and Circuit Innovation25% Gain from Device and Circuit Innovation
25% Gain from Increased Chip Size 25% Gain from Increased Chip Size (manufacturability)(manufacturability)
Improving Performance:Improving Performance:
Transistor Performance ImprovementTransistor Performance Improvement
Interconnect Density and DelayInterconnect Density and Delay
Packaging and CoolingPackaging and Cooling
Circuit-level and System-level GainsCircuit-level and System-level Gains
Evolution of Memory DensityEvolution of Memory DensityEvolution of Memory DensityEvolution of Memory Density
1980 1985 1990 1995 2000 2005 2010 2015
Year
0
1
10
100
1,000
10,000
100,000
Meg
abits
/chi
p
doubling time: 1.5 yr.
doubling time: 3 yr.
95 97 99 02 05 08 11
1994 SIA NTRS
1997 SIA NTRS
1998 / 1999 ITRS
Min
imu
m F
eatu
re S
ize
(nm
)
(DR
AM
Hal
f-P
itch
)500
350
250
180
130
100
70
50
35
2595 97 99 02 05 08 11
ISMT Litho 2000 Plan
Area for FutureAcceleration
ITRS Lithography RoadmapITRS Lithography RoadmapITRS Lithography RoadmapITRS Lithography Roadmap
Industry-Wide Lithography Technology Acceleration
1000 10100 0.1 0.01Nanometers
1
Feature Sizes
248 157193
Deep UV Extreme UV
X-ray Proximity
Electron Beam
365435&
405
Wavelength
Dimensions in LithographyDimensions in LithographyDimensions in LithographyDimensions in Lithography
Original Device
Device ScalingDevice ScalingDevice ScalingDevice Scaling
Lxd
GATEn+ source n+ drain
WIRINGVoltage, V
W
p substrate, doping NA
tox
SCALING:Voltage: V/ααOxide: tox /ααWire width: W/ααGate width: L/ααDiffusion: xd /ααSubstrate: αα *
NA
Device ScalingDevice ScalingDevice ScalingDevice Scaling
Scaled Device
L/ααxd/αα
GATEn+ source n+ drain
WIRINGVoltage, V / αα
W/αα
p substrate, doping αα*NA
tox/αα
SCALING:Voltage: V/ααOxide: tox /ααWire width: W/ααGate width: L/ααDiffusion: xd /ααSubstrate: αα * NA
RESULTS:Higher Density: ~αα 22
Higher Speed: ~ααLower Power/ckt: ~1/α/α 22
Power Density: ~Constant
Device ScalingDevice ScalingDevice ScalingDevice Scaling
Scaled Device
L/ααxd/αα
GATEn+ source n+ drain
WIRINGVoltage, V / αα
W/αα
p substrate, doping αα*NA
tox/αα
Fundamental atomic limit to scaling Fundamental atomic limit to scaling Fundamental atomic limit to scaling Fundamental atomic limit to scaling reciperecipereciperecipe
Oxide thickness is approaching a few atomic layers
silicon bulk field effect transistor (FET)
present future
1.2 nm oxynitride
Limit of Oxide ScalingLimit of Oxide ScalingLimit of Oxide ScalingLimit of Oxide Scaling
0 1 2 3 4
Gate Oxide Thickness (nm)
1E-8
1E-6
1E-4
1E-2
1E+0
1E+2
1E+4
1E+6G
ate
Cur
rent
Den
sity
(A
/cm
2)
(Gate voltages: 0.9 to 2.0 V)
1994 1996 1998 2000 2002 2004 2006 2008
Year of First Production
1
10
100
1000
FP
GIndustry Logic Performance Trends
High Performance CMOS Logic TrendHigh Performance CMOS Logic TrendHigh Performance CMOS Logic TrendHigh Performance CMOS Logic Trend
Relative CMOS Device Performance Relative CMOS Device Performance Relative CMOS Device Performance Relative CMOS Device Performance
19861992
19982004
2010
Year of Technology Capability
Rel
ativ
e D
evic
e P
erfo
rman
ce
Bulk FETsSOI FETsLow Temp.
Double Gate FETs
New structures are needed to maintain device performance... ?
MOSFET Device Structure (R)evolutionMOSFET Device Structure (R)evolutionMOSFET Device Structure (R)evolutionMOSFET Device Structure (R)evolution
New devices/materials support accelerated growth rate
R
elat
ive
Dev
ice
Per
form
ance
Year
?
Better Performance Without ScalingBetter Performance Without ScalingBetter Performance Without ScalingBetter Performance Without Scaling
Novel DevicesNovel DevicesNovel DevicesNovel Devices
Carbon NanotubesCarbon Nanotubes
Organic TransistorsOrganic Transistors
Quantum ComputingQuantum Computing
MolecularMolecularDevicesDevices
V-Groove V-Groove TransistorsTransistors
64-bit S/390 Microprocessor64-bit S/390 Microprocessor64-bit S/390 Microprocessor64-bit S/390 Microprocessor
47 Million transistors47 Million transistors
Copper interconnect -- 7 layers Copper interconnect -- 7 layers
Size: 17.9 x 9.9 mmSize: 17.9 x 9.9 mm
Single scalar, in-order executionSingle scalar, in-order execution
Split L1 cache (256K I & D)Split L1 cache (256K I & D)
BTB 2K x 4, multiportedBTB 2K x 4, multiported
On chip compression unitOn chip compression unit
> 1 GHz frequency on a 20-way system> 1 GHz frequency on a 20-way system
Blue PacificBlue PacificBlue PacificBlue Pacific
3.9 trillion operations/sec3.9 trillion operations/sec
Can simulate nuclear devicesCan simulate nuclear devices
15,000 X speed of average desktop15,000 X speed of average desktop
80,000 X memory of average desktop80,000 X memory of average desktop
75 terabytes of disk storage capacity 75 terabytes of disk storage capacity
Overall System Level Performance Improvement Will Come From Many Small Improvements
Pe
rf
or
ma
nc
e
CA
GR
Year
20% CAGR
60 to 90% CAGR
2000
Overall performance Application tuning Middleware tuning OS: tuning/scalability Compilers Multi-way systems Motherboard design: electrical, debug Memory subsystem: latency/bandwidth Packaging: more pins, better electrical/cooling Tools / environment / designer productivity Architecture/Microarchitecture/Logic design Circuit design New device structures Other process technology Traditional CMOS scaling
System Level Performance ImprovementSystem Level Performance ImprovementSystem Level Performance ImprovementSystem Level Performance Improvement
Moore's Law continues beyond conventional scalingMoore's Law continues beyond conventional scaling
Power becomes the limiting metricPower becomes the limiting metric
The integration focus moves from circuit to processorThe integration focus moves from circuit to processor
1988 1990 1992 1994 1996 1998 2000
Year
4x
16x
256x
1/4x
1/16x
Mag
nitu
de
Lithography
1x½x / 6 yrs.
Moore's Law
1988 1990 1992 1994 1996 1998 2000
Year
2x / 1.9 yrs.
# Transistors
1988 1990 1992 1994 1996 1998 2000
Year
2x / 1.5 yrs.
1988 1990 1992 1994 1996 1998 2000
Year
½x / 3.8 yrs.
Industry
1988 1990 1992 1994 1996 1998 2000
Year
~ flat
Chip Area
1988 1990 1992 1994 1996 1998 2000
Year
2x / 6 yrs.
64x
Microprocessor Size TrendsMicroprocessor Size TrendsMicroprocessor Size TrendsMicroprocessor Size Trends
Performance
1988 1990 1992 1994 1996 1998 2000
Year
2x / 1.5 yrs.
4x
16x
256x
1/4x
1/16x
Mag
nitu
de
1x
Moore's Law
64x
1988 1990 1992 1994 1996 1998 2000
Year
Industry Frequency
1988 1990 1992 1994 1996 1998 2000
Year
1988 1990 1992 1994 1996 1998 2000
Year
2x / 2 yrs.
1988 1990 1992 1994 1996 1998 2000
Year
Power
2x / 3 yrs.
1988 1990 1992 1994 1996 1998 2000
Year
Microprocessor Performance TrendsMicroprocessor Performance TrendsMicroprocessor Performance TrendsMicroprocessor Performance Trends
Power Density
1988 1990 1992 1994 1996 1998 2000
Year
2x / 6 yrs.
1988 1990 1992 1994 1996 1998 2000
Year
Power Density
Processor 486DX Device Scaling
Moore's Law
Date 04/10/89 2001 2001
Technology (um) 1 0.25 0.25
Vdd (V) 5 1.25 1.25
FPG 5 20 20
Frequency (MHz) 25 100 6400
SpecInt95 0.5 2.0 128
# Transistors (M) 1.2 1.2 307
Chip Size (sq. mm) 165 10 660
Power (W) 4 0.25 66
Power Density (W/sq. cm)
2.5 2.5 10
Processor 486DX Device Scaling
Date 04/10/89 2001
Technology (um) 1 0.25
Vdd (V) 5 1.25
FPG 5 20
Frequency (MHz) 25 100
SpecInt95 0.5 2.0
# Transistors (M) 1.2 1.2
Chip Size (sq. mm) 165 10
Power (W) 4 0.25
Power Density (W/sq. cm)
2.5 2.5
Processor 486DX
Date 04/10/89
Technology (um) 1
Vdd (V) 5
FPG 5
Frequency (MHz) 25
SpecInt95 0.5
# Transistors (M) 1.2
Chip Size (sq. mm) 165
Power (W) 4
Power Density (W/sq. cm)
2.5
Microprocessor Scaling TrendsMicroprocessor Scaling TrendsMicroprocessor Scaling TrendsMicroprocessor Scaling Trends
Processor 486DX Device Scaling
Moore's Law
Pentium 4
Date 04/10/89 2001 2001 04/23/01
Technology (um) 1 0.25 0.25 0.18
Vdd (V) 5 1.25 1.25 1.75
FPG 5 20 20 51
Frequency (MHz) 25 100 6400 1700
SpecInt95 0.5 2.0 128 71
# Transistors (M) 1.2 1.2 307 42
Chip Size (sq. mm) 165 10 660 216
Power (W) 4 0.25 66 64
Power Density (W/sq. cm)
2.5 2.5 10 29.5
1
10
100
1000
1.5µµ 1µµ 0.7µµ 0.5µµ 0.35µµ 0.25µµ 0.18µµ 0.13µµ 0.1µµ 0.07µµ
i386i486
Pentium®
Pentium Pro®
Pentium II ®
Pentium III ®
W/cm2
Hot Plate
Nuclear Reactor
Source: Fred Pollack, Intel. New Microprocessor Challenges in the Coming Generations of CMOS Technologies, Micro32
Power Density: The Fundamental ProblemPower Density: The Fundamental ProblemPower Density: The Fundamental ProblemPower Density: The Fundamental Problem
IT electrical power needs are projected to reach crisis IT electrical power needs are projected to reach crisis proportionsproportions
Server farm energy consumption is increasing exponentiallyServer farm energy consumption is increasing exponentially
...more Watts/sq. ft than semiconductor or automobile plants...more Watts/sq. ft than semiconductor or automobile plants
...energy needs constitute 60% of cost...energy needs constitute 60% of cost
Interesting anecdotesInteresting anecdotesThe "2,400 megawatt problem":The "2,400 megawatt problem":
27 farms proposed for 27 farms proposed for South King CountySouth King County will require as much energy will require as much energy as as SeattleSeattle (including (including BoeingBoeing))
ExodusExodus considering building power plant near its considering building power plant near its Santa ClaraSanta Clara facility facility
San Jose City CouncilSan Jose City Council approved 250 MW power plant for approved 250 MW power plant for US DataPortUS DataPort server farmserver farm
and installation of 80 back-up diesel generators and installation of 80 back-up diesel generators
PowerPowerPowerPower
Server Farm Heat Density TrendServer Farm Heat Density TrendServer Farm Heat Density TrendServer Farm Heat Density Trend
Highest Communication: 28% AGRLowest Tape storage: 7%* Slower growth after 2005 due to improvement in semiconductor power consumption
Reprinted with permission of The Uptime Institute from a White Paper titled Heat Density Trends in Data Processing, Computer Systems, andTelecommunications Equipment Version 1.0.
1940 1960 1980 2000 2020
YEAR
1E-10
1E-6
1E-2
1E+2
1E+6
1E+10
En
erg
y (p
J)
kT (room temp.)
Energy Dissipated per Logic OperationEnergy Dissipated per Logic OperationEnergy Dissipated per Logic OperationEnergy Dissipated per Logic Operation
SCALING:Voltage: V/ααOxide: tox /ααWire width: W/ααGate width: L/ααDiffusion: xd /ααSubstrate: αα * NA
RESULTS:Higher Density: ~αα 22
Higher Speed: ~ααLower Power/ckt: ~1/α/α 22
Power Density: ~Constant
Device ScalingDevice ScalingDevice ScalingDevice Scaling
Scaled Device
L/ααxd/αα
GATEn+ source n+ drain
WIRINGVoltage, V / αα
W/αα
p substrate, doping αα*NA
tox/αα
MOSFET Device Parameter TrendsMOSFET Device Parameter TrendsMOSFET Device Parameter TrendsMOSFET Device Parameter Trends
0.01 0.1 1
Gate Length, Lgate (um)
0.1
1
10
100
1000
classic scaling
Tox (CC)
Vdd (V)
Vt (V)
-0.5 0 0.5 1
Gate Voltage (V)
1.0E-10
1.0E-8
1.0E-6
1.0E-4
1.0E-2
1.0E+0
1.0E+2
So
urc
e/D
rain
Cu
rren
t (A
/cm
)
T=100KT=200KT=300K
L = 25 nmVds = 1V
Subthreshold slope steepens as temperature is reduced
Low Temperature CMOSLow Temperature CMOSLow Temperature CMOSLow Temperature CMOS
0.01 0.1 10.1
1
10
100
1000
10000
Cgate (fF/um)Inverter Delay (ps)NFET Id-sat (A/m)Power Density (W/cm2)CV/I Delay (a. u.)
LGATE (lm)
CMOS Performance Parameter TrendsCMOS Performance Parameter TrendsCMOS Performance Parameter TrendsCMOS Performance Parameter Trends
Relative Power Density in Scaled CMOS Relative Power Density in Scaled CMOS Relative Power Density in Scaled CMOS Relative Power Density in Scaled CMOS
After B. Davari, et al., IEEE Proc. Vol. 83, p. 595, 1995.
1
2
3
4R
elat
ive
Po
wer
Den
sity
1.00.50.20.10.05
Channel Length (µµm)
5.0V
3.3V
1.5V
1.8V
1.5V
1.2V
High Performance
2.5V
2.5V
0.8V
1.2V1.0V
Low Power
(RELATIVE DENSITY)
(1.0)
(2.5)
(6.3)
(12.8)
(25)
(48)
0.01 0.1 11E-5
0.0001
0.001
0.01
0.1
1
10
100
1000
Pow
er (
W/c
m2)
Gate Length (um)
Active Power Density
Subthreshold Power Density
?
CMOS Power Density TrendsCMOS Power Density TrendsCMOS Power Density TrendsCMOS Power Density Trends
Microprocessor Power Draw vs. FrequencyMicroprocessor Power Draw vs. FrequencyMicroprocessor Power Draw vs. FrequencyMicroprocessor Power Draw vs. Frequency
200 300 400 500 600 700
Operating Frequency (MHz)
0
5
10
15
20
25
30
35
40P
ow
er (W
atts
)
Moore's Law continues beyond conventional scalingMoore's Law continues beyond conventional scaling
Power becomes the limiting metricPower becomes the limiting metric
The integration focus moves from circuit to processorThe integration focus moves from circuit to processor
We've been here before!We've been here before!We've been here before!We've been here before!
Heat Flux Explosion
Year of Announcement
1950 1960 1970 1980 1990 2000 2010
Mo
du
le H
eat
Flu
x(w
atts
/cm
2 )
0
2
4
6
8
10
12
14
Bipolar
CMOS
Vacuum IBM 360IBM 370IBM 3033
IBM ES9000
Fujitsu VP2000IBM 3090S
NTT
Fujitsu M-780
IBM 3090
CDC Cyber 205IBM 4381
IBM 3081Fujitsu M380
IBM RY5
IBM GP
IBM RY6
Apache
Pulsar
Merced
IBM RY7
IBM RY4
Pentium II(DSIP)
Steam Iron(5W/cm2)
1970 1975 1980 1985 1990 1995 2000 2005
Year
1
10
100
1000
Rel
ativ
e P
erfo
rman
ce
Bipolar CMOS
S/390 G4
S/390 G3
9021-711
168
S/390 G5
S/390 Mainframe CPU PerformanceS/390 Mainframe CPU PerformanceS/390 Mainframe CPU PerformanceS/390 Mainframe CPU Performance
S/390 G6S/390 G7
S/390: Comparison of Bipolar and CMOSS/390: Comparison of Bipolar and CMOSS/390: Comparison of Bipolar and CMOSS/390: Comparison of Bipolar and CMOS
ES9000 9X2 S/390 G5
Technology Bipolar CMOS
Total Chips 5000 29
(12 CPUs)
Total Parts 6659 92
Weight (lbs) 31.1 K 2.0 K
Power Req (KW) 153 5
Chips/processor 390 1
Maximum Memory (GB) 10 24
Space (sq ft) 672 52
1970 1975 1980 1985 1990 1995 2000 2005
Year
1
10
100
1000
Rel
ativ
e P
erfo
rman
ce
Bipolar CMOS
S/390 G4
S/390 G3
9021-711
168
S/390 G5
S/390 Mainframe CPU PerformanceS/390 Mainframe CPU PerformanceS/390 Mainframe CPU PerformanceS/390 Mainframe CPU Performance
S/390 G6S/390 G7
Focus on massively parallel systemsFocus on massively parallel systemsFocus on massively parallel systemsFocus on massively parallel systems
Use slower processors with much greater power efficiencyUse slower processors with much greater power efficiency
Scale to desired performance with parallel systemsScale to desired performance with parallel systems
Workload scaling efficiency must sustain power efficiencyWorkload scaling efficiency must sustain power efficiency
Physical distance must be small to keep communication Physical distance must be small to keep communication power manageable.power manageable.
Example: Processor A is slower than Bby a factor S but more power efficient by E.Then MP System A at the same performanceas MP System B has lower power by E/S.
Microprocessor EfficienciesMicroprocessor EfficienciesMicroprocessor EfficienciesMicroprocessor Efficiencies
100 1000 10000 100000
Active power (mW)
0
1000
2000
3000
4000P
erfo
rman
ce (
DM
IPS
)
1 M
IPS/m
W
0.1 M
IPS/
mW
¢
Parallel Performance Scaling ModelParallel Performance Scaling ModelParallel Performance Scaling ModelParallel Performance Scaling Model
Number of processors
0
20
40
60
80
100R
elat
ive
per
form
ance
Ideal scaling Real scaling
Nmax
Pmax
Power/Bandwidth by Interconnect LengthPower/Bandwidth by Interconnect LengthPower/Bandwidth by Interconnect LengthPower/Bandwidth by Interconnect Length
0.001 0.01 0.1 1 10 100
Interconnect Length (meters)
1
10
100P
ow
er/B
and
wid
th (
mW
/Gb
it/s
ec)
on-chip
chip-chip
board-board
rack-rack
room-room
Supercomputer Peak Performance Supercomputer Peak Performance Supercomputer Peak Performance Supercomputer Peak Performance
1940 1950 1960 1970 1980 1990 2000 2010
Year Introduced
1E+2
1E+4
1E+6
1E+8
1E+10
1E+12
1E+14
1E+16
Pea
k S
pee
d (
flo
ps)
Doubling time = 1.5 yr.
ENIAC (vacuum tubes)UNIVAC
IBM 701 IBM 704
IBM 7090 (transistors)
IBM Stretch
CDC 6600 (ICs)CDC 7600
CDC STAR-100 (vectors)CRAY-1
Cyber 205 X-MP2 (parallel vectors)
CRAY-2X-MP4
Y-MP8i860 (MPPs)
ASCI White
PetaflopBlueGene
Blue Pacific
DeltaCM-5 Paragon
NWTASCI Red
ASCI Red
CP-PACS
ASCI WhiteASCI WhiteASCI WhiteASCI White
Cellular ArchitectureCellular ArchitectureCellular ArchitectureCellular Architecture
computational efficiency ~ 0.2 GFLOP/W
IBM PPC440 system-on-chip
EDRAM
4MBytesMemoryMapped or Cache
FP2.8GF
440 PowerPC~1 Watt
32 kB I-Cache32 kB D-Cache
Six 2Gb/sec serial links
PrefetchBuffer
PossibleL2
ControlPLB Infc
10/100MbEthernet
Link DMA
& buffers
DDR/DDR2
controller
24
Global Functions
Ethernetfor I/O
DDR SDRAM256-512MB
144
Link DMA+ Buffersand Global Tree
Integrated memory system
1Gb Ethernet or Infiniband
EthernetFor boot
Example of a Cellular NodeExample of a Cellular NodeExample of a Cellular NodeExample of a Cellular Node
65536 nodes interconnected with three integrated networks
Cellular Communication NetworksCellular Communication NetworksCellular Communication NetworksCellular Communication Networks
EthernetIncorporated into every node ASICDisk I/OHost control, booting and diagnostics
3 Dimensional Torus
Virtual cut-through hardware routing to maximize efficiency2.8 Gb/s on each of 12 node links (total 4.2 GB/s per node)Communication backbone134 TB/s total torus interconnect bandwidth
1.4/2.8 TB/s bisectional bandwidth
Global TreeOne-to-all or all-all broadcast functionalityArithmetic operations implemented in tree~1.4 GB/s of bandwidth from any node to all other nodes Latency of tree less than 1usec~90TB/s total binary tree bandwidth (64k machine)
Node Card and I/O Card DesignNode Card and I/O Card DesignNode Card and I/O Card DesignNode Card and I/O Card Design
Compute cardsCompute cards8 processors, 2 x 2 x 2 (x,y,z)8 processors, 2 x 2 x 2 (x,y,z)256 MB RAM each processor256 MB RAM each processorRedundant power suppliesRedundant power suppliesFast EthernetFast Ethernet
I/O cardsI/O cards4 processors (no torus)4 processors (no torus)512MB-1GB each processor512MB-1GB each processorRedundant Power SuppliesRedundant Power SuppliesFast and 1Gb EthernetFast and 1Gb Ethernet
Compute Nodes
Gb Ethernet
I/O Node
100Mb Ethernet
Switch
Rack DesignRack Design
1024 compute nodes1024 compute nodes 256 GB DRAM 256 GB DRAM
2.8TF peak 2.8TF peak
16 I/O nodes16 I/O nodes8 GB DRAM8 GB DRAM16 Gb Ethernet16 Gb Ethernet
~15 KW, air cooled~15 KW, air cooled1+1 or 2+1 redundant power 1+1 or 2+1 redundant power 2+1 redundant fans2+1 redundant fans
BL ASIC2 cores
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
One compute node
One I/O node
BL ASIC2 cores DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
5.6 GF/s4 MB
Chip(2 processors)
Board(8 chips, 2x2x2)
Cabinet(128 boards, 8x8x16)
44.8 GF/s2.08 GB
5.7 TF/s266 GB
System(64 cabinets, 32x32x64)
360 TF/s16 TB
440 core
440 core
EDRAM
I/O
Building a Cellular System Building a Cellular System Building a Cellular System Building a Cellular System
Moore's Law continues beyond conventional scalingMoore's Law continues beyond conventional scalingTechnology innovation will overcome limitsTechnology innovation will overcome limits
Power becomes the limiting metricPower becomes the limiting metricTechnology trend is to higher power densityTechnology trend is to higher power density
The integration focus moves from circuit to processorThe integration focus moves from circuit to processorRadical power reduction depends on efficient processorsRadical power reduction depends on efficient processorsMassively parallel systems have great potentialMassively parallel systems have great potential
((((Hopefully NotHopefully NotHopefully NotHopefully Not ) The End!) The End!) The End!) The End!