© gordon bell 1 nrc review panel on high performance computing 11 march 1994 gordon bell

© Gordon Bell1

NRC Review Panel on High Performance Computing

11 March 1994

Gordon Bell

© Gordon Bell2

Position

Dual use: Exploit parallelism with in situ nodes & networksLeverage WS & mP industrial HW/SW/app infrastructure!

No Teraflop before its time -- its Moore's Law

It is possible to help fund computing: Heuristics from federal funding & use ( 50 computer systems and 30 years)

Stop Duel Use, genetic engineering of State Computers •10+ years: nil pay back, mono use, poor, & still to come •plan for apps porting to monos will also be ineffective -- apps must leverage, be cross-platform & self-sustaining•let "Challenges" choose apps, not mono use computers•"industry" offers better computers & these are jeopardized•users must be free to choose their computers, not funders •next generation State Computers "approach" industry •10 Tflop ... why?

Summary recommendations

© Gordon Bell3

Principle computingEnvironmentscirca 1994 -->4 networks tosupportmainframes,minis, UNIX servers,workstations &PCs

Token-ring (gateways, bridges,routers, hubs, etc.)

LANs

PCs (DOS, Windows, NT)

UNIX workstations

POTs netfor switching

terminals

ASCII & PCterminals

3270 (&PC)terminals

Ethernet (gateways, bridges,routers, hubs, etc.)

LANs

mainframes

minicomputers

Novell & NT servers

Xterminals

minicomputers

NFSservers

>4 Interconnect & comm. stds:

POTS & 3270 terms.

WAN (comm. stds.)

LAN (2 stds.)

Clusters (prop.)

Late '80s LAN-PCworld

'80s Unix distributed workstations & servers world

70's mini(prop.) world & '90s UNIXmini world

IBM & propritarymainframeworld '50s

Datacomm.worlds

mainframesASCII & PCterminals

clusters

Compute & dbase uni- & mP servers

clustersWide-areainter-sitenetwork

UNIX Multiprocessorservers operated as

traditional minicomputers

© Gordon Bell4

ComputingEnvironmentscirca 2000

Local & global data commworld

ATM† & LocalArea Networksfor: terminal,

PC, workstation,& servers

Centralized& departmental

uni- & mP servers(UNIX & NT)

Legacymainframes &

minicomputersservers & terms

Wide-areaglobal

ATM network

Legacymainframe &

minicomputerservers & terminals

Centralized& departmental

scalable uni- & mP servers*

(NT & UNIX)

NT, Windows& UNIX personservers

Platforms: X86PowerPC Sparcetc.Universal highspeed dataservice usingATM or ??

NT, Windows& UNIX person

servers*

*

multicomputers built from multiple simple, servers

NFS, database, compute, print, & communication servers

† also10 - 100 mb/spt-to-pt Ethernet

TC=TV+PChome ...

(CATV or ATM)

???

© Gordon Bell5

Beyond Dual & Duel Use Technology:Parallelism can & must be free!

HPCS, corporate R&D, and technical users must have the goal to design, install and support parallel environments using and leveraging:• every in situ workstation & multiprocessor server • as part of the local ... national network.

Parallelism is a capability that all computing environments can & must possess!--not a feature to segment "mono use" computers

Parallel applications become a way of computing utilizing existing, zero cost resources -- not subsidy for specialized ad hoc computers

Apps follow pervasive computing environments

© Gordon Bell6

Computer genetic engineering & species selection has been

ineffectiveAlthough Problem x Machine Scalability using SIMD for simulating some physical systems has been demonstrated, given extraordinary resources, the efficacy of larger problems to justify cost-effectiveness has not. Hamming:"The purpose of computing is insight, not numbers."

The "demand side" Challenge users have the problems and should be drivers. ARPA's contractors should re-evaluate their research in light of driving needs.

Federally funded "Challenge" apps porting should be to multiple platforms including workstations & compatible, multis that support // environments to insure portability and understand main line cost-effectiveness

Continued "supply side"programs aimed at designing, purchasing, supporting, sponsoring, & porting of apps to specialized, State Computers, including programs aimed at 10 Tflops, should be re-directed to networked computing.

User must be free to choose and buy any computer, including PCs & WSs, WS Clusters, multiprocessor servers, supercomputers, mainframes, and even highly distributed, coarse grain, data parallel, MPP State computers.

© Gordon Bell7

Performance (t)

20001998199619941992199019881

10

100

1000

10000

• Intel $55M

Intel $300M•

NEC

Cray Super $30M

Cray DARPA

CM5 $30M

CM5 $120M

CM5 $240M

The teraflops

Bell Prize •

© Gordon Bell8

We get no Teraflop before it's time: it's Moore's Law!

Flops = f(t,$), not f(t) technology plans e.g. BAA 94-08 ignores $s!

All Flops are not equal (peak announced performance-PAP or real app perf. -RAP)

FlopsCMOSPAP*< C x 1.6**(1992-t) x $; C = 128 x 10**6 flops / $30,000

FlopsRAP =FlopsPAP x 0.5 for real apps, 1/2 PAP is a great goal

Flopssupers = FlopsCMOS x 0.1; improvement of supers 15-40%/year; higher cost is f(need for profitability, lack of subsidies, volume, SRAM)

92'-94': FlopsPAP/$ =4K; Flopssupers/$=500; Flopsvsp/$ =50 M (1.6G@$25)

*Assumes primary & secondary memory size & costs scale with time memory = $50/MB in 1992-1994 violates Moore's Law disks = $1/MB in1993, size must continue to increases at 60% / year

When does a Teraflop arrive if only $30 million** is spent on a super?

1 TflopCMOS PAP in 1996 (x7.8) with 1 GFlop nodes!!!; or 1997 if RAP

10 TflopCMOS PAP will be reached in 2001 (x78) or 2002 if RAP

How do you get a teraflop earlier?

**A $60 - $240 million Ultracomputer reduces the time by 1.5 - 4.5 years.

© Gordon Bell9

Funding Heuristics(50 computers & 30 years of

hindsight)1. Demand side works i.e., we need this product/technology for x; Supply side doesn't work! Field of Dreams": build it and they will come.

2. Direct funding of university research resulting in technology and product prototypes that is carried over to startup a company is the most effective. -- provided the right person & team are backed with have a transfer avenue.a. Forest Baskett > Stanford to fund various projects (SGI, SUN, MIPS)b. Transfer to large companies has not been effectivec. Government labs... rare, an accident if something emerges

3. A demanding & tolerant customer or user who "buys" products works best to influence and evolve products (e.g., CDC, Cray, DEC, IBM, SGI, SUN) a. DOE labs have been effective buyers and influencers, "Fernbach policy"; unclear if labs are effective product or apps or process developers b. Universities were effective at influencing computing in timesharing, graphics, workstations, AI workstations, etc.c. ARPA, per se, and its contractors have not demonstrated a need for flops. d. Universities have failed ARPA in defining work that demands HPCS -- hence are unlikely to be very helpful as users in the trek to the teraflop.

4. Direct funding of large scale projects" is risky in outcome, long-term, training, and other effects. ARPAnet established an industry after it escaped BBN!

© Gordon Bell10

Funding Heuristics-2 5. Funding product development, targeted purchases, and other subsidies to

establish "State Companies"in a vibrant and overcrowded market is wasteful, likely to be wrong , likely to impede computer development, (e.g. by having to feed an overpopulated industry). Furthermore, it is likely to have a deleterious effect on a healthy industry (e.g. supercomputers).

A significantly smaller universe of computing environments is needed. Cray & IBM are given; SGI is probably the most profitable technical; HP/Convex are likely to be a contender, & others (e.g., DEC) are trying. No state co (intel,TMC, Tera) is likely to be profitable & hence self-sustaining.

6. "University-Company collaboration is a new area of government R&D. So far it hasn't worked nor is it likely to, unless the company invests. Appears to be a way to help company fund marginal people and projects.

7. CRADAs or co-operative research and development agreement are very closely allied to direct product development and are equally likely to be ineffective.

8. Direct subsidy of software apps or the porting of apps to one platform, e.g., EMI analysis are a way to keep marginal computers afloat.If government funds apps, they must be ported cross-platform!

9. Encourage the use of computers across the board, but discourage designs from those who have not used or built a successful computer.

© Gordon Bell11

Scalability: The Platform of HPCS& why continued funding is unnecessary

Mono use aka MPPs have been, are, and will be doomed

The law of scalability

Four scalabilities: machine, problem x machine, generation (t), & now spatial

How do flops, memory size, efficiency & time vary with problem size? Does insight increase with problem size?

What's the nature of problems & work for monos?

What about the mapping of problems onto monos?

What about the economics of software to support monos?

What about all the competitive machines? e.g. workstations, workstation clusters, supers, scalable multis, attached P?

© Gordon Bell12

Special, mono-use MPPs are doomed...no matter how much fedspend!

Special because it has non-standard nodes & networks -- with no appsHaving not evolved to become mainline -- events have over-taken them.

It's special purpose if it's only in Dongarra's Table 3. Flop rate, execution time, and memory size vs problem size shows limited applicability to very large scale problems that must be scaled to cover the inherent, high overhead.

Conjecture: a properly used supercomputer will provide greater insight and utility because of the apps and generality -- running more, smaller sized problems with a plan produces more insight

The problem domain is limited & now they have to compete with: •supers -- do scalars, fine grain, and work and have apps•workstations -- do very long grain, are in situ and have apps•workstation clusters -- have identical characteristics and have apps•low priced ($2 million) multis -- are superior i.e., shorter grain and have apps•scalable multiprocessors -- formed from multis are in design stage

Mono useful (>>//) -- hence, are illegal because they are not dual use Duel use -- only useful to keep a high budget in tact e.g., 10 TF

© Gordon Bell13

The Law of Massive Parallelism isbased on application scale

There exists a problem that can be made sufficiently large such that any network of computers can run efficiently given enough memory, searching, & work -- but this problem may be unrelated to no other problem.

A ... any parallel problem can be scaled to run on an arbitrary network of computers, given enough memory and time

Challenge to theoreticians: How well will an algorithm run?

Challenge for software: Can package be scalable & portable?

Challenge to users: Do larger scale, faster, longer run times, increase problem insight and not just flops?

Challenge to HPCC: Is the cost justified? if so let users do it!

© Gordon Bell14

Scalabilities

Size scalable computers are designed from a few components, with no bottleneck component.

Generation scalable computers can be implemented with the next generation technology with No rewrite/recompile

Problem x machine scalability - ability of a problem, algorithm, or program to exist at a range of sizes so that it can be run efficiently on a given, scalable computer.

Although large scale problems allow high flops, large probs running longer may not produce more insight.

Spatial scalability -- ability of a computer to be scaled over a large physical space to use in situ resources.

© Gordon Bell15

Linpack rate in Gflopsvs Matrix Order

1000001000010001001

10

100

SX 3 4 PSX 3 4 P

CM5 1KCM5 1K

???

© Gordon Bell16

Linpack Solution timevs Matrix Order

100,00010,0001,0001001

10

100

1000

SX 3 4 PSX 3 4 P

CM5 1KCM5 1K

© Gordon Bell17

GB's Estimate of Parallelism in Engineering & Scientific Applications

scalar60%

vector15%

mP (<8)vector

5%

>>// 5%

embarrassingly or perfectly parallel

15%

log (# of apps)

granularity & degree of coupling (comp./comm.)

new orscaled-up apps

dusty decksfor supers

SupersWSs massive mCs & WSs

----scalable multiprocessors-----

© Gordon Bell19

Applicability of varioustechnical computer

alternativesDomain PC|WS Multi servr SC & Mfrm >>// WS Clusters

scalar 1 1 2 na 1*vector 2* 2 1 3 2vect.mP na 2 1 3 nadata // na 1 2 1 1ep & inf.// 1 2 3 2 1gp wrkld 3 1 1 na 2vizualiz'n 1 na na na 1apps 1 1 1 na from WS

*Current micros are weak, but improving rapidly such that subsequent >>//s that use them will have no advantage for node vectorization

© Gordon Bell20

Performance using distributedcomputers depends on

problem & machine granularity

Berkeley's log(p) model characterizes granularity &needs to be understood, measured, and used

Three parameters are given in terms of processing ops:

l = latency -- delay time to communicate between apps

o = overhead -- time lost transmitting messages

g = gap - 1 / message-passing rate ( bandwidth) - time between messages

p = number of processors

© Gordon Bell21

GranularityNomograph

C 90

MPPs

WANs&LANs

1G

100M

10M

100

1000

10K

100 K

1 M

10 M

100 ns.

1 µs.

10 µs.

100 µs.

1 ms. (LAN)

10 ms.

100 ms. (WAN)

1sec.

Processor Processor speedspeed

Grain length Grain length (ops)(ops)

Grain Grain Comm. Comm. Latency & Latency & Synch. Synch. Ovhd.Ovhd.

Fine

Med.

Coarse

Very

1993µ

1995µ

1993

Ultra

x

© Gordon Bell22

GranularityGranularityNomographNomograph

Cray T3D

C 90

VPP 500

VP

1G

100M

10M

100

1000

10K

100 K

1 M

10 M

100 ns.(Supers mem.)

1 µs.

10 µs.

100 µs.

1 ms. (LAN)

10 ms.

100 ms. (WAN)

1sec.

Processor speed

Grain length (ops)

Grain Comm. Latency & Synch. Ovhd.

Fine

Med.

Coarse

Very

1993µ

1995µ

1993super

Ultra

x

© Gordon Bell23

Economics of Packaged Software

Platform Cost Leverage # copies

MPP >100K 1 1-10 copies

Minis, mainframe 10-100K 10-100 1000s copiesalso, evolving high performance multiprocessor servers

Workstation 1-100K 1-10K 1-100K copies

PC 25-500 50K-1M 1-10M copies

© Gordon Bell24

Chuck Seitz comments

on multicomputers“I believe that the commercial, medium grained multicomputers aimed at ultra-supercomputer performance have adopted a relatively unprofitable scaling track, and are doomed to extinction. ... they may as Gordon Bell believes be displaced over the next several years by shared memory multiprocessors. ... For loosely coupled computations at which they excel, ultra-super multicomputers will, in any case, be more economically implemented as networks of high-performance workstations connected by high-bandwidth, local area networks...”

© Gordon Bell25

Convergence to a single architecturewith a single address space

that uses a distributed, shared memory

limited (<20) scalability multiprocessors >> scalable multiprocessors

workstations with 1-4 processors>> workstation clusters & scalable multiprocessors

workstation clusters >> scalable multiprocessors

State Computers built as message passing multicomputers >> scalable multiprocessors

© Gordon Bell26

Convergence to one architecture

limited scalability: mP, uniform memory access

experimental, scalable, multicomputer: smC, non uniform memory access

1st smC hypercube Transputer

(grid)

smC fine-grain

DSM??

smC med-coarse

grain

mP mainframe,

super

smC next gen.

DSM=>smP

Mosaic-C, J-machine

Fujitsu, Intel, Meiko, NCUBE, TMC; 1985-1994

Convex, Cray, Fujistu, IBM, Hitachi, NEC mainframes & supers

smC coarse gr.

clusters

smC: very

coarse grain

Cm* ('75), Butterfly ('85), Cedar ('88)

mP bus based

multi: mini, W/S

networked workstations: smC

mP ring-based

multi

Cosmic Cube, iPSC 1, NCUBE, Transputer-based

Apollo, SUN, HP, etc.

scalable, mP: smP, non-uniform memory access 1st smP

0 cache

smP DSM some cache

smP all cache

arch.

DASH, Convex, Cray T3D, SCI

KSR Allcache next gen. smP research e.g. DDM, DASH+

WSs Clusters via special switches 1994 & ATM 1995

micros

1995?Evolution of scalable multiprocessors, multicomputers, & workstations to shared memory computers

DEC, Encore, Sequent, Stratus, SGI, SUN, etc.

??

high bandwith switch , comm. protocols e.g. ATM

Natural evolution

Cache for locality

WS Micros, fast switch

1995?

1995?

note, only two structures: 1. shared memory mP with uniform & non-uniform memory access; and 2. networked workstations, shared nothingmPs continue

to be the main line

© Gordon Bell27

Re-engineering HPCSGenetic engineering of computers has not produced a healthy strain that lives

more than one, 3 year computer generation. Hence no app base can form. •No inter-generational, MPPs exist with compatible networks & nodes. •All parts of an architecture must scale from generation to generation! •An archecture must be designed for at least three, 3 year generations!

High price to support a DARPA U. to learn computer design -- the market is only $200 million and R&D is billions-- competition works far better

Inevitable movement of standard networks and nodes can or need not be accelerated, these best evolve by a normal market mechanism through driven by users

Dual use of Networks & Nodes is the path to widescale parallelism, not weird computers

Networking is free via ATM Nodes are free via in situ workstationsApps follow pervasive computing environments

Applicability was small and getting smaller very fast with many experienced computer companies entering the market with fine products e.g. Convex/HP, Cray, DEC, IBM, SGI & SUN that are leveraging their R&D, apps, apps, & apps

Japan has a strong supercomputer industry. The more we jeprodize ours by mandating use of weird machines that take away from use, the weaker it becomes.

MPP won, mainstream vendors have adopted multiple CMOS. Stop funding!environments & apps are needed, but are unlikely because the market is small

© Gordon Bell28

Recommendations to HPCSGoal: By 2000, massive parallelism must exist as a by-products that leverages a

widescale national network & workstation/multi HW/SW nodes

Dual use not duel use of products and technology or the principle of "elegance" -one part serves more than one function network companies supply networks, node suppliers use ordinary workstations/servers with existing apps will leverage $30 billion x 10**6 R&D

Fund high speed, low latency, networks for a ubiquitous service as the base of all forms of interconnections from WANs to supercomputers (in addition, some special networks will exist for small grain probs)

Observe heuristics in future federal program funding scenarios ... eliminate direct or indirect product development and mono-use computersFund Challenges who in turn fund purchase, not product development

Funding or purchase of apps porting must be driven by Challenges, but builds on binary compatible workstation/server apps to leverage nodes be cross-platform based to benefit multiple vendors & have cross-platform use

Review effectiveness of State Computers e.g., need, economics, efficacyEach committee member might visit 2-5 sites using a >>// computer

Review // program environments & the efficacy to produce & support apps

Eliminate all forms of State Computers & recommend a balanced HPCS program: nodes & networks; based on industrial infrastructurestop funding the development of mono computers, including the 10Tflopit must be acceptable & encouraged to buy any computer for any contract

Gratis advice for HPCC* & BS* D. Bailey warns that scientists have almost lost credibility....

Focus on Gigabit NREN with low overhead connections that will enable multicomputers as a by-product

Provide many small, scalable computers vs large, centralized

Encourage (revert to) & support not so grand challenges

Grand Challenges (GCs) need explicit goals & plans --disciplines fund & manage (demand side)... HPCC will not

Fund balanced machines/efforts; stop starting Viet Nams

Drop the funding & directed purchase of state computers

Revert to university research -> company & product development

Review the HPCC & GCs program's output ...

*High Performance Cash Conscriptor; Big Spenders

© Gordon Bell30

Disclaimer

This talk may appear inflammatory... i.e. the speaker may have appeared "to flame".

It is not the speaker's intent to make ad hominem attacks on people, organizations, countries, or computers ... it just may appear that way.

© Gordon Bell31

Scalability: The Platform of HPCS

The law of scalability

Three kinds: machine, problem x machine, & generation (t)

How do flops, memory size, efficiency & time vary with problem size?

What's the nature problems & work for the computers?

What about the mapping of problems onto the machines?

© gordon bell 1 nrc review panel on high performance computing 11 march 1994 gordon bell

Documents

unix servers

servers nfs

gordon bell slide

computers apps

nt unix nt

communication servers

mono use computers industry

terminals ascii pc terminals