© gordon bell 1 goal of the committee the goal of the committee is to assess the status of...

© Gordon Bell1

Goal of the Committee

The goal of the committee is to assess the status of supercomputing in the United States, including the characteristics of relevant systems and architecture research in government, industry, and academia and the characteristics of the relevant market. The committee will examine key elements of context--the history of supercomputing, the erosion of research investment, the needs of government agencies for supercomputing capabilities--and assess options for progress. Key historical or causal factors will be identified. The committee will examine the changing nature of problems demanding supercomputing (e.g., weapons design, molecule modeling and simulation, cryptanalysis, bioinformatics, climate modeling) and the implications for systems design. It will seek to understand the role of national security in the supercomputer market and the long-term federal interest in supercomputing.

© Gordon Bell2

NRC-CSTB Future of Supercomputing Committee

22 May 2003NRC Brooks-Sutherland Committee

11 March 1994NRC OSTP Report

18 August 1984

Gordon Bell

Microsoft Bay Area Research Center

San Francisco

© Gordon Bell3

Outline

• Community re-Centric Computing vs. Computer Centers• Background: Where we are and how we got here.• Performance (t). Hardware trends and questions• If we didn’t have general purpose computation centers, would we

invent them?• Atkins Report: Past and future concerns…be careful what we wish

for• Appendices: NRC Brooks-Sutherland 94 comments; CISE(gbell)

Concerns re. Supercomputing Centers (’87); CISE(gbell) //ism goals (‘87); NRC Report 84 comments.

• Bottom line, independent of the question: It has been and always will be the software and apps, stupid!And now it’s the data, too!

© Gordon Bell4

Community re-centric Computing...

• Goal: Enable technical communities to create their own computing environments for personal, data, and program collaboration and distribution.

• Design based on technology trends, especially networking, apps programs maintenance, databases, & providing web and other services

• Many alternative styles and locations are possible– Service from existing centers, including many state centers– Software vendors could be encouraged to supply apps svcs– NCAR style center around data and apps– Instrument-based databases. Both central & distributed when multiple

viewpoints create the whole– Wholly distributed with many individual groups

© Gordon Bell5

Community re-Centric ComputingTime for a major change

• Community Centric• Community is responsible

– Planned & budget as resources– Responsible for its infrastructure– Apps are community centric– Computing is integral to sci./eng.

• In sync with technologies– 1-3 Tflops/$M; 1-3 PBytes/$M

to buy smallish Tflops & PBytes.• New scalables are “centers” fast

– Community can afford– Dedicated to a community– Program, data & database centric

– May be aligned with instruments or other community activities

• Output = web service; an entire community demands real-time web service

• Centers Centric• Center is responsible

– Computing is “free” to users– Provides a vast service array for all– Runs & supports all apps– Computing grant disconnected

• Counter to technologies directions– More costly. Large centers operate at a

dis-economy of scale • Based on unique, fast computers

– Center can only afford– Divvy cycles among all communities– Cycles centric; but politically difficult to

maintain highest power vs more centers– Data is shipped to centers requiring,

expensive, fast networking• Output = diffuse among gp centers;

Are centers ready or able to support on-demand, real time web services?

Background: scalability at last• Q: How did we get to scalable computing and parallel

processing?• A: Scalability evolved from a two decade old vision

and plan starting at DARPA & NSF. Now picked up by DOE & row.

• Q: What should be done now?• A: Realize scalability, the web, and now web services

change everything. Redesign to get with the program!• Q: Why do you seem to be wanting to de-center?• A: Besides the fact that user demand has been and is

totally de-coupled from supply, I believe technology doesn’t necessarily support users or their mission, and that centers are potentially inefficient compared with a more distributed approach.

Copyright Gordon BellCopyright Gordon Bell

Steve Squires & Gordon Bell at our “Cray” at the start of DARPA’s SCI program c1984.

20 years later: Clusters of Killer micros become the single standard


RIPRIP

Lost in the search for parallelism ACRI Alliant American Supercomputer Ametek Applied Dynamics Astronautics BBN CDC Cogent Convex > HP Cray Computer Cray Research > SGI > Cray Culler-Harris Culler Scientific Cydrome Dana/Ardent/Stellar/Stardent Denelcor Encore Elexsi ETA Systems Evans and Sutherland Computer Exa Flexible Floating Point Systems Galaxy YH-1

Goodyear Aerospace MPP Gould NPL Guiltech Intel Scientific Computers International Parallel Machines Kendall Square Research Key Computer Laboratories searching again MasPar Meiko Multiflow Myrias Numerix Pixar Parsytec nCube Prisma Pyramid Ridge Saxpy Scientific Computer Systems (SCS) Soviet Supercomputers Supertek Supercomputer Systems Suprenum Tera > Cray Company Thinking Machines Vitesse Electronics Wavetracer

A brief, simplified history of HPC1. Cray formula smPv evolves for Fortran. 60-02 (US:60-90) 2. 1978: VAXen threaten computer centers…3. NSF response: Lax Report. Create 7-Cray centers 1982 -

4. SCI: DARPA searches for parallelism using killer micros5. Scalability found: “bet the farm” on micros clusters

Users “adapt”: MPI, lcd programming model found. >95

Result: EVERYONE gets to re-write their code!!6. Beowulf Clusters form by adopting PCs and Linus’ Linux to create the cluster standard!

(In spite of funders.)>1995 7. ASCI: DOE gets petaflops clusters, creating “arms” race!8. “Do-it-yourself” Beowulfs negate computer centers since everything is a cluster and

shared power is nil! >2000.

9. 1997-2002: Tell Fujitsu & NEC to get “in step”!

10.High speed nets enable peer2peer & Grid or Teragrid11.Atkins Report: Spend $1B/year, form more and larger centers and connect them as a

single center…


Inno

vatio

n

The Virtuous Economic Cycle drives the PC industry… & Beowulf

Volum

e

Competition

Standards

Utility/value

DOJ

Greater availability

@ lower cost

Creates apps, tools, training,Attracts users

Attracts suppliers


Lessons from Beowulf An experiment in parallel computing systems ‘92 Established vision- low cost high end computing Demonstrated effectiveness of PC clusters for some (not all) classes of

applications Provided networking software Provided cluster management tools Conveyed findings to broad community Tutorials and the book Provided design standard to rally community! Standards beget: books, trained people, software … virtuous cycle that

allowed apps to form Industry began to form beyond a research project

Courtesy, Thomas Sterling, Caltech.


Technology: peta-bytes, -flops, -bpsWe get no technology before its time

Moore’s Law 2004-2012: 40X The big surprise will be the 64 bit micro 2004: O(100) processors = 300 GF PAP, $100K

– 3 TF/M, not diseconomy of scale for large systems– 1 PF => 330M, but 330K processors; other paths

Storage 1-10 TB disks; 100-1000 disks Internet II killer app – NOT teragrid

– Access Grid, new methods of communication– Response time to provide web services

Computing LawsComputing Laws

Performance metrics (t) 1987-2009

0

1

10

100

1,000

10,000

100,000

1,000,000

10,000,000RAP(GF)

Proc(#)

cost ($M)

Density(Gb/in)

Flops/$

60%

100%

110%

ES

© Gordon Bell14

Perf (PAP) = c x 1.6**(t-1992); c = 128 GF/$300M ‘94 prediction: c = 128 GF/$30M

1.E+08

1.E+09

1.E+10

1.E+11

1.E+12

1.E+13

1.E+14

1.E+15

1.E+16

1992 1996 2000 2004 2008 2012

GB peak 30 M super 100 M super 300 M super Flops(PAP)M/$

Performance(TF) vs. cost($M) of non-central and centrally distributed systems

Cost

Perform

ance

+ Centers (old style super)

0.01

0.1

1

10

100

0.1 1 10 100

Non-central Centers delivery Center purchase base

Centers allocation range

National Semiconductor Technology Roadmap (size)

1

10

100

1000

10000

1995 1998 2001 2004 2007 2010

Mem

ory

siz

e (M

byt

es/c

hip

) &

Mtr

ansi

sto

rs/

chip

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

Mem(MBytes)

Micros Mtr/chip

Line width

+ 1Gbit

Disk Density Explosion

Magnetic disk recording density (bits per mm2) grew at 25% per year from 1975 until 1989.

Since 1989 it has grown at 60-70% per year Since 1998 it has grown at >100% per year

– This rate will continue into 2003

Factors causing accelerated growth:– Improvements in head and media technology– Improvements in signal processing electronics– Lower head flying heights

Courtesy Richie Lary

Computing LawsComputing Laws

National Storage Roadmap 2000

100x/decade=100%/year

~10x/decade = 60%/year

19

Disk / Tape Cost ConvergenceDisk / Tape Cost Convergence

$0.00

$0.50

$1.00

$1.50

$2.00

$2.50

$3.00

1/01 1/02 1/03 1/04 1/05

Re

tail

Pri

ce

.

5400 RPM ATA Disk

SDLT Tape Cartridge

33½” ½” ATA disk could cost less than SDLT ATA disk could cost less than SDLT cartridgecartridge in 2004. in 2004. IfIf disk manufacturers maintain 3½”, multi-platter form factor disk manufacturers maintain 3½”, multi-platter form factor

Volumetric density of disk will exceed tape in 2001.Volumetric density of disk will exceed tape in 2001. ““Big Box of ATA Disks” could be cheaper than a tape Big Box of ATA Disks” could be cheaper than a tape

library of equivalent size in 2001library of equivalent size in 2001Courtesy of Richard Lary

Disk Capacity / Performance Imbalance

Capacity growth outpacing performance growth

Difference must be made up by better caching and load balancing

Actual disk capacity may be capped by market (red line); shift to smaller disks (already happening for high speed disks) 19921992 19951995 19981998 20012001

11

1010

100100CapacityCapacity

PerformancePerformance

140140xx in in9 years9 years

(73%/yr)(73%/yr)

33xx in 9 years in 9 years(13%/yr)(13%/yr)

Courtesy of Richard Lary

© Gordon Bell21

Re-Centering to Community Centers

• There is little rational support for general purpose centers– Scalability changes the architecture of the entire Cyberinfrastructure

– No need to have a computer bigger than the largest parallel app.

– They aren’t super.

– World is substantially data driven, not cycles driven.

– Demand is de-coupled from supply planning, payment or services

• Scientific / Engineering computing has to be the responsibility of each of its communities – Communities form around instruments, programs, databases, etc.

– Output is web service for the entire community

© Gordon Bell22

Grand Challenge (c2002) problems become desktop (c2004) tractable

I don’t buy problem growth mantra 2x res. >2**4 (yrs.)

• Radical machines will come from low cost 64-bit explosion• Today’s desktop has and will increasingly trump

yesteryear’s super simply due to memory size emplosion• Pittsburgh Alpha: 3D MRI skeleton computing/viewing

using a large memory is a desktop problem given a large memory

• Tony Jamieson: “I can models an entire 747 on my laptop!”

© Gordon Bell23

Centers aren’t very super…

• Pittsburgh: 6; NCAR: 10; LSU: 17; Buffalo: 22; FSU: 38; San Diego: 52; NCSA: 82; Cornell: 88 Utah: 89;

• 17 Universities, world-wide in top 100 • Massive upgrade is continuously required:

– Large memories: machines aren’t balanced and haven’t been. Bandwidth 1 Byte/flop vs. 24 Bytes/flop

– File Storage > Databases

• Since centers systems have >4 year lives, they start as obsolete and overpriced…and then get worse.

© Gordon Bell24

Centers: The role going forward• The US builds scalable clusters, NOT supercomputers

– Scalables are 1 to n commodity PCs that anyone can assemble. – Unlike the “Crays” all are equal. Use is allocated in small clusters.– Problem parallelism sans ∞// has been elusive (limited to 1K)– No advantage of having a computer larger than a //able program

• User computation can be acquired and managed effectively.– Computation is divvied up in small clusters e.g. 128 nodes that individual

groups can acquire and manage effectively

• The basic hardware evolves, doesn’t especially favor centers– 64-bit architecture. 512Mb x 32/dimm =2GB x 4/system = 8GB Systems

>>16GB(Centers machine will be obsolete, by memory / balance rules.)

– 3 year timeframe: 1 TB disks at $0.20/TB– Last mile communication costs are not decreasing to favor centers or grids.

© Gordon Bell25

Review the bidding

• 1984: Japanese are coming. CMOS and killer Micros. Build // machines.– 40+ computers were built & failed based on CMOS and/or micros– No attention to software or apps

• 1994: Parallelism and Grand Challenges– Converge to Linux Clusters (Constellations nodes >1 Proc.) & MPI– No noteworthy middleware software to aid apps & replace Fortran e.g.

HPF failed.– Whatever happened to the Grand Challenges??

• 2004: Teragrid has potential as a massive computer and massive research– We have and will continue to have the computer clusters to reach a

<$300M Petaflop– Massive review and re-architecture of centers and their function. – Instrument/program/data/community centric (CERN, Fermi, NCAR,

Calera)

© Gordon Bell26

Recommendations

• Give careful advice on the Atkins Report (It is just the kind of plan that is likely to fly.)

• Community Centric Computing– Community/instrument/data/programcentric (Calera, CERN, NCAR)

• Small number of things to report– Forget about hardware for now…it’s scalables. The die has been cast. – Support training, apps, and any research to ease apps development.– Databases represent the biggest gain. Don’t grep, just reference it.

© Gordon Bell27

The End

© Gordon Bell28

Atkins Report: Be careful of what you ask for

• Suggestions (gbell)– Centers to be re-centered in light of data versus flops– Overall re-architecture based on user need & technology– Highly distributed structure aligned with users who plan

their facilities– Very skeptical “gridized” projects e.g. tera, GGF– Training in the use of databases is needed! It will get

more productivity than another generation of computers.• The Atkins Report

– 1.02 Billion per year recommendation for research, buy software, and spend $600 M to build and maintain more centers that are certain to be obsolete and non-productive

Summary to Atkins Report 2/15/02 15:00 gbell

• Same old concerns: “I don’t have as many flops as users at the national labs.” • Many facilities should be distributed and with build-it yourself Beowulf clusters to

get extraordinary cycles and bytes.• Centers need to be re-centered see Bell & Gray, “What’s Next in High

Performance Computing?”, Comm. ACM, Feb. 2002, pp91-95.• Scientific computing needs re architecting based on networking, communication,

computation, and storage. Centrality versus distributed depends on costs and the nature of the work e.g. instrumentation that generates lots of data. (Last mile problem is significant.)

– Fedex’d hard drive is low cost. Cost of hard drive < network cost. Net is very expensive!– Centers flops and bytes are expensive. Distributed likely to be less so.– Many sciences need to be reformulated as a distribute computing/dbase

• Network costs (last mi.) are a disgrace. $1 billion boondoggle with NGI, Internet II.• Grid funding: Not in line with COTS or IETF model. Another very large SW project!• Give funding to scientists in joint grants with tool builders e.g. www came from

user• Database technology is not understood by users and computer scientists

– Training, tool funding, & combined efforts especially when large & distributed– Equipment, problems, etc are dramatically outstripping capabilities!

• It is still about software, especially in light of scalable computers that require reformulation into a new, // programming model

© Gordon Bell30

Atkins Report: the critical challenges

1) build real synergy between computer and information science research and development, and its use in science and engineering research and education;

2) capture the cyberinfrastructure commonalities across science and engineering disciplines;

3) use cyberinfrastructure to empower and enable, not impede, collaboration across science and engineering disciplines;

4) exploit technologies being developed commercially and apply them to research applications, as well as feed back new approaches from the scientific realm into the larger world;

5) engage social scientists to work constructively with other scientists and technologists.

© Gordon Bell31

Atkins Report: Be careful of what you ask for

1. fundamental research to create advanced cyberinfrastructure ($60M);

2. research on the application of cyberinfrastructure to specific fields of science and engineering research ($100M);

3. acquisition and development of production quality software for cyberinfrastructure and supported applications ($200M);

4. provisioning and operations (including computational centers, data repositories, digital libraries, networking, and application support) ($600M).

5. archives for software ($60M)

© Gordon Bell32

NRC Review Panel on High Performance Computing

11 March 1994

Gordon Bell

© Gordon Bell33

Position

Dual use: Exploit parallelism with in situ nodes & networksLeverage WS & mP industrial HW/SW/app infrastructure!

No Teraflop before its time -- its Moore's Law

It is possible to help fund computing: Heuristics from federal funding & use ( 50 computer systems and 30 years)

Stop Duel Use, genetic engineering of State Computers •10+ years: nil pay back, mono use, poor, & still to come •plan for apps porting to monos will also be ineffective -- apps must leverage, be cross-platform & self-sustaining•let "Challenges" choose apps, not mono use computers•"industry" offers better computers & these are jeopardized•users must be free to choose their computers, not funders •next generation State Computers "approach" industry •10 Tflop ... why?

Summary recommendations

© Gordon Bell34

Principle computingEnvironmentscirca 1994 -->4 networks tosupportmainframes,minis, UNIX servers,workstations &PCs

Token-ring (gateways, bridges,routers, hubs, etc.)

LANs

PCs (DOS, Windows, NT)

UNIX workstations

POTs netfor switching

terminals

ASCII & PCterminals

3270 (&PC)terminals

Ethernet (gateways, bridges,routers, hubs, etc.)

LANs

mainframes

minicomputers

Novell & NT servers

Xterminals

minicomputers

NFSservers

>4 Interconnect & comm. stds:

POTS & 3270 terms.

WAN (comm. stds.)

LAN (2 stds.)

Clusters (prop.)

Late '80s LAN-PCworld

'80s Unix distributed workstations & servers world

70's mini(prop.) world & '90s UNIXmini world

IBM & propritarymainframeworld '50s

Datacomm.worlds

mainframesASCII & PCterminals

clusters

Compute & dbase uni- & mP servers

clustersWide-areainter-sitenetwork

UNIX Multiprocessorservers operated as

traditional minicomputers

© Gordon Bell35

ComputingEnvironmentscirca 2000

Local & global data commworld

ATM† & LocalArea Networksfor: terminal,

PC, workstation,& servers

Centralized& departmental

uni- & mP servers(UNIX & NT)

Legacymainframes &

minicomputersservers & terms

Wide-areaglobal

ATM network

Legacymainframe &

minicomputerservers & terminals

Centralized& departmental

scalable uni- & mP servers*

(NT & UNIX)

NT, Windows& UNIX personservers

Platforms: X86PowerPC Sparcetc.Universal highspeed dataservice usingATM or ??

NT, Windows& UNIX person

servers*

*

multicomputers built from multiple simple, servers

NFS, database, compute, print, & communication servers

† also10 - 100 mb/spt-to-pt Ethernet

TC=TV+PChome ...

(CATV or ATM)

???

© Gordon Bell36

Beyond Dual & Duel Use Technology:Parallelism can & must be free!

HPCS, corporate R&D, and technical users must have the goal to design, install and support parallel environments using and leveraging:• every in situ workstation & multiprocessor server • as part of the local ... national network.

Parallelism is a capability that all computing environments can & must possess!--not a feature to segment "mono use" computers

Parallel applications become a way of computing utilizing existing, zero cost resources -- not subsidy for specialized ad hoc computers

Apps follow pervasive computing environments

© Gordon Bell37

Computer genetic engineering & species selection has been

ineffectiveAlthough Problem x Machine Scalability using SIMD for simulating some physical systems has been demonstrated, given extraordinary resources, the efficacy of larger problems to justify cost-effectiveness has not. Hamming:"The purpose of computing is insight, not numbers."

The "demand side" Challenge users have the problems and should be drivers. ARPA's contractors should re-evaluate their research in light of driving needs.

Federally funded "Challenge" apps porting should be to multiple platforms including workstations & compatible, multis that support // environments to insure portability and understand main line cost-effectiveness

Continued "supply side"programs aimed at designing, purchasing, supporting, sponsoring, & porting of apps to specialized, State Computers, including programs aimed at 10 Tflops, should be re-directed to networked computing.

User must be free to choose and buy any computer, including PCs & WSs, WS Clusters, multiprocessor servers, supercomputers, mainframes, and even highly distributed, coarse grain, data parallel, MPP State computers.

© Gordon Bell38

Performance (t)

20001998199619941992199019881

10

100

1000

10000

• Intel $55M

Intel $300M•

NEC

Cray Super $30M

Cray DARPA

CM5 $30M

CM5 $120M

CM5 $240M

The teraflops

Bell Prize •

© Gordon Bell39

We get no Teraflop before it's time: it's Moore's Law!

Flops = f(t,$), not f(t) technology plans e.g. BAA 94-08 ignores $s!

All Flops are not equal (peak announced performance-PAP or real app perf. -RAP)

FlopsCMOSPAP*< C x 1.6**(t-1992) x $; C = 128 x 10**6 flops / $30,000

FlopsRAP =FlopsPAP x 0.5 for real apps, 1/2 PAP is a great goal

Flopssupers = FlopsCMOS x 0.1; improvement of supers 15-40%/year; higher cost is f(need for profitability, lack of subsidies, volume, SRAM)

92'-94': FlopsPAP/$ =4K; Flopssupers/$=500; Flopsvsp/$ =50 M (1.6G@$25)

*Assumes primary & secondary memory size & costs scale with time memory = $50/MB in 1992-1994 violates Moore's Law disks = $1/MB in1993, size must continue to increases at 60% / year

When does a Teraflop arrive if only $30 million** is spent on a super?

1 TflopCMOS PAP in 1996 (x7.8) with 1 GFlop nodes!!!; or 1997 if RAP

10 TflopCMOS PAP will be reached in 2001 (x78) or 2002 if RAP

How do you get a teraflop earlier?

**A $60 - $240 million Ultracomputer reduces the time by 1.5 - 4.5 years.

© Gordon Bell40

Re-engineering HPCSGenetic engineering of computers has not produced a healthy strain that lives

more than one, 3 year computer generation. Hence no app base can form. •No inter-generational, MPPs exist with compatible networks & nodes. •All parts of an architecture must scale from generation to generation! •An archecture must be designed for at least three, 3 year generations!

High price to support a DARPA U. to learn computer design -- the market is only $200 million and R&D is billions-- competition works far better

Inevitable movement of standard networks and nodes can or need not be accelerated, these best evolve by a normal market mechanism through driven by users

Dual use of Networks & Nodes is the path to widescale parallelism, not weird computers

Networking is free via ATM Nodes are free via in situ workstationsApps follow pervasive computing environments

Applicability was small and getting smaller very fast with many experienced computer companies entering the market with fine products e.g. Convex/HP, Cray, DEC, IBM, SGI & SUN that are leveraging their R&D, apps, apps, & apps

Japan has a strong supercomputer industry. The more we jeprodize ours by mandating use of weird machines that take away from use, the weaker it becomes.

MPP won, mainstream vendors have adopted multiple CMOS. Stop funding!environments & apps are needed, but are unlikely because the market is small

© Gordon Bell41

Recommendations to HPCSGoal: By 2000, massive parallelism must exist as a by-products that leverages a

widescale national network & workstation/multi HW/SW nodes

Dual use not duel use of products and technology or the principle of "elegance" -one part serves more than one function network companies supply networks, node suppliers use ordinary workstations/servers with existing apps will leverage $30 billion x 10**6 R&D

Fund high speed, low latency, networks for a ubiquitous service as the base of all forms of interconnections from WANs to supercomputers (in addition, some special networks will exist for small grain probs)

Observe heuristics in future federal program funding scenarios ... eliminate direct or indirect product development and mono-use computersFund Challenges who in turn fund purchase, not product development

Funding or purchase of apps porting must be driven by Challenges, but builds on binary compatible workstation/server apps to leverage nodes be cross-platform based to benefit multiple vendors & have cross-platform use

Review effectiveness of State Computers e.g., need, economics, efficacyEach committee member might visit 2-5 sites using a >>// computer

Review // program environments & the efficacy to produce & support apps

Eliminate all forms of State Computers & recommend a balanced HPCS program: nodes & networks; based on industrial infrastructurestop funding the development of mono computers, including the 10Tflopit must be acceptable & encouraged to buy any computer for any contract

Gratis advice for HPCC* & BS* D. Bailey warns that scientists have almost lost credibility....

Focus on Gigabit NREN with low overhead connections that will enable multicomputers as a by-product

Provide many small, scalable computers vs large, centralized

Encourage (revert to) & support not so grand challenges

Grand Challenges (GCs) need explicit goals & plans --disciplines fund & manage (demand side)... HPCC will not

Fund balanced machines/efforts; stop starting Viet Nams (efforts that are rat holes that you can’t get out of)

Drop the funding & directed purchase of state computers

Revert to university research -> company & product development

Review the HPCC & GCs program's output ...

*High Performance Cash Conscriptor; Big Spenders

© Gordon Bell43

Scalability: The Platform of HPCS

The law of scalability

Three kinds: machine, problem x machine, & generation (t)

How do flops, memory size, efficiency & time vary with problem size?

What's the nature problems & work for the computers?

What about the mapping of problems onto the machines?

© Gordon Bell44

Disclaimer

This talk may appear inflammatory... i.e. the speaker may have appeared "to flame".

It is not the speaker's intent to make ad hominem attacks on people, organizations, countries, or computers ... it just may appear that way.

© Gordon Bell45

Backups

© Gordon Bell46

Funding Heuristics(50 computers & 30 years of

hindsight)1. Demand side works i.e., we need this product/technology for x; Supply side doesn't work! Field of Dreams": build it and they will come.

2. Direct funding of university research resulting in technology and product prototypes that is carried over to startup a company is the most effective. -- provided the right person & team are backed with have a transfer avenue.a. Forest Baskett > Stanford to fund various projects (SGI, SUN, MIPS)b. Transfer to large companies has not been effectivec. Government labs... rare, an accident if something emerges

3. A demanding & tolerant customer or user who "buys" products works best to influence and evolve products (e.g., CDC, Cray, DEC, IBM, SGI, SUN) a. DOE labs have been effective buyers and influencers, "Fernbach policy"; unclear if labs are effective product or apps or process developers b. Universities were effective at influencing computing in timesharing, graphics, workstations, AI workstations, etc.c. ARPA, per se, and its contractors have not demonstrated a need for flops. d. Universities have failed ARPA in defining work that demands HPCS -- hence are unlikely to be very helpful as users in the trek to the teraflop.

4. Direct funding of large scale projects" is risky in outcome, long-term, training, and other effects. ARPAnet established an industry after it escaped BBN!

© Gordon Bell47

Funding Heuristics-2 5. Funding product development, targeted purchases, and other subsidies to

establish "State Companies"in a vibrant and overcrowded market is wasteful, likely to be wrong , likely to impede computer development, (e.g. by having to feed an overpopulated industry). Furthermore, it is likely to have a deleterious effect on a healthy industry (e.g. supercomputers).

A significantly smaller universe of computing environments is needed. Cray & IBM are given; SGI is probably the most profitable technical; HP/Convex are likely to be a contender, & others (e.g., DEC) are trying. No state co (intel,TMC, Tera) is likely to be profitable & hence self-sustaining.

6. "University-Company collaboration is a new area of government R&D. So far it hasn't worked nor is it likely to, unless the company invests. Appears to be a way to help company fund marginal people and projects.

7. CRADAs or co-operative research and development agreement are very closely allied to direct product development and are equally likely to be ineffective.

8. Direct subsidy of software apps or the porting of apps to one platform, e.g., EMI analysis are a way to keep marginal computers afloat.If government funds apps, they must be ported cross-platform!

9. Encourage the use of computers across the board, but discourage designs from those who have not used or built a successful computer.

© Gordon Bell48

Scalability: The Platform of HPCS& why continued funding is unnecessary

Mono use aka MPPs have been, are, and will be doomed


Four scalabilities: machine, problem x machine, generation (t), & now spatial

How do flops, memory size, efficiency & time vary with problem size? Does insight increase with problem size?

What's the nature of problems & work for monos?

What about the mapping of problems onto monos?

What about the economics of software to support monos?

What about all the competitive machines? e.g. workstations, workstation clusters, supers, scalable multis, attached P?

© Gordon Bell49

Special, mono-use MPPs are doomed...no matter how much fedspend!

Special because it has non-standard nodes & networks -- with no appsHaving not evolved to become mainline -- events have over-taken them.

It's special purpose if it's only in Dongarra's Table 3. Flop rate, execution time, and memory size vs problem size shows limited applicability to very large scale problems that must be scaled to cover the inherent, high overhead.

Conjecture: a properly used supercomputer will provide greater insight and utility because of the apps and generality -- running more, smaller sized problems with a plan produces more insight

The problem domain is limited & now they have to compete with: •supers -- do scalars, fine grain, and work and have apps•workstations -- do very long grain, are in situ and have apps•workstation clusters -- have identical characteristics and have apps•low priced ($2 million) multis -- are superior i.e., shorter grain and have apps•scalable multiprocessors -- formed from multis are in design stage

Mono useful (>>//) -- hence, are illegal because they are not dual use Duel use -- only useful to keep a high budget in tact e.g., 10 TF

© Gordon Bell50

The Law of Massive Parallelism isbased on application scale

There exists a problem that can be made sufficiently large such that any network of computers can run efficiently given enough memory, searching, & work -- but this problem may be unrelated to no other problem.

A ... any parallel problem can be scaled to run on an arbitrary network of computers, given enough memory and time

Challenge to theoreticians: How well will an algorithm run?

Challenge for software: Can package be scalable & portable?

Challenge to users: Do larger scale, faster, longer run times, increase problem insight and not just flops?

Challenge to HPCC: Is the cost justified? if so let users do it!

© Gordon Bell51

Scalabilities

Size scalable computers are designed from a few components, with no bottleneck component.

Generation scalable computers can be implemented with the next generation technology with No rewrite/recompile

Problem x machine scalability - ability of a problem, algorithm, or program to exist at a range of sizes so that it can be run efficiently on a given, scalable computer.

Although large scale problems allow high flops, large probs running longer may not produce more insight.

Spatial scalability -- ability of a computer to be scaled over a large physical space to use in situ resources.

© Gordon Bell52

Linpack rate in Gflopsvs Matrix Order

1000001000010001001

10

100

SX 3 4 PSX 3 4 P

CM5 1KCM5 1K

???

© Gordon Bell53

Linpack Solution timevs Matrix Order

100,00010,0001,0001001

10

100

1000

SX 3 4 PSX 3 4 P

CM5 1KCM5 1K

© Gordon Bell54

GB's Estimate of Parallelism in Engineering & Scientific Applications

scalar60%

vector15%

mP (<8)vector

5%

>>// 5%

embarrassingly or perfectly parallel

15%

log (# of apps)

granularity & degree of coupling (comp./comm.)

new orscaled-up apps

dusty decksfor supers

SupersWSs massive mCs & WSs

----scalable multiprocessors-----

© Gordon Bell56

Applicability of varioustechnical computer

alternativesDomain PC|WS Multi servr SC & Mfrm >>// WS Clusters

scalar 1 1 2 na 1*vector 2* 2 1 3 2vect.mP na 2 1 3 nadata // na 1 2 1 1ep & inf.// 1 2 3 2 1gp wrkld 3 1 1 na 2vizualiz'n 1 na na na 1apps 1 1 1 na from WS

*Current micros are weak, but improving rapidly such that subsequent >>//s that use them will have no advantage for node vectorization

© Gordon Bell57

Performance using distributedcomputers depends on

problem & machine granularity

Berkeley's log(p) model characterizes granularity &needs to be understood, measured, and used

Three parameters are given in terms of processing ops:

l = latency -- delay time to communicate between apps

o = overhead -- time lost transmitting messages

g = gap - 1 / message-passing rate ( bandwidth) - time between messages

p = number of processors

© Gordon Bell58

GranularityNomograph

C 90

MPPs

WANs&LANs

1G

100M

10M

100

1000

10K

100 K

1 M

10 M

100 ns.

1 µs.

10 µs.

100 µs.

1 ms. (LAN)

10 ms.

100 ms. (WAN)

1sec.

Processor Processor speedspeed

Grain length Grain length (ops)(ops)

Grain Grain Comm. Comm. Latency & Latency & Synch. Synch. Ovhd.Ovhd.

Fine

Med.

Coarse

Very

1993µ

1995µ

1993

Ultra

x

© Gordon Bell59

GranularityGranularityNomographNomograph

Cray T3D

C 90

VPP 500

VP

1G

100M

10M

100

1000

10K

100 K

1 M

10 M

100 ns.(Supers mem.)

1 µs.

10 µs.

100 µs.

1 ms. (LAN)

10 ms.

100 ms. (WAN)

1sec.

Processor speed

Grain length (ops)

Grain Comm. Latency & Synch. Ovhd.

Fine

Med.

Coarse

Very

1993µ

1995µ

1993super

Ultra

x

© Gordon Bell60

Economics of Packaged Software

Platform Cost Leverage # copies

MPP >100K 1 1-10 copies

Minis, mainframe 10-100K 10-100 1000s copiesalso, evolving high performance multiprocessor servers

Workstation 1-100K 1-10K 1-100K copies

PC 25-500 50K-1M 1-10M copies

© Gordon Bell61

Chuck Seitz comments

on multicomputers“I believe that the commercial, medium grained multicomputers aimed at ultra-supercomputer performance have adopted a relatively unprofitable scaling track, and are doomed to extinction. ... they may as Gordon Bell believes be displaced over the next several years by shared memory multiprocessors. ... For loosely coupled computations at which they excel, ultra-super multicomputers will, in any case, be more economically implemented as networks of high-performance workstations connected by high-bandwidth, local area networks...”

© Gordon Bell62

Convergence to a single architecturewith a single address space

that uses a distributed, shared memory

limited (<20) scalability multiprocessors >> scalable multiprocessors

workstations with 1-4 processors>> workstation clusters & scalable multiprocessors

workstation clusters >> scalable multiprocessors

State Computers built as message passing multicomputers >> scalable multiprocessors

© Gordon Bell63

Convergence to one architecture

limited scalability: mP, uniform memory access

experimental, scalable, multicomputer: smC, non uniform memory access

1st smC hypercube Transputer

(grid)

smC fine-grain

DSM??

smC med-coarse

grain

mP mainframe,

super

smC next gen.

DSM=>smP

Mosaic-C, J-machine

Fujitsu, Intel, Meiko, NCUBE, TMC; 1985-1994

Convex, Cray, Fujistu, IBM, Hitachi, NEC mainframes & supers

smC coarse gr.

clusters

smC: very

coarse grain

Cm* ('75), Butterfly ('85), Cedar ('88)

mP bus based

multi: mini, W/S

networked workstations: smC

mP ring-based

multi

Cosmic Cube, iPSC 1, NCUBE, Transputer-based

Apollo, SUN, HP, etc.

scalable, mP: smP, non-uniform memory access 1st smP

0 cache

smP DSM some cache

smP all cache

arch.

DASH, Convex, Cray T3D, SCI

KSR Allcache next gen. smP research e.g. DDM, DASH+

WSs Clusters via special switches 1994 & ATM 1995

micros

1995?Evolution of scalable multiprocessors, multicomputers, & workstations to shared memory computers

DEC, Encore, Sequent, Stratus, SGI, SUN, etc.

??

high bandwith switch , comm. protocols e.g. ATM

Natural evolution

Cache for locality

WS Micros, fast switch

1995?

1995?

note, only two structures: 1. shared memory mP with uniform & non-uniform memory access; and 2. networked workstations, shared nothingmPs continue

to be the main line

© Gordon Bell64

Re-engineering HPCSGenetic engineering of computers has not produced a healthy strain that lives

more than one, 3 year computer generation. Hence no app base can form. •No inter-generational, MPPs exist with compatible networks & nodes. •All parts of an architecture must scale from generation to generation! •An archecture must be designed for at least three, 3 year generations!

High price to support a DARPA U. to learn computer design -- the market is only $200 million and R&D is billions-- competition works far better

Inevitable movement of standard networks and nodes can or need not be accelerated, these best evolve by a normal market mechanism through driven by users

Dual use of Networks & Nodes is the path to widescale parallelism, not weird computers

Networking is free via ATM Nodes are free via in situ workstationsApps follow pervasive computing environments

Applicability was small and getting smaller very fast with many experienced computer companies entering the market with fine products e.g. Convex/HP, Cray, DEC, IBM, SGI & SUN that are leveraging their R&D, apps, apps, & apps

Japan has a strong supercomputer industry. The more we jeprodize ours by mandating use of weird machines that take away from use, the weaker it becomes.

MPP won, mainstream vendors have adopted multiple CMOS. Stop funding!environments & apps are needed, but are unlikely because the market is small

© Gordon Bell65

Recommendations to HPCSGoal: By 2000, massive parallelism must exist as a by-products that leverages a

widescale national network & workstation/multi HW/SW nodes

Dual use not duel use of products and technology or the principle of "elegance" -one part serves more than one function network companies supply networks, node suppliers use ordinary workstations/servers with existing apps will leverage $30 billion x 10**6 R&D

Fund high speed, low latency, networks for a ubiquitous service as the base of all forms of interconnections from WANs to supercomputers (in addition, some special networks will exist for small grain probs)

Observe heuristics in future federal program funding scenarios ... eliminate direct or indirect product development and mono-use computersFund Challenges who in turn fund purchase, not product development

Funding or purchase of apps porting must be driven by Challenges, but builds on binary compatible workstation/server apps to leverage nodes be cross-platform based to benefit multiple vendors & have cross-platform use

Review effectiveness of State Computers e.g., need, economics, efficacyEach committee member might visit 2-5 sites using a >>// computer

Review // program environments & the efficacy to produce & support apps

Eliminate all forms of State Computers & recommend a balanced HPCS program: nodes & networks; based on industrial infrastructurestop funding the development of mono computers, including the 10Tflopit must be acceptable & encouraged to buy any computer for any contract

Gratis advice for HPCC* & BS* D. Bailey warns that scientists have almost lost credibility....

Focus on Gigabit NREN with low overhead connections that will enable multicomputers as a by-product

Provide many small, scalable computers vs large, centralized

Encourage (revert to) & support not so grand challenges

Grand Challenges (GCs) need explicit goals & plans --disciplines fund & manage (demand side)... HPCC will not

Fund balanced machines/efforts; stop starting Viet Nams (efforts that are rat holes that you can’t get out of)

Drop the funding & directed purchase of state computers

Revert to university research -> company & product development

Review the HPCC & GCs program's output ...

*High Performance Cash Conscriptor; Big Spenders

© Gordon Bell67

Scalability: The Platform of HPCS


Three kinds: machine, problem x machine, & generation (t)

How do flops, memory size, efficiency & time vary with problem size?

What's the nature problems & work for the computers?

What about the mapping of problems onto the machines?

© Gordon Bell68

Disclaimer

This talk may appear inflammatory... i.e. the speaker may have appeared "to flame".

It is not the speaker's intent to make ad hominem attacks on people, organizations, countries, or computers ... it just may appear that way.

© Gordon Bell69

Re. Centers Funding August 1987 gbell memo to E. Bloch

A fundamentally broken system!1 Status quo. NSF funds them, as we do now in competition with computer science… use is completely

decoupled from the supply... If I make the decision to trade-off, it will not favor the centers...2. Central facility. NSF funds ASC as an NSF central facility. This allows the Director, who has the purview

for all facilities and research to make the trade-offs across the foundation.3. NSF Directorate use taxation. NSF funds it via some combination of the directorates on a taxed basis. The

overall budget is set by AD’s. DASC would present the options, and administer the program.4. Directorate-based centers. The centers (all or in part) are “given” to the research directorates. NCAR

provides an excellent model. Engineering might also operate a facility. I see great economy, increased quality, and effectiveness coming through specialization of programs, databases, and support.

5. Co-pay. In order to differentially charge for all the upgrades …a tax would be levied on various allocation awards. Such a tax would be nominal (e.g. 5%) in order to deal with the infinite appetite for new hardware and software. This would allow other agencies who use the computer to also help pay.

6 Manufacturer support. Somehow, I don’t see this changing for a long time. A change would require knowing something about the power of the machines so that manufacturers could compete to provide lower costs. BTW:Erich Bloch and I visited Cray Research and succeeded in getting their assistance.

7. Make the centers larger to share support costs. Manufacturers or service providers could contract with the centers to “run” facilities. This would reduce our costs somewhat on a per machine basis.

8. Fewer physical centers. While we could keep the number of centers constant, greater economy of scale would be created by locating machines in a central facility… LANL and LLNL each run 8 Crays to share operators, mass storage and other hardware and software support. With decent networks, multiple centers are even less important.

9. Simply have fewer centers. but with increasing power. This is the sole argument for centers!10. Maintain centers at their current or constant core levels for some specified period. Each center would be

totally responsible for upgrades, etc. and their own ultimate fate.11. Free market mechanism. Provide grant money for users to buy time. This might cost more because I sure

we get free rides e.g. Berkeley, Michigan, Texas and the increasing number of institutions providing service.

© Gordon Bell70

GB Interview as CISE AD July 1987• We, together with our program advisory committees have

described the need for basic work in parallel processing to exploit both the research challenge and the plethora of parallel-processing machines that are available and emerging. We believe NSF’s role is to sponsor a wide range of software research about these machines.

• This research includes basic computational models more suited to parallelism, new algorithms, standardized primitives (a small number) for addition to the standard programming languages, new languages based on parallel-computation primitives rather than extensions to sequential languages, and new applications that exploit parallelism.

• Three approaches to parallelism are clearly here now:

© Gordon Bell71

Bell CISE Interview July 1987• First, vector processing has become primitive in supercomputers and mini-

supercomputers. In becoming so, it has created a revolution in scientific applications. Unfortunately, computer science and engineering departments are not part of the revolution in scientific computation that is occurring as a result of the availability of vectors. New texts and curricula are needed.

• Second, message-passing models of computation can be used now on workstation clusters, on the various multicomputers such as the Hypercube and VAX clusters, and on the shared-memory multiprocessors (from supercomputers to multiple microprocessors). The Unix pipes mechanism may be acceptable as a programming model, but it has to be an awful lot faster for use in problems where medium-grain parallelism occurs. A remote procedure-call mechanism may be required for control.

• Third, microtasking of a single process using shared-memory multiprocessors must also be used independently. On shared- memory multiprocessors, both mechanisms would be provided and used in forms appropriate to the algorithms and applications. Of course, other forms of parallelism will be used because it is relatively easy to build large, useful SIMD [ multiple-data] machines

© Gordon Bell72

Q: What performance do you expect from parallelism in the next decade?

A: Our goal is obtaining a factor of 100 in the performance of computing, not counting vectors, within the decade and a factor of 10 within five years. I think 10 will be easy because it is inherently there in most applications right now. The hardware will clearly be there if the software can support it or the users can use it.Many researchers think this goal is aiming too low. They think it should be a factor of I million within 15 years. However, I am skeptical that anything more than our goal will be too difficult in this time period. Still, a factor of 1 million may be possible through SIMD.The reasoning behind the NSF goals is that we have parallel machines now and on the near horizon that can actually achieve these levels of performance. Virtually all new computer systems support parallelism in some form (such as vector processing or clusters of computers). However, this quiet revolution demands a major update of computer science, from textbooks and curriculum to applications research.

© Gordon Bell74

NRC/OSTP Report 18 August 1984

• Summary– Pare the poor projects; fund proven researchers– Understand the range of technologies required and

especially the Japanese position; also vector processors

• Heuristics for the program– Apply current multicomputers and multicomputers now– Fund software and applications starting now for //ism…

© Gordon Bell75

…the report greatly underestimates the position and underlying strength of the Japanese in regard to Supercomputers. The report fails to make a substantive case about the U. S. position, based on actual data in all the technologies from chips (where the Japanese clearly lead) to software engineering.The numbers used for present and projected performance appear to be wildly optimistic with no real underlying experimental basis. A near term future based on parallelism other than evolving pipelining is probably not realistic.The report continues the tradition of recommending that funding science is good, and in addition everything be funded. The conclusions to continue to invest in small scale fundamental research without a prioritization across the levels of integration or kinds of projects would seem to be of little value to decision makers. For example, the specific knowledge that we badly need in order to exploit parallelism is not addressed. Nor is the issue of how we go about getting this knowledge.My own belief is that small scale research around a single researcher is the only style of work we understand or are effective with. This may not get us very far in supercomputers. Infrastructure is more important than wild, new computer structures if the "one professor" research model is to be useful in the supercomputer effort. While this is useful to generate small startup companies, it also generates basic ideas for improving the Japanese state of the art. This occurs because the Japanese excel in the transfer of knowledge from world research laboratories into their products and because the U.S. has a declining technological base of product and process (manufacturing) engineering.The problem of organizing experimental research in the many projects requiring a small laboratory (Cray‑style lab of 40 or so) to actually build supercomputer prototypes isn't addressed; these larger projects have been uniformly disastrous and the transfer to non‑Japanese products negligible.Surprisingly, no one asked Seymour Cray whether there was anything he wanted in order to stay ahead…

© Gordon Bell76

1. Narrow the choice of architectures that are to be pursued. There are simply too many poor ones, and too few that can be adequately staffed.

2. Fund only competent, full‑time efforts where people have proven ability to build … systems. These projects should be carried out by full‑time people, not researchers servicing multiple contracts and doing consulting. New entrants must demonstrate competence by actually building something!

3. Have competitive proposals and projects. If something is really an important area to fund…, then have two projects with …information exchange.

4. Fund balanced hardware/software/systems applications. Doing architectures without user involvement (or understanding) is sure to produce useless toys.

5. Recognize the various types of projects and what the various organizational structures are likely to be able to produce.

6. A strong infrastructure of chips to systems to support individual researchers will continue to produce interesting results. These projects are not more than a dozen people because professors don't work for or with other professors well.

7. There are many existing multicomputers and multiprocessors that could be delivered to universities to understand parallelism before we go off to build…

8. It is essential to get the Cray X‑MP alongside the Fujitsu machine to understand …parallelism associated with multiple processors, and pipelines.

9. Build "technology transfer mechanisms" in up front. Transfer doesn't happen automatically. Monitor the progress associated with "the transfer".

© Gordon Bell79

Notes with Jim

• Applications software and its development is still the no. 1 problem• 64 bit addressing will change everything• Many machines are used for their large memory• Centers will always use all available time: Cycles bottom feeders.• Allocation is still central and a bad idea. • Not big enough centers• Can’t really architect or recommend a plan, unless you have some notion of the

needs and costs!• No handle on communication costs especially for the last mi where its 50 – 150

Mbps; not fiber (10 Gbps). Two orders of magnitude low…• BeoW happened as an embarrassment to funders, not because of it• Walk through 7>2 now 3 centers.

A center is a 50M/year expense when you upgrade! • NSF: The tools development is questionable. Part of business. Feel very, very

uncomfortable developing tools• Centers should be functionally specialized around communities and databases• Planning, budgets and allocation must be with the disciplines. People vs. Machines.• Teragrid problem: having not solved the Clusters problem move to larger problem• File oriented vs database hpss.

© gordon bell 1 goal of the committee the goal of the committee is to assess the status of...

Documents

cray centers

clusters of killer micros

petaflops clusters

micros clusters users

beowulf clusters form

parallel computing systems

cluster standard

larger centers