© gordon bell 1 goal of the committee the goal of the committee is to assess the status of...
TRANSCRIPT
© Gordon Bell1
Goal of the Committee
The goal of the committee is to assess the status of supercomputing in the United States, including the characteristics of relevant systems and architecture research in government, industry, and academia and the characteristics of the relevant market. The committee will examine key elements of context--the history of supercomputing, the erosion of research investment, the needs of government agencies for supercomputing capabilities--and assess options for progress. Key historical or causal factors will be identified. The committee will examine the changing nature of problems demanding supercomputing (e.g., weapons design, molecule modeling and simulation, cryptanalysis, bioinformatics, climate modeling) and the implications for systems design. It will seek to understand the role of national security in the supercomputer market and the long-term federal interest in supercomputing.
© Gordon Bell2
NRC-CSTB Future of Supercomputing Committee
22 May 2003NRC Brooks-Sutherland Committee
11 March 1994NRC OSTP Report
18 August 1984
Gordon Bell
Microsoft Bay Area Research Center
San Francisco
© Gordon Bell3
Outline
• Community re-Centric Computing vs. Computer Centers• Background: Where we are and how we got here.• Performance (t). Hardware trends and questions• If we didn’t have general purpose computation centers, would we
invent them?• Atkins Report: Past and future concerns…be careful what we wish
for• Appendices: NRC Brooks-Sutherland 94 comments; CISE(gbell)
Concerns re. Supercomputing Centers (’87); CISE(gbell) //ism goals (‘87); NRC Report 84 comments.
• Bottom line, independent of the question: It has been and always will be the software and apps, stupid!And now it’s the data, too!
© Gordon Bell4
Community re-centric Computing...
• Goal: Enable technical communities to create their own computing environments for personal, data, and program collaboration and distribution.
• Design based on technology trends, especially networking, apps programs maintenance, databases, & providing web and other services
• Many alternative styles and locations are possible– Service from existing centers, including many state centers– Software vendors could be encouraged to supply apps svcs– NCAR style center around data and apps– Instrument-based databases. Both central & distributed when multiple
viewpoints create the whole– Wholly distributed with many individual groups
© Gordon Bell5
Community re-Centric ComputingTime for a major change
• Community Centric• Community is responsible
– Planned & budget as resources– Responsible for its infrastructure– Apps are community centric– Computing is integral to sci./eng.
• In sync with technologies– 1-3 Tflops/$M; 1-3 PBytes/$M
to buy smallish Tflops & PBytes.• New scalables are “centers” fast
– Community can afford– Dedicated to a community– Program, data & database centric
– May be aligned with instruments or other community activities
• Output = web service; an entire community demands real-time web service
• Centers Centric• Center is responsible
– Computing is “free” to users– Provides a vast service array for all– Runs & supports all apps– Computing grant disconnected
• Counter to technologies directions– More costly. Large centers operate at a
dis-economy of scale • Based on unique, fast computers
– Center can only afford– Divvy cycles among all communities– Cycles centric; but politically difficult to
maintain highest power vs more centers– Data is shipped to centers requiring,
expensive, fast networking• Output = diffuse among gp centers;
Are centers ready or able to support on-demand, real time web services?
Background: scalability at last• Q: How did we get to scalable computing and parallel
processing?• A: Scalability evolved from a two decade old vision
and plan starting at DARPA & NSF. Now picked up by DOE & row.
• Q: What should be done now?• A: Realize scalability, the web, and now web services
change everything. Redesign to get with the program!• Q: Why do you seem to be wanting to de-center?• A: Besides the fact that user demand has been and is
totally de-coupled from supply, I believe technology doesn’t necessarily support users or their mission, and that centers are potentially inefficient compared with a more distributed approach.
Copyright Gordon BellCopyright Gordon Bell
Steve Squires & Gordon Bell at our “Cray” at the start of DARPA’s SCI program c1984.
20 years later: Clusters of Killer micros become the single standard
Copyright Gordon BellCopyright Gordon Bell
RIPRIP
Lost in the search for parallelism ACRI Alliant American Supercomputer Ametek Applied Dynamics Astronautics BBN CDC Cogent Convex > HP Cray Computer Cray Research > SGI > Cray Culler-Harris Culler Scientific Cydrome Dana/Ardent/Stellar/Stardent Denelcor Encore Elexsi ETA Systems Evans and Sutherland Computer Exa Flexible Floating Point Systems Galaxy YH-1
Goodyear Aerospace MPP Gould NPL Guiltech Intel Scientific Computers International Parallel Machines Kendall Square Research Key Computer Laboratories searching again MasPar Meiko Multiflow Myrias Numerix Pixar Parsytec nCube Prisma Pyramid Ridge Saxpy Scientific Computer Systems (SCS) Soviet Supercomputers Supertek Supercomputer Systems Suprenum Tera > Cray Company Thinking Machines Vitesse Electronics Wavetracer
A brief, simplified history of HPC1. Cray formula smPv evolves for Fortran. 60-02 (US:60-90) 2. 1978: VAXen threaten computer centers…3. NSF response: Lax Report. Create 7-Cray centers 1982 -
4. SCI: DARPA searches for parallelism using killer micros5. Scalability found: “bet the farm” on micros clusters
Users “adapt”: MPI, lcd programming model found. >95
Result: EVERYONE gets to re-write their code!!6. Beowulf Clusters form by adopting PCs and Linus’ Linux to create the cluster standard!
(In spite of funders.)>1995 7. ASCI: DOE gets petaflops clusters, creating “arms” race!8. “Do-it-yourself” Beowulfs negate computer centers since everything is a cluster and
shared power is nil! >2000.
9. 1997-2002: Tell Fujitsu & NEC to get “in step”!
10.High speed nets enable peer2peer & Grid or Teragrid11.Atkins Report: Spend $1B/year, form more and larger centers and connect them as a
single center…
Copyright Gordon BellCopyright Gordon Bell
Inno
vatio
n
The Virtuous Economic Cycle drives the PC industry… & Beowulf
Volum
e
Competition
Standards
Utility/value
DOJ
Greater availability
@ lower cost
Creates apps, tools, training,Attracts users
Attracts suppliers
Copyright Gordon BellCopyright Gordon Bell
Lessons from Beowulf An experiment in parallel computing systems ‘92 Established vision- low cost high end computing Demonstrated effectiveness of PC clusters for some (not all) classes of
applications Provided networking software Provided cluster management tools Conveyed findings to broad community Tutorials and the book Provided design standard to rally community! Standards beget: books, trained people, software … virtuous cycle that
allowed apps to form Industry began to form beyond a research project
Courtesy, Thomas Sterling, Caltech.
Copyright Gordon BellCopyright Gordon Bell
Technology: peta-bytes, -flops, -bpsWe get no technology before its time
Moore’s Law 2004-2012: 40X The big surprise will be the 64 bit micro 2004: O(100) processors = 300 GF PAP, $100K
– 3 TF/M, not diseconomy of scale for large systems– 1 PF => 330M, but 330K processors; other paths
Storage 1-10 TB disks; 100-1000 disks Internet II killer app – NOT teragrid
– Access Grid, new methods of communication– Response time to provide web services
Computing LawsComputing Laws
Performance metrics (t) 1987-2009
0
1
10
100
1,000
10,000
100,000
1,000,000
10,000,000RAP(GF)
Proc(#)
cost ($M)
Density(Gb/in)
Flops/$
60%
100%
110%
ES
© Gordon Bell14
Perf (PAP) = c x 1.6**(t-1992); c = 128 GF/$300M ‘94 prediction: c = 128 GF/$30M
1.E+08
1.E+09
1.E+10
1.E+11
1.E+12
1.E+13
1.E+14
1.E+15
1.E+16
1992 1996 2000 2004 2008 2012
GB peak 30 M super 100 M super 300 M super Flops(PAP)M/$
Performance(TF) vs. cost($M) of non-central and centrally distributed systems
Cost
Perform
ance
+ Centers (old style super)
0.01
0.1
1
10
100
0.1 1 10 100
Non-central Centers delivery Center purchase base
Centers allocation range
National Semiconductor Technology Roadmap (size)
1
10
100
1000
10000
1995 1998 2001 2004 2007 2010
Mem
ory
siz
e (M
byt
es/c
hip
) &
Mtr
ansi
sto
rs/
chip
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
Mem(MBytes)
Micros Mtr/chip
Line width
+ 1Gbit
Disk Density Explosion
Magnetic disk recording density (bits per mm2) grew at 25% per year from 1975 until 1989.
Since 1989 it has grown at 60-70% per year Since 1998 it has grown at >100% per year
– This rate will continue into 2003
Factors causing accelerated growth:– Improvements in head and media technology– Improvements in signal processing electronics– Lower head flying heights
Courtesy Richie Lary
Computing LawsComputing Laws
National Storage Roadmap 2000
100x/decade=100%/year
~10x/decade = 60%/year
19
Disk / Tape Cost ConvergenceDisk / Tape Cost Convergence
$0.00
$0.50
$1.00
$1.50
$2.00
$2.50
$3.00
1/01 1/02 1/03 1/04 1/05
Re
tail
Pri
ce
.
5400 RPM ATA Disk
SDLT Tape Cartridge
33½” ½” ATA disk could cost less than SDLT ATA disk could cost less than SDLT cartridgecartridge in 2004. in 2004. IfIf disk manufacturers maintain 3½”, multi-platter form factor disk manufacturers maintain 3½”, multi-platter form factor
Volumetric density of disk will exceed tape in 2001.Volumetric density of disk will exceed tape in 2001. ““Big Box of ATA Disks” could be cheaper than a tape Big Box of ATA Disks” could be cheaper than a tape
library of equivalent size in 2001library of equivalent size in 2001Courtesy of Richard Lary
Disk Capacity / Performance Imbalance
Capacity growth outpacing performance growth
Difference must be made up by better caching and load balancing
Actual disk capacity may be capped by market (red line); shift to smaller disks (already happening for high speed disks) 19921992 19951995 19981998 20012001
11
1010
100100CapacityCapacity
PerformancePerformance
140140xx in in9 years9 years
(73%/yr)(73%/yr)
33xx in 9 years in 9 years(13%/yr)(13%/yr)
Courtesy of Richard Lary
© Gordon Bell21
Re-Centering to Community Centers
• There is little rational support for general purpose centers– Scalability changes the architecture of the entire Cyberinfrastructure
– No need to have a computer bigger than the largest parallel app.
– They aren’t super.
– World is substantially data driven, not cycles driven.
– Demand is de-coupled from supply planning, payment or services
• Scientific / Engineering computing has to be the responsibility of each of its communities – Communities form around instruments, programs, databases, etc.
– Output is web service for the entire community
© Gordon Bell22
Grand Challenge (c2002) problems become desktop (c2004) tractable
I don’t buy problem growth mantra 2x res. >2**4 (yrs.)
• Radical machines will come from low cost 64-bit explosion• Today’s desktop has and will increasingly trump
yesteryear’s super simply due to memory size emplosion• Pittsburgh Alpha: 3D MRI skeleton computing/viewing
using a large memory is a desktop problem given a large memory
• Tony Jamieson: “I can models an entire 747 on my laptop!”
© Gordon Bell23
Centers aren’t very super…
• Pittsburgh: 6; NCAR: 10; LSU: 17; Buffalo: 22; FSU: 38; San Diego: 52; NCSA: 82; Cornell: 88 Utah: 89;
• 17 Universities, world-wide in top 100 • Massive upgrade is continuously required:
– Large memories: machines aren’t balanced and haven’t been. Bandwidth 1 Byte/flop vs. 24 Bytes/flop
– File Storage > Databases
• Since centers systems have >4 year lives, they start as obsolete and overpriced…and then get worse.
© Gordon Bell24
Centers: The role going forward• The US builds scalable clusters, NOT supercomputers
– Scalables are 1 to n commodity PCs that anyone can assemble. – Unlike the “Crays” all are equal. Use is allocated in small clusters.– Problem parallelism sans ∞// has been elusive (limited to 1K)– No advantage of having a computer larger than a //able program
• User computation can be acquired and managed effectively.– Computation is divvied up in small clusters e.g. 128 nodes that individual
groups can acquire and manage effectively
• The basic hardware evolves, doesn’t especially favor centers– 64-bit architecture. 512Mb x 32/dimm =2GB x 4/system = 8GB Systems
>>16GB(Centers machine will be obsolete, by memory / balance rules.)
– 3 year timeframe: 1 TB disks at $0.20/TB– Last mile communication costs are not decreasing to favor centers or grids.
© Gordon Bell25
Review the bidding
• 1984: Japanese are coming. CMOS and killer Micros. Build // machines.– 40+ computers were built & failed based on CMOS and/or micros– No attention to software or apps
• 1994: Parallelism and Grand Challenges– Converge to Linux Clusters (Constellations nodes >1 Proc.) & MPI– No noteworthy middleware software to aid apps & replace Fortran e.g.
HPF failed.– Whatever happened to the Grand Challenges??
• 2004: Teragrid has potential as a massive computer and massive research– We have and will continue to have the computer clusters to reach a
<$300M Petaflop– Massive review and re-architecture of centers and their function. – Instrument/program/data/community centric (CERN, Fermi, NCAR,
Calera)
© Gordon Bell26
Recommendations
• Give careful advice on the Atkins Report (It is just the kind of plan that is likely to fly.)
• Community Centric Computing– Community/instrument/data/programcentric (Calera, CERN, NCAR)
• Small number of things to report– Forget about hardware for now…it’s scalables. The die has been cast. – Support training, apps, and any research to ease apps development.– Databases represent the biggest gain. Don’t grep, just reference it.
© Gordon Bell27
The End
© Gordon Bell28
Atkins Report: Be careful of what you ask for
• Suggestions (gbell)– Centers to be re-centered in light of data versus flops– Overall re-architecture based on user need & technology– Highly distributed structure aligned with users who plan
their facilities– Very skeptical “gridized” projects e.g. tera, GGF– Training in the use of databases is needed! It will get
more productivity than another generation of computers.• The Atkins Report
– 1.02 Billion per year recommendation for research, buy software, and spend $600 M to build and maintain more centers that are certain to be obsolete and non-productive
Summary to Atkins Report 2/15/02 15:00 gbell
• Same old concerns: “I don’t have as many flops as users at the national labs.” • Many facilities should be distributed and with build-it yourself Beowulf clusters to
get extraordinary cycles and bytes.• Centers need to be re-centered see Bell & Gray, “What’s Next in High
Performance Computing?”, Comm. ACM, Feb. 2002, pp91-95.• Scientific computing needs re architecting based on networking, communication,
computation, and storage. Centrality versus distributed depends on costs and the nature of the work e.g. instrumentation that generates lots of data. (Last mile problem is significant.)
– Fedex’d hard drive is low cost. Cost of hard drive < network cost. Net is very expensive!– Centers flops and bytes are expensive. Distributed likely to be less so.– Many sciences need to be reformulated as a distribute computing/dbase
• Network costs (last mi.) are a disgrace. $1 billion boondoggle with NGI, Internet II.• Grid funding: Not in line with COTS or IETF model. Another very large SW project!• Give funding to scientists in joint grants with tool builders e.g. www came from
user• Database technology is not understood by users and computer scientists
– Training, tool funding, & combined efforts especially when large & distributed– Equipment, problems, etc are dramatically outstripping capabilities!
• It is still about software, especially in light of scalable computers that require reformulation into a new, // programming model
© Gordon Bell30
Atkins Report: the critical challenges
1) build real synergy between computer and information science research and development, and its use in science and engineering research and education;
2) capture the cyberinfrastructure commonalities across science and engineering disciplines;
3) use cyberinfrastructure to empower and enable, not impede, collaboration across science and engineering disciplines;
4) exploit technologies being developed commercially and apply them to research applications, as well as feed back new approaches from the scientific realm into the larger world;
5) engage social scientists to work constructively with other scientists and technologists.
© Gordon Bell31
Atkins Report: Be careful of what you ask for
1. fundamental research to create advanced cyberinfrastructure ($60M);
2. research on the application of cyberinfrastructure to specific fields of science and engineering research ($100M);
3. acquisition and development of production quality software for cyberinfrastructure and supported applications ($200M);
4. provisioning and operations (including computational centers, data repositories, digital libraries, networking, and application support) ($600M).
5. archives for software ($60M)
© Gordon Bell32
NRC Review Panel on High Performance Computing
11 March 1994
Gordon Bell
© Gordon Bell33
Position
Dual use: Exploit parallelism with in situ nodes & networksLeverage WS & mP industrial HW/SW/app infrastructure!
No Teraflop before its time -- its Moore's Law
It is possible to help fund computing: Heuristics from federal funding & use ( 50 computer systems and 30 years)
Stop Duel Use, genetic engineering of State Computers •10+ years: nil pay back, mono use, poor, & still to come •plan for apps porting to monos will also be ineffective -- apps must leverage, be cross-platform & self-sustaining•let "Challenges" choose apps, not mono use computers•"industry" offers better computers & these are jeopardized•users must be free to choose their computers, not funders •next generation State Computers "approach" industry •10 Tflop ... why?
Summary recommendations
© Gordon Bell34
Principle computingEnvironmentscirca 1994 -->4 networks tosupportmainframes,minis, UNIX servers,workstations &PCs
Token-ring (gateways, bridges,routers, hubs, etc.)
LANs
PCs (DOS, Windows, NT)
UNIX workstations
POTs netfor switching
terminals
ASCII & PCterminals
3270 (&PC)terminals
Ethernet (gateways, bridges,routers, hubs, etc.)
LANs
mainframes
minicomputers
Novell & NT servers
Xterminals
minicomputers
NFSservers
>4 Interconnect & comm. stds:
POTS & 3270 terms.
WAN (comm. stds.)
LAN (2 stds.)
Clusters (prop.)
Late '80s LAN-PCworld
'80s Unix distributed workstations & servers world
70's mini(prop.) world & '90s UNIXmini world
IBM & propritarymainframeworld '50s
Datacomm.worlds
mainframesASCII & PCterminals
clusters
Compute & dbase uni- & mP servers
clustersWide-areainter-sitenetwork
UNIX Multiprocessorservers operated as
traditional minicomputers
© Gordon Bell35
ComputingEnvironmentscirca 2000
Local & global data commworld
ATM† & LocalArea Networksfor: terminal,
PC, workstation,& servers
Centralized& departmental
uni- & mP servers(UNIX & NT)
Legacymainframes &
minicomputersservers & terms
Wide-areaglobal
ATM network
Legacymainframe &
minicomputerservers & terminals
Centralized& departmental
scalable uni- & mP servers*
(NT & UNIX)
NT, Windows& UNIX personservers
Platforms: X86PowerPC Sparcetc.Universal highspeed dataservice usingATM or ??
NT, Windows& UNIX person
servers*
*
multicomputers built from multiple simple, servers
NFS, database, compute, print, & communication servers
† also10 - 100 mb/spt-to-pt Ethernet
TC=TV+PChome ...
(CATV or ATM)
???
© Gordon Bell36
Beyond Dual & Duel Use Technology:Parallelism can & must be free!
HPCS, corporate R&D, and technical users must have the goal to design, install and support parallel environments using and leveraging:• every in situ workstation & multiprocessor server • as part of the local ... national network.
Parallelism is a capability that all computing environments can & must possess!--not a feature to segment "mono use" computers
Parallel applications become a way of computing utilizing existing, zero cost resources -- not subsidy for specialized ad hoc computers
Apps follow pervasive computing environments
© Gordon Bell37
Computer genetic engineering & species selection has been
ineffectiveAlthough Problem x Machine Scalability using SIMD for simulating some physical systems has been demonstrated, given extraordinary resources, the efficacy of larger problems to justify cost-effectiveness has not. Hamming:"The purpose of computing is insight, not numbers."
The "demand side" Challenge users have the problems and should be drivers. ARPA's contractors should re-evaluate their research in light of driving needs.
Federally funded "Challenge" apps porting should be to multiple platforms including workstations & compatible, multis that support // environments to insure portability and understand main line cost-effectiveness
Continued "supply side"programs aimed at designing, purchasing, supporting, sponsoring, & porting of apps to specialized, State Computers, including programs aimed at 10 Tflops, should be re-directed to networked computing.
User must be free to choose and buy any computer, including PCs & WSs, WS Clusters, multiprocessor servers, supercomputers, mainframes, and even highly distributed, coarse grain, data parallel, MPP State computers.
© Gordon Bell38
Performance (t)
20001998199619941992199019881
10
100
1000
10000
• Intel $55M
Intel $300M•
NEC
Cray Super $30M
Cray DARPA
CM5 $30M
CM5 $120M
CM5 $240M
The teraflops
Bell Prize •
© Gordon Bell39
We get no Teraflop before it's time: it's Moore's Law!
Flops = f(t,$), not f(t) technology plans e.g. BAA 94-08 ignores $s!
All Flops are not equal (peak announced performance-PAP or real app perf. -RAP)
FlopsCMOSPAP*< C x 1.6**(t-1992) x $; C = 128 x 10**6 flops / $30,000
FlopsRAP =FlopsPAP x 0.5 for real apps, 1/2 PAP is a great goal
Flopssupers = FlopsCMOS x 0.1; improvement of supers 15-40%/year; higher cost is f(need for profitability, lack of subsidies, volume, SRAM)
92'-94': FlopsPAP/$ =4K; Flopssupers/$=500; Flopsvsp/$ =50 M (1.6G@$25)
*Assumes primary & secondary memory size & costs scale with time memory = $50/MB in 1992-1994 violates Moore's Law disks = $1/MB in1993, size must continue to increases at 60% / year
When does a Teraflop arrive if only $30 million** is spent on a super?
1 TflopCMOS PAP in 1996 (x7.8) with 1 GFlop nodes!!!; or 1997 if RAP
10 TflopCMOS PAP will be reached in 2001 (x78) or 2002 if RAP
How do you get a teraflop earlier?
**A $60 - $240 million Ultracomputer reduces the time by 1.5 - 4.5 years.
© Gordon Bell40
Re-engineering HPCSGenetic engineering of computers has not produced a healthy strain that lives
more than one, 3 year computer generation. Hence no app base can form. •No inter-generational, MPPs exist with compatible networks & nodes. •All parts of an architecture must scale from generation to generation! •An archecture must be designed for at least three, 3 year generations!
High price to support a DARPA U. to learn computer design -- the market is only $200 million and R&D is billions-- competition works far better
Inevitable movement of standard networks and nodes can or need not be accelerated, these best evolve by a normal market mechanism through driven by users
Dual use of Networks & Nodes is the path to widescale parallelism, not weird computers
Networking is free via ATM Nodes are free via in situ workstationsApps follow pervasive computing environments
Applicability was small and getting smaller very fast with many experienced computer companies entering the market with fine products e.g. Convex/HP, Cray, DEC, IBM, SGI & SUN that are leveraging their R&D, apps, apps, & apps
Japan has a strong supercomputer industry. The more we jeprodize ours by mandating use of weird machines that take away from use, the weaker it becomes.
MPP won, mainstream vendors have adopted multiple CMOS. Stop funding!environments & apps are needed, but are unlikely because the market is small
© Gordon Bell41
Recommendations to HPCSGoal: By 2000, massive parallelism must exist as a by-products that leverages a
widescale national network & workstation/multi HW/SW nodes
Dual use not duel use of products and technology or the principle of "elegance" -one part serves more than one function network companies supply networks, node suppliers use ordinary workstations/servers with existing apps will leverage $30 billion x 10**6 R&D
Fund high speed, low latency, networks for a ubiquitous service as the base of all forms of interconnections from WANs to supercomputers (in addition, some special networks will exist for small grain probs)
Observe heuristics in future federal program funding scenarios ... eliminate direct or indirect product development and mono-use computersFund Challenges who in turn fund purchase, not product development
Funding or purchase of apps porting must be driven by Challenges, but builds on binary compatible workstation/server apps to leverage nodes be cross-platform based to benefit multiple vendors & have cross-platform use
Review effectiveness of State Computers e.g., need, economics, efficacyEach committee member might visit 2-5 sites using a >>// computer
Review // program environments & the efficacy to produce & support apps
Eliminate all forms of State Computers & recommend a balanced HPCS program: nodes & networks; based on industrial infrastructurestop funding the development of mono computers, including the 10Tflopit must be acceptable & encouraged to buy any computer for any contract
Gratis advice for HPCC* & BS* D. Bailey warns that scientists have almost lost credibility....
Focus on Gigabit NREN with low overhead connections that will enable multicomputers as a by-product
Provide many small, scalable computers vs large, centralized
Encourage (revert to) & support not so grand challenges
Grand Challenges (GCs) need explicit goals & plans --disciplines fund & manage (demand side)... HPCC will not
Fund balanced machines/efforts; stop starting Viet Nams (efforts that are rat holes that you can’t get out of)
Drop the funding & directed purchase of state computers
Revert to university research -> company & product development
Review the HPCC & GCs program's output ...
*High Performance Cash Conscriptor; Big Spenders
© Gordon Bell43
Scalability: The Platform of HPCS
The law of scalability
Three kinds: machine, problem x machine, & generation (t)
How do flops, memory size, efficiency & time vary with problem size?
What's the nature problems & work for the computers?
What about the mapping of problems onto the machines?
© Gordon Bell44
Disclaimer
This talk may appear inflammatory... i.e. the speaker may have appeared "to flame".
It is not the speaker's intent to make ad hominem attacks on people, organizations, countries, or computers ... it just may appear that way.
© Gordon Bell45
Backups
© Gordon Bell46
Funding Heuristics(50 computers & 30 years of
hindsight)1. Demand side works i.e., we need this product/technology for x; Supply side doesn't work! Field of Dreams": build it and they will come.
2. Direct funding of university research resulting in technology and product prototypes that is carried over to startup a company is the most effective. -- provided the right person & team are backed with have a transfer avenue.a. Forest Baskett > Stanford to fund various projects (SGI, SUN, MIPS)b. Transfer to large companies has not been effectivec. Government labs... rare, an accident if something emerges
3. A demanding & tolerant customer or user who "buys" products works best to influence and evolve products (e.g., CDC, Cray, DEC, IBM, SGI, SUN) a. DOE labs have been effective buyers and influencers, "Fernbach policy"; unclear if labs are effective product or apps or process developers b. Universities were effective at influencing computing in timesharing, graphics, workstations, AI workstations, etc.c. ARPA, per se, and its contractors have not demonstrated a need for flops. d. Universities have failed ARPA in defining work that demands HPCS -- hence are unlikely to be very helpful as users in the trek to the teraflop.
4. Direct funding of large scale projects" is risky in outcome, long-term, training, and other effects. ARPAnet established an industry after it escaped BBN!
© Gordon Bell47
Funding Heuristics-2 5. Funding product development, targeted purchases, and other subsidies to
establish "State Companies"in a vibrant and overcrowded market is wasteful, likely to be wrong , likely to impede computer development, (e.g. by having to feed an overpopulated industry). Furthermore, it is likely to have a deleterious effect on a healthy industry (e.g. supercomputers).
A significantly smaller universe of computing environments is needed. Cray & IBM are given; SGI is probably the most profitable technical; HP/Convex are likely to be a contender, & others (e.g., DEC) are trying. No state co (intel,TMC, Tera) is likely to be profitable & hence self-sustaining.
6. "University-Company collaboration is a new area of government R&D. So far it hasn't worked nor is it likely to, unless the company invests. Appears to be a way to help company fund marginal people and projects.
7. CRADAs or co-operative research and development agreement are very closely allied to direct product development and are equally likely to be ineffective.
8. Direct subsidy of software apps or the porting of apps to one platform, e.g., EMI analysis are a way to keep marginal computers afloat.If government funds apps, they must be ported cross-platform!
9. Encourage the use of computers across the board, but discourage designs from those who have not used or built a successful computer.
© Gordon Bell48
Scalability: The Platform of HPCS& why continued funding is unnecessary
Mono use aka MPPs have been, are, and will be doomed
The law of scalability
Four scalabilities: machine, problem x machine, generation (t), & now spatial
How do flops, memory size, efficiency & time vary with problem size? Does insight increase with problem size?
What's the nature of problems & work for monos?
What about the mapping of problems onto monos?
What about the economics of software to support monos?
What about all the competitive machines? e.g. workstations, workstation clusters, supers, scalable multis, attached P?
© Gordon Bell49
Special, mono-use MPPs are doomed...no matter how much fedspend!
Special because it has non-standard nodes & networks -- with no appsHaving not evolved to become mainline -- events have over-taken them.
It's special purpose if it's only in Dongarra's Table 3. Flop rate, execution time, and memory size vs problem size shows limited applicability to very large scale problems that must be scaled to cover the inherent, high overhead.
Conjecture: a properly used supercomputer will provide greater insight and utility because of the apps and generality -- running more, smaller sized problems with a plan produces more insight
The problem domain is limited & now they have to compete with: •supers -- do scalars, fine grain, and work and have apps•workstations -- do very long grain, are in situ and have apps•workstation clusters -- have identical characteristics and have apps•low priced ($2 million) multis -- are superior i.e., shorter grain and have apps•scalable multiprocessors -- formed from multis are in design stage
Mono useful (>>//) -- hence, are illegal because they are not dual use Duel use -- only useful to keep a high budget in tact e.g., 10 TF
© Gordon Bell50
The Law of Massive Parallelism isbased on application scale
There exists a problem that can be made sufficiently large such that any network of computers can run efficiently given enough memory, searching, & work -- but this problem may be unrelated to no other problem.
A ... any parallel problem can be scaled to run on an arbitrary network of computers, given enough memory and time
Challenge to theoreticians: How well will an algorithm run?
Challenge for software: Can package be scalable & portable?
Challenge to users: Do larger scale, faster, longer run times, increase problem insight and not just flops?
Challenge to HPCC: Is the cost justified? if so let users do it!
© Gordon Bell51
Scalabilities
Size scalable computers are designed from a few components, with no bottleneck component.
Generation scalable computers can be implemented with the next generation technology with No rewrite/recompile
Problem x machine scalability - ability of a problem, algorithm, or program to exist at a range of sizes so that it can be run efficiently on a given, scalable computer.
Although large scale problems allow high flops, large probs running longer may not produce more insight.
Spatial scalability -- ability of a computer to be scaled over a large physical space to use in situ resources.
© Gordon Bell52
Linpack rate in Gflopsvs Matrix Order
1000001000010001001
10
100
SX 3 4 PSX 3 4 P
CM5 1KCM5 1K
???
© Gordon Bell53
Linpack Solution timevs Matrix Order
100,00010,0001,0001001
10
100
1000
SX 3 4 PSX 3 4 P
CM5 1KCM5 1K
© Gordon Bell54
GB's Estimate of Parallelism in Engineering & Scientific Applications
scalar60%
vector15%
mP (<8)vector
5%
>>// 5%
embarrassingly or perfectly parallel
15%
log (# of apps)
granularity & degree of coupling (comp./comm.)
new orscaled-up apps
dusty decksfor supers
SupersWSs massive mCs & WSs
----scalable multiprocessors-----
© Gordon Bell55
MPPs are only for unique,very large scale, data parallel
apps
s
Scalar| vector |vector mP| data // | emb. // | gp work | viz | apps
100...
10...1...
.1...
.01
mP
WS
mP
WS
s
s
s
mP
s
mP
WS
mP
WS
WS
mP
>>// >>//
Application characterization
$M
WS
s
monouse
© Gordon Bell56
Applicability of varioustechnical computer
alternativesDomain PC|WS Multi servr SC & Mfrm >>// WS Clusters
scalar 1 1 2 na 1*vector 2* 2 1 3 2vect.mP na 2 1 3 nadata // na 1 2 1 1ep & inf.// 1 2 3 2 1gp wrkld 3 1 1 na 2vizualiz'n 1 na na na 1apps 1 1 1 na from WS
*Current micros are weak, but improving rapidly such that subsequent >>//s that use them will have no advantage for node vectorization
© Gordon Bell57
Performance using distributedcomputers depends on
problem & machine granularity
Berkeley's log(p) model characterizes granularity &needs to be understood, measured, and used
Three parameters are given in terms of processing ops:
l = latency -- delay time to communicate between apps
o = overhead -- time lost transmitting messages
g = gap - 1 / message-passing rate ( bandwidth) - time between messages
p = number of processors
© Gordon Bell58
GranularityNomograph
C 90
MPPs
WANs&LANs
1G
100M
10M
100
1000
10K
100 K
1 M
10 M
100 ns.
1 µs.
10 µs.
100 µs.
1 ms. (LAN)
10 ms.
100 ms. (WAN)
1sec.
Processor Processor speedspeed
Grain length Grain length (ops)(ops)
Grain Grain Comm. Comm. Latency & Latency & Synch. Synch. Ovhd.Ovhd.
Fine
Med.
Coarse
Very
1993µ
1995µ
1993
Ultra
x
© Gordon Bell59
GranularityGranularityNomographNomograph
Cray T3D
C 90
VPP 500
VP
1G
100M
10M
100
1000
10K
100 K
1 M
10 M
100 ns.(Supers mem.)
1 µs.
10 µs.
100 µs.
1 ms. (LAN)
10 ms.
100 ms. (WAN)
1sec.
Processor speed
Grain length (ops)
Grain Comm. Latency & Synch. Ovhd.
Fine
Med.
Coarse
Very
1993µ
1995µ
1993super
Ultra
x
© Gordon Bell60
Economics of Packaged Software
Platform Cost Leverage # copies
MPP >100K 1 1-10 copies
Minis, mainframe 10-100K 10-100 1000s copiesalso, evolving high performance multiprocessor servers
Workstation 1-100K 1-10K 1-100K copies
PC 25-500 50K-1M 1-10M copies
© Gordon Bell61
Chuck Seitz comments
on multicomputers“I believe that the commercial, medium grained multicomputers aimed at ultra-supercomputer performance have adopted a relatively unprofitable scaling track, and are doomed to extinction. ... they may as Gordon Bell believes be displaced over the next several years by shared memory multiprocessors. ... For loosely coupled computations at which they excel, ultra-super multicomputers will, in any case, be more economically implemented as networks of high-performance workstations connected by high-bandwidth, local area networks...”
© Gordon Bell62
Convergence to a single architecturewith a single address space
that uses a distributed, shared memory
limited (<20) scalability multiprocessors >> scalable multiprocessors
workstations with 1-4 processors>> workstation clusters & scalable multiprocessors
workstation clusters >> scalable multiprocessors
State Computers built as message passing multicomputers >> scalable multiprocessors
© Gordon Bell63
Convergence to one architecture
limited scalability: mP, uniform memory access
experimental, scalable, multicomputer: smC, non uniform memory access
1st smC hypercube Transputer
(grid)
smC fine-grain
DSM??
smC med-coarse
grain
mP mainframe,
super
smC next gen.
DSM=>smP
Mosaic-C, J-machine
Fujitsu, Intel, Meiko, NCUBE, TMC; 1985-1994
Convex, Cray, Fujistu, IBM, Hitachi, NEC mainframes & supers
smC coarse gr.
clusters
smC: very
coarse grain
Cm* ('75), Butterfly ('85), Cedar ('88)
mP bus based
multi: mini, W/S
networked workstations: smC
mP ring-based
multi
Cosmic Cube, iPSC 1, NCUBE, Transputer-based
Apollo, SUN, HP, etc.
scalable, mP: smP, non-uniform memory access 1st smP
0 cache
smP DSM some cache
smP all cache
arch.
DASH, Convex, Cray T3D, SCI
KSR Allcache next gen. smP research e.g. DDM, DASH+
WSs Clusters via special switches 1994 & ATM 1995
micros
1995?Evolution of scalable multiprocessors, multicomputers, & workstations to shared memory computers
DEC, Encore, Sequent, Stratus, SGI, SUN, etc.
??
high bandwith switch , comm. protocols e.g. ATM
Natural evolution
Cache for locality
WS Micros, fast switch
1995?
1995?
note, only two structures: 1. shared memory mP with uniform & non-uniform memory access; and 2. networked workstations, shared nothingmPs continue
to be the main line
© Gordon Bell64
Re-engineering HPCSGenetic engineering of computers has not produced a healthy strain that lives
more than one, 3 year computer generation. Hence no app base can form. •No inter-generational, MPPs exist with compatible networks & nodes. •All parts of an architecture must scale from generation to generation! •An archecture must be designed for at least three, 3 year generations!
High price to support a DARPA U. to learn computer design -- the market is only $200 million and R&D is billions-- competition works far better
Inevitable movement of standard networks and nodes can or need not be accelerated, these best evolve by a normal market mechanism through driven by users
Dual use of Networks & Nodes is the path to widescale parallelism, not weird computers
Networking is free via ATM Nodes are free via in situ workstationsApps follow pervasive computing environments
Applicability was small and getting smaller very fast with many experienced computer companies entering the market with fine products e.g. Convex/HP, Cray, DEC, IBM, SGI & SUN that are leveraging their R&D, apps, apps, & apps
Japan has a strong supercomputer industry. The more we jeprodize ours by mandating use of weird machines that take away from use, the weaker it becomes.
MPP won, mainstream vendors have adopted multiple CMOS. Stop funding!environments & apps are needed, but are unlikely because the market is small
© Gordon Bell65
Recommendations to HPCSGoal: By 2000, massive parallelism must exist as a by-products that leverages a
widescale national network & workstation/multi HW/SW nodes
Dual use not duel use of products and technology or the principle of "elegance" -one part serves more than one function network companies supply networks, node suppliers use ordinary workstations/servers with existing apps will leverage $30 billion x 10**6 R&D
Fund high speed, low latency, networks for a ubiquitous service as the base of all forms of interconnections from WANs to supercomputers (in addition, some special networks will exist for small grain probs)
Observe heuristics in future federal program funding scenarios ... eliminate direct or indirect product development and mono-use computersFund Challenges who in turn fund purchase, not product development
Funding or purchase of apps porting must be driven by Challenges, but builds on binary compatible workstation/server apps to leverage nodes be cross-platform based to benefit multiple vendors & have cross-platform use
Review effectiveness of State Computers e.g., need, economics, efficacyEach committee member might visit 2-5 sites using a >>// computer
Review // program environments & the efficacy to produce & support apps
Eliminate all forms of State Computers & recommend a balanced HPCS program: nodes & networks; based on industrial infrastructurestop funding the development of mono computers, including the 10Tflopit must be acceptable & encouraged to buy any computer for any contract
Gratis advice for HPCC* & BS* D. Bailey warns that scientists have almost lost credibility....
Focus on Gigabit NREN with low overhead connections that will enable multicomputers as a by-product
Provide many small, scalable computers vs large, centralized
Encourage (revert to) & support not so grand challenges
Grand Challenges (GCs) need explicit goals & plans --disciplines fund & manage (demand side)... HPCC will not
Fund balanced machines/efforts; stop starting Viet Nams (efforts that are rat holes that you can’t get out of)
Drop the funding & directed purchase of state computers
Revert to university research -> company & product development
Review the HPCC & GCs program's output ...
*High Performance Cash Conscriptor; Big Spenders
© Gordon Bell67
Scalability: The Platform of HPCS
The law of scalability
Three kinds: machine, problem x machine, & generation (t)
How do flops, memory size, efficiency & time vary with problem size?
What's the nature problems & work for the computers?
What about the mapping of problems onto the machines?
© Gordon Bell68
Disclaimer
This talk may appear inflammatory... i.e. the speaker may have appeared "to flame".
It is not the speaker's intent to make ad hominem attacks on people, organizations, countries, or computers ... it just may appear that way.
© Gordon Bell69
Re. Centers Funding August 1987 gbell memo to E. Bloch
A fundamentally broken system!1 Status quo. NSF funds them, as we do now in competition with computer science… use is completely
decoupled from the supply... If I make the decision to trade-off, it will not favor the centers...2. Central facility. NSF funds ASC as an NSF central facility. This allows the Director, who has the purview
for all facilities and research to make the trade-offs across the foundation.3. NSF Directorate use taxation. NSF funds it via some combination of the directorates on a taxed basis. The
overall budget is set by AD’s. DASC would present the options, and administer the program.4. Directorate-based centers. The centers (all or in part) are “given” to the research directorates. NCAR
provides an excellent model. Engineering might also operate a facility. I see great economy, increased quality, and effectiveness coming through specialization of programs, databases, and support.
5. Co-pay. In order to differentially charge for all the upgrades …a tax would be levied on various allocation awards. Such a tax would be nominal (e.g. 5%) in order to deal with the infinite appetite for new hardware and software. This would allow other agencies who use the computer to also help pay.
6 Manufacturer support. Somehow, I don’t see this changing for a long time. A change would require knowing something about the power of the machines so that manufacturers could compete to provide lower costs. BTW:Erich Bloch and I visited Cray Research and succeeded in getting their assistance.
7. Make the centers larger to share support costs. Manufacturers or service providers could contract with the centers to “run” facilities. This would reduce our costs somewhat on a per machine basis.
8. Fewer physical centers. While we could keep the number of centers constant, greater economy of scale would be created by locating machines in a central facility… LANL and LLNL each run 8 Crays to share operators, mass storage and other hardware and software support. With decent networks, multiple centers are even less important.
9. Simply have fewer centers. but with increasing power. This is the sole argument for centers!10. Maintain centers at their current or constant core levels for some specified period. Each center would be
totally responsible for upgrades, etc. and their own ultimate fate.11. Free market mechanism. Provide grant money for users to buy time. This might cost more because I sure
we get free rides e.g. Berkeley, Michigan, Texas and the increasing number of institutions providing service.
© Gordon Bell70
GB Interview as CISE AD July 1987• We, together with our program advisory committees have
described the need for basic work in parallel processing to exploit both the research challenge and the plethora of parallel-processing machines that are available and emerging. We believe NSF’s role is to sponsor a wide range of software research about these machines.
• This research includes basic computational models more suited to parallelism, new algorithms, standardized primitives (a small number) for addition to the standard programming languages, new languages based on parallel-computation primitives rather than extensions to sequential languages, and new applications that exploit parallelism.
• Three approaches to parallelism are clearly here now:
© Gordon Bell71
Bell CISE Interview July 1987• First, vector processing has become primitive in supercomputers and mini-
supercomputers. In becoming so, it has created a revolution in scientific applications. Unfortunately, computer science and engineering departments are not part of the revolution in scientific computation that is occurring as a result of the availability of vectors. New texts and curricula are needed.
• Second, message-passing models of computation can be used now on workstation clusters, on the various multicomputers such as the Hypercube and VAX clusters, and on the shared-memory multiprocessors (from supercomputers to multiple microprocessors). The Unix pipes mechanism may be acceptable as a programming model, but it has to be an awful lot faster for use in problems where medium-grain parallelism occurs. A remote procedure-call mechanism may be required for control.
• Third, microtasking of a single process using shared-memory multiprocessors must also be used independently. On shared- memory multiprocessors, both mechanisms would be provided and used in forms appropriate to the algorithms and applications. Of course, other forms of parallelism will be used because it is relatively easy to build large, useful SIMD [ multiple-data] machines
© Gordon Bell72
Q: What performance do you expect from parallelism in the next decade?
A: Our goal is obtaining a factor of 100 in the performance of computing, not counting vectors, within the decade and a factor of 10 within five years. I think 10 will be easy because it is inherently there in most applications right now. The hardware will clearly be there if the software can support it or the users can use it.Many researchers think this goal is aiming too low. They think it should be a factor of I million within 15 years. However, I am skeptical that anything more than our goal will be too difficult in this time period. Still, a factor of 1 million may be possible through SIMD.The reasoning behind the NSF goals is that we have parallel machines now and on the near horizon that can actually achieve these levels of performance. Virtually all new computer systems support parallelism in some form (such as vector processing or clusters of computers). However, this quiet revolution demands a major update of computer science, from textbooks and curriculum to applications research.
© Gordon Bell73
Bell Prize
Initiated…
© Gordon Bell74
NRC/OSTP Report 18 August 1984
• Summary– Pare the poor projects; fund proven researchers– Understand the range of technologies required and
especially the Japanese position; also vector processors
• Heuristics for the program– Apply current multicomputers and multicomputers now– Fund software and applications starting now for //ism…
© Gordon Bell75
…the report greatly underestimates the position and underlying strength of the Japanese in regard to Supercomputers. The report fails to make a substantive case about the U. S. position, based on actual data in all the technologies from chips (where the Japanese clearly lead) to software engineering.The numbers used for present and projected performance appear to be wildly optimistic with no real underlying experimental basis. A near term future based on parallelism other than evolving pipelining is probably not realistic.The report continues the tradition of recommending that funding science is good, and in addition everything be funded. The conclusions to continue to invest in small scale fundamental research without a prioritization across the levels of integration or kinds of projects would seem to be of little value to decision makers. For example, the specific knowledge that we badly need in order to exploit parallelism is not addressed. Nor is the issue of how we go about getting this knowledge.My own belief is that small scale research around a single researcher is the only style of work we understand or are effective with. This may not get us very far in supercomputers. Infrastructure is more important than wild, new computer structures if the "one professor" research model is to be useful in the supercomputer effort. While this is useful to generate small startup companies, it also generates basic ideas for improving the Japanese state of the art. This occurs because the Japanese excel in the transfer of knowledge from world research laboratories into their products and because the U.S. has a declining technological base of product and process (manufacturing) engineering.The problem of organizing experimental research in the many projects requiring a small laboratory (Cray‑style lab of 40 or so) to actually build supercomputer prototypes isn't addressed; these larger projects have been uniformly disastrous and the transfer to non‑Japanese products negligible.Surprisingly, no one asked Seymour Cray whether there was anything he wanted in order to stay ahead…
© Gordon Bell76
1. Narrow the choice of architectures that are to be pursued. There are simply too many poor ones, and too few that can be adequately staffed.
2. Fund only competent, full‑time efforts where people have proven ability to build … systems. These projects should be carried out by full‑time people, not researchers servicing multiple contracts and doing consulting. New entrants must demonstrate competence by actually building something!
3. Have competitive proposals and projects. If something is really an important area to fund…, then have two projects with …information exchange.
4. Fund balanced hardware/software/systems applications. Doing architectures without user involvement (or understanding) is sure to produce useless toys.
5. Recognize the various types of projects and what the various organizational structures are likely to be able to produce.
6. A strong infrastructure of chips to systems to support individual researchers will continue to produce interesting results. These projects are not more than a dozen people because professors don't work for or with other professors well.
7. There are many existing multicomputers and multiprocessors that could be delivered to universities to understand parallelism before we go off to build…
8. It is essential to get the Cray X‑MP alongside the Fujitsu machine to understand …parallelism associated with multiple processors, and pipelines.
9. Build "technology transfer mechanisms" in up front. Transfer doesn't happen automatically. Monitor the progress associated with "the transfer".
© Gordon Bell77
Residue
© Gordon Bell78
NSF TeraGrid c2003
© Gordon Bell79
Notes with Jim
• Applications software and its development is still the no. 1 problem• 64 bit addressing will change everything• Many machines are used for their large memory• Centers will always use all available time: Cycles bottom feeders.• Allocation is still central and a bad idea. • Not big enough centers• Can’t really architect or recommend a plan, unless you have some notion of the
needs and costs!• No handle on communication costs especially for the last mi where its 50 – 150
Mbps; not fiber (10 Gbps). Two orders of magnitude low…• BeoW happened as an embarrassment to funders, not because of it• Walk through 7>2 now 3 centers.
A center is a 50M/year expense when you upgrade! • NSF: The tools development is questionable. Part of business. Feel very, very
uncomfortable developing tools• Centers should be functionally specialized around communities and databases• Planning, budgets and allocation must be with the disciplines. People vs. Machines.• Teragrid problem: having not solved the Clusters problem move to larger problem• File oriented vs database hpss.