the impact of computer architectures on linear architectures...

1

The Impact Of ComputerThe Impact Of ComputerArchitectures On LinearArchitectures On Linear Algebra and NumericalAlgebra and Numerical

LibrariesLibraries

Jack Dongarra Innovative Computing Laboratory University of Tennessee

/

High Performance ComputersHigh Performance Computers♦ ~ 20 years ago ¾ 1x106 Floating Point Ops/sec (Mflop/s)

¾ Scalar based

♦ ~ 10 years ago ¾ 1x109 Floating Point Ops/sec (Gflop/s)

¾ Vector & Shared memory computing, bandwidth aware¾ Block partitioned, latency tolerant

♦ ~ Today ¾ 1x1012 Floating Point Ops/sec (Tflop/s)

¾ Highly parallel, distributed processing, message passing, network based ¾ data decomposition, communication/computation

♦ ~ 10 years away ¾ 1x1015 Floating Point Ops/sec (Pflop/s)

¾ Many more levels MH, combination/grids&HPC ¾ More adaptive, LT and bandwidth aware, fault tolerant,

extended precision, attention to SMP nodes 2

TOP500TOP500- Listing of the 500 most powerful Computers in the World

- Yardstick: Rmax from LINPACK MPP Ax=b, dense problem

Size

TPP performance

- Updated twice a year SC‘xy in the States in November Meeting in Mannheim, Germany in June

Rat

e

- All data available from www.top500.org 3

GFl

op/s

Fastest Computer Over Time

504540353025201510

5 0

(2048)

(8)

TMC CM-2

Fujitsu VP-2600

Cray Y-MP

1990 1992 1994 1996 1998 2000

YearIn 1980 a computation that took 1 full year to complete can now be done in ~ 9 hours!

GFl

op/s

Fastest Computer Over Time

500 450 400 350 300 250 200 150 100

50 Y-MP

0 1990 1992 1994 1996 1998 2000

Year

(2040)

(6788)

Fuj

(140)

(1024)

(4)(2048)

Fuj

(8)

Hitachi CP-PACS

Intel Paragon

itsu VPP-500

TMC CM-5 NEC

SX-3 TMC CM-2 its u

VP-2600 Cray

In 1980 a computation that took 1 full year to complete can now be done in ~ 13 minutes!

GFl

op/s

Fastest Computer Over Time5000 ASCI White

Pacific 4500 (7424) 4000 3500 Intel ASCI

Red Xeon3000 ASCI Blue

2500 Pacific SST (9632)

Intel (5808)2000 ASCI Red SGI ASCI 1500 (9152) Blue

Intel Hitachi Mountain 1000

0

TMC Fujitsu Paragon CP-PACS

500 Cray TMC NEC CM-5 VPP-500 (6788) (2040) (5040)

Fujitsu CM-2 SX-3 (1024) (140)Y-MP (8) VP-2600 (2048) (4)

1990 1992 1994 1996 1998 2000

YearIn 1980 a computation that took 1 full year to complete can today be done in ~90 second!

Top 10 Machines (Nov 2000)Top 10 Machines (Nov 2000)Rank Company Machine Procs Gflop/s Place Country Year

1 IBM ASCI White 8192 4938 Livermore National Laboratory Livermore 2000

2 Intel ASCI Red 9632 2380 Sandia National Labs Albuquerque USA 1999

3 IBM ASCI Blue-Pacific SST, IBM SP 604e 5808 2144 Lawrence Livermore National

Laboratory Livermore USA 1999

4 SGI ASCI Blue Mountain 6144 1608 Los Alamos National Laboratory

Los Alamos USA 1998

5 IBM SP Power3 375 MHz 1336 1417 Naval Oceanographic Office

(NAVOCEANO) USA 2000

6 IBM SP Power3 375 MHz 1104 1179 National Center for

Environmental Protection USA 2000

7 Hitachi SR8000-F1/112 112 1035 Leibniz Rechenzentrum Muenchen Germany 2000

8 IBM SP Power3 375 MHz, 8 way 1152 929 UCSD/San Diego

Supercomputer Center USA 2000

9 Hitachi SR8000-F1/100 100 917 High Energy Accelerator

Research Organization /KEK Tsukuba

Japan 2000

10 Cray Inc. T3E1200 1084 892 Government USA 1998

Performance DevelopmentPerformance Development100 Tflop/s

10 Tflop/s

1 Tflop/s

100 Gflop/s

10 Gflop/s

1 Gflop/s

100 Mflop/s

88.1 TF/s

4.9 TF/s

55.1 GF/s Intel XP/S140

Sandia

Fujitsu 'NWT' NAL

SNI VP200EX Uni Dresden

Cray Y-MP M94/4

KFA Jülich

Cray Y-MP C94/364

'EPA' USA

Hitachi/Tsukuba CP-PACS/2048

SGI POWER

CHALLANGE GOODYEAR

Intel ASCI Red

Sandia

Intel ASCI Red

Sandia

Sun Ultra HPC 1000

News International

Sun HPC 10000 Merril Lynch

Fujitsu 'NWT' NAL

IBM ASCI White

IBM 604e 69 proc

A&P

N=1

N=500

SUM

n-93Nov-9

3Ju

n-94Nov-9

4Ju

n-95ov-9

5Ju

n-96 6Ju

n-97 7n-98

Nov-98

Jun-99

Nov-99

Jun-00

Nov-009 9v v-

Ju N o o JuN N

[60G - 400 M][4.9 Tflop/s 55Gflop/s], Schwab #15, 1/2 each year, 209 > 100 Gf, faster than Moore’s law,

Performance DevelopmentPerformance Development

0.1

1

10

100

1000

10000

100000

1000000

Perfo

rman

ce [G

Flop

/s]

N=1

N=500

Sum

1 TFlop/s

1 PFlop/s ASCI

Earth Simulator

My Laptop

9 4 97-93 -96 99 0 02 3 05 6 08- - -0 -0 -0 -

Nov

Nov n - n- n- nn n v v v v

Ju Ju Ju No Ju No Ju No Ju No

Entry 1 T 2005 and 1 P 2010

-09

ArchitecturesArchitectures

Single Processor

SMP

MPP

SIMD Constellation Cluster - NOW

0

100

200

300

400

500

Y-MP C90

Sun HPC

Paragon

CM5 T3D

T3E

SP2

Cluster of Sun HPC

ASCI Red

CM2

VP500

SX3

93

94

95 96 96 97 97

98 -99 99 -00 00v-93 -9 5 -9894-n- n- n-

ov ov- v- n- v- ov- v v-un

No un

Non no No

NoJu Ju Ju Ju Ju JuN N N NJ J

112 const, 28 clus, 343 mpp, 17 smp

10

Chip Technology Chip Technology

Inmos Transputer

Alpha

IBM

HP

intel

MIPS

SUN

Other

0

100

200

300

400

500

Jun-93

Nov-93

Jun-94

Nov-94

Jun-95

Nov-95

Jun-96

Nov-96

Jun-97

Nov-97

Jun-98

Nov-98

Jun-99

Nov-99

Jun-00

Nov-00

HighHigh--Performance Computing Directions:Performance Computing Directions: BeowulfBeowulf--class PC Clustersclass PC Clusters

♦ COTS PC Nodes ♦ Best price-Definition: Advantages:

¾ Pentium, Alpha,PowerPC, SMP

♦ COTS LAN/SANInterconnect

performance ♦ Low entry-level cost ♦ Just-in-place

configuration¾ Ethernet, Myrinet,

♦ Vendor invulnerableGiganet, ATM ♦ Scalable♦ Open Source Unix

¾ Linux, BSD ♦ Rapid technology ♦ Message Passing tracking

Computing Enabled by PC hardware, networks and operating system ¾ MPI, PVM achieving capabilities of scientific workstations at a fraction of

the cost and availability of industry standard message¾ HPF passing libraries. However, much more of a contact sport. 12

Perf

orm

ance

Where Does the Performance Go? orWhere Does the Performance Go? orWhy Should I Care About the Memory Hierarchy?Why Should I Care About the Memory Hierarchy?

µProc1000

DRAM

CPU

Processor-Memory Performance Gap: (grows 50% / year)

“Moore’s Law”

Processor-DRAM Memory Gap (latency)

60%/yr. (2X/1.5yr)

100

10DRAM9%/yr.

1 (2X/10 yrs)

1980

19

81

1982

1983

1984

1985

1986

1987

1988

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

Time 13

Optimizing Computation andOptimizing Computation and Memory UseMemory Use

♦ Computational optimizations¾ Theoretical peak:(# fpus)*(flops/cycle) * Mhz

¾ PIII: (1 fpu)*(1 flop/cycle)*(850 Mhz) = 850 MFLOP/s ¾ Athlon: (2 fpu)*(1flop/cycle)*(600 Mhz) = 1200 MFLOP/s ¾ Power3: (2 fpu)*(2 flops/cycle)*(375 Mhz) = 1500 MFLOP/s

♦ Operations like: T¾ α = x y : 2 operands (16 Bytes) needed for 2 flops;

at 850 Mflop/s will requires 1700 MW/s bandwidth ¾ y = α x + y : 3 operands (24 Bytes) needed for 2 flops;

at 850 Mflop/s will requires 2550 MW/s bandwidth

♦ Memory optimization ¾ Theoretical peak: (bus width) * (bus speed)

¾ PIII : (32 bits)*(133 Mhz) = 532 MB/s = 66.5 MW/s¾ Athlon: (64 bits)*(133 Mhz) = 1064 MB/s = 133 MW/s¾ Power3: (128 bits)*(100 Mhz) = 1600 MB/s = 200 MW/s

14

Memory HierarchyMemory Hierarchy♦ By taking advantage of the principle of locality:¾ Present the user with as much memory as is available in

the cheapest technology. ¾ Provide access at the speed offered by the fastest

technology.

Control

Datapath

Secondary Storage (Disk)

Processor

Registers

Main Memory (DRAM)

Level 2 and 3 Cache

(SRAM)

On-C

hipC

ache

Tertiary Storage

(Disk/Tape)

Distributed Memory

Remote Cluster

Memory

Speed (ns): 1s 10s Size (bytes): 100s

Ks

100s 10,000,000s 10,000,000,000s(10s ms) (10s sec)

Ms 100,000 s 10,000,000 s(.1s ms) (10s ms)

Gs Ts

SelfSelf--Adapting NumericalAdapting Numerical Software (SANS)Software (SANS)

♦ Today’s processors can achieve high-performance, butthis requires extensive machine-specific hand tuning.

♦ Operations like the BLAS require many man-hours /platform • Software lags far behind hardware introduction • Only done if financial incentive is there

♦ Hardware, compilers, and software have a largedesign space w/many parameters ¾ Blocking sizes, loop nesting permutations, loop unrolling

depths, software pipelining strategies, register allocations,and instruction schedules.

¾ Complicated interactions with the increasingly sophisticated micro-architectures of new microprocessors.

♦ Need for quick/dynamic deployment of optimized routines. ♦ ATLAS - Automatic Tuned Linear Algebra Software 16

Software GenerationSoftware Generation StrategyStrategy♦ Level 1 cache multiply

optimizes for: ¾ TLB access¾ L1 cache reuse¾ FP unit usage¾ Memory fetch¾ Register reuse¾ Loop overhead

minimization ♦ Takes about 30 minutes

to run. ♦ “New” model of high

performance programmingwhere critical code is machine generated usingparameter optimization.

♦ Code is iterativelygenerated & timed untiloptimal case is found. We try: ¾ Differing NBs ¾ Breaking false

dependencies ¾ M, N and K loop unrolling

♦ Designed for RISC arch ¾ Super Scalar ¾ Need reasonable C

compiler

♦ Today ATLAS in use byMatlab, Mathematica, Octave, Maple, Debian, Scyld Beowulf, SuSE, …

17

ATLASATLAS (DGEMM n = 500)(DGEMM n = 500)

0.0

100.0

200.0

300.0

400.0

500.0

600.0

700.0

800.0

900.0

MFL

OPS

Vendor BLAS ATLAS BLAS F77 BLAS

0 00270

112 00

0

135

500 933

200

60 200

266

16533 2

n -- -2- -

28-

c2-

30

ro II -

04- 2 335/

56 6 -ev

IIIo erer mP

ow arw m p phl

C6/7ev p0 i 00iu00 00

um

AMD At C tio

BMPP ti SnC iu enDE PP 2090 a

DE Pe

GI R10 1ent trM M

R

P P UlBIBH PII GI nSuS SArchitectures

♦ ATLAS is faster than all other portable BLASimplementations and it is comparable withmachine-specific libraries provided by the vendor. 18

Intel PIII 933 MHzIntel PIII 933 MHzMKL 5.0MKL 5.0 vsvs ATLAS 3.2.0 using WindowsATLAS 3.2.0 using Windows 20002000

800

700

600

500

400

300

200

100

0

DGEMM DSYMM DSYRK DSYR2K DTRMM DTRSM

MFL

OPS

Vendor BLAS ATLAS BLAS Reference BLAS

BLAS


Matrix Vector Multiply DGEMVMatrix Vector Multiply DGEMV

0

50

100

150

200

250

300

MFL

OPS

Vendor NoTrans Vendor Trans ATLAS NoTrans Atlas Trans F77 NoTrans F77 NoTrans

000 00 63500 00

11233

2933

20 7

/1

0 62000 16 2-26 -25 -5 28- 0-Ion- 2- 3-56- 4- o35 II6 0 II c2

mPr 3r r

ECev ripe e m pa7 m phl v 6 w w00/

00iu

C e

BMPPC 00u

D At

IBM

Po

BMPo

enti Stiiu n 00

R12

0 a90E D nt Pe trM P UlD 1

HP Pe RA I n

SGII GI

Su

Architectures

S 20

LU Factorization, RecursiveLU Factorization, Recursive w/ATLASw/ATLAS

0

10 0

20 0

30 0

40 0

50 0

60 0

70 0

MFL

OPS

Ven dor B L AS ATLA S B L AS F77 B L AS

AMDAthl

on-600

DEC ev56

-533

DEC ev6-5

00 HP90

00/735

/135

IBMPPC60

4-112

IBM Power2

-160

IBM Power3

-200

Pentiu

m Pro-200

Pen

tium II-

266

Pentiu

m III-93

3

SGI R12

000ip

30-27

0

SGI R10

000ip

28-20

0

Sun U

ltraSparc

2-200

Architecture


Related Tuning ProjectsRelated Tuning Projects ♦ PHiPAC ¾ Portable High Performance ANSI C

www.icsi.berkeley.edu/~bilmes/phipac initial automatic GEMM generation project

♦ FFTW Fastest Fourier Transform in the West ¾ www.fftw.org

♦ UHFFT ¾ tuning parallel FFT algorithms¾ rodin.cs.uh.edu/~mirkovic/fft/parfft.htm

♦ SPIRAL¾ Signal Processing Algorithms Implementation Research for

Adaptable Libraries maps DSP algorithms to architectures ♦ Sparsity ¾ Sparse-matrix-vector and Sparse-matrix-matrix multiplication

www.cs.berkeley.edu/~ejim/publication/ tunes code to sparsitystructure of matrix more later in this tutorial

22

23

0500

10001500200025003000350040004500

100

300

500

700

900

Size

Mflo

p/s

Intel P4 1.5 GHz SSEIntel P4 1.5 GHzAMD Athlon 1GHzIntel IA64 666MHz

ATLAS Matrix Multiply ATLAS Matrix Multiply (64 & 32 bit floating point results)(64 & 32 bit floating point results)

32 bit floating point using SSE

MachineMachine--Assisted ApplicationAssisted Application Development and AdaptationDevelopment and Adaptation

♦ Communication libraries¾Optimize for the specifics of one’s configuration.

♦ Algorithm layout and implementation¾Look at the different ways to express implementation

24

Work in Progress:Work in Progress:ATLASATLAS--like Approach Applied to Broadcastlike Approach Applied to Broadcast (PII 8 Way Cluster with 100 Mb/s switched network)(PII 8 Way Cluster with 100 Mb/s switched network)

Root

Sequential Binary Ring

Binomial

Message Size Optimal algorithm Buffer Size (bytes) (bytes)

8 binomial 8 16 binomial 16 32 binary 32 64 binomial 64 128 binomial 128 256 binomial 256 512 binomial 512 1K sequential 1K 2K binary 2K 4K binary 2K 8K binary 2K 16K binary 4K 32K binary 4K 64K ring 4K 128K ring 4K 256K 512K

ring ring

4K 4K 25

1M binary 4K

Reformulating/Rearranging/ReuseReformulating/Rearranging/Reuse ♦ Example is the reduction to narrow band

from for the SVD

A TA = − u T −wvnew Ty = A unew

w = A vnew new

♦ Fetch each entry of A once ♦ Restructure and combined operations ♦ Results in a speedup of > 30%

26

CG Variants by DynamicCG Variants by Dynamic Selection at Run TimeSelection at Run Time♦ Variants combine

inner products toreduce communication bottleneck at the expense of morescalar ops.

♦ Same number of iterations, no advantage on asequential processor

♦ With a large numberof processor and a

may be advantages. ♦ Improvements can

range from 15% to50% depending onsize. 27

high-latency network

CG Variants by DynamicCG Variants by Dynamic Selection at Run TimeSelection at Run Time♦ Variants combine

inner products toreduce communication bottleneck at the expense of morescalar ops.

♦ Same number of iterations, no advantage on asequential processor

♦ With a large numberof processor and ahigh-latency networkmay be advantages.

♦ Improvements canrange from 15% to50% depending onsize. 28

Gaussian EliminationGaussian Elimination

0 x

x

x x

.

.

.

Standard Way

0

x

00

. . .

0

LINPACK subtract a multiple of a row apply sequence to a column

0

x

00

. . .

0 x

nb

a3

a2

a1

L

=L-1 a2a2 a3=a3-a1*a2nb LAPACK

apply sequence to nb then apply nb to rest of matrix29

Gaussian Elimination via aGaussian Elimination via a Recursive AlgorithmRecursive Algorithm

F. Gustavson and S. ToledoLU Algorithm:

1: Split matrix into two rectangles (m x n/2)if only 1 column, scale by reciprocal of pivot & return

2: Apply LU Algorithm to the left part

3: Apply transformations to right part (triangular solve A12 = L-1A12 and matrix multiplication A22=A22 -A21*A12 )

30

4: Apply LU Algorithm to right part

L A12

A21 A22

Most of the work in the matrix multiply Matrices of size n/2, n/4, n/8, …

Recursive FactorizationsRecursive Factorizations♦ Just as accurate as conventional method ♦ Same number of operations ♦ Automatic variable blocking ¾ Level 1 and 3 BLAS only !

♦ Extreme clarity and simplicity of expression ♦ Highly efficient ♦ The recursive formulation is just a

rearrangement of the point-wise LINPACKalgorithm

♦ The standard error analysis applies (assumingthe matrix operations are computed the“conventional” way).

31

32

DGEMM ATLAS & DGETRF RecursiveAMD Athlon 1GHz (~$1100 system)

0

100

200

300

400

500 1000 1500 2000 2500 3000

Order

MFl

op/s

Pentium III 550 MHz Dual ProcessorLU Factorization

800

600

Mflo

p /s

400

200

0

LAPACK

Recursive LU

Recursive LU

LAPACKUniprocessor

Dual-processor

500 1000 1500 2000 2500 3000 3500 4000 4500 5000

Order

33


0

500

1000

1500

2000

100 200 300 400 500 600

Order

MFl

op/s

Pentium III 933 MHzSGEMM

Windows 2000using SSE

2000

1500

1000

M flo

p/ s

500

0

100 200 300 400 500 600 700 800 900 1000

Order

34


0

500

1000

1500

2000

100 200 300 400 500 600

Order

MFl

op/s

Pentium III 933 MHzDGEMM

Windows 2000

800

600

400

Mflo

p /s

200

0

1 2 3 4 5 6 7 8 9 10

Order

35


0

500

1000

1500

2000

100 200 300 400 500 600

Order

MFl

op/s

Pentium III 933 MHzS, D, C, and Z GEMM

Windows 2000S & C use SSE

2000

1500

1000

Mflo

p /s

500

0

100 200 300 400 500 600 700 800 900 1000

Order

SuperLUSuperLU -- High PerformanceHigh Performance Sparse SolversSparse Solvers♦ SuperLU; X. Li and J. Demmel

¾ Solve sparse linear system A x = b usingGaussian elimination.

¾ Efficient and portable implementation on modern architectures:¾ Sequential SuperLU : PC and workstations

¾ Achieved up to 40% peak Megafloprate

¾ SuperLU_MT : shared-memory parallel machines¾ Achieved up to 10 fold speedup

¾ SuperLU_DIST : distributed-memory parallel machines¾ Achieved up to 100 fold speedup

¾ Support real and complex matrices, fill-reducing orderings, equilibration, numerical pivoting, condition estimation, iterative refinement, and error bounds.

♦ Enabled Scientific Discovery ¾ First solution to quantum scattering of 3

charged particles. [Recigno, Baertschy, Isaacs & McCurdy, Science, 24 Dec 1999]

¾ SuperLU solved complex unsymmetric systems of order up to 1.79 million, on the ASCI Blue Pacific Computer at LLNL.

36

Recursive Factorization AppliedRecursive Factorization Applied to Sparse Direct Methodsto Sparse Direct Methods

Victor Eijkhout, Piotr Luszczek & JD

1. Symbolic Factorization 2. Search for blocks that

contain non-zeros 3. Conversion to sparse

recursive storage 4. Search for embedded

blocks 5. Numerical factorization

Layout of sparse recursive matrix

37

Dense recursive factorizationDense recursive factorization

♦ The algorithm:

function rlu(A)

A

A

A

begin

rlu(A11); recursive call

21←A21 · U-1(A11); xTRSM() on upper triangular submatrix

12 ←L1-1(A11) · A12; xTRSM() on lower triangular submatrix

22 ←A22-A21·A12; xGEMM()

rlu(A22); recursive call

end.

♦ Replace xTRSM and xGEMM with sparse implementations that are themselves recursive 38

Sparse Recursive Factorization AlgorithmSparse Recursive Factorization Algorithm

♦ Solutions - continued¾fast sparse xGEMM() is two-level algorithm ¾recursive operation on sparse data structures ¾dense xGEMM() call when recursion reaches single block

♦ Uses Reverse Cuthill-McKee orderingcausing fill-in around the band

♦ No partial pivoting¾use iterative improvement or¾pivot only within blocks

39

Recursive storage conversion stepsMatrix divided into 2x2 blocks Matrix with explicit 0’s and fill-in

Recursive algorithm division lines

• - original nonzero value0 - zero value introduced due to

blocking x - zero value introduced due to fill-in

40

Breakdown of Time Across Phases

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

l t.

ing

iion

ion

it.

Breakdown of Time Across Phases For the Recursive Sparse Factorization

100% Numericafac

Ebedded block

Recurs ve convers

Block convers

Symbol cal fac

a f23560 e x11 goodwin jpwh_991 mcfe memplus ola fu orsre g_1 psmigr_1 ra e fsky_3 ra e fsky_4 sa ylr4 she rma n3 she rma n5 wa ng3 42

Di f f e r e nt Te st M a t r i c e s

SETI@homeSETI@home♦ Use thousands of Internet-

connected PCs to help inthe search for extraterrestrial intelligence.

♦ Uses data collected with the Arecibo Radio Telescope, in Puerto Rico

♦ When their computer is idleor being wasted thissoftware will download a 300 kilobyte chunk of datafor analysis.

♦ The results of this analysisare sent back to the SETI team, combined with thousands of other participants.

♦ Largest distributed computation project in existence ¾ ~ 400,000 machines

¾ Averaging 26 Tflop/s

♦ Today many companies trying this for profit.

43

Distributed and Parallel SystemsDistributed and Parallel Systems

memt

cennocr

t

er

swfsut

psol Massively

emoh

@

isDl

Distributed cfl ok

edesba

d

wsre

/

I TCSA

parallel

fntgna uwoeBl ellilpoi owrsystems

heterogeneous

it aupmoC

aTESI isut ceps

tr aPreNtnE

systems homo

rGi

C l

geneous

♦ Bounded set of resources ♦ Apps grow to consume all cycles ♦ Application manages resources ♦ System SW gets in the way ♦ 5% overhead is maximum ♦ Apps drive purchase of

equipment ♦ Real-time constraints ♦ Space-shared

♦ Gather (unused) resources ♦ Steal cycles ♦ System SW manages resources ♦ System SW adds value ♦ 10% - 20% overhead is OK ♦ Resources drive applications ♦ Time to completion is not

critical ♦ Time-shared

Napster

The GridThe Grid♦ To treat CPU cycles and software like commodities.

♦ on steroids.

♦ Enable the coordinated use of geographicallydistributed resources – in the absence of central control and existing trust relationships.

♦ Computing power is produced much like utilities suchas power and water are produced for consumers.

♦ Users will have access to “power” on demand

45

NetSolveNetSolve Network Enabled ServerNetwork Enabled Server

♦ NetSolve is an example of a grid basedhardware/software server.

♦ Easy-of-use paramount ♦ Based on a RPC model but with …¾ resource discovery, dynamic problem solving

capabilities, load balancing, fault toleranceasynchronicity, security, …

♦ Other examples are NEOS from Argonneand NINF Japan.

♦ Use a resource, not tie togethergeographically distributed resources fora single application. 46

NetSolve: The Big PictureNetSolve: The Big PictureClient Schedule

47

AGENT(s)

A C

S1 S2

S3 S4

A, B, C

Answer (C)

S2 !

Request

Op(C, A, B)

Matlab

Mathematica

C, Fortran

Java, Excel

Database

No knowledge of the grid required, RPC like.

Basic UsageBasic Usage ScenariosScenarios

♦ Grid based numerical library routines¾ User doesn’t have to have

software library on theirmachine, LAPACK, SuperLU,ScaLAPACK, PETSc, AZTEC, ARPACK

♦ Task farming applications¾ “Pleasantly parallel”

execution¾ eg Parameter studies

♦ Remote applicationexecution ¾ Complete applications with

user specifying inputparameters and receiving output

♦ “Blue Collar” Grid Based Computing ¾ Does not require deep

knowledge of network programming

¾ Level of expressiveness right for many users

¾ User can set things up, no “su” required

¾ In use today, up to 200 servers in 9 countries

48

Futures for Linear Algebra NumericalFutures for Linear Algebra Numerical Algorithms and SoftwareAlgorithms and Software♦ Numerical software will be adaptive,

exploratory, and intelligent♦ Determinism in numerical computing will be

gone.¾ After all, its not reasonable to ask for exactness in numerical computations.

¾ Audibility of the computation, reproducibility at a cost

♦ Importance of floating point arithmetic will beundiminished. ¾ 16, 32, 64, 128 bits and beyond.

♦ Reproducibility, fault tolerance, and auditability ♦ Adaptivity is a key so applications can function

appropriately

49

Contributors to These IdeasContributors to These Ideas♦ Top500 ¾ Erich Strohmaier, LBL ¾ Hans Meuer, Mannheim U

♦ Linear Algebra ¾ Victor Eijkhout, UTK ¾ Piotr Luszczek, UTK ¾ Antoine Petitet, UTK ¾ Clint Whaley, UTK

♦ NetSolve ¾ Dorian Arnold, UTK ¾ Susan Blackford, UTK ¾ Henri Casanova, UCSD ¾ Michelle Miller, UTK ¾ Sathish Vadhiyar, UTK

For additional information see…

www.netlib.org/atlas/ www.netlib.org/netsolve/

www.cs.utk.edu/~dongarra/

Many opportunities within the group at Tennessee 50

Intel® Math Kernel Library 5.1 for IA32 andIntel® Math Kernel Library 5.1 for IA32 andItaniumItanium™ Processor™ Processor--based Linux applications Betabased Linux applications Beta License Agreement PRELicense Agreement PRE--RELEASERELEASE

♦ The Materials are pre-release code, which may not be fully functional and which Intel may substantially modify in producing any final version. Intel can provide no assurance that it will ever produce or make generally available a final version. You agree to maintain as confidential all information relating to your use of the Materials and not to disclose to any third party any benchmarks, performance results, or other information relating to the Materials comprising the pre-release.

♦ See: http://developer.intel.com/software/products/mkl/mkllicense51_lnx.htm

51

the impact of computer architectures on linear architectures...

Documents