the other hpc: high productivity computing

61
The Other HPC: High Productivity Computing in Polystore Environments Bill Howe, Ph.D. Associate Director, eScience Institute Senior Data Science Fellow, eScience Institute Affiliate Associate Professor, Computer Science & Engineering 06/28/2022 Bill Howe, UW 1

Upload: university-of-washington

Post on 14-Apr-2017

261 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: The Other HPC: High Productivity Computing

05/03/2023 1

The Other HPC: High Productivity Computing in Polystore Environments

Bill Howe, Ph.D.Associate Director, eScience Institute

Senior Data Science Fellow, eScience InstituteAffiliate Associate Professor, Computer Science & Engineering

Bill Howe, UW

Page 2: The Other HPC: High Productivity Computing

Time

Amou

nt o

f dat

a in

the

wor

ld

Time

Proc

essi

ng p

ower

What is the rate-limiting step in data understanding?

Processing power: Moore’s Law

Amount of data in the world

Page 3: The Other HPC: High Productivity Computing

Proc

essi

ng p

ower

Time

What is the rate-limiting step in data understanding?

Processing power: Moore’s Law

Human cognitive capacity

Idea adapted from “Less is More” by Bill Buxton (2001)

Amount of data in the world

slide src: Cecilia Aragon, UW HCDE

Page 4: The Other HPC: High Productivity Computing

05/03/2023 4

How much time do you spend “handling data” as opposed to “doing science”?

Mode answer: “90%”

Bill Howe, UW

Page 5: The Other HPC: High Productivity Computing

“[This was hard] due to the large amount of data (e.g. data indexes for data retrieval, dissection into data blocks and processing steps, order in which steps are performed to match memory/time requirements, file formats required by software used).

In addition we actually spend quite some time in iterations fixing problems with certain features (e.g. capping ENCODE data), testing features and feature products to include, identifying useful test data sets, adjusting the training data (e.g. 1000G vs human-derived variants)

So roughly 50% of the project was testing and improving the model, 30% figuring out how to do things (engineering) and 20% getting files and getting them into the right format.

I guess in total [I spent] 6 months [on this project].”

At least 3 months on issues of scale, file handling, and feature extraction.

Martin Kircher, Genome SciencesWhy?

3k NSF postdocs in 2010$50k / postdocat least 50% overhead

maybe $75M annually at NSF alone?

Where does the time go? (2)

Page 6: The Other HPC: High Productivity Computing

Productivity

How long I have to wait for results

monthsweeksdayshoursminutessecondsmilliseconds

HPC

Systems

Databases

feasibility threshold

interactivity threshold

These two performance thresholds are really important; other requirements are situation-specific

Page 7: The Other HPC: High Productivity Computing

05/03/2023 7Bill Howe, UW

Table

Graph

Array

Matrix

Key-Value

Data- frame

MATLAB

GEMS

GraphX Neo4J

Dato

RDBMS

HIVE

Spark

R

Pandas

Ibis

Accumulo

Spark

SciDB HDF5

MyriaPolystore Algebra

Page 8: The Other HPC: High Productivity Computing

05/03/2023 8

Desiderata for a Polystore Algebra

• Captures user intent• Affords reasoning and optimization• Accommodates best-known

algorithms

Bill Howe, UW

Page 9: The Other HPC: High Productivity Computing
Page 10: The Other HPC: High Productivity Computing

05/03/2023 Bill Howe, eScience Institute 13

Why do we care? Algebraic Optimization

N = ((z*2)+((z*3)+0))/1Algebraic Laws: 1. (+) identity: x+0 = x2. (/) identity: x/1 = x3. (*) distributes: (n*x+n*y) = n*(x+y)4. (*) commutes: x*y = y*x

Apply rules 1, 3, 4, 2:N = (2+3)*z

two operations instead of five, no division operator

Same idea works with the Relational Algebra

Page 11: The Other HPC: High Productivity Computing

The Myria Algebra is…

Relational Algebra + While / Sequence+ Flatmap + Window Ops+ Sample (+ Dimension Bounds)

https://github.com/uwescience/raco/

Page 12: The Other HPC: High Productivity Computing

MyriaX Radish SciDB GEMS

Parallel Algebra

Polystore Algebra

Middleware

SciDB API

MyriaX API

Radish API

Graph API

rewrite rulesArray

Algebra

MyriaL

Services: visualization, logging, discovery, history, browsing

Orchestration

Page 13: The Other HPC: High Productivity Computing

How does this actually work?(1) Client submits a program

in one of several Big Data languages….

(2) Program is parsed as an expression tree….

(or programs directly against the API…)

Page 14: The Other HPC: High Productivity Computing

(3) Expression tree is optimized into a parallel, federated execution plan involving one or more Big Data platforms.

(4) Depending on the back end, parallel plan may be directly compiled into executable code

How does this actually work?

Page 15: The Other HPC: High Productivity Computing

(5) Orchestrates the parallel, federated plan execution across the platforms

Client

MyriaQ Sys1 Sys2

How does this actually work?

Page 16: The Other HPC: High Productivity Computing

(6) Exposes query execution logs and results through a REST API and a visual web-based interface

How does this actually work?

Page 17: The Other HPC: High Productivity Computing

05/03/2023 20

What can you do with a Polystore Algebra?

1) Facilitate Experiments– Provide reference implementations– Apply shared optimizations for apples-to-

apples comparisons– K-means, Markov chain, Naïve Bayes, TPC-H,

Betweenness Centrality, Sigma-clipping, Linear Algebra

– LANL using this idea to express algorithms to solve governing equations for heat transfer models!

Bill Howe, UW

Page 18: The Other HPC: High Productivity Computing

05/03/2023 21

What can you do with a Polystore Algebra?

2) Rapidly develop new applications– Microbial Oceanography– Neuroanatomy– Music Analytics– Video Analytics– Clinical Analytics– Astronomical Image de-noising

Bill Howe, UW

Page 19: The Other HPC: High Productivity Computing

Laser

Microscope Objective

Pine Hole Lens

Nozzle d1

d2

FSC (Forward scatter)

Orange fluo

Red fluo

EX: SeaFlowFrancois Ribalet

Jarred Swalwell

Ginger Armbrust

Page 20: The Other HPC: High Productivity Computing

Ex: SeaFlow

RE

D fl

uore

scen

ceFSC

Picoplankton

Nanoplankton

IS

Ultraplankton

Prochlorococcus

Continuous observations of various phytoplankton groups from 1-20 mm in size

Based on RED fluo: Prochlorococcus, Pico-, Ultra- and Nanoplankton Based on ORANGE fluo: Synechococcus, Cryptophytes Based on FSC: Coccolithophores

Francois Ribalet

Jarred Swalwell

Ginger Armbrust

Page 21: The Other HPC: High Productivity Computing

SeaFlow in Myria

• “That 5-line MyriaL program was 100x faster than my R cluster, and much simpler”

Dan Halperin Sophie Clayton

Page 22: The Other HPC: High Productivity Computing

05/03/2023 25Bill Howe, UW

Page 23: The Other HPC: High Productivity Computing
Page 24: The Other HPC: High Productivity Computing
Page 25: The Other HPC: High Productivity Computing

select a.annotation , var_samp(d.density) as var from density d join annotation a on d.x = a.x and d.y = a.y and d.z = a.zgroup by a.annotationorder by var desclimit 10

Sample variance by annotation across all experiments

Page 26: The Other HPC: High Productivity Computing

05/03/2023 29Bill Howe, UW

Are two regions connected?

adjacent(r1, r2) :- annotation(experiment, x1, y1, z1, r1), annotation(experiment, x2, y2, z2, r2), x2 = x1+1 or y2 = y1+1 or z2 = z1 + 1

connected(r1, r2) :- adjacent(r1,r2)connected(r1, r3) :- connected(r1, r2), adjacent(r2, r3)

Page 27: The Other HPC: High Productivity Computing

Music Analytics

segments = scan(Jeremy:MSD:SegmentsTable);songs = scan(Jeremy:MSD:SongsTable);

seg_count = select song_id, count(segment_number) as c from segments;density = select songs.song_id, (seg_count.c / songs.duration) as density from songs, seg_count where songs.song_id = seg_count.song_id;store(density, public:adhoc:song_density);

Computing song densityMillion-Song Dataset

http://musicmachinery.com/2011/09/04/how-to-process-a-million-songs-in-20-minutes/

Blog post on how to run it in 20 minutes on Hadoop…

Page 28: The Other HPC: High Productivity Computing

05/03/2023 31Bill Howe, UW

-- calculate probability of outcomesPoe = select input_sp.id as inputId, sum(CondP.lp) as lprob, CondP.outcome as outcome from CondP, input_sp where CondP.index=input_sp.index and CondP.value=input_sp.value;

-- select the max probability outcomeclasses = select inputId, ArgMax(outcome, lprob) from Poe;

Naïve Bayes Classification: Million Song Dataset

Predict song year in a 515,345-song dataset using eight timbre features, discretized into intervals of size 10

Page 29: The Other HPC: High Productivity Computing

bad data?

lower heart rate variance

average relative heart rate variance

time (hours)

aver

age

hear

t rat

e be

ats/

min

ute

Page 30: The Other HPC: High Productivity Computing

MIMIC Information Flow

Client MyriaMiddleware MyriaX SciDB

Waveform data

Structured data

headless Octave + Web

interface

REST interface, optimization, orchestration

serverclient

Page 31: The Other HPC: High Productivity Computing

https://metanautix.com/tr/01_big_data_techniques_for_media_graphics.pdf

Page 32: The Other HPC: High Productivity Computing

https://metanautix.com/tr/01_big_data_techniques_for_media_graphics.pdf

Page 33: The Other HPC: High Productivity Computing

05/03/2023 36Bill Howe, UW

Ollie Lo, Los Alamos National Lab

Page 34: The Other HPC: High Productivity Computing

05/03/2023 37

What can you do with a Polystore Algebra?

3) Reason about algorithms

• Apply application-specific optimizations (in addition to automatic optimizations)

Bill Howe, UW

Page 35: The Other HPC: High Productivity Computing

38

CurGood = SCAN(public:adhoc:sc_points);

DO mean = [FROM CurGood EMIT val=AVG(v)]; std = [FROM CurGood EMIT val=STDEV(v)]; NewBad = [FROM Good WHERE ABS(Good.v - mean) > 2 * std EMIT *]; CurGood = CurGood - NewBad; continue = [FROM NewBad EMIT COUNT(NewBad.v) > 0];WHILE continue;

DUMP(CurGood);

Sigma-clipping, V0

Page 36: The Other HPC: High Productivity Computing

39

CurGood = Psum = [FROM CurGood EMIT SUM(val)];sumsq = [FROM CurGood EMIT SUM(val*val)]cnt = [FROM CurGood EMIT CNT(*)];NewBad = []DO sum = sum – [FROM NewBad EMIT SUM(val)]; sumsq = sum – [FROM NewBad EMIT SUM(val*val)]; cnt = sum - [FROM NewBad EMIT CNT(*)]; mean = sum / cnt std = sqrt(1/(cnt*(cnt-1)) * (cnt * sumsq - sum*sum)) NewBad = FILTER([ABS(val-mean)>std], CurGood) CurGood = CurGood - NewBad WHILE NewBad != {}

Sigma-clipping, V1: Incremental

Page 37: The Other HPC: High Productivity Computing

40

Points = SCAN(public:adhoc:sc_points);aggs = [FROM Points EMIT _sum=SUM(v), sumsq=SUM(v*v), cnt=COUNT(v)];newBad = []

bounds = [FROM Points EMIT lower=MIN(v), upper=MAX(v)];

DO new_aggs = [FROM newBad EMIT _sum=SUM(v), sumsq=SUM(v*v), cnt=COUNT(v)]; aggs = [FROM aggs, new_aggs EMIT _sum=aggs._sum - new_aggs._sum, sumsq=aggs.sumsq - new_aggs.sumsq, cnt=aggs.cnt - new_aggs.cnt];

stats = [FROM aggs EMIT mean=_sum/cnt, std=SQRT(1.0/(cnt*(cnt-1)) * (cnt * sumsq - _sum * _sum))];

newBounds = [FROM stats EMIT lower=mean - 2 * std, upper=mean + 2 * std];

tooLow = [FROM Points, bounds, newBounds WHERE newBounds.lower > v AND v >= bounds.lower EMIT v=Points.v]; tooHigh = [FROM Points, bounds, newBounds WHERE newBounds.upper < v AND v <= bounds.upper EMIT v=Points.v]; newBad = UNIONALL(tooLow, tooHigh);

bounds = newBounds; continue = [FROM newBad EMIT COUNT(v) > 0];WHILE continue;

output = [FROM Points, bounds WHERE Points.v > bounds.lower AND Points.v < bounds.upper EMIT v=Points.v];DUMP(output);

Sigma-clipping, V2

Page 38: The Other HPC: High Productivity Computing

05/03/2023 41

What can you do with a Polystore Algebra?

3) Orchestrate Federated Workflows

Bill Howe, UW

Page 39: The Other HPC: High Productivity Computing

Client MyriaX SciDB

More Orchestrating Federated Workflows

Spark

HadoopRDBMSMyriaQ

Page 40: The Other HPC: High Productivity Computing

05/03/2023 43

What can you do with a Polystore Algebra?

4) Study the price of abstraction

Bill Howe, UW

Page 41: The Other HPC: High Productivity Computing

Compiling the Myria algebra to bare metal PGAS programs

RADISH

ICDE 15

Brandon Myers

Page 42: The Other HPC: High Productivity Computing

RADISH

ICDE 15

Brandon Myers

Page 43: The Other HPC: High Productivity Computing

Query compilation for distributed processing

pipeline as

parallel code

parallel compiler

machine code

[Myers ’14]

pipeline fragment

code

pipeline fragment

code

sequential compiler

machine code

[Crotty ’14, Li ’14, Seo ’14, Murray ‘11]

sequential compiler

Page 44: The Other HPC: High Productivity Computing

05/03/2023 Bill Howe, UW 47/57

1% selection microbenchmark, 20GB

Avoid long code paths

Page 45: The Other HPC: High Productivity Computing

05/03/2023 Bill Howe, UW 48/57

Q2 SP2Bench, 100M triples, multiple self-joins

Communication optimization

Page 46: The Other HPC: High Productivity Computing

49

Graph Patterns

• SP2Bench, 100 million triples• Queries compiled to a PGAS C++ language layer, then compiled

again by a low-level PGAS compiler• One of Myria’s supported back ends• Comparison with Shark/Spark, which itself has been shown to

be 100X faster than Hadoop-based systems• …plus PageRank, Naïve Bayes, and more

RADISH

ICDE 15

Page 47: The Other HPC: High Productivity Computing

05/03/2023 50Bill Howe, UW

Page 48: The Other HPC: High Productivity Computing

select A.i, B.k, sum(A.val*B.val) from A, Bwhere A.j = B.jgroup by A.i, B.k

Matrix multiply in RA

Matrix multiply

Page 49: The Other HPC: High Productivity Computing

sparsity exponent (r s.t. m=nr)

Complexity exponent

n2.38

mn

m0.7n1.2+n2

slide adapted from ZwickR. Yuster and U. Zwick, Fast Sparse Matrix Multiplication

n = number of rowsm = number of non-zerosComplexity of matrix

multiply

naïve sparse algorithm

best known sparse algorithm

best known dense algorithm

lots of room here

Page 50: The Other HPC: High Productivity Computing

BLAS vs. SpBLAS vs. SQL (10k)off the shelf database

15X

Page 51: The Other HPC: High Productivity Computing

Relative Speedup of SpBLAS vs. HyperDB

- speedup = T_HyperDB / T_SpBLAS

- benchmark datasets with r is 1.2 and the real data cases (the three largest datasets: 1.17 < r < 1.20)

- on star (nTh = 12), on dragon (nTh = 60)

- As n increases, the relative speedup of SpBLAS over HyperDB is reduced.

- soc-Pokec: the speedup is only around 5 times.

on star, hyperDB stuck on thrashing with soc-Pokec data.

Page 52: The Other HPC: High Productivity Computing

05/03/2023 55Bill Howe, UW

20k X 20k matrix multiply by sparsityCombBLAS, MyriaX, Radish

Page 53: The Other HPC: High Productivity Computing

05/03/2023 56Bill Howe, UW

50k X 50k matrix multiply by sparsityCombBLAS, MyriaX, Radish Filter to upper left corner of result matrix

Page 54: The Other HPC: High Productivity Computing

05/03/2023 57

What can you do with a Polystore Algebra?

5) Provide new services over a Polystore Ecosystem

Bill Howe, UW

Page 55: The Other HPC: High Productivity Computing

Lowering barrier to entry

Page 56: The Other HPC: High Productivity Computing

Exposing Performance IssuesDominik Moritz

EuroSys 15

Page 57: The Other HPC: High Productivity Computing

Exposing Performance IssuesDominik Moritz

EuroSys 15

Sou

rce

wor

ker

Destination worker

Page 58: The Other HPC: High Productivity Computing

Kanit "Ham" Wongsuphasawat

Voyager: Visualization Recommendation

InfoVis 15

Page 59: The Other HPC: High Productivity Computing

Seung-Hee BaeScalable Graph Clustering

Version 1Parallelize Best-known Serial Algorithm

ICDM 2013

Version 2Free 30% improvement for any algorithm

TKDD 2014 SC 2015 Version 3Distributed approx. algorithm, 1.5B edges

Page 60: The Other HPC: High Productivity Computing

Viziometrics: Analysis of Visualization in the Scientific Literature

Proportion of non-quantitative figures in paper

Paper impact, grouped into 5% percentiles

Poshen Lee

Page 61: The Other HPC: High Productivity Computing

http://escience.washington.edu

http://myria.cs.washington.edu

http://uwescience.github.io/sqlshare/