the other hpc: high productivity computing in polystore environments

79
The Other HPC: High Productivity Computing in Polystore Environments Bill Howe, Ph.D. Associate Professor, Information School Adjunct Associate Professor, Computer Science & Engineering Associate Director, eScience Institute Director, Urbanalytics Group 8/7/2017 Bill Howe, UW 1

Upload: university-of-washington

Post on 23-Jan-2018

97 views

Category:

Data & Analytics


1 download

TRANSCRIPT

Page 1: The Other HPC: High Productivity Computing in Polystore Environments

The Other HPC: High

Productivity Computing in

Polystore Environments

Bill Howe, Ph.D.Associate Professor, Information School

Adjunct Associate Professor, Computer Science & Engineering

Associate Director, eScience Institute

Director, Urbanalytics Group

8/7/2017 Bill Howe, UW 1

Page 2: The Other HPC: High Productivity Computing in Polystore Environments

Time

Am

ou

nt

of

data

in

th

e w

orl

d

Time

Pro

cessin

g p

ow

er

What is the rate-limiting step in data understanding?

Processing power:

Moore’s LawAmount of data in

the world

Page 3: The Other HPC: High Productivity Computing in Polystore Environments

Pro

cess

ing

po

wer

Time

What is the rate-limiting step in data understanding?

Processing power:

Moore’s Law

Human cognitive capacity

Idea adapted from “Less is More” by Bill Buxton (2001)

Amount of data in

the world

slide src: Cecilia Aragon, UW HCDE

Page 4: The Other HPC: High Productivity Computing in Polystore Environments

Productivity

How long I have to wait for results

monthsweeksdayshoursminutessecondsmilliseconds

HPC

Systems

Databases

feasibility

threshold

interactivity

threshold

Claim: Only these two performance

thresholds are generally important;

other performance requirements

are application-specific

Page 5: The Other HPC: High Productivity Computing in Polystore Environments

8/7/2017 Bill Howe, UW 5

priority is machine efficiency

HPC DB/ Dataflow

priority is developer efficiency

data manipulation considered

pre-processing

batch

analysis considered post-

processing

batch and interactive

Page 6: The Other HPC: High Productivity Computing in Polystore Environments

Observations

• Every interesting application has both a data

manipulation component and an analytics

component

• Different people like to express things different

ways

• Different systems offer better performance at

different things

• …but in between people and systems, there is

no real difference in expressiveness between

linear and relational algebra

• So we want full “anything anywhere” rewrites 8/7/2017 Bill Howe, UW 6

Page 7: The Other HPC: High Productivity Computing in Polystore Environments

Observations

• Every interesting application has both a data

manipulation component and an analytics

component

• Different people like to express things different

ways

• Different systems offer better performance at

different things

• …but in between people and systems, there is

no real difference in expressiveness between

linear and relational algebra

• So we want full “anything anywhere” rewrites 8/7/2017 Bill Howe, UW 7

Page 8: The Other HPC: High Productivity Computing in Polystore Environments

Matrix Multiply

select A.i, B.k, sum(A.val*B.val)

from A, B

where A.j = B.j

group by A.i, B.k

Matrix multiply in RA

Sparse means: |non-zero elements| < |rows|~1.2

Naïve sparse algorithm: |non-zero elements|*|rows| Best-known dense algorithm: |rows|2.38

Matrix multiply

Page 9: The Other HPC: High Productivity Computing in Polystore Environments

sparsity exponent r where m=nr

Complexity

exponent

n2.38

mn

m0.7n1.2+n2

slide adapted from ZwickR. Yuster and U. Zwick, Fast Sparse Matrix Multiplication

n = number of rows

m = number of non-zerosComplexity of matrix multiply

naïve sparse algorithm

best known sparse algorithm

best known dense algorithmlots of room

here

Page 10: The Other HPC: High Productivity Computing in Polystore Environments

Single-Server Experiment

Top-shelf SQL (Hyper)

vs. Top-shelf Dense Library (MKL BLAS)

vs. Top-shelf Sparse Library (MKL SpBLAS)

Who wins? By how much?

Page 11: The Other HPC: High Productivity Computing in Polystore Environments

BLAS vs. SpBLAS vs. SQL (10k)

off the shelf

database

(HyPer)

15X

Page 12: The Other HPC: High Productivity Computing in Polystore Environments

Single Node Sparse Matrix Multiply:BLAS vs. SpBLAS vs. HyperDB (N=20k)

1. Dense is not competitive2. 50X-100X gap between DB and library

Page 13: The Other HPC: High Productivity Computing in Polystore Environments

Single Node Sparse Matrix Multiply:BLAS vs. SpBLAS vs. SQL (N=50k)

1. Dense is not competitive2. 10X-50X gap between DB and library

Page 14: The Other HPC: High Productivity Computing in Polystore Environments

Single Node Sparse Matrix Multiply:Relative Speedup of SpBLAS vs. HyperDB

100X on small data

“Only” 5X on big data

Page 15: The Other HPC: High Productivity Computing in Polystore Environments

Single Node Sparse Matrix Multiply:SpBLAS vs. SQL (Real Data)

About 5X on Real Data

Page 16: The Other HPC: High Productivity Computing in Polystore Environments

Distributed Experiment

MyriaX

vs. CombBLAS

Who wins? By how much?

Page 17: The Other HPC: High Productivity Computing in Polystore Environments

CombBLAS vs. MyriaX (N=50k) on star

8X to 45X

Page 18: The Other HPC: High Productivity Computing in Polystore Environments

CombBLAS vs. MyriaX (Real Data)

• CombBLAS 10X faster on one dataset

• MyriaX 1.5X faster on another!

Page 19: The Other HPC: High Productivity Computing in Polystore Environments

A x B x C

select AB.i, C.m, sum(AB.val*C.val)

from

(select A.i, B.k, sum(A.val*B.val)

from A, B

where A.j = B.j

group by A.i, B.k

) AB,

C

where AB.k = C.k

group by AB.i, C.m

select A.i, C.m, sum(A.val*B.val*C.val)

from A, B, C

where A.j = B.j

and B.k = C.k

group by A.i, C.m

group . join . join

group . join . group . join

Page 20: The Other HPC: High Productivity Computing in Polystore Environments

Observations

• Every interesting application has both a data manipulation component and an analytics component

• Different people like to express things different ways

• Different systems offer better performance at different things

• …but in between people and systems, there is no real difference in expressiveness between linear and relational algebra

• So we want full “anything anywhere” rewrites

8/7/2017 Bill Howe, UW 20

Page 21: The Other HPC: High Productivity Computing in Polystore Environments

8/7/2017 Bill Howe, UW 21

Page 22: The Other HPC: High Productivity Computing in Polystore Environments

8/7/2017 Bill Howe, UW 22

Linear Algebra

Relational Algebra

Page 23: The Other HPC: High Productivity Computing in Polystore Environments
Page 24: The Other HPC: High Productivity Computing in Polystore Environments

8/7/2017 Bill Howe, UW 24

Page 25: The Other HPC: High Productivity Computing in Polystore Environments

8/7/2017 Bill Howe, UW 25

Page 26: The Other HPC: High Productivity Computing in Polystore Environments

Example: Combine measurements from sensors, compute means & covariances

26

Preprocessing

(easier to

express in RA)

Analysis

(easier to

express in LA)

Dylan

Hutchison

Page 27: The Other HPC: High Productivity Computing in Polystore Environments

Example: Sensor Difference Mean & Covariance

27 https://arrayofthings.github.io/

t c v

466 temp 55.2

466 hum 40.1

492 temp 56.3

492 hum 35.0

528 temp 56.5

Filter, bin onto common time

buckets

Filter, bin onto common time

buckets

Subtract

Compute Mean

Compute Covariance

Preprocessing

(easier to

express in RA)

Analysis

(easier to

express in LA)

Array of Things

Sensor Data

Collected in CSV files

Dylan

Hutchison

Page 28: The Other HPC: High Productivity Computing in Polystore Environments

Bin query: easy in RA, harder in LA

28

t c v

466 temp 55.2

466 hum 40.1

492 temp 56.3

492 hum 35.0

528 temp 56.5

𝑡𝑒𝑚𝑝 ℎ𝑢𝑚466492528

55.2 40.1𝟓𝟔.𝟑 35.0𝟓𝟔.𝟓

t’ c v

460 temp 55.2

460 hum 40.1

520 temp 56.4

520 hum 35.0

𝑡𝑒𝑚𝑝 ℎ𝑢𝑚460520

55.2 40.1𝟓𝟔.𝟒 35.0

bin 𝑡 = 𝑡 − 𝑡 % 60 + 60 𝑡 % 6060 + .5

LAMultiply:

using avg on added elements

466 492 528460520

11 1

RASELECT bin(t) AS t', c, avg(v) AS vGROUP BY t', c

* =

Dylan

Hutchison

Page 29: The Other HPC: High Productivity Computing in Polystore Environments

Covariance query: easy in LA, harder in RA

29

𝑋 is an 𝑛 ⨉𝑑 matrix

𝑀 = 1

𝑛1𝑇𝑋 is a 1 ⨉𝑑 matrix

𝐶 = 1

𝑛𝑋𝑇𝑋 −𝑀𝑇𝑀 is a 𝑑 ⨉𝑑 matrix

LA

N = size(X, 1);

M = mean(X, 1);

C = X'*X / N – M'*M;

Carlos Ordonez.Building Statistical Models and Scoring with UDFs. SIGMOD 2007.

𝑥11 𝑥12𝑥21 𝑥22𝑥31 𝑥32

d attributes

n points

RA

(Generated SQL statements for each entry)

T = SELECT FROM X sum(1.0) AS N,

sum(X1) AS M1, sum(X2) AS M2, …, sum(Xd) AS Md,

sum(X1*X1) AS Q11, sum(X1*X2) AS Q12, …,

sum(Xd-1*Xd) AS Q(d-1)d, sum(Xd*Xd) AS Qdd

C = SELECT FROM T

(1 AS i, 1 AS j, Q11/N – M1*M1 AS v) UNION

(1 AS i, 2 AS j, Q12/N – M1*M2 AS v) UNION

Dylan

Hutchison

Page 30: The Other HPC: High Productivity Computing in Polystore Environments

LARA: COMPREHENSIVE UNIFIED

LINEAR AND RELATIONAL ALGEBRA

30

Page 31: The Other HPC: High Productivity Computing in Polystore Environments

R A G K

Myria Algebra

Page 32: The Other HPC: High Productivity Computing in Polystore Environments

Spark Myria CombBLAS GEMS

Parallel Algebra

Logical Algebra

Myria Middleware

CombBLAS API

Spark API

MyriaAPI

GEMS API

rewrite

rules

Array Algebra

MyriaL

Services: visualization, logging, discovery, history, browsing

Orchestration and Execution of the Polystore Plan

Graph Algebra

Accumulo

KeyVal Algebra

Accumulo API

Serial C

Serial Algebra

C

Page 33: The Other HPC: High Productivity Computing in Polystore Environments

Spark Myria CombBLAS GEMS

Parallel Algebra

Logical Algebra

Myria Middleware

CombBLAS API

Spark API

MyriaAPI

GEMS API

rewrite

rules

Array Algebra

MyriaL

Services: visualization, logging, discovery, history, browsing

Orchestration and Execution of the Polystore Plan

Graph Algebra

Accumulo

KeyVal Algebra

AccumuloAPI

Serial C

Serial Algebra

C

LARA Algebra

LARA API

LARA Physical Plans

LaraDB (Accumulo)

Page 34: The Other HPC: High Productivity Computing in Polystore Environments

34

k1 k2

[0]v1

[‘’]v2

a 37 7 ‘dan’

a 20 0 ‘’

b 25 0 ‘dylan'

b 20 2 ‘bill’

⋈⊗

extf⨁

Join Union Extension

Objects:

Associative Tables

Operators:

Join and Union adapted from:

M. Spight and V. Tropashko.

First steps in relational lattice. 2006.

Ext is a restricted form

of monadic bind

Total functions from keys

to values with finite support

Default Values

ValuesKeys

Attributes

“horizontal

concat”

“vertical

concat”

“flatmap”

UDFs: ⊗, ⨁, fThink “Semiring”

⊗⋈⊕

Support

Page 35: The Other HPC: High Productivity Computing in Polystore Environments

Join: Horizontal Concat

35

a c[0]x

[0]z

a1 c1 11 1

a1 c2 12 2

a2 c1 13 3

a3 c3 14 4

c b[0']

z[0']

y

c1 b1 5 15

c2 b1 6 16

c2 b2 7 17

c4 b1 8 18

⋈⊗ = a c b[0 ⊗ 0']

z

a1 c1 b1 1 ⊗ 5

a1 c2 b1 3 ⊗ 5

a1 c2 b2 2 ⊗ 6

(a3 c3 b1 4 ⊗ 0= 0 ⊗ 0')

Requires:

vA ⊗ 0' = 0 ⊗ vB = 0 ⊗ 0'

Page 36: The Other HPC: High Productivity Computing in Polystore Environments

Union: Vertical Concat

36

= c[0]x

[0 ⨁ 0 = 0]z

[0]y

c1 11 ⨁ 13 1 ⨁ 3 ⨁ 5 15

c2 12 2 ⨁ 6 ⨁ 7 16 ⨁ 17

c3 14 14 0

c4 0 8 18

⨁a c

[0]x

[0]z

a1 c1 11 1

a1 c2 12 2

a2 c1 13 3

a3 c3 14 4

c b[0]z

[0]y

c1 b1 5 15

c2 b1 6 16

c2 b2 7 17

c4 b1 8 18

Requires:

v ⨁ 0 = 0 ⨁ v = v

Page 37: The Other HPC: High Productivity Computing in Polystore Environments

Ext: Flatmap

37

a c[0]x

[0]z

a1 c1 11 1

a1 c2 12 2

a2 c1 13 3

=

a c k'[0 – 0 = 0]

v'

a1 c1 a1c1 11 – 5

a1 c1 c1a1 5 – 11

a1 c2 a1c2 12 – 2

a1 c2 c2a1 2 – 12

a2 c1 a2c1 13 – 3

a2 c1 c1a2 3 – 13

extf

k' v'

ac x – z

ca z – x

f(a, c, x, z) =

Requires:

Page 38: The Other HPC: High Productivity Computing in Polystore Environments

Summary: Union, Join, Ext

38

Key Types Value Types Support

Union ( 𝑨 ⊕ 𝑩 ) = 𝐾𝐴 ∩ 𝐾𝐵 = 𝑉𝐴 ∪ 𝑉𝐵 ⊆ 𝑆𝐴 ∪ 𝑆𝐵

Join (𝑨 ⋈⊗ 𝑩 ) = 𝐾𝐴 ∪ 𝐾𝐵 = 𝑉𝐴 ∩ 𝑉𝐵 ⊆ 𝑆𝐴 ∩ 𝑆𝐵

Ext ( ext f A ) extended by f set by f ⊆ 𝑆𝐴 × 𝑆𝑓

For Support, ‘⊆’ becomes ‘=’ if

⊕ is zero-sum-free or ⊗ has zero-product-property

Duality

Page 39: The Other HPC: High Productivity Computing in Polystore Environments

> If ⨁ or ⊗ is associative, commutative, or idempotent,

then so is Union or Join

> (Push Aggregation into Join) If ⊗ distributes over ⨁,

> (Distribute Join over Union)

If , then

LARA Properties

39

= sum(AB ⊗ CT)

Page 40: The Other HPC: High Productivity Computing in Polystore Environments

RADISH: COMPILING QUERIES TO HPC ARCHITECTURES

Page 41: The Other HPC: High Productivity Computing in Polystore Environments

Query compilation for distributed processing

pipeline

as

parallel

code

parallel compiler

machine

code

[Myers ’14]

pipeline

fragment

code

pipeline

fragment

code

sequential

compiler

machine

code

[Crotty ’14, Li ’14, Seo ’14, Murray ‘11]

sequential

compiler

Page 42: The Other HPC: High Productivity Computing in Polystore Environments

RADISH

ICS 16

Brandon

Myers

Page 43: The Other HPC: High Productivity Computing in Polystore Environments

8/7/2017 Bill Howe, UW 43/57

1% selection microbenchmark, 20GB

Avoid long code paths

ICS 16

Brandon

Myers

Page 44: The Other HPC: High Productivity Computing in Polystore Environments

8/7/2017 Bill Howe, UW 44/57

Q2 SP2Bench, 100M triples, multiple self-joins

Communication optimization

ICS 16

Brandon

Myers

Page 45: The Other HPC: High Productivity Computing in Polystore Environments

Graph Patterns

45

• SP2Bench, 100 million triples

• Queries compiled to a PGAS C++ language layer, then

compiled again by a low-level PGAS compiler

• One of Myria’s supported back ends

• Comparison with Shark/Spark, which itself has been shown to

be 100X faster than Hadoop-based systems

• …plus PageRank, Naïve Bayes, and more

RADISH

ICS 16

Brandon

Myers

Page 46: The Other HPC: High Productivity Computing in Polystore Environments

8/7/2017 Bill Howe, UW 46

ICS 15

RADISH

ICS 16

Brandon

Myers

Page 47: The Other HPC: High Productivity Computing in Polystore Environments

Recap

• Productivity is the new performance

• …but this doesn’t mean give up on orders of

magnitude performance difference by doing

everything on one system

• Everything interesting is LA + RA

• There is no difference except syntax and

systems

• We want to comprehensively optimize

across them, generate code anywhere

Page 48: The Other HPC: High Productivity Computing in Polystore Environments

Other Productivity Work

• Workload Analytics for SQL Data Lakes

– Shrainik Jain

• AI for Scientific Data Curation

– Maxim Grechkin, Hoing Poon (MSR)

• Visualization Recommendation

– Kanit “Ham” Wongsuphasawat, Dom Moritz, Jeff

Heer

• Information Extraction from Scientific Figures

– Poshen Lee, Sean Yang

• Scalable Approximate Community Detection

– Seung-Hee Bae (Western Michigan)

Page 49: The Other HPC: High Productivity Computing in Polystore Environments

The SQLShare Corpus:

A multi-year log of hand-written SQL queries

Queries 24275

Views 4535

Tables 3891

Users 591

SIGMOD 2016

Shrainik Jain

https://uwescience.github.io/sqlshare

Workload Analytics for Data Lakes

Page 50: The Other HPC: High Productivity Computing in Polystore Environments

lifetime = days between first and last access of table

SIGMOD 2016

Shrainik Jain

http://uwescience.github.io/sqlshare/

Data “Grazing”: Short dataset lifetimes

Page 51: The Other HPC: High Productivity Computing in Polystore Environments
Page 52: The Other HPC: High Productivity Computing in Polystore Environments

Key idea: Embed queries as vectors

• Learn query embeddings; use them for

all workload analytics tasks:

– Query recommendation

– Workload summarization / index selection

– User behavior modeling

– Predicting heavy hitters

– Forensics

• Get rid of specialized feature

engineering

Page 53: The Other HPC: High Productivity Computing in Polystore Environments

Doc2Vec on SQL

Can we recover known

patterns in the workload?

TPC-H queries,

generated with

different

parameters

Page 54: The Other HPC: High Productivity Computing in Polystore Environments

Can we recover known

patterns in the workload?

TPC-H queries,

generated with

different

parameters

Doc2Vec on Templatized Query Plans

Page 55: The Other HPC: High Productivity Computing in Polystore Environments

Workload Summarization

and Index Selection

Page 56: The Other HPC: High Productivity Computing in Polystore Environments

DEEP CURATION FOR

SCIENTIFIC DATA LAKES

Page 57: The Other HPC: High Productivity Computing in Polystore Environments

Microarray experiments

Page 58: The Other HPC: High Productivity Computing in Polystore Environments

8/7/2017 Bill Howe, UW 58

Microarray samples submitted to the Gene Expression Omnibus

Curation is fast becoming the

bottleneck to data sharing

Maxim

Gretchkin

Hoifung

Poon

Page 59: The Other HPC: High Productivity Computing in Polystore Environments

Maxim

Gretchkin

Hoifung

Poon

No growth in number of

datasets used per paper!

Page 60: The Other HPC: High Productivity Computing in Polystore Environments

Maxim

Gretchkin

Hoifung

Poon

Majority of samples are

one-time-use only!

Page 61: The Other HPC: High Productivity Computing in Polystore Environments

color = labels supplied

as metadata

clusters = 1st two PCA

dimensions on the

gene expression data

itself

Can we use curate algorithmically?Maxim

Gretchkin

Hoifung

Poon

The expression data and the text labels appear to disagree

Page 62: The Other HPC: High Productivity Computing in Polystore Environments

Maxim

Gretchkin

Hoifung

Poon

Better Tissue

Type Labels

Domain knowledge

(Ontology)

Expression data

Free-text Metadata

2 Deep Networkstext

expr

SVM

NIPS 18 (review)

Page 63: The Other HPC: High Productivity Computing in Polystore Environments

Deep Curation Maxim

Gretchkin

Hoifung

PoonDistant supervision and co-learning between text-

based classified and expression-based classifier: Both

models improve by training on each others’ results.

Free-text classifierExpression classifier

NIPS 18 (review)

Page 64: The Other HPC: High Productivity Computing in Polystore Environments

Deep Curation:

Our stuff wins, with ZERO training dataMaxim

Gretchkin

Hoifung

Poon

state of the art

our reimplementation

of the state of the art

our dueling

pianos NN

amount of training data used

NIPS 18 (review)

Page 65: The Other HPC: High Productivity Computing in Polystore Environments

Viziometrics: Analysis of Visualization

in the Scientific Literature

Proportion of

non-quantitative

figures in paper

Paper impact, grouped into 5% percentiles

Poshen Lee

Page 66: The Other HPC: High Productivity Computing in Polystore Environments

Voyager

8/7/2017 Bill Howe, UW 66

Kanit “Ham” Wongsuphasawat Dominik Moritz

InfoVis 15

Page 67: The Other HPC: High Productivity Computing in Polystore Environments

Seung-Hee

BaeScalable Graph Clustering

Version 1

Parallelize Best-known

Serial Algorithm

ICDM 2013

Version 2

Free 30% improvement

for any algorithm

TKDD 2014 SC 2015

Version 3

Distributed approx.

algorithm, 1.5B edges

Page 68: The Other HPC: High Productivity Computing in Polystore Environments

RESPONSIBLE DATA SCIENCE

8/7/2017 Bill Howe, UW 68

Page 69: The Other HPC: High Productivity Computing in Polystore Environments

69

Propublica, May 2016

Page 70: The Other HPC: High Productivity Computing in Polystore Environments

70

The Special Committee on Criminal Justice Reform's

hearing of reducing the pre-trial jail population.

Technical.ly, September 2016

Philadelphia is grappling with the prospect of a racist computer algorithm

Any background signal in the

data of institutional racism is

amplified by the algorithm

operationalized by the algorithm

legitimized by the algorithm

“Should I be afraid of risk assessment tools?”

“No, you gotta tell me a lot more about yourself.

At what age were you first arrested?

What is the date of your most recent crime?”

“And what’s the culture of policing in the

neighborhood in which I grew up in?”

Page 71: The Other HPC: High Productivity Computing in Polystore Environments

8/7/2017 Bill Howe, UW 71

Amazon Prime Now Delivery Area: Atlanta Bloomberg, 2016

Page 72: The Other HPC: High Productivity Computing in Polystore Environments

8/7/2017 Bill Howe, UW 72

Amazon Prime Now Delivery Area: Boston Bloomberg, 2016

Page 73: The Other HPC: High Productivity Computing in Polystore Environments

8/7/2017 Bill Howe, UW 73

Amazon Prime Now Delivery Area: Chicago Bloomberg, 2016

Page 74: The Other HPC: High Productivity Computing in Polystore Environments

First decade of Data Science research and practice:

What can we do with massive, noisy, heterogeneous datasets?

Next decade of Data Science research and practice:

What should we do with massive, noisy, heterogeneous datasets?

The way I think about this…..(1)

Page 75: The Other HPC: High Productivity Computing in Polystore Environments

The way I think about this…. (2)

Decisions are based on two sources of information:

1. Past examplese.g., “prior arrests tend to increase likelihood of future arrests”

2. Societal constraintse.g., “we must avoid racial discrimination”

8/7/2017 Data, Responsibly / SciTech NW 75

We’ve become very good at automating the use of past examples

We’ve only just started to think about incorporating societal constraints

Page 76: The Other HPC: High Productivity Computing in Polystore Environments

The way I think about this… (3)

How do we apply societal constraints to algorithmic

decision-making?

Option 1: Rely on human oversight

Ex: EU General Data Protection Regulation requires that a

human be involved in legally binding algorithmic decision-making

Ex: Wisconsin Supreme Court says a human must review

algorithmic decisions made by recidivism models

Issues with scalability, prejudice

Option 2: Build systems to help enforce these constraints

This is the approach we are exploring

8/7/2017 Data, Responsibly / SciTech NW 76

Page 77: The Other HPC: High Productivity Computing in Polystore Environments

The way I think about this…(4)

On transparency vs. accountability:

• For human decision-making, sometimes explanations are

required, improving transparency

– Supreme court decisions

– Employee reprimands/termination

• But when transparency is difficult, accountability takes over

– medical emergencies, business decisions

• As we shift decisions to algorithms, we lose both

transparency AND accountability

• “The buck stops where?”

8/7/2017 Data, Responsibly / SciTech NW 77

Page 78: The Other HPC: High Productivity Computing in Polystore Environments

FairnessAccountability TransparencyPrivacyReproducibility

Fides: A platform for responsible data science

joint with Stoyanovich [US], Abiteboul [FR], Miklau [US], Sahuguet [US], Weikum [DE]

Data Curation

novel features to support:

So what do we do about it?

Page 79: The Other HPC: High Productivity Computing in Polystore Environments