designing for self-serve science

21
Designing for self- serve science Daniel Halperin

Upload: dhalperi

Post on 26-Jun-2015

68 views

Category:

Data & Analytics


0 download

DESCRIPTION

I gave this talk at the UW Systems, Architecture, & Networking (SANE) retreat in May 2014. I argued that as a community, big data system-builders may be great at building fast systems.. but that these systems DO NOT serve the scientists we work with at the UW eScience Institute. I then provide a few ideas going forward for how to build services for scientists that will enable them to do their own work, thus "serving themselves".

TRANSCRIPT

Page 1: Designing for self-serve science

Designing for self-serve science

Daniel Halperin

Page 2: Designing for self-serve science

How much time “handling data” vs “doing science”?

Page 3: Designing for self-serve science

How much time “handling data” vs “doing science”?

90%

Page 4: Designing for self-serve science

“I sort both my spreadsheets on Gene ID, then I copy matches into a new one”

Page 5: Designing for self-serve science

We are the problem

Page 6: Designing for self-serve science

0

30

60

90

120

Benchmark 1 Benchmark 2

Old system Your system Our system

Page 7: Designing for self-serve science

0

2500

5000

7500

10000

Benchmark 1 Benchmark 2

Old system Your systemOur system What people use

Page 8: Designing for self-serve science

Perfo

rman

ce

Complexity

Page 9: Designing for self-serve science

Perfo

rman

ce

Complexity

Page 10: Designing for self-serve science

Perfo

rman

ce

Complexity

Page 11: Designing for self-serve science

Perfo

rman

ce

Complexity

Design for here

Page 12: Designing for self-serve science

What we build What they need

Steve Jurvetson https://www.flickr.com/photos/jurvetson/7408464122

sutton-images.com http://biser3a.com/formula-1/f1-airboxes-all-you-need-to-know/

terms: http://sutton-images.com/terms.asp

Page 13: Designing for self-serve science

Lowering barrier to entry

Page 14: Designing for self-serve science

Developing a new language

• SQL: 3 great features for science • THE language of data

management!• We know how to

scale it • Scientists can learn it

• MyriaL is better • Imperative &

declarative:easy to write

• Iteration & recursion!• Lots of practical

extensions

Page 15: Designing for self-serve science

Giving users insight

Page 16: Designing for self-serve science

Diagnosing problems����������������

�� ��������

� � � � � � � � � ��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

���������

������������������������������������������������������������������������������������������������������������������������������

Sour

ce n

ode

Destination node

Page 17: Designing for self-serve science

Automating the ‘CS parts’• Do work on the user’s behalf:

(Ratul Mahajan’s Buffet Principle)

• Infer indexes and constraints!

• Aggressively reuse computation

• Speculatively apply queries to data

• Key enabler: science data is (mostly) read-only

Page 18: Designing for self-serve science

Enable authoring & sharing

• “Autocomplete for science” - predict query snippets as users work. (Nodira Khoussainova)

• Natural language interface: queries → English questions → queries “Compute the fraction of CGs that are methylated in the oyster genome.”

Page 19: Designing for self-serve science

Improve their state of the art

• “You just did in 1 minute what took me a week”

• “Replaced 100 lines of Python with 1 line of SQL”

• “That 5-line MyriaL program was 100x faster than my R cluster, and much simpler”

Page 20: Designing for self-serve science

Trust, but Verify (& Support)

Page 21: Designing for self-serve science

Trust, but Verify (& Support)