designing for self-serve science

Designing for self-serve science

Daniel Halperin

How much time “handling data” vs “doing science”?

How much time “handling data” vs “doing science”?

90%

“I sort both my spreadsheets on Gene ID, then I copy matches into a new one”

We are the problem

0

30

60

90

120

Benchmark 1 Benchmark 2

Old system Your system Our system

0

2500

5000

7500

10000

Benchmark 1 Benchmark 2

Old system Your systemOur system What people use

Perfo

rman

ce

Complexity

Perfo

rman

ce

Complexity

Design for here

What we build What they need

Steve Jurvetson https://www.flickr.com/photos/jurvetson/7408464122

sutton-images.com http://biser3a.com/formula-1/f1-airboxes-all-you-need-to-know/

terms: http://sutton-images.com/terms.asp

https://www.flickr.com/photos/jurvetson/7408464122

http://sutton-images.com/terms.asp

Lowering barrier to entry

Developing a new language

• SQL: 3 great features for science • THE language of data

management!• We know how to

scale it • Scientists can learn it

• MyriaL is better • Imperative &

declarative:easy to write

• Iteration & recursion!• Lots of practical

extensions

Giving users insight

Diagnosing problems��

��

� � � � � � � � � ��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

Sour

ce n

ode

Destination node

Automating the ‘CS parts’• Do work on the user’s behalf:

(Ratul Mahajan’s Buffet Principle)

• Infer indexes and constraints!

• Aggressively reuse computation

• Speculatively apply queries to data

• Key enabler: science data is (mostly) read-only

Enable authoring & sharing

• “Autocomplete for science” - predict query snippets as users work. (Nodira Khoussainova)

• Natural language interface: queries → English questions → queries “Compute the fraction of CGs that are methylated in the oyster genome.”

Improve their state of the art

• “You just did in 1 minute what took me a week”

• “Replaced 100 lines of Python with 1 line of SQL”

• “That 5-line MyriaL program was 100x faster than my R cluster, and much simpler”

Trust, but Verify (& Support)

designing for self-serve science

Data & Analytics

science data

old system

language of data management

performance complexity

time handling data vs

data key enabler

new language sql

serve science daniel