three tools for "human-in-the-loop" data science
TRANSCRIPT
![Page 1: Three Tools for "Human-in-the-loop" Data Science](https://reader035.vdocuments.net/reader035/viewer/2022070522/58ee33bd1a28abcd0c8b46a3/html5/thumbnails/1.jpg)
Aditya ParameswaranAssistant ProfessorUniversity of Illinois
http://data-people.cs.illinois.edu
Three Tools for “Human-in-the-loop” Data Science
![Page 2: Three Tools for "Human-in-the-loop" Data Science](https://reader035.vdocuments.net/reader035/viewer/2022070522/58ee33bd1a28abcd0c8b46a3/html5/thumbnails/2.jpg)
2
Many many contributors!
• PIs: Kevin Chang, Karrie Karahalios, Aaron Elmore, Sam Madden, Amol Deshpande (Spanning Illinois, UMD, MIT, Chicago)
• PhD Students: Mangesh Bendre, Himel Dev, John Lee, Albert Kim, Manasi Vartak, Liqi Xu, Silu Huang, Sajjadur Rahman, Stephen Macke
• MS Students: Vipul Venkataraman, Tarique Siddiqui, Chao Wang, Sili Hui
• Undergrads: Paul Zhou, Ding Zhang, Kejia Jiang, Bofan Sun, Ed Xue, Sean Zou, Jialin Liu, Changfeng Liu, Xiaofo Yu
![Page 3: Three Tools for "Human-in-the-loop" Data Science](https://reader035.vdocuments.net/reader035/viewer/2022070522/58ee33bd1a28abcd0c8b46a3/html5/thumbnails/3.jpg)
3
Scale is a Solved ProblemMost work in the database community is myopically focused on scale: the ability to pose SQL queries on larger and larger datasets.
My claim:Scale is a solved problem.
Findings:– Median job size at Microsoft and Yahoo is 16GB;– >90% of the jobs within Facebook are <100GB
The bottleneck is no longer our ability to pose SQL queries on large datasets!
Of course, exceptions exist: the “1%” of data analysis needs
![Page 4: Three Tools for "Human-in-the-loop" Data Science](https://reader035.vdocuments.net/reader035/viewer/2022070522/58ee33bd1a28abcd0c8b46a3/html5/thumbnails/4.jpg)
4
What about the Needs of the 99%?The bottleneck is actually the “humans-in-the-loop”
As our data size has grown, what has stayed constant is • the time for analysis, • the human cognitive load,• the skills to extract value from data
There is a severe need for tools that can help analysts extract value from even moderately sized datasets
From “Big data and and its Technical Challenges”, CACM 2014
For big data to fully reach its potential, we need to consider scale not just for the system but also from the perspective of humans. We have to make sure that the end points—humans—can properly “absorb” the results of the analysis and not get lost in a sea of data.
![Page 5: Three Tools for "Human-in-the-loop" Data Science](https://reader035.vdocuments.net/reader035/viewer/2022070522/58ee33bd1a28abcd0c8b46a3/html5/thumbnails/5.jpg)
5
Need of the hour: Human-In-the-Loop Data Analytics Tools
HILDA tools:• treat both humans and
data as first-class citizens
• reduce human labor• minimize complexity
Interaction Data Mining
Databases
Taking the human
perspective into account
Go beyond SQL
Scalability/Interactivity is still important
Magic happens here
![Page 6: Three Tools for "Human-in-the-loop" Data Science](https://reader035.vdocuments.net/reader035/viewer/2022070522/58ee33bd1a28abcd0c8b46a3/html5/thumbnails/6.jpg)
6
A Maslow’s Hierarchy for HILDABackground: Maslow developed a theory for what motivates individuals in 1943; highly influential
Complex Needs
Basic Needs
![Page 7: Three Tools for "Human-in-the-loop" Data Science](https://reader035.vdocuments.net/reader035/viewer/2022070522/58ee33bd1a28abcd0c8b46a3/html5/thumbnails/7.jpg)
7
A Maslow’s Hierarchy for HILDA
Share & Collaborate
Play & View
Touch & Feel
Incr
easin
g so
phist
icatio
n of
ana
lysis
![Page 8: Three Tools for "Human-in-the-loop" Data Science](https://reader035.vdocuments.net/reader035/viewer/2022070522/58ee33bd1a28abcd0c8b46a3/html5/thumbnails/8.jpg)
8
Touch and Feel:DataSpread is a spreadsheet-database hybrid:
Goal: Marrying the flexibility and ease of use of spreadsheets with the scalability and power of databases
Enables the “99%” with large datasets but limited prog. skills to open, touch, and examine their datasets
http://dataspread.github.io
[VLDB’15,VLDB’15,ICDE’16]
![Page 9: Three Tools for "Human-in-the-loop" Data Science](https://reader035.vdocuments.net/reader035/viewer/2022070522/58ee33bd1a28abcd0c8b46a3/html5/thumbnails/9.jpg)
9
Play and View:Zenvisage is effortless visual exploration tool.
Goal: “fast-forward” to visual patterns, trends, without having analyst step through each one individually
Enables individuals to play with, and extract insights from large datasets at a fraction of the time.
http://zenvisage.github.io
[TR’16,VLDB’16,VLDB’15,DSIA’15,VLDB’14,VLDB’14]
![Page 10: Three Tools for "Human-in-the-loop" Data Science](https://reader035.vdocuments.net/reader035/viewer/2022070522/58ee33bd1a28abcd0c8b46a3/html5/thumbnails/10.jpg)
10
Collaborate and Share:
OrpheusDB is a tool for managing dataset versions with a database
Goal: building a versioned database system to reduce the burden of recording datasets in various stages of analysis
Enables individuals to collaborate on data analysis, and share, keep track of, and retrieve dataset versions.
http://orpheus-db.github.io
[VLDB’16,VLDB’15,VLDB’15,TAPP’15,CIDR’15]
(also part of : a collab. analysis system w/ MIT & UMD) datahub
![Page 11: Three Tools for "Human-in-the-loop" Data Science](https://reader035.vdocuments.net/reader035/viewer/2022070522/58ee33bd1a28abcd0c8b46a3/html5/thumbnails/11.jpg)
11
This talkAbout 10 minutes per system:
overview + architecture + one key technical challenge
Common theme: if you torture databases enough, you can get them to do what you want!
Share & Collaborate
Play & View
Touch & Feel
Incr
easin
g so
phist
icatio
n of
ana
lysis
![Page 12: Three Tools for "Human-in-the-loop" Data Science](https://reader035.vdocuments.net/reader035/viewer/2022070522/58ee33bd1a28abcd0c8b46a3/html5/thumbnails/12.jpg)
12
![Page 13: Three Tools for "Human-in-the-loop" Data Science](https://reader035.vdocuments.net/reader035/viewer/2022070522/58ee33bd1a28abcd0c8b46a3/html5/thumbnails/13.jpg)
13
MotivationMost of the people doing ad-hoc data
manipulation and analysis use spreadsheets,
e.g., Excel
Why?
• Easy to use: direct manipulation• Built-in visualization capabilities• Flexible: no need for a schema
![Page 14: Three Tools for "Human-in-the-loop" Data Science](https://reader035.vdocuments.net/reader035/viewer/2022070522/58ee33bd1a28abcd0c8b46a3/html5/thumbnails/14.jpg)
14
But Spreadsheets are Terrible!
– Slow• single change wait minutes on a 10,000 x 10
spreadsheet• can’t even open a spreadsheet with >1M cells• speed by itself can prevent analysis
– Tedious + not Powerful• filters via copy-paste• only FK joins via VLOOKUPs; others impossible• even simple operations are cumbersome
– Brittle• sharing excel sheets around, no collab/recovery• using spreadsheets for collaboration is painful and
error-prone
![Page 15: Three Tools for "Human-in-the-loop" Data Science](https://reader035.vdocuments.net/reader035/viewer/2022070522/58ee33bd1a28abcd0c8b46a3/html5/thumbnails/15.jpg)
15
Let’s turn to DatabasesDatabases are:• Slow Scalable• Tedious + not Powerful Powerful and expressive (SQL)• Brittle Collaboration, recovery, succinct
So why not use databases? Well, for the same reason why spreadsheets are so useful:
• Easy to use Not easy to use• Built-in visualization No built-in visualization• Flexible Not flexible
![Page 16: Three Tools for "Human-in-the-loop" Data Science](https://reader035.vdocuments.net/reader035/viewer/2022070522/58ee33bd1a28abcd0c8b46a3/html5/thumbnails/16.jpg)
16
Combining the benefits of spreadsheets and databases
Spreadsheet as a frontend interfaceDatabases as a backend engine
Result: retain the benefits of both!
But it’s not that simple…
![Page 17: Three Tools for "Human-in-the-loop" Data Science](https://reader035.vdocuments.net/reader035/viewer/2022070522/58ee33bd1a28abcd0c8b46a3/html5/thumbnails/17.jpg)
17
Different IdeologiesDatabases and spreadsheets have different ideologies that need to be reconciled…
Due to this, the integration is not trivial…
Feature Databases SpreadsheetsData Model Schema-first Dynamic/No SchemaAddressing Tuples with PK Cells, using Row/ColPresentation
Set-oriented, no such notion
Notion of current window, order
Modifications
Must correspond to queries
Can be done at any granularity
Computation
Query at a time Value at a time
![Page 18: Three Tools for "Human-in-the-loop" Data Science](https://reader035.vdocuments.net/reader035/viewer/2022070522/58ee33bd1a28abcd0c8b46a3/html5/thumbnails/18.jpg)
18
First Problem: RepresentationQ: how do we represent spreadsheet
data?
Dense spreadsheets: represent as tables(Row #, Col1 val, Col2 val, …)
Sparse spreadsheets: represent as triples(Row #, Column #, Value)
![Page 19: Three Tools for "Human-in-the-loop" Data Science](https://reader035.vdocuments.net/reader035/viewer/2022070522/58ee33bd1a28abcd0c8b46a3/html5/thumbnails/19.jpg)
19
First Problem: RepresentationQ: how do we represent spreadsheet
data?
Can we do even better than the two extremes? Yes!
Carve out dense areas store as tables, sparse areas store as triples
![Page 20: Three Tools for "Human-in-the-loop" Data Science](https://reader035.vdocuments.net/reader035/viewer/2022070522/58ee33bd1a28abcd0c8b46a3/html5/thumbnails/20.jpg)
20
First Problem: Representation
However, even if we only use “tables”, carving out the ideal # partitions (min. storage, modif., access) is NP-HardReduction from min. edge-length
partition of rectilinear polygons
Thankfully, we have a way out…
![Page 21: Three Tools for "Human-in-the-loop" Data Science](https://reader035.vdocuments.net/reader035/viewer/2022070522/58ee33bd1a28abcd0c8b46a3/html5/thumbnails/21.jpg)
21
Solution: Constrain the Problem
A new class of partitionings: recursive decomp.
A very natural class of partitionings!
![Page 22: Three Tools for "Human-in-the-loop" Data Science](https://reader035.vdocuments.net/reader035/viewer/2022070522/58ee33bd1a28abcd0c8b46a3/html5/thumbnails/22.jpg)
22
Solution: Constrain the Problem
The optimal recursive decomp. partitioning can be found in PTIME using DP
Still quadratic in # rows, columns
Merge rows/columns with identical signatures
~ the time for a single scan
![Page 23: Three Tools for "Human-in-the-loop" Data Science](https://reader035.vdocuments.net/reader035/viewer/2022070522/58ee33bd1a28abcd0c8b46a3/html5/thumbnails/23.jpg)
23
Initial Progress and Architecture
Postgres backendZK spreadsheet • open-source web
frontend
Comfortably scales to arbitrarily many rows+ handle SQL queries
Hopefully bring spreadsheets to the big data age!
1224560
![Page 24: Three Tools for "Human-in-the-loop" Data Science](https://reader035.vdocuments.net/reader035/viewer/2022070522/58ee33bd1a28abcd0c8b46a3/html5/thumbnails/24.jpg)
![Page 25: Three Tools for "Human-in-the-loop" Data Science](https://reader035.vdocuments.net/reader035/viewer/2022070522/58ee33bd1a28abcd0c8b46a3/html5/thumbnails/25.jpg)
25
Standard Visual Data Analysis Recipe:
1. Load dataset into viz tool2. Select viz to be generated3. See if it matches desired
visual pattern or insight4. Repeat until you find a
match
![Page 26: Three Tools for "Human-in-the-loop" Data Science](https://reader035.vdocuments.net/reader035/viewer/2022070522/58ee33bd1a28abcd0c8b46a3/html5/thumbnails/26.jpg)
26
Tedious and Time-consuming!
![Page 27: Three Tools for "Human-in-the-loop" Data Science](https://reader035.vdocuments.net/reader035/viewer/2022070522/58ee33bd1a28abcd0c8b46a3/html5/thumbnails/27.jpg)
27
Key Issue: Visualizations can be generated by
• varying subsets of data, and • varying attributes being
visualized
Too many visualizations to look at to find desired visual patterns!
![Page 28: Three Tools for "Human-in-the-loop" Data Science](https://reader035.vdocuments.net/reader035/viewer/2022070522/58ee33bd1a28abcd0c8b46a3/html5/thumbnails/28.jpg)
28
MotivationThis is a real problem!
• Advertisers at Turn – find keywords with similar CTRs to a specific one
• Bioinformaticians at an NIH genomics center– find aspects on which two sets of genes differ
• Battery scientists at CMU – find solvents with desired properties
Common theme: finding the “right” visualization can take several hours of combing through visualizations manually.
![Page 29: Three Tools for "Human-in-the-loop" Data Science](https://reader035.vdocuments.net/reader035/viewer/2022070522/58ee33bd1a28abcd0c8b46a3/html5/thumbnails/29.jpg)
29
Key InsightWe can automate that!• instead of combing through visualizations manually• tell us what you want, and we can “fast-forward” to desired
insights
Desiderata for automation:• Expressive – the ability to specify what you want• Interactive – interact with the results, catering to non-
programmers• Scalable – get interesting results quickly
Enter Zenvisage:(zen + envisage: to effortlessly visualize)
![Page 30: Three Tools for "Human-in-the-loop" Data Science](https://reader035.vdocuments.net/reader035/viewer/2022070522/58ee33bd1a28abcd0c8b46a3/html5/thumbnails/30.jpg)
30
Effortless Visual Exploration of Large Datasets with
Ingredients• Drag-and-drop and sketch based interactions • to find specific patterns
• Sophisticated visual exploration language, ZQL • to ask more elaborate questions
• Scalable visualization generation engine• preprocess, batch and parallel eval. for
interactive results• Rapid pattern matching algorithms• sampling-based techniques
![Page 31: Three Tools for "Human-in-the-loop" Data Science](https://reader035.vdocuments.net/reader035/viewer/2022070522/58ee33bd1a28abcd0c8b46a3/html5/thumbnails/31.jpg)
31Attribute Selection
Sketching Canvas
Matches Typical Trends and Outliers
ZQL: Advanced Exploration Interface
Screenshots
![Page 32: Three Tools for "Human-in-the-loop" Data Science](https://reader035.vdocuments.net/reader035/viewer/2022070522/58ee33bd1a28abcd0c8b46a3/html5/thumbnails/32.jpg)
32
Screenshots
![Page 33: Three Tools for "Human-in-the-loop" Data Science](https://reader035.vdocuments.net/reader035/viewer/2022070522/58ee33bd1a28abcd0c8b46a3/html5/thumbnails/33.jpg)
Challenges: One Specific InstanceFind visualizations on which two groups of data differ most.
Examples: • find visualizations where solvent x differs from solvent y• find visualizations where product x differs from product y
We represent a visualization using [d, m, f]• dimension = x axis• measure = y axis• function = aggregate applied to y
Each [d,m,f] on a specific subset of data can be computed using a single SQL query.
33
![Page 34: Three Tools for "Human-in-the-loop" Data Science](https://reader035.vdocuments.net/reader035/viewer/2022070522/58ee33bd1a28abcd0c8b46a3/html5/thumbnails/34.jpg)
Challenge: One Specific InstanceFind visualizations on which two groups of data differ most.
Naïve approach:
For each [d, m, f]:Compute visualization for both products (two SQL queries), then compare
Pick k best (“highest utility”) [d, m, f]
Utility Metric: We ignore how to compare for now, but there are many standard distance metrics
Scale: 10s of dimensions, 10s of measures, handful of aggregates 100s of queries for a single user task!
34
![Page 35: Three Tools for "Human-in-the-loop" Data Science](https://reader035.vdocuments.net/reader035/viewer/2022070522/58ee33bd1a28abcd0c8b46a3/html5/thumbnails/35.jpg)
Issues w/ Naïve Approach
• Repeated processing of same data in sequence across queries
• Computation wasted on low-utility visualizations
Sharing
Pruning
35
![Page 36: Three Tools for "Human-in-the-loop" Data Science](https://reader035.vdocuments.net/reader035/viewer/2022070522/58ee33bd1a28abcd0c8b46a3/html5/thumbnails/36.jpg)
Sharing Optimizations
1. Minimize # of queries: Group queries together• Combine multiple aggregates: (d1, m1, f1), (d1, m2, f1) —> (d1, [m1, m2], f1)• Combine multiple group-bys:
(d1, m1, f1), (d2, m1, f1) —> ([d1, d2], m1, f1)
2. Minimize sequential execution: Parallel query evaluation
A bit tricky!
36
![Page 37: Three Tools for "Human-in-the-loop" Data Science](https://reader035.vdocuments.net/reader035/viewer/2022070522/58ee33bd1a28abcd0c8b46a3/html5/thumbnails/37.jpg)
Pruning Optimizations
• Keep running estimates of utility• Prune visualizations based on estimates: Two flavors– Vanilla Confidence Interval based Pruning– Multi-armed Bandit Pruning
Discard low-utility views early to avoid wasted computation
37
![Page 38: Three Tools for "Human-in-the-loop" Data Science](https://reader035.vdocuments.net/reader035/viewer/2022070522/58ee33bd1a28abcd0c8b46a3/html5/thumbnails/38.jpg)
38
Visualizations
Queries (100s)
Sharing
Pruning
Optimizer
DBMS
Middleware Layer
Viz interface
![Page 39: Three Tools for "Human-in-the-loop" Data Science](https://reader035.vdocuments.net/reader035/viewer/2022070522/58ee33bd1a28abcd0c8b46a3/html5/thumbnails/39.jpg)
Up to 300X speedup: <1s for SM, 4s for L
Experimental Findings
39
![Page 40: Three Tools for "Human-in-the-loop" Data Science](https://reader035.vdocuments.net/reader035/viewer/2022070522/58ee33bd1a28abcd0c8b46a3/html5/thumbnails/40.jpg)
40
Effortless Visual Exploration of Large Datasets with
Ingredients• Drag-and-drop and
sketch based interactions
• Sophisticated visual exploration language, ZQL
• Scalable visualization generation engine
• Rapid pattern matching algorithms
![Page 41: Three Tools for "Human-in-the-loop" Data Science](https://reader035.vdocuments.net/reader035/viewer/2022070522/58ee33bd1a28abcd0c8b46a3/html5/thumbnails/41.jpg)
41
![Page 42: Three Tools for "Human-in-the-loop" Data Science](https://reader035.vdocuments.net/reader035/viewer/2022070522/58ee33bd1a28abcd0c8b46a3/html5/thumbnails/42.jpg)
42
MotivationCollaborative data science is ubiquitous• Many users, many versions of
the same dataset stored at many stages of analysis
• Status quo:– Stored in a file system,
relationships unknown
Challenge: can we build a versioned data store?
– Support efficient access, retrieval, querying, and modification of versions
![Page 43: Three Tools for "Human-in-the-loop" Data Science](https://reader035.vdocuments.net/reader035/viewer/2022070522/58ee33bd1a28abcd0c8b46a3/html5/thumbnails/43.jpg)
43
Motivation: Starting Points• VCS: Git/svn is inefficient and
unsuitable– Ordered semantics– No data manipulation API– No efficient multi-version queries– Poor support for massive files
• DBMS: Relational databases don’t support versioning, but are efficient and scalable
![Page 44: Three Tools for "Human-in-the-loop" Data Science](https://reader035.vdocuments.net/reader035/viewer/2022070522/58ee33bd1a28abcd0c8b46a3/html5/thumbnails/44.jpg)
44
OrpheusDB: Current FocusPostgreSQL + Versioning Commands
![Page 45: Three Tools for "Human-in-the-loop" Data Science](https://reader035.vdocuments.net/reader035/viewer/2022070522/58ee33bd1a28abcd0c8b46a3/html5/thumbnails/45.jpg)
45
Challenge: Storing Versions Compactly/Retrieving Versions Quickly
1000s of versions, spanning millions of records.
Store all versions independentlyHuge storage, version access time is very small
Store one version, all others via chains of “deltas” Very small storage, version access time is high
![Page 46: Three Tools for "Human-in-the-loop" Data Science](https://reader035.vdocuments.net/reader035/viewer/2022070522/58ee33bd1a28abcd0c8b46a3/html5/thumbnails/46.jpg)
46
And Answer Queries…• Retrieve the first version that contains this
tuple• Find versions where the average(salary) is
greater than 1000• Find all pairs of versions where over 100 new
tuples were added• Show the history of the tuple with record id
34.
For more examples, see [TAPP’15]
![Page 47: Three Tools for "Human-in-the-loop" Data Science](https://reader035.vdocuments.net/reader035/viewer/2022070522/58ee33bd1a28abcd0c8b46a3/html5/thumbnails/47.jpg)
Framework
“Versioning” Layer(translation/bookkeeping)
User Interface Layer
47
Unmodified Postgres Backend(not aware of versions)
Parser &Translato
rLayout
Optimizer
DBMS
git commands, or SQL (versions as
rel)
![Page 48: Three Tools for "Human-in-the-loop" Data Science](https://reader035.vdocuments.net/reader035/viewer/2022070522/58ee33bd1a28abcd0c8b46a3/html5/thumbnails/48.jpg)
48
Summary: Make Data Analytics Great Again!
orpheus-db.github.ioShare & Collaborate
Play & View
Touch & Feel
Incr
easin
g so
phist
icatio
n of
ana
lysis
zenvisage.github.io
dataspread.github.io
My website: http://data-people.cs.illinois.eduTwitter: @adityagp