meaningful visual exploration of massive datammds-data.org/presentations/2016/s-wang.pdf · bokeh...
TRANSCRIPT
![Page 1: Meaningful Visual Exploration of Massive Datammds-data.org/presentations/2016/s-wang.pdf · Bokeh Apps: Shiny for Python • Fully interactive data web apps • Streaming data, dynamic](https://reader030.vdocuments.net/reader030/viewer/2022040306/5ec4631f3afddc3080328562/html5/thumbnails/1.jpg)
© 2016 Continuum Analytics
Meaningful Visual Exploration of Massive Data
Peter Wang Continuum Analytics CTO & Co-Founder
@pwang
![Page 2: Meaningful Visual Exploration of Massive Datammds-data.org/presentations/2016/s-wang.pdf · Bokeh Apps: Shiny for Python • Fully interactive data web apps • Streaming data, dynamic](https://reader030.vdocuments.net/reader030/viewer/2022040306/5ec4631f3afddc3080328562/html5/thumbnails/2.jpg)
Big data magnifies small problems
2
• Big data presents storage and computation problems • More importantly, standard plotting tools have problems that are
magnified by big data: • Overdrawing/Overplotting • Saturation • Undersaturation • Binning issues
![Page 3: Meaningful Visual Exploration of Massive Datammds-data.org/presentations/2016/s-wang.pdf · Bokeh Apps: Shiny for Python • Fully interactive data web apps • Streaming data, dynamic](https://reader030.vdocuments.net/reader030/viewer/2022040306/5ec4631f3afddc3080328562/html5/thumbnails/3.jpg)
Overdrawing
3
• For a scatterplot, the order in which points are drawn is very important
• The same distribution can look entirely different depending on plotting order
• Last data plotted overplots
![Page 4: Meaningful Visual Exploration of Massive Datammds-data.org/presentations/2016/s-wang.pdf · Bokeh Apps: Shiny for Python • Fully interactive data web apps • Streaming data, dynamic](https://reader030.vdocuments.net/reader030/viewer/2022040306/5ec4631f3afddc3080328562/html5/thumbnails/4.jpg)
Overdrawing
4
• Underlying issue is just occlusion
• Same problem happens with one category, but less obvious
• Can prevent occlusion using transparency
![Page 5: Meaningful Visual Exploration of Massive Datammds-data.org/presentations/2016/s-wang.pdf · Bokeh Apps: Shiny for Python • Fully interactive data web apps • Streaming data, dynamic](https://reader030.vdocuments.net/reader030/viewer/2022040306/5ec4631f3afddc3080328562/html5/thumbnails/5.jpg)
Saturation
5
• E.g. for alpha = 0.1, up to 10 points can overlap before saturating the available brightness
• Now the order of plotting matters less
• After 10 points, first-plotted data still lost
• For one category, 10, 20, or 2000 points overlapping will look identical
![Page 6: Meaningful Visual Exploration of Massive Datammds-data.org/presentations/2016/s-wang.pdf · Bokeh Apps: Shiny for Python • Fully interactive data web apps • Streaming data, dynamic](https://reader030.vdocuments.net/reader030/viewer/2022040306/5ec4631f3afddc3080328562/html5/thumbnails/6.jpg)
Saturation
6
• Same alpha value, more points:
• Now is highly misleading
• alpha value depends on size, overlap of dataset
• Difficult-to-set parameter, hard to know when data is misrepresented
![Page 7: Meaningful Visual Exploration of Massive Datammds-data.org/presentations/2016/s-wang.pdf · Bokeh Apps: Shiny for Python • Fully interactive data web apps • Streaming data, dynamic](https://reader030.vdocuments.net/reader030/viewer/2022040306/5ec4631f3afddc3080328562/html5/thumbnails/7.jpg)
Saturation
7
• Can try to reduce point size to reduce overplotting and saturation
• Now points are hard to see, with no guarantee of avoiding problems
• Another difficult-to-set parameter
• For really big data, scatterplots start to become very inefficient, because there are many datapoints per pixel — may as well be binning by pixel
![Page 8: Meaningful Visual Exploration of Massive Datammds-data.org/presentations/2016/s-wang.pdf · Bokeh Apps: Shiny for Python • Fully interactive data web apps • Streaming data, dynamic](https://reader030.vdocuments.net/reader030/viewer/2022040306/5ec4631f3afddc3080328562/html5/thumbnails/8.jpg)
Binning issues
8
• Can use heatmap instead of scatter
• Avoids saturation by auto-ranging on bins
• Result independent of data size
• Here two merged normal distributions look very different at different binning
• Another difficult-to-set parameter
![Page 9: Meaningful Visual Exploration of Massive Datammds-data.org/presentations/2016/s-wang.pdf · Bokeh Apps: Shiny for Python • Fully interactive data web apps • Streaming data, dynamic](https://reader030.vdocuments.net/reader030/viewer/2022040306/5ec4631f3afddc3080328562/html5/thumbnails/9.jpg)
Notebook
9
![Page 10: Meaningful Visual Exploration of Massive Datammds-data.org/presentations/2016/s-wang.pdf · Bokeh Apps: Shiny for Python • Fully interactive data web apps • Streaming data, dynamic](https://reader030.vdocuments.net/reader030/viewer/2022040306/5ec4631f3afddc3080328562/html5/thumbnails/10.jpg)
Real world example
10
![Page 11: Meaningful Visual Exploration of Massive Datammds-data.org/presentations/2016/s-wang.pdf · Bokeh Apps: Shiny for Python • Fully interactive data web apps • Streaming data, dynamic](https://reader030.vdocuments.net/reader030/viewer/2022040306/5ec4631f3afddc3080328562/html5/thumbnails/11.jpg)
So what?
11
![Page 12: Meaningful Visual Exploration of Massive Datammds-data.org/presentations/2016/s-wang.pdf · Bokeh Apps: Shiny for Python • Fully interactive data web apps • Streaming data, dynamic](https://reader030.vdocuments.net/reader030/viewer/2022040306/5ec4631f3afddc3080328562/html5/thumbnails/12.jpg)
Plotting big data
12
• When exploring really big data, the visualization is all you have — there’s no way to look at each of the individual data points
• Common plotting problems can lead to completely incorrect conclusions based on misleading visualizations
• Slow processing makes trial and error approach ineffective
When data is large, you don’t know when the viz is lying.
![Page 13: Meaningful Visual Exploration of Massive Datammds-data.org/presentations/2016/s-wang.pdf · Bokeh Apps: Shiny for Python • Fully interactive data web apps • Streaming data, dynamic](https://reader030.vdocuments.net/reader030/viewer/2022040306/5ec4631f3afddc3080328562/html5/thumbnails/13.jpg)
Datashading
13
• Flexible, configurable pipeline for automatic plotting • Provides flexible plugins for viz stages, like in graphics shaders • Completely prevents overplotting, saturation, and undersaturation • Mitigates binning issues by providing fully interactive exploration in web
browsers, even of very large datasets on ordinary machines • Statistical transformations of data are a first-class aspect of the
visualization • Allows rapid iteration of visual styles & configs, interactive selections and
filtering, to support data exploration
![Page 14: Meaningful Visual Exploration of Massive Datammds-data.org/presentations/2016/s-wang.pdf · Bokeh Apps: Shiny for Python • Fully interactive data web apps • Streaming data, dynamic](https://reader030.vdocuments.net/reader030/viewer/2022040306/5ec4631f3afddc3080328562/html5/thumbnails/14.jpg)
Datashading Pipeline: Projection
Data
Project / Synthesize
Scene
• Stage 1: select variables (columns) to project onto the screen
• Data often filtered at this stage
14
![Page 15: Meaningful Visual Exploration of Massive Datammds-data.org/presentations/2016/s-wang.pdf · Bokeh Apps: Shiny for Python • Fully interactive data web apps • Streaming data, dynamic](https://reader030.vdocuments.net/reader030/viewer/2022040306/5ec4631f3afddc3080328562/html5/thumbnails/15.jpg)
Datashading Pipeline: Aggregation
Data
Project / Synthesize
Scene Aggregates
Sample / Raster
• Stage 2: Aggregate data into a fixed set of bins
• Each bin yields one or more scalars (total count, mean, stddev, etc.)
15
![Page 16: Meaningful Visual Exploration of Massive Datammds-data.org/presentations/2016/s-wang.pdf · Bokeh Apps: Shiny for Python • Fully interactive data web apps • Streaming data, dynamic](https://reader030.vdocuments.net/reader030/viewer/2022040306/5ec4631f3afddc3080328562/html5/thumbnails/16.jpg)
Datashading Pipeline: Transfer
Data
Project / Synthesize
Scene Aggregates
Sample / Raster Transfer
Image
• Stage 3: Transform data using one or more transfer functions, culminating in a function that yields a visible image
• Each stage can be replaced and configured separately
16
![Page 17: Meaningful Visual Exploration of Massive Datammds-data.org/presentations/2016/s-wang.pdf · Bokeh Apps: Shiny for Python • Fully interactive data web apps • Streaming data, dynamic](https://reader030.vdocuments.net/reader030/viewer/2022040306/5ec4631f3afddc3080328562/html5/thumbnails/17.jpg)
Infovis Pipeline
17
Visual Abstraction
DataTransforms
VisualMappings
ViewTransforms
Data Tables
Source Data Views
![Page 18: Meaningful Visual Exploration of Massive Datammds-data.org/presentations/2016/s-wang.pdf · Bokeh Apps: Shiny for Python • Fully interactive data web apps • Streaming data, dynamic](https://reader030.vdocuments.net/reader030/viewer/2022040306/5ec4631f3afddc3080328562/html5/thumbnails/18.jpg)
Infovis Pipeline… modified
18
Visual Abstraction
DataTransforms
VisualMappings
ViewTransforms
Data Tables
Source Data Views
Selection Aggregation Transfer
SignificantSet Aggregates
![Page 19: Meaningful Visual Exploration of Massive Datammds-data.org/presentations/2016/s-wang.pdf · Bokeh Apps: Shiny for Python • Fully interactive data web apps • Streaming data, dynamic](https://reader030.vdocuments.net/reader030/viewer/2022040306/5ec4631f3afddc3080328562/html5/thumbnails/19.jpg)
NYC Taxi DemoThis demo shows how traditional plotting tools break down for large datasets, and how to use datashading to make even large datasets practical interactively.
19
• Data for 10 million New York City taxi trips
• Even 100,000 points gets slow for scatterplot
• Parameters usually need adjusting for every zoom
• True relationships within data not visible in std plot
Datashading automatically reveals the entire dataset, including outliers, hot spots, and missing data
![Page 20: Meaningful Visual Exploration of Massive Datammds-data.org/presentations/2016/s-wang.pdf · Bokeh Apps: Shiny for Python • Fully interactive data web apps • Streaming data, dynamic](https://reader030.vdocuments.net/reader030/viewer/2022040306/5ec4631f3afddc3080328562/html5/thumbnails/20.jpg)
Categorical data: 2010 US Census
20
• One point per person
• 300 million total • Categorized by
race • Datashading
shows faithful distribution per pixel
![Page 21: Meaningful Visual Exploration of Massive Datammds-data.org/presentations/2016/s-wang.pdf · Bokeh Apps: Shiny for Python • Fully interactive data web apps • Streaming data, dynamic](https://reader030.vdocuments.net/reader030/viewer/2022040306/5ec4631f3afddc3080328562/html5/thumbnails/21.jpg)
OSM Dataset: 3 Billion PointsBecause Datashader decouples the data-processing from the visualization, it can handle arbitrarily large data
21
• About 3 billion GPS coordinates
• https://blog.openstreetmap.org/2012/04/01/bulk-gps-point-data/.
• This image was rendered in one minute on a standard MacBook with 16 GB RAM
• Renders in 7 seconds on a 128GB Amazon EC2 instance
![Page 22: Meaningful Visual Exploration of Massive Datammds-data.org/presentations/2016/s-wang.pdf · Bokeh Apps: Shiny for Python • Fully interactive data web apps • Streaming data, dynamic](https://reader030.vdocuments.net/reader030/viewer/2022040306/5ec4631f3afddc3080328562/html5/thumbnails/22.jpg)
Additional
• Timeseries • Trajectory • Race & Elevation • API / Pipeline walkthrough
22
![Page 23: Meaningful Visual Exploration of Massive Datammds-data.org/presentations/2016/s-wang.pdf · Bokeh Apps: Shiny for Python • Fully interactive data web apps • Streaming data, dynamic](https://reader030.vdocuments.net/reader030/viewer/2022040306/5ec4631f3afddc3080328562/html5/thumbnails/23.jpg)
Bokeh
• Interactive visualization • Novel graphics • Streaming, dynamic, large data • For the browser, with or without a server • No need to write Javascript
23
http://bokeh.pydata.org
![Page 25: Meaningful Visual Exploration of Massive Datammds-data.org/presentations/2016/s-wang.pdf · Bokeh Apps: Shiny for Python • Fully interactive data web apps • Streaming data, dynamic](https://reader030.vdocuments.net/reader030/viewer/2022040306/5ec4631f3afddc3080328562/html5/thumbnails/25.jpg)
Linked plots, tools
25
• Easy to show multiple plots and link them • Easy to link data selections between plots • Can easily customize the kind of linkage straight from
Python, without needing to fiddle around with JS
![Page 26: Meaningful Visual Exploration of Massive Datammds-data.org/presentations/2016/s-wang.pdf · Bokeh Apps: Shiny for Python • Fully interactive data web apps • Streaming data, dynamic](https://reader030.vdocuments.net/reader030/viewer/2022040306/5ec4631f3afddc3080328562/html5/thumbnails/26.jpg)
Large data
• With easy WebGL support, can scale to 500k points or so
• Bottlenecks are browser performance, JSON encoding, network transport
26
![Page 27: Meaningful Visual Exploration of Massive Datammds-data.org/presentations/2016/s-wang.pdf · Bokeh Apps: Shiny for Python • Fully interactive data web apps • Streaming data, dynamic](https://reader030.vdocuments.net/reader030/viewer/2022040306/5ec4631f3afddc3080328562/html5/thumbnails/27.jpg)
rBokehPlays well with R ecosystem: HTMLwidget, RMarkdown…
27
http://hafen.github.io/rbokeh
![Page 28: Meaningful Visual Exploration of Massive Datammds-data.org/presentations/2016/s-wang.pdf · Bokeh Apps: Shiny for Python • Fully interactive data web apps • Streaming data, dynamic](https://reader030.vdocuments.net/reader030/viewer/2022040306/5ec4631f3afddc3080328562/html5/thumbnails/28.jpg)
rBokeh with RStudio & ShinyPlays well with R ecosystem: HTMLwidget, RMarkdown…
28
![Page 29: Meaningful Visual Exploration of Massive Datammds-data.org/presentations/2016/s-wang.pdf · Bokeh Apps: Shiny for Python • Fully interactive data web apps • Streaming data, dynamic](https://reader030.vdocuments.net/reader030/viewer/2022040306/5ec4631f3afddc3080328562/html5/thumbnails/29.jpg)
Bokeh Apps: Shiny for Python
• Fully interactive data web apps • Streaming data, dynamic data • Easy-to-write pure Python charts, widgets, event
handlers • Open source (BSD licensed), including server • Enterprise on-prem version in Anaconda Enterprise,
with Active Directory/LDAP auth
29
![Page 30: Meaningful Visual Exploration of Massive Datammds-data.org/presentations/2016/s-wang.pdf · Bokeh Apps: Shiny for Python • Fully interactive data web apps • Streaming data, dynamic](https://reader030.vdocuments.net/reader030/viewer/2022040306/5ec4631f3afddc3080328562/html5/thumbnails/30.jpg)
Example Apps
30
![Page 31: Meaningful Visual Exploration of Massive Datammds-data.org/presentations/2016/s-wang.pdf · Bokeh Apps: Shiny for Python • Fully interactive data web apps • Streaming data, dynamic](https://reader030.vdocuments.net/reader030/viewer/2022040306/5ec4631f3afddc3080328562/html5/thumbnails/31.jpg)
Easy Streaming Apps
In this demo, we will demonstrate how the Bokeh server makes it easy to visualize streaming and dynamic data.
31
• A minimal example with < 50 LOC • Demonstrates ease of pushing data
from Python code into the browser
![Page 34: Meaningful Visual Exploration of Massive Datammds-data.org/presentations/2016/s-wang.pdf · Bokeh Apps: Shiny for Python • Fully interactive data web apps • Streaming data, dynamic](https://reader030.vdocuments.net/reader030/viewer/2022040306/5ec4631f3afddc3080328562/html5/thumbnails/34.jpg)
For more information on Bokeh Apps
• Webinar: http://www.slideshare.net/continuumio/hassle-free-data-science-apps-with-bokeh-webinar
• PyData Videos, Tutorials
34
![Page 35: Meaningful Visual Exploration of Massive Datammds-data.org/presentations/2016/s-wang.pdf · Bokeh Apps: Shiny for Python • Fully interactive data web apps • Streaming data, dynamic](https://reader030.vdocuments.net/reader030/viewer/2022040306/5ec4631f3afddc3080328562/html5/thumbnails/35.jpg)
Community & AdoptionGithub • 4100+ stars • 860+ forks
Mailing list • 400+ members • 150+ posts in November
Downloads • 45,000 / month (conda) • 4,000 / month (pip)
![Page 36: Meaningful Visual Exploration of Massive Datammds-data.org/presentations/2016/s-wang.pdf · Bokeh Apps: Shiny for Python • Fully interactive data web apps • Streaming data, dynamic](https://reader030.vdocuments.net/reader030/viewer/2022040306/5ec4631f3afddc3080328562/html5/thumbnails/36.jpg)
Architecture
![Page 37: Meaningful Visual Exploration of Massive Datammds-data.org/presentations/2016/s-wang.pdf · Bokeh Apps: Shiny for Python • Fully interactive data web apps • Streaming data, dynamic](https://reader030.vdocuments.net/reader030/viewer/2022040306/5ec4631f3afddc3080328562/html5/thumbnails/37.jpg)
37Server-side Data Processing: Python, Java, etc.
HTML
Javascript
D3 Highcharts Flot nvd3 dcjs
JavaScript Plotting library
CSV, SQL
Data
Traditional Web Visualization
CSSTech: • Python/R/Java • HTML & browser compat • CSS/LESS/Sass • JS plotting library API • Javascript
• jQuery, underscore
• svg, canvas2D • webGL, three.js • React • Angular • node.js,
browserify, gulp, grunt, npm, …
![Page 38: Meaningful Visual Exploration of Massive Datammds-data.org/presentations/2016/s-wang.pdf · Bokeh Apps: Shiny for Python • Fully interactive data web apps • Streaming data, dynamic](https://reader030.vdocuments.net/reader030/viewer/2022040306/5ec4631f3afddc3080328562/html5/thumbnails/38.jpg)
Browser
HTML
38
HTML
CSSJavascript
User
Data
Python, Ruby, Java, .NET
Server
Traditional Web Viz - Interaction
Javascript
Javascript
Data’
Simple dashboard: Server language generating HTML, JS, CSS styling, subset of data
Handling user interaction: Custom Javascript, calling Server endpoint, which generates updated JSON or JS that gets pushed back to client via websocket
![Page 39: Meaningful Visual Exploration of Massive Datammds-data.org/presentations/2016/s-wang.pdf · Bokeh Apps: Shiny for Python • Fully interactive data web apps • Streaming data, dynamic](https://reader030.vdocuments.net/reader030/viewer/2022040306/5ec4631f3afddc3080328562/html5/thumbnails/39.jpg)
Server
Bokeh BokehJS
JSON
(HTML, CSS)
Client
Bokeh Conceptual Architecture
UserPython, R,
Scala
Data
Simple dashboard: Single language, no need to write HTML, JS, CSS
Handling user interaction: Single language that you already know; interactive data updates feel seamless to the user
![Page 40: Meaningful Visual Exploration of Massive Datammds-data.org/presentations/2016/s-wang.pdf · Bokeh Apps: Shiny for Python • Fully interactive data web apps • Streaming data, dynamic](https://reader030.vdocuments.net/reader030/viewer/2022040306/5ec4631f3afddc3080328562/html5/thumbnails/40.jpg)
• Skills required: 5-10 skills • Time to market: weeks to months • Server code: 100s to 1000s lines
• Skills required: ~1 skill • Time to market: minutes • Server code: 0
Client
Data
BokehJSPython, RBokeh
Server
Python, Ruby Java, .NET
Data
Client
CSSData
Comparison Chart
![Page 41: Meaningful Visual Exploration of Massive Datammds-data.org/presentations/2016/s-wang.pdf · Bokeh Apps: Shiny for Python • Fully interactive data web apps • Streaming data, dynamic](https://reader030.vdocuments.net/reader030/viewer/2022040306/5ec4631f3afddc3080328562/html5/thumbnails/41.jpg)
Thank you
Email: [email protected]
Twitter: @ContinuumIO
Peter WangTwitter: @pwang
Bokeh
Twitter: @bokehplots