sparkly notebook: interactive analysis and visualization with spark
TRANSCRIPT
![Page 1: Sparkly Notebook: Interactive Analysis and Visualization with Spark](https://reader036.vdocuments.net/reader036/viewer/2022062503/58f9a962760da3da068b6e16/html5/thumbnails/1.jpg)
SPARKLY NOTEBOOK: INTERACTIVE ANALYSIS AND VISUALIZATION WITH SPARK
FELIX CHEUNG
APRIL 2015 HTTP://WWW.MEETUP.COM/SEATTLE-SPARK-MEETUP/EVENTS/208711962/
![Page 2: Sparkly Notebook: Interactive Analysis and Visualization with Spark](https://reader036.vdocuments.net/reader036/viewer/2022062503/58f9a962760da3da068b6e16/html5/thumbnails/2.jpg)
SETUP
• Spark on CDH cluster
• Vagrant - 2-nodes - custom provisioning
![Page 3: Sparkly Notebook: Interactive Analysis and Visualization with Spark](https://reader036.vdocuments.net/reader036/viewer/2022062503/58f9a962760da3da068b6e16/html5/thumbnails/3.jpg)
AGENDA
• IPython + PySpark cluster
• Zeppelin
• Spark’s Streaming k-means
• Lightning
![Page 4: Sparkly Notebook: Interactive Analysis and Visualization with Spark](https://reader036.vdocuments.net/reader036/viewer/2022062503/58f9a962760da3da068b6e16/html5/thumbnails/4.jpg)
![Page 5: Sparkly Notebook: Interactive Analysis and Visualization with Spark](https://reader036.vdocuments.net/reader036/viewer/2022062503/58f9a962760da3da068b6e16/html5/thumbnails/5.jpg)
SPARK - 10 SEC INTRODUCTION
• Spark
• Spark SQL + Data Frame + data source
• Spark Streaming
• MLlib
• GraphX
![Page 6: Sparkly Notebook: Interactive Analysis and Visualization with Spark](https://reader036.vdocuments.net/reader036/viewer/2022062503/58f9a962760da3da068b6e16/html5/thumbnails/6.jpg)
It’s a lot of time looking at data..
![Page 7: Sparkly Notebook: Interactive Analysis and Visualization with Spark](https://reader036.vdocuments.net/reader036/viewer/2022062503/58f9a962760da3da068b6e16/html5/thumbnails/7.jpg)
REPL
• Read-Eval-Print-Loop
![Page 8: Sparkly Notebook: Interactive Analysis and Visualization with Spark](https://reader036.vdocuments.net/reader036/viewer/2022062503/58f9a962760da3da068b6e16/html5/thumbnails/8.jpg)
Set of REPL related to Spark…
![Page 9: Sparkly Notebook: Interactive Analysis and Visualization with Spark](https://reader036.vdocuments.net/reader036/viewer/2022062503/58f9a962760da3da068b6e16/html5/thumbnails/9.jpg)
$ spark-‐shell
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 1.2.0-‐SNAPSHOT
/_/
Using Scala version 2.10.4 (Java HotSpot(TM) 64-‐Bit Server VM, Java 1.7.0_67)
Type in expressions to have them evaluated.
Type :help for more information.
15/04/15 11:31:28 INFO SparkILoop: Created spark context..
Spark context available as sc.
scala> val a = sc.parallelize(1 to 100)
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:12
scala> a.collect.foreach(x => println(x))
1
2
3
4
![Page 10: Sparkly Notebook: Interactive Analysis and Visualization with Spark](https://reader036.vdocuments.net/reader036/viewer/2022062503/58f9a962760da3da068b6e16/html5/thumbnails/10.jpg)
GOOD
• See results instantly
![Page 11: Sparkly Notebook: Interactive Analysis and Visualization with Spark](https://reader036.vdocuments.net/reader036/viewer/2022062503/58f9a962760da3da068b6e16/html5/thumbnails/11.jpg)
NOT SO GOOD
• Ok as an IDE
• No Save / Repeat
• No visualization
![Page 12: Sparkly Notebook: Interactive Analysis and Visualization with Spark](https://reader036.vdocuments.net/reader036/viewer/2022062503/58f9a962760da3da068b6e16/html5/thumbnails/12.jpg)
NOTEBOOK
![Page 13: Sparkly Notebook: Interactive Analysis and Visualization with Spark](https://reader036.vdocuments.net/reader036/viewer/2022062503/58f9a962760da3da068b6e16/html5/thumbnails/13.jpg)
![Page 14: Sparkly Notebook: Interactive Analysis and Visualization with Spark](https://reader036.vdocuments.net/reader036/viewer/2022062503/58f9a962760da3da068b6e16/html5/thumbnails/14.jpg)
Jupyter IPython will continue to exist as a Python kernel for Jupyter, but the notebook and other language-agnostic parts of IPython will move to new projects under the Jupyter name. IPython 3.0 will be the last monolithic release of IPython. !“IPython” http://ipython.org/ • interactive shell • browser-based notebook • 'Kernel' • great support for visualization library (eg. matplotlib) • built on pyzmq, tornado
IPYTHON/JUPYTER
![Page 15: Sparkly Notebook: Interactive Analysis and Visualization with Spark](https://reader036.vdocuments.net/reader036/viewer/2022062503/58f9a962760da3da068b6e16/html5/thumbnails/15.jpg)
IPYTHON NOTEBOOK NOTEBOOK == BROWSER-BASED REPL
IPython Notebook is a web-based interactive computational environment for creating IPython notebooks. An IPython notebook is a JSON document containing an ordered list of input/output cells which can contain code, text, mathematics, plots and rich media.
![Page 16: Sparkly Notebook: Interactive Analysis and Visualization with Spark](https://reader036.vdocuments.net/reader036/viewer/2022062503/58f9a962760da3da068b6e16/html5/thumbnails/16.jpg)
MATPLOTLIBmatplotlib tries to make easy things easy and hard things possible. You can generate plots, histograms, power spectra, bar charts, errorcharts, scatterplots, etc, with just a few lines of code, with familiar MATLAB APIs.
plt.barh(y_pos, performance, xerr=error, align='center', alpha=0.4)
plt.yticks(y_pos, people)
plt.xlabel('Performance')
plt.title('How fast do you want to go today?')
plt.show()
![Page 17: Sparkly Notebook: Interactive Analysis and Visualization with Spark](https://reader036.vdocuments.net/reader036/viewer/2022062503/58f9a962760da3da068b6e16/html5/thumbnails/17.jpg)
PYSPARK
• Spark on Python, this serves as the Kernel, integrating with IPython
• Each notebook spins up a new instance of the Kernel (ie. PySpark running as the Spark Driver, in different deploy mode Spark/PySpark supports)
![Page 18: Sparkly Notebook: Interactive Analysis and Visualization with Spark](https://reader036.vdocuments.net/reader036/viewer/2022062503/58f9a962760da3da068b6e16/html5/thumbnails/18.jpg)
(All notebook examples are a subset of those in the Meetup reconstructed here)
![Page 19: Sparkly Notebook: Interactive Analysis and Visualization with Spark](https://reader036.vdocuments.net/reader036/viewer/2022062503/58f9a962760da3da068b6e16/html5/thumbnails/19.jpg)
Markdown
![Page 20: Sparkly Notebook: Interactive Analysis and Visualization with Spark](https://reader036.vdocuments.net/reader036/viewer/2022062503/58f9a962760da3da068b6e16/html5/thumbnails/20.jpg)
Spark in Python
![Page 21: Sparkly Notebook: Interactive Analysis and Visualization with Spark](https://reader036.vdocuments.net/reader036/viewer/2022062503/58f9a962760da3da068b6e16/html5/thumbnails/21.jpg)
Source: http://nbviewer.ipython.org/github/ResearchComputing/scientific_computing_tutorials/blob/master/spark/02_word_count.ipynb
![Page 22: Sparkly Notebook: Interactive Analysis and Visualization with Spark](https://reader036.vdocuments.net/reader036/viewer/2022062503/58f9a962760da3da068b6e16/html5/thumbnails/22.jpg)
![Page 23: Sparkly Notebook: Interactive Analysis and Visualization with Spark](https://reader036.vdocuments.net/reader036/viewer/2022062503/58f9a962760da3da068b6e16/html5/thumbnails/23.jpg)
WORD2VEC EXAMPLE
Word2Vec computes distributed vector representation of words. Distributed vector representation is showed to be useful in many natural language processing applications such as named entity recognition, disambiguation, parsing, tagging and machine translation.https://code.google.com/p/word2vec/
Spark MLlib implements the Skip-gram approach. With Skip-gram we want to predict a window of words given a single word.
![Page 24: Sparkly Notebook: Interactive Analysis and Visualization with Spark](https://reader036.vdocuments.net/reader036/viewer/2022062503/58f9a962760da3da068b6e16/html5/thumbnails/24.jpg)
WORD2VEC DATASET
Wikipedia dump http://mattmahoney.net/dc/textdata
grep -‐o -‐E '\w+(\W+\w+){0,15}' text8 > text8_lines
then randomly sampled to ~200k lines
![Page 25: Sparkly Notebook: Interactive Analysis and Visualization with Spark](https://reader036.vdocuments.net/reader036/viewer/2022062503/58f9a962760da3da068b6e16/html5/thumbnails/25.jpg)
![Page 26: Sparkly Notebook: Interactive Analysis and Visualization with Spark](https://reader036.vdocuments.net/reader036/viewer/2022062503/58f9a962760da3da068b6e16/html5/thumbnails/26.jpg)
![Page 27: Sparkly Notebook: Interactive Analysis and Visualization with Spark](https://reader036.vdocuments.net/reader036/viewer/2022062503/58f9a962760da3da068b6e16/html5/thumbnails/27.jpg)
matplotlib: http://matplotlib.org Seaborn: http://stanford.edu/~mwaskom/software/seaborn/ Bokeh: http://bokeh.pydata.org/en/latest/
MORE VISUALIZATIONS Seaborn
Bokehmatplotlib
![Page 28: Sparkly Notebook: Interactive Analysis and Visualization with Spark](https://reader036.vdocuments.net/reader036/viewer/2022062503/58f9a962760da3da068b6e16/html5/thumbnails/28.jpg)
SETUPTo setup IPython
• Python 2.7.9 (separate from CentOS default 2.6.6), on all nodes
• matplotlib, on the host running IPython
To run IPython with the PySpark Kernel, set these in the environment(Please check out my handy script on github)
!
!
!
PYSPARK_PYTHON command to run python, eg. “python2.7”
PYSPARK_DRIVER_PYTHON command to run ipython
PYSPARK_DRIVER_PYTHON_OPTS “notebook —profile”
PYSPARK_SUBMIT_ARGS pyspark commandline, eg. --master --deploy_mode
YARN_CONF_DIR if YARN mode
LD_LIBRARY_PATH for matplotlib
![Page 29: Sparkly Notebook: Interactive Analysis and Visualization with Spark](https://reader036.vdocuments.net/reader036/viewer/2022062503/58f9a962760da3da068b6e16/html5/thumbnails/29.jpg)
IPYTHON/JUPYTER KERNELS • IPython
• IGo
• Bash
• IR
• IHaskell
• IMatlab
• ICSharp
• IScala
• IRuby
• IJulia
.. and more https://github.com/ipython/ipython/wiki/IPython-kernels-for-other-languages
![Page 30: Sparkly Notebook: Interactive Analysis and Visualization with Spark](https://reader036.vdocuments.net/reader036/viewer/2022062503/58f9a962760da3da068b6e16/html5/thumbnails/30.jpg)
ZEPPELIN
![Page 31: Sparkly Notebook: Interactive Analysis and Visualization with Spark](https://reader036.vdocuments.net/reader036/viewer/2022062503/58f9a962760da3da068b6e16/html5/thumbnails/31.jpg)
Apache Zeppelin (incubating) is interactive data analytics environment for distributed data processing system. It provides beautiful interactive web-based interface, data visualization, collaborative work environment and many other nice features to make your data analytics more fun and enjoyable.
Zeppelin has been incubating since Dec 2014.https://zeppelin.incubator.apache.org/
![Page 32: Sparkly Notebook: Interactive Analysis and Visualization with Spark](https://reader036.vdocuments.net/reader036/viewer/2022062503/58f9a962760da3da068b6e16/html5/thumbnails/32.jpg)
![Page 33: Sparkly Notebook: Interactive Analysis and Visualization with Spark](https://reader036.vdocuments.net/reader036/viewer/2022062503/58f9a962760da3da068b6e16/html5/thumbnails/33.jpg)
shell script & calling library package
Load and process data with Spark
![Page 34: Sparkly Notebook: Interactive Analysis and Visualization with Spark](https://reader036.vdocuments.net/reader036/viewer/2022062503/58f9a962760da3da068b6e16/html5/thumbnails/34.jpg)
SQL query powered by Spark SQL - progress &
parameterization via dynamic form
![Page 35: Sparkly Notebook: Interactive Analysis and Visualization with Spark](https://reader036.vdocuments.net/reader036/viewer/2022062503/58f9a962760da3da068b6e16/html5/thumbnails/35.jpg)
Python & data passing across
languages (interpreters)
![Page 36: Sparkly Notebook: Interactive Analysis and Visualization with Spark](https://reader036.vdocuments.net/reader036/viewer/2022062503/58f9a962760da3da068b6e16/html5/thumbnails/36.jpg)
ZEPPELIN ARCHITECTURE
Realtime collaboration - enabled by websocket communications
Frontend: AngularJS Backend server: Java Interpreters: JavaVisualization: NVD3
![Page 37: Sparkly Notebook: Interactive Analysis and Visualization with Spark](https://reader036.vdocuments.net/reader036/viewer/2022062503/58f9a962760da3da068b6e16/html5/thumbnails/37.jpg)
INTERPRETERS• Spark group
• Spark (Scala)
• PySpark
• Spark SQL
• Dependency
• Markdownjs
• Shell
• Hive
• Coming: jdbc, Tajo, etc.
![Page 38: Sparkly Notebook: Interactive Analysis and Visualization with Spark](https://reader036.vdocuments.net/reader036/viewer/2022062503/58f9a962760da3da068b6e16/html5/thumbnails/38.jpg)
CLUSTERING
• Clustering tries to find natural groupings in data. It puts objects into groups in which those within a group are more similar to each other than to those in other groups.
• Unsupervised learning
![Page 39: Sparkly Notebook: Interactive Analysis and Visualization with Spark](https://reader036.vdocuments.net/reader036/viewer/2022062503/58f9a962760da3da068b6e16/html5/thumbnails/39.jpg)
K-MEANS
• First, given an initial set of k cluster centers, we find which cluster each data point is closest to
• Then, we compute the average of each of the new clusters and use the result to update our cluster centers
![Page 40: Sparkly Notebook: Interactive Analysis and Visualization with Spark](https://reader036.vdocuments.net/reader036/viewer/2022062503/58f9a962760da3da068b6e16/html5/thumbnails/40.jpg)
![Page 41: Sparkly Notebook: Interactive Analysis and Visualization with Spark](https://reader036.vdocuments.net/reader036/viewer/2022062503/58f9a962760da3da068b6e16/html5/thumbnails/41.jpg)
K-MEANS|| IN MLLIB• a parallelized variant of the k-means++
http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf
Parameters:
• k is the number of desired clusters.
• maxIterations is the maximum number of iterations to run.
• initializationMode specifies either random initialization or initialization via k-means||.
• runs is the number of times to run the k-means algorithm (k-means is not guaranteed to find a globally optimal solution, and when run multiple times on a given dataset, the algorithm returns the best clustering result).
• initializationSteps determines the number of steps in the k-means|| algorithm.
• epsilon determines the distance threshold within which we consider k-means to have converged.
![Page 42: Sparkly Notebook: Interactive Analysis and Visualization with Spark](https://reader036.vdocuments.net/reader036/viewer/2022062503/58f9a962760da3da068b6e16/html5/thumbnails/42.jpg)
CASE STUDY: K-MEANS - ZEPPELIN
![Page 43: Sparkly Notebook: Interactive Analysis and Visualization with Spark](https://reader036.vdocuments.net/reader036/viewer/2022062503/58f9a962760da3da068b6e16/html5/thumbnails/43.jpg)
Details on github at: http://bit.ly/1JWOPh8
ANOMALY DETECTION WITH K-MEANS Using Spark DataFrame, csv data source, to process KDDCup’99 dataScoring with different k values
![Page 44: Sparkly Notebook: Interactive Analysis and Visualization with Spark](https://reader036.vdocuments.net/reader036/viewer/2022062503/58f9a962760da3da068b6e16/html5/thumbnails/44.jpg)
COMING SOON (NOW!)
![Page 45: Sparkly Notebook: Interactive Analysis and Visualization with Spark](https://reader036.vdocuments.net/reader036/viewer/2022062503/58f9a962760da3da068b6e16/html5/thumbnails/45.jpg)
Realtime updates
![Page 46: Sparkly Notebook: Interactive Analysis and Visualization with Spark](https://reader036.vdocuments.net/reader036/viewer/2022062503/58f9a962760da3da068b6e16/html5/thumbnails/46.jpg)
Dashboard
![Page 47: Sparkly Notebook: Interactive Analysis and Visualization with Spark](https://reader036.vdocuments.net/reader036/viewer/2022062503/58f9a962760da3da068b6e16/html5/thumbnails/47.jpg)
Spark-notebook: https://github.com/andypetrella/spark-notebook ISpark: https://github.com/tribbloid/ISpark Spark Kernel: https://github.com/ibm-et/spark-kernel Jove: https://github.com/jove-sh/jove-notebook Beaker: https://github.com/twosigma/beaker-notebook
OTHER NOTEBOOKS
• Spark-notebook
• ISpark
• Spark Kernel
• Jove Notebook
• Beaker
• Databricks Cloud notebook
![Page 48: Sparkly Notebook: Interactive Analysis and Visualization with Spark](https://reader036.vdocuments.net/reader036/viewer/2022062503/58f9a962760da3da068b6e16/html5/thumbnails/48.jpg)
PART 2STREAMING K-MEANS
![Page 49: Sparkly Notebook: Interactive Analysis and Visualization with Spark](https://reader036.vdocuments.net/reader036/viewer/2022062503/58f9a962760da3da068b6e16/html5/thumbnails/49.jpg)
WHY STREAMING?
• Train - model - predict works well on static data
• What if data is
• Coming in streams
• Changing over time?
![Page 50: Sparkly Notebook: Interactive Analysis and Visualization with Spark](https://reader036.vdocuments.net/reader036/viewer/2022062503/58f9a962760da3da068b6e16/html5/thumbnails/50.jpg)
STREAMING K-MEANS DESIGN
• Proposed by Dr Jeremy Freeman (here)
![Page 51: Sparkly Notebook: Interactive Analysis and Visualization with Spark](https://reader036.vdocuments.net/reader036/viewer/2022062503/58f9a962760da3da068b6e16/html5/thumbnails/51.jpg)
STREAMING K-MEANS
• key concept: forgetfulness
• balances the relative importance of new data versus past history
• half-life
• time it takes before past data contributes to only one half of the current model
![Page 52: Sparkly Notebook: Interactive Analysis and Visualization with Spark](https://reader036.vdocuments.net/reader036/viewer/2022062503/58f9a962760da3da068b6e16/html5/thumbnails/52.jpg)
STREAMING K-MEANS
• time unit
• batches (which have a fixed duration in time), or points
• eliminate dying clusters
![Page 53: Sparkly Notebook: Interactive Analysis and Visualization with Spark](https://reader036.vdocuments.net/reader036/viewer/2022062503/58f9a962760da3da068b6e16/html5/thumbnails/53.jpg)
VISUALIZING STREAMING K-MEANS - LIGHTNING
![Page 54: Sparkly Notebook: Interactive Analysis and Visualization with Spark](https://reader036.vdocuments.net/reader036/viewer/2022062503/58f9a962760da3da068b6e16/html5/thumbnails/54.jpg)
LIGHTNING
![Page 55: Sparkly Notebook: Interactive Analysis and Visualization with Spark](https://reader036.vdocuments.net/reader036/viewer/2022062503/58f9a962760da3da068b6e16/html5/thumbnails/55.jpg)
• Lightning - data visualization serverhttp://lightning-viz.org
• provides API-based access to reproducible, web-based, interactive visualizations. It includes a core set of visualization types, but is built for extendability and customization. Lightning supports modern libraries like d3.js and three.js, and is designed for interactivity over large data sets and continuously updating data streams.
VISUALIZING STREAMING K-MEANS ON IPYTHON + LIGHTNING
![Page 56: Sparkly Notebook: Interactive Analysis and Visualization with Spark](https://reader036.vdocuments.net/reader036/viewer/2022062503/58f9a962760da3da068b6e16/html5/thumbnails/56.jpg)
RUNNING LIGHTNING
• API: node.js, Python, Scala
• Extension support for custom chart (eg. d3.js)
• Requirements:
• Postgres recommended (SQLlite ok)
• node.js (npm , gulp)
![Page 57: Sparkly Notebook: Interactive Analysis and Visualization with Spark](https://reader036.vdocuments.net/reader036/viewer/2022062503/58f9a962760da3da068b6e16/html5/thumbnails/57.jpg)
The Freeman Lab at Janelia Research Campus uses Lightning to visualize large-scale neural recordings from zebrafish, in collaboration with the Ahrens Lab
![Page 58: Sparkly Notebook: Interactive Analysis and Visualization with Spark](https://reader036.vdocuments.net/reader036/viewer/2022062503/58f9a962760da3da068b6e16/html5/thumbnails/58.jpg)
SPARK STREAMING K-MEANS DEMOEnvironment
• requires: numpy, scipy, scikit-learn
• IPython/Python requires: lightning-python package
Demo consists of 3 parts: https://github.com/felixcheung/spark-ml-streaming
• Python driver script, data generator
• Scala job - Spark Streaming & Streaming k-means
• IPython notebook to process result, visualize with Lightning Originally this was part of the Python driver script - it has been modified for this talk to run within IPython
![Page 59: Sparkly Notebook: Interactive Analysis and Visualization with Spark](https://reader036.vdocuments.net/reader036/viewer/2022062503/58f9a962760da3da068b6e16/html5/thumbnails/59.jpg)
![Page 60: Sparkly Notebook: Interactive Analysis and Visualization with Spark](https://reader036.vdocuments.net/reader036/viewer/2022062503/58f9a962760da3da068b6e16/html5/thumbnails/60.jpg)
![Page 61: Sparkly Notebook: Interactive Analysis and Visualization with Spark](https://reader036.vdocuments.net/reader036/viewer/2022062503/58f9a962760da3da068b6e16/html5/thumbnails/61.jpg)
CHALLENGES
• Package management
• Version/build conflicts!
![Page 62: Sparkly Notebook: Interactive Analysis and Visualization with Spark](https://reader036.vdocuments.net/reader036/viewer/2022062503/58f9a962760da3da068b6e16/html5/thumbnails/62.jpg)
YOU CAN RUN THIS TOO!
• Notebooks available at http://bit.ly/1JWOPh8
• Everything is heavily scripted and automatedVagrant config for local, virtual environment available at http://bit.ly/1DB3OLw
![Page 63: Sparkly Notebook: Interactive Analysis and Visualization with Spark](https://reader036.vdocuments.net/reader036/viewer/2022062503/58f9a962760da3da068b6e16/html5/thumbnails/63.jpg)
QUESTION?!
https://github.com/felixcheung linkedin: http://linkd.in/1OeZDb7
blog: http://bit.ly/1E2z6OI !