open data science in the ai era: breaking data science open
TRANSCRIPT
© 2017 Continuum Analytics - Confidential & Proprietary 1
Open Data Science In the AI EraBreaking Data Science Open
© 2017 Continuum Analytics - Confidential & Proprietary 2
Michele Chambers @mcAnalytics• EVP Product & CMO Continuum Analytics• M.B.A Duke University, B.S. Computer Engineering• Author
• Breaking Data Science Open: O’Reilly• Modern Analytics Methodologies: Driving Business
Value with Analytics Pearson FT Press• Advanced Analytics Methodologies: Driving Business
Value with Analytics Pearson FT Press• Big Data Big Analytics Wiley
About Us
© 2017 Continuum Analytics - Confidential & Proprietary 3
About Us
Ian Stokes-Rees @ijstokes• Product Marketing Manager• PhD in Particle Physics from Oxford University• Passionate advocate of Open Data Science• Educator and evangelist for use of
Python and Anaconda • Author: Breaking Data Science Open: O’Reilly• 20+ years experience engineering large scale
systems for numerical computing
© 2017 Continuum Analytics - Confidential & Proprietary 4
Business Intelligence & Predictive AnalyticsUsing Data for Insight & Human-in-the-Loop actions
© 2017 Continuum Analytics - Confidential & Proprietary 5
Cognitive IntelligenceUsing Data & Deep Learning to Make Recommendations
© 2017 Continuum Analytics - Confidential & Proprietary 6
© 2017 Continuum Analytics - Confidential & Proprietary 7
© 2017 Continuum Analytics - Confidential & Proprietary 8
Open Data ScienceConnecting Data, Analytics & Computation
© 2017 Continuum Analytics - Confidential & Proprietary
“ ”9
The ability of machines and software to think, work and react like humans.
AI is…
© 2017 Continuum Analytics - Confidential & Proprietary 10
© 2017 Continuum Analytics - Confidential & Proprietary
Data sciencehas expanded toinclude AI
Distributed Systems
Business Intelligence
AI
Web
Scientific Computing / HPC
Deep learning, natural language processing, neural networks,
regression, statistics
Hadoop, SparkWeb crawling, scraping, 3rd party data & API providers, predictive services & APIs
GPUs, multi-coresData warehouse, querying, reporting
© 2017 Continuum Analytics - Confidential & Proprietary
Numbaxlwings
Airflow
BlazeOpen Source Communities Creates Powerful Technology for Data Science
Distributed Systems
Business Intelligence
Web
Scientific Computing / HPC
AI
© 2017 Continuum Analytics - Confidential & Proprietary
Diverse Languages Poised to Embrace AI
Distributed Systems
Business Intelligence
AI
Web
Scientific Computing / HPC
SQL
© 2017 Continuum Analytics - Confidential & Proprietary
Numbaxlwings
Airflow
Blaze
Anaconda is the Open Data Science Platform bringing technology together
Distributed Systems
Business Intelligence
Web
Scientific Computing / HPC
AI
© 2017 Continuum Analytics - Confidential & Proprietary
© 2017 Continuum Analytics - Confidential & Proprietary
© 2017 Continuum Analytics - Confidential & Proprietary 17
Open Data Science Technologies
AI
NotebooksVisualization
Big DataLanguages
© 2017 Continuum Analytics - Confidential & Proprietary 18
Open Source Open Source
Open Core Open Core
Open Data Science Landscape
Niche Wide
Niche WideLEADERSCHALLENGERS
NICHE VISIONARIES
Adoption
Open Core
Open Source
© 2017 Continuum Analytics - Confidential & Proprietary 19
Open Core
Open Source
LEADERSCHALLENGERS
NICHE VISIONARIES
Adoption
MS Azure MLAWS Machine LearningGoogle Cloud Platform
DataRobotH2OAnaconda
Enterprisedeepsense
NLTK
TheanoTensorFlowCaffeKerasMXnet
Spark MLlib
SpaCy
Open Data Science Landscape for AI
AnacondaNumPySciPyPandasScikit-Learn
dplyrcaret
© 2017 Continuum Analytics - Confidential & Proprietary 20
LEADERSCHALLENGERS
NICHE VISIONARIES
Open Data Science Landscape for Notebooks
Anaconda Enterprise
R StudioDatabricks
Jupyter
ZeppelinJupyterHub
Rodeo
JupyterLab
Beaker
nteractbinder
nbviewer
Hue
Adoption
Open Core
Open Source
© 2017 Continuum Analytics - Confidential & Proprietary 21
LEADERSCHALLENGERS
NICHE VISIONARIES
Open Data Science Landscape for Visualization
Plotly Server
Anaconda Enterprise
` HoloviewsDatashader
Shiny Server Pro
BokehShinyShiny Server
ggplot
d3
matplotlib
seaborn
Plotly
Adoption
Open Core
Open Source
© 2017 Continuum Analytics - Confidential & Proprietary 22
LEADERSCHALLENGERS
NICHE VISIONARIES
Open Data Science Landscape for Big Data
Hortonworks
Cloudera
MapRAnaconda Enterprise
DatabricksPivotal
Dask
Ibis
Numba
KuduSpark
Hadoop
Flink
Spark Streaming
Kafka
ImpalaStorm
Elasticsearch
SolrHive
Adoption
Open Core
Open Source
© 2017 Continuum Analytics - Confidential & Proprietary 23
LEADERSCHALLENGERS
NICHE VISIONARIES
Open Data Science Landscape for Languages
R StudioAnaconda Enterprise
R AnacondaPythonPyData
Microsoft R Open
Scala
Microsoft R Client
Julia
Data Robot
MS Azure ML
IBM DSXMicrosoft R Server
yhatdataiku
Domino Data Lab
DataScience.com
Adoption
Open Core
Open Source
© 2017 Continuum Analytics - Confidential & Proprietary 24
LEADERSCHALLENGERS
NICHE VISIONARIES
Landscape for Open Data Science Platforms
Everyone Else
Cloudera
Microsoft
Adoption
Open Core
Open Source
Anaconda Enterprise
R Studio
R AnacondaJupyter
25
Anaconda now includes
AI
© 2017 Continuum Analytics - Confidential & Proprietary 26
• Deep Learning means creating models that attempt to match the structure of the human brain through artificial neural networks
• TensorFlow, from Google, has grown enormously in popularity and provides a framework for creating deep learning systems
• Commonly applied for text, speech, and image processing to ”learn” patterns then identify them and score or classify new data streams
Art of the PossibleOpen Data Science Applications
https://www.tensorflow.org/
© 2017 Continuum Analytics - Confidential & Proprietary 27
• Artificial General Intelligence refers to systems that can understand and respond to a wide range of inputs
• Universe, from OpenAI, provides a framework for testing AGI systems on video games
• Open Source, Python based
Art of the PossibleOpen Data Science Applications
https://universe.openai.com/
© 2017 Continuum Analytics - Confidential & Proprietary 28
• Production-ready open source AI toolkit from Microsoft
• Used in Bing, Skype, Cortana• Designed to work with Anaconda• GPU and multi-node ready
Art of the PossibleOpen Data Science Applications
https://aka.ms/cognitivetoolkit
© 2017 Continuum Analytics - Confidential & Proprietary 29
• Accelerate Time-to-Value
• Connect Data, Analytics & Compute
• Empower Data Science Teams
…is the leading Open Data Science platform powered by Python the fastest growing data science language
© 2017 Continuum Analytics - Confidential & Proprietary 30
INNOVATE faster through managed agile experimentation
MOVE from analysis to deployment immediately
DELIVER powerful results backed by high performance open data science platform
LEVERAGE innovative open source analytics to extract value from data
MAXIMIZE your computational power to easily analyze all data
CONNECT and integrate all your data sources for predictive models
ITERATE quickly to create powerful analysis and predictive models
COLLABORATE and share with your data science team
PUBLISH interactive results to the business
ACCELERATETime-to-Value
CONNECTData, Analytics & Compute
EMPOWERData Science Teams
© 2017 Continuum Analytics - Confidential & Proprietary 31
Open Data Science PlatformACCELERATE. CONNECT. EMPOWER
© 2017 Continuum Analytics - Confidential & Proprietary 32
Open Data Science in Enterprises
DATASCIENCECOLLAB
DATASCIENCE
GOVERNANCEDASHBOARDS
& APPS
SELFSERVICE
ANALYTICS
DATASCIENCE
OPERATIONS
DATASCIENCE FOR
BIG DATA
AI
OPENDATA
SCIENCE
© 2017 Continuum Analytics - Confidential & Proprietary 33
Anaconda• High performance Python &
R• 720+ data science
packages• Cross-platform package,
dependency & environments• Community driven package
repository collaboration Anaconda Navigator
• Desktop Portal & Installer
Anaconda Powers
OPEN DATA SCIENCE
DATA SCIENCE GOVERNANCE
DATA SCIENCE COLLABORATION
Anaconda Repository• Storage & sharing of
packages, environments, notebooks
• On-premise governance• Enterprise authentication
Anaconda Enterprise Notebooks
• Collaborative project based workflows for Python & R
• Enterprise authentication & permissioning
• Notebook sharing, versioning, search, differencing
DATA SCIENCE FOR BIG DATA
Anaconda Scale • Hadoop & Spark integration• Scalable distributed
processing framework• Integration with resource
management & data stores• Self-service cluster
launching• Distributed package,
dependency & environments Anaconda Fusion
• Big Data querying & transformations
© 2017 Continuum Analytics - Confidential & Proprietary 34
Anaconda• 720+ data science
packages• Deep Learning: Theano,
TensorFlow, Caffe, Keras, Neon, Lasagne
• Natural Language Processing: NLTK, spaCy
• Machine Learning: Scikit-learn
• GPU enablement
Anaconda Powers
AI DASHBOARDS & APPS
DATA SCIENCE OPERATIONS
Anaconda• Interactive browser based
dashboards & visualizations with Bokeh
• Bokeh apps using Python, R, Scala
• Big Data visualizations with Datashader
Anaconda Adam• Server & Cluster Installer
Anaconda Accelerate• Python compilation for multi-
core & GPUs• Code, data, in-notebook
profilers• Pre-optimized numerical
libraries
SELF-SERVICE ANALYTICS
Anaconda Fusion• Integration of Open Data
Science with Microsoft Excel®
• Interactive exploration & visualization
• Predictive modeling• Big Data querying &
transformations
© 2017 Continuum Analytics - Confidential & Proprietary 35
GrowingAnaconda Partner Ecosystem
© 2017 Continuum Analytics - Confidential & Proprietary© 2017 Continuum Analytics - Confidential & Proprietary
Anaconda Gives Superpowers To People Who Change The World
© 2017 Continuum Analytics - Confidential & Proprietary 37
Open Data ScienceVibrant and Growing Community
Python Community
30M+Packages in Anaconda
720+
R Community
16M+Spark Python Usage
50%+
ANACONDADownloads
11M+
© 2017 Continuum Analytics - Confidential & Proprietary 38
Financial Services• Risk management, Quant modeling, Data exploration
and processing, algorithmic trading, compliance reporting
Government• Fraud detection, data crawling, web & cyber data
analytics, statistical modelingHealthcare & Life Sciences• Genomics data processing, cancer research, natural
language processing for health data scienceHigh Tech• Customer behavior, recommendations, ad bidding,
retargeting, social media analyticsRetail & CPG• Engineering simulation, supply chain modeling,
scientific analysisOil & Gas• Pipeline monitoring, noise logging, seismic data
processing, geophysics
Anaconda…is Trusted by Industry Leaders
39
Open Data Sciencein Action
© 2017 Continuum Analytics - Confidential & Proprietary 40
DaskParallel processing with Dask makes large data sets accessible
Deep Learning with Anaconda & TensorFlowClassic digit recognition leveraging Anaconda + GPUs
© 2017 Continuum Analytics - Confidential & Proprietary 41
• Parallel and Distributed Pandas and NumPy
• Low latency workflow manager• Graphical tools• Simple APIs• Extensible and generalizable to
other data structures
Dask: Parallel Data Processing
© 2017 Continuum Analytics - Confidential & Proprietary 42
Dask: Parallel Data Processing
Synthetic views of NumPy ndarrays
Synthetic views of Pandas DataFrameswith HDFS support
DAG construction and workflow manager
© 2017 Continuum Analytics - Confidential & Proprietary 43
DaskParallel processing with Dask makes large data sets accessible
Deep Learning with Anaconda & TensorFlowClassic digit recognition leveraging Anaconda + GPUs
© 2017 Continuum Analytics - Confidential & Proprietary
Deep Learning == Neural Nets
ReLU
ReLU
ReLU
ReLU
© 2017 Continuum Analytics - Confidential & Proprietary 45
Deep Learning Software Stack
MULTI-CORE CPU GPU
MANY-CORE CPU
(XEON PHI)HARDWARE
INTEL MKL 2017 CUDNNPRIMITIVES
TENSORFLOWTHEANOPYTORCHTENSOR MATH
NEURAL NETWORKS KERAS TFLEARNCAFFE
...and many others
MI OPEN
© 2017 Continuum Analytics - Confidential & Proprietary 46
Example: Digit recognition
Trained to 98% accuracy in4 minutes using a singleNVIDIA GTX 1080 GPU
98% success
2% failure
47
A collaborative environment for data science teams
Anaconda Enterprise
© 2017 Continuum Analytics - Confidential & Proprietary 48
Anaconda Foundation
http://continuum.io/downloads
© 2017 Continuum Analytics - Confidential & Proprietary 49
Accelerating adoption of Open Data Science
• Easy to install
• Agile data exploration
• Powerful data analysis
• Simple to collaborate
• Accessible to everyone
PYTHON & R OPEN SOURCE ANALYTICSNumPy SciPy Pandas Scikit-learn Jupyter/IPython
Numba Matplotlib Spyder TensorFlow Cython Theano
Scikit-image NLTK Dask Caffe dplyr shiny
ggplot2 tidyr caret PySpark & 720+ packages
© 2017 Continuum Analytics - Confidential & Proprietary 50
© 2017 Continuum Analytics - Confidential & Proprietary 51
© 2017 Continuum Analytics - Confidential & Proprietary 52
© 2017 Continuum Analytics - Confidential & Proprietary 53
Third Party Asset Management
http://anaconda.org
© 2017 Continuum Analytics - Confidential & Proprietary 54
Step 2: Anaconda Cloud
© 2017 Continuum Analytics - Confidential & Proprietary 55
Personal Assets Management
http://anaconda.org/ijstokes
© 2017 Continuum Analytics - Confidential & Proprietary 56
Publishing Data Science
© 2017 Continuum Analytics - Confidential & Proprietary 57
Provenance and Reproducibility
© 2017 Continuum Analytics - Confidential & Proprietary 58
Enterprise Collaboration
Apps
Search
Tags
Team
Assets
© 2017 Continuum Analytics - Confidential & Proprietary 59
Data Science App Creation & Deployment
© 2017 Continuum Analytics - Confidential & Proprietary 60
Excel Integration
BRING interactive visualizations, machine learning and ETL to Excel
BRIDGE Excel Data to Python & R through notebooks
ACCESS all the power of Python and Big Data, natively embedded inside Excel
Anaconda Fusion brings Open Data Science to Microsoft Excel
© 2017 Continuum Analytics - Confidential & Proprietary 61
Enterprise-Ready Anaconda
DATASCIENCECOLLAB
DATASCIENCE
GOVERNANCEDASHBOARDS
& APPS
SELFSERVICE
ANALYTICS
DATASCIENCE
OPERATIONS
DATASCIENCE FOR
BIG DATA
AI
OPENDATA
SCIENCE
© 2017 Continuum Analytics - Confidential & Proprietary 62
Executive Sponsorship
© 2017 Continuum Analytics - Confidential & Proprietary 63
Static Investments
© 2017 Continuum Analytics - Confidential & Proprietary 64
Dynamic Investments
© 2017 Continuum Analytics - Confidential & Proprietary 65
Dynamic Investments
Data Lab Data Science Team CDO Production Operations
© 2017 Continuum Analytics - Confidential & Proprietary 66
Executive Sponsorship Responsibilities
Governance Provenance Reproducibility
© 2017 Continuum Analytics - Confidential & Proprietary
Data science is Team Sport
Team | CollaborativeIndividual | Silo
Modern RolesTraditional Roles
© 2017 Continuum Analytics - Confidential & Proprietary 68
Empowering the Data Science Team
© 2017 Continuum Analytics - Confidential & Proprietary 69
We can help you get started…
• Journey to Open Data Science
• SAS to Python Workflow Best Practices
• Open Data Science Workflow Best Practices
• Deployed Data Science Model
• Interactive Visualization Data Science
• SAS to Python Migration
• Matlab to Python Migration
• SAS to R Migration
• Python Turning• Model Tuning• Infrastructure
Tuning
• Open Source Development
• Package Building
AssessmentsBestPractices
DataScience
Migrations Tuning Development
© 2017 Continuum Analytics - Confidential & Proprietary 70
Next Steps
DOWNLOAD Breaking Data Science Open eBookGo.continuum.io/download-ebook-breaking-data-science-open/
CHECKOUT Anaconda Enterprisecontinuum.io/anaconda-overview
EXPERIENCE Anaconda Enterprise on your ownhttp://know.continuum.io/Anaconda-Enterprise-Test-Drive.html
© 2017 Continuum Analytics - Confidential & Proprietary 71
Thank YouMichele ChambersTwitter: @mcAnalytics
Ian Stokes-ReesTwitter: @ijstokes
[email protected]@ContinuumIO
© 2017 Continuum Analytics - Confidential & Proprietary© 2017 Continuum Analytics - Confidential & Proprietary
Continuum AnalyticsWe empower data science teams to make the world a better placeWe Empower Data Science Teams to Make the World Better