power of data. simplicity of design. speed of innovation. · what spark isn’t a data store...
TRANSCRIPT
© 2016 IBM Corporation
Carlo Appugliese
Big Data Evangelist
www.linkedin.com/in/carloappugliese
Power of data. Simplicity of design. Speed of innovation.
Pijush Chatterjee
Big Data Evangelist
https://www.linkedin.com/in/pijushchatterjee
2 © 2016 IBM Corporation
The digital age is changing the way we
Live, Play, Learn and Work…
3 © 2016 IBM Corporation
Landing and
Archive Zone
Real-time
Analytics
Zone
Enterprise Warehouse
and Mart Zone
Information Governance, Security and Business Continuity
Analytic Appliances
Big Data Platform Capabilities
Streaming Data
Text Data
Applications Data
Time Series
Geo Spatial
Relational
• Information Ingest
• Real Time Analytics
• Warehouse & Data Marts
• Analytic Appliances
Social Network
Video &
Image
All Data Sources
Advanced Analytics /
New InsightsNew / Enhanced
Applications
Automated Process
Case Management
Analytic Applications
CognitiveLearn Dynamically?
PrescriptiveBest Outcomes?
PredictiveWhat Could Happen?
DescriptiveWhat Has Happened?
Exploration and
DiscoveryWhat Do You Have?
Watson
Cloud Services
ISV Solutions
Alerts
Big Data and analytics sample architecture
Ingestion
and
Operational
Information
4 © 2016 IBM Corporation
Apache Spark
Apache Spark is enabling competitive advantage
5 © 2016 IBM Corporation
What is Spark?
Spark is an open source
in-memory
application framework for
distributed data processing and
iterative analysis
on massive data volumes
“Analytic Operating System”
6 © 2016 IBM Corporation
Apache Spark has no limits!!
Spark Application
(Driver Program)
SparkContext
Local Threads
Swift (SoftLayer), AWS S3, HDFS, or other storage
Cluster Manager
Worker Node Worker Node
Spark
ExecutorSpark
Executor
Cache Cache
• Spark programs generally consist of two
components: driver program and worker
program(s)
– Driver Program manages the division of
computations (Task) that are sent to worker
nodes
– Worker programs run smaller portions of
computations
• The SparkContext object instructs Spark on
how & where to access a cluster
• Cluster Manager manages the physical
resources needed to run driver and worker
programs.
7 © 2016 IBM Corporation
Traditional Approach: MapReduce jobs for complex jobs, interactive query,
and online event-hub processing involves lots of (slow) disk I/O
Solution: Keep data in-memory with a new distributed execution engine
Apache Spark is Fast!
HDFS
Read
Input
CPU
Iteration 1
Memory CPU
Iteration 2
Memory
10–100x faster than
network & disk
Minimal
Read/Write Disk
Bottleneck
Chain Job Output
into New Job Input
HDFS
Read
HDFS
Write
HDFS
Read
HDFS
Write
CPU
Iteration 1
Memory CPU
Iteration 2
Memory
8 © 2016 IBM Corporation
What Spark isn’t
A data store – Spark attaches to many data stores but does not provide its own
Only for Hadoop – Spark can work with Hadoop (especially HDFS), but Spark is a separate, standalone system
Only for machine learning – Spark includes machine learning and does it very well, but it can handle much broader tasks equally as well
Only for batch processing – Spark has streaming capabilities, machine learning and is increasingly being used for end user real-time analytic applications
9 © 2016 IBM Corporation
Spark includes a set of core libraries that enable various
analytic methods which can process data from many sources
Spark Core Engine
general compute
engine, handles
distributed task
dispatching,
scheduling and basic
I/O functions
Spark SQLSpark
Streaming
MLlib
(machine
learning)
GraphX
(graph)
executes SQL
statements
performs
streaming
analytics using
micro-batches
common
machine
learning and
statistical
algorithms
distributed
graph
processing
framework
large variety of
data sources and
formats can be
supported, both on
premise or cloud
BigInsights
(HDFS)
Cloudant
dashDB
Object
Storage
SQL
DB
…many
others
IBM CLOUD OTHER CLOUD CLOUD APPS ON-PREMISE
10 © 2016 IBM Corporation
Spark Programming Languages
Scala– Functional programming
– Spark written in Scala
– Scala compiles into Java byte code
Java– New features in Java 8 makes for more
compact coding (lambda expressions)
Python– Most widely used API with Spark today
R– Functional programming language used to create and manipulate functions
This probably means that more “data scientists” are starting to use Spark
DataFrames make all languages equally performant
Language 2014 2015
Scala 84% 71%
Java 38% 31%
Python 38% 58%
R unknown 18%
Survey done by Databricks,
Summer 2015
11 © 2016 IBM Corporation
What is a “Notebook”?
Pen and Paper Pen and paper has long provided the rich
experience that scientists need to document
progress through notes and drawings:– Expressive
– Cumulative
– Collaborative
Notebooks Notebooks are the digital equivalent of the
“pen and paper” lab notebook, enabling data
scientists to document reproducible
analysis:– Markdown and visualization
– Iterative exploration
– Easy to share
12 © 2016 IBM Corporation
Open Source Web-Based Notebooks
Notebooks:
“interactive computational environment, in which you can combine code
execution, rich text, mathematics, plots and rich media”
Zeppelin– Classified by Apache with a status of “incubator”
– Current version: 0.5.5 (Nov 18, 2015)
– Support multiple interpreters• Scala, Python, SparkSQL
Jupyter– Based on IPython
– Current version: 4.1.0b1
– Supports multiple interpreters• Python, Scala
13 © 2016 IBM Corporation
Key reasons for interest in Spark
Performant In-memory architecture greatly reduces disk I/O
Anywhere from 20-100x faster for common tasks
Productive Concise and expressive syntax, especially compared to prior approaches
Single programming model across a range of use cases and steps in data lifecycle
Integrated with common programming languages – Java, Python, Scala
New tools continually reduce skill barrier for access (e.g. SQL for analysts)
Leverages existing
investments
Works well within existing Hadoop ecosystem
Improves with age Large and growing community of contributors continuously improve full analytics stack and extend capabilities
14 © 2016 IBM Corporation
Spark Common Use Cases
•Enterprise-scale data volumes accessible to interactive query for business intelligence (BI)
•Faster time to job completion allows analysts to ask the “next” question about their data & business
Interactive Query
•Data cleaning to improve data quality (missing data, entity resolution, unit mismatch, etc.)
•Nightly ETL processing from production systemsLarge-Scale Batch
Forecasting vs. “Nowcasting” (e.g. Google Search queries analyzed en masse for Google Flu Trends to predict outbreaks)
Data mining across various types of data
Complex Analytics
•Web server log file analysis (human-readable file formats that are rarely read by humans) in near-real time
•Responsive monitoring of RFID-tagged devices
Event Processing
•Predictive modeling answers questions of "what will happen?"
•Self-tuning machine learning, continually updating algorithms, and predictive modeling
Model Building
15 © 2016 IBM Corporation
IBM is all-in on Spark
Launch Spark Technology Cluster (STC), 300 engineers
Open source SystemML
Partner with Databricks
Contribute to the Core
Foster Community
Educate 1M+ data scientists and engineers via online
courses
Sponsor AMPLab, creators and evangelists of Spark
Infuse the Portfolio
Integrate Spark throughout portfolio
3,500 employees working on Spark-related topics
Spark however customers want it – standalone, platform or products
"It's like Spark just got
blessed by the enterprise
rabbi."
Ben Horowitz,Andreessen Horowitz
16 © 2016 IBM Corporation
IBM w/ Spark empowers all Data Professionals Who
Are Hungry to Put Data to Work
Business Analysts
Data ScienceApp Developers
Data Engineers
Easily discover and
explore data to
improve decisions
Tame, curate and secure
data to make it relevant
and accessible.
Easily plug into data
and models to make
apps more powerful
Streamline algorithm
development to
deliver insight faster
17 © 2016 IBM Corporation
IBM Announcement - Watson Data Platform Project
ibm.co/dataworks
Community-centric User Experiences– Fit for purpose tooling for each data professional
– Bound together through collaboration and integrated platform
Cloud-Based Data and Analytics Services – Cloud based data and analytics services
– Built on Open Source Technologies
– Supplemented w/ enterprise and cognitive capabilities
Solution Blueprints – Set of data and analytics offerings integrated together to solve
specific use cases
Business
Analyst
Data
Scientist
App
Developer
Data
Engineer
18 © 2016 IBM Corporation
Tailored Experiences for Users Collaborating Together
Architects how data is
organized & ensures operability
Gets deep into the data to draw
hidden insights for the business
Works with data to apply insights
to the business strategy
Plugs into data and models &
writes code to build apps
Ingest
data
Transform
: clean
Create
and build
model
Evaluate
Deliver
and deploy
model
Communicate
results
Understand
problem and
domain
Explore and
understand
data
Transform:
shape
OUTPUT
ANALYSIS
INPUT
(Beta)
Data Engineer
Data Scientist
Business Analyst
App Developer
IBM Bluemix Data Connect
Data Science Experience
Analytics for Apache Spark, SPSS
Watson Analytics
Bluemix
19 © 2016 IBM Corporation
IBM DataWorks Provides Choice of Collaborative User
Experiences, Solution Blueprints, and Individual Services
Access
& Ingest
Find Share Collaborate
StoreAnalyze
& BuildDeploy
• IOT
• Streaming
• Data Preparation
• ETL/ELT
• Hadoop
• NoSQL/SQL
• Object Store
• Descriptive
• Predictive
• Prescriptive
• Dev. environment
• Apps/APIs
• Data Pipelines
• Reports
• Models
Solution
BlueprintsSelf-Service
Analytics
Internet of
Things
Data
Lake
Mobile
Applications
User
Experiences
Individual
Services
Powered by
Governance
Data Access
Data Recognition
Advanced Analytics
20 © 2016 IBM Corporation
Data Professional Evolution – Data Science!
Statistician
Programmer Business Expert
ETL using SED/AWK
Scripting, SQL
Data Formats
Python, R Scala
Mathematical Background
Computational Science
Business/Industry Expertise
Domain Knowledge
Supply Chain
CRM
Financials
Networking
Data Science Combines Skills across areas of Expertise
Data Science Today vary in combinations of these skills
21 © 2016 IBM Corporation
Benefits of Spark for Data Science
Spark is Easy… Allows Data Scientist to code at scale…
Support multiple programing interfaces (Scala, Python, Java and R)
Less lines of Code to get answers.
Spark is Agile… Allows Data Scientist to build pipelines, learn and get answers...
Unified APIs (SQL, DataFrames, Streaming, Machine Learning, etc.)
Supports Notebooks (Jupyter, Zeppelin, Etc.)
Spark is Fast… Allows Data Scientist to iterate quicker...
In-Memory processing that scales in a distributed architecture.
Application follows lazy evaluations architecture w/ optimized execution.
General compute engine
Basic I/O functions
Task dispatching
SchedulingSpark Core
Spark SQLSpark
StreamingMLlib
Machine Learning
GraphX Graphing
+
22 © 2016 IBM Corporation
Built-in learning to
get started or go
the distance with
advanced tutorials
Learn
The best of open source
and IBM value-add to
create state-of-the-art
data products
Create
Community and
social features that
provide meaningful
collaboration
Collaborate
URL: http://datascience.ibm.com
Supporting Data Scientist…
Introducing the Data Science ExperienceCurrently in Beta
Powered by
23 © 2016 IBM Corporation
A L L Y O U R T O O L S I N O N E P L A C E
IBM Data Science Experience is an environment that brings
together everything that a Data Scientist needs. It includes the
most popular Open Source tools and IBM unique value-add
functionalities with community and social features, integrated
as a first class citizen to make Data Scientists more successful.
datascience.ibm.co
m
IBM Data Science Experience
24 © 2016 IBM Corporation
IBM Data Science Experience
Community Open Source IBM Added Value
Powered by IBM DataWorks in the Cloud
• Find tutorials and datasets
• Connect with Data Scientists
• Ask questions
• Read articles and papers
• Fork and share projects
• Code in Scala/Python/R/SQL
• Jupyter and Zeppelin* Notebooks
• RStudio IDE and Shiny apps
• Apache Spark
• Your favorite libraries
• Data Shaping/Pipeline UI *
• Auto-data preparation
and modeling*
• Advanced Visualizations*
• Model management
and deployment*
• Documented Model APIs*
• Spark as a Service
* DSX product roadmap items
Core Attributes of the Data Science Experience
25 © 2016 IBM Corporation
https://youtu.be/1HjzkLRdP5k
26 © 2016 IBM Corporation
* DSX product roadmap items
Demonstration