Robert HryniewiczData Evangelist@RobertH8z
Hands-on Intro to Spark & ZeppelinCrash Course
2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda• Quick Demo
• Spark Overview
• Zeppelin + HDP
• Lab ~ 1hr
• Spark 2.0
• Q/A
3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
“Big Data”
Internet of Anything (IoAT)– Wind Turbines, Oil Rigs, Cars– Weather Stations, Smart Grids– RFID Tags, Beacons, Wearables
User Generated Content (Web & Mobile)– Twitter, Facebook, Snapchat, YouTube– Clickstream, Ads, User Engagement– Payments: Paypal, Venmo
Where does “Big Data” come from?
44ZB in 2020
4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
The “Big Data” Problem
A single machine cannot process or even store all the data!Problem
Solution Distribute data over large clusters
Difficulty How to split work across machines? Moving data over network is expensive Must consider data & network locality How to deal with failures? How to deal with slow nodes?
5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark Background
6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
History of Hadoop & Spark
7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Access Rates
At least an order of magnitude difference between memory and hard drive / network speed
FAST slower slowest
8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
What is Spark?
Apache Open Source Project– originally developed at AMPLab (University of California Berkeley)
Data Processing Engine – In-memory computation – FAST!
Elegant Developer-friendly APIs– Supports: Scala, Python, Java and R– Single environment for Data Wrangling, Machine Learning (ML), SQL Queries, Streaming Apps
9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark Ecosystem
Spark Core
Spark SQL Spark Streaming Spark MLlib GraphX
10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Spark Basics
11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark Context
Main entry point for Spark functionality Represents a connection to a Spark cluster Represented as sc in your code
What is it?
12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark SQL
13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark SQL Overview
Spark module for structured data processing (e.g. DB tables, JSON files) Three ways to manipulate data:
– DataFrames API– SQL queries– Datasets API
14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
DataFrames
Distributed collection of data organized into named columns Conceptually equivalent to a table in relational DB or data frame in R/Python
– rows, columns, and schema
API available in Scala, Java, Python, and R
15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
DataFramesCSVAvro
HIVE
Spark SQL
Text
Col1 Col2 … … ColN
DataFrame
Column
Row
Created from Various Sources
DataFrames from HIVE:– Reading and writing HIVE tables
DataFrames from files:– Built-in: JSON, JDBC, ORC, Parquet, HDFS– External plug-in: CSV, HBASE, Avro
Data is described as a DataFrame with rows, columns and a schema
16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
SQL Context and Hive Context
Entry point into all functionality in Spark SQL All you need is SparkContextval sqlContext = SQLContext(sc)
SQLContext
Superset of functionality provided by basic SQLContext– Read data from Hive tables– Access to Hive Functions UDFs
HiveContext
val hc = HiveContext(sc)
Use when your data resides in
Hive
17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark SQL Examples
18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
DataFrame Example
val df = sqlContext.table("flightsTbl")
df.select("Origin", "Dest", "DepDelay").show(5)
Reading Data From Table
+------+----+--------+|Origin|Dest|DepDelay|+------+----+--------+| IAD| TPA| 8|| IAD| TPA| 19|| IND| BWI| 8|| IND| BWI| -4|| IND| BWI| 34|+------+----+--------+
19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
DataFrame Example
df.select("Origin", "Dest", "DepDelay”).filter($"DepDelay" > 15).show(5)
Using DataFrame API to Filter Data (show delays more than 15 min)
+------+----+--------+|Origin|Dest|DepDelay|+------+----+--------+| IAD| TPA| 19|| IND| BWI| 34|| IND| JAX| 25|| IND| LAS| 67|| IND| MCO| 94|+------+----+--------+
20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
SQL Example
// Register Temporary Table
df.registerTempTable("flights")
// Use SQL to Query Dataset
sqlContext.sql("SELECT Origin, Dest, DepDelay FROM flights WHERE DepDelay > 15 LIMIT
5").show
Using SQL to Query and Filter Data (again, show delays more than 15 min)
+------+----+--------+|Origin|Dest|DepDelay|+------+----+--------+| IAD| TPA| 19|| IND| BWI| 34|| IND| JAX| 25|| IND| LAS| 67|| IND| MCO| 94|+------+----+--------+
21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark Streaming
22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark Streaming
Extension of Spark Core API Stream processing of live data streams
– Scalable– High-throughput– Fault-tolerant
Overview
ZeroMQ
MQTT
23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark Streaming
24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark Streaming
Apply transformations over a sliding window of data, e.g. rolling averageWindow Operations
25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark MLlib
26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Where Can We Use Data Science / Machine Learning
Healthcare• Predict diagnosis• Prioritize screenings• Reduce re-admittance rates
Financial services• Fraud Detection/prevention• Predict underwriting risk• New account risk screens
Public Sector• Analyze public sentiment• Optimize resource allocation• Law enforcement & security
Retail• Product recommendation• Inventory management• Price optimization
Telco/mobile• Predict customer churn• Predict equipment failure• Customer behavior analysis
Oil & Gas• Predictive maintenance• Seismic data management• Predict well production levels
27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark ML: Spark API for building ML pipelines
Feature transform
1
Feature transform
2
Combine features
LinearRegression
Input DataFrame
Input DataFrame
Output DataFrame
Pipeline
Pipeline Model
Train
Predict
Export Model
28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark GraphX
29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
GraphX
Page Rank Topic Modeling (LDA) Community Detection
Source: ampcamp.berkeley.edu
30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Zeppelin & HDP Sandbox
31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
What’s Apache Zeppelin?
Web-based notebook that enables
interactive data analytics.
You can make beautiful data-
driven, interactive and collaborative
documents with SQL, Scala and more
32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
What is a Note/Notebook?
• A web base GUI for small code snippets• Write code snippets in browser• Zeppelin sends code to backend for execution• Zeppelin gets data back from backend• Zeppelin visualizes data• Zeppelin Note = Set of (Paragraphs/Cells)• Other Features - Sharing/Collaboration/Reports/Import/Export
33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
How does Zeppelin work?
Notebook Author
Collaborators/Report viewers
Zeppelin
ClusterSpark | Hive | HBaseAny of 30+ back ends
34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Big Data Lifecycle
Collect ETL /Process Analysis
Report
DataProduct
Business userCustomer
Data ScientistData Engineer
35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
HDP Sandbox
What’s included in the Sandbox?
Zeppelin Latest Hortonworks Data Platform (HDP)
– Spark– YARN Resource Management– HDFS Distributed Storage Layer– And many more components... YARN
ScalaJava
PythonR
APIs
Spark Core Engine
Spark SQL
Spark StreamingMLlib GraphX
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
°
NHDFS
36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
There’s more to HDP
YARN : Data Operating System
DATA ACCESS SECURITYGOVERNANCE & INTEGRATION OPERATIONS
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
°
N
Data Lifecycle & Governance
FalconAtlas
AdministrationAuthenticationAuthorizationAuditingData Protection
RangerKnoxAtlasHDFS EncryptionData Workflow
SqoopFlumeKafkaNFSWebHDFS
Provisioning, Managing, & Monitoring
AmbariCloudbreakZookeeper
Scheduling
Oozie
Batch
MapReduce
Script
Pig
Search
Solr
SQL
Hive
NoSQL
HBaseAccumuloPhoenix
Stream
Storm
In-memory Others
ISV Engines
Tez Tez Slider Slider
DATA MANAGEMENT
Hortonworks Data Platform 2.4.x
Deployment ChoiceLinux Windows On-Premise Cloud
HDFS Hadoop Distributed File System
37 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark 2.0
38 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
What’s New in Spark 2.0 API Unification
– DataFrame alias for DataSet[Row]– SparkSession (%spark) replaces SparkContext, SQLContext, and HiveContext
• spark is the new entry point to all Spark features
Structured Streaming– DataFrame/DataSet for manipulating stream data– Real-time incremental processing – Attempt to unify streaming, interactive, and batch processing
Performance Improvements– Tungsten - “bare metal” code generation– ORC & Parquet file formats
39 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hortonworks Community Connection
40 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hortonworks Community Connection
Read access for everyone, join to participate and be recognized
• Full Q&A Platform (like StackOverflow)
• Knowledge Base Articles
• Code Samples and Repositories
41 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Community Engagement
Participate now at: community.hortonworks.com© Hortonworks Inc. 2011 – 2015. All Rights Reserved
7,500+Registered Users
15,000+Answers
20,000+Technical Assets
One Website!
42 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Lab Preview
43 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Link to Lab Setup Instructions
http://tinyurl.com/hwx-spark-intro
Robert [email protected]@RobertH8z
Thanks!