Download - Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin

Robert HryniewiczData Evangelist@RobertH8z

Hands-on Intro to Spark & ZeppelinCrash Course

2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Agenda• Quick Demo

• Spark Overview

• Zeppelin + HDP

• Lab ~ 1hr

• Spark 2.0

• Q/A


“Big Data”

Internet of Anything (IoAT)– Wind Turbines, Oil Rigs, Cars– Weather Stations, Smart Grids– RFID Tags, Beacons, Wearables

User Generated Content (Web & Mobile)– Twitter, Facebook, Snapchat, YouTube– Clickstream, Ads, User Engagement– Payments: Paypal, Venmo

Where does “Big Data” come from?

44ZB in 2020


The “Big Data” Problem

A single machine cannot process or even store all the data!Problem

Solution Distribute data over large clusters

Difficulty How to split work across machines? Moving data over network is expensive Must consider data & network locality How to deal with failures? How to deal with slow nodes?


Spark Background


History of Hadoop & Spark


Access Rates

At least an order of magnitude difference between memory and hard drive / network speed

FAST slower slowest


What is Spark?

Apache Open Source Project– originally developed at AMPLab (University of California Berkeley)

Data Processing Engine – In-memory computation – FAST!

Elegant Developer-friendly APIs– Supports: Scala, Python, Java and R– Single environment for Data Wrangling, Machine Learning (ML), SQL Queries, Streaming Apps


Spark Ecosystem

Spark Core

Spark SQL Spark Streaming Spark MLlib GraphX


Apache Spark Basics


Spark Context

Main entry point for Spark functionality Represents a connection to a Spark cluster Represented as sc in your code

What is it?


Spark SQL


Spark SQL Overview

Spark module for structured data processing (e.g. DB tables, JSON files) Three ways to manipulate data:

– DataFrames API– SQL queries– Datasets API


DataFrames

Distributed collection of data organized into named columns Conceptually equivalent to a table in relational DB or data frame in R/Python

– rows, columns, and schema

API available in Scala, Java, Python, and R


DataFramesCSVAvro

HIVE

Spark SQL

Text

Col1 Col2 … … ColN

DataFrame

Column

Row

Created from Various Sources

DataFrames from HIVE:– Reading and writing HIVE tables

DataFrames from files:– Built-in: JSON, JDBC, ORC, Parquet, HDFS– External plug-in: CSV, HBASE, Avro

Data is described as a DataFrame with rows, columns and a schema


SQL Context and Hive Context

Entry point into all functionality in Spark SQL All you need is SparkContextval sqlContext = SQLContext(sc)

SQLContext

Superset of functionality provided by basic SQLContext– Read data from Hive tables– Access to Hive Functions UDFs

HiveContext

val hc = HiveContext(sc)

Use when your data resides in

Hive


Spark SQL Examples


DataFrame Example

val df = sqlContext.table("flightsTbl")

df.select("Origin", "Dest", "DepDelay").show(5)

Reading Data From Table

+------+----+--------+|Origin|Dest|DepDelay|+------+----+--------+| IAD| TPA| 8|| IAD| TPA| 19|| IND| BWI| 8|| IND| BWI| -4|| IND| BWI| 34|+------+----+--------+


DataFrame Example

df.select("Origin", "Dest", "DepDelay”).filter($"DepDelay" > 15).show(5)

Using DataFrame API to Filter Data (show delays more than 15 min)

+------+----+--------+|Origin|Dest|DepDelay|+------+----+--------+| IAD| TPA| 19|| IND| BWI| 34|| IND| JAX| 25|| IND| LAS| 67|| IND| MCO| 94|+------+----+--------+


SQL Example

// Register Temporary Table

df.registerTempTable("flights")

// Use SQL to Query Dataset

sqlContext.sql("SELECT Origin, Dest, DepDelay FROM flights WHERE DepDelay > 15 LIMIT

5").show

Using SQL to Query and Filter Data (again, show delays more than 15 min)

+------+----+--------+|Origin|Dest|DepDelay|+------+----+--------+| IAD| TPA| 19|| IND| BWI| 34|| IND| JAX| 25|| IND| LAS| 67|| IND| MCO| 94|+------+----+--------+


Spark Streaming


Spark Streaming

Extension of Spark Core API Stream processing of live data streams

– Scalable– High-throughput– Fault-tolerant

Overview

ZeroMQ

MQTT


Spark Streaming


Spark Streaming

Apply transformations over a sliding window of data, e.g. rolling averageWindow Operations


Spark MLlib


Where Can We Use Data Science / Machine Learning

Healthcare• Predict diagnosis• Prioritize screenings• Reduce re-admittance rates

Financial services• Fraud Detection/prevention• Predict underwriting risk• New account risk screens

Public Sector• Analyze public sentiment• Optimize resource allocation• Law enforcement & security

Retail• Product recommendation• Inventory management• Price optimization

Telco/mobile• Predict customer churn• Predict equipment failure• Customer behavior analysis

Oil & Gas• Predictive maintenance• Seismic data management• Predict well production levels


Spark ML: Spark API for building ML pipelines

Feature transform

1

Feature transform

2

Combine features

LinearRegression

Input DataFrame

Input DataFrame

Output DataFrame

Pipeline

Pipeline Model

Train

Predict

Export Model


Spark GraphX


GraphX

Page Rank Topic Modeling (LDA) Community Detection

Source: ampcamp.berkeley.edu


Apache Zeppelin & HDP Sandbox


What’s Apache Zeppelin?

Web-based notebook that enables

interactive data analytics.

You can make beautiful data-

driven, interactive and collaborative

documents with SQL, Scala and more


What is a Note/Notebook?

• A web base GUI for small code snippets• Write code snippets in browser• Zeppelin sends code to backend for execution• Zeppelin gets data back from backend• Zeppelin visualizes data• Zeppelin Note = Set of (Paragraphs/Cells)• Other Features - Sharing/Collaboration/Reports/Import/Export


How does Zeppelin work?

Notebook Author

Collaborators/Report viewers

Zeppelin

ClusterSpark | Hive | HBaseAny of 30+ back ends


Big Data Lifecycle

Collect ETL /Process Analysis

Report

DataProduct

Business userCustomer

Data ScientistData Engineer


HDP Sandbox

What’s included in the Sandbox?

Zeppelin Latest Hortonworks Data Platform (HDP)

– Spark– YARN Resource Management– HDFS Distributed Storage Layer– And many more components... YARN

ScalaJava

PythonR

APIs

Spark Core Engine

Spark SQL

Spark StreamingMLlib GraphX

1 ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° ° °

°

NHDFS


There’s more to HDP

YARN : Data Operating System

DATA ACCESS SECURITYGOVERNANCE & INTEGRATION OPERATIONS

1 ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° ° °

°

N

Data Lifecycle & Governance

FalconAtlas

AdministrationAuthenticationAuthorizationAuditingData Protection

RangerKnoxAtlasHDFS EncryptionData Workflow

SqoopFlumeKafkaNFSWebHDFS

Provisioning, Managing, & Monitoring

AmbariCloudbreakZookeeper

Scheduling

Oozie

Batch

MapReduce

Script

Pig

Search

Solr

SQL

Hive

NoSQL

HBaseAccumuloPhoenix

Stream

Storm

In-memory Others

ISV Engines

Tez Tez Slider Slider

DATA MANAGEMENT

Hortonworks Data Platform 2.4.x

Deployment ChoiceLinux Windows On-Premise Cloud

HDFS Hadoop Distributed File System


Spark 2.0


What’s New in Spark 2.0 API Unification

– DataFrame alias for DataSet[Row]– SparkSession (%spark) replaces SparkContext, SQLContext, and HiveContext

• spark is the new entry point to all Spark features

Structured Streaming– DataFrame/DataSet for manipulating stream data– Real-time incremental processing – Attempt to unify streaming, interactive, and batch processing

Performance Improvements– Tungsten - “bare metal” code generation– ORC & Parquet file formats


Hortonworks Community Connection


Hortonworks Community Connection

Read access for everyone, join to participate and be recognized

• Full Q&A Platform (like StackOverflow)

• Knowledge Base Articles

• Code Samples and Repositories


Lab Preview


Link to Lab Setup Instructions

http://tinyurl.com/hwx-spark-intro

Robert [email protected]@RobertH8z

Thanks!

Download - Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin

Top Related