spark + hadoop perfect together

Spark and Hadoop

Perfect Together

Vinay ShuklaDirector, Product Management

@neomythos

© Hortonworks Inc. 2011 – 2015. All Rights Reserved

Ram SriharshaSpark & Data Science Architect

@halfabrane


Data Operating System: Open Enterprise Hadoop

YARN: data operating system

Governance Security

Operations

Resource management

Data access: batch, interactive, real-time

Storage

Commodity Appliance Cloud

Built on a centralized architecture of

shared enterprise services:

Scalable tiered storage

Resource and workload management

Trusted data governance & metadata management

Consistent operations

Comprehensive security

Developer APIs and tools

Hadoop/YARN-powered data operating system

100% open source, multi-tenant data platform for

any application, any data set, anywhere.


Elegant Developer APIsDataFrames, Machine Learning, and SQL

Made for Data ScienceAll apps need to get predictive at scale and fine granularity

Democratize Machine LearningSpark is doing to ML on Hadoop what Hive did for SQL on

Hadoop

CommunityBroad developer, customer and partner interest

Realize Value of Data Operating SystemA key tool in the Hadoop toolbox

Why We Love Spark at Hortonworks

YARN

Scala

Java

Python

R

APIs

Spark Core Engine

Spark

SQL

Spark

StreamingMLlib GraphX

1 ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° ° °

°

NHDFS


Customer Use Cases with Spark

Insurance Detect over and under payments

• Overwhelmed by data ingest rates, reduce sample data to fit into edge node

• Random Forest models in R

• Team expertise in R and not Scala/ Java

• Lots of key features like textual features not incorporated (cannot handle feature blowup in R)

Internet RetailerOptimize Offers/Coupons

• Leverage Spark’s ML, SQL & Streaming

• Process streaming data to offer coupons to at risk shopping cart

Finance React quickly to earnings reports

• Efficiently bring HBase Data into Spark

• HBase has efficient scans, however can Spark leverage it?

• Push predicates and prune columns

Health CarePatient care system

• ETL, Streaming, SparkSQL & ML

• Need guidance on various ecosystem projects

• How to size cluster for Spark and other workloads?

• How does Spark run best on YARN?


Emerging Spark Patterns

• Spark as query federation engine Bring data from multiple sources to join/query in Spark

• Leverage Spark data frame for custom DSL

• Use multiple Spark libraries together Common to see Core, ML & Sql used together

• Use Spark with various Hadoop ecosystem projects Use Spark & Hive together

Spark & HBase together


Our focus on Spark

Data Science

Acceleration

Compelling User Experience for Data Scientists

Data science notebooks and automation for the most common

analysis scenarios. Include support for Geospatial and Entity

Resolution.

Seamless Data

Access

Bring as much data under analysis with Spark as possible

Seamless use of capabilities across Spark and Hive via SQL

including common file formats. Deliver connectors for HBase

(HFile).

Innovate at the

Core

Leverage strengths of Spark & HDP to unlock additional value

Allow for RDD sharing with the HDFS Memory Tier. Improve

dynamic resource allocation via YARN. Mature SparkSQL and Spark

Streaming to GA quality.

Apache is a trademark of the Apache Software Foundation.


Spark & Data Science Roadmap

Ease of UseEnterprise ReadinessData Integration

Improved Data Source

Integration

- One Hive

Interop between Hive in

HDP & Hive in SparkSQL

- Spark Connector for

HBase

HDFS

- RDD Caching in HDFS

General Availability

– SparkSQL

– Spark Streaming

Improve Spark Security

Easier Job Submission

– REST ways to submit/query

Spark jobs

Enhanced YARN support

– Improve Dynamic executor

allocation w/o using

NodeMgr

General Availability

– Zeppelin in HDP

Enhanced ML

– ML Optimizations & new ML

algorithms

Improved R Support

– SparkR & RStudio certification

Data Science Libraries

– Magellan

– Entity Resolution


Zeppelin Roadmap to GA

Operations

- Ambari Stack & Ambari View integration

Security

- Authentication against LDAP

- SSL

- Run in Kerberized Cluster

- Authorization of notebooks

- Security vulnerability fixes

Stabilization

- PySpark stability & validation

- Fix Random freezes in the UI

- Better Text editor

R support

- SparkR support & validation

Usability

- Note hierarchy, Navigate to new note

- Auto complete, syntax high light, line numbers

Visualization

- Pluggable visualization

- More charts, maps and state of art visualizations

Enterprise ReadinessEase of Use


Spark & Zeppelin Ongoing InnovationHDP 2.2.4

Spark 1.2.1

GA

HDP 2.3.2

Spark 1.4.1

GA

HDP 2.3.0

Spark 1.3.1

GA

HDP 2.x

Spark 1.5.2*

GA

Spark

Spark 1.3.1

TP

5/2015

Spark 1.4.1TP

8/2015

Spark 1.5.1*TP

Nov/2015

Current Time

Zeppelin

TP

Oct/2015

Apache Zeppelin

Zeppelin

TP Refresh

Jan 2015

Current Time

Dec 2015

HDP x.y

Spark 1.z.1*

GA

Zeppelin

GA

1H, 2016


Magellan: Geospatial Analytics on SparkRam Sriharsha

Twitter: @halfabrane

Spark and Data Science Architect,

Hortonworks


Geospatial Insight

Where do people go on weekends?

Does usage pattern change with time?

Predict the drop off point of a user?

Predict the location where next pick up can be expected?

Identify crime hotspots

How do these hotspots evolve with time?

Predict the likelihood of crime occurring at a given neighborhood

Predict climate at fairly granular level

Climate insurance: do I need to buy insurance for my crops?

Climate as a factor in crime: Join climate dataset with Crimes

Geospatial data is pervasive

Magellan in a nutshell

• Read Shapefiles/ GeoJSON as DataSources:

sqlContext.read("magellan").load(“$path”)

sqlContext.read(“magellan”).option(“type”, “geojson”).load(“$path”)

• Spatial Queries using Expressions

point(-122.5, 37.6) = Shape Literal

$”point” within $”polygon” = Boolean Expression

$”polygon1” intersection $”polygon2” = Binary Expression

• Joins using Catalyst + Spatial Optimizations

points.join(polygons).where($”point” within $”polygon”)


Magellan on Spark Packages

What

– Brings GeoSpatial Analytics to Big Data powered by Spark

– Available at Spark Packages

http://spark-packages.org/package/harsha2010/magellan

Key Features

– Parse geospatial data and metadata into Shapes + Metadata

– Python and Scala support

– Efficient Geometric Queries

- simple and intuitive syntax

– Scalable implementations of common algorithms

Learn More

– Magellan Blog: http://tinyurl.com/magellanBlog

Find Sample Magellan Notebook:

http://tinyurl.com/zeppelinNotebooks


Magellan Walk-through


Maximizing Revenue for UBER drivers

Data Insight Opportunity

– Uber publishes anonymized GeoSpatial trip data

– City of San Francisco has an active Open Data program

- demographics, neighborhoods, traffic

Challenges

– What neighborhood should a driver hangout to maximize revenue?

Solution

– Leverage Spark to transform and aggregate data

– Use Magellan to do GeoSpatial Queries

uber-magellan-nb

Hortonworks Data Platform

Perfect Together+ Hadoop


Learn More Spark + Hadoop Perfect TogetherHDP Spark General Info:

http://hortonworks.com/hadoop/spark/

Learn more about our Focus on Spark:

http://hortonworks.com/hadoop/spark/#section_6

Get the HDP Spark 1.5.1 Tech Preview:

http://hortonworks.com/hadoop/spark/#section_5

Get started with Spark and Zeppelin and download the Sandbox:

http://hortonworks.com/sandbox

Try these tutorials:

http://hortonworks.com/hadoop/spark/#tutorials

Learn more about GeoSpatial Spark processing with Magellan:

http://hortonworks.com/blog/magellan-geospatial-analytics-in-spark/