spark + hadoop perfect together
TRANSCRIPT
Spark and Hadoop
Perfect Together
Vinay ShuklaDirector, Product Management
@neomythos
© Hortonworks Inc. 2011 – 2015. All Rights Reserved
Ram SriharshaSpark & Data Science Architect
@halfabrane
Page 2 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Data Operating System: Open Enterprise Hadoop
YARN: data operating system
Governance Security
Operations
Resource management
Data access: batch, interactive, real-time
Storage
Commodity Appliance Cloud
Built on a centralized architecture of
shared enterprise services:
Scalable tiered storage
Resource and workload management
Trusted data governance & metadata management
Consistent operations
Comprehensive security
Developer APIs and tools
Hadoop/YARN-powered data operating system
100% open source, multi-tenant data platform for
any application, any data set, anywhere.
Page 3 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Elegant Developer APIsDataFrames, Machine Learning, and SQL
Made for Data ScienceAll apps need to get predictive at scale and fine granularity
Democratize Machine LearningSpark is doing to ML on Hadoop what Hive did for SQL on
Hadoop
CommunityBroad developer, customer and partner interest
Realize Value of Data Operating SystemA key tool in the Hadoop toolbox
Why We Love Spark at Hortonworks
YARN
Scala
Java
Python
R
APIs
Spark Core Engine
Spark
SQL
Spark
StreamingMLlib GraphX
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
°
NHDFS
Page 4 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Customer Use Cases with Spark
Insurance Detect over and under payments
• Overwhelmed by data ingest rates, reduce sample data to fit into edge node
• Random Forest models in R
• Team expertise in R and not Scala/ Java
• Lots of key features like textual features not incorporated (cannot handle feature blowup in R)
Internet RetailerOptimize Offers/Coupons
• Leverage Spark’s ML, SQL & Streaming
• Process streaming data to offer coupons to at risk shopping cart
Finance React quickly to earnings reports
• Efficiently bring HBase Data into Spark
• HBase has efficient scans, however can Spark leverage it?
• Push predicates and prune columns
Health CarePatient care system
• ETL, Streaming, SparkSQL & ML
• Need guidance on various ecosystem projects
• How to size cluster for Spark and other workloads?
• How does Spark run best on YARN?
Page 5 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Emerging Spark Patterns
• Spark as query federation engine Bring data from multiple sources to join/query in Spark
• Leverage Spark data frame for custom DSL
• Use multiple Spark libraries together Common to see Core, ML & Sql used together
• Use Spark with various Hadoop ecosystem projects Use Spark & Hive together
Spark & HBase together
Page 6 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Our focus on Spark
Data Science
Acceleration
Compelling User Experience for Data Scientists
Data science notebooks and automation for the most common
analysis scenarios. Include support for Geospatial and Entity
Resolution.
Seamless Data
Access
Bring as much data under analysis with Spark as possible
Seamless use of capabilities across Spark and Hive via SQL
including common file formats. Deliver connectors for HBase
(HFile).
Innovate at the
Core
Leverage strengths of Spark & HDP to unlock additional value
Allow for RDD sharing with the HDFS Memory Tier. Improve
dynamic resource allocation via YARN. Mature SparkSQL and Spark
Streaming to GA quality.
Apache is a trademark of the Apache Software Foundation.
Page 7 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Spark & Data Science Roadmap
Ease of UseEnterprise ReadinessData Integration
Improved Data Source
Integration
- One Hive
Interop between Hive in
HDP & Hive in SparkSQL
- Spark Connector for
HBase
HDFS
- RDD Caching in HDFS
General Availability
– SparkSQL
– Spark Streaming
Improve Spark Security
Easier Job Submission
– REST ways to submit/query
Spark jobs
Enhanced YARN support
– Improve Dynamic executor
allocation w/o using
NodeMgr
General Availability
– Zeppelin in HDP
Enhanced ML
– ML Optimizations & new ML
algorithms
Improved R Support
– SparkR & RStudio certification
Data Science Libraries
– Magellan
– Entity Resolution
Page 8 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Zeppelin Roadmap to GA
Operations
- Ambari Stack & Ambari View integration
Security
- Authentication against LDAP
- SSL
- Run in Kerberized Cluster
- Authorization of notebooks
- Security vulnerability fixes
Stabilization
- PySpark stability & validation
- Fix Random freezes in the UI
- Better Text editor
R support
- SparkR support & validation
Usability
- Note hierarchy, Navigate to new note
- Auto complete, syntax high light, line numbers
Visualization
- Pluggable visualization
- More charts, maps and state of art visualizations
Enterprise ReadinessEase of Use
Page 9 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Spark & Zeppelin Ongoing InnovationHDP 2.2.4
Spark 1.2.1
GA
HDP 2.3.2
Spark 1.4.1
GA
HDP 2.3.0
Spark 1.3.1
GA
HDP 2.x
Spark 1.5.2*
GA
Spark
Spark 1.3.1
TP
5/2015
Spark 1.4.1TP
8/2015
Spark 1.5.1*TP
Nov/2015
Current Time
Zeppelin
TP
Oct/2015
Apache Zeppelin
Zeppelin
TP Refresh
Jan 2015
Current Time
Dec 2015
HDP x.y
Spark 1.z.1*
GA
Zeppelin
GA
1H, 2016
Page 10 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Magellan: Geospatial Analytics on SparkRam Sriharsha
Twitter: @halfabrane
Spark and Data Science Architect,
Hortonworks
Page 11 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Geospatial Insight
Where do people go on weekends?
Does usage pattern change with time?
Predict the drop off point of a user?
Predict the location where next pick up can be expected?
Identify crime hotspots
How do these hotspots evolve with time?
Predict the likelihood of crime occurring at a given neighborhood
Predict climate at fairly granular level
Climate insurance: do I need to buy insurance for my crops?
Climate as a factor in crime: Join climate dataset with Crimes
Page 12
Geospatial data is pervasive
Page 13
Magellan in a nutshell
• Read Shapefiles/ GeoJSON as DataSources:
sqlContext.read("magellan").load(“$path”)
sqlContext.read(“magellan”).option(“type”, “geojson”).load(“$path”)
• Spatial Queries using Expressions
point(-122.5, 37.6) = Shape Literal
$”point” within $”polygon” = Boolean Expression
$”polygon1” intersection $”polygon2” = Binary Expression
• Joins using Catalyst + Spatial Optimizations
points.join(polygons).where($”point” within $”polygon”)
Page 14 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Magellan on Spark Packages
What
– Brings GeoSpatial Analytics to Big Data powered by Spark
– Available at Spark Packages
http://spark-packages.org/package/harsha2010/magellan
Key Features
– Parse geospatial data and metadata into Shapes + Metadata
– Python and Scala support
– Efficient Geometric Queries
- simple and intuitive syntax
– Scalable implementations of common algorithms
Learn More
– Magellan Blog: http://tinyurl.com/magellanBlog
Find Sample Magellan Notebook:
http://tinyurl.com/zeppelinNotebooks
Page 15 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Magellan Walk-through
Page 16 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Maximizing Revenue for UBER drivers
Data Insight Opportunity
– Uber publishes anonymized GeoSpatial trip data
– City of San Francisco has an active Open Data program
- demographics, neighborhoods, traffic
Challenges
– What neighborhood should a driver hangout to maximize revenue?
Solution
– Leverage Spark to transform and aggregate data
– Use Magellan to do GeoSpatial Queries
uber-magellan-nb
uber-magellan-nb
uber-magellan-nb
uber-magellan-nb
Hortonworks Data Platform
Perfect Together+ Hadoop
Page 22 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Learn More Spark + Hadoop Perfect TogetherHDP Spark General Info:
http://hortonworks.com/hadoop/spark/
Learn more about our Focus on Spark:
http://hortonworks.com/hadoop/spark/#section_6
Get the HDP Spark 1.5.1 Tech Preview:
http://hortonworks.com/hadoop/spark/#section_5
Get started with Spark and Zeppelin and download the Sandbox:
http://hortonworks.com/sandbox
Try these tutorials:
http://hortonworks.com/hadoop/spark/#tutorials
Learn more about GeoSpatial Spark processing with Magellan:
http://hortonworks.com/blog/magellan-geospatial-analytics-in-spark/