apache spark: killer or savior of apache hadoop?

62
Apache Spark: killer or savior of Apache Hadoop? Roman Shaposhnik Director of Open Source @Pivotal (Twitter: @rhatr)

Upload: rhatr

Post on 10-May-2015

2.892 views

Category:

Software


2 download

DESCRIPTION

The Big Boss(tm) has just OKed the first Hadoop cluster in the company. You are the guy in charge of analyzing petabytes of your company's valuable data using a combination of custom MapReduce jobs and SQL-on-Hadoop solutions. All of a sudden the web is full of articles telling you that Hadoop is dead, Spark has won and you should quit while you're still ahead. But should you?

TRANSCRIPT

Page 1: Apache Spark: killer or savior of Apache Hadoop?

Apache Spark: ���killer or savior of Apache Hadoop?

Roman Shaposhnik Director of Open Source @Pivotal

(Twitter: @rhatr)

Page 2: Apache Spark: killer or savior of Apache Hadoop?

Who’s this guy?

•  Director of Open Source (building a team of OS contributors)

•  Apache Software Foundation guy (Member, VP of Apache Incubator, committer on Hadoop, Giraph, Sqoop, etc)

•  Used to be root@Cloudera

•  Used to be PHB@Yahoo! (original Hadoop team)

•  Used to be a hacker at Sun microsystems (Sun Studio compilers and tools)

Page 3: Apache Spark: killer or savior of Apache Hadoop?

Shameless plug

http://manning.com/martella

Page 4: Apache Spark: killer or savior of Apache Hadoop?

Dearly beloved…

Page 5: Apache Spark: killer or savior of Apache Hadoop?

40 minute to figure out

Hadoop vs. Spark

Page 6: Apache Spark: killer or savior of Apache Hadoop?

40 minute to figure out

Hadoop++ == Spark

Page 7: Apache Spark: killer or savior of Apache Hadoop?

40 minute to figure out

Hadoop + Spark

Page 8: Apache Spark: killer or savior of Apache Hadoop?

40 minute to figure out

Page 9: Apache Spark: killer or savior of Apache Hadoop?

Long, long time ago…

HDFS

ASF Projects FLOSS Projects Pivotal Products

MapReduce

Page 10: Apache Spark: killer or savior of Apache Hadoop?

In a blink of an eye

HDFS

Pig

Sqoop Flume

Coordination and workflow

management

Zookeeper

Command Center

ASF Projects FLOSS Projects Pivotal Products

GemFire XD

Oozie

MapReduce

Hive

Tez

Giraph

Hadoop UI

Hue

SolrCloud

Phoenix

HBase

Crunch Mahout

Spark

Shark

Streaming

MLib

GraphX

Impala

HAW

Q

SpringXD

MADlib

Ham

ster

PivotalR

YARN

Tachyon

Page 11: Apache Spark: killer or savior of Apache Hadoop?

A Spark view?

HDFS

Sqoop Flume

Coordination and workflow

management

Zookeeper

Command Center

ASF Projects FLOSS Projects Pivotal Products

GemFire XD

Oozie

Hadoop UI

Hue

SolrCloud

Phoenix

HBase Spark

Shark

Streaming

MLib

GraphX

SpringXD

YARN

Tachyon

Page 12: Apache Spark: killer or savior of Apache Hadoop?

BDAS

Page 13: Apache Spark: killer or savior of Apache Hadoop?

Principle #1

HDFS is the datalake

Page 14: Apache Spark: killer or savior of Apache Hadoop?

Your datacenter

server 1

server N

Page 15: Apache Spark: killer or savior of Apache Hadoop?

Hadoop’s view

MapReduce

server 1

server N

HDFS

Page 16: Apache Spark: killer or savior of Apache Hadoop?

HDFS: decoupled storage

… MR

HDFS

MR

Page 17: Apache Spark: killer or savior of Apache Hadoop?
Page 18: Apache Spark: killer or savior of Apache Hadoop?

Anatomy of MapReduce

d a c

a b c

a 3 b 1 c 2

a 1 b 1 c 1

a 1 c 1 a 1

a 1 1 1 b 1 c 1 1

HDFS mappers reducers HDFS

Page 19: Apache Spark: killer or savior of Apache Hadoop?

Principle #2

MR is assembly language

Page 20: Apache Spark: killer or savior of Apache Hadoop?

MapReduce 1.0

Job Tracker

Task Tracker���(HDFS)

Task Tracker���(HDFS)

task1 task1 task1 task1 task1

task1 task1 task1 task1 taskN

Page 21: Apache Spark: killer or savior of Apache Hadoop?

YARN (AKA MR2.0)

Resource���Manager

Job Tracker

task1 task1 task1 task1 task1 Task Tracker

Page 22: Apache Spark: killer or savior of Apache Hadoop?

YARN (AKA MR2.0)

Resource���Manager

Job Tracker

task1 task1 task1 task1 task1 Task Tracker

Page 23: Apache Spark: killer or savior of Apache Hadoop?

Principle #3

MR: YARN + library

Page 24: Apache Spark: killer or savior of Apache Hadoop?

What’s wrong with MR?

Source: UC Berkeley Spark project (just the image)

Page 25: Apache Spark: killer or savior of Apache Hadoop?

Principle #4

$ grep –R | awk | sort …

Page 26: Apache Spark: killer or savior of Apache Hadoop?

Spark philosophy • Make life easy for Data Scientists

• Provide well documented and expressive APIs

• Powerful Domain Specific Libraries

• Easy integration with storage systems

• Caching to avoid data movement

• Well defined releases, stable API

Page 27: Apache Spark: killer or savior of Apache Hadoop?

Spark innovations • Resilient Distribtued Datasets (RDDs)

• Distributed on a cluster

• Manipulated via parallel operators (map, etc.)

• Automatically rebuilt on failure

• A parallel ecosystem

• A solution to iterative and multi-stage apps

Page 28: Apache Spark: killer or savior of Apache Hadoop?

RDDs

warnings = textFile(…).filter(_.contains(“warning”)) .map(_.split(‘ ‘)(1))

HadoopRDD���path = hdfs://

FilteredRDD���contains…

MappedRDD split…

Page 29: Apache Spark: killer or savior of Apache Hadoop?

Parallel operators

• map, reduce

• sample, filter

• groupBy, reduceByKey

• join, leftOuterJoin, rightOuterJoin

• union, cross

Page 30: Apache Spark: killer or savior of Apache Hadoop?

How do I use it?

val file = spark.textFile("hdfs://...")

val counts = file.flatMap(line => line.split(" "))

.map(word => (word, 1))

.reduceByKey(_ + _)

counts.saveAsTextFile("hdfs://...")

Page 31: Apache Spark: killer or savior of Apache Hadoop?

Principle #5

Memory is the new disk

Page 32: Apache Spark: killer or savior of Apache Hadoop?

RDDs are the foundation

• SQL

• Graph

• ML

• Streaming

Page 33: Apache Spark: killer or savior of Apache Hadoop?

Spark SQL • Lib in Spark Core that models RDDs as rels.

• SchemaRDD

• Replaces Shark

• Lightweight with no code from Hive

• Import/Export into different storage formats

• Columnar storage (as in Shark)

Page 34: Apache Spark: killer or savior of Apache Hadoop?

Spark Streaming

• Extend Spark to do large scale stream processing

• Simple, batch like API with RDDs

• Single semantics for both real time and high latency

Page 35: Apache Spark: killer or savior of Apache Hadoop?

D-Streams

Page 36: Apache Spark: killer or savior of Apache Hadoop?

Streaming from Twitter

TwitterUtils.createStream(...)

.filter(_.getText.contains("Spark"))

.countByWindow(Seconds(5))

Page 37: Apache Spark: killer or savior of Apache Hadoop?

Spark GraphX

• Pregel (BSP) (formerly know as Bagel)

• Graph-centric modeling

• Unification of processing

• No more MR trickery

Page 38: Apache Spark: killer or savior of Apache Hadoop?

You killed Apache Giraph?

Page 39: Apache Spark: killer or savior of Apache Hadoop?

MLbase

• Machine Learning toolset

• MatLab for scale out computing

• Built on Spark Mlib

• Classification, Regression, Colab. Filtering, etc.

Page 40: Apache Spark: killer or savior of Apache Hadoop?

What is really happening?

HDFS

Pig

Sqoop Flume

Coordination and workflow

management

Zookeeper

Command Center

ASF Projects FLOSS Projects Pivotal Products

GemFire XD

Oozie

MapReduce

Hive

Tez

Giraph

Hadoop UI

Hue

SolrCloud

Phoenix

HBase

Crunch Mahout

Spark

Shark

Streaming

MLib

GraphX

Impala

HAW

Q

SpringXD

MADlib

Ham

ster

PivotalR

YARN

Tachyon

Page 41: Apache Spark: killer or savior of Apache Hadoop?

Principle #6

Spark: the ecosystem

Page 42: Apache Spark: killer or savior of Apache Hadoop?

May be its not so bad server 1

server N

Page 43: Apache Spark: killer or savior of Apache Hadoop?

But HDFS/YARN are safe?

HDFS, Ceph, S3, NAS, etc.

New HDFS

New YARN

Page 44: Apache Spark: killer or savior of Apache Hadoop?

What is *really* going on? • 2009 Research at UCB, written in Scala

• 2010 Open Sourced

• 2013 Accepted into Apache Incubator

• 2013 Databricks formed ($14M funding)

• 2014 Becomes TLP with ASF

• 2014 Spark 1.0 is out

• 2014 Databricks gets an extra $33M

Page 45: Apache Spark: killer or savior of Apache Hadoop?

Bigdata: brought to U by ASF

• >50% ML traffic

• 100-200 contributors across 25-35 companies

• More active than Hadoop

• Cross-pollination with other TLPs

Page 46: Apache Spark: killer or savior of Apache Hadoop?

Principle #7

Where Hadoop was ‘09

Page 47: Apache Spark: killer or savior of Apache Hadoop?

This is how hardening looks

Page 48: Apache Spark: killer or savior of Apache Hadoop?

What is Hadoop?

Hadoop != MR + HDFS

Page 49: Apache Spark: killer or savior of Apache Hadoop?

The ecosystem • Apache HBase

• Apache Crunch, Pig, Hive and Phoenix

• Apache Giraph

• Apache Oozie

• Apache Mahout

• Apache Sqoop and Flume

Page 50: Apache Spark: killer or savior of Apache Hadoop?

Principle #8

Spark: an alternative backend

Page 51: Apache Spark: killer or savior of Apache Hadoop?

Spark is best for cloud

Page 52: Apache Spark: killer or savior of Apache Hadoop?

Principle #9

Memory is expensive

Page 53: Apache Spark: killer or savior of Apache Hadoop?

What’s new?

• True elasticity

• Resource partitioning

• Security

• Data marketplace

• Multi datacenter deployments

Page 54: Apache Spark: killer or savior of Apache Hadoop?

Hadoop Maturity

ETL Offload Accommodate massive ���

data growth with existing EDW investments

Data Lakes Unify Unstructured and Structured Data Access

Big Data Apps

Build analytic-led applications impacting ���

top line revenue

Data-Driven Enterprise

App Dev and Operational Management on HDFS

Data Architecture

Page 55: Apache Spark: killer or savior of Apache Hadoop?

Pivotal HD on Pivotal CF

� Enterprise PaaS Management System

� Flexible multi-language ‘buildpack’ architecture

� Deployed applications enjoy built-in services

� On-Premise Hadoop as a Service

� Single cluster deployment of Pivotal HD

� Developers instantly bind to shared Hadoop Clusters

� Speeds up time-to-value

Page 56: Apache Spark: killer or savior of Apache Hadoop?

Pivotal’s view

Data Science Platform

Tachyon/Gem

Cluster Manager

MR

Application

Stream Server

MPP SQL

Data Lake / HDFS / Virtual Storage

GemFireXD

...ETC

Hadoop HDFS Isilon

App Dev / Ops

MLbase Streaming

Legacy Systems

Legacy

Data Scientists Data Sources End Users

SparkSQL

Page 57: Apache Spark: killer or savior of Apache Hadoop?

Principle #10

The rumors of my death…

Page 58: Apache Spark: killer or savior of Apache Hadoop?

It will be called Hadoop

HDFS

Pig

Sqoop Flume

Coordination and workflow

management

Zookeeper

Command Center

ASF Projects FLOSS Projects Pivotal Products

GemFire with Tachyon

Oozie

MapReduce

Hive

Tez

Giraph

Hadoop UI

Hue

SolrCloud

Phoenix

HBase

Crunch Mahout

Spark

Shark

Streaming

MLib

GraphX

Impala

HAW

Q

SpringXD

MADlib

Ham

ster

PivotalR

YARN

Page 59: Apache Spark: killer or savior of Apache Hadoop?

Spark recap

• Is it “Big Data” (Yes)

• Is it “Hadoop” (No)

• It’s one of those “in memory” things, right (Yes)

• JVM, Java, Scala (All)

• Is it Real or just another shiny technology with a long, but ultimately small tail (Yes and ?)

Page 60: Apache Spark: killer or savior of Apache Hadoop?

A NEW PLATFORM FOR A NEW ERA

Additional Line 18 Point Verdana

Page 61: Apache Spark: killer or savior of Apache Hadoop?

Credits • Wikipedia and Dilbert.com

• Apache Software Foundation

• Scott Deeg

• Milind Bhandarkar

• Susheel Kaushik

• Mak Gokhale

Page 62: Apache Spark: killer or savior of Apache Hadoop?

Questions ?