fast and expressive big data analytics with python

29
Matei Zaharia Fast and Expressive Big Data Analytics with Python UC BERKELEY spark-project .org UC Berkeley / MIT

Upload: jelani-rios

Post on 31-Dec-2015

46 views

Category:

Documents


3 download

DESCRIPTION

Fast and Expressive Big Data Analytics with Python. Matei Zaharia. UC Berkeley / MIT. spark-project.org. UC BERKELEY. What is Spark?. Fast and expressive cluster computing system interoperable with Apache Hadoop Improves efficiency through: In-memory computing primitives - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Fast and Expressive Big Data Analytics with Python

Matei Zaharia

Fast and Expressive Big Data Analytics with Python

UC BERKELEY

spark-project.org

UC Berkeley / MIT

Page 2: Fast and Expressive Big Data Analytics with Python

What is Spark?Fast and expressive cluster computing system interoperable with Apache Hadoop

Improves efficiency through:»In-memory computing primitives»General computation graphs

Improves usability through:»Rich APIs in Scala, Java, Python»Interactive shell

Up to 100× faster(2-10× on disk)

Often 5× less code

Page 3: Fast and Expressive Big Data Analytics with Python

Project HistoryStarted in 2009, open sourced 2010

17 companies now contributing code»Yahoo!, Intel, Adobe, Quantifind, Conviva,

Bizo, …

Entered Apache incubator in June

Python API added in February

Page 4: Fast and Expressive Big Data Analytics with Python

An Expanding StackSpark is the basis for a wide set of projects in the Berkeley Data Analytics Stack (BDAS)

Spark

Spark Streamin

g(real-time)

GraphX(graph)

Shark(SQL)

MLbase(machine learning)

More details: amplab.berkeley.edu

Page 5: Fast and Expressive Big Data Analytics with Python

This TalkSpark programming model

Examples

Demo

Implementation

Trying it out

Page 6: Fast and Expressive Big Data Analytics with Python

Why a New Programming Model?

MapReduce simplified big data processing, but users quickly found two problems:

Programmability: tangle of map/red functions

Speed: MapReduce inefficient for apps that share data across multiple steps

»Iterative algorithms, interactive queries

Page 7: Fast and Expressive Big Data Analytics with Python

Data Sharing in MapReduce

iter. 1 iter. 2 . . .

Input

HDFSread

HDFSwrite

HDFSread

HDFSwrite

Input

query 1

query 2

query 3

result 1

result 2

result 3

. . .

HDFSread

Slow due to data replication and disk I/O

Page 8: Fast and Expressive Big Data Analytics with Python

iter. 1 iter. 2 . . .

Input

Distributedmemory

Input

query 1

query 2

query 3

. . .

one-timeprocessing

10-100× faster than network and disk

What We’d Like

Page 9: Fast and Expressive Big Data Analytics with Python

Spark ModelWrite programs in terms of transformations on distributed datasets

Resilient Distributed Datasets (RDDs)»Collections of objects that can be stored in

memory or disk across a cluster»Built via parallel transformations (map,

filter, …)»Automatically rebuilt on failure

Page 10: Fast and Expressive Big Data Analytics with Python

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))

messages = errors.map(lambda s: s.split(“\t”)[2])

messages.cache()Block 1

Block 2

Block 3

Worker

Worker

Worker

Driver

messages.filter(lambda s: “foo” in s).count()

messages.filter(lambda s: “bar” in s).count()

. . .

tasks

results

Cache 1

Cache 2

Cache 3

Base RDD

Transformed RDD

Action

Result: full-text search of Wikipedia in 2 sec (vs 30 s for

on-disk data)

Result: scaled to 1 TB data in 7 sec (vs 180 sec for on-disk

data)

Page 11: Fast and Expressive Big Data Analytics with Python

Fault ToleranceRDDs track the transformations used to build them (their lineage) to recompute lost datamessages = textFile(...).filter(lambda s: “ERROR” in s) .map(lambda s: s.split(“\t”)[2])

HadoopRDDpath = hdfs://…

FilteredRDDfunc = lambda s:

MappedRDDfunc = lambda s:

Page 12: Fast and Expressive Big Data Analytics with Python

Example: Logistic Regression

Goal: find line separating two sets of points

+

++

+

+

+

++ +

– ––

––

+

target

random initial line

Page 13: Fast and Expressive Big Data Analytics with Python

Example: Logistic Regression

data = spark.textFile(...).map(readPoint).cache()

w = numpy.random.rand(D)

for i in range(iterations): gradient = data.map(lambda p: (1 / (1 + exp(-p.y * w.dot(p.x)))) * p.y * p.x ).reduce(lambda x, y: x + y) w -= gradient

print “Final w: %s” % w

Page 14: Fast and Expressive Big Data Analytics with Python

Logistic Regression Performance

1 5 10 20 300

500

1000

1500

2000

2500

3000

3500

4000

HadoopPySpark

Number of Iterations

Ru

nn

ing

Tim

e (

s)

110 s / iteration

first iteration 80 sfurther iterations

5 s

Page 15: Fast and Expressive Big Data Analytics with Python

Demo

Page 16: Fast and Expressive Big Data Analytics with Python

Supported Operatorsmap

filter

groupBy

union

join

leftOuterJoin

rightOuterJoin

reduce

count

fold

reduceByKey

groupByKey

cogroup

flatMap

take

first

partitionBy

pipe

distinct

save

...

Page 17: Fast and Expressive Big Data Analytics with Python

1000+ meetup members

60+ contributors

17 companies contributing

Spark Community

Page 18: Fast and Expressive Big Data Analytics with Python

This TalkSpark programming model

Examples

Demo

Implementation

Trying it out

Page 19: Fast and Expressive Big Data Analytics with Python

OverviewSpark core is written in Scala

PySpark calls existing scheduler, cache and networking layer (2K-line wrapper)

No changes to Python

Your app Spark

client

Spark worker

Python child

Python child

PyS

par

k

Spark worker

Python child

Python child

Page 20: Fast and Expressive Big Data Analytics with Python

OverviewSpark core is written in Scala

PySpark calls existing scheduler, cache and networking layer (2K-line wrapper)

No changes to Python

Your app Spark

client

Spark worker

Python child

Python childPyS

par

k

Spark worker

Python child

Python child

Main PySpark author:

Josh Rosen

cs.berkeley.edu/~joshrosen

Page 21: Fast and Expressive Big Data Analytics with Python

Object MarshalingUses pickle library for both communication and cached data

»Much cheaper than Python objects in RAM

Lambda marshaling library by PiCloud

Page 22: Fast and Expressive Big Data Analytics with Python

Job SchedulerSupports general operator graphs

Automatically pipelines functions

Aware of data locality and partitioning

join

union

groupBy

map

Stage 3

Stage 1

Stage 2

A: B:

C: D:

E:

F:

G:

= cached data partition

Page 23: Fast and Expressive Big Data Analytics with Python

InteroperabilityRuns in standard CPython, on Linux / Mac

»Works fine with extensions, e.g. NumPy

Input from local file system, NFS, HDFS, S3

»Only text files for now

Works in IPython, including notebook

Works in doctests – see our tests!

Page 24: Fast and Expressive Big Data Analytics with Python

Getting StartedVisit spark-project.org for video tutorials, online exercises, docs

Easy to run in local mode (multicore), standalone clusters, or EC2

Training camp at Berkeley in August (free video): ampcamp.berkeley.edu

Page 25: Fast and Expressive Big Data Analytics with Python

Getting StartedEasiest way to learn is the shell:

$ ./pyspark

>>> nums = sc.parallelize([1,2,3]) # make RDD from array

>>> nums.count()3

>>> nums.map(lambda x: 2 * x).collect()[2, 4, 6]

Page 26: Fast and Expressive Big Data Analytics with Python

ConclusionPySpark provides a fast and simple way to analyze big datasets from Python

Learn more or contribute at spark-project.org

Look for our training camp on August 29-30!

My email: [email protected]

Page 27: Fast and Expressive Big Data Analytics with Python

Behavior with Not Enough RAM

Cache disabled

25% 50% 75% Fully cached

0

20

40

60

80

10068.8

58.1

40.729.7

11.5

% of working set in memory

Itera

tion

tim

e (

s)

Page 28: Fast and Expressive Big Data Analytics with Python

The Rest of the StackSpark is the foundation for wide set of projects in the Berkeley Data Analytics Stack (BDAS)

Spark

Spark Streamin

g(real-time)

GraphX(graph)

Shark(SQL)

MLbase(machine learning)

More details: amplab.berkeley.edu

Page 29: Fast and Expressive Big Data Analytics with Python

Performance Comparison

0

5

10

15

20

25

Impala

(d

isk)

Impala

(m

em

)R

edsh

ift

Shark

(d

isk)

Shark

(m

em

)Resp

on

se T

ime (

s)

SQL0

5

10

15

20

25

30

35

Sto

rm

Spark

Th

rou

gh

pu

t (M

B/s

/nod

e)

Streaming0

5

10

15

20

25

30

Hadoop

Gir

aph

Gra

phLa

b

Gra

phX

Resp

on

se T

ime (

min

)Graph