big data analytics · big data analytics 1. what is spark how to use spark spark can be used...

Big Data Analytics

Big Data Analytics

Lucas Rego Drumond

Information Systems and Machine Learning Lab (ISMLL)Institute of Computer Science

University of Hildesheim, Germany

Apache Spark

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Apache Spark 1 / 26

Big Data Analytics

Outline

1. What is Spark

2. Working with Spark2.1 Spark Context2.2 Resilient Distributed Datasets

3. MLLib: Machine Learning with Spark


Apache Spark 1 / 26

Big Data Analytics 1. What is Spark

Outline

1. What is Spark




Apache Spark 1 / 26


Spark Overview

Apache Spark is an open source framework for large scale dataprocessing and analysis

Main Ideas:

I Processing occurs where the data resides

I Avoid moving data over the network

I Works with the data in Memory

Technical details:

I Written in Scala

I Work seamlessly with Java and Python

I Developed at UC Berkeley


Apache Spark 1 / 26


Apache Spark Stack

Data platform: Distributed file system /data base

I Ex: HDFS, HBase, Cassandra

Execution Environment: single machine or a cluster

I Standalone, EC2, YARN, Mesos

Spark Core: Spark API

Spark Ecosystem: libraries of common algorithms

I MLLib, GraphX, Streaming


Apache Spark 2 / 26


Apache Spark Ecosystem


Apache Spark 3 / 26


How to use Spark

Spark can be used through:

I The Spark ShellI Available in Python and Scala

I Useful for learning the Framework

I Spark ApplicationsI Available in Python, Java and Scala

I For “serious” large scale processing


Apache Spark 4 / 26

Big Data Analytics 2. Working with Spark

Outline

1. What is Spark




Apache Spark 5 / 26


Working with Spark

Working with Spark requires accessing a Spark Context:

I Main entry point to the Spark API

I Already preconfigured in the Shell

Most of the work in Spark is a set of operation on Resilient DistributedDatasets (RDDs):

I Main data abstraction

I The data used and generated by the application is stored as RDDs


Apache Spark 5 / 26


Spark Interactive Shell (Python)

$ . / b in / pyspa rkWelcome to

/ / / /\ \/ \/ ‘/ / ’ /

/ / . /\ , / / / /\ \ v e r s i o n 1 . 3 . 1/ /

Us ing Python v e r s i o n 2 . 7 . 6 ( d e f a u l t , Jun 22 2015 17 : 5 8 : 1 3 )SparkContext a v a i l a b l e as sc , SQLContext a v a i l a b l e as s q lCon t e x t .>>>


Apache Spark 6 / 26

Big Data Analytics 2. Working with Spark 2.1 Spark Context

Spark Context

The Spark Context is the main entry point for the Spark functionality.

I It represents the connection to a Spark cluster

I Allows to create RDDs

I Allows to broadcast variables on the cluster

I Allows to create Accumulators


Apache Spark 7 / 26

Big Data Analytics 2. Working with Spark 2.2 Resilient Distributed Datasets

Resilient Distributed Datasets (RDDs)

A Spark application stores data as RDDs

Resilient → if data in memory is lost it can be recreated (fault tolerance)

Distributed → stored in memory across different machines

Dataset → data coming from a file or generated by the application

A Spark program is about operations on RDDs

RDDs are immutable: operations on RDDs may create new RDDs butnever change them


Apache Spark 8 / 26


Resilient Distributed Datasets (RDDs)

data

data

data

data

RDDRDD Element

RDD elements can be stored in different machines (transparent to thedeveloper)

data can have various data types


Apache Spark 9 / 26


RDD Data types

An element of an RDD can be of any type as long as it is serializable

Example:

I Primitive data types: integers, characters, strings, floating pointnumbers, ...

I Sequences: lists, arrays, tuples ...

I Pair RDDs: key-value pairs

I Serializable Scala/Java objects

A single RDD may have elements of different types

Some specific element types have additional functionality


Apache Spark 10 / 26


Example: Text file to RDD

I had breakfast this morning.The coffee was really good.I didn't like the bread though.But I had cheese.Oh I love cheese.

I had breakfast this morning.

The coffee was really good.

I didn't like the bread though.

But I had cheese.

RDD: mydata

Oh I love cheese.

File: mydiary.txt




RDD operations

There are two types of RDD operations:

I Actions: return a value based on the RDD

I Example:

I count: returns the number of elements in the RDD

I first(): returns the first element in the RDD

I take(n): returns an array with the first n elements in the RDD

I Transformations: creates a new RDD based on the current one

I Example:

I filter: returns the elements of an RDD which match a given criterion

I map: applies a particular function to each RDD element

I reduce: applies a particular function to each RDD element




Actions vs. Transformations

data

data

data

data

RDD

Action Value

data

data

data

data

BaseRDD

Transformation

data

data

data

data

NewRDD




Actions examples

I had breakfast this morning.The coffee was really good.I didn't like the bread though.But I had cheese.Oh I love cheese.




But I had cheese.

RDD: mydata

Oh I love cheese.

File: mydiary.txt

>>> mydata = sc . t e x t F i l e ( ”mydiary . t x t ” )>>> mydata . count ( )5>>> mydata . f i r s t ( )u ’ I had b r e a k f a s t t h i s morning . ’>>> mydata . take (2 )[ u ’ I had b r e a k f a s t t h i s morning . ’ , u ’The c o f f e e was r e a l l y good . ’ ]




Transformation examples




But I had cheese.

RDD: mydata

Oh I love cheese.

filter



But I had cheese.

RDD: filtered

Oh I love cheese.

filter(lambda line: "I " in line)



But I had cheese.

RDD: filtered

Oh I love cheese.

I HAD BREAKFAST THIS MORNING.

I DIDN'T LIKE THE BREAD THOUGH.

BUT I HAD CHEESE.

RDD: filterMap

OH I LOVE CHEESE.

map

map(lambda line: line.upper())




Transformations examples

>>> f i l t e r e d = mydata . f i l t e r ( lambda l i n e : ” I ” i n l i n e )>>> f i l t e r e d . count ( )4>>> f i l t e r e d . take (4 )[ u ’ I had b r e a k f a s t t h i s morning . ’ ,u” I d idn ’ t l i k e the bread though . ” ,u ’ But I had chee s e . ’ ,u ’Oh I l o v e chee s e . ’ ]

>>> f i l t e rMa p = f i l t e r e d .map( lambda l i n e : l i n e . upper ( ) )>>> f i l t e rMa p . count ( )4>>> f i l t e rMa p . take (4 )[ u ’ I HAD BREAKFAST THIS MORNING. ’ ,u” I DIDN ’T LIKE THE BREAD THOUGH. ” ,u ’BUT I HAD CHEESE . ’ ,u ’OH I LOVE CHEESE . ’ ]




Operations on specific typesNumeric RDDs have special operations:

I mean()

I min()

I max()

I ...

>>> l i n e l e n s = mydata .map( lambda l i n e : l e n ( l i n e ) )>>> l i n e l e n s . c o l l e c t ( )[ 2 9 , 27 , 31 , 17 , 17 ]>>> l i n e l e n s . mean ( )24 .2>>> l i n e l e n s .min ( )17>>> l i n e l e n s .max ( )31>>> l i n e l e n s . s t d e v ( )6.0133185513491636




Operations on Key-Value Pairs

Pair RDDs contain a two element tuple: (K ,V )

Keys and values can be of any type

Extremely useful for implementing MapReduce algorithms

Examples of operations:

I groupByKey

I reduceByKey

I aggregateByKey

I sortByKey

I ...




Word Count Example

Map:

I Input: document-word list pairs

I Output: word-count pairs

(dk , “w1, . . . ,w′′m) 7→ [(wi , ci )]

Reduce:

I Input: word-(count list) pairs

I Output: word-count pairs

(wi , [ci ]) 7→ (wi ,∑c∈[ci ]

c)




Word Count Example

(d1, “love ain't no stranger”)

(d2, “crying in the rain”)

(d3, “looking for love”)

(d4, “I'm crying”)

(d5, “the deeper the love”)

(d6, “is this love”)

(d7, “Ain't no love”)

(love, 1)

(ain't, 1)

(stranger,1)

(crying, 1)

(rain, 1)

(looking, 1)

(love, 2)

(crying, 1)

(deeper, 1)

(this, 1)

(love, 2)

(ain't, 1)

(love, 1)

(ain't, 1)

(stranger,1)

(crying, 1)

(rain, 1)

(looking, 1)

(love, 2)

(crying, 1)

(deeper, 1)

(this, 1)

(love, 2)

(ain't, 1)

(love, 5)

(stranger,1)

(crying, 2)

(ain't, 2)

(rain, 1)

(looking, 1)

(deeper, 1)

(this, 1)

Mappers ReducersLucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany



Word Count on Spark




But I had cheese.

RDD: mydata

Oh I love cheese.

flatMap

I

had

breakfast

RDD

this

morning.

...

map

(I,1)

(had,1)

(breakfast,1)

RDD

(this,1)

(morning.,1)

...

reduceByKey

(I,4)

(had,2)

(breakfast,1)

RDD

(this,1)

(morning.,1)

...

>>> count s = mydata . f l a tMap ( lambda l i n e : l i n e . s p l i t ( ” ” ) ) \.map( lambda word : ( word , 1 ) ) \. reduceByKey ( lambda x , y : x + y )




ReduceByKey

. reduceByKey ( lambda x , y : x + y )

ReduceByKey works a little different from the MapReduce reduce

function:

I It takes two arguments: combines two values at a time associatedwith the same key

I Must be commutative: reduceByKey(x,y) = reduceByKey(y,x)

I Must be associative: reduceByKey(reduceByKey(x,y), z) =

reduceByKey(x,reduceByKey(y,z))

Spark does not guarantee on which order the reduceByKey functions areexecuted!




ConsiderationsSpark provides a much more efficient MapReduce implementation thenHadoop:

I Higher level API

I In memory storage (less I/O overhead)

I Chaining MapReduce operations is simplified: sequence ofMapReduce passes can be done in one job

Spark vs. Hadoop on training a logistic regression model:

Source: Apache Spark. https://spark.apache.org/



https://spark.apache.org/

Big Data Analytics 3. MLLib: Machine Learning with Spark

Outline

1. What is Spark






Overview

MLLib is a Spark Machine Learning library containing implementations for:

I Computing Basic Statistics from Datasets

I Classification and Regression

I Collaborative Filtering

I Clustering

I Feature Extraction and Dimensionality Reduction

I Frequent Pattern Mining

I Optimization Algorithms




Logistic Regression with MLLib

Import necessary packages:

from pyspa rk . m l l i b . r e g r e s s i o n import Labe l edPo i n tfrom pyspa rk . m l l i b . u t i l import MLUti l sfrom pyspa rk . m l l i b . c l a s s i f i c a t i o n import Log i s t i cReg r e s s i onWi thSGD

Read the data (LibSVM format):

da t a s e t = MLUti l s . l oadL ibSVMFi l e ( sc ,” data / m l l i b / s amp l e l i b s vm da t a . t x t ” )




Logistic Regression with MLLib

Train the Model:

model = Log i s t i cReg r e s s i onWi thSGD . t r a i n ( d a t a s e t )

Evaluate:

l a b e l sAndP r ed s = da t a s e t.map( lambda p : ( p . l a b e l , model . p r e d i c t ( p . f e a t u r e s ) ) )

t r a i n E r r = l abe l sAndP r ed s. f i l t e r ( lambda ( v , p ) : v != p ). count ( ) / f l o a t ( d a t a s e t . count ( ) )

p r i n t ( ” T r a i n i n g E r r o r = ” + s t r ( t r a i n E r r ) )



big data analytics · big data analytics 1. what is spark how to use spark spark can be used...

Documents