big data analytics · big data analytics 1. what is spark how to use spark spark can be used...
TRANSCRIPT
Big Data Analytics
Big Data Analytics
Lucas Rego Drumond
Information Systems and Machine Learning Lab (ISMLL)Institute of Computer Science
University of Hildesheim, Germany
Apache Spark
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Apache Spark 1 / 26
Big Data Analytics
Outline
1. What is Spark
2. Working with Spark2.1 Spark Context2.2 Resilient Distributed Datasets
3. MLLib: Machine Learning with Spark
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Apache Spark 1 / 26
Big Data Analytics 1. What is Spark
Outline
1. What is Spark
2. Working with Spark2.1 Spark Context2.2 Resilient Distributed Datasets
3. MLLib: Machine Learning with Spark
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Apache Spark 1 / 26
Big Data Analytics 1. What is Spark
Spark Overview
Apache Spark is an open source framework for large scale dataprocessing and analysis
Main Ideas:
I Processing occurs where the data resides
I Avoid moving data over the network
I Works with the data in Memory
Technical details:
I Written in Scala
I Work seamlessly with Java and Python
I Developed at UC Berkeley
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Apache Spark 1 / 26
Big Data Analytics 1. What is Spark
Apache Spark Stack
Data platform: Distributed file system /data base
I Ex: HDFS, HBase, Cassandra
Execution Environment: single machine or a cluster
I Standalone, EC2, YARN, Mesos
Spark Core: Spark API
Spark Ecosystem: libraries of common algorithms
I MLLib, GraphX, Streaming
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Apache Spark 2 / 26
Big Data Analytics 1. What is Spark
Apache Spark Ecosystem
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Apache Spark 3 / 26
Big Data Analytics 1. What is Spark
How to use Spark
Spark can be used through:
I The Spark ShellI Available in Python and Scala
I Useful for learning the Framework
I Spark ApplicationsI Available in Python, Java and Scala
I For “serious” large scale processing
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Apache Spark 4 / 26
Big Data Analytics 2. Working with Spark
Outline
1. What is Spark
2. Working with Spark2.1 Spark Context2.2 Resilient Distributed Datasets
3. MLLib: Machine Learning with Spark
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Apache Spark 5 / 26
Big Data Analytics 2. Working with Spark
Working with Spark
Working with Spark requires accessing a Spark Context:
I Main entry point to the Spark API
I Already preconfigured in the Shell
Most of the work in Spark is a set of operation on Resilient DistributedDatasets (RDDs):
I Main data abstraction
I The data used and generated by the application is stored as RDDs
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Apache Spark 5 / 26
Big Data Analytics 2. Working with Spark
Spark Interactive Shell (Python)
$ . / b in / pyspa rkWelcome to
/ / / /\ \/ \/ ‘/ / ’ /
/ / . /\ , / / / /\ \ v e r s i o n 1 . 3 . 1/ /
Us ing Python v e r s i o n 2 . 7 . 6 ( d e f a u l t , Jun 22 2015 17 : 5 8 : 1 3 )SparkContext a v a i l a b l e as sc , SQLContext a v a i l a b l e as s q lCon t e x t .>>>
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Apache Spark 6 / 26
Big Data Analytics 2. Working with Spark 2.1 Spark Context
Spark Context
The Spark Context is the main entry point for the Spark functionality.
I It represents the connection to a Spark cluster
I Allows to create RDDs
I Allows to broadcast variables on the cluster
I Allows to create Accumulators
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Apache Spark 7 / 26
Big Data Analytics 2. Working with Spark 2.2 Resilient Distributed Datasets
Resilient Distributed Datasets (RDDs)
A Spark application stores data as RDDs
Resilient → if data in memory is lost it can be recreated (fault tolerance)
Distributed → stored in memory across different machines
Dataset → data coming from a file or generated by the application
A Spark program is about operations on RDDs
RDDs are immutable: operations on RDDs may create new RDDs butnever change them
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Apache Spark 8 / 26
Big Data Analytics 2. Working with Spark 2.2 Resilient Distributed Datasets
Resilient Distributed Datasets (RDDs)
data
data
data
data
RDDRDD Element
RDD elements can be stored in different machines (transparent to thedeveloper)
data can have various data types
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Apache Spark 9 / 26
Big Data Analytics 2. Working with Spark 2.2 Resilient Distributed Datasets
RDD Data types
An element of an RDD can be of any type as long as it is serializable
Example:
I Primitive data types: integers, characters, strings, floating pointnumbers, ...
I Sequences: lists, arrays, tuples ...
I Pair RDDs: key-value pairs
I Serializable Scala/Java objects
A single RDD may have elements of different types
Some specific element types have additional functionality
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Apache Spark 10 / 26
Big Data Analytics 2. Working with Spark 2.2 Resilient Distributed Datasets
Example: Text file to RDD
I had breakfast this morning.The coffee was really good.I didn't like the bread though.But I had cheese.Oh I love cheese.
I had breakfast this morning.
The coffee was really good.
I didn't like the bread though.
But I had cheese.
RDD: mydata
Oh I love cheese.
File: mydiary.txt
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Apache Spark 11 / 26
Big Data Analytics 2. Working with Spark 2.2 Resilient Distributed Datasets
RDD operations
There are two types of RDD operations:
I Actions: return a value based on the RDD
I Example:
I count: returns the number of elements in the RDD
I first(): returns the first element in the RDD
I take(n): returns an array with the first n elements in the RDD
I Transformations: creates a new RDD based on the current one
I Example:
I filter: returns the elements of an RDD which match a given criterion
I map: applies a particular function to each RDD element
I reduce: applies a particular function to each RDD element
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Apache Spark 12 / 26
Big Data Analytics 2. Working with Spark 2.2 Resilient Distributed Datasets
Actions vs. Transformations
data
data
data
data
RDD
Action Value
data
data
data
data
BaseRDD
Transformation
data
data
data
data
NewRDD
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Apache Spark 13 / 26
Big Data Analytics 2. Working with Spark 2.2 Resilient Distributed Datasets
Actions examples
I had breakfast this morning.The coffee was really good.I didn't like the bread though.But I had cheese.Oh I love cheese.
I had breakfast this morning.
The coffee was really good.
I didn't like the bread though.
But I had cheese.
RDD: mydata
Oh I love cheese.
File: mydiary.txt
>>> mydata = sc . t e x t F i l e ( ”mydiary . t x t ” )>>> mydata . count ( )5>>> mydata . f i r s t ( )u ’ I had b r e a k f a s t t h i s morning . ’>>> mydata . take (2 )[ u ’ I had b r e a k f a s t t h i s morning . ’ , u ’The c o f f e e was r e a l l y good . ’ ]
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Apache Spark 14 / 26
Big Data Analytics 2. Working with Spark 2.2 Resilient Distributed Datasets
Transformation examples
I had breakfast this morning.
The coffee was really good.
I didn't like the bread though.
But I had cheese.
RDD: mydata
Oh I love cheese.
filter
I had breakfast this morning.
I didn't like the bread though.
But I had cheese.
RDD: filtered
Oh I love cheese.
filter(lambda line: "I " in line)
I had breakfast this morning.
I didn't like the bread though.
But I had cheese.
RDD: filtered
Oh I love cheese.
I HAD BREAKFAST THIS MORNING.
I DIDN'T LIKE THE BREAD THOUGH.
BUT I HAD CHEESE.
RDD: filterMap
OH I LOVE CHEESE.
map
map(lambda line: line.upper())
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Apache Spark 15 / 26
Big Data Analytics 2. Working with Spark 2.2 Resilient Distributed Datasets
Transformations examples
>>> f i l t e r e d = mydata . f i l t e r ( lambda l i n e : ” I ” i n l i n e )>>> f i l t e r e d . count ( )4>>> f i l t e r e d . take (4 )[ u ’ I had b r e a k f a s t t h i s morning . ’ ,u” I d idn ’ t l i k e the bread though . ” ,u ’ But I had chee s e . ’ ,u ’Oh I l o v e chee s e . ’ ]
>>> f i l t e rMa p = f i l t e r e d .map( lambda l i n e : l i n e . upper ( ) )>>> f i l t e rMa p . count ( )4>>> f i l t e rMa p . take (4 )[ u ’ I HAD BREAKFAST THIS MORNING. ’ ,u” I DIDN ’T LIKE THE BREAD THOUGH. ” ,u ’BUT I HAD CHEESE . ’ ,u ’OH I LOVE CHEESE . ’ ]
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Apache Spark 16 / 26
Big Data Analytics 2. Working with Spark 2.2 Resilient Distributed Datasets
Operations on specific typesNumeric RDDs have special operations:
I mean()
I min()
I max()
I ...
>>> l i n e l e n s = mydata .map( lambda l i n e : l e n ( l i n e ) )>>> l i n e l e n s . c o l l e c t ( )[ 2 9 , 27 , 31 , 17 , 17 ]>>> l i n e l e n s . mean ( )24 .2>>> l i n e l e n s .min ( )17>>> l i n e l e n s .max ( )31>>> l i n e l e n s . s t d e v ( )6.0133185513491636
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Apache Spark 17 / 26
Big Data Analytics 2. Working with Spark 2.2 Resilient Distributed Datasets
Operations on Key-Value Pairs
Pair RDDs contain a two element tuple: (K ,V )
Keys and values can be of any type
Extremely useful for implementing MapReduce algorithms
Examples of operations:
I groupByKey
I reduceByKey
I aggregateByKey
I sortByKey
I ...
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Apache Spark 18 / 26
Big Data Analytics 2. Working with Spark 2.2 Resilient Distributed Datasets
Word Count Example
Map:
I Input: document-word list pairs
I Output: word-count pairs
(dk , “w1, . . . ,w′′m) 7→ [(wi , ci )]
Reduce:
I Input: word-(count list) pairs
I Output: word-count pairs
(wi , [ci ]) 7→ (wi ,∑c∈[ci ]
c)
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Apache Spark 19 / 26
Big Data Analytics 2. Working with Spark 2.2 Resilient Distributed Datasets
Word Count Example
(d1, “love ain't no stranger”)
(d2, “crying in the rain”)
(d3, “looking for love”)
(d4, “I'm crying”)
(d5, “the deeper the love”)
(d6, “is this love”)
(d7, “Ain't no love”)
(love, 1)
(ain't, 1)
(stranger,1)
(crying, 1)
(rain, 1)
(looking, 1)
(love, 2)
(crying, 1)
(deeper, 1)
(this, 1)
(love, 2)
(ain't, 1)
(love, 1)
(ain't, 1)
(stranger,1)
(crying, 1)
(rain, 1)
(looking, 1)
(love, 2)
(crying, 1)
(deeper, 1)
(this, 1)
(love, 2)
(ain't, 1)
(love, 5)
(stranger,1)
(crying, 2)
(ain't, 2)
(rain, 1)
(looking, 1)
(deeper, 1)
(this, 1)
Mappers ReducersLucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Apache Spark 20 / 26
Big Data Analytics 2. Working with Spark 2.2 Resilient Distributed Datasets
Word Count on Spark
I had breakfast this morning.
The coffee was really good.
I didn't like the bread though.
But I had cheese.
RDD: mydata
Oh I love cheese.
flatMap
I
had
breakfast
RDD
this
morning.
...
map
(I,1)
(had,1)
(breakfast,1)
RDD
(this,1)
(morning.,1)
...
reduceByKey
(I,4)
(had,2)
(breakfast,1)
RDD
(this,1)
(morning.,1)
...
>>> count s = mydata . f l a tMap ( lambda l i n e : l i n e . s p l i t ( ” ” ) ) \.map( lambda word : ( word , 1 ) ) \. reduceByKey ( lambda x , y : x + y )
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Apache Spark 21 / 26
Big Data Analytics 2. Working with Spark 2.2 Resilient Distributed Datasets
ReduceByKey
. reduceByKey ( lambda x , y : x + y )
ReduceByKey works a little different from the MapReduce reduce
function:
I It takes two arguments: combines two values at a time associatedwith the same key
I Must be commutative: reduceByKey(x,y) = reduceByKey(y,x)
I Must be associative: reduceByKey(reduceByKey(x,y), z) =
reduceByKey(x,reduceByKey(y,z))
Spark does not guarantee on which order the reduceByKey functions areexecuted!
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Apache Spark 22 / 26
Big Data Analytics 2. Working with Spark 2.2 Resilient Distributed Datasets
ConsiderationsSpark provides a much more efficient MapReduce implementation thenHadoop:
I Higher level API
I In memory storage (less I/O overhead)
I Chaining MapReduce operations is simplified: sequence ofMapReduce passes can be done in one job
Spark vs. Hadoop on training a logistic regression model:
Source: Apache Spark. https://spark.apache.org/
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Apache Spark 23 / 26
Big Data Analytics 3. MLLib: Machine Learning with Spark
Outline
1. What is Spark
2. Working with Spark2.1 Spark Context2.2 Resilient Distributed Datasets
3. MLLib: Machine Learning with Spark
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Apache Spark 24 / 26
Big Data Analytics 3. MLLib: Machine Learning with Spark
Overview
MLLib is a Spark Machine Learning library containing implementations for:
I Computing Basic Statistics from Datasets
I Classification and Regression
I Collaborative Filtering
I Clustering
I Feature Extraction and Dimensionality Reduction
I Frequent Pattern Mining
I Optimization Algorithms
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Apache Spark 24 / 26
Big Data Analytics 3. MLLib: Machine Learning with Spark
Logistic Regression with MLLib
Import necessary packages:
from pyspa rk . m l l i b . r e g r e s s i o n import Labe l edPo i n tfrom pyspa rk . m l l i b . u t i l import MLUti l sfrom pyspa rk . m l l i b . c l a s s i f i c a t i o n import Log i s t i cReg r e s s i onWi thSGD
Read the data (LibSVM format):
da t a s e t = MLUti l s . l oadL ibSVMFi l e ( sc ,” data / m l l i b / s amp l e l i b s vm da t a . t x t ” )
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Apache Spark 25 / 26
Big Data Analytics 3. MLLib: Machine Learning with Spark
Logistic Regression with MLLib
Train the Model:
model = Log i s t i cReg r e s s i onWi thSGD . t r a i n ( d a t a s e t )
Evaluate:
l a b e l sAndP r ed s = da t a s e t.map( lambda p : ( p . l a b e l , model . p r e d i c t ( p . f e a t u r e s ) ) )
t r a i n E r r = l abe l sAndP r ed s. f i l t e r ( lambda ( v , p ) : v != p ). count ( ) / f l o a t ( d a t a s e t . count ( ) )
p r i n t ( ” T r a i n i n g E r r o r = ” + s t r ( t r a i n E r r ) )
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Apache Spark 26 / 26