big data processing with apache spark

BIG DATA PROCESSING WITH APACHE SPARK

December 9, 2015

LBS College of Engineering

www.sarithdivakar.info | www.csegyan.org

WHAT IS BIG DATA?

Terabytes of Data

Petabytes of Data

Exabytes of Data

Yottabytes of Data

Brontobytes of Data

Geobytes of Data

WHERE BIG DATA COMES FROM?

Huge amount of data is created everyday!

It comes from Us!

No digitized process becomes digitized

Digital India Programmee to transform India to a digitally empowered

society and knowledge economy

EXAMPLES OF DIGITIZATION

Online banking

Online shopping

E-learning

Emails

Social medias

Decrease in cost of storage & data capture technology make up new era of data revolution

TRENDS IN BIG DATA

Digitalization of virtually everything: e.g. One’s personal life

DATA TYPES

StructuredDatabase, Data warehouse, Enterprise systems

UnstructuredAnalog data, GPS tracking, Audio/Video streams, Text

files

Semi-StructuredXML, Email, EDI

KEY ENABLERS OF BIG DATA

Increase in storage capacities

Increase in processing power

Availability of Data

FEATURES OF BIG DATA GENERATED

Digitally generated

Passively produced

Automatically collected

Geographically or temporarily trackable

Continuously analyzed

DIMENSIONS OF BIG DATA

Volume: Every minute 72 hours of videos are uploaded onYouTube

Variety: Excel tables & databases (Structured), Pure text,photo, audio, video, web, GPS data, sensor data, documents,sms, etc. New data formats for new applications

Velocity: Batch processing not possible as data is streamed.

Veracity/variability: Uncertainty inherent within some type ofdata

Value: Economic/business value of different data may vary

CHALLENGES IN BIG DATA

Capture

Storage

Search

Sharing

Transfer

Analysis

Visualization

NEED FOR BIG DATA ANALYTICS

Big Data needs to be captured, stored, organized and analyzed

It is Large & ComplexCannot manage with current methodologies or data

mining toolsTHEN NOW

Data warehousing, Datamining & Databasetechnologies

Did not analyze email, PDF and Video files

Worked with huge amount of data Analyzing semi structured and Un structured data

Prediction based on data Access and store all huge data created

BIG DATA ANALYTICS

Big Data analytics refers to tools and methodologies that aim to transform massive quantities of raw data into “data about data” for analytical purposes.Discovery of meaningful

patterns in data

Used for decision making

EXCAVATING HIDDEN TREASURES FROM BIG DATA

Insights into data can provide business advantage

Some key early indications can mean fortunes to business

More precise analysis with more data

Integrate Big Data with traditional data: Enhance business intelligence analysis

UNSTRUCTURED DATA TYPES

Email and other forms of electronic communication

Web based content(Click streams, social media)

Digitized audio and video

Machine generated data(RFID, GPS, sensor-generated data, log files) and IoT

APPLICATIONS OF BIG DATA ANALYSIS

Business: Customer personalization, customer needs

Technology: Reduce process time

Health: DNA mining to detect hereditary diseases

Smart cities: Cities with good economic development and high quality of life could be analyzed

Oil and Gas: Analyzing sensor generated data for production optimization, cost management risk management, healthy and safe drilling

Telecommunications: Network analytics and optimization from device, sensor and GPS to enhance social and promotion opportunities

OPPORTUNITIES BIG DATA OFFERS

Early warning

Real-time awareness

Real-time feedback

CHALLENGES IN BIG DATA

Heterogeneity and incompleteness

Scale

Timeliness

Privacy

Human collaboration

BIG DATA AND CLOUD: CONVERGING TECHNOLOGIES

Big data: Extracting value out of “variety, velocity and volume” from unstructured information available

Cloud: On demand, elastic, scalable pay per use self service model

ANSWER THESE BEFORE MOVING TO BIG DATA ANALYSIS

Do you have an effective big data problem?

Can the business benefit from using Big Data?

Do your data volumes really require these distributed mechanisms?

TECHNOLOGY TO HANDLE BIG DATA

Google was the first company to effectively use big data

Engineers at google created massively distributed systems

Collected and analyzed massive collections of web pages & relationships between them and created “Google Search Engine” capable of querying billions of pages

FIRST GENERATION OF DISTRIBUTED SYSTEMS

Proprietary

Custom Hardware and software

Centralized data

Hardware based fault recovery

Eg: Teradata, Netezza etc

SECOND GENERATION OF DISTRIBUTED SYSTEMS

Open source

Commodity hardware

Distributed data

Software based fault recovery

Eg : Hadoop, HPCC

APACHE HADOOP

Apache Hadoop is a framework that allows for thedistributed processing of large data sets across clustersof commodity computers using a simple programmingmodel.

HADOOP – KEY CHARACTERISTICS

HADOOP CORE COMPONENTS

HDFS ARCHITECTURE

SECONDARY NAMENODE

HADOOP CLUSTER ARCHITECTURE

HADOOP ECOSYSTEM

HADOOP CLUSTER MODES

MAP REDUCE PROGRAMMING

MAP REDUCE FLOW

EXISTING HADOOP CUSTOMERS

HADOOP VERSIONS

WHY WE NEED NEW GENERATION?

Lot has been changed from 2000

Both hardware and software gone through changes

Big data has become necessity now

Let’s look at what changed over decade

CHANGES IN HARDWARE

State of hardware in 2000 State of hardware now

Disk was cheap so disk was primary source of data

RAM is the king

Network was costly so data locality RAM is primary source of data and we use disk for fallback

RAM was very costly Network is speedier

Single core machines were dominant Multi core machines are commonplace

SHORTCOMINGS OF SECOND GENERATION

Batch processing is primary objective

Not designed to change depending upon use cases

Tight coupling between API and run time

Do not exploit new hardware capabilities

Too much complex

MAPREDUCE LIMITATIONS

If you wanted to do something complicated, you would have to string together a series of MapReduce jobs and execute them in sequence.

Each of those jobs have high-latency, and none could start until the previous job had finished completely.

The Job output data between each step has to be stored in the distributed file system before the next step can begin.

Hence, this approach tends to be slow due to replication & disk storage.

HADOOP VS SPARK

HADOOP SPARK

Stores data on disk Sores data in memory (RAM)

Commodity hardware can be utilized Need high end systems with greater RAM

Uses Replication to achieve fault tolerance Uses different data storage models to achieve fault tolerance (Eg. RDD)

Speed of processing is less due to disk read write

100x faster than Hadoop

Supports only Java & R Supports Java, Python, R, Scala etc. Ease of programming is high.

Everything is just Map and Reduce Supports Map, Reduce, SQL. Streaming etc

Data should be in HDFS Data can be in HDFS,Cassandra,Hbase or S3.Runs on Hadoop, Cloud, Mesos or standalone

THIRD GENERATION DISTRIBUTED SYSTEMS

Handle both batch processing and real time

Exploit RAM as much as disk

Multiple core aware

Do not reinvent the wheel

They use

HDFS for storage

Apache Mesos / YARN for distribution

Plays well with Hadoop

APACHE SPARK

Open source Big Data processing framework

Apache Spark started as a research project at UC Berkeley in the AMPLab(Now Databricks), which focuses on big data analytics.

Open sourced in early 2010.

Many of the ideas behind the system are presented in various research papers.

SPARK TIMELINE

SPARK FEATURES

Spark gives us a comprehensive, unified framework

Manage big data processing requirements with a variety of data sets Diverse in nature (text data, graph data etc)

Source of data (batch v. real-time streaming data).

Spark lets you quickly write applications in Java, Scala, or Python.

DIRECTED ACYCLIC GRAPH (DAG)

Spark allows programmers to develop complex, multi-step data pipelines using directed acyclic graph (DAG) pattern.

It also supports in-memory data sharing across DAGs, so that different jobs can work with the same data.

UNIFIED PLATFORM FOR BIG DATA APPS

WHY UNIFICATION MATTERS?

Good for developers : One platform to learn

Good for users : Take apps every where

Good for distributors : More apps

UNIFICATION BRINGS ONE ABSTRACTION

All different processing systems in spark share same abstraction called RDD

RDD is Resilient Distributed Dataset

As they share same abstraction you can mix and match different kind of processing in same application

SPAM DETECTION

RUNS EVERYWHERE

You can spark on top any distributed system

It can run on

Hadoop 1.x

Hadoop 2.x

Apache Mesos

It’s own cluster

It’s just a user space

library

SMALL AND SIMPLE

Apache Spark is highly modular

The original version contained only 1600 lines of scala code

Apache Spark API is extremely simple compared Java API of M/R

API is concise and consistent

SPARK ARCHITECTURE

DATA STORAGE

Spark uses HDFS file system for data storage purposes.

It works with any Hadoop compatible data source including HDFS, HBase, Cassandra, etc.

API

The API provides the application developers to createSpark based applications using a standard API interface.Spark provides API for Scala, Java, and Pythonprogramming languages.

RESOURCE MANAGEMENT

Spark can be deployed as a Stand-alone server or it can be on a distributed computing framework like Mesos or YARN

SPARK RUNNING ARCHITECTURE


Connects to a cluster manager which allocate resources across applications

Acquires executors on cluster nodes– worker processes to run computations and store data

Sends appcode to the executors

Sends tasks for the executors to run


sc = new SparkContext

f = sc.textFile(“…”)

f.filter(…)

.count()

...

Your program

Spark client(app master)

Spark worker

HDFS, HBase, …

Block manager

Task threads

RDD graph

Scheduler

Block tracker

Shuffle tracker

Clustermanager

SCHEDULING PROCESS

RDD Objects

agnostic to operators!

doesn’t know about stages

DAGScheduler

split graph into stages of tasks

submit each stage as ready

DAG

TaskScheduler

TaskSet

launch tasks via cluster manager

retry failed or straggling tasks

Clustermanager

Worker

execute tasks

store and serve blocks

Block manager

Threads

Task

stagefailed

RDD - RESILIENT DISTRIBUTED DATASET

Resilient Distributed Datasets (RDD) are the primary abstraction in Spark – a fault-tolerant collection of elements that can be operated on in parallel

A big collection of data having following properties Immutable

Lazy evaluated

Cacheable

Type inferred

RDD - RESILIENT DISTRIBUTED DATASET –TWO TYPES

Parallelized collections – take an existing Scala collection and run functions on it in parallel

Hadoop datasets / files – run functions on each record of a file in Hadoop distributed file system or any other storage system supported by Hadoop

SPARK COMPONENTS & ECOSYSTEM

Spark driver (context)

Spark DAG scheduler

Cluster management systems YARN Apache Mesos

Data sources In memory HDFS No SQL

ECOSYSTEM OF HADOOP & SPARK

CONTRIBUTORS PER MONTH TO SPARK

SPARK – STARK OVERFLOW ACTIVITIES

IN MEMORY

In Spark, you can cache hdfs data in main memory of worker nodes

Spark analysis can be executed directly on in memory data

Shuffling also can be done from in memory

Fault tolerant

INTEGRATION WITH HADOOP

No separate storage layer

Integrates well with HDFS

Can run on Hadoop 1.0 and Hadoop 2.0 YARN

Excellent integration with ecosystem projects like

Apache Hive, HBase etc

MULTI LANGUAGE API

Written in Scala but API is not limited to it

Offers API in

Scala

Java

Python

You can also do SQL using SparkSQL

SPARK – OPEN SOURCE ECOSYSTEM

SPARK SORT RECORD

PYTHON EXAMPLES

WRITE

f = open('demo.txt','r')data = f.read()print(data)

f = open('demo.txt','a')f.write('I am trying to write a file')f.close()

PYTHON EXAMPLES

READ

PYTHON EXAMPLES

RDD CREATION – FROM COLLECTIONS

A = range(1,100000)print(A)

raw_data = sc. parallelize(A)

raw_data.count()

raw_data.take(5)

Creating a RDD from a Collection

Creating a Collection

To view the sample data

Count the number of lines in the loaded files

RDD CREATION – FROM FILES

Getting the data files

import urllibf = urllib.urlretrieve ("https://sparksarith.azurewebsites.net/Sarith/test.csv", "tv.csv")

Count the number of lines in the loaded files

data_file = "./tv.csv"raw_data = sc.textFile(data_file)

Creating a RDD from a file

raw_data.count()

To view the sample data

raw_data.take(5)

IMMUTABILITY

Immutability means once created it never changes

Big data by default immutable in nature

Immutability helps to Parallelize Caching

const int a = 0 //immutable

int b = 0; // mutable

b ++ // in place (Updation)

c = a + 1 (Copy)

Immutability is about value not about reference

IMMUTABILITY IN COLLECTIONS

Mutable Immutable

var collection = [1,2,4,5]for ( i = 0; i<collection.length; i++) {collection[i]+=1;}Uses loop for updatingcollection is updated in place

val collection = [1,2,4,5]val newCollection = collection.map(value=> value +1)Uses transformation for changeCreates a new copy of collection. Leavescollection intact

CHALLENGES OF IMMUTABILITY

Immutability is great for parallelism but not good for space

Doing multiple transformations result inMultiple copies of data

Multiple passes over data

In big data, multiple copies and multiple passes will have poor performance characteristics.

LET’S GET LAZY

Laziness means not computing transformation till it’s need

Laziness defers evaluation

Laziness allows separating execution from evaluation

LAZINESS AND IMMUTABILITY

You can be lazy only if the underneath data is immutable

You cannot combine transformation if transformation has side effect

So combining laziness and immutability gives better performance and distributed processing.

CHALLENGES OF LAZINESS

Laziness poses challenges in terms of data type

If laziness defers execution, determining the type of the variable becomes challenging

If we can’t determine the right type, it allows to have semantic issues

Running big data programs and getting semantics errors are not fun.

TRANSFORMATIONS

Transformations are the operations on RDD that return new RDD

By using the map transformation in Spark, we can apply a function to every element in our RDD

Collect will get all the elements in the RDD into memory to work with them

csv_data = raw_data.map(lambda x : x.split(“,”))

all_data = csv_data.collect()all_datalen(all_data)

SET OPERATIONS ON RDD

Spark support many of the operations we have in mathematical sets, such as union and intersection, even when the RDDs themselves are not properly sets

Union of RDDs doesn't remove duplicates

a=[1,2,3,4,5]b=[1,2,3,6]dist_a = sc.parallelize(a)dist_b = sc.parallelize(b)substract_data = dist_a.subtract(dist_b)substract_data.take(10)union_data=dist_a.union(dist_b)union_data.take(10)[1, 2, 3, 4, 5, 1, 2, 3, 6]distinct_data=union_data.distinct()distinct_data.take(10)[2, 4, 6, 1, 3, 5]

KEY VALUE PAIRS - RDD

Spark provides specific functions to deal with RDDs which elements are key/value pairs

They are commonly used for grouping and aggregations

data = ['nithin,25','appu,40','anil,20','nithin,35','anil,30','anil,50’]raw_data = sc.parallelize(data)raw_data.collect()key_value = raw_data.map(lambda line:(line.split(',')[0],int(line.split(',')[1])))grouped_data = key_value.reduceByKey(lambda x,y:x+y)grouped_data.collect()grouped_data.keys().collect()grouped_data.values().collect()sorted_data = grouped_data.sortByKey()sorted_data.collect()

CACHING

Immutable data allows you to cache data for long time

Lazy transformation allows to recreate data on failure

Transformations can be saved also

Caching data improves execution engine performance

raw_data.cache()raw_data.collect()

SAVING YOUR DATA

saveAsTextFile(path) is used for storing the RDD inside your harddisk

Path is a directory and spark will output the multiple files under that directory. It allows the spark to write the output from the multiple nodes

raw_data.saveAsTextFile('opt')

SPARK EXECUTION MODEL

Create DAG of RDDs to represent computation

Create logical execution plan for DAG

Schedule and execute individual tasks

STEP 1: CREATE RDDS

DEPENDENCY TYPES

union

groupByKey

join with inputs notco-partitioned

join with inputs co-

partitioned

map, filter

“Narrow” deps: “Wide” (shuffle) deps:

STEP 2: CREATE EXECUTION PLAN

Pipeline as much as possible

Split into “stages” based on need to reorganize data

STEP 3: SCHEDULE TASKS

Split each stage into tasks

A task is data + computation

Execute all tasks within a stage before moving on

SPARK SUBMIT

from pyspark import SparkContext

sc = SparkContext( 'local', 'pyspark')

raw_data = sc.textFile("./bigdata.txt")

shows = raw_data.map(lambda line: (line.split(',')[4],1))

shows.take(5)

STEP 3: SCHEDULE TASKS

WHO ARE USING SPARK

SPARK INSTALLATION

INSTALL JDK

sudo apt-get install openjdk-7-jdk

INSTALL SCALA

sudo apt-get install scala

INSTALLING MAVEN

wget http://mirrors.sonic.net/apache/maven/maven-3/3.3.3/binaries/apache-maven-3.3.3-bin.tar.gz

tar -zxf apache-maven-3.3.3-bin.tar.gz

sudo cp -R apache-maven-3.3.3 /usr/local

sudo ln -s /usr/local/apache-maven-3.3.3/bin/mvn /usr/bin/mvn

mvn –v

SPARK INSTALLATION

INSTALLING GIT

sudo apt-get install git

CLONE SPARK PROJECT FROM GITHUB

git clone https://github.com/apache/spark.git

cd spark

BUILD SPARK PROJECT

build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive -Phive-thriftserver -DskipTests clean package

For starting spark cluster - ./sbin/start-all.sh

For Starting shell ./bin/pyspark

REFERENCES

1. “Data Mining and Data Warehousing”, M.Sudheep Elayidom, SOE, CUSAT

2. “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing”. Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, AnkurDave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica. NSDI 2012. April 2012. Best Paper Award.

3. “What is Big Data”, https://www-01.ibm.com/software/in/data/bigdata/

4. “Apache Hadoop”, https://hadoop.apache.org/

5. “Apache Spark”, http://spark.apache.org/

6. “Spark: Cluster Computing with Working Sets”. Matei Zaharia, MosharafChowdhury, Michael J. Franklin, Scott Shenker, Ion Stoica. HotCloud 2010. June 2010.

CREDITS

Dr. M Sudheep Elayidom, Associate Professor, Div Of Computer Science & Engg, SOE, CUSAT

Nithink K Anil, Quantiph, Mumbai, Maharashtra, India

Lija Mohan, Lija Mohan, Div Of Computer Science & Engg, SOE, CUSAT

big data processing with apache spark

Data & Analytics