big data processing with apache spark
TRANSCRIPT
BIG DATA PROCESSING WITH APACHE SPARK
December 9, 2015
LBS College of Engineering
www.sarithdivakar.info | www.csegyan.org
WHAT IS BIG DATA?
Terabytes of Data
Petabytes of Data
Exabytes of Data
Yottabytes of Data
Brontobytes of Data
Geobytes of Data
WHERE BIG DATA COMES FROM?
Huge amount of data is created everyday!
It comes from Us!
No digitized process becomes digitized
Digital India Programmee to transform India to a digitally empowered
society and knowledge economy
EXAMPLES OF DIGITIZATION
Online banking
Online shopping
E-learning
Emails
Social medias
Decrease in cost of storage & data capture technology make up new era of data revolution
TRENDS IN BIG DATA
Digitalization of virtually everything: e.g. One’s personal life
DATA TYPES
StructuredDatabase, Data warehouse, Enterprise systems
UnstructuredAnalog data, GPS tracking, Audio/Video streams, Text
files
Semi-StructuredXML, Email, EDI
KEY ENABLERS OF BIG DATA
Increase in storage capacities
Increase in processing power
Availability of Data
FEATURES OF BIG DATA GENERATED
Digitally generated
Passively produced
Automatically collected
Geographically or temporarily trackable
Continuously analyzed
DIMENSIONS OF BIG DATA
Volume: Every minute 72 hours of videos are uploaded onYouTube
Variety: Excel tables & databases (Structured), Pure text,photo, audio, video, web, GPS data, sensor data, documents,sms, etc. New data formats for new applications
Velocity: Batch processing not possible as data is streamed.
Veracity/variability: Uncertainty inherent within some type ofdata
Value: Economic/business value of different data may vary
CHALLENGES IN BIG DATA
Capture
Storage
Search
Sharing
Transfer
Analysis
Visualization
NEED FOR BIG DATA ANALYTICS
Big Data needs to be captured, stored, organized and analyzed
It is Large & ComplexCannot manage with current methodologies or data
mining toolsTHEN NOW
Data warehousing, Datamining & Databasetechnologies
Did not analyze email, PDF and Video files
Worked with huge amount of data Analyzing semi structured and Un structured data
Prediction based on data Access and store all huge data created
BIG DATA ANALYTICS
Big Data analytics refers to tools and methodologies that aim to transform massive quantities of raw data into “data about data” for analytical purposes.Discovery of meaningful
patterns in data
Used for decision making
EXCAVATING HIDDEN TREASURES FROM BIG DATA
Insights into data can provide business advantage
Some key early indications can mean fortunes to business
More precise analysis with more data
Integrate Big Data with traditional data: Enhance business intelligence analysis
UNSTRUCTURED DATA TYPES
Email and other forms of electronic communication
Web based content(Click streams, social media)
Digitized audio and video
Machine generated data(RFID, GPS, sensor-generated data, log files) and IoT
APPLICATIONS OF BIG DATA ANALYSIS
Business: Customer personalization, customer needs
Technology: Reduce process time
Health: DNA mining to detect hereditary diseases
Smart cities: Cities with good economic development and high quality of life could be analyzed
Oil and Gas: Analyzing sensor generated data for production optimization, cost management risk management, healthy and safe drilling
Telecommunications: Network analytics and optimization from device, sensor and GPS to enhance social and promotion opportunities
OPPORTUNITIES BIG DATA OFFERS
Early warning
Real-time awareness
Real-time feedback
CHALLENGES IN BIG DATA
Heterogeneity and incompleteness
Scale
Timeliness
Privacy
Human collaboration
BIG DATA AND CLOUD: CONVERGING TECHNOLOGIES
Big data: Extracting value out of “variety, velocity and volume” from unstructured information available
Cloud: On demand, elastic, scalable pay per use self service model
ANSWER THESE BEFORE MOVING TO BIG DATA ANALYSIS
Do you have an effective big data problem?
Can the business benefit from using Big Data?
Do your data volumes really require these distributed mechanisms?
TECHNOLOGY TO HANDLE BIG DATA
Google was the first company to effectively use big data
Engineers at google created massively distributed systems
Collected and analyzed massive collections of web pages & relationships between them and created “Google Search Engine” capable of querying billions of pages
FIRST GENERATION OF DISTRIBUTED SYSTEMS
Proprietary
Custom Hardware and software
Centralized data
Hardware based fault recovery
Eg: Teradata, Netezza etc
SECOND GENERATION OF DISTRIBUTED SYSTEMS
Open source
Commodity hardware
Distributed data
Software based fault recovery
Eg : Hadoop, HPCC
APACHE HADOOP
Apache Hadoop is a framework that allows for thedistributed processing of large data sets across clustersof commodity computers using a simple programmingmodel.
HADOOP – KEY CHARACTERISTICS
HADOOP CORE COMPONENTS
HDFS ARCHITECTURE
SECONDARY NAMENODE
HADOOP CLUSTER ARCHITECTURE
HADOOP ECOSYSTEM
HADOOP CLUSTER MODES
MAP REDUCE PROGRAMMING
MAP REDUCE FLOW
EXISTING HADOOP CUSTOMERS
HADOOP VERSIONS
WHY WE NEED NEW GENERATION?
Lot has been changed from 2000
Both hardware and software gone through changes
Big data has become necessity now
Let’s look at what changed over decade
CHANGES IN HARDWARE
State of hardware in 2000 State of hardware now
Disk was cheap so disk was primary source of data
RAM is the king
Network was costly so data locality RAM is primary source of data and we use disk for fallback
RAM was very costly Network is speedier
Single core machines were dominant Multi core machines are commonplace
SHORTCOMINGS OF SECOND GENERATION
Batch processing is primary objective
Not designed to change depending upon use cases
Tight coupling between API and run time
Do not exploit new hardware capabilities
Too much complex
MAPREDUCE LIMITATIONS
If you wanted to do something complicated, you would have to string together a series of MapReduce jobs and execute them in sequence.
Each of those jobs have high-latency, and none could start until the previous job had finished completely.
The Job output data between each step has to be stored in the distributed file system before the next step can begin.
Hence, this approach tends to be slow due to replication & disk storage.
HADOOP VS SPARK
HADOOP SPARK
Stores data on disk Sores data in memory (RAM)
Commodity hardware can be utilized Need high end systems with greater RAM
Uses Replication to achieve fault tolerance Uses different data storage models to achieve fault tolerance (Eg. RDD)
Speed of processing is less due to disk read write
100x faster than Hadoop
Supports only Java & R Supports Java, Python, R, Scala etc. Ease of programming is high.
Everything is just Map and Reduce Supports Map, Reduce, SQL. Streaming etc
Data should be in HDFS Data can be in HDFS,Cassandra,Hbase or S3.Runs on Hadoop, Cloud, Mesos or standalone
THIRD GENERATION DISTRIBUTED SYSTEMS
Handle both batch processing and real time
Exploit RAM as much as disk
Multiple core aware
Do not reinvent the wheel
They use
HDFS for storage
Apache Mesos / YARN for distribution
Plays well with Hadoop
APACHE SPARK
Open source Big Data processing framework
Apache Spark started as a research project at UC Berkeley in the AMPLab(Now Databricks), which focuses on big data analytics.
Open sourced in early 2010.
Many of the ideas behind the system are presented in various research papers.
SPARK TIMELINE
SPARK FEATURES
Spark gives us a comprehensive, unified framework
Manage big data processing requirements with a variety of data sets Diverse in nature (text data, graph data etc)
Source of data (batch v. real-time streaming data).
Spark lets you quickly write applications in Java, Scala, or Python.
DIRECTED ACYCLIC GRAPH (DAG)
Spark allows programmers to develop complex, multi-step data pipelines using directed acyclic graph (DAG) pattern.
It also supports in-memory data sharing across DAGs, so that different jobs can work with the same data.
UNIFIED PLATFORM FOR BIG DATA APPS
WHY UNIFICATION MATTERS?
Good for developers : One platform to learn
Good for users : Take apps every where
Good for distributors : More apps
UNIFICATION BRINGS ONE ABSTRACTION
All different processing systems in spark share same abstraction called RDD
RDD is Resilient Distributed Dataset
As they share same abstraction you can mix and match different kind of processing in same application
SPAM DETECTION
RUNS EVERYWHERE
You can spark on top any distributed system
It can run on
Hadoop 1.x
Hadoop 2.x
Apache Mesos
It’s own cluster
It’s just a user space
library
SMALL AND SIMPLE
Apache Spark is highly modular
The original version contained only 1600 lines of scala code
Apache Spark API is extremely simple compared Java API of M/R
API is concise and consistent
SPARK ARCHITECTURE
DATA STORAGE
Spark uses HDFS file system for data storage purposes.
It works with any Hadoop compatible data source including HDFS, HBase, Cassandra, etc.
API
The API provides the application developers to createSpark based applications using a standard API interface.Spark provides API for Scala, Java, and Pythonprogramming languages.
RESOURCE MANAGEMENT
Spark can be deployed as a Stand-alone server or it can be on a distributed computing framework like Mesos or YARN
SPARK RUNNING ARCHITECTURE
SPARK RUNNING ARCHITECTURE
Connects to a cluster manager which allocate resources across applications
Acquires executors on cluster nodes– worker processes to run computations and store data
Sends appcode to the executors
Sends tasks for the executors to run
SPARK RUNNING ARCHITECTURE
sc = new SparkContext
f = sc.textFile(“…”)
f.filter(…)
.count()
...
Your program
Spark client(app master)
Spark worker
HDFS, HBase, …
Block manager
Task threads
RDD graph
Scheduler
Block tracker
Shuffle tracker
Clustermanager
SCHEDULING PROCESS
RDD Objects
agnostic to operators!
doesn’t know about stages
DAGScheduler
split graph into stages of tasks
submit each stage as ready
DAG
TaskScheduler
TaskSet
launch tasks via cluster manager
retry failed or straggling tasks
Clustermanager
Worker
execute tasks
store and serve blocks
Block manager
Threads
Task
stagefailed
RDD - RESILIENT DISTRIBUTED DATASET
Resilient Distributed Datasets (RDD) are the primary abstraction in Spark – a fault-tolerant collection of elements that can be operated on in parallel
A big collection of data having following properties Immutable
Lazy evaluated
Cacheable
Type inferred
RDD - RESILIENT DISTRIBUTED DATASET –TWO TYPES
Parallelized collections – take an existing Scala collection and run functions on it in parallel
Hadoop datasets / files – run functions on each record of a file in Hadoop distributed file system or any other storage system supported by Hadoop
SPARK COMPONENTS & ECOSYSTEM
Spark driver (context)
Spark DAG scheduler
Cluster management systems YARN Apache Mesos
Data sources In memory HDFS No SQL
ECOSYSTEM OF HADOOP & SPARK
CONTRIBUTORS PER MONTH TO SPARK
SPARK – STARK OVERFLOW ACTIVITIES
IN MEMORY
In Spark, you can cache hdfs data in main memory of worker nodes
Spark analysis can be executed directly on in memory data
Shuffling also can be done from in memory
Fault tolerant
INTEGRATION WITH HADOOP
No separate storage layer
Integrates well with HDFS
Can run on Hadoop 1.0 and Hadoop 2.0 YARN
Excellent integration with ecosystem projects like
Apache Hive, HBase etc
MULTI LANGUAGE API
Written in Scala but API is not limited to it
Offers API in
Scala
Java
Python
You can also do SQL using SparkSQL
SPARK – OPEN SOURCE ECOSYSTEM
SPARK SORT RECORD
PYTHON EXAMPLES
PYTHON EXAMPLES
WRITE
f = open('demo.txt','r')data = f.read()print(data)
f = open('demo.txt','a')f.write('I am trying to write a file')f.close()
PYTHON EXAMPLES
READ
PYTHON EXAMPLES
PYTHON EXAMPLES
RDD CREATION – FROM COLLECTIONS
A = range(1,100000)print(A)
raw_data = sc. parallelize(A)
raw_data.count()
raw_data.take(5)
Creating a RDD from a Collection
Creating a Collection
To view the sample data
Count the number of lines in the loaded files
RDD CREATION – FROM FILES
Getting the data files
import urllibf = urllib.urlretrieve ("https://sparksarith.azurewebsites.net/Sarith/test.csv", "tv.csv")
Count the number of lines in the loaded files
data_file = "./tv.csv"raw_data = sc.textFile(data_file)
Creating a RDD from a file
raw_data.count()
To view the sample data
raw_data.take(5)
IMMUTABILITY
Immutability means once created it never changes
Big data by default immutable in nature
Immutability helps to Parallelize Caching
const int a = 0 //immutable
int b = 0; // mutable
b ++ // in place (Updation)
c = a + 1 (Copy)
Immutability is about value not about reference
IMMUTABILITY IN COLLECTIONS
Mutable Immutable
var collection = [1,2,4,5]for ( i = 0; i<collection.length; i++) {collection[i]+=1;}Uses loop for updatingcollection is updated in place
val collection = [1,2,4,5]val newCollection = collection.map(value=> value +1)Uses transformation for changeCreates a new copy of collection. Leavescollection intact
CHALLENGES OF IMMUTABILITY
Immutability is great for parallelism but not good for space
Doing multiple transformations result inMultiple copies of data
Multiple passes over data
In big data, multiple copies and multiple passes will have poor performance characteristics.
LET’S GET LAZY
Laziness means not computing transformation till it’s need
Laziness defers evaluation
Laziness allows separating execution from evaluation
LAZINESS AND IMMUTABILITY
You can be lazy only if the underneath data is immutable
You cannot combine transformation if transformation has side effect
So combining laziness and immutability gives better performance and distributed processing.
CHALLENGES OF LAZINESS
Laziness poses challenges in terms of data type
If laziness defers execution, determining the type of the variable becomes challenging
If we can’t determine the right type, it allows to have semantic issues
Running big data programs and getting semantics errors are not fun.
TRANSFORMATIONS
Transformations are the operations on RDD that return new RDD
By using the map transformation in Spark, we can apply a function to every element in our RDD
Collect will get all the elements in the RDD into memory to work with them
csv_data = raw_data.map(lambda x : x.split(“,”))
all_data = csv_data.collect()all_datalen(all_data)
SET OPERATIONS ON RDD
Spark support many of the operations we have in mathematical sets, such as union and intersection, even when the RDDs themselves are not properly sets
Union of RDDs doesn't remove duplicates
a=[1,2,3,4,5]b=[1,2,3,6]dist_a = sc.parallelize(a)dist_b = sc.parallelize(b)substract_data = dist_a.subtract(dist_b)substract_data.take(10)union_data=dist_a.union(dist_b)union_data.take(10)[1, 2, 3, 4, 5, 1, 2, 3, 6]distinct_data=union_data.distinct()distinct_data.take(10)[2, 4, 6, 1, 3, 5]
KEY VALUE PAIRS - RDD
Spark provides specific functions to deal with RDDs which elements are key/value pairs
They are commonly used for grouping and aggregations
data = ['nithin,25','appu,40','anil,20','nithin,35','anil,30','anil,50’]raw_data = sc.parallelize(data)raw_data.collect()key_value = raw_data.map(lambda line:(line.split(',')[0],int(line.split(',')[1])))grouped_data = key_value.reduceByKey(lambda x,y:x+y)grouped_data.collect()grouped_data.keys().collect()grouped_data.values().collect()sorted_data = grouped_data.sortByKey()sorted_data.collect()
CACHING
Immutable data allows you to cache data for long time
Lazy transformation allows to recreate data on failure
Transformations can be saved also
Caching data improves execution engine performance
raw_data.cache()raw_data.collect()
SAVING YOUR DATA
saveAsTextFile(path) is used for storing the RDD inside your harddisk
Path is a directory and spark will output the multiple files under that directory. It allows the spark to write the output from the multiple nodes
raw_data.saveAsTextFile('opt')
SPARK EXECUTION MODEL
Create DAG of RDDs to represent computation
Create logical execution plan for DAG
Schedule and execute individual tasks
STEP 1: CREATE RDDS
DEPENDENCY TYPES
union
groupByKey
join with inputs notco-partitioned
join with inputs co-
partitioned
map, filter
“Narrow” deps: “Wide” (shuffle) deps:
STEP 2: CREATE EXECUTION PLAN
Pipeline as much as possible
Split into “stages” based on need to reorganize data
STEP 3: SCHEDULE TASKS
Split each stage into tasks
A task is data + computation
Execute all tasks within a stage before moving on
SPARK SUBMIT
from pyspark import SparkContext
sc = SparkContext( 'local', 'pyspark')
raw_data = sc.textFile("./bigdata.txt")
shows = raw_data.map(lambda line: (line.split(',')[4],1))
shows.take(5)
STEP 3: SCHEDULE TASKS
WHO ARE USING SPARK
SPARK INSTALLATION
INSTALL JDK
sudo apt-get install openjdk-7-jdk
INSTALL SCALA
sudo apt-get install scala
INSTALLING MAVEN
wget http://mirrors.sonic.net/apache/maven/maven-3/3.3.3/binaries/apache-maven-3.3.3-bin.tar.gz
tar -zxf apache-maven-3.3.3-bin.tar.gz
sudo cp -R apache-maven-3.3.3 /usr/local
sudo ln -s /usr/local/apache-maven-3.3.3/bin/mvn /usr/bin/mvn
mvn –v
SPARK INSTALLATION
INSTALLING GIT
sudo apt-get install git
CLONE SPARK PROJECT FROM GITHUB
git clone https://github.com/apache/spark.git
cd spark
BUILD SPARK PROJECT
build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive -Phive-thriftserver -DskipTests clean package
For starting spark cluster - ./sbin/start-all.sh
For Starting shell ./bin/pyspark
REFERENCES
1. “Data Mining and Data Warehousing”, M.Sudheep Elayidom, SOE, CUSAT
2. “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing”. Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, AnkurDave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica. NSDI 2012. April 2012. Best Paper Award.
3. “What is Big Data”, https://www-01.ibm.com/software/in/data/bigdata/
4. “Apache Hadoop”, https://hadoop.apache.org/
5. “Apache Spark”, http://spark.apache.org/
6. “Spark: Cluster Computing with Working Sets”. Matei Zaharia, MosharafChowdhury, Michael J. Franklin, Scott Shenker, Ion Stoica. HotCloud 2010. June 2010.
CREDITS
Dr. M Sudheep Elayidom, Associate Professor, Div Of Computer Science & Engg, SOE, CUSAT
Nithink K Anil, Quantiph, Mumbai, Maharashtra, India
Lija Mohan, Lija Mohan, Div Of Computer Science & Engg, SOE, CUSAT