hadoop tutorials spark - cern
TRANSCRIPT
![Page 1: Hadoop Tutorials Spark - CERN](https://reader030.vdocuments.net/reader030/viewer/2022012716/61af14119aeab6042e240efb/html5/thumbnails/1.jpg)
Hadoop TutorialsSparkKacper Surdy
Prasanth Kothuri
![Page 2: Hadoop Tutorials Spark - CERN](https://reader030.vdocuments.net/reader030/viewer/2022012716/61af14119aeab6042e240efb/html5/thumbnails/2.jpg)
About the tutorial
• The third session in Hadoop tutorial series• this time given by Kacper and Prasanth
• Session fully dedicated to Spark framework• Extensively discussed
• Actively developed
• Used in production
• Mixture of a talk and hands-on exercises
1
![Page 3: Hadoop Tutorials Spark - CERN](https://reader030.vdocuments.net/reader030/viewer/2022012716/61af14119aeab6042e240efb/html5/thumbnails/3.jpg)
What is Spark
• A framework for performing distributed computations
• Scalable, applicable for processing TBs of data
• Easy programming interface
• Supports Java, Scala, Python, R
• Varied APIs: DataFrames, SQL, MLib, Streaming
• Multiple cluster deployment modes
• Multiple data sources HDFS, Cassandra, HBase, S3
2
![Page 4: Hadoop Tutorials Spark - CERN](https://reader030.vdocuments.net/reader030/viewer/2022012716/61af14119aeab6042e240efb/html5/thumbnails/4.jpg)
Compared to Impala
• Similar concept of the workload distribution
• Overlap in SQL functionalities
• Spark is more flexible and offers richer API
• Impala is fine tuned for SQL queries
3
Interconnect network
MEMORY
CPU
Disks
MEMORY
CPU
Disks
MEMORY
CPU
Disks
MEMORY
CPU
Disks
MEMORY
CPU
Disks
MEMORY
CPU
Disks
Node 1 Node 2 Node 3 Node 4 Node 5 Node X
![Page 5: Hadoop Tutorials Spark - CERN](https://reader030.vdocuments.net/reader030/viewer/2022012716/61af14119aeab6042e240efb/html5/thumbnails/5.jpg)
Evolution from MapReduce
2002 2004 2006 2008 2010 2012 2014
MapReduce paper
MapReduce at Google Hadoop at Yahoo!
Hadoop Summit
Spark paper
Apache Spark top-level
4Based on a Databricks' slide
![Page 6: Hadoop Tutorials Spark - CERN](https://reader030.vdocuments.net/reader030/viewer/2022012716/61af14119aeab6042e240efb/html5/thumbnails/6.jpg)
Deployment modes
Local mode• Make use of multiple cores/CPUs with the thread-level parallelism
Cluster modes:
• Standalone
• Apache Mesos
• Hadoop YARNtypical for hadoop clusters
with centralised resource management
5
![Page 7: Hadoop Tutorials Spark - CERN](https://reader030.vdocuments.net/reader030/viewer/2022012716/61af14119aeab6042e240efb/html5/thumbnails/7.jpg)
How to use it
• Interactive shellspark-shellpyspark
• Job submissionspark-submit (use also for python)
• NotebooksjupyterzeppelinSWAN (centralised CERN jupyter service)
terminal
web interface
6
![Page 8: Hadoop Tutorials Spark - CERN](https://reader030.vdocuments.net/reader030/viewer/2022012716/61af14119aeab6042e240efb/html5/thumbnails/8.jpg)
Example – Pi estimation
7
import scala.math.random
val slices = 2val n = 100000 * slicesval rdd = sc.parallelize(1 to n, slices)val sample = rdd.map { i =>
val x = randomval y = randomif (x*x + y*y < 1) 1 else 0
}val count = sample.reduce(_ + _)println("Pi is roughly " + 4.0 * count / n)
https://en.wikipedia.org/wiki/Monte_Carlo_method
![Page 9: Hadoop Tutorials Spark - CERN](https://reader030.vdocuments.net/reader030/viewer/2022012716/61af14119aeab6042e240efb/html5/thumbnails/9.jpg)
Example – Pi estimation
• Login to on of the cluster nodes haperf1[01-12]
• Start spark-shell
• Copy or retype the content of/afs/cern.ch/user/k/kasurdy/public/pi.spark
8
![Page 10: Hadoop Tutorials Spark - CERN](https://reader030.vdocuments.net/reader030/viewer/2022012716/61af14119aeab6042e240efb/html5/thumbnails/10.jpg)
Spark concepts and terminology
9
![Page 11: Hadoop Tutorials Spark - CERN](https://reader030.vdocuments.net/reader030/viewer/2022012716/61af14119aeab6042e240efb/html5/thumbnails/11.jpg)
Transformations, actions (1)
• Transformations define how you want to transform you data. The result of a transformation of a spark dataset is another spark dataset. They do not trigger a computation.
examples: map, filter, union, distinct, join
• Actions are used to collect the result from a dataset defined previously. The result is usually an array or a number. They start the computation.
examples: reduce, collect, count, first, take
10
![Page 12: Hadoop Tutorials Spark - CERN](https://reader030.vdocuments.net/reader030/viewer/2022012716/61af14119aeab6042e240efb/html5/thumbnails/12.jpg)
Transformations, actions (2)
11
1 2 3 … … … n-1 n
import scala.math.random
val slices = 2val n = 100000 * slicesval rdd = sc.parallelize(1 to n, slices)val sample = rdd.map { i =>
val x = randomval y = randomif (x*x + y*y < 1) 1 else 0
}val count = sample.reduce(_ + _)println("Pi is roughly " + 4.0 * count / n)
0 1 1 … … … 1 0
4 * count = 157018 / n = 200000
Map function
Reduce function
![Page 13: Hadoop Tutorials Spark - CERN](https://reader030.vdocuments.net/reader030/viewer/2022012716/61af14119aeab6042e240efb/html5/thumbnails/13.jpg)
Driver, worker, executor
12
import scala.math.random
val slices = 2val n = 100000 * slicesval rdd = sc.parallelize(1 to n, slices)val sample = rdd.map { i =>
val x = randomval y = randomif (x*x + y*y < 1) 1 else 0
}val count = sample.reduce(_ + _)println("Pi is roughly " + 4.0 * count / n)
Driver
SparkContext
Cluster Manager
Worker
Executor
Worker
Executor
Worker
Executor
typically on a cluster
![Page 14: Hadoop Tutorials Spark - CERN](https://reader030.vdocuments.net/reader030/viewer/2022012716/61af14119aeab6042e240efb/html5/thumbnails/14.jpg)
Stages, tasks
• Task - A unit of work that will be sent to one executor
• Stage - Each job gets divided into smaller sets of “tasks” called stages that depend on each other aka synchronization points
13
![Page 15: Hadoop Tutorials Spark - CERN](https://reader030.vdocuments.net/reader030/viewer/2022012716/61af14119aeab6042e240efb/html5/thumbnails/15.jpg)
Your job as a directed acyclic graph (DAG)
14https://dwtobigdata.wordpress.com/2015/09/29/etl-with-apache-spark/
![Page 16: Hadoop Tutorials Spark - CERN](https://reader030.vdocuments.net/reader030/viewer/2022012716/61af14119aeab6042e240efb/html5/thumbnails/16.jpg)
Monitoring pages
• If you can, scroll the console output up to get an application ID(e.g. application_1468968864481_0070) and open
http://haperf100.cern.ch:8088/proxy/application_XXXXXXXXXXXXX_XXXX/
• Otherwise use the link below to find you application on the list
http://haperf100.cern.ch:8088/cluster
15
![Page 17: Hadoop Tutorials Spark - CERN](https://reader030.vdocuments.net/reader030/viewer/2022012716/61af14119aeab6042e240efb/html5/thumbnails/17.jpg)
Monitoring pages
16
![Page 18: Hadoop Tutorials Spark - CERN](https://reader030.vdocuments.net/reader030/viewer/2022012716/61af14119aeab6042e240efb/html5/thumbnails/18.jpg)
Data APIs
17