python and bigdata - an introduction to spark (pyspark)
TRANSCRIPT
![Page 1: Python and Bigdata - An Introduction to Spark (PySpark)](https://reader030.vdocuments.net/reader030/viewer/2022032620/55c75473bb61ebb90c8b47db/html5/thumbnails/1.jpg)
Python and Big data - An Introduction to Spark (PySpark)
Hitesh Dharmdasani
![Page 2: Python and Bigdata - An Introduction to Spark (PySpark)](https://reader030.vdocuments.net/reader030/viewer/2022032620/55c75473bb61ebb90c8b47db/html5/thumbnails/2.jpg)
About me
• Security Researcher, Malware Reversing Engineer, Developer
• GIT > GMU > Berkeley > FireEye > On Stage
• Bootstrapping a few ideas
• Hiring!
InformationSecurity
Big Data
Machine Learning
Me
![Page 3: Python and Bigdata - An Introduction to Spark (PySpark)](https://reader030.vdocuments.net/reader030/viewer/2022032620/55c75473bb61ebb90c8b47db/html5/thumbnails/3.jpg)
What we will talk about?• What is Spark? • How does spark do things • PySpark and data processing primitives • Example Demo - Playing with Network Logs • Streaming and Machine Learning in Spark • When to use Spark
http://bit.do/PyBelgaumSpark
http://tinyurl.com/PyBelgaumSpark
![Page 4: Python and Bigdata - An Introduction to Spark (PySpark)](https://reader030.vdocuments.net/reader030/viewer/2022032620/55c75473bb61ebb90c8b47db/html5/thumbnails/4.jpg)
What will we NOT talk about
• Writing production level jobs
• Fine Tuning Spark
• Integrating Spark with Kafka and the like
• Nooks and Crooks of Spark
• But glad to talk about it offline
![Page 5: Python and Bigdata - An Introduction to Spark (PySpark)](https://reader030.vdocuments.net/reader030/viewer/2022032620/55c75473bb61ebb90c8b47db/html5/thumbnails/5.jpg)
The Common Scenario
Some Data (NTFS, NFS, HDFS, Amazon S3 …)
Python
Process 1 Process 2 Process 3 Process 4 Process 5 …
You write 1 job. Then chunk,cut, slice and dice
![Page 6: Python and Bigdata - An Introduction to Spark (PySpark)](https://reader030.vdocuments.net/reader030/viewer/2022032620/55c75473bb61ebb90c8b47db/html5/thumbnails/6.jpg)
Compute where the data is• Paradigm shift in computing
• Don't load all the data into one place and do operations
• State your operations and send code to the machine
• Sending code to machine >>> Getting data over network
![Page 7: Python and Bigdata - An Introduction to Spark (PySpark)](https://reader030.vdocuments.net/reader030/viewer/2022032620/55c75473bb61ebb90c8b47db/html5/thumbnails/7.jpg)
MapReducepublic static MyFirstMapper { public void map { . . . } }
public static MyFirstReducer { public void reduce { . . . } }
public static MySecondMapper { public void map { . . . } }
public static MySecondReducer { public void reduce { . . . } }
Job job = new Job(conf, “First"); job.setMapperClass(MyFirstMapper.class); job.setReducerClass(MyFirstReducer.class);
/*Job 1 goes to Disk */
if(job.isSuccessful()) {
Job job2 = new Job(conf,”Second”); job2.setMapperClass(MySecondMapper.class); job2.setReducerClass(MySecondReducer.class);
}
This also looks ugly if you ask me!
![Page 8: Python and Bigdata - An Introduction to Spark (PySpark)](https://reader030.vdocuments.net/reader030/viewer/2022032620/55c75473bb61ebb90c8b47db/html5/thumbnails/8.jpg)
What is Spark?• Open Source Lighting Fast Cluster Computing
• Focus on Speed and Scale
• Developed at AMP Lab, UC Berkeley by Matei Zaharia
• Most active Apache Project in 2014 (Even more than Hadoop)
• Recently beat MapReduce in sorting 100TB of data by being 3X faster and using 10X fewer machines
![Page 9: Python and Bigdata - An Introduction to Spark (PySpark)](https://reader030.vdocuments.net/reader030/viewer/2022032620/55c75473bb61ebb90c8b47db/html5/thumbnails/9.jpg)
What is Spark?
Spark
Some Data (NTFS, NFS, HDFS, Amazon S3 …)
Java Python Scala
MLLib Streaming ETL SQL ….GraphX
![Page 10: Python and Bigdata - An Introduction to Spark (PySpark)](https://reader030.vdocuments.net/reader030/viewer/2022032620/55c75473bb61ebb90c8b47db/html5/thumbnails/10.jpg)
What is Spark?
Spark
Some Data (NTFS, NFS, HDFS, Amazon S3 …)
• Inherently distributed • Computation happens where the data
resides
![Page 11: Python and Bigdata - An Introduction to Spark (PySpark)](https://reader030.vdocuments.net/reader030/viewer/2022032620/55c75473bb61ebb90c8b47db/html5/thumbnails/11.jpg)
What is different from MapReduce
• Uses main memory for caching
• Dataset is partitioned and stored in RAM/Disk for iterative queries
• Large speedups for iterative operations when in-memory caching is used
![Page 12: Python and Bigdata - An Introduction to Spark (PySpark)](https://reader030.vdocuments.net/reader030/viewer/2022032620/55c75473bb61ebb90c8b47db/html5/thumbnails/12.jpg)
Spark InternalsThe Init
• Creating a SparkContext • It is Sparks’ gateway to access the cluster • In interactive mode. SparkContext is created as ‘sc’
$ pyspark ... ... SparkContext available as sc.
>>> sc <pyspark.context.SparkContext at 0xdeadbeef>
![Page 13: Python and Bigdata - An Introduction to Spark (PySpark)](https://reader030.vdocuments.net/reader030/viewer/2022032620/55c75473bb61ebb90c8b47db/html5/thumbnails/13.jpg)
Spark InternalsThe Key Idea
Resilient Distributed Datasets
• Basic unit of abstraction of data • Immutable • Persistance
>>> data = [90, 14, 20, 86, 43, 55, 30, 94 ] >>> distData = sc.parallelize(data) ParallelCollectionRDD[13] at parallelize at PythonRDD.scala:364
![Page 14: Python and Bigdata - An Introduction to Spark (PySpark)](https://reader030.vdocuments.net/reader030/viewer/2022032620/55c75473bb61ebb90c8b47db/html5/thumbnails/14.jpg)
Spark Internals
Operations on RDDs - Transformations & Actions
![Page 15: Python and Bigdata - An Introduction to Spark (PySpark)](https://reader030.vdocuments.net/reader030/viewer/2022032620/55c75473bb61ebb90c8b47db/html5/thumbnails/15.jpg)
Spark Internals
Transformations RDD
Spark Context
File/Collection
![Page 16: Python and Bigdata - An Introduction to Spark (PySpark)](https://reader030.vdocuments.net/reader030/viewer/2022032620/55c75473bb61ebb90c8b47db/html5/thumbnails/16.jpg)
Spark Internals
Lazy Evaluation
Now what?
![Page 17: Python and Bigdata - An Introduction to Spark (PySpark)](https://reader030.vdocuments.net/reader030/viewer/2022032620/55c75473bb61ebb90c8b47db/html5/thumbnails/17.jpg)
Spark Internals
Transformations ActionsRDD
Spark Context
File/Collection
![Page 18: Python and Bigdata - An Introduction to Spark (PySpark)](https://reader030.vdocuments.net/reader030/viewer/2022032620/55c75473bb61ebb90c8b47db/html5/thumbnails/18.jpg)
Spark InternalsTransformation Operations on RDDs
Map
def mapFunc(x):return x+1
rdd_2 = rdd_1.map(mapFunc)
Filterdef filterFunc(x):
if x % 2 == 0:return True
else:return False
rdd_2 = rdd_1.filter(filterFunc)
![Page 19: Python and Bigdata - An Introduction to Spark (PySpark)](https://reader030.vdocuments.net/reader030/viewer/2022032620/55c75473bb61ebb90c8b47db/html5/thumbnails/19.jpg)
Spark InternalsTransformation Operations on RDDs
• map • filter • flatMap • mapPartitions • mapPartitionsWithIndex • sample • union • intersection • distinct • groupByKey
![Page 20: Python and Bigdata - An Introduction to Spark (PySpark)](https://reader030.vdocuments.net/reader030/viewer/2022032620/55c75473bb61ebb90c8b47db/html5/thumbnails/20.jpg)
Spark Internals>>> increment_rdd = distData.map(mapFunc) >>> increment_rdd.collect() [91, 15, 21, 87, 44, 56, 31, 95] >>> >>> increment_rdd.filter(filterFunc).collect() [44, 56] OR >>> distData.map(mapFunc).filter(filterFunc).collect() [44, 56]
![Page 21: Python and Bigdata - An Introduction to Spark (PySpark)](https://reader030.vdocuments.net/reader030/viewer/2022032620/55c75473bb61ebb90c8b47db/html5/thumbnails/21.jpg)
Spark Internals
Fault Tolerance and Lineage
![Page 22: Python and Bigdata - An Introduction to Spark (PySpark)](https://reader030.vdocuments.net/reader030/viewer/2022032620/55c75473bb61ebb90c8b47db/html5/thumbnails/22.jpg)
Moving to the Terminal
![Page 23: Python and Bigdata - An Introduction to Spark (PySpark)](https://reader030.vdocuments.net/reader030/viewer/2022032620/55c75473bb61ebb90c8b47db/html5/thumbnails/23.jpg)
Spark Streaming
Kafka Flume HDFS Twitter ZeroMQ
HDFS Cassandra NFS TextFile
RDD
![Page 24: Python and Bigdata - An Introduction to Spark (PySpark)](https://reader030.vdocuments.net/reader030/viewer/2022032620/55c75473bb61ebb90c8b47db/html5/thumbnails/24.jpg)
ML Lib
• Machine Learning Primitives in Spark
• Provides training and classification at scale
• Exploits Sparks’ ability for iterative computation (Linear Regression, Random Forest)
• Currently the most active area of work within Spark
![Page 25: Python and Bigdata - An Introduction to Spark (PySpark)](https://reader030.vdocuments.net/reader030/viewer/2022032620/55c75473bb61ebb90c8b47db/html5/thumbnails/25.jpg)
How can I use all this?
HDFS
Spark + ML Lib
Load Tweets
Bad Tweets Model
Live Tweets
Good
Bad
Report to Twitter
Spark Streaming
![Page 26: Python and Bigdata - An Introduction to Spark (PySpark)](https://reader030.vdocuments.net/reader030/viewer/2022032620/55c75473bb61ebb90c8b47db/html5/thumbnails/26.jpg)
To Spark or not to Spark• Iterative computations
• “Don't fix something that is not broken”
• Lesser learning barrier
• Large one-time compute
• Single Map Reduce Operation