using pyspark to process boat loads of data
TRANSCRIPT
![Page 1: Using PySpark to Process Boat Loads of Data](https://reader030.vdocuments.net/reader030/viewer/2022021508/5a647b747f8b9a57568b4865/html5/thumbnails/1.jpg)
Using PySpark to Process Boat Loads of Data
Robert Dempsey, CEO Atlantic Dominion Solutions
![Page 2: Using PySpark to Process Boat Loads of Data](https://reader030.vdocuments.net/reader030/viewer/2022021508/5a647b747f8b9a57568b4865/html5/thumbnails/2.jpg)
![Page 3: Using PySpark to Process Boat Loads of Data](https://reader030.vdocuments.net/reader030/viewer/2022021508/5a647b747f8b9a57568b4865/html5/thumbnails/3.jpg)
We’ve mastered three jobs so you can focus on one - growing your business.
![Page 4: Using PySpark to Process Boat Loads of Data](https://reader030.vdocuments.net/reader030/viewer/2022021508/5a647b747f8b9a57568b4865/html5/thumbnails/4.jpg)
The Three JobsAt Atlantic Dominion Solutions we perform three functions for our customers:
Consulting: we assess and advise in the areas of technology, team and process to determine how machine learning can have the biggest impact on your business.
Implementation: after a strategy session to determine the work you need we get to work using our proven methodology and begin delivering smarter applications.
Training: continuous improvement requires continuous learning. We provide both on-premises and online training.
![Page 5: Using PySpark to Process Boat Loads of Data](https://reader030.vdocuments.net/reader030/viewer/2022021508/5a647b747f8b9a57568b4865/html5/thumbnails/5.jpg)
Co-authoring the book Building Machine Learning Pipelines.
Written for software developers and data scientists, Building Machine Learning Pipelines teaches the skills required to create and use the infrastructure needed to run modern intelligent systems.
machinelearningpipelines.com
Writing the Book
![Page 6: Using PySpark to Process Boat Loads of Data](https://reader030.vdocuments.net/reader030/viewer/2022021508/5a647b747f8b9a57568b4865/html5/thumbnails/6.jpg)
Robert Dempsey, CEOSoftware Engineer
Books and online courses
Lotus Guides, District Data Labs
Atlantic Dominion Solutions, LLC
Professional
Author
Instructor
Owner
![Page 7: Using PySpark to Process Boat Loads of Data](https://reader030.vdocuments.net/reader030/viewer/2022021508/5a647b747f8b9a57568b4865/html5/thumbnails/7.jpg)
What You Can Expect Today
![Page 8: Using PySpark to Process Boat Loads of Data](https://reader030.vdocuments.net/reader030/viewer/2022021508/5a647b747f8b9a57568b4865/html5/thumbnails/8.jpg)
MTAC Framework™Mindset
Toolbox
Application
Communication
![Page 9: Using PySpark to Process Boat Loads of Data](https://reader030.vdocuments.net/reader030/viewer/2022021508/5a647b747f8b9a57568b4865/html5/thumbnails/9.jpg)
1. When acquiring knowledge start by going wide instead of deep.
2. Always focus on what's important to people rather than just the technology.
3. Be able to clearly communicate what you know with others.
Core Principles
![Page 10: Using PySpark to Process Boat Loads of Data](https://reader030.vdocuments.net/reader030/viewer/2022021508/5a647b747f8b9a57568b4865/html5/thumbnails/10.jpg)
MTAC Framework™ AppliedMindset: use-case centric example
Toolbox: Python, PySpark, Docker
Application: Code & Analysis
Communication: Q&A
![Page 11: Using PySpark to Process Boat Loads of Data](https://reader030.vdocuments.net/reader030/viewer/2022021508/5a647b747f8b9a57568b4865/html5/thumbnails/11.jpg)
Mindset
![Page 12: Using PySpark to Process Boat Loads of Data](https://reader030.vdocuments.net/reader030/viewer/2022021508/5a647b747f8b9a57568b4865/html5/thumbnails/12.jpg)
Keep It Simple
Image: Jesse van Dijk : http://jessevandijkart.com/the-labyrinth-of-tsan-kamal/
![Page 13: Using PySpark to Process Boat Loads of Data](https://reader030.vdocuments.net/reader030/viewer/2022021508/5a647b747f8b9a57568b4865/html5/thumbnails/13.jpg)
Solve the Problem
Image: Paulo : https://paullus23.deviantart.com/art/Bliss-soccer-field-326563199
![Page 14: Using PySpark to Process Boat Loads of Data](https://reader030.vdocuments.net/reader030/viewer/2022021508/5a647b747f8b9a57568b4865/html5/thumbnails/14.jpg)
Explain It, Simply
![Page 15: Using PySpark to Process Boat Loads of Data](https://reader030.vdocuments.net/reader030/viewer/2022021508/5a647b747f8b9a57568b4865/html5/thumbnails/15.jpg)
Break Through
![Page 16: Using PySpark to Process Boat Loads of Data](https://reader030.vdocuments.net/reader030/viewer/2022021508/5a647b747f8b9a57568b4865/html5/thumbnails/16.jpg)
Use Case
![Page 17: Using PySpark to Process Boat Loads of Data](https://reader030.vdocuments.net/reader030/viewer/2022021508/5a647b747f8b9a57568b4865/html5/thumbnails/17.jpg)
Got Clean Air?
![Page 18: Using PySpark to Process Boat Loads of Data](https://reader030.vdocuments.net/reader030/viewer/2022021508/5a647b747f8b9a57568b4865/html5/thumbnails/18.jpg)
Got Clean Air?• Clean air is important.
• Toxic pollutants are known or suspected of causing cancer, reproductive effects, birth defects, and adverse environmental effects.
![Page 19: Using PySpark to Process Boat Loads of Data](https://reader030.vdocuments.net/reader030/viewer/2022021508/5a647b747f8b9a57568b4865/html5/thumbnails/19.jpg)
Questions to Answer1. Which state has the highest level of pollutants?
2. Which county has the highest level of pollutants?
3. What are the top 5 pollutants by unit of measure?
4. What are the trends of pollutants by state over time?
![Page 20: Using PySpark to Process Boat Loads of Data](https://reader030.vdocuments.net/reader030/viewer/2022021508/5a647b747f8b9a57568b4865/html5/thumbnails/20.jpg)
Toolbox
![Page 21: Using PySpark to Process Boat Loads of Data](https://reader030.vdocuments.net/reader030/viewer/2022021508/5a647b747f8b9a57568b4865/html5/thumbnails/21.jpg)
Python
![Page 22: Using PySpark to Process Boat Loads of Data](https://reader030.vdocuments.net/reader030/viewer/2022021508/5a647b747f8b9a57568b4865/html5/thumbnails/22.jpg)
![Page 23: Using PySpark to Process Boat Loads of Data](https://reader030.vdocuments.net/reader030/viewer/2022021508/5a647b747f8b9a57568b4865/html5/thumbnails/23.jpg)
Spark
![Page 24: Using PySpark to Process Boat Loads of Data](https://reader030.vdocuments.net/reader030/viewer/2022021508/5a647b747f8b9a57568b4865/html5/thumbnails/24.jpg)
The Core of Spark• Computational engine that schedules, distributes and
monitors computational tasks running on a cluster
![Page 25: Using PySpark to Process Boat Loads of Data](https://reader030.vdocuments.net/reader030/viewer/2022021508/5a647b747f8b9a57568b4865/html5/thumbnails/25.jpg)
Higher Level Tools• Spark SQL: SQL and structured data
• MLlib: machine learning
• GraphX: graph processing
• Spark Streaming: process streaming data
![Page 26: Using PySpark to Process Boat Loads of Data](https://reader030.vdocuments.net/reader030/viewer/2022021508/5a647b747f8b9a57568b4865/html5/thumbnails/26.jpg)
Storage• Local file system
• Amazon S3
• Cassandra
• Hive
• HBase
• File formats
• Text files
• Sequence files
• Avro
• Parquet
• Hadoop Input Format
![Page 27: Using PySpark to Process Boat Loads of Data](https://reader030.vdocuments.net/reader030/viewer/2022021508/5a647b747f8b9a57568b4865/html5/thumbnails/27.jpg)
Hadoop?• Not necessary, but…
• If you have multiple nodes you need a resource manager like YARN or Mesos
• You'll need access to distributed storage like HDFS, Amazon S3 or Cassandra
![Page 28: Using PySpark to Process Boat Loads of Data](https://reader030.vdocuments.net/reader030/viewer/2022021508/5a647b747f8b9a57568b4865/html5/thumbnails/28.jpg)
PySpark
![Page 29: Using PySpark to Process Boat Loads of Data](https://reader030.vdocuments.net/reader030/viewer/2022021508/5a647b747f8b9a57568b4865/html5/thumbnails/29.jpg)
What Is PySpark?• An API that exposes the Spark programming model to
Python
• Build on top of Spark's Java API
• Data is processed with Python and cached/shuffled in the JVM
• Driver programs
![Page 30: Using PySpark to Process Boat Loads of Data](https://reader030.vdocuments.net/reader030/viewer/2022021508/5a647b747f8b9a57568b4865/html5/thumbnails/30.jpg)
Driver Programs• Launch parallel operations on a cluster
• Contain application functions
• Define distributed datasets
• Access Spark through a SparkContext
• Uses Py4J to launch a JVM and create a JavaSparkContext
![Page 31: Using PySpark to Process Boat Loads of Data](https://reader030.vdocuments.net/reader030/viewer/2022021508/5a647b747f8b9a57568b4865/html5/thumbnails/31.jpg)
When to Use It• When you need to…
• Process boat loads of data (TB)
• Perform operations that require all the data to be in memory (machine learning)
• Efficiently process streaming data
• Create an overly complicated use case to present at a meetup
![Page 32: Using PySpark to Process Boat Loads of Data](https://reader030.vdocuments.net/reader030/viewer/2022021508/5a647b747f8b9a57568b4865/html5/thumbnails/32.jpg)
Docker
![Page 33: Using PySpark to Process Boat Loads of Data](https://reader030.vdocuments.net/reader030/viewer/2022021508/5a647b747f8b9a57568b4865/html5/thumbnails/33.jpg)
Docker• Software container platform
• Containers are application only (no OS)
• Deployed anywhere with same CPU architecture (x86-64, ARM)
• Available for *nix, Mac, Windows
![Page 34: Using PySpark to Process Boat Loads of Data](https://reader030.vdocuments.net/reader030/viewer/2022021508/5a647b747f8b9a57568b4865/html5/thumbnails/34.jpg)
Container Architecture
![Page 35: Using PySpark to Process Boat Loads of Data](https://reader030.vdocuments.net/reader030/viewer/2022021508/5a647b747f8b9a57568b4865/html5/thumbnails/35.jpg)
Application
![Page 36: Using PySpark to Process Boat Loads of Data](https://reader030.vdocuments.net/reader030/viewer/2022021508/5a647b747f8b9a57568b4865/html5/thumbnails/36.jpg)
PySpark in Data Architectures
![Page 37: Using PySpark to Process Boat Loads of Data](https://reader030.vdocuments.net/reader030/viewer/2022021508/5a647b747f8b9a57568b4865/html5/thumbnails/37.jpg)
Architecture #1
AgentFile
SystemApache Spark
File System
Agent ES
1 2 3
Data Flow
![Page 38: Using PySpark to Process Boat Loads of Data](https://reader030.vdocuments.net/reader030/viewer/2022021508/5a647b747f8b9a57568b4865/html5/thumbnails/38.jpg)
Architecture #2
Data Flow
Agent
1 2 3
Agent
Agent
Athena
S3
S3Apache Spark
![Page 39: Using PySpark to Process Boat Loads of Data](https://reader030.vdocuments.net/reader030/viewer/2022021508/5a647b747f8b9a57568b4865/html5/thumbnails/39.jpg)
Architecture #3
Data Flow
Agent
1 2 3
Agent
Agent
ES
S3
HDFS
Apache Kafka
Apache Spark
HBase
![Page 40: Using PySpark to Process Boat Loads of Data](https://reader030.vdocuments.net/reader030/viewer/2022021508/5a647b747f8b9a57568b4865/html5/thumbnails/40.jpg)
What We’ll Build (Simple)
AgentFile
SystemApache Spark
File System
1 2 3
Data Flow
![Page 41: Using PySpark to Process Boat Loads of Data](https://reader030.vdocuments.net/reader030/viewer/2022021508/5a647b747f8b9a57568b4865/html5/thumbnails/41.jpg)
Python• Analysis
• Visualization
• Code in our Spark jobs
![Page 42: Using PySpark to Process Boat Loads of Data](https://reader030.vdocuments.net/reader030/viewer/2022021508/5a647b747f8b9a57568b4865/html5/thumbnails/42.jpg)
Spark• By using PySpark
![Page 43: Using PySpark to Process Boat Loads of Data](https://reader030.vdocuments.net/reader030/viewer/2022021508/5a647b747f8b9a57568b4865/html5/thumbnails/43.jpg)
PySpark• Process all the data!
• Perform aggregations
![Page 44: Using PySpark to Process Boat Loads of Data](https://reader030.vdocuments.net/reader030/viewer/2022021508/5a647b747f8b9a57568b4865/html5/thumbnails/44.jpg)
Docker• Run Spark in a Docker container.
• So you don’t have to install anything.
![Page 45: Using PySpark to Process Boat Loads of Data](https://reader030.vdocuments.net/reader030/viewer/2022021508/5a647b747f8b9a57568b4865/html5/thumbnails/45.jpg)
Code Time!
![Page 46: Using PySpark to Process Boat Loads of Data](https://reader030.vdocuments.net/reader030/viewer/2022021508/5a647b747f8b9a57568b4865/html5/thumbnails/46.jpg)
README• https://github.com/rdempsey/pyspark-for-data-processing
• Create a virtual environment (Anaconda)
• Install dependencies
• Run docker-compose to create the Spark containers
• Run a script (or all of them!) per the README
![Page 47: Using PySpark to Process Boat Loads of Data](https://reader030.vdocuments.net/reader030/viewer/2022021508/5a647b747f8b9a57568b4865/html5/thumbnails/47.jpg)
Dive In• Data explorer notebook
• Q1 - Most polluted state
• Q2 - Most polluted county
• Q3 - Top pollutants by unit of measure
• Q4 - Pollutants over time
![Page 48: Using PySpark to Process Boat Loads of Data](https://reader030.vdocuments.net/reader030/viewer/2022021508/5a647b747f8b9a57568b4865/html5/thumbnails/48.jpg)
Communication
![Page 49: Using PySpark to Process Boat Loads of Data](https://reader030.vdocuments.net/reader030/viewer/2022021508/5a647b747f8b9a57568b4865/html5/thumbnails/49.jpg)
Q&A
![Page 50: Using PySpark to Process Boat Loads of Data](https://reader030.vdocuments.net/reader030/viewer/2022021508/5a647b747f8b9a57568b4865/html5/thumbnails/50.jpg)
Early Bird Specials!
![Page 51: Using PySpark to Process Boat Loads of Data](https://reader030.vdocuments.net/reader030/viewer/2022021508/5a647b747f8b9a57568b4865/html5/thumbnails/51.jpg)
Intro to Data Science for Software Engineers
Goes live October 23, 2017
Normally: $97
Pre-Launch: $47
http://lotusguides.com
![Page 52: Using PySpark to Process Boat Loads of Data](https://reader030.vdocuments.net/reader030/viewer/2022021508/5a647b747f8b9a57568b4865/html5/thumbnails/52.jpg)
Where to Find MeWebsite
Lotus Guides
Github
robertwdempsey.com
lotusguides.com
robertwdempsey
rdempsey
rdempsey
![Page 53: Using PySpark to Process Boat Loads of Data](https://reader030.vdocuments.net/reader030/viewer/2022021508/5a647b747f8b9a57568b4865/html5/thumbnails/53.jpg)
Thank You!