apache spark: the modern data analytics platform
TRANSCRIPT
![Page 1: Apache Spark: The modern data analytics platform](https://reader034.vdocuments.net/reader034/viewer/2022042602/55b6d3c8bb61eb93418b48cb/html5/thumbnails/1.jpg)
Apache SparkMate Gulyas
![Page 2: Apache Spark: The modern data analytics platform](https://reader034.vdocuments.net/reader034/viewer/2022042602/55b6d3c8bb61eb93418b48cb/html5/thumbnails/2.jpg)
WHY WE DO IT?
54% average viewability**
36% non-human visitor *
What percentage of digital ads reach
people?
Clickbots, botnets
Invisible, hidden ads
Transparency in the market
*http://technorati.com/iab-keynote-36-percent-ad-traffic-from-bots-and-threatening-industry/**http://www.statista.com/statistics/255061/viewability-rates-for-rich-media-ads-worldwide-by-industry/
![Page 3: Apache Spark: The modern data analytics platform](https://reader034.vdocuments.net/reader034/viewer/2022042602/55b6d3c8bb61eb93418b48cb/html5/thumbnails/3.jpg)
JavaScript SegmentationBehaviour analysis
WHAT WE DO?
![Page 4: Apache Spark: The modern data analytics platform](https://reader034.vdocuments.net/reader034/viewer/2022042602/55b6d3c8bb61eb93418b48cb/html5/thumbnails/4.jpg)
Distributed data processing
WHAT DO WE NEED?
Averge size client30 GB / day
900 GB / month
20 average size clients600 GB / day
18 TB / month
![Page 5: Apache Spark: The modern data analytics platform](https://reader034.vdocuments.net/reader034/viewer/2022042602/55b6d3c8bb61eb93418b48cb/html5/thumbnails/5.jpg)
Recurring data transformations
WHAT DO WE NEED?
![Page 6: Apache Spark: The modern data analytics platform](https://reader034.vdocuments.net/reader034/viewer/2022042602/55b6d3c8bb61eb93418b48cb/html5/thumbnails/6.jpg)
Interactive / Batch / Streaming / SQL / Graph processing
IT’S SO COOL
![Page 7: Apache Spark: The modern data analytics platform](https://reader034.vdocuments.net/reader034/viewer/2022042602/55b6d3c8bb61eb93418b48cb/html5/thumbnails/7.jpg)
In-memory
WHY SPARK?
![Page 8: Apache Spark: The modern data analytics platform](https://reader034.vdocuments.net/reader034/viewer/2022042602/55b6d3c8bb61eb93418b48cb/html5/thumbnails/8.jpg)
Productive API
WHY SPARK?
![Page 9: Apache Spark: The modern data analytics platform](https://reader034.vdocuments.net/reader034/viewer/2022042602/55b6d3c8bb61eb93418b48cb/html5/thumbnails/9.jpg)
Multiple language
WHY SPARK?
![Page 10: Apache Spark: The modern data analytics platform](https://reader034.vdocuments.net/reader034/viewer/2022042602/55b6d3c8bb61eb93418b48cb/html5/thumbnails/10.jpg)
Active community
WHY SPARK?
![Page 11: Apache Spark: The modern data analytics platform](https://reader034.vdocuments.net/reader034/viewer/2022042602/55b6d3c8bb61eb93418b48cb/html5/thumbnails/11.jpg)
friendly
WHY SPARK?
developeranalyst
CFO
![Page 12: Apache Spark: The modern data analytics platform](https://reader034.vdocuments.net/reader034/viewer/2022042602/55b6d3c8bb61eb93418b48cb/html5/thumbnails/12.jpg)
friendly
WHY SPARK?
developeranalyst
CFO
![Page 13: Apache Spark: The modern data analytics platform](https://reader034.vdocuments.net/reader034/viewer/2022042602/55b6d3c8bb61eb93418b48cb/html5/thumbnails/13.jpg)
friendly
WHY SPARK?
developeranalyst
CFO
![Page 14: Apache Spark: The modern data analytics platform](https://reader034.vdocuments.net/reader034/viewer/2022042602/55b6d3c8bb61eb93418b48cb/html5/thumbnails/14.jpg)
Resilient Distributed Dataset (RDD)
ONE THING TO REMEMBER
![Page 15: Apache Spark: The modern data analytics platform](https://reader034.vdocuments.net/reader034/viewer/2022042602/55b6d3c8bb61eb93418b48cb/html5/thumbnails/15.jpg)
RDD
![Page 16: Apache Spark: The modern data analytics platform](https://reader034.vdocuments.net/reader034/viewer/2022042602/55b6d3c8bb61eb93418b48cb/html5/thumbnails/16.jpg)
IT RUN’S ON
MesosYARNStandaloneAWS EC2
![Page 17: Apache Spark: The modern data analytics platform](https://reader034.vdocuments.net/reader034/viewer/2022042602/55b6d3c8bb61eb93418b48cb/html5/thumbnails/17.jpg)
IT GET’S DATA FROM?
Amazon S3, HDFS, Cassandra, Hive, Hbase, Tachyon, Local Filesystem, ODBC databases, etc...
![Page 18: Apache Spark: The modern data analytics platform](https://reader034.vdocuments.net/reader034/viewer/2022042602/55b6d3c8bb61eb93418b48cb/html5/thumbnails/18.jpg)
Batch processing
THE OLD WAY
![Page 19: Apache Spark: The modern data analytics platform](https://reader034.vdocuments.net/reader034/viewer/2022042602/55b6d3c8bb61eb93418b48cb/html5/thumbnails/19.jpg)
Interactive analytics
THE NEW WAY
![Page 20: Apache Spark: The modern data analytics platform](https://reader034.vdocuments.net/reader034/viewer/2022042602/55b6d3c8bb61eb93418b48cb/html5/thumbnails/20.jpg)
SPARK WITH IPYTHON
![Page 21: Apache Spark: The modern data analytics platform](https://reader034.vdocuments.net/reader034/viewer/2022042602/55b6d3c8bb61eb93418b48cb/html5/thumbnails/21.jpg)
Spark SQL
THE NOT THAT OLD WAY
![Page 22: Apache Spark: The modern data analytics platform](https://reader034.vdocuments.net/reader034/viewer/2022042602/55b6d3c8bb61eb93418b48cb/html5/thumbnails/22.jpg)
{"name": "Mate Gulyas", "twitter": "gulyasm"}{"name": "John Doe", "email": "[email protected]"}{"name": "Jane Doe", "email": "[email protected]"}
val input = hiveCtx.jsonFile(“example.json”)input.registerAsTable(“users”)hiveCtx.sql(“SELECT name, twitter FROM people;”)
SQL WITH JSON
![Page 23: Apache Spark: The modern data analytics platform](https://reader034.vdocuments.net/reader034/viewer/2022042602/55b6d3c8bb61eb93418b48cb/html5/thumbnails/23.jpg)
Spark Streaming
THE LOW LATENCY WAY
![Page 24: Apache Spark: The modern data analytics platform](https://reader034.vdocuments.net/reader034/viewer/2022042602/55b6d3c8bb61eb93418b48cb/html5/thumbnails/24.jpg)
DStream
![Page 25: Apache Spark: The modern data analytics platform](https://reader034.vdocuments.net/reader034/viewer/2022042602/55b6d3c8bb61eb93418b48cb/html5/thumbnails/25.jpg)
MLlib
THE SKYNET WAY
![Page 26: Apache Spark: The modern data analytics platform](https://reader034.vdocuments.net/reader034/viewer/2022042602/55b6d3c8bb61eb93418b48cb/html5/thumbnails/26.jpg)
GraphX
I LOVE GRAPHS
![Page 27: Apache Spark: The modern data analytics platform](https://reader034.vdocuments.net/reader034/viewer/2022042602/55b6d3c8bb61eb93418b48cb/html5/thumbnails/27.jpg)
Third party modules
THE OTHERS WAY
![Page 28: Apache Spark: The modern data analytics platform](https://reader034.vdocuments.net/reader034/viewer/2022042602/55b6d3c8bb61eb93418b48cb/html5/thumbnails/28.jpg)
On-premisesAWSDatabricks Cloud
BUT… WHERE TO GO?
![Page 29: Apache Spark: The modern data analytics platform](https://reader034.vdocuments.net/reader034/viewer/2022042602/55b6d3c8bb61eb93418b48cb/html5/thumbnails/29.jpg)
TAKEAWAY I
Spark can provide one platform to cover most of the use-cases in data analytics
![Page 30: Apache Spark: The modern data analytics platform](https://reader034.vdocuments.net/reader034/viewer/2022042602/55b6d3c8bb61eb93418b48cb/html5/thumbnails/30.jpg)
TAKEAWAY II
Productive, fast data processing framework that helps you minimize to time business impact.