spark · 2018-03-24 · summit -1,164 participants from over 453 companies attended -spark training...

39
Spark

Upload: others

Post on 06-Jun-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Spark · 2018-03-24 · Summit -1,164 participants from over 453 companies attended -Spark Training sold out at 300 participants -31 organizations sponsored the event  …

Spark

Page 2: Spark · 2018-03-24 · Summit -1,164 participants from over 453 companies attended -Spark Training sold out at 300 participants -31 organizations sponsored the event  …

Spark - Summit - News - Basics - Advanced - Subprojects - Use Cases - Resources

Page 3: Spark · 2018-03-24 · Summit -1,164 participants from over 453 companies attended -Spark Training sold out at 300 participants -31 organizations sponsored the event  …

Summit - 1,164 participants from over 453 companies attended - Spark Training sold out at 300 participants - 31 organizations sponsored the event - 12 keynotes and 52 community presentations were given

Page 4: Spark · 2018-03-24 · Summit -1,164 participants from over 453 companies attended -Spark Training sold out at 300 participants -31 organizations sponsored the event  …

News - Project - Databricks

Page 5: Spark · 2018-03-24 · Summit -1,164 participants from over 453 companies attended -Spark Training sold out at 300 participants -31 organizations sponsored the event  …

Project - 1.0.0 release - Graduated incubator - Very active community

Page 6: Spark · 2018-03-24 · Summit -1,164 participants from over 453 companies attended -Spark Training sold out at 300 participants -31 organizations sponsored the event  …

Very active community - Top three Apache projects - Most active Big Data project - > 50 companies - > 250 contributors - > 175,000 LOC

Page 7: Spark · 2018-03-24 · Summit -1,164 participants from over 453 companies attended -Spark Training sold out at 300 participants -31 organizations sponsored the event  …

Databricks - Certification - Cloud

Page 8: Spark · 2018-03-24 · Summit -1,164 participants from over 453 companies attended -Spark Training sold out at 300 participants -31 organizations sponsored the event  …

Certification - Every certified app will run on every certified distribution - Distribution Partners - App Partners

Page 9: Spark · 2018-03-24 · Summit -1,164 participants from over 453 companies attended -Spark Training sold out at 300 participants -31 organizations sponsored the event  …

Distribution Partners - Cloudera - MapR - Hortonworks - Pivotal - IBM - Amazon Web Services - SAP

Page 10: Spark · 2018-03-24 · Summit -1,164 participants from over 453 companies attended -Spark Training sold out at 300 participants -31 organizations sponsored the event  …

App Partners - Alteryx - Datastax - 0xdata - Typesafe - Zoomdata

Page 11: Spark · 2018-03-24 · Summit -1,164 participants from over 453 companies attended -Spark Training sold out at 300 participants -31 organizations sponsored the event  …

Cloud - Vision: Make Big Data Easy! - Product: Badass - Hosted Platform - Cluster Management - Interactive Workspace

Page 12: Spark · 2018-03-24 · Summit -1,164 participants from over 453 companies attended -Spark Training sold out at 300 participants -31 organizations sponsored the event  …

Interactive Workspace - Notebooks - Dashboards - Jobs

Page 13: Spark · 2018-03-24 · Summit -1,164 participants from over 453 companies attended -Spark Training sold out at 300 participants -31 organizations sponsored the event  …

Dashboards - WYSIWYG Builder - Interactive plots - One-click publishing

Page 14: Spark · 2018-03-24 · Summit -1,164 participants from over 453 companies attended -Spark Training sold out at 300 participants -31 organizations sponsored the event  …

Spark Basics - Execution - RDDs - Caching - Broadcast - Languages

Page 15: Spark · 2018-03-24 · Summit -1,164 participants from over 453 companies attended -Spark Training sold out at 300 participants -31 organizations sponsored the event  …

Execution - Apply Functional Operators across Distributed Collections - Master / Worker - Lazy - Parallelize with Threads first

Page 16: Spark · 2018-03-24 · Summit -1,164 participants from over 453 companies attended -Spark Training sold out at 300 participants -31 organizations sponsored the event  …

RDDs - Interface for dataset - Backed by anything - Any InputFormat class - HDFS default

Page 17: Spark · 2018-03-24 · Summit -1,164 participants from over 453 companies attended -Spark Training sold out at 300 participants -31 organizations sponsored the event  …

Caching - Store intermediate results in memory - Partition-locality - Significant speed-up for iterative algorithms

Page 18: Spark · 2018-03-24 · Summit -1,164 participants from over 453 companies attended -Spark Training sold out at 300 participants -31 organizations sponsored the event  …

Broadcast - Send immutable object to all workers - Similar to DistributedCache in mapreduce

Page 19: Spark · 2018-03-24 · Summit -1,164 participants from over 453 companies attended -Spark Training sold out at 300 participants -31 organizations sponsored the event  …

Languages - Scala - Python - Java 7 - Java 8 - R - Clojure

Page 20: Spark · 2018-03-24 · Summit -1,164 participants from over 453 companies attended -Spark Training sold out at 300 participants -31 organizations sponsored the event  …

Advanced - Partitioning - Persistence Options - Checkpointing - Accumulators - Optimizations

Page 21: Spark · 2018-03-24 · Summit -1,164 participants from over 453 companies attended -Spark Training sold out at 300 participants -31 organizations sponsored the event  …

Subprojects - SparkSQL - Tachyon - Spark Streaming - MLLib - GraphX - BlinkDB - Spark Job Server

Page 22: Spark · 2018-03-24 · Summit -1,164 participants from over 453 companies attended -Spark Training sold out at 300 participants -31 organizations sponsored the event  …

SparkSQL - Replaces Shark - Core - Catalyst - Libraries

Page 23: Spark · 2018-03-24 · Summit -1,164 participants from over 453 companies attended -Spark Training sold out at 300 participants -31 organizations sponsored the event  …

Core - SchemaRDDs - Query Execution - Caching

Page 24: Spark · 2018-03-24 · Summit -1,164 participants from over 453 companies attended -Spark Training sold out at 300 participants -31 organizations sponsored the event  …

Catalyst - Relational algebra - Expressions / UDFs - Query Planning - Optimizer

Page 25: Spark · 2018-03-24 · Summit -1,164 participants from over 453 companies attended -Spark Training sold out at 300 participants -31 organizations sponsored the event  …

Libraries - POJOs - JDBC - JSON - Parquet - Hive

Page 26: Spark · 2018-03-24 · Summit -1,164 participants from over 453 companies attended -Spark Training sold out at 300 participants -31 organizations sponsored the event  …

Hive - Catalog info from Metastore - Helps connect UI like Microstrategy / Tableau - Wrappers for UDF, UDAFs, UDTFs - Supports TRANSFORM - Supports SerDes

Page 27: Spark · 2018-03-24 · Summit -1,164 participants from over 453 companies attended -Spark Training sold out at 300 participants -31 organizations sponsored the event  …

Tachyon - In Memory (Off-Heap) Distributed Datastore - Change URI from hdfs:// to tachyon:// - Share datasets between jobs without HDFS - Helps scaling by off-loading allocation responsibility and GC pauses from executor processes

Page 28: Spark · 2018-03-24 · Summit -1,164 participants from over 453 companies attended -Spark Training sold out at 300 participants -31 organizations sponsored the event  …

Spark Streaming - Real-time streams - Micro-batching - Windowed Computations - Lambda Architecture

Page 29: Spark · 2018-03-24 · Summit -1,164 participants from over 453 companies attended -Spark Training sold out at 300 participants -31 organizations sponsored the event  …

MLLib - Summary statistics - Regression - Classification - Clustering - Collaborative Filtering - Optimization - Dimensional Reduction

Page 30: Spark · 2018-03-24 · Summit -1,164 participants from over 453 companies attended -Spark Training sold out at 300 participants -31 organizations sponsored the event  …

GraphX - Graph, VertexRDD, EdgeRDD objects and operations - Pregel API - mapReduceTriplets List<V,E,V> - Graph analytics libraries

Page 31: Spark · 2018-03-24 · Summit -1,164 participants from over 453 companies attended -Spark Training sold out at 300 participants -31 organizations sponsored the event  …

Graph analytics libraries - ConnectedComponents - PageRank - TriangleCount - ShortestPaths - SVDPlusPlus

Page 32: Spark · 2018-03-24 · Summit -1,164 participants from over 453 companies attended -Spark Training sold out at 300 participants -31 organizations sponsored the event  …

BlinkDB - Get estimated results - Time bound - Error bound

Page 33: Spark · 2018-03-24 · Summit -1,164 participants from over 453 companies attended -Spark Training sold out at 300 participants -31 organizations sponsored the event  …

Spark Job Server - Runs multiple jobs / contexts in same process - Allows for RDD Caching / Sharing between jobs - Job Persistence

Page 34: Spark · 2018-03-24 · Summit -1,164 participants from over 453 companies attended -Spark Training sold out at 300 participants -31 organizations sponsored the event  …

Use Cases - Spotify - Real-time Auctions - ShareThrough - Real-time Recommendations - Graphflow - Cancer Genomics - AMPLab - Malware Detection - F-Secure - Media Distribution Analytics - NBC Universal - Personal Fitness - Jawbone - Neuroscience - HHMI

Page 35: Spark · 2018-03-24 · Summit -1,164 participants from over 453 companies attended -Spark Training sold out at 300 participants -31 organizations sponsored the event  …

Resources - Code - Event - Technology - Videos

Page 36: Spark · 2018-03-24 · Summit -1,164 participants from over 453 companies attended -Spark Training sold out at 300 participants -31 organizations sponsored the event  …

Code - https://github.com/apache/spark

Page 37: Spark · 2018-03-24 · Summit -1,164 participants from over 453 companies attended -Spark Training sold out at 300 participants -31 organizations sponsored the event  …

Event - spark-summit.org - http://arjon.es/2014/06/30/spark-summit-2014-day-1/ - https://www.crowdchat.net/chat/c3BvdF9vYmpfODc=. - https://nathanbrixius.wordpress.com/2014/07/02/spark-summit-keynote-notes/ - http://thomaswdinsmore.com/2014/07/03/spark-summit-2014-roundup/

Page 38: Spark · 2018-03-24 · Summit -1,164 participants from over 453 companies attended -Spark Training sold out at 300 participants -31 organizations sponsored the event  …

Technology - Learning Spark (O'Reilly eBook) - www.spark-stack.org - ampcamp.berkeley.edu - https://amplab.cs.berkeley.edu/2013/10/23/got-a-minute-spin-up-a-spark-cluster-on-your-laptop-with-docker/

Page 39: Spark · 2018-03-24 · Summit -1,164 participants from over 453 companies attended -Spark Training sold out at 300 participants -31 organizations sponsored the event  …

YouTube - AmpLab https://www.youtube.com/channel/UCWudC4d9i-2yxR5tuen-Nuw - Databricks https://www.youtube.com/channel/UC3q8O3Bh2Le8Rj1-Q-_UUbA - Apache Spark https://www.youtube.com/channel/UCRzsq7k4-kT-h3TDUBQ82-w