advanced apache spark meetup spark sql + dataframes + catalyst optimizer + data sources api

41
BM | spark.tc Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst + Data Sources API Chris Fregly, Principal Data Solutions Engineer IBM Spark Technology Center Sept 21, 2015 Power of data. Simplicity of design. Speed of innovation.

Upload: chris-fregly

Post on 07-Jan-2017

3.032 views

Category:

Software


2 download

TRANSCRIPT

Page 1: Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Data Sources API

IBM | spark.tc

Advanced Apache Spark MeetupSpark SQL + DataFrames + Catalyst + Data Sources

APIChris Fregly, Principal Data Solutions Engineer

IBM Spark Technology CenterSept 21, 2015

Power of data. Simplicity of design. Speed of innovation.

Page 2: Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Data Sources API

Meetup Housekeeping

Page 3: Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Data Sources API

IBM | spark.tc

Announcements

Patrick McFadin, Evange-list

DataStax

Steve Beier, Boss ManIBM Spark Tech Center

Page 4: Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Data Sources API

IBM | spark.tc

Who am I?Streaming Platform EngineerNot a Photographer or Model

Streaming Data EngineerNetflix Open Source Committer

Data Solutions EngineerApache Contributor

Principal Data Solutions Engineer

IBM Technology Center

Page 5: Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Data Sources API

IBM | spark.tc

Last Meetup (Spark Wins 100 TB Daytona GraySort) On-disk only, in-memory caching

disabled!sortbenchmark.org/ApacheSpark2014.pdf

Page 6: Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Data Sources API

IBM | spark.tc

Meetup MetricsTotal Spark Experts: ~1000 (+20%)Mean RSVPs per Meetup: ~300Mean Attendance: ~50-60% of RSVPs

Donations: $0 (-100%)This is good!“Your money is no good here.”

Lloyd from The Shining<--- eek!

Page 7: Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Data Sources API

IBM | spark.tc

Meetup UpdatesTalking with other Spark Meetup Groups

Potential mergers and/or hostile takeovers!New Sponsors!!

Looking for more South Bay/Peninsula HostsRequired: Food, Beer/Soda/Water, Air ConditioningOptional: A/V Recording and Live Stream

We’re trying out new PowerPoint AnimationsPlease be patient!

Page 8: Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Data Sources API

IBM | spark.tc

Constructive Criticism from Previous Attendees“Chris, you’re like a fat version of an already-fat Erlich from Silicon Valley - except not funny.”

“Chris, your voice is so annoying that it actually woke me from the sleep induced by your boring content.”

Page 9: Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Data Sources API

IBM | spark.tc

Freg-a-palooza Upcoming World Tour① New York Strata (Sept 29th – Oct 1st)② London Spark Meetup (Oct 12th)③ Scotland Data Science Meetup (Oct 13th)④ Dublin Spark Meetup (Oct 15th)⑤ Barcelona Spark Meetup (Oct 20th)⑥ Madrid Spark Meetup (Oct 22nd)⑦ Amsterdam Spark Summit (Oct 27th – Oct 29th)⑧ Delft Dutch Data Science Meetup (Oct 29th) ⑨ Brussels Spark Meetup (Oct 30th)⑩ Zurich Big Data Developers Meetup (Nov 2nd)

High probabilityI’ll end up in jail

Page 10: Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Data Sources API

IBM | spark.tc

Topics of this Talk① DataFrames② Catalyst Optimizer and Query Plans③ Data Sources API④ Creating and Contributing Custom Data

Source⑤ Partitions, Pruning, Pushdowns⑥ Native + Third-Party Data Source Impls⑦ Spark SQL Performance Tuning

Page 11: Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Data Sources API

IBM | spark.tc

DataFramesInspired by R and Pandas DataFramesCross language support

SQL, Python, Scala, Java, RLevels performance of Python, Scala, Java, and R

Generates JVM bytecode vs serialize/pickle objects to PythonDataFrame is Container for Logical Plan

Transformations are lazy and represented as a treeCatalyst Optimizer creates physical plan

DataFrame.rdd returns the underlying RDD if neededCustom UDF using registerFunction()New, experimental UDAF support

Use DataFrames instead of

RDDs!!

Page 12: Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Data Sources API

IBM | spark.tc

Catalyst OptimizerConverts logical plan to physical planManipulate & optimize DataFrame transformation tree

Subquery elimination – use aliases to collapse sub-queries

Constant folding – replace expression with constantSimplify filters – remove unnecessary filtersPredicate/filter pushdowns – avoid unnecessary data

loadProjection collapsing – avoid unnecessary projections

Hooks for custom rulesRules = Scala Case Classes

val newPlan = MyFilterRule(analyzedPlan)

Implementsoas.sql.catalyst.rules.Rule

Apply to any plan stage

Page 13: Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Data Sources API

IBM | spark.tc

Plan DebugginggendersCsvDF.select($"id", $"gender").filter("gender != 'F'").filter("gender != 'M'").explain(true)

Requires explain(true)

DataFrame.queryExecution.logicalDataFrame.queryExecution.analyzed

DataFrame.queryExecution.optimizedPlan

DataFrame.queryExecution.executedPlan

Page 14: Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Data Sources API

IBM | spark.tc

Plan Visualization & Join/Aggregation Metrics

Effectiveness of Filter

Cost-based Optimization

is Applied

Peak Memory forJoins and Aggs

Optimized CPU-cache-aware

Binary FormatMinimizes GC &

Improves Join Perf(Project Tungsten)

New in Spark 1.5!

Page 15: Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Data Sources API

IBM | spark.tc

Data Sources APIExecution (o.a.s.sql.execution.commands.scala)RunnableCommand (trait/interface)

ExplainCommand(impl: case class)CacheTableCommand(impl: case class)

Relations (o.a.s.sql.sources.interfaces.scala)BaseRelation (abstract class)

TableScan (impl: returns all rows)PrunedFilteredScan (impl: column pruning and predicate push-

down)InsertableRelation (impl: insert or overwrite data using Save-

Mode)Filters (o.a.s.sql.sources.filters.scala)

Filter (abstract class for all filter pushdowns for this data source)EqualToGreaterThanStringStartsWith

Page 16: Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Data Sources API

IBM | spark.tc

Creating a Custom Data SourceStudy Existing Native and Third-Party Data Source Impls

Native: JDBC (o.a.s.sql.execution.datasources.jdbc)class JDBCRelation extends BaseRelation

with PrunedFilteredScan with InsertableRelation

Third-Party: Cassandra (o.a.s.sql.cassandra)class CassandraSourceRelation extends BaseRela-

tion with PrunedFilteredScan with InsertableRelation

Page 17: Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Data Sources API

IBM | spark.tc

Contributing a Custom Data Sourcespark-packages.orgManaged byContains links to externally-managed github

projectsRatings and commentsSpark version requirements of each package

Exampleshttps://github.com/databricks/spark-csvhttps://github.com/databricks/spark-avrohttps://github.com/databricks/spark-redshift

Page 18: Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Data Sources API

Partitions, Pruning, Pushdowns

Page 19: Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Data Sources API

IBM | spark.tc

Demo Dataset (from previous Spark After Dark talks)

RATINGS ========

UserID,ProfileID,Rating

(1-10)

GENDERS========

UserID,Gender (M,F,U)

<-- Totally -->

Anonymous

Page 20: Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Data Sources API

IBM | spark.tc

PartitionsPartition based on data usage patterns/root/gender=M/… /gender=F/… <-- Use case: access users by gender

/gender=U/…Partition Discovery

On read, infer partitions from organization of data (ie. gen-der=F)Dynamic Partitions

Upon insert, dynamically create partitionsSpecify field to use for each partition (ie. gender)SQL: INSERT TABLE genders PARTITION (gender) SELECT …DF:

gendersDF.write.format(”parquet").partitionBy(”gender”).save(…)

Page 21: Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Data Sources API

IBM | spark.tc

PruningPartition PruningFilter out entire partitions of rows on partitioned

dataSELECT id, gender FROM genders where gender = ‘U’

Column PruningFilter out entire columns for all rows if not re-

quiredExtremely useful for columnar storage formats

Parquet, ORCSELECT id, gender FROM genders

Page 22: Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Data Sources API

IBM | spark.tc

PushdownsPredicate (aka Filter) Pushdowns

Predicate returns {true, false} for a given function/condition

Filters rows as deep into the data source as possibleData Source must implement PrunedFilteredScan

Page 23: Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Data Sources API

Native Spark SQL Data Sources

Page 24: Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Data Sources API

IBM | spark.tc

Spark SQL Native Data Sources - Source Code

Page 25: Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Data Sources API

IBM | spark.tc

JSON Data SourceDataFrameval ratingsDF = sqlContext.read.format("json")

.load("file:/root/pipeline/datasets/dating/ratings.j-son.bz2") -- or --

val ratingsDF = sqlContext.read.json("file:/root/pipeline/datasets/dating/ratings.j-

son.bz2")

SQL CodeCREATE TABLE genders USING jsonOPTIONS

(path "file:/root/pipeline/datasets/dating/genders.j-son.bz2")

Convenience Method

Page 26: Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Data Sources API

IBM | spark.tc

JDBC Data SourceAdd Driver to Spark JVM System Classpath

$ export SPARK_CLASSPATH=<jdbc-driver.jar>

DataFrameval jdbcConfig = Map("driver" -> "org.postgresql.Driver",

"url" -> "jdbc:postgresql:hostname:port/database", "dbtable" -> ”schema.tablename")

df.read.format("jdbc").options(jdbcConfig).load()

SQLCREATE TABLE genders USING jdbc

OPTIONS (url, dbtable, driver, …)

Page 27: Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Data Sources API

IBM | spark.tc

Parquet Data SourceConfigurationspark.sql.parquet.filterPushdown=truespark.sql.parquet.mergeSchema=truespark.sql.parquet.cacheMetadata=true

spark.sql.parquet.compression.codec=[uncompressed,snappy,gzip,lzo]DataFrames

val gendersDF = sqlContext.read.format("parquet") .load("file:/root/pipeline/datasets/dating/genders.parquet")gendersDF.write.format("parquet").partitionBy("gender") .save("file:/root/pipeline/datasets/dating/genders.parquet")

SQLCREATE TABLE genders USING parquetOPTIONS

(path "file:/root/pipeline/datasets/dating/genders.parquet")

Page 28: Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Data Sources API

IBM | spark.tc

ORC Data SourceConfiguration

spark.sql.orc.filterPushdown=trueDataFrames

val gendersDF = sqlContext.read.format("orc").load("file:/root/pipeline/datasets/dating/genders")

gendersDF.write.format("orc").partitionBy("gender").save("file:/root/pipeline/datasets/dating/genders")

SQLCREATE TABLE genders USING orcOPTIONS

(path "file:/root/pipeline/datasets/dating/genders")

Page 29: Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Data Sources API

Third-Party Data Sources

spark-packages.org

Page 30: Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Data Sources API

IBM | spark.tc

CSV Data Source (Databricks)Github

https://github.com/databricks/spark-csv

Mavencom.databricks:spark-csv_2.10:1.2.0

Codeval gendersCsvDF = sqlContext.read

.format("com.databricks.spark.csv")

.load("file:/root/pipeline/datasets/dating/gen-der.csv.bz2")

.toDF("id", "gender") toDF() defines column names

Page 31: Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Data Sources API

IBM | spark.tc

Avro Data Source (Databricks)Github

https://github.com/databricks/spark-avro

Mavencom.databricks:spark-avro_2.10:2.0.1

Codeval df = sqlContext.read

.format("com.databricks.spark.avro") .load("file:/root/pipeline/datasets/dating/gen-der.avro")

Page 32: Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Data Sources API

IBM | spark.tc

Redshift Data Source (Databricks)Github

https://github.com/databricks/spark-redshift

Mavencom.databricks:spark-redshift:0.5.0

Codeval df: DataFrame = sqlContext.read

.format("com.databricks.spark.redshift") .option("url", "jdbc:redshift://<hostname>:<port>/<database>…") .option("query", "select x, count(*) my_table group by x") .option("tempdir", "s3n://tmpdir") .load()

Copies to S3 for fast, parallel reads vs

single Redshift Master bottleneck

Page 33: Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Data Sources API

IBM | spark.tc

ElasticSearch Data Source (Elastic.co)Githubhttps://github.com/elastic/elasticsearch-hadoop

Mavenorg.elasticsearch:elasticsearch-spark_2.10:2.1.0

Codeval esConfig = Map("pushdown" -> "true", "es.nodes" -> "<host-

name>", "es.port" -> "<port>")

df.write.format("org.elasticsearch.spark.sql”).mode(SaveMode.Overwrite)

.options(esConfig).save("<index>/<document>")

Page 34: Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Data Sources API

IBM | spark.tc

Cassandra Data Source (DataStax)Githubhttps://github.com/datastax/spark-cassandra-connector

Mavencom.datastax.spark:spark-cassandra-connector_2.10:1.5.0-

M1

CoderatingsDF.write.format("org.apache.spark.sql.cassandra")

.mode(SaveMode.Append)

.options(Map("keyspace"->"dating","table"->"rat-ings"))

.save()

Page 35: Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Data Sources API

IBM | spark.tc

REST Data Source (Databricks)Coming Soon!

https://github.com/databricks/spark-rest?

Michael ArmbrustSpark SQL Lead @ Databricks

Page 36: Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Data Sources API

IBM | spark.tc

DynamoDB Data Source (IBM Spark Tech Center) Coming Soon!

https://github.com/cfregly/spark-dynamodb

Me Erlich

Page 37: Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Data Sources API

IBM | spark.tc

SparkSQL Performance Tuning (oas.sql.SQL-Conf)spark.sql.inMemoryColumnarStorage.compressed=trueAutomatically selects column codec based on data

spark.sql.inMemoryColumnarStorage.batchSizeIncrease as much as possible without OOM – improves compression and GC

spark.sql.inMemoryPartitionPruning=trueEnable partition pruning for in-memory partitions

spark.sql.tungsten.enabled=trueCode Gen for CPU and Memory Optimizations (Tungsten aka Unsafe Mode)

spark.sql.shuffle.partitionsIncrease from default 200 for large joins and aggregations

spark.sql.autoBroadcastJoinThresholdIncrease to tune this cost-based, physical plan optimization

spark.sql.hive.metastorePartitionPruningPredicate pushdown into the metastore to prune partitions early

spark.sql.planner.sortMergeJoinPrefer sort-merge (vs. hash join) for large joins

spark.sql.sources.partitionDiscovery.enabled & spark.sql.sources.parallelPartitionDiscovery.threshold

Enable automatic partition discovery when loading data

Page 38: Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Data Sources API

IBM | spark.tc

Related Linkshttps://github.com/datastax/spark-cassandra-connectorhttp://blog.madhukaraphatak.com/anatomy-of-spark-dataframe-api/https://github.com/phatek-dev/anatomy_of_spark_dataframe_api/https://databricks.com/blog/…

Page 39: Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Data Sources API

IBM | spark.tc

Upcoming Advanced Apache Spark MeetupsProject Tungsten Data Structs & Algos for CPU & Memory Optimization

Nov 12th, 2015Text-based Advanced Analytics and Machine Learning

Jan 14th, 2016ElasticSearch-Spark Connector w/ Costin Leau (Elastic.co) & Me

Feb 16th, 2016Spark Internals Deep Dive

Mar 24th, 2016Spark SQL Catalyst Optimizer Deep Dive

Apr 21st, 2016

Page 40: Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Data Sources API

Special Thanks to DataStax!!IBM Spark Tech Center is Hiring!

Only Fun, Collaborative People - No Erlichs!

IBM | spark.tc

Sign up for our newsletter at

Thank You!

Power of data. Simplicity of design. Speed of innovation.

Page 41: Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Data Sources API

Power of data. Simplicity of design. Speed of innovation.IBM Spark