advanced apache spark meetup spark sql + dataframes + catalyst optimizer + data sources api

Download Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Data Sources API

Post on 07-Jan-2017

2.991 views

Category:

Software

2 download

Embed Size (px)

TRANSCRIPT

PowerPoint Presentation

IBM | spark.tcAdvanced Apache Spark MeetupSpark SQL + DataFrames + Catalyst + Data Sources APIChris Fregly, Principal Data Solutions EngineerIBM Spark Technology CenterSept 21, 2015

Power of data. Simplicity of design. Speed of innovation.

Meetup Housekeeping

IBM | spark.tc

AnnouncementsPatrick McFadin, EvangelistDataStax

Steve Beier, Boss ManIBM Spark Tech Center

IBM | spark.tc

Who am I?Streaming Platform EngineerNot a Photographer or Model

Streaming Data EngineerNetflix Open Source Committer

Data Solutions EngineerApache Contributor

Principal Data Solutions EngineerIBM Technology Center

IBM | spark.tc

Last Meetup (Spark Wins 100 TB Daytona GraySort)

On-disk only, in-memory caching disabled!sortbenchmark.org/ApacheSpark2014.pdf

IBM | spark.tc

Meetup MetricsTotal Spark Experts: ~1000 (+20%)Mean RSVPs per Meetup: ~300Mean Attendance: ~50-60% of RSVPs

Donations: $0 (-100%)This is good!Your money is no good here.

Lloyd from The Shining "true", "es.nodes" -> "", "es.port" -> "")df.write.format("org.elasticsearch.spark.sql).mode(SaveMode.Overwrite).options(esConfig).save("/")

IBM | spark.tc

Cassandra Data Source (DataStax)

Githubhttps://github.com/datastax/spark-cassandra-connectorMavencom.datastax.spark:spark-cassandra-connector_2.10:1.5.0-M1CoderatingsDF.write.format("org.apache.spark.sql.cassandra").mode(SaveMode.Append).options(Map("keyspace"->"dating","table"->"ratings")).save()

IBM | spark.tc

REST Data Source (Databricks)

Coming Soon!https://github.com/databricks/spark-rest?

Michael ArmbrustSpark SQL Lead @ Databricks

IBM | spark.tc

DynamoDB Data Source (IBM Spark Tech Center)

Coming Soon!https://github.com/cfregly/spark-dynamodb

MeErlich

IBM | spark.tc

SparkSQL Performance Tuning (oas.sql.SQLConf)

spark.sql.inMemoryColumnarStorage.compressed=trueAutomatically selects column codec based on dataspark.sql.inMemoryColumnarStorage.batchSizeIncrease as much as possible without OOM improves compression and GCspark.sql.inMemoryPartitionPruning=trueEnable partition pruning for in-memory partitionsspark.sql.tungsten.enabled=trueCode Gen for CPU and Memory Optimizations (Tungsten aka Unsafe Mode)spark.sql.shuffle.partitionsIncrease from default 200 for large joins and aggregationsspark.sql.autoBroadcastJoinThresholdIncrease to tune this cost-based, physical plan optimizationspark.sql.hive.metastorePartitionPruningPredicate pushdown into the metastore to prune partitions earlyspark.sql.planner.sortMergeJoinPrefer sort-merge (vs. hash join) for large joinsspark.sql.sources.partitionDiscovery.enabled & spark.sql.sources.parallelPartitionDiscovery.thresholdEnable automatic partition discovery when loading data

IBM | spark.tc

Related Linkshttps://github.com/datastax/spark-cassandra-connectorhttp://blog.madhukaraphatak.com/anatomy-of-spark-dataframe-api/https://github.com/phatek-dev/anatomy_of_spark_dataframe_api/https://databricks.com/blog/

IBM | spark.tc

Upcoming Advanced Apache Spark MeetupsProject Tungsten Data Structs & Algos for CPU &Memory OptimizationNov 12th, 2015Text-based Advanced Analytics and Machine LearningJan 14th, 2016ElasticSearch-Spark Connector w/ Costin Leau (Elastic.co) & MeFeb 16th, 2016Spark Internals Deep DiveMar 24th, 2016Spark SQL Catalyst Optimizer Deep Dive Apr 21st, 2016

Special Thanks to DataStax!!

IBM Spark Tech Center is Hiring! Only Fun, Collaborative People - No Erlichs!

IBM | spark.tcSign up for our newsletter at

Thank You!

Power of data. Simplicity of design. Speed of innovation.

Power of data. Simplicity of design. Speed of innovation.

IBM Spark