dataengconf sf16 - spark sql workshop
TRANSCRIPT
![Page 1: DataEngConf SF16 - Spark SQL Workshop](https://reader031.vdocuments.net/reader031/viewer/2022021815/587110531a28abac6d8b598b/html5/thumbnails/1.jpg)
Agenda● Brief Review of Spark (15 min)● Intro to Spark SQL (30 min)● Code session 1: Lab (45 min)● Break (15 min)● Intermediate Topics in Spark SQL (30 min)● Code session 2: Quiz (30 min)● Wrap up (15 min)
![Page 2: DataEngConf SF16 - Spark SQL Workshop](https://reader031.vdocuments.net/reader031/viewer/2022021815/587110531a28abac6d8b598b/html5/thumbnails/2.jpg)
Spark ReviewBy Aaron Merlob
![Page 3: DataEngConf SF16 - Spark SQL Workshop](https://reader031.vdocuments.net/reader031/viewer/2022021815/587110531a28abac6d8b598b/html5/thumbnails/3.jpg)
Apache Spark● Open-source cluster computing framework ● “Successor” to Hadoop MapReduce● Supports Scala, Java, and Python!
https://en.wikipedia.org/wiki/Apache_Spark
![Page 4: DataEngConf SF16 - Spark SQL Workshop](https://reader031.vdocuments.net/reader031/viewer/2022021815/587110531a28abac6d8b598b/html5/thumbnails/4.jpg)
Spark Core + Libraries
https://spark.apache.org
![Page 5: DataEngConf SF16 - Spark SQL Workshop](https://reader031.vdocuments.net/reader031/viewer/2022021815/587110531a28abac6d8b598b/html5/thumbnails/5.jpg)
Resilient Distributed Dataset● Distributed Collection● Fault-tolerant● Parallel operation - Partitioned● Many data sourcesImplementation...
RDD - Main Abstraction
![Page 6: DataEngConf SF16 - Spark SQL Workshop](https://reader031.vdocuments.net/reader031/viewer/2022021815/587110531a28abac6d8b598b/html5/thumbnails/6.jpg)
Immutable
Mute
Immutable
Lazily Evaluated
Cachable
Type Inferred
![Page 7: DataEngConf SF16 - Spark SQL Workshop](https://reader031.vdocuments.net/reader031/viewer/2022021815/587110531a28abac6d8b598b/html5/thumbnails/7.jpg)
Lazily EvaluatedHow Good Is Aaron’s Presentation? Immutable
Lazily Evaluated
Cachable
Type Inferred
![Page 8: DataEngConf SF16 - Spark SQL Workshop](https://reader031.vdocuments.net/reader031/viewer/2022021815/587110531a28abac6d8b598b/html5/thumbnails/8.jpg)
CachableImmutable
Lazily Evaluated
Cachable
Type Inferred
![Page 9: DataEngConf SF16 - Spark SQL Workshop](https://reader031.vdocuments.net/reader031/viewer/2022021815/587110531a28abac6d8b598b/html5/thumbnails/9.jpg)
Type Inferred (Scala)Immutable
Lazily Evaluated
Cachable
Type Inferred
![Page 10: DataEngConf SF16 - Spark SQL Workshop](https://reader031.vdocuments.net/reader031/viewer/2022021815/587110531a28abac6d8b598b/html5/thumbnails/10.jpg)
RDD Operations Actions
Transformations
![Page 11: DataEngConf SF16 - Spark SQL Workshop](https://reader031.vdocuments.net/reader031/viewer/2022021815/587110531a28abac6d8b598b/html5/thumbnails/11.jpg)
Cache & PersistTransformed RDDs recomputed each actionStore RDDs in memory using cache (or persist)
![Page 12: DataEngConf SF16 - Spark SQL Workshop](https://reader031.vdocuments.net/reader031/viewer/2022021815/587110531a28abac6d8b598b/html5/thumbnails/12.jpg)
SparkContext.● Your way to get data into/out of RDDs● Given as ‘sc’ when you launch Spark shell.
For example: sc.parallelize()
SparkContext
![Page 13: DataEngConf SF16 - Spark SQL Workshop](https://reader031.vdocuments.net/reader031/viewer/2022021815/587110531a28abac6d8b598b/html5/thumbnails/13.jpg)
Transformation vs. Action?val data = sc.parallelize(Seq(
“Aaron Aaron”, “Aaron Brian”, “Charlie”, “” ))val words = data.flatMap(d => d.split(" "))val result = words.map(word => (word, 1)).
reduceByKey((v1, v2) => v1 + v2)result.filter( kv => kv._1.contains(“a”) ).count()result.filter{ case (k, v) => v > 2 }.count()
![Page 14: DataEngConf SF16 - Spark SQL Workshop](https://reader031.vdocuments.net/reader031/viewer/2022021815/587110531a28abac6d8b598b/html5/thumbnails/14.jpg)
Transformation vs. Action?val data = sc.parallelize(Seq(
“Aaron Aaron”, “Aaron Brian”, “Charlie”, “” ))val words = data.flatMap(d => d.split(" "))val result = words.map(word => (word, 1)).
reduceByKey((v1, v2) => v1 + v2)result.filter( kv => kv._1.contains(“a”) ).count()result.filter{ case (k, v) => v > 2 }.count()
![Page 15: DataEngConf SF16 - Spark SQL Workshop](https://reader031.vdocuments.net/reader031/viewer/2022021815/587110531a28abac6d8b598b/html5/thumbnails/15.jpg)
Transformation vs. Action?val data = sc.parallelize(Seq(
“Aaron Aaron”, “Aaron Brian”, “Charlie”, “” ))val words = data.flatMap(d => d.split(" "))val result = words.map(word => (word, 1)).
reduceByKey((v1, v2) => v1 + v2)result.filter( kv => kv._1.contains(“a”) ).count()result.filter{ case (k, v) => v > 2 }.count()
![Page 16: DataEngConf SF16 - Spark SQL Workshop](https://reader031.vdocuments.net/reader031/viewer/2022021815/587110531a28abac6d8b598b/html5/thumbnails/16.jpg)
Transformation vs. Action?val data = sc.parallelize(Seq(
“Aaron Aaron”, “Aaron Brian”, “Charlie”, “” ))val words = data.flatMap(d => d.split(" "))val result = words.map(word => (word, 1)).
reduceByKey((v1, v2) => v1 + v2)result.filter( kv => kv._1.contains(“a”) ).count()result.filter{ case (k, v) => v > 2 }.count()
![Page 17: DataEngConf SF16 - Spark SQL Workshop](https://reader031.vdocuments.net/reader031/viewer/2022021815/587110531a28abac6d8b598b/html5/thumbnails/17.jpg)
Transformation vs. Action?val data = sc.parallelize(Seq(
“Aaron Aaron”, “Aaron Brian”, “Charlie”, “” ))val words = data.flatMap(d => d.split(" "))val result = words.map(word => (word, 1)).
reduceByKey((v1, v2) => v1 + v2).cache()result.filter( kv => kv._1.contains(“a”) ).count()result.filter{ case (k, v) => v > 2 }.count()
![Page 18: DataEngConf SF16 - Spark SQL Workshop](https://reader031.vdocuments.net/reader031/viewer/2022021815/587110531a28abac6d8b598b/html5/thumbnails/18.jpg)
Spark SQLBy Aaron Merlob
![Page 19: DataEngConf SF16 - Spark SQL Workshop](https://reader031.vdocuments.net/reader031/viewer/2022021815/587110531a28abac6d8b598b/html5/thumbnails/19.jpg)
Spark SQLRDDs with Schemas!
![Page 20: DataEngConf SF16 - Spark SQL Workshop](https://reader031.vdocuments.net/reader031/viewer/2022021815/587110531a28abac6d8b598b/html5/thumbnails/20.jpg)
Spark SQLRDDs with Schemas!
Schemas = Table Names + Column Names + Column Types = Metadata
![Page 21: DataEngConf SF16 - Spark SQL Workshop](https://reader031.vdocuments.net/reader031/viewer/2022021815/587110531a28abac6d8b598b/html5/thumbnails/21.jpg)
Schemas● Schema Pros
○ Enable column names instead of column positions○ Queries using SQL (or DataFrame) syntax○ Make your data more structured
● Schema Cons○ ??○ ??○ ??
![Page 22: DataEngConf SF16 - Spark SQL Workshop](https://reader031.vdocuments.net/reader031/viewer/2022021815/587110531a28abac6d8b598b/html5/thumbnails/22.jpg)
Schemas● Schema Pros
○ Enable column names instead of column positions○ Queries using SQL (or DataFrame) syntax○ Make your data more structured
● Schema Cons○ Make your data more structured○ Reduce future flexibility (app is more fragile)○ Y2K
![Page 23: DataEngConf SF16 - Spark SQL Workshop](https://reader031.vdocuments.net/reader031/viewer/2022021815/587110531a28abac6d8b598b/html5/thumbnails/23.jpg)
HiveContextval sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
![Page 24: DataEngConf SF16 - Spark SQL Workshop](https://reader031.vdocuments.net/reader031/viewer/2022021815/587110531a28abac6d8b598b/html5/thumbnails/24.jpg)
HiveContextval sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
FYI - a less preferred alternative:org.apache.spark.sql.SQLContext
![Page 25: DataEngConf SF16 - Spark SQL Workshop](https://reader031.vdocuments.net/reader031/viewer/2022021815/587110531a28abac6d8b598b/html5/thumbnails/25.jpg)
DataFramesPrimary abstraction in Spark SQL
Evolved from SchemaRDDExposes functionality via SQL or DF APISQL for developer productivity (ETL, BI, etc)DF for data scientist productivity (R / Pandas)
![Page 26: DataEngConf SF16 - Spark SQL Workshop](https://reader031.vdocuments.net/reader031/viewer/2022021815/587110531a28abac6d8b598b/html5/thumbnails/26.jpg)
Live Coding - Spark-ShellMaven Packages for CSV and Avroorg.apache.hadoop:hadoop-aws:2.7.1com.amazonaws:aws-java-sdk-s3:1.10.30com.databricks:spark-csv_2.10:1.3.0com.databricks:spark-avro_2.10:2.0.1
spark-shell --packages $SPARK_PKGS
![Page 27: DataEngConf SF16 - Spark SQL Workshop](https://reader031.vdocuments.net/reader031/viewer/2022021815/587110531a28abac6d8b598b/html5/thumbnails/27.jpg)
Live Coding - Loading CSVval path = "AAPL.csv"val df = sqlContext.read. format("com.databricks.spark.csv"). option("header", "true"). option("inferSchema", "true"). load(path)df.registerTempTable("stocks")
![Page 28: DataEngConf SF16 - Spark SQL Workshop](https://reader031.vdocuments.net/reader031/viewer/2022021815/587110531a28abac6d8b598b/html5/thumbnails/28.jpg)
CachingIf I run a query twice, how many times will the data be read from disk?
![Page 29: DataEngConf SF16 - Spark SQL Workshop](https://reader031.vdocuments.net/reader031/viewer/2022021815/587110531a28abac6d8b598b/html5/thumbnails/29.jpg)
CachingIf I run a query twice, how many times will the data be read from disk?
1. RDDs are lazy.2. Therefore the data will be read twice.3. Unless you cache the RDD, All transformations
in the RDD will execute on each action.
![Page 30: DataEngConf SF16 - Spark SQL Workshop](https://reader031.vdocuments.net/reader031/viewer/2022021815/587110531a28abac6d8b598b/html5/thumbnails/30.jpg)
Caching TablessqlContext.cacheTable("stocks")
Particularly useful when using Spark SQL to explore data, and if your data is on S3.
sqlContext.uncacheTable("stocks")
![Page 31: DataEngConf SF16 - Spark SQL Workshop](https://reader031.vdocuments.net/reader031/viewer/2022021815/587110531a28abac6d8b598b/html5/thumbnails/31.jpg)
Caching in SQLSQL Command Speed`CACHE TABLE sales;` Eagerly`CACHE LAZY TABLE sales;` Lazily`UNCACHE TABLE sales;` Eagerly
![Page 32: DataEngConf SF16 - Spark SQL Workshop](https://reader031.vdocuments.net/reader031/viewer/2022021815/587110531a28abac6d8b598b/html5/thumbnails/32.jpg)
Caching ComparisonCaching Spark SQL DataFrames vs
caching plain non-DataFrame RDDs● RDDs cached at level of individual records● DataFrames know more about the data.● DataFrames are cached using an in-memory
columnar format.
![Page 33: DataEngConf SF16 - Spark SQL Workshop](https://reader031.vdocuments.net/reader031/viewer/2022021815/587110531a28abac6d8b598b/html5/thumbnails/33.jpg)
Caching ComparisonWhat is the difference between these:(a) sqlContext.cacheTable("df_table")(b) df.cache(c) sqlContext.sql("CACHE TABLE df_table")
![Page 34: DataEngConf SF16 - Spark SQL Workshop](https://reader031.vdocuments.net/reader031/viewer/2022021815/587110531a28abac6d8b598b/html5/thumbnails/34.jpg)
Lab 1Spark SQL Workshop
![Page 35: DataEngConf SF16 - Spark SQL Workshop](https://reader031.vdocuments.net/reader031/viewer/2022021815/587110531a28abac6d8b598b/html5/thumbnails/35.jpg)
Spark SQL,the SequelBy Aaron Merlob
![Page 36: DataEngConf SF16 - Spark SQL Workshop](https://reader031.vdocuments.net/reader031/viewer/2022021815/587110531a28abac6d8b598b/html5/thumbnails/36.jpg)
Live Coding - Filetype ETL● Read in a CSV● Export as JSON or Parquet● Read JSON
![Page 37: DataEngConf SF16 - Spark SQL Workshop](https://reader031.vdocuments.net/reader031/viewer/2022021815/587110531a28abac6d8b598b/html5/thumbnails/37.jpg)
Live Coding - Common● Show● Sample● Take● First
![Page 38: DataEngConf SF16 - Spark SQL Workshop](https://reader031.vdocuments.net/reader031/viewer/2022021815/587110531a28abac6d8b598b/html5/thumbnails/38.jpg)
Read FormatsFormat ReadParquet sqlContext.read.parquet(path)
ORC sqlContext.read.orc(path)
JSON sqlContext.read.json(path)
CSV sqlContext.read.format(“com.databricks.spark.csv”).load(path)
![Page 39: DataEngConf SF16 - Spark SQL Workshop](https://reader031.vdocuments.net/reader031/viewer/2022021815/587110531a28abac6d8b598b/html5/thumbnails/39.jpg)
Write FormatsFormat WriteParquet sqlContext.write.parquet(path)
ORC sqlContext.write.orc(path)
JSON sqlContext.write.json(path)
CSV sqlContext.write.format(“com.databricks.spark.csv”).save(path)
![Page 40: DataEngConf SF16 - Spark SQL Workshop](https://reader031.vdocuments.net/reader031/viewer/2022021815/587110531a28abac6d8b598b/html5/thumbnails/40.jpg)
Schema InferenceInfer schema of JSON files:
● By default it scans the entire file.● It finds the broadest type that will fit a field.● This is an RDD operation so it happens fast.
Infer schema of CSV files:● CSV parser uses same logic as JSON
parser.
![Page 41: DataEngConf SF16 - Spark SQL Workshop](https://reader031.vdocuments.net/reader031/viewer/2022021815/587110531a28abac6d8b598b/html5/thumbnails/41.jpg)
User Defined FunctionsHow do you apply a “UDF”?● Import types (StringType, IntegerType, etc)● Create UDF (in Scala)● Apply the function (in SQL)Notes:● UDFs can take single or multiple arguments● Optional registerFunction arg2: ‘return type’
![Page 42: DataEngConf SF16 - Spark SQL Workshop](https://reader031.vdocuments.net/reader031/viewer/2022021815/587110531a28abac6d8b598b/html5/thumbnails/42.jpg)
Live Coding - UDF● Import types (StringType, IntegerType, etc)● Create UDF (in Scala)● Apply the function (in SQL)
![Page 43: DataEngConf SF16 - Spark SQL Workshop](https://reader031.vdocuments.net/reader031/viewer/2022021815/587110531a28abac6d8b598b/html5/thumbnails/43.jpg)
Live Coding - AutocompleteFind all types available for SQL schemas +UDF
Types and their meanings:StringType = StringIntegerType = IntDoubleType = Double
![Page 44: DataEngConf SF16 - Spark SQL Workshop](https://reader031.vdocuments.net/reader031/viewer/2022021815/587110531a28abac6d8b598b/html5/thumbnails/44.jpg)
Spark UI on port 4040