higher order functions · herman van hövell @westerflyer 2018-10-03, london spark summit eu 2018....
TRANSCRIPT
![Page 1: Higher Order Functions · Herman van Hövell @westerflyer 2018-10-03, London Spark Summit EU 2018. 2 About Me - Software Engineer @Databricks Amsterdam office - Apache Spark Committer](https://reader034.vdocuments.net/reader034/viewer/2022042219/5ec5db6d0efcdc47420f5ba8/html5/thumbnails/1.jpg)
Higher Order Functions
Herman van Hövell @westerflyer2018-10-03, London Spark Summit EU 2018
![Page 2: Higher Order Functions · Herman van Hövell @westerflyer 2018-10-03, London Spark Summit EU 2018. 2 About Me - Software Engineer @Databricks Amsterdam office - Apache Spark Committer](https://reader034.vdocuments.net/reader034/viewer/2022042219/5ec5db6d0efcdc47420f5ba8/html5/thumbnails/2.jpg)
2
About Me
- Software Engineer @Databricks Amsterdam office
- Apache Spark Committer and PMC member
- In a previous life Data Engineer & Data Analyst.
![Page 3: Higher Order Functions · Herman van Hövell @westerflyer 2018-10-03, London Spark Summit EU 2018. 2 About Me - Software Engineer @Databricks Amsterdam office - Apache Spark Committer](https://reader034.vdocuments.net/reader034/viewer/2022042219/5ec5db6d0efcdc47420f5ba8/html5/thumbnails/3.jpg)
3
Complex Data
Complex data types in Spark SQL
- Struct. For example: struct(a: Int, b: String)- Array. For example: array(a: Int)- Map. For example: map(key: String, value: Int)
This provides primitives to build tree-based data models- High expressiveness. Often alleviates the need for ‘flat-earth’ multi-table
designs.- More natural, reality like data models
![Page 4: Higher Order Functions · Herman van Hövell @westerflyer 2018-10-03, London Spark Summit EU 2018. 2 About Me - Software Engineer @Databricks Amsterdam office - Apache Spark Committer](https://reader034.vdocuments.net/reader034/viewer/2022042219/5ec5db6d0efcdc47420f5ba8/html5/thumbnails/4.jpg)
4
Complex Data - Tweet JSON{
"created_at": "Wed Oct 03 11:41:57 +0000 2018",
"id_str": "994633657141813248",
"text": "Looky nested data #spark #sseu",
"display_text_range": [0, 140],
"user": {
"id_str": "12343453",
"screen_name": "Westerflyer"
},
"extended_tweet": {
"full_text": "Looky nested data #spark #sseu",
"display_text_range": [0, 249],
"entities": {
"hashtags": [{
"text": "spark",
"indices": [211, 225]
}, {
"text": "sseu",
"indices": [239, 249]
}]
}
}
}
adapted from: https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/intro-to-tweet-json.html
root
|-- created_at: string (nullable = true)
|-- id_str: string (nullable = true)
|-- text: string (nullable = true)
|-- user: struct (nullable = true)
| |-- id_str: string (nullable = true)
| |-- screen_name: string (nullable = true)
|-- display_text_range: array (nullable = true)
| |-- element: long (containsNull = true)
|-- extended_tweet: struct (nullable = true)
| |-- full_text: string (nullable = true)
| |-- display_text_range: array (nullable = true)
| | |-- element: long (containsNull = true)
| |-- entities: struct (nullable = true)
| | |-- hashtags: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- indices: array (nullable = true)
| | | | | |-- element: long (containsNull = true)
| | | | |-- text: string (nullable = true)
![Page 5: Higher Order Functions · Herman van Hövell @westerflyer 2018-10-03, London Spark Summit EU 2018. 2 About Me - Software Engineer @Databricks Amsterdam office - Apache Spark Committer](https://reader034.vdocuments.net/reader034/viewer/2022042219/5ec5db6d0efcdc47420f5ba8/html5/thumbnails/5.jpg)
5
Manipulating Complex Data
Structs are easy :)
Maps/Arrays not so much...- Easy to read single values/retrieve keys- Hard to transform or to summarize
![Page 6: Higher Order Functions · Herman van Hövell @westerflyer 2018-10-03, London Spark Summit EU 2018. 2 About Me - Software Engineer @Databricks Amsterdam office - Apache Spark Committer](https://reader034.vdocuments.net/reader034/viewer/2022042219/5ec5db6d0efcdc47420f5ba8/html5/thumbnails/6.jpg)
6
Transforming an ArrayLet’s say we want to add 1 to every element of the vals field of every row in an input table.
Id Vals
1 [1, 2, 3]
2 [4, 5, 6]
Id Vals
1 [2, 3, 4]
2 [5, 6, 7]
How would we do this?
![Page 7: Higher Order Functions · Herman van Hövell @westerflyer 2018-10-03, London Spark Summit EU 2018. 2 About Me - Software Engineer @Databricks Amsterdam office - Apache Spark Committer](https://reader034.vdocuments.net/reader034/viewer/2022042219/5ec5db6d0efcdc47420f5ba8/html5/thumbnails/7.jpg)
7
Transforming an ArrayOption 1 - Explode and Collect
select id,
collect_list(val + 1) as vals
from (select id,
explode(vals) as val
from input_tbl) x
group by id
![Page 8: Higher Order Functions · Herman van Hövell @westerflyer 2018-10-03, London Spark Summit EU 2018. 2 About Me - Software Engineer @Databricks Amsterdam office - Apache Spark Committer](https://reader034.vdocuments.net/reader034/viewer/2022042219/5ec5db6d0efcdc47420f5ba8/html5/thumbnails/8.jpg)
8
Transforming an ArrayOption 1 - Explode and Collect - Explode
select id,
collect_list(val + 1) as vals
from (select id,
explode(vals) as val from input_tbl) x
group by id
1. Explode
![Page 9: Higher Order Functions · Herman van Hövell @westerflyer 2018-10-03, London Spark Summit EU 2018. 2 About Me - Software Engineer @Databricks Amsterdam office - Apache Spark Committer](https://reader034.vdocuments.net/reader034/viewer/2022042219/5ec5db6d0efcdc47420f5ba8/html5/thumbnails/9.jpg)
9
Transforming an ArrayOption 1 - Explode and Collect - Explode
Id Vals
1 [1, 2, 3]
2 [4, 5, 6]
Id Val
1 1
1 2
1 3
2 4
2 5
2 6
![Page 10: Higher Order Functions · Herman van Hövell @westerflyer 2018-10-03, London Spark Summit EU 2018. 2 About Me - Software Engineer @Databricks Amsterdam office - Apache Spark Committer](https://reader034.vdocuments.net/reader034/viewer/2022042219/5ec5db6d0efcdc47420f5ba8/html5/thumbnails/10.jpg)
10
Transforming an ArrayOption 1 - Explode and Collect - Collect
select id,
collect_list(val + 1) as valsfrom (select id,
explode(vals) as val from input_tbl) x
group by id
1. Explode
2. Collect
![Page 11: Higher Order Functions · Herman van Hövell @westerflyer 2018-10-03, London Spark Summit EU 2018. 2 About Me - Software Engineer @Databricks Amsterdam office - Apache Spark Committer](https://reader034.vdocuments.net/reader034/viewer/2022042219/5ec5db6d0efcdc47420f5ba8/html5/thumbnails/11.jpg)
11
Transforming an ArrayOption 1 - Explode and Collect - Collect
Id Val
1 1 + 1
1 2 + 1
1 3 + 1
2 4 + 1
2 5 + 1
2 6 + 1
Id Vals
1 [2, 3, 4]
2 [5, 6, 7]
![Page 12: Higher Order Functions · Herman van Hövell @westerflyer 2018-10-03, London Spark Summit EU 2018. 2 About Me - Software Engineer @Databricks Amsterdam office - Apache Spark Committer](https://reader034.vdocuments.net/reader034/viewer/2022042219/5ec5db6d0efcdc47420f5ba8/html5/thumbnails/12.jpg)
12
Transforming an ArrayOption 1 - Explode and Collect - Complexity
== Physical Plan ==ObjectHashAggregate(keys=[id], functions=[collect_list(val + 1)])+- Exchange hashpartitioning(id, 200) +- ObjectHashAggregate(keys=[id], functions=[collect_list(val + 1)]) +- Generate explode(vals), [id], false, [val] +- FileScan parquet default.input_tbl
![Page 13: Higher Order Functions · Herman van Hövell @westerflyer 2018-10-03, London Spark Summit EU 2018. 2 About Me - Software Engineer @Databricks Amsterdam office - Apache Spark Committer](https://reader034.vdocuments.net/reader034/viewer/2022042219/5ec5db6d0efcdc47420f5ba8/html5/thumbnails/13.jpg)
13
Transforming an ArrayOption 1 - Explode and Collect - Complexity
• Shuffles the data around, which is very expensive• collect_list does not respect pre-existing ordering
== Physical Plan ==ObjectHashAggregate(keys=[id], functions=[collect_list(val + 1)])+- Exchange hashpartitioning(id, 200) +- ObjectHashAggregate(keys=[id], functions=[collect_list(val + 1)]) +- Generate explode(vals), [id], false, [val] +- FileScan parquet default.input_tbl
![Page 14: Higher Order Functions · Herman van Hövell @westerflyer 2018-10-03, London Spark Summit EU 2018. 2 About Me - Software Engineer @Databricks Amsterdam office - Apache Spark Committer](https://reader034.vdocuments.net/reader034/viewer/2022042219/5ec5db6d0efcdc47420f5ba8/html5/thumbnails/14.jpg)
14
Transforming an ArrayOption 1 - Explode and Collect - Pitfalls
Id Vals
1 [1, 2, 3]
1 [4, 5, 6]
Id Vals
1 [5, 6, 7, 2, 3, 4]
Id Vals
1 null
2 [4, 5, 6]
Id Vals
2 [5, 6, 7]
Values need to have data
Keys need to be unique
![Page 15: Higher Order Functions · Herman van Hövell @westerflyer 2018-10-03, London Spark Summit EU 2018. 2 About Me - Software Engineer @Databricks Amsterdam office - Apache Spark Committer](https://reader034.vdocuments.net/reader034/viewer/2022042219/5ec5db6d0efcdc47420f5ba8/html5/thumbnails/15.jpg)
15
Transforming an ArrayOption 2 - Scala UDF
def addOne(values: Seq[Int]): Seq[Int] = {
values.map(value => value + 1)
}
![Page 16: Higher Order Functions · Herman van Hövell @westerflyer 2018-10-03, London Spark Summit EU 2018. 2 About Me - Software Engineer @Databricks Amsterdam office - Apache Spark Committer](https://reader034.vdocuments.net/reader034/viewer/2022042219/5ec5db6d0efcdc47420f5ba8/html5/thumbnails/16.jpg)
16
Transforming an ArrayOption 2 - Scala UDF
def addOne(values: Seq[Int]): Seq[Int] = {
values.map(value => value + 1)
}val plusOneInt = spark.udf.register("plusOneInt", addOne(_:Seq[Int]):Seq[Int])
![Page 17: Higher Order Functions · Herman van Hövell @westerflyer 2018-10-03, London Spark Summit EU 2018. 2 About Me - Software Engineer @Databricks Amsterdam office - Apache Spark Committer](https://reader034.vdocuments.net/reader034/viewer/2022042219/5ec5db6d0efcdc47420f5ba8/html5/thumbnails/17.jpg)
17
Transforming an ArrayOption 2 - Scala UDF
def addOne(values: Seq[Int]): Seq[Int] = {
values.map(value => value + 1)
}val plusOneInt = spark.udf.register("plusOneInt", addOne(_:Seq[Int]):Seq[Int])
val newDf = spark.table("input_tbl").select($"id", plusOneInt($"vals"))
![Page 18: Higher Order Functions · Herman van Hövell @westerflyer 2018-10-03, London Spark Summit EU 2018. 2 About Me - Software Engineer @Databricks Amsterdam office - Apache Spark Committer](https://reader034.vdocuments.net/reader034/viewer/2022042219/5ec5db6d0efcdc47420f5ba8/html5/thumbnails/18.jpg)
18
Transforming an Array
Pros- Is faster than Explode & Collect- Does not suffer from correctness pitfalls
Cons- Is relatively slow, we need to do a lot serialization- You need to register UDFs per type- Does not work for SQL- Clunky
Option 2 - Scala UDF
![Page 19: Higher Order Functions · Herman van Hövell @westerflyer 2018-10-03, London Spark Summit EU 2018. 2 About Me - Software Engineer @Databricks Amsterdam office - Apache Spark Committer](https://reader034.vdocuments.net/reader034/viewer/2022042219/5ec5db6d0efcdc47420f5ba8/html5/thumbnails/19.jpg)
19
When are you going to talk about
Higher Order Functions?
![Page 20: Higher Order Functions · Herman van Hövell @westerflyer 2018-10-03, London Spark Summit EU 2018. 2 About Me - Software Engineer @Databricks Amsterdam office - Apache Spark Committer](https://reader034.vdocuments.net/reader034/viewer/2022042219/5ec5db6d0efcdc47420f5ba8/html5/thumbnails/20.jpg)
20
Higher Order FunctionsLet’s take another look at Option 2 - Scala UDF
def addOne(values: Seq[Int]): Seq[Int] = {
values.map(value => value + 1)
}val plusOneInt = spark.udf.register("plusOneInt",
addOne(_:Seq[Int]):Seq[Int])
val newDf = spark.table("input_tbl").select($"id", plusOneInt($"vals"))
![Page 21: Higher Order Functions · Herman van Hövell @westerflyer 2018-10-03, London Spark Summit EU 2018. 2 About Me - Software Engineer @Databricks Amsterdam office - Apache Spark Committer](https://reader034.vdocuments.net/reader034/viewer/2022042219/5ec5db6d0efcdc47420f5ba8/html5/thumbnails/21.jpg)
21
Higher Order FunctionsLet’s take another look at Option 2 - Scala UDF
def addOne(values: Seq[Int]): Seq[Int] = {
values.map(value => value + 1)
}val plusOneInt = spark.udf.register("plusOneInt",
addOne(_:Seq[Int]):Seq[Int])
val newDf = spark.table("input_tbl").select($"id", plusOneInt($"vals"))
Higher Order Function
![Page 22: Higher Order Functions · Herman van Hövell @westerflyer 2018-10-03, London Spark Summit EU 2018. 2 About Me - Software Engineer @Databricks Amsterdam office - Apache Spark Committer](https://reader034.vdocuments.net/reader034/viewer/2022042219/5ec5db6d0efcdc47420f5ba8/html5/thumbnails/22.jpg)
22
Higher Order FunctionsLet’s take another look at Option 2 - Scala UDF
Can we do the same for Spark SQL?
def addOne(values: Seq[Int]): Seq[Int] = {
values.map(value => value + 1)
}val plusOneInt = spark.udf.register("plusOneInt",
addOne(_:Seq[Int]):Seq[Int])
val newDf = spark.table("input_tbl").select($"id", plusOneInt($"vals"))
Higher Order FunctionAnonymous ‘Lambda’ Function
![Page 23: Higher Order Functions · Herman van Hövell @westerflyer 2018-10-03, London Spark Summit EU 2018. 2 About Me - Software Engineer @Databricks Amsterdam office - Apache Spark Committer](https://reader034.vdocuments.net/reader034/viewer/2022042219/5ec5db6d0efcdc47420f5ba8/html5/thumbnails/23.jpg)
23
Higher Order Functions in Spark SQL
select id, transform(vals, val -> val + 1) as vals
from input_tbl
- Spark SQL native code: fast & no serialization needed- Works for SQL
![Page 24: Higher Order Functions · Herman van Hövell @westerflyer 2018-10-03, London Spark Summit EU 2018. 2 About Me - Software Engineer @Databricks Amsterdam office - Apache Spark Committer](https://reader034.vdocuments.net/reader034/viewer/2022042219/5ec5db6d0efcdc47420f5ba8/html5/thumbnails/24.jpg)
24
Higher Order Functions in Spark SQLselect id, transform(vals, val -> val + 1) as vals
from input_tbl
Higher Order Function
transform is the Higher Order Function. It takes an input array and an expression, it applies this expression to each element in the array
![Page 25: Higher Order Functions · Herman van Hövell @westerflyer 2018-10-03, London Spark Summit EU 2018. 2 About Me - Software Engineer @Databricks Amsterdam office - Apache Spark Committer](https://reader034.vdocuments.net/reader034/viewer/2022042219/5ec5db6d0efcdc47420f5ba8/html5/thumbnails/25.jpg)
25
Higher Order Functions in Spark SQLselect id, transform(vals, val -> val + 1) as vals
from input_tbl Anonymous ‘Lambda’ Function
val -> val + 1 is the lambda function. It is the operation that is applied to each value in the array. This function is divided into two components separated by a -> symbol:1. The Argument list.2. The expression used to calculate the new value.
![Page 26: Higher Order Functions · Herman van Hövell @westerflyer 2018-10-03, London Spark Summit EU 2018. 2 About Me - Software Engineer @Databricks Amsterdam office - Apache Spark Committer](https://reader034.vdocuments.net/reader034/viewer/2022042219/5ec5db6d0efcdc47420f5ba8/html5/thumbnails/26.jpg)
26
Higher Order Functions in Spark SQLNesting
select id,
transform(vals, val ->
transform(val, e -> e + 1)) as vals
from nested_input_tbl
Captureselect id,
ref_value,
transform(vals, val -> ref_value + val) as vals
from nested_input_tbl
![Page 27: Higher Order Functions · Herman van Hövell @westerflyer 2018-10-03, London Spark Summit EU 2018. 2 About Me - Software Engineer @Databricks Amsterdam office - Apache Spark Committer](https://reader034.vdocuments.net/reader034/viewer/2022042219/5ec5db6d0efcdc47420f5ba8/html5/thumbnails/27.jpg)
27
Didn’t you say these were faster?
![Page 28: Higher Order Functions · Herman van Hövell @westerflyer 2018-10-03, London Spark Summit EU 2018. 2 About Me - Software Engineer @Databricks Amsterdam office - Apache Spark Committer](https://reader034.vdocuments.net/reader034/viewer/2022042219/5ec5db6d0efcdc47420f5ba8/html5/thumbnails/28.jpg)
28
Performance
![Page 29: Higher Order Functions · Herman van Hövell @westerflyer 2018-10-03, London Spark Summit EU 2018. 2 About Me - Software Engineer @Databricks Amsterdam office - Apache Spark Committer](https://reader034.vdocuments.net/reader034/viewer/2022042219/5ec5db6d0efcdc47420f5ba8/html5/thumbnails/29.jpg)
29
Higher Order Functions in Spark SQL
Spark 2.4 will ship with following higher order functions:
Array- transform- filter- exists- aggregate/reduce- zip_with
Map- transform_keys- transform_values- map_filter- map_zip_with
A lot of new collection based expression were also added...
![Page 30: Higher Order Functions · Herman van Hövell @westerflyer 2018-10-03, London Spark Summit EU 2018. 2 About Me - Software Engineer @Databricks Amsterdam office - Apache Spark Committer](https://reader034.vdocuments.net/reader034/viewer/2022042219/5ec5db6d0efcdc47420f5ba8/html5/thumbnails/30.jpg)
30
Future work
Disclaimer: All of this is speculative and has not been discussed on the Dev list!
Arrays and Maps have received a lot of love. However working with wide structs fields is still non-trivial (a lot of typing). We can do better here:- The following dataset functions should work for nested fields:
- withColumn()- withColumnRenamed()
- The following functions should be added for struct fields:- select()- withColumn()- withColumnRenamed()
![Page 31: Higher Order Functions · Herman van Hövell @westerflyer 2018-10-03, London Spark Summit EU 2018. 2 About Me - Software Engineer @Databricks Amsterdam office - Apache Spark Committer](https://reader034.vdocuments.net/reader034/viewer/2022042219/5ec5db6d0efcdc47420f5ba8/html5/thumbnails/31.jpg)
Questions?