building machine learning algorithms on apache spark with william benton
TRANSCRIPT
![Page 1: Building Machine Learning Algorithms on Apache Spark with William Benton](https://reader034.vdocuments.net/reader034/viewer/2022051521/5aaa86217f8b9af9198b46c7/html5/thumbnails/1.jpg)
Building machine learning algorithms on Apache Spark
William Benton (@willb) Red Hat, Inc.
Session hashtag: #EUds5
![Page 2: Building Machine Learning Algorithms on Apache Spark with William Benton](https://reader034.vdocuments.net/reader034/viewer/2022051521/5aaa86217f8b9af9198b46c7/html5/thumbnails/2.jpg)
#EUds5
Motivation
2
![Page 3: Building Machine Learning Algorithms on Apache Spark with William Benton](https://reader034.vdocuments.net/reader034/viewer/2022051521/5aaa86217f8b9af9198b46c7/html5/thumbnails/3.jpg)
#EUds5
Motivation
3
![Page 4: Building Machine Learning Algorithms on Apache Spark with William Benton](https://reader034.vdocuments.net/reader034/viewer/2022051521/5aaa86217f8b9af9198b46c7/html5/thumbnails/4.jpg)
#EUds5
Motivation
4
![Page 5: Building Machine Learning Algorithms on Apache Spark with William Benton](https://reader034.vdocuments.net/reader034/viewer/2022051521/5aaa86217f8b9af9198b46c7/html5/thumbnails/5.jpg)
#EUds5
Motivation
5
![Page 6: Building Machine Learning Algorithms on Apache Spark with William Benton](https://reader034.vdocuments.net/reader034/viewer/2022051521/5aaa86217f8b9af9198b46c7/html5/thumbnails/6.jpg)
#EUds5
ForecastIntroducing our case study: self-organizing maps
Parallel implementations for partitioned collections (in particular, RDDs)
Beyond the RDD: data frames and ML pipelines
Practical considerations and key takeaways
6
![Page 7: Building Machine Learning Algorithms on Apache Spark with William Benton](https://reader034.vdocuments.net/reader034/viewer/2022051521/5aaa86217f8b9af9198b46c7/html5/thumbnails/7.jpg)
#EUds5
Introducing self-organizing maps
![Page 8: Building Machine Learning Algorithms on Apache Spark with William Benton](https://reader034.vdocuments.net/reader034/viewer/2022051521/5aaa86217f8b9af9198b46c7/html5/thumbnails/8.jpg)
#EUds5 8
![Page 9: Building Machine Learning Algorithms on Apache Spark with William Benton](https://reader034.vdocuments.net/reader034/viewer/2022051521/5aaa86217f8b9af9198b46c7/html5/thumbnails/9.jpg)
#EUds5 8
![Page 10: Building Machine Learning Algorithms on Apache Spark with William Benton](https://reader034.vdocuments.net/reader034/viewer/2022051521/5aaa86217f8b9af9198b46c7/html5/thumbnails/10.jpg)
#EUds5
Training self-organizing maps
9
![Page 11: Building Machine Learning Algorithms on Apache Spark with William Benton](https://reader034.vdocuments.net/reader034/viewer/2022051521/5aaa86217f8b9af9198b46c7/html5/thumbnails/11.jpg)
#EUds5
Training self-organizing maps
10
![Page 12: Building Machine Learning Algorithms on Apache Spark with William Benton](https://reader034.vdocuments.net/reader034/viewer/2022051521/5aaa86217f8b9af9198b46c7/html5/thumbnails/12.jpg)
#EUds5
Training self-organizing maps
11
![Page 13: Building Machine Learning Algorithms on Apache Spark with William Benton](https://reader034.vdocuments.net/reader034/viewer/2022051521/5aaa86217f8b9af9198b46c7/html5/thumbnails/13.jpg)
#EUds5
Training self-organizing maps
11
![Page 14: Building Machine Learning Algorithms on Apache Spark with William Benton](https://reader034.vdocuments.net/reader034/viewer/2022051521/5aaa86217f8b9af9198b46c7/html5/thumbnails/14.jpg)
#EUds5
Training self-organizing maps
12
while t < maxupdates: random.shuffle(examples) for ex in examples: t = t + 1 if t == maxupdates: break bestMatch = closest(somt, ex) for (unit, wt) in neighborhood(bestMatch, sigma(t)): somt+1[unit] = somt[unit] + (ex - somt[unit]) * alpha(t) * wt
![Page 15: Building Machine Learning Algorithms on Apache Spark with William Benton](https://reader034.vdocuments.net/reader034/viewer/2022051521/5aaa86217f8b9af9198b46c7/html5/thumbnails/15.jpg)
#EUds5
Training self-organizing maps
12
while t < maxupdates: random.shuffle(examples) for ex in examples: t = t + 1 if t == maxupdates: break bestMatch = closest(somt, ex) for (unit, wt) in neighborhood(bestMatch, sigma(t)): somt+1[unit] = somt[unit] + (ex - somt[unit]) * alpha(t) * wt
process the training set in random order
![Page 16: Building Machine Learning Algorithms on Apache Spark with William Benton](https://reader034.vdocuments.net/reader034/viewer/2022051521/5aaa86217f8b9af9198b46c7/html5/thumbnails/16.jpg)
#EUds5
Training self-organizing maps
12
while t < maxupdates: random.shuffle(examples) for ex in examples: t = t + 1 if t == maxupdates: break bestMatch = closest(somt, ex) for (unit, wt) in neighborhood(bestMatch, sigma(t)): somt+1[unit] = somt[unit] + (ex - somt[unit]) * alpha(t) * wt
process the training set in random order
the neighborhood size controls how much of the map around
the BMU is affected
![Page 17: Building Machine Learning Algorithms on Apache Spark with William Benton](https://reader034.vdocuments.net/reader034/viewer/2022051521/5aaa86217f8b9af9198b46c7/html5/thumbnails/17.jpg)
#EUds5
Training self-organizing maps
12
while t < maxupdates: random.shuffle(examples) for ex in examples: t = t + 1 if t == maxupdates: break bestMatch = closest(somt, ex) for (unit, wt) in neighborhood(bestMatch, sigma(t)): somt+1[unit] = somt[unit] + (ex - somt[unit]) * alpha(t) * wt
process the training set in random order
the neighborhood size controls how much of the map around
the BMU is affected
the learning rate controls how much closer to the example each unit gets
![Page 18: Building Machine Learning Algorithms on Apache Spark with William Benton](https://reader034.vdocuments.net/reader034/viewer/2022051521/5aaa86217f8b9af9198b46c7/html5/thumbnails/18.jpg)
#EUds5
Parallel implementations for partitioned collections
![Page 19: Building Machine Learning Algorithms on Apache Spark with William Benton](https://reader034.vdocuments.net/reader034/viewer/2022051521/5aaa86217f8b9af9198b46c7/html5/thumbnails/19.jpg)
#EUds5
Historical aside: Amdahl’s Law
14
11 - plim So =
sp —> ∞
![Page 20: Building Machine Learning Algorithms on Apache Spark with William Benton](https://reader034.vdocuments.net/reader034/viewer/2022051521/5aaa86217f8b9af9198b46c7/html5/thumbnails/20.jpg)
#EUds5
What forces serial execution?
15
![Page 21: Building Machine Learning Algorithms on Apache Spark with William Benton](https://reader034.vdocuments.net/reader034/viewer/2022051521/5aaa86217f8b9af9198b46c7/html5/thumbnails/21.jpg)
#EUds5
What forces serial execution?
16
![Page 22: Building Machine Learning Algorithms on Apache Spark with William Benton](https://reader034.vdocuments.net/reader034/viewer/2022051521/5aaa86217f8b9af9198b46c7/html5/thumbnails/22.jpg)
#EUds5
What forces serial execution?
17
state[t+1] = combine(state[t], x)
![Page 23: Building Machine Learning Algorithms on Apache Spark with William Benton](https://reader034.vdocuments.net/reader034/viewer/2022051521/5aaa86217f8b9af9198b46c7/html5/thumbnails/23.jpg)
#EUds5
What forces serial execution?
18
state[t+1] = combine(state[t], x)
![Page 24: Building Machine Learning Algorithms on Apache Spark with William Benton](https://reader034.vdocuments.net/reader034/viewer/2022051521/5aaa86217f8b9af9198b46c7/html5/thumbnails/24.jpg)
#EUds5
What forces serial execution?
19
f1: (T, T) => T f2: (T, U) => T
![Page 25: Building Machine Learning Algorithms on Apache Spark with William Benton](https://reader034.vdocuments.net/reader034/viewer/2022051521/5aaa86217f8b9af9198b46c7/html5/thumbnails/25.jpg)
#EUds5
What forces serial execution?
19
f1: (T, T) => T f2: (T, U) => T
![Page 26: Building Machine Learning Algorithms on Apache Spark with William Benton](https://reader034.vdocuments.net/reader034/viewer/2022051521/5aaa86217f8b9af9198b46c7/html5/thumbnails/26.jpg)
#EUds5
What forces serial execution?
19
f1: (T, T) => T f2: (T, U) => T
![Page 27: Building Machine Learning Algorithms on Apache Spark with William Benton](https://reader034.vdocuments.net/reader034/viewer/2022051521/5aaa86217f8b9af9198b46c7/html5/thumbnails/27.jpg)
#EUds5
How can we fix these?
20
a ⊕ b = b ⊕ a (a ⊕ b) ⊕ c = a ⊕ (b ⊕ c)
![Page 28: Building Machine Learning Algorithms on Apache Spark with William Benton](https://reader034.vdocuments.net/reader034/viewer/2022051521/5aaa86217f8b9af9198b46c7/html5/thumbnails/28.jpg)
#EUds5
How can we fix these?
21
a ⊕ b = b ⊕ a (a ⊕ b) ⊕ c = a ⊕ (b ⊕ c)
![Page 29: Building Machine Learning Algorithms on Apache Spark with William Benton](https://reader034.vdocuments.net/reader034/viewer/2022051521/5aaa86217f8b9af9198b46c7/html5/thumbnails/29.jpg)
#EUds5
How can we fix these?
22
a ⊕ b = b ⊕ a (a ⊕ b) ⊕ c = a ⊕ (b ⊕ c)
![Page 30: Building Machine Learning Algorithms on Apache Spark with William Benton](https://reader034.vdocuments.net/reader034/viewer/2022051521/5aaa86217f8b9af9198b46c7/html5/thumbnails/30.jpg)
#EUds5
How can we fix these?
23
a ⊕ b = b ⊕ a (a ⊕ b) ⊕ c = a ⊕ (b ⊕ c)
![Page 31: Building Machine Learning Algorithms on Apache Spark with William Benton](https://reader034.vdocuments.net/reader034/viewer/2022051521/5aaa86217f8b9af9198b46c7/html5/thumbnails/31.jpg)
#EUds5
How can we fix these?
24
L-BGFSSGD
a ⊕ b = b ⊕ a (a ⊕ b) ⊕ c = a ⊕ (b ⊕ c)
![Page 32: Building Machine Learning Algorithms on Apache Spark with William Benton](https://reader034.vdocuments.net/reader034/viewer/2022051521/5aaa86217f8b9af9198b46c7/html5/thumbnails/32.jpg)
#EUds5
How can we fix these?
25
SGD L-BGFS
a ⊕ b = b ⊕ a (a ⊕ b) ⊕ c = a ⊕ (b ⊕ c)
There will be examples of each of these approaches for many problems in the literature and in open-source code!
![Page 33: Building Machine Learning Algorithms on Apache Spark with William Benton](https://reader034.vdocuments.net/reader034/viewer/2022051521/5aaa86217f8b9af9198b46c7/html5/thumbnails/33.jpg)
#EUds5
Implementing atop RDDsWe’ll start with a batch implementation of our technique:
26
for t in (1 to iterations): state = newState() for ex in examples: bestMatch = closest(somt-1, ex) hood = neighborhood(bestMatch, sigma(t)) state.matches += ex * hood state.hoods += hood somt = newSOM(state.matches / state.hoods)
![Page 34: Building Machine Learning Algorithms on Apache Spark with William Benton](https://reader034.vdocuments.net/reader034/viewer/2022051521/5aaa86217f8b9af9198b46c7/html5/thumbnails/34.jpg)
#EUds5
Implementing atop RDDs
27
for t in (1 to iterations): state = newState() for ex in examples: bestMatch = closest(somt-1, ex) hood = neighborhood(bestMatch, sigma(t)) state.matches += ex * hood state.hoods += hood somt = newSOM(state.matches / state.hoods)
Each batch produces a model that can be averaged with other models
![Page 35: Building Machine Learning Algorithms on Apache Spark with William Benton](https://reader034.vdocuments.net/reader034/viewer/2022051521/5aaa86217f8b9af9198b46c7/html5/thumbnails/35.jpg)
#EUds5
Implementing atop RDDs
28
for t in (1 to iterations): state = newState() for ex in examples: bestMatch = closest(somt-1, ex) hood = neighborhood(bestMatch, sigma(t)) state.matches += ex * hood state.hoods += hood somt = newSOM(state.matches / state.hoods)
Each batch produces a model that can be averaged with other models
partition
![Page 36: Building Machine Learning Algorithms on Apache Spark with William Benton](https://reader034.vdocuments.net/reader034/viewer/2022051521/5aaa86217f8b9af9198b46c7/html5/thumbnails/36.jpg)
#EUds5
Implementing atop RDDs
29
for t in (1 to iterations): state = newState() for ex in examples: bestMatch = closest(somt-1, ex) hood = neighborhood(bestMatch, sigma(t)) state.matches += ex * hood state.hoods += hood somt = newSOM(state.matches / state.hoods)
This won’t always work!
![Page 37: Building Machine Learning Algorithms on Apache Spark with William Benton](https://reader034.vdocuments.net/reader034/viewer/2022051521/5aaa86217f8b9af9198b46c7/html5/thumbnails/37.jpg)
#EUds5
An implementation template
30
var nextModel = initialModel for (int i = 0; i < iterations; i++) {
val newState = examples.aggregate(ModelState.empty()) { { case (state: ModelState, example: Example) => state.update(nextModel.lookup(example, i), example) } { case (s1: ModelState, s2: ModelState) => s1.combine(s2) } } nextModel = modelFromState(newState)
}
![Page 38: Building Machine Learning Algorithms on Apache Spark with William Benton](https://reader034.vdocuments.net/reader034/viewer/2022051521/5aaa86217f8b9af9198b46c7/html5/thumbnails/38.jpg)
#EUds5
An implementation template
31
var nextModel = initialModel for (int i = 0; i < iterations; i++) {
val newState = examples.aggregate(ModelState.empty()) { { case (state: ModelState, example: Example) => state.update(nextModel.lookup(example, i), example) } { case (s1: ModelState, s2: ModelState) => s1.combine(s2) } } nextModel = modelFromState(newState)
}
var nextModel = initialModel for (int i = 0; i < iterations; i++) {
val newState = examples.aggregate(ModelState.empty()) { { case (state: ModelState, example: Example) => state.update(nextModel.lookup(example, i), example) } { case (s1: ModelState, s2: ModelState) => s1.combine(s2) } } nextModel = modelFromState(newState)
}
“fold”: update the state for this partition with a
single new example
![Page 39: Building Machine Learning Algorithms on Apache Spark with William Benton](https://reader034.vdocuments.net/reader034/viewer/2022051521/5aaa86217f8b9af9198b46c7/html5/thumbnails/39.jpg)
#EUds5
An implementation template
32
var nextModel = initialModel for (int i = 0; i < iterations; i++) {
val newState = examples.aggregate(ModelState.empty()) { { case (state: ModelState, example: Example) => state.update(nextModel.lookup(example, i), example) } { case (s1: ModelState, s2: ModelState) => s1.combine(s2) } } nextModel = modelFromState(newState)
}
var nextModel = initialModel for (int i = 0; i < iterations; i++) {
val newState = examples.aggregate(ModelState.empty()) { { case (state: ModelState, example: Example) => state.update(nextModel.lookup(example, i), example) } { case (s1: ModelState, s2: ModelState) => s1.combine(s2) } } nextModel = modelFromState(newState)
}“reduce”: combine the
states from two partitions
![Page 40: Building Machine Learning Algorithms on Apache Spark with William Benton](https://reader034.vdocuments.net/reader034/viewer/2022051521/5aaa86217f8b9af9198b46c7/html5/thumbnails/40.jpg)
#EUds5
An implementation template
33
var nextModel = initialModel for (int i = 0; i < iterations; i++) {
val newState = examples.aggregate(ModelState.empty()) { { case (state: ModelState, example: Example) => state.update(nextModel.lookup(example, i), example) } { case (s1: ModelState, s2: ModelState) => s1.combine(s2) } } nextModel = modelFromState(newState)
}
this will cause the model object to be serialized with the closure
nextModel
![Page 41: Building Machine Learning Algorithms on Apache Spark with William Benton](https://reader034.vdocuments.net/reader034/viewer/2022051521/5aaa86217f8b9af9198b46c7/html5/thumbnails/41.jpg)
#EUds5
An implementation template
34
var nextModel = initialModel for (int i = 0; i < iterations; i++) { val current = sc.broadcast(nextModel) val newState = examples.aggregate(ModelState.empty()) { { case (state: ModelState, example: Example) => state.update(current.value.lookup(example, i), example) } { case (s1: ModelState, s2: ModelState) => s1.combine(s2) } } nextModel = modelFromState(newState) current.unpersist }
broadcast the current working model for this iterationvar nextModel = initialModel
for (int i = 0; i < iterations; i++) { val current = sc.broadcast(nextModel) val newState = examples.aggregate(ModelState.empty()) { { case (state: ModelState, example: Example) => state.update(current.value.lookup(example, i), example) } { case (s1: ModelState, s2: ModelState) => s1.combine(s2) } } nextModel = modelFromState(newState) current.unpersist } get the value of the
broadcast variable
![Page 42: Building Machine Learning Algorithms on Apache Spark with William Benton](https://reader034.vdocuments.net/reader034/viewer/2022051521/5aaa86217f8b9af9198b46c7/html5/thumbnails/42.jpg)
#EUds5
An implementation template
35
var nextModel = initialModel for (int i = 0; i < iterations; i++) { val current = sc.broadcast(nextModel) val newState = examples.aggregate(ModelState.empty()) { { case (state: ModelState, example: Example) => state.update(current.value.lookup(example, i), example) } { case (s1: ModelState, s2: ModelState) => s1.combine(s2) } } nextModel = modelFromState(newState) current.unpersist } remove the stale
broadcasted model
var nextModel = initialModel for (int i = 0; i < iterations; i++) { val current = sc.broadcast(nextModel) val newState = examples.aggregate(ModelState.empty()) { { case (state: ModelState, example: Example) => state.update(current.value.lookup(example, i), example) } { case (s1: ModelState, s2: ModelState) => s1.combine(s2) } } nextModel = modelFromState(newState) current.unpersist
![Page 43: Building Machine Learning Algorithms on Apache Spark with William Benton](https://reader034.vdocuments.net/reader034/viewer/2022051521/5aaa86217f8b9af9198b46c7/html5/thumbnails/43.jpg)
#EUds5
An implementation template
36
var nextModel = initialModel for (int i = 0; i < iterations; i++) { val current = sc.broadcast(nextModel) val newState = examples.aggregate(ModelState.empty()) { { case (state: ModelState, example: Example) => state.update(current.value.lookup(example, i), example) } { case (s1: ModelState, s2: ModelState) => s1.combine(s2) } } nextModel = modelFromState(newState) current.unpersist }
var nextModel = initialModel for (int i = 0; i < iterations; i++) { val current = sc.broadcast(nextModel) val newState = examples.aggregate(ModelState.empty()) { { case (state: ModelState, example: Example) => state.update(current.value.lookup(example, i), example) } { case (s1: ModelState, s2: ModelState) => s1.combine(s2) } } nextModel = modelFromState(newState) current.unpersist }
the wrong implementation of the right interface
![Page 44: Building Machine Learning Algorithms on Apache Spark with William Benton](https://reader034.vdocuments.net/reader034/viewer/2022051521/5aaa86217f8b9af9198b46c7/html5/thumbnails/44.jpg)
#EUds5
workersdriver (aggregate)
Implementing on RDDs
37
⊕ ⊕ ⊕ ⊕ ⊕ ⊕
⊕ ⊕ ⊕ ⊕ ⊕
⊕ ⊕ ⊕ ⊕
![Page 45: Building Machine Learning Algorithms on Apache Spark with William Benton](https://reader034.vdocuments.net/reader034/viewer/2022051521/5aaa86217f8b9af9198b46c7/html5/thumbnails/45.jpg)
#EUds5
workersdriver (aggregate)
Implementing on RDDs
38
⊕ ⊕ ⊕ ⊕ ⊕ ⊕
⊕ ⊕ ⊕ ⊕ ⊕
⊕ ⊕ ⊕ ⊕
![Page 46: Building Machine Learning Algorithms on Apache Spark with William Benton](https://reader034.vdocuments.net/reader034/viewer/2022051521/5aaa86217f8b9af9198b46c7/html5/thumbnails/46.jpg)
#EUds5
workersdriver (aggregate)
Implementing on RDDs
38
![Page 47: Building Machine Learning Algorithms on Apache Spark with William Benton](https://reader034.vdocuments.net/reader034/viewer/2022051521/5aaa86217f8b9af9198b46c7/html5/thumbnails/47.jpg)
#EUds5
workersdriver (treeAggregate)
Implementing on RDDs
39
⊕
⊕
⊕
⊕
⊕ ⊕
⊕ ⊕
![Page 48: Building Machine Learning Algorithms on Apache Spark with William Benton](https://reader034.vdocuments.net/reader034/viewer/2022051521/5aaa86217f8b9af9198b46c7/html5/thumbnails/48.jpg)
#EUds5
workersdriver (treeAggregate)
Implementing on RDDs
40
⊕
⊕
⊕
⊕
⊕ ⊕
⊕ ⊕
![Page 49: Building Machine Learning Algorithms on Apache Spark with William Benton](https://reader034.vdocuments.net/reader034/viewer/2022051521/5aaa86217f8b9af9198b46c7/html5/thumbnails/49.jpg)
#EUds5
workersdriver (treeAggregate)
Implementing on RDDs
41
⊕
⊕
⊕
⊕
⊕ ⊕
⊕ ⊕
⊕
⊕
⊕
⊕
![Page 50: Building Machine Learning Algorithms on Apache Spark with William Benton](https://reader034.vdocuments.net/reader034/viewer/2022051521/5aaa86217f8b9af9198b46c7/html5/thumbnails/50.jpg)
#EUds5
workersdriver (treeAggregate)
Implementing on RDDs
42
⊕
⊕
⊕
⊕
⊕
⊕
⊕ ⊕driver (treeAggregate)
⊕
⊕
⊕
⊕
![Page 51: Building Machine Learning Algorithms on Apache Spark with William Benton](https://reader034.vdocuments.net/reader034/viewer/2022051521/5aaa86217f8b9af9198b46c7/html5/thumbnails/51.jpg)
#EUds5
workersdriver (treeAggregate)
Implementing on RDDs
43
⊕
⊕
⊕
⊕
⊕ ⊕
⊕ ⊕
⊕ ⊕
![Page 52: Building Machine Learning Algorithms on Apache Spark with William Benton](https://reader034.vdocuments.net/reader034/viewer/2022051521/5aaa86217f8b9af9198b46c7/html5/thumbnails/52.jpg)
#EUds5
workersdriver (treeAggregate)
Implementing on RDDs
44
⊕ ⊕
![Page 53: Building Machine Learning Algorithms on Apache Spark with William Benton](https://reader034.vdocuments.net/reader034/viewer/2022051521/5aaa86217f8b9af9198b46c7/html5/thumbnails/53.jpg)
#EUds5
driver (treeAggregate) workers
Implementing on RDDs
45
⊕ ⊕
![Page 54: Building Machine Learning Algorithms on Apache Spark with William Benton](https://reader034.vdocuments.net/reader034/viewer/2022051521/5aaa86217f8b9af9198b46c7/html5/thumbnails/54.jpg)
#EUds5
driver (treeAggregate) workers
Implementing on RDDs
45
⊕ ⊕⊕
![Page 55: Building Machine Learning Algorithms on Apache Spark with William Benton](https://reader034.vdocuments.net/reader034/viewer/2022051521/5aaa86217f8b9af9198b46c7/html5/thumbnails/55.jpg)
#EUds5
Beyond the RDD: Data frames and ML Pipelines
![Page 56: Building Machine Learning Algorithms on Apache Spark with William Benton](https://reader034.vdocuments.net/reader034/viewer/2022051521/5aaa86217f8b9af9198b46c7/html5/thumbnails/56.jpg)
#EUds5
RDDs: some good parts
47
val rdd: RDD[String] = /* ... */ rdd.map(_ * 3.0).collect()
val df: DataFrame = /* data frame with one String-valued column */ df.select($"_1" * 3.0).show()
![Page 57: Building Machine Learning Algorithms on Apache Spark with William Benton](https://reader034.vdocuments.net/reader034/viewer/2022051521/5aaa86217f8b9af9198b46c7/html5/thumbnails/57.jpg)
#EUds5
RDDs: some good parts
48
val rdd: RDD[String] = /* ... */ rdd.map(_ * 3.0).collect()
val df: DataFrame = /* data frame with one String-valued column */ df.select($"_1" * 3.0).show()
doesn’t compile
![Page 58: Building Machine Learning Algorithms on Apache Spark with William Benton](https://reader034.vdocuments.net/reader034/viewer/2022051521/5aaa86217f8b9af9198b46c7/html5/thumbnails/58.jpg)
#EUds5
RDDs: some good parts
49
val rdd: RDD[String] = /* ... */ rdd.map(_ * 3.0).collect()
val df: DataFrame = /* data frame with one String-valued column */ df.select($"_1" * 3.0).show()
doesn’t compile
![Page 59: Building Machine Learning Algorithms on Apache Spark with William Benton](https://reader034.vdocuments.net/reader034/viewer/2022051521/5aaa86217f8b9af9198b46c7/html5/thumbnails/59.jpg)
#EUds5
RDDs: some good parts
50
val rdd: RDD[String] = /* ... */ rdd.map(_ * 3.0).collect()
val df: DataFrame = /* data frame with one String-valued column */ df.select($"_1" * 3.0).show()
doesn’t compile
crashes at runtime
![Page 60: Building Machine Learning Algorithms on Apache Spark with William Benton](https://reader034.vdocuments.net/reader034/viewer/2022051521/5aaa86217f8b9af9198b46c7/html5/thumbnails/60.jpg)
#EUds5
RDDs: some good parts
51
rdd.map { vec => (vec, model.value.closestWithSimilarity(vec)) }
val predict = udf ((vec: SV) => model.value.closestWithSimilarity(vec))
df.withColumn($"predictions", predict($"features"))
![Page 61: Building Machine Learning Algorithms on Apache Spark with William Benton](https://reader034.vdocuments.net/reader034/viewer/2022051521/5aaa86217f8b9af9198b46c7/html5/thumbnails/61.jpg)
#EUds5
RDDs: some good parts
52
rdd.map { vec => (vec, model.value.closestWithSimilarity(vec)) }
val predict = udf ((vec: SV) => model.value.closestWithSimilarity(vec))
df.withColumn($"predictions", predict($"features"))
![Page 62: Building Machine Learning Algorithms on Apache Spark with William Benton](https://reader034.vdocuments.net/reader034/viewer/2022051521/5aaa86217f8b9af9198b46c7/html5/thumbnails/62.jpg)
#EUds5
RDDs versus query planning
53
val numbers1 = sc.parallelize(1 to 100000000) val numbers2 = sc.parallelize(1 to 1000000000) numbers1.cartesian(numbers2) .map((x, y) => (x, y, expensive(x, y))) .filter((x, y, _) => isPrime(x), isPrime(y))
![Page 63: Building Machine Learning Algorithms on Apache Spark with William Benton](https://reader034.vdocuments.net/reader034/viewer/2022051521/5aaa86217f8b9af9198b46c7/html5/thumbnails/63.jpg)
#EUds5
RDDs versus query planning
54
val numbers1 = sc.parallelize(1 to 100000000) val numbers2 = sc.parallelize(1 to 1000000000) numbers1.filter(isPrime(_)) .cartesian(numbers2.filter(isPrime(_))) .map((x, y) => (x, y, expensive(x, y)))
![Page 64: Building Machine Learning Algorithms on Apache Spark with William Benton](https://reader034.vdocuments.net/reader034/viewer/2022051521/5aaa86217f8b9af9198b46c7/html5/thumbnails/64.jpg)
#EUds5
RDDs and the JVM heap
55
val mat = Array(Array(1.0, 2.0), Array(3.0, 4.0))
![Page 65: Building Machine Learning Algorithms on Apache Spark with William Benton](https://reader034.vdocuments.net/reader034/viewer/2022051521/5aaa86217f8b9af9198b46c7/html5/thumbnails/65.jpg)
#EUds5
RDDs and the Java heap
56
val mat = Array(Array(1.0, 2.0), Array(3.0, 4.0))
class pointer flags size locks element pointer element pointer
class pointer flags size locks 1.0
class pointer flags size locks 3.0 4.0
2.0
![Page 66: Building Machine Learning Algorithms on Apache Spark with William Benton](https://reader034.vdocuments.net/reader034/viewer/2022051521/5aaa86217f8b9af9198b46c7/html5/thumbnails/66.jpg)
#EUds5
RDDs and the Java heap
57
val mat = Array(Array(1.0, 2.0), Array(3.0, 4.0))
class pointer flags size locks element pointer element pointer
class pointer flags size locks 1.0
class pointer flags size locks 3.0 4.0
2.0 32 bytes of data…
![Page 67: Building Machine Learning Algorithms on Apache Spark with William Benton](https://reader034.vdocuments.net/reader034/viewer/2022051521/5aaa86217f8b9af9198b46c7/html5/thumbnails/67.jpg)
#EUds5
RDDs and the Java heap
58
val mat = Array(Array(1.0, 2.0), Array(3.0, 4.0))
class pointer flags size locks element pointer element pointer
class pointer flags size locks 1.0
class pointer flags size locks 3.0 4.0
2.0
…and 64 bytes of overhead!
32 bytes of data…
![Page 68: Building Machine Learning Algorithms on Apache Spark with William Benton](https://reader034.vdocuments.net/reader034/viewer/2022051521/5aaa86217f8b9af9198b46c7/html5/thumbnails/68.jpg)
#EUds5
ML pipelines: a quick example
59
from pyspark.ml.clustering import KMeans
K, SEED = 100, 0xdea110c8
randomDF = make_random_df()
kmeans = KMeans().setK(K).setSeed(SEED).setFeaturesCol("features") model = kmeans.fit(randomDF) withPredictions = model.transform(randomDF).select("x", "y", "prediction")
![Page 69: Building Machine Learning Algorithms on Apache Spark with William Benton](https://reader034.vdocuments.net/reader034/viewer/2022051521/5aaa86217f8b9af9198b46c7/html5/thumbnails/69.jpg)
#EUds5
Working with ML pipelines
60
estimator.fit(df)
![Page 70: Building Machine Learning Algorithms on Apache Spark with William Benton](https://reader034.vdocuments.net/reader034/viewer/2022051521/5aaa86217f8b9af9198b46c7/html5/thumbnails/70.jpg)
#EUds5
Working with ML pipelines
61
estimator.fit(df) model.transform(df)
![Page 71: Building Machine Learning Algorithms on Apache Spark with William Benton](https://reader034.vdocuments.net/reader034/viewer/2022051521/5aaa86217f8b9af9198b46c7/html5/thumbnails/71.jpg)
#EUds5
Working with ML pipelines
62
model.transform(df)
![Page 72: Building Machine Learning Algorithms on Apache Spark with William Benton](https://reader034.vdocuments.net/reader034/viewer/2022051521/5aaa86217f8b9af9198b46c7/html5/thumbnails/72.jpg)
#EUds5
Working with ML pipelines
63
model.transform(df)
![Page 73: Building Machine Learning Algorithms on Apache Spark with William Benton](https://reader034.vdocuments.net/reader034/viewer/2022051521/5aaa86217f8b9af9198b46c7/html5/thumbnails/73.jpg)
#EUds5
Working with ML pipelines
64
estimator.fit(df) model.transform(df)
![Page 74: Building Machine Learning Algorithms on Apache Spark with William Benton](https://reader034.vdocuments.net/reader034/viewer/2022051521/5aaa86217f8b9af9198b46c7/html5/thumbnails/74.jpg)
#EUds5
Working with ML pipelines
64
estimator.fit(df) model.transform(df)
inputCol
epochs
seed
![Page 75: Building Machine Learning Algorithms on Apache Spark with William Benton](https://reader034.vdocuments.net/reader034/viewer/2022051521/5aaa86217f8b9af9198b46c7/html5/thumbnails/75.jpg)
#EUds5
Working with ML pipelines
64
estimator.fit(df) model.transform(df)
inputCol
epochs
seed
outputCol
![Page 76: Building Machine Learning Algorithms on Apache Spark with William Benton](https://reader034.vdocuments.net/reader034/viewer/2022051521/5aaa86217f8b9af9198b46c7/html5/thumbnails/76.jpg)
#EUds5
Defining parameters
65
private[som] trait SOMParams extends Params with DefaultParamsWritable { final val x: IntParam = new IntParam(this, "x", "width of self-organizing map (>= 1)", ParamValidators.gtEq(1))
final def getX: Int = $(x) final def setX(value: Int): this.type = set(x, value) // ...
![Page 77: Building Machine Learning Algorithms on Apache Spark with William Benton](https://reader034.vdocuments.net/reader034/viewer/2022051521/5aaa86217f8b9af9198b46c7/html5/thumbnails/77.jpg)
#EUds5
Defining parameters
66
private[som] trait SOMParams extends Params with DefaultParamsWritable { final val x: IntParam = new IntParam(this, "x", "width of self-organizing map (>= 1)", ParamValidators.gtEq(1))
final def getX: Int = $(x) final def setX(value: Int): this.type = set(x, value) // ...
private[som] trait SOMParams extends Params with DefaultParamsWritable { final val x: IntParam = new IntParam(this, "x", "width of self-organizing map (>= 1)", ParamValidators.gtEq(1))
final def getX: Int = $(x) final def setX(value: Int): this.type = set(x, value) // ...
![Page 78: Building Machine Learning Algorithms on Apache Spark with William Benton](https://reader034.vdocuments.net/reader034/viewer/2022051521/5aaa86217f8b9af9198b46c7/html5/thumbnails/78.jpg)
#EUds5
private[som] trait SOMParams extends Params with DefaultParamsWritable { final val x: IntParam = new IntParam(this, "x", "width of self-organizing map (>= 1)", ParamValidators.gtEq(1))
final def getX: Int = $(x) final def setX(value: Int): this.type = set(x, value) // ...
Defining parameters
67
private[som] trait SOMParams extends Params with DefaultParamsWritable { final val x: IntParam = new IntParam(this, "x", "width of self-organizing map (>= 1)", ParamValidators.gtEq(1))
final def getX: Int = $(x) final def setX(value: Int): this.type = set(x, value) // ...
![Page 79: Building Machine Learning Algorithms on Apache Spark with William Benton](https://reader034.vdocuments.net/reader034/viewer/2022051521/5aaa86217f8b9af9198b46c7/html5/thumbnails/79.jpg)
#EUds5
Defining parameters
68
private[som] trait SOMParams extends Params with DefaultParamsWritable { final val x: IntParam = new IntParam(this, "x", "width of self-organizing map (>= 1)", ParamValidators.gtEq(1))
final def getX: Int = $(x) final def setX(value: Int): this.type = set(x, value) // ...
private[som] trait SOMParams extends Params with DefaultParamsWritable { final val x: IntParam = new IntParam(this, "x", "width of self-organizing map (>= 1)", ParamValidators.gtEq(1))
final def getX: Int = $(x) final def setX(value: Int): this.type = set(x, value) // ...
![Page 80: Building Machine Learning Algorithms on Apache Spark with William Benton](https://reader034.vdocuments.net/reader034/viewer/2022051521/5aaa86217f8b9af9198b46c7/html5/thumbnails/80.jpg)
#EUds5
Defining parameters
69
private[som] trait SOMParams extends Params with DefaultParamsWritable { final val x: IntParam = new IntParam(this, "x", "width of self-organizing map (>= 1)", ParamValidators.gtEq(1))
final def getX: Int = $(x) final def setX(value: Int): this.type = set(x, value) // ...
private[som] trait SOMParams extends Params with DefaultParamsWritable { final val x: IntParam = new IntParam(this, "x", "width of self-organizing map (>= 1)", ParamValidators.gtEq(1))
final def getX: Int = $(x) final def setX(value: Int): this.type = set(x, value) // ...
![Page 81: Building Machine Learning Algorithms on Apache Spark with William Benton](https://reader034.vdocuments.net/reader034/viewer/2022051521/5aaa86217f8b9af9198b46c7/html5/thumbnails/81.jpg)
#EUds5
Defining parameters
70
private[som] trait SOMParams extends Params with DefaultParamsWritable { final val x: IntParam = new IntParam(this, "x", "width of self-organizing map (>= 1)", ParamValidators.gtEq(1))
final def getX: Int = $(x) final def setX(value: Int): this.type = set(x, value) // ...
private[som] trait SOMParams extends Params with DefaultParamsWritable { final val x: IntParam = new IntParam(this, "x", "width of self-organizing map (>= 1)", ParamValidators.gtEq(1))
final def getX: Int = $(x) final def setX(value: Int): this.type = set(x, value) // ...
![Page 82: Building Machine Learning Algorithms on Apache Spark with William Benton](https://reader034.vdocuments.net/reader034/viewer/2022051521/5aaa86217f8b9af9198b46c7/html5/thumbnails/82.jpg)
#EUds5
Don’t repeat yourself
71
/** * Common params for KMeans and KMeansModel */ private[clustering] trait KMeansParams extends Params with HasMaxIter with HasFeaturesCol with HasSeed with HasPredictionCol with HasTol { /* ... */ }
![Page 83: Building Machine Learning Algorithms on Apache Spark with William Benton](https://reader034.vdocuments.net/reader034/viewer/2022051521/5aaa86217f8b9af9198b46c7/html5/thumbnails/83.jpg)
#EUds5
Estimators and transformers
72
estimator.fit(df)
![Page 84: Building Machine Learning Algorithms on Apache Spark with William Benton](https://reader034.vdocuments.net/reader034/viewer/2022051521/5aaa86217f8b9af9198b46c7/html5/thumbnails/84.jpg)
#EUds5
Estimators and transformers
72
estimator.fit(df)
![Page 85: Building Machine Learning Algorithms on Apache Spark with William Benton](https://reader034.vdocuments.net/reader034/viewer/2022051521/5aaa86217f8b9af9198b46c7/html5/thumbnails/85.jpg)
#EUds5
Estimators and transformers
72
estimator.fit(df) model.transform(df)
![Page 86: Building Machine Learning Algorithms on Apache Spark with William Benton](https://reader034.vdocuments.net/reader034/viewer/2022051521/5aaa86217f8b9af9198b46c7/html5/thumbnails/86.jpg)
#EUds5
Estimators and transformers
72
estimator.fit(df) model.transform(df)
![Page 87: Building Machine Learning Algorithms on Apache Spark with William Benton](https://reader034.vdocuments.net/reader034/viewer/2022051521/5aaa86217f8b9af9198b46c7/html5/thumbnails/87.jpg)
#EUds5
Estimators and transformers
72
estimator.fit(df) model.transform(df)
![Page 88: Building Machine Learning Algorithms on Apache Spark with William Benton](https://reader034.vdocuments.net/reader034/viewer/2022051521/5aaa86217f8b9af9198b46c7/html5/thumbnails/88.jpg)
#EUds5
Validate and transform at once
73
def transformSchema(schema: StructType): StructType = { // check that the input columns exist... // ...and are the proper type // ...and that the output columns don’t exist // ...and then make a new schema }
![Page 89: Building Machine Learning Algorithms on Apache Spark with William Benton](https://reader034.vdocuments.net/reader034/viewer/2022051521/5aaa86217f8b9af9198b46c7/html5/thumbnails/89.jpg)
#EUds5
Validate and transform at once
74
def transformSchema(schema: StructType): StructType = { // check that the input columns exist… require(schema.fieldNames.contains($(featuresCol))) // ...and are the proper type // ...and that the output columns don’t exist // ...and then make a new schema }
![Page 90: Building Machine Learning Algorithms on Apache Spark with William Benton](https://reader034.vdocuments.net/reader034/viewer/2022051521/5aaa86217f8b9af9198b46c7/html5/thumbnails/90.jpg)
#EUds5
Validate and transform at once
75
def transformSchema(schema: StructType): StructType = { // check that the input columns exist... // ...and are the proper type schema($(featuresCol)) match { case sf: StructField => require(sf.dataType.equals(VectorType)) } // ...and that the output columns don’t exist // ...and then make a new schema }
![Page 91: Building Machine Learning Algorithms on Apache Spark with William Benton](https://reader034.vdocuments.net/reader034/viewer/2022051521/5aaa86217f8b9af9198b46c7/html5/thumbnails/91.jpg)
#EUds5
Validate and transform at once
76
def transformSchema(schema: StructType): StructType = { // check that the input columns exist… // ...and are the proper type // ...and that the output columns don’t exist require(!schema.fieldNames.contains($(predictionCol))) require(!schema.fieldNames.contains($(similarityCol))) // ...and then make a new schema }
![Page 92: Building Machine Learning Algorithms on Apache Spark with William Benton](https://reader034.vdocuments.net/reader034/viewer/2022051521/5aaa86217f8b9af9198b46c7/html5/thumbnails/92.jpg)
#EUds5
Validate and transform at once
77
def transformSchema(schema: StructType): StructType = { // check that the input columns exist… // ...and are the proper type // ...and that the output columns don’t exist // ...and then make a new schema schema.add($(predictionCol), "int") .add($(similarityCol), "double") }
![Page 93: Building Machine Learning Algorithms on Apache Spark with William Benton](https://reader034.vdocuments.net/reader034/viewer/2022051521/5aaa86217f8b9af9198b46c7/html5/thumbnails/93.jpg)
#EUds5
Training on data frames
78
def fit(examples: DataFrame) = { import examples.sparkSession.implicits._ import org.apache.spark.ml.linalg.{Vector=>SV} val dfexamples = examples.select($(exampleCol)).rdd.map { case Row(sv: SV) => sv } /* construct a model object with the result of training */ new SOMModel(train(dfexamples, $(x), $(y))) }
![Page 94: Building Machine Learning Algorithms on Apache Spark with William Benton](https://reader034.vdocuments.net/reader034/viewer/2022051521/5aaa86217f8b9af9198b46c7/html5/thumbnails/94.jpg)
#EUds5
Practical considerationsand key takeaways
![Page 95: Building Machine Learning Algorithms on Apache Spark with William Benton](https://reader034.vdocuments.net/reader034/viewer/2022051521/5aaa86217f8b9af9198b46c7/html5/thumbnails/95.jpg)
#EUds5
Retire your visibility hacks
80
package org.apache.spark.ml.hacks
object Hacks { import org.apache.spark.ml.linalg.VectorUDT val vectorUDT = new VectorUDT }
![Page 96: Building Machine Learning Algorithms on Apache Spark with William Benton](https://reader034.vdocuments.net/reader034/viewer/2022051521/5aaa86217f8b9af9198b46c7/html5/thumbnails/96.jpg)
#EUds5
Retire your visibility hacks
81
package org.apache.spark.ml.linalg
/* imports, etc., are elided ... */ @Since("2.0.0") @DeveloperApi object SQLDataTypes { val VectorType: DataType = new VectorUDT val MatrixType: DataType = new MatrixUDT }
![Page 97: Building Machine Learning Algorithms on Apache Spark with William Benton](https://reader034.vdocuments.net/reader034/viewer/2022051521/5aaa86217f8b9af9198b46c7/html5/thumbnails/97.jpg)
#EUds5
Caching training data
82
val wasUncached = examples.storageLevel == StorageLevel.NONE
if (wasUncached) { examples.cache() }
/* actually train here */
if (wasUncached) { examples.unpersist() }
![Page 98: Building Machine Learning Algorithms on Apache Spark with William Benton](https://reader034.vdocuments.net/reader034/viewer/2022051521/5aaa86217f8b9af9198b46c7/html5/thumbnails/98.jpg)
#EUds5
Improve serial execution timesAre you repeatedly comparing training data to a model that only changes once per iteration? Consider caching norms.
Are you doing a lot of dot products in a for loop? Consider replacing these loops with a matrix-vector multiplication.
Seek to limit the number of library invocations you make and thus the time you spend copying data to and from your linear algebra library.
83
![Page 99: Building Machine Learning Algorithms on Apache Spark with William Benton](https://reader034.vdocuments.net/reader034/viewer/2022051521/5aaa86217f8b9af9198b46c7/html5/thumbnails/99.jpg)
#EUds5
Key takeawaysThere are several techniques you can use to develop parallel implementations of machine learning algorithms.
The RDD API may not be your favorite way to interact with Spark as a user, but it can be extremely valuable if you’re developing libraries for Spark.
As a library developer, you might need to rely on developer APIs and dive in to Spark’s source code, but things are getting easier with each release!
84