classical distributed computing studies. washington dc apache spark interactive meetup 2015-09-22
TRANSCRIPT
Classical Distributed
Computing Studies
title inspired by http://prog21.dadgum.com/210.html
Can Catalyst save us
from Amdahl's Law?
(Sorry, no.)
Gene Amdahl
born 1922
in South Dakota
4
(CC BY 2.0) https://www.flickr.com/photos/mwichary/
5
WWII Naval Veteran
then went to SD State
then only got into Wisconsin for theoretical physics
6
While working with slide rules on physics calculations he
thought the whole thing could be faster if he made a computer
to do it.
7
So he did.
WISC
Wisconsin
Integrally
Syncronized
Computer
(CC BY 2.0) https://www.flickr.com/photos/pargon/
8
6J6 and 12AU7 Vacuum
Tubes
Magnetic Drum MemoryCC BY 2.0 https://www.flickr.com/photos/mwichary/
9
First non-Government
sponsored computer.
CC BY 2.0 https://www.flickr.com/photos/mwichary/
10
Invented floating point
CC BY 2.0 https://www.flickr.com/photos/mwichary/
11
When he filed a patent on Floating Point he found out that
von Neumann had already done so.
http://pages.cs.wisc.edu/~bezenek/Stuff/amdahl_thesis
12
http://pages.cs.wisc.edu/~bezenek/Stuff/amdahl_thesis.pdf
13
Hired immediately by IBM and worked on
the arithmetic unit for the IBM 360
14
Worked on STRETCH
the first transistorized IBM computer
via https://en.wikipedia.org/wiki/IBM_7030_Stretch
15
photo CC by https://www.flickr.com/photos/jurvetson/
Then founded in partnership with Fujitsu
Air cooled Amdahl 470
The first IBM clone of the IBM S/370!
16
Memo while still at IBM:
Validity of the single processor approach to achieving large
scale computing capabilities
Creates what is known as Amdahl’s Law
17
No equation in the memo, which has led to it
being written many different ways.
But it’s easiest to understand graphically.
18
parallelizableserial
total run time on 1 processor
total run time with infinite
parallelization
19
If your familiar with the Critical Path Method from
business or operations research
or if you’ve ever worked in a restaurant
or on an assembly line
Amdahl’s law should be common sense
Now some other
historical notes
eventually tying to
Spark. :)
21
Rear Admiral Grace Hopper
1906-1992
https://www.youtube.com/watch?v=JEpsKnWZrJ8
22
Rear Admiral Grace Hopper
1906-1992
https://www.youtube.com/watch?v=JEpsKnWZrJ8
what do nanoseconds look like?
23
Table from Amdahl’s PhD Thesis
(1952)
24
https://gist.github.com/jboner/2841832
Latency Comparison Numbers
--------------------------
L1 cache reference 0.5 ns
Branch mispredict 5 ns
L2 cache reference 7 ns 14x L1 cache
Mutex lock/unlock 25 ns
Main memory reference 100 ns 20x L2 cache, 200x L1 cache
Compress 1K bytes with Zippy 3,000 ns
Send 1K bytes over 1 Gbps network 10,000 ns 0.01 ms
Read 4K randomly from SSD* 150,000 ns 0.15 ms
Read 1 MB sequentially from memory 250,000 ns 0.25 ms
Round trip within same datacenter 500,000 ns 0.5 ms
Read 1 MB sequentially from SSD* 1,000,000 ns 1 ms 4X memory
Disk seek 10,000,000 ns 10 ms 20x datacenter roundtrip
Read 1 MB sequentially from disk 20,000,000 ns 20 ms 80x memory, 20X SSD
Send packet CA->Netherlands->CA 150,000,000 ns 150 ms
25
L1 cache reference : 0:00:01
Branch mispredict : 0:00:10
L2 cache reference : 0:00:14
Mutex lock/unlock : 0:00:50
Main memory reference : 0:03:20
Compress 1K bytes with Zippy : 1:40:00
Send 1K bytes over 1 Gbps network : 5:33:20
Read 4K randomly from SSD : 3 days, 11:20:00
Read 1 MB sequentially from memory : 5 days, 18:53:20
Round trip within same datacenter : 11 days, 13:46:40
Read 1 MB sequentially from SSD : 23 days, 3:33:20
Disk seek : 231 days, 11:33:20
Read 1 MB sequentially from disk : 462 days, 23:06:40
Send packet CA->Netherlands->CA : 3472 days, 5:20:00
comment from https://gist.github.com/kofemann
“humanized scale” where 1ns = 1s
26
American Documentation
Volume 20, Issue 1, pages 21–26, January 1969
27
What computerization and statistics
can add...
28
Karen Spärck Jones FBA
(1935-2007)
29
Karen Spärck Jones FBA
(1935-2007)
Invented Inverse Document
Frequency
http://nlp.cs.swarthmore.edu/~richardw/papers/sparckjones1972-statistical.pdf
“The specificity of a term can be
quantified as an inverse function of
the number of documents in which it
occurs.”
SparkSQL
31
The Promise of SparkSQL
(the catalyst planner)
32
SELECT Orders.OrderID, Customers.CustomerName, Orders.OrderDate
FROM Orders
INNER JOIN Customers
ON Orders.CustomerID=Customers.CustomerID;
33
SELECT Orders.OrderID, Customers.CustomerName, Orders.OrderDate
FROM Orders
JOIN Customers
ON Orders.CustomerID=Customers.CustomerID;
an imaginary SQL statement that could be parallelized
34
SELECT Orders.OrderID, Customers.CustomerName, Orders.OrderDate
FROM Orders
JOIN Customers
ON Orders.CustomerID=Customers.CustomerID;
35
SELECT Orders.OrderID, Customers.CustomerName, Orders.OrderDate
FROM Orders
JOIN Customers
ON Orders.CustomerID=Customers.CustomerID;
But what if Customers is on your local HDFS and Orders is at
your on a data center at your warehouse?
36
Computerized query planning is the future, but for the time
being you the user are going to have to recognize your
latency issues.
37
Quick fix
38
https://gist.github.com/jboner/2841832
Latency Comparison Numbers
--------------------------
L1 cache reference 0.5 ns
Branch mispredict 5 ns
L2 cache reference 7 ns 14x L1 cache
Mutex lock/unlock 25 ns
Main memory reference 100 ns 20x L2 cache, 200x L1 cache
Compress 1K bytes with Zippy 3,000 ns
Send 1K bytes over 1 Gbps network 10,000 ns 0.01 ms
Read 4K randomly from SSD* 150,000 ns 0.15 ms
Read 1 MB sequentially from memory 250,000 ns 0.25 ms
Round trip within same datacenter 500,000 ns 0.5 ms
Read 1 MB sequentially from SSD* 1,000,000 ns 1 ms 4X memory
Disk seek 10,000,000 ns 10 ms 20x datacenter roundtrip
Read 1 MB sequentially from disk 20,000,000 ns 20 ms 80x memory, 20X SSD
Send packet CA->Netherlands->CA 150,000,000 ns 150 ms
39
Quick fix
CACHE [LAZY] TABLE [AS SELECT]
40
Premature optimization is the root of
all evil
- Donald Knuth (misquoted)
41
We should forget about small efficiencies, say about
97% of the time: premature optimization is the root of
all evil.
Yet we should not pass up our opportunities in that
critical 3%.
A good programmer will not be lulled into
complacency by such reasoning, he will be wise to
look carefully at the critical code; but only after that
code has been identified.
Donald Knuth
ACM Computing Surveys, Vol 6, No. 4, Dec. 1974
Structured Programming with go to Statements
Thank You