in the once upon a time days of the first age of magic, the prudentwcohen/10-405/workflows-3.pdf ·...

64
In the once upon a time days of the First Age of Magic, the prudent sorcerer regarded his own true name as his most valued possession but also the greatest threat to his continued good health, for--the stories go-- once an enemy, even a weak unskilled enemy, learned the sorcerer's true name, then routine and widely known spells could destroy or enslave even the most powerful. As times passed, and we graduated to the Age of Reason and thence to the first and second industrial revolutions, such notions were discredited. Now it seems that the Wheel has turned full circle (even if there never really was a First Age) and we are back to worrying about true names again: The first hint Mr. Slippery had that his own True Name might be known-- and, for that matter, known to the Great Enemy--came with the appearance of two black Lincolns humming up the long dirt driveway ... Roger Pollack was in his garden weeding, had been there nearly the whole morning.... Four heavy-set men and a hard-looking female piled out, started purposefully across his well-tended cabbage patch.… This had been, of course, Roger Pollack's great fear. They had discovered Mr. Slippery's True Name and it was Roger Andrew Pollack TIN/SSAN 0959-34-2861.

Upload: others

Post on 22-Jan-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: In the once upon a time days of the First Age of Magic, the prudentwcohen/10-405/workflows-3.pdf · 2018. 2. 7. · In the once upon a time days of the First Age of Magic, the prudent

In the once upon a time days of the First Age of Magic, the prudent sorcerer regarded his own true name as his most valued possession but also the greatest threat to his continued good health, for--the stories go--once an enemy, even a weak unskilled enemy, learned the sorcerer's true name, then routine and widely known spells could destroy or enslave even the most powerful. As times passed, and we graduated to the Age of Reason and thence to the first and second industrial revolutions, such notions were discredited. Now it seems that the Wheel has turned full circle (even if there never really was a First Age) and we are back to worrying about true names again:

The first hint Mr. Slippery had that his own True Name might be known--and, for that matter, known to the Great Enemy--came with the appearance of two black Lincolns humming up the long dirt driveway ... Roger Pollack was in his garden weeding, had been there nearly the whole morning.... Four heavy-set men and a hard-looking female piled out, started purposefully across his well-tended cabbage patch.…

This had been, of course, Roger Pollack's great fear. They had discovered Mr. Slippery's True Name and it was Roger Andrew Pollack TIN/SSAN 0959-34-2861.

Page 2: In the once upon a time days of the First Age of Magic, the prudentwcohen/10-405/workflows-3.pdf · 2018. 2. 7. · In the once upon a time days of the First Age of Magic, the prudent

GUINEA PIG: A WORKFLOW PACKAGE FOR PYTHON

2

Page 3: In the once upon a time days of the First Age of Magic, the prudentwcohen/10-405/workflows-3.pdf · 2018. 2. 7. · In the once upon a time days of the First Age of Magic, the prudent

Full Implementation

3

docid term v(d,w)d137 aardvark 0.645d137 found 0.083d137 I 0.004d137 farmville 0.356d138 when …d138 … …

Page 4: In the once upon a time days of the First Age of Magic, the prudentwcohen/10-405/workflows-3.pdf · 2018. 2. 7. · In the once upon a time days of the First Age of Magic, the prudent

DB2

DB2

Motivation• Integrating data is important• Data from different sources

may not have consistent object identifiers– Especially automatically-

constructed ones• But databases will have

human-readable names and/or descriptions for the objects

• But matching names and descriptions is tricky….

DB1 DB2 …

DB1

Page 5: In the once upon a time days of the First Age of Magic, the prudentwcohen/10-405/workflows-3.pdf · 2018. 2. 7. · In the once upon a time days of the First Age of Magic, the prudent

One solution: Soft (Similarity) joins

• A similarity join of two sets A and B is– an ordered list of triples (sij,ai,bj) such that

• ai is from A• bj is from B• sij is the similarity of ai and bj

• the triples are in descending order

• the list is either the top K triples by sij or ALL triples with sij>L … or sometimes some approximation of these….

Page 6: In the once upon a time days of the First Age of Magic, the prudentwcohen/10-405/workflows-3.pdf · 2018. 2. 7. · In the once upon a time days of the First Age of Magic, the prudent

Example: soft joins/similarity joins

… …

Input: Two Different Lists of Entity Names

Page 7: In the once upon a time days of the First Age of Magic, the prudentwcohen/10-405/workflows-3.pdf · 2018. 2. 7. · In the once upon a time days of the First Age of Magic, the prudent

Example: soft joins/similarity joinsOutput: Pairs of Names Ranked by Similarity

identical

similar

less similar

Page 8: In the once upon a time days of the First Age of Magic, the prudentwcohen/10-405/workflows-3.pdf · 2018. 2. 7. · In the once upon a time days of the First Age of Magic, the prudent

Semantic Joiningwith Multiscale Statistics

William CohenKatie Rivard, Dana Attias-Moshevitz

CMU

Page 9: In the once upon a time days of the First Age of Magic, the prudentwcohen/10-405/workflows-3.pdf · 2018. 2. 7. · In the once upon a time days of the First Age of Magic, the prudent

A parallel workflow for TFIDF similarity joins

9

Page 10: In the once upon a time days of the First Age of Magic, the prudentwcohen/10-405/workflows-3.pdf · 2018. 2. 7. · In the once upon a time days of the First Age of Magic, the prudent

Parallel Inverted Index Softjoin - 1

Statistics for computing TFIDF with IDFs local to each relation

sumSquareWeights

rel docid term v(d,w)a a137 gateway 0.945a a137 nwr 0.055a a138 acadia 0.915a a138 np 0.085b b013 gateway 0.845b b013 np 0.155b … … …

Page 11: In the once upon a time days of the First Age of Magic, the prudentwcohen/10-405/workflows-3.pdf · 2018. 2. 7. · In the once upon a time days of the First Age of Magic, the prudent

Parallel Inverted Index Softjoin - 2

What’sthealgorithm?• Step1:createdocumentvectorsas(Cd,d,term,weight)tuples

• Step2:jointhetuplesfromAandB:onesortandreduce• Givesyoutuples(a,b,term,w(a,term)*w(b,term))

• Step3:groupthecommontermsby(a,b)andreducetoaggregatethecomponentsofthesum

𝑠𝑖𝑚 𝒗, 𝒗′ = 𝒗 N 𝒗′ =O𝒗 𝑖 𝒗′(𝑖)�

Page 12: In the once upon a time days of the First Age of Magic, the prudentwcohen/10-405/workflows-3.pdf · 2018. 2. 7. · In the once upon a time days of the First Age of Magic, the prudent

Making the algorithm smarter….

Page 13: In the once upon a time days of the First Age of Magic, the prudentwcohen/10-405/workflows-3.pdf · 2018. 2. 7. · In the once upon a time days of the First Age of Magic, the prudent

Inverted Index Softjoin - 2

we should make a smart choice about which terms to use

Page 14: In the once upon a time days of the First Age of Magic, the prudentwcohen/10-405/workflows-3.pdf · 2018. 2. 7. · In the once upon a time days of the First Age of Magic, the prudent

Adding heuristics to the soft join - 1

Page 15: In the once upon a time days of the First Age of Magic, the prudentwcohen/10-405/workflows-3.pdf · 2018. 2. 7. · In the once upon a time days of the First Age of Magic, the prudent

Adding heuristics to the soft join - 2

Page 16: In the once upon a time days of the First Age of Magic, the prudentwcohen/10-405/workflows-3.pdf · 2018. 2. 7. · In the once upon a time days of the First Age of Magic, the prudent

Adding heuristics

• Parks: – input 40k– data 60k– docvec 102k– softjoin

• 539k tokens• 508k documents• 0 errors in top 50

• w/heuristics:– input40k–data60k–docvec 102k– softjoin

• 32ktokens• 24kdocuments• 3errorsintop50• <400usefulterms

Page 17: In the once upon a time days of the First Age of Magic, the prudentwcohen/10-405/workflows-3.pdf · 2018. 2. 7. · In the once upon a time days of the First Age of Magic, the prudent

Adding heuristics

• SO vs Wikipedia: – input 612M– docvec 1050M– softjoin 67M

• withheuristics– input612M–docvec 1050M– softjoin 9.1M

Page 18: In the once upon a time days of the First Age of Magic, the prudentwcohen/10-405/workflows-3.pdf · 2018. 2. 7. · In the once upon a time days of the First Age of Magic, the prudent

PageRank

18

Page 19: In the once upon a time days of the First Age of Magic, the prudentwcohen/10-405/workflows-3.pdf · 2018. 2. 7. · In the once upon a time days of the First Age of Magic, the prudent

Google’s PageRank

web site xxx

web site yyyy

web site a b c d e f g

web

site

pdq pdq ..

web site yyyy

web site a b c d e f g

web site xxx

Inlinks are “good” (recommendations)

Inlinks from a “good” site are better than inlinksfrom a “bad” site

but inlinks from sites with many outlinks are not a “good”...

“Good “ and “bad” are relative.

web site xxx

19

Page 20: In the once upon a time days of the First Age of Magic, the prudentwcohen/10-405/workflows-3.pdf · 2018. 2. 7. · In the once upon a time days of the First Age of Magic, the prudent

Google’s PageRank

web site xxx

web site yyyy

web site a b c d e f g

web

site

pdq pdq ..

web site yyyy

web site a b c d e f g

web site xxx Imagine a “pagehopper”

that always either

• follows a random link, or

• jumps to random page

20

Page 21: In the once upon a time days of the First Age of Magic, the prudentwcohen/10-405/workflows-3.pdf · 2018. 2. 7. · In the once upon a time days of the First Age of Magic, the prudent

Google’s PageRank(Brin & Page, http://www-db.stanford.edu/~backrub/google.html)

web site xxx

web site yyyy

web site a b c d e f g

web

site

pdq pdq ..

web site yyyy

web site a b c d e f g

web site xxx Imagine a “pagehopper”

that always either

• follows a random link, or

• jumps to random page

PageRank ranks pages by the amount of time the pagehopper spends on a page:

• or, if there were many pagehoppers, PageRank is the expected “crowd size” 21

Page 22: In the once upon a time days of the First Age of Magic, the prudentwcohen/10-405/workflows-3.pdf · 2018. 2. 7. · In the once upon a time days of the First Age of Magic, the prudent

PageRank in Memory

• Let u = (1/N, …, 1/N)– dimension = #nodes N

• Let A = adjacency matrix: [aij=1 ó i links to j]• Let W = [wij = aij/outdegree(i)]

– wij is probability of jump from i to j• Let v0 = (1,1,….,1)

– or anything else you want• Repeat until converged:

– Let vt+1 = cu + (1-c) vtW• c is probability of jumping “anywhere randomly”

22

Page 23: In the once upon a time days of the First Age of Magic, the prudentwcohen/10-405/workflows-3.pdf · 2018. 2. 7. · In the once upon a time days of the First Age of Magic, the prudent

Streaming PageRank

Page 24: In the once upon a time days of the First Age of Magic, the prudentwcohen/10-405/workflows-3.pdf · 2018. 2. 7. · In the once upon a time days of the First Age of Magic, the prudent

Streaming PageRank

• Assume we can store v but not W in memory• Repeat until converged:

– Let vt+1 = cu + (1-c) vtW

• Store A as a row matrix: each line is– i ji,1,…,ji,d [the neighbors of i]

• Store v’ and v in memory: v’ starts out as cu• For each line “i ji,1,…,ji,d “

– For each j in ji,1,…,ji,d• v’[j] += (1-c)v[i]/d

Everything needed for update is right there in row….

24

Page 25: In the once upon a time days of the First Age of Magic, the prudentwcohen/10-405/workflows-3.pdf · 2018. 2. 7. · In the once upon a time days of the First Age of Magic, the prudent

Streaming PageRank• Assume we can store one row of W in memory at a

time, but not v – i.e., links on a web page fit in memory, not pageRank for the whole web.

• Repeat until converged:– Let vt+1 = cu + (1-c) vtW

• Store A as a row matrix: each line is– i ji,1,…,ji,d [the neighbors of i]

• v’ starts out as cu• For each line “i ji,1,…,ji,d “

– For each j in ji,1,…,ji,d• v’[j] += (1-c)v[i]/d

25

We need to convert these counter updates to messages, like we did for naïve Bayes

Like in naïve Bayes: a document fits, but the

model doesn’t

Page 26: In the once upon a time days of the First Age of Magic, the prudentwcohen/10-405/workflows-3.pdf · 2018. 2. 7. · In the once upon a time days of the First Age of Magic, the prudent

Streaming PageRank• Assume we can store one row of W in memory at a

time, but not v – i.e., links on a web page fit in memory, not pageRank for the whole web.

• Repeat until converged:– Let vt+1 = cu + (1-c) vtW

• Store A as a row matrix: each line is– i ji,1,…,ji,d [the neighbors of i]

• v’ starts out as cu• For each line “i ji,1,…,ji,d “

– For each j in ji,1,…,ji,d• v’[j] += (1-c)v[i]/d

26

Streaming: if we know c, v[i] and have the linked-to j’s then• we can compute d• we can produce messages

saying how much to increment each v[j]

Then we sort the messages by v[j] and add up all the incrementals• and somehow also add in

c/N

Page 27: In the once upon a time days of the First Age of Magic, the prudentwcohen/10-405/workflows-3.pdf · 2018. 2. 7. · In the once upon a time days of the First Age of Magic, the prudent

Streaming PageRank• Assume we can store one row of W in memory at a

time, but not v – i.e., links on a web page fit in memory, not pageRank for the whole web.

• Repeat until converged:– Let vt+1 = cu + (1-c) vtW

• Store A as a row matrix: each line is– i ji,1,…,ji,d [the neighbors of i]

• v’ starts out as cu• For each line “i ji,1,…,ji,d “

– For each j in ji,1,…,ji,d• v’[j] += (1-c)v[i]/d

27

Streaming: if we know c, v[i] and have the linked-to j’s then• we can compute d• we can produce messages

saying how much to increment each v[j]

Then we sort the messages by v[j] and add up all the incrementals• and somehow also add in

c/N

So we need to stream thru some structure that has v[i] plus all the outlinks.

This is a copy of the graph with current pageRankestimates (v) for each node attached to the node.

Page 28: In the once upon a time days of the First Age of Magic, the prudentwcohen/10-405/workflows-3.pdf · 2018. 2. 7. · In the once upon a time days of the First Age of Magic, the prudent

PageRank in Dataflow Languages

28

Page 29: In the once upon a time days of the First Age of Magic, the prudentwcohen/10-405/workflows-3.pdf · 2018. 2. 7. · In the once upon a time days of the First Age of Magic, the prudent

Iteration in Dataflow

• PIG and Guinea Pig are pure data flow languages– no conditionals– no iteration

• To loop you need to embed a dataflow program into a ‘real’ program

29

Page 30: In the once upon a time days of the First Age of Magic, the prudentwcohen/10-405/workflows-3.pdf · 2018. 2. 7. · In the once upon a time days of the First Age of Magic, the prudent

30

Recap: a Gpig program….

There’s no “program” object, and no obvious way to write a main that will call a program

Page 31: In the once upon a time days of the First Age of Magic, the prudentwcohen/10-405/workflows-3.pdf · 2018. 2. 7. · In the once upon a time days of the First Age of Magic, the prudent

Calling GuineaPigprogrammaticallySomewhere in the execution

of the plan we will call this script with special arguments that tell it to do a substep of the plan.

So we need a Python mainthat will do that right

But I also want to write an actual main program so….

Creatiing a Planner object I can call (not a subclass) – remember to call setup()

31

Page 32: In the once upon a time days of the First Age of Magic, the prudentwcohen/10-405/workflows-3.pdf · 2018. 2. 7. · In the once upon a time days of the First Age of Magic, the prudent

Calling GuineaPigprogrammatically Convert from

initial format and assign

initial pagerankvector v

create the planner

Move initial pageranks + graph to tmp

file

Compute updated

pageranks

Move new pageranks to tmp

file for next iteration of loop32

Page 33: In the once upon a time days of the First Age of Magic, the prudentwcohen/10-405/workflows-3.pdf · 2018. 2. 7. · In the once upon a time days of the First Age of Magic, the prudent

lots and lots of i/o happening here…

rowininitGraph:(i,[j1,….,jd])

33

Page 34: In the once upon a time days of the First Age of Magic, the prudentwcohen/10-405/workflows-3.pdf · 2018. 2. 7. · In the once upon a time days of the First Age of Magic, the prudent

note: there is lots of i/o happening here… 34

Page 35: In the once upon a time days of the First Age of Magic, the prudentwcohen/10-405/workflows-3.pdf · 2018. 2. 7. · In the once upon a time days of the First Age of Magic, the prudent

An example from Ron Bekkerman

35

Page 36: In the once upon a time days of the First Age of Magic, the prudentwcohen/10-405/workflows-3.pdf · 2018. 2. 7. · In the once upon a time days of the First Age of Magic, the prudent

Example: k-means clustering

• An EM-like algorithm:• Initialize k cluster centroids• E-step: associate each data instance with the

closest centroid– Find expected values of cluster

assignments given the data and centroids• M-step: recalculate centroids as an average of

the associated data instances– Find new centroids that maximize that

expectation36

Page 37: In the once upon a time days of the First Age of Magic, the prudentwcohen/10-405/workflows-3.pdf · 2018. 2. 7. · In the once upon a time days of the First Age of Magic, the prudent

k-means Clustering

centroids

37

Page 38: In the once upon a time days of the First Age of Magic, the prudentwcohen/10-405/workflows-3.pdf · 2018. 2. 7. · In the once upon a time days of the First Age of Magic, the prudent

Parallelizing k-means

38

Page 39: In the once upon a time days of the First Age of Magic, the prudentwcohen/10-405/workflows-3.pdf · 2018. 2. 7. · In the once upon a time days of the First Age of Magic, the prudent

Parallelizing k-means

39

Page 40: In the once upon a time days of the First Age of Magic, the prudentwcohen/10-405/workflows-3.pdf · 2018. 2. 7. · In the once upon a time days of the First Age of Magic, the prudent

Parallelizing k-means

40

Page 41: In the once upon a time days of the First Age of Magic, the prudentwcohen/10-405/workflows-3.pdf · 2018. 2. 7. · In the once upon a time days of the First Age of Magic, the prudent

k-means on MapReduce

• Mappers read data portions and centroids• Mappers assign data instances to clusters• Mappers compute new local centroids

and local cluster sizes• Reducers aggregate local centroids

(weighted by local cluster sizes) into new global centroids

• Reducers write the new centroids

Panda et al, Chapter 2

41

Page 42: In the once upon a time days of the First Age of Magic, the prudentwcohen/10-405/workflows-3.pdf · 2018. 2. 7. · In the once upon a time days of the First Age of Magic, the prudent

k-means in Apache Pig: input data

• Assume we need to cluster documents– Stored in a 3-column table D:

• Initial centroids are k randomly chosen docs– Stored in table C in the same format as

above

Document Word Count

doc1 Carnegie 2

doc1 Mellon 2

42

Page 43: In the once upon a time days of the First Age of Magic, the prudentwcohen/10-405/workflows-3.pdf · 2018. 2. 7. · In the once upon a time days of the First Age of Magic, the prudent

D_C =JOIN C BY w,D BY w;PROD =FOREACH D_C GENERATE d,c,id *ic AS idic ;

PRODg =GROUP PROD BY (d,c);DOT_PROD =FOREACH PRODg GENERATE d,c,SUM(idic)AS dXc;

SQR =FOREACH C GENERATE c,ic *ic AS ic2;SQRg =GROUP SQR BY c;LEN_C=FOREACH SQRgGENERATE c,SQRT(SUM(ic2))AS lenc;

DOT_LEN =JOIN LEN_CBY c,DOT_PROD BY c;SIM =FOREACH DOT_LEN GENERATE d,c,dXc /lenc;

SIMg =GROUP SIM BY d;CLUSTERS=FOREACH SIMg GENERATE TOP(1,2,SIM);

k-means in Apache Pig: E-step

( )åå

Î

Î×

=

cwwc

dwwc

wd

cdi

iic

2maxarg

43

Page 44: In the once upon a time days of the First Age of Magic, the prudentwcohen/10-405/workflows-3.pdf · 2018. 2. 7. · In the once upon a time days of the First Age of Magic, the prudent

D_C =JOIN C BY w,D BY w;PROD =FOREACH D_C GENERATE d,c,id *ic AS idic ;

PRODg =GROUP PROD BY (d,c);DOT_PROD =FOREACH PRODg GENERATE d,c,SUM(idic)AS dXc;

SQR =FOREACH C GENERATE c,ic *ic AS ic2;SQRg =GROUP SQR BY c;LEN_C=FOREACH SQRgGENERATE c,SQRT(SUM(ic2))AS lenc;

DOT_LEN =JOIN LEN_CBY c,DOT_PROD BY c;SIM =FOREACH DOT_LEN GENERATE d,c,dXc /lenc;

SIMg =GROUP SIM BY d;CLUSTERS=FOREACH SIMg GENERATE TOP(1,2,SIM);

k-means in Apache Pig: E-step

( )åå

Î

Î×

=

cwwc

dwwc

wd

cdi

iic

2maxarg

44

Page 45: In the once upon a time days of the First Age of Magic, the prudentwcohen/10-405/workflows-3.pdf · 2018. 2. 7. · In the once upon a time days of the First Age of Magic, the prudent

D_C =JOIN C BY w,D BY w;PROD =FOREACH D_C GENERATE d,c,id *ic AS idic ;

PRODg =GROUP PROD BY (d,c);DOT_PROD =FOREACH PRODg GENERATE d,c,SUM(idic)AS dXc;

SQR =FOREACH C GENERATE c,ic *ic AS ic2;SQRg =GROUP SQR BY c;LEN_C=FOREACH SQRgGENERATE c,SQRT(SUM(ic2))AS lenc;

DOT_LEN =JOIN LEN_CBY c,DOT_PROD BY c;SIM =FOREACH DOT_LEN GENERATE d,c,dXc /lenc;

SIMg =GROUP SIM BY d;CLUSTERS=FOREACH SIMg GENERATE TOP(1,2,SIM);

k-means in Apache Pig: E-step

( )åå

Î

Î×

=

cwwc

dwwc

wd

cdi

iic

2maxarg

45

Page 46: In the once upon a time days of the First Age of Magic, the prudentwcohen/10-405/workflows-3.pdf · 2018. 2. 7. · In the once upon a time days of the First Age of Magic, the prudent

D_C =JOIN C BY w,D BY w;PROD =FOREACH D_C GENERATE d,c,id *ic AS idic ;

PRODg =GROUP PROD BY (d,c);DOT_PROD =FOREACH PRODg GENERATE d,c,SUM(idic)AS dXc;

SQR =FOREACH C GENERATE c,ic *ic AS ic2;SQRg =GROUP SQR BY c;LEN_C=FOREACH SQRgGENERATE c,SQRT(SUM(ic2))AS lenc;

DOT_LEN =JOIN LEN_CBY c,DOT_PROD BY c;SIM =FOREACH DOT_LEN GENERATE d,c,dXc /lenc;

SIMg =GROUP SIM BY d;CLUSTERS=FOREACH SIMg GENERATE TOP(1,2,SIM);

k-means in Apache Pig: E-step

( )åå

Î

Î×

=

cwwc

dwwc

wd

cdi

iic

2maxarg

46

Page 47: In the once upon a time days of the First Age of Magic, the prudentwcohen/10-405/workflows-3.pdf · 2018. 2. 7. · In the once upon a time days of the First Age of Magic, the prudent

D_C =JOIN C BY w,D BY w;PROD =FOREACH D_C GENERATE d,c,id *ic AS idic ;

PRODg =GROUP PROD BY (d,c);DOT_PROD =FOREACH PRODg GENERATE d,c,SUM(idic)AS dXc;

SQR =FOREACH C GENERATE c,ic *ic AS ic2;SQRg =GROUP SQR BY c;LEN_C=FOREACH SQRgGENERATE c,SQRT(SUM(ic2))AS lenc;

DOT_LEN =JOIN LEN_CBY c,DOT_PROD BY c;SIM =FOREACH DOT_LEN GENERATE d,c,dXc /lenc;

SIMg =GROUP SIM BY d;CLUSTERS=FOREACH SIMg GENERATE TOP(1,2,SIM);

k-means in Apache Pig: E-step

( )åå

Î

Î×

=

cwwc

dwwc

wd

cdi

iic

2maxarg

47

Page 48: In the once upon a time days of the First Age of Magic, the prudentwcohen/10-405/workflows-3.pdf · 2018. 2. 7. · In the once upon a time days of the First Age of Magic, the prudent

k-means in Apache Pig: E-step

D_C =JOIN C BY w,D BY w;PROD =FOREACH D_C GENERATE d,c,id *ic AS idic ;

PRODg =GROUP PROD BY (d,c);DOT_PROD =FOREACH PRODg GENERATE d,c,SUM(idic)AS dXc;

SQR =FOREACH C GENERATE c,ic *ic AS ic2;SQRg =GROUP SQR BY c;LEN_C=FOREACH SQRgGENERATE c,SQRT(SUM(ic2))AS lenc;

DOT_LEN =JOIN LEN_CBY c,DOT_PROD BY c;SIM =FOREACH DOT_LEN GENERATE d,c,dXc /lenc;

SIMg =GROUP SIM BY d;CLUSTERS=FOREACH SIMg GENERATE TOP(1,2,SIM);

48

Page 49: In the once upon a time days of the First Age of Magic, the prudentwcohen/10-405/workflows-3.pdf · 2018. 2. 7. · In the once upon a time days of the First Age of Magic, the prudent

k-means in Apache Pig: M-step

D_C_W= JOIN CLUSTERSBY d, DBY d;

D_C_Wg = GROUP D_C_WBY (c, w);SUMS= FOREACH D_C_Wg GENERATE c, w, SUM(id)AS sum;

D_C_Wgg =GROUP D_C_WBY c;SIZES=FOREACH D_C_WggGENERATE c,COUNT(D_C_W) AS size;

SUMS_SIZES=JOIN SIZESBY c, SUMSBY c;C=FOREACH SUMS_SIZESGENERATE c,w, sum/sizeAS ic ;

Finally - embed in Java (or Python or ….) to do the looping

49

Page 50: In the once upon a time days of the First Age of Magic, the prudentwcohen/10-405/workflows-3.pdf · 2018. 2. 7. · In the once upon a time days of the First Age of Magic, the prudent

The problem with k-means in Hadoop

I/O costs

50

Page 51: In the once upon a time days of the First Age of Magic, the prudentwcohen/10-405/workflows-3.pdf · 2018. 2. 7. · In the once upon a time days of the First Age of Magic, the prudent

Data is read, and model is written, with every iteration

• Mappers read data portions and centroids• Mappers assign data instances to clusters• Mappers compute new local centroids

and local cluster sizes• Reducers aggregate local centroids

(weighted by local cluster sizes) into new global centroids

• Reducers write the new centroids

Panda et al, Chapter 2

51

Page 52: In the once upon a time days of the First Age of Magic, the prudentwcohen/10-405/workflows-3.pdf · 2018. 2. 7. · In the once upon a time days of the First Age of Magic, the prudent

Spark: Another Dataflow Language

52

Page 53: In the once upon a time days of the First Age of Magic, the prudentwcohen/10-405/workflows-3.pdf · 2018. 2. 7. · In the once upon a time days of the First Age of Magic, the prudent

Spark• Hadoop: Too much typing

– programs are not concise• Hadoop: Too low level

– missing abstractions– hard to specify a workflow

• Pig, Guinea Pig, … address these problems: also Spark• Not well suited to iterative operations

– E.g., PageRank, E/M, k-means clustering, …– Spark lowers cost of repeated reads

53

Set of concise dataflow operations (“transformation”)

Dataflow operations are embedded in an API together with “actions”

Spark: Sharded files are replaced by “RDDs” – resilient distributed datasets. RDDs can be cached in cluster memory and recreated to recover from error

Page 54: In the once upon a time days of the First Age of Magic, the prudentwcohen/10-405/workflows-3.pdf · 2018. 2. 7. · In the once upon a time days of the First Age of Magic, the prudent

Spark examples

54

spark is a spark context object

Page 55: In the once upon a time days of the First Age of Magic, the prudentwcohen/10-405/workflows-3.pdf · 2018. 2. 7. · In the once upon a time days of the First Age of Magic, the prudent

Spark examples

55

errors is a transformation, and

thus a data strucurethat explains HOW

to do somethingcount() is an action:

it will actually execute the plan for errors and return a

value.

errors.filter() is a transformation

collect() is an action

everything is sharded, like inHadoop and GuineaPig

Page 56: In the once upon a time days of the First Age of Magic, the prudentwcohen/10-405/workflows-3.pdf · 2018. 2. 7. · In the once upon a time days of the First Age of Magic, the prudent

Spark examples

56

# modify errors to be stored in cluster memory

subsequent actions will be

much faster

everything is sharded … and the shards are stored in memory of worker machines not local disk (if possible)

You can also persist() an RDD on disk, whichis like marking it as opts(stored=True) in GuineaPig. Spark’s not smart about persisting data.

Page 57: In the once upon a time days of the First Age of Magic, the prudentwcohen/10-405/workflows-3.pdf · 2018. 2. 7. · In the once upon a time days of the First Age of Magic, the prudent

Spark examples

57

# modify errors to be stored in cluster memory

everything is sharded … and the shards are stored in memory of worker machines not local disk (if possible)

You can also persist() an RDD on disk, whichis like marking it as opts(stored=True) in GuineaPig. Spark’s not smart about persisting data.

persist-on-disk works because the RDD is read-only

(immutable)

Page 58: In the once upon a time days of the First Age of Magic, the prudentwcohen/10-405/workflows-3.pdf · 2018. 2. 7. · In the once upon a time days of the First Age of Magic, the prudent

Spark examples: wordcount

58

the action transformation on (key,value) pairs , which are special

Page 59: In the once upon a time days of the First Age of Magic, the prudentwcohen/10-405/workflows-3.pdf · 2018. 2. 7. · In the once upon a time days of the First Age of Magic, the prudent

Spark examples: batch logistic regression

59

reduce is an action – it produces a numpy vector

p.x and w are vectors, from the numpy package

p.x and w are vectors, from the numpypackage. Python

overloads operations like * and + for vectors.

Page 60: In the once upon a time days of the First Age of Magic, the prudentwcohen/10-405/workflows-3.pdf · 2018. 2. 7. · In the once upon a time days of the First Age of Magic, the prudent

Spark examples: batch logistic regression

Important note: numpy vectors/matrices are not just “syntactic sugar”. • They are much more compact than something like a list of python floats.• numpy operations like dot, *, + are calls to optimized C code• a little python logic around a lot of numpy calls is pretty efficient

60

Page 61: In the once upon a time days of the First Age of Magic, the prudentwcohen/10-405/workflows-3.pdf · 2018. 2. 7. · In the once upon a time days of the First Age of Magic, the prudent

Spark examples: batch logistic regression

61

w is defined outside the lambda function, but

used inside itSo: python builds a closure – code including the current value of w –

and Spark ships it off to each worker. So w is copied, and must

be read-only.

Page 62: In the once upon a time days of the First Age of Magic, the prudentwcohen/10-405/workflows-3.pdf · 2018. 2. 7. · In the once upon a time days of the First Age of Magic, the prudent

Spark examples: batch logistic regression

62

dataset of points is cached in cluster memory to

reduce i/o

Page 63: In the once upon a time days of the First Age of Magic, the prudentwcohen/10-405/workflows-3.pdf · 2018. 2. 7. · In the once upon a time days of the First Age of Magic, the prudent

Spark logistic regression example

63

Page 64: In the once upon a time days of the First Age of Magic, the prudentwcohen/10-405/workflows-3.pdf · 2018. 2. 7. · In the once upon a time days of the First Age of Magic, the prudent

Spark

64