in the once upon a time days of the first age of magic, the prudentwcohen/10-405/workflows-3.pdf ·...

In the once upon a time days of the First Age of Magic, the prudent sorcerer regarded his own true name as his most valued possession but also the greatest threat to his continued good health, for--the stories go--once an enemy, even a weak unskilled enemy, learned the sorcerer's true name, then routine and widely known spells could destroy or enslave even the most powerful. As times passed, and we graduated to the Age of Reason and thence to the first and second industrial revolutions, such notions were discredited. Now it seems that the Wheel has turned full circle (even if there never really was a First Age) and we are back to worrying about true names again:

The first hint Mr. Slippery had that his own True Name might be known--and, for that matter, known to the Great Enemy--came with the appearance of two black Lincolns humming up the long dirt driveway ... Roger Pollack was in his garden weeding, had been there nearly the whole morning.... Four heavy-set men and a hard-looking female piled out, started purposefully across his well-tended cabbage patch.…

This had been, of course, Roger Pollack's great fear. They had discovered Mr. Slippery's True Name and it was Roger Andrew Pollack TIN/SSAN 0959-34-2861.

GUINEA PIG: A WORKFLOW PACKAGE FOR PYTHON

2

Full Implementation

3

docid term v(d,w)d137 aardvark 0.645d137 found 0.083d137 I 0.004d137 farmville 0.356d138 when …d138 … …

DB2

DB2

Motivation• Integrating data is important• Data from different sources

may not have consistent object identifiers– Especially automatically-

constructed ones• But databases will have

human-readable names and/or descriptions for the objects

• But matching names and descriptions is tricky….

DB1 DB2 …

DB1

…

One solution: Soft (Similarity) joins

• A similarity join of two sets A and B is– an ordered list of triples (sij,ai,bj) such that

• ai is from A• bj is from B• sij is the similarity of ai and bj

• the triples are in descending order

• the list is either the top K triples by sij or ALL triples with sij>L … or sometimes some approximation of these….

Example: soft joins/similarity joins

… …

Input: Two Different Lists of Entity Names

Example: soft joins/similarity joinsOutput: Pairs of Names Ranked by Similarity

…

…

identical

similar

less similar

Semantic Joiningwith Multiscale Statistics

William CohenKatie Rivard, Dana Attias-Moshevitz

CMU

A parallel workflow for TFIDF similarity joins

9

Parallel Inverted Index Softjoin - 1

Statistics for computing TFIDF with IDFs local to each relation

sumSquareWeights

rel docid term v(d,w)a a137 gateway 0.945a a137 nwr 0.055a a138 acadia 0.915a a138 np 0.085b b013 gateway 0.845b b013 np 0.155b … … …

Parallel Inverted Index Softjoin - 2

What’sthealgorithm?• Step1:createdocumentvectorsas(Cd,d,term,weight)tuples

• Step2:jointhetuplesfromAandB:onesortandreduce• Givesyoutuples(a,b,term,w(a,term)*w(b,term))

• Step3:groupthecommontermsby(a,b)andreducetoaggregatethecomponentsofthesum

𝑠𝑖𝑚 𝒗, 𝒗′ = 𝒗 N 𝒗′ =O𝒗 𝑖 𝒗′(𝑖)�

�

Making the algorithm smarter….

Inverted Index Softjoin - 2

we should make a smart choice about which terms to use

Adding heuristics to the soft join - 1

Adding heuristics to the soft join - 2

Adding heuristics

• Parks: – input 40k– data 60k– docvec 102k– softjoin

• 539k tokens• 508k documents• 0 errors in top 50

• w/heuristics:– input40k–data60k–docvec 102k– softjoin

• 32ktokens• 24kdocuments• 3errorsintop50• <400usefulterms

Adding heuristics

• SO vs Wikipedia: – input 612M– docvec 1050M– softjoin 67M

• withheuristics– input612M–docvec 1050M– softjoin 9.1M

PageRank

18

Google’s PageRank

web site xxx

web site yyyy

web site a b c d e f g

web

site

pdq pdq ..

web site yyyy


web site xxx

Inlinks are “good” (recommendations)

Inlinks from a “good” site are better than inlinksfrom a “bad” site

but inlinks from sites with many outlinks are not a “good”...

“Good “ and “bad” are relative.

web site xxx

19

Google’s PageRank

web site xxx

web site yyyy


web

site

pdq pdq ..

web site yyyy


web site xxx Imagine a “pagehopper”

that always either

• follows a random link, or

• jumps to random page

20

Google’s PageRank(Brin & Page, http://www-db.stanford.edu/~backrub/google.html)

web site xxx

web site yyyy


web

site

pdq pdq ..

web site yyyy


web site xxx Imagine a “pagehopper”

that always either

• follows a random link, or

• jumps to random page

PageRank ranks pages by the amount of time the pagehopper spends on a page:

• or, if there were many pagehoppers, PageRank is the expected “crowd size” 21

PageRank in Memory

• Let u = (1/N, …, 1/N)– dimension = #nodes N

• Let A = adjacency matrix: [aij=1 ó i links to j]• Let W = [wij = aij/outdegree(i)]

– wij is probability of jump from i to j• Let v0 = (1,1,….,1)

– or anything else you want• Repeat until converged:

– Let vt+1 = cu + (1-c) vtW• c is probability of jumping “anywhere randomly”

22

Streaming PageRank

Streaming PageRank

• Assume we can store v but not W in memory• Repeat until converged:

– Let vt+1 = cu + (1-c) vtW

• Store A as a row matrix: each line is– i ji,1,…,ji,d [the neighbors of i]

• Store v’ and v in memory: v’ starts out as cu• For each line “i ji,1,…,ji,d “

– For each j in ji,1,…,ji,d• v’[j] += (1-c)v[i]/d

Everything needed for update is right there in row….

24

Streaming PageRank• Assume we can store one row of W in memory at a

time, but not v – i.e., links on a web page fit in memory, not pageRank for the whole web.

• Repeat until converged:– Let vt+1 = cu + (1-c) vtW


• v’ starts out as cu• For each line “i ji,1,…,ji,d “


25

We need to convert these counter updates to messages, like we did for naïve Bayes

Like in naïve Bayes: a document fits, but the

model doesn’t







26

Streaming: if we know c, v[i] and have the linked-to j’s then• we can compute d• we can produce messages

saying how much to increment each v[j]

Then we sort the messages by v[j] and add up all the incrementals• and somehow also add in

c/N







27

Streaming: if we know c, v[i] and have the linked-to j’s then• we can compute d• we can produce messages

saying how much to increment each v[j]

Then we sort the messages by v[j] and add up all the incrementals• and somehow also add in

c/N

So we need to stream thru some structure that has v[i] plus all the outlinks.

This is a copy of the graph with current pageRankestimates (v) for each node attached to the node.

PageRank in Dataflow Languages

28

Iteration in Dataflow

• PIG and Guinea Pig are pure data flow languages– no conditionals– no iteration

• To loop you need to embed a dataflow program into a ‘real’ program

29

30

Recap: a Gpig program….

There’s no “program” object, and no obvious way to write a main that will call a program

Calling GuineaPigprogrammaticallySomewhere in the execution

of the plan we will call this script with special arguments that tell it to do a substep of the plan.

So we need a Python mainthat will do that right

But I also want to write an actual main program so….

Creatiing a Planner object I can call (not a subclass) – remember to call setup()

31

Calling GuineaPigprogrammatically Convert from

initial format and assign

initial pagerankvector v

create the planner

Move initial pageranks + graph to tmp

file

Compute updated

pageranks

Move new pageranks to tmp

file for next iteration of loop32

lots and lots of i/o happening here…

rowininitGraph:(i,[j1,….,jd])

33

note: there is lots of i/o happening here… 34

An example from Ron Bekkerman

35

Example: k-means clustering

• An EM-like algorithm:• Initialize k cluster centroids• E-step: associate each data instance with the

closest centroid– Find expected values of cluster

assignments given the data and centroids• M-step: recalculate centroids as an average of

the associated data instances– Find new centroids that maximize that

expectation36

k-means Clustering

centroids

37

Parallelizing k-means

38


39


40

k-means on MapReduce

• Mappers read data portions and centroids• Mappers assign data instances to clusters• Mappers compute new local centroids

and local cluster sizes• Reducers aggregate local centroids

(weighted by local cluster sizes) into new global centroids

• Reducers write the new centroids

Panda et al, Chapter 2

41

k-means in Apache Pig: input data

• Assume we need to cluster documents– Stored in a 3-column table D:

• Initial centroids are k randomly chosen docs– Stored in table C in the same format as

above

Document Word Count

doc1 Carnegie 2

doc1 Mellon 2

42

D_C =JOIN C BY w,D BY w;PROD =FOREACH D_C GENERATE d,c,id *ic AS idic ;

PRODg =GROUP PROD BY (d,c);DOT_PROD =FOREACH PRODg GENERATE d,c,SUM(idic)AS dXc;

SQR =FOREACH C GENERATE c,ic *ic AS ic2;SQRg =GROUP SQR BY c;LEN_C=FOREACH SQRgGENERATE c,SQRT(SUM(ic2))AS lenc;

DOT_LEN =JOIN LEN_CBY c,DOT_PROD BY c;SIM =FOREACH DOT_LEN GENERATE d,c,dXc /lenc;

SIMg =GROUP SIM BY d;CLUSTERS=FOREACH SIMg GENERATE TOP(1,2,SIM);

k-means in Apache Pig: E-step

( )åå

Î

Î×

=

cwwc

dwwc

wd

cdi

iic

2maxarg

43







( )åå

Î

Î×

=

cwwc

dwwc

wd

cdi

iic

2maxarg

44







( )åå

Î

Î×

=

cwwc

dwwc

wd

cdi

iic

2maxarg

45







( )åå

Î

Î×

=

cwwc

dwwc

wd

cdi

iic

2maxarg

46







( )åå

Î

Î×

=

cwwc

dwwc

wd

cdi

iic

2maxarg

47







48

k-means in Apache Pig: M-step

D_C_W= JOIN CLUSTERSBY d, DBY d;

D_C_Wg = GROUP D_C_WBY (c, w);SUMS= FOREACH D_C_Wg GENERATE c, w, SUM(id)AS sum;

D_C_Wgg =GROUP D_C_WBY c;SIZES=FOREACH D_C_WggGENERATE c,COUNT(D_C_W) AS size;

SUMS_SIZES=JOIN SIZESBY c, SUMSBY c;C=FOREACH SUMS_SIZESGENERATE c,w, sum/sizeAS ic ;

Finally - embed in Java (or Python or ….) to do the looping

49

The problem with k-means in Hadoop

I/O costs

50

Data is read, and model is written, with every iteration

• Mappers read data portions and centroids• Mappers assign data instances to clusters• Mappers compute new local centroids

and local cluster sizes• Reducers aggregate local centroids

(weighted by local cluster sizes) into new global centroids

• Reducers write the new centroids

Panda et al, Chapter 2

51

Spark: Another Dataflow Language

52

Spark• Hadoop: Too much typing

– programs are not concise• Hadoop: Too low level

– missing abstractions– hard to specify a workflow

• Pig, Guinea Pig, … address these problems: also Spark• Not well suited to iterative operations

– E.g., PageRank, E/M, k-means clustering, …– Spark lowers cost of repeated reads

53

Set of concise dataflow operations (“transformation”)

Dataflow operations are embedded in an API together with “actions”

Spark: Sharded files are replaced by “RDDs” – resilient distributed datasets. RDDs can be cached in cluster memory and recreated to recover from error

Spark examples

54

spark is a spark context object

Spark examples

55

errors is a transformation, and

thus a data strucurethat explains HOW

to do somethingcount() is an action:

it will actually execute the plan for errors and return a

value.

errors.filter() is a transformation

collect() is an action

everything is sharded, like inHadoop and GuineaPig

Spark examples

56

# modify errors to be stored in cluster memory

subsequent actions will be

much faster

everything is sharded … and the shards are stored in memory of worker machines not local disk (if possible)

You can also persist() an RDD on disk, whichis like marking it as opts(stored=True) in GuineaPig. Spark’s not smart about persisting data.

Spark examples

57

# modify errors to be stored in cluster memory

everything is sharded … and the shards are stored in memory of worker machines not local disk (if possible)

You can also persist() an RDD on disk, whichis like marking it as opts(stored=True) in GuineaPig. Spark’s not smart about persisting data.

persist-on-disk works because the RDD is read-only

(immutable)

Spark examples: wordcount

58

the action transformation on (key,value) pairs , which are special

Spark examples: batch logistic regression

59

reduce is an action – it produces a numpy vector

p.x and w are vectors, from the numpy package

p.x and w are vectors, from the numpypackage. Python

overloads operations like * and + for vectors.


Important note: numpy vectors/matrices are not just “syntactic sugar”. • They are much more compact than something like a list of python floats.• numpy operations like dot, *, + are calls to optimized C code• a little python logic around a lot of numpy calls is pretty efficient

60


61

w is defined outside the lambda function, but

used inside itSo: python builds a closure – code including the current value of w –

and Spark ships it off to each worker. So w is copied, and must

be read-only.


62

dataset of points is cached in cluster memory to

reduce i/o

Spark logistic regression example

63

Spark

64

in the once upon a time days of the first age of magic, the prudentwcohen/10-405/workflows-3.pdf ·...

Documents