in the once upon a time days of the first age of magic, the prudentwcohen/10-405/workflows-3.pdf ·...
TRANSCRIPT
In the once upon a time days of the First Age of Magic, the prudent sorcerer regarded his own true name as his most valued possession but also the greatest threat to his continued good health, for--the stories go--once an enemy, even a weak unskilled enemy, learned the sorcerer's true name, then routine and widely known spells could destroy or enslave even the most powerful. As times passed, and we graduated to the Age of Reason and thence to the first and second industrial revolutions, such notions were discredited. Now it seems that the Wheel has turned full circle (even if there never really was a First Age) and we are back to worrying about true names again:
The first hint Mr. Slippery had that his own True Name might be known--and, for that matter, known to the Great Enemy--came with the appearance of two black Lincolns humming up the long dirt driveway ... Roger Pollack was in his garden weeding, had been there nearly the whole morning.... Four heavy-set men and a hard-looking female piled out, started purposefully across his well-tended cabbage patch.…
This had been, of course, Roger Pollack's great fear. They had discovered Mr. Slippery's True Name and it was Roger Andrew Pollack TIN/SSAN 0959-34-2861.
GUINEA PIG: A WORKFLOW PACKAGE FOR PYTHON
2
Full Implementation
3
docid term v(d,w)d137 aardvark 0.645d137 found 0.083d137 I 0.004d137 farmville 0.356d138 when …d138 … …
DB2
DB2
Motivation• Integrating data is important• Data from different sources
may not have consistent object identifiers– Especially automatically-
constructed ones• But databases will have
human-readable names and/or descriptions for the objects
• But matching names and descriptions is tricky….
DB1 DB2 …
DB1
…
One solution: Soft (Similarity) joins
• A similarity join of two sets A and B is– an ordered list of triples (sij,ai,bj) such that
• ai is from A• bj is from B• sij is the similarity of ai and bj
• the triples are in descending order
• the list is either the top K triples by sij or ALL triples with sij>L … or sometimes some approximation of these….
Example: soft joins/similarity joins
… …
Input: Two Different Lists of Entity Names
Example: soft joins/similarity joinsOutput: Pairs of Names Ranked by Similarity
…
…
identical
similar
less similar
Semantic Joiningwith Multiscale Statistics
William CohenKatie Rivard, Dana Attias-Moshevitz
CMU
A parallel workflow for TFIDF similarity joins
9
Parallel Inverted Index Softjoin - 1
Statistics for computing TFIDF with IDFs local to each relation
sumSquareWeights
rel docid term v(d,w)a a137 gateway 0.945a a137 nwr 0.055a a138 acadia 0.915a a138 np 0.085b b013 gateway 0.845b b013 np 0.155b … … …
Parallel Inverted Index Softjoin - 2
What’sthealgorithm?• Step1:createdocumentvectorsas(Cd,d,term,weight)tuples
• Step2:jointhetuplesfromAandB:onesortandreduce• Givesyoutuples(a,b,term,w(a,term)*w(b,term))
• Step3:groupthecommontermsby(a,b)andreducetoaggregatethecomponentsofthesum
𝑠𝑖𝑚 𝒗, 𝒗′ = 𝒗 N 𝒗′ =O𝒗 𝑖 𝒗′(𝑖)�
�
Making the algorithm smarter….
Inverted Index Softjoin - 2
we should make a smart choice about which terms to use
Adding heuristics to the soft join - 1
Adding heuristics to the soft join - 2
Adding heuristics
• Parks: – input 40k– data 60k– docvec 102k– softjoin
• 539k tokens• 508k documents• 0 errors in top 50
• w/heuristics:– input40k–data60k–docvec 102k– softjoin
• 32ktokens• 24kdocuments• 3errorsintop50• <400usefulterms
Adding heuristics
• SO vs Wikipedia: – input 612M– docvec 1050M– softjoin 67M
• withheuristics– input612M–docvec 1050M– softjoin 9.1M
PageRank
18
Google’s PageRank
web site xxx
web site yyyy
web site a b c d e f g
web
site
pdq pdq ..
web site yyyy
web site a b c d e f g
web site xxx
Inlinks are “good” (recommendations)
Inlinks from a “good” site are better than inlinksfrom a “bad” site
but inlinks from sites with many outlinks are not a “good”...
“Good “ and “bad” are relative.
web site xxx
19
Google’s PageRank
web site xxx
web site yyyy
web site a b c d e f g
web
site
pdq pdq ..
web site yyyy
web site a b c d e f g
web site xxx Imagine a “pagehopper”
that always either
• follows a random link, or
• jumps to random page
20
Google’s PageRank(Brin & Page, http://www-db.stanford.edu/~backrub/google.html)
web site xxx
web site yyyy
web site a b c d e f g
web
site
pdq pdq ..
web site yyyy
web site a b c d e f g
web site xxx Imagine a “pagehopper”
that always either
• follows a random link, or
• jumps to random page
PageRank ranks pages by the amount of time the pagehopper spends on a page:
• or, if there were many pagehoppers, PageRank is the expected “crowd size” 21
PageRank in Memory
• Let u = (1/N, …, 1/N)– dimension = #nodes N
• Let A = adjacency matrix: [aij=1 ó i links to j]• Let W = [wij = aij/outdegree(i)]
– wij is probability of jump from i to j• Let v0 = (1,1,….,1)
– or anything else you want• Repeat until converged:
– Let vt+1 = cu + (1-c) vtW• c is probability of jumping “anywhere randomly”
22
Streaming PageRank
Streaming PageRank
• Assume we can store v but not W in memory• Repeat until converged:
– Let vt+1 = cu + (1-c) vtW
• Store A as a row matrix: each line is– i ji,1,…,ji,d [the neighbors of i]
• Store v’ and v in memory: v’ starts out as cu• For each line “i ji,1,…,ji,d “
– For each j in ji,1,…,ji,d• v’[j] += (1-c)v[i]/d
Everything needed for update is right there in row….
24
Streaming PageRank• Assume we can store one row of W in memory at a
time, but not v – i.e., links on a web page fit in memory, not pageRank for the whole web.
• Repeat until converged:– Let vt+1 = cu + (1-c) vtW
• Store A as a row matrix: each line is– i ji,1,…,ji,d [the neighbors of i]
• v’ starts out as cu• For each line “i ji,1,…,ji,d “
– For each j in ji,1,…,ji,d• v’[j] += (1-c)v[i]/d
25
We need to convert these counter updates to messages, like we did for naïve Bayes
Like in naïve Bayes: a document fits, but the
model doesn’t
Streaming PageRank• Assume we can store one row of W in memory at a
time, but not v – i.e., links on a web page fit in memory, not pageRank for the whole web.
• Repeat until converged:– Let vt+1 = cu + (1-c) vtW
• Store A as a row matrix: each line is– i ji,1,…,ji,d [the neighbors of i]
• v’ starts out as cu• For each line “i ji,1,…,ji,d “
– For each j in ji,1,…,ji,d• v’[j] += (1-c)v[i]/d
26
Streaming: if we know c, v[i] and have the linked-to j’s then• we can compute d• we can produce messages
saying how much to increment each v[j]
Then we sort the messages by v[j] and add up all the incrementals• and somehow also add in
c/N
Streaming PageRank• Assume we can store one row of W in memory at a
time, but not v – i.e., links on a web page fit in memory, not pageRank for the whole web.
• Repeat until converged:– Let vt+1 = cu + (1-c) vtW
• Store A as a row matrix: each line is– i ji,1,…,ji,d [the neighbors of i]
• v’ starts out as cu• For each line “i ji,1,…,ji,d “
– For each j in ji,1,…,ji,d• v’[j] += (1-c)v[i]/d
27
Streaming: if we know c, v[i] and have the linked-to j’s then• we can compute d• we can produce messages
saying how much to increment each v[j]
Then we sort the messages by v[j] and add up all the incrementals• and somehow also add in
c/N
So we need to stream thru some structure that has v[i] plus all the outlinks.
This is a copy of the graph with current pageRankestimates (v) for each node attached to the node.
PageRank in Dataflow Languages
28
Iteration in Dataflow
• PIG and Guinea Pig are pure data flow languages– no conditionals– no iteration
• To loop you need to embed a dataflow program into a ‘real’ program
29
30
Recap: a Gpig program….
There’s no “program” object, and no obvious way to write a main that will call a program
Calling GuineaPigprogrammaticallySomewhere in the execution
of the plan we will call this script with special arguments that tell it to do a substep of the plan.
So we need a Python mainthat will do that right
But I also want to write an actual main program so….
Creatiing a Planner object I can call (not a subclass) – remember to call setup()
31
Calling GuineaPigprogrammatically Convert from
initial format and assign
initial pagerankvector v
create the planner
Move initial pageranks + graph to tmp
file
Compute updated
pageranks
Move new pageranks to tmp
file for next iteration of loop32
lots and lots of i/o happening here…
rowininitGraph:(i,[j1,….,jd])
33
note: there is lots of i/o happening here… 34
An example from Ron Bekkerman
35
Example: k-means clustering
• An EM-like algorithm:• Initialize k cluster centroids• E-step: associate each data instance with the
closest centroid– Find expected values of cluster
assignments given the data and centroids• M-step: recalculate centroids as an average of
the associated data instances– Find new centroids that maximize that
expectation36
k-means Clustering
centroids
37
Parallelizing k-means
38
Parallelizing k-means
39
Parallelizing k-means
40
k-means on MapReduce
• Mappers read data portions and centroids• Mappers assign data instances to clusters• Mappers compute new local centroids
and local cluster sizes• Reducers aggregate local centroids
(weighted by local cluster sizes) into new global centroids
• Reducers write the new centroids
Panda et al, Chapter 2
41
k-means in Apache Pig: input data
• Assume we need to cluster documents– Stored in a 3-column table D:
• Initial centroids are k randomly chosen docs– Stored in table C in the same format as
above
Document Word Count
doc1 Carnegie 2
doc1 Mellon 2
42
D_C =JOIN C BY w,D BY w;PROD =FOREACH D_C GENERATE d,c,id *ic AS idic ;
PRODg =GROUP PROD BY (d,c);DOT_PROD =FOREACH PRODg GENERATE d,c,SUM(idic)AS dXc;
SQR =FOREACH C GENERATE c,ic *ic AS ic2;SQRg =GROUP SQR BY c;LEN_C=FOREACH SQRgGENERATE c,SQRT(SUM(ic2))AS lenc;
DOT_LEN =JOIN LEN_CBY c,DOT_PROD BY c;SIM =FOREACH DOT_LEN GENERATE d,c,dXc /lenc;
SIMg =GROUP SIM BY d;CLUSTERS=FOREACH SIMg GENERATE TOP(1,2,SIM);
k-means in Apache Pig: E-step
( )åå
Î
Î×
=
cwwc
dwwc
wd
cdi
iic
2maxarg
43
D_C =JOIN C BY w,D BY w;PROD =FOREACH D_C GENERATE d,c,id *ic AS idic ;
PRODg =GROUP PROD BY (d,c);DOT_PROD =FOREACH PRODg GENERATE d,c,SUM(idic)AS dXc;
SQR =FOREACH C GENERATE c,ic *ic AS ic2;SQRg =GROUP SQR BY c;LEN_C=FOREACH SQRgGENERATE c,SQRT(SUM(ic2))AS lenc;
DOT_LEN =JOIN LEN_CBY c,DOT_PROD BY c;SIM =FOREACH DOT_LEN GENERATE d,c,dXc /lenc;
SIMg =GROUP SIM BY d;CLUSTERS=FOREACH SIMg GENERATE TOP(1,2,SIM);
k-means in Apache Pig: E-step
( )åå
Î
Î×
=
cwwc
dwwc
wd
cdi
iic
2maxarg
44
D_C =JOIN C BY w,D BY w;PROD =FOREACH D_C GENERATE d,c,id *ic AS idic ;
PRODg =GROUP PROD BY (d,c);DOT_PROD =FOREACH PRODg GENERATE d,c,SUM(idic)AS dXc;
SQR =FOREACH C GENERATE c,ic *ic AS ic2;SQRg =GROUP SQR BY c;LEN_C=FOREACH SQRgGENERATE c,SQRT(SUM(ic2))AS lenc;
DOT_LEN =JOIN LEN_CBY c,DOT_PROD BY c;SIM =FOREACH DOT_LEN GENERATE d,c,dXc /lenc;
SIMg =GROUP SIM BY d;CLUSTERS=FOREACH SIMg GENERATE TOP(1,2,SIM);
k-means in Apache Pig: E-step
( )åå
Î
Î×
=
cwwc
dwwc
wd
cdi
iic
2maxarg
45
D_C =JOIN C BY w,D BY w;PROD =FOREACH D_C GENERATE d,c,id *ic AS idic ;
PRODg =GROUP PROD BY (d,c);DOT_PROD =FOREACH PRODg GENERATE d,c,SUM(idic)AS dXc;
SQR =FOREACH C GENERATE c,ic *ic AS ic2;SQRg =GROUP SQR BY c;LEN_C=FOREACH SQRgGENERATE c,SQRT(SUM(ic2))AS lenc;
DOT_LEN =JOIN LEN_CBY c,DOT_PROD BY c;SIM =FOREACH DOT_LEN GENERATE d,c,dXc /lenc;
SIMg =GROUP SIM BY d;CLUSTERS=FOREACH SIMg GENERATE TOP(1,2,SIM);
k-means in Apache Pig: E-step
( )åå
Î
Î×
=
cwwc
dwwc
wd
cdi
iic
2maxarg
46
D_C =JOIN C BY w,D BY w;PROD =FOREACH D_C GENERATE d,c,id *ic AS idic ;
PRODg =GROUP PROD BY (d,c);DOT_PROD =FOREACH PRODg GENERATE d,c,SUM(idic)AS dXc;
SQR =FOREACH C GENERATE c,ic *ic AS ic2;SQRg =GROUP SQR BY c;LEN_C=FOREACH SQRgGENERATE c,SQRT(SUM(ic2))AS lenc;
DOT_LEN =JOIN LEN_CBY c,DOT_PROD BY c;SIM =FOREACH DOT_LEN GENERATE d,c,dXc /lenc;
SIMg =GROUP SIM BY d;CLUSTERS=FOREACH SIMg GENERATE TOP(1,2,SIM);
k-means in Apache Pig: E-step
( )åå
Î
Î×
=
cwwc
dwwc
wd
cdi
iic
2maxarg
47
k-means in Apache Pig: E-step
D_C =JOIN C BY w,D BY w;PROD =FOREACH D_C GENERATE d,c,id *ic AS idic ;
PRODg =GROUP PROD BY (d,c);DOT_PROD =FOREACH PRODg GENERATE d,c,SUM(idic)AS dXc;
SQR =FOREACH C GENERATE c,ic *ic AS ic2;SQRg =GROUP SQR BY c;LEN_C=FOREACH SQRgGENERATE c,SQRT(SUM(ic2))AS lenc;
DOT_LEN =JOIN LEN_CBY c,DOT_PROD BY c;SIM =FOREACH DOT_LEN GENERATE d,c,dXc /lenc;
SIMg =GROUP SIM BY d;CLUSTERS=FOREACH SIMg GENERATE TOP(1,2,SIM);
48
k-means in Apache Pig: M-step
D_C_W= JOIN CLUSTERSBY d, DBY d;
D_C_Wg = GROUP D_C_WBY (c, w);SUMS= FOREACH D_C_Wg GENERATE c, w, SUM(id)AS sum;
D_C_Wgg =GROUP D_C_WBY c;SIZES=FOREACH D_C_WggGENERATE c,COUNT(D_C_W) AS size;
SUMS_SIZES=JOIN SIZESBY c, SUMSBY c;C=FOREACH SUMS_SIZESGENERATE c,w, sum/sizeAS ic ;
Finally - embed in Java (or Python or ….) to do the looping
49
The problem with k-means in Hadoop
I/O costs
50
Data is read, and model is written, with every iteration
• Mappers read data portions and centroids• Mappers assign data instances to clusters• Mappers compute new local centroids
and local cluster sizes• Reducers aggregate local centroids
(weighted by local cluster sizes) into new global centroids
• Reducers write the new centroids
Panda et al, Chapter 2
51
Spark: Another Dataflow Language
52
Spark• Hadoop: Too much typing
– programs are not concise• Hadoop: Too low level
– missing abstractions– hard to specify a workflow
• Pig, Guinea Pig, … address these problems: also Spark• Not well suited to iterative operations
– E.g., PageRank, E/M, k-means clustering, …– Spark lowers cost of repeated reads
53
Set of concise dataflow operations (“transformation”)
Dataflow operations are embedded in an API together with “actions”
Spark: Sharded files are replaced by “RDDs” – resilient distributed datasets. RDDs can be cached in cluster memory and recreated to recover from error
Spark examples
54
spark is a spark context object
Spark examples
55
errors is a transformation, and
thus a data strucurethat explains HOW
to do somethingcount() is an action:
it will actually execute the plan for errors and return a
value.
errors.filter() is a transformation
collect() is an action
everything is sharded, like inHadoop and GuineaPig
Spark examples
56
# modify errors to be stored in cluster memory
subsequent actions will be
much faster
everything is sharded … and the shards are stored in memory of worker machines not local disk (if possible)
You can also persist() an RDD on disk, whichis like marking it as opts(stored=True) in GuineaPig. Spark’s not smart about persisting data.
Spark examples
57
# modify errors to be stored in cluster memory
everything is sharded … and the shards are stored in memory of worker machines not local disk (if possible)
You can also persist() an RDD on disk, whichis like marking it as opts(stored=True) in GuineaPig. Spark’s not smart about persisting data.
persist-on-disk works because the RDD is read-only
(immutable)
Spark examples: wordcount
58
the action transformation on (key,value) pairs , which are special
Spark examples: batch logistic regression
59
reduce is an action – it produces a numpy vector
p.x and w are vectors, from the numpy package
p.x and w are vectors, from the numpypackage. Python
overloads operations like * and + for vectors.
Spark examples: batch logistic regression
Important note: numpy vectors/matrices are not just “syntactic sugar”. • They are much more compact than something like a list of python floats.• numpy operations like dot, *, + are calls to optimized C code• a little python logic around a lot of numpy calls is pretty efficient
60
Spark examples: batch logistic regression
61
w is defined outside the lambda function, but
used inside itSo: python builds a closure – code including the current value of w –
and Spark ships it off to each worker. So w is copied, and must
be read-only.
Spark examples: batch logistic regression
62
dataset of points is cached in cluster memory to
reduce i/o
Spark logistic regression example
63
Spark
64