thuong-cang phan ([email protected]) laurent d'orazio (laurent...
TRANSCRIPT
Thuong-Cang Phan ([email protected])Laurent d'Orazio ([email protected])
1
( @ )Philippe Rigaux ([email protected])
CloudCloud--I I '13'13, VLDB Workshop 2013, August 26th, Trento, Italy.
ISIMA UMR 6158 CNRS
ContextContext
M dM dMapreduceMapreduce a popular big data processing framework its basic complex operations used extensively and expensively
join operations : R1(X1) ⋈ R2(X2) ⋈ .... ⋈ Rn (Xn)
Big joinBig join i t t ti f ffi i t d t l i & l ti an important operation for efficient data analysis&query evaluation NOT a straightforward implementation in Mapreduce compiled to MapReduce job(s)compiled to MapReduce job(s) Join algorithms:Map-side join,Reduce-side join,Broadcast join,etc.
Too much unnecessary intermediate data generated in the map phase
2
map phase
Problem: Problem: Intermediate data in Joinshuffleshuffleinputinput mapmap reducereduce outputoutput
Pairs: (key, targeted record)
Philippe ::001::5::9783Dominique::661::3::9702Baraa::661::3::9796Cang::661::5::9789Laurent::333::4::9785
001, P :Philippe::001::5::9783661, P :Dominique::661::3::9702661, P :Baraa::661::3::9796661, P :Cang::661::5::9789 (661, …)
(001, …)(333, …)
(Philippe,CN…)
Group by join key{(001::CNAM)}
X(Philippe::001::…)(001,
[P :Philippe::001::…],[S :001::CNAM])
P: Person.dat
Laurent::333::4::9785
001 CNAM
661, P :Cang::661::5::9789333, P :Laurent::333::4::9785
001, S :001::CNAM
(661, …)(661, …)
(001, …)(003 )
(661,[P :Dominique::661::]
Buffers records into two sets according to the table tag
+Cross-product
(003, [S :003::Blaise Pascal] )(333, [P :Laurent::33…])
S: School dat
001::CNAM002::Cergy-Pontoise003::Blaise Pascal004::CanTho006::Paris Sud 11
002, S :002:: Cergy-Pontoise003, S :003:: Blaise Pascal004, S :004::CanTho006, S :006::Paris Sud 11
(002, …)(004, …)(006 …)
(003, …) [P :Dominique::661::],[P :Baraa::661::3::…],[P :Cang::661::5::…])
Cross product
(002, [S :002::Cergy.])(004, [S :....])
Drawback: many tuples don’t actually participate in Join operation
S: School.dat (004, [S :....])(006, [S :....])
They significantly increase the costs : I/O operations for intermediate results Communication cost
Reduce-Side Join
300,000,000,...Don't be wasting ...P(.., sch-id) ⋈ S(sch-id, ...)
3
Reduce-Side Join
Proposed SolutionProposed Solution
P S
P ∩ S = {001}
P S
P ∩ SP {(001, Philippe) , (661, Barra), (333, Laurent)}
S {(001, CNAM) , (002, Lyon I), (003, BP)}{(001, Philippe,CNAM)}
P ∩ S = (P S) \ (P ∆ S) The intersection filter
Contributions:Contributions:(a) three approaches of the intersection filter that approximates the(a) three approaches of the intersection filter that approximates the intersection of datasets;
(b) the feasibility of our approaches used in two-way joins( ) y pp y j
(c) the advantage of the intersection filter for important join cases
(d) The considerable efficiency of the intersection filter as
4
( ) ycompared with basic filters in join operations.
ContentContent
J i l ith i M R dJ i l ith i M R d Join algorithms in MapReduceJoin algorithms in MapReduce
M d li I t ti Filt (I F)M d li I t ti Filt (I F) Modeling Intersection Filter (I.F)Modeling Intersection Filter (I.F)
Optimization of twoOptimization of two way join using I Fway join using I F Optimization of twoOptimization of two--way join using I.Fway join using I.F
Advantage of I F for important join casesAdvantage of I F for important join cases Advantage of I.F for important join casesAdvantage of I.F for important join cases
Cost analysis and experimental evaluationCost analysis and experimental evaluation Cost analysis and experimental evaluationCost analysis and experimental evaluation
5
Join Algorithms in MapReduceJoin Algorithms in MapReduce
R dR d id J iid J i ReduceReduce--side Join side Join The actual join happens on the Reduce side of the framework. The ‘map’ phase only pre-processes theframework. The map phase only pre processes the tuples of the two datasets to organize them in terms of the join key.
MapMap--side Join side Join It is carried out on Mapper nodes Both the inputIt is carried out on Mapper nodes. Both the input datasets for each map task must be already partitioned and sorted by the same join key.
Broadcast Join Broadcast Join Mappers load the small dataset into memory and calls
6
Mappers load the small dataset into memory and calls the map function for joining each tuple from the bigger dataset
Bloom Filter (BF)Bloom Filter (BF) Bloom filter [Burton Howard Bloom in 1970] is a space-efficient probabilistic d t t t d t t t b hi i t ith ll t f f ldata structure used to test membership in a set with a small rate of false positives (a false positive probability).
BF representing a static set S = {e e e } of n elements consists of an BF representing a static set S = {e1, e2, …, en} of n elements consists of an array of m bits and a group of k independent hash functions h1, …, hk with the range of {1, …, m}.
A Bloom Filter.
No false negativez is definitely not a member.
False positivey is probably a member; (may be wrong)kkn
kkn 1
Pr[bit is still 0] mknkn
em
p
111
7
y is probably a member; (may be wrong) Find optimal at k = (ln 2)m/n, p = 1/2 by derivative of f
mknkn
k em
pf
11111Pr[false pos]
Partitioned Bloom Filter (PBF)Partitioned Bloom Filter (PBF)
I t BF(S) x Insert x: - k hash functions encode k bit indices to set
xBF(S) ← x
1()h1() 2()h2() k()hk()…
k partitions of length m/k bits
Pr[bit is still 0]
knk
nn
pp m
km
p
111
n
p
kn
mkp
mp
111
8
Pr[false pos] kpp mkpf
111
pff
ContentContent
J i l ith i M R dJ i l ith i M R d Join algorithms in MapReduceJoin algorithms in MapReduce
M d li I t ti Filt (I F)M d li I t ti Filt (I F) Modeling Intersection Filter (I.F)Modeling Intersection Filter (I.F)
Optimization of twoOptimization of two way join using I Fway join using I F Optimization of twoOptimization of two--way join using I.Fway join using I.F
Advantage of I F for important join casesAdvantage of I F for important join cases Advantage of I.F for important join casesAdvantage of I.F for important join cases
Cost analysis and experimental evaluationCost analysis and experimental evaluation Cost analysis and experimental evaluationCost analysis and experimental evaluation
9
Modeling Intersection FilterModeling Intersection Filter Three approaches to building the intersection filterThree approaches to building the intersection filter
R ∩ S = (R S) \ (R ∆ S) = (R S) \ ( (R \ S) (S \ R) )
(1) A pair of Bloom filters
BF(R)BF(S)
(1) A pair of Bloom filters
BF(R ∩ S) = BF(R) ∩ BF(S) with probability (1-1/m)k|R-RS|.k|S-RS|
10
(2) Unpartitioned BF Intersection (3) Partitioned BF Intersection
The false intersection probabilityThe false intersection probability
TTHEOREM 1. A false intersection by a pair of Bloom filters is identified with one of probabilities1
1 ||
1)(
111
kRk
Rpair mf
22 ||
2)(
111
kSk
Spair mf
TTHEOREM 2. A false intersection by intersecting unpartitioned filters is identified with probabilitykSkkRk
|||| SkRk
BF mmf
|||| 111111
TTHEOREM 3. A false intersection by intersecting partitioned filters is identified with probabilitykSkR
PBFkkf
||||
1111PBF mmf
1111
TTHEOREM 4 Th f l i t ti b bilit f th titi d filt i t ti i l
11
TTHEOREM 4. The false intersection probability of the unpartitioned filter intersection is less than the false intersection probability of the partitioned filter intersection PBFBF ff
ContentContent
J i l ith i M R dJ i l ith i M R d Join algorithms in MapReduceJoin algorithms in MapReduce
M d li I t ti Filt (I F)M d li I t ti Filt (I F) Modeling Intersection Filter (I.F)Modeling Intersection Filter (I.F)
Optimization of twoOptimization of two way join using I Fway join using I F Optimization of twoOptimization of two--way join using I.Fway join using I.F
Advantage of I F for important join casesAdvantage of I F for important join cases Advantage of I.F for important join casesAdvantage of I.F for important join cases
Cost analysis and experimental evaluationCost analysis and experimental evaluation Cost analysis and experimental evaluationCost analysis and experimental evaluation
12
ContentContent
J i l ith i M R dJ i l ith i M R d Join algorithms in MapReduceJoin algorithms in MapReduce
M d li I t ti Filt (I F)M d li I t ti Filt (I F) Modeling Intersection Filter (I.F)Modeling Intersection Filter (I.F)
Optimization of twoOptimization of two way join using I Fway join using I F Optimization of twoOptimization of two--way join using I.Fway join using I.F
Advantage of I F for important join casesAdvantage of I F for important join cases Advantage of I.F for important join casesAdvantage of I.F for important join cases
Cost analysis and experimental evaluationCost analysis and experimental evaluation Cost analysis and experimental evaluationCost analysis and experimental evaluation
14
Advantage of I.F for important join casesAdvantage of I.F for important join cases
Chain JoinChain Join R1(x1, x2) ⨝ R2(x2, x3) ⨝ R3(x3, x4) ⨝ ... ⨝ Rn(xn, xn+1)
Execution of a chain join using a Bloomjoin cascadeR2, R3,..., Rn have not been filtered
Execution of a chain join using a cascade of intersection filter joinR2, R3,..., Rn have been filtered
15
Advantage of I.F for important join casesAdvantage of I.F for important join cases
I F b d ti i ti f h i j iI F b d ti i ti f h i j i I.F based optimization of a chain join I.F based optimization of a chain join Extended intersection filter (E.I.F)i l d f Bl filt h h d diff t j i k E h t lincludes an array of Bloom filters hashed on different join keys. Each tuple of a dataset may contain a few join keys linking to others. The tuple is eliminated if at least one of its join keys, xi, is not a member of a component filter BFi of the extended filtercomponent filter BFi of the extended filter.
BF(R )k BF(Rk.xk)
BF(R1 x1)BF(R2.x2 R3.x3)
k
21
t(x1, x2 , x3 .., xk ,.., xn)
? t E.I.F
all t (xi) BFi (i=1,..,k)
BF(R1.x1)
Extended intersection filter (E.I.F)
16
Advantage of I.F for important join casesAdvantage of I.F for important join cases
Chain JoinChain JoinR1,2,3, ..., n‐1, n⨝xn
R R
BF(R1,2,..,n‐1.xn)Optimization of a chain join with R1,2,..,n‐1 Rn
R ⨝BF(R4.x4)j
extended intersection filters
R1,2,3⨝
R1,2 R3
BF(R1,2.x3)⨝x2
BF(R3.x3)NO redundant data In intermediate join results
2
BF(R1.x2)BF(R2.x2)
17
R1 R2
BF(R1.x2)BF(R2.x2)
Advantage of I.F for important join casesAdvantage of I.F for important join cases
Chain JoinChain Join
Optimization of a chain join with extended intersection filters
Three-way join reduces the number of intermediate join jobs
18
Advantage of I.F for important join casesAdvantage of I.F for important join cases
Star JoinStar JoinR
x'nRn
R0
x'1x1
xn
'x2
Optimization of a star join with extended intersection filters R 1 x'2extended intersection filters
R2R1
19
E.I.F reduces the number of intermediate join jobs to zero, NO redundant data.
ContentContent
J i l ith i M R dJ i l ith i M R d Join algorithms in MapReduceJoin algorithms in MapReduce
M d li I t ti Filt (I F)M d li I t ti Filt (I F) Modeling Intersection Filter (I.F)Modeling Intersection Filter (I.F)
Optimization of twoOptimization of two way join using I Fway join using I F Optimization of twoOptimization of two--way join using I.Fway join using I.F
Advantage of I F for important join casesAdvantage of I F for important join cases Advantage of I.F for important join casesAdvantage of I.F for important join cases
Cost analysis and experimental evaluationCost analysis and experimental evaluation Cost analysis and experimental evaluationCost analysis and experimental evaluation
20
Cost Analysis for TwoCost Analysis for Two--way Joinway Join
C t d lC t d l Cost modelCost modelThe total cost of the join operation:
C = Cpre + Cread + Csort + Ctr + Cwrite
whereCread = cr . |R| + cr . |S|; Cwrite = cr . |O|; Ctr = ct . |D|read r r write r tr t
Csort = cl|D|.2([logB|D|-logB(mp1+mp2)] + [logB(mp1+mp2)]) [8]
Cpre = Cread + 2 . ct . m . t + ct . m . r . t + a
21
a = ct . m . r . t for the first approach, otherwise a = 0
Cost Analysis for TwoCost Analysis for Two--way Joinway Join
C t i f hC t i f h Cost comparison of approachesCost comparison of approachesThe size of intermediate data with the false intersection probability is
(1)
(2)(2)
(3)|D| =(4)
(5)whereequation (1) for the pair of the filters (approach 1),equation (2) for the unpartitioned intersection filter (approach 2),equation (3) for the partitioned intersection filter (approach 3)
22
equation (3) for the partitioned intersection filter (approach 3),equation (4) for a filter BF(R), andequation (5) in case without Bloom filter
Cost Analysis for TwoCost Analysis for Two--way Joinway Join
TTHEOREM 5. The join operation using the intersection filter is more efficient than using a basic Bloom filter because it produces less redundant and intermediate data than the latter. Additionally, we can drive comparing equation for |D|
|D|1 |D|2 < |D|3 < |D|4 < |D|5
where |D|i is the intermediate data size for equation ith (i = 1..5).
TTHEOREM 6. The total cost of the join operation for our approaches is defined by
C1 C2 < C3 < C4 < C5
where Ci is the total cost in case of equation ith (i = 1..5).
TTHEOREM 7. The total cost to perform pre-processing step
C =Cread + 2 . ct . m . t + 2 . ct . m . r . t , in case of (1)
C + 2 c m t + c m r t in case of (2) (3) (4)
23
Cpre = Cread + 2 . ct . m . t + ct . m . r. t, in case of (2), (3), (4)
0 in case of (5)
ConclusionConclusion
Th h f b ildi th i t ti filt Three approaches for building the intersection filter Their efficiency used in joins better than other solutions Their advantage for important join cases Their advantage for important join cases
Although the intersection filter has false positives and an g pextra cost for the pre-processing step, its efficiency in space-saving and filtering often outweighs these drawbacksdrawbacks System will become inefficient if t and r is large or there is very little redundant data in the join operation.y j p
24
Future workFuture work
I l t ti f l lti j i i ll Implementation of general multiway joins, especially a cascade of map-side joins.
Recursive joins.
A complete optimizer for choosing the best joinA complete optimizer for choosing the best join implementation in MapReduce.
25
ReferencesReferences[1] Bloom, B.H. 1970. Space/time trade-offs in hash coding with allowable errors. Commun. ACM.Commun. ACM.[2] Broder, A. and Mitzenmacher, M. 2004. Network Applications of Bloom Filters: A Survey. Internet Mathematics.[3] Lee T Kim K and Kim H -J 2012 Join processing using Bloom filter in[3] Lee, T., Kim, K. and Kim, H. J. 2012. Join processing using Bloom filter in MapReduce. Proceedings of the 2012 ACM Research in Applied Computation Symposium (New York).[4] Tom White’s book 2010. Hadoop: The Definitive Guide, 2nd Edition. O’Reilly.[ ] p , y[5] PUMA: Purdue MapReduce Benchmarks Suite: http://web.ics.purdue.edu/~fahmad/benchmarks.htm. [6] Foto N. Afrati and Jeffrey D. Ullman. 2010.Optimizing joins in a map-reduce[6] Foto N. Afrati and Jeffrey D. Ullman. 2010.Optimizing joins in a map reduce environment. In Proceedings of the 13th International Conference on Extending Database Technology (EDBT '10).[7] Michael, L., Nejd, W., Papapetrou, O. and Siberski, W. 2007. Improving distributed join efficiency with extended bloom filter operations. 21st International Conference on Advanced Information Networking and Applications, 2007. AINA ’07.[8] Nykiel, T., Potamias, M., Mishra, C., Kollios, G. and Koudas, N. 2010. MRShare:
26
sharing across multiple queries in MapReduce. Proc. VLDB Endow. 3, 1-2 (Sep. 2010), 494–505.