marimba: abelian - - tu k · pdf filemarimba: a mapreduce-based framework for incremental...
TRANSCRIPT
Marimba:A MapReduce-based Framework for
Incremental Recomputations
Johannes Schildgen ([email protected]) TU Kaiserslautern, Germany
Research Questions• How to avoid full recomputations?• Can a MapReduce job reuse its own result from a
former computation?
BackgroundMany MapReduce jobs for analyzing Big Data require many hours andhave to be repeated again and again because the base data changescontinuously. With simple MapReduce every job has to be executed fromscratch and cannot benefit from previous computations.
What We Can Learn From Materialized ViewsMapReduce jobs and materialized views in relational DBMS have a lotin common. A long running query is executed periodically to calculatea result. When this result can be computed by only analyzing changesthe in base data (deltas) as well as the result of the formercomputation, a materialized view is called self-maintainable. Ourapproach to make MapReduce jobs self-maintainable is based on twopopular concepts, Overwrite and Increment Installation:
Marimba• Simply write Map & Reduce functions,• as intermediate value type use an Abelian type,• the Marimba framework will only read inserted
and deleted data,• deleted values will be inverted,• the changes will be aggrega-
ted on the previous result.• Generic combiner
(simply aggregates values)• For Overwrite Installation, a
Deserialize method has to be defined to convert the former result into Abelian objects.
ExamplesWordCount• map(linenum, line):
∀word in line: emit(word, 1)• invert: ℤ → ℤ ≔ ⋅ −1 ℤ
• aggregate: ℤ × ℤ → ℤ ≔ +ℤ• identity element: 0ℤ
Friends-Of-Friends Other
Results
WordCount Reverse Web-Link Graph
on Big Data
Map
ΔMapΔ
+= Mapinvert
Aggregate Reduce
(dotted arrows: user-defined functions)
removed
edge
added
edge
Former
Result
• Reverse Web-Link Graph• N-Grams• PageRank• …M
arim
ba
Had
oop
map() reduce()
Abelianinvert()
aggregate()
neutral()
HBase
deserialize()
HDFS
Map (+ invert)
Inserted
i got rhythm
i got music
Deleted
rhythm of life
Former Result
got, 8
life, 1
melody 1
of, 5
rhythm 12
got, 10
i, 2
melody 1
music, 1
of, 4
rhythm 12
i, 1
got, 1
rhythm, 1
i, 1
got, 1,
music, 1
rhythm, -1
of, -1
life, -1
got, 8
life, 1
melody 1
of, 5
rhythm 12
00:00
00:10
00:20
00:30
00:40
00:50
01:00
01:10
0% 10% 20% 30% 40% 50% 60% 70%
Tim
e [
hh
:mm
]
Changed Data
FULL INCDEC OVERWRITE
00:00
00:20
00:40
01:00
01:20
01:40
02:00
02:20
0% 10% 20% 30% 40%
Tim
e [
hh
:mm
]
Changed Data
FULL INCDEC OVERWRITE