marimba: abelian - - tu k · pdf filemarimba: a mapreduce-based framework for incremental...

1
Marimba: A MapReduce - based Framework for Incremental Recomputations Johannes Schildgen ([email protected] - kl.de) TU Kaiserslautern, Germany Research Questions How to avoid full recomputations? Can a MapReduce job reuse its own result from a former computation? Background Many MapReduce jobs for analyzing Big Data require many hours and have to be repeated again and again because the base data changes continuously. With simple MapReduce every job has to be executed from scratch and cannot benefit from previous computations. What We Can Learn From Materialized Views MapReduce jobs and materialized views in relational DBMS have a lot in common. A long running query is executed periodically to calculate a result. When this result can be computed by only analyzing changes the in base data (deltas) as well as the result of the former computation, a materialized view is called self-maintainable. Our approach to make MapReduce jobs self-maintainable is based on two popular concepts, Overwrite and Increment Installation: Marimba Simply write Map & Reduce functions, as intermediate value type use an Abelian type, the Marimba framework will only read inserted and deleted data, deleted values will be inverted, the changes will be aggrega- ted on the previous result. Generic combiner (simply aggregates values) For Overwrite Installation, a Deserialize method has to be defined to convert the former result into Abelian objects. Examples WordCount map(linenum, line): word in line: emit(word, 1) invert: ℤ→ℤ≔⋅ −1 aggregate: ℤ×ℤ→ℤ ≔ + identity element: 0 Friends-Of-Friends Other Results WordCount Reverse Web-Link Graph on Big Data Map Δ Map Δ += Map invert Aggregate Reduce (dotted arrows: user-defined functions) removed edge added edge Former Result Reverse Web-Link Graph N-Grams PageRank Marimba Hadoop map() reduce() Abelian invert() aggregate() neutral() HBase deserialize() HDFS Map (+ invert) Inserted i got rhythm i got music Deleted rhythm of life Former Result got, 8 life, 1 melody 1 of, 5 rhythm 12 got, 10 i, 2 melody 1 music, 1 of, 4 rhythm 12 i, 1 got, 1 rhythm, 1 i, 1 got, 1, music, 1 rhythm, -1 of, -1 life, -1 got, 8 life, 1 melody 1 of, 5 rhythm 12 00:00 00:10 00:20 00:30 00:40 00:50 01:00 01:10 0% 10% 20% 30% 40% 50% 60% 70% Time [hh:mm] Changed Data FULL INCDEC OVERWRITE 00:00 00:20 00:40 01:00 01:20 01:40 02:00 02:20 0% 10% 20% 30% 40% Time [hh:mm] Changed Data FULL INCDEC OVERWRITE

Upload: ledang

Post on 10-Mar-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Marimba: Abelian - - TU K · PDF fileMarimba: A MapReduce-based Framework for Incremental Recomputations Johannes Schildgen (schildgen@cs.uni-kl.de) TU Kaiserslautern, Germany Research

Marimba:A MapReduce-based Framework for

Incremental Recomputations

Johannes Schildgen ([email protected]) TU Kaiserslautern, Germany

Research Questions• How to avoid full recomputations?• Can a MapReduce job reuse its own result from a

former computation?

BackgroundMany MapReduce jobs for analyzing Big Data require many hours andhave to be repeated again and again because the base data changescontinuously. With simple MapReduce every job has to be executed fromscratch and cannot benefit from previous computations.

What We Can Learn From Materialized ViewsMapReduce jobs and materialized views in relational DBMS have a lotin common. A long running query is executed periodically to calculatea result. When this result can be computed by only analyzing changesthe in base data (deltas) as well as the result of the formercomputation, a materialized view is called self-maintainable. Ourapproach to make MapReduce jobs self-maintainable is based on twopopular concepts, Overwrite and Increment Installation:

Marimba• Simply write Map & Reduce functions,• as intermediate value type use an Abelian type,• the Marimba framework will only read inserted

and deleted data,• deleted values will be inverted,• the changes will be aggrega-

ted on the previous result.• Generic combiner

(simply aggregates values)• For Overwrite Installation, a

Deserialize method has to be defined to convert the former result into Abelian objects.

ExamplesWordCount• map(linenum, line):

∀word in line: emit(word, 1)• invert: ℤ → ℤ ≔ ⋅ −1 ℤ

• aggregate: ℤ × ℤ → ℤ ≔ +ℤ• identity element: 0ℤ

Friends-Of-Friends Other

Results

WordCount Reverse Web-Link Graph

on Big Data

Map

ΔMapΔ

+= Mapinvert

Aggregate Reduce

(dotted arrows: user-defined functions)

removed

edge

added

edge

Former

Result

• Reverse Web-Link Graph• N-Grams• PageRank• …M

arim

ba

Had

oop

map() reduce()

Abelianinvert()

aggregate()

neutral()

HBase

deserialize()

HDFS

Map (+ invert)

Inserted

i got rhythm

i got music

Deleted

rhythm of life

Former Result

got, 8

life, 1

melody 1

of, 5

rhythm 12

got, 10

i, 2

melody 1

music, 1

of, 4

rhythm 12

i, 1

got, 1

rhythm, 1

i, 1

got, 1,

music, 1

rhythm, -1

of, -1

life, -1

got, 8

life, 1

melody 1

of, 5

rhythm 12

00:00

00:10

00:20

00:30

00:40

00:50

01:00

01:10

0% 10% 20% 30% 40% 50% 60% 70%

Tim

e [

hh

:mm

]

Changed Data

FULL INCDEC OVERWRITE

00:00

00:20

00:40

01:00

01:20

01:40

02:00

02:20

0% 10% 20% 30% 40%

Tim

e [

hh

:mm

]

Changed Data

FULL INCDEC OVERWRITE