mapreduce over lustre
DESCRIPTION
integrate Hadoop with LustreTRANSCRIPT
MapReduce over Lustre report
David Luan, Simon Huang, GaoShengGong 2008.10~2009.6
Outline
• Early research, analysis
• Platform design & improvement
• Test cases, test process design
• Result analysis
• Related jobs (GFS-like redundancy)
• White paper & conclusion
Early research, analysis
• HDFS, Lustre overall Benchmark tests IOZone IOR? WebDAV (an indirect way to mount HDFS)★
• Hadoop platform overview MapReduce Three kinds of Hadoop I/O Shortcoming & bottlenecks
• Lustre platform Modules analysis Shortcomings
Early research, analysis
Overall Benchmark tests
Early research, analysis
Input
Map
Key, Value Key, Value …=
Map Map
Key, Value
Key, Value
…
Key, Value
Key, Value
…
Key, Value
Key, Value
…
Split Input into Key-Value pairs.For each K-V pair call Map.
Each Map produces new set of K-V pairs.
Reduce(K, V[…])
Sort
Output Key, Value Key, Value …=
For each distinct key, call reduce. Produces one K-V pair for each distinct key.
Output as a set of Key Value Pairs.
MapReduce Flow
Early research, analysis
Hadoop I/O phases
Map Read
Local Read
Local Read
HTTP
Reduce write
Node a
Node 1
MapTask phase begins
ReduceTask phase begins
MapTask 1
InputSplit1
MapTask phase ends
HTTP
ReduceTask a
Result a
ReduceTask phase ends
……
……
……
Linux FS
Submit Job
Split the Job and distribute the tasks: MapTask {1, 2, ... n}, ReduceTask {1,...m}
Node 2
MapTask 2
InputSplit 2
Linux FS
Node n
MapTask n
InputSplit n
Linux FS
HDFS read:
Shuffle map results
Node b
ReduceTask b
Result b
Shuffle map results
Node m
ReduceTask m
Result m
Shuffle map results
Linux FS Linux FS Linux FS
HDFS Write:
Local R/W:
Local R/W
Early research, analysis
• Hadoop + HDFS Job/Task level parallel Compute/storage tightly coupled HDFS prefer huge files app limited (job split difficult)
• Distribute grep• Distribute sort • Log Processing• Data Warehousing
• Lustre I/O level parallel Compute/storage loose coupled POSIX compatible Apps
• Super computer
Platform comparison
Early research, analysis
• HDFS shortcom.– Metadata design – No parallel I/O– No general use (desig
n for MapReduce)
• Lustre shortcom.– inadequate reliability – inadequate stability– No native redundancy
Shortcomings comparison
Outline
• Early research, analysis
• Platform design & improvement
• Test cases, test process design
• Result analysis
• Related jobs (GFS-like redundancy)
• White paper & conclusion
Platform design & improvement
Two ways:
① Java wrapper for liblustre (without Lustre client ) Motivation:
Design a method to merge these two system. Implement Hadoop’s FileSystem interface with java wrapper, then MapReduce can work without Lustre Client.
Touch impasse
② Use Lustre Client Design Improvement
Platform design & improvement
Java wrapper touch impasse -_-
• JNI call liblustre.so error: Java’ JNI will mis-link the function whose name is the same as s
ystem call (such as: mount, read, write, etc.) If we use C to call static-lib (liblutre.a), compile to a executable p
rogram, it works ok.
• liblustre’s other problems Liblustre is not recommended to use in wiki When use it, use liblustre.a instead of liblustre.so Liblustre depends on gcc version
Platform design & improvement
Advantages for each Task (with Lustre) Decentralized I/O Lustre can write parallel Lustre is common usage Great for non-splitable jobs
Platform design (1) advantages:
Platform design & improvement
Platform design (2) modules
Slave
Lustre Client
Slave
Lustre Client
Slave
DISK
Lustre Client
OSS …
Master
Lustre Client
JobTracker Application Launcher
Submit job
MGS/MDS
…...
Distribute tasks/heartbeat
TaskTracker
OST OST
…
OSS …
TaskTracker
OST OST OSS …
TaskTracker
OST OST
DISK DISK … DISK DISK … DISK
Platform design & improvement
Node a
Node 1
MapTask phase begins
ReduceTask phase begins
MapTask 1
MapTask phase ends
ReduceTask a
ReduceTask phase ends
……
……
……
Submit Job
Split the Job and distribute the tasks: MapTask {1, 2, ... n}, ReduceTask {1,...m}
Node 2 Node n
Lustre R/W:
Node b Node m
Lustre client
ReduceTask b
Lustre client
ReduceTask c
Lustre client
Lustre client
MapTask 2
Lustre client
MapTask n
Lustre client
Platform design (3) read/write
Platform design & improvement
• Use Hardlink in instead of HTTP shuffle before ReduceTask starts [1] decentralized network bandwidth usage delay ReduceTask actual Read/Write
• Use Lustre block location info to distribute tasks[2] “move the compute to its data” Save network bandwidth Use a java Child thread to run shell to fetch the location info (det
ail in White paper)
Platform improvement 1
Platform design & improvement
Platform improvement 2
Slave
Lustre Client
Slave
Lustre Client
Slave
DISK
Lustre Client
OSS …
Master
Lustre Client
JobTracker Application Launcher
Submit job
MGS/MDS
…...
Distribute tasks/heartbeat
TaskTracker
OST OST
…
OSS …
TaskTracker
OST OST OSS …
TaskTracker
OST OST
DISK DISK … DISK DISK … DISK
Add location info as a scheduling
parameter Use hardlink to delay shuffle pahse
Outline
• Early research, analysis
• Platform design & improvement
• Test cases, test process design
• Result analysis
• Related jobs
• White paper & conclusion
Test cases design (Two kinds apps)
① Apps of statistics (search, log processing, etc.) Little grained tasks (job tasks) MapTask intermediate result is small
② Apps of no-good splitable & highly complex large grained tasks (job tasks) MapTask intermediate result is big Each task is highly compute Each task needs big I/O
Test cases, test process design
Platform design & improvement
Apps of highly complex, no-good splitable
intermediate result is big
Each task is highly compute
Test cases, test process design
Test cases:
• Apps of statistics: WordCount This test reads text files and count each words. The output contains a word and its count, separated by a tab.
• Apps of no-good splitable : BigMapoutput :It is a map/reduce program that works on a very big non-splittable file , for map or reduce tasks it just read the input file and the do nothing but output the same file.
Test cases, test process design
Test results :
• Overall execute time• Time of each phase
– Map Read phase (the most time-consuming for Lustre )– Local read/write and HTTP phase – Reduce write phase
Test cases, test process design
Test scene:
• No optimization • Use hardlink• Use hadlink and location info • Lustre tuning
– Stripe size=?
– Stripe count=?
Outline
• Early research, analysis
• Platform design & improvement
• Test cases, test process design
• Result analysis
• Related jobs (GFS-like redundancy)
• White paper & conclusion
Result analysis
• Result analysis
• Conclusion
Result analysis
• Test1: WordCount with a big file– process one big textfile(6G)– blocksize=32m– Reduce Tasks=0.95((1.75))*2*7=13
Result analysis
• Test2:WordCount with many small files• process a large number small files(10000)• Reduce Tasks=0.95*2*7=13
Result analysis
• Test3: BigMapOutput with one big file
• Result1:
• Result2( fresh memory )
• Result3 (set mapred.local.dir to default value)
Result analysis
• Test4: BigMapOutput with hardlink
• Test5: BigMapOutput with hardlink & location information
Result analysis
• Test6: BigMapOutput Map Read phase
• Conclusion– Map Read is The most time-consulting part : ★
Result analysis
Node a
Node 1
MapTask phase begins
ReduceTask phase begins
MapTask 1
InputSplit1
MapTask phase ends
HTTP
ReduceTask a
Result a
ReduceTask phase ends
……
……
……
Linux FS
Submit Job
Split the Job and distribute the tasks: MapTask {1, 2, ... n}, ReduceTask {1,...m}
Node 2
MapTask 2
InputSplit 2
Linux FS
Node n
MapTask n
InputSplit n
Linux FS
HDFS read:
Shuffle map results
Node b
ReduceTask b
Result b
Shuffle map results
Node m
ReduceTask m
Result m
Shuffle map results
Linux FS Linux FS Linux FS
HDFS Write:
Local R/W:
Local R/W:
Conclusion 1: Hadoop+ HDFS
Map Read
Local Read
Local Read
HTTP
Reduce write
Result analysis
Conclusion 2: Hadoop + Lustre
Node a
Node 1
MapTask phase begins
ReduceTask phase begins
MapTask 1
MapTask phase ends
ReduceTask a
ReduceTask phase ends
……
……
……
Submit Job
Split the Job and distribute the tasks: MapTask {1, 2, ... n}, ReduceTask {1,...m}
Node 2 Node n
Lustre R/W:
Node b Node m
Lustre client
ReduceTask b
Lustre client
ReduceTask c
Lustre client
Lustre client
MapTask 2
Lustre client
MapTask n
Lustre client
HDFS block location is fitter for Hadoop task distribute
algorithm than Lustre stripe info
This makes Map Read be The most time-consult
ing part
Result analysis
• Dig the logs (each task execution time (map-read))
Outline
• Early research, analysis
• Platform design & improvement
• Test cases, test process design
• Result analysis
• Related jobs (GFS-like redundancy)
• White paper & conclusion
Related jobs
GFS-liked redundant design
• Motivation: Lustre is no native redundant RAID is expensive Lustre’s new feature
• Code analysis & HLD design
• Challenges for design
Related jobs
Lustre’s inodeInode (*,*,*,…,{obj1,obj2,obj3,…})
Related jobs
Raw HLD thinking 1
• Modified inode structure Make inode contains 3 obj arrays Inode (*,*,*,…,{obj11,obj12,obj13,…},{obj21,obj22,obj23,…}, {obj31,obj32,obj33,…})))
• File read: client read first group, if the first damaged, then read second,…
• File writeclient write three arrays objects one by one
• File consistency (all done by client )• Streaming replication(-_-)
Client {OST,…}{OST,…} {OST,…}, like a write chain Some work shift to OST
• automatic recovery (-_-) If file’s one group of object damaged, the system automatically recover it
by other backup.
Related jobs
Raw HLD thinking 2 : challenges
• No rack information (place for redundancy?) • OST can not write to another OST (streaming redundant chain?)• File consistency • Lustre is changing fast (pool, etc.)• Internship is time-limited
• Early research, analysis
• Platform design & improvement
• Test cases, test process design
• Result analysis
• Related jobs
• White paper & conclusion
Outline
White paper & conclusion
• White paper (hadoop_lustre_wp_v0.4.2.pdf)
• Thanks to
White paper & conclusion
Thanks a lot to our mentors and manager
• Mentor: WangDi, HuangHua• Manager: Nirant.Puntambekar