programming in hadoop guangda hu [email protected] huayang guo [email protected]
TRANSCRIPT
Hadoop Overview
• About Hadoop– Apache Hadoop is a Java software framework that
supports data-intensive distributed applications under a free license. It enables applications to work with thousands of nodes and petabytes of data.
Hadoop Overview
• Architecture– HDFS (Hadoop Distributed File System)– Job Tracker– Task Tracker
Hadoop Overview
• Mechanism– Map and Reduce
Hadoop Overview
• Applications– Facebook (Hadoop, Hive, Scribe)– Yahoo! (Hadoop in Yahoo Search)– Veritas (San Point Direct, Veritas File System)– IBM Transarc (Andrew File System)– UW Computer Science Alumni (Condor Project)
Our Work
• Setup running environment– Single node setup– Multi-node cluster setup– Network access
• Experiments and analysis– Word count– Integration– Largest number
Environment Setup
• Hardware– Two multi-core machines with Linux– Ethernet connection
• Software– Ubuntu 9.04– Hadoop 0.20.1– Five virtual machine on VirtualBox
Environment Setup
• Cluster structure– Two machines
• 166.111.69.85• 59.66.132.161
– One master node– Three slave nodes
Experiments
• Benchmark– Word count (default example)– Super word count (SuperWordCount.java)– Integration (Integration.java)– Largest numbers (LargestGen.java)
Benchmark Analysis
0 5000000000 10000000000 150000000000
50
100
150
200
250
300
Four nodes
Computation
Tim
e
Benchmark Analysis
0 5000000000 10000000000 150000000000
50
100
150
200
250
300
Two nodes
Computation
Tim
e
Benchmark Analysis
• More experiments
Files Computation Time (s)
24 2.4 * 109 102
120 2.4 * 109 179
Nodes Files Slope (sec/109)
4 24 ≈ 30
2 24 ≈ 40
Challenges & Acquirements
• Network & virtual cluster communication• Hadoop technique survey• Cooperation
References
• http://www.ibm.com/developerworks/cn/• http://en.wikipedia.org/wiki/Hadoop• http://www.michael-noll.com/wiki/• Linux Man Pages• Hadoop source code and Java Doc
Thanks