programming in hadoop guangda hu [email protected] huayang guo [email protected]

15
Programming in Hadoop Guangda HU [email protected] Huayang GUO [email protected]

Upload: grant-moody

Post on 17-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 2: Programming in Hadoop Guangda HU tarlou.gd@gmail.com Huayang GUO dragonghy@gmail.com

Hadoop Overview

• About Hadoop– Apache Hadoop is a Java software framework that

supports data-intensive distributed applications under a free license. It enables applications to work with thousands of nodes and petabytes of data.

Page 3: Programming in Hadoop Guangda HU tarlou.gd@gmail.com Huayang GUO dragonghy@gmail.com

Hadoop Overview

• Architecture– HDFS (Hadoop Distributed File System)– Job Tracker– Task Tracker

Page 4: Programming in Hadoop Guangda HU tarlou.gd@gmail.com Huayang GUO dragonghy@gmail.com

Hadoop Overview

• Mechanism– Map and Reduce

Page 5: Programming in Hadoop Guangda HU tarlou.gd@gmail.com Huayang GUO dragonghy@gmail.com

Hadoop Overview

• Applications– Facebook (Hadoop, Hive, Scribe)– Yahoo! (Hadoop in Yahoo Search)– Veritas (San Point Direct, Veritas File System)– IBM Transarc (Andrew File System)– UW Computer Science Alumni (Condor Project)

Page 6: Programming in Hadoop Guangda HU tarlou.gd@gmail.com Huayang GUO dragonghy@gmail.com

Our Work

• Setup running environment– Single node setup– Multi-node cluster setup– Network access

• Experiments and analysis– Word count– Integration– Largest number

Page 7: Programming in Hadoop Guangda HU tarlou.gd@gmail.com Huayang GUO dragonghy@gmail.com

Environment Setup

• Hardware– Two multi-core machines with Linux– Ethernet connection

• Software– Ubuntu 9.04– Hadoop 0.20.1– Five virtual machine on VirtualBox

Page 8: Programming in Hadoop Guangda HU tarlou.gd@gmail.com Huayang GUO dragonghy@gmail.com

Environment Setup

• Cluster structure– Two machines

• 166.111.69.85• 59.66.132.161

– One master node– Three slave nodes

Page 9: Programming in Hadoop Guangda HU tarlou.gd@gmail.com Huayang GUO dragonghy@gmail.com

Experiments

• Benchmark– Word count (default example)– Super word count (SuperWordCount.java)– Integration (Integration.java)– Largest numbers (LargestGen.java)

Page 10: Programming in Hadoop Guangda HU tarlou.gd@gmail.com Huayang GUO dragonghy@gmail.com

Benchmark Analysis

0 5000000000 10000000000 150000000000

50

100

150

200

250

300

Four nodes

Computation

Tim

e

Page 11: Programming in Hadoop Guangda HU tarlou.gd@gmail.com Huayang GUO dragonghy@gmail.com

Benchmark Analysis

0 5000000000 10000000000 150000000000

50

100

150

200

250

300

Two nodes

Computation

Tim

e

Page 12: Programming in Hadoop Guangda HU tarlou.gd@gmail.com Huayang GUO dragonghy@gmail.com

Benchmark Analysis

• More experiments

Files Computation Time (s)

24 2.4 * 109 102

120 2.4 * 109 179

Nodes Files Slope (sec/109)

4 24 ≈ 30

2 24 ≈ 40

Page 13: Programming in Hadoop Guangda HU tarlou.gd@gmail.com Huayang GUO dragonghy@gmail.com

Challenges & Acquirements

• Network & virtual cluster communication• Hadoop technique survey• Cooperation

Page 14: Programming in Hadoop Guangda HU tarlou.gd@gmail.com Huayang GUO dragonghy@gmail.com

References

• http://www.ibm.com/developerworks/cn/• http://en.wikipedia.org/wiki/Hadoop• http://www.michael-noll.com/wiki/• Linux Man Pages• Hadoop source code and Java Doc

Page 15: Programming in Hadoop Guangda HU tarlou.gd@gmail.com Huayang GUO dragonghy@gmail.com

Thanks