intro to hdfs and mapreduce
TRANSCRIPT
![Page 1: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/1.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
Introduction to HDFS and MapReduce
Thursday, January 10, 13
![Page 2: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/2.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved2
Who Am I- Ryan Tabora
- Data Developer at Think Big Analytics
- Big Data Consulting
- Experience working with Hadoop, HBase, Hive, Solr, Cassandra, etc.
Thursday, January 10, 13
![Page 3: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/3.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved2
Who Am I- Ryan Tabora
- Data Developer at Think Big Analytics
- Big Data Consulting
- Experience working with Hadoop, HBase, Hive, Solr, Cassandra, etc.
Thursday, January 10, 13
![Page 4: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/4.jpg)
Confidential Think Big Analytics
• One of Silicon Valley’s Fastest Growing Big Data start ups• 100% Focus on Big Data consulting & Data Science solution services• Management Background:
Cambridge Technology, C-bridge, Oracle, Sun Microsystems, Quantcast, Accenture
C-bridge Internet Solutions (CBIS) founder 1996 & executives, IPO 1999• Clients: 40+• North America Locations
• US East: Boston, New York, Washington D.C.• US Central: Chicago, Austin• US West: HQ Mountain View, San Diego, Salt Lake City
• EMEA & APAC
3
Think Big is the leading professional services firm that’s purpose built for Big Data.
Thursday, January 10, 13
![Page 5: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/5.jpg)
Confidential Think Big Analytics 01/04/13
Think Big Recognized as a Top Pure-Play Big Data Vendor
Source: Forbes February 2012
4Thursday, January 10, 13
![Page 6: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/6.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved5
Agenda- Big Data
- Hadoop Ecosystem
- HDFS
- MapReduce in Hadoop
- The Hadoop Java API
- Conclusions
Thursday, January 10, 13
![Page 7: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/7.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved6
Big Data
Thursday, January 10, 13
![Page 8: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/8.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved7
A Data Shift...
Source: EMC Digital Universe Study*
Thursday, January 10, 13
![Page 9: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/9.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved8
“Simple algorithms and lots of data trump complex
models. ”
Motivation
Halevy, Norvig, and Pereira (Google), IEEE Intelligent Systems
Thursday, January 10, 13
![Page 10: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/10.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved9
Pioneers• Google and Yahoo:
- Index 850+ million websites, over one trillion URLs.
• Facebook ad targeting:
- 840+ million users, > 50% of whom are active daily.
Thursday, January 10, 13
![Page 11: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/11.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved10
Hadoop Ecosystem
Thursday, January 10, 13
![Page 12: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/12.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved11
Common Tool?• Hadoop
- Cluster: distributed computing platform.
- Commodity*, server-class hardware.
- Extensible Platform.
Thursday, January 10, 13
![Page 13: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/13.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved12
Hadoop Origins• MapReduce and Google File System (GFS)
pioneered at Google.
• Hadoop is the commercially-supported open-source equivalent.
Thursday, January 10, 13
![Page 14: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/14.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved13
What Is Hadoop?• Hadoop is a platform.
• Distributes and replicates data.
• Manages parallel tasks created by users.
• Runs as several processes on a cluster.
• The term Hadoop generally refers to a toolset, not a single tool.
Thursday, January 10, 13
![Page 15: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/15.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved14
Why Hadoop?• Handles unstructured to semi-structured to
structured data.
• Handles enormous data volumes.
• Flexible data analysis and machine learning tools.
• Cost-effective scalability.
Thursday, January 10, 13
![Page 16: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/16.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
• HDFS - Hadoop Distributed File System.
• Map/Reduce - A distributed framework for executing work in parallel.
• Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS.
• Pig - A top down scripting language to manipulate.
• HBase - A NoSQL, non-sequential data store.
15
The Hadoop Ecosystem
Thursday, January 10, 13
![Page 17: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/17.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
• HDFS - Hadoop Distributed File System.
• Map/Reduce - A distributed framework for executing work in parallel.
• Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS.
• Pig - A top down scripting language to manipulate.
• HBase - A NoSQL, non-sequential data store.
15
The Hadoop Ecosystem
Thursday, January 10, 13
![Page 18: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/18.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved16
HDFS
Thursday, January 10, 13
![Page 19: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/19.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved17
What Is HDFS?• Hadoop Distributed File System.
• Stores files in blocks across many nodes in a cluster.
• Replicates the blocks across nodes for durability.
• Master/Slave architecture.
Thursday, January 10, 13
![Page 20: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/20.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
HDFS Traits
18
• Not fully POSIX compliant.
• No file updates.
• Write once, read many times.
• Large blocks, sequential read patterns.
• Designed for batch processing.
Thursday, January 10, 13
![Page 21: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/21.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
HDFS Master
19
• NameNode
- Runs on a single node as a master process
‣ Holds file metadata (which blocks are where)
‣ Directs client access to files in HDFS
• SecondaryNameNode
- Not a hot failover
- Maintains a copy of the NameNode metadata
Thursday, January 10, 13
![Page 22: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/22.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
HDFS Slaves
20
• DataNode
- Generally runs on all nodes in the cluster
‣ Block creation/replication/deletion/reads
‣ Takes orders from the NameNode
Thursday, January 10, 13
![Page 23: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/23.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
NameNode
DataNode 1 DataNode 2 DataNode 3
DataNode 4 DataNode 5 DataNode 6
HDFS Illustrated
21
File
Put File
Thursday, January 10, 13
![Page 24: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/24.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
NameNode
DataNode 1 DataNode 2 DataNode 3
DataNode 4 DataNode 5 DataNode 6
HDFS Illustrated
21
File
Put File
Thursday, January 10, 13
![Page 25: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/25.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
NameNode
DataNode 1 DataNode 2 DataNode 3
DataNode 4 DataNode 5 DataNode 6
HDFS Illustrated
21
3
12Put File
Thursday, January 10, 13
![Page 26: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/26.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
NameNode
DataNode 1 DataNode 2 DataNode 3
DataNode 4 DataNode 5 DataNode 6
HDFS Illustrated
21
3
12,4,6
Put File
Thursday, January 10, 13
![Page 27: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/27.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
NameNode
DataNode 1 DataNode 2 DataNode 3
DataNode 4 DataNode 5 DataNode 6
HDFS Illustrated
21
3
12,4,6,5,3Put File
Thursday, January 10, 13
![Page 28: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/28.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
NameNode
DataNode 1 DataNode 2 DataNode 3
DataNode 4 DataNode 5 DataNode 6
HDFS Illustrated
21
3
12,2,6
,4,6,5,3Put File
Thursday, January 10, 13
![Page 29: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/29.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
NameNode
DataNode 1 DataNode 2 DataNode 3
DataNode 4 DataNode 5 DataNode 6
HDFS Illustrated
21
3
12,2,6
,4,6,5,3Put File
Thursday, January 10, 13
![Page 30: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/30.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
DataNode 1
NameNode
DataNode 2 DataNode 3
DataNode 4 DataNode 5 DataNode 6
22
Power of Hadoop
3
12,2,6
,4,6,5,3Read File
Thursday, January 10, 13
![Page 31: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/31.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
DataNode 1
NameNode
DataNode 2 DataNode 3
DataNode 4 DataNode 5 DataNode 6
22
Power of Hadoop
3
12,2,6
,4,6,5,3Read File
Thursday, January 10, 13
![Page 32: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/32.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
DataNode 1
NameNode
DataNode 2 DataNode 3
DataNode 4 DataNode 5 DataNode 6
22
Power of Hadoop
3
12,2,6
,4,6,5,3Read File
Thursday, January 10, 13
![Page 33: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/33.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
NameNode
DataNode 2 DataNode 3
DataNode 4 DataNode 5 DataNode 6
22
Power of Hadoop
32,2,6
,4,6,5,3Read File
Thursday, January 10, 13
![Page 34: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/34.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
NameNode
DataNode 2 DataNode 3
DataNode 4 DataNode 5 DataNode 6
22
Power of Hadoop
32,2,6
,4,6,5,3Read File
5
Thursday, January 10, 13
![Page 35: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/35.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
NameNode
DataNode 2 DataNode 3
DataNode 4 DataNode 5 DataNode 6
22
Power of Hadoop
32,2,6
,4,6,5,3Read File
5
Thursday, January 10, 13
![Page 36: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/36.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
NameNode
DataNode 2 DataNode 3
DataNode 4 DataNode 5 DataNode 6
22
Power of Hadoop
32,2,6
,4,6,5,3Read File
Read time =
Transfer Rate x
Number of Machines*
5
Thursday, January 10, 13
![Page 37: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/37.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
NameNode
DataNode 2 DataNode 3
DataNode 4 DataNode 5 DataNode 6
22
Power of Hadoop
32,2,6
,4,6,5,3Read File
Read time =
Transfer Rate x
Number of Machines*
100 MB/sx3=
300MB/s
5
Thursday, January 10, 13
![Page 38: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/38.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
HDFS Shell
23
• Easy to use command line interface.
• Create, copy, move, and delete files.
• Administrative duties - chmod, chown, chgrp.
• Set replication factor for a file.
• Head, tail, cat to view files.
Thursday, January 10, 13
![Page 39: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/39.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
• HDFS - Hadoop Distributed File System.
• Map/Reduce - A distributed framework for executing work in parallel.
• Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS.
• Pig - A top down scripting language to manipulate.
• HBase - A NoSQL, non-sequential data store.
24
The Hadoop Ecosystem
Thursday, January 10, 13
![Page 40: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/40.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
• HDFS - Hadoop Distributed File System.
• Map/Reduce - A distributed framework for executing work in parallel.
• Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS.
• Pig - A top down scripting language to manipulate.
• HBase - A NoSQL, non-sequential data store.
24
The Hadoop Ecosystem
Thursday, January 10, 13
![Page 41: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/41.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved25
MapReduce in
Hadoop
Thursday, January 10, 13
![Page 42: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/42.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
MapReduce Basics
26
• Logical functions: Mappers and Reducers.
• Developers write map and reduce functions, then submit a jar to the Hadoop cluster.
• Hadoop handles distributing the Map and Reduce tasks across the cluster.
• Typically batch oriented.
Thursday, January 10, 13
![Page 43: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/43.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
MapReduce Daemons
27
•JobTracker (Master)
- Manages MapReduce jobs, giving tasks to different nodes, managing task failure
•TaskTracker (Slave)
- Creates individual map and reduce tasks
- Reports task status to JobTracker
Thursday, January 10, 13
![Page 44: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/44.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved28
MapReduce in Hadoop
Thursday, January 10, 13
![Page 45: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/45.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved28
MapReduce in Hadoop
Let’s look at how MapReduce actually works in Hadoop,
using WordCount.
Thursday, January 10, 13
![Page 46: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/46.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
There is a Map phase
Hadoop uses MapReduce
Input Mappers Sort,Shuffle
Reducers
map 1mapreduce 1phase 2
a 2hadoop 1is 2
Output
There is a Reduce phase
reduce 1there 2uses 1
(hadoop, 1)
(uses, 1)(mapreduce, 1)
(is, 1), (a, 1)
(there, 1)
(there, 1), (reduce 1)
(phase,1)
(map, 1),(phase,1)
(is, 1), (a, 1)
29
Thursday, January 10, 13
![Page 47: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/47.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
There is a Map phase
Hadoop uses MapReduce
Input Mappers Sort,Shuffle
Reducers
map 1mapreduce 1phase 2
a 2hadoop 1is 2
Output
There is a Reduce phase
reduce 1there 2uses 1
(hadoop, 1)
(uses, 1)(mapreduce, 1)
(is, 1), (a, 1)
(there, 1)
(there, 1), (reduce 1)
(phase,1)
(map, 1),(phase,1)
(is, 1), (a, 1)
29
We need to convert the Input
into the Output.
Thursday, January 10, 13
![Page 48: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/48.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved30
There is a Map phase
Hadoop uses MapReduce
Input Mappers Sort,Shuffle
Reducers
map 1mapreduce 1phase 2
a 2hadoop 1is 2
Output
There is a Reduce phase
reduce 1there 2uses 1
Thursday, January 10, 13
![Page 49: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/49.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved31
There is a Map phase
Hadoop uses MapReduce
Input
(doc1, "…")
(doc2, "…")
(doc3, "")
Mappers
There is a Reduce phase (doc4, "…")
Thursday, January 10, 13
![Page 50: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/50.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved32
There is a Map phase
Hadoop uses MapReduce
Input
(doc1, "…")
(doc2, "…")
(doc3, "")
Mappers
There is a Reduce phase (doc4, "…")
(hadoop, 1)(uses, 1)(mapreduce, 1)
(there, 1) (is, 1)(a, 1) (reduce, 1)(phase, 1)
(there, 1) (is, 1)(a, 1) (map, 1)(phase, 1)
Thursday, January 10, 13
![Page 51: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/51.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved33
There is a Map phase
Hadoop uses MapReduce
Input
(doc1, "…")
(doc2, "…")
(doc3, "")
Mappers Sort,Shuffle
Reducers
There is a Reduce phase (doc4, "…")
(hadoop, 1)
(uses, 1)(mapreduce, 1)
(is, 1), (a, 1)
(there, 1)
(there, 1), (reduce 1)
(phase,1)
(map, 1),(phase,1)
(is, 1), (a, 1)
0-9, a-l
m-q
r-z
Thursday, January 10, 13
![Page 52: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/52.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved34
There is a Map phase
Hadoop uses MapReduce
Input
(doc1, "…")
(doc2, "…")
(doc3, "")
Mappers Sort,Shuffle
(a, [1,1]),(hadoop, [1]),
(is, [1,1])
(map, [1]),(mapreduce, [1]),
(phase, [1,1])
Reducers
There is a Reduce phase (doc4, "…")
(reduce, [1]),(there, [1,1]),
(uses, 1)
(hadoop, 1)
(uses, 1)(mapreduce, 1)
(is, 1), (a, 1)
(there, 1)
(there, 1), (reduce 1)
(phase,1)
(map, 1),(phase,1)
(is, 1), (a, 1)
0-9, a-l
m-q
r-z
Thursday, January 10, 13
![Page 53: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/53.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved35
There is a Map phase
Hadoop uses MapReduce
Input
(doc1, "…")
(doc2, "…")
(doc3, "")
Mappers Sort,Shuffle
(a, [1,1]),(hadoop, [1]),
(is, [1,1])
(map, [1]),(mapreduce, [1]),
(phase, [1,1])
Reducers
map 1mapreduce 1phase 2
a 2hadoop 1is 2
Output
There is a Reduce phase (doc4, "…")
(reduce, [1]),(there, [1,1]),
(uses, 1)
reduce 1there 2uses 1
(hadoop, 1)
(uses, 1)(mapreduce, 1)
(is, 1), (a, 1)
(there, 1)
(there, 1), (reduce 1)
(phase,1)
(map, 1),(phase,1)
(is, 1), (a, 1)
0-9, a-l
m-q
r-z
Thursday, January 10, 13
![Page 54: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/54.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved36
There is a Map phase
Hadoop uses MapReduce
Input
(doc1, "…")
(doc2, "…")
(doc3, "")
Mappers Sort,Shuffle
(a, [1,1]),(hadoop, [1]),
(is, [1,1])
(map, [1]),(mapreduce, [1]),
(phase, [1,1])
Reducers
map 1mapreduce 1phase 2
a 2hadoop 1is 2
Output
(doc4, "…")
(reduce, [1]),(there, [1,1]),
(uses, 1)
(hadoop, 1)
(uses, 1)(mapreduce, 1)
(is, 1), (a, 1)
(there, 1)
(there, 1), (reduce 1)
(phase,1)
(map, 1),(phase,1)
(is, 1), (a, 1)
0-9, a-l
m-q
r-z
Thursday, January 10, 13
![Page 55: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/55.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved36
There is a Map phase
Hadoop uses MapReduce
Input
(doc1, "…")
(doc2, "…")
(doc3, "")
Mappers Sort,Shuffle
(a, [1,1]),(hadoop, [1]),
(is, [1,1])
(map, [1]),(mapreduce, [1]),
(phase, [1,1])
Reducers
map 1mapreduce 1phase 2
a 2hadoop 1is 2
Output
(doc4, "…")
(reduce, [1]),(there, [1,1]),
(uses, 1)
(hadoop, 1)
(uses, 1)(mapreduce, 1)
(is, 1), (a, 1)
(there, 1)
(there, 1), (reduce 1)
(phase,1)
(map, 1),(phase,1)
(is, 1), (a, 1)
0-9, a-l
m-q
r-z
Map:
• Transform one input to 0-N outputs.
Thursday, January 10, 13
![Page 56: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/56.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved36
There is a Map phase
Hadoop uses MapReduce
Input
(doc1, "…")
(doc2, "…")
(doc3, "")
Mappers Sort,Shuffle
(a, [1,1]),(hadoop, [1]),
(is, [1,1])
(map, [1]),(mapreduce, [1]),
(phase, [1,1])
Reducers
map 1mapreduce 1phase 2
a 2hadoop 1is 2
Output
(doc4, "…")
(reduce, [1]),(there, [1,1]),
(uses, 1)
(hadoop, 1)
(uses, 1)(mapreduce, 1)
(is, 1), (a, 1)
(there, 1)
(there, 1), (reduce 1)
(phase,1)
(map, 1),(phase,1)
(is, 1), (a, 1)
0-9, a-l
m-q
r-z
Map:
• Transform one input to 0-N outputs.
Reduce:
• Collect multiple inputs into one output.
Thursday, January 10, 13
![Page 57: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/57.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
JobTracker
TaskTracker
DataNode
NameNode
TaskTracker
DataNode
TaskTracker
DataNode
Cluster View of MapReduce
37
jar
M R
Thursday, January 10, 13
![Page 58: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/58.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
JobTracker
TaskTracker
DataNode
NameNode
TaskTracker
DataNode
TaskTracker
DataNode
Cluster View of MapReduce
37
jar
M R
Thursday, January 10, 13
![Page 59: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/59.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
JobTracker
TaskTracker
DataNode
NameNode
TaskTracker
DataNode
TaskTracker
DataNode
Cluster View of MapReduce
37
MMM
jar
M R
Thursday, January 10, 13
![Page 60: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/60.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
JobTracker
TaskTracker
DataNode
NameNode
TaskTracker
DataNode
TaskTracker
DataNode
Cluster View of MapReduce
37
MMM
jar
M R
Map Phase
Thursday, January 10, 13
![Page 61: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/61.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
JobTracker
TaskTracker
DataNode
NameNode
TaskTracker
DataNode
TaskTracker
DataNode
Cluster View of MapReduce
37
MMM
jar
M R
k,v k,vk,v k,v k,vMap Phase * Intermediate Data Is Stored Locally
Thursday, January 10, 13
![Page 62: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/62.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
JobTracker
TaskTracker
DataNode
NameNode
TaskTracker
DataNode
TaskTracker
DataNode
Cluster View of MapReduce
37
jar
M R
k,v k,vk,v k,v k,vMap Phase
Thursday, January 10, 13
![Page 63: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/63.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
JobTracker
TaskTracker
DataNode
NameNode
TaskTracker
DataNode
TaskTracker
DataNode
Cluster View of MapReduce
37
jar
M R
k,v k,vk,v k,v k,v
Shuffle/Sort
Thursday, January 10, 13
![Page 64: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/64.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
JobTracker
TaskTracker
DataNode
NameNode
TaskTracker
DataNode
TaskTracker
DataNode
Cluster View of MapReduce
37
jar
M R
k,v k,v k,v k,v k,v
Shuffle/Sort
Thursday, January 10, 13
![Page 65: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/65.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
JobTracker
TaskTracker
DataNode
NameNode
TaskTracker
DataNode
TaskTracker
DataNode
Cluster View of MapReduce
37
RR R
jar
M R
k,v k,v k,v k,v k,v
Reduce Phase
Thursday, January 10, 13
![Page 66: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/66.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
JobTracker
TaskTracker
DataNode
NameNode
TaskTracker
DataNode
TaskTracker
DataNode
Cluster View of MapReduce
37
RR R
jar
M R
Reduce Phase
Thursday, January 10, 13
![Page 67: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/67.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
JobTracker
TaskTracker
DataNode
NameNode
TaskTracker
DataNode
TaskTracker
DataNode
Cluster View of MapReduce
37
jar
M R
Job Complete!
Thursday, January 10, 13
![Page 68: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/68.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved38
The Hadoop Java API
Thursday, January 10, 13
![Page 69: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/69.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved39
MapReduce in Java
Thursday, January 10, 13
![Page 70: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/70.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved39
MapReduce in Java
Let’s look at WordCountwritten in the
MapReduce Java API.
Thursday, January 10, 13
![Page 71: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/71.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved40
public class SimpleWordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
static final Text word = new Text(); static final IntWritable one = new IntWritable(1); @Override public void map(LongWritable key, Text documentContents, OutputCollector<Text, IntWritable> collector, Reporter reporter) throws IOException { String[] tokens = documentContents.toString().split("\\s+"); for (String wordString : tokens) { if (wordString.length() > 0) { word.set(wordString.toLowerCase()); collector.collect(word, one); } } }}
Map Code
Thursday, January 10, 13
![Page 72: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/72.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved40
public class SimpleWordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
static final Text word = new Text(); static final IntWritable one = new IntWritable(1); @Override public void map(LongWritable key, Text documentContents, OutputCollector<Text, IntWritable> collector, Reporter reporter) throws IOException { String[] tokens = documentContents.toString().split("\\s+"); for (String wordString : tokens) { if (wordString.length() > 0) { word.set(wordString.toLowerCase()); collector.collect(word, one); } } }}
Map Code
Let’s drill into this code...
Thursday, January 10, 13
![Page 73: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/73.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved41
public class SimpleWordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
static final Text word = new Text(); static final IntWritable one = new IntWritable(1); @Override public void map(LongWritable key, Text documentContents, OutputCollector<Text, IntWritable> collector, Reporter reporter) throws IOException { String[] tokens = documentContents.toString().split("\\s+"); for (String wordString : tokens) { if (wordString.length() > 0) { word.set(wordString.toLowerCase()); collector.collect(word, one); } } }}
Map Code
Thursday, January 10, 13
![Page 74: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/74.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved41
public class SimpleWordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
static final Text word = new Text(); static final IntWritable one = new IntWritable(1); @Override public void map(LongWritable key, Text documentContents, OutputCollector<Text, IntWritable> collector, Reporter reporter) throws IOException { String[] tokens = documentContents.toString().split("\\s+"); for (String wordString : tokens) { if (wordString.length() > 0) { word.set(wordString.toLowerCase()); collector.collect(word, one); } } }}
Map CodeMapper class with 4
type parameters for the input key-value types and
output types.
Thursday, January 10, 13
![Page 75: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/75.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved42
public class SimpleWordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
static final Text word = new Text(); static final IntWritable one = new IntWritable(1); @Override public void map(LongWritable key, Text documentContents, OutputCollector<Text, IntWritable> collector, Reporter reporter) throws IOException { String[] tokens = documentContents.toString().split("\\s+"); for (String wordString : tokens) { if (wordString.length() > 0) { word.set(wordString.toLowerCase()); collector.collect(word, one); } } }}
Map Code
Output key-value objects we’ll reuse.
Thursday, January 10, 13
![Page 76: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/76.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved43
public class SimpleWordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
static final Text word = new Text(); static final IntWritable one = new IntWritable(1); @Override public void map(LongWritable key, Text documentContents, OutputCollector<Text, IntWritable> collector, Reporter reporter) throws IOException { String[] tokens = documentContents.toString().split("\\s+"); for (String wordString : tokens) { if (wordString.length() > 0) { word.set(wordString.toLowerCase()); collector.collect(word, one); } } }}
Map Code
Map method with input, output “collector”, and
reporting object.
Thursday, January 10, 13
![Page 77: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/77.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved44
public class SimpleWordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
static final Text word = new Text(); static final IntWritable one = new IntWritable(1); @Override public void map(LongWritable key, Text documentContents, OutputCollector<Text, IntWritable> collector, Reporter reporter) throws IOException { String[] tokens = documentContents.toString().split("\\s+"); for (String wordString : tokens) { if (wordString.length() > 0) { word.set(wordString.toLowerCase()); collector.collect(word, one); } } }}
Map Code
Tokenize the line, “collect” each (word, 1)
Thursday, January 10, 13
![Page 78: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/78.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved45
public class SimpleWordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
@Override public void reduce(Text key, Iterator<IntWritable> counts, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int count = 0; while (counts.hasNext()) { count += counts.next().get(); } output.collect(key, new IntWritable(count)); }}
Reduce Code
Thursday, January 10, 13
![Page 79: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/79.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved45
public class SimpleWordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
@Override public void reduce(Text key, Iterator<IntWritable> counts, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int count = 0; while (counts.hasNext()) { count += counts.next().get(); } output.collect(key, new IntWritable(count)); }}
Reduce Code
Let’s drill into this code...
Thursday, January 10, 13
![Page 80: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/80.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
public class SimpleWordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
@Override public void reduce(Text key, Iterator<IntWritable> counts, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int count = 0; while (counts.hasNext()) { count += counts.next().get(); } output.collect(key, new IntWritable(count)); }}
46
Reduce Code
Thursday, January 10, 13
![Page 81: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/81.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
public class SimpleWordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
@Override public void reduce(Text key, Iterator<IntWritable> counts, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int count = 0; while (counts.hasNext()) { count += counts.next().get(); } output.collect(key, new IntWritable(count)); }}
46
Reduce CodeReducer class with 4
type parameters for the input key-value types and
output types.
Thursday, January 10, 13
![Page 82: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/82.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved47
Reduce Code
Reduce method with input, output “collector”,
and reporting object.
public class SimpleWordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
@Override public void reduce(Text key, Iterator<IntWritable> counts, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int count = 0; while (counts.hasNext()) { count += counts.next().get(); } output.collect(key, new IntWritable(count)); }}
Thursday, January 10, 13
![Page 83: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/83.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved48
Reduce Code
Count the counts per word and emit(word, N)
public class SimpleWordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
@Override public void reduce(Text key, Iterator<IntWritable> counts, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int count = 0; while (counts.hasNext()) { count += counts.next().get(); } output.collect(key, new IntWritable(count)); }}
Thursday, January 10, 13
![Page 84: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/84.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
• HDFS - Hadoop Distributed File System.
• Map/Reduce - A distributed framework for executing work in parallel.
• Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS.
• Pig - A top down scripting language to manipulate.
• HBase - A NoSQL, non-sequential data store.
49
Other Options
Thursday, January 10, 13
![Page 85: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/85.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
• HDFS - Hadoop Distributed File System.
• Map/Reduce - A distributed framework for executing work in parallel.
• Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS.
• Pig - A top down scripting language to manipulate.
• HBase - A NoSQL, non-sequential data store.
49
Other Options
Thursday, January 10, 13
![Page 86: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/86.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
• HDFS - Hadoop Distributed File System.
• Map/Reduce - A distributed framework for executing work in parallel.
• Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS.
• Pig - A top down scripting language to manipulate.
• HBase - A NoSQL, non-sequential data store.
49
Other Options
Thursday, January 10, 13
![Page 87: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/87.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved50
Conclusions
Thursday, January 10, 13
![Page 88: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/88.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
• A cost-effective, scalable way to:- Store massive data sets.- Perform arbitrary analyses on
those data sets.
51
Hadoop Benefits
Thursday, January 10, 13
![Page 89: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/89.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
• Offers a variety of tools for:- Application development.- Integration with other platforms
(e.g., databases).
52
Hadoop Tools
Thursday, January 10, 13
![Page 90: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/90.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
• A rich, open-source ecosystem.- Free to use.- Commercially-supported
distributions.
53
Hadoop Distributions
Thursday, January 10, 13
![Page 91: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/91.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved54
Thank You!- Feel free to contact me at
- Or our solutions consultant
- As always, THINK BIG!
Thursday, January 10, 13
![Page 92: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/92.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved55
Bonus Content
Thursday, January 10, 13
![Page 93: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/93.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
• HDFS - Hadoop Distributed File System.
• Map/Reduce - A distributed framework for executing work in parallel.
• Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS.
• Pig - A top down scripting language to manipulate.
• HBase - A NoSQL, non-sequential data store.
56
The Hadoop Ecosystem
Thursday, January 10, 13
![Page 94: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/94.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
• HDFS - Hadoop Distributed File System.
• Map/Reduce - A distributed framework for executing work in parallel.
• Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS.
• Pig - A top down scripting language to manipulate.
• HBase - A NoSQL, non-sequential data store.
56
The Hadoop Ecosystem
Thursday, January 10, 13
![Page 95: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/95.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved57
Hive: SQL for Hadoop
Thursday, January 10, 13
![Page 96: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/96.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved58
Hive
Thursday, January 10, 13
![Page 97: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/97.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved58
Hive
Let’s look at WordCountwritten in Hive,
the SQL for Hadoop.
Thursday, January 10, 13
![Page 98: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/98.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved59
CREATE TABLE docs (line STRING);
LOAD DATA INPATH 'docs' OVERWRITE INTO TABLE docs;
CREATE TABLE word_counts ASSELECT word, count(1) AS count FROM(SELECT explode(split(line, '\s')) AS word FROM docs) w GROUP BY word ORDER BY word;
Thursday, January 10, 13
![Page 99: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/99.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved59
CREATE TABLE docs (line STRING);
LOAD DATA INPATH 'docs' OVERWRITE INTO TABLE docs;
CREATE TABLE word_counts ASSELECT word, count(1) AS count FROM(SELECT explode(split(line, '\s')) AS word FROM docs) w GROUP BY word ORDER BY word;
Let’s drill into this code...
Thursday, January 10, 13
![Page 100: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/100.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved60
CREATE TABLE docs (line STRING);
LOAD DATA INPATH 'docs' OVERWRITE INTO TABLE docs;
CREATE TABLE word_counts ASSELECT word, count(1) AS count FROM(SELECT explode(split(line, '\s')) AS word FROM docs) w GROUP BY word ORDER BY word;
Thursday, January 10, 13
![Page 101: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/101.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved60
CREATE TABLE docs (line STRING);
LOAD DATA INPATH 'docs' OVERWRITE INTO TABLE docs;
CREATE TABLE word_counts ASSELECT word, count(1) AS count FROM(SELECT explode(split(line, '\s')) AS word FROM docs) w GROUP BY word ORDER BY word;
Create a table to hold the raw text we’re
counting. Each line is a “column”.
Thursday, January 10, 13
![Page 102: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/102.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved61
CREATE TABLE docs (line STRING);
LOAD DATA INPATH 'docs' OVERWRITE INTO TABLE docs;
CREATE TABLE word_counts ASSELECT word, count(1) AS count FROM(SELECT explode(split(line, '\s')) AS word FROM docs) w GROUP BY word ORDER BY word;
Load the text in the “docs” directory into the
table.
Thursday, January 10, 13
![Page 103: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/103.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved62
CREATE TABLE docs (line STRING);
LOAD DATA INPATH 'docs' OVERWRITE INTO TABLE docs;
CREATE TABLE word_counts ASSELECT word, count(1) AS count FROM(SELECT explode(split(line, '\s')) AS word FROM docs) w GROUP BY word ORDER BY word;
Create the final table and fill it with the results from a nested query of
the docs table that performs WordCount
on the fly.
Thursday, January 10, 13
![Page 104: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/104.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved63
Hive
Thursday, January 10, 13
![Page 105: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/105.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved63
Hive
Because so many Hadoop userscome from SQL backgrounds,
Hive is one of the mostessential tools in the ecosystem!!
Thursday, January 10, 13
![Page 106: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/106.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
• HDFS - Hadoop Distributed File System.
• Map/Reduce - A distributed framework for executing work in parallel.
• Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS.
• Pig - A top down scripting language to manipulate.
• HBase - A NoSQL, non-sequential data store.
64
The Hadoop Ecosystem
Thursday, January 10, 13
![Page 107: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/107.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
• HDFS - Hadoop Distributed File System.
• Map/Reduce - A distributed framework for executing work in parallel.
• Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS.
• Pig - A top down scripting language to manipulate.
• HBase - A NoSQL, non-sequential data store.
64
The Hadoop Ecosystem
Thursday, January 10, 13
![Page 108: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/108.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved65
Pig: Data Flow
for Hadoop
Thursday, January 10, 13
![Page 109: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/109.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved66
Pig
Thursday, January 10, 13
![Page 110: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/110.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved66
Pig
Let’s look at WordCountwritten in Pig,
the Data Flow language for Hadoop.
Thursday, January 10, 13
![Page 111: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/111.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved67
inpt = LOAD 'docs' using TextLoader AS (line:chararray);
words = FOREACH inpt GENERATE flatten(TOKENIZE(line)) AS word;
grpd = GROUP words BY word;
cntd = FOREACH grpd GENERATE group, COUNT(words);
STORE cntd INTO 'output';
Thursday, January 10, 13
![Page 112: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/112.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved67
inpt = LOAD 'docs' using TextLoader AS (line:chararray);
words = FOREACH inpt GENERATE flatten(TOKENIZE(line)) AS word;
grpd = GROUP words BY word;
cntd = FOREACH grpd GENERATE group, COUNT(words);
STORE cntd INTO 'output'; Let’s drill into this code...
Thursday, January 10, 13
![Page 113: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/113.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved68
inpt = LOAD 'docs' using TextLoader AS (line:chararray);
words = FOREACH inpt GENERATE flatten(TOKENIZE(line)) AS word;
grpd = GROUP words BY word;
cntd = FOREACH grpd GENERATE group, COUNT(words);
STORE cntd INTO 'output';
Thursday, January 10, 13
![Page 114: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/114.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved68
inpt = LOAD 'docs' using TextLoader AS (line:chararray);
words = FOREACH inpt GENERATE flatten(TOKENIZE(line)) AS word;
grpd = GROUP words BY word;
cntd = FOREACH grpd GENERATE group, COUNT(words);
STORE cntd INTO 'output';
Like the Hive example, load “docs” content,each line is a “field”.
Thursday, January 10, 13
![Page 115: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/115.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved69
inpt = LOAD 'docs' using TextLoader AS (line:chararray);
words = FOREACH inpt GENERATE flatten(TOKENIZE(line)) AS word;
grpd = GROUP words BY word;
cntd = FOREACH grpd GENERATE group, COUNT(words);
STORE cntd INTO 'output';
Tokenize into words (an array) and “flatten” into
separate records.
Thursday, January 10, 13
![Page 116: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/116.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved70
inpt = LOAD 'docs' using TextLoader AS (line:chararray);
words = FOREACH inpt GENERATE flatten(TOKENIZE(line)) AS word;
grpd = GROUP words BY word;
cntd = FOREACH grpd GENERATE group, COUNT(words);
STORE cntd INTO 'output';
Collect the same words together.
Thursday, January 10, 13
![Page 117: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/117.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved71
inpt = LOAD 'docs' using TextLoader AS (line:chararray);
words = FOREACH inpt GENERATE flatten(TOKENIZE(line)) AS word;
grpd = GROUP words BY word;
cntd = FOREACH grpd GENERATE group, COUNT(words);
STORE cntd INTO 'output';
Count each word.
Thursday, January 10, 13
![Page 118: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/118.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved72
inpt = LOAD 'docs' using TextLoader AS (line:chararray);
words = FOREACH inpt GENERATE flatten(TOKENIZE(line)) AS word;
grpd = GROUP words BY word;
cntd = FOREACH grpd GENERATE group, COUNT(words);
STORE cntd INTO 'output'; Save the results.Profit!
Thursday, January 10, 13
![Page 119: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/119.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved73
Pig
Thursday, January 10, 13
![Page 120: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/120.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved73
Pig
Pig and Hive overlap, but Pig is popular for ETL, e.g., data transformation, cleansing, ingestion, etc.
Thursday, January 10, 13
![Page 121: Intro to HDFS and MapReduce](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d4dd70bb61eb042a8b458a/html5/thumbnails/121.jpg)
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved74
Questions?
Thursday, January 10, 13