2013 10 21_db_lecture_06_hdfs
DESCRIPTION
TRANSCRIPT
![Page 2: 2013 10 21_db_lecture_06_hdfs](https://reader033.vdocuments.net/reader033/viewer/2022052820/54c67c914a795993528b45a7/html5/thumbnails/2.jpg)
Intro
• Hadoop ecosystem
• History
• Setups
2
![Page 3: 2013 10 21_db_lecture_06_hdfs](https://reader033.vdocuments.net/reader033/viewer/2022052820/54c67c914a795993528b45a7/html5/thumbnails/3.jpg)
HDFS
• Features
• Architecture
• NameNode
• DataNode
• Data flow
• Tools
• Performance
3
![Page 4: 2013 10 21_db_lecture_06_hdfs](https://reader033.vdocuments.net/reader033/viewer/2022052820/54c67c914a795993528b45a7/html5/thumbnails/4.jpg)
Hadoop ecosystem
4
![Page 5: 2013 10 21_db_lecture_06_hdfs](https://reader033.vdocuments.net/reader033/viewer/2022052820/54c67c914a795993528b45a7/html5/thumbnails/5.jpg)
Doug Cutting
5
![Page 6: 2013 10 21_db_lecture_06_hdfs](https://reader033.vdocuments.net/reader033/viewer/2022052820/54c67c914a795993528b45a7/html5/thumbnails/6.jpg)
History
• 2002: Apache Nutch
• 2003: GFS
• 2004: Nutch Distributed Filesystem (NDFS)
• 2005: MapReduce
• 2006: subproject of Lucene => Hadoop
6
![Page 7: 2013 10 21_db_lecture_06_hdfs](https://reader033.vdocuments.net/reader033/viewer/2022052820/54c67c914a795993528b45a7/html5/thumbnails/7.jpg)
History
• 2006: Yahoo! Research cluster: 300 nodes
• 2006: BigTable paper
• 2007: Initial HBase prototype
• 2007: First usable HBase (Hadoop 0.15.0)
• 2008: 10,000-core Hadoop cluster
7
![Page 8: 2013 10 21_db_lecture_06_hdfs](https://reader033.vdocuments.net/reader033/viewer/2022052820/54c67c914a795993528b45a7/html5/thumbnails/8.jpg)
Яндекс
• 2 MapReduce clusters MapReduce (50 nodes, 100 nodes)
• 2 HBase clusters
• 500 nodes cluster is on its way :)
• Fact extraction, data mining, statistics, reporting, analytics
8
![Page 9: 2013 10 21_db_lecture_06_hdfs](https://reader033.vdocuments.net/reader033/viewer/2022052820/54c67c914a795993528b45a7/html5/thumbnails/9.jpg)
• 2 major clusters (1100 and 300 nodes)
• Messages; machine learning, log storage, reporting, analytics
• 8B messages/day, 2Pb online data, growing 250 Tb/ months
9
![Page 10: 2013 10 21_db_lecture_06_hdfs](https://reader033.vdocuments.net/reader033/viewer/2022052820/54c67c914a795993528b45a7/html5/thumbnails/10.jpg)
EBay
• 532 nodes cluster (8 * 532 cores, 5.3PB).
• Heavy usage of MapReduce, HBase, Pig, Hive
• Search optimization and Research
10
![Page 11: 2013 10 21_db_lecture_06_hdfs](https://reader033.vdocuments.net/reader033/viewer/2022052820/54c67c914a795993528b45a7/html5/thumbnails/11.jpg)
11
![Page 12: 2013 10 21_db_lecture_06_hdfs](https://reader033.vdocuments.net/reader033/viewer/2022052820/54c67c914a795993528b45a7/html5/thumbnails/12.jpg)
HDFS
12
![Page 13: 2013 10 21_db_lecture_06_hdfs](https://reader033.vdocuments.net/reader033/viewer/2022052820/54c67c914a795993528b45a7/html5/thumbnails/13.jpg)
Features
13
![Page 14: 2013 10 21_db_lecture_06_hdfs](https://reader033.vdocuments.net/reader033/viewer/2022052820/54c67c914a795993528b45a7/html5/thumbnails/14.jpg)
Features
• Distributed file system
• Replicated, highly fault-tolerant
• Commodity hardware
• Aimed for Map Reduce
• Pure java
• Large files, huge datasets
14
![Page 15: 2013 10 21_db_lecture_06_hdfs](https://reader033.vdocuments.net/reader033/viewer/2022052820/54c67c914a795993528b45a7/html5/thumbnails/15.jpg)
Features
• Traditional hierarchical file organization
• Permissions
• No soft links
15
![Page 16: 2013 10 21_db_lecture_06_hdfs](https://reader033.vdocuments.net/reader033/viewer/2022052820/54c67c914a795993528b45a7/html5/thumbnails/16.jpg)
Features
• Streaming data access
• Write-once-read-many access model for files
• Strictly one writer
16
![Page 17: 2013 10 21_db_lecture_06_hdfs](https://reader033.vdocuments.net/reader033/viewer/2022052820/54c67c914a795993528b45a7/html5/thumbnails/17.jpg)
Hardware Failure
• Hardware failure is the norm rather than the exception
• Detection of faults and quick, automatic recovery
17
![Page 18: 2013 10 21_db_lecture_06_hdfs](https://reader033.vdocuments.net/reader033/viewer/2022052820/54c67c914a795993528b45a7/html5/thumbnails/18.jpg)
Not good for
• Multiple data centers
• Low latency data access (<10 ms)
• Lots of small files
• Multiple writers
18
![Page 19: 2013 10 21_db_lecture_06_hdfs](https://reader033.vdocuments.net/reader033/viewer/2022052820/54c67c914a795993528b45a7/html5/thumbnails/19.jpg)
Architecture
19
![Page 20: 2013 10 21_db_lecture_06_hdfs](https://reader033.vdocuments.net/reader033/viewer/2022052820/54c67c914a795993528b45a7/html5/thumbnails/20.jpg)
Architecture
• One NameNode, many DataNodes
• File is a sequence of blocks (64 Mb)
• Meta data in memory of namenode
• Client interacts with datanodes directly
20
![Page 21: 2013 10 21_db_lecture_06_hdfs](https://reader033.vdocuments.net/reader033/viewer/2022052820/54c67c914a795993528b45a7/html5/thumbnails/21.jpg)
Architecture
21
![Page 22: 2013 10 21_db_lecture_06_hdfs](https://reader033.vdocuments.net/reader033/viewer/2022052820/54c67c914a795993528b45a7/html5/thumbnails/22.jpg)
Big blocks
• Let’s say we want to achieve 1% seek overhead
• Seek time ~10ms, transfer rate ~100 Mb/s
• So block size must be ~100Mb
22
![Page 23: 2013 10 21_db_lecture_06_hdfs](https://reader033.vdocuments.net/reader033/viewer/2022052820/54c67c914a795993528b45a7/html5/thumbnails/23.jpg)
NameNode
23
![Page 24: 2013 10 21_db_lecture_06_hdfs](https://reader033.vdocuments.net/reader033/viewer/2022052820/54c67c914a795993528b45a7/html5/thumbnails/24.jpg)
NameNode
• Manages the FS namespace
• Keeps in Memory all files and blocks
• Executes FS operations like opening, closing, and renaming
• Makes all decisions regarding replication of blocks
• SPOF
24
![Page 25: 2013 10 21_db_lecture_06_hdfs](https://reader033.vdocuments.net/reader033/viewer/2022052820/54c67c914a795993528b45a7/html5/thumbnails/25.jpg)
NameNode
• NS stored in 2 files: namespace image, edit log
• Determines the mapping of blocks to DataNodes
• Receives a Heartbeat and a Blockreport from each of the DataNodes
25
![Page 26: 2013 10 21_db_lecture_06_hdfs](https://reader033.vdocuments.net/reader033/viewer/2022052820/54c67c914a795993528b45a7/html5/thumbnails/26.jpg)
NameNode
26
![Page 27: 2013 10 21_db_lecture_06_hdfs](https://reader033.vdocuments.net/reader033/viewer/2022052820/54c67c914a795993528b45a7/html5/thumbnails/27.jpg)
NameNode
• Store persistent state to several file system
• Secondary MasterNode
• HDFS High-Availability
• HDFS Federation
27
![Page 28: 2013 10 21_db_lecture_06_hdfs](https://reader033.vdocuments.net/reader033/viewer/2022052820/54c67c914a795993528b45a7/html5/thumbnails/28.jpg)
28
![Page 29: 2013 10 21_db_lecture_06_hdfs](https://reader033.vdocuments.net/reader033/viewer/2022052820/54c67c914a795993528b45a7/html5/thumbnails/29.jpg)
DataNode
29
![Page 30: 2013 10 21_db_lecture_06_hdfs](https://reader033.vdocuments.net/reader033/viewer/2022052820/54c67c914a795993528b45a7/html5/thumbnails/30.jpg)
DataNode
• Usually 1 DataNode 1 machine
• Serves read and write client requests
• Performs block creation, deletion, and replication
30
![Page 31: 2013 10 21_db_lecture_06_hdfs](https://reader033.vdocuments.net/reader033/viewer/2022052820/54c67c914a795993528b45a7/html5/thumbnails/31.jpg)
DataNode
• Each block represented as 2 files: data, metadata
• Metadata contains checksums, generation stamp
• Handshake while starting the cluster to verify ID of namespace and SW version
31
![Page 32: 2013 10 21_db_lecture_06_hdfs](https://reader033.vdocuments.net/reader033/viewer/2022052820/54c67c914a795993528b45a7/html5/thumbnails/32.jpg)
DataNode
• NameNode never calls DataNodes
• Replicate block
• Remove local replica
• Re-register/shut down
• Send an immediate block report
32
![Page 33: 2013 10 21_db_lecture_06_hdfs](https://reader033.vdocuments.net/reader033/viewer/2022052820/54c67c914a795993528b45a7/html5/thumbnails/33.jpg)
Data Flow
33
![Page 34: 2013 10 21_db_lecture_06_hdfs](https://reader033.vdocuments.net/reader033/viewer/2022052820/54c67c914a795993528b45a7/html5/thumbnails/34.jpg)
Distance
• HDFS is rack aware
• Network is presented as tree
• d(n1, n2) = d(n1, A) + d (n2, A), A -- closest common ancestor
34
![Page 35: 2013 10 21_db_lecture_06_hdfs](https://reader033.vdocuments.net/reader033/viewer/2022052820/54c67c914a795993528b45a7/html5/thumbnails/35.jpg)
Distance
35
![Page 36: 2013 10 21_db_lecture_06_hdfs](https://reader033.vdocuments.net/reader033/viewer/2022052820/54c67c914a795993528b45a7/html5/thumbnails/36.jpg)
Read
• Get list of blocks locations from NameNode
• Iterate over blocks and read
• Verifies checksums
36
![Page 37: 2013 10 21_db_lecture_06_hdfs](https://reader033.vdocuments.net/reader033/viewer/2022052820/54c67c914a795993528b45a7/html5/thumbnails/37.jpg)
Read
• On fail or when unable to connect DN:
• try another replica
• remember that
• tell NameNode
37
![Page 38: 2013 10 21_db_lecture_06_hdfs](https://reader033.vdocuments.net/reader033/viewer/2022052820/54c67c914a795993528b45a7/html5/thumbnails/38.jpg)
Read
38
![Page 39: 2013 10 21_db_lecture_06_hdfs](https://reader033.vdocuments.net/reader033/viewer/2022052820/54c67c914a795993528b45a7/html5/thumbnails/39.jpg)
Read
39
final Configuration conf = new Configuration();
final FileSystem fs = FileSystem.get(conf);
InputStream in = null;
try {
in = fs.open(new Path("/user/test-hdfs/somefile"));
//do whatever with inpustream
} finally {
IOUtils.closeStream(in);
}
![Page 40: 2013 10 21_db_lecture_06_hdfs](https://reader033.vdocuments.net/reader033/viewer/2022052820/54c67c914a795993528b45a7/html5/thumbnails/40.jpg)
Write
• Ask NameNode to create file
• NameNode performs some checks
• Client ask for list of DataNodes
• Forms a pipeline and ack queue
40
![Page 41: 2013 10 21_db_lecture_06_hdfs](https://reader033.vdocuments.net/reader033/viewer/2022052820/54c67c914a795993528b45a7/html5/thumbnails/41.jpg)
Write
• Send packets to the pipeline
• In case of failure
• close pipeline
• remove bad DN, tell NN
• retry sending to good DNs
• block will be replicated asynchronously
41
![Page 42: 2013 10 21_db_lecture_06_hdfs](https://reader033.vdocuments.net/reader033/viewer/2022052820/54c67c914a795993528b45a7/html5/thumbnails/42.jpg)
Write
42
![Page 43: 2013 10 21_db_lecture_06_hdfs](https://reader033.vdocuments.net/reader033/viewer/2022052820/54c67c914a795993528b45a7/html5/thumbnails/43.jpg)
Write
43
final Configuration conf = new Configuration();
final FileSystem fs = FileSystem.get(conf);
FSDataOutputStream out = null;
try {
out = fs.create(new Path("/user/test-hdfs/newfile"));
//write to out
} finally {
IOUtils.closeStream(in);
}
![Page 44: 2013 10 21_db_lecture_06_hdfs](https://reader033.vdocuments.net/reader033/viewer/2022052820/54c67c914a795993528b45a7/html5/thumbnails/44.jpg)
Block Placement
• Reliability/Bandwidth trade off
• By default: same node, 2 random nodes from another rack, other random nodes
• No Datanode contains more than one replica of any block
• No rack contains more than two replicas of the same block
44
![Page 45: 2013 10 21_db_lecture_06_hdfs](https://reader033.vdocuments.net/reader033/viewer/2022052820/54c67c914a795993528b45a7/html5/thumbnails/45.jpg)
Tools
45
![Page 46: 2013 10 21_db_lecture_06_hdfs](https://reader033.vdocuments.net/reader033/viewer/2022052820/54c67c914a795993528b45a7/html5/thumbnails/46.jpg)
Balancer
• Compare utilization of node with utilization of cluster
• Guarantees that the decision does not reduce either the number of replicas or the number of racks
• Minimizes the inter-rack data copying
46
![Page 47: 2013 10 21_db_lecture_06_hdfs](https://reader033.vdocuments.net/reader033/viewer/2022052820/54c67c914a795993528b45a7/html5/thumbnails/47.jpg)
Block scanner
• Periodically scans replica, verifies checksums
• Notifies NameNode if checksum fails
• Replicate first, then delete
47
![Page 48: 2013 10 21_db_lecture_06_hdfs](https://reader033.vdocuments.net/reader033/viewer/2022052820/54c67c914a795993528b45a7/html5/thumbnails/48.jpg)
Performance
48
![Page 49: 2013 10 21_db_lecture_06_hdfs](https://reader033.vdocuments.net/reader033/viewer/2022052820/54c67c914a795993528b45a7/html5/thumbnails/49.jpg)
Performance
• Cluster ~3500 nodes
• Total bandwidth is linear to number of nodes
• DFSIO Read: 66 MB /s per node
• DFSIO Write: 40 MB /s per node
49
![Page 50: 2013 10 21_db_lecture_06_hdfs](https://reader033.vdocuments.net/reader033/viewer/2022052820/54c67c914a795993528b45a7/html5/thumbnails/50.jpg)
Performance
• Open file for read: 126100
• Create file: 5600
• Rename file: 8300
• Delete file: 20700
• DataNode heartbeat: 300000
• Block reports (block/s): 639700
50
![Page 51: 2013 10 21_db_lecture_06_hdfs](https://reader033.vdocuments.net/reader033/viewer/2022052820/54c67c914a795993528b45a7/html5/thumbnails/51.jpg)
Sources
• The Hadoop Distributed File System (K. Shvacko)
• HDFS Architecture Guide
• Hadoop: The Definitive Guide (O’Reilly)
51