ghislain fourny big data for engineers fall 2019 · ghislain fourny big data for engineers fall...
TRANSCRIPT
Ghislain Fourny
Big Data for Engineers Fall 20194. Distributed file systems
Kheng Ho Toh / 123RF Stock Photo
2
So far...
We've
rehearsed
relational
databases
3
So far...
We've
looked into
scaling out
4
So far...
We've
seen
Object storage
5
So far...
We've
looked into
the
Key-Value Model
6
Poll
https://eduapp-app1.ethz.ch/
Go now to:
or install EduApp 3.x
There is
Big Dataand
Big Data
Anna Liebiedieva / 123RF Stock Photo
Vadym Kurgak / 123RF Stock Photo
8
Use cases
A huge amount of large files?
9
Use cases
vs.
A huge amount of large files?
A large amount of huge files?
10
Use cases
vs.
Billions of TB files
Millions of PB files
Object Storage
File Storage
11
Where does the data come from?
Raw Data
Sensors
Measurements
Events
Logs
Oleg Dudko / 123RF Stock Photo
12
Where does the data come from?
Raw Data Derived Data
Sensors
Measurements
Events
Logs
Aggregated data
Intermediate data
Oleg Dudko / 123RF Stock Photo
Anton Starikov / 123RF Stock Photo
13
Technologies and models
Key-Value Store File System
Object Storage Block Storage
Billions of
<TB files
Millions of
<PB filesvs.
14
Technologies and models
Key-Value Store File System
Object Storage Block Storage
Billions of
<TB files
Millions of
<PB filesvs.
15
Technologies and models
Key-Value Model File System
Object Storage Block Storage
Billions of
<TB files
Millions of
<PB filesvs.
16
Technologies and models
Key-Value Model File System
Object Storage Block Storage
Billions of
<TB files
Millions of
<PB filesvs.
17
Distributed file systems: inception
FS
17
18
GFS genesis
Characteristics
19
GFS genesis
Characteristics
Requirements
20
GFS genesis
Characteristics
File System Design
Requirements
21
Fault tolerance and robustness
Vitaly Korovin / 123RF Stock Photo
It might fail
Local disk
22
Fault tolerance and robustness
Vitaly Korovin / 123RF Stock Photo
It might fail
nodes will fail
Kheng Ho Toh / 123RF Stock Photo
Local disk
Cluster with 100s to10,000s of machines
22
23
Fault tolerance and robustness
Monitoring
Kheng Ho Toh / 123RF Stock Photo
24
Fault tolerance and robustness
Monitoring
Error detection
Kheng Ho Toh / 123RF Stock Photo
25
Fault tolerance and robustness
Monitoring
Error detection
Automatic Recovery
Kheng Ho Toh / 123RF Stock Photo
26
Fault tolerance and robustness
Fault tolerance
Monitoring
Error detection
Automatic Recovery
Kheng Ho Toh / 123RF Stock Photo
27
File read model
Random access Scan the file
vs.
28
File update model
Random access Upsert/append only
vs.
29
File update model
immutable
Append
suitable for
Sensors
Logs
Intermediate data
_____
_____
_____
30
Appends
Append only
100s of clients
in parallel
atomic
31
Performance requirements
Top priority:
Throughput
32
Performance requirements
? !
Top priority:
Throughput
Secondary:
Latency
33
The progress made (1956-2018): Logarithmic
Picture: Ash Waechter/123RF
10,000x
8x
Throughput LatencyCapacity
(per unit of volume)
200,000,000,000x
34
The progress made (1956-2018): Logarithmic
Picture: Ash Waechter/123RF
200,000,000,000x
Throughput Latency
Parallelize!
Capacity
(per unit of volume)
10,000x
8x
35
The progress made (1956-2018): Logarithmic
Picture: Ash Waechter/123RF
Throughput Latency
Batch processing!
200,000,000,000x
Capacity
(per unit of volume)
10,000x
8x
36
Hadoop
37
Hadoop
Initiated in
2006
38
Hadoop
Primarily:
• Distributed File System (HDFS)
• MapReduce
• Wide column store (HBase)
Covere
d in this
lectu
re
39
Hadoop
Inspired by Google's
• GFS (2003)
• MapReduce (2004)
• BigTable (2006)
Covere
d in this
lectu
re
40
Size timeline
Date Size reported by Yahoo
April 2006 188
May 2006 300
October 2006 600
April 2007 1,000
February 2008 10,000 (index generation)
March 2009 24,000 (17 clusters)
June 2011 42,000 (100+ PB)
November 2016 100,000? (600PB)
41
Distributed file systems: the model
42
Lorem Ipsum
Dolor sit amet
Consectetur
Adipiscing
Elit. In
Imperdiet
Ipsum ante
File Systems (Logical Model)
Key-Value Storage
43
Lorem Ipsum
Dolor sit amet
Consectetur
Adipiscing
Elit. In
Imperdiet
Ipsum ante
File Systems (Logical Model)
Lorem Ipsum
Dolor sit amet
Consectetur
Adipiscing
Elit. In
Imperdiet
Ipsum ante
Key-Value Model File Hierarchy
vs.
44
Block Storage (Physical Storage)
111010010110101…
Object Storage
45
Block Storage (Physical Storage)
111010010110101…
1 2 3
4 5 6
7 8
Object Storage Block Storage
46
Terminology
HDFS: Block
GFS: Chunk
47
Files and blocks
Lorem Ipsum
Dolor sit amet
Consectetur
Adipiscing
Elit. In
Imperdiet
Ipsum ante
48
Files and blocks
Lorem Ipsum
Dolor sit amet
Consectetur
Adipiscing
Elit. In
Imperdiet
Ipsum ante
12 3 4
5
6
7
8
49
Files and blocks
Lorem Ipsum
Dolor sit amet
Consectetur
Adipiscing
Elit. In
Imperdiet
Ipsum ante
12 3 4
5
6
7
8
12
3
50
Files and blocks
Lorem Ipsum
Dolor sit amet
Consectetur
Adipiscing
Elit. In
Imperdiet
Ipsum ante
12 3 4
5
6
7
8
12
3
1
2 3 4
51
Why blocks?
52
Why blocks?
1. Files bigger than a disk
PBs!
53
Why blocks?
1. Files bigger than a disk
PBs!
2. Simpler level of abstraction
54
Single machine vs. distributed
55
The right block size
Simple file system
4 kB
56
Poll
https://eduapp-app1.ethz.ch/
Go now to:
or install EduApp 2.x
57
The right block size
Simple file system Distributed file system
4 kB
64 MB – 128 MB
58
The right block size
Relational Database Distributed file system
4 kB – 32 kB
64 MB – 128 MB
59
HDFS Architecture
60
How do we connect the many machines?
61
Peer-to-peer architecture
62
Master-slave architecture
Slave
Master
Slave Slave Slave Slave Slave
63
HDFS server architecture
64
HDFS server architecture
Datanode Datanode Datanode Datanode Datanode Datanode
Namenode
65
From the file perspectiveNamenode
File...
66
From the file perspective
File...
...divided into 128MB chunks...
Namenode
67
From the file perspective
File...
...divided into 128MB chunks...
... replicated for fault tolerance
Namenode
68
Concurrently accessed
Datanode Datanode Datanode Datanode Datanode Datanode
Namenode
69
Hadoop implementation
(Packaged code)
70
HDFS Architecture
Datanode Datanode Datanode Datanode Datanode Datanode
Namenode
71
HDFS Architecture: NameNode
Datanode Datanode Datanode Datanode Datanode Datanode
Namenode
72
NameNode: all system-wide activity
73
NameNode: all system-wide activity
Memory
1 File namespace
(+Access Control)
74
NameNode: all system-wide activity
Memory
/dir/file1
/dir/file2
/file3
File to block mapping
1 File namespace
(+Access Control)
2
75
NameNode: all system-wide activity
Memory
Block locations
/dir/file1
/dir/file2
/file3
File to block mapping
1 File namespace
(+Access Control)
2
3
76
HDFS Architecture
Datanode Datanode Datanode Datanode Datanode Datanode
Namenode
77
HDFS Architecture: DataNode
Datanode Datanode Datanode Datanode Datanode Datanode
Namenode
78
DataNode
79
DataNode
80
DataNode
Blocks are stored on the
local disk
81
DataNode
Proximity to hardware facilitates disk failure detection
82
Block IDs
64 bits
e.g., 7586700455251598184
83
Subblock granularity: Byte Range
84
Communication
Datanode
Namenode
Datanode
Client
85
Summary
Datanode
Namenode
Datanode
Client
Client Protocol
DataTransfer
Protocol
DataNode
Protocol
86
Communication
Datanode
Namenode
Datanode
Client
87
Client Protocol (RPC)
Client
Metadata operations
Namenode
88
Client Protocol (RPC)
Client
Metadata operations
DataNode location
Namenode
89
Client Protocol (RPC)
Client
Metadata operations
DataNode location
Block IDs Namenode
90
Client Protocol (RPC)
NamenodeClient
Metadata operations
DataNode location
Block IDs
Java API available
90
91
Communication
Datanode
Namenode
Datanode
Client
Control
92
DataNode Protocol (RPC)
Datanode
Datanode always
initiates connection!
Namenode
93
DataNode Protocol (RPC)
Datanode
Datanode always
initiates connection!
Registration
Namenode
94
DataNode Protocol (RPC)
Datanode
Heartbeat
Datanode always
initiates connection!
every 3s
custo
miz
able
Registration
Namenode
95
DataNode Protocol (RPC)
Datanode
HeartbeatBlock operations
Datanode always
initiates connection!
every 3s
custo
miz
able
Registration
Namenode
96
DataNode Protocol (RPC)
Datanode
Heartbeat
BlockReportBlock operations
Datanode always
initiates connection!
every 3s
every 6h
custo
miz
able
Registration
Namenode
97
DataNode Protocol (RPC)
Datanode
Heartbeat
BlockReportBlock operations
Datanode always
initiates connection!
every 3s
every 6h
custo
miz
able
Registration
BlockReceived
Namenode
98
DataNode Protocol (RPC)
Datanode
Namenode
Heartbeat
BlockReportBlock operations
Java API available
every 3s
every 6h
custo
miz
able
Registration
BlockReceived
99
DataNode Protocol
Datanode
Heartbeat
BlockReportBlock operations
Datanode always
initiates connection!
every 3s
every 6h
custo
miz
able
Registration
BlockReceived
Namenode
100
Communication
Datanode
Namenode
Datanode
Client
Control
Control
101
DataTransfer Protocol (Streaming)
DataNodeClient
Data blocks
DataNodeDataNode
101
102
DataTransfer Protocol (Streaming)
DataNodeClient
Data blocks
DataNodeDataNode
Replication
pipelining
(write only)
102
103
DataTransfer Protocol (Streaming)
DataNodeClient
Data blocks
DataNodeDataNode
Replication
pipelining
(write only)
103
104
Communication
Datanode
Namenode
Datanode
Client
Control
Control
ControlData
Control
105
Summary
Datanode
Namenode
Datanode
Client
Client Protocol
DataTransfer
Protocol
DataNode
Protocol
106
Metadata functionality
Create directory
Delete directory
Write file
Append to file
Read file
Delete file
107
Client reads a file
108
Client reads a file
Asks for file1
109
Client reads a file
Get block locations
Multiple DataNodes for each block,
sorted by distance
2
110
Client reads a file
Read
3Input
Stream
111
Client writes a file
112
Client writes a file
Create1
113
Client writes a file
DataNodes for first block
2
114
Client writes a file
Organizes pipeline3
115
Client writes a file
Sends data over4
116
Client writes a file
Ack5
117
Client writes a file
DataNodes for second block
2
118
Client writes a file
Organizes pipeline3
119
Client writes a file
Sends data over4
120
Client writes a file
Ack5
121
This is all done simultaneously under DFSOutputStream (streaming through)
Sends data over4
DataNodes for nth block
2
5
3 Pipeline
Ack
122
Client writes a file
Close/release lock
6
123
Client writes a file
Checking for
minimal
replication
(DataNode protocol)
7
124
Client writes a file
Ack
8
125
Client writes a file
replicates further asynchronously 9
126
Replicas
127
Replicas
Number of replicas
specified
per filedefault:3
128
Replica placement: what to consider?
Reliability
Read/Write Bandwidth
Block distribution
129
Replica placement: Reminder on topology
Cluster
Rack
Node
130
Replica placement: Distance
BA
D(A,B)=2
131
Replica placement: Distance
BA
D(A,B)=4
132
Replica placement
Replica 1: same node as client (or random), rack A
133
Replica placement
Replica 1: same node as client (or random), rack A
Replica 2: a node in a different rack B
134
Replica placement
Replica 1: same node as client (or random), rack A
Replica 2: a node in a different rack B
Replica 3: a node in same rack B
135
Replica placement
Replica 1: same node as client (or random), rack A
Replica 2: a node in a different rack B
Replica 3: a node in same rack B
Replica 4 and beyond: random, but if possible:
• at most one replica per node
• at most two replicas per rack
136
Replica placement
Client
137
Why replicas 2+3 on other rack?
Client
138
If replicas 1+2 were on same rack...
Block concentration on same rack (2/3)
139
Performance and availability
140
The NameNode is a single point of failure
Datanode Datanode Datanode Datanode Datanode Datanode
Namenode
/dir/file1
/dir/file2
/file3
141
The namenode is a single point of failure...
Datanode Datanode Datanode Datanode Datanode Datanode
Namenode
/dir/file1
/dir/file2
/file3
What if it fails?
142
NameNode: all system-wide activity
Memory
Block locations
/dir/file1
/dir/file2
/file3
File to block mapping
1 File namespace
(+Access Control)
2
3
143
1. You want to persist
Memory
1/dir/file1
2
3 not persisted
144
1. You want to persist
Namespace
file
Persistent Storage
Memory
1/dir/file1
2
3 not persisted
145
1. You want to persist
Namespace
file
Persistent Storage
Memory
1/dir/file1
/dir/file2
2
3 not persisted
Edit log
146
1. You want to persist
Namespace
file
Persistent Storage
Memory
1/dir/file1
/dir/file2
/file3
2
3 not persisted
Edit log
147
2. You want to backup
Namespace
fileEdit log
Persistent Storage
Shared driveBackup drives
Glacier
148
The namenode is a single point of failure...
Datanode Datanode Datanode Datanode Datanode Datanode
Namenode
/dir/file1
/dir/file2
/file3
What if it fails?
149
The namenode is a single point of failure...
Datanode Datanode Datanode Datanode Datanode Datanode
Namenode
/dir/file1
/dir/file2
/file3
We need to start
it up again!
150
Namenodes: Startup
Namespace
file
Persistent Storage
Memory
Edit log
151
Namenodes: Startup
Namespace
file
Persistent Storage
Memory
Filesystem
Edit log
152
Edit log
Namenodes: Startup
Namespace
file
Persistent Storage
Memory
Filesystem
153
Edit log
Namenodes: Startup
Namespace
file
Persistent Storage
Memory
Filesystem
/dir/file1
/dir/file2
/file3
154
Namenodes: Startup
Namespace
file
Persistent Storage
Memory
Filesystem
/dir/file1
/dir/file2
/file3
Block locations
Edit log
155
Namenodes: Startup
Namespace
file
Persistent Storage
Memory
Filesystem
/dir/file1
/dir/file2
/file3
Block locations
Edit log
156
Namenodes: Startup
Namespace
file
Persistent Storage
Memory
Filesystem
/dir/file1
/dir/file2
/file3
Block locations
Edit log
157
Starting a namenode...
... takes
30 minutes!
158
Starting a namenode...
Can we do
better?
159
3. Checkpoints with Secondary NameNode
Old namespace
fileEdit log
New namespace file
160
4. High Availability (HA): Backup NameNodes
Datanode Datanode Datanode Datanode Datanode Datanode
Namenode
/dir/file1
/dir/file2
/file3
Namenode
/dir/file1
/dir/file2
/file3
Maintains mappings and locations
in memory like the namenode.
161
5. High Availability (HA): Standby NameNodes
Active
Namenode
Standby
Namenode
Standby
Namenode
162
5. Federated DFS
Datanode Datanode Datanode Datanode Datanode Datanode
Namenode /foo
/foo/file1
/foo/file2
Namenode /bar
/bar/file1
/bar/file2
163
Using HDFS
164
HDFS Shell
$ hadoop fs <args>
165
HDFS Shell
$ hadoop fs <args>
$ hdfs dfs <args>
166
HDFS Shell
$ hadoop fs <args>
$ hdfs dfs <args>local
filesystem
167
HDFS Shell: POSIX-like
$ hadoop fs –ls
$ hadoop fs –cat /dir/file
$ hadoop fs –rm /dir/file
$ hadoop fs –mkdir /dir
168
HDFS Shell: upload and download
$ hadoop fs –copyToLocal /user/hadoop/file
localfile
$ hadoop fs –copyFromLocal
localfile1 localfile2
/user/hadoop/hadoopdir
169
HDFS Shell: Configuration
core-site.xml
<properties>
<property>
<name>fs.defaultFS</name>
<value>hdfs://host:8020</value>
<description>NameNode hostname</description>
</property>
</properties>
170
HDFS Shell: Configuration
hdfs-site.xml
<properties>
<property>
<name>dfs.replication</name>
<value>1</value>
<description>Replication factor</description>
</property> <property>
<name>dfs.namenode.name.dir</name>
<value>/grid/hadoop/hdfs/nn</value>
<description>NameNode directory</description>
</property>
<property>
<name>dfs.datanode.name.dir</name>
<value>/grid/hadoop/hdfs/nn</value>
<description>DataNode directory</description>
</property>
</properties>
171
Populating HDFS: Apache Flume
Collects, aggregates, moves log data(into HDFS)
_____ _____
__ _____ ___
_____
172
Populating HDFS: Apache Sqoop
Imports from a relational database
173
GFS
174
GFS vs. HDFS: Terminology
NameNode
DataNode
Block
FS Image
Edit log
HDFS
Master
Chunkserver
Chunk
Checkpoint image
Operation log
GFS
175
HDFS vs. GFS: Block size
GFS/Apache HDFS
64 MB
128 MB
Cloudera HDFS
176
Pointers
Official documentation
http://hadoop.apache.org/docs/r3.2.1/
GFS Paper
On course website
Java API
https://hadoop.apache.org/docs/r3.2.1/api/ind
ex.html