ghislain fourny big data for engineers fall 2019 · ghislain fourny big data for engineers fall...

Ghislain Fourny

Big Data for Engineers Fall 20194. Distributed file systems

Kheng Ho Toh / 123RF Stock Photo

2

So far...

We've

rehearsed

relational

databases

3

So far...

We've

looked into

scaling out

4

So far...

We've

seen

Object storage

5

So far...

We've

looked into

the

Key-Value Model

6

Poll

https://eduapp-app1.ethz.ch/

Go now to:

or install EduApp 3.x

There is

Big Dataand

Big Data

Anna Liebiedieva / 123RF Stock Photo

Vadym Kurgak / 123RF Stock Photo

8

Use cases

A huge amount of large files?

9

Use cases

vs.

A huge amount of large files?

A large amount of huge files?

10

Use cases

vs.

Billions of TB files

Millions of PB files

Object Storage

File Storage

11

Where does the data come from?

Raw Data

Sensors

Measurements

Events

Logs

Oleg Dudko / 123RF Stock Photo

12

Where does the data come from?

Raw Data Derived Data

Sensors

Measurements

Events

Logs

Aggregated data

Intermediate data

Oleg Dudko / 123RF Stock Photo

Anton Starikov / 123RF Stock Photo

13

Technologies and models

Key-Value Store File System

Object Storage Block Storage

Billions of

<TB files

Millions of

<PB filesvs.

14


Key-Value Store File System


Billions of

<TB files

Millions of

<PB filesvs.

15


Key-Value Model File System


Billions of

<TB files

Millions of

<PB filesvs.

16


Key-Value Model File System


Billions of

<TB files

Millions of

<PB filesvs.

17

Distributed file systems: inception

FS

17

18

GFS genesis

Characteristics

19

GFS genesis

Characteristics

Requirements

20

GFS genesis

Characteristics

File System Design

Requirements

21

Fault tolerance and robustness

Vitaly Korovin / 123RF Stock Photo

It might fail

Local disk

22


Vitaly Korovin / 123RF Stock Photo

It might fail

nodes will fail


Local disk

Cluster with 100s to10,000s of machines

22

23


Monitoring


24


Monitoring

Error detection


25


Monitoring

Error detection

Automatic Recovery


26


Fault tolerance

Monitoring

Error detection

Automatic Recovery


27

File read model

Random access Scan the file

vs.

28

File update model

Random access Upsert/append only

vs.

29

File update model

immutable

Append

suitable for

Sensors

Logs

Intermediate data

_____

_____

_____

30

Appends

Append only

100s of clients

in parallel

atomic

31

Performance requirements

Top priority:

Throughput

32

Performance requirements

? !

Top priority:

Throughput

Secondary:

Latency

33

The progress made (1956-2018): Logarithmic

Picture: Ash Waechter/123RF

10,000x

8x

Throughput LatencyCapacity

(per unit of volume)

200,000,000,000x

34



200,000,000,000x

Throughput Latency

Parallelize!

Capacity


10,000x

8x

35



Throughput Latency

Batch processing!

200,000,000,000x

Capacity


10,000x

8x

36

Hadoop

37

Hadoop

Initiated in

2006

38

Hadoop

Primarily:

• Distributed File System (HDFS)

• MapReduce

• Wide column store (HBase)

Covere

d in this

lectu

re

39

Hadoop

Inspired by Google's

• GFS (2003)

• MapReduce (2004)

• BigTable (2006)

Covere

d in this

lectu

re

40

Size timeline

Date Size reported by Yahoo

April 2006 188

May 2006 300

October 2006 600

April 2007 1,000

February 2008 10,000 (index generation)

March 2009 24,000 (17 clusters)

June 2011 42,000 (100+ PB)

November 2016 100,000? (600PB)

41

Distributed file systems: the model

42

Lorem Ipsum

Dolor sit amet

Consectetur

Adipiscing

Elit. In

Imperdiet

Ipsum ante

File Systems (Logical Model)

Key-Value Storage

43

Lorem Ipsum

Dolor sit amet

Consectetur

Adipiscing

Elit. In

Imperdiet

Ipsum ante

File Systems (Logical Model)

Lorem Ipsum

Dolor sit amet

Consectetur

Adipiscing

Elit. In

Imperdiet

Ipsum ante

Key-Value Model File Hierarchy

vs.

44

Block Storage (Physical Storage)

111010010110101…

Object Storage

45

Block Storage (Physical Storage)

111010010110101…

1 2 3

4 5 6

7 8


46

Terminology

HDFS: Block

GFS: Chunk

47

Files and blocks

Lorem Ipsum

Dolor sit amet

Consectetur

Adipiscing

Elit. In

Imperdiet

Ipsum ante

48

Files and blocks

Lorem Ipsum

Dolor sit amet

Consectetur

Adipiscing

Elit. In

Imperdiet

Ipsum ante

12 3 4

5

6

7

8

49

Files and blocks

Lorem Ipsum

Dolor sit amet

Consectetur

Adipiscing

Elit. In

Imperdiet

Ipsum ante

12 3 4

5

6

7

8

12

3

50

Files and blocks

Lorem Ipsum

Dolor sit amet

Consectetur

Adipiscing

Elit. In

Imperdiet

Ipsum ante

12 3 4

5

6

7

8

12

3

1

2 3 4

51

Why blocks?

52

Why blocks?

1. Files bigger than a disk

PBs!

53

Why blocks?

1. Files bigger than a disk

PBs!

2. Simpler level of abstraction

54

Single machine vs. distributed

55

The right block size

Simple file system

4 kB

56

Poll

https://eduapp-app1.ethz.ch/

Go now to:

or install EduApp 2.x

57


Simple file system Distributed file system

4 kB

64 MB – 128 MB

58


Relational Database Distributed file system

4 kB – 32 kB

64 MB – 128 MB

59

HDFS Architecture

60

How do we connect the many machines?

61

Peer-to-peer architecture

62

Master-slave architecture

Slave

Master

Slave Slave Slave Slave Slave

63

HDFS server architecture

64

HDFS server architecture

Datanode Datanode Datanode Datanode Datanode Datanode

Namenode

65

From the file perspectiveNamenode

File...

66

From the file perspective

File...

...divided into 128MB chunks...

Namenode

67

From the file perspective

File...

...divided into 128MB chunks...

... replicated for fault tolerance

Namenode

68

Concurrently accessed


Namenode

69

Hadoop implementation

(Packaged code)

70

HDFS Architecture


Namenode

71

HDFS Architecture: NameNode


Namenode

72

NameNode: all system-wide activity

73


Memory

1 File namespace

(+Access Control)

74


Memory

/dir/file1

/dir/file2

/file3

File to block mapping

1 File namespace

(+Access Control)

2

75


Memory

Block locations

/dir/file1

/dir/file2

/file3


1 File namespace

(+Access Control)

2

3

76

HDFS Architecture


Namenode

77

HDFS Architecture: DataNode


Namenode

78

DataNode

79

DataNode

80

DataNode

Blocks are stored on the

local disk

81

DataNode

Proximity to hardware facilitates disk failure detection

82

Block IDs

64 bits

e.g., 7586700455251598184

83

Subblock granularity: Byte Range

84

Communication

Datanode

Namenode

Datanode

Client

85

Summary

Datanode

Namenode

Datanode

Client

Client Protocol

DataTransfer

Protocol

DataNode

Protocol

86

Communication

Datanode

Namenode

Datanode

Client

87

Client Protocol (RPC)

Client

Metadata operations

Namenode

88


Client

Metadata operations

DataNode location

Namenode

89


Client

Metadata operations

DataNode location

Block IDs Namenode

90


NamenodeClient

Metadata operations

DataNode location

Block IDs

Java API available

90

91

Communication

Datanode

Namenode

Datanode

Client

Control

92

DataNode Protocol (RPC)

Datanode

Datanode always

initiates connection!

Namenode

93


Datanode

Datanode always


Registration

Namenode

94


Datanode

Heartbeat

Datanode always


every 3s

custo

miz

able

Registration

Namenode

95


Datanode

HeartbeatBlock operations

Datanode always


every 3s

custo

miz

able

Registration

Namenode

96


Datanode

Heartbeat

BlockReportBlock operations

Datanode always


every 3s

every 6h

custo

miz

able

Registration

Namenode

97


Datanode

Heartbeat


Datanode always


every 3s

every 6h

custo

miz

able

Registration

BlockReceived

Namenode

98


Datanode

Namenode

Heartbeat


Java API available

every 3s

every 6h

custo

miz

able

Registration

BlockReceived

99

DataNode Protocol

Datanode

Heartbeat


Datanode always


every 3s

every 6h

custo

miz

able

Registration

BlockReceived

Namenode

100

Communication

Datanode

Namenode

Datanode

Client

Control

Control

101

DataTransfer Protocol (Streaming)

DataNodeClient

Data blocks

DataNodeDataNode

101

102


DataNodeClient

Data blocks

DataNodeDataNode

Replication

pipelining

(write only)

102

103


DataNodeClient

Data blocks

DataNodeDataNode

Replication

pipelining

(write only)

103

104

Communication

Datanode

Namenode

Datanode

Client

Control

Control

ControlData

Control

105

Summary

Datanode

Namenode

Datanode

Client

Client Protocol

DataTransfer

Protocol

DataNode

Protocol

106

Metadata functionality

Create directory

Delete directory

Write file

Append to file

Read file

Delete file

107

Client reads a file

108

Client reads a file

Asks for file1

109

Client reads a file

Get block locations

Multiple DataNodes for each block,

sorted by distance

2

110

Client reads a file

Read

3Input

Stream

111

Client writes a file

112


Create1

113


DataNodes for first block

2

114


Organizes pipeline3

115


Sends data over4

116


Ack5

117


DataNodes for second block

2

118


Organizes pipeline3

119


Sends data over4

120


Ack5

121

This is all done simultaneously under DFSOutputStream (streaming through)

Sends data over4

DataNodes for nth block

2

5

3 Pipeline

Ack

122


Close/release lock

6

123


Checking for

minimal

replication

(DataNode protocol)

7

124


Ack

8

125


replicates further asynchronously 9

126

Replicas

127

Replicas

Number of replicas

specified

per filedefault:3

128

Replica placement: what to consider?

Reliability

Read/Write Bandwidth

Block distribution

129

Replica placement: Reminder on topology

Cluster

Rack

Node

130

Replica placement: Distance

BA

D(A,B)=2

131

Replica placement: Distance

BA

D(A,B)=4

132

Replica placement

Replica 1: same node as client (or random), rack A

133

Replica placement


Replica 2: a node in a different rack B

134

Replica placement



Replica 3: a node in same rack B

135

Replica placement



Replica 3: a node in same rack B

Replica 4 and beyond: random, but if possible:

• at most one replica per node

• at most two replicas per rack

136

Replica placement

Client

137

Why replicas 2+3 on other rack?

Client

138

If replicas 1+2 were on same rack...

Block concentration on same rack (2/3)

139

Performance and availability

140

The NameNode is a single point of failure


Namenode

/dir/file1

/dir/file2

/file3

141

The namenode is a single point of failure...


Namenode

/dir/file1

/dir/file2

/file3

What if it fails?

142


Memory

Block locations

/dir/file1

/dir/file2

/file3


1 File namespace

(+Access Control)

2

3

143

1. You want to persist

Memory

1/dir/file1

2

3 not persisted

144


Namespace

file

Persistent Storage

Memory

1/dir/file1

2

3 not persisted

145


Namespace

file

Persistent Storage

Memory

1/dir/file1

/dir/file2

2

3 not persisted

Edit log

146


Namespace

file

Persistent Storage

Memory

1/dir/file1

/dir/file2

/file3

2

3 not persisted

Edit log

147

2. You want to backup

Namespace

fileEdit log

Persistent Storage

Shared driveBackup drives

Glacier

148



Namenode

/dir/file1

/dir/file2

/file3

What if it fails?

149



Namenode

/dir/file1

/dir/file2

/file3

We need to start

it up again!

150

Namenodes: Startup

Namespace

file

Persistent Storage

Memory

Edit log

151

Namenodes: Startup

Namespace

file

Persistent Storage

Memory

Filesystem

Edit log

152

Edit log

Namenodes: Startup

Namespace

file

Persistent Storage

Memory

Filesystem

153

Edit log

Namenodes: Startup

Namespace

file

Persistent Storage

Memory

Filesystem

/dir/file1

/dir/file2

/file3

154

Namenodes: Startup

Namespace

file

Persistent Storage

Memory

Filesystem

/dir/file1

/dir/file2

/file3

Block locations

Edit log

155

Namenodes: Startup

Namespace

file

Persistent Storage

Memory

Filesystem

/dir/file1

/dir/file2

/file3

Block locations

Edit log

156

Namenodes: Startup

Namespace

file

Persistent Storage

Memory

Filesystem

/dir/file1

/dir/file2

/file3

Block locations

Edit log

157

Starting a namenode...

... takes

30 minutes!

158

Starting a namenode...

Can we do

better?

159

3. Checkpoints with Secondary NameNode

Old namespace

fileEdit log

New namespace file

160

4. High Availability (HA): Backup NameNodes


Namenode

/dir/file1

/dir/file2

/file3

Namenode

/dir/file1

/dir/file2

/file3

Maintains mappings and locations

in memory like the namenode.

161

5. High Availability (HA): Standby NameNodes

Active

Namenode

Standby

Namenode

Standby

Namenode

162

5. Federated DFS


Namenode /foo

/foo/file1

/foo/file2

Namenode /bar

/bar/file1

/bar/file2

163

Using HDFS

164

HDFS Shell

$ hadoop fs <args>

165

HDFS Shell

$ hadoop fs <args>

$ hdfs dfs <args>

166

HDFS Shell

$ hadoop fs <args>

$ hdfs dfs <args>local

filesystem

167

HDFS Shell: POSIX-like

$ hadoop fs –ls

$ hadoop fs –cat /dir/file

$ hadoop fs –rm /dir/file

$ hadoop fs –mkdir /dir

168

HDFS Shell: upload and download

$ hadoop fs –copyToLocal /user/hadoop/file

localfile

$ hadoop fs –copyFromLocal

localfile1 localfile2

/user/hadoop/hadoopdir

169

HDFS Shell: Configuration

core-site.xml

<properties>

<property>

<name>fs.defaultFS</name>

<value>hdfs://host:8020</value>

<description>NameNode hostname</description>

</property>

</properties>

170

HDFS Shell: Configuration

hdfs-site.xml

<properties>

<property>

<name>dfs.replication</name>

<value>1</value>

<description>Replication factor</description>

</property> <property>

<name>dfs.namenode.name.dir</name>

<value>/grid/hadoop/hdfs/nn</value>

<description>NameNode directory</description>

</property>

<property>

<name>dfs.datanode.name.dir</name>

<value>/grid/hadoop/hdfs/nn</value>

<description>DataNode directory</description>

</property>

</properties>

171

Populating HDFS: Apache Flume

Collects, aggregates, moves log data(into HDFS)

_____ _____

__ _____ ___

_____

172

Populating HDFS: Apache Sqoop

Imports from a relational database

173

GFS

174

GFS vs. HDFS: Terminology

NameNode

DataNode

Block

FS Image

Edit log

HDFS

Master

Chunkserver

Chunk

Checkpoint image

Operation log

GFS

175

HDFS vs. GFS: Block size

GFS/Apache HDFS

64 MB

128 MB

Cloudera HDFS

176

Pointers

Official documentation

http://hadoop.apache.org/docs/r3.2.1/

GFS Paper

On course website

Java API

https://hadoop.apache.org/docs/r3.2.1/api/ind

ex.html

ghislain fourny big data for engineers fall 2019 · ghislain fourny big data for engineers fall...

Documents