new distributed file systemsarchive.control.lth.se/media/staff/johaneker/introduction... · 2015....

67
Distributed File Systems GFS, HDSF, Ceph

Upload: others

Post on 03-Sep-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: New Distributed File Systemsarchive.control.lth.se/media/Staff/JohanEker/introduction... · 2015. 5. 13. · Content • Whatis%adistributed%file% system?% • GFS%–Google%File%System%

Distributed  File  Systems  GFS,  HDSF,  Ceph

Page 2: New Distributed File Systemsarchive.control.lth.se/media/Staff/JohanEker/introduction... · 2015. 5. 13. · Content • Whatis%adistributed%file% system?% • GFS%–Google%File%System%

Content

• What  is  a  distributed  file  system?  • GFS  –  Google  File  System  • HDFS  –  Hadoop  Distributed  File  System  • Ceph  –  RedHat  Distributed  File  System

Ceph HDFS

GFS

Page 3: New Distributed File Systemsarchive.control.lth.se/media/Staff/JohanEker/introduction... · 2015. 5. 13. · Content • Whatis%adistributed%file% system?% • GFS%–Google%File%System%

Content

• What  is  a  distributed  file  system?  • GFS  –  Google  File  System  • HDFS  –  Hadoop  Distributed  File  System  • Ceph  –  RedHat  Distributed  File  System

CephHDFSGFS

Page 4: New Distributed File Systemsarchive.control.lth.se/media/Staff/JohanEker/introduction... · 2015. 5. 13. · Content • Whatis%adistributed%file% system?% • GFS%–Google%File%System%

What  is  a  distributed  file  system?

The  difference  between  a  distributed  file  system  and  a  distributed  data  store  is  that  a  distributed  file  system  allows  files  to  be  accessed  using  the  same  interfaces  and  semantics  as  local  files  -­‐  e.g.  mounting/unmounting,  listing  directories,  read/write  at  byte  boundaries,  system's  native  permission  model.  Distributed  data  stores,  by  contrast,  require  using  a  different  API  or  library  and  have  different  semantics  (most  often  those  of  a  database).

Page 5: New Distributed File Systemsarchive.control.lth.se/media/Staff/JohanEker/introduction... · 2015. 5. 13. · Content • Whatis%adistributed%file% system?% • GFS%–Google%File%System%

GFS  –  Google  File  System• Centrally  managed  distributed  file  system  for  Google’s  internal  use  • Developed  from  BigFiles  by  Larry  and  Sergey  • Not  implemented  in  kernel,  but  rather  a  user  space  library  • Intra  data  centre  file  system  • New  version  in  the  works  …

Page 6: New Distributed File Systemsarchive.control.lth.se/media/Staff/JohanEker/introduction... · 2015. 5. 13. · Content • Whatis%adistributed%file% system?% • GFS%–Google%File%System%

GFS  –  Motivation• Fault  tolerance  • Google  is  using  commodity  parts  with  relatively  high  failure  rate  • Software  quality  cannot  be  guaranteed  -­‐  SW  errors  • High  degree  of  shared  resources:  power  supplies,  switches,  etc.  

• Client  guarantees  • High  availability  -­‐  scale  to  infinity    • Consistency  • Concurrent  write  and  read  from  multiple  clients  

• Infrastructure  management  • High  sustained  bandwidth  is  more  important  than  latency.

Page 7: New Distributed File Systemsarchive.control.lth.se/media/Staff/JohanEker/introduction... · 2015. 5. 13. · Content • Whatis%adistributed%file% system?% • GFS%–Google%File%System%

GFS  –  Dimensions• Multiple  GFS  instances  in  a  data  centre  • Millions  of  files  

• 100  MB  is  the  norm  • Multi  GB  files  are  common    • Small  files  not  common  

• Aggregately  storing  terabytes  of  data  • 100’s  of  storage  nodes  • Concurrently  accessed  by  100’s  of  clients

Page 8: New Distributed File Systemsarchive.control.lth.se/media/Staff/JohanEker/introduction... · 2015. 5. 13. · Content • Whatis%adistributed%file% system?% • GFS%–Google%File%System%

GFS  –  Assumptions• High  HW  and  SW  failure  rate  • Modest  number  of  large  files  • Large  streaming  reads  –  1  MB  per  operation  • Few  and  small  random  reads  –  KB  per  operation  • Large  sequential  append  writes  • Append  to  the  end  of  a  chunk  is  far  more  likely  than  writes  that  overwrite  existing  data  

Page 9: New Distributed File Systemsarchive.control.lth.se/media/Staff/JohanEker/introduction... · 2015. 5. 13. · Content • Whatis%adistributed%file% system?% • GFS%–Google%File%System%

GFS  –  Guarantees• Consistent All  clients  will  see  the  same  data,  regardless  of  which  replicas  they  read  from  • High  availability  • Atomic  write

Page 10: New Distributed File Systemsarchive.control.lth.se/media/Staff/JohanEker/introduction... · 2015. 5. 13. · Content • Whatis%adistributed%file% system?% • GFS%–Google%File%System%

GFS  –  Entities• Chunk  • Chunk  server  • Master  • Client

ChunkChunk

GFS  Client

Application

GFS  Master

Chunk

Chunkserver

Page 11: New Distributed File Systemsarchive.control.lth.se/media/Staff/JohanEker/introduction... · 2015. 5. 13. · Content • Whatis%adistributed%file% system?% • GFS%–Google%File%System%

GFS  –  Entities  –  Chunk• 64  MB  linux  file  –  Most  files  are  larger  than  100  MB  • Requires  64  Bytes  of  metadata  in  the  master  • Stored  and  replicated  on  multiple  chunk  servers

Chunk Chunk Chunk

64  MB

File Padding

Page 12: New Distributed File Systemsarchive.control.lth.se/media/Staff/JohanEker/introduction... · 2015. 5. 13. · Content • Whatis%adistributed%file% system?% • GFS%–Google%File%System%

GFS  –  Entities  –  Chunkserver• Stores  chunks  on  local  disk(s)  • Any  machine  in  the  data  centre,  heterogenous  • Performs  checksum  on  each  block    • Checksum  and  chunk  map  is  stored  in  memory  • Communicates  with  master  

Chunkserver

… ChunkChunkChunk

ChunkChunk

Page 13: New Distributed File Systemsarchive.control.lth.se/media/Staff/JohanEker/introduction... · 2015. 5. 13. · Content • Whatis%adistributed%file% system?% • GFS%–Google%File%System%

GFS  –  Entities  –  Master• File  system  metadata  –  Stored  in  memory  • Namespace  • Access  control  information  • Imposes  read  and  write  locks  • File  to  chunk  mapping  • Chunk  location  

•Lookup  table  mapping  full  pathnames  with  metadata.  Stored  in  memory  •No  per-­‐directory  data  structure  or  aliasing  •Handles  file  permutations  •Periodically  scans  namespace  for  changes,  loss  of  chunkservers,  deleted  files,  etc.  using  heartbeat  messages  •Ensures  fault  tolerance    •There  can  be  only  one,  for  now

Page 14: New Distributed File Systemsarchive.control.lth.se/media/Staff/JohanEker/introduction... · 2015. 5. 13. · Content • Whatis%adistributed%file% system?% • GFS%–Google%File%System%

GFS  –  Entities  –  Client• Application,  value  adder,  revenue  generator  • Map  Reduce,  Google  Services  • Consumes  and  produces  data  • i.e.  reads  and  writes  to  the  file  system  

• Own  checksum    • Has  no  ability  to  report  error  in  file  system  • Caches  metadata  from  master

GFS  Client

Application

Page 15: New Distributed File Systemsarchive.control.lth.se/media/Staff/JohanEker/introduction... · 2015. 5. 13. · Content • Whatis%adistributed%file% system?% • GFS%–Google%File%System%

GFS  –  Architecture

GFS  Master

ChunkChunkChunk

Chunkserver

…ChunkChunkChunk

Chunkserver

GFS  Client

Application

GFS  Client

Application…

Page 16: New Distributed File Systemsarchive.control.lth.se/media/Staff/JohanEker/introduction... · 2015. 5. 13. · Content • Whatis%adistributed%file% system?% • GFS%–Google%File%System%

GFS  –  Architecture

GFS  Master

ChunkChunkChunk

Chunkserver

…ChunkChunkChunk

Chunkserver

GFS  Client

Application

Blade

GFS  Client

Application…

Page 17: New Distributed File Systemsarchive.control.lth.se/media/Staff/JohanEker/introduction... · 2015. 5. 13. · Content • Whatis%adistributed%file% system?% • GFS%–Google%File%System%

GFS  –  Redundancy

15

Page 18: New Distributed File Systemsarchive.control.lth.se/media/Staff/JohanEker/introduction... · 2015. 5. 13. · Content • Whatis%adistributed%file% system?% • GFS%–Google%File%System%

GFS  –  Read1. Client  asks  master  for  replicas  

(Namespace)  2. Master  replies  with  replicas  (IP  

addresses  and  offsets)  3. Client  reads  from  random  replica    4. If  it  fails  to  read,  it  retires  from  a  

different  replica  

GFS  MasterGFS  Client

Application

Page 19: New Distributed File Systemsarchive.control.lth.se/media/Staff/JohanEker/introduction... · 2015. 5. 13. · Content • Whatis%adistributed%file% system?% • GFS%–Google%File%System%

GFS  –  Read1. Client  asks  master  for  replicas  

(Namespace)  2. Master  replies  with  replicas  (IP  

addresses  and  offsets)  3. Client  reads  from  random  replica    4. If  it  fails  to  read,  it  retires  from  a  

different  replica  

GFS  MasterGFS  Client

Application

Page 20: New Distributed File Systemsarchive.control.lth.se/media/Staff/JohanEker/introduction... · 2015. 5. 13. · Content • Whatis%adistributed%file% system?% • GFS%–Google%File%System%

GFS  –  Read1. Client  asks  master  for  replicas  

(Namespace)  2. Master  replies  with  replicas  (IP  

addresses  and  offsets)  3. Client  reads  from  random  replica    4. If  it  fails  to  read,  it  retires  from  a  

different  replica  

GFS  MasterGFS  Client

Application

Replica  A

Replica  B

Replica  C

Page 21: New Distributed File Systemsarchive.control.lth.se/media/Staff/JohanEker/introduction... · 2015. 5. 13. · Content • Whatis%adistributed%file% system?% • GFS%–Google%File%System%

GFS  –  Read1. Client  asks  master  for  replicas  

(Namespace)  2. Master  replies  with  replicas  (IP  

addresses  and  offsets)  3. Client  reads  from  random  replica    4. If  it  fails  to  read,  it  retires  from  a  

different  replica  

GFS  MasterGFS  Client

Application

Replica  A

Replica  B

Replica  C

Page 22: New Distributed File Systemsarchive.control.lth.se/media/Staff/JohanEker/introduction... · 2015. 5. 13. · Content • Whatis%adistributed%file% system?% • GFS%–Google%File%System%

GFS  –  Read1. Client  asks  master  for  replicas  

(Namespace)  2. Master  replies  with  replicas  (IP  

addresses  and  offsets)  3. Client  reads  from  random  replica    4. If  it  fails  to  read,  it  retires  from  a  

different  replica  

GFS  MasterGFS  Client

Application

Replica  A

Replica  B

Replica  C

Page 23: New Distributed File Systemsarchive.control.lth.se/media/Staff/JohanEker/introduction... · 2015. 5. 13. · Content • Whatis%adistributed%file% system?% • GFS%–Google%File%System%

GFS  –  Read1. Client  asks  master  for  replicas  

(Namespace)  2. Master  replies  with  replicas  (IP  

addresses  and  offsets)  3. Client  reads  from  random  replica    4. If  it  fails  to  read,  it  retires  from  a  

different  replica  

GFS  MasterGFS  Client

Application

Replica  A

Replica  B

Replica  C

Page 24: New Distributed File Systemsarchive.control.lth.se/media/Staff/JohanEker/introduction... · 2015. 5. 13. · Content • Whatis%adistributed%file% system?% • GFS%–Google%File%System%

GFS  –  Read1. Client  asks  master  for  replicas  

(Namespace)  2. Master  replies  with  replicas  (IP  

addresses  and  offsets)  3. Client  reads  from  random  replica    4. If  it  fails  to  read,  it  retires  from  a  

different  replica  

GFS  MasterGFS  Client

Application

Replica  A

Replica  B

Replica  C

Page 25: New Distributed File Systemsarchive.control.lth.se/media/Staff/JohanEker/introduction... · 2015. 5. 13. · Content • Whatis%adistributed%file% system?% • GFS%–Google%File%System%

GFS  –  Write  (per  chunk)1. Client  asks  master  for  primary  

chunkserver  (Namespace)  2. Master  evaluated  lease  validity  3. Master  replies  with  identity  of  the  

primary  (IP  address) Client  caches  data  

4. Client  pushed  data  to  closest  chunkserver  

5. Client  forwards  write  requests

GFS  MasterGFS  Client

Application

Page 26: New Distributed File Systemsarchive.control.lth.se/media/Staff/JohanEker/introduction... · 2015. 5. 13. · Content • Whatis%adistributed%file% system?% • GFS%–Google%File%System%

GFS  –  Write  (per  chunk)1. Client  asks  master  for  primary  

chunkserver  (Namespace)  2. Master  evaluated  lease  validity  3. Master  replies  with  identity  of  the  

primary  (IP  address) Client  caches  data  

4. Client  pushed  data  to  closest  chunkserver  

5. Client  forwards  write  requests

GFS  MasterGFS  Client

Application

Page 27: New Distributed File Systemsarchive.control.lth.se/media/Staff/JohanEker/introduction... · 2015. 5. 13. · Content • Whatis%adistributed%file% system?% • GFS%–Google%File%System%

GFS  –  Write  (per  chunk)1. Client  asks  master  for  primary  

chunkserver  (Namespace)  2. Master  evaluated  lease  validity  3. Master  replies  with  identity  of  the  

primary  (IP  address) Client  caches  data  

4. Client  pushed  data  to  closest  chunkserver  

5. Client  forwards  write  requests

GFS  MasterGFS  Client

Application

Secondary  Replica  A

Primary  Replica

Secondary  Replica  B

Page 28: New Distributed File Systemsarchive.control.lth.se/media/Staff/JohanEker/introduction... · 2015. 5. 13. · Content • Whatis%adistributed%file% system?% • GFS%–Google%File%System%

GFS  –  Write  (per  chunk)1. Client  asks  master  for  primary  

chunkserver  (Namespace)  2. Master  evaluated  lease  validity  3. Master  replies  with  identity  of  the  

primary  (IP  address) Client  caches  data  

4. Client  pushed  data  to  closest  chunkserver  

5. Client  forwards  write  requests

GFS  MasterGFS  Client

Application

Secondary  Replica  A

Primary  Replica

Secondary  Replica  B

Page 29: New Distributed File Systemsarchive.control.lth.se/media/Staff/JohanEker/introduction... · 2015. 5. 13. · Content • Whatis%adistributed%file% system?% • GFS%–Google%File%System%

GFS  –  Write  (per  chunk)1. Client  asks  master  for  primary  

chunkserver  (Namespace)  2. Master  evaluated  lease  validity  3. Master  replies  with  identity  of  the  

primary  (IP  address) Client  caches  data  

4. Client  pushed  data  to  closest  chunkserver  

5. Client  forwards  write  requests

GFS  MasterGFS  Client

Application

Secondary  Replica  A

Primary  Replica

Secondary  Replica  B

Page 30: New Distributed File Systemsarchive.control.lth.se/media/Staff/JohanEker/introduction... · 2015. 5. 13. · Content • Whatis%adistributed%file% system?% • GFS%–Google%File%System%

GFS  –  Write  (per  chunk)1. Client  asks  master  for  primary  

chunkserver  (Namespace)  2. Master  evaluated  lease  validity  3. Master  replies  with  identity  of  the  

primary  (IP  address) Client  caches  data  

4. Client  pushed  data  to  closest  chunkserver  

5. Client  forwards  write  requests

GFS  MasterGFS  Client

Application

Secondary  Replica  A

Primary  Replica

Secondary  Replica  B

Page 31: New Distributed File Systemsarchive.control.lth.se/media/Staff/JohanEker/introduction... · 2015. 5. 13. · Content • Whatis%adistributed%file% system?% • GFS%–Google%File%System%

GFS  –  Write  (per  chunk)1. Client  asks  master  for  primary  

chunkserver  (Namespace)  2. Master  evaluated  lease  validity  3. Master  replies  with  identity  of  the  

primary  (IP  address) Client  caches  data  

4. Client  pushed  data  to  closest  chunkserver  

5. Client  forwards  write  requests

GFS  MasterGFS  Client

Application

Secondary  Replica  A

Primary  Replica

Secondary  Replica  B

Page 32: New Distributed File Systemsarchive.control.lth.se/media/Staff/JohanEker/introduction... · 2015. 5. 13. · Content • Whatis%adistributed%file% system?% • GFS%–Google%File%System%

GFS  –  Write  (per  chunk)1. Client  asks  master  for  primary  

chunkserver  (Namespace)  2. Master  evaluated  lease  validity  3. Master  replies  with  identity  of  the  

primary  (IP  address) Client  caches  data  

4. Client  pushed  data  to  closest  chunkserver  

5. Client  forwards  write  requests

GFS  MasterGFS  Client

Application

Secondary  Replica  A

Primary  Replica

Secondary  Replica  B

Page 33: New Distributed File Systemsarchive.control.lth.se/media/Staff/JohanEker/introduction... · 2015. 5. 13. · Content • Whatis%adistributed%file% system?% • GFS%–Google%File%System%

GFS  –  Write  (per  chunk)1. Client  asks  master  for  primary  

chunkserver  (Namespace)  2. Master  evaluated  lease  validity  3. Master  replies  with  identity  of  the  

primary  (IP  address) Client  caches  data  

4. Client  pushed  data  to  closest  chunkserver  

5. Client  forwards  write  requests

GFS  MasterGFS  Client

Application

Secondary  Replica  A

Primary  Replica

Secondary  Replica  B

Page 34: New Distributed File Systemsarchive.control.lth.se/media/Staff/JohanEker/introduction... · 2015. 5. 13. · Content • Whatis%adistributed%file% system?% • GFS%–Google%File%System%

GFS  –  Write  (per  chunk)1. Client  asks  master  for  primary  

chunkserver  (Namespace)  2. Master  evaluated  lease  validity  3. Master  replies  with  identity  of  the  

primary  (IP  address) Client  caches  data  

4. Client  pushed  data  to  closest  chunkserver  

5. Client  forwards  write  requests

GFS  MasterGFS  Client

Application

Secondary  Replica  A

Primary  Replica

Secondary  Replica  B

Page 35: New Distributed File Systemsarchive.control.lth.se/media/Staff/JohanEker/introduction... · 2015. 5. 13. · Content • Whatis%adistributed%file% system?% • GFS%–Google%File%System%

GFS  –  Write  (per  chunk)1. Client  asks  master  for  primary  

chunkserver  (Namespace)  2. Master  evaluated  lease  validity  3. Master  replies  with  identity  of  the  

primary  (IP  address) Client  caches  data  

4. Client  pushed  data  to  closest  chunkserver  

5. Client  forwards  write  requests

GFS  MasterGFS  Client

Application

Secondary  Replica  A

Primary  Replica

Secondary  Replica  B

Page 36: New Distributed File Systemsarchive.control.lth.se/media/Staff/JohanEker/introduction... · 2015. 5. 13. · Content • Whatis%adistributed%file% system?% • GFS%–Google%File%System%

GFS  –  Fault  tolerance  –  Master• Does  not  distinguish  between  normal  and  abnormal  termination.  • Accessed  using  it’s  canonical  name  • Log  file  is  key  

• Replicated  to  multiple  machines  • Defines  the  order  of  concurrent  operations  • Log  file  only  only  persistent  record  of  meta  data  • Fast  recovery  by  replaying  log  file  from  checkpoint

Page 37: New Distributed File Systemsarchive.control.lth.se/media/Staff/JohanEker/introduction... · 2015. 5. 13. · Content • Whatis%adistributed%file% system?% • GFS%–Google%File%System%

GFS  –  Fault  tolerance  –  Chunkserver  failure

• Chunk  are  replicated  across  multiple  chunkservers,  not  RAID  • Chunks  are  split  into  64KB  blocks  with  32bit  checksum  stored  in  memory  • Not  discovered  until  when  master  probes  chunkserver  • New  replica  created  when  not  responding  

Page 38: New Distributed File Systemsarchive.control.lth.se/media/Staff/JohanEker/introduction... · 2015. 5. 13. · Content • Whatis%adistributed%file% system?% • GFS%–Google%File%System%

GFS  –  Maintenance  (I)• Garbage  collection  • Performed  during  heartbeat  • File  renamed  when  deleted,  up  to  client  to  know  the  new  name  • Action  can  be  undone      

• Resources  are  reclaimed  lazily,  with  a  user  configurable  interval    • Google  uses  a  3-­‐day  grace  period    

• Orphan  chunks  are  removed    • Anything  not  know  to  the  master  is  “garbage”

Page 39: New Distributed File Systemsarchive.control.lth.se/media/Staff/JohanEker/introduction... · 2015. 5. 13. · Content • Whatis%adistributed%file% system?% • GFS%–Google%File%System%

GFS  –  Maintenance  (II)•Replica  placement  • Simple  default  policy,  replicas  across  multiple  racks  • Equalise  disk  utilisation  across  chunk  servers    

•Creating,  re-­‐replication,  and  rebalancing  • Limit  number  of  recent  creations  on  each  chunkserver    • Cloning  done  to  closest  suitable  chunkserver    • Periodically,  graceful  rebalancing  

Page 40: New Distributed File Systemsarchive.control.lth.se/media/Staff/JohanEker/introduction... · 2015. 5. 13. · Content • Whatis%adistributed%file% system?% • GFS%–Google%File%System%

GFS  –  User  tuneable  parameters• Number  of  replicas    • Chunksever  placement  policy  • Client  and  chunkservers  co-­‐locality  policy  • Lease  timeout  • Rebalancing  periodicity  • Garbage  collection  periodicity

Page 41: New Distributed File Systemsarchive.control.lth.se/media/Staff/JohanEker/introduction... · 2015. 5. 13. · Content • Whatis%adistributed%file% system?% • GFS%–Google%File%System%

HDFS  –  Hadoop  Distributed  File  System

• Decoupled  metadata  and  data  store  • One  metadata  store    • Locality  though  the  client  and  replica  placement  policies  • More  signalling:  storage  capacity,  status  HB,  etc.  • Greater  configurability,  block  sizes,  fault  tolerance  • Fault  tolerance  through  multiple  replicas  • Obviously  designed  for  Hadoop  workloads

Page 42: New Distributed File Systemsarchive.control.lth.se/media/Staff/JohanEker/introduction... · 2015. 5. 13. · Content • Whatis%adistributed%file% system?% • GFS%–Google%File%System%

HDFS  –  Entities• Chunk  • Chunkserver  • Master  • Client

ChunkChunk

HDSF  Client

Application

Name  Node

Block

Data  Node

GFS HDFS

Chunk Block

Chunkserver Datanode

Master Namenode

Client Client

-­‐ CheckpointNode

-­‐ BackupNode

Page 43: New Distributed File Systemsarchive.control.lth.se/media/Staff/JohanEker/introduction... · 2015. 5. 13. · Content • Whatis%adistributed%file% system?% • GFS%–Google%File%System%

HDFS  –  Architecture

Name  NodeChunkChunkBlock

Data  NodeHDFS  Client

Application

Checkpoint  node

Backup  Node

Page 44: New Distributed File Systemsarchive.control.lth.se/media/Staff/JohanEker/introduction... · 2015. 5. 13. · Content • Whatis%adistributed%file% system?% • GFS%–Google%File%System%

HDFS  –  Fault  tolerance  -­‐  NameNode

• CheckpointNode  • Keeps  a  persistent  replica  of  the  journal  • Truncates  journal  for  faster  recovery  

• BackupNode  • Same  as  CheckpointNode  • Can  act  as  a  read  only  NameNode,  requires  a  load  balancer

Page 45: New Distributed File Systemsarchive.control.lth.se/media/Staff/JohanEker/introduction... · 2015. 5. 13. · Content • Whatis%adistributed%file% system?% • GFS%–Google%File%System%

HDFS  –  Write1. Client  asks  NameNode  for  replicas  2. NameNode  evaluated  lease  

validity  3. NameNode  replies  with  identity  of  

the  Replicas  (IP  address)  4. Client  pipelines  data  to  replicas  

from  the  closest  5. Client  send  “close  file”  6. Last  replica  responds  7. Lease  ends

NameNodeHDFS  Client

Application

Page 46: New Distributed File Systemsarchive.control.lth.se/media/Staff/JohanEker/introduction... · 2015. 5. 13. · Content • Whatis%adistributed%file% system?% • GFS%–Google%File%System%

HDFS  –  Write1. Client  asks  NameNode  for  replicas  2. NameNode  evaluated  lease  

validity  3. NameNode  replies  with  identity  of  

the  Replicas  (IP  address)  4. Client  pipelines  data  to  replicas  

from  the  closest  5. Client  send  “close  file”  6. Last  replica  responds  7. Lease  ends

NameNodeHDFS  Client

Application

Page 47: New Distributed File Systemsarchive.control.lth.se/media/Staff/JohanEker/introduction... · 2015. 5. 13. · Content • Whatis%adistributed%file% system?% • GFS%–Google%File%System%

HDFS  –  Write1. Client  asks  NameNode  for  replicas  2. NameNode  evaluated  lease  

validity  3. NameNode  replies  with  identity  of  

the  Replicas  (IP  address)  4. Client  pipelines  data  to  replicas  

from  the  closest  5. Client  send  “close  file”  6. Last  replica  responds  7. Lease  ends

NameNodeHDFS  Client

Application

Replica  A

Replica  B

Replica  C

Page 48: New Distributed File Systemsarchive.control.lth.se/media/Staff/JohanEker/introduction... · 2015. 5. 13. · Content • Whatis%adistributed%file% system?% • GFS%–Google%File%System%

HDFS  –  Write1. Client  asks  NameNode  for  replicas  2. NameNode  evaluated  lease  

validity  3. NameNode  replies  with  identity  of  

the  Replicas  (IP  address)  4. Client  pipelines  data  to  replicas  

from  the  closest  5. Client  send  “close  file”  6. Last  replica  responds  7. Lease  ends

NameNodeHDFS  Client

Application

Replica  A

Replica  B

Replica  C

Page 49: New Distributed File Systemsarchive.control.lth.se/media/Staff/JohanEker/introduction... · 2015. 5. 13. · Content • Whatis%adistributed%file% system?% • GFS%–Google%File%System%

HDFS  –  Write1. Client  asks  NameNode  for  replicas  2. NameNode  evaluated  lease  

validity  3. NameNode  replies  with  identity  of  

the  Replicas  (IP  address)  4. Client  pipelines  data  to  replicas  

from  the  closest  5. Client  send  “close  file”  6. Last  replica  responds  7. Lease  ends

NameNodeHDFS  Client

Application

Replica  A

Replica  B

Replica  C

Page 50: New Distributed File Systemsarchive.control.lth.se/media/Staff/JohanEker/introduction... · 2015. 5. 13. · Content • Whatis%adistributed%file% system?% • GFS%–Google%File%System%

HDFS  –  Write1. Client  asks  NameNode  for  replicas  2. NameNode  evaluated  lease  

validity  3. NameNode  replies  with  identity  of  

the  Replicas  (IP  address)  4. Client  pipelines  data  to  replicas  

from  the  closest  5. Client  send  “close  file”  6. Last  replica  responds  7. Lease  ends

NameNodeHDFS  Client

Application

Replica  A

Replica  B

Replica  C

Page 51: New Distributed File Systemsarchive.control.lth.se/media/Staff/JohanEker/introduction... · 2015. 5. 13. · Content • Whatis%adistributed%file% system?% • GFS%–Google%File%System%

HDFS  –  Write1. Client  asks  NameNode  for  replicas  2. NameNode  evaluated  lease  

validity  3. NameNode  replies  with  identity  of  

the  Replicas  (IP  address)  4. Client  pipelines  data  to  replicas  

from  the  closest  5. Client  send  “close  file”  6. Last  replica  responds  7. Lease  ends

NameNodeHDFS  Client

Application

Replica  A

Replica  B

Replica  C

Page 52: New Distributed File Systemsarchive.control.lth.se/media/Staff/JohanEker/introduction... · 2015. 5. 13. · Content • Whatis%adistributed%file% system?% • GFS%–Google%File%System%

HDFS  –  Write1. Client  asks  NameNode  for  replicas  2. NameNode  evaluated  lease  

validity  3. NameNode  replies  with  identity  of  

the  Replicas  (IP  address)  4. Client  pipelines  data  to  replicas  

from  the  closest  5. Client  send  “close  file”  6. Last  replica  responds  7. Lease  ends

NameNodeHDFS  Client

Application

Replica  A

Replica  B

Replica  C

Page 53: New Distributed File Systemsarchive.control.lth.se/media/Staff/JohanEker/introduction... · 2015. 5. 13. · Content • Whatis%adistributed%file% system?% • GFS%–Google%File%System%

HDFS  –  Block  placement• Client  always  tries  to  read  from  the  DataNode  with  least  separation.  Based  on  IP  addresses.    • Does  not  take  into  account  DataNode  disk  utilisation  • When  creating  a  new  file  • 1st  replica  is  placed  on  the  same  node  as  the  client  • 2nd  and  3rd  replicas  are  placed  on  different  racks  • No  rack  contains  more  than  two  replicas    • No  DataNode  contains  more  than  one  replica

Page 54: New Distributed File Systemsarchive.control.lth.se/media/Staff/JohanEker/introduction... · 2015. 5. 13. · Content • Whatis%adistributed%file% system?% • GFS%–Google%File%System%

HDFS  –  Block  placementRack  1 Rack  2 Rack  3 Rack  4

Client

Page 55: New Distributed File Systemsarchive.control.lth.se/media/Staff/JohanEker/introduction... · 2015. 5. 13. · Content • Whatis%adistributed%file% system?% • GFS%–Google%File%System%

HDFS  –  Block  placementRack  1 Rack  2 Rack  3 Rack  4

Replica  1

Client

Page 56: New Distributed File Systemsarchive.control.lth.se/media/Staff/JohanEker/introduction... · 2015. 5. 13. · Content • Whatis%adistributed%file% system?% • GFS%–Google%File%System%

HDFS  –  Block  placementRack  1 Rack  2 Rack  3 Rack  4

Replica  1

Replica  2

Client

Page 57: New Distributed File Systemsarchive.control.lth.se/media/Staff/JohanEker/introduction... · 2015. 5. 13. · Content • Whatis%adistributed%file% system?% • GFS%–Google%File%System%

HDFS  –  Block  placementRack  1 Rack  2 Rack  3 Rack  4

Replica  1

Replica  2

Client

Replica

Page 58: New Distributed File Systemsarchive.control.lth.se/media/Staff/JohanEker/introduction... · 2015. 5. 13. · Content • Whatis%adistributed%file% system?% • GFS%–Google%File%System%

HDFS  –  Block  placementRack  1 Rack  2 Rack  3 Rack  4

Replica  1

Replica  2

Replica  3

Client Replica

Replica

Page 59: New Distributed File Systemsarchive.control.lth.se/media/Staff/JohanEker/introduction... · 2015. 5. 13. · Content • Whatis%adistributed%file% system?% • GFS%–Google%File%System%

HDFS  –  Block  placementRack  1 Rack  2 Rack  3 Rack  4

Replica  1

Replica  2

Replica  3

Client Replica

Replica

Page 60: New Distributed File Systemsarchive.control.lth.se/media/Staff/JohanEker/introduction... · 2015. 5. 13. · Content • Whatis%adistributed%file% system?% • GFS%–Google%File%System%

HDFS  –  Block  placementRack  1 Rack  2 Rack  3 Rack  4

Replica  1

Replica  2

Replica  3

Replica  4

Client

Replica

Replica

Replica

Page 61: New Distributed File Systemsarchive.control.lth.se/media/Staff/JohanEker/introduction... · 2015. 5. 13. · Content • Whatis%adistributed%file% system?% • GFS%–Google%File%System%

HDFS  –  Block  placementRack  1 Rack  2 Rack  3 Rack  4

Replica  1

Replica  2

Replica  3

Replica  4

Client

Replica

Replica

Replica

Page 62: New Distributed File Systemsarchive.control.lth.se/media/Staff/JohanEker/introduction... · 2015. 5. 13. · Content • Whatis%adistributed%file% system?% • GFS%–Google%File%System%

Ceph• Integrated  into  Linux  kernel,  transparent  to  application  • Maintained  by  RedHat  • Found  in  Fedora,  Ubuntu,  Debian,  etc.    • More  general  than  the  other  two  • Scalable  through  higher  degree  of  distribution  • Multiple  metadata  nodes,  partitioned  namespace  • Multiple  data/storage  nodes    

• Pseudo-­‐random  block  placement  • Trusting  client  capability  implementation

Page 63: New Distributed File Systemsarchive.control.lth.se/media/Staff/JohanEker/introduction... · 2015. 5. 13. · Content • Whatis%adistributed%file% system?% • GFS%–Google%File%System%

Ceph  –  Entities• Chunk  • Chunkserver  • Master  • Client

ChunkChunk

Client

Linux  OS

Metadata  server

Block

Storage  cluster

GFS HDFS CephChunk Block BlockChunkserver Datanode Storage  clusterMaster Namenode Metadata  serverClient Client Client-­‐ CheckpointNode -­‐-­‐ BackupNode -­‐-­‐ -­‐ System  monitor

Page 64: New Distributed File Systemsarchive.control.lth.se/media/Staff/JohanEker/introduction... · 2015. 5. 13. · Content • Whatis%adistributed%file% system?% • GFS%–Google%File%System%

Storage  clusterStorage  cluster

Metadata  server

Metadata  server

Ceph  –  Architecture

Metadata  server

Ceph  Client

Linux  OSStorage  cluster

System    MonitorSystem    MonitorSystem    Monitor

Page 65: New Distributed File Systemsarchive.control.lth.se/media/Staff/JohanEker/introduction... · 2015. 5. 13. · Content • Whatis%adistributed%file% system?% • GFS%–Google%File%System%

Ceph  –  Metadata  (CRUSH)Controlled  Replication  Under  Scalable  Hashing  

Page 66: New Distributed File Systemsarchive.control.lth.se/media/Staff/JohanEker/introduction... · 2015. 5. 13. · Content • Whatis%adistributed%file% system?% • GFS%–Google%File%System%

Ceph  –  Namespace  partitioning

34

Page 67: New Distributed File Systemsarchive.control.lth.se/media/Staff/JohanEker/introduction... · 2015. 5. 13. · Content • Whatis%adistributed%file% system?% • GFS%–Google%File%System%

Conclusion• GFS    • Bespoke,  Google  centric  • Stereotypically  google-­‐simple  • Not  very  adaptive  

• HDFS  • More  control  signalling  • More  tuning  parameters  

• Ceph  • Most  general  • Higher  degree  of  distribution  • Most  configurable  but  also  most  adaptive