hydrafs: a high-throughput file system for the hydrastor ... · fast 2010 –hydrafs: a high...

80
HydraFS: a High-Throughput File System for the HYDRAstor Content-Addressable Storage System Cristian Ungureanu , Benjamin Atkin, Akshat Aranya, Salil Gokhale, Steve Rago, Grzegorz Calkowski, Cezary Dubnicki, Aniruddha Bohra Feb 26, 2010

Upload: others

Post on 03-Oct-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: HydraFS: a High-Throughput File System for the HYDRAstor ... · FAST 2010 –HydraFS: a High Throughput Filesystem for CAS HYDRAstor Usage Example 10 CA 1 CA 2 B1 B2 Block Store File

HydraFS: a High-Throughput File System for the HYDRAstor Content-Addressable Storage System

Cristian Ungureanu, Benjamin Atkin, Akshat Aranya, Salil Gokhale,Steve Rago, Grzegorz Calkowski, Cezary Dubnicki, Aniruddha Bohra

Feb 26, 2010

Page 2: HydraFS: a High-Throughput File System for the HYDRAstor ... · FAST 2010 –HydraFS: a High Throughput Filesystem for CAS HYDRAstor Usage Example 10 CA 1 CA 2 B1 B2 Block Store File

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Content-addressable API

HYDRAstor:De-duplicated Scalable Storage

Scale-out storage

With global de-duplication

Using Content-Defined Chunking

Resilient to multiple failures

Easy to manage (self-healing,…)

High throughput for streaming access

Std. interfaces (NFS/CIFS, VTL,…)

2

FAST’09

HYDRAstor: a Scalable Secondary Storage

• Scalable

• Easy to manage

• Resilient

• High throughput

Content-addressable Block Store

• Standard protocols

• Chunking

• High throughputAccess Layer

FAST’10

• HydraFS: a High Throughput Filesystem

• Bimodal CDC for Backup Streams

Page 3: HydraFS: a High-Throughput File System for the HYDRAstor ... · FAST 2010 –HydraFS: a High Throughput Filesystem for CAS HYDRAstor Usage Example 10 CA 1 CA 2 B1 B2 Block Store File

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

HYDRAstor Usage Example

3

B1

Variable-size blocks

Block Store

File System

Block Store (CAS) API

Page 4: HydraFS: a High-Throughput File System for the HYDRAstor ... · FAST 2010 –HydraFS: a High Throughput Filesystem for CAS HYDRAstor Usage Example 10 CA 1 CA 2 B1 B2 Block Store File

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

HYDRAstor Usage Example

4

CA1

B1

Variable-size blocks

Content-addressable

Address decided by the store

Block Store (CAS) API

Block Store

File System

Page 5: HydraFS: a High-Throughput File System for the HYDRAstor ... · FAST 2010 –HydraFS: a High Throughput Filesystem for CAS HYDRAstor Usage Example 10 CA 1 CA 2 B1 B2 Block Store File

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

HYDRAstor Usage Example

5

CA1

B2

B1

Block Store

File System

Variable-size blocks

Content-addressable

Address decided by the store

Block Store (CAS) API

Page 6: HydraFS: a High-Throughput File System for the HYDRAstor ... · FAST 2010 –HydraFS: a High Throughput Filesystem for CAS HYDRAstor Usage Example 10 CA 1 CA 2 B1 B2 Block Store File

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

HYDRAstor Usage Example

6

CA2CA1

B2B1

Block Store

File System

Variable-size blocks

Content-addressable

Address decided by the store

Block Store (CAS) API

Page 7: HydraFS: a High-Throughput File System for the HYDRAstor ... · FAST 2010 –HydraFS: a High Throughput Filesystem for CAS HYDRAstor Usage Example 10 CA 1 CA 2 B1 B2 Block Store File

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

HYDRAstor Usage Example

7

CA2CA1

B2B1

Block Store

File System

Variable-size blocks

Content-addressable

Address decided by the store

Block Store (CAS) API

B3

Page 8: HydraFS: a High-Throughput File System for the HYDRAstor ... · FAST 2010 –HydraFS: a High Throughput Filesystem for CAS HYDRAstor Usage Example 10 CA 1 CA 2 B1 B2 Block Store File

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

HYDRAstor Usage Example

8

CA2CA1

B2B1

Block Store

File System

Variable-size blocks

Content-addressable

Address decided by the store

Duplicates eliminated by store

Block Store (CAS) API

B3

CA1

Page 9: HydraFS: a High-Throughput File System for the HYDRAstor ... · FAST 2010 –HydraFS: a High Throughput Filesystem for CAS HYDRAstor Usage Example 10 CA 1 CA 2 B1 B2 Block Store File

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

HYDRAstor Usage Example

9

CA2

CA1

B2B1

Block Store

File System

B3

CA1

Variable-size blocks

Content-addressable

Address decided by the store

Duplicates eliminated by store

Block Store (CAS) API

Page 10: HydraFS: a High-Throughput File System for the HYDRAstor ... · FAST 2010 –HydraFS: a High Throughput Filesystem for CAS HYDRAstor Usage Example 10 CA 1 CA 2 B1 B2 Block Store File

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

HYDRAstor Usage Example

10

CA2CA1

B2B1

Block Store

File System

B3

CA1

Variable-size blocks

Content-addressable

Address decided by the store

Duplicates eliminated by store

Block Store (CAS) API

Page 11: HydraFS: a High-Throughput File System for the HYDRAstor ... · FAST 2010 –HydraFS: a High Throughput Filesystem for CAS HYDRAstor Usage Example 10 CA 1 CA 2 B1 B2 Block Store File

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

HYDRAstor Usage Example

11

CA2CA1

B2B1

Block Store

File System

CA1

Variable-size blocks

Content-addressable

Address decided by the store

Duplicates eliminated by store

Block Store (CAS) API

Page 12: HydraFS: a High-Throughput File System for the HYDRAstor ... · FAST 2010 –HydraFS: a High Throughput Filesystem for CAS HYDRAstor Usage Example 10 CA 1 CA 2 B1 B2 Block Store File

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

HYDRAstor Usage Example

12

B2B1

Block Store

File System

B4 CA2CA1 CA1

Variable-size blocks

Content-addressable

Address decided by the store

Duplicates eliminated by store

Configurable block resilience

Block Store (CAS) API

Page 13: HydraFS: a High-Throughput File System for the HYDRAstor ... · FAST 2010 –HydraFS: a High Throughput Filesystem for CAS HYDRAstor Usage Example 10 CA 1 CA 2 B1 B2 Block Store File

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

HYDRAstor Usage Example

13

CA3

Block Store

File System

B4 CA2CA1 CA1

B2B1

Variable-size blocks

Content-addressable

Address decided by the store

Duplicates eliminated by store

Configurable block resilience

Block Store (CAS) API

Page 14: HydraFS: a High-Throughput File System for the HYDRAstor ... · FAST 2010 –HydraFS: a High Throughput Filesystem for CAS HYDRAstor Usage Example 10 CA 1 CA 2 B1 B2 Block Store File

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

HYDRAstor Usage Example

14

CA3

Block Store

File System

B4 CA2CA1 CA1

B2B1

Variable-size blocks

Content-addressable

Address decided by the store

Duplicates eliminated by store

Configurable block resilience

Block Store (CAS) API

Page 15: HydraFS: a High-Throughput File System for the HYDRAstor ... · FAST 2010 –HydraFS: a High Throughput Filesystem for CAS HYDRAstor Usage Example 10 CA 1 CA 2 B1 B2 Block Store File

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

HYDRAstor Usage Example

15

CA3Root1

Block Store

File System

B4 CA2CA1 CA1

B2B1

Variable-size blocks

Content-addressable

Address decided by the store

Duplicates eliminated by store

Configurable block resilience

Garbage collection

Block Store (CAS) API

Page 16: HydraFS: a High-Throughput File System for the HYDRAstor ... · FAST 2010 –HydraFS: a High Throughput Filesystem for CAS HYDRAstor Usage Example 10 CA 1 CA 2 B1 B2 Block Store File

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

HYDRAstor Usage Example

16

CA3Root1

Block Store

File System

B4 CA2CA1 CA1

B2B1

Variable-size blocks

Content-addressable

Address decided by the store

Duplicates eliminated by store

Configurable block resilience

Garbage collection

Block Store (CAS) API

Page 17: HydraFS: a High-Throughput File System for the HYDRAstor ... · FAST 2010 –HydraFS: a High Throughput Filesystem for CAS HYDRAstor Usage Example 10 CA 1 CA 2 B1 B2 Block Store File

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Outline

HYDRAstor content-addressable API

Challenges posed to the filesystem

Filesystem architecture

Techniques used to overcome the challenges

Conclusions and future work

17

Page 18: HydraFS: a High-Throughput File System for the HYDRAstor ... · FAST 2010 –HydraFS: a High Throughput Filesystem for CAS HYDRAstor Usage Example 10 CA 1 CA 2 B1 B2 Block Store File

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Challenges

Content-addressable blocks

– A change in a block’s contents also changes the block’s address

• All metadata has to change, recursively up to the filesystem root

• Parent can only be written after the children writes are successful

18

Variable-sized chunking (splitting file data into blocks)

– Block boundaries change when content is changed

– Overwrites cause read-rechunk-rewrite

High-latency block store operations

– Why? Hashing, compression, erasure coding, fragment distribution …

– Exacerbates the above two challenges

Page 19: HydraFS: a High-Throughput File System for the HYDRAstor ... · FAST 2010 –HydraFS: a High Throughput Filesystem for CAS HYDRAstor Usage Example 10 CA 1 CA 2 B1 B2 Block Store File

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Persistent Layout

19

File inode

Inode B-tree

Directory inode

Inode map

(segmented array)

File contents Directory contents

Inode map B-tree

Filesystem superblock (root block)

Inode map root

Directory

B-tree

Page 20: HydraFS: a High-Throughput File System for the HYDRAstor ... · FAST 2010 –HydraFS: a High Throughput Filesystem for CAS HYDRAstor Usage Example 10 CA 1 CA 2 B1 B2 Block Store File

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

HydraFS Architecture

20

Block Store

Data

File Server Commit ServerControl messages

User

operations

Filesystem

Update log TS=1; op1, op2,...TS=2; …TS=3; ……

File System

Root

Metadata

TS=1 TS=20

Page 21: HydraFS: a High-Throughput File System for the HYDRAstor ... · FAST 2010 –HydraFS: a High Throughput Filesystem for CAS HYDRAstor Usage Example 10 CA 1 CA 2 B1 B2 Block Store File

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

File Server

21

Write buffer

– Accumulates written data; flushed on sync

– Helps re-order NFS packets arriving out-of-order

Write Buffer

(dirty data)

Page 22: HydraFS: a High-Throughput File System for the HYDRAstor ... · FAST 2010 –HydraFS: a High Throughput Filesystem for CAS HYDRAstor Usage Example 10 CA 1 CA 2 B1 B2 Block Store File

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

File Server

22

Write buffer

– Accumulates written data; flushed on sync

– Helps re-order NFS packets arriving out-of-order

Chunker

– Decides block boundaries (based on data content)

Write Buffer

(dirty data)

Chunker

Page 23: HydraFS: a High-Throughput File System for the HYDRAstor ... · FAST 2010 –HydraFS: a High Throughput Filesystem for CAS HYDRAstor Usage Example 10 CA 1 CA 2 B1 B2 Block Store File

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

File Server

23

Metadata Modification Records

•File offset_range CA

•Directory additions/removals

• Inode map de/allocations

(dirty metadata)

Write Buffer

(dirty data)

Chunker

Write buffer

– Accumulates written data; flushed on sync

– Helps re-order NFS packets arriving out-of-order

Chunker

– Decides block boundaries (based on data content)

Metadata modification records (file, directory, inode map)

– Dirty metadata annotated with time-stamp (for cleaning)

– Written out to log

– Large amount of dirty metadata! Requires efficient cleaning

Resource management issues

Page 24: HydraFS: a High-Throughput File System for the HYDRAstor ... · FAST 2010 –HydraFS: a High Throughput Filesystem for CAS HYDRAstor Usage Example 10 CA 1 CA 2 B1 B2 Block Store File

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

File Server

24

Block CacheMetadata Modification Records

•File offset_range CA

•Directory additions/removals

• Inode map de/allocations

•CA block data

(clean data &

metadata)(dirty metadata)

Write Buffer

(dirty data)

Chunker

Write buffer

– Accumulates written data; flushed on sync

– Helps re-order NFS packets arriving out-of-order

Chunker

– Decides block boundaries (based on data content)

Metadata modification records (file, directory, inode map)

– Dirty metadata annotated with time-stamp (for cleaning)

– Written out to log

Block cache

– Clean data and metadata (not de-serialized)

Page 25: HydraFS: a High-Throughput File System for the HYDRAstor ... · FAST 2010 –HydraFS: a High Throughput Filesystem for CAS HYDRAstor Usage Example 10 CA 1 CA 2 B1 B2 Block Store File

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Write Processing

25

Block CacheMetadata Modification RecordsWrite Buffer

Chunker

Block Store

Page 26: HydraFS: a High-Throughput File System for the HYDRAstor ... · FAST 2010 –HydraFS: a High Throughput Filesystem for CAS HYDRAstor Usage Example 10 CA 1 CA 2 B1 B2 Block Store File

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Write Processing

26

Write Buffer

Chunker

Block Store

[0,8 KB)

Metadata Modification Records Block Cache

Page 27: HydraFS: a High-Throughput File System for the HYDRAstor ... · FAST 2010 –HydraFS: a High Throughput Filesystem for CAS HYDRAstor Usage Example 10 CA 1 CA 2 B1 B2 Block Store File

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Write Processing

27

Write Buffer

Chunker

Block Store

[0, 8 KB)

Metadata Modification Records Block Cache

Page 28: HydraFS: a High-Throughput File System for the HYDRAstor ... · FAST 2010 –HydraFS: a High Throughput Filesystem for CAS HYDRAstor Usage Example 10 CA 1 CA 2 B1 B2 Block Store File

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Write Processing

28

Write Buffer

Chunker

Block Store

[0, 8 KB)

[8 KB,16 KB)

Metadata Modification Records Block Cache

Page 29: HydraFS: a High-Throughput File System for the HYDRAstor ... · FAST 2010 –HydraFS: a High Throughput Filesystem for CAS HYDRAstor Usage Example 10 CA 1 CA 2 B1 B2 Block Store File

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Write Processing

29

Write Buffer

Chunker

Block Store

[0, 8 KB)

[8 KB,16 KB)

Metadata Modification Records Block Cache

Page 30: HydraFS: a High-Throughput File System for the HYDRAstor ... · FAST 2010 –HydraFS: a High Throughput Filesystem for CAS HYDRAstor Usage Example 10 CA 1 CA 2 B1 B2 Block Store File

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Write Processing

30

Write Buffer

Chunker

Block Store

[12 KB, 16 KB)

12 KB of data

Metadata Modification Records Block Cache

Page 31: HydraFS: a High-Throughput File System for the HYDRAstor ... · FAST 2010 –HydraFS: a High Throughput Filesystem for CAS HYDRAstor Usage Example 10 CA 1 CA 2 B1 B2 Block Store File

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Write Processing

31

Write Buffer

Chunker

Block Store12 KB of data

Data blocks

[12 KB, 16 KB)

CA1

Metadata Modification Records Block Cache

Page 32: HydraFS: a High-Throughput File System for the HYDRAstor ... · FAST 2010 –HydraFS: a High Throughput Filesystem for CAS HYDRAstor Usage Example 10 CA 1 CA 2 B1 B2 Block Store File

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Write Processing

32

Write Buffer

Chunker

Block Store12 KB of data

TS , [0, 12KB) CA1

Data blocks

[12 KB, 16 KB)

CA1

Metadata Modification Records Block Cache

Page 33: HydraFS: a High-Throughput File System for the HYDRAstor ... · FAST 2010 –HydraFS: a High Throughput Filesystem for CAS HYDRAstor Usage Example 10 CA 1 CA 2 B1 B2 Block Store File

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Write Processing

33

Write Buffer

Chunker

Block Store12 KB of data

TS , [0, 12KB) CA1

Data blocks

[12 KB, 16 KB)

[16 KB,24 KB)

Metadata Modification Records Block Cache

Page 34: HydraFS: a High-Throughput File System for the HYDRAstor ... · FAST 2010 –HydraFS: a High Throughput Filesystem for CAS HYDRAstor Usage Example 10 CA 1 CA 2 B1 B2 Block Store File

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Write Processing

34

Write Buffer

Chunker

Block Store12 KB of data

TS , [0, 12KB) CA1

Data blocks

[12 KB, 16 KB)

[16 KB,24 KB)

Metadata Modification Records Block Cache

Page 35: HydraFS: a High-Throughput File System for the HYDRAstor ... · FAST 2010 –HydraFS: a High Throughput Filesystem for CAS HYDRAstor Usage Example 10 CA 1 CA 2 B1 B2 Block Store File

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Write Processing

35

Write Buffer

Chunker

Block Store

[22 KB, 24 KB)

12 KB of data 10 KB of data

Data blocks

TS , [0, 12KB) CA1

TS , [12KB, 22KB) CA2

Metadata Modification Records Block Cache

Page 36: HydraFS: a High-Throughput File System for the HYDRAstor ... · FAST 2010 –HydraFS: a High Throughput Filesystem for CAS HYDRAstor Usage Example 10 CA 1 CA 2 B1 B2 Block Store File

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Write Processing

36

Write Buffer

Chunker

Block Store

[22 KB, 24 KB)

12 KB of data 10 KB of data

TS , [0, 12KB) CA1

TS , [12KB, 22KB) CA2

TS ; [0, 12KB) CA1 ; [12KB, 22KB) CA2 ;

…Log block

Data blocks

Metadata Modification Records Block Cache

Page 37: HydraFS: a High-Throughput File System for the HYDRAstor ... · FAST 2010 –HydraFS: a High Throughput Filesystem for CAS HYDRAstor Usage Example 10 CA 1 CA 2 B1 B2 Block Store File

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Commit Server

37

File inode

Inode map B-tree

TSFilesystem superblock

Commit server does not read data

Page 38: HydraFS: a High-Throughput File System for the HYDRAstor ... · FAST 2010 –HydraFS: a High Throughput Filesystem for CAS HYDRAstor Usage Example 10 CA 1 CA 2 B1 B2 Block Store File

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Commit Server

38

File inode

Inode map B-tree

TSFilesystem superblock

TS ; inode=2,[24KB, 32KB)=CA1 TS ; inode=9,[24KB, 32KB)=CA2 Log records

Page 39: HydraFS: a High-Throughput File System for the HYDRAstor ... · FAST 2010 –HydraFS: a High Throughput Filesystem for CAS HYDRAstor Usage Example 10 CA 1 CA 2 B1 B2 Block Store File

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Commit Server

39

TS TS

Amortize updates over many log records

Recovery time == the time to re-apply log

Page 40: HydraFS: a High-Throughput File System for the HYDRAstor ... · FAST 2010 –HydraFS: a High Throughput Filesystem for CAS HYDRAstor Usage Example 10 CA 1 CA 2 B1 B2 Block Store File

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Metadata Cleaning

40

779Block Cache

CA101

File Modification Records

791, [18KB, 31KB) CA2

791, [8KB, 18KB) CA1

inode #1; root= CA101 ; min=791; max=791

806, [10KB, 21KB) CA4

805, [0KB, 10KB) CA1

803, [18KB, 29KB) CA4

801, [8KB, 18KB) CA3

inode #2; root= CA122 ; min=801; max=803

inode #3; root= CA303 ; min=805; max=806

CA122

Page 41: HydraFS: a High-Throughput File System for the HYDRAstor ... · FAST 2010 –HydraFS: a High Throughput Filesystem for CAS HYDRAstor Usage Example 10 CA 1 CA 2 B1 B2 Block Store File

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Metadata Cleaning

41

779Block Cache

CA101

File Modification Records

803, [18KB, 29KB) CA4

801, [8KB, 18KB) CA3

791, [18KB, 31KB) CA2

791, [8KB, 18KB) CA1

inode #1; root= CA101 ; min=791; max=791

806, [10KB, 21KB) CA4

805, [0KB, 10KB) CA1

inode #2; root= CA122 ; min=801; max=803

inode #3; root= CA303 ; min=805; max=806

CA122

update( TimeStamp=802)

Page 42: HydraFS: a High-Throughput File System for the HYDRAstor ... · FAST 2010 –HydraFS: a High Throughput Filesystem for CAS HYDRAstor Usage Example 10 CA 1 CA 2 B1 B2 Block Store File

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Metadata Cleaning

42

Block Cache

CA101

File Modification Records

803, [18KB, 29KB) CA4

801, [8KB, 18KB) CA3

791, [18KB, 31KB) CA2

791, [8KB, 18KB) CA1

inode #1; root= CA101 ; min=791; max=791

806, [10KB, 21KB) CA4

805, [0KB, 10KB) CA1

inode #2; root= CA122 ; min=801; max=803

inode #3; root= CA303 ; min=805; max=806

CA122

802

Read new, evict old superblock

779

update( TimeStamp=802)

Page 43: HydraFS: a High-Throughput File System for the HYDRAstor ... · FAST 2010 –HydraFS: a High Throughput Filesystem for CAS HYDRAstor Usage Example 10 CA 1 CA 2 B1 B2 Block Store File

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Metadata Cleaning

43

Block Cache

CA101

File Modification Records

803, [18KB, 29KB) CA4

801, [8KB, 18KB) CA3

791, [18KB, 31KB) CA2

791, [8KB, 18KB) CA1

inode #1; root= CA101 ; min=791; max=791

806, [10KB, 21KB) CA4

805, [0KB, 10KB) CA1

inode #2; root= CA122 ; min=801; max=803

inode #3; root= CA303 ; min=805; max=806

CA122

802

Process dirty inodes one by one update( TimeStamp=802)

Page 44: HydraFS: a High-Throughput File System for the HYDRAstor ... · FAST 2010 –HydraFS: a High Throughput Filesystem for CAS HYDRAstor Usage Example 10 CA 1 CA 2 B1 B2 Block Store File

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Metadata Cleaning

44

Block Cache

CA101

File Modification Records

803, [18KB, 29KB) CA4

801, [8KB, 18KB) CA3

791, [18KB, 31KB) CA2

791, [8KB, 18KB) CA1

inode #1; root= CA101 ; min=791; max=791

806, [10KB, 21KB) CA4

805, [0KB, 10KB) CA1

inode #2; root= CA122 ; min=801; max=803

inode #3; root= CA303 ; min=805; max=806

CA122

802

Case 1: 802 ≥ max evict entire inode update( TimeStamp=802)

Page 45: HydraFS: a High-Throughput File System for the HYDRAstor ... · FAST 2010 –HydraFS: a High Throughput Filesystem for CAS HYDRAstor Usage Example 10 CA 1 CA 2 B1 B2 Block Store File

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Metadata Cleaning

45

Block Cache

CA101

File Modification Records

803, [18KB, 29KB) CA4

801, [8KB, 18KB) CA3

806, [10KB, 21KB) CA4

805, [0KB, 10KB) CA1

inode #2; root= CA122 ; min=801; max=803

inode #3; root= CA303 ; min=805; max=806

CA122

802

Case 2: 802 ≥ min and 802 < max

drop root CA

drop records with time_stamp ≤ 802

update( TimeStamp=802)

Page 46: HydraFS: a High-Throughput File System for the HYDRAstor ... · FAST 2010 –HydraFS: a High Throughput Filesystem for CAS HYDRAstor Usage Example 10 CA 1 CA 2 B1 B2 Block Store File

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Metadata Cleaning

46

Block Cache

CA101

File Modification Records

803, [18KB, 29KB) CA4

806, [10KB, 21KB) CA4

805, [0KB, 10KB) CA1

inode #2; root= ? ; min=801; max=803

inode #3; root= CA303 ; min=805; max=806

CA122

802

Case 3: 802 < min

skip record processing (all are newer)

inode root remains unchangedupdate( TimeStamp=802)

Page 47: HydraFS: a High-Throughput File System for the HYDRAstor ... · FAST 2010 –HydraFS: a High Throughput Filesystem for CAS HYDRAstor Usage Example 10 CA 1 CA 2 B1 B2 Block Store File

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Metadata Cleaning

47

Block Cache

CA101

File Modification Records

803, [18KB, 29KB) CA4

806, [10KB, 21KB) CA4

805, [0KB, 10KB) CA1

inode #2; root= ? ; min=801; max=803

inode #3; root= CA303 ; min=805; max=806

802

Locks only one inode at a time (no tree locking)

No I/O done with the lock held

Page 48: HydraFS: a High-Throughput File System for the HYDRAstor ... · FAST 2010 –HydraFS: a High Throughput Filesystem for CAS HYDRAstor Usage Example 10 CA 1 CA 2 B1 B2 Block Store File

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Read Processing

48

Block Cache

CA101

File Modification Records

803, [18KB, 29KB) CA4

806, [10KB, 21KB) CA4

805, [0KB, 10KB) CA1

inode #2; root= ? ; min=801; max=803

inode #3; root= CA303 ; min=805; max=806

802

read( inode=2, off=0, len=8KB)

Page 49: HydraFS: a High-Throughput File System for the HYDRAstor ... · FAST 2010 –HydraFS: a High Throughput Filesystem for CAS HYDRAstor Usage Example 10 CA 1 CA 2 B1 B2 Block Store File

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS 49

Block Cache

CA101

File Modification Records

803, [18KB, 29KB) CA4

806, [10KB, 21KB) CA4

805, [0KB, 10KB) CA1

inode #2; root= ? ; min=801; max=803

inode #3; root= CA303 ; min=805; max=806

802

Read Processing

read( inode=2, off=0, len=8KB)

Page 50: HydraFS: a High-Throughput File System for the HYDRAstor ... · FAST 2010 –HydraFS: a High Throughput Filesystem for CAS HYDRAstor Usage Example 10 CA 1 CA 2 B1 B2 Block Store File

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS 50

Block Cache

CA101

File Modification Records

803, [18KB, 29KB) CA4

806, [10KB, 21KB) CA4

805, [0KB, 10KB) CA1

inode #2; root=CA412 ; min=801; max=803

inode #3; root= CA303 ; min=805; max=806

CA122

802

CA412

Read Processing

read( inode=2, off=0, len=8KB)

Page 51: HydraFS: a High-Throughput File System for the HYDRAstor ... · FAST 2010 –HydraFS: a High Throughput Filesystem for CAS HYDRAstor Usage Example 10 CA 1 CA 2 B1 B2 Block Store File

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS 51

Block Cache

CA101

File Modification Records

803, [18KB, 29KB) CA4

806, [10KB, 21KB) CA4

805, [0KB, 10KB) CA1

inode #2; root=CA412 ; min=801; max=803

inode #3; root= CA303 ; min=805; max=806

CA122

802

CA412

Read Processing

read( inode=2, off=0, len=8KB)

Page 52: HydraFS: a High-Throughput File System for the HYDRAstor ... · FAST 2010 –HydraFS: a High Throughput Filesystem for CAS HYDRAstor Usage Example 10 CA 1 CA 2 B1 B2 Block Store File

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Read Performance

Just pre-fetch?

Problems

High latency high read-ahead

Poor cache locality for metadata

Solutions

Separate data and meta-data pre-fetch

Weighted-LRU Policy for Block Cache

52

Block Cache

Page 53: HydraFS: a High-Throughput File System for the HYDRAstor ... · FAST 2010 –HydraFS: a High Throughput Filesystem for CAS HYDRAstor Usage Example 10 CA 1 CA 2 B1 B2 Block Store File

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Data and Metadata Pre-fetch

Problem: time to pre-fetch a data block varies

with its position in the B-tree

Compare: B1 – B2 with B4 – B5

• B1: B15 – B11 – B9 – B1

• B2: B15 – B11 – B9 – B2

• B4: B15 – B11 – B10 – B4

• B5: B15 – B14 – B12 – B5

Solution

– Pre-fetch metadata more aggressively than data

53

B2

B10B9

B3

B11

B1 B4 B6

B13B12

B7

B14

B5 B8

B15

Same file offset distance

Likely cache miss

Page 54: HydraFS: a High-Throughput File System for the HYDRAstor ... · FAST 2010 –HydraFS: a High Throughput Filesystem for CAS HYDRAstor Usage Example 10 CA 1 CA 2 B1 B2 Block Store File

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Weighted LRU Policy for Block Cache

Problem: different access pattern for data and metadata blocks

– Data blocks being read

• Clean pages, pinned until read completes

• Looked-up once, then unlikely to be needed again (for streaming workloads)

– Data blocks pre-fetched

• Clean pages, not pinned

• Should avoid evicting before they are read

– Metadata blocks

• Looked-up more than once, but with large duration between accesses

Solution: cache eviction policy that favors metadata blocks

– Insert Assign weight based on block type

– Lookup Reset to initial weight, and make MRU in that bucket

– Reclaim Evict blocks with zero weight;

Decrease everybody else’s weight with 1

54

Page 55: HydraFS: a High-Throughput File System for the HYDRAstor ... · FAST 2010 –HydraFS: a High Throughput Filesystem for CAS HYDRAstor Usage Example 10 CA 1 CA 2 B1 B2 Block Store File

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Weighted LRU

55

Different block weights• initial metadata block weight:

• initial data block weight:

Reclamation• evict 0-weight blocks

• reduce all weights by 1

3

1

B2

B10B9

B3

B11

B1 B4 B6

B13B12

B7

B14

B8

B15

0 1 2

B15B9

B4

B3 B11

B10

3

B14

B12

Page 56: HydraFS: a High-Throughput File System for the HYDRAstor ... · FAST 2010 –HydraFS: a High Throughput Filesystem for CAS HYDRAstor Usage Example 10 CA 1 CA 2 B1 B2 Block Store File

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Weighted LRU

56

Different block weights• initial metadata block weight:

• initial data block weight:

Reclamation• evict 0-weight blocks

• reduce all weights by 1

3

1

0 1 2

B15B9

B4

B3

B2

B10B9

B3

B11

B1 B4 B6

B13B12

B7

B14

B8

B15

B11

B10

3

B14

B12

Page 57: HydraFS: a High-Throughput File System for the HYDRAstor ... · FAST 2010 –HydraFS: a High Throughput Filesystem for CAS HYDRAstor Usage Example 10 CA 1 CA 2 B1 B2 Block Store File

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Weighted LRU

57

Different block weights• initial metadata block weight:

• initial data block weight:

Reclamation• evict 0-weight blocks

• reduce all weights by 1

3

1

0 1 2

B15B9

B4

B3

B2

B10B9

B3

B11

B1 B4 B6

B13B12

B7

B14

B8

B15

B11

B10

3

B14

B12

Page 58: HydraFS: a High-Throughput File System for the HYDRAstor ... · FAST 2010 –HydraFS: a High Throughput Filesystem for CAS HYDRAstor Usage Example 10 CA 1 CA 2 B1 B2 Block Store File

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Weighted LRU

58

Different block weights• initial metadata block weight:

• initial data block weight:

Reclamation• evict 0-weight blocks

• reduce all weights by 1

3

1

0 1 2

B15B9

B3

B4

B2

B10B9

B3

B11

B1 B4 B6

B13B12

B7

B14

B8

B15

B11

B10

3

B14

B12B6

Page 59: HydraFS: a High-Throughput File System for the HYDRAstor ... · FAST 2010 –HydraFS: a High Throughput Filesystem for CAS HYDRAstor Usage Example 10 CA 1 CA 2 B1 B2 Block Store File

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Weighted LRU

59

Different block weights• initial metadata block weight:

• initial data block weight:

Reclamation• evict 0-weight blocks

• reduce all weights by 1

3

1

0 1 2

B4

B2

B10B9

B3

B11

B1 B4 B6

B13B12

B7

B14

B8

B15

3

B15B9

B3

B11

B10 B14

B12B6

Reclamation !

Page 60: HydraFS: a High-Throughput File System for the HYDRAstor ... · FAST 2010 –HydraFS: a High Throughput Filesystem for CAS HYDRAstor Usage Example 10 CA 1 CA 2 B1 B2 Block Store File

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Weighted LRU

60

Different block weights• initial metadata block weight:

• initial data block weight:

Reclamation• evict 0-weight blocks

• reduce all weights by 1

3

1

0 1 2

B15B9

B2

B10B9

B3

B11

B1 B4 B6

B13B12

B7

B14

B8

B15

B11

B10

3

B14

B12

B3

B6 B13B7

B8

Page 61: HydraFS: a High-Throughput File System for the HYDRAstor ... · FAST 2010 –HydraFS: a High Throughput Filesystem for CAS HYDRAstor Usage Example 10 CA 1 CA 2 B1 B2 Block Store File

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Effectiveness of Read Path Optimizations

Main techniques

– Pre-fetch metadata more aggressively than data

– Weighted-LRU to evict data more aggressively than metadata

Experiment

– Read a large file

61

AccessesMisses Throughput

(MB/s)Data Metadata

Base 486,966 1577 1011 134.3

Optimized 211,632 438 945 183.2

Page 62: HydraFS: a High-Throughput File System for the HYDRAstor ... · FAST 2010 –HydraFS: a High Throughput Filesystem for CAS HYDRAstor Usage Example 10 CA 1 CA 2 B1 B2 Block Store File

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Resource Management

62

Auxiliary objects

Pages for data and blocks

Inodes

Metadata modification records

Log blocks

Pre-allocated, managed

Fixed-size pools of fixed-size objects

– pages are 4 KB

– inodes are 8 KB

– log blocks are up to 128 KB

– etc.

Unmanaged heap

Objects’ number is bound by that of

some managed objects

Page 63: HydraFS: a High-Throughput File System for the HYDRAstor ... · FAST 2010 –HydraFS: a High Throughput Filesystem for CAS HYDRAstor Usage Example 10 CA 1 CA 2 B1 B2 Block Store File

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Resource Management

63

Resource

Event actions

Reserved: 0 Allocated: 0 Total: 10

Admission condition: Requested + Reserved + Allocated ≤ Total

Page 64: HydraFS: a High-Throughput File System for the HYDRAstor ... · FAST 2010 –HydraFS: a High Throughput Filesystem for CAS HYDRAstor Usage Example 10 CA 1 CA 2 B1 B2 Block Store File

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Resource Management

64

Resource

Event actions

Reserved: 0 Allocated: 0 Total: 10

Event 1 createdevent’s requirements

determined by event type

4 + 0 + 0 ≤ 10

Page 65: HydraFS: a High-Throughput File System for the HYDRAstor ... · FAST 2010 –HydraFS: a High Throughput Filesystem for CAS HYDRAstor Usage Example 10 CA 1 CA 2 B1 B2 Block Store File

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Resource Management

65

Resource

Event actions

Reserved: 4 Allocated: 0 Total: 10

Event 1 created

Event 1 admitted

Page 66: HydraFS: a High-Throughput File System for the HYDRAstor ... · FAST 2010 –HydraFS: a High Throughput Filesystem for CAS HYDRAstor Usage Example 10 CA 1 CA 2 B1 B2 Block Store File

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Resource Management

66

Resource

Event actions

Reserved: 4 Allocated: 4 Total: 10

Event 1 created

Event 1 admitted

Event 1 allocates 4

Page 67: HydraFS: a High-Throughput File System for the HYDRAstor ... · FAST 2010 –HydraFS: a High Throughput Filesystem for CAS HYDRAstor Usage Example 10 CA 1 CA 2 B1 B2 Block Store File

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Resource Management

67

Resource

Event actions

Reserved: 0 Allocated: 4 Total: 10

Event 1 created

Event 1 admitted

Event 1 allocates 4

Event 1 completes

event’s

reservations

are released may leave behind

allocated resources

Page 68: HydraFS: a High-Throughput File System for the HYDRAstor ... · FAST 2010 –HydraFS: a High Throughput Filesystem for CAS HYDRAstor Usage Example 10 CA 1 CA 2 B1 B2 Block Store File

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Resource Management

68

Resource

Event actions

Reserved: 0 Allocated: 4 Total: 10

Event 1 created

Event 1 admitted

Event 1 allocates 4

Event 1 completes

Event 2 created

3 + 0 + 4 ≤ 10

Page 69: HydraFS: a High-Throughput File System for the HYDRAstor ... · FAST 2010 –HydraFS: a High Throughput Filesystem for CAS HYDRAstor Usage Example 10 CA 1 CA 2 B1 B2 Block Store File

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Resource Management

69

Resource

Event actions

Reserved: 3 Allocated: 4 Total: 10

Event 1 created

Event 1 admitted

Event 1 allocates 4

Event 1 completes

Event 2 created

Event 2 admitted

Page 70: HydraFS: a High-Throughput File System for the HYDRAstor ... · FAST 2010 –HydraFS: a High Throughput Filesystem for CAS HYDRAstor Usage Example 10 CA 1 CA 2 B1 B2 Block Store File

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Resource Management

70

Resource

Event actions

Reserved: 3 Allocated: 6 Total: 10

Event 1 created

Event 1 admitted

Event 1 allocates 4

Event 1 completes

Event 2 created

Event 2 admitted

Event 2 allocates 2

Page 71: HydraFS: a High-Throughput File System for the HYDRAstor ... · FAST 2010 –HydraFS: a High Throughput Filesystem for CAS HYDRAstor Usage Example 10 CA 1 CA 2 B1 B2 Block Store File

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Resource Management

71

Resource

Event actions

Reserved: 3 Allocated: 6 Total: 10

Event 2 allocates 2

Event 3 created

3 + 3 + 6 > 10

Event is blocked

Page 72: HydraFS: a High-Throughput File System for the HYDRAstor ... · FAST 2010 –HydraFS: a High Throughput Filesystem for CAS HYDRAstor Usage Example 10 CA 1 CA 2 B1 B2 Block Store File

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Resource Management

72

Resource

Event actions

Reserved: 3 Allocated: 6 Total: 10

Event 2 allocates 2

Event 3 created

Event 4 created

Blocked behind event 3

FIFO admission

Page 73: HydraFS: a High-Throughput File System for the HYDRAstor ... · FAST 2010 –HydraFS: a High Throughput Filesystem for CAS HYDRAstor Usage Example 10 CA 1 CA 2 B1 B2 Block Store File

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Resource Management

73

Resource

Event actions

Reserved: 3 Allocated: 4 Total: 10

Event 2 allocates 2

Event 3 created

Event 4 created

Event 2 frees 2

On free, admission

condition re-evaluated

3 + 3 + 4 ≤ 10

Page 74: HydraFS: a High-Throughput File System for the HYDRAstor ... · FAST 2010 –HydraFS: a High Throughput Filesystem for CAS HYDRAstor Usage Example 10 CA 1 CA 2 B1 B2 Block Store File

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Resource Management

74

Resource

Event actions

Reserved: 6 Allocated: 4 Total: 10

Event 2 allocates 2

Event 3 created

Event 4 created

Event 2 frees 2

Event 3 admitted

3 + 6 + 4 > 10 event 4 remains blocked

Page 75: HydraFS: a High-Throughput File System for the HYDRAstor ... · FAST 2010 –HydraFS: a High Throughput Filesystem for CAS HYDRAstor Usage Example 10 CA 1 CA 2 B1 B2 Block Store File

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Resource Management

75

Resource

Event actions

Reserved: 3 Allocated: 4 Total: 10

Event 2 allocates 2

Event 3 created

Event 4 created

Event 2 frees 2

Event 3 admitted

Event 2 completes

1 + 3 + 4 ≤ 10 event 4 admitted …

On un-reserve, admission

condition re-evaluated

Page 76: HydraFS: a High-Throughput File System for the HYDRAstor ... · FAST 2010 –HydraFS: a High Throughput Filesystem for CAS HYDRAstor Usage Example 10 CA 1 CA 2 B1 B2 Block Store File

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Resource Management- Reclamation -

Reclamation processing

– First, free pages from clean cached blocks

– If not sufficient, initiate flush of dirty inodes

– Flush is an internal event with pre-reserved resources

Reclamation initiated when

– An event is blocked

– A threshold is reached

Threshold limit depends on resource type

– Metadata modification records can only be cleaned through

metadata update start earlier

– Others (pages, log blocks) can be cleaned quicker start later

76

Page 77: HydraFS: a High-Throughput File System for the HYDRAstor ... · FAST 2010 –HydraFS: a High Throughput Filesystem for CAS HYDRAstor Usage Example 10 CA 1 CA 2 B1 B2 Block Store File

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Resource Management

Limits the amount of memory used (avoid swapping)

Avoids handling allocation failures in the middle of event

processing

Avoids event starvation through FIFO processing

Simple but effective (allows high utilization of resources)

77

0

200

400

600

800

1000

1200

Pa

ge

me

mo

ry (

MB

)

Time

Memory reserved and allocated during sequential write

Total

Allocated

Reserved

Page 78: HydraFS: a High-Throughput File System for the HYDRAstor ... · FAST 2010 –HydraFS: a High Throughput Filesystem for CAS HYDRAstor Usage Example 10 CA 1 CA 2 B1 B2 Block Store File

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Experiments

• Requests can be rejected (“system busy”)

• Clients notified to resume submission

• Submit requests until busy, resumes as soon as notified

• Maximum concurrency; No parent-child structures

• Upper limit of performance

78

Tool

Flow control

API

0

50

100

150

200

250

300

350

0 0 20 40 60 80

Th

rou

gh

pu

t (M

B/s

)

Duplicate ratio (%)

Hydra block store

HydraFS

read write

Page 79: HydraFS: a High-Throughput File System for the HYDRAstor ... · FAST 2010 –HydraFS: a High Throughput Filesystem for CAS HYDRAstor Usage Example 10 CA 1 CA 2 B1 B2 Block Store File

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Conclusions and Future Work

Conclusions

Building a filesystem for a content-addressable storage system with

content-defined chunking poses interesting challenges

A small number of techniques was sufficient to overcome them while

keeping the system relatively simple and achieving high throughput

Future work

Distribute the filesystem

Use SSD to improve performance for metadata intensive workloads

79

Page 80: HydraFS: a High-Throughput File System for the HYDRAstor ... · FAST 2010 –HydraFS: a High Throughput Filesystem for CAS HYDRAstor Usage Example 10 CA 1 CA 2 B1 B2 Block Store File

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Thank you!

80

We are hiring! http://www.nec-labs.com/jobs/