dht2 - o brother, where art thou with shyam ranganathan

23
DHT2 - O Brother, Where Art Thou? Shyamsundar Ranganathan Developer

Upload: glusterorg

Post on 08-Jan-2017

208 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: DHT2 - O Brother, Where Art Thou with Shyam Ranganathan

DHT2 - O Brother, Where Art Thou?Shyamsundar RanganathanDeveloper

Page 2: DHT2 - O Brother, Where Art Thou with Shyam Ranganathan

Session aims to explore... "The hypothetical treasure at the end of the journey"

Why DHT2 "The plan..." DHT2 design "Known adventures along the way!"Challenges in DHT2 "The strange characters"Challenges because of DHT2 "Trouble escaping the chain gang!"Where are we with DHT2Loosely inspired by the movie: https://en.wikipedia.org/wiki/O_Brother,_Where_Art_Thou%3F

Page 3: DHT2 - O Brother, Where Art Thou with Shyam Ranganathan

Why DHT2DHT pitfalls

Directories on all subvolumesLayout per directoryRebalance IO path handling and nonoptimal data movementThis impacts scale and correctness!

Page 4: DHT2 - O Brother, Where Art Thou with Shyam Ranganathan

Why DHT2DHT pitfalls

Directories on all subvolumesLayout per directoryRebalance IO path handling and nonoptimal data movementThis impacts scale and correctness!

Correctness can be addressed in DHT,Broader locking semantics for dentry operationsPossibly single layout adoptionBut, increases complexity and could cost performance!

With DHT2 the goal is to fix all of the above, retaining or improving performance

Page 5: DHT2 - O Brother, Where Art Thou with Shyam Ranganathan

DHT2 Design: The file system objectsView the file system as a collection of related objects

”wait a second... isn't that what inodes and data pointers are?”Yes, but they are not distributed!

Directory objects denote hierarchystoring <name,inode#> tables

File object maintains inode related metadataActual file data is maintained in data object(s)

Page 6: DHT2 - O Brother, Where Art Thou with Shyam Ranganathan

The file system objects (example)

Client View. ('root')├── Dir1 │ ├── Dir2 │ └── File2 └── File1

Page 7: DHT2 - O Brother, Where Art Thou with Shyam Ranganathan

Dir Object File Object

Data

Data

root

File2

Dir2Dir1

File1

The file system objects (example)

inodes/dinode File data

1

A

CB

D

A

D

A Data Object

Client View ('root')├── Dir1 │ ├── Dir2 │ └── File2 └── File1

The different objects, segregated by type

Page 8: DHT2 - O Brother, Where Art Thou with Shyam Ranganathan

Dir Object File Object

Data

Data

root

File2

Dir2Dir1

File1

The file system objects (example)

inodes/dinode File data

1

A

CB

D

A

D

A Data Object

Client View ('root')├── Dir1 │ ├── Dir2 │ └── File2 └── File1

Namespace hierarchy representation

Page 9: DHT2 - O Brother, Where Art Thou with Shyam Ranganathan

Dir Object File Object

Data

Data

root

File2

Dir2Dir1

File1

The file system objects (example)

inodes/dinode File data

1

A

CB

D

A

D

A Data Object

Client View ('root')├── Dir1 │ ├── Dir2 │ └── File2 └── File1

Data association

Page 10: DHT2 - O Brother, Where Art Thou with Shyam Ranganathan

DHT2 Design: Distribution detailsDistribute inodes using GFID

in the metadata ringNo hierarchy, a directory object lives only on one subvolume

Use GFID as the data object#in the data ring

Distribution is hence not name dependent, and we just use a single layout per ring

Page 11: DHT2 - O Brother, Where Art Thou with Shyam Ranganathan

Dir Object File Object

BAC5

00EF

0001

BAC5

7525BA11

Distribution details (example)

Metadata Ring(few bricks)

Data Ring(many bricks)

1

A

CB

D

A

D

Data Object

00EF

<File1, 00EF><Dir1, BA11>

<File2, BAC5><Dir2, 7525>

Switch names to GFID, add name to dinodes

Page 12: DHT2 - O Brother, Where Art Thou with Shyam Ranganathan

Dir Object File Object

BAC5

00EF

0001

BAC5

7525BA11

Distribution details (example)

Metadata Ring(few bricks)

Data Ring(many bricks)

1

A

CB

D

A

D

Data Object

00EF

<File1, 00EF><Dir1, BA11>

<File2, BAC5><Dir2, 7525>

Client View ('root')├── Dir1 │ ├── Dir2 │ └── File2 └── File1

Page 13: DHT2 - O Brother, Where Art Thou with Shyam Ranganathan

DHT2 Design: Distribution details (contd.)Layout is based on bucket to subvolume assignment

Where, buckets >> subvolumesBucket ID is encoded into first n bytes of the GFID

Trivial GFID based operations

Collocates file object with parent objectFile object# statically inherits parent directory# bucket IDOptimized readirp and lookup operations (no hopping unless

non-trivially renamed, or a link file)IOW, optimized (pGFID, basename) based operations

Page 14: DHT2 - O Brother, Where Art Thou with Shyam Ranganathan

00EF

Dir Object File Object

BAC5

00EF

0001

BAC5

7525BA11

Distribution details (example)

Metadata Ring(few bricks)

Data Ring(many bricks)

1

A

CB

D

A

D

Data Object

<File1, 00EF><Dir1, BA11>

<File2, BAC5><Dir2, 7525>

Bricks/Subvols

Client View ('root')├── Dir1 │ ├── Dir2 │ └── File2 └── File1

Add bricks/subvolumes

Page 15: DHT2 - O Brother, Where Art Thou with Shyam Ranganathan

00EF

Dir Object File Object

BAC5

00EF

0001

BAC5

7525BA11

Distribution details (example)

Metadata Ring(few bricks)

Data Ring(many bricks)

1

A

CB

D

A

D

Data Object

<File1, 00EF><Dir1, BA11>

<File2, BAC5><Dir2, 7525>

Bricks/Subvols

00

75

BA

00

BA

Client View ('root')├── Dir1 │ ├── Dir2 │ └── File2 └── File1

Buckets

Assign buckets to bricks

Page 16: DHT2 - O Brother, Where Art Thou with Shyam Ranganathan

00EF

Dir Object File Object

BAC5

00EF

0001

BAC5

7525BA11

Distribution details (example)

Metadata Ring(few bricks)

Data Ring(many bricks)

1

A

CB

D

A

D

Data Object

<File1, 00EF><Dir1, BA11>

<File2, BAC5><Dir2, 7525>

Bricks/Subvols

00

75

BA

00

BA

Client View ('root')├── Dir1 │ ├── Dir2 │ └── File2 └── File1

Buckets

Place directories based on bucket encoded in the GFID

Page 17: DHT2 - O Brother, Where Art Thou with Shyam Ranganathan

00EF

Dir Object File Object

BAC5

00EF

0001

BAC5

7525BA11

Distribution details (example)

Metadata Ring(few bricks)

Data Ring(many bricks)

1

A

CB

D

A

D

Data Object

<File1, 00EF><Dir1, BA11>

<File2, BAC5><Dir2, 7525>

Bricks/Subvols

00

75

BA

00

BA

Client View ('root')├── Dir1 │ ├── Dir2 │ └── File2 └── File1

Buckets

Colocate the files under a directory with the same bucket ID

Page 18: DHT2 - O Brother, Where Art Thou with Shyam Ranganathan

DHT2 Design: RebalanceReassign buckets to/from newer/removed subvolumes

fix-layout is instantaneousFiles travel with directories (same bucket colocation)

Expand the cluster, but perform no rebalanceaka just add-brick and let min-free-disk+link-to do its job This is the tough one, use layout versions/histories to pull this

off?

Split DHT2 into client-server piecesHandle IO traffic, locking during rebalanceBetter consistency model for transactions

Ability to have different expansions strategies for the 2 rings

Page 19: DHT2 - O Brother, Where Art Thou with Shyam Ranganathan

Challenges in DHT2Rename ELOOP checking requires hierarchy

Object backpointers

Time and size information should be in sync between data and metadata objectsDirty inode, tracked via open fd

Orphan GFID cleanupEnter transactions/journals!

Directories as files/in a DBReduce local FS inode proliferation

Page 20: DHT2 - O Brother, Where Art Thou with Shyam Ranganathan

Challenges because of DHT2IO path cannot depend on hierarchy (Ex: quota)Quick-read cannot fetch data in lookupsAnon-fd based operations cannot track dirty inodesOthers

Will changelog play well!EC has to bother with only data?Tier may need a rethinkSharding may accrue cost of missing anon-fd and data/meta-

data split of shards

Unknowns!

Page 21: DHT2 - O Brother, Where Art Thou with Shyam Ranganathan

Where are we with DHT2Introduced DHT Version 2 in Barcelona summit, 2015

Followed up with 2 discussions upstream on core concepts [1] [2]

Followed up with a POC and some slides/documents to demonstrate the concepts [3]

In a limbo since then,But, not out of the picture yet!

Targeting an experimental release with 4.0

Page 22: DHT2 - O Brother, Where Art Thou with Shyam Ranganathan

Questions?

"The treasure you seek shall not be the treasure you find."

Page 23: DHT2 - O Brother, Where Art Thou with Shyam Ranganathan

References[1] DHT2 Design Discussion

https://goo.gl/tLpqJO[2] DHT2 Design Discussion, Round 2https://goo.gl/dCAO36[3] POC trail…http://www.gluster.org/pipermail/gluster-devel/2015-August/046369.html

Other threads of interest:

- http://www.gluster.org/pipermail/gluster-devel/2016-March/048874.html

- http://www.gluster.org/pipermail/gluster-devel/2015-November/047098.html

- http://www.gluster.org/pipermail/gluster-devel/2015-September/046630.html