ozone- object store for apache hadoop

46
1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Ozone – Object Store for Apache Hadoop Anu Engineer [email protected] Arpit Agarwal [email protected]

Upload: hortonworks

Post on 16-Jan-2017

511 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Ozone- Object store for Apache Hadoop

1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Ozone – Object Store for Apache HadoopAnu [email protected]

Arpit [email protected]

Page 2: Ozone- Object store for Apache Hadoop

2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Ozone – Why an Object Store

⬢With workloads like IoT we are looking at scaling to trillions of objects.⬢Apache HDFS is designed for large objects – not for many small objects– Small files create memory pressure on namenode.

⬢Each small file creates a block in the datanode.–Datanodes send all block information to namenode in BlockReports.

⬢Both of these create scalability issues on Namenode.⬢Metadata in memory is the strength of the original GFS and HDFS design, but also its

weakness in scaling number of files and blocks.

⬢An object store has simpler semantics than a file system and is easier to scale

Apache Hadoop, Hadoop, Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and other countries.

Page 3: Ozone- Object store for Apache Hadoop

3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Ozone – Why an Object Store (continued)

⬢Ozone attempts to scale to trillions of objects– This presentation is about how we will get there.

⬢Ozone is built on a distributed metadata store.

⬢Avoids any single server becoming a bottleneck

⬢More parallelism possible in both data and metadata operations

⬢Build on well tested components and understood protocols–RAFT for consensus

•RAFT is a protocol for reaching consensus between a set of machines in an unreliable environment where machines and network may fail.

–Off-the-shelf Key-Value store like LevelDB•LevelDB is an open-source standalone key-value store built by Google.

Page 4: Ozone- Object store for Apache Hadoop

4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Alternative solutions to NameNode scalability

⬢HDFS federation aims to address namespace and Block Space scalability issues.–Federation deployment and planning adds complexity

–Requires changes to other components in the Hadoop stack⬢HDFS-8286 - Partial Namespace In Memory.

–Proposal to keep active working set of namespace in memory.

⬢HDFS-5477 - Block Management as a Service.–Proposed solution for block space scalability issue.

⬢Ozone borrows many ideas from these and is a super set of these approaches.

Page 5: Ozone- Object store for Apache Hadoop

5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Presentation Outline

Ozone Introduction Ozone Architectural OverviewContainersOzone - Bringing it all togetherBonus Slides - if we have time.

Page 6: Ozone- Object store for Apache Hadoop

6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Ozone - Introduction

⬢An Ozone URL

–http://hostname/myvolume/mybucket/mykey

⬢An S3 URL

–http://hostname/mybucket/mykey

⬢An Azure URL

–http://hostname/myaccount/mybucket/key

Page 7: Ozone- Object store for Apache Hadoop

7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Ozone - Definitions

⬢Storage Volume–A notion similar to an account

–Allows admin controls on usage of the object store e.g. storage quota

–Created and managed by admins only

⬢Bucket–Consists of keys and objects

–Similar to a bucket in S3 or a container in Azure

–ACLs

Page 8: Ozone- Object store for Apache Hadoop

8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Ozone - Definitions (continued)

⬢Storage Key–Unique in a bucket

⬢Object–Values in a bucket–Each corresponds to a unique key within a bucket

Page 9: Ozone- Object store for Apache Hadoop

9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Ozone - REST API

⬢POST - Creates Volumes and Buckets–Only Admin creates volumes–Bucket can be created by owner of the volume

⬢PUT - Updates Volumes , Buckets and creates keys–Only admin can change some volume settings–Buckets have ACLs–Creates Keys

Page 10: Ozone- Object store for Apache Hadoop

10

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Ozone - REST API (continued)

⬢GET - Lists volumes and buckets and allows reading of keys–Lists Volumes–List Buckets–Get Keys

⬢DELETE - Deletes volumes, buckets and keys.–Delete Volumes–Delete Buckets–Removes the Key

Page 11: Ozone- Object store for Apache Hadoop

11

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Ozone Components

⬢Containers – Actual storage locations on Datanodes.–We acknowledge the term container is overloaded. No relation to YARN containers or LXC.–Assume container means a collection of blocks on a datanode for now.–Containers deep dive to follow.

⬢Ozone Handler - REST frontend for the Ozone rest protocol - deployed on datanodes.⬢Storage Container Manager (SCM) - Manages the container life cycle. ⬢Ozone Key Space Manager (KSM) - Maps Ozone entities to Containers.

⬢Container Client - Talks to KSM to discover the location of a container and sends IO requests to the appropriate container.

Page 12: Ozone- Object store for Apache Hadoop

12

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Ozone Overview

Page 13: Ozone- Object store for Apache Hadoop

13

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Ozone Key Space Manager

Page 14: Ozone- Object store for Apache Hadoop

14

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Ozone Key Space Manager

⬢Key to container mapping service.

⬢Keeps the key ranges to containers mapping in memory.–Θ(number of containers) - 1 Exabyte cluster = 200M containers x 5GB each.

–Memory usage scales with number of containers and not number of keys.

⬢KSM does NOT know about all the keys in the system.

⬢KSM state is replicated via RAFT, NameNode-like snapshots.

Page 15: Ozone- Object store for Apache Hadoop

15

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Ozone Key Space Manager

⬢KSM knows about Ozone Volumes and Buckets.

⬢KSM keeps a map of Volumes to container and buckets to containers.

⬢KSM performs longest prefix match on a given string.

⬢Example: The user wants to lookup a key - “/volume1/bucket1/key1”–KSM authenticates the user, maps this key to a container and looks up the container location.

–Container client gets a token from the KSM and talks to the container on the data node.

–Container client makes a getKey call to the datanode container with the full key path.

–DataNode validates the access token and serves the value.

⬢Contents of a bucket may span multiple containers.

Page 16: Ozone- Object store for Apache Hadoop

16

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

KSM - Bucket spanning multiple containers

Page 17: Ozone- Object store for Apache Hadoop

17

© Hortonworks Inc. 2011 – 2016. All Rights Reserved17

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Containers

Page 18: Ozone- Object store for Apache Hadoop

18

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Container Framework

⬢A shareable generic block service that can be used by distributed storage services.

⬢Make it easier to develop new storage services - BYO storage format and naming scheme.

⬢Design Goals–No single points of failure. All services are replicated.–Avoid bottlenecks

•Minimize central state•No verbose Block Reports

⬢ Lessons learned from large scale HDFS clusters.

⬢ Ideas from earlier proposals in HDFS community.

Page 19: Ozone- Object store for Apache Hadoop

19

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Container Framework Components

Page 20: Ozone- Object store for Apache Hadoop

20

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Containers

Page 21: Ozone- Object store for Apache Hadoop

21

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Containers

⬢A container is the unit of replication–Size bounded by how quickly it can be re-replicated after a loss.

⬢Each container is an independent key-value store.–No requirements on the structure or format of keys/values.–Keys are unique only within a container.

⬢E.g. key-value pair could be one of–An Ozone Key-Value pair–An HDFS block ID and block contents

•Or part of a block, when a block spans containers.

Page 22: Ozone- Object store for Apache Hadoop

22

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Containers (continued)

⬢Each container has metadata–Metadata consistency maintained via the RAFT protocol–Metadata consists of keys and references to chunks.

⬢Container metadata stored in LevelDB.–Exact choice of KV store unimportant. LevelDB is already used by other Hadoop components.

⬢A chunk is a piece of user data.– Chunks are replicated via a data pipeline.–Chunks can be of arbitrary sizes e.g. a few KB to GBs.–Each chunk reference is a (file, offset, length) triplet.

⬢Containers may garbage collect unreferenced chunks.

⬢Each container independently decides how to map chunks to files–Allow reauthoring files for performance, compaction and overwrites.

Page 23: Ozone- Object store for Apache Hadoop

23

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Containers (continued)

Page 24: Ozone- Object store for Apache Hadoop

24

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Containers support simple client operations

⬢Write chunks - streaming writes

⬢Put(key, List<ChunkReference>)–The value is a list of chunk references.–Putting a key makes previously written chunk data visible to readers.–Put overwrites previous versions of the key.

⬢Get(key)–Returns a list of chunk references

⬢Read chunks - streaming reads

⬢Delete(key)

⬢ List Keys

Page 25: Ozone- Object store for Apache Hadoop

25

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Storage Container Manager

Page 26: Ozone- Object store for Apache Hadoop

26

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Storage Container Manager

⬢A fault-tolerant replicated service

⬢Replicates its own state using RAFT protocol

⬢Provides Container Location Service to clients–Given a container ID, return a list of nodes with replicas–Mapping a container ID to Data Nodes (and vice versa)

⬢Provides Cluster Membership Management–Maintain list of live Data Nodes in the cluster–Handle heartbeats from DataNodes

Page 27: Ozone- Object store for Apache Hadoop

27

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Storage Container Manager (continued)

⬢Provides Replication services–Detect lost container replicas and initiate re-replication–Containers send a container report.

•Unlike HDFS block reports which include details about each block , a container report is a summary of information.•This is used by KSM for placement of containers

⬢ If a node suffers from disk failure or if a node is lost, the reconstruction is a local activity which is coordinated via RAFT running on the data nodes.

Page 28: Ozone- Object store for Apache Hadoop

28

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Storage Container Manager

⬢Maintains pre-created containers

⬢Collects container operation statistics

⬢Decides which Data Nodes form the replication set for a given container.–The number of replication sets in a cluster is bounded–Borrowing the work done by Facebook and RAMCloud (Copysets, Cidon et al. 2013).

⬢ Important - Knows nothing about keys–Does NOT provide Naming Service (mapping keys to containers)

–e.g. KSM provides naming for Ozone.

Page 29: Ozone- Object store for Apache Hadoop

29

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Conceptual Representation of Ozone and Container State

Page 30: Ozone- Object store for Apache Hadoop

30

© Hortonworks Inc. 2011 – 2016. All Rights Reserved30

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Ozone - Bringing it all together

Page 31: Ozone- Object store for Apache Hadoop

31

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Bringing it all together - Ozone createVolume operations

Key Space Manager

Replicated containers

1: createVolume

Container Manager

2: Lookup(volName, Operation) 3: getContainer

4: putMetdata(VolumeName, Properties)

Ozone HandlerClient

Heartbeats & Reports

DataNodes

Page 32: Ozone- Object store for Apache Hadoop

32

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Tracing an Ozone PutKey

Key Space Manager

Replicated containers

1: Ozone - putKey

2: Lookup(keyName, Operation)

4: putData(File, offset, Length, data)

Client

5: putMetadata(key, List<chunks>)

Ozone Handler

DataNodes

Page 33: Ozone- Object store for Apache Hadoop

33

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Tracing an Ozone createVolume

OzoneVolume vol = (new VolumeBuilder(pipeLine)) .setCreated(new Date()) .setOwnerName("bilbo") .setClient(client) .setName(“shire”) .build();

POST /shire

keyData = {ContainerKeyData} keyName = "shire" containerName = "OzoneContainer" metadata =  

  0."Created" -> "1449533074362"  1."CreatedBy" -> "gandalf"  2."Key" -> "VOLUME"  3."Owner" -> "bilbo"

Ozone REST Handler code Container wire and storage format

Page 34: Ozone- Object store for Apache Hadoop

34

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Ozone Metadata operations

⬢Any metadata write to a container is Replicated via RAFT.

⬢Machines forming the replication set for a container comprise a pipeline.

⬢A createVolume call reduces to putKey operation on the container.

⬢ putKey is consistent, atomic and durable.

⬢All metadata data operations are done via putKey, getKey and deleteKey.

⬢Data is written to one or more chunks and a key is updated to point to those chunks.

⬢Updating the key makes the data visible in the namespace.

Page 35: Ozone- Object store for Apache Hadoop

35

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Current State of Ozone

⬢Stand alone container framework.

⬢Single node ozone using container framework.

⬢Full REST API -- Command Line Tools and Client Libs are fully functional.

⬢Active development in branch HDFS-7240.⬢Work in progress:

–SCM–KSM–Replication Pipeline

–RAFT

Page 36: Ozone- Object store for Apache Hadoop

36

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Acknowledgements

⬢Ozone is being designed and developed by Jitendra Pandey, Chris Nauroth, Tsz Wo

(Nicholas) Sze, Jing Zhao, Suresh Srinivas, Sanjay Radia, Anu Engineer and Arpit

Agarwal.

⬢The Apache community has been very helpful and we were supported by comments

and contributions from Kanaka Kumar Avvaru, Edward Bortnikov, Thomas Demoor, Nick

Dimiduk, Chris Douglas, Jian Fang, Lars Francke, Gautam Hegde, Lars Hofhansl, Jakob

Homan, Virajith Jalaparti, Charles Lamb, Steve Loughran, Haohui Mai, Colin Patrick

McCabe, Aaron Myers, Owen O’Malley, Liam Slusser, Jeff Sogolov, Enis Soztutar,

Andrew Wang, Fengdong Yu, Zhe Zhang & khanderao.

Page 37: Ozone- Object store for Apache Hadoop

37

© Hortonworks Inc. 2011 – 2016. All Rights Reserved37

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Thank You

Page 38: Ozone- Object store for Apache Hadoop

38

© Hortonworks Inc. 2011 – 2016. All Rights Reserved38

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Bonus Slides

Page 39: Ozone- Object store for Apache Hadoop

39

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Ozone Key Space Manager - Dynamic Container Partitioning

⬢KSM deals with dynamic partitioning of containers.

⬢ Let us say that a user starts by uploading all his photographs to a bucket in ozone.

⬢Since all the photographs are called IMG_* (thanks Apple), we will soon overflow the

5GB capacity of the container.

⬢At this point we need to split the container.

Page 40: Ozone- Object store for Apache Hadoop

40

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Ozone KSM - Dynamic Container Partitioning

⬢The container client attempts to write the Nth ozone key, IMG_N, gets a partition

required error.

⬢Container client will take that error and return that info to KSM.

⬢That error contains the info -- about the proposed split -- That is IMG_0- IMG_200 will

stay in this container and IMG_201-IMG_400 will move to next container.

⬢Note: KSM initiates container partitioning but mechanics of the split are handled by

the Container Layer

Page 41: Ozone- Object store for Apache Hadoop

41

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Ozone Key Space Manager - Dynamic Container Partitioning

⬢One of the assumptions we have made about a container split is that the splits are on the same datanode as the original container.

⬢This allows us to reduce a split operation to a copy of Keys from one LevelDB to another LevelDB.

⬢ if we need to move actual file data from one datanode to another -- we do support container moves. However they are slow.

⬢A split on the other hand will complete in seconds in most cases.⬢The split point is chosen by the container so that we are able to pick the 50th percentile

position that gives us reasonable chance at an equal partition of a container.⬢KSM does not know about the keys or the actual data sizes until much later. ⬢So always relies on the container to tell it where the split should be.

Page 42: Ozone- Object store for Apache Hadoop

42

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Ozone Key Space Manager - Dynamic Container Partitioning

⬢A container split is done in KSM via updating the Tree. The range partition key we maintain gets updated to reflect the fact that Keys - {IMG_0 - IMG_200} are on container C1, and keys {IMG_201-IMG_Z} are on C2.

⬢Container will update the SCM when the split is done.

⬢This information is learned and maintained by KSM.

Page 43: Ozone- Object store for Apache Hadoop

43

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Ozone Key Space Manager - Soft Quotas

⬢ In the HDFS world, a Quota is hard limit. It is actually conservative in quota management.

⬢ In the ozone world, Quotas are soft quotas. That is users can and will be able to violate it, but KSM/SCM will eventually learn about it and lock the volume out.

⬢The key reason why this is different is because KSM/SCM is not involved in the allocation of chunks.

⬢The containers have a partial -- that is an isolated view of the data in a volume. Since volumes can span many containers, it is possible for users to allocate chunks that violate the volume quota, but eventually KSM will learn and disable further writes to a volume.

Page 44: Ozone- Object store for Apache Hadoop

44

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Ozone Key Space Manager - Missing namespace problem

⬢One great thing about HDFS is Namenode.–Despite scalability issues, in most cases Namenode does a wonderful job.

⬢Ozone - Subtle problem if we lose the all replicas of a container.

⬢We will not only lose data -- just as if HDFS lost its all 3 replicas, but we will also lose information about which keys have been lost.

⬢To solve this issue, we propose to have a on-disk eventually consistent log maintained by a separate service.–Records information about the keys that exist in the cluster.

⬢This Scribe service logs the state of the cluster.

Page 45: Ozone- Object store for Apache Hadoop

45

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Ozone - Range reads

⬢Ozone supports range reads and might support range writes like part upload in S3.–Ozone achieves this by using the chunk mechanism.–Chunks offer a stream like interface.–You can seek to any location and read as many bytes as you want.–This is used by ozone to support range reads

⬢Periodic Scanner can reclaim unreferenced chunks.

Page 46: Ozone- Object store for Apache Hadoop

46

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Scalability - What HDFS does well

⬢HDFS NN stores all namespace metadata in memory (as per GFS)–Scales to large clusters (5K) since all metadata in memory

•60K-100K tasks can share the Namenode

•Low latency

–Large data if files are large

⬢Proof points of large data and large clusters–Single Organizations have over 600PB in HDFS

–Single clusters with over 200PB using federation

–Large clusters of over 4K multi-core nodes hitting a single NN