ceph days 2014 paul evans slide deck

23
BUILDING A CEPH-POWERED DATA LAKE (OR) DATA GRID Paul Evans principal architect daystrom technology group [email protected] san jose 2014 ceph days

Upload: daystromtech

Post on 24-Jun-2015

147 views

Category:

Technology


3 download

DESCRIPTION

Ceph Days held in October 2014 at Brocade headquarters in Silicon Valley.

TRANSCRIPT

Page 1: Ceph Days 2014 Paul Evans Slide Deck

BUILDING A CEPH-POWERED DATA LAKE (OR) DATA GRID

Paul Evans principal architect

daystrom technology group [email protected]

san jose 2014

ceph days

Page 2: Ceph Days 2014 Paul Evans Slide Deck

Why build a data grid (or data lake) ?

…because we have a data FLOOD in process

Page 3: Ceph Days 2014 Paul Evans Slide Deck

indeed, we love data…

we’re good at generating more and more, but…

( we never seem to throw any of it out )

too FAST

too many VARIANTS

too MUCH

Page 4: Ceph Days 2014 Paul Evans Slide Deck

IS THE ANSWER TO ALL OF THIS…. “ WE NEED LESS DATA! ”

are you crazy? we live to store things!

we just need better tools… (and more storage)

Page 5: Ceph Days 2014 Paul Evans Slide Deck

DATA AUTOMATION

Workflow Automation

Wildly-Scalable Storage

Data Lake Data Grid

STACK

Page 6: Ceph Days 2014 Paul Evans Slide Deck

DATA LAKE“a storage repository that holds a vast amount of raw data in its native

format until it is needed”

Page 7: Ceph Days 2014 Paul Evans Slide Deck

DATA LAKE - ORIGINS

First use credited to James Dixon, CTO at Pentaho, circa 2010

“If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state…”

“The contents of the data lake stream in from a

source to fill the lake, and various users of the lake

can come to examine, dive in, or take samples.”

Page 8: Ceph Days 2014 Paul Evans Slide Deck

DATA LAKE - EXPLAINED

While a hierarchical data warehouse stores data in files or folders, a data lake uses a flat architecture to store data. Each data element in a lake is assigned a unique identifier and tagged with a set of extended metadata tags. When a business question arises, the data lake can be queried for relevant data, and that smaller set of data can then be analyzed to help answer the question.

Page 9: Ceph Days 2014 Paul Evans Slide Deck

DATA LAKE - WHY ???

?

Page 10: Ceph Days 2014 Paul Evans Slide Deck

DATA LAKE CHARACTER

Unwashed Data: schema-on-read from RAW source Flexible Processing: batch, interactive, online, search

MetaData Dependent: tag it or lose it Common Access: hdfs-centric toolset

…in other words: this is not a glass-house Data Mart

Page 11: Ceph Days 2014 Paul Evans Slide Deck

A REFERENCE ‘LAKE’ ARCHITECTURE

OPERATIONSSECURITYDATA ACCESSGOVERNENCEINTEGRATION

DATA MANAGEMENT

Page 12: Ceph Days 2014 Paul Evans Slide Deck

A CEPHALOPOD IN THE LAKE?

Hadoop-native HDFS Locality-aware HDFS Distributed Name Svc Ceph Native Erasure Coding Ceph 20% Faster * Ceph * on Terasort benchmark over IB, Mar 2014

If this is import… Use this…

Page 13: Ceph Days 2014 Paul Evans Slide Deck

(LAKE) DREDGERS

technology grouptechnology group

Page 14: Ceph Days 2014 Paul Evans Slide Deck

DATA GRID“the unifying layer to how content and data are stored, protected, located

and accessed”

Page 15: Ceph Days 2014 Paul Evans Slide Deck

DATA GRID - ORIGINS

The need for data grids was first recognized by the scientific community concerning climate modeling, where exchanging PB-size data sets became commonplace. Recently, large-scale

instruments such as the Large Hadron Collider (LHC) at CERN are driving grid innovation.

Page 16: Ceph Days 2014 Paul Evans Slide Deck

DATA GRID - EXPLAINED

Data Grids present consistent access controls, governance, and metadata extensions to diverse storage media using a common, global interface for access and transport.

Additionally, they offer a ‘micro-service’ architecture for the creation of standard tasks & policies, which are enforced by a distributed “grid control-plane.”

Page 17: Ceph Days 2014 Paul Evans Slide Deck

DATA GRID - WHY ???

Page 18: Ceph Days 2014 Paul Evans Slide Deck

DATA GRID - ATTRIBUTES

Data Virtualization: common presentation of all content Universe-size Namespace: for files, objects & metadata Automation of Data Operations: distributed, scalable

Policy Mgmt/Reporting: data valuation & action triggers

Page 19: Ceph Days 2014 Paul Evans Slide Deck

CEPH MEETS GRID

implemented:

CephFS & RBD Ceph libRADOS RemoteCloud

Cold StorageArchive

DATA GRID unified namespace

HiSpeed Tier

LinkD

irectLIBRADOS

+ Ceph

LIBRADOS + Ceph

RBD

Page 20: Ceph Days 2014 Paul Evans Slide Deck

GRID IRON ALL-STARS

technology grouptechnology group

(Dan Bedard: [email protected])

Page 21: Ceph Days 2014 Paul Evans Slide Deck

TIME 2 SUMMARIZE…We are in the midst of a Data Explosion

We also need effective, de-centralized ways to care for the dataWe need robust, expandable, yet simple solutions to store data

Page 22: Ceph Days 2014 Paul Evans Slide Deck

DATA AUTOMATION

STACK

Workflow Automation

Wildly-Scalable Storage

Ceph+

the SMART approach

Data Lake Data Grid

Page 23: Ceph Days 2014 Paul Evans Slide Deck

thank you!

Paul Evans principal architect

[email protected]

technology grouptechnology group

san jose ceph days