yuan zhou, [email protected], software engineer jian ... · compute distro/plugin data processing...

33
Yuan Zhou, [email protected], Software Engineer Jian Zhang, [email protected], Senior Software Engineer 04/2016

Upload: others

Post on 25-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Yuan Zhou, yuan.zhou@intel.com, Software Engineer Jian ... · Compute Distro/Plugin Data Processing API Vanilla Spark CDH HDP MapR VM Container Bare-metal Tenant vs. Admin provisioned

Yuan Zhou, [email protected], Software Engineer

Jian Zhang, [email protected], Senior Software Engineer

04/2016

Page 2: Yuan Zhou, yuan.zhou@intel.com, Software Engineer Jian ... · Compute Distro/Plugin Data Processing API Vanilla Spark CDH HDP MapR VM Container Bare-metal Tenant vs. Admin provisioned

Agenda

• Big Data Analytics over Cloud

• Deployment considerations

• Design details of BDA over Ceph* Object Storage

• Sample Performance testing and results

• Client side cache project

• Summary

2

Page 3: Yuan Zhou, yuan.zhou@intel.com, Software Engineer Jian ... · Compute Distro/Plugin Data Processing API Vanilla Spark CDH HDP MapR VM Container Bare-metal Tenant vs. Admin provisioned

Agenda

• Big Data Analytics over Cloud

• Deployment considerations

• Design details of BDA over Ceph* Object Storage

• Sample Performance testing and results

• Client side cache project

• Summary

3

Page 4: Yuan Zhou, yuan.zhou@intel.com, Software Engineer Jian ... · Compute Distro/Plugin Data Processing API Vanilla Spark CDH HDP MapR VM Container Bare-metal Tenant vs. Admin provisioned

Big Data Analytics over Cloud

• You or someone at your company is using AWS, Azure, or Google cloud platform

• You’re probably doing it for easy access to OS instances, but also the modern application features, e.g. AWS’ EMR or RDS or Storage

• Migrating to, or even using, OpenStack infrastructure for workloads means having application features, e.g. Sahara & Trove

• Writing applications is complex enough without having to manage supporting (non-value-add) infrastructure

• Things are also happening on the cloud computing side

4

Page 5: Yuan Zhou, yuan.zhou@intel.com, Software Engineer Jian ... · Compute Distro/Plugin Data Processing API Vanilla Spark CDH HDP MapR VM Container Bare-metal Tenant vs. Admin provisioned

Big Data Analytics over Cloud: OpenStack Sahara

5

Repeatable cluster provisioning and management operations

Data processing workflows (EDP)

Cluster scaling (elasticity), Storage integration (Swift, Cinder, HCFS)

Network and security group (firewall) integration

Service anti-affinity (fault domains & efficiency)

Page 6: Yuan Zhou, yuan.zhou@intel.com, Software Engineer Jian ... · Compute Distro/Plugin Data Processing API Vanilla Spark CDH HDP MapR VM Container Bare-metal Tenant vs. Admin provisioned

Big data analytics on Object Storage

• Object storage provides restful/http access which is a good choice for archival storage

• It’s just like a K-V storage and be able to build on commodity hardware

• Currently Hadoop solution usually relies on HDFS

• Hadoop over Swift (SwiftFS) provides an much easier way

• Which saves much efforts on the copying between HDFS and local storage

6

Page 7: Yuan Zhou, yuan.zhou@intel.com, Software Engineer Jian ... · Compute Distro/Plugin Data Processing API Vanilla Spark CDH HDP MapR VM Container Bare-metal Tenant vs. Admin provisioned

Agenda

• Big Data Analytics over Cloud

• Deployment considerations

• Design details of BDA over Ceph* Object Storage

• Sample Performance testing and results

• Client side cache project

• Summary

7

Page 8: Yuan Zhou, yuan.zhou@intel.com, Software Engineer Jian ... · Compute Distro/Plugin Data Processing API Vanilla Spark CDH HDP MapR VM Container Bare-metal Tenant vs. Admin provisioned

Deployment Consideration Matrix

8

Storage

Compute

Distro/Plugin

Data Processing API

Vanilla CDH HDP MapRSpark

VM Container Bare-metal

Tenant vs. Admin provisioned

Disaggregated vs. Collocated HDFS vs. other options

TraditionalEDP

(Sahara native)3rd party

APIs

Storm

Page 9: Yuan Zhou, yuan.zhou@intel.com, Software Engineer Jian ... · Compute Distro/Plugin Data Processing API Vanilla Spark CDH HDP MapR VM Container Bare-metal Tenant vs. Admin provisioned

Storage Architecture

Tenant provisioned (in VM)

HDFS in the same VMs of computing tasks vs. in the different VMs

Ephemeral disk vs. Cinder volume

Admin provided

Logically disaggregated from computing tasks

Physical collocation is a matter of deployment

For network remote storage, Neutron DVR is very useful feature

A disaggregated (and centralized) storage system has significant values

No data silos, more business opportunities

Could leverage Manila service

Allow to create advanced solutions (.e.g. in-memory overlayer)

More vendor specific optimization opportunities

9

#2 #4#3#1

Host

HDFS

VM

Comput-ing Task

VM

Comput-ing Task

Host

HDFS

VM

Comput-ing Task

VM

HDFS

Host

HDFS

VM

Comput-ing Task

VM

Comput-ing Task

Legacy NFS

GlusterFS Ceph*External

HDFS Swift

HDFS

Scenario #1: computing and data service collocate in the VMsScenario #2: data service locates in the host worldScenario #3: data service locates in a separate VM worldScenario #4: data service locates in the remote network

Page 10: Yuan Zhou, yuan.zhou@intel.com, Software Engineer Jian ... · Compute Distro/Plugin Data Processing API Vanilla Spark CDH HDP MapR VM Container Bare-metal Tenant vs. Admin provisioned

Agenda

• Big Data Analytics over Cloud

• Deployment considerations

• Design details of BDA over Ceph* Object Storage

• Sample Performance testing and results

• Client side cache project

• Summary

10

Page 11: Yuan Zhou, yuan.zhou@intel.com, Software Engineer Jian ... · Compute Distro/Plugin Data Processing API Vanilla Spark CDH HDP MapR VM Container Bare-metal Tenant vs. Admin provisioned

Big data analytics on Ceph Object Storage

• Ceph provides a unified storage solution, which saves more man power to maintain another different setup for storage.

• We build a reference solution on Hadoop over multiple Ceph* RGW with SSD cache, similar with Hadoop over Swift.

• In this solution, the RGW instances will play as the connectors of compute and storage networks.

• We'll leverage Ceph* Cache Tier technology to cache the data in each RGW servers temporality.

• The reason for not using CephFS:

• CephFS is not so mature comparing with RGW(As of Infernalis)

• Existing deployment in DC may require a isolated network between storage and compute nodes

• Multi-tenancy support

11

Page 12: Yuan Zhou, yuan.zhou@intel.com, Software Engineer Jian ... · Compute Distro/Plugin Data Processing API Vanilla Spark CDH HDP MapR VM Container Bare-metal Tenant vs. Admin provisioned

Hadoop over Ceph RGW

RGW-FS: A new adaptor for RGW

RGW-Proxy: schedule jobs to RGW, and maintain locality

RGW: co-located with cache tiering osd, and configured specially for big data access

Working Flow:

1. Scheduler ask RGW service where a particular block locates (control path)

– RGW-Proxy returns the closest active RGW instance(s)

2. Scheduler allocates a task on the server that is near to the data

3. Task access data from nearby (data path)

4. RGW get/put data from the CT, and CT would get/put data from BT if necessary (data path)

12

Rack 2

Server 1

RGW(modified)

RGWFS

FileSystemInterface

M/R

Server 2

RGW(modified)

RGWFS

FileSystemInterface

M/R

RGW-Proxy

(NEW!)

RGWFS

FileSystemInterface

Scheduler

1

2

3

RGW Servi

ce

OSD OSD OSD OSD OSD OSDCeph* RADO

S

Rack 1

4

Ga

tew

ay

Ga

tew

ay

OSD(SSD) OSD(SSD) OSD(SSD) OSD(SSD)

Cache Tier

Base Tier

Isolatednetwork MON

Page 13: Yuan Zhou, yuan.zhou@intel.com, Software Engineer Jian ... · Compute Distro/Plugin Data Processing API Vanilla Spark CDH HDP MapR VM Container Bare-metal Tenant vs. Admin provisioned

Data Mapping

HCFS(HDFS/RGWFS)

RGW-ProxyNameNode

DataNode

DataNode

DataNode

DataNode

RGW

RADOS

RGW

Cache Cache

M/R Job

HDFS

RGWFS

13

Page 14: Yuan Zhou, yuan.zhou@intel.com, Software Engineer Jian ... · Compute Distro/Plugin Data Processing API Vanilla Spark CDH HDP MapR VM Container Bare-metal Tenant vs. Admin provisioned

API Mapping

API RGWFS HDFS

getFileBlockLocations Call RGW proxy(restful) to get the block locations(the closest RGW instance)

call Name Node to get the block locations

append Not supported Support

rename copy + delete Change the metadata in Name Node

OutputStream 1. Partition the file and upload as many partitions if the file size is greater than thelimit.

2. Send data to the endpoint of a RGW instance (configured in authentication server).

3. RGW instance will stream the data to the RADOS cluster.

1. Data is splited into packets (typically 64K in size), and a packet comprises of chunks (512 bytes and a checksum)

2. Send packets to the primary datanode (got from namenode) and it will replicate the data to other nodes

InputStream 1. Get block locations from RGW proxy (the nearest RGW instance)

2. Get the Data from RGW instance

1. Get the block location from Name Node

2. With block location cached3. Get the Data from Data Node

14

Page 15: Yuan Zhou, yuan.zhou@intel.com, Software Engineer Jian ... · Compute Distro/Plugin Data Processing API Vanilla Spark CDH HDP MapR VM Container Bare-metal Tenant vs. Admin provisioned

RGWFS – a new adaptor for HCFS

New filesystem URL rgw://

1. Forked from Hadoop-8545(SwiftFS)

2. Hadoop is able to talk to a RGW cluster with this plugin

3. A new ‘block concept’ was added since Swift doesn’t support blocks

– Thus scheduler could use multiple tasks to access the same file

– Based on the location, RGWFS is able to read from the closest RGW through range GET API

RGWFS

FileSystemInterface

Scheduler

RGW://

RGW

RADOS

RGW

file1

15

Page 16: Yuan Zhou, yuan.zhou@intel.com, Software Engineer Jian ... · Compute Distro/Plugin Data Processing API Vanilla Spark CDH HDP MapR VM Container Bare-metal Tenant vs. Admin provisioned

RGW-Proxy – Give out the closest RGW instance

1. Before get/put, RGWFS would try to get the location of each block from RGW-Proxy

1. One topology file of the cluster is generated

2. RGW-Proxy would get the manifest from the head object first(librados + getxattr)

3. Then based on the crushmap RGW-proxy can get the location of each object block(ceph osd map)

4. RGW-proxy could get the closest RGW the data osd info and the topology file(simple lookup in the topology file)

RGW-Proxy(NEW!) RGWFS

FileSystemInterface

Scheduler

RGW://

RGW

RADOS

RGW

file1

16

Page 17: Yuan Zhou, yuan.zhou@intel.com, Software Engineer Jian ... · Compute Distro/Plugin Data Processing API Vanilla Spark CDH HDP MapR VM Container Bare-metal Tenant vs. Admin provisioned

RGW – Serve the data requests

1. Setting up RGW on a Cache Tier thus we could use SSD as the cache.

1. With some dedicated chunk size: e.g., 64MB considering the data are quite big usually

2. rgw_max_chunk_size, rgw_obj_strip_size

2. Based on the account/container/object name, RGW could get/put the content.

1. Using Range Read to get each chunk

3. We’ll use write-through mode here as a start point to bypass the data consistency issue.

15

RGWFS

FileSystemInterface

Scheduler

RGW(modified)

RGW(modified)

RGW-Proxy(NEW!)

RGW Service

OSD OSD OSD OSD OSD OSDCeph*

RADOS

Ga

tew

ay

Ga

tew

ay

OSD OSD OSD OSD Cache Tier

Base Tier

MON

Page 18: Yuan Zhou, yuan.zhou@intel.com, Software Engineer Jian ... · Compute Distro/Plugin Data Processing API Vanilla Spark CDH HDP MapR VM Container Bare-metal Tenant vs. Admin provisioned

Sample Get(mapper)

Scheduler asks RGW-Proxy for the location of a big data file

RGW-Proxy looks for this file in a specified container (configured by operator) with a HEAD request to RGW

RGW then returns the RADOS objects info in the manifest

RGW-Proxy looks for these objects in RADOS using Crush

Get the target OSD in CT for each object. And with the OSD info, we know which RGW instance is the closest.

RGW-Proxy would also monitors these RGW instances and do the failover by returning the active RGW instances

Scheduler allocates a task on the server that is near to the data(RGW instance)

RGWFS would issue a range read to get the block

Ceph* CT would automatically handle the consistency here.

16

Rack 2

Server 1

RGW(modified)

RGWFS

FileSystemInterface

M/R

Server 2

RGW(modified)

RGWFS

FileSystemInterface

M/R

RGW-Proxy(NEW!)

RGWFS

FileSystemInterface

Scheduler

1

2

3

RGW Service

OSD OSD OSD OSD OSD OSDCeph*

RADOS

Rack 1

4

Ga

tew

ay

Ga

tew

ay

OSD(SSD) OSD(SSD) OSD(SSD) OSD(SSD)

Cache Tier

Base Tier

MON

Page 19: Yuan Zhou, yuan.zhou@intel.com, Software Engineer Jian ... · Compute Distro/Plugin Data Processing API Vanilla Spark CDH HDP MapR VM Container Bare-metal Tenant vs. Admin provisioned

Sample Put(reducer)

Reducer writes the output to RGW(configured in authentication server)

RGW would split the object into multiple RADOS objects

Ceph* CT would automatically handle the consistency here.

17

Rack 2

Server 1

RGW(modified)

RGWFS

FileSystemInterface

M/R

Server 2

RGW(modified)

RGWFS

FileSystemInterface

M/R

RGW-Proxy(NEW!)

RGWFS

FileSystemInterface

Scheduler

1

2

3

RGW Service

OSD OSD OSD OSD OSD OSDCeph*

RADOS

Rack 1

4

Ga

tew

ay

Ga

tew

ay

OSD(SSD) OSD(SSD) OSD(SSD) OSD(SSD)

Cache Tier

Base Tier

MON

Page 20: Yuan Zhou, yuan.zhou@intel.com, Software Engineer Jian ... · Compute Distro/Plugin Data Processing API Vanilla Spark CDH HDP MapR VM Container Bare-metal Tenant vs. Admin provisioned

Agenda

• Big Data Analytics over Cloud

• Deployment considerations

• Design details of BDA over Ceph* RGW

• Sample Performance testing and results

• Client side cache project

• Summary

20

Page 21: Yuan Zhou, yuan.zhou@intel.com, Software Engineer Jian ... · Compute Distro/Plugin Data Processing API Vanilla Spark CDH HDP MapR VM Container Bare-metal Tenant vs. Admin provisioned

Performance testing results

o RGWFS is Better than SwiftFS but worse than HDFSo Big gap with HDFS

o Profiling data shows the biggest overhead comes from ‘rename’o Rename on Object stores requires ‘copy’ + ‘delete’, which is quite heavy

19

Host

Ceph/Swift…..

VMVM

HostNova Inst.

Store

VM

HDFS

VM

HDFS

JBOD

…..

…..vs.

JBOD

Dataset Phase RGWFS(sec) SwiftFS(sec) HDFS(sec)

4GB run 134 269 81

8GB run 248 369 142

16GB run 403 1449 215

32GB run 820 Not Tested 393

64GB run 1876 Not Tested 737

RGWFS vs SwiftFS vs HDFS- 4 node Hadoop cluster- 32 Maps 4 Reds- Workload: sort

Page 22: Yuan Zhou, yuan.zhou@intel.com, Software Engineer Jian ... · Compute Distro/Plugin Data Processing API Vanilla Spark CDH HDP MapR VM Container Bare-metal Tenant vs. Admin provisioned

The output of the reduce function is written to a temporary location in HDFS. After completing, the output will automatically get renamed from its temporary location to its final location.

Object storage cannot support rename natively, “copy and delete” are used for rename function:

HDFS Rename -> Change METADATA in Name Node

Swift Rename -> Copy new object and Delete the older one

RGW Rename -> Copy new header and Delete the old header file

Is it possible to do the ‘rename’ in a faster way?

Rename in reduce

22

Page 23: Yuan Zhou, yuan.zhou@intel.com, Software Engineer Jian ... · Compute Distro/Plugin Data Processing API Vanilla Spark CDH HDP MapR VM Container Bare-metal Tenant vs. Admin provisioned

Agenda

• Big Data Analytics over Cloud

• Deployment considerations

• Design details of BDA over Ceph* RGW

• Sample Performance testing and results

• Client side cache project

• Summary

23

Page 24: Yuan Zhou, yuan.zhou@intel.com, Software Engineer Jian ... · Compute Distro/Plugin Data Processing API Vanilla Spark CDH HDP MapR VM Container Bare-metal Tenant vs. Admin provisioned

Background

• A strong demands for SSD caching in Ceph cluster

• Ceph SSD caching performance has gaps

• Journal, Cache tiering, Flashcache/bCache not work well

• Hyper-converged architecture is also very popular

24

Page 25: Yuan Zhou, yuan.zhou@intel.com, Software Engineer Jian ... · Compute Distro/Plugin Data Processing API Vanilla Spark CDH HDP MapR VM Container Bare-metal Tenant vs. Admin provisioned

Existing Cache solution in OpenStack

25

Server Side Caching:

Flashcache

Bcache

EnhanceIO

External xfs journal on SSD

Ceph Cache Tiering

cache?

Local Cache on Compute node missing

Page 26: Yuan Zhou, yuan.zhou@intel.com, Software Engineer Jian ... · Compute Distro/Plugin Data Processing API Vanilla Spark CDH HDP MapR VM Container Bare-metal Tenant vs. Admin provisioned

Client Side Cache Proposal Overview

Target:

• Building a cache layer for Ceph• Local read cache and distributed write cache

• Independent cache layer

• Started with Ceph RBD

• Extensible Framework

• Pluggable design, supporting third-party S/W

• General caching interfaces: memcached like API

• Advance data services

• Deduplication/Compression

26

write read

Write cachingRead caching

(dedup)

Persistence

Layer1In-mem

Layer2SSD

Page 27: Yuan Zhou, yuan.zhou@intel.com, Software Engineer Jian ... · Compute Distro/Plugin Data Processing API Vanilla Spark CDH HDP MapR VM Container Bare-metal Tenant vs. Admin provisioned

Client Side Cache Proposal Overview

• Example cache layer for Ceph RBD

• Writes to log-structured object store, Local SSD copy + remote SSD copy

• CAS read cache, memory layer + SSD layer

OSD OSD OSD OSD OSD OSD OSD OSD OSD OSD OSD OSD

Compute Node

ReadCache

WriteCache

VM1 VM2 VMn

Compute Node

ReadCache

WriteCache

VM1 VM2 VMn

meta meta

234 1 123

27

5

ssd ssd ssd ssd ssd ssd ssd ssd

dedup/compression

SSD Caching tier

HDD Capacity tier

dedup/compression

Page 28: Yuan Zhou, yuan.zhou@intel.com, Software Engineer Jian ... · Compute Distro/Plugin Data Processing API Vanilla Spark CDH HDP MapR VM Container Bare-metal Tenant vs. Admin provisioned

Client side cache Preliminary performance

We did some 4k rw performance testing with early code:

• Small Ceph cluster using vstart

• RBD pool created on ramdisk, with single copy

• Client Cache was using memcached backend, single copy

• All hit

The initial data showed we have 1.6x improvement

• Our code path is more simpler than rbd code path

• With new code path, memcached looks like to be the new bottleneck.

28

0

20

40

60

05000

10000150002000025000

With single copy

in-mem rbd

With single copy

client cache + write

back

La

ten

cy(m

s)

Th

rou

gh

pu

t (I

OP

S)

Client Side Cache Performance

IOPS Latency(ms)

Page 29: Yuan Zhou, yuan.zhou@intel.com, Software Engineer Jian ... · Compute Distro/Plugin Data Processing API Vanilla Spark CDH HDP MapR VM Container Bare-metal Tenant vs. Admin provisioned

Agenda

• Big Data Analytics over Cloud

• Deployment considerations

• Design details of BDA over Ceph* RGW

• Sample Performance testing and results

• Client side cache project

• Summary

29

Page 30: Yuan Zhou, yuan.zhou@intel.com, Software Engineer Jian ... · Compute Distro/Plugin Data Processing API Vanilla Spark CDH HDP MapR VM Container Bare-metal Tenant vs. Admin provisioned

Summary

Big data analytics running on cloud will ease your work

However each deployment has Pros and Cons

Performance penalty is big, optimization needed

Comparing with local HDFS setup, performance is 50% worse. The biggest gap is on the ‘rename’ part.

Intel is optimizing CEPH for Big data world

21

Page 31: Yuan Zhou, yuan.zhou@intel.com, Software Engineer Jian ... · Compute Distro/Plugin Data Processing API Vanilla Spark CDH HDP MapR VM Container Bare-metal Tenant vs. Admin provisioned

23

Q&A

Page 32: Yuan Zhou, yuan.zhou@intel.com, Software Engineer Jian ... · Compute Distro/Plugin Data Processing API Vanilla Spark CDH HDP MapR VM Container Bare-metal Tenant vs. Admin provisioned

32

Page 33: Yuan Zhou, yuan.zhou@intel.com, Software Engineer Jian ... · Compute Distro/Plugin Data Processing API Vanilla Spark CDH HDP MapR VM Container Bare-metal Tenant vs. Admin provisioned

Compute Engine

o Container seems to be promising but still need better support

o Determining the appropriate cluster size is always a challenge to tenants

o e.g. small flavor with more nodes or large flavor with less nodes

33

Pros Cons

VM• Best support in OpenStack• Strong security

• Slow to provision• Relatively high runtime performance overhead

Container

• Light-weight, fast provisioning• Better runtime performance than VM

• Nova-docker readiness• Cinder volume support is not ready yet• Weaker security than VM• Not the ideal way to use container

Bare-Metal

• Best performance and QoS• Best security isolation

• Ironic readiness• Worst efficiency (e.g. consolidation of workloads

with different behaviors)• Worst flexibility (e.g. migration)• Worst elasticity due to slow provisioning