yuan zhou, [email protected], software engineer jian ... · compute distro/plugin data processing...
TRANSCRIPT
Yuan Zhou, [email protected], Software Engineer
Jian Zhang, [email protected], Senior Software Engineer
04/2016
Agenda
• Big Data Analytics over Cloud
• Deployment considerations
• Design details of BDA over Ceph* Object Storage
• Sample Performance testing and results
• Client side cache project
• Summary
2
Agenda
• Big Data Analytics over Cloud
• Deployment considerations
• Design details of BDA over Ceph* Object Storage
• Sample Performance testing and results
• Client side cache project
• Summary
3
Big Data Analytics over Cloud
• You or someone at your company is using AWS, Azure, or Google cloud platform
• You’re probably doing it for easy access to OS instances, but also the modern application features, e.g. AWS’ EMR or RDS or Storage
• Migrating to, or even using, OpenStack infrastructure for workloads means having application features, e.g. Sahara & Trove
• Writing applications is complex enough without having to manage supporting (non-value-add) infrastructure
• Things are also happening on the cloud computing side
4
Big Data Analytics over Cloud: OpenStack Sahara
5
Repeatable cluster provisioning and management operations
Data processing workflows (EDP)
Cluster scaling (elasticity), Storage integration (Swift, Cinder, HCFS)
Network and security group (firewall) integration
Service anti-affinity (fault domains & efficiency)
Big data analytics on Object Storage
• Object storage provides restful/http access which is a good choice for archival storage
• It’s just like a K-V storage and be able to build on commodity hardware
• Currently Hadoop solution usually relies on HDFS
• Hadoop over Swift (SwiftFS) provides an much easier way
• Which saves much efforts on the copying between HDFS and local storage
6
Agenda
• Big Data Analytics over Cloud
• Deployment considerations
• Design details of BDA over Ceph* Object Storage
• Sample Performance testing and results
• Client side cache project
• Summary
7
Deployment Consideration Matrix
8
Storage
Compute
Distro/Plugin
Data Processing API
Vanilla CDH HDP MapRSpark
VM Container Bare-metal
Tenant vs. Admin provisioned
Disaggregated vs. Collocated HDFS vs. other options
TraditionalEDP
(Sahara native)3rd party
APIs
Storm
Storage Architecture
Tenant provisioned (in VM)
HDFS in the same VMs of computing tasks vs. in the different VMs
Ephemeral disk vs. Cinder volume
Admin provided
Logically disaggregated from computing tasks
Physical collocation is a matter of deployment
For network remote storage, Neutron DVR is very useful feature
A disaggregated (and centralized) storage system has significant values
No data silos, more business opportunities
Could leverage Manila service
Allow to create advanced solutions (.e.g. in-memory overlayer)
More vendor specific optimization opportunities
9
#2 #4#3#1
Host
HDFS
VM
Comput-ing Task
VM
Comput-ing Task
Host
HDFS
VM
Comput-ing Task
VM
HDFS
Host
HDFS
VM
Comput-ing Task
VM
Comput-ing Task
Legacy NFS
GlusterFS Ceph*External
HDFS Swift
HDFS
Scenario #1: computing and data service collocate in the VMsScenario #2: data service locates in the host worldScenario #3: data service locates in a separate VM worldScenario #4: data service locates in the remote network
Agenda
• Big Data Analytics over Cloud
• Deployment considerations
• Design details of BDA over Ceph* Object Storage
• Sample Performance testing and results
• Client side cache project
• Summary
10
Big data analytics on Ceph Object Storage
• Ceph provides a unified storage solution, which saves more man power to maintain another different setup for storage.
• We build a reference solution on Hadoop over multiple Ceph* RGW with SSD cache, similar with Hadoop over Swift.
• In this solution, the RGW instances will play as the connectors of compute and storage networks.
• We'll leverage Ceph* Cache Tier technology to cache the data in each RGW servers temporality.
• The reason for not using CephFS:
• CephFS is not so mature comparing with RGW(As of Infernalis)
• Existing deployment in DC may require a isolated network between storage and compute nodes
• Multi-tenancy support
11
Hadoop over Ceph RGW
RGW-FS: A new adaptor for RGW
RGW-Proxy: schedule jobs to RGW, and maintain locality
RGW: co-located with cache tiering osd, and configured specially for big data access
Working Flow:
1. Scheduler ask RGW service where a particular block locates (control path)
– RGW-Proxy returns the closest active RGW instance(s)
2. Scheduler allocates a task on the server that is near to the data
3. Task access data from nearby (data path)
4. RGW get/put data from the CT, and CT would get/put data from BT if necessary (data path)
12
Rack 2
Server 1
RGW(modified)
RGWFS
FileSystemInterface
M/R
Server 2
RGW(modified)
RGWFS
FileSystemInterface
M/R
RGW-Proxy
(NEW!)
RGWFS
FileSystemInterface
Scheduler
1
2
3
RGW Servi
ce
OSD OSD OSD OSD OSD OSDCeph* RADO
S
Rack 1
4
Ga
tew
ay
Ga
tew
ay
OSD(SSD) OSD(SSD) OSD(SSD) OSD(SSD)
Cache Tier
Base Tier
Isolatednetwork MON
Data Mapping
HCFS(HDFS/RGWFS)
RGW-ProxyNameNode
DataNode
DataNode
DataNode
DataNode
RGW
RADOS
RGW
Cache Cache
M/R Job
HDFS
RGWFS
13
API Mapping
API RGWFS HDFS
getFileBlockLocations Call RGW proxy(restful) to get the block locations(the closest RGW instance)
call Name Node to get the block locations
append Not supported Support
rename copy + delete Change the metadata in Name Node
OutputStream 1. Partition the file and upload as many partitions if the file size is greater than thelimit.
2. Send data to the endpoint of a RGW instance (configured in authentication server).
3. RGW instance will stream the data to the RADOS cluster.
1. Data is splited into packets (typically 64K in size), and a packet comprises of chunks (512 bytes and a checksum)
2. Send packets to the primary datanode (got from namenode) and it will replicate the data to other nodes
InputStream 1. Get block locations from RGW proxy (the nearest RGW instance)
2. Get the Data from RGW instance
1. Get the block location from Name Node
2. With block location cached3. Get the Data from Data Node
14
RGWFS – a new adaptor for HCFS
New filesystem URL rgw://
1. Forked from Hadoop-8545(SwiftFS)
2. Hadoop is able to talk to a RGW cluster with this plugin
3. A new ‘block concept’ was added since Swift doesn’t support blocks
– Thus scheduler could use multiple tasks to access the same file
– Based on the location, RGWFS is able to read from the closest RGW through range GET API
RGWFS
FileSystemInterface
Scheduler
RGW://
RGW
RADOS
RGW
file1
15
RGW-Proxy – Give out the closest RGW instance
1. Before get/put, RGWFS would try to get the location of each block from RGW-Proxy
1. One topology file of the cluster is generated
2. RGW-Proxy would get the manifest from the head object first(librados + getxattr)
3. Then based on the crushmap RGW-proxy can get the location of each object block(ceph osd map)
4. RGW-proxy could get the closest RGW the data osd info and the topology file(simple lookup in the topology file)
RGW-Proxy(NEW!) RGWFS
FileSystemInterface
Scheduler
RGW://
RGW
RADOS
RGW
file1
16
RGW – Serve the data requests
1. Setting up RGW on a Cache Tier thus we could use SSD as the cache.
1. With some dedicated chunk size: e.g., 64MB considering the data are quite big usually
2. rgw_max_chunk_size, rgw_obj_strip_size
2. Based on the account/container/object name, RGW could get/put the content.
1. Using Range Read to get each chunk
3. We’ll use write-through mode here as a start point to bypass the data consistency issue.
15
RGWFS
FileSystemInterface
Scheduler
RGW(modified)
RGW(modified)
RGW-Proxy(NEW!)
RGW Service
OSD OSD OSD OSD OSD OSDCeph*
RADOS
Ga
tew
ay
Ga
tew
ay
OSD OSD OSD OSD Cache Tier
Base Tier
MON
Sample Get(mapper)
Scheduler asks RGW-Proxy for the location of a big data file
RGW-Proxy looks for this file in a specified container (configured by operator) with a HEAD request to RGW
RGW then returns the RADOS objects info in the manifest
RGW-Proxy looks for these objects in RADOS using Crush
Get the target OSD in CT for each object. And with the OSD info, we know which RGW instance is the closest.
RGW-Proxy would also monitors these RGW instances and do the failover by returning the active RGW instances
Scheduler allocates a task on the server that is near to the data(RGW instance)
RGWFS would issue a range read to get the block
Ceph* CT would automatically handle the consistency here.
16
Rack 2
Server 1
RGW(modified)
RGWFS
FileSystemInterface
M/R
Server 2
RGW(modified)
RGWFS
FileSystemInterface
M/R
RGW-Proxy(NEW!)
RGWFS
FileSystemInterface
Scheduler
1
2
3
RGW Service
OSD OSD OSD OSD OSD OSDCeph*
RADOS
Rack 1
4
Ga
tew
ay
Ga
tew
ay
OSD(SSD) OSD(SSD) OSD(SSD) OSD(SSD)
Cache Tier
Base Tier
MON
Sample Put(reducer)
Reducer writes the output to RGW(configured in authentication server)
RGW would split the object into multiple RADOS objects
Ceph* CT would automatically handle the consistency here.
17
Rack 2
Server 1
RGW(modified)
RGWFS
FileSystemInterface
M/R
Server 2
RGW(modified)
RGWFS
FileSystemInterface
M/R
RGW-Proxy(NEW!)
RGWFS
FileSystemInterface
Scheduler
1
2
3
RGW Service
OSD OSD OSD OSD OSD OSDCeph*
RADOS
Rack 1
4
Ga
tew
ay
Ga
tew
ay
OSD(SSD) OSD(SSD) OSD(SSD) OSD(SSD)
Cache Tier
Base Tier
MON
Agenda
• Big Data Analytics over Cloud
• Deployment considerations
• Design details of BDA over Ceph* RGW
• Sample Performance testing and results
• Client side cache project
• Summary
20
Performance testing results
o RGWFS is Better than SwiftFS but worse than HDFSo Big gap with HDFS
o Profiling data shows the biggest overhead comes from ‘rename’o Rename on Object stores requires ‘copy’ + ‘delete’, which is quite heavy
19
Host
Ceph/Swift…..
VMVM
HostNova Inst.
Store
VM
HDFS
VM
HDFS
JBOD
…..
…..vs.
JBOD
Dataset Phase RGWFS(sec) SwiftFS(sec) HDFS(sec)
4GB run 134 269 81
8GB run 248 369 142
16GB run 403 1449 215
32GB run 820 Not Tested 393
64GB run 1876 Not Tested 737
RGWFS vs SwiftFS vs HDFS- 4 node Hadoop cluster- 32 Maps 4 Reds- Workload: sort
The output of the reduce function is written to a temporary location in HDFS. After completing, the output will automatically get renamed from its temporary location to its final location.
Object storage cannot support rename natively, “copy and delete” are used for rename function:
HDFS Rename -> Change METADATA in Name Node
Swift Rename -> Copy new object and Delete the older one
RGW Rename -> Copy new header and Delete the old header file
Is it possible to do the ‘rename’ in a faster way?
Rename in reduce
22
Agenda
• Big Data Analytics over Cloud
• Deployment considerations
• Design details of BDA over Ceph* RGW
• Sample Performance testing and results
• Client side cache project
• Summary
23
Background
• A strong demands for SSD caching in Ceph cluster
• Ceph SSD caching performance has gaps
• Journal, Cache tiering, Flashcache/bCache not work well
• Hyper-converged architecture is also very popular
24
Existing Cache solution in OpenStack
25
Server Side Caching:
Flashcache
Bcache
EnhanceIO
External xfs journal on SSD
Ceph Cache Tiering
cache?
Local Cache on Compute node missing
Client Side Cache Proposal Overview
Target:
• Building a cache layer for Ceph• Local read cache and distributed write cache
• Independent cache layer
• Started with Ceph RBD
• Extensible Framework
• Pluggable design, supporting third-party S/W
• General caching interfaces: memcached like API
• Advance data services
• Deduplication/Compression
26
write read
Write cachingRead caching
(dedup)
Persistence
Layer1In-mem
Layer2SSD
Client Side Cache Proposal Overview
• Example cache layer for Ceph RBD
• Writes to log-structured object store, Local SSD copy + remote SSD copy
• CAS read cache, memory layer + SSD layer
OSD OSD OSD OSD OSD OSD OSD OSD OSD OSD OSD OSD
Compute Node
ReadCache
WriteCache
VM1 VM2 VMn
Compute Node
ReadCache
WriteCache
VM1 VM2 VMn
meta meta
234 1 123
27
5
ssd ssd ssd ssd ssd ssd ssd ssd
dedup/compression
SSD Caching tier
HDD Capacity tier
dedup/compression
Client side cache Preliminary performance
We did some 4k rw performance testing with early code:
• Small Ceph cluster using vstart
• RBD pool created on ramdisk, with single copy
• Client Cache was using memcached backend, single copy
• All hit
The initial data showed we have 1.6x improvement
• Our code path is more simpler than rbd code path
• With new code path, memcached looks like to be the new bottleneck.
28
0
20
40
60
05000
10000150002000025000
With single copy
in-mem rbd
With single copy
client cache + write
back
La
ten
cy(m
s)
Th
rou
gh
pu
t (I
OP
S)
Client Side Cache Performance
IOPS Latency(ms)
Agenda
• Big Data Analytics over Cloud
• Deployment considerations
• Design details of BDA over Ceph* RGW
• Sample Performance testing and results
• Client side cache project
• Summary
29
Summary
Big data analytics running on cloud will ease your work
However each deployment has Pros and Cons
Performance penalty is big, optimization needed
Comparing with local HDFS setup, performance is 50% worse. The biggest gap is on the ‘rename’ part.
Intel is optimizing CEPH for Big data world
21
23
Q&A
32
Compute Engine
o Container seems to be promising but still need better support
o Determining the appropriate cluster size is always a challenge to tenants
o e.g. small flavor with more nodes or large flavor with less nodes
33
Pros Cons
VM• Best support in OpenStack• Strong security
• Slow to provision• Relatively high runtime performance overhead
Container
• Light-weight, fast provisioning• Better runtime performance than VM
• Nova-docker readiness• Cinder volume support is not ready yet• Weaker security than VM• Not the ideal way to use container
Bare-Metal
• Best performance and QoS• Best security isolation
• Ironic readiness• Worst efficiency (e.g. consolidation of workloads
with different behaviors)• Worst flexibility (e.g. migration)• Worst elasticity due to slow provisioning