osiris technical lead open storage research infrastructure ... · other but 10g to wsu datacenter...
Post on 20-May-2020
1 Views
Preview:
TRANSCRIPT
Open Storage Research Infrastructure
HEPiX Spring 2019 - SDSC
Ben MeekhofUniversity of Michigan Advanced Research ComputingOSiRIS Technical Lead
2OSiRIS - Open Storage Research Infrastructure
OSiRIS is a pilot project funded by the NSF to evaluate a software-defined storage infrastructure for our primary Michigan research universities and beyond.
Our goal is to provide transparent, high-performance access to the same storage infrastructure from well-connected locations on any of our campuses.
⬝ Leveraging CEPH features such as CRUSH, cache tiers to place data
⬝ NFSv4 (Ganesha) with Krb5 auth and LDAP to map POSIX identity
⬝ Radosgw/S3 behind HAproxy, public and campus local endpoints
⬝ Identity establishment and provisioning of federated users (COmanage)
UM, driven by OSiRIS, recently joined Ceph Foundation: https://ceph.com/foundation
OSiRIS Summary
3OSiRIS - Open Storage Research Infrastructure
OSiRIS Summary - Structure
Single Ceph cluster (Mimic 13.2.x ) spanning UM, WSU, MSU - 792 OSD, 7 PiB
Network topology store (UNIS) and SDN rules (Flange) managed at IU
NVMe nodes at VAI used for Ceph cache tier (Ganesha NFS export to local cluster)
4OSiRIS - Open Storage Research Infrastructure
COmanage - Virtual Org Provisioning
When we create COmanage COU (virtual org):
Data pools created
RGW placement target defined to link to pool cou.Name.rgw
CephFS pool create and added to fs
COU directory created and placed on CephFS pool
Default perms/ownership set to COU all members group, write perms for admins group (as a default, can be modified)
5OSiRIS - Open Storage Research Infrastructure
OSiRIS relies on other identity providers to verify users⬝ InCommon and eduGain federations
The first step to using OSiRIS is to authenticate and enroll via COmanage⬝ Users choose a COU (virtual org) ⬝ Designated virtual org admins can approve new enrollments, OSiRIS
admins don’t need to be involved for every enrollment
Once enrolled COmanage can feed information to provisioning plugins. ⬝ LDAP, Grouper are core plugins included with COmanage⬝ We wrote a Ceph provisioner⬝ Ceph provisioner runs ‘ceph’ and ‘radosgw-admin’ commands directly ⬝ No way in RGW admin api to do placement settings ⬝ There used to be ceph-rest-api, now ceph-mgr RESTful plugin but I haven’t
really looked into whether it offers what we would need (appears not)
OSiRIS Identity Onboarding
6OSiRIS - Open Storage Research Infrastructure
COmanage - New User Provisioning
7OSiRIS - Open Storage Research Infrastructure
Virtual Orgs are provisioned from COmanage as Grouper stems
VO admins are given capabilities to create/manage groups under their stem
Groups become Unix group objects in LDAP usable in filesystem permissions
Every COU (VO) has the CO_COU groups available for use by default, COmanage sets membership in these
Grouper - VO Group Self Management
8OSiRIS - Open Storage Research Infrastructure
COmanage Credential Management
COmanage Ceph Provisioner plugin provides user interface to retrieve/manage credentials
9OSiRIS - Open Storage Research Infrastructure
We want to have a view of usage by virtual org, and also ensure fair usage of available RGW resources (bandwidth, ops)
S3 metrics and throttling
10OSiRIS - Open Storage Research Infrastructure
Pool metrics and other cluster metrics are all gathered and fed to Influxdb by a plugin to ceph-mgr. Originally written by a student here at UM and contributed to Ceph (now greatly modified by others). http://docs.ceph.com/docs/mimic/mgr/influx/
Ceph-mgr influx plugin
(cpu metric from collectd)
11OSiRIS - Open Storage Research Infrastructure
We isolate our virtual orgs by Ceph data pool so we can gather metrics on usage, throughput, iops at the storage level
Virtual Org Storage Metrics
12OSiRIS - Open Storage Research Infrastructure
Mounting CephFS directly provides no provision for mapping user identity from external systems (posix numeric uid/gid/grouplist).
This functionality is available in NFSv4 with idmapd, and we are able to provide NFS mounts to our member campuses using their local KRB5 realms. ⬝ Setup can be an issue due to complexity and requirement of client principals⬝ Used to provide mounts on campus cluster login/xfer nodes where users already
obtain krb5 creds at login⬝ Also used in MSAD krb5 environment at Van Andel site ⬝ Baked in config: https://hub.docker.com/r/miosiris/nfs-ganesha-ceph
POSIX ID Mapping
13OSiRIS - Open Storage Research Infrastructure
Two components:
Map server gid/uid to client user/group
Map krb5 principal to posix uid/gid
NFS ID Mapping Overview
14OSiRIS - Open Storage Research Infrastructure
We also provide Globus access to CephFS and S3 storage⬝ For now separate endpoints, future Globus version will support multiple storage
connectors⬝ Ceph S3 connector is highly scalable - the more raw storage you have, the more it
costs!⬝ Ceph connector uses radosgw admin API to lookup user credentials and connect to
endpoint URL with them
Credentials: CILogon + globus-gridmap⬝ We keep CILogon DN in LDAP voPerson CoPersonCertificateDN attribute⬝ For now a script is used to feed DN to LDAP but should be self-service for users (in
COmanage gateway)
Simple python script generates globus-gridmap file on our Globus servers⬝ Anyone want to write gridmap module to directly query DN from LDAP?
https://groups.google.com/a/globus.org/forum/#!topic/admin-discuss/8D54FzJzS-o
Globus and gridmap
15OSiRIS - Open Storage Research Infrastructure
We’ve had issues with network bandwidth inequality: UM and MSU sites have 80G link to each other but 10G to WSU datacenter⬝ Adding a new node, or losing enough disks,
would completely swamp the 10G link and cause OSD flapping, mon/mds problems, service disruptions
Lowering recovery tunings fixes the issue, at the expense of under-utilizing our faster links. Recovery sleep had the most effect, the others not as clear
osd_recovery_max_active: 1 # (default 3) osd_backfill_scan_min: 8 #(def 64)osd_backfill_scan_max: 64 #(def 512)osd_recovery_sleep_hybrid: 0.1 # (def .025)
Fighting Network Inequality
16OSiRIS - Open Storage Research Infrastructure
Speaking of networks...The OSiRIS Network Management Abstraction Layer is a key part of the project with
several important focuses:
Capturing site topology and routing information from multiple sources: SNMP, LLDP,
sflow, SDN controllers, and existing topology and looking glass services
Converge on common scheduled measurement architecture with existing perfSONAR
mesh configurations
Correlate long-term performance measurements with passive metrics collected via
other monitoring infrastructure (Check_mk, Collectd, Nagios, etc)
Be able to apply this information to SDN based architecture for traffic routing and traffic
shaping / QOS (for example, prioritize client / cluster service traffic over recovery)
http://www.osris.org/nmal
NMAL work is led by the Indiana University CREST team
17OSiRIS - Open Storage Research Infrastructure
At Van Andel Institute (Grand Rapids) we deployed 3 NVMe based nodesPart of our cluster but for cache tier pools onlyAllocated a tier on top of VAI CephFS storage and local Ganesha NFS server
Van Andel - Ceph Cache Tiering
http://www.osris.org/domains/vai.html
18OSiRIS - Open Storage Research Infrastructure
Another thing we can do to optimize data reads for certain locations is adjust CRUSH to place primary (1st) OSD at a site.
Since reads/writes all go to primary the clients at that site will see a boost to read (write still constrained by latency to other replicas)
Our ‘sites’ are all CRUSH buckets so it’s as simple as creating CRUSH rule that specifically allocates replicas starting with the desired primary site:
rule replicated_primary_wsu {id 5step take wsu step chooseleaf firstn 1 type hoststep emitstep take msu … }
http://www.osris.org/article/2019/03/01/ceph-osd-site-affinity
CRUSH and Primary OSD
19OSiRIS - Open Storage Research Infrastructure
We manage everything with puppet, deployment with Foreman ⬝ foreman-bootdisk for external deployments such as Van Andel⬝ r10k git environments ⬝ Recently updated to puppet 5 / foreman 1.20
Define a site and role (sub-role for storage) from hostname, use these in hiera lookups⬝ Example: um-stor-nvm01 becomes a Ceph ‘stor’ node using devices as defined in
‘nvm’ nodetype to create OSD⬝ site, role, node, nodetype are hiera tree levels⬝ At the site level define things like networks (frontend/backend/mgmt), CRUSH
locations, etc
Ceph deployment and disk provisioning managed by Puppet module ⬝ Storage nodes lookup Ceph OSD devices in hiera based on hostname component ⬝ Our module was forked from openstack/puppet-ceph ⬝ Supports all the ceph daemons, bluestore, multi-OSD devices ⬝ https://github.com/MI-OSiRIS/puppet-ceph
Puppet
20OSiRIS - Open Storage Research Infrastructure
OSiRIS is involved with multiple science domains including medical imaging, biology population studies, ocean tidal studies (NRL), and Physics / HEP
ATLAS Event Service - Storage service based on S3 allowing jobs on any resource to retrieve physics events in small chunks.
⬝ Recently: Testing Globus to move data to OSiRIS S3 (and others)
ATLAS dCache - Ceph can be used as a backend for dCache. With AGLT2 work is underway to provide dCache storage from OSiRIS.
⬝ dCache Book: https://www.dcache.org/manuals/Book-5.0/cookbook-pool.shtml#running-pool-with-ceph-backend
⬝ Some more (old?) background: https://www.dcache.org/manuals/workshop-2017-05-29-Umea/000-Final/tigran-dcache+ceph.pdf
IceCube Neutrino Detector - Interest in using OSiRIS via object store interface (S3 or directly, not sure). Early discussion.
More: http://www.osris.org/domains
OSiRIS and HEP
21OSiRIS - Open Storage Research Infrastructure
This is the 4th year of OSiRIS, with a potential 5th year TBD by upcoming NSF review
We’d like to get more data on the platform, have a number of queued up users or new engagements (IceCube, OSN, Van Andel, more)
See more utilization of S3 services as a more practical path to working in-place on data sets⬝ Recently deployed Globus S3 connectors (good timing for ATLAS!)
We are almost ready to bring up 100 Gbit links between all sites. Phase-1 will just put our cluster ‘backend’ networks on private peerings between sites and significantly improve replication / recovery capabilities as well as remove this traffic from client network links (cluster_network, public_network in Ceph)
⬝ Links will land directly on OSiRIS TOR switches (Dell Z9100) instead of campus DMZ routes - more control for us, more potential to break things if we peer to campus
Future
22OSiRIS - Open Storage Research Infrastructure
Questions?
The End
23OSiRIS - Open Storage Research Infrastructure
Internet2 COmanage: https://spaces.at.internet2.edu/display/COmanage/Home
Internet2 Grouper: https://www.internet2.edu/products-services/trust-identity/grouper/
OSiRIS CephProvisioner: https://github.com/MI-OSiRIS/comanage-registry/tree/ceph_provisioner/app/AvailablePlugin/CephProvisioner
OSIRIS Docker (Ganesha, NMAL containers): https://hub.docker.com/u/miosiris
OSiRIS Docs: https://www.osris.org/documentation
Ceph Influxdb plugin: http://docs.ceph.com/docs/mimic/mgr/influx/
Reference / Supplemental
24OSiRIS - Open Storage Research Infrastructure
CPU – 56 hyper-thread cores (28 real cores) to 60 disks in one node – can see 100% usage on those during startup, normal 20-30%
Memory– 384 GB for 600 TB disk (about 650 MiB per TB)– 6.5GB per 10TB OSD
4 x NVMe OSD Bluestore DB (Intel DC P3700 1.6TB)– 110GB per 10TB OSD (15 per device)– 11GB per TB of space– about 1% of block device space (docs recommend 4%....not realistic for us)
VAI Cache Tier– 3 nodes, each 1 x 11 TB Micron Pro 9100 NVMe– 4 OSD per NVMe– 2x AMD EPYC 7251 2Ghz 8-Core, 128GB
Some Numbers (current hw)
Older hardware has 4 x smaller 400GB or 800GB NVMe - spec’d pre-Bluestore
25OSiRIS - Open Storage Research Infrastructure
ATLAS Event Service
Supplement to ‘heavy’ ATLAS grid infrastructure
Jobs fetch events / store output via S3 URL
Short term compute jobs good for preemptible resources
26OSiRIS - Open Storage Research Infrastructure
Core Ceph cluster sites share identical config and similar numbers / types of OSD
Any site can be used for S3/RGW access (HAproxy uses RGW backends at each site)
Any site can be used via Globus endpoint for FS or S3
Users at each site can mount NFS export from Ganesha + Ceph FSAL. NFSv4 idmap
umich_ldap scheme used to map POSIX identities.
Site Overview
27OSiRIS - Open Storage Research Infrastructure
Example hardware models and details shown in the diagram on the left.
This year’s purchases used R740 headnodes and 10TB SAS disks and Intel P3700 PCIe NVMe devices
Site Overview - hardware
28OSiRIS - Open Storage Research Infrastructure
Cache Tier Benchmarks - RADOS (VAI)
http://www.osris.org/domains/vai.html
29OSiRIS - Open Storage Research Infrastructure
From SC18: http://www.osris.org/article/2018/11/15/ceph-cache-tiering-demo-at-sc18
Cache Tier Benchmarks - NFS / Iozone
30OSiRIS - Open Storage Research Infrastructure
NMAL - Topology discovery (viz)
Visualization can also
display computed paths
through topology
31OSiRIS - Open Storage Research Infrastructure
NMAL SDN Deployment - Ryu controller
Ryu SDN framework (https://osrg.github.io/ryu/)
Simple to deploy and integrate with ourPython-based tools
Ryu in a VM through OVS required someplanning to separate control from dataplanewith common physical LAG on all hosts
Services managed by Puppet
32OSiRIS - Open Storage Research Infrastructure
NMAL - Bringing topology and monitoring together
Regular perfSONAR testing (via MCA mesh) results are exposed via topology visualization
The goal is to create an OSiRIS monitoring dashboard to quickly analyze and troubleshoot performance issues
Near-term plan:
Integrate passive measurements
Make topology elements dynamically update to highlight current network and device state
Long-term goal:
Integrate analysis engine based on UNIS-RT to perform changepoint detection and reporting
top related