Open Cloud Testbed: Developing a Testbed for the Research Community Exploring
Next-Generation Cloud Platforms
Michael Zink, David Irwin, UMass Amherst Orran Krieger & Martin Herbordt, Boston University,
Miriam Leeser & Peter Desnoyers, Northeastern University
What is MGHPCC?
What is Mass Open Cloud?
• Vision Statement“To create a self-sustaining at-scale public cloud based on the Open Cloud eXchange model… a marketplace for industry partners as well as a place for researchers and industry to innovate and expose innovation to users.”
• Project Overview and Goals– At-scale efficient production cloud for broad set of applications– Create and Deploy the OCX model– Testbed for research, open source developers, companies
Motivation I
● Cloud computing plays an important role in supporting most software we use in our daily lives
● Critical for enabling research into new cloud technologies (see demand for CloudLab and Chameleon)
● Demand for cloud testbeds often higher than available resources
Motivation II
● CISE researchers want to study users that are not CISE● MOC supports
○ real users○ access to real data sets○ can provide traces of real usage○ can allow services to be exposed to end-users (e.g., TTP)○ has access to production services at scale (e.g., NESE)○ infrastructure and services provided by industry partners
Research "in" the MOC
Motivation III
● CloudLab supports○ Large community (nationwide) of systems researchers○ Tools to configure experimental slices (a combination of bare
metal nodes and networking resources)○ Hard isolation from other users/experiments○ Profiles to describe hard- and software to build a cloud○ Designed specifically for reproducible research○ Software stack is open source○ Federation with other testbeds (e.g., GENI, FABRIC)
● Scientific infrastructure for cloud research
● Three clusters (Utah, Wisconsin, Clemson, and MGHPCC), which offer 15,000+ cores○ Each cluster has a different focus: storage and networking (using hardware from
Cisco, Seagate, and HP), high-memory computing (Dell), and energy-efficient computing (HP).
● A testbed for research and experimentation into new cloud platforms
● Combine proven software technologies and reproducibility features with a real production cloud
● Enhanced with programmable hardware (FPGA) capabilities; bump-in-the-wire (BITW); ~30 nodes
Open Cloud Testbed
OCT Approach
OCT Approach
OCT Approach
Research "in" the MOC
FPGAs in OCT
Research "in" the MOC
Open Cloud Testbed Concept
What’s new?● FPGA’s as reconfigurable compute and Bump-in-the-Wire
(BITW) fully accessible by users
What’s new?● FPGA’s as Bump-in-the-Wire (BITW) fully accessible by users● Make CloudLab dynamically scalable by adding and removing
third-party resources
What’s new?● FPGA’s as Bump-in-the-Wire (BITW) fully accessible by users● Make CloudLab dynamically scalable by adding and removing
third-party resources● Transfer new cloud mechanisms to production cloud (MOC)
What’s new?● FPGA’s as Bump-in-the-Wire (BITW) fully accessible by users● Make CloudLab dynamically scalable by adding and removing
third-party resources● Transfer new cloud mechanisms to production cloud (MOC)● Access to storage, data sets, cloud telemetry
What’s new?● FPGA’s as Bump-in-the-Wire (BITW) fully accessible by users● Make CloudLab dynamically scalable by adding and removing
third-party resources● Transfer new cloud mechanisms to production cloud (MOC)● Access to storage, data sets, cloud telemetry● Collaboration with industry (e.g.,180 nodes from Two Sigma)
What’s new?● FPGA’s as Bump-in-the-Wire (BITW) fully accessible by users● Make CloudLab dynamically scalable by adding and removing
third-party resources● Transfer new cloud mechanisms to production cloud (MOC)● Access to storage, data sets, cloud telemetry● Collaboration with industry (180 nodes from two sigma)● Usage of certain parts of CloudLab by users outside research
community
New Hardware
● Original plan: Add 10 new nodes to existing Mass CloudLab cluster● Two Sigma donation of ~200, 2-year old servers to MOC made us change
plan:○ Mass CloudLab (19 additional R630 => 380 additional cores)
○ 3 additional racks of servers (R630, R620, and R720/R730):■ ~ 1600 cores■ Part of CloudLab■ Part of MOC■ Can be used by industry and foundations
Challenges
● FPGAs● Out-of-band management● Network isolation● Community buy-in
FPGAs
● 15 in 2020 & 15 in 2022● Queried potential users of FPGAs (~20 beta users)
○ Cloud and Operating System
○ Middleware
○ FPGA systems
○ FPGA tools
○ Provider applications
○ Tenant applications
FPGAs
● Will most likely start with two models:○ Xilinx Alveo U280○ Intel D5005
● Toolchain● Implications on networking:
○ Rate limit switch
Out-of-band Management
● Whoever controls OBM controls server● OBM proxy will manage control● OBM interfaces with ESI
ESI
ESI Approach: long term● ESI controls access to servers and switches● CloudLab will have drivers for OBM, switch control, console, for servers
allocated● Scripts for admins to transfer nodes to/from CloudLab● Use Keylime for attesting nodes provided back to production:
○ MOC
○ NERC
○ HPC Clusters
● Eventually enable stateless CloudLab nodes - save and restore experiments
● Original plan was to do this; CloudLab not elastic until complete...
ESI Approach: new short term● Give CloudLab & ESI control software direct access to servers and
switches● Each will have in its inventory all servers; build simple mechanism to
allocate to NULL project servers used by other side● Cons: is unsecure, credentials for server out-of-band management and
switch management have to be shared● Pros: straight forward; we will be able to have an elastic cloud lab mid
year; can incrementally add drivers for OBM, console, network...
Network Isolation
AL2S● Guaranteeing QoS in the network is hard● Options:
○ Traffic shape at the server (tricky to enforce)○ Rate limit ports at switches (what is the impact on TCP?)
● Overprovisioning helps but does not guarantee isolation
Community buy-in
● This is where you come into the picture!● How can OCT support your systems research?● We need your feedback!!!● Good news:
○ CloudLab community can use it from day one
○ MOC community can use it from day one
Core Team
CNS-1925464