High-density Multi-tenant Bare-metal Cloud with Memory
Expansion SoC and Power Management
Authors:
HotChips
For security and isolation
2
3 For single-thread performance
4
Why Baremetal Cloud and What is X-Dragon?
Alibaba
Cloud
For interoperability and manageability
For multi-tenancy and cost efficiency
1
Problems
Problem1: VM-based Cloud has non-
ignorable virtualization overhead,
isolation/security concern and limited
single thread performance, but good
manageability
Problem2: Existing bare-metal cloud
design for single tenant, lack of
manageability and also costly
There are VM-based cloud, single-tenant bare-metal cloud and BM-Hive(Multi-tenants bare-metal cloud) in Datacenter
VM-based
Cloud
Legacy
Baremetal
Cloud
Xdragon: Design for cloud with multi-
tenant, secure, high performance and
easy manageable
Same cloud
infrastructure
KVM vs X-Dragon
Same tools to
manage
Both Multi-tenants
X-Dragon High Level View in Cloud
More secure and selectable
bare-metal performance
X-Dragon System Architecture
Compute Boards + Base Server1
Hardware implementation of
virtio devices2
Custom backend: BM-Hypervisor3
X-Dragon: IO Bond and Backend
Shadow Ring buffer design
Transfer data between computing board
and backend base server
BM-Hypervisor design
Emulate virtio-devices, and connect into
existing cloud infrastructure
• X-Dragon BM-Guest vs Native vs VM: BM-
Guests are slightly better performance than VM
• Memory bandwidth: BM-Guests are same as
Native. VM 98% of BM-Guests under load
Evaluation: CPU/Mem/IO performance
• Network PPS: Same PPS rate, however more
implied volatility.
• Latency: Same in application level, longer path
then DPDK bypass-kernel testing
• Storage: substantially better than VM from latency
and long tail.
• Nginx
• MariaDB
• Redis
Evaluation: Real business
X-Dragon BM guest performs
substantially better than the
virtualization-based cloud service for the
popular applications used in the cloud
Memory Pool
2
X-Dragon based Infrastructure Enhancement
Alibaba
Cloud Cloud App Aware Power Management
1
Memory Pool
PMEM
PMEM
xNIC
xNIC
DDR4
CPU/FPGA
CPU/FPGA
DDR4
C PU 0 C PU 1
DDR
DDR
xNIC
xNIC
M em X
C Ccontroller/bridge
C ache Line M anager(rack)
M EM PM EM N VM e
D D R 4D D R 4
D D R 4
PM EMPM EM
PM EM
SSD
SSD
SSD
C om pute
PC Ie
D atabus
Retimer
Retimer
To C Csw itch
To C Csw itch
To non-C Csw itch
PC IeC acheC oherence
R ack /LocalPool Sw itch Fabric
Ether/Ether-C C sw itch
PC Ie /PC Ie-C C sw itch
DDR4
xNIC
PC Ie
To non-C Csw itch
CPU/FPGA
MEM
xNIC
xNIC
CPU/FPGA
CPU/FPGA
PMEM
xNIC
PC Ie
To non-C Csw itch
CPU/FPGA
PMEM
B M C B M C
PMEM
PMEM
DDR4
CPU/FPGA
CPU/FPGA
DDR4
DDR4
To C Csw itch
CPU/FPGA
MEM
CPU/FPGA
CPU/FPGA
PMEM
To C Csw itch
CPU/FPGA
PMEM
B M C B M C
R em ote Pool
EtherSw itch (A IO ps)R M C
D atabus
Page
manager
Page
manager
To non-C Csw itch
C Ccontroller/bridge
C ache Line M anager(node)
ROI Analysis
On Compute & Rack
PCIe
PCIe
Cache Line
converter
MEM
Controller
DDR
DDR
DDR
PMEM
PMEM
PMEM
NVMe
SSD
SSD
SSD
PCIe
Page Manager
Buffer /
Queue
ARM
Prediction &
Prefetch
ACCL
Lookup &
order mgmtWarm-up
Local
processing Eth
ern
et
/P
CIe
CC bridge
NV
Me
-oF
Re
liab
leX
fer
xNIC
Alibaba
Vendor
OS Kernel Mem Mgmt
Hypervisor Memory Mgmt
Instance
Host
• Memory allocation
• Page fault handling
• Performance
optimization
Instance Instance
Ve
rbs
CC
Controller
Type Test Potential Benefits
Traditional Compute Mid to high utilization Lower performance, higher density
Middleware E-Commerce Lower performance, higher density
Micro Services E-Commerce Lower performance, higher density
AI Ali Native training & inference Unacceptable for training
Encyption & Compression Standard payload pre-/post-processing Easier to scale out
Placement & Migration Large instances Faster; saving network b/w
Checkpointing & Mirroring Cloud based HPC High performance checkpointing enabled
NFV Host gateway Depends; easier to provision
Database In-memory DB Cost down significantly
Graph Large social apps Cost down significantly; minor programming modelchange
Upgrade & Deployment Patching & initialization Faster upgrade & composing
Workloads & Potential Benefits
Power Management Platform
Highly Available Management
CPUMSR/MMCFG
Bus master /
proxy
BMC
Accelerators(GPU, FPGA)
DDR
CPU
Chipsets
RoP
Storage
Network
FANS
A
P
I
IB Sub-AgPnPent
Coordinator
OOB PnP Sub-Agent
PnP Agent Compute
Node
Inte
rface
to
Pn
P M
aste
r
Alibaba Power Agent• In-Band Power Management• Out-of-Band Power Management
Server Platform• Fine granularity power and
performance telemetry & controlknobs
• In-Band and Out-of-Band Control Channels
Capping & Budgeting
Rack/Node Power Capping App Driven Power Budgeting
Performance Awareness
Identify IDC, server and app control knobs with least performance impact