stateless hypervisors at scale
TRANSCRIPT
![Page 1: Stateless Hypervisors at Scale](https://reader034.vdocuments.net/reader034/viewer/2022050613/58ecc8011a28abef658b4599/html5/thumbnails/1.jpg)
Stateless Hypervisors at ScaleAntony Messerli [email protected] @ntonym
![Page 2: Stateless Hypervisors at Scale](https://reader034.vdocuments.net/reader034/viewer/2022050613/58ecc8011a28abef658b4599/html5/thumbnails/2.jpg)
• Almost 14 years with Rackspace
• Hardware Development for Rackspace
• Rackspace Cloud Servers
• Slicehost
• Openstack Public Cloud
• R&D and Prototyping
• Twitter: @ntonym Github: antonym IRC (freenode): antonym
ABOUT MYSELF
2
![Page 3: Stateless Hypervisors at Scale](https://reader034.vdocuments.net/reader034/viewer/2022050613/58ecc8011a28abef658b4599/html5/thumbnails/3.jpg)
3
• Openstack Public Cloud in productionsince August 2012
• Six Geographic regions around the globe
• 10’s of 1000’s of hypervisors (Over 340,000 Cores, Just over 1.2 Petabytes of RAM)
• Over 10 different hardware platforms
• Primarily utilize the Citrix XenServer Hypervisor today
![Page 4: Stateless Hypervisors at Scale](https://reader034.vdocuments.net/reader034/viewer/2022050613/58ecc8011a28abef658b4599/html5/thumbnails/4.jpg)
TRADITIONAL HYPERVISORS
4
![Page 5: Stateless Hypervisors at Scale](https://reader034.vdocuments.net/reader034/viewer/2022050613/58ecc8011a28abef658b4599/html5/thumbnails/5.jpg)
5
Components of a Hypervisorin OpenStack
• Bare metal
• Operating System
• Configuration Management (Ansible, Chef, Puppet)
• Nova Compute
• Instance settings
• Instance virtual disks
![Page 6: Stateless Hypervisors at Scale](https://reader034.vdocuments.net/reader034/viewer/2022050613/58ecc8011a28abef658b4599/html5/thumbnails/6.jpg)
6
Hypervisor’s Mission
Needs to be:
• Stable
• Secure
• Provision and run instances reliably
• Consistent with other servers
![Page 7: Stateless Hypervisors at Scale](https://reader034.vdocuments.net/reader034/viewer/2022050613/58ecc8011a28abef658b4599/html5/thumbnails/7.jpg)
Problems With Hypervisors At Scale
• Operating System
‣ Multiple versions of XenServer
‣ Each version has variation of patches, kernels, or xen hypervisor
‣ More variations = More work
• Server Hardware
‣ Incorrect BIOS settings, firmware, or modules can cause different behaviors
• Operational Issues
‣ Openstack or Hypervisor bugs can leave things in undesirable states.
7
![Page 8: Stateless Hypervisors at Scale](https://reader034.vdocuments.net/reader034/viewer/2022050613/58ecc8011a28abef658b4599/html5/thumbnails/8.jpg)
8
How We Solved Some Of Those Problems
• Factory style provisioning using iPXE and Ansible
• Consolidated hypervisor versions to reduce variations
• Attempt to correct inconsistencies on the hypervisors automatically
But… we’re still running a traditional operating system!
![Page 9: Stateless Hypervisors at Scale](https://reader034.vdocuments.net/reader034/viewer/2022050613/58ecc8011a28abef658b4599/html5/thumbnails/9.jpg)
9
Our Goals
• Rapidly deploy hypervisors.
• Take advantage of server reboots!
• Reproducible build
• Consistency within hardware platforms and operating systems
After all, these are Cattle, not Pets!
![Page 10: Stateless Hypervisors at Scale](https://reader034.vdocuments.net/reader034/viewer/2022050613/58ecc8011a28abef658b4599/html5/thumbnails/10.jpg)
THE CONCEPT: LIVE BOOTED HYPERVISORS
10
![Page 11: Stateless Hypervisors at Scale](https://reader034.vdocuments.net/reader034/viewer/2022050613/58ecc8011a28abef658b4599/html5/thumbnails/11.jpg)
What Is A Live OS?
• A bootable image that runs in a system’s memory
• Predictable and portable
• Typically used for installs or rescue, booted from CD or network
• Doesn’t make changes to existing configuration
11
What if we applied this same concept to run our hypervisor?
![Page 12: Stateless Hypervisors at Scale](https://reader034.vdocuments.net/reader034/viewer/2022050613/58ecc8011a28abef658b4599/html5/thumbnails/12.jpg)
12
“We’ll Do It Live!”
• Network booted stateless LiveOS
• Built from scratch using Ansible
• Operating System is separated from customer data
• Reboot for the latest build
![Page 13: Stateless Hypervisors at Scale](https://reader034.vdocuments.net/reader034/viewer/2022050613/58ecc8011a28abef658b4599/html5/thumbnails/13.jpg)
But Where Does The Persistent Data Go?
• systemd unit file mounts disk early in the boot process • Create the symlinks from LiveOS to persistent store • For example:
/dev/sda2 is mounted to /data /var/lib/nova -> /data/var/lib/nova
• Can create symlink for each directory you want to persist
13
![Page 14: Stateless Hypervisors at Scale](https://reader034.vdocuments.net/reader034/viewer/2022050613/58ecc8011a28abef658b4599/html5/thumbnails/14.jpg)
How Is This Possible?
• We leverage the dracut project.
• Dracut runs in the initramfs during boot time
• Main goal is to transition to the real root filesystem
• Has lots of functionality for network boot
• Set options from kernel command line
• More information @ https://dracut.wiki.kernel.org
14
Dracut Config Example:
![Page 15: Stateless Hypervisors at Scale](https://reader034.vdocuments.net/reader034/viewer/2022050613/58ecc8011a28abef658b4599/html5/thumbnails/15.jpg)
Why Use A LiveOS
• Everything boots from a single image.
• Can make changes without reboot, but should update image.
• Can update to a new release of the OS and roll back to the existing if needed.
• Portable and easy to test and develop on.
• Memory is cheap!
15
![Page 16: Stateless Hypervisors at Scale](https://reader034.vdocuments.net/reader034/viewer/2022050613/58ecc8011a28abef658b4599/html5/thumbnails/16.jpg)
THE IMAGE BUILD PROCESS
16
![Page 17: Stateless Hypervisors at Scale](https://reader034.vdocuments.net/reader034/viewer/2022050613/58ecc8011a28abef658b4599/html5/thumbnails/17.jpg)
Squashible
• Combination of SquashFS and Ansible.
• Ansible Playbooks automate the build process of creating the images
• Supports multiple OS versions
• Configuration management done during image build
• All changes to our build live within the repo, fully tracked and easily reproducible.
17
![Page 18: Stateless Hypervisors at Scale](https://reader034.vdocuments.net/reader034/viewer/2022050613/58ecc8011a28abef658b4599/html5/thumbnails/18.jpg)
18
The Initial Bootstrap
• Ansible uses Docker to create a minimal chroot
• Installs:
‣ Package manager
‣ Init system
• Copies the chroot to Jenkins
• Ansible destroys the docker container minimal OS in chroot
dnf, apt, zypper
Filesystem on Jenkins server
Docker container or systemd-nspawn
![Page 19: Stateless Hypervisors at Scale](https://reader034.vdocuments.net/reader034/viewer/2022050613/58ecc8011a28abef658b4599/html5/thumbnails/19.jpg)
Preparing The chroot
Live OS chroot
• Ansible uses its chroot module catch up the OS
• Version-tracking metadata is added to the image
• Package manager configurations are applied
• All packages are updated to the latest available versions from the distribution's mirrors
yum/apt configuration
versioning metadata
19
![Page 20: Stateless Hypervisors at Scale](https://reader034.vdocuments.net/reader034/viewer/2022050613/58ecc8011a28abef658b4599/html5/thumbnails/20.jpg)
Common Configuration
Live OS chroot
Ansible applies configuration to the live image that should be included in all live images
• Authentication
• Auditing
• Common Packages
• Logging configuration
• Security configurations
• SSH configurations
• Enable/disable services on boot
security configuration (auth, sshd,
SELinux, AppArmor,
auditd)
logging configuration
(journald, rsyslog)
service startup configuration (via systemd)
20
![Page 21: Stateless Hypervisors at Scale](https://reader034.vdocuments.net/reader034/viewer/2022050613/58ecc8011a28abef658b4599/html5/thumbnails/21.jpg)
21
Apply The Personality
• Ansible takes the common live OS chroot and configures it based on the desired “personality"
• Each role has the packages to install along with any special configurations required in order for the hypervisor to function
Common Live OS chroot
Basicserver
KVMhypervisor
Xenhypervisor
LXChypervisor
XenServerhypervisor
(via additional Ansible Roles)
![Page 22: Stateless Hypervisors at Scale](https://reader034.vdocuments.net/reader034/viewer/2022050613/58ecc8011a28abef658b4599/html5/thumbnails/22.jpg)
22
Publishing The Build• Kernel and initramfs are copied to
deployment server
• Root filesystem (entire chroot) is tarballed and copied to the deployment server
• mktorrent generates a torrent file for rootfs
• rtorrent seeds the initial torrent of the rootfs
Common Live OS chroot
vmlinuz (kernel)
initrd (ramdisk)
root filesystem torrent file (mktorrent)
root filesystem tarball of chroot
rtorrent (seeds rootfs)
opentracker
Deployment Server (HTTP)
vmlinuz initramfs
rootfs.img rootfs.img.torrent
![Page 23: Stateless Hypervisors at Scale](https://reader034.vdocuments.net/reader034/viewer/2022050613/58ecc8011a28abef658b4599/html5/thumbnails/23.jpg)
THE BOOT PROCESS
23
![Page 24: Stateless Hypervisors at Scale](https://reader034.vdocuments.net/reader034/viewer/2022050613/58ecc8011a28abef658b4599/html5/thumbnails/24.jpg)
24
Ok, We Built An Image,Now What? Boot It!
• Boot from network with iPXE
• Boot from local disk with Grub
• If network fails, can revert to localboot
• Lots of open source provisioning systems available
![Page 25: Stateless Hypervisors at Scale](https://reader034.vdocuments.net/reader034/viewer/2022050613/58ecc8011a28abef658b4599/html5/thumbnails/25.jpg)
Boot with iPXE
#!ipxe
:netboot
imgfree
set dracut_ip ip=${mgmt_ip_address}::${mgmt_gateway_ip}:${mgmt_netmask}:${hostname}:
${mgmt_device}:none nameserver=${dns}
kernel ${vmlinuz_url} || goto netboot
module ${initrd_url} || goto netboot
imgargs vmlinuz root=live:${torrent_url} ${dracut_ip} rd.writable.fsimg ${console}
boot || goto netboot
25
![Page 26: Stateless Hypervisors at Scale](https://reader034.vdocuments.net/reader034/viewer/2022050613/58ecc8011a28abef658b4599/html5/thumbnails/26.jpg)
Boot via extlinux
LABEL latestbuild-$GIT_COMMIT menu label latestbuild-$GIT_COMMIT kernel $KERNEL root=live:/dev/sda1 rd.live.dir=/boot/builds/$GIT_COMMIT ${dracut_ip} rd.writable.fsimg booted_as=local initrd $INITRD
26
• After booting from network, you can create a local disk cache of image.
• If network boot fails, you can still boot previously loaded image from disk.
• Could roll out images ahead of time and skip network boot.
![Page 27: Stateless Hypervisors at Scale](https://reader034.vdocuments.net/reader034/viewer/2022050613/58ecc8011a28abef658b4599/html5/thumbnails/27.jpg)
Boot via kexec
kexec -l vmlinuz —initrd=initrd.img \—command-line=“root=live:http://$deployment_server/images/fedora-23-kvm/rootfs.img \ip=dhcp nameserver=8.8.8.8 rd.writable.fsimg rd.info rd.shell”kexec -e
27
• Useful for testing it out from a running machine
• Also useful for reloading your OS to the latest build of the image
• Have to make sure your hardware drivers work well with kexec
![Page 28: Stateless Hypervisors at Scale](https://reader034.vdocuments.net/reader034/viewer/2022050613/58ecc8011a28abef658b4599/html5/thumbnails/28.jpg)
28
Our Primary Boot Method, Terraform
• Server makes DHCP request and retrieves iPXE kernel
• Identifies itself using LLDP
• Gets all attributes and plugs that into an iPXE template.
• Our Utility LiveOS:
‣Brings Firmware and BIOS settings to latest spec
‣Storage and OBM
‣Inventory
‣Kexec’s into Primary Image
![Page 29: Stateless Hypervisors at Scale](https://reader034.vdocuments.net/reader034/viewer/2022050613/58ecc8011a28abef658b4599/html5/thumbnails/29.jpg)
29
Our Initial Scale Tests (x86_64)
• Heavily tested on 200+ x86 hosts running Fedora 23 based LiveOS
• Time to build and package live image from git commit: ~10 minutes
• Time to boot a server once POST completes: ~60 seconds
• Re-provision time for 200 servers from reboot to provisioning instances: ~15 minutes
![Page 30: Stateless Hypervisors at Scale](https://reader034.vdocuments.net/reader034/viewer/2022050613/58ecc8011a28abef658b4599/html5/thumbnails/30.jpg)
30
Openpower “Barreleye” (ppc64le)
• Currently testing OpenStack KVM stack with LiveOS builds using Fedora 23 on OpenPower Barreleye
• More information about Barreleye @ http://blog.rackspace.com/openpower-open-compute-barreleye/
![Page 31: Stateless Hypervisors at Scale](https://reader034.vdocuments.net/reader034/viewer/2022050613/58ecc8011a28abef658b4599/html5/thumbnails/31.jpg)
31
Future Ideas• Embedded configuration
management
‣ Image would run automation and retrieve it’s own configuration on boot
‣ Regenerates itself on every boot • Stateless instances
‣ Boot from Config Drive
‣ Reset state or upgrade with reboot
![Page 32: Stateless Hypervisors at Scale](https://reader034.vdocuments.net/reader034/viewer/2022050613/58ecc8011a28abef658b4599/html5/thumbnails/32.jpg)
Give It A Try
Squashible - Cross-Platform Linux Live Image Builder
http://squashible.com
Sample iPXE Boot menus:
https://github.com/squashible/boot.squashible.com
32
![Page 33: Stateless Hypervisors at Scale](https://reader034.vdocuments.net/reader034/viewer/2022050613/58ecc8011a28abef658b4599/html5/thumbnails/33.jpg)
![Page 34: Stateless Hypervisors at Scale](https://reader034.vdocuments.net/reader034/viewer/2022050613/58ecc8011a28abef658b4599/html5/thumbnails/34.jpg)
Thank you!Antony Messerli [email protected] @ntonym