userfaultfd and post-copy migration
TRANSCRIPT
Userfaultfd and Post-Copy Migration
Mike Rapoport
Outline● Migration background● Userfaultfd● Post-copy migration
Migration: why?● Spectacular● Statefull application with no downtime
○ Hardware upgrades○ Software upgrades requiring boot
● Load balancing
Migration: how?
● Very simple○ Save state on source○ Copy state to destination○ Restore state on destination
● Memory is the heaviest part○ Pre-copy vs post-copy
Migration flows
Pre-copy
● Track memory, copy inactive part● Freeze on source● Copy state and remaining memory● Unfreeze on destination
Post-copy
● Freeze on source● Copy state except memory● Enable “remote swap”● Unfreeze on destination● Bring memory on demand
Pre-copy
prepare memory copy 1
memory copy n freeze state
copy unfreeze
time
Running on
sourceStopped Running
on dest
Post-copy
prepare rest of the memoryfreeze state copy unfreeze
time
remote page faults
Running on
sourceStopped Running on dest
Pre-copy vs post-copy
https://youtu.be/lo2JJ2KWrlA
Pre-Copy
+ Less vulnerable to node failures
+ High performance in “UP” state- Longer downtime- Might diverge
Post-Copy
- More vulnerable to node failures
- Slowdown after migration+ Shorter downtime+ Predictable downtime
Userfaultfd highlights● Delegation of page faults to userspace● File descriptor with ioctl’s for control● Poll and read to get page fault notifications● mcopy_atomic to “map” the page
○ Can handle zero pages
Userfaultfd setup● Initialize user fault page descriptor
○ uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK);
● API handshake○ ioctl(uffd, UFFDIO_API, &uffdio_api);
● Register range○ uffdio_register.range.start = (unsigned long) start;○ uffdio_register.range.len = nr_pages * page_size;○ uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING;○ ioctl(uffd, UFFDIO_REGISTER, &uffdio_register);
Page fault handling
● Wait for event○ pollfd[0].fd = uffd;○ pollfd[0].events = POLLIN;○ poll(pollfd, 1, -1);
● Read the event○ read(uffd, &uffd_msg, sizeof(uffd_msg));○ if (msg.event != UFFD_EVENT_PAGEFAULT)○ oops...○ faulting_address = msg.arg.pagefault.address
Page fault handling● “Map” normal page
○ uffdio_copy.dst = faulting_address;○ uffdio_copy.src = source_page_address;○ uffdio_copy.len = page_size;○ uffdio_copy.mode = 0;○ uffdio_copy.copy = 0;○ ioctl(uffd, UFFDIO_COPY, &uffdio_copy);
● “Map” zero page○ uffdio_zeropage.range.start = faulting_address;○ uffdio_zeropage.len = page_size;○ uffdio_zeropage.mode = 0;○ ioctl(uffd, UFFDIO_ZEROPAGE, &uffdio_zeropage);
Under the hood● syscall(__NR_userfaultfd)
○ Allocate userfault context○ Create a file hooked to an anonymous inode○ Wait for API handshake
● ioctl(UFFDIO_API)○ Verify that userspace and kernel talk the same language
● ioctl(UFFDIO_REGISTER)○ Find VMA covering desired range○ Make sure the VMA can “user fault”○ Add userfault context to the VMA
Under the hood● Page fault
○ Faulting address covered by VMA with userfault context○ Add “page fault” message to file poll queue○ Wake up process polling the uffd○ Return VM_FAULT_UFFD_RETRY to mm core
● UFFDIO_COPY/UFFDIO_ZEROPAGE○ Allocate a page○ Create a page table entry for faulting address○ Copy the page content from user or○ Map to zero page
VM post-copy migration● Guest memory is a part of QEMU
address space● Combine pre- and post-copy● Straightforward flow
○ Start a thread for for user fault handling○ Register guest memory areas with userfaultfd
○ Guest page fault causes UFFD_EVENT_PAGEFAULT
■ Request the page from source■ copy/zero guest memory upon response
○ Fetch non-faulting pages in the background
CRIU + post-copy migration● Different address spaces
○ Restore controller○ Restored processes
● Basic flow similar to VMs○ Start a daemon for user fault handling○ Register restored process areas with userfaultfd
■ Might be quite a few uffds○ Handle page faults○ Fetch non-faulting memory in the background
● BUT
Non-cooperative userfaultfd ● Page fault cannot block restorer
○ Use UFFDIO_WAKE ioctl
● Processes change mappings on the flight○ fork()○ madvise(..., MADV_DONTNEED)○ mremap()
Future● Kernel WIP
○ Write protected pages○ fork, madvise, mremap events○ hugetlbfs○ tmpfs
● CRIU○ Make it work? ;-)
References● https://www.kernel.org/doc/Documentation/vm/userfaultfd.txt● http://wiki.qemu.org/Features/PostCopyLiveMigration● https://criu.org/Userfaultfd