Scatter-Gather VM Migration
Scatter-Gather VM migration enables fast deprovisioning of the source host when the destination host is resource constrained. In this project, we introduce a new metric called eviction time, which is defined as the time to evict the entire state of a VM from the source host. Eviction time determines how quickly the source host can be taken offline, or the freed resources repurposed for other VMs. In traditional approaches for live VM migration, such as pre-copy and post-copy, eviction time is equal to the total migration time, which is the time taken to transfer the VM’s entire state to the destination. Eviction time increases if the destination host is slow to receive the incoming VM, such as due to insufficient memory or network bandwidth, thus tying up the source host. We present a new approach, called Scatter-Gather live migration, which reduces the eviction time when the destination host is resource constrained. The key idea is for the source host to push out its memory state as quickly as possible to one or more intermediate hosts in the network. Concurrently, the destination host retrieves the VM’s memory state from the intermediate hosts or middleboxes using a variant of post-copy VM migration.
Live Gang Migration
VM migration plays an important role in facilitating proactive maintenance and load balancing in datacenters to deal with imminent failures or sudden load spikes. However, excessive network overhead when migrating VMs may violate the performance requirements of VMs dictated by Service Level Agreements (SLA). This overhead increases greatly when multiple VMs are migrated together. The resulting migration traffic overloads the core network links and leads to degradation of the network-bound applications running across the datacenter. State-of-the-art live migration techniques that optimize the migration of a single VM are insufficient when hundreds of VMs need to be migrated simultaneously. We present a network-friendly approach that works both within a host and across the entire cluster to reduce the network overhead of simultaneous live migration of multiple VMs. Our cluster-wide deduplication technique eliminates the retransmission of duplicate VM pages, reducing the network traffic by 60% compared to the default technique of QEMU/KVM. In addition, at each source host we apply differential compression to co-located VMs in order to exploit content similarity across nearly-identical pages.
Contribution of the thesis is two fold. Firstly, we propose new pre-paging strategies to minimize the number of network bound page faults in post-copy. We also present a new VM migration technique, hybrid pre-postcopy live VM migration, to reduce the total migration time of precopy and downtime of postcopy.
Here, we implement and compare different pre-paging strategies for post-copy VM migration to reduce the number of network-bound page faults. In post-copy live VM migration memory transfer is deferred until after the VM’s CPU state is transferred to the destination host and VM is resumed there. The pages that are faulted on by the VM at the destination are demand paged over the network. Therefore the responsiveness of the VM depends on the number of network-bound page faults. We present various pre-paging strategies to reduce the number of network-bound page faults. The goal of pre-paging is to actively push the pages to the destination before they are faulted upon by the VM running at the destination. We take into account the page fault pattern of processes running inside the VM to make an informed choice of the pages that should be pre-paged before others. This reduces the number of network-bound page faults significantly and improves VM’s responsiveness.
Hybrid pre/post-copy migration improves upon pre-copy and post-copy by reducing total migration time and downtime for write-intensive VM workloads. With read-mostly VM workloads pre-copy VM migration yields low total migration time and downtime. However due to its iterative nature, pre-copy incurs high total migration time even for slightly write-intensive applications. On the other hand post-copy migration improves upon pre-copy by reducing the total migration time, but it accomplishes this at the cost of increased VM downtime. Hybrid live VM migration provides best of both worlds by on one hand reducing the total migration time of VMs especially for write-intensive VM workloads, while on the other hand delivering shorter downtime and VM application degradation than post-copy by reducing the network faults. Our evaluation shows that hybrid migration is capable of migrating a VM with just 250ms of downtime, two times shorter than post-copy migration.
Intern, Redhat Inc.
Implemented and evaluated the prototype of a dedicated VM migration thread in QEMU, which improved responsiveness of VMs during migration. The final version is part of the mainstream QEMU now.
Senior Development Engineer, Calsoft Inc., India