AHV Internals: Memory Overcommit

Introduction

Memory Overcommit is one of the features of the Nutanix AHV hypervisor that allows you to pack more VMs into a host than the host would normally be able to accommodate just based on its physical size. It’s also a feature familiar to many customers using competing hypervisors who are looking to move to AHV. Providing more CPUs to guests than you have physical cores is commonplace but with overcommit you can do something similar with memory - e.g., potentially fitting 12 VMs each using 10GB into a host with just 100GB physical memory. But, as with everything, nothing comes for free and the use of Memory Overcommit needs to balance the advantages against the potential implications.

Memory Overcommit is useful in test and development environments, but is not recommended in production environments due to the potential performance reduction and other implications.

This blog post expands on both the Memory Overcommit documentation in the AHV Administration Guide, and the Book Of AHV Architecture in the Nutanix Cloud Bible, exploring some of the internal details of how Memory Overcommit works and describing some of the implications.

How does memory overcommit work on AHV?

There are two key ways in which AHV can squeeze more VMs into a host than would normally be permitted without memory overcommit - Ballooning and Swapping.

Ballooning is a co-operative process with the guest operating system where a component which runs in the guest (provided by the VirtIO drivers) can reserve memory that is not in-use within the guest, and return that to the host for use in other VMs. This method works best where guests are allocated with more memory than they typically consume.

The following diagram shows an example where three 20GB VMs can fit into a host with 40GB VM memory by reserving unused memory from the guest and allowing other VMs to make use of that physical memory.

diagram shows an example where three 20GB VMs can fit into a host with 40GB VM memory by reserving unused memory from the guest and allowing other VMs to make use of that physical memory.

	Available to guest	Ballooned	Swapped
VM1	10GB	10GB	0GB
VM2	15GB	5GB	0GB
VM3	15GB	5GB	0GB
Totals	40GB	20GB	0GB

As well as reserving memory for removal from the guest, balloon drivers also provide information to the host about a guest’s memory usage statistics. This allows the host to make informed decisions about VM resizing: it can remove unused memory from one VM and give it to another VM that is in need of more memory to run its workload. The balloon driver within the guest passes statistics back to the host - including the amount of available memory in the VM and the in-guest swap usage for the VM. We also consider the host-level swap usage for the VM. These metrics are combined to produce an estimate of the size of memory that’s actively being used by the guest. This also gives an indication of whether the VM has a need for additional memory. We can then balance VMs which are not reporting a need for additional memory, ballooning down the amount of memory they can consume to make additional memory available to the VMs needing it. The AHV host can see that a VM has a need for additional memory, but not necessarily the size of that need - therefore this process of reallocating physical memory between guests is constantly re-evaluating and responding to the dynamic needs from the guest.

If guests are actively using the memory, then the balloon driver can’t reserve memory from the guest to return to the host, so in order to fit more VMs into the host then AHV will use a host swap memory mechanism. This is also true if the memory has been reserved by the balloon driver and then the guest starts to need it. Inside the guest, an out-of-memory handler may be triggered when the guest attempts to use memory that has been reserved by the balloon driver. In this scenario, the balloon will shrink immediately so guest operations aren’t interrupted - this can also trigger the usage of host swap. Note that ballooning is the preferred mechanism to fit more VMs into a host on AHV as it is more performant. If the system can’t recover enough memory through ballooning, then AHV may also have to use host swap. The performance impact can be substantial, which is why ballooning is the preferred method, and why every guest OS running inside a memory overcommit VM is strongly recommended to have a working balloon driver.

Note that, just like an in-guest swap disk, the amount of each VM that is swapped out to host disk will change dynamically - VMs that are actively using their memory are more likely to have that memory in physical memory rather than swapped out to disk.

The following diagram shows an example where AHV is using host swap as a fallback where the amount of memory in use within the VMs exceeds the total available physical memory. If there is memory unused inside the guest, that memory can still be reserved by the balloon driver to reduce the amount of host swap required.

Diagram of example where AHV is using host swap as a fallback where the amount of memory in use within the VMs exceeds the total available physical memory.

	Available to guest	Ballooned	Swapped
VM1	18GB	2GB	8GB
VM2	20GB	0GB	5GB
VM3	18GB	2GB	3GB
Totals	56GB	4GB	16GB

It’s worth highlighting that the recommendation from the Memory Overcommit documentation to have guest swap inside the VM is important to note, even when there is a possibility for host swap.

There are different types of memory allocations which can be made by applications, which gives hints to the operating system about how to use the guest-level swap. This can include memory that the operating system will not put into its own swap space. Further, the guest can make decisions about what to put into guest-level swap space based on knowledge of how the processes are interacting. The host doesn’t have visibility of these nuances, so operates a much simpler swap mechanism where AHV will swap out the memory using a Least Recently Used (LRU) algorithm. Conversely, the guest isn’t aware which memory has been swapped out at host level, so will witness unexpected spikes in memory accesses as the memory has to be swapped back in first.

The guest operating system is likely to be able to operate its swap disk more efficiently than the host can with overcommit - so instead of allocating 20GB memory to the VM and assuming that host swap will be used, consider allocating 10GB and creating a 10GB swap disk within the guest.

ADS and Memory Overcommit

If we consider a host with 40GB and three overcommit-enabled 10GB VMs running on the host already, with overcommit enabled, how does the system start a new 15GB VM? AHV uses two methods to start VMs - which internally Nutanix calls Initial Placement and ADS. Initial Placement is the simple case. If there are enough resources, and no special policies in play, then we can place the VM immediately on a host. If, on the other hand, AHV can’t find a host without enough resources available then it will make use of ADS - the Acropolis Dynamic Scheduler - to see if it can make enough space to start the VM.

When placing overcommitted VMs on a host, AHV uses a “pool” of memory concept. This allows for placing overcommit-enabled and overcommit-disabled VMs on the same host, as well as host reservations for high availability. This pool of memory is the amount of memory that is reserved for all of the running over-committed VMs in an individual AHV host. While the size of the pool is never larger than the sum of the configured memory for these VMs, it can be smaller when the system is under memory pressure. In the 40GB host with three overcommit-enabled 10GB VMs, the pool size would be 30GB and all VMs would be running with all VM memory fully-backed by physical memory. If a new 15GB VM with overcommit disabled was then needed to be started on that host, the overcommit pool would be shrunk to 25GB and the overcommit-enabled VMs would be adapted (either ballooned or swapped) to make enough space on the host to run the new 15GB VM. In reality there is also a small buffer applied to the pool so each AHV host can respond dynamically to an individual guest’s demand for additional memory without always having to free that memory from the host. This means the actual memory used by overcommit-enabled VMs is slightly less than the size of the overcommit pool.

Some guest operating systems will touch every byte of memory (e.g. to wipe the memory) on boot-up. This means that in boot storm scenarios, where many guests are booting up at the same time, the overall performance of the guest boot up time would be impacted very significantly if it was not using physical memory. To avoid bottlenecks when booting new VMs, start-up of a VM is a particular use case where ADS will ensure that even overcommitted VMs can fully fit into physical memory. This is achieved by ADS reducing the overcommit pool size to free up enough physical memory such that the new VM can start with 100% of its allocated memory. During the boot process for this new overcommit-enabled VM, the pool size is automatically increased by the amount of memory allocated to that VM.

For example, in the 40GB host with three overcommit-enabled 10GB VMs, starting a new 15GB overcommit-enabled VM, the pool size would shrink to 25GB (adjusting the three 10GB VMs as needed), then when the new 15GB overcommit-enabled VM is started, the pool size will grow to 40GB, and AHV will then dynamically control balloon and host swap usage within the expanded pool. Over time, this will mean that AHV will balance the memory needs for the new 15GB guest against the other overcommit-enabled guests on the host within the pool.

Note that ADS has a strong focus on minimising any potential performance impact on VMs, across storage, CPU usage, memory overcommit and other factors. Therefore, ADS will only shrink the pool on a host to place a new VM there if there is an absolute need. ADS will prefer to place VMs on hosts with enough memory capacity to avoid having to use memory overcommit, and if memory overcommit scenarios exist in a cluster then ADS will attempt to move VMs around when there is capacity to reduce the usage and minimise potential performance impacts on VMs. There are also cases where ADS may not always be able to shrink the pool to make space to start new VMs if it observes that VMs on the host are already experiencing significant performance impacts as a result of memory overcommit - for example as observed by a high utilisation of host or guest swap.

Memory overcommit ratio

As discussed in the AHV Administration Guide, each VM is guaranteed to have at least 25% of its allocated memory available to the guest allocated from the host’s physical RAM - i.e., the most that AHV will reserve using ballooning and the hosts swap disk is 75% of the allocated memory. This is to ensure that frequently used parts of the guest, such as its kernel, are able to run in physical memory at all times.

Note that while the theoretical maximum overcommit ratio with each VM having 25% physical memory is a 4x overcommit - however, as each new VM will start with 100% of its memory, the actual total overcommit possible on a host may be lower. For example, if we have a 40GB host with four 10GB VMs in it, these VMs can shrink to 25% of their allocated size (10GB), leaving space to start a 3 more 10GB VMs. With 10GB VMs and a 40GB host, you can achieve a 3.25x overcommit ratio, by starting new VMs over time, before there is no longer enough space to start new VMs.

Number of 10GB VMs running	Sum of VM’s memory	Overcommit multiplier	Minimum squeezed size (25%)	Memory potentially available for new VMs	Number of 10GB VMs startable
4	40GB	1x	10GB	30GB	3
7	70GB	1.75x	17.5GB	22.5GB	2
9	90GB	2.25x	22.5GB	17.5GB	1
10	100GB	2.5x	25GB	15GB	1
11	110GB	2.75x	27.5GB	12.5GB	1
12	120GB	3x	30GB	10GB	1
13	130GB	3.25x	32.5GB	7.5GB	0

Note that the above calculations are all assuming that high availability is not enabled for the cluster.

The purpose for high availability is to recover guests expeditiously in the event of a failure. As noted above, some guests access 100% of their memory at boot-up time, however if we always reserved 100% of a guest’s memory for high availability then it would effectively mean that high availability and memory overcommit could not be used in parallel. While we don’t recommend memory overcommit to be used in production scenarios, and high availability is mostly used in production scenarios, there are still options to use the two together.

To ensure that we can reserve memory for VMs when guaranteed high availability and memory overcommit are used in conjunction with each other, we assume that all overcommit-enabled VMs will have 75% of their available memory allocated as physical to ensure that they can boot up and not suffer from the worst of the performance impacts.

Obviously, just enabling high availability, which reserves memory to start VMs in a failure event, will reduce the total memory available for usage by VMs. In conjunction with allocating 75% of assigned memory as physical for overcommit-enabled VMs, this means that when using HA and overcommit together you can achieve a 33% (1.33x) overcommit of the host’s memory, in comparison to up to 400% overcommit when not using HA.

Upgrades and Live Migrations

Upgrades in AHV typically start by putting a host into maintenance mode. In order to complete the upgrade, AHV will need to restart the machine, so we have to migrate all VMs from this host to other hosts in the cluster to avoid VM downtime.

When guaranteed HA is enabled, then AHV has already identified a second host where each VM can reside, so a live migration is triggered for each VM to that second host. If guaranteed HA is not enabled then AHV will have to identify a new host for each VM. Rather than looking for a host with enough spare memory for 100% of the VM’s allocated memory, AHV already knows how much physical memory the VM is using on the original host - so it will identify a host with at least that amount of memory spare for an incoming migration.

Migrations on AHV work by iteratively copying memory from the source host to the destination host (see Book of AHV Architecture), which also has implications for overcommit-enabled VMs because we need to swap any memory in the host swap out from disk into physical memory in order for the live migration process to copy the memory to the destination host. This means live migrating of an overcommit-enabled VM will take more time than live migrating of an overcommit-disabled VM. It can also mean that in times of heavy system load, a live migration of an overcommit-enabled VM can add additional load to the system, potentially impacting performance further. As such, ADS will identify overcommit-enabled VMs as having a higher cost of migration than equivalent VMs which do not have overcommit enabled when it’s calculating a migration plan to resolve hotspots.

Summary

AHV’s support for Memory Overcommit is one of the many useful features in AHV’s toolbox to make best use of system resources. Nutanix’s implementation of Memory Overcommit is integrated with other features, such as ADS, to ensure that the impact on VM’s performance is minimised. Whether memory overcommit is suitable for use in your environment will depend on many different factors, such as the level of overcommit desired and the activity levels within the VM.

Follow Nutanix’s recommendations when considering Memory Overcommit to help minimise the impact to any guests.

©2024 Nutanix, Inc. All rights reserved. Nutanix, the Nutanix logo and all Nutanix product and service names mentioned herein are registered trademarks or trademarks of Nutanix, Inc. in the United States and other countries. All other brand names mentioned herein are for identification purposes only and may be the trademarks of their respective holder(s).

AHV Internals: Memory Overcommit