QEMU Virtual Machine Optimization

2018.07.15 | Yuki Rea

There are many ways of fine tuning QEMU virtual machine performance. This monolithic article contains a variety of methods to optimize and improve VM performance for different types of applications. While this article is specific to Debian 9 and QEMU, it can easily be adapted to work for other Linux distributions and applications.


Terminology

  • Bare-Metal
    • A physical computer system and the base level operating system running on it.

  • Host
    • A bare-metal machine that virtual machines are run on.

  • Thread
    • A core or SMP/Hyperthreaded core of a CPU.

  • CPU thread
    • A thread on the host system CPU.

  • vCPU thread
    • A thread on a virtual machine's virtualized CPU.

CPU Isolation

Isolating a VM's vCPU threads from the host system ensures that processes on the host system can not effect the performance of isolated VMs. This can be set at boot time with the static kernel parameter isolcpus or dynamically during runtime using cpuset or cset shield (an interface for cpuset).


Isolating QEMU Threads From Host Processes Using "isolcpus"

Using isolcpus to shield CPU threads from the Linux process scheduler is the method to use if having the absolute lowest latency possible for a virtual machine or other latency sensitive application is required. While this method may result in the lowest latency possible, it is also the least flexible. This kernel parameter is set at boot time and will prevent the Linux process scheduler from running tasks on all isolated threads. It is necessary to manually bind each application's PID/s to isolated CPU threads and also requires a reboot to return the isolated CPU threads back to the Linux process scheduler.

Edit /etc/default/grub adding the isolcpus=[list] kernel parameter to the GRUB_CMDLINE_LINUX_DEFAULT variable. [list] is a list or range of host CPU threads to be isolated from the Linux process scheduler. Separate with , to select individual threads or separate with - to select a range of threads.

/etc/default/grub
 1 # If you change this file, run 'update-grub' afterwards to update
 2 # /boot/grub/grub.cfg.
 3 # For full documentation of the options in this file, see:
 4 #   info -f grub -n 'Simple configuration'
 5 
 6 GRUB_DEFAULT=0
 7 GRUB_TIMEOUT=0
 8 GRUB_RECORDFAIL_TIMEOUT=$GRUB_TIMEOUT
 9 GRUB_DISTRIBUTOR=`lsb_release -i -s 2> /dev/null || echo Debian`
10 GRUB_CMDLINE_LINUX=""
11 
12 # specify individual threads
13 GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on isolcpus=8,9,10,11,12,13,14,15,24,25,26,27,28,29,30,31"
14 # or specify a range of threads
15 GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on isolcpus=8-15,24-31"

Save and close the file then run update-grub to update grub.

Reboot and run a system monitor like htop or gnome-system-monitor to ensure that the correct CPU threads have been isolated and no processes are running on them. Use the stress -c`nproc` command to stress the host CPU. There should be no load on the isolated CPU threads.



Each QEMU vCPU thread needs to be manually bound to an isolated CPU thread, use taskset or the QEMU CPU Affinity patch to do this. Both methods are explained later in this article.


Dynamically Isolating QEMU Threads From Host Processes Using "cset shield"

cset shield is much more flexible than isolcpus and makes it possible to create/remove and modify isolated CPU thread groups called shields on-the-fly with out rebooting the host or stopping the VM. The performance is nearly identical to isolcpus but has a slight latency penalty because some low-level kernel processes are unable to be moved out of the shield.

First create a shield, do this by running the following command/s.

1 # specify individual threads
2 cset shield -c 8,9,10,11,12,13,14,15,24,25,26,27,28,29,30,31
3 # or specify a range of threads
4 cset shield -c 8-15,24-31
5 # move all movable kernel threads off of the shield
6 cset shield --kthread on

QEMU can be started either by using cset shield or by adding the PID/s to the shield after the fact. Use the following command start QEMU with cset shield.

cset shield --exec qemu-system-x86_64 -- -$OPTS

Alternatively, add already running PID/s to the shield like so.

cset shield --shield --threads --pid $(pidof qemu-system-x86_64)

It may be desirable to disable the shield to allow the host to share CPU resources with the VM again at certain times. This will/should not cause the VM to pause or terminate. Do this by using the following command.

cset shield --reset

Dynamically Isolating QEMU Threads From Host Processes Using "cpuset / cgroups"

cpuset is a subsystem of cgroups which makes it possible to create/remove and modify isolated CPU thread groups on-the-fly with out rebooting the host system or stopping the VM. This is the underlying kernel driver that cset uses and therefore both methods perform the same. However, interacting with the kernel driver directly allows greater flexibility. By using cpuset, assigning separate cpusets for different types of QEMU threads allows for even further isolation. This example sets QEMU vCPU threads to match up 1/1 with the second CPU node in the host system, then sets the I/O threads to the last 2 threads of the first CPU node so heavy I/O access has minimal effect on the VM's CPU performance.

First create the following cpusets using the following commands like so.

 1 # create a cpuset called "system" for all other processes running on the host system
 2 mkdir /sys/fs/cgroup/cpuset/system
 3 # set cpuset.mems to the memory nodes which processes in the "system" cpuset should have access to
 4 echo 0-1 > /sys/fs/cgroup/cpuset/system/cpuset.mems
 5 # specify which cores are available to the "system" cpuset (first 7 cores on the first numa node and their SMP threads)
 6 echo 0-6,16-22 > /sys/fs/cgroup/cpuset/system/cpuset.cpus
 7 
 8 # create a cpuset called "qemu-virt" for the QEMU parent process and all of it's worker processes
 9 mkdir /sys/fs/cgroup/cpuset/qemu-virt
10 # set cpuset.mems to the memory nodes which processes in the "qemu-virt" cpuset should have access to
11 echo 0-1 > /sys/fs/cgroup/cpuset/qemu-virt/cpuset.mems
12 # specify which cores are available to the "qemu-virt" cpuset (last core on the first numa node and it's SMP thread)
13 echo 7,23 > /sys/fs/cgroup/cpuset/qemu-virt/cpuset.cpus
14 
15 # create a cpuset called "qemu-vcpu" for QEMU vCPU processes
16 mkdir /sys/fs/cgroup/cpuset/qemu-vcpu
17 # set cpuset.mems to the memory nodes which processes in the "qemu-vcpu" cpuset should have access to
18 echo 0-1 > /sys/fs/cgroup/cpuset/qemu-vcpu/cpuset.mems
19 # specify which cores are available to the "qemu-vcpu" cpuset (the entire second node)
20 echo 8-15,24-31 > /sys/fs/cgroup/cpuset/qemu-vcpu/cpuset.cpus

Next move all processes from the root cpuset into the system cpuset with the following commands.

1 # copy all PIDs in the "root" cpuset to "system" cpuset
2 cat /sys/fs/cgroup/cpuset/tasks | xargs -n1 -i echo {} > /sys/fs/cgroup/cpuset/system/tasks
3 # clear all PIDs from the "root" cpuset
4 echo > /sys/fs/cgroup/cpuset/tasks

Now move the QEMU parent PID into the qemu-virt cpuset.

 1 # NOTE:
 2 # "$NAME" is a variable we set to differentiate VMs from each other on the host system. It should be identical to the "-name" qemu argument
 3 # for the "for" loop to work and to differentiate QEMU vCPU threads from other threads, it is necessary to add the '-name $NAME,debug-threads=on' QEMU argument
 4 
 5 # move QEMU parent PID into the "qemu-virt" cpuset
 6 echo $(pstree -pa $(pidof qemu-system-x86_64) | grep $NAME | awk -F',' '{print $2}' | awk '{print $1}') > /sys/fs/cgroup/cpuset/qemu-virt/tasks ;
 7 # create a counter and set it to 0
 8 HOST_THREAD=0
 9 # for every QEMU vCPU PID
10 for PID in $(pstree -pa $(pstree -pa $(pidof qemu-system-x86_64) | grep $NAME | awk -F',' '{print $2}' | awk '{print $1}') | grep CPU |  pstree -pa $(pstree -pa $(pidof qemu-system-x86_64) | grep $NAME | cut -d',' -f2 | cut -d' ' -f1) | grep CPU | sort | awk -F',' '{print $2}')
11     do let HOST_THREAD+=1
12     # add QEMU vCPU PIDs to "qemu-vcpu" cpuset
13     echo $PID > /sys/fs/cgroup/cpuset/qemu-vcpu/tasks
14 done

Removing the cpusets and returning the system back to normal is just like before but in reverse.

 1 # copy all PIDs in the "system" cpuset to "root" cpuset
 2 cat /sys/fs/cgroup/cpuset/system/tasks | xargs -n1 -i echo {} > /sys/fs/cgroup/cpuset/tasks
 3 # remove PIDs from the "system" cpuset
 4 echo > /sys/fs/cgroup/cpuset/tasks
 5 # remove the "system" cpuset
 6 rmdir /sys/fs/cgroup/cpuset/system
 7 
 8 # copy all PIDs in the "qemu-virt" cpuset to "root" cpuset
 9 cat /sys/fs/cgroup/cpuset/qemu-virt/tasks | xargs -n1 -i echo {} > /sys/fs/cgroup/cpuset/tasks
10 # remove PIDs from the "qemu-virt" cpuset
11 echo > /sys/fs/cgroup/cpuset/tasks
12 # remove the "qemu-virt" cpuset
13 rmdir /sys/fs/cgroup/cpuset/qemu-virt
14 
15 # copy all PIDs in the "qemu-vcpu" cpuset to "root" cpuset
16 cat /sys/fs/cgroup/cpuset/qemu-vcpu/tasks | xargs -n1 -i echo {} > /sys/fs/cgroup/cpuset/tasks
17 # remove PIDs from the "qemu-vcpu" cpuset
18 echo > /sys/fs/cgroup/cpuset/tasks
19 # remove the "qemu-vcpu" cpuset
20 rmdir /sys/fs/cgroup/cpuset/qemu-vcpu

CPU Affinity

It is important to have system architecture in mind when creating high performance, low latency virtual machines. Optimizing which CPU threads a VM utilizes in relation to the host system's architecture can result in significant improvements. Ensuring that VMs are allocated CPU threads that are on the same node or in the same compute module can dramatically decrease latency spikes and improve overall performance.


A perfect example of when optimizing a VM for a specific CPU architecture results in significant performance gains is the AMD FX line of CPUs. They feature a unique architecture which also happens to be one of their major downfalls. The AMD FX 8xxx/9xxx series processors consist of 4 compute modules, each with 2 integer cores that have their own level one(L1) cache. Although each integer core has its own L1 cache, they have to share the fetch and decode stages of each cycle, a single L2 cache module, and a single floating point unit(FPU). These shared resources mean that one integer core may have to wait for the other in certain circumstances. By setting a VM's CPU affinity to a staggered arrangement, each vCPU will be in a different compute module on the host CPU. This means each vCPU has access to its own L2 cache and FPU on the host CPU giving the VM access to twice the L2 cache and FPUs as it would otherwise have when using a sequential CPU affinity.


Below is the test results from a 4 core VM on a host system with a 5 GHz AMD FX 8350 CPU running Debian 9.1. The guest is a Windows 10 VM with networking disabled so that Windows update can not impact the test results. You can see that using cores (0,2,4,6) instead of (0,1,2,3) nets a 13.5% improvement.


Using "taskset" to Set CPU Affinity

This method requires no modifications to QEMU itself which can be good if you want to get QEMU updates from your distribution's software repository as opposed to patching and re-compiling QEMU from source every time you decide to update it. This is a simple shell script which uses the debug-threads QEMU argument and taskset to find vCPU threads and pin them to an affinity variable set elsewhere in the script.

 1 #!/bin/bash
 2 
 3 # clear options
 4 OPTS=""
 5 
 6 # set vm name
 7 NAME="PARASITE"
 8 
 9 # host affinity list
10 THREAD_LIST="8,9,10,11,12,13,14,15,24,25,26,27,28,29,30,31"
11 
12 # qemu options
13 OPTS="$OPTS -name $NAME,debug-threads=on"
14 OPTS="$OPTS -enable-kvm"
15 OPTS="$OPTS -cpu host"
16 OPTS="$OPTS -smp 16,cores=8,sockets=1,threads=2"
17 OPTS="$OPTS -m 32G"
18 OPTS="$OPTS -drive if=virtio,format=raw,aio=threads,file=/vms/disk-images/windows-10.img"
19 
20 function run-vm {
21 # specify which host threads to run QEMU parent and worker processes on
22 taskset -c 0-7,16-32 qemu-system-x86_64 $OPTS
23 }
24 
25 function set-affinity {
26 # sleep for 20 seconds while QEMU VM boots and vCPU threads are created
27 sleep 20 &&
28 HOST_THREAD=0
29 # for each vCPU thread PID
30 for PID in $(pstree -pa $(pstree -pa $(pidof qemu-system-x86_64) | grep $NAME | awk -F',' '{print $2}' | awk '{print $1}') | grep CPU |  pstree -pa $(pstree -pa $(pidof qemu-system-x86_64) | grep $NAME | cut -d',' -f2 | cut -d' ' -f1) | grep CPU | sort | awk -F',' '{print $2}')
31 do
32     let HOST_THREAD+=1
33     # set each vCPU thread PID to next host CPU thread in THREAD_LIST
34     echo "taskset -pc $(echo $THREAD_LIST | cut -d',' -f$HOST_THREAD) $PID" | bash
35 done
36 }
37 
38 set-affinity &
39 run-vm

Memory Tuning

Memory, or RAM, can have a large impact on VM performance, especially with certain system architectures that are memory speed and latency sensitive. Preallocating memory for your VM and increasing the memory page size will help reduce memory access latencies and increase overall VM CPU performance.


Preallocated Memory

Preallocating memory is one of the easiest ways to improve VM performance. This setting does exactly what it says, it dedicates and area of memory that is the entire size of the guest VM's memory so that new blocks of memory do not have to be allocated by the host as the guest requests them. This does however leave all memory preallocated to the VM unusable by the host until the guest is halted.

Preallocated memory can be enabled with one QEMU argument. -mem-prealloc


Hugepages

A page is a chunk of system memory (usually 2 or 4KB in size) that a CPU allocates and indexes so that it can be accessed again when needed. Since the operating system and CPU need to keep this index of where each page is located in physical memory, having many small pages increases the total amount of entries in the index and therefore increases the time to look up each entry. Hugepages are used to increase the size and decrease the total number of pages which lowers the time it takes to look up a page in memory.

There are a few steps to enable and use hugepages for your QEMU VM. Start by determining the page sizes that the host CPU supports. Do this by running the following script:

1 if [ "$(cat /proc/cpuinfo | grep -oh pse | uniq)" = "pse" ]
2 then echo "2048K = OK"
3 else echo "2048K = NO"
4 fi
5 if [ "$(cat /proc/cpuinfo | grep -oh pdpe1gb | uniq)" = "pdpe1gb" ]
6 then echo "1G = OK"
7 else echo "1G = NO"
8 fi

It is recommended to use the largest supported hugepage size for the best performance. Edit /etc/default/grub and add the hugepagesz=[size] and default_hugepagesz=[size] kernel parameters. These parameters set the size of each hugepage.

 1 # If you change this file, run 'update-grub' afterwards to update
 2 # /boot/grub/grub.cfg.
 3 # For full documentation of the options in this file, see:
 4 #   info -f grub -n 'Simple configuration'
 5 
 6 GRUB_DEFAULT=0
 7 GRUB_TIMEOUT=0
 8 GRUB_RECORDFAIL_TIMEOUT=$GRUB_TIMEOUT
 9 GRUB_DISTRIBUTOR=`lsb_release -i -s 2> /dev/null || echo Debian`
10 GRUB_CMDLINE_LINUX=""
11 GRUB_CMDLINE_LINUX_DEFAULT="quiet hugepagesz=1G default_hugepagesz=1G"

Save and close the file then run update-grub to update grub. A reboot will be necessary for changes apply.

Hugepages can be allocated dynamically during runtime by echoing a number of hugepages to a file. To allocate hugepages to all nodes, echo the number of hugepages to allocate to /sys/kernel/mm/hugepages/hugepages-[size]/nr_hugepages. To allocate hugepages to a specific node, echo the number of hugepages to allocate to /sys/devices/system/node/node[number]/hugepages/hugepages-[size]/nr_hugepages instead.

1 # allocate hugepages to the entire system's memory pool
2 echo 16 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
3 
4 # allocate hugepages to a specific node
5 echo 16 > /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages

Remove hugepages by echoing 0 instead to the same file.

Use the -mem-path /dev/hugepages argument when starting QEMU to enable the use of hugepages.


Linux Process Scheduler Tuning

There are a few different ways of tuning the Linux process scheduler to reduce latency spikes and increase overall CPU performance. Most of these require using a real-time kernel or compiling a custom kernel to enable these features.


SCHED_FIFO Real-Time Process Priority

The SCHED_FIFO (first in, first out) policy is a real-time policy and has a priority range of 1-99, 99 being the highest. Processes using this policy will run indefinitely or until they are finished and can only be interrupted by process with the same or higher priority.

Set QEMU to the SCHED_FIFO policy with chrt

1 # NOTE:
2 # "$NAME" is a variable we set to differentiate VMs from each other on the host system. It should be identical to the "-name" qemu argument
3 # it is necessary to wait until QEMU has finished booting an OS before changing to a real-time process priority or it will halt virtual disk access
4 
5 # get parent PID of QEMU VM
6 PARENT_PID=$(pstree -pa $(pidof qemu-system-x86_64) | grep $NAME | cut -d','  -f2 | cut -d' ' -f1)
7 # set all threads of parent PID to SCHED_FIFO 99 priority
8 pstree -pa $PARENT_PID | cut -d','  -f2 | cut -d' ' -f1 | xargs -L1 echo "chrt -f -p 99" | bash

sched_rt_runtime_us

sched_rt_runtime_us is a kernel parameter that governs how much CPU time in microseconds is allowed to be consumed by real-time tasks. The default value of 950000 allows only 95% CPU time for real-time tasks, increasing this value or disabling it completely can improve latency and will increase the amount of CPU time available to real-time tasks. However, increasing this value or disabling it can also have negative effects on the system. If real-time tasks are not isolated to their own dedicated CPU threads, they will be allowed to use all available CPU time and prevent other, sometimes critical, processes from running and may lock up the host system indefinitely. Only change this parameter if you understand the risks and can configure your system properly to work around them.

Set this parameter during runtime using the sysctl command.

1 # set to -1 to disable 
2 sysctl kernel.sched_rt_runtime_us=-1
3 # or increase the value
4 sysctl kernel.sched_rt_runtime_us=980000

Linux Kernel Tuning

Configuring and compiling a custom Linux kernel may be necessary for low latency virtualization if using the real-time kernel from your distribution's software repository is not good enough for your application.


CONFIG_PREEMPT_RT Real-Time Kernel Patch

The CONFIG_PREEMPT_RT Linux kernel patch allows most of the kernel to be preempted by processes on the system which have a real-time process priority. This means that real-time processes are executed first and will not have to wait for most processes running on the system.

Try downloading the latest real-time kernel and it's source code from your distribution's software repository. If your distribution does not provide a real-time kernel or a newer version is desired, patching a vanilla kernel will be required.

Start by downloading a copy of the Linux kernel source code and a matching real-time kernel patch, in this case linux-4.16.18.

cd into the directory containing the Linux kernel source code and real-time patch archives and extract them.

Downloads/4.16.18/
1 gunzip linux-4.16.18.tar.gz
2 tar xvf linux-4.16.18.tar
3 
4 gunzip patch-4.16.18-rt12.patch.gz

Move the patch into the Linux kernel source directory and then cd into it.

Downloads/4.16.18/
1 mv patch-4.16.18-rt12.patch linux-4.16.18/
2 cd linux-4.16.18

Apply the patch with the patch command.

Downloads/4.16.18/linux-4.16.18/
1 patch -p1 < patch-4.16.18-rt12.patch

Copy the configuration for your existing kernel into the kernel source directory.

Downloads/4.16.18/linux-4.16.18/
cp /boot/config-`uname -r` .config

Use the menu configuration utility to enable the Fully Preemptible Kernel (RT) Preemption Model then save and exit. If the system is unstable with the Fully Preemptible Kernel option, try each of the Preemptible Kernel options.

Downloads/4.16.18/linux-4.16.18/
make menuconfig

These options can be found in the following menu tree of the kernel configuration:
Processor type and features > Preemption Model





Use the next three commands to compile and build a deb package.

Downloads/4.16.18/linux-4.16.18/
1 make -j`nproc`
2 make modules -j`nproc`
3 make -j`nproc` bindeb-pkg

Finally install the packages using dpkg, they will be located up one directory.

Downloads/4.16.18/linux-4.16.18/
1 dpkg -i \
2 ../linux-headers-4.16.18-rt12_4.16.18-rt12-1_amd64.deb \
3 ../linux-image-4.16.18-rt12_4.16.18-rt12-1_amd64.deb \
4 ../linux-libc-dev_4.16.18-rt12-1_amd64.deb

Timer Frequency

Increasing the interrupt timer frequency to 1000Hz can improve the overall latency and response of the system with the downsides of slightly decreasing throughput and increased power consumption. Doing this also requires changing the kernel configuration and re-compiling the kernel.

This option can be found in the following menu tree of the kernel configuration:
Processor type and features > Timer Frequency


References

hugetlbpage.txt
https://www.kernel.org/doc/Documentation/vm/hugetlbpage.txt


The Kernel Development Community. The kernel’s command-line parameters.
https://www.kernel.org/doc/html/v4.14/admin-guide/kernel-parameters.html


Robert M. Love. (2014). taskset - set or retrieve a process's CPU affinity.
https://manpages.debian.org/testing/util-linux/taskset.1.en.html


Robert M. Love & Karel Zak. (2016). chrt - manipulate the real-time attributes of a process.
https://manpages.debian.org/testing/util-linux/chrt.1.en.html


SUSE. (2018). Tuning the Task Scheduler.
https://doc.opensuse.org/documentation/leap/tuning/html/book.sle.tuning/cha.tuning.taskscheduler.html


Alex Tsariounov. (2011). cset-shield - cpuset supercommand which implements cpu shielding.
https://manpages.debian.org/jessie/cpuset/cset-shield.1.en.html