virtualization (VMs)

Tags: computers

https://github.com/firecracker-microvm/firecracker/

KVM (kernel virtual machine)

linux based virtualization, they work together
https://www.redhat.com/en/blog/all-you-need-know-about-kvm-userspace
https://www.redhat.com/en/blog/journey-vhost-users-realm

QEMU

qcow2/3

Virtual filesystems for QEMU

Hot Plug Memory

https://github.com/qemu/qemu/blob/db596ae19040574e41d086e78469014191d7d7fc/docs/memory-hotplug.tx11t

Replacements

https://github.com/Solo5/solo5

Xen

Why is it important to provide timers?
- Because OS’s aren’t used to being preempted! We want to provide the OS real timers so they can do actual things
- Timing estimates, for example from TCP
Hypervisors are meant to be tiny, has direct machine access
- Hypervisor control node actually runs as a VM
IO rings, similar to io_uring
Changes made to make virtualization easier
- Guest OS runs at 1, which works nicely for CPU x86 priv rings
- when the guest starts to run and gets scheduled, it communicates with the hypervisor to register a set of addresses
  - When there’s a syscall, jump to this part in the OS
- GuestOSs have realtime and virtual time
  - real time for congestion
  - virtual time for when you get descheduled
- Hypercalls manages guests to the OS
Paging
- Virtualizing the MMU is very difficult, since the MMU is allowed to walk the page tables
- Xen allows the guest to maintain its own page tables, but Xen checks all modifications to the page tables
  - Software managed TLB or TLB with address space modifiers, but x86 doesn’t have this
- everyone manages their own page tables
- minimized involvement for xen
- So Xen lives in the unused memory space
- When guest operating systems creates tables, they create the table ds with address space, any memory that’s required is part of that address space
  - when they register the page table with xen, then it’s read only
  - Xen does recieves updates, it can batch em
- Guest runs at 1

Other Execution Environments

https://buildsoftwaresystems.com/post/guide-to-execution-environments/

KVM

What really is KVM anyways?
- ioctl (Devices) interface (VMM <> KVM)
- vm/vcpu creation
- VM exists and reentry
- memory manager
- interrupt/iohandling flow
- timers!
KVM first started in Linux 2.6.20 (2007 edition)
Any process can create a guest, a guest is memory and vcpus, with in kernel emulated devices, user space emulated devices, etc
cpu Arches
- Every supported arch has a CPU guest mode
- x86 is VMX root/non-root (Intel VT-x) SVM (AMD-V)
- ARM is vHE vs nVHE
- RISC-V: hypervisor extension (VS/VU modes)
- Generally the pattern is, run on guest until you hit a trap, then exit ot the host
  - trap being priveleaged instruction, IO access, interrupts, page faults
Generally you want to poll /dev/kvm, then the VM fd, then the vcpu fd
Lifecycle
1. open("/dev/kvm") to get the list of avaiabilities, such as checking the API version and extnesions, then use KVM_CREATE_VM to get the fd for thE VM. Then we mmap host memory and do KVM_SET_USER_MEMORY_REGION
2. The vpcu fd mmaps the kvm_run page, which sets KVM_SET_REGS (RIP, RSP), KVM_SET_SREGS (CR* and EFER) then KVM_SET_CPUID2, and finally we can spawn host threads
3. Normal guest -> trap -> handle trap -> re-enter loop
4. Teardown is done by signaling to the vCPU threads, then closing the various FDs
Exits
- IO exists are common
- MMIO exists, such as memory mapped devices
- In kernel fast path has various things
- And virtio can write various doorbells
Memory Slots!
- We need to map the guest physical address space into the host virtual address space (VMM)
- So we typically have 4 slots:
  - Slot 3: PCI MMIO device memory is mmap region 3, with userspace_addr[3]
  - Slot 2: High ram -> mmap region 2
  - Slot 1: Low ram -> mmap region 1
  - Slot 0: Real mode: first 64k

   GPA (x86, 4 GB guest)                        HVA (QEMU process VA)

   0x140000000
─  ┌──────────────────┐
   │  above-4G RAM    │─── slot 2 ─────────────►  0x7fc000000000
   │  (1G remapped    │    base_gfn: 0x100000       ┌──────────┐
   │   above PCI hole)│    npages:   0x40000         │anon mmap │
─  └──────────────────┘    usr_addr: 0x7fc000000000 └──────────┘
   0x100000000

   ╔══════════════════╗
   ║  PCI / MMIO hole ║  no memslot — KVM_EXIT_MMIO
   ║  0xFEE00000 LAPIC║  dispatched to QEMU device models
   ║  0xFEC00000 IOAPIC
   ║  0xC0000000+ BARs║
   ╚══════════════════╝
   0xC0000000

─  ┌──────────────────┐
   │  low extended RAM│─── slot 1 ─────────────►  0x7f8000100000
   │  1 MB → ~3 GB    │    base_gfn: 0x100          ┌──────────┐
   │                  │    npages:   0xBFF00         │anon mmap │
─  └──────────────────┘    usr_addr: 0x7f8000100000 └──────────┘
   0x00100000

   ╔══════════════════╗
   ║  ISA hole        ║  no memslot — KVM_EXIT_MMIO / KVM_EXIT_IO
   ║  0xA0000 VGA RAM ║  dispatched to QEMU ROM and device handlers
   ║  0xC0000 opt ROM ║
   ║  0xF0000 BIOS    ║
   ╚══════════════════╝
   0x000A0000

─  ┌──────────────────┐
   │  conventional RAM│─── slot 0 ─────────────►  0x7f8000000000
   │  0 → 640 KB      │    base_gfn: 0x0            ┌──────────┐
   │                  │    npages:   0xA0            │anon mmap │
─  └──────────────────┘    usr_addr: 0x7f8000000000 └──────────┘
   0x00000000

   ──────────────────────────────────────────────────────────────
   hva    = slot->userspace_addr + (gfn - base_gfn) << PAGE_SHIFT
   gfn    = gpa >> PAGE_SHIFT        (PAGE_SHIFT = 12 for 4K pages)
   lookup → kvm->memslots[as_id]
            lru_slot checked first (atomic_t, avoids search on hot path)
            then binary search descending on base_gfn
   as_id  → 0 = normal, 1 = SMM shadow memory (x86 only)

Which means we then arrive at three address layers: the Guest Virtual Addres (GVA), the Guest Physical Address (GPA), and the Host Physical Address (HPA)
- So the hardware has to walk on both levels when a TLB miss happens (GVA -> [guest PT walk] -> GPA -> [EPT walk] -> HPA)
  - AMD uses NPT (Nested Page Tables instead of Extended Page Tables)
  - ARM uses Stage-2 page tables
  - RISC-V uses G-stage page tables
Timers!
- Timers are tricky because the guest OS needs precise timers, but we can deschedule the vCPU. So we need to emulate timers without breaking guest time.
- We have hardware supported timers, which the hardware fires directly into the VM. This allows for no VM exists on ARM64/RISC-V
- But sometimes we have to fallback to software timers, where the guest writes a comparator -> vm exists -> KVM programs the hrtimer
  - hrtimer fires -> the KVM injects virtual interrupts -> VM exits
  - This is TWO exits per cycle
  - We can also block. Where the guest halts, the vCPU stops, and the HW timer can’t fire.
    - So KVM then starts a backup hrtimer to wake it
  - There’s always some form of offset timekeeping so that the guest sees a consistent virtual time
IO
- When a guest accesses a device register, the guest reads/writes unampped GPAs, which causes a VM exit, and exits to userspace, triggering QEMU
- In certain situations we have kernel shortcuts, such as ioeventfd and irqfd.
- The key is to avoid hitting QEMU, which is very slow
Interrupts
- IRQCHIP is userspace, emualted in VMM (QEMU), which registers a VM exit
- KVM acceleration is handled in the kernel. This prevents an exits to space.
- Sometimes we can signal directly to HW from the guest
ioeventfd and irqfd act as dorbells and interrupt injections

VMMs

VMMs create and manage vms with the help of the hypervisor, such as kvmtool, qemu-kvm, crosvm, cloud-hypervisor

Links to this note

ganeti