عجفت الغور

virtualization (VMs)

Tags: computers

KVM (kernel virtual machine)

QEMU

qcow2/3

  • Virtual filesystems for QEMU

Hot Plug Memory

Replacements

Xen

  • Why is it important to provide timers?
    • Because OS’s aren’t used to being preempted! We want to provide the OS real timers so they can do actual things
    • Timing estimates, for example from TCP
  • Hypervisors are meant to be tiny, has direct machine access
    • Hypervisor control node actually runs as a VM
  • IO rings, similar to io_uring
  • Changes made to make virtualization easier
    • Guest OS runs at 1, which works nicely for CPU x86 priv rings
    • when the guest starts to run and gets scheduled, it communicates with the hypervisor to register a set of addresses
      • When there’s a syscall, jump to this part in the OS
    • GuestOSs have realtime and virtual time
      • real time for congestion
      • virtual time for when you get descheduled
    • Hypercalls manages guests to the OS
  • Paging
    • Virtualizing the MMU is very difficult, since the MMU is allowed to walk the page tables

    • Xen allows the guest to maintain its own page tables, but Xen checks all modifications to the page tables

      • Software managed TLB or TLB with address space modifiers, but x86 doesn’t have this
    • everyone manages their own page tables

    • minimized involvement for xen

    • So Xen lives in the unused memory space

    • When guest operating systems creates tables, they create the table ds with address space, any memory that’s required is part of that address space

      • when they register the page table with xen, then it’s read only
      • Xen does recieves updates, it can batch em
    • Guest runs at 1

Other Execution Environments

KVM

  • What really is KVM anyways?
    • ioctl (Devices) interface (VMM <> KVM)
    • vm/vcpu creation
    • VM exists and reentry
    • memory manager
    • interrupt/iohandling flow
    • timers!
  • KVM first started in Linux 2.6.20 (2007 edition)
  • Any process can create a guest, a guest is memory and vcpus, with in kernel emulated devices, user space emulated devices, etc
  • cpu Arches
    • Every supported arch has a CPU guest mode
    • x86 is VMX root/non-root (Intel VT-x) SVM (AMD-V)
    • ARM is vHE vs nVHE
    • RISC-V: hypervisor extension (VS/VU modes)
    • Generally the pattern is, run on guest until you hit a trap, then exit ot the host
      • trap being priveleaged instruction, IO access, interrupts, page faults
  • Generally you want to poll /dev/kvm, then the VM fd, then the vcpu fd
  • Lifecycle
    1. open("/dev/kvm") to get the list of avaiabilities, such as checking the API version and extnesions, then use KVM_CREATE_VM to get the fd for thE VM. Then we mmap host memory and do KVM_SET_USER_MEMORY_REGION
    2. The vpcu fd mmaps the kvm_run page, which sets KVM_SET_REGS (RIP, RSP), KVM_SET_SREGS (CR* and EFER) then KVM_SET_CPUID2, and finally we can spawn host threads
    3. Normal guest -> trap -> handle trap -> re-enter loop
    4. Teardown is done by signaling to the vCPU threads, then closing the various FDs
  • Exits
    • IO exists are common
    • MMIO exists, such as memory mapped devices
    • In kernel fast path has various things
    • And virtio can write various doorbells
  • Memory Slots!
    • We need to map the guest physical address space into the host virtual address space (VMM)
    • So we typically have 4 slots:
      • Slot 3: PCI MMIO device memory is mmap region 3, with userspace_addr[3]
      • Slot 2: High ram -> mmap region 2
      • Slot 1: Low ram -> mmap region 1
      • Slot 0: Real mode: first 64k
   GPA (x86, 4 GB guest)                        HVA (QEMU process VA)

   0x140000000
─  ┌──────────────────┐
   │  above-4G RAM    │─── slot 2 ─────────────►  0x7fc000000000
   │  (1G remapped    │    base_gfn: 0x100000       ┌──────────┐
   │   above PCI hole)│    npages:   0x40000         │anon mmap │
─  └──────────────────┘    usr_addr: 0x7fc000000000 └──────────┘
   0x100000000

   ╔══════════════════╗
   ║  PCI / MMIO hole ║  no memslot — KVM_EXIT_MMIO
   ║  0xFEE00000 LAPIC║  dispatched to QEMU device models
   ║  0xFEC00000 IOAPIC
   ║  0xC0000000+ BARs║
   ╚══════════════════╝
   0xC0000000

─  ┌──────────────────┐
   │  low extended RAM│─── slot 1 ─────────────►  0x7f8000100000
   │  1 MB → ~3 GB    │    base_gfn: 0x100          ┌──────────┐
   │                  │    npages:   0xBFF00         │anon mmap │
─  └──────────────────┘    usr_addr: 0x7f8000100000 └──────────┘
   0x00100000

   ╔══════════════════╗
   ║  ISA hole        ║  no memslot — KVM_EXIT_MMIO / KVM_EXIT_IO
   ║  0xA0000 VGA RAM ║  dispatched to QEMU ROM and device handlers
   ║  0xC0000 opt ROM ║
   ║  0xF0000 BIOS    ║
   ╚══════════════════╝
   0x000A0000

─  ┌──────────────────┐
   │  conventional RAM│─── slot 0 ─────────────►  0x7f8000000000
   │  0 → 640 KB      │    base_gfn: 0x0            ┌──────────┐
   │                  │    npages:   0xA0            │anon mmap │
─  └──────────────────┘    usr_addr: 0x7f8000000000 └──────────┘
   0x00000000

   ──────────────────────────────────────────────────────────────
   hva    = slot->userspace_addr + (gfn - base_gfn) << PAGE_SHIFT
   gfn    = gpa >> PAGE_SHIFT        (PAGE_SHIFT = 12 for 4K pages)
   lookup → kvm->memslots[as_id]
            lru_slot checked first (atomic_t, avoids search on hot path)
            then binary search descending on base_gfn
   as_id  → 0 = normal, 1 = SMM shadow memory (x86 only)
  • Which means we then arrive at three address layers: the Guest Virtual Addres (GVA), the Guest Physical Address (GPA), and the Host Physical Address (HPA)
    • So the hardware has to walk on both levels when a TLB miss happens (GVA -> [guest PT walk] -> GPA -> [EPT walk] -> HPA)
      • AMD uses NPT (Nested Page Tables instead of Extended Page Tables)
      • ARM uses Stage-2 page tables
      • RISC-V uses G-stage page tables
  • Timers!
    • Timers are tricky because the guest OS needs precise timers, but we can deschedule the vCPU. So we need to emulate timers without breaking guest time.
    • We have hardware supported timers, which the hardware fires directly into the VM. This allows for no VM exists on ARM64/RISC-V
    • But sometimes we have to fallback to software timers, where the guest writes a comparator -> vm exists -> KVM programs the hrtimer
      • hrtimer fires -> the KVM injects virtual interrupts -> VM exits
      • This is TWO exits per cycle
      • We can also block. Where the guest halts, the vCPU stops, and the HW timer can’t fire.
        • So KVM then starts a backup hrtimer to wake it
      • There’s always some form of offset timekeeping so that the guest sees a consistent virtual time
  • IO
    • When a guest accesses a device register, the guest reads/writes unampped GPAs, which causes a VM exit, and exits to userspace, triggering QEMU
    • In certain situations we have kernel shortcuts, such as ioeventfd and irqfd.
    • The key is to avoid hitting QEMU, which is very slow
  • Interrupts
    • IRQCHIP is userspace, emualted in VMM (QEMU), which registers a VM exit
    • KVM acceleration is handled in the kernel. This prevents an exits to space.
    • Sometimes we can signal directly to HW from the guest
  • ioeventfd and irqfd act as dorbells and interrupt injections

VMMs

  • VMMs create and manage vms with the help of the hypervisor, such as kvmtool, qemu-kvm, crosvm, cloud-hypervisor

Links to this note