virtualization (VMs)
Tags: computers
KVM (kernel virtual machine)
- linux based virtualization, they work together
- https://www.redhat.com/en/blog/all-you-need-know-about-kvm-userspace
- https://www.redhat.com/en/blog/journey-vhost-users-realm
QEMU
qcow2/3
- Virtual filesystems for QEMU
Hot Plug Memory
- https://github.com/qemu/qemu/blob/db596ae19040574e41d086e78469014191d7d7fc/docs/memory-hotplug.tx11t
Replacements
Xen
- Why is it important to provide timers?
- Because OS’s aren’t used to being preempted! We want to provide the OS real timers so they can do actual things
- Timing estimates, for example from TCP
- Hypervisors are meant to be tiny, has direct machine access
- Hypervisor control node actually runs as a VM
- IO rings, similar to io_uring
- Changes made to make virtualization easier
- Guest OS runs at 1, which works nicely for CPU x86 priv rings
- when the guest starts to run and gets scheduled, it communicates with the hypervisor to register a set of addresses
- When there’s a syscall, jump to this part in the OS
- GuestOSs have realtime and virtual time
- real time for congestion
- virtual time for when you get descheduled
- Hypercalls manages guests to the OS
- Paging
-
Virtualizing the MMU is very difficult, since the MMU is allowed to walk the page tables
-
Xen allows the guest to maintain its own page tables, but Xen checks all modifications to the page tables
- Software managed TLB or TLB with address space modifiers, but x86 doesn’t have this
-
everyone manages their own page tables
-
minimized involvement for xen
-
So Xen lives in the unused memory space
-
When guest operating systems creates tables, they create the table ds with address space, any memory that’s required is part of that address space
- when they register the page table with xen, then it’s read only
- Xen does recieves updates, it can batch em
-
Guest runs at 1
-
Other Execution Environments
KVM
- What really is KVM anyways?
- ioctl (Devices) interface (VMM <> KVM)
- vm/vcpu creation
- VM exists and reentry
- memory manager
- interrupt/iohandling flow
- timers!
- KVM first started in Linux 2.6.20 (2007 edition)
- Any process can create a guest, a guest is memory and vcpus, with in kernel emulated devices, user space emulated devices, etc
- cpu Arches
- Every supported arch has a CPU guest mode
- x86 is VMX root/non-root (Intel VT-x) SVM (AMD-V)
- ARM is vHE vs nVHE
- RISC-V: hypervisor extension (VS/VU modes)
- Generally the pattern is, run on guest until you hit a trap, then exit ot the host
- trap being priveleaged instruction, IO access, interrupts, page faults
- Generally you want to poll /dev/kvm, then the VM fd, then the vcpu fd
- Lifecycle
open("/dev/kvm")to get the list of avaiabilities, such as checking the API version and extnesions, then useKVM_CREATE_VMto get the fd for thE VM. Then we mmap host memory and doKVM_SET_USER_MEMORY_REGION- The vpcu fd mmaps the
kvm_runpage, which setsKVM_SET_REGS (RIP, RSP),KVM_SET_SREGS(CR* and EFER) thenKVM_SET_CPUID2, and finally we can spawn host threads - Normal guest -> trap -> handle trap -> re-enter loop
- Teardown is done by signaling to the vCPU threads, then closing the various FDs
- Exits
- IO exists are common
- MMIO exists, such as memory mapped devices
- In kernel fast path has various things
- And virtio can write various doorbells
- Memory Slots!
- We need to map the guest physical address space into the host virtual address space (VMM)
- So we typically have 4 slots:
- Slot 3: PCI MMIO device memory is
mmap region 3, withuserspace_addr[3] - Slot 2: High ram ->
mmap region 2 - Slot 1: Low ram ->
mmap region 1 - Slot 0: Real mode: first 64k
- Slot 3: PCI MMIO device memory is
GPA (x86, 4 GB guest) HVA (QEMU process VA)
0x140000000
─ ┌──────────────────┐
│ above-4G RAM │─── slot 2 ─────────────► 0x7fc000000000
│ (1G remapped │ base_gfn: 0x100000 ┌──────────┐
│ above PCI hole)│ npages: 0x40000 │anon mmap │
─ └──────────────────┘ usr_addr: 0x7fc000000000 └──────────┘
0x100000000
╔══════════════════╗
║ PCI / MMIO hole ║ no memslot — KVM_EXIT_MMIO
║ 0xFEE00000 LAPIC║ dispatched to QEMU device models
║ 0xFEC00000 IOAPIC
║ 0xC0000000+ BARs║
╚══════════════════╝
0xC0000000
─ ┌──────────────────┐
│ low extended RAM│─── slot 1 ─────────────► 0x7f8000100000
│ 1 MB → ~3 GB │ base_gfn: 0x100 ┌──────────┐
│ │ npages: 0xBFF00 │anon mmap │
─ └──────────────────┘ usr_addr: 0x7f8000100000 └──────────┘
0x00100000
╔══════════════════╗
║ ISA hole ║ no memslot — KVM_EXIT_MMIO / KVM_EXIT_IO
║ 0xA0000 VGA RAM ║ dispatched to QEMU ROM and device handlers
║ 0xC0000 opt ROM ║
║ 0xF0000 BIOS ║
╚══════════════════╝
0x000A0000
─ ┌──────────────────┐
│ conventional RAM│─── slot 0 ─────────────► 0x7f8000000000
│ 0 → 640 KB │ base_gfn: 0x0 ┌──────────┐
│ │ npages: 0xA0 │anon mmap │
─ └──────────────────┘ usr_addr: 0x7f8000000000 └──────────┘
0x00000000
──────────────────────────────────────────────────────────────
hva = slot->userspace_addr + (gfn - base_gfn) << PAGE_SHIFT
gfn = gpa >> PAGE_SHIFT (PAGE_SHIFT = 12 for 4K pages)
lookup → kvm->memslots[as_id]
lru_slot checked first (atomic_t, avoids search on hot path)
then binary search descending on base_gfn
as_id → 0 = normal, 1 = SMM shadow memory (x86 only)
- Which means we then arrive at three address layers: the Guest Virtual Addres (GVA), the Guest Physical Address (GPA), and the Host Physical Address (HPA)
- So the hardware has to walk on both levels when a TLB miss happens (GVA -> [guest PT walk] -> GPA -> [EPT walk] -> HPA)
- AMD uses NPT (Nested Page Tables instead of Extended Page Tables)
- ARM uses Stage-2 page tables
- RISC-V uses G-stage page tables
- So the hardware has to walk on both levels when a TLB miss happens (GVA -> [guest PT walk] -> GPA -> [EPT walk] -> HPA)
- Timers!
- Timers are tricky because the guest OS needs precise timers, but we can deschedule the vCPU. So we need to emulate timers without breaking guest time.
- We have hardware supported timers, which the hardware fires directly into the VM. This allows for no VM exists on ARM64/RISC-V
- But sometimes we have to fallback to software timers, where the guest writes a comparator -> vm exists -> KVM programs the hrtimer
- hrtimer fires -> the KVM injects virtual interrupts -> VM exits
- This is TWO exits per cycle
- We can also block. Where the guest halts, the vCPU stops, and the HW timer can’t fire.
- So KVM then starts a backup hrtimer to wake it
- There’s always some form of offset timekeeping so that the guest sees a consistent virtual time
- IO
- When a guest accesses a device register, the guest reads/writes unampped GPAs, which causes a VM exit, and exits to userspace, triggering QEMU
- In certain situations we have kernel shortcuts, such as ioeventfd and irqfd.
- The key is to avoid hitting QEMU, which is very slow
- Interrupts
- IRQCHIP is userspace, emualted in VMM (QEMU), which registers a VM exit
- KVM acceleration is handled in the kernel. This prevents an exits to space.
- Sometimes we can signal directly to HW from the guest
ioeventfdandirqfdact as dorbells and interrupt injections
VMMs
- VMMs create and manage vms with the help of the hypervisor, such as kvmtool, qemu-kvm, crosvm, cloud-hypervisor