Volt VMM (Neutron Stardust): source-available under AGPSL v5.0
KVM-based microVMM for the Volt platform: - Sub-second VM boot times - Minimal memory footprint - Landlock LSM + seccomp security - Virtio device support - Custom kernel management Copyright (c) Armored Gates LLC. All rights reserved. Licensed under AGPSL v5.0
This commit is contained in:
125
docs/cpuid-implementation.md
Normal file
125
docs/cpuid-implementation.md
Normal file
@@ -0,0 +1,125 @@
|
||||
# CPUID Implementation for Volt VMM
|
||||
|
||||
**Date**: 2025-03-08
|
||||
**Status**: ✅ **IMPLEMENTED AND WORKING**
|
||||
|
||||
## Summary
|
||||
|
||||
Implemented CPUID filtering and boot MSR configuration that enables Linux kernels to boot successfully in Volt VMM. The root cause of the previous triple-fault crash was missing CPUID configuration — specifically, the SYSCALL feature (CPUID 0x80000001, EDX bit 11) was not being advertised to the guest, causing a #GP fault when the kernel tried to enable it via WRMSR to EFER.
|
||||
|
||||
## Root Cause Analysis
|
||||
|
||||
### The Crash
|
||||
```
|
||||
vCPU 0 SHUTDOWN (triple fault?) at RIP=0xffffffff81000084
|
||||
RAX=0x501 RCX=0xc0000080 (EFER MSR)
|
||||
CR3=0x1d08000 (kernel's early_top_pgt)
|
||||
EFER=0x500 (LME|LMA, but NOT SCE)
|
||||
```
|
||||
|
||||
The kernel was trying to write `0x501` (LME | LMA | SCE) to EFER MSR at 0xC0000080. The SCE (SYSCALL Enable) bit requires CPUID to advertise SYSCALL support. Without proper CPUID, KVM generates #GP on the WRMSR. With IDT limit=0 (set by VMM for clean boot), #GP cascades to a triple fault.
|
||||
|
||||
### Why No CPUID Was a Problem
|
||||
Without `KVM_SET_CPUID2`, the vCPU presents a bare/default CPUID to the guest. This may not include:
|
||||
- **SYSCALL** (0x80000001 EDX bit 11) — Required for `wrmsr EFER.SCE`
|
||||
- **NX/XD** (0x80000001 EDX bit 20) — Required for NX page table entries
|
||||
- **Long Mode** (0x80000001 EDX bit 29) — Required for 64-bit
|
||||
- **Hypervisor** (0x1 ECX bit 31) — Tells kernel it's in a VM for paravirt optimizations
|
||||
|
||||
## Implementation
|
||||
|
||||
### New Files
|
||||
- **`vmm/src/kvm/cpuid.rs`** — Complete CPUID filtering module
|
||||
|
||||
### Modified Files
|
||||
- **`vmm/src/kvm/mod.rs`** — Added `cpuid` module and exports
|
||||
- **`vmm/src/kvm/vm.rs`** — Integrated CPUID into VM/vCPU creation flow
|
||||
- **`vmm/src/kvm/vcpu.rs`** — Added boot MSR configuration
|
||||
|
||||
### CPUID Filtering Details
|
||||
|
||||
The implementation follows Firecracker's approach:
|
||||
|
||||
1. **Get host-supported CPUID** via `KVM_GET_SUPPORTED_CPUID`
|
||||
2. **Filter/modify entries** per leaf:
|
||||
|
||||
| Leaf | Action | Rationale |
|
||||
|------|--------|-----------|
|
||||
| 0x0 | Pass through vendor | Changing vendor breaks CPU-specific kernel paths |
|
||||
| 0x1 | Strip VMX/SMX/DTES64/MONITOR/DS_CPL, set HYPERVISOR bit | Security + paravirt |
|
||||
| 0x4 | Adjust core topology | Match vCPU count |
|
||||
| 0x6 | Clear all | Don't expose power management |
|
||||
| 0x7 | **Strip TSX (HLE/RTM)**, strip MPX, RDT | Security, deprecated features |
|
||||
| 0xA | Clear all | Disable PMU in guest |
|
||||
| 0xB | Set APIC IDs per vCPU | Topology |
|
||||
| 0x40000000 | Set KVM hypervisor signature | Enables KVM paravirt |
|
||||
| 0x80000001 | **Ensure SYSCALL, NX, LM bits** | **Critical fix** |
|
||||
| 0x80000007 | Only keep Invariant TSC | Clean power management |
|
||||
|
||||
3. **Apply to each vCPU** via `KVM_SET_CPUID2` before register setup
|
||||
|
||||
### Boot MSR Configuration
|
||||
|
||||
Added `setup_boot_msrs()` to vcpu.rs, matching Firecracker's `create_boot_msr_entries()`:
|
||||
|
||||
| MSR | Value | Purpose |
|
||||
|-----|-------|---------|
|
||||
| IA32_SYSENTER_CS/ESP/EIP | 0 | 32-bit syscall ABI (zeroed) |
|
||||
| STAR, LSTAR, CSTAR, SYSCALL_MASK | 0 | 64-bit syscall ABI (kernel fills later) |
|
||||
| KERNEL_GS_BASE | 0 | Per-CPU data (kernel fills later) |
|
||||
| IA32_TSC | 0 | Time Stamp Counter |
|
||||
| IA32_MISC_ENABLE | FAST_STRING (bit 0) | Enable fast string operations |
|
||||
| MTRRdefType | (1<<11) \| 6 | MTRR enabled, default write-back |
|
||||
|
||||
## Test Results
|
||||
|
||||
### Linux 4.14.174 (vmlinux-firecracker-official.bin)
|
||||
```
|
||||
✅ Full boot to init (VFS panic expected — no rootfs provided)
|
||||
- Kernel version detected
|
||||
- KVM hypervisor detected
|
||||
- kvm-clock configured
|
||||
- NX protection active
|
||||
- CPU mitigations (Spectre V1/V2, SSBD, TSX) detected
|
||||
- All subsystems initialized (network, SCSI, serial, etc.)
|
||||
- Boot time: ~1.4 seconds to init
|
||||
```
|
||||
|
||||
### Minimal Hello Kernel (minimal-hello.elf)
|
||||
```
|
||||
✅ Still works: "Hello from minimal kernel!" + "OK"
|
||||
```
|
||||
|
||||
## Architecture Notes
|
||||
|
||||
### Why vmlinux ELF Works Now
|
||||
|
||||
The previous analysis (kernel-pagetable-analysis.md) identified that the kernel's `__startup_64()` builds its own page tables and switches CR3, abandoning the VMM's tables. This was thought to be the root cause.
|
||||
|
||||
**It turns out that's not the issue.** The kernel's early page tables are sufficient for the kernel's own needs. The actual problem was:
|
||||
|
||||
1. Kernel enters `startup_64` at physical 0x1000000
|
||||
2. `__startup_64()` builds page tables in kernel BSS (`early_top_pgt` at physical 0x1d08000)
|
||||
3. CR3 switches to kernel's tables
|
||||
4. Kernel tries `wrmsr EFER, 0x501` to enable SYSCALL
|
||||
5. **Without CPUID advertising SYSCALL support → #GP → triple fault**
|
||||
|
||||
With CPUID properly configured:
|
||||
5. WRMSR succeeds (CPUID advertises SYSCALL)
|
||||
6. Kernel continues initialization
|
||||
7. Kernel sets up its own IDT/GDT for exception handling
|
||||
8. Early page fault handler manages any unmapped pages lazily
|
||||
|
||||
### Key Insight
|
||||
The vmlinux direct boot works because:
|
||||
- The kernel's `__startup_64` only needs kernel text mapped (which it creates)
|
||||
- boot_params at 0x20000 is accessed early but via `%rsi` and identity mapping (before CR3 switch)
|
||||
- The kernel's early exception handler can resolve any subsequent page faults
|
||||
- **The crash was purely a CPUID/feature issue, not a page table issue**
|
||||
|
||||
## References
|
||||
|
||||
- [Firecracker CPUID source](https://github.com/firecracker-microvm/firecracker/tree/main/src/vmm/src/cpu_config/x86_64/cpuid)
|
||||
- [Firecracker boot MSRs](https://github.com/firecracker-microvm/firecracker/blob/main/src/vmm/src/arch/x86_64/msr.rs)
|
||||
- [Linux kernel CPUID usage](https://elixir.bootlin.com/linux/v4.14/source/arch/x86/kernel/head_64.S)
|
||||
- [Intel SDM Vol 2A: CPUID](https://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-vol-2a-manual.html)
|
||||
Reference in New Issue
Block a user