Volt VMM (Neutron Stardust): source-available under AGPSL v5.0

KVM-based microVMM for the Volt platform:
- Sub-second VM boot times
- Minimal memory footprint
- Landlock LSM + seccomp security
- Virtio device support
- Custom kernel management

Copyright (c) Armored Gates LLC. All rights reserved.
Licensed under AGPSL v5.0
This commit is contained in:
Karl Clinger
2026-03-21 01:04:35 -05:00
commit 40ed108dd5
143 changed files with 50300 additions and 0 deletions

View File

@@ -0,0 +1,321 @@
# Linux Kernel Page Table Analysis: Why vmlinux Direct Boot Fails
**Date**: 2025-03-07
**Status**: 🔴 **ROOT CAUSE IDENTIFIED**
**Issue**: CR2=0x0 fault after kernel switches to its own page tables
## Executive Summary
The crash occurs because Linux's `__startup_64()` function **builds its own page tables** that only map the kernel text region, **abandoning the VMM-provided page tables**. After the CR3 switch, low memory (including address 0 and boot_params at 0x20000) is no longer mapped.
| Stage | Page Tables Used | Low Memory Mapped? |
|-------|-----------------|-------------------|
| VMM Setup | Volt's @ 0x1000 | ✅ Yes (identity mapped 0-4GB) |
| kernel startup_64 entry | Volt's @ 0x1000 | ✅ Yes |
| After __startup_64 + CR3 switch | Kernel's early_top_pgt | ❌ **NO** |
---
## 1. Root Cause Analysis
### The Problem Flow
```
1. Volt creates page tables at 0x1000
- Identity maps 0-4GB (including address 0)
- Maps kernel high-half (0xffffffff80000000+)
2. Volt enters kernel at startup_64
- Kernel uses Volt's tables initially
- Sets up GS_BASE, calls startup_64_setup_env()
3. Kernel calls __startup_64()
- Builds NEW page tables in early_top_pgt (kernel BSS)
- Creates identity mapping for KERNEL TEXT ONLY
- Does NOT map low memory (0-16MB except kernel)
4. CR3 switches to early_top_pgt
- Volt's page tables ABANDONED
- Low memory NO LONGER MAPPED
5. 💥 Any access to low memory causes #PF with CR2=address
```
### The Kernel's Page Table Setup (head64.c)
```c
unsigned long __head __startup_64(unsigned long physaddr, struct boot_params *bp)
{
// ... setup code ...
// ONLY maps kernel text region:
for (i = 0; i < DIV_ROUND_UP(_end - _text, PMD_SIZE); i++) {
int idx = i + (physaddr >> PMD_SHIFT);
pmd[idx % PTRS_PER_PMD] = pmd_entry + i * PMD_SIZE;
}
// Low memory (0x0 - 0x1000000) is NOT mapped!
}
```
### What Gets Mapped in Kernel's Page Tables
| Memory Region | Mapped? | Purpose |
|---------------|---------|---------|
| 0x0 - 0xFFFFF (0-1MB) | ❌ No | Boot structures |
| 0x100000 - 0xFFFFFF (1-16MB) | ❌ No | Below kernel |
| 0x1000000 - kernel_end | ✅ Yes | Kernel text/data |
| 0xffffffff80000000+ | ✅ Yes | Kernel virtual |
| 0xffff888000000000+ (__PAGE_OFFSET) | ❌ No* | Direct physical map |
*The __PAGE_OFFSET mapping is created lazily via early page fault handler
---
## 2. Why bzImage Works
The compressed kernel (bzImage) includes a **decompressor** at `arch/x86/boot/compressed/head_64.S` that:
1. **Creates full identity mapping** for ALL memory (0-4GB):
```asm
/* Build Level 2 - maps 4GB with 2MB pages */
movl $0x00000183, %eax /* Present + RW + PS (2MB page) */
movl $2048, %ecx /* 2048 entries × 2MB = 4GB */
```
2. **Decompresses kernel** to 0x1000000
3. **Jumps to decompressed kernel** with decompressor's tables still in CR3
4. When startup_64 builds new tables, the **decompressor's mappings are inherited**
### bzImage vs vmlinux Boot Comparison
| Aspect | bzImage | vmlinux |
|--------|---------|---------|
| Decompressor | ✅ Yes (sets up 4GB identity map) | ❌ No |
| Initial page tables | Decompressor's (full coverage) | VMM's (then abandoned) |
| Low memory after startup | ✅ Mapped | ❌ **NOT mapped** |
| Boot_params accessible | ✅ Yes | ❌ **NO** |
---
## 3. Technical Details
### Entry Point Analysis
For vmlinux ELF:
- `e_entry` = virtual address (e.g., 0xffffffff81000000)
- Corresponds to `startup_64` symbol in head_64.S
Volt correctly:
1. Loads kernel to physical 0x1000000
2. Maps virtual 0xffffffff81000000 → physical 0x1000000
3. Enters at e_entry (virtual address)
### The CR3 Switch (head_64.S)
```asm
/* Call __startup_64 which returns SME mask */
leaq _text(%rip), %rdi
movq %r15, %rsi
call __startup_64
/* Form CR3 value with early_top_pgt */
addq $(early_top_pgt - __START_KERNEL_map), %rax
/* Switch to kernel's page tables - VMM's tables abandoned! */
movq %rax, %cr3
```
### Kernel's early_top_pgt Layout
```
early_top_pgt (in kernel .data):
[0-273] = 0 (unmapped - includes identity region)
[274-510] = 0 (unmapped - includes __PAGE_OFFSET region)
[511] = level3_kernel_pgt | flags (kernel mapping)
```
Only PGD[511] is populated, mapping 0xffffffff80000000-0xffffffffffffffff.
---
## 4. The Crash Sequence
1. **VMM**: Sets CR3=0x1000 (Volt's tables), RIP=0xffffffff81000000
2. **Kernel startup_64**:
- Sets up GS_BASE (wrmsr) ✅
- Calls startup_64_setup_env() (loads GDT, IDT) ✅
- Calls __startup_64() - builds new tables ✅
3. **CR3 Switch**: CR3 = early_top_pgt address
4. **Crash**: Something accesses low memory
- Could be stack canary check via %gs
- Could be boot_params access
- Could be early exception handler
**Crash location**: RIP=0xffffffff81000084, CR2=0x0
---
## 5. Solutions
### ✅ Recommended: Use bzImage Instead of vmlinux
The compressed kernel format handles all early setup correctly:
```rust
// In loader.rs - detect bzImage and use appropriate entry
pub fn load(...) -> Result<KernelLoadResult> {
match kernel_type {
KernelType::BzImage => Self::load_bzimage(&kernel_data, ...),
KernelType::Elf64 => {
// Warning: vmlinux direct boot has page table issues
// Consider using bzImage instead
Self::load_elf64(&kernel_data, ...)
}
}
}
```
**Why bzImage works:**
- Includes decompressor stub
- Decompressor sets up proper 4GB identity mapping
- Kernel inherits good mappings
### ⚠️ Alternative: Pre-initialize Kernel's Page Tables
If vmlinux support is required, the VMM could pre-populate the kernel's `early_dynamic_pgts`:
```rust
// Find early_dynamic_pgts symbol in vmlinux ELF
// Pre-populate with identity mapping entries
// Set next_early_pgt to indicate tables are ready
```
**Risks:**
- Kernel version dependent
- Symbol locations change
- Fragile and hard to maintain
### ⚠️ Alternative: Use Different Entry Point
PVH entry (if kernel supports it) might have different expectations:
```rust
// Look for .note.xen.pvh section in ELF
// Use PVH entry point which may preserve VMM tables
```
---
## 6. Verification Checklist
- [x] Root cause identified: Kernel's __startup_64 builds minimal page tables
- [x] Why bzImage works: Decompressor provides full identity mapping
- [x] CR3 switch behavior confirmed from kernel source
- [x] Low memory unmapped after switch confirmed
- [ ] Test with bzImage format
- [ ] Document bzImage requirement in Volt
---
## 7. Implementation Recommendation
### Short-term Fix
Update Volt to **require bzImage format**:
```rust
// In loader.rs
fn load_elf64(...) -> Result<...> {
tracing::warn!(
"Loading vmlinux ELF directly may fail due to kernel page table setup. \
Consider using bzImage format for reliable boot."
);
// ... existing code ...
}
```
### Long-term Solution
1. **Default to bzImage** for production use
2. **Document the limitation** in user-facing docs
3. **Investigate PVH entry** for vmlinux if truly needed
---
## 8. Files Referenced
### Linux Kernel Source (v6.6)
- `arch/x86/kernel/head_64.S` - Entry point, CR3 switch
- `arch/x86/kernel/head64.c` - `__startup_64()` page table setup
- `arch/x86/boot/compressed/head_64.S` - Decompressor with full identity mapping
### Volt Source
- `vmm/src/boot/loader.rs` - Kernel loading (ELF/bzImage)
- `vmm/src/boot/pagetable.rs` - VMM page table setup
- `vmm/src/boot/mod.rs` - Boot orchestration
---
## 9. Code Changes Made
### Warning Added to loader.rs
```rust
/// Load ELF64 kernel (vmlinux)
///
/// # Warning: vmlinux Direct Boot Limitations
///
/// Loading vmlinux ELF directly has a fundamental limitation...
fn load_elf64<M: GuestMemory>(...) -> Result<KernelLoadResult> {
tracing::warn!(
"Loading vmlinux ELF directly. This may fail due to kernel page table setup..."
);
// ... rest of function
}
```
---
## 10. Future Work
### If vmlinux Support is Essential
To properly support vmlinux direct boot, one of these approaches would be needed:
1. **Pre-initialize kernel's early_top_pgt**
- Parse vmlinux ELF to find `early_top_pgt` and `early_dynamic_pgts` symbols
- Pre-populate with full identity mapping
- Set `next_early_pgt` to indicate tables are ready
2. **Use PVH Entry Point**
- Check for `.note.xen.pvhabi` section in ELF
- Use PVH entry which may have different page table expectations
3. **Patch Kernel Entry**
- Skip the CR3 switch in startup_64
- Highly invasive and version-specific
### Recommended Approach for Production
Always use **bzImage** for Volt:
- Fast extraction (<10ms)
- Handles all edge cases correctly
- Standard approach used by QEMU, Firecracker, Cloud Hypervisor
---
## 11. Summary
**The core issue**: Linux kernel's startup_64 assumes the bootloader (decompressor) has set up page tables that remain valid. When vmlinux is loaded directly, the VMM's page tables are **replaced, not augmented**.
**The fix**: Use bzImage format, which includes the decompressor that properly handles page table setup for the kernel's expectations.
**Changes made**:
- Added warning to `load_elf64()` in loader.rs
- Created this analysis document