Files
volt-vmm/docs/kernel-pagetable-analysis.md
Karl Clinger 40ed108dd5 Volt VMM (Neutron Stardust): source-available under AGPSL v5.0
KVM-based microVMM for the Volt platform:
- Sub-second VM boot times
- Minimal memory footprint
- Landlock LSM + seccomp security
- Virtio device support
- Custom kernel management

Copyright (c) Armored Gates LLC. All rights reserved.
Licensed under AGPSL v5.0
2026-03-21 01:04:35 -05:00

322 lines
9.1 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Linux Kernel Page Table Analysis: Why vmlinux Direct Boot Fails
**Date**: 2025-03-07
**Status**: 🔴 **ROOT CAUSE IDENTIFIED**
**Issue**: CR2=0x0 fault after kernel switches to its own page tables
## Executive Summary
The crash occurs because Linux's `__startup_64()` function **builds its own page tables** that only map the kernel text region, **abandoning the VMM-provided page tables**. After the CR3 switch, low memory (including address 0 and boot_params at 0x20000) is no longer mapped.
| Stage | Page Tables Used | Low Memory Mapped? |
|-------|-----------------|-------------------|
| VMM Setup | Volt's @ 0x1000 | ✅ Yes (identity mapped 0-4GB) |
| kernel startup_64 entry | Volt's @ 0x1000 | ✅ Yes |
| After __startup_64 + CR3 switch | Kernel's early_top_pgt | ❌ **NO** |
---
## 1. Root Cause Analysis
### The Problem Flow
```
1. Volt creates page tables at 0x1000
- Identity maps 0-4GB (including address 0)
- Maps kernel high-half (0xffffffff80000000+)
2. Volt enters kernel at startup_64
- Kernel uses Volt's tables initially
- Sets up GS_BASE, calls startup_64_setup_env()
3. Kernel calls __startup_64()
- Builds NEW page tables in early_top_pgt (kernel BSS)
- Creates identity mapping for KERNEL TEXT ONLY
- Does NOT map low memory (0-16MB except kernel)
4. CR3 switches to early_top_pgt
- Volt's page tables ABANDONED
- Low memory NO LONGER MAPPED
5. 💥 Any access to low memory causes #PF with CR2=address
```
### The Kernel's Page Table Setup (head64.c)
```c
unsigned long __head __startup_64(unsigned long physaddr, struct boot_params *bp)
{
// ... setup code ...
// ONLY maps kernel text region:
for (i = 0; i < DIV_ROUND_UP(_end - _text, PMD_SIZE); i++) {
int idx = i + (physaddr >> PMD_SHIFT);
pmd[idx % PTRS_PER_PMD] = pmd_entry + i * PMD_SIZE;
}
// Low memory (0x0 - 0x1000000) is NOT mapped!
}
```
### What Gets Mapped in Kernel's Page Tables
| Memory Region | Mapped? | Purpose |
|---------------|---------|---------|
| 0x0 - 0xFFFFF (0-1MB) | ❌ No | Boot structures |
| 0x100000 - 0xFFFFFF (1-16MB) | ❌ No | Below kernel |
| 0x1000000 - kernel_end | ✅ Yes | Kernel text/data |
| 0xffffffff80000000+ | ✅ Yes | Kernel virtual |
| 0xffff888000000000+ (__PAGE_OFFSET) | ❌ No* | Direct physical map |
*The __PAGE_OFFSET mapping is created lazily via early page fault handler
---
## 2. Why bzImage Works
The compressed kernel (bzImage) includes a **decompressor** at `arch/x86/boot/compressed/head_64.S` that:
1. **Creates full identity mapping** for ALL memory (0-4GB):
```asm
/* Build Level 2 - maps 4GB with 2MB pages */
movl $0x00000183, %eax /* Present + RW + PS (2MB page) */
movl $2048, %ecx /* 2048 entries × 2MB = 4GB */
```
2. **Decompresses kernel** to 0x1000000
3. **Jumps to decompressed kernel** with decompressor's tables still in CR3
4. When startup_64 builds new tables, the **decompressor's mappings are inherited**
### bzImage vs vmlinux Boot Comparison
| Aspect | bzImage | vmlinux |
|--------|---------|---------|
| Decompressor | ✅ Yes (sets up 4GB identity map) | ❌ No |
| Initial page tables | Decompressor's (full coverage) | VMM's (then abandoned) |
| Low memory after startup | ✅ Mapped | ❌ **NOT mapped** |
| Boot_params accessible | ✅ Yes | ❌ **NO** |
---
## 3. Technical Details
### Entry Point Analysis
For vmlinux ELF:
- `e_entry` = virtual address (e.g., 0xffffffff81000000)
- Corresponds to `startup_64` symbol in head_64.S
Volt correctly:
1. Loads kernel to physical 0x1000000
2. Maps virtual 0xffffffff81000000 → physical 0x1000000
3. Enters at e_entry (virtual address)
### The CR3 Switch (head_64.S)
```asm
/* Call __startup_64 which returns SME mask */
leaq _text(%rip), %rdi
movq %r15, %rsi
call __startup_64
/* Form CR3 value with early_top_pgt */
addq $(early_top_pgt - __START_KERNEL_map), %rax
/* Switch to kernel's page tables - VMM's tables abandoned! */
movq %rax, %cr3
```
### Kernel's early_top_pgt Layout
```
early_top_pgt (in kernel .data):
[0-273] = 0 (unmapped - includes identity region)
[274-510] = 0 (unmapped - includes __PAGE_OFFSET region)
[511] = level3_kernel_pgt | flags (kernel mapping)
```
Only PGD[511] is populated, mapping 0xffffffff80000000-0xffffffffffffffff.
---
## 4. The Crash Sequence
1. **VMM**: Sets CR3=0x1000 (Volt's tables), RIP=0xffffffff81000000
2. **Kernel startup_64**:
- Sets up GS_BASE (wrmsr) ✅
- Calls startup_64_setup_env() (loads GDT, IDT) ✅
- Calls __startup_64() - builds new tables ✅
3. **CR3 Switch**: CR3 = early_top_pgt address
4. **Crash**: Something accesses low memory
- Could be stack canary check via %gs
- Could be boot_params access
- Could be early exception handler
**Crash location**: RIP=0xffffffff81000084, CR2=0x0
---
## 5. Solutions
### ✅ Recommended: Use bzImage Instead of vmlinux
The compressed kernel format handles all early setup correctly:
```rust
// In loader.rs - detect bzImage and use appropriate entry
pub fn load(...) -> Result<KernelLoadResult> {
match kernel_type {
KernelType::BzImage => Self::load_bzimage(&kernel_data, ...),
KernelType::Elf64 => {
// Warning: vmlinux direct boot has page table issues
// Consider using bzImage instead
Self::load_elf64(&kernel_data, ...)
}
}
}
```
**Why bzImage works:**
- Includes decompressor stub
- Decompressor sets up proper 4GB identity mapping
- Kernel inherits good mappings
### ⚠️ Alternative: Pre-initialize Kernel's Page Tables
If vmlinux support is required, the VMM could pre-populate the kernel's `early_dynamic_pgts`:
```rust
// Find early_dynamic_pgts symbol in vmlinux ELF
// Pre-populate with identity mapping entries
// Set next_early_pgt to indicate tables are ready
```
**Risks:**
- Kernel version dependent
- Symbol locations change
- Fragile and hard to maintain
### ⚠️ Alternative: Use Different Entry Point
PVH entry (if kernel supports it) might have different expectations:
```rust
// Look for .note.xen.pvh section in ELF
// Use PVH entry point which may preserve VMM tables
```
---
## 6. Verification Checklist
- [x] Root cause identified: Kernel's __startup_64 builds minimal page tables
- [x] Why bzImage works: Decompressor provides full identity mapping
- [x] CR3 switch behavior confirmed from kernel source
- [x] Low memory unmapped after switch confirmed
- [ ] Test with bzImage format
- [ ] Document bzImage requirement in Volt
---
## 7. Implementation Recommendation
### Short-term Fix
Update Volt to **require bzImage format**:
```rust
// In loader.rs
fn load_elf64(...) -> Result<...> {
tracing::warn!(
"Loading vmlinux ELF directly may fail due to kernel page table setup. \
Consider using bzImage format for reliable boot."
);
// ... existing code ...
}
```
### Long-term Solution
1. **Default to bzImage** for production use
2. **Document the limitation** in user-facing docs
3. **Investigate PVH entry** for vmlinux if truly needed
---
## 8. Files Referenced
### Linux Kernel Source (v6.6)
- `arch/x86/kernel/head_64.S` - Entry point, CR3 switch
- `arch/x86/kernel/head64.c` - `__startup_64()` page table setup
- `arch/x86/boot/compressed/head_64.S` - Decompressor with full identity mapping
### Volt Source
- `vmm/src/boot/loader.rs` - Kernel loading (ELF/bzImage)
- `vmm/src/boot/pagetable.rs` - VMM page table setup
- `vmm/src/boot/mod.rs` - Boot orchestration
---
## 9. Code Changes Made
### Warning Added to loader.rs
```rust
/// Load ELF64 kernel (vmlinux)
///
/// # Warning: vmlinux Direct Boot Limitations
///
/// Loading vmlinux ELF directly has a fundamental limitation...
fn load_elf64<M: GuestMemory>(...) -> Result<KernelLoadResult> {
tracing::warn!(
"Loading vmlinux ELF directly. This may fail due to kernel page table setup..."
);
// ... rest of function
}
```
---
## 10. Future Work
### If vmlinux Support is Essential
To properly support vmlinux direct boot, one of these approaches would be needed:
1. **Pre-initialize kernel's early_top_pgt**
- Parse vmlinux ELF to find `early_top_pgt` and `early_dynamic_pgts` symbols
- Pre-populate with full identity mapping
- Set `next_early_pgt` to indicate tables are ready
2. **Use PVH Entry Point**
- Check for `.note.xen.pvhabi` section in ELF
- Use PVH entry which may have different page table expectations
3. **Patch Kernel Entry**
- Skip the CR3 switch in startup_64
- Highly invasive and version-specific
### Recommended Approach for Production
Always use **bzImage** for Volt:
- Fast extraction (<10ms)
- Handles all edge cases correctly
- Standard approach used by QEMU, Firecracker, Cloud Hypervisor
---
## 11. Summary
**The core issue**: Linux kernel's startup_64 assumes the bootloader (decompressor) has set up page tables that remain valid. When vmlinux is loaded directly, the VMM's page tables are **replaced, not augmented**.
**The fix**: Use bzImage format, which includes the decompressor that properly handles page table setup for the kernel's expectations.
**Changes made**:
- Added warning to `load_elf64()` in loader.rs
- Created this analysis document