KVM-based microVMM for the Volt platform: - Sub-second VM boot times - Minimal memory footprint - Landlock LSM + seccomp security - Virtio device support - Custom kernel management Copyright (c) Armored Gates LLC. All rights reserved. Licensed under AGPSL v5.0
9.1 KiB
Linux Kernel Page Table Analysis: Why vmlinux Direct Boot Fails
Date: 2025-03-07
Status: 🔴 ROOT CAUSE IDENTIFIED
Issue: CR2=0x0 fault after kernel switches to its own page tables
Executive Summary
The crash occurs because Linux's __startup_64() function builds its own page tables that only map the kernel text region, abandoning the VMM-provided page tables. After the CR3 switch, low memory (including address 0 and boot_params at 0x20000) is no longer mapped.
| Stage | Page Tables Used | Low Memory Mapped? |
|---|---|---|
| VMM Setup | Volt's @ 0x1000 | ✅ Yes (identity mapped 0-4GB) |
| kernel startup_64 entry | Volt's @ 0x1000 | ✅ Yes |
| After __startup_64 + CR3 switch | Kernel's early_top_pgt | ❌ NO |
1. Root Cause Analysis
The Problem Flow
1. Volt creates page tables at 0x1000
- Identity maps 0-4GB (including address 0)
- Maps kernel high-half (0xffffffff80000000+)
2. Volt enters kernel at startup_64
- Kernel uses Volt's tables initially
- Sets up GS_BASE, calls startup_64_setup_env()
3. Kernel calls __startup_64()
- Builds NEW page tables in early_top_pgt (kernel BSS)
- Creates identity mapping for KERNEL TEXT ONLY
- Does NOT map low memory (0-16MB except kernel)
4. CR3 switches to early_top_pgt
- Volt's page tables ABANDONED
- Low memory NO LONGER MAPPED
5. 💥 Any access to low memory causes #PF with CR2=address
The Kernel's Page Table Setup (head64.c)
unsigned long __head __startup_64(unsigned long physaddr, struct boot_params *bp)
{
// ... setup code ...
// ONLY maps kernel text region:
for (i = 0; i < DIV_ROUND_UP(_end - _text, PMD_SIZE); i++) {
int idx = i + (physaddr >> PMD_SHIFT);
pmd[idx % PTRS_PER_PMD] = pmd_entry + i * PMD_SIZE;
}
// Low memory (0x0 - 0x1000000) is NOT mapped!
}
What Gets Mapped in Kernel's Page Tables
| Memory Region | Mapped? | Purpose |
|---|---|---|
| 0x0 - 0xFFFFF (0-1MB) | ❌ No | Boot structures |
| 0x100000 - 0xFFFFFF (1-16MB) | ❌ No | Below kernel |
| 0x1000000 - kernel_end | ✅ Yes | Kernel text/data |
| 0xffffffff80000000+ | ✅ Yes | Kernel virtual |
| 0xffff888000000000+ (__PAGE_OFFSET) | ❌ No* | Direct physical map |
*The __PAGE_OFFSET mapping is created lazily via early page fault handler
2. Why bzImage Works
The compressed kernel (bzImage) includes a decompressor at arch/x86/boot/compressed/head_64.S that:
- Creates full identity mapping for ALL memory (0-4GB):
/* Build Level 2 - maps 4GB with 2MB pages */
movl $0x00000183, %eax /* Present + RW + PS (2MB page) */
movl $2048, %ecx /* 2048 entries × 2MB = 4GB */
-
Decompresses kernel to 0x1000000
-
Jumps to decompressed kernel with decompressor's tables still in CR3
-
When startup_64 builds new tables, the decompressor's mappings are inherited
bzImage vs vmlinux Boot Comparison
| Aspect | bzImage | vmlinux |
|---|---|---|
| Decompressor | ✅ Yes (sets up 4GB identity map) | ❌ No |
| Initial page tables | Decompressor's (full coverage) | VMM's (then abandoned) |
| Low memory after startup | ✅ Mapped | ❌ NOT mapped |
| Boot_params accessible | ✅ Yes | ❌ NO |
3. Technical Details
Entry Point Analysis
For vmlinux ELF:
e_entry= virtual address (e.g., 0xffffffff81000000)- Corresponds to
startup_64symbol in head_64.S
Volt correctly:
- Loads kernel to physical 0x1000000
- Maps virtual 0xffffffff81000000 → physical 0x1000000
- Enters at e_entry (virtual address)
The CR3 Switch (head_64.S)
/* Call __startup_64 which returns SME mask */
leaq _text(%rip), %rdi
movq %r15, %rsi
call __startup_64
/* Form CR3 value with early_top_pgt */
addq $(early_top_pgt - __START_KERNEL_map), %rax
/* Switch to kernel's page tables - VMM's tables abandoned! */
movq %rax, %cr3
Kernel's early_top_pgt Layout
early_top_pgt (in kernel .data):
[0-273] = 0 (unmapped - includes identity region)
[274-510] = 0 (unmapped - includes __PAGE_OFFSET region)
[511] = level3_kernel_pgt | flags (kernel mapping)
Only PGD[511] is populated, mapping 0xffffffff80000000-0xffffffffffffffff.
4. The Crash Sequence
-
VMM: Sets CR3=0x1000 (Volt's tables), RIP=0xffffffff81000000
-
Kernel startup_64:
- Sets up GS_BASE (wrmsr) ✅
- Calls startup_64_setup_env() (loads GDT, IDT) ✅
- Calls __startup_64() - builds new tables ✅
-
CR3 Switch: CR3 = early_top_pgt address
-
Crash: Something accesses low memory
- Could be stack canary check via %gs
- Could be boot_params access
- Could be early exception handler
Crash location: RIP=0xffffffff81000084, CR2=0x0
5. Solutions
✅ Recommended: Use bzImage Instead of vmlinux
The compressed kernel format handles all early setup correctly:
// In loader.rs - detect bzImage and use appropriate entry
pub fn load(...) -> Result<KernelLoadResult> {
match kernel_type {
KernelType::BzImage => Self::load_bzimage(&kernel_data, ...),
KernelType::Elf64 => {
// Warning: vmlinux direct boot has page table issues
// Consider using bzImage instead
Self::load_elf64(&kernel_data, ...)
}
}
}
Why bzImage works:
- Includes decompressor stub
- Decompressor sets up proper 4GB identity mapping
- Kernel inherits good mappings
⚠️ Alternative: Pre-initialize Kernel's Page Tables
If vmlinux support is required, the VMM could pre-populate the kernel's early_dynamic_pgts:
// Find early_dynamic_pgts symbol in vmlinux ELF
// Pre-populate with identity mapping entries
// Set next_early_pgt to indicate tables are ready
Risks:
- Kernel version dependent
- Symbol locations change
- Fragile and hard to maintain
⚠️ Alternative: Use Different Entry Point
PVH entry (if kernel supports it) might have different expectations:
// Look for .note.xen.pvh section in ELF
// Use PVH entry point which may preserve VMM tables
6. Verification Checklist
- Root cause identified: Kernel's __startup_64 builds minimal page tables
- Why bzImage works: Decompressor provides full identity mapping
- CR3 switch behavior confirmed from kernel source
- Low memory unmapped after switch confirmed
- Test with bzImage format
- Document bzImage requirement in Volt
7. Implementation Recommendation
Short-term Fix
Update Volt to require bzImage format:
// In loader.rs
fn load_elf64(...) -> Result<...> {
tracing::warn!(
"Loading vmlinux ELF directly may fail due to kernel page table setup. \
Consider using bzImage format for reliable boot."
);
// ... existing code ...
}
Long-term Solution
- Default to bzImage for production use
- Document the limitation in user-facing docs
- Investigate PVH entry for vmlinux if truly needed
8. Files Referenced
Linux Kernel Source (v6.6)
arch/x86/kernel/head_64.S- Entry point, CR3 switcharch/x86/kernel/head64.c-__startup_64()page table setuparch/x86/boot/compressed/head_64.S- Decompressor with full identity mapping
Volt Source
vmm/src/boot/loader.rs- Kernel loading (ELF/bzImage)vmm/src/boot/pagetable.rs- VMM page table setupvmm/src/boot/mod.rs- Boot orchestration
9. Code Changes Made
Warning Added to loader.rs
/// Load ELF64 kernel (vmlinux)
///
/// # Warning: vmlinux Direct Boot Limitations
///
/// Loading vmlinux ELF directly has a fundamental limitation...
fn load_elf64<M: GuestMemory>(...) -> Result<KernelLoadResult> {
tracing::warn!(
"Loading vmlinux ELF directly. This may fail due to kernel page table setup..."
);
// ... rest of function
}
10. Future Work
If vmlinux Support is Essential
To properly support vmlinux direct boot, one of these approaches would be needed:
-
Pre-initialize kernel's early_top_pgt
- Parse vmlinux ELF to find
early_top_pgtandearly_dynamic_pgtssymbols - Pre-populate with full identity mapping
- Set
next_early_pgtto indicate tables are ready
- Parse vmlinux ELF to find
-
Use PVH Entry Point
- Check for
.note.xen.pvhabisection in ELF - Use PVH entry which may have different page table expectations
- Check for
-
Patch Kernel Entry
- Skip the CR3 switch in startup_64
- Highly invasive and version-specific
Recommended Approach for Production
Always use bzImage for Volt:
- Fast extraction (<10ms)
- Handles all edge cases correctly
- Standard approach used by QEMU, Firecracker, Cloud Hypervisor
11. Summary
The core issue: Linux kernel's startup_64 assumes the bootloader (decompressor) has set up page tables that remain valid. When vmlinux is loaded directly, the VMM's page tables are replaced, not augmented.
The fix: Use bzImage format, which includes the decompressor that properly handles page table setup for the kernel's expectations.
Changes made:
- Added warning to
load_elf64()in loader.rs - Created this analysis document