Files
volt-vmm/docs/kernel-pagetable-analysis.md
Karl Clinger 40ed108dd5 Volt VMM (Neutron Stardust): source-available under AGPSL v5.0
KVM-based microVMM for the Volt platform:
- Sub-second VM boot times
- Minimal memory footprint
- Landlock LSM + seccomp security
- Virtio device support
- Custom kernel management

Copyright (c) Armored Gates LLC. All rights reserved.
Licensed under AGPSL v5.0
2026-03-21 01:04:35 -05:00

9.1 KiB
Raw Blame History

Linux Kernel Page Table Analysis: Why vmlinux Direct Boot Fails

Date: 2025-03-07
Status: 🔴 ROOT CAUSE IDENTIFIED
Issue: CR2=0x0 fault after kernel switches to its own page tables

Executive Summary

The crash occurs because Linux's __startup_64() function builds its own page tables that only map the kernel text region, abandoning the VMM-provided page tables. After the CR3 switch, low memory (including address 0 and boot_params at 0x20000) is no longer mapped.

Stage Page Tables Used Low Memory Mapped?
VMM Setup Volt's @ 0x1000 Yes (identity mapped 0-4GB)
kernel startup_64 entry Volt's @ 0x1000 Yes
After __startup_64 + CR3 switch Kernel's early_top_pgt NO

1. Root Cause Analysis

The Problem Flow

1. Volt creates page tables at 0x1000
   - Identity maps 0-4GB (including address 0)
   - Maps kernel high-half (0xffffffff80000000+)
   
2. Volt enters kernel at startup_64
   - Kernel uses Volt's tables initially
   - Sets up GS_BASE, calls startup_64_setup_env()
   
3. Kernel calls __startup_64()
   - Builds NEW page tables in early_top_pgt (kernel BSS)
   - Creates identity mapping for KERNEL TEXT ONLY
   - Does NOT map low memory (0-16MB except kernel)
   
4. CR3 switches to early_top_pgt
   - Volt's page tables ABANDONED
   - Low memory NO LONGER MAPPED
   
5. 💥 Any access to low memory causes #PF with CR2=address

The Kernel's Page Table Setup (head64.c)

unsigned long __head __startup_64(unsigned long physaddr, struct boot_params *bp)
{
    // ... setup code ...
    
    // ONLY maps kernel text region:
    for (i = 0; i < DIV_ROUND_UP(_end - _text, PMD_SIZE); i++) {
        int idx = i + (physaddr >> PMD_SHIFT);
        pmd[idx % PTRS_PER_PMD] = pmd_entry + i * PMD_SIZE;
    }
    
    // Low memory (0x0 - 0x1000000) is NOT mapped!
}

What Gets Mapped in Kernel's Page Tables

Memory Region Mapped? Purpose
0x0 - 0xFFFFF (0-1MB) No Boot structures
0x100000 - 0xFFFFFF (1-16MB) No Below kernel
0x1000000 - kernel_end Yes Kernel text/data
0xffffffff80000000+ Yes Kernel virtual
0xffff888000000000+ (__PAGE_OFFSET) No* Direct physical map

*The __PAGE_OFFSET mapping is created lazily via early page fault handler


2. Why bzImage Works

The compressed kernel (bzImage) includes a decompressor at arch/x86/boot/compressed/head_64.S that:

  1. Creates full identity mapping for ALL memory (0-4GB):
/* Build Level 2 - maps 4GB with 2MB pages */
movl	$0x00000183, %eax  /* Present + RW + PS (2MB page) */
movl	$2048, %ecx        /* 2048 entries × 2MB = 4GB */
  1. Decompresses kernel to 0x1000000

  2. Jumps to decompressed kernel with decompressor's tables still in CR3

  3. When startup_64 builds new tables, the decompressor's mappings are inherited

bzImage vs vmlinux Boot Comparison

Aspect bzImage vmlinux
Decompressor Yes (sets up 4GB identity map) No
Initial page tables Decompressor's (full coverage) VMM's (then abandoned)
Low memory after startup Mapped NOT mapped
Boot_params accessible Yes NO

3. Technical Details

Entry Point Analysis

For vmlinux ELF:

  • e_entry = virtual address (e.g., 0xffffffff81000000)
  • Corresponds to startup_64 symbol in head_64.S

Volt correctly:

  1. Loads kernel to physical 0x1000000
  2. Maps virtual 0xffffffff81000000 → physical 0x1000000
  3. Enters at e_entry (virtual address)

The CR3 Switch (head_64.S)

/* Call __startup_64 which returns SME mask */
leaq    _text(%rip), %rdi
movq    %r15, %rsi
call    __startup_64

/* Form CR3 value with early_top_pgt */
addq    $(early_top_pgt - __START_KERNEL_map), %rax

/* Switch to kernel's page tables - VMM's tables abandoned! */
movq    %rax, %cr3

Kernel's early_top_pgt Layout

early_top_pgt (in kernel .data):
  [0-273]   = 0 (unmapped - includes identity region)
  [274-510] = 0 (unmapped - includes __PAGE_OFFSET region)
  [511]     = level3_kernel_pgt | flags  (kernel mapping)

Only PGD[511] is populated, mapping 0xffffffff80000000-0xffffffffffffffff.


4. The Crash Sequence

  1. VMM: Sets CR3=0x1000 (Volt's tables), RIP=0xffffffff81000000

  2. Kernel startup_64:

    • Sets up GS_BASE (wrmsr)
    • Calls startup_64_setup_env() (loads GDT, IDT)
    • Calls __startup_64() - builds new tables
  3. CR3 Switch: CR3 = early_top_pgt address

  4. Crash: Something accesses low memory

    • Could be stack canary check via %gs
    • Could be boot_params access
    • Could be early exception handler

Crash location: RIP=0xffffffff81000084, CR2=0x0


5. Solutions

The compressed kernel format handles all early setup correctly:

// In loader.rs - detect bzImage and use appropriate entry
pub fn load(...) -> Result<KernelLoadResult> {
    match kernel_type {
        KernelType::BzImage => Self::load_bzimage(&kernel_data, ...),
        KernelType::Elf64 => {
            // Warning: vmlinux direct boot has page table issues
            // Consider using bzImage instead
            Self::load_elf64(&kernel_data, ...)
        }
    }
}

Why bzImage works:

  • Includes decompressor stub
  • Decompressor sets up proper 4GB identity mapping
  • Kernel inherits good mappings

⚠️ Alternative: Pre-initialize Kernel's Page Tables

If vmlinux support is required, the VMM could pre-populate the kernel's early_dynamic_pgts:

// Find early_dynamic_pgts symbol in vmlinux ELF
// Pre-populate with identity mapping entries
// Set next_early_pgt to indicate tables are ready

Risks:

  • Kernel version dependent
  • Symbol locations change
  • Fragile and hard to maintain

⚠️ Alternative: Use Different Entry Point

PVH entry (if kernel supports it) might have different expectations:

// Look for .note.xen.pvh section in ELF
// Use PVH entry point which may preserve VMM tables

6. Verification Checklist

  • Root cause identified: Kernel's __startup_64 builds minimal page tables
  • Why bzImage works: Decompressor provides full identity mapping
  • CR3 switch behavior confirmed from kernel source
  • Low memory unmapped after switch confirmed
  • Test with bzImage format
  • Document bzImage requirement in Volt

7. Implementation Recommendation

Short-term Fix

Update Volt to require bzImage format:

// In loader.rs
fn load_elf64(...) -> Result<...> {
    tracing::warn!(
        "Loading vmlinux ELF directly may fail due to kernel page table setup. \
         Consider using bzImage format for reliable boot."
    );
    // ... existing code ...
}

Long-term Solution

  1. Default to bzImage for production use
  2. Document the limitation in user-facing docs
  3. Investigate PVH entry for vmlinux if truly needed

8. Files Referenced

Linux Kernel Source (v6.6)

  • arch/x86/kernel/head_64.S - Entry point, CR3 switch
  • arch/x86/kernel/head64.c - __startup_64() page table setup
  • arch/x86/boot/compressed/head_64.S - Decompressor with full identity mapping

Volt Source

  • vmm/src/boot/loader.rs - Kernel loading (ELF/bzImage)
  • vmm/src/boot/pagetable.rs - VMM page table setup
  • vmm/src/boot/mod.rs - Boot orchestration

9. Code Changes Made

Warning Added to loader.rs

/// Load ELF64 kernel (vmlinux)
///
/// # Warning: vmlinux Direct Boot Limitations
///
/// Loading vmlinux ELF directly has a fundamental limitation...
fn load_elf64<M: GuestMemory>(...) -> Result<KernelLoadResult> {
    tracing::warn!(
        "Loading vmlinux ELF directly. This may fail due to kernel page table setup..."
    );
    // ... rest of function
}

10. Future Work

If vmlinux Support is Essential

To properly support vmlinux direct boot, one of these approaches would be needed:

  1. Pre-initialize kernel's early_top_pgt

    • Parse vmlinux ELF to find early_top_pgt and early_dynamic_pgts symbols
    • Pre-populate with full identity mapping
    • Set next_early_pgt to indicate tables are ready
  2. Use PVH Entry Point

    • Check for .note.xen.pvhabi section in ELF
    • Use PVH entry which may have different page table expectations
  3. Patch Kernel Entry

    • Skip the CR3 switch in startup_64
    • Highly invasive and version-specific

Always use bzImage for Volt:

  • Fast extraction (<10ms)
  • Handles all edge cases correctly
  • Standard approach used by QEMU, Firecracker, Cloud Hypervisor

11. Summary

The core issue: Linux kernel's startup_64 assumes the bootloader (decompressor) has set up page tables that remain valid. When vmlinux is loaded directly, the VMM's page tables are replaced, not augmented.

The fix: Use bzImage format, which includes the decompressor that properly handles page table setup for the kernel's expectations.

Changes made:

  • Added warning to load_elf64() in loader.rs
  • Created this analysis document