# Linux Kernel Page Table Analysis: Why vmlinux Direct Boot Fails **Date**: 2025-03-07 **Status**: 🔴 **ROOT CAUSE IDENTIFIED** **Issue**: CR2=0x0 fault after kernel switches to its own page tables ## Executive Summary The crash occurs because Linux's `__startup_64()` function **builds its own page tables** that only map the kernel text region, **abandoning the VMM-provided page tables**. After the CR3 switch, low memory (including address 0 and boot_params at 0x20000) is no longer mapped. | Stage | Page Tables Used | Low Memory Mapped? | |-------|-----------------|-------------------| | VMM Setup | Volt's @ 0x1000 | ✅ Yes (identity mapped 0-4GB) | | kernel startup_64 entry | Volt's @ 0x1000 | ✅ Yes | | After __startup_64 + CR3 switch | Kernel's early_top_pgt | ❌ **NO** | --- ## 1. Root Cause Analysis ### The Problem Flow ``` 1. Volt creates page tables at 0x1000 - Identity maps 0-4GB (including address 0) - Maps kernel high-half (0xffffffff80000000+) 2. Volt enters kernel at startup_64 - Kernel uses Volt's tables initially - Sets up GS_BASE, calls startup_64_setup_env() 3. Kernel calls __startup_64() - Builds NEW page tables in early_top_pgt (kernel BSS) - Creates identity mapping for KERNEL TEXT ONLY - Does NOT map low memory (0-16MB except kernel) 4. CR3 switches to early_top_pgt - Volt's page tables ABANDONED - Low memory NO LONGER MAPPED 5. 💥 Any access to low memory causes #PF with CR2=address ``` ### The Kernel's Page Table Setup (head64.c) ```c unsigned long __head __startup_64(unsigned long physaddr, struct boot_params *bp) { // ... setup code ... // ONLY maps kernel text region: for (i = 0; i < DIV_ROUND_UP(_end - _text, PMD_SIZE); i++) { int idx = i + (physaddr >> PMD_SHIFT); pmd[idx % PTRS_PER_PMD] = pmd_entry + i * PMD_SIZE; } // Low memory (0x0 - 0x1000000) is NOT mapped! } ``` ### What Gets Mapped in Kernel's Page Tables | Memory Region | Mapped? | Purpose | |---------------|---------|---------| | 0x0 - 0xFFFFF (0-1MB) | ❌ No | Boot structures | | 0x100000 - 0xFFFFFF (1-16MB) | ❌ No | Below kernel | | 0x1000000 - kernel_end | ✅ Yes | Kernel text/data | | 0xffffffff80000000+ | ✅ Yes | Kernel virtual | | 0xffff888000000000+ (__PAGE_OFFSET) | ❌ No* | Direct physical map | *The __PAGE_OFFSET mapping is created lazily via early page fault handler --- ## 2. Why bzImage Works The compressed kernel (bzImage) includes a **decompressor** at `arch/x86/boot/compressed/head_64.S` that: 1. **Creates full identity mapping** for ALL memory (0-4GB): ```asm /* Build Level 2 - maps 4GB with 2MB pages */ movl $0x00000183, %eax /* Present + RW + PS (2MB page) */ movl $2048, %ecx /* 2048 entries × 2MB = 4GB */ ``` 2. **Decompresses kernel** to 0x1000000 3. **Jumps to decompressed kernel** with decompressor's tables still in CR3 4. When startup_64 builds new tables, the **decompressor's mappings are inherited** ### bzImage vs vmlinux Boot Comparison | Aspect | bzImage | vmlinux | |--------|---------|---------| | Decompressor | ✅ Yes (sets up 4GB identity map) | ❌ No | | Initial page tables | Decompressor's (full coverage) | VMM's (then abandoned) | | Low memory after startup | ✅ Mapped | ❌ **NOT mapped** | | Boot_params accessible | ✅ Yes | ❌ **NO** | --- ## 3. Technical Details ### Entry Point Analysis For vmlinux ELF: - `e_entry` = virtual address (e.g., 0xffffffff81000000) - Corresponds to `startup_64` symbol in head_64.S Volt correctly: 1. Loads kernel to physical 0x1000000 2. Maps virtual 0xffffffff81000000 → physical 0x1000000 3. Enters at e_entry (virtual address) ### The CR3 Switch (head_64.S) ```asm /* Call __startup_64 which returns SME mask */ leaq _text(%rip), %rdi movq %r15, %rsi call __startup_64 /* Form CR3 value with early_top_pgt */ addq $(early_top_pgt - __START_KERNEL_map), %rax /* Switch to kernel's page tables - VMM's tables abandoned! */ movq %rax, %cr3 ``` ### Kernel's early_top_pgt Layout ``` early_top_pgt (in kernel .data): [0-273] = 0 (unmapped - includes identity region) [274-510] = 0 (unmapped - includes __PAGE_OFFSET region) [511] = level3_kernel_pgt | flags (kernel mapping) ``` Only PGD[511] is populated, mapping 0xffffffff80000000-0xffffffffffffffff. --- ## 4. The Crash Sequence 1. **VMM**: Sets CR3=0x1000 (Volt's tables), RIP=0xffffffff81000000 2. **Kernel startup_64**: - Sets up GS_BASE (wrmsr) ✅ - Calls startup_64_setup_env() (loads GDT, IDT) ✅ - Calls __startup_64() - builds new tables ✅ 3. **CR3 Switch**: CR3 = early_top_pgt address 4. **Crash**: Something accesses low memory - Could be stack canary check via %gs - Could be boot_params access - Could be early exception handler **Crash location**: RIP=0xffffffff81000084, CR2=0x0 --- ## 5. Solutions ### ✅ Recommended: Use bzImage Instead of vmlinux The compressed kernel format handles all early setup correctly: ```rust // In loader.rs - detect bzImage and use appropriate entry pub fn load(...) -> Result { match kernel_type { KernelType::BzImage => Self::load_bzimage(&kernel_data, ...), KernelType::Elf64 => { // Warning: vmlinux direct boot has page table issues // Consider using bzImage instead Self::load_elf64(&kernel_data, ...) } } } ``` **Why bzImage works:** - Includes decompressor stub - Decompressor sets up proper 4GB identity mapping - Kernel inherits good mappings ### ⚠️ Alternative: Pre-initialize Kernel's Page Tables If vmlinux support is required, the VMM could pre-populate the kernel's `early_dynamic_pgts`: ```rust // Find early_dynamic_pgts symbol in vmlinux ELF // Pre-populate with identity mapping entries // Set next_early_pgt to indicate tables are ready ``` **Risks:** - Kernel version dependent - Symbol locations change - Fragile and hard to maintain ### ⚠️ Alternative: Use Different Entry Point PVH entry (if kernel supports it) might have different expectations: ```rust // Look for .note.xen.pvh section in ELF // Use PVH entry point which may preserve VMM tables ``` --- ## 6. Verification Checklist - [x] Root cause identified: Kernel's __startup_64 builds minimal page tables - [x] Why bzImage works: Decompressor provides full identity mapping - [x] CR3 switch behavior confirmed from kernel source - [x] Low memory unmapped after switch confirmed - [ ] Test with bzImage format - [ ] Document bzImage requirement in Volt --- ## 7. Implementation Recommendation ### Short-term Fix Update Volt to **require bzImage format**: ```rust // In loader.rs fn load_elf64(...) -> Result<...> { tracing::warn!( "Loading vmlinux ELF directly may fail due to kernel page table setup. \ Consider using bzImage format for reliable boot." ); // ... existing code ... } ``` ### Long-term Solution 1. **Default to bzImage** for production use 2. **Document the limitation** in user-facing docs 3. **Investigate PVH entry** for vmlinux if truly needed --- ## 8. Files Referenced ### Linux Kernel Source (v6.6) - `arch/x86/kernel/head_64.S` - Entry point, CR3 switch - `arch/x86/kernel/head64.c` - `__startup_64()` page table setup - `arch/x86/boot/compressed/head_64.S` - Decompressor with full identity mapping ### Volt Source - `vmm/src/boot/loader.rs` - Kernel loading (ELF/bzImage) - `vmm/src/boot/pagetable.rs` - VMM page table setup - `vmm/src/boot/mod.rs` - Boot orchestration --- ## 9. Code Changes Made ### Warning Added to loader.rs ```rust /// Load ELF64 kernel (vmlinux) /// /// # Warning: vmlinux Direct Boot Limitations /// /// Loading vmlinux ELF directly has a fundamental limitation... fn load_elf64(...) -> Result { tracing::warn!( "Loading vmlinux ELF directly. This may fail due to kernel page table setup..." ); // ... rest of function } ``` --- ## 10. Future Work ### If vmlinux Support is Essential To properly support vmlinux direct boot, one of these approaches would be needed: 1. **Pre-initialize kernel's early_top_pgt** - Parse vmlinux ELF to find `early_top_pgt` and `early_dynamic_pgts` symbols - Pre-populate with full identity mapping - Set `next_early_pgt` to indicate tables are ready 2. **Use PVH Entry Point** - Check for `.note.xen.pvhabi` section in ELF - Use PVH entry which may have different page table expectations 3. **Patch Kernel Entry** - Skip the CR3 switch in startup_64 - Highly invasive and version-specific ### Recommended Approach for Production Always use **bzImage** for Volt: - Fast extraction (<10ms) - Handles all edge cases correctly - Standard approach used by QEMU, Firecracker, Cloud Hypervisor --- ## 11. Summary **The core issue**: Linux kernel's startup_64 assumes the bootloader (decompressor) has set up page tables that remain valid. When vmlinux is loaded directly, the VMM's page tables are **replaced, not augmented**. **The fix**: Use bzImage format, which includes the decompressor that properly handles page table setup for the kernel's expectations. **Changes made**: - Added warning to `load_elf64()` in loader.rs - Created this analysis document