Volt VMM (Neutron Stardust): source-available under AGPSL v5.0
KVM-based microVMM for the Volt platform: - Sub-second VM boot times - Minimal memory footprint - Landlock LSM + seccomp security - Virtio device support - Custom kernel management Copyright (c) Armored Gates LLC. All rights reserved. Licensed under AGPSL v5.0
This commit is contained in:
321
docs/kernel-pagetable-analysis.md
Normal file
321
docs/kernel-pagetable-analysis.md
Normal file
@@ -0,0 +1,321 @@
|
||||
# Linux Kernel Page Table Analysis: Why vmlinux Direct Boot Fails
|
||||
|
||||
**Date**: 2025-03-07
|
||||
**Status**: 🔴 **ROOT CAUSE IDENTIFIED**
|
||||
**Issue**: CR2=0x0 fault after kernel switches to its own page tables
|
||||
|
||||
## Executive Summary
|
||||
|
||||
The crash occurs because Linux's `__startup_64()` function **builds its own page tables** that only map the kernel text region, **abandoning the VMM-provided page tables**. After the CR3 switch, low memory (including address 0 and boot_params at 0x20000) is no longer mapped.
|
||||
|
||||
| Stage | Page Tables Used | Low Memory Mapped? |
|
||||
|-------|-----------------|-------------------|
|
||||
| VMM Setup | Volt's @ 0x1000 | ✅ Yes (identity mapped 0-4GB) |
|
||||
| kernel startup_64 entry | Volt's @ 0x1000 | ✅ Yes |
|
||||
| After __startup_64 + CR3 switch | Kernel's early_top_pgt | ❌ **NO** |
|
||||
|
||||
---
|
||||
|
||||
## 1. Root Cause Analysis
|
||||
|
||||
### The Problem Flow
|
||||
|
||||
```
|
||||
1. Volt creates page tables at 0x1000
|
||||
- Identity maps 0-4GB (including address 0)
|
||||
- Maps kernel high-half (0xffffffff80000000+)
|
||||
|
||||
2. Volt enters kernel at startup_64
|
||||
- Kernel uses Volt's tables initially
|
||||
- Sets up GS_BASE, calls startup_64_setup_env()
|
||||
|
||||
3. Kernel calls __startup_64()
|
||||
- Builds NEW page tables in early_top_pgt (kernel BSS)
|
||||
- Creates identity mapping for KERNEL TEXT ONLY
|
||||
- Does NOT map low memory (0-16MB except kernel)
|
||||
|
||||
4. CR3 switches to early_top_pgt
|
||||
- Volt's page tables ABANDONED
|
||||
- Low memory NO LONGER MAPPED
|
||||
|
||||
5. 💥 Any access to low memory causes #PF with CR2=address
|
||||
```
|
||||
|
||||
### The Kernel's Page Table Setup (head64.c)
|
||||
|
||||
```c
|
||||
unsigned long __head __startup_64(unsigned long physaddr, struct boot_params *bp)
|
||||
{
|
||||
// ... setup code ...
|
||||
|
||||
// ONLY maps kernel text region:
|
||||
for (i = 0; i < DIV_ROUND_UP(_end - _text, PMD_SIZE); i++) {
|
||||
int idx = i + (physaddr >> PMD_SHIFT);
|
||||
pmd[idx % PTRS_PER_PMD] = pmd_entry + i * PMD_SIZE;
|
||||
}
|
||||
|
||||
// Low memory (0x0 - 0x1000000) is NOT mapped!
|
||||
}
|
||||
```
|
||||
|
||||
### What Gets Mapped in Kernel's Page Tables
|
||||
|
||||
| Memory Region | Mapped? | Purpose |
|
||||
|---------------|---------|---------|
|
||||
| 0x0 - 0xFFFFF (0-1MB) | ❌ No | Boot structures |
|
||||
| 0x100000 - 0xFFFFFF (1-16MB) | ❌ No | Below kernel |
|
||||
| 0x1000000 - kernel_end | ✅ Yes | Kernel text/data |
|
||||
| 0xffffffff80000000+ | ✅ Yes | Kernel virtual |
|
||||
| 0xffff888000000000+ (__PAGE_OFFSET) | ❌ No* | Direct physical map |
|
||||
|
||||
*The __PAGE_OFFSET mapping is created lazily via early page fault handler
|
||||
|
||||
---
|
||||
|
||||
## 2. Why bzImage Works
|
||||
|
||||
The compressed kernel (bzImage) includes a **decompressor** at `arch/x86/boot/compressed/head_64.S` that:
|
||||
|
||||
1. **Creates full identity mapping** for ALL memory (0-4GB):
|
||||
```asm
|
||||
/* Build Level 2 - maps 4GB with 2MB pages */
|
||||
movl $0x00000183, %eax /* Present + RW + PS (2MB page) */
|
||||
movl $2048, %ecx /* 2048 entries × 2MB = 4GB */
|
||||
```
|
||||
|
||||
2. **Decompresses kernel** to 0x1000000
|
||||
|
||||
3. **Jumps to decompressed kernel** with decompressor's tables still in CR3
|
||||
|
||||
4. When startup_64 builds new tables, the **decompressor's mappings are inherited**
|
||||
|
||||
### bzImage vs vmlinux Boot Comparison
|
||||
|
||||
| Aspect | bzImage | vmlinux |
|
||||
|--------|---------|---------|
|
||||
| Decompressor | ✅ Yes (sets up 4GB identity map) | ❌ No |
|
||||
| Initial page tables | Decompressor's (full coverage) | VMM's (then abandoned) |
|
||||
| Low memory after startup | ✅ Mapped | ❌ **NOT mapped** |
|
||||
| Boot_params accessible | ✅ Yes | ❌ **NO** |
|
||||
|
||||
---
|
||||
|
||||
## 3. Technical Details
|
||||
|
||||
### Entry Point Analysis
|
||||
|
||||
For vmlinux ELF:
|
||||
- `e_entry` = virtual address (e.g., 0xffffffff81000000)
|
||||
- Corresponds to `startup_64` symbol in head_64.S
|
||||
|
||||
Volt correctly:
|
||||
1. Loads kernel to physical 0x1000000
|
||||
2. Maps virtual 0xffffffff81000000 → physical 0x1000000
|
||||
3. Enters at e_entry (virtual address)
|
||||
|
||||
### The CR3 Switch (head_64.S)
|
||||
|
||||
```asm
|
||||
/* Call __startup_64 which returns SME mask */
|
||||
leaq _text(%rip), %rdi
|
||||
movq %r15, %rsi
|
||||
call __startup_64
|
||||
|
||||
/* Form CR3 value with early_top_pgt */
|
||||
addq $(early_top_pgt - __START_KERNEL_map), %rax
|
||||
|
||||
/* Switch to kernel's page tables - VMM's tables abandoned! */
|
||||
movq %rax, %cr3
|
||||
```
|
||||
|
||||
### Kernel's early_top_pgt Layout
|
||||
|
||||
```
|
||||
early_top_pgt (in kernel .data):
|
||||
[0-273] = 0 (unmapped - includes identity region)
|
||||
[274-510] = 0 (unmapped - includes __PAGE_OFFSET region)
|
||||
[511] = level3_kernel_pgt | flags (kernel mapping)
|
||||
```
|
||||
|
||||
Only PGD[511] is populated, mapping 0xffffffff80000000-0xffffffffffffffff.
|
||||
|
||||
---
|
||||
|
||||
## 4. The Crash Sequence
|
||||
|
||||
1. **VMM**: Sets CR3=0x1000 (Volt's tables), RIP=0xffffffff81000000
|
||||
|
||||
2. **Kernel startup_64**:
|
||||
- Sets up GS_BASE (wrmsr) ✅
|
||||
- Calls startup_64_setup_env() (loads GDT, IDT) ✅
|
||||
- Calls __startup_64() - builds new tables ✅
|
||||
|
||||
3. **CR3 Switch**: CR3 = early_top_pgt address
|
||||
|
||||
4. **Crash**: Something accesses low memory
|
||||
- Could be stack canary check via %gs
|
||||
- Could be boot_params access
|
||||
- Could be early exception handler
|
||||
|
||||
**Crash location**: RIP=0xffffffff81000084, CR2=0x0
|
||||
|
||||
---
|
||||
|
||||
## 5. Solutions
|
||||
|
||||
### ✅ Recommended: Use bzImage Instead of vmlinux
|
||||
|
||||
The compressed kernel format handles all early setup correctly:
|
||||
|
||||
```rust
|
||||
// In loader.rs - detect bzImage and use appropriate entry
|
||||
pub fn load(...) -> Result<KernelLoadResult> {
|
||||
match kernel_type {
|
||||
KernelType::BzImage => Self::load_bzimage(&kernel_data, ...),
|
||||
KernelType::Elf64 => {
|
||||
// Warning: vmlinux direct boot has page table issues
|
||||
// Consider using bzImage instead
|
||||
Self::load_elf64(&kernel_data, ...)
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Why bzImage works:**
|
||||
- Includes decompressor stub
|
||||
- Decompressor sets up proper 4GB identity mapping
|
||||
- Kernel inherits good mappings
|
||||
|
||||
### ⚠️ Alternative: Pre-initialize Kernel's Page Tables
|
||||
|
||||
If vmlinux support is required, the VMM could pre-populate the kernel's `early_dynamic_pgts`:
|
||||
|
||||
```rust
|
||||
// Find early_dynamic_pgts symbol in vmlinux ELF
|
||||
// Pre-populate with identity mapping entries
|
||||
// Set next_early_pgt to indicate tables are ready
|
||||
```
|
||||
|
||||
**Risks:**
|
||||
- Kernel version dependent
|
||||
- Symbol locations change
|
||||
- Fragile and hard to maintain
|
||||
|
||||
### ⚠️ Alternative: Use Different Entry Point
|
||||
|
||||
PVH entry (if kernel supports it) might have different expectations:
|
||||
|
||||
```rust
|
||||
// Look for .note.xen.pvh section in ELF
|
||||
// Use PVH entry point which may preserve VMM tables
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 6. Verification Checklist
|
||||
|
||||
- [x] Root cause identified: Kernel's __startup_64 builds minimal page tables
|
||||
- [x] Why bzImage works: Decompressor provides full identity mapping
|
||||
- [x] CR3 switch behavior confirmed from kernel source
|
||||
- [x] Low memory unmapped after switch confirmed
|
||||
- [ ] Test with bzImage format
|
||||
- [ ] Document bzImage requirement in Volt
|
||||
|
||||
---
|
||||
|
||||
## 7. Implementation Recommendation
|
||||
|
||||
### Short-term Fix
|
||||
|
||||
Update Volt to **require bzImage format**:
|
||||
|
||||
```rust
|
||||
// In loader.rs
|
||||
fn load_elf64(...) -> Result<...> {
|
||||
tracing::warn!(
|
||||
"Loading vmlinux ELF directly may fail due to kernel page table setup. \
|
||||
Consider using bzImage format for reliable boot."
|
||||
);
|
||||
// ... existing code ...
|
||||
}
|
||||
```
|
||||
|
||||
### Long-term Solution
|
||||
|
||||
1. **Default to bzImage** for production use
|
||||
2. **Document the limitation** in user-facing docs
|
||||
3. **Investigate PVH entry** for vmlinux if truly needed
|
||||
|
||||
---
|
||||
|
||||
## 8. Files Referenced
|
||||
|
||||
### Linux Kernel Source (v6.6)
|
||||
- `arch/x86/kernel/head_64.S` - Entry point, CR3 switch
|
||||
- `arch/x86/kernel/head64.c` - `__startup_64()` page table setup
|
||||
- `arch/x86/boot/compressed/head_64.S` - Decompressor with full identity mapping
|
||||
|
||||
### Volt Source
|
||||
- `vmm/src/boot/loader.rs` - Kernel loading (ELF/bzImage)
|
||||
- `vmm/src/boot/pagetable.rs` - VMM page table setup
|
||||
- `vmm/src/boot/mod.rs` - Boot orchestration
|
||||
|
||||
---
|
||||
|
||||
## 9. Code Changes Made
|
||||
|
||||
### Warning Added to loader.rs
|
||||
|
||||
```rust
|
||||
/// Load ELF64 kernel (vmlinux)
|
||||
///
|
||||
/// # Warning: vmlinux Direct Boot Limitations
|
||||
///
|
||||
/// Loading vmlinux ELF directly has a fundamental limitation...
|
||||
fn load_elf64<M: GuestMemory>(...) -> Result<KernelLoadResult> {
|
||||
tracing::warn!(
|
||||
"Loading vmlinux ELF directly. This may fail due to kernel page table setup..."
|
||||
);
|
||||
// ... rest of function
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 10. Future Work
|
||||
|
||||
### If vmlinux Support is Essential
|
||||
|
||||
To properly support vmlinux direct boot, one of these approaches would be needed:
|
||||
|
||||
1. **Pre-initialize kernel's early_top_pgt**
|
||||
- Parse vmlinux ELF to find `early_top_pgt` and `early_dynamic_pgts` symbols
|
||||
- Pre-populate with full identity mapping
|
||||
- Set `next_early_pgt` to indicate tables are ready
|
||||
|
||||
2. **Use PVH Entry Point**
|
||||
- Check for `.note.xen.pvhabi` section in ELF
|
||||
- Use PVH entry which may have different page table expectations
|
||||
|
||||
3. **Patch Kernel Entry**
|
||||
- Skip the CR3 switch in startup_64
|
||||
- Highly invasive and version-specific
|
||||
|
||||
### Recommended Approach for Production
|
||||
|
||||
Always use **bzImage** for Volt:
|
||||
- Fast extraction (<10ms)
|
||||
- Handles all edge cases correctly
|
||||
- Standard approach used by QEMU, Firecracker, Cloud Hypervisor
|
||||
|
||||
---
|
||||
|
||||
## 11. Summary
|
||||
|
||||
**The core issue**: Linux kernel's startup_64 assumes the bootloader (decompressor) has set up page tables that remain valid. When vmlinux is loaded directly, the VMM's page tables are **replaced, not augmented**.
|
||||
|
||||
**The fix**: Use bzImage format, which includes the decompressor that properly handles page table setup for the kernel's expectations.
|
||||
|
||||
**Changes made**:
|
||||
- Added warning to `load_elf64()` in loader.rs
|
||||
- Created this analysis document
|
||||
Reference in New Issue
Block a user