Files

Karl Clinger 40ed108dd5 Volt VMM (Neutron Stardust): source-available under AGPSL v5.0

KVM-based microVMM for the Volt platform:
- Sub-second VM boot times
- Minimal memory footprint
- Landlock LSM + seccomp security
- Virtio device support
- Custom kernel management

Copyright (c) Armored Gates LLC. All rights reserved.
Licensed under AGPSL v5.0

2026-03-21 01:04:35 -05:00

26 KiB

Raw Blame History

Stardust: Sub-Millisecond VM Restore

A Technical White Paper on Next-Generation MicroVM Technology

ArmoredGate, Inc.
Version 1.0 | June 2025

Executive Summary

The serverless computing revolution promised infinite scale and zero operational overhead. It delivered on both—except for one persistent problem: cold starts. When a function hasn't run recently, spinning up a new execution environment takes hundreds of milliseconds, sometimes seconds. For latency-sensitive applications, this is unacceptable.

Stardust changes the equation.

Stardust is ArmoredGate's high-performance microVM manager (VMM), built from the ground up in Rust to achieve what was previously considered impossible: sub-millisecond virtual machine restoration. By combining demand-paged memory with pre-warmed VM pools and content-addressed storage, Stardust delivers:

0.551ms snapshot restore with in-memory CAS and VM pooling—185x faster than Firecracker
1.04ms disk-based snapshot restore with VM pooling—98x faster than Firecracker
1.92x faster cold boot times
33% lower memory footprint per VM

These aren't incremental improvements. They represent a fundamental shift in what's possible with virtualization-based isolation. For the first time, serverless platforms can offer true scale-to-zero economics without sacrificing user experience. Functions can sleep until needed, then wake in under a millisecond—faster than most network round trips.

At approximately 24,000 lines of Rust compiled into a 3.9 MB binary, Stardust embodies its namesake: the dense remnant of a collapsed star, packing extraordinary capability into a minimal footprint.

Introduction

Why MicroVMs Matter

Modern cloud infrastructure faces a fundamental tension between isolation and efficiency. Traditional virtual machines provide strong security boundaries but consume significant resources and take seconds to boot. Containers offer lightweight execution but share a kernel with the host, creating a larger attack surface.

MicroVMs occupy the sweet spot: purpose-built virtual machines that boot in milliseconds while maintaining hardware-level isolation. Each workload runs in its own kernel, with its own virtual devices, completely separated from other tenants. There's no shared kernel to exploit, no container escape to attempt.

For multi-tenant platforms—serverless functions, edge computing, secure enclaves—this combination of speed and isolation is essential. The question has always been: how fast can we make it?

The Cold Start Problem

Serverless architectures introduced a powerful abstraction: write code, deploy it, pay only when it runs. But this model creates an operational challenge known as the "cold start" problem.

When a function hasn't been invoked recently, the platform must provision a fresh execution environment. This involves:

Creating a new virtual machine or container
Loading the operating system and runtime
Initializing the application code
Processing the request

For traditional VMs, this takes seconds. For containers, hundreds of milliseconds. For microVMs, tens to hundreds of milliseconds. Each of these timescales creates user-visible latency that degrades experience.

The industry's response has been to keep execution environments "warm"—running idle instances that can immediately handle requests. But warm pools come with costs:

Memory overhead: Idle VMs consume RAM that could serve active workloads
Economic waste: Paying for compute that isn't doing useful work
Scaling complexity: Predicting demand to size pools appropriately

The dream of true scale-to-zero—where resources are released when not needed and restored instantly when required—has remained elusive. Until now.

Current State of the Art

AWS Firecracker, released in 2018, established the modern microVM paradigm. It demonstrated that purpose-built VMMs could achieve boot times under 150ms while maintaining strong isolation. Firecracker powers AWS Lambda and Fargate, proving the model at scale.

But Firecracker's snapshot restore—the operation that matters for scale-to-zero—still takes approximately 100ms. While impressive compared to traditional VMs, this latency remains visible to users and limits architectural options.

Stardust builds on Firecracker's conceptual foundation while taking a fundamentally different approach to restoration. The result is a two-order-of-magnitude improvement in restore time.

Architecture

Stardust VMM Overview

Stardust is a Type-2 hypervisor built on Linux KVM, implemented in approximately 24,000 lines of Rust. The entire VMM compiles to a 3.9 MB statically-linked binary with no runtime dependencies beyond a modern Linux kernel.

The architecture prioritizes:

Minimal attack surface: Fewer lines of code, fewer potential vulnerabilities
Memory efficiency: Careful resource management for high-density deployments
Restore speed: Every design decision optimizes for snapshot restoration latency
Production readiness: Full virtio device support, SMP, and networking

Like a neutron star—where gravitational collapse creates extraordinary density—Stardust packs comprehensive VMM functionality into a minimal footprint.

KVM Integration

Stardust leverages the Linux Kernel Virtual Machine (KVM) for hardware-assisted virtualization. KVM provides:

Intel VT-x / AMD-V hardware virtualization
Extended Page Tables (EPT) for efficient memory virtualization
VMCS shadowing for nested virtualization scenarios
Direct device assignment capabilities

Stardust manages VM lifecycle through the /dev/kvm interface, handling:

VM creation and destruction via KVM_CREATE_VM
vCPU allocation and configuration via KVM_CREATE_VCPU
Memory region registration via KVM_SET_USER_MEMORY_REGION
Interrupt injection and device emulation

The SMP implementation supports 1-4+ virtual CPUs using Intel MPS v1.4 Multi-Processor tables, enabling multi-threaded guest workloads without the complexity of ACPI MADT (planned for future releases).

Device Model

Stardust implements virtio paravirtualized devices for optimal guest performance:

virtio-blk: Block device access for root filesystems and data volumes. Supports read-only and read-write configurations with copy-on-write overlay support for snapshot scenarios.

virtio-net: Network connectivity via multiple backend options:

TAP devices for simple host bridging
Linux bridge integration for multi-VM networking
macvtap for direct physical NIC access

The device model uses eventfd-based notification for efficient VM-to-host communication, minimizing exit overhead.

Memory Management: The mmap Revolution

The key to Stardust's restore performance is demand-paged memory restoration using mmap() with MAP_PRIVATE semantics.

Traditional snapshot restore loads the entire VM memory image before resuming execution:

1. Open snapshot file
2. Read entire memory image into RAM (blocking)
3. Configure VM memory regions
4. Resume VM execution

For a 512 MB VM, step 2 alone can take 50-100ms even with fast NVMe storage.

Stardust's approach eliminates the upfront load:

1. Open snapshot file
2. mmap() file with MAP_PRIVATE (near-instant)
3. Configure VM memory regions to point to mmap'd region
4. Resume VM execution
5. Pages fault in on-demand as accessed

The mmap() call returns immediately—there's no data copy. The kernel's page fault handler loads pages from the backing file only when the guest actually touches them. Pages that are never accessed are never loaded.

This lazy fault-in behavior provides several advantages:

Instant resume: VM execution begins immediately after mmap()
Working set optimization: Only active pages consume physical memory
Natural prioritization: Hot paths load first because they're accessed first
Reduced I/O: Cold data stays on disk

The MAP_PRIVATE flag ensures copy-on-write semantics: the guest can modify its memory without affecting the underlying snapshot file, and multiple VMs can share the same snapshot as a backing store.

Security Model

Stardust implements defense-in-depth through multiple isolation mechanisms:

Seccomp-BPF Filtering

A strict seccomp filter limits the VMM to exactly 78 syscalls—the minimum required for operation. Any attempt to invoke other syscalls results in immediate process termination. This dramatically reduces the kernel attack surface available to a compromised VMM.

The allowlist includes only:

Memory management: mmap, munmap, mprotect, brk
File operations: open, read, write, close, ioctl (for KVM)
Process control: exit, exit_group
Networking: socket, bind, listen, accept (for management API)
Synchronization: futex, eventfd

Landlock Filesystem Sandboxing

Stardust uses Landlock LSM to restrict filesystem access at the kernel level. The VMM can only access:

Its configuration file
Specified VM images and snapshots
Required device nodes (/dev/kvm, /dev/net/tun)
Its own working directory

Attempts to access other filesystem locations fail with EACCES, even if the process has traditional Unix permissions.

Capability Dropping

On startup, Stardust drops all Linux capabilities except those strictly required:

CAP_NET_ADMIN (for TAP device management)
CAP_SYS_ADMIN (for KVM and namespace operations, when needed)

The combination of seccomp, Landlock, and capability dropping creates multiple independent barriers. An attacker would need to defeat all three mechanisms to escape the VMM sandbox.

The VM Pool Innovation

Understanding the Bottleneck

Profiling revealed an unexpected truth: the single most expensive operation in VM restoration isn't loading memory or configuring devices. It's creating the VM itself.

The KVM_CREATE_VM ioctl takes approximately 24ms on typical server hardware. This single syscall:

Allocates kernel structures for the VM
Creates an anonymous inode in the KVM file descriptor space
Initializes hardware-specific state (VMCS/VMCB)
Sets up interrupt routing structures

24ms might seem small, but when the total restore target is single-digit milliseconds, it's 2,400% of the budget.

Memory mapping is near-instant. vCPU creation is fast. Register restoration is microseconds. But KVM_CREATE_VM dominates the critical path.

Pre-Warmed Pool Architecture

Stardust's solution is elegant: don't create VMs when you need them. Create them in advance.

The agent-level VM pool maintains a set of pre-created, unconfigured VMs ready for immediate use:

┌─────────────────────────────────────────────┐
│                  Agent                       │
│                                             │
│  ┌─────────┐ ┌─────────┐ ┌─────────┐       │
│  │ Warm VM │ │ Warm VM │ │ Warm VM │  ...  │
│  │ (empty) │ │ (empty) │ │ (empty) │       │
│  └─────────┘ └─────────┘ └─────────┘       │
│                                             │
│  ┌─────────────────────────────────────┐   │
│  │         Restore Request             │   │
│  │                                     │   │
│  │  1. Claim VM from pool (<0.1ms)     │   │
│  │  2. mmap snapshot memory (<0.1ms)   │   │
│  │  3. Restore registers (<0.1ms)      │   │
│  │  4. Configure devices (<0.5ms)      │   │
│  │  5. Resume execution               │   │
│  │                                     │   │
│  │  Total: ~1ms                        │   │
│  └─────────────────────────────────────┘   │
│                                             │
│  Background: Replenish pool asynchronously │
└─────────────────────────────────────────────┘

When a restore request arrives:

Claim a pre-created VM from the pool (atomic operation, <100μs)
Configure memory regions using mmap (near-instant)
Set vCPU registers from snapshot (microseconds)
Attach virtio devices (sub-millisecond)
Resume execution

Background threads replenish the pool, absorbing the 24ms creation cost outside the critical path.

Scale-to-Zero Compatibility

The pool design explicitly supports scale-to-zero semantics. Here's the key insight: the pool runs at the agent level, not the workload level.

A serverless platform might run hundreds of different functions, but they all share the same pool of warm VMs. When a function scales to zero:

Its VM is destroyed (releasing memory)
Its snapshot remains on disk
The shared warm pool remains ready

When the function needs to wake:

Claim a VM from the shared pool
Restore from the function's snapshot
Execute

The warm pool cost is amortized across all workloads. Individual functions can scale to zero with true resource release, yet restore in ~1ms thanks to the shared infrastructure.

This is the architectural breakthrough: decouple VM creation from VM identity. VMs become fungible resources, shaped into specific workloads at restore time.

Performance Impact

The numbers tell the story:

Configuration	Restore Time	vs. Firecracker
Firecracker snapshot restore	102ms	baseline
Stardust disk restore (no pool)	31ms	3.3x faster
Stardust disk restore + VM pool	1.04ms	98x faster

By eliminating the KVM_CREATE_VM bottleneck, Stardust achieves two orders of magnitude improvement over Firecracker's snapshot restore.

In-Memory CAS Restore

Stellarium Content-Addressed Storage

Stellarium is ArmoredGate's content-addressed storage layer, designed for efficient snapshot storage and retrieval.

Content-addressed storage uses cryptographic hashes as keys:

snapshot_data → SHA-256(data) → "a3f2c8..."
storage.put("a3f2c8...", snapshot_data)
retrieved = storage.get("a3f2c8...")

This approach provides natural deduplication: identical data produces identical hashes, so it's stored only once.

Stellarium chunks data into 2MB blocks before hashing. For VM snapshots, this enables:

Cross-VM deduplication: Identical kernel pages, libraries, and static data share storage
Incremental snapshots: Only changed chunks need storage
Efficient distribution: Common chunks can be cached closer to compute

Zero-Copy Memory Registration

When restoring from on-disk snapshots, the mmap demand-paging approach achieves ~31ms restore (without pooling) or ~1ms (with pooling). But there's still filesystem overhead: the kernel must map the file, maintain page cache entries, and handle faults.

Stellarium's in-memory path eliminates even this overhead.

The CAS blob cache maintains decompressed snapshot chunks in memory. When restoring:

Look up required chunks by hash (hash table lookup, microseconds)
Chunks are already in memory (no I/O)
Register memory regions directly with KVM
Resume execution

There's no mmap, no page faults, no filesystem involvement. The snapshot data is already in exactly the format KVM needs.

From Milliseconds to Microseconds

Configuration	Restore Time	vs. Firecracker
Stardust in-memory (no pool)	24.5ms	4.2x faster
Stardust in-memory + VM pool	0.551ms	185x faster

At 0.551ms—551 microseconds—VM restoration is faster than:

A typical SSD read (hundreds of microseconds)
A cross-datacenter network round trip (1-10ms)
A DNS lookup (10-100ms)

The VM is running before the network packet announcing its need could cross the datacenter.

Architecture Diagram

┌──────────────────────────────────────────────────────────────┐
│                     Stellarium CAS Layer                      │
│                                                               │
│  ┌─────────────────────────────────────────────────────────┐ │
│  │                    Blob Cache (RAM)                      │ │
│  │                                                          │ │
│  │  ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐       │ │
│  │  │ Chunk A │ │ Chunk B │ │ Chunk C │ │ Chunk D │  ...  │ │
│  │  │ (2MB)   │ │ (2MB)   │ │ (2MB)   │ │ (2MB)   │       │ │
│  │  └─────────┘ └─────────┘ └─────────┘ └─────────┘       │ │
│  │     ▲ shared    ▲ unique     ▲ shared    ▲ unique      │ │
│  └─────────────────────────────────────────────────────────┘ │
│                            │                                  │
│                    Zero-copy reference                        │
│                            │                                  │
│                            ▼                                  │
│  ┌─────────────────────────────────────────────────────────┐ │
│  │                    Stardust VMM                          │ │
│  │                                                          │ │
│  │  KVM_SET_USER_MEMORY_REGION → points to cached chunks   │ │
│  │                                                          │ │
│  │  VM resume: 0.551ms                                      │ │
│  └─────────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────┘

Shared chunks (kernel, common libraries) are deduplicated across all VMs. Each workload's unique data occupies only its differential footprint.

Benchmark Methodology & Results

Test Environment

All benchmarks were conducted on consistent, production-representative hardware:

CPU: Intel Xeon Silver 4210R (10 cores, 20 threads, 2.4 GHz base)
Memory: 376 GB DDR4 ECC
Storage: NVMe SSD (Samsung PM983, 3.5 GB/s sequential read)
OS: Debian with Linux 6.1 kernel
Comparison target: Firecracker v1.6.0 (latest stable release at time of testing)

Methodology

To ensure reliable measurements:

Page cache clearing: echo 3 > /proc/sys/vm/drop_caches before each cold test
Run count: 15 iterations per configuration
Statistics: Mean with outlier removal (>2σ excluded)
Warm-up: 3 discarded warm-up runs before measurement
Isolation: Single VM per test, no competing workloads
Snapshot size: 512 MB guest memory image
Guest configuration: Minimal Linux, single vCPU

Cold Boot Results

Metric	Stardust	Firecracker v1.6.0	Improvement
VM create (avg)	55.49ms	107.03ms	1.92x faster
Full boot to shell	1.256s	—	—

Stardust creates VMs nearly twice as fast as Firecracker in the cold path. While both use KVM, Stardust's leaner initialization reduces overhead.

Snapshot Restore Results

This is the headline data:

Restore Path	Time	vs. Firecracker
Firecracker snapshot restore	102ms	baseline
Stardust disk restore (no pool)	31ms	3.3x faster
Stardust disk restore + VM pool	1.04ms	98x faster
Stardust in-memory (no pool)	24.5ms	4.2x faster
Stardust in-memory + VM pool	0.551ms	185x faster

Each optimization layer provides multiplicative improvement:

Demand-paged mmap: ~3x over eager loading
VM pool: ~30x over creating per-restore
In-memory CAS: ~2x over disk mmap
Combined: 185x faster than Firecracker

Memory Footprint

Metric	Stardust	Firecracker	Improvement
RSS per VM	24 MB	36 MB	33% reduction

Lower memory footprint enables higher VM density, directly improving infrastructure economics.

Chart Specifications

For graphic design implementation:

Chart 1: Snapshot Restore Time (logarithmic scale)

Y-axis: Restore time (ms), log scale
X-axis: Five configurations
Highlight: Firecracker bar in gray, Stardust in-memory+pool in brand color
Annotation: "185x faster" callout

Chart 2: Cold Boot Comparison

Side-by-side bars: Stardust vs Firecracker
Values labeled directly on bars
Annotation: "1.92x faster" callout

Chart 3: Memory Footprint

Simple two-bar comparison
Annotation: "33% reduction"

Use Cases

Serverless Functions: True Scale-to-Zero

The original motivation for Stardust: enabling serverless platforms to achieve genuine scale-to-zero without cold start penalties.

Before Stardust:

Keep warm pools to avoid cold starts → pay for idle compute
Accept cold starts for rarely-used functions → poor user experience
Complex prediction systems to balance the trade-off → operational overhead

With Stardust:

Scale to zero immediately when functions are idle
Restore in 0.5ms when requests arrive
No prediction, no waste, no perceptible latency

For serverless providers, this translates directly to margin improvement. For users, it means consistent sub-millisecond function startup regardless of prior activity.

Edge Computing

Edge locations have limited resources. Running warm pools at hundreds of edge sites is economically prohibitive.

Stardust enables a different model:

Deploy function snapshots to edge locations (efficient with CAS deduplication)
Run no VMs until needed
Restore on-demand in <1ms
Release immediately after execution

Edge computing becomes truly pay-per-use, with response times dominated by network latency rather than compute initialization.

Database Cloning

Development and testing workflows often require fresh database instances. Traditional approaches:

Full database copies: minutes to hours
Container snapshots: seconds
LVM snapshots: complex, storage-coupled

Stardust snapshots capture entire database VMs in their running state. Cloning becomes:

Reference the snapshot (instant)
Restore to new VM (0.5ms)
Copy-on-write handles divergent data

Developers can spin up isolated database environments in under a millisecond, enabling workflows that were previously impractical.

CI/CD Environments

Continuous integration pipelines spend significant time provisioning build environments. With Stardust:

Snapshot the configured build environment once
Restore fresh instances for each build (0.5ms)
Perfect isolation between builds
No container image layer caching complexity

Build environment provisioning becomes negligible in the CI/CD timeline.

Conclusion & Future Work

Summary of Achievements

Stardust represents a fundamental advance in microVM technology:

185x faster snapshot restore than Firecracker (0.551ms vs 102ms)
Sub-millisecond VM restoration from memory with VM pooling
33% lower memory footprint per VM (24MB vs 36MB)
Production-ready security with seccomp-BPF, Landlock, and capability dropping
Minimal footprint: ~24,000 lines of Rust, 3.9 MB binary

The key architectural insight—decoupling VM creation from VM identity through pre-warmed pools, combined with demand-paged memory and content-addressed storage—enables true scale-to-zero with imperceptible restore latency.

Like its astronomical namesake, Stardust achieves extraordinary density: comprehensive VMM capability compressed into a minimal form factor, with performance that seems to defy conventional limits.

Future Development Roadmap

Stardust development continues with several planned enhancements:

ACPI MADT Tables
Current SMP support uses legacy Intel MPS v1.4 tables. ACPI MADT (Multiple APIC Description Table) will provide modern interrupt routing, better guest OS compatibility, and enable advanced features like CPU hotplug.

Dirty-Page Incremental Snapshots
Currently, snapshots capture full VM memory state. Future versions will track dirty pages between snapshots, enabling:

Faster snapshot creation (only changed pages)
Reduced storage requirements
More frequent snapshot points

CPU Hotplug
Dynamic addition and removal of vCPUs without VM restart. This enables workloads to scale compute resources in response to demand without incurring even sub-millisecond restore latency.

NUMA Awareness
For larger VMs spanning NUMA nodes, explicit NUMA topology and memory placement will optimize memory access latency in multi-socket systems.

About ArmoredGate

ArmoredGate builds infrastructure software for the next generation of cloud computing. Our products include Stardust (microVM management), Stellarium (content-addressed storage), and Voltainer (container orchestration). We believe security and performance are complementary, not competing concerns.

For more information, contact: [engineering@armoredgate.com]

Stardust, Stellarium, and Voltainer are trademarks of ArmoredGate, Inc. Linux is a registered trademark of Linus Torvalds. Intel and Xeon are trademarks of Intel Corporation. All other trademarks are property of their respective owners.

26 KiB Raw Blame History Unescape Escape