Files

Karl Clinger 40ed108dd5 Volt VMM (Neutron Stardust): source-available under AGPSL v5.0

KVM-based microVMM for the Volt platform:
- Sub-second VM boot times
- Minimal memory footprint
- Landlock LSM + seccomp security
- Virtio device support
- Custom kernel management

Copyright (c) Armored Gates LLC. All rights reserved.
Licensed under AGPSL v5.0

2026-03-21 01:04:35 -05:00

28 KiB

Raw Permalink Blame History

Stellarium: Unified Storage Architecture for Volt

"Every byte has a home. Every home is shared. Nothing is stored twice."

1. Vision Statement

Stellarium is a revolutionary storage architecture that treats storage not as isolated volumes, but as a unified content-addressed stellar cloud where every unique byte exists exactly once, and every VM draws from the same constellation of data.

What Makes This Revolutionary

Traditional VM storage operates on a fundamental lie: that each VM has its own dedicated disk. This creates:

Massive redundancy — 1000 Debian VMs = 1000 copies of libc
Slow boots — Each VM reads its own copy of boot files
Wasted IOPS — Page cache misses everywhere
Memory bloat — Same data cached N times

Stellarium inverts this model. Instead of VMs owning storage, storage serves VMs through a unified content mesh. The result:

Metric	Traditional	Stellarium	Improvement
Storage per 1000 Debian VMs	10 TB	12 GB + deltas	833x
Cold boot time	2-5s	<50ms	40-100x
Memory efficiency	1 GB/VM	~50 MB shared core	20x
IOPS for identical reads	N	1	Nx

2. Architecture Overview

┌─────────────────────────────────────────────────────────────────────┐
│                         STELLARIUM LAYERS                           │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐                 │
│  │  Volt  │  │  Volt  │  │  Volt  │   VM Layer      │
│  │   microVM   │  │   microVM   │  │   microVM   │                 │
│  └──────┬──────┘  └──────┬──────┘  └──────┬──────┘                 │
│         │                │                │                         │
│  ┌──────┴────────────────┴────────────────┴──────┐                 │
│  │              STELLARIUM VirtIO Driver          │   Driver        │
│  │         (Memory-Mapped CAS Interface)          │   Layer         │
│  └──────────────────────┬────────────────────────┘                 │
│                         │                                           │
│  ┌──────────────────────┴────────────────────────┐                 │
│  │                NOVA-STORE                      │   Store         │
│  │  ┌─────────┐ ┌─────────┐ ┌─────────┐         │   Layer         │
│  │  │ TinyVol │ │ShareVol │ │ DeltaVol│         │                 │
│  │  │ Manager │ │ Manager │ │ Manager │         │                 │
│  │  └────┬────┘ └────┬────┘ └────┬────┘         │                 │
│  │       └───────────┴───────────┘               │                 │
│  │                   │                           │                 │
│  │  ┌────────────────┴────────────────┐         │                 │
│  │  │     PHOTON (Content Router)      │         │                 │
│  │  │   Hot→Memory  Warm→NVMe  Cold→S3 │         │                 │
│  │  └────────────────┬────────────────┘         │                 │
│  └───────────────────┼──────────────────────────┘                 │
│                      │                                             │
│  ┌───────────────────┴──────────────────────────┐                 │
│  │              NEBULA (CAS Core)                │   Foundation    │
│  │                                               │   Layer         │
│  │  ┌─────────┐ ┌─────────┐ ┌─────────────────┐ │                 │
│  │  │  Chunk  │ │  Block  │ │   Distributed   │ │                 │
│  │  │ Packer  │ │  Dedup  │ │   Hash Index    │ │                 │
│  │  └─────────┘ └─────────┘ └─────────────────┘ │                 │
│  │                                               │                 │
│  │  ┌─────────────────────────────────────────┐ │                 │
│  │  │      COSMIC MESH (Distributed CAS)       │ │                 │
│  │  │   Local NVMe ←→ Cluster ←→ Object Store  │ │                 │
│  │  └─────────────────────────────────────────┘ │                 │
│  └───────────────────────────────────────────────┘                 │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Core Components

NEBULA: Content-Addressable Storage Core

The foundation layer. Every piece of data is:

Chunked using content-defined chunking (CDC) with FastCDC algorithm
Hashed with BLAKE3 (256-bit, hardware-accelerated)
Deduplicated at write time via hash lookup
Stored once regardless of how many VMs reference it

PHOTON: Intelligent Content Router

Manages data placement across the storage hierarchy:

L1 (Hot): Memory-mapped, instant access, boot-critical data
L2 (Warm): NVMe, sub-millisecond, working set
L3 (Cool): SSD, single-digit ms, recent data
L4 (Cold): Object storage (S3/R2), archival

NOVA-STORE: Volume Abstraction Layer

Presents traditional block/file interfaces to VMs while backed by CAS:

TinyVol: Ultra-lightweight volumes with minimal metadata
ShareVol: Copy-on-write shared volumes
DeltaVol: Delta-encoded writable layers

3. Key Innovations

3.1 Stellar Deduplication

Innovation: Inline deduplication with zero write amplification.

Traditional dedup:

Write → Buffer → Hash → Lookup → Decide → Store
         (copy)         (wait)   (maybe copy again)

Stellar dedup:

Write → Hash-while-streaming → CAS Insert (atomic)
        (no buffer needed)     (single write or reference)

Implementation:

struct StellarChunk {
    hash: Blake3Hash,      // 32 bytes
    size: u16,             // 2 bytes (max 64KB chunks)
    refs: AtomicU32,       // 4 bytes - reference count
    tier: AtomicU8,        // 1 byte - storage tier
    flags: u8,             // 1 byte - compression, encryption
    // Total: 40 bytes metadata per chunk
}

// Hash table: 40 bytes × 1B chunks = 40GB index for ~40TB unique data
// Fits in memory on modern servers

3.2 TinyVol: Minimal Volume Overhead

Innovation: Volumes as tiny manifest files, not pre-allocated space.

Traditional qcow2:   Header (512B) + L1 Table + L2 Tables + Refcount...
                     Minimum overhead: ~512KB even for empty volume

TinyVol:             Just a manifest pointing to chunks
                     Overhead: 64 bytes base + 48 bytes per modified chunk
                     Empty 10GB volume: 64 bytes
                     1GB modified: 64B + (1GB/64KB × 48B) = ~768KB

Structure:

struct TinyVol {
    magic: [u8; 8],        // "TINYVOL\0"
    version: u32,
    flags: u32,
    base_image: Blake3Hash, // Optional parent
    size_bytes: u64,
    chunk_map: BTreeMap<ChunkIndex, ChunkRef>,
}

struct ChunkRef {
    hash: Blake3Hash,       // 32 bytes
    offset_in_vol: u48,     // 6 bytes
    len: u16,               // 2 bytes
    flags: u64,             // 8 bytes (CoW, compressed, etc.)
}

3.3 ShareVol: Zero-Copy Shared Volumes

Innovation: Multiple VMs share read paths, with instant copy-on-write.

Traditional Shared Storage:
  VM1 reads /lib/libc.so → Disk read → VM1 memory
  VM2 reads /lib/libc.so → Disk read → VM2 memory
  (Same data read twice, stored twice in RAM)

ShareVol:
  VM1 reads /lib/libc.so → Shared mapping (already in memory)
  VM2 reads /lib/libc.so → Same shared mapping
  (Single read, single memory location, N consumers)

Memory-Mapped CAS:

// Shared content is memory-mapped once
struct SharedMapping {
    hash: Blake3Hash,
    mmap_addr: *const u8,
    mmap_len: usize,
    vm_refs: AtomicU32,      // How many VMs reference this
    last_access: AtomicU64,  // For eviction
}

// VMs get read-only mappings to shared content
// Write attempts trigger CoW into TinyVol delta layer

3.4 Cosmic Packing: Small File Optimization

Innovation: Pack small files into larger chunks without losing addressability.

Problem: Millions of small files (< 4KB) waste space at chunk boundaries.

Solution: Cosmic Packs — aggregated storage with inline index:

┌─────────────────────────────────────────────────┐
│              COSMIC PACK (64KB)                 │
├─────────────────────────────────────────────────┤
│ Header (64B)                                    │
│   - magic, version, entry_count                 │
├─────────────────────────────────────────────────┤
│ Index (variable, ~100B per entry)               │
│   - [hash, offset, len, flags] × N              │
├─────────────────────────────────────────────────┤
│ Data (remaining space)                          │
│   - Packed file contents                        │
└─────────────────────────────────────────────────┘

Benefit: 1000 × 100-byte files = 100KB raw, but with individual addressing overhead. Cosmic Pack: single 64KB chunk, full addressability retained.

3.5 Stellar Boot: Sub-50ms VM Start

Innovation: Boot data is pre-staged in memory before VM starts.

Boot Sequence Comparison:

Traditional:
  t=0ms    VMM starts
  t=5ms    BIOS loads
  t=50ms   Kernel requested
  t=100ms  Kernel loaded from disk
  t=200ms  initrd loaded
  t=500ms  Root FS mounted
  t=2000ms Boot complete

Stellar Boot:
  t=-50ms  Boot manifest analyzed (during scheduling)
  t=-25ms  Hot chunks pre-faulted to memory
  t=0ms    VMM starts with memory-mapped boot data
  t=5ms    Kernel executes (already in memory)
  t=15ms   initrd processed (already in memory)
  t=40ms   Root FS ready (ShareVol, pre-mapped)
  t=50ms   Boot complete

Boot Manifest:

struct BootManifest {
    kernel: Blake3Hash,
    initrd: Option<Blake3Hash>,
    root_vol: TinyVolRef,
    
    // Predicted hot chunks for first 100ms
    prefetch_set: Vec<Blake3Hash>,
    
    // Memory layout hints
    kernel_load_addr: u64,
    initrd_load_addr: Option<u64>,
}

3.6 CDN-Native Distribution: Voltainer Integration

Innovation: Images distributed via CDN, layers indexed directly in NEBULA.

Traditional (Registry-based):
  Registry API → Pull manifest → Pull layers → Extract → Overlay FS
  (Complex protocol, copies data, registry infrastructure required)

Stellarium + CDN:
  HTTPS GET manifest → HTTPS GET missing chunks → Mount
  (Simple HTTP, zero extraction, CDN handles global distribution)

CDN-Native Architecture:

┌─────────────────────────────────────────────────────────────────┐
│                    CDN-NATIVE DISTRIBUTION                       │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  cdn.armoredgate.com/                                           │
│  ├── manifests/                                                 │
│  │   └── {blake3-hash}.json    ← Image/layer manifests         │
│  └── blobs/                                                     │
│      └── {blake3-hash}         ← Raw content chunks             │
│                                                                  │
│  Benefits:                                                       │
│  ✓ No registry daemon to run                                   │
│  ✓ No registry protocol complexity                              │
│  ✓ Global edge caching built-in                                │
│  ✓ Simple HTTPS GET (curl-debuggable)                          │
│  ✓ Content-addressed = perfect cache keys                       │
│  ✓ Dedup at CDN level (same hash = same edge cache)            │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Implementation:

struct CdnDistribution {
    base_url: String,  // "https://cdn.armoredgate.com"
    
    async fn fetch_manifest(&self, hash: &Blake3Hash) -> Result<ImageManifest> {
        let url = format!("{}/manifests/{}.json", self.base_url, hash);
        let resp = reqwest::get(&url).await?;
        Ok(resp.json().await?)
    }
    
    async fn fetch_chunk(&self, hash: &Blake3Hash) -> Result<Vec<u8>> {
        let url = format!("{}/blobs/{}", self.base_url, hash);
        let resp = reqwest::get(&url).await?;
        
        // Verify content hash matches (integrity check)
        let data = resp.bytes().await?;
        assert_eq!(blake3::hash(&data), *hash);
        
        Ok(data.to_vec())
    }
    
    async fn fetch_missing(&self, needed: &[Blake3Hash], local: &Nebula) -> Result<()> {
        // Only fetch chunks we don't have locally
        let missing: Vec<_> = needed.iter()
            .filter(|h| !local.exists(h))
            .collect();
        
        // Parallel fetch from CDN
        futures::future::join_all(
            missing.iter().map(|h| self.fetch_and_store(h, local))
        ).await;
        
        Ok(())
    }
}

struct VoltainerImage {
    manifest_hash: Blake3Hash,
    layers: Vec<LayerRef>,
}

struct LayerRef {
    hash: Blake3Hash,           // Content hash (CDN path)
    stellar_manifest: TinyVol,  // Direct mapping to Stellar chunks
}

// Voltainer pull = simple CDN fetch
async fn voltainer_pull(image: &str, cdn: &CdnDistribution, nebula: &Nebula) -> Result<VoltainerImage> {
    // 1. Resolve image name to manifest hash (local index or CDN lookup)
    let manifest_hash = resolve_image_hash(image).await?;
    
    // 2. Fetch manifest from CDN
    let manifest = cdn.fetch_manifest(&manifest_hash).await?;
    
    // 3. Fetch only missing chunks (dedup-aware)
    let needed_chunks = manifest.all_chunk_hashes();
    cdn.fetch_missing(&needed_chunks, nebula).await?;
    
    // 4. Image is ready - no extraction, layers ARE the storage
    Ok(VoltainerImage::from_manifest(manifest))
}

Voltainer Integration:

// Voltainer (systemd-nspawn based) uses Stellarium directly
impl VoltainerRuntime {
    async fn create_container(&self, image: &VoltainerImage) -> Result<Container> {
        // Layers are already in NEBULA, just create overlay view
        let rootfs = self.stellarium.create_overlay_view(&image.layers)?;
        
        // systemd-nspawn mounts the Stellarium-backed rootfs
        let container = systemd_nspawn::Container::new()
            .directory(&rootfs)
            .private_network(true)
            .boot(false)
            .spawn()?;
        
        Ok(container)
    }
}

3.7 Memory-Storage Convergence

Innovation: Memory and storage share the same backing, eliminating double-buffering.

Traditional:
  Storage: [Block Device] → [Page Cache] → [VM Memory]
           (data copied twice)

Stellarium:
  Unified: [CAS Memory Map] ←──────────→ [VM Memory View]
           (single location, two views)

DAX-Style Direct Access:

// VM sees storage as memory-mapped region
struct StellarBlockDevice {
    volumes: Vec<TinyVol>,
    
    fn handle_read(&self, offset: u64, len: u32) -> &[u8] {
        let chunk = self.volumes[0].chunk_at(offset);
        let mapping = photon.get_or_map(chunk.hash);
        &mapping[chunk.local_offset..][..len]
    }
    
    // Writes go to delta layer
    fn handle_write(&mut self, offset: u64, data: &[u8]) {
        self.volumes[0].write_delta(offset, data);
    }
}

4. Density Targets

Storage Efficiency

Scenario	Traditional	Stellarium	Target
1000 Ubuntu 22.04 VMs	2.5 TB	2.8 GB shared + 10 MB/VM avg delta	99.6% reduction
10000 Python app VMs (same base)	25 TB	2.8 GB + 5 MB/VM	99.8% reduction
Mixed workload (100 unique bases)	2.5 TB	50 GB shared + 20 MB/VM avg	94% reduction

Memory Efficiency

Component	Traditional	Stellarium	Target
Kernel (per VM)	8-15 MB	Shared (~0 marginal)	99%+ reduction
libc (per VM)	2 MB	Shared	99%+ reduction
Page cache duplication	High	Zero	100% reduction
Effective RAM per VM	512 MB - 1 GB	50-100 MB unique	5-10x improvement

Performance

Metric	Traditional	Stellarium Target
Cold boot (minimal VM)	500ms - 2s	< 50ms
Warm boot (pre-cached)	100-500ms	< 20ms
Clone time (full copy)	10-60s	< 1ms (CoW instant)
Dedup ratio (homogeneous)	N/A	50:1 to 1000:1
IOPS (deduplicated reads)	N	1

Density Goals

Scenario	Traditional (64GB RAM host)	Stellarium Target
Minimal VMs (32MB each)	~1000	5000-10000
Small VMs (128MB each)	~400	2000-4000
Medium VMs (512MB each)	~100	500-1000
Storage per 10K VMs	10-50 TB	10-50 GB

5. Integration with Volt VMM

Boot Path Integration

// Volt VMM integration
impl VoltVmm {
    fn boot_with_stellarium(&mut self, manifest: BootManifest) -> Result<()> {
        // 1. Pre-fault boot chunks to L1 (memory)
        let prefetch_handle = stellarium.prefetch(&manifest.prefetch_set);
        
        // 2. Set up memory-mapped kernel
        let kernel_mapping = stellarium.map_readonly(&manifest.kernel);
        self.load_kernel_direct(kernel_mapping);
        
        // 3. Set up memory-mapped initrd (if present)
        if let Some(initrd) = &manifest.initrd {
            let initrd_mapping = stellarium.map_readonly(initrd);
            self.load_initrd_direct(initrd_mapping);
        }
        
        // 4. Configure VirtIO-Stellar device
        self.add_stellar_blk(manifest.root_vol)?;
        
        // 5. Ensure prefetch complete
        prefetch_handle.wait();
        
        // 6. Boot
        self.start()
    }
}

VirtIO-Stellar Driver

Custom VirtIO block device that speaks Stellarium natively:

struct VirtioStellarConfig {
    // Standard virtio-blk compatible
    capacity: u64,
    size_max: u32,
    seg_max: u32,
    
    // Stellarium extensions
    stellar_features: u64,      // STELLAR_F_SHAREVOL, STELLAR_F_DEDUP, etc.
    vol_hash: Blake3Hash,       // Volume identity
    shared_regions: u32,        // Number of pre-shared regions
}

// Request types (extends standard virtio-blk)
enum StellarRequest {
    Read { sector: u64, len: u32 },
    Write { sector: u64, data: Vec<u8> },
    
    // Stellarium extensions
    MapShared { hash: Blake3Hash },    // Map shared chunk directly
    QueryDedup { sector: u64 },        // Check if sector is deduplicated
    Prefetch { sectors: Vec<u64> },    // Hint upcoming reads
}

Snapshot and Restore

// Instant snapshots via TinyVol CoW
fn snapshot_vm(vm: &VoltVm) -> VmSnapshot {
    VmSnapshot {
        // Memory as Stellar chunks
        memory_chunks: stellarium.chunk_memory(vm.memory_region()),
        
        // Volume is already CoW - just reference
        root_vol: vm.root_vol.clone_manifest(),
        
        // CPU state is tiny
        cpu_state: vm.save_cpu_state(),
    }
}

// Restore from snapshot
fn restore_vm(snapshot: &VmSnapshot) -> VoltVm {
    let mut vm = VoltVm::new();
    
    // Memory is mapped directly from Stellar chunks
    vm.map_memory_from_stellar(&snapshot.memory_chunks);
    
    // Volume manifest is loaded (no data copy)
    vm.attach_vol(snapshot.root_vol.clone());
    
    // Restore CPU state
    vm.restore_cpu_state(&snapshot.cpu_state);
    
    vm
}

Live Migration with Dedup

// Only transfer unique chunks during migration
async fn migrate_vm(vm: &VoltVm, target: &NodeAddr) -> Result<()> {
    // 1. Get list of chunks VM references
    let vm_chunks = vm.collect_chunk_refs();
    
    // 2. Query target for chunks it already has
    let target_has = target.query_chunks(&vm_chunks).await?;
    
    // 3. Transfer only missing chunks
    let missing = vm_chunks.difference(&target_has);
    target.receive_chunks(&missing).await?;
    
    // 4. Transfer tiny metadata
    target.receive_manifest(&vm.root_vol).await?;
    target.receive_memory_manifest(&vm.memory_chunks).await?;
    
    // 5. Final state sync and switchover
    vm.pause();
    target.receive_final_state(vm.cpu_state()).await?;
    target.resume().await?;
    
    Ok(())
}

6. Implementation Priorities

Phase 1: Foundation (Month 1-2)

Goal: Core CAS and basic volume support

NEBULA Core
- BLAKE3 hashing with SIMD acceleration
- In-memory hash table (robin hood hashing)
- Basic chunk storage (local NVMe)
- Reference counting
TinyVol v1
- Manifest format
- Read-only volume mounting
- Basic CoW writes
VirtIO-Stellar Driver
- Basic block interface
- Integration with Volt

Deliverable: Boot a VM from Stellarium storage

Phase 2: Deduplication (Month 2-3)

Goal: Inline dedup with zero performance regression

Inline Deduplication
- Write path with hash-first
- Atomic insert-or-reference
- Dedup metrics/reporting
Content-Defined Chunking
- FastCDC implementation
- Tuned for VM workloads
Base Image Sharing
- ShareVol implementation
- Multiple VMs sharing base

Deliverable: 10:1+ dedup ratio for homogeneous VMs

Phase 3: Performance (Month 3-4)

Goal: Sub-50ms boot, memory convergence

PHOTON Tiering
- Hot/warm/cold classification
- Automatic promotion/demotion
- Memory-mapped hot tier
Boot Optimization
- Boot manifest analysis
- Prefetch implementation
- Zero-copy kernel loading
Memory-Storage Convergence
- DAX-style direct access
- Shared page elimination

Deliverable: <50ms cold boot, memory sharing active

Phase 4: Density (Month 4-5)

Goal: 10000+ VMs per host achievable

Small File Packing
- Cosmic Pack implementation
- Inline file storage
Aggressive Sharing
- Cross-VM page dedup
- Kernel/library sharing
Memory Pressure Handling
- Intelligent eviction
- Graceful degradation

Deliverable: 5000+ density on 64GB host

Phase 5: Distribution (Month 5-6)

Goal: Multi-node Stellarium cluster

Cosmic Mesh
- Distributed hash index
- Cross-node chunk routing
- Consistent hashing for placement
Migration Optimization
- Chunk pre-staging
- Delta transfers
Object Storage Backend
- S3/R2 cold tier
- Async writeback

Deliverable: Seamless multi-node storage

Phase 6: Voltainer + CDN Native (Month 6-7)

Goal: Voltainer containers as first-class citizens, CDN-native distribution

CDN Distribution Layer
- Manifest/chunk fetch from ArmoredGate CDN
- Parallel chunk retrieval
- Edge cache warming strategies
Voltainer Integration
- Direct Stellarium mount for systemd-nspawn
- Shared layers between Voltainer containers and Volt VMs
- Unified storage for both runtimes
Layer Mapping
- Direct layer registration in NEBULA
- No extraction needed
- Content-addressed = perfect CDN cache keys

Deliverable: Voltainer containers boot in <100ms, unified with VM storage

7. Name: Stellarium

Why Stellarium?

Continuing the cosmic theme of Stardust (cluster) and Volt (VMM):

Stellar = Star-like, exceptional, relating to stars
-arium = A place for (like aquarium, planetarium)
Stellarium = "A place for stars" — where all your VM's data lives

Component Names (Cosmic Theme)

Component	Name	Meaning
CAS Core	NEBULA	Birthplace of stars, cloud of shared matter
Content Router	PHOTON	Light-speed data movement
Chunk Packer	Cosmic Pack	Aggregating cosmic dust
Volume Manager	Nova-Store	Connects to Volt
Distributed Mesh	Cosmic Mesh	Interconnected universe
Boot Optimizer	Stellar Boot	Star-like speed
Small File Pack	Cosmic Dust	Tiny particles aggregated

Taglines

"Every byte a star. Every star shared."
"The storage that makes density possible."
"Where VMs find their data, instantly."

8. Summary

Stellarium transforms storage from a per-VM liability into a shared asset. By treating all data as content-addressed chunks in a unified namespace:

Deduplication becomes free — No extra work, it's the storage model
Sharing becomes default — VMs reference, not copy
Boot becomes instant — Data is pre-positioned
Density becomes extreme — 10-100x more VMs per host
Migration becomes trivial — Only ship unique data

Combined with Volt's minimal VMM overhead, Stellarium enables the original ArmoredContainers vision: VM isolation at container density, with VM security guarantees.

The Stellarium Promise

On a 64GB host with 2TB NVMe:

10,000+ microVMs running simultaneously

50GB total storage for 10,000 Debian-based workloads

<50ms boot time for any VM

Instant cloning and snapshots

Seamless live migration

This isn't incremental improvement. This is a new storage paradigm for the microVM era.

Stellarium: The stellar storage for stellar density.

28 KiB Raw Permalink Blame History Unescape Escape