KVM-based microVMM for the Volt platform: - Sub-second VM boot times - Minimal memory footprint - Landlock LSM + seccomp security - Virtio device support - Custom kernel management Copyright (c) Armored Gates LLC. All rights reserved. Licensed under AGPSL v5.0
28 KiB
Stellarium: Unified Storage Architecture for Volt
"Every byte has a home. Every home is shared. Nothing is stored twice."
1. Vision Statement
Stellarium is a revolutionary storage architecture that treats storage not as isolated volumes, but as a unified content-addressed stellar cloud where every unique byte exists exactly once, and every VM draws from the same constellation of data.
What Makes This Revolutionary
Traditional VM storage operates on a fundamental lie: that each VM has its own dedicated disk. This creates:
- Massive redundancy — 1000 Debian VMs = 1000 copies of libc
- Slow boots — Each VM reads its own copy of boot files
- Wasted IOPS — Page cache misses everywhere
- Memory bloat — Same data cached N times
Stellarium inverts this model. Instead of VMs owning storage, storage serves VMs through a unified content mesh. The result:
| Metric | Traditional | Stellarium | Improvement |
|---|---|---|---|
| Storage per 1000 Debian VMs | 10 TB | 12 GB + deltas | 833x |
| Cold boot time | 2-5s | <50ms | 40-100x |
| Memory efficiency | 1 GB/VM | ~50 MB shared core | 20x |
| IOPS for identical reads | N | 1 | Nx |
2. Architecture Overview
┌─────────────────────────────────────────────────────────────────────┐
│ STELLARIUM LAYERS │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Volt │ │ Volt │ │ Volt │ VM Layer │
│ │ microVM │ │ microVM │ │ microVM │ │
│ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │
│ │ │ │ │
│ ┌──────┴────────────────┴────────────────┴──────┐ │
│ │ STELLARIUM VirtIO Driver │ Driver │
│ │ (Memory-Mapped CAS Interface) │ Layer │
│ └──────────────────────┬────────────────────────┘ │
│ │ │
│ ┌──────────────────────┴────────────────────────┐ │
│ │ NOVA-STORE │ Store │
│ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ Layer │
│ │ │ TinyVol │ │ShareVol │ │ DeltaVol│ │ │
│ │ │ Manager │ │ Manager │ │ Manager │ │ │
│ │ └────┬────┘ └────┬────┘ └────┬────┘ │ │
│ │ └───────────┴───────────┘ │ │
│ │ │ │ │
│ │ ┌────────────────┴────────────────┐ │ │
│ │ │ PHOTON (Content Router) │ │ │
│ │ │ Hot→Memory Warm→NVMe Cold→S3 │ │ │
│ │ └────────────────┬────────────────┘ │ │
│ └───────────────────┼──────────────────────────┘ │
│ │ │
│ ┌───────────────────┴──────────────────────────┐ │
│ │ NEBULA (CAS Core) │ Foundation │
│ │ │ Layer │
│ │ ┌─────────┐ ┌─────────┐ ┌─────────────────┐ │ │
│ │ │ Chunk │ │ Block │ │ Distributed │ │ │
│ │ │ Packer │ │ Dedup │ │ Hash Index │ │ │
│ │ └─────────┘ └─────────┘ └─────────────────┘ │ │
│ │ │ │
│ │ ┌─────────────────────────────────────────┐ │ │
│ │ │ COSMIC MESH (Distributed CAS) │ │ │
│ │ │ Local NVMe ←→ Cluster ←→ Object Store │ │ │
│ │ └─────────────────────────────────────────┘ │ │
│ └───────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘
Core Components
NEBULA: Content-Addressable Storage Core
The foundation layer. Every piece of data is:
- Chunked using content-defined chunking (CDC) with FastCDC algorithm
- Hashed with BLAKE3 (256-bit, hardware-accelerated)
- Deduplicated at write time via hash lookup
- Stored once regardless of how many VMs reference it
PHOTON: Intelligent Content Router
Manages data placement across the storage hierarchy:
- L1 (Hot): Memory-mapped, instant access, boot-critical data
- L2 (Warm): NVMe, sub-millisecond, working set
- L3 (Cool): SSD, single-digit ms, recent data
- L4 (Cold): Object storage (S3/R2), archival
NOVA-STORE: Volume Abstraction Layer
Presents traditional block/file interfaces to VMs while backed by CAS:
- TinyVol: Ultra-lightweight volumes with minimal metadata
- ShareVol: Copy-on-write shared volumes
- DeltaVol: Delta-encoded writable layers
3. Key Innovations
3.1 Stellar Deduplication
Innovation: Inline deduplication with zero write amplification.
Traditional dedup:
Write → Buffer → Hash → Lookup → Decide → Store
(copy) (wait) (maybe copy again)
Stellar dedup:
Write → Hash-while-streaming → CAS Insert (atomic)
(no buffer needed) (single write or reference)
Implementation:
struct StellarChunk {
hash: Blake3Hash, // 32 bytes
size: u16, // 2 bytes (max 64KB chunks)
refs: AtomicU32, // 4 bytes - reference count
tier: AtomicU8, // 1 byte - storage tier
flags: u8, // 1 byte - compression, encryption
// Total: 40 bytes metadata per chunk
}
// Hash table: 40 bytes × 1B chunks = 40GB index for ~40TB unique data
// Fits in memory on modern servers
3.2 TinyVol: Minimal Volume Overhead
Innovation: Volumes as tiny manifest files, not pre-allocated space.
Traditional qcow2: Header (512B) + L1 Table + L2 Tables + Refcount...
Minimum overhead: ~512KB even for empty volume
TinyVol: Just a manifest pointing to chunks
Overhead: 64 bytes base + 48 bytes per modified chunk
Empty 10GB volume: 64 bytes
1GB modified: 64B + (1GB/64KB × 48B) = ~768KB
Structure:
struct TinyVol {
magic: [u8; 8], // "TINYVOL\0"
version: u32,
flags: u32,
base_image: Blake3Hash, // Optional parent
size_bytes: u64,
chunk_map: BTreeMap<ChunkIndex, ChunkRef>,
}
struct ChunkRef {
hash: Blake3Hash, // 32 bytes
offset_in_vol: u48, // 6 bytes
len: u16, // 2 bytes
flags: u64, // 8 bytes (CoW, compressed, etc.)
}
3.3 ShareVol: Zero-Copy Shared Volumes
Innovation: Multiple VMs share read paths, with instant copy-on-write.
Traditional Shared Storage:
VM1 reads /lib/libc.so → Disk read → VM1 memory
VM2 reads /lib/libc.so → Disk read → VM2 memory
(Same data read twice, stored twice in RAM)
ShareVol:
VM1 reads /lib/libc.so → Shared mapping (already in memory)
VM2 reads /lib/libc.so → Same shared mapping
(Single read, single memory location, N consumers)
Memory-Mapped CAS:
// Shared content is memory-mapped once
struct SharedMapping {
hash: Blake3Hash,
mmap_addr: *const u8,
mmap_len: usize,
vm_refs: AtomicU32, // How many VMs reference this
last_access: AtomicU64, // For eviction
}
// VMs get read-only mappings to shared content
// Write attempts trigger CoW into TinyVol delta layer
3.4 Cosmic Packing: Small File Optimization
Innovation: Pack small files into larger chunks without losing addressability.
Problem: Millions of small files (< 4KB) waste space at chunk boundaries.
Solution: Cosmic Packs — aggregated storage with inline index:
┌─────────────────────────────────────────────────┐
│ COSMIC PACK (64KB) │
├─────────────────────────────────────────────────┤
│ Header (64B) │
│ - magic, version, entry_count │
├─────────────────────────────────────────────────┤
│ Index (variable, ~100B per entry) │
│ - [hash, offset, len, flags] × N │
├─────────────────────────────────────────────────┤
│ Data (remaining space) │
│ - Packed file contents │
└─────────────────────────────────────────────────┘
Benefit: 1000 × 100-byte files = 100KB raw, but with individual addressing overhead. Cosmic Pack: single 64KB chunk, full addressability retained.
3.5 Stellar Boot: Sub-50ms VM Start
Innovation: Boot data is pre-staged in memory before VM starts.
Boot Sequence Comparison:
Traditional:
t=0ms VMM starts
t=5ms BIOS loads
t=50ms Kernel requested
t=100ms Kernel loaded from disk
t=200ms initrd loaded
t=500ms Root FS mounted
t=2000ms Boot complete
Stellar Boot:
t=-50ms Boot manifest analyzed (during scheduling)
t=-25ms Hot chunks pre-faulted to memory
t=0ms VMM starts with memory-mapped boot data
t=5ms Kernel executes (already in memory)
t=15ms initrd processed (already in memory)
t=40ms Root FS ready (ShareVol, pre-mapped)
t=50ms Boot complete
Boot Manifest:
struct BootManifest {
kernel: Blake3Hash,
initrd: Option<Blake3Hash>,
root_vol: TinyVolRef,
// Predicted hot chunks for first 100ms
prefetch_set: Vec<Blake3Hash>,
// Memory layout hints
kernel_load_addr: u64,
initrd_load_addr: Option<u64>,
}
3.6 CDN-Native Distribution: Voltainer Integration
Innovation: Images distributed via CDN, layers indexed directly in NEBULA.
Traditional (Registry-based):
Registry API → Pull manifest → Pull layers → Extract → Overlay FS
(Complex protocol, copies data, registry infrastructure required)
Stellarium + CDN:
HTTPS GET manifest → HTTPS GET missing chunks → Mount
(Simple HTTP, zero extraction, CDN handles global distribution)
CDN-Native Architecture:
┌─────────────────────────────────────────────────────────────────┐
│ CDN-NATIVE DISTRIBUTION │
├─────────────────────────────────────────────────────────────────┤
│ │
│ cdn.armoredgate.com/ │
│ ├── manifests/ │
│ │ └── {blake3-hash}.json ← Image/layer manifests │
│ └── blobs/ │
│ └── {blake3-hash} ← Raw content chunks │
│ │
│ Benefits: │
│ ✓ No registry daemon to run │
│ ✓ No registry protocol complexity │
│ ✓ Global edge caching built-in │
│ ✓ Simple HTTPS GET (curl-debuggable) │
│ ✓ Content-addressed = perfect cache keys │
│ ✓ Dedup at CDN level (same hash = same edge cache) │
│ │
└─────────────────────────────────────────────────────────────────┘
Implementation:
struct CdnDistribution {
base_url: String, // "https://cdn.armoredgate.com"
async fn fetch_manifest(&self, hash: &Blake3Hash) -> Result<ImageManifest> {
let url = format!("{}/manifests/{}.json", self.base_url, hash);
let resp = reqwest::get(&url).await?;
Ok(resp.json().await?)
}
async fn fetch_chunk(&self, hash: &Blake3Hash) -> Result<Vec<u8>> {
let url = format!("{}/blobs/{}", self.base_url, hash);
let resp = reqwest::get(&url).await?;
// Verify content hash matches (integrity check)
let data = resp.bytes().await?;
assert_eq!(blake3::hash(&data), *hash);
Ok(data.to_vec())
}
async fn fetch_missing(&self, needed: &[Blake3Hash], local: &Nebula) -> Result<()> {
// Only fetch chunks we don't have locally
let missing: Vec<_> = needed.iter()
.filter(|h| !local.exists(h))
.collect();
// Parallel fetch from CDN
futures::future::join_all(
missing.iter().map(|h| self.fetch_and_store(h, local))
).await;
Ok(())
}
}
struct VoltainerImage {
manifest_hash: Blake3Hash,
layers: Vec<LayerRef>,
}
struct LayerRef {
hash: Blake3Hash, // Content hash (CDN path)
stellar_manifest: TinyVol, // Direct mapping to Stellar chunks
}
// Voltainer pull = simple CDN fetch
async fn voltainer_pull(image: &str, cdn: &CdnDistribution, nebula: &Nebula) -> Result<VoltainerImage> {
// 1. Resolve image name to manifest hash (local index or CDN lookup)
let manifest_hash = resolve_image_hash(image).await?;
// 2. Fetch manifest from CDN
let manifest = cdn.fetch_manifest(&manifest_hash).await?;
// 3. Fetch only missing chunks (dedup-aware)
let needed_chunks = manifest.all_chunk_hashes();
cdn.fetch_missing(&needed_chunks, nebula).await?;
// 4. Image is ready - no extraction, layers ARE the storage
Ok(VoltainerImage::from_manifest(manifest))
}
Voltainer Integration:
// Voltainer (systemd-nspawn based) uses Stellarium directly
impl VoltainerRuntime {
async fn create_container(&self, image: &VoltainerImage) -> Result<Container> {
// Layers are already in NEBULA, just create overlay view
let rootfs = self.stellarium.create_overlay_view(&image.layers)?;
// systemd-nspawn mounts the Stellarium-backed rootfs
let container = systemd_nspawn::Container::new()
.directory(&rootfs)
.private_network(true)
.boot(false)
.spawn()?;
Ok(container)
}
}
3.7 Memory-Storage Convergence
Innovation: Memory and storage share the same backing, eliminating double-buffering.
Traditional:
Storage: [Block Device] → [Page Cache] → [VM Memory]
(data copied twice)
Stellarium:
Unified: [CAS Memory Map] ←──────────→ [VM Memory View]
(single location, two views)
DAX-Style Direct Access:
// VM sees storage as memory-mapped region
struct StellarBlockDevice {
volumes: Vec<TinyVol>,
fn handle_read(&self, offset: u64, len: u32) -> &[u8] {
let chunk = self.volumes[0].chunk_at(offset);
let mapping = photon.get_or_map(chunk.hash);
&mapping[chunk.local_offset..][..len]
}
// Writes go to delta layer
fn handle_write(&mut self, offset: u64, data: &[u8]) {
self.volumes[0].write_delta(offset, data);
}
}
4. Density Targets
Storage Efficiency
| Scenario | Traditional | Stellarium | Target |
|---|---|---|---|
| 1000 Ubuntu 22.04 VMs | 2.5 TB | 2.8 GB shared + 10 MB/VM avg delta | 99.6% reduction |
| 10000 Python app VMs (same base) | 25 TB | 2.8 GB + 5 MB/VM | 99.8% reduction |
| Mixed workload (100 unique bases) | 2.5 TB | 50 GB shared + 20 MB/VM avg | 94% reduction |
Memory Efficiency
| Component | Traditional | Stellarium | Target |
|---|---|---|---|
| Kernel (per VM) | 8-15 MB | Shared (~0 marginal) | 99%+ reduction |
| libc (per VM) | 2 MB | Shared | 99%+ reduction |
| Page cache duplication | High | Zero | 100% reduction |
| Effective RAM per VM | 512 MB - 1 GB | 50-100 MB unique | 5-10x improvement |
Performance
| Metric | Traditional | Stellarium Target |
|---|---|---|
| Cold boot (minimal VM) | 500ms - 2s | < 50ms |
| Warm boot (pre-cached) | 100-500ms | < 20ms |
| Clone time (full copy) | 10-60s | < 1ms (CoW instant) |
| Dedup ratio (homogeneous) | N/A | 50:1 to 1000:1 |
| IOPS (deduplicated reads) | N | 1 |
Density Goals
| Scenario | Traditional (64GB RAM host) | Stellarium Target |
|---|---|---|
| Minimal VMs (32MB each) | ~1000 | 5000-10000 |
| Small VMs (128MB each) | ~400 | 2000-4000 |
| Medium VMs (512MB each) | ~100 | 500-1000 |
| Storage per 10K VMs | 10-50 TB | 10-50 GB |
5. Integration with Volt VMM
Boot Path Integration
// Volt VMM integration
impl VoltVmm {
fn boot_with_stellarium(&mut self, manifest: BootManifest) -> Result<()> {
// 1. Pre-fault boot chunks to L1 (memory)
let prefetch_handle = stellarium.prefetch(&manifest.prefetch_set);
// 2. Set up memory-mapped kernel
let kernel_mapping = stellarium.map_readonly(&manifest.kernel);
self.load_kernel_direct(kernel_mapping);
// 3. Set up memory-mapped initrd (if present)
if let Some(initrd) = &manifest.initrd {
let initrd_mapping = stellarium.map_readonly(initrd);
self.load_initrd_direct(initrd_mapping);
}
// 4. Configure VirtIO-Stellar device
self.add_stellar_blk(manifest.root_vol)?;
// 5. Ensure prefetch complete
prefetch_handle.wait();
// 6. Boot
self.start()
}
}
VirtIO-Stellar Driver
Custom VirtIO block device that speaks Stellarium natively:
struct VirtioStellarConfig {
// Standard virtio-blk compatible
capacity: u64,
size_max: u32,
seg_max: u32,
// Stellarium extensions
stellar_features: u64, // STELLAR_F_SHAREVOL, STELLAR_F_DEDUP, etc.
vol_hash: Blake3Hash, // Volume identity
shared_regions: u32, // Number of pre-shared regions
}
// Request types (extends standard virtio-blk)
enum StellarRequest {
Read { sector: u64, len: u32 },
Write { sector: u64, data: Vec<u8> },
// Stellarium extensions
MapShared { hash: Blake3Hash }, // Map shared chunk directly
QueryDedup { sector: u64 }, // Check if sector is deduplicated
Prefetch { sectors: Vec<u64> }, // Hint upcoming reads
}
Snapshot and Restore
// Instant snapshots via TinyVol CoW
fn snapshot_vm(vm: &VoltVm) -> VmSnapshot {
VmSnapshot {
// Memory as Stellar chunks
memory_chunks: stellarium.chunk_memory(vm.memory_region()),
// Volume is already CoW - just reference
root_vol: vm.root_vol.clone_manifest(),
// CPU state is tiny
cpu_state: vm.save_cpu_state(),
}
}
// Restore from snapshot
fn restore_vm(snapshot: &VmSnapshot) -> VoltVm {
let mut vm = VoltVm::new();
// Memory is mapped directly from Stellar chunks
vm.map_memory_from_stellar(&snapshot.memory_chunks);
// Volume manifest is loaded (no data copy)
vm.attach_vol(snapshot.root_vol.clone());
// Restore CPU state
vm.restore_cpu_state(&snapshot.cpu_state);
vm
}
Live Migration with Dedup
// Only transfer unique chunks during migration
async fn migrate_vm(vm: &VoltVm, target: &NodeAddr) -> Result<()> {
// 1. Get list of chunks VM references
let vm_chunks = vm.collect_chunk_refs();
// 2. Query target for chunks it already has
let target_has = target.query_chunks(&vm_chunks).await?;
// 3. Transfer only missing chunks
let missing = vm_chunks.difference(&target_has);
target.receive_chunks(&missing).await?;
// 4. Transfer tiny metadata
target.receive_manifest(&vm.root_vol).await?;
target.receive_memory_manifest(&vm.memory_chunks).await?;
// 5. Final state sync and switchover
vm.pause();
target.receive_final_state(vm.cpu_state()).await?;
target.resume().await?;
Ok(())
}
6. Implementation Priorities
Phase 1: Foundation (Month 1-2)
Goal: Core CAS and basic volume support
-
NEBULA Core
- BLAKE3 hashing with SIMD acceleration
- In-memory hash table (robin hood hashing)
- Basic chunk storage (local NVMe)
- Reference counting
-
TinyVol v1
- Manifest format
- Read-only volume mounting
- Basic CoW writes
-
VirtIO-Stellar Driver
- Basic block interface
- Integration with Volt
Deliverable: Boot a VM from Stellarium storage
Phase 2: Deduplication (Month 2-3)
Goal: Inline dedup with zero performance regression
-
Inline Deduplication
- Write path with hash-first
- Atomic insert-or-reference
- Dedup metrics/reporting
-
Content-Defined Chunking
- FastCDC implementation
- Tuned for VM workloads
-
Base Image Sharing
- ShareVol implementation
- Multiple VMs sharing base
Deliverable: 10:1+ dedup ratio for homogeneous VMs
Phase 3: Performance (Month 3-4)
Goal: Sub-50ms boot, memory convergence
-
PHOTON Tiering
- Hot/warm/cold classification
- Automatic promotion/demotion
- Memory-mapped hot tier
-
Boot Optimization
- Boot manifest analysis
- Prefetch implementation
- Zero-copy kernel loading
-
Memory-Storage Convergence
- DAX-style direct access
- Shared page elimination
Deliverable: <50ms cold boot, memory sharing active
Phase 4: Density (Month 4-5)
Goal: 10000+ VMs per host achievable
-
Small File Packing
- Cosmic Pack implementation
- Inline file storage
-
Aggressive Sharing
- Cross-VM page dedup
- Kernel/library sharing
-
Memory Pressure Handling
- Intelligent eviction
- Graceful degradation
Deliverable: 5000+ density on 64GB host
Phase 5: Distribution (Month 5-6)
Goal: Multi-node Stellarium cluster
-
Cosmic Mesh
- Distributed hash index
- Cross-node chunk routing
- Consistent hashing for placement
-
Migration Optimization
- Chunk pre-staging
- Delta transfers
-
Object Storage Backend
- S3/R2 cold tier
- Async writeback
Deliverable: Seamless multi-node storage
Phase 6: Voltainer + CDN Native (Month 6-7)
Goal: Voltainer containers as first-class citizens, CDN-native distribution
-
CDN Distribution Layer
- Manifest/chunk fetch from ArmoredGate CDN
- Parallel chunk retrieval
- Edge cache warming strategies
-
Voltainer Integration
- Direct Stellarium mount for systemd-nspawn
- Shared layers between Voltainer containers and Volt VMs
- Unified storage for both runtimes
-
Layer Mapping
- Direct layer registration in NEBULA
- No extraction needed
- Content-addressed = perfect CDN cache keys
Deliverable: Voltainer containers boot in <100ms, unified with VM storage
7. Name: Stellarium
Why Stellarium?
Continuing the cosmic theme of Stardust (cluster) and Volt (VMM):
- Stellar = Star-like, exceptional, relating to stars
- -arium = A place for (like aquarium, planetarium)
- Stellarium = "A place for stars" — where all your VM's data lives
Component Names (Cosmic Theme)
| Component | Name | Meaning |
|---|---|---|
| CAS Core | NEBULA | Birthplace of stars, cloud of shared matter |
| Content Router | PHOTON | Light-speed data movement |
| Chunk Packer | Cosmic Pack | Aggregating cosmic dust |
| Volume Manager | Nova-Store | Connects to Volt |
| Distributed Mesh | Cosmic Mesh | Interconnected universe |
| Boot Optimizer | Stellar Boot | Star-like speed |
| Small File Pack | Cosmic Dust | Tiny particles aggregated |
Taglines
- "Every byte a star. Every star shared."
- "The storage that makes density possible."
- "Where VMs find their data, instantly."
8. Summary
Stellarium transforms storage from a per-VM liability into a shared asset. By treating all data as content-addressed chunks in a unified namespace:
- Deduplication becomes free — No extra work, it's the storage model
- Sharing becomes default — VMs reference, not copy
- Boot becomes instant — Data is pre-positioned
- Density becomes extreme — 10-100x more VMs per host
- Migration becomes trivial — Only ship unique data
Combined with Volt's minimal VMM overhead, Stellarium enables the original ArmoredContainers vision: VM isolation at container density, with VM security guarantees.
The Stellarium Promise
On a 64GB host with 2TB NVMe:
- 10,000+ microVMs running simultaneously
- 50GB total storage for 10,000 Debian-based workloads
- <50ms boot time for any VM
- Instant cloning and snapshots
- Seamless live migration
This isn't incremental improvement. This is a new storage paradigm for the microVM era.
Stellarium: The stellar storage for stellar density.