# Stellarium: Unified Storage Architecture for Volt > *"Every byte has a home. Every home is shared. Nothing is stored twice."* ## 1. Vision Statement **Stellarium** is a revolutionary storage architecture that treats storage not as isolated volumes, but as a **unified content-addressed stellar cloud** where every unique byte exists exactly once, and every VM draws from the same constellation of data. ### What Makes This Revolutionary Traditional VM storage operates on a fundamental lie: that each VM has its own dedicated disk. This creates: - **Massive redundancy** — 1000 Debian VMs = 1000 copies of libc - **Slow boots** — Each VM reads its own copy of boot files - **Wasted IOPS** — Page cache misses everywhere - **Memory bloat** — Same data cached N times **Stellarium inverts this model.** Instead of VMs owning storage, **storage serves VMs through a unified content mesh**. The result: | Metric | Traditional | Stellarium | Improvement | |--------|-------------|------------|-------------| | Storage per 1000 Debian VMs | 10 TB | 12 GB + deltas | **833x** | | Cold boot time | 2-5s | <50ms | **40-100x** | | Memory efficiency | 1 GB/VM | ~50 MB shared core | **20x** | | IOPS for identical reads | N | 1 | **Nx** | --- ## 2. Architecture Overview ``` ┌─────────────────────────────────────────────────────────────────────┐ │ STELLARIUM LAYERS │ ├─────────────────────────────────────────────────────────────────────┤ │ │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │ │ Volt │ │ Volt │ │ Volt │ VM Layer │ │ │ microVM │ │ microVM │ │ microVM │ │ │ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │ │ │ │ │ │ │ ┌──────┴────────────────┴────────────────┴──────┐ │ │ │ STELLARIUM VirtIO Driver │ Driver │ │ │ (Memory-Mapped CAS Interface) │ Layer │ │ └──────────────────────┬────────────────────────┘ │ │ │ │ │ ┌──────────────────────┴────────────────────────┐ │ │ │ NOVA-STORE │ Store │ │ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ Layer │ │ │ │ TinyVol │ │ShareVol │ │ DeltaVol│ │ │ │ │ │ Manager │ │ Manager │ │ Manager │ │ │ │ │ └────┬────┘ └────┬────┘ └────┬────┘ │ │ │ │ └───────────┴───────────┘ │ │ │ │ │ │ │ │ │ ┌────────────────┴────────────────┐ │ │ │ │ │ PHOTON (Content Router) │ │ │ │ │ │ Hot→Memory Warm→NVMe Cold→S3 │ │ │ │ │ └────────────────┬────────────────┘ │ │ │ └───────────────────┼──────────────────────────┘ │ │ │ │ │ ┌───────────────────┴──────────────────────────┐ │ │ │ NEBULA (CAS Core) │ Foundation │ │ │ │ Layer │ │ │ ┌─────────┐ ┌─────────┐ ┌─────────────────┐ │ │ │ │ │ Chunk │ │ Block │ │ Distributed │ │ │ │ │ │ Packer │ │ Dedup │ │ Hash Index │ │ │ │ │ └─────────┘ └─────────┘ └─────────────────┘ │ │ │ │ │ │ │ │ ┌─────────────────────────────────────────┐ │ │ │ │ │ COSMIC MESH (Distributed CAS) │ │ │ │ │ │ Local NVMe ←→ Cluster ←→ Object Store │ │ │ │ │ └─────────────────────────────────────────┘ │ │ │ └───────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────────┘ ``` ### Core Components #### NEBULA: Content-Addressable Storage Core The foundation layer. Every piece of data is: - **Chunked** using content-defined chunking (CDC) with FastCDC algorithm - **Hashed** with BLAKE3 (256-bit, hardware-accelerated) - **Deduplicated** at write time via hash lookup - **Stored once** regardless of how many VMs reference it #### PHOTON: Intelligent Content Router Manages data placement across the storage hierarchy: - **L1 (Hot)**: Memory-mapped, instant access, boot-critical data - **L2 (Warm)**: NVMe, sub-millisecond, working set - **L3 (Cool)**: SSD, single-digit ms, recent data - **L4 (Cold)**: Object storage (S3/R2), archival #### NOVA-STORE: Volume Abstraction Layer Presents traditional block/file interfaces to VMs while backed by CAS: - **TinyVol**: Ultra-lightweight volumes with minimal metadata - **ShareVol**: Copy-on-write shared volumes - **DeltaVol**: Delta-encoded writable layers --- ## 3. Key Innovations ### 3.1 Stellar Deduplication **Innovation**: Inline deduplication with zero write amplification. Traditional dedup: ``` Write → Buffer → Hash → Lookup → Decide → Store (copy) (wait) (maybe copy again) ``` Stellar dedup: ``` Write → Hash-while-streaming → CAS Insert (atomic) (no buffer needed) (single write or reference) ``` **Implementation**: ```rust struct StellarChunk { hash: Blake3Hash, // 32 bytes size: u16, // 2 bytes (max 64KB chunks) refs: AtomicU32, // 4 bytes - reference count tier: AtomicU8, // 1 byte - storage tier flags: u8, // 1 byte - compression, encryption // Total: 40 bytes metadata per chunk } // Hash table: 40 bytes × 1B chunks = 40GB index for ~40TB unique data // Fits in memory on modern servers ``` ### 3.2 TinyVol: Minimal Volume Overhead **Innovation**: Volumes as tiny manifest files, not pre-allocated space. ``` Traditional qcow2: Header (512B) + L1 Table + L2 Tables + Refcount... Minimum overhead: ~512KB even for empty volume TinyVol: Just a manifest pointing to chunks Overhead: 64 bytes base + 48 bytes per modified chunk Empty 10GB volume: 64 bytes 1GB modified: 64B + (1GB/64KB × 48B) = ~768KB ``` **Structure**: ```rust struct TinyVol { magic: [u8; 8], // "TINYVOL\0" version: u32, flags: u32, base_image: Blake3Hash, // Optional parent size_bytes: u64, chunk_map: BTreeMap, } struct ChunkRef { hash: Blake3Hash, // 32 bytes offset_in_vol: u48, // 6 bytes len: u16, // 2 bytes flags: u64, // 8 bytes (CoW, compressed, etc.) } ``` ### 3.3 ShareVol: Zero-Copy Shared Volumes **Innovation**: Multiple VMs share read paths, with instant copy-on-write. ``` Traditional Shared Storage: VM1 reads /lib/libc.so → Disk read → VM1 memory VM2 reads /lib/libc.so → Disk read → VM2 memory (Same data read twice, stored twice in RAM) ShareVol: VM1 reads /lib/libc.so → Shared mapping (already in memory) VM2 reads /lib/libc.so → Same shared mapping (Single read, single memory location, N consumers) ``` **Memory-Mapped CAS**: ```rust // Shared content is memory-mapped once struct SharedMapping { hash: Blake3Hash, mmap_addr: *const u8, mmap_len: usize, vm_refs: AtomicU32, // How many VMs reference this last_access: AtomicU64, // For eviction } // VMs get read-only mappings to shared content // Write attempts trigger CoW into TinyVol delta layer ``` ### 3.4 Cosmic Packing: Small File Optimization **Innovation**: Pack small files into larger chunks without losing addressability. Problem: Millions of small files (< 4KB) waste space at chunk boundaries. Solution: **Cosmic Packs** — aggregated storage with inline index: ``` ┌─────────────────────────────────────────────────┐ │ COSMIC PACK (64KB) │ ├─────────────────────────────────────────────────┤ │ Header (64B) │ │ - magic, version, entry_count │ ├─────────────────────────────────────────────────┤ │ Index (variable, ~100B per entry) │ │ - [hash, offset, len, flags] × N │ ├─────────────────────────────────────────────────┤ │ Data (remaining space) │ │ - Packed file contents │ └─────────────────────────────────────────────────┘ ``` **Benefit**: 1000 × 100-byte files = 100KB raw, but with individual addressing overhead. Cosmic Pack: single 64KB chunk, full addressability retained. ### 3.5 Stellar Boot: Sub-50ms VM Start **Innovation**: Boot data is pre-staged in memory before VM starts. ``` Boot Sequence Comparison: Traditional: t=0ms VMM starts t=5ms BIOS loads t=50ms Kernel requested t=100ms Kernel loaded from disk t=200ms initrd loaded t=500ms Root FS mounted t=2000ms Boot complete Stellar Boot: t=-50ms Boot manifest analyzed (during scheduling) t=-25ms Hot chunks pre-faulted to memory t=0ms VMM starts with memory-mapped boot data t=5ms Kernel executes (already in memory) t=15ms initrd processed (already in memory) t=40ms Root FS ready (ShareVol, pre-mapped) t=50ms Boot complete ``` **Boot Manifest**: ```rust struct BootManifest { kernel: Blake3Hash, initrd: Option, root_vol: TinyVolRef, // Predicted hot chunks for first 100ms prefetch_set: Vec, // Memory layout hints kernel_load_addr: u64, initrd_load_addr: Option, } ``` ### 3.6 CDN-Native Distribution: Voltainer Integration **Innovation**: Images distributed via CDN, layers indexed directly in NEBULA. ``` Traditional (Registry-based): Registry API → Pull manifest → Pull layers → Extract → Overlay FS (Complex protocol, copies data, registry infrastructure required) Stellarium + CDN: HTTPS GET manifest → HTTPS GET missing chunks → Mount (Simple HTTP, zero extraction, CDN handles global distribution) ``` **CDN-Native Architecture**: ``` ┌─────────────────────────────────────────────────────────────────┐ │ CDN-NATIVE DISTRIBUTION │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ cdn.armoredgate.com/ │ │ ├── manifests/ │ │ │ └── {blake3-hash}.json ← Image/layer manifests │ │ └── blobs/ │ │ └── {blake3-hash} ← Raw content chunks │ │ │ │ Benefits: │ │ ✓ No registry daemon to run │ │ ✓ No registry protocol complexity │ │ ✓ Global edge caching built-in │ │ ✓ Simple HTTPS GET (curl-debuggable) │ │ ✓ Content-addressed = perfect cache keys │ │ ✓ Dedup at CDN level (same hash = same edge cache) │ │ │ └─────────────────────────────────────────────────────────────────┘ ``` **Implementation**: ```rust struct CdnDistribution { base_url: String, // "https://cdn.armoredgate.com" async fn fetch_manifest(&self, hash: &Blake3Hash) -> Result { let url = format!("{}/manifests/{}.json", self.base_url, hash); let resp = reqwest::get(&url).await?; Ok(resp.json().await?) } async fn fetch_chunk(&self, hash: &Blake3Hash) -> Result> { let url = format!("{}/blobs/{}", self.base_url, hash); let resp = reqwest::get(&url).await?; // Verify content hash matches (integrity check) let data = resp.bytes().await?; assert_eq!(blake3::hash(&data), *hash); Ok(data.to_vec()) } async fn fetch_missing(&self, needed: &[Blake3Hash], local: &Nebula) -> Result<()> { // Only fetch chunks we don't have locally let missing: Vec<_> = needed.iter() .filter(|h| !local.exists(h)) .collect(); // Parallel fetch from CDN futures::future::join_all( missing.iter().map(|h| self.fetch_and_store(h, local)) ).await; Ok(()) } } struct VoltainerImage { manifest_hash: Blake3Hash, layers: Vec, } struct LayerRef { hash: Blake3Hash, // Content hash (CDN path) stellar_manifest: TinyVol, // Direct mapping to Stellar chunks } // Voltainer pull = simple CDN fetch async fn voltainer_pull(image: &str, cdn: &CdnDistribution, nebula: &Nebula) -> Result { // 1. Resolve image name to manifest hash (local index or CDN lookup) let manifest_hash = resolve_image_hash(image).await?; // 2. Fetch manifest from CDN let manifest = cdn.fetch_manifest(&manifest_hash).await?; // 3. Fetch only missing chunks (dedup-aware) let needed_chunks = manifest.all_chunk_hashes(); cdn.fetch_missing(&needed_chunks, nebula).await?; // 4. Image is ready - no extraction, layers ARE the storage Ok(VoltainerImage::from_manifest(manifest)) } ``` **Voltainer Integration**: ```rust // Voltainer (systemd-nspawn based) uses Stellarium directly impl VoltainerRuntime { async fn create_container(&self, image: &VoltainerImage) -> Result { // Layers are already in NEBULA, just create overlay view let rootfs = self.stellarium.create_overlay_view(&image.layers)?; // systemd-nspawn mounts the Stellarium-backed rootfs let container = systemd_nspawn::Container::new() .directory(&rootfs) .private_network(true) .boot(false) .spawn()?; Ok(container) } } ``` ### 3.7 Memory-Storage Convergence **Innovation**: Memory and storage share the same backing, eliminating double-buffering. ``` Traditional: Storage: [Block Device] → [Page Cache] → [VM Memory] (data copied twice) Stellarium: Unified: [CAS Memory Map] ←──────────→ [VM Memory View] (single location, two views) ``` **DAX-Style Direct Access**: ```rust // VM sees storage as memory-mapped region struct StellarBlockDevice { volumes: Vec, fn handle_read(&self, offset: u64, len: u32) -> &[u8] { let chunk = self.volumes[0].chunk_at(offset); let mapping = photon.get_or_map(chunk.hash); &mapping[chunk.local_offset..][..len] } // Writes go to delta layer fn handle_write(&mut self, offset: u64, data: &[u8]) { self.volumes[0].write_delta(offset, data); } } ``` --- ## 4. Density Targets ### Storage Efficiency | Scenario | Traditional | Stellarium | Target | |----------|-------------|------------|--------| | 1000 Ubuntu 22.04 VMs | 2.5 TB | 2.8 GB shared + 10 MB/VM avg delta | **99.6% reduction** | | 10000 Python app VMs (same base) | 25 TB | 2.8 GB + 5 MB/VM | **99.8% reduction** | | Mixed workload (100 unique bases) | 2.5 TB | 50 GB shared + 20 MB/VM avg | **94% reduction** | ### Memory Efficiency | Component | Traditional | Stellarium | Target | |-----------|-------------|------------|--------| | Kernel (per VM) | 8-15 MB | Shared (~0 marginal) | **99%+ reduction** | | libc (per VM) | 2 MB | Shared | **99%+ reduction** | | Page cache duplication | High | Zero | **100% reduction** | | Effective RAM per VM | 512 MB - 1 GB | 50-100 MB unique | **5-10x improvement** | ### Performance | Metric | Traditional | Stellarium Target | |--------|-------------|-------------------| | Cold boot (minimal VM) | 500ms - 2s | < 50ms | | Warm boot (pre-cached) | 100-500ms | < 20ms | | Clone time (full copy) | 10-60s | < 1ms (CoW instant) | | Dedup ratio (homogeneous) | N/A | 50:1 to 1000:1 | | IOPS (deduplicated reads) | N | 1 | ### Density Goals | Scenario | Traditional (64GB RAM host) | Stellarium Target | |----------|------------------------------|-------------------| | Minimal VMs (32MB each) | ~1000 | 5000-10000 | | Small VMs (128MB each) | ~400 | 2000-4000 | | Medium VMs (512MB each) | ~100 | 500-1000 | | Storage per 10K VMs | 10-50 TB | 10-50 GB | --- ## 5. Integration with Volt VMM ### Boot Path Integration ```rust // Volt VMM integration impl VoltVmm { fn boot_with_stellarium(&mut self, manifest: BootManifest) -> Result<()> { // 1. Pre-fault boot chunks to L1 (memory) let prefetch_handle = stellarium.prefetch(&manifest.prefetch_set); // 2. Set up memory-mapped kernel let kernel_mapping = stellarium.map_readonly(&manifest.kernel); self.load_kernel_direct(kernel_mapping); // 3. Set up memory-mapped initrd (if present) if let Some(initrd) = &manifest.initrd { let initrd_mapping = stellarium.map_readonly(initrd); self.load_initrd_direct(initrd_mapping); } // 4. Configure VirtIO-Stellar device self.add_stellar_blk(manifest.root_vol)?; // 5. Ensure prefetch complete prefetch_handle.wait(); // 6. Boot self.start() } } ``` ### VirtIO-Stellar Driver Custom VirtIO block device that speaks Stellarium natively: ```rust struct VirtioStellarConfig { // Standard virtio-blk compatible capacity: u64, size_max: u32, seg_max: u32, // Stellarium extensions stellar_features: u64, // STELLAR_F_SHAREVOL, STELLAR_F_DEDUP, etc. vol_hash: Blake3Hash, // Volume identity shared_regions: u32, // Number of pre-shared regions } // Request types (extends standard virtio-blk) enum StellarRequest { Read { sector: u64, len: u32 }, Write { sector: u64, data: Vec }, // Stellarium extensions MapShared { hash: Blake3Hash }, // Map shared chunk directly QueryDedup { sector: u64 }, // Check if sector is deduplicated Prefetch { sectors: Vec }, // Hint upcoming reads } ``` ### Snapshot and Restore ```rust // Instant snapshots via TinyVol CoW fn snapshot_vm(vm: &VoltVm) -> VmSnapshot { VmSnapshot { // Memory as Stellar chunks memory_chunks: stellarium.chunk_memory(vm.memory_region()), // Volume is already CoW - just reference root_vol: vm.root_vol.clone_manifest(), // CPU state is tiny cpu_state: vm.save_cpu_state(), } } // Restore from snapshot fn restore_vm(snapshot: &VmSnapshot) -> VoltVm { let mut vm = VoltVm::new(); // Memory is mapped directly from Stellar chunks vm.map_memory_from_stellar(&snapshot.memory_chunks); // Volume manifest is loaded (no data copy) vm.attach_vol(snapshot.root_vol.clone()); // Restore CPU state vm.restore_cpu_state(&snapshot.cpu_state); vm } ``` ### Live Migration with Dedup ```rust // Only transfer unique chunks during migration async fn migrate_vm(vm: &VoltVm, target: &NodeAddr) -> Result<()> { // 1. Get list of chunks VM references let vm_chunks = vm.collect_chunk_refs(); // 2. Query target for chunks it already has let target_has = target.query_chunks(&vm_chunks).await?; // 3. Transfer only missing chunks let missing = vm_chunks.difference(&target_has); target.receive_chunks(&missing).await?; // 4. Transfer tiny metadata target.receive_manifest(&vm.root_vol).await?; target.receive_memory_manifest(&vm.memory_chunks).await?; // 5. Final state sync and switchover vm.pause(); target.receive_final_state(vm.cpu_state()).await?; target.resume().await?; Ok(()) } ``` --- ## 6. Implementation Priorities ### Phase 1: Foundation (Month 1-2) **Goal**: Core CAS and basic volume support 1. **NEBULA Core** - BLAKE3 hashing with SIMD acceleration - In-memory hash table (robin hood hashing) - Basic chunk storage (local NVMe) - Reference counting 2. **TinyVol v1** - Manifest format - Read-only volume mounting - Basic CoW writes 3. **VirtIO-Stellar Driver** - Basic block interface - Integration with Volt **Deliverable**: Boot a VM from Stellarium storage ### Phase 2: Deduplication (Month 2-3) **Goal**: Inline dedup with zero performance regression 1. **Inline Deduplication** - Write path with hash-first - Atomic insert-or-reference - Dedup metrics/reporting 2. **Content-Defined Chunking** - FastCDC implementation - Tuned for VM workloads 3. **Base Image Sharing** - ShareVol implementation - Multiple VMs sharing base **Deliverable**: 10:1+ dedup ratio for homogeneous VMs ### Phase 3: Performance (Month 3-4) **Goal**: Sub-50ms boot, memory convergence 1. **PHOTON Tiering** - Hot/warm/cold classification - Automatic promotion/demotion - Memory-mapped hot tier 2. **Boot Optimization** - Boot manifest analysis - Prefetch implementation - Zero-copy kernel loading 3. **Memory-Storage Convergence** - DAX-style direct access - Shared page elimination **Deliverable**: <50ms cold boot, memory sharing active ### Phase 4: Density (Month 4-5) **Goal**: 10000+ VMs per host achievable 1. **Small File Packing** - Cosmic Pack implementation - Inline file storage 2. **Aggressive Sharing** - Cross-VM page dedup - Kernel/library sharing 3. **Memory Pressure Handling** - Intelligent eviction - Graceful degradation **Deliverable**: 5000+ density on 64GB host ### Phase 5: Distribution (Month 5-6) **Goal**: Multi-node Stellarium cluster 1. **Cosmic Mesh** - Distributed hash index - Cross-node chunk routing - Consistent hashing for placement 2. **Migration Optimization** - Chunk pre-staging - Delta transfers 3. **Object Storage Backend** - S3/R2 cold tier - Async writeback **Deliverable**: Seamless multi-node storage ### Phase 6: Voltainer + CDN Native (Month 6-7) **Goal**: Voltainer containers as first-class citizens, CDN-native distribution 1. **CDN Distribution Layer** - Manifest/chunk fetch from ArmoredGate CDN - Parallel chunk retrieval - Edge cache warming strategies 2. **Voltainer Integration** - Direct Stellarium mount for systemd-nspawn - Shared layers between Voltainer containers and Volt VMs - Unified storage for both runtimes 3. **Layer Mapping** - Direct layer registration in NEBULA - No extraction needed - Content-addressed = perfect CDN cache keys **Deliverable**: Voltainer containers boot in <100ms, unified with VM storage --- ## 7. Name: **Stellarium** ### Why Stellarium? Continuing the cosmic theme of **Stardust** (cluster) and **Volt** (VMM): - **Stellar** = Star-like, exceptional, relating to stars - **-arium** = A place for (like aquarium, planetarium) - **Stellarium** = "A place for stars" — where all your VM's data lives ### Component Names (Cosmic Theme) | Component | Name | Meaning | |-----------|------|---------| | CAS Core | **NEBULA** | Birthplace of stars, cloud of shared matter | | Content Router | **PHOTON** | Light-speed data movement | | Chunk Packer | **Cosmic Pack** | Aggregating cosmic dust | | Volume Manager | **Nova-Store** | Connects to Volt | | Distributed Mesh | **Cosmic Mesh** | Interconnected universe | | Boot Optimizer | **Stellar Boot** | Star-like speed | | Small File Pack | **Cosmic Dust** | Tiny particles aggregated | ### Taglines - *"Every byte a star. Every star shared."* - *"The storage that makes density possible."* - *"Where VMs find their data, instantly."* --- ## 8. Summary **Stellarium** transforms storage from a per-VM liability into a shared asset. By treating all data as content-addressed chunks in a unified namespace: 1. **Deduplication becomes free** — No extra work, it's the storage model 2. **Sharing becomes default** — VMs reference, not copy 3. **Boot becomes instant** — Data is pre-positioned 4. **Density becomes extreme** — 10-100x more VMs per host 5. **Migration becomes trivial** — Only ship unique data Combined with Volt's minimal VMM overhead, Stellarium enables the original ArmoredContainers vision: **VM isolation at container density, with VM security guarantees**. ### The Stellarium Promise > On a 64GB host with 2TB NVMe: > - **10,000+ microVMs** running simultaneously > - **50GB total storage** for 10,000 Debian-based workloads > - **<50ms** boot time for any VM > - **Instant** cloning and snapshots > - **Seamless** live migration This isn't incremental improvement. This is a **new storage paradigm** for the microVM era. --- *Stellarium: The stellar storage for stellar density.*