KVM-based microVMM for the Volt platform: - Sub-second VM boot times - Minimal memory footprint - Landlock LSM + seccomp security - Virtio device support - Custom kernel management Copyright (c) Armored Gates LLC. All rights reserved. Licensed under AGPSL v5.0
758 lines
28 KiB
Markdown
758 lines
28 KiB
Markdown
# Stellarium: Unified Storage Architecture for Volt
|
||
|
||
> *"Every byte has a home. Every home is shared. Nothing is stored twice."*
|
||
|
||
## 1. Vision Statement
|
||
|
||
**Stellarium** is a revolutionary storage architecture that treats storage not as isolated volumes, but as a **unified content-addressed stellar cloud** where every unique byte exists exactly once, and every VM draws from the same constellation of data.
|
||
|
||
### What Makes This Revolutionary
|
||
|
||
Traditional VM storage operates on a fundamental lie: that each VM has its own dedicated disk. This creates:
|
||
- **Massive redundancy** — 1000 Debian VMs = 1000 copies of libc
|
||
- **Slow boots** — Each VM reads its own copy of boot files
|
||
- **Wasted IOPS** — Page cache misses everywhere
|
||
- **Memory bloat** — Same data cached N times
|
||
|
||
**Stellarium inverts this model.** Instead of VMs owning storage, **storage serves VMs through a unified content mesh**. The result:
|
||
|
||
| Metric | Traditional | Stellarium | Improvement |
|
||
|--------|-------------|------------|-------------|
|
||
| Storage per 1000 Debian VMs | 10 TB | 12 GB + deltas | **833x** |
|
||
| Cold boot time | 2-5s | <50ms | **40-100x** |
|
||
| Memory efficiency | 1 GB/VM | ~50 MB shared core | **20x** |
|
||
| IOPS for identical reads | N | 1 | **Nx** |
|
||
|
||
---
|
||
|
||
## 2. Architecture Overview
|
||
|
||
```
|
||
┌─────────────────────────────────────────────────────────────────────┐
|
||
│ STELLARIUM LAYERS │
|
||
├─────────────────────────────────────────────────────────────────────┤
|
||
│ │
|
||
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
|
||
│ │ Volt │ │ Volt │ │ Volt │ VM Layer │
|
||
│ │ microVM │ │ microVM │ │ microVM │ │
|
||
│ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │
|
||
│ │ │ │ │
|
||
│ ┌──────┴────────────────┴────────────────┴──────┐ │
|
||
│ │ STELLARIUM VirtIO Driver │ Driver │
|
||
│ │ (Memory-Mapped CAS Interface) │ Layer │
|
||
│ └──────────────────────┬────────────────────────┘ │
|
||
│ │ │
|
||
│ ┌──────────────────────┴────────────────────────┐ │
|
||
│ │ NOVA-STORE │ Store │
|
||
│ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ Layer │
|
||
│ │ │ TinyVol │ │ShareVol │ │ DeltaVol│ │ │
|
||
│ │ │ Manager │ │ Manager │ │ Manager │ │ │
|
||
│ │ └────┬────┘ └────┬────┘ └────┬────┘ │ │
|
||
│ │ └───────────┴───────────┘ │ │
|
||
│ │ │ │ │
|
||
│ │ ┌────────────────┴────────────────┐ │ │
|
||
│ │ │ PHOTON (Content Router) │ │ │
|
||
│ │ │ Hot→Memory Warm→NVMe Cold→S3 │ │ │
|
||
│ │ └────────────────┬────────────────┘ │ │
|
||
│ └───────────────────┼──────────────────────────┘ │
|
||
│ │ │
|
||
│ ┌───────────────────┴──────────────────────────┐ │
|
||
│ │ NEBULA (CAS Core) │ Foundation │
|
||
│ │ │ Layer │
|
||
│ │ ┌─────────┐ ┌─────────┐ ┌─────────────────┐ │ │
|
||
│ │ │ Chunk │ │ Block │ │ Distributed │ │ │
|
||
│ │ │ Packer │ │ Dedup │ │ Hash Index │ │ │
|
||
│ │ └─────────┘ └─────────┘ └─────────────────┘ │ │
|
||
│ │ │ │
|
||
│ │ ┌─────────────────────────────────────────┐ │ │
|
||
│ │ │ COSMIC MESH (Distributed CAS) │ │ │
|
||
│ │ │ Local NVMe ←→ Cluster ←→ Object Store │ │ │
|
||
│ │ └─────────────────────────────────────────┘ │ │
|
||
│ └───────────────────────────────────────────────┘ │
|
||
│ │
|
||
└─────────────────────────────────────────────────────────────────────┘
|
||
```
|
||
|
||
### Core Components
|
||
|
||
#### NEBULA: Content-Addressable Storage Core
|
||
The foundation layer. Every piece of data is:
|
||
- **Chunked** using content-defined chunking (CDC) with FastCDC algorithm
|
||
- **Hashed** with BLAKE3 (256-bit, hardware-accelerated)
|
||
- **Deduplicated** at write time via hash lookup
|
||
- **Stored once** regardless of how many VMs reference it
|
||
|
||
#### PHOTON: Intelligent Content Router
|
||
Manages data placement across the storage hierarchy:
|
||
- **L1 (Hot)**: Memory-mapped, instant access, boot-critical data
|
||
- **L2 (Warm)**: NVMe, sub-millisecond, working set
|
||
- **L3 (Cool)**: SSD, single-digit ms, recent data
|
||
- **L4 (Cold)**: Object storage (S3/R2), archival
|
||
|
||
#### NOVA-STORE: Volume Abstraction Layer
|
||
Presents traditional block/file interfaces to VMs while backed by CAS:
|
||
- **TinyVol**: Ultra-lightweight volumes with minimal metadata
|
||
- **ShareVol**: Copy-on-write shared volumes
|
||
- **DeltaVol**: Delta-encoded writable layers
|
||
|
||
---
|
||
|
||
## 3. Key Innovations
|
||
|
||
### 3.1 Stellar Deduplication
|
||
|
||
**Innovation**: Inline deduplication with zero write amplification.
|
||
|
||
Traditional dedup:
|
||
```
|
||
Write → Buffer → Hash → Lookup → Decide → Store
|
||
(copy) (wait) (maybe copy again)
|
||
```
|
||
|
||
Stellar dedup:
|
||
```
|
||
Write → Hash-while-streaming → CAS Insert (atomic)
|
||
(no buffer needed) (single write or reference)
|
||
```
|
||
|
||
**Implementation**:
|
||
```rust
|
||
struct StellarChunk {
|
||
hash: Blake3Hash, // 32 bytes
|
||
size: u16, // 2 bytes (max 64KB chunks)
|
||
refs: AtomicU32, // 4 bytes - reference count
|
||
tier: AtomicU8, // 1 byte - storage tier
|
||
flags: u8, // 1 byte - compression, encryption
|
||
// Total: 40 bytes metadata per chunk
|
||
}
|
||
|
||
// Hash table: 40 bytes × 1B chunks = 40GB index for ~40TB unique data
|
||
// Fits in memory on modern servers
|
||
```
|
||
|
||
### 3.2 TinyVol: Minimal Volume Overhead
|
||
|
||
**Innovation**: Volumes as tiny manifest files, not pre-allocated space.
|
||
|
||
```
|
||
Traditional qcow2: Header (512B) + L1 Table + L2 Tables + Refcount...
|
||
Minimum overhead: ~512KB even for empty volume
|
||
|
||
TinyVol: Just a manifest pointing to chunks
|
||
Overhead: 64 bytes base + 48 bytes per modified chunk
|
||
Empty 10GB volume: 64 bytes
|
||
1GB modified: 64B + (1GB/64KB × 48B) = ~768KB
|
||
```
|
||
|
||
**Structure**:
|
||
```rust
|
||
struct TinyVol {
|
||
magic: [u8; 8], // "TINYVOL\0"
|
||
version: u32,
|
||
flags: u32,
|
||
base_image: Blake3Hash, // Optional parent
|
||
size_bytes: u64,
|
||
chunk_map: BTreeMap<ChunkIndex, ChunkRef>,
|
||
}
|
||
|
||
struct ChunkRef {
|
||
hash: Blake3Hash, // 32 bytes
|
||
offset_in_vol: u48, // 6 bytes
|
||
len: u16, // 2 bytes
|
||
flags: u64, // 8 bytes (CoW, compressed, etc.)
|
||
}
|
||
```
|
||
|
||
### 3.3 ShareVol: Zero-Copy Shared Volumes
|
||
|
||
**Innovation**: Multiple VMs share read paths, with instant copy-on-write.
|
||
|
||
```
|
||
Traditional Shared Storage:
|
||
VM1 reads /lib/libc.so → Disk read → VM1 memory
|
||
VM2 reads /lib/libc.so → Disk read → VM2 memory
|
||
(Same data read twice, stored twice in RAM)
|
||
|
||
ShareVol:
|
||
VM1 reads /lib/libc.so → Shared mapping (already in memory)
|
||
VM2 reads /lib/libc.so → Same shared mapping
|
||
(Single read, single memory location, N consumers)
|
||
```
|
||
|
||
**Memory-Mapped CAS**:
|
||
```rust
|
||
// Shared content is memory-mapped once
|
||
struct SharedMapping {
|
||
hash: Blake3Hash,
|
||
mmap_addr: *const u8,
|
||
mmap_len: usize,
|
||
vm_refs: AtomicU32, // How many VMs reference this
|
||
last_access: AtomicU64, // For eviction
|
||
}
|
||
|
||
// VMs get read-only mappings to shared content
|
||
// Write attempts trigger CoW into TinyVol delta layer
|
||
```
|
||
|
||
### 3.4 Cosmic Packing: Small File Optimization
|
||
|
||
**Innovation**: Pack small files into larger chunks without losing addressability.
|
||
|
||
Problem: Millions of small files (< 4KB) waste space at chunk boundaries.
|
||
|
||
Solution: **Cosmic Packs** — aggregated storage with inline index:
|
||
|
||
```
|
||
┌─────────────────────────────────────────────────┐
|
||
│ COSMIC PACK (64KB) │
|
||
├─────────────────────────────────────────────────┤
|
||
│ Header (64B) │
|
||
│ - magic, version, entry_count │
|
||
├─────────────────────────────────────────────────┤
|
||
│ Index (variable, ~100B per entry) │
|
||
│ - [hash, offset, len, flags] × N │
|
||
├─────────────────────────────────────────────────┤
|
||
│ Data (remaining space) │
|
||
│ - Packed file contents │
|
||
└─────────────────────────────────────────────────┘
|
||
```
|
||
|
||
**Benefit**: 1000 × 100-byte files = 100KB raw, but with individual addressing overhead. Cosmic Pack: single 64KB chunk, full addressability retained.
|
||
|
||
### 3.5 Stellar Boot: Sub-50ms VM Start
|
||
|
||
**Innovation**: Boot data is pre-staged in memory before VM starts.
|
||
|
||
```
|
||
Boot Sequence Comparison:
|
||
|
||
Traditional:
|
||
t=0ms VMM starts
|
||
t=5ms BIOS loads
|
||
t=50ms Kernel requested
|
||
t=100ms Kernel loaded from disk
|
||
t=200ms initrd loaded
|
||
t=500ms Root FS mounted
|
||
t=2000ms Boot complete
|
||
|
||
Stellar Boot:
|
||
t=-50ms Boot manifest analyzed (during scheduling)
|
||
t=-25ms Hot chunks pre-faulted to memory
|
||
t=0ms VMM starts with memory-mapped boot data
|
||
t=5ms Kernel executes (already in memory)
|
||
t=15ms initrd processed (already in memory)
|
||
t=40ms Root FS ready (ShareVol, pre-mapped)
|
||
t=50ms Boot complete
|
||
```
|
||
|
||
**Boot Manifest**:
|
||
```rust
|
||
struct BootManifest {
|
||
kernel: Blake3Hash,
|
||
initrd: Option<Blake3Hash>,
|
||
root_vol: TinyVolRef,
|
||
|
||
// Predicted hot chunks for first 100ms
|
||
prefetch_set: Vec<Blake3Hash>,
|
||
|
||
// Memory layout hints
|
||
kernel_load_addr: u64,
|
||
initrd_load_addr: Option<u64>,
|
||
}
|
||
```
|
||
|
||
### 3.6 CDN-Native Distribution: Voltainer Integration
|
||
|
||
**Innovation**: Images distributed via CDN, layers indexed directly in NEBULA.
|
||
|
||
```
|
||
Traditional (Registry-based):
|
||
Registry API → Pull manifest → Pull layers → Extract → Overlay FS
|
||
(Complex protocol, copies data, registry infrastructure required)
|
||
|
||
Stellarium + CDN:
|
||
HTTPS GET manifest → HTTPS GET missing chunks → Mount
|
||
(Simple HTTP, zero extraction, CDN handles global distribution)
|
||
```
|
||
|
||
**CDN-Native Architecture**:
|
||
```
|
||
┌─────────────────────────────────────────────────────────────────┐
|
||
│ CDN-NATIVE DISTRIBUTION │
|
||
├─────────────────────────────────────────────────────────────────┤
|
||
│ │
|
||
│ cdn.armoredgate.com/ │
|
||
│ ├── manifests/ │
|
||
│ │ └── {blake3-hash}.json ← Image/layer manifests │
|
||
│ └── blobs/ │
|
||
│ └── {blake3-hash} ← Raw content chunks │
|
||
│ │
|
||
│ Benefits: │
|
||
│ ✓ No registry daemon to run │
|
||
│ ✓ No registry protocol complexity │
|
||
│ ✓ Global edge caching built-in │
|
||
│ ✓ Simple HTTPS GET (curl-debuggable) │
|
||
│ ✓ Content-addressed = perfect cache keys │
|
||
│ ✓ Dedup at CDN level (same hash = same edge cache) │
|
||
│ │
|
||
└─────────────────────────────────────────────────────────────────┘
|
||
```
|
||
|
||
**Implementation**:
|
||
```rust
|
||
struct CdnDistribution {
|
||
base_url: String, // "https://cdn.armoredgate.com"
|
||
|
||
async fn fetch_manifest(&self, hash: &Blake3Hash) -> Result<ImageManifest> {
|
||
let url = format!("{}/manifests/{}.json", self.base_url, hash);
|
||
let resp = reqwest::get(&url).await?;
|
||
Ok(resp.json().await?)
|
||
}
|
||
|
||
async fn fetch_chunk(&self, hash: &Blake3Hash) -> Result<Vec<u8>> {
|
||
let url = format!("{}/blobs/{}", self.base_url, hash);
|
||
let resp = reqwest::get(&url).await?;
|
||
|
||
// Verify content hash matches (integrity check)
|
||
let data = resp.bytes().await?;
|
||
assert_eq!(blake3::hash(&data), *hash);
|
||
|
||
Ok(data.to_vec())
|
||
}
|
||
|
||
async fn fetch_missing(&self, needed: &[Blake3Hash], local: &Nebula) -> Result<()> {
|
||
// Only fetch chunks we don't have locally
|
||
let missing: Vec<_> = needed.iter()
|
||
.filter(|h| !local.exists(h))
|
||
.collect();
|
||
|
||
// Parallel fetch from CDN
|
||
futures::future::join_all(
|
||
missing.iter().map(|h| self.fetch_and_store(h, local))
|
||
).await;
|
||
|
||
Ok(())
|
||
}
|
||
}
|
||
|
||
struct VoltainerImage {
|
||
manifest_hash: Blake3Hash,
|
||
layers: Vec<LayerRef>,
|
||
}
|
||
|
||
struct LayerRef {
|
||
hash: Blake3Hash, // Content hash (CDN path)
|
||
stellar_manifest: TinyVol, // Direct mapping to Stellar chunks
|
||
}
|
||
|
||
// Voltainer pull = simple CDN fetch
|
||
async fn voltainer_pull(image: &str, cdn: &CdnDistribution, nebula: &Nebula) -> Result<VoltainerImage> {
|
||
// 1. Resolve image name to manifest hash (local index or CDN lookup)
|
||
let manifest_hash = resolve_image_hash(image).await?;
|
||
|
||
// 2. Fetch manifest from CDN
|
||
let manifest = cdn.fetch_manifest(&manifest_hash).await?;
|
||
|
||
// 3. Fetch only missing chunks (dedup-aware)
|
||
let needed_chunks = manifest.all_chunk_hashes();
|
||
cdn.fetch_missing(&needed_chunks, nebula).await?;
|
||
|
||
// 4. Image is ready - no extraction, layers ARE the storage
|
||
Ok(VoltainerImage::from_manifest(manifest))
|
||
}
|
||
```
|
||
|
||
**Voltainer Integration**:
|
||
```rust
|
||
// Voltainer (systemd-nspawn based) uses Stellarium directly
|
||
impl VoltainerRuntime {
|
||
async fn create_container(&self, image: &VoltainerImage) -> Result<Container> {
|
||
// Layers are already in NEBULA, just create overlay view
|
||
let rootfs = self.stellarium.create_overlay_view(&image.layers)?;
|
||
|
||
// systemd-nspawn mounts the Stellarium-backed rootfs
|
||
let container = systemd_nspawn::Container::new()
|
||
.directory(&rootfs)
|
||
.private_network(true)
|
||
.boot(false)
|
||
.spawn()?;
|
||
|
||
Ok(container)
|
||
}
|
||
}
|
||
```
|
||
|
||
### 3.7 Memory-Storage Convergence
|
||
|
||
**Innovation**: Memory and storage share the same backing, eliminating double-buffering.
|
||
|
||
```
|
||
Traditional:
|
||
Storage: [Block Device] → [Page Cache] → [VM Memory]
|
||
(data copied twice)
|
||
|
||
Stellarium:
|
||
Unified: [CAS Memory Map] ←──────────→ [VM Memory View]
|
||
(single location, two views)
|
||
```
|
||
|
||
**DAX-Style Direct Access**:
|
||
```rust
|
||
// VM sees storage as memory-mapped region
|
||
struct StellarBlockDevice {
|
||
volumes: Vec<TinyVol>,
|
||
|
||
fn handle_read(&self, offset: u64, len: u32) -> &[u8] {
|
||
let chunk = self.volumes[0].chunk_at(offset);
|
||
let mapping = photon.get_or_map(chunk.hash);
|
||
&mapping[chunk.local_offset..][..len]
|
||
}
|
||
|
||
// Writes go to delta layer
|
||
fn handle_write(&mut self, offset: u64, data: &[u8]) {
|
||
self.volumes[0].write_delta(offset, data);
|
||
}
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## 4. Density Targets
|
||
|
||
### Storage Efficiency
|
||
|
||
| Scenario | Traditional | Stellarium | Target |
|
||
|----------|-------------|------------|--------|
|
||
| 1000 Ubuntu 22.04 VMs | 2.5 TB | 2.8 GB shared + 10 MB/VM avg delta | **99.6% reduction** |
|
||
| 10000 Python app VMs (same base) | 25 TB | 2.8 GB + 5 MB/VM | **99.8% reduction** |
|
||
| Mixed workload (100 unique bases) | 2.5 TB | 50 GB shared + 20 MB/VM avg | **94% reduction** |
|
||
|
||
### Memory Efficiency
|
||
|
||
| Component | Traditional | Stellarium | Target |
|
||
|-----------|-------------|------------|--------|
|
||
| Kernel (per VM) | 8-15 MB | Shared (~0 marginal) | **99%+ reduction** |
|
||
| libc (per VM) | 2 MB | Shared | **99%+ reduction** |
|
||
| Page cache duplication | High | Zero | **100% reduction** |
|
||
| Effective RAM per VM | 512 MB - 1 GB | 50-100 MB unique | **5-10x improvement** |
|
||
|
||
### Performance
|
||
|
||
| Metric | Traditional | Stellarium Target |
|
||
|--------|-------------|-------------------|
|
||
| Cold boot (minimal VM) | 500ms - 2s | < 50ms |
|
||
| Warm boot (pre-cached) | 100-500ms | < 20ms |
|
||
| Clone time (full copy) | 10-60s | < 1ms (CoW instant) |
|
||
| Dedup ratio (homogeneous) | N/A | 50:1 to 1000:1 |
|
||
| IOPS (deduplicated reads) | N | 1 |
|
||
|
||
### Density Goals
|
||
|
||
| Scenario | Traditional (64GB RAM host) | Stellarium Target |
|
||
|----------|------------------------------|-------------------|
|
||
| Minimal VMs (32MB each) | ~1000 | 5000-10000 |
|
||
| Small VMs (128MB each) | ~400 | 2000-4000 |
|
||
| Medium VMs (512MB each) | ~100 | 500-1000 |
|
||
| Storage per 10K VMs | 10-50 TB | 10-50 GB |
|
||
|
||
---
|
||
|
||
## 5. Integration with Volt VMM
|
||
|
||
### Boot Path Integration
|
||
|
||
```rust
|
||
// Volt VMM integration
|
||
impl VoltVmm {
|
||
fn boot_with_stellarium(&mut self, manifest: BootManifest) -> Result<()> {
|
||
// 1. Pre-fault boot chunks to L1 (memory)
|
||
let prefetch_handle = stellarium.prefetch(&manifest.prefetch_set);
|
||
|
||
// 2. Set up memory-mapped kernel
|
||
let kernel_mapping = stellarium.map_readonly(&manifest.kernel);
|
||
self.load_kernel_direct(kernel_mapping);
|
||
|
||
// 3. Set up memory-mapped initrd (if present)
|
||
if let Some(initrd) = &manifest.initrd {
|
||
let initrd_mapping = stellarium.map_readonly(initrd);
|
||
self.load_initrd_direct(initrd_mapping);
|
||
}
|
||
|
||
// 4. Configure VirtIO-Stellar device
|
||
self.add_stellar_blk(manifest.root_vol)?;
|
||
|
||
// 5. Ensure prefetch complete
|
||
prefetch_handle.wait();
|
||
|
||
// 6. Boot
|
||
self.start()
|
||
}
|
||
}
|
||
```
|
||
|
||
### VirtIO-Stellar Driver
|
||
|
||
Custom VirtIO block device that speaks Stellarium natively:
|
||
|
||
```rust
|
||
struct VirtioStellarConfig {
|
||
// Standard virtio-blk compatible
|
||
capacity: u64,
|
||
size_max: u32,
|
||
seg_max: u32,
|
||
|
||
// Stellarium extensions
|
||
stellar_features: u64, // STELLAR_F_SHAREVOL, STELLAR_F_DEDUP, etc.
|
||
vol_hash: Blake3Hash, // Volume identity
|
||
shared_regions: u32, // Number of pre-shared regions
|
||
}
|
||
|
||
// Request types (extends standard virtio-blk)
|
||
enum StellarRequest {
|
||
Read { sector: u64, len: u32 },
|
||
Write { sector: u64, data: Vec<u8> },
|
||
|
||
// Stellarium extensions
|
||
MapShared { hash: Blake3Hash }, // Map shared chunk directly
|
||
QueryDedup { sector: u64 }, // Check if sector is deduplicated
|
||
Prefetch { sectors: Vec<u64> }, // Hint upcoming reads
|
||
}
|
||
```
|
||
|
||
### Snapshot and Restore
|
||
|
||
```rust
|
||
// Instant snapshots via TinyVol CoW
|
||
fn snapshot_vm(vm: &VoltVm) -> VmSnapshot {
|
||
VmSnapshot {
|
||
// Memory as Stellar chunks
|
||
memory_chunks: stellarium.chunk_memory(vm.memory_region()),
|
||
|
||
// Volume is already CoW - just reference
|
||
root_vol: vm.root_vol.clone_manifest(),
|
||
|
||
// CPU state is tiny
|
||
cpu_state: vm.save_cpu_state(),
|
||
}
|
||
}
|
||
|
||
// Restore from snapshot
|
||
fn restore_vm(snapshot: &VmSnapshot) -> VoltVm {
|
||
let mut vm = VoltVm::new();
|
||
|
||
// Memory is mapped directly from Stellar chunks
|
||
vm.map_memory_from_stellar(&snapshot.memory_chunks);
|
||
|
||
// Volume manifest is loaded (no data copy)
|
||
vm.attach_vol(snapshot.root_vol.clone());
|
||
|
||
// Restore CPU state
|
||
vm.restore_cpu_state(&snapshot.cpu_state);
|
||
|
||
vm
|
||
}
|
||
```
|
||
|
||
### Live Migration with Dedup
|
||
|
||
```rust
|
||
// Only transfer unique chunks during migration
|
||
async fn migrate_vm(vm: &VoltVm, target: &NodeAddr) -> Result<()> {
|
||
// 1. Get list of chunks VM references
|
||
let vm_chunks = vm.collect_chunk_refs();
|
||
|
||
// 2. Query target for chunks it already has
|
||
let target_has = target.query_chunks(&vm_chunks).await?;
|
||
|
||
// 3. Transfer only missing chunks
|
||
let missing = vm_chunks.difference(&target_has);
|
||
target.receive_chunks(&missing).await?;
|
||
|
||
// 4. Transfer tiny metadata
|
||
target.receive_manifest(&vm.root_vol).await?;
|
||
target.receive_memory_manifest(&vm.memory_chunks).await?;
|
||
|
||
// 5. Final state sync and switchover
|
||
vm.pause();
|
||
target.receive_final_state(vm.cpu_state()).await?;
|
||
target.resume().await?;
|
||
|
||
Ok(())
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## 6. Implementation Priorities
|
||
|
||
### Phase 1: Foundation (Month 1-2)
|
||
**Goal**: Core CAS and basic volume support
|
||
|
||
1. **NEBULA Core**
|
||
- BLAKE3 hashing with SIMD acceleration
|
||
- In-memory hash table (robin hood hashing)
|
||
- Basic chunk storage (local NVMe)
|
||
- Reference counting
|
||
|
||
2. **TinyVol v1**
|
||
- Manifest format
|
||
- Read-only volume mounting
|
||
- Basic CoW writes
|
||
|
||
3. **VirtIO-Stellar Driver**
|
||
- Basic block interface
|
||
- Integration with Volt
|
||
|
||
**Deliverable**: Boot a VM from Stellarium storage
|
||
|
||
### Phase 2: Deduplication (Month 2-3)
|
||
**Goal**: Inline dedup with zero performance regression
|
||
|
||
1. **Inline Deduplication**
|
||
- Write path with hash-first
|
||
- Atomic insert-or-reference
|
||
- Dedup metrics/reporting
|
||
|
||
2. **Content-Defined Chunking**
|
||
- FastCDC implementation
|
||
- Tuned for VM workloads
|
||
|
||
3. **Base Image Sharing**
|
||
- ShareVol implementation
|
||
- Multiple VMs sharing base
|
||
|
||
**Deliverable**: 10:1+ dedup ratio for homogeneous VMs
|
||
|
||
### Phase 3: Performance (Month 3-4)
|
||
**Goal**: Sub-50ms boot, memory convergence
|
||
|
||
1. **PHOTON Tiering**
|
||
- Hot/warm/cold classification
|
||
- Automatic promotion/demotion
|
||
- Memory-mapped hot tier
|
||
|
||
2. **Boot Optimization**
|
||
- Boot manifest analysis
|
||
- Prefetch implementation
|
||
- Zero-copy kernel loading
|
||
|
||
3. **Memory-Storage Convergence**
|
||
- DAX-style direct access
|
||
- Shared page elimination
|
||
|
||
**Deliverable**: <50ms cold boot, memory sharing active
|
||
|
||
### Phase 4: Density (Month 4-5)
|
||
**Goal**: 10000+ VMs per host achievable
|
||
|
||
1. **Small File Packing**
|
||
- Cosmic Pack implementation
|
||
- Inline file storage
|
||
|
||
2. **Aggressive Sharing**
|
||
- Cross-VM page dedup
|
||
- Kernel/library sharing
|
||
|
||
3. **Memory Pressure Handling**
|
||
- Intelligent eviction
|
||
- Graceful degradation
|
||
|
||
**Deliverable**: 5000+ density on 64GB host
|
||
|
||
### Phase 5: Distribution (Month 5-6)
|
||
**Goal**: Multi-node Stellarium cluster
|
||
|
||
1. **Cosmic Mesh**
|
||
- Distributed hash index
|
||
- Cross-node chunk routing
|
||
- Consistent hashing for placement
|
||
|
||
2. **Migration Optimization**
|
||
- Chunk pre-staging
|
||
- Delta transfers
|
||
|
||
3. **Object Storage Backend**
|
||
- S3/R2 cold tier
|
||
- Async writeback
|
||
|
||
**Deliverable**: Seamless multi-node storage
|
||
|
||
### Phase 6: Voltainer + CDN Native (Month 6-7)
|
||
**Goal**: Voltainer containers as first-class citizens, CDN-native distribution
|
||
|
||
1. **CDN Distribution Layer**
|
||
- Manifest/chunk fetch from ArmoredGate CDN
|
||
- Parallel chunk retrieval
|
||
- Edge cache warming strategies
|
||
|
||
2. **Voltainer Integration**
|
||
- Direct Stellarium mount for systemd-nspawn
|
||
- Shared layers between Voltainer containers and Volt VMs
|
||
- Unified storage for both runtimes
|
||
|
||
3. **Layer Mapping**
|
||
- Direct layer registration in NEBULA
|
||
- No extraction needed
|
||
- Content-addressed = perfect CDN cache keys
|
||
|
||
**Deliverable**: Voltainer containers boot in <100ms, unified with VM storage
|
||
|
||
---
|
||
|
||
## 7. Name: **Stellarium**
|
||
|
||
### Why Stellarium?
|
||
|
||
Continuing the cosmic theme of **Stardust** (cluster) and **Volt** (VMM):
|
||
|
||
- **Stellar** = Star-like, exceptional, relating to stars
|
||
- **-arium** = A place for (like aquarium, planetarium)
|
||
- **Stellarium** = "A place for stars" — where all your VM's data lives
|
||
|
||
### Component Names (Cosmic Theme)
|
||
|
||
| Component | Name | Meaning |
|
||
|-----------|------|---------|
|
||
| CAS Core | **NEBULA** | Birthplace of stars, cloud of shared matter |
|
||
| Content Router | **PHOTON** | Light-speed data movement |
|
||
| Chunk Packer | **Cosmic Pack** | Aggregating cosmic dust |
|
||
| Volume Manager | **Nova-Store** | Connects to Volt |
|
||
| Distributed Mesh | **Cosmic Mesh** | Interconnected universe |
|
||
| Boot Optimizer | **Stellar Boot** | Star-like speed |
|
||
| Small File Pack | **Cosmic Dust** | Tiny particles aggregated |
|
||
|
||
### Taglines
|
||
|
||
- *"Every byte a star. Every star shared."*
|
||
- *"The storage that makes density possible."*
|
||
- *"Where VMs find their data, instantly."*
|
||
|
||
---
|
||
|
||
## 8. Summary
|
||
|
||
**Stellarium** transforms storage from a per-VM liability into a shared asset. By treating all data as content-addressed chunks in a unified namespace:
|
||
|
||
1. **Deduplication becomes free** — No extra work, it's the storage model
|
||
2. **Sharing becomes default** — VMs reference, not copy
|
||
3. **Boot becomes instant** — Data is pre-positioned
|
||
4. **Density becomes extreme** — 10-100x more VMs per host
|
||
5. **Migration becomes trivial** — Only ship unique data
|
||
|
||
Combined with Volt's minimal VMM overhead, Stellarium enables the original ArmoredContainers vision: **VM isolation at container density, with VM security guarantees**.
|
||
|
||
### The Stellarium Promise
|
||
|
||
> On a 64GB host with 2TB NVMe:
|
||
> - **10,000+ microVMs** running simultaneously
|
||
> - **50GB total storage** for 10,000 Debian-based workloads
|
||
> - **<50ms** boot time for any VM
|
||
> - **Instant** cloning and snapshots
|
||
> - **Seamless** live migration
|
||
|
||
This isn't incremental improvement. This is a **new storage paradigm** for the microVM era.
|
||
|
||
---
|
||
|
||
*Stellarium: The stellar storage for stellar density.*
|