Volt VMM (Neutron Stardust): source-available under AGPSL v5.0
KVM-based microVMM for the Volt platform: - Sub-second VM boot times - Minimal memory footprint - Landlock LSM + seccomp security - Virtio device support - Custom kernel management Copyright (c) Armored Gates LLC. All rights reserved. Licensed under AGPSL v5.0
This commit is contained in:
757
designs/storage-architecture.md
Normal file
757
designs/storage-architecture.md
Normal file
@@ -0,0 +1,757 @@
|
||||
# Stellarium: Unified Storage Architecture for Volt
|
||||
|
||||
> *"Every byte has a home. Every home is shared. Nothing is stored twice."*
|
||||
|
||||
## 1. Vision Statement
|
||||
|
||||
**Stellarium** is a revolutionary storage architecture that treats storage not as isolated volumes, but as a **unified content-addressed stellar cloud** where every unique byte exists exactly once, and every VM draws from the same constellation of data.
|
||||
|
||||
### What Makes This Revolutionary
|
||||
|
||||
Traditional VM storage operates on a fundamental lie: that each VM has its own dedicated disk. This creates:
|
||||
- **Massive redundancy** — 1000 Debian VMs = 1000 copies of libc
|
||||
- **Slow boots** — Each VM reads its own copy of boot files
|
||||
- **Wasted IOPS** — Page cache misses everywhere
|
||||
- **Memory bloat** — Same data cached N times
|
||||
|
||||
**Stellarium inverts this model.** Instead of VMs owning storage, **storage serves VMs through a unified content mesh**. The result:
|
||||
|
||||
| Metric | Traditional | Stellarium | Improvement |
|
||||
|--------|-------------|------------|-------------|
|
||||
| Storage per 1000 Debian VMs | 10 TB | 12 GB + deltas | **833x** |
|
||||
| Cold boot time | 2-5s | <50ms | **40-100x** |
|
||||
| Memory efficiency | 1 GB/VM | ~50 MB shared core | **20x** |
|
||||
| IOPS for identical reads | N | 1 | **Nx** |
|
||||
|
||||
---
|
||||
|
||||
## 2. Architecture Overview
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────────┐
|
||||
│ STELLARIUM LAYERS │
|
||||
├─────────────────────────────────────────────────────────────────────┤
|
||||
│ │
|
||||
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
|
||||
│ │ Volt │ │ Volt │ │ Volt │ VM Layer │
|
||||
│ │ microVM │ │ microVM │ │ microVM │ │
|
||||
│ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │
|
||||
│ │ │ │ │
|
||||
│ ┌──────┴────────────────┴────────────────┴──────┐ │
|
||||
│ │ STELLARIUM VirtIO Driver │ Driver │
|
||||
│ │ (Memory-Mapped CAS Interface) │ Layer │
|
||||
│ └──────────────────────┬────────────────────────┘ │
|
||||
│ │ │
|
||||
│ ┌──────────────────────┴────────────────────────┐ │
|
||||
│ │ NOVA-STORE │ Store │
|
||||
│ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ Layer │
|
||||
│ │ │ TinyVol │ │ShareVol │ │ DeltaVol│ │ │
|
||||
│ │ │ Manager │ │ Manager │ │ Manager │ │ │
|
||||
│ │ └────┬────┘ └────┬────┘ └────┬────┘ │ │
|
||||
│ │ └───────────┴───────────┘ │ │
|
||||
│ │ │ │ │
|
||||
│ │ ┌────────────────┴────────────────┐ │ │
|
||||
│ │ │ PHOTON (Content Router) │ │ │
|
||||
│ │ │ Hot→Memory Warm→NVMe Cold→S3 │ │ │
|
||||
│ │ └────────────────┬────────────────┘ │ │
|
||||
│ └───────────────────┼──────────────────────────┘ │
|
||||
│ │ │
|
||||
│ ┌───────────────────┴──────────────────────────┐ │
|
||||
│ │ NEBULA (CAS Core) │ Foundation │
|
||||
│ │ │ Layer │
|
||||
│ │ ┌─────────┐ ┌─────────┐ ┌─────────────────┐ │ │
|
||||
│ │ │ Chunk │ │ Block │ │ Distributed │ │ │
|
||||
│ │ │ Packer │ │ Dedup │ │ Hash Index │ │ │
|
||||
│ │ └─────────┘ └─────────┘ └─────────────────┘ │ │
|
||||
│ │ │ │
|
||||
│ │ ┌─────────────────────────────────────────┐ │ │
|
||||
│ │ │ COSMIC MESH (Distributed CAS) │ │ │
|
||||
│ │ │ Local NVMe ←→ Cluster ←→ Object Store │ │ │
|
||||
│ │ └─────────────────────────────────────────┘ │ │
|
||||
│ └───────────────────────────────────────────────┘ │
|
||||
│ │
|
||||
└─────────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### Core Components
|
||||
|
||||
#### NEBULA: Content-Addressable Storage Core
|
||||
The foundation layer. Every piece of data is:
|
||||
- **Chunked** using content-defined chunking (CDC) with FastCDC algorithm
|
||||
- **Hashed** with BLAKE3 (256-bit, hardware-accelerated)
|
||||
- **Deduplicated** at write time via hash lookup
|
||||
- **Stored once** regardless of how many VMs reference it
|
||||
|
||||
#### PHOTON: Intelligent Content Router
|
||||
Manages data placement across the storage hierarchy:
|
||||
- **L1 (Hot)**: Memory-mapped, instant access, boot-critical data
|
||||
- **L2 (Warm)**: NVMe, sub-millisecond, working set
|
||||
- **L3 (Cool)**: SSD, single-digit ms, recent data
|
||||
- **L4 (Cold)**: Object storage (S3/R2), archival
|
||||
|
||||
#### NOVA-STORE: Volume Abstraction Layer
|
||||
Presents traditional block/file interfaces to VMs while backed by CAS:
|
||||
- **TinyVol**: Ultra-lightweight volumes with minimal metadata
|
||||
- **ShareVol**: Copy-on-write shared volumes
|
||||
- **DeltaVol**: Delta-encoded writable layers
|
||||
|
||||
---
|
||||
|
||||
## 3. Key Innovations
|
||||
|
||||
### 3.1 Stellar Deduplication
|
||||
|
||||
**Innovation**: Inline deduplication with zero write amplification.
|
||||
|
||||
Traditional dedup:
|
||||
```
|
||||
Write → Buffer → Hash → Lookup → Decide → Store
|
||||
(copy) (wait) (maybe copy again)
|
||||
```
|
||||
|
||||
Stellar dedup:
|
||||
```
|
||||
Write → Hash-while-streaming → CAS Insert (atomic)
|
||||
(no buffer needed) (single write or reference)
|
||||
```
|
||||
|
||||
**Implementation**:
|
||||
```rust
|
||||
struct StellarChunk {
|
||||
hash: Blake3Hash, // 32 bytes
|
||||
size: u16, // 2 bytes (max 64KB chunks)
|
||||
refs: AtomicU32, // 4 bytes - reference count
|
||||
tier: AtomicU8, // 1 byte - storage tier
|
||||
flags: u8, // 1 byte - compression, encryption
|
||||
// Total: 40 bytes metadata per chunk
|
||||
}
|
||||
|
||||
// Hash table: 40 bytes × 1B chunks = 40GB index for ~40TB unique data
|
||||
// Fits in memory on modern servers
|
||||
```
|
||||
|
||||
### 3.2 TinyVol: Minimal Volume Overhead
|
||||
|
||||
**Innovation**: Volumes as tiny manifest files, not pre-allocated space.
|
||||
|
||||
```
|
||||
Traditional qcow2: Header (512B) + L1 Table + L2 Tables + Refcount...
|
||||
Minimum overhead: ~512KB even for empty volume
|
||||
|
||||
TinyVol: Just a manifest pointing to chunks
|
||||
Overhead: 64 bytes base + 48 bytes per modified chunk
|
||||
Empty 10GB volume: 64 bytes
|
||||
1GB modified: 64B + (1GB/64KB × 48B) = ~768KB
|
||||
```
|
||||
|
||||
**Structure**:
|
||||
```rust
|
||||
struct TinyVol {
|
||||
magic: [u8; 8], // "TINYVOL\0"
|
||||
version: u32,
|
||||
flags: u32,
|
||||
base_image: Blake3Hash, // Optional parent
|
||||
size_bytes: u64,
|
||||
chunk_map: BTreeMap<ChunkIndex, ChunkRef>,
|
||||
}
|
||||
|
||||
struct ChunkRef {
|
||||
hash: Blake3Hash, // 32 bytes
|
||||
offset_in_vol: u48, // 6 bytes
|
||||
len: u16, // 2 bytes
|
||||
flags: u64, // 8 bytes (CoW, compressed, etc.)
|
||||
}
|
||||
```
|
||||
|
||||
### 3.3 ShareVol: Zero-Copy Shared Volumes
|
||||
|
||||
**Innovation**: Multiple VMs share read paths, with instant copy-on-write.
|
||||
|
||||
```
|
||||
Traditional Shared Storage:
|
||||
VM1 reads /lib/libc.so → Disk read → VM1 memory
|
||||
VM2 reads /lib/libc.so → Disk read → VM2 memory
|
||||
(Same data read twice, stored twice in RAM)
|
||||
|
||||
ShareVol:
|
||||
VM1 reads /lib/libc.so → Shared mapping (already in memory)
|
||||
VM2 reads /lib/libc.so → Same shared mapping
|
||||
(Single read, single memory location, N consumers)
|
||||
```
|
||||
|
||||
**Memory-Mapped CAS**:
|
||||
```rust
|
||||
// Shared content is memory-mapped once
|
||||
struct SharedMapping {
|
||||
hash: Blake3Hash,
|
||||
mmap_addr: *const u8,
|
||||
mmap_len: usize,
|
||||
vm_refs: AtomicU32, // How many VMs reference this
|
||||
last_access: AtomicU64, // For eviction
|
||||
}
|
||||
|
||||
// VMs get read-only mappings to shared content
|
||||
// Write attempts trigger CoW into TinyVol delta layer
|
||||
```
|
||||
|
||||
### 3.4 Cosmic Packing: Small File Optimization
|
||||
|
||||
**Innovation**: Pack small files into larger chunks without losing addressability.
|
||||
|
||||
Problem: Millions of small files (< 4KB) waste space at chunk boundaries.
|
||||
|
||||
Solution: **Cosmic Packs** — aggregated storage with inline index:
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────┐
|
||||
│ COSMIC PACK (64KB) │
|
||||
├─────────────────────────────────────────────────┤
|
||||
│ Header (64B) │
|
||||
│ - magic, version, entry_count │
|
||||
├─────────────────────────────────────────────────┤
|
||||
│ Index (variable, ~100B per entry) │
|
||||
│ - [hash, offset, len, flags] × N │
|
||||
├─────────────────────────────────────────────────┤
|
||||
│ Data (remaining space) │
|
||||
│ - Packed file contents │
|
||||
└─────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
**Benefit**: 1000 × 100-byte files = 100KB raw, but with individual addressing overhead. Cosmic Pack: single 64KB chunk, full addressability retained.
|
||||
|
||||
### 3.5 Stellar Boot: Sub-50ms VM Start
|
||||
|
||||
**Innovation**: Boot data is pre-staged in memory before VM starts.
|
||||
|
||||
```
|
||||
Boot Sequence Comparison:
|
||||
|
||||
Traditional:
|
||||
t=0ms VMM starts
|
||||
t=5ms BIOS loads
|
||||
t=50ms Kernel requested
|
||||
t=100ms Kernel loaded from disk
|
||||
t=200ms initrd loaded
|
||||
t=500ms Root FS mounted
|
||||
t=2000ms Boot complete
|
||||
|
||||
Stellar Boot:
|
||||
t=-50ms Boot manifest analyzed (during scheduling)
|
||||
t=-25ms Hot chunks pre-faulted to memory
|
||||
t=0ms VMM starts with memory-mapped boot data
|
||||
t=5ms Kernel executes (already in memory)
|
||||
t=15ms initrd processed (already in memory)
|
||||
t=40ms Root FS ready (ShareVol, pre-mapped)
|
||||
t=50ms Boot complete
|
||||
```
|
||||
|
||||
**Boot Manifest**:
|
||||
```rust
|
||||
struct BootManifest {
|
||||
kernel: Blake3Hash,
|
||||
initrd: Option<Blake3Hash>,
|
||||
root_vol: TinyVolRef,
|
||||
|
||||
// Predicted hot chunks for first 100ms
|
||||
prefetch_set: Vec<Blake3Hash>,
|
||||
|
||||
// Memory layout hints
|
||||
kernel_load_addr: u64,
|
||||
initrd_load_addr: Option<u64>,
|
||||
}
|
||||
```
|
||||
|
||||
### 3.6 CDN-Native Distribution: Voltainer Integration
|
||||
|
||||
**Innovation**: Images distributed via CDN, layers indexed directly in NEBULA.
|
||||
|
||||
```
|
||||
Traditional (Registry-based):
|
||||
Registry API → Pull manifest → Pull layers → Extract → Overlay FS
|
||||
(Complex protocol, copies data, registry infrastructure required)
|
||||
|
||||
Stellarium + CDN:
|
||||
HTTPS GET manifest → HTTPS GET missing chunks → Mount
|
||||
(Simple HTTP, zero extraction, CDN handles global distribution)
|
||||
```
|
||||
|
||||
**CDN-Native Architecture**:
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ CDN-NATIVE DISTRIBUTION │
|
||||
├─────────────────────────────────────────────────────────────────┤
|
||||
│ │
|
||||
│ cdn.armoredgate.com/ │
|
||||
│ ├── manifests/ │
|
||||
│ │ └── {blake3-hash}.json ← Image/layer manifests │
|
||||
│ └── blobs/ │
|
||||
│ └── {blake3-hash} ← Raw content chunks │
|
||||
│ │
|
||||
│ Benefits: │
|
||||
│ ✓ No registry daemon to run │
|
||||
│ ✓ No registry protocol complexity │
|
||||
│ ✓ Global edge caching built-in │
|
||||
│ ✓ Simple HTTPS GET (curl-debuggable) │
|
||||
│ ✓ Content-addressed = perfect cache keys │
|
||||
│ ✓ Dedup at CDN level (same hash = same edge cache) │
|
||||
│ │
|
||||
└─────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
**Implementation**:
|
||||
```rust
|
||||
struct CdnDistribution {
|
||||
base_url: String, // "https://cdn.armoredgate.com"
|
||||
|
||||
async fn fetch_manifest(&self, hash: &Blake3Hash) -> Result<ImageManifest> {
|
||||
let url = format!("{}/manifests/{}.json", self.base_url, hash);
|
||||
let resp = reqwest::get(&url).await?;
|
||||
Ok(resp.json().await?)
|
||||
}
|
||||
|
||||
async fn fetch_chunk(&self, hash: &Blake3Hash) -> Result<Vec<u8>> {
|
||||
let url = format!("{}/blobs/{}", self.base_url, hash);
|
||||
let resp = reqwest::get(&url).await?;
|
||||
|
||||
// Verify content hash matches (integrity check)
|
||||
let data = resp.bytes().await?;
|
||||
assert_eq!(blake3::hash(&data), *hash);
|
||||
|
||||
Ok(data.to_vec())
|
||||
}
|
||||
|
||||
async fn fetch_missing(&self, needed: &[Blake3Hash], local: &Nebula) -> Result<()> {
|
||||
// Only fetch chunks we don't have locally
|
||||
let missing: Vec<_> = needed.iter()
|
||||
.filter(|h| !local.exists(h))
|
||||
.collect();
|
||||
|
||||
// Parallel fetch from CDN
|
||||
futures::future::join_all(
|
||||
missing.iter().map(|h| self.fetch_and_store(h, local))
|
||||
).await;
|
||||
|
||||
Ok(())
|
||||
}
|
||||
}
|
||||
|
||||
struct VoltainerImage {
|
||||
manifest_hash: Blake3Hash,
|
||||
layers: Vec<LayerRef>,
|
||||
}
|
||||
|
||||
struct LayerRef {
|
||||
hash: Blake3Hash, // Content hash (CDN path)
|
||||
stellar_manifest: TinyVol, // Direct mapping to Stellar chunks
|
||||
}
|
||||
|
||||
// Voltainer pull = simple CDN fetch
|
||||
async fn voltainer_pull(image: &str, cdn: &CdnDistribution, nebula: &Nebula) -> Result<VoltainerImage> {
|
||||
// 1. Resolve image name to manifest hash (local index or CDN lookup)
|
||||
let manifest_hash = resolve_image_hash(image).await?;
|
||||
|
||||
// 2. Fetch manifest from CDN
|
||||
let manifest = cdn.fetch_manifest(&manifest_hash).await?;
|
||||
|
||||
// 3. Fetch only missing chunks (dedup-aware)
|
||||
let needed_chunks = manifest.all_chunk_hashes();
|
||||
cdn.fetch_missing(&needed_chunks, nebula).await?;
|
||||
|
||||
// 4. Image is ready - no extraction, layers ARE the storage
|
||||
Ok(VoltainerImage::from_manifest(manifest))
|
||||
}
|
||||
```
|
||||
|
||||
**Voltainer Integration**:
|
||||
```rust
|
||||
// Voltainer (systemd-nspawn based) uses Stellarium directly
|
||||
impl VoltainerRuntime {
|
||||
async fn create_container(&self, image: &VoltainerImage) -> Result<Container> {
|
||||
// Layers are already in NEBULA, just create overlay view
|
||||
let rootfs = self.stellarium.create_overlay_view(&image.layers)?;
|
||||
|
||||
// systemd-nspawn mounts the Stellarium-backed rootfs
|
||||
let container = systemd_nspawn::Container::new()
|
||||
.directory(&rootfs)
|
||||
.private_network(true)
|
||||
.boot(false)
|
||||
.spawn()?;
|
||||
|
||||
Ok(container)
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 3.7 Memory-Storage Convergence
|
||||
|
||||
**Innovation**: Memory and storage share the same backing, eliminating double-buffering.
|
||||
|
||||
```
|
||||
Traditional:
|
||||
Storage: [Block Device] → [Page Cache] → [VM Memory]
|
||||
(data copied twice)
|
||||
|
||||
Stellarium:
|
||||
Unified: [CAS Memory Map] ←──────────→ [VM Memory View]
|
||||
(single location, two views)
|
||||
```
|
||||
|
||||
**DAX-Style Direct Access**:
|
||||
```rust
|
||||
// VM sees storage as memory-mapped region
|
||||
struct StellarBlockDevice {
|
||||
volumes: Vec<TinyVol>,
|
||||
|
||||
fn handle_read(&self, offset: u64, len: u32) -> &[u8] {
|
||||
let chunk = self.volumes[0].chunk_at(offset);
|
||||
let mapping = photon.get_or_map(chunk.hash);
|
||||
&mapping[chunk.local_offset..][..len]
|
||||
}
|
||||
|
||||
// Writes go to delta layer
|
||||
fn handle_write(&mut self, offset: u64, data: &[u8]) {
|
||||
self.volumes[0].write_delta(offset, data);
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 4. Density Targets
|
||||
|
||||
### Storage Efficiency
|
||||
|
||||
| Scenario | Traditional | Stellarium | Target |
|
||||
|----------|-------------|------------|--------|
|
||||
| 1000 Ubuntu 22.04 VMs | 2.5 TB | 2.8 GB shared + 10 MB/VM avg delta | **99.6% reduction** |
|
||||
| 10000 Python app VMs (same base) | 25 TB | 2.8 GB + 5 MB/VM | **99.8% reduction** |
|
||||
| Mixed workload (100 unique bases) | 2.5 TB | 50 GB shared + 20 MB/VM avg | **94% reduction** |
|
||||
|
||||
### Memory Efficiency
|
||||
|
||||
| Component | Traditional | Stellarium | Target |
|
||||
|-----------|-------------|------------|--------|
|
||||
| Kernel (per VM) | 8-15 MB | Shared (~0 marginal) | **99%+ reduction** |
|
||||
| libc (per VM) | 2 MB | Shared | **99%+ reduction** |
|
||||
| Page cache duplication | High | Zero | **100% reduction** |
|
||||
| Effective RAM per VM | 512 MB - 1 GB | 50-100 MB unique | **5-10x improvement** |
|
||||
|
||||
### Performance
|
||||
|
||||
| Metric | Traditional | Stellarium Target |
|
||||
|--------|-------------|-------------------|
|
||||
| Cold boot (minimal VM) | 500ms - 2s | < 50ms |
|
||||
| Warm boot (pre-cached) | 100-500ms | < 20ms |
|
||||
| Clone time (full copy) | 10-60s | < 1ms (CoW instant) |
|
||||
| Dedup ratio (homogeneous) | N/A | 50:1 to 1000:1 |
|
||||
| IOPS (deduplicated reads) | N | 1 |
|
||||
|
||||
### Density Goals
|
||||
|
||||
| Scenario | Traditional (64GB RAM host) | Stellarium Target |
|
||||
|----------|------------------------------|-------------------|
|
||||
| Minimal VMs (32MB each) | ~1000 | 5000-10000 |
|
||||
| Small VMs (128MB each) | ~400 | 2000-4000 |
|
||||
| Medium VMs (512MB each) | ~100 | 500-1000 |
|
||||
| Storage per 10K VMs | 10-50 TB | 10-50 GB |
|
||||
|
||||
---
|
||||
|
||||
## 5. Integration with Volt VMM
|
||||
|
||||
### Boot Path Integration
|
||||
|
||||
```rust
|
||||
// Volt VMM integration
|
||||
impl VoltVmm {
|
||||
fn boot_with_stellarium(&mut self, manifest: BootManifest) -> Result<()> {
|
||||
// 1. Pre-fault boot chunks to L1 (memory)
|
||||
let prefetch_handle = stellarium.prefetch(&manifest.prefetch_set);
|
||||
|
||||
// 2. Set up memory-mapped kernel
|
||||
let kernel_mapping = stellarium.map_readonly(&manifest.kernel);
|
||||
self.load_kernel_direct(kernel_mapping);
|
||||
|
||||
// 3. Set up memory-mapped initrd (if present)
|
||||
if let Some(initrd) = &manifest.initrd {
|
||||
let initrd_mapping = stellarium.map_readonly(initrd);
|
||||
self.load_initrd_direct(initrd_mapping);
|
||||
}
|
||||
|
||||
// 4. Configure VirtIO-Stellar device
|
||||
self.add_stellar_blk(manifest.root_vol)?;
|
||||
|
||||
// 5. Ensure prefetch complete
|
||||
prefetch_handle.wait();
|
||||
|
||||
// 6. Boot
|
||||
self.start()
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### VirtIO-Stellar Driver
|
||||
|
||||
Custom VirtIO block device that speaks Stellarium natively:
|
||||
|
||||
```rust
|
||||
struct VirtioStellarConfig {
|
||||
// Standard virtio-blk compatible
|
||||
capacity: u64,
|
||||
size_max: u32,
|
||||
seg_max: u32,
|
||||
|
||||
// Stellarium extensions
|
||||
stellar_features: u64, // STELLAR_F_SHAREVOL, STELLAR_F_DEDUP, etc.
|
||||
vol_hash: Blake3Hash, // Volume identity
|
||||
shared_regions: u32, // Number of pre-shared regions
|
||||
}
|
||||
|
||||
// Request types (extends standard virtio-blk)
|
||||
enum StellarRequest {
|
||||
Read { sector: u64, len: u32 },
|
||||
Write { sector: u64, data: Vec<u8> },
|
||||
|
||||
// Stellarium extensions
|
||||
MapShared { hash: Blake3Hash }, // Map shared chunk directly
|
||||
QueryDedup { sector: u64 }, // Check if sector is deduplicated
|
||||
Prefetch { sectors: Vec<u64> }, // Hint upcoming reads
|
||||
}
|
||||
```
|
||||
|
||||
### Snapshot and Restore
|
||||
|
||||
```rust
|
||||
// Instant snapshots via TinyVol CoW
|
||||
fn snapshot_vm(vm: &VoltVm) -> VmSnapshot {
|
||||
VmSnapshot {
|
||||
// Memory as Stellar chunks
|
||||
memory_chunks: stellarium.chunk_memory(vm.memory_region()),
|
||||
|
||||
// Volume is already CoW - just reference
|
||||
root_vol: vm.root_vol.clone_manifest(),
|
||||
|
||||
// CPU state is tiny
|
||||
cpu_state: vm.save_cpu_state(),
|
||||
}
|
||||
}
|
||||
|
||||
// Restore from snapshot
|
||||
fn restore_vm(snapshot: &VmSnapshot) -> VoltVm {
|
||||
let mut vm = VoltVm::new();
|
||||
|
||||
// Memory is mapped directly from Stellar chunks
|
||||
vm.map_memory_from_stellar(&snapshot.memory_chunks);
|
||||
|
||||
// Volume manifest is loaded (no data copy)
|
||||
vm.attach_vol(snapshot.root_vol.clone());
|
||||
|
||||
// Restore CPU state
|
||||
vm.restore_cpu_state(&snapshot.cpu_state);
|
||||
|
||||
vm
|
||||
}
|
||||
```
|
||||
|
||||
### Live Migration with Dedup
|
||||
|
||||
```rust
|
||||
// Only transfer unique chunks during migration
|
||||
async fn migrate_vm(vm: &VoltVm, target: &NodeAddr) -> Result<()> {
|
||||
// 1. Get list of chunks VM references
|
||||
let vm_chunks = vm.collect_chunk_refs();
|
||||
|
||||
// 2. Query target for chunks it already has
|
||||
let target_has = target.query_chunks(&vm_chunks).await?;
|
||||
|
||||
// 3. Transfer only missing chunks
|
||||
let missing = vm_chunks.difference(&target_has);
|
||||
target.receive_chunks(&missing).await?;
|
||||
|
||||
// 4. Transfer tiny metadata
|
||||
target.receive_manifest(&vm.root_vol).await?;
|
||||
target.receive_memory_manifest(&vm.memory_chunks).await?;
|
||||
|
||||
// 5. Final state sync and switchover
|
||||
vm.pause();
|
||||
target.receive_final_state(vm.cpu_state()).await?;
|
||||
target.resume().await?;
|
||||
|
||||
Ok(())
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 6. Implementation Priorities
|
||||
|
||||
### Phase 1: Foundation (Month 1-2)
|
||||
**Goal**: Core CAS and basic volume support
|
||||
|
||||
1. **NEBULA Core**
|
||||
- BLAKE3 hashing with SIMD acceleration
|
||||
- In-memory hash table (robin hood hashing)
|
||||
- Basic chunk storage (local NVMe)
|
||||
- Reference counting
|
||||
|
||||
2. **TinyVol v1**
|
||||
- Manifest format
|
||||
- Read-only volume mounting
|
||||
- Basic CoW writes
|
||||
|
||||
3. **VirtIO-Stellar Driver**
|
||||
- Basic block interface
|
||||
- Integration with Volt
|
||||
|
||||
**Deliverable**: Boot a VM from Stellarium storage
|
||||
|
||||
### Phase 2: Deduplication (Month 2-3)
|
||||
**Goal**: Inline dedup with zero performance regression
|
||||
|
||||
1. **Inline Deduplication**
|
||||
- Write path with hash-first
|
||||
- Atomic insert-or-reference
|
||||
- Dedup metrics/reporting
|
||||
|
||||
2. **Content-Defined Chunking**
|
||||
- FastCDC implementation
|
||||
- Tuned for VM workloads
|
||||
|
||||
3. **Base Image Sharing**
|
||||
- ShareVol implementation
|
||||
- Multiple VMs sharing base
|
||||
|
||||
**Deliverable**: 10:1+ dedup ratio for homogeneous VMs
|
||||
|
||||
### Phase 3: Performance (Month 3-4)
|
||||
**Goal**: Sub-50ms boot, memory convergence
|
||||
|
||||
1. **PHOTON Tiering**
|
||||
- Hot/warm/cold classification
|
||||
- Automatic promotion/demotion
|
||||
- Memory-mapped hot tier
|
||||
|
||||
2. **Boot Optimization**
|
||||
- Boot manifest analysis
|
||||
- Prefetch implementation
|
||||
- Zero-copy kernel loading
|
||||
|
||||
3. **Memory-Storage Convergence**
|
||||
- DAX-style direct access
|
||||
- Shared page elimination
|
||||
|
||||
**Deliverable**: <50ms cold boot, memory sharing active
|
||||
|
||||
### Phase 4: Density (Month 4-5)
|
||||
**Goal**: 10000+ VMs per host achievable
|
||||
|
||||
1. **Small File Packing**
|
||||
- Cosmic Pack implementation
|
||||
- Inline file storage
|
||||
|
||||
2. **Aggressive Sharing**
|
||||
- Cross-VM page dedup
|
||||
- Kernel/library sharing
|
||||
|
||||
3. **Memory Pressure Handling**
|
||||
- Intelligent eviction
|
||||
- Graceful degradation
|
||||
|
||||
**Deliverable**: 5000+ density on 64GB host
|
||||
|
||||
### Phase 5: Distribution (Month 5-6)
|
||||
**Goal**: Multi-node Stellarium cluster
|
||||
|
||||
1. **Cosmic Mesh**
|
||||
- Distributed hash index
|
||||
- Cross-node chunk routing
|
||||
- Consistent hashing for placement
|
||||
|
||||
2. **Migration Optimization**
|
||||
- Chunk pre-staging
|
||||
- Delta transfers
|
||||
|
||||
3. **Object Storage Backend**
|
||||
- S3/R2 cold tier
|
||||
- Async writeback
|
||||
|
||||
**Deliverable**: Seamless multi-node storage
|
||||
|
||||
### Phase 6: Voltainer + CDN Native (Month 6-7)
|
||||
**Goal**: Voltainer containers as first-class citizens, CDN-native distribution
|
||||
|
||||
1. **CDN Distribution Layer**
|
||||
- Manifest/chunk fetch from ArmoredGate CDN
|
||||
- Parallel chunk retrieval
|
||||
- Edge cache warming strategies
|
||||
|
||||
2. **Voltainer Integration**
|
||||
- Direct Stellarium mount for systemd-nspawn
|
||||
- Shared layers between Voltainer containers and Volt VMs
|
||||
- Unified storage for both runtimes
|
||||
|
||||
3. **Layer Mapping**
|
||||
- Direct layer registration in NEBULA
|
||||
- No extraction needed
|
||||
- Content-addressed = perfect CDN cache keys
|
||||
|
||||
**Deliverable**: Voltainer containers boot in <100ms, unified with VM storage
|
||||
|
||||
---
|
||||
|
||||
## 7. Name: **Stellarium**
|
||||
|
||||
### Why Stellarium?
|
||||
|
||||
Continuing the cosmic theme of **Stardust** (cluster) and **Volt** (VMM):
|
||||
|
||||
- **Stellar** = Star-like, exceptional, relating to stars
|
||||
- **-arium** = A place for (like aquarium, planetarium)
|
||||
- **Stellarium** = "A place for stars" — where all your VM's data lives
|
||||
|
||||
### Component Names (Cosmic Theme)
|
||||
|
||||
| Component | Name | Meaning |
|
||||
|-----------|------|---------|
|
||||
| CAS Core | **NEBULA** | Birthplace of stars, cloud of shared matter |
|
||||
| Content Router | **PHOTON** | Light-speed data movement |
|
||||
| Chunk Packer | **Cosmic Pack** | Aggregating cosmic dust |
|
||||
| Volume Manager | **Nova-Store** | Connects to Volt |
|
||||
| Distributed Mesh | **Cosmic Mesh** | Interconnected universe |
|
||||
| Boot Optimizer | **Stellar Boot** | Star-like speed |
|
||||
| Small File Pack | **Cosmic Dust** | Tiny particles aggregated |
|
||||
|
||||
### Taglines
|
||||
|
||||
- *"Every byte a star. Every star shared."*
|
||||
- *"The storage that makes density possible."*
|
||||
- *"Where VMs find their data, instantly."*
|
||||
|
||||
---
|
||||
|
||||
## 8. Summary
|
||||
|
||||
**Stellarium** transforms storage from a per-VM liability into a shared asset. By treating all data as content-addressed chunks in a unified namespace:
|
||||
|
||||
1. **Deduplication becomes free** — No extra work, it's the storage model
|
||||
2. **Sharing becomes default** — VMs reference, not copy
|
||||
3. **Boot becomes instant** — Data is pre-positioned
|
||||
4. **Density becomes extreme** — 10-100x more VMs per host
|
||||
5. **Migration becomes trivial** — Only ship unique data
|
||||
|
||||
Combined with Volt's minimal VMM overhead, Stellarium enables the original ArmoredContainers vision: **VM isolation at container density, with VM security guarantees**.
|
||||
|
||||
### The Stellarium Promise
|
||||
|
||||
> On a 64GB host with 2TB NVMe:
|
||||
> - **10,000+ microVMs** running simultaneously
|
||||
> - **50GB total storage** for 10,000 Debian-based workloads
|
||||
> - **<50ms** boot time for any VM
|
||||
> - **Instant** cloning and snapshots
|
||||
> - **Seamless** live migration
|
||||
|
||||
This isn't incremental improvement. This is a **new storage paradigm** for the microVM era.
|
||||
|
||||
---
|
||||
|
||||
*Stellarium: The stellar storage for stellar density.*
|
||||
Reference in New Issue
Block a user