Volt VMM (Neutron Stardust): source-available under AGPSL v5.0

KVM-based microVMM for the Volt platform:
- Sub-second VM boot times
- Minimal memory footprint
- Landlock LSM + seccomp security
- Virtio device support
- Custom kernel management

Copyright (c) Armored Gates LLC. All rights reserved.
Licensed under AGPSL v5.0
This commit is contained in:
Karl Clinger
2026-03-21 01:04:35 -05:00
commit 40ed108dd5
143 changed files with 50300 additions and 0 deletions

View File

@@ -0,0 +1,302 @@
# systemd-networkd Enhanced virtio-net
## Overview
This design enhances Volt's virtio-net implementation by integrating with systemd-networkd for declarative, lifecycle-managed network configuration. Instead of Volt manually creating/configuring TAP devices, networkd manages them declaratively.
## Architecture
```
┌─────────────────────────────────────────────────────────────────────┐
│ systemd-networkd │
│ ┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐ │
│ │ volt-vmm-br0 │ │ vm-{uuid}.netdev │ │ vm-{uuid}.network│ │
│ │ (.netdev bridge) │ │ (TAP definition) │ │ (bridge attach) │ │
│ └────────┬─────────┘ └────────┬─────────┘ └────────┬─────────┘ │
│ │ │ │ │
│ └─────────────────────┼─────────────────────┘ │
│ ▼ │
│ ┌───────────────┐ │
│ │ br0 │ ◄── Unified bridge │
│ │ (bridge) │ (VMs + Voltainer) │
│ └───────┬───────┘ │
│ │ │
│ ┌─────────────────┼─────────────────┐ │
│ ▼ ▼ ▼ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ tap0 │ │ veth0 │ │ tap1 │ │
│ │ (VM-1) │ │ (cont.) │ │ (VM-2) │ │
│ └────┬────┘ └────┬────┘ └────┬────┘ │
└─────────────┼────────────────┼────────────────┼─────────────────────┘
│ │ │
▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐
│Volt│ │Voltainer│ │Volt│
│ VM-1 │ │Container│ │ VM-2 │
└─────────┘ └─────────┘ └─────────┘
```
## Benefits
1. **Declarative Configuration**: Network topology defined in unit files, version-controllable
2. **Automatic Cleanup**: systemd removes TAP devices when VM exits
3. **Lifecycle Integration**: TAP created before VM starts, destroyed after
4. **Unified Networking**: VMs and Voltainer containers share the same bridge infrastructure
5. **vhost-net Acceleration**: Kernel-level packet processing bypasses userspace
6. **Predictable Naming**: TAP names derived from VM UUID
## Components
### 1. Bridge Infrastructure (One-time Setup)
```ini
# /etc/systemd/network/10-volt-vmm-br0.netdev
[NetDev]
Name=br0
Kind=bridge
MACAddress=52:54:00:00:00:01
[Bridge]
STP=false
ForwardDelaySec=0
```
```ini
# /etc/systemd/network/10-volt-vmm-br0.network
[Match]
Name=br0
[Network]
Address=10.42.0.1/24
IPForward=yes
IPMasquerade=both
ConfigureWithoutCarrier=yes
```
### 2. Per-VM TAP Template
Volt generates these dynamically:
```ini
# /run/systemd/network/50-vm-{uuid}.netdev
[NetDev]
Name=tap-{short_uuid}
Kind=tap
MACAddress=none
[Tap]
User=root
Group=root
VNetHeader=true
MultiQueue=true
PacketInfo=false
```
```ini
# /run/systemd/network/50-vm-{uuid}.network
[Match]
Name=tap-{short_uuid}
[Network]
Bridge=br0
ConfigureWithoutCarrier=yes
```
### 3. vhost-net Acceleration
vhost-net offloads packet processing to the kernel:
```
┌─────────────────────────────────────────────────┐
│ Guest VM │
│ ┌─────────────────────────────────────────┐ │
│ │ virtio-net driver │ │
│ └─────────────────┬───────────────────────┘ │
└───────────────────┬┼────────────────────────────┘
││
┌──────────┘│
│ │ KVM Exit (rare)
▼ ▼
┌────────────────────────────────────────────────┐
│ vhost-net (kernel) │
│ │
│ - Processes virtqueue directly in kernel │
│ - Zero-copy between TAP and guest memory │
│ - Avoids userspace context switches │
│ - ~30-50% throughput improvement │
└────────────────────┬───────────────────────────┘
┌─────────────┐
│ TAP device │
└─────────────┘
```
**Without vhost-net:**
```
Guest → KVM exit → QEMU/Volt userspace → syscall → TAP → kernel → network
```
**With vhost-net:**
```
Guest → vhost-net (kernel) → TAP → network
```
## Integration with Voltainer
Both Volt VMs and Voltainer containers connect to the same bridge:
### Voltainer Network Zone
```yaml
# /etc/voltainer/network/zone-default.yaml
kind: NetworkZone
name: default
bridge: br0
subnet: 10.42.0.0/24
gateway: 10.42.0.1
dhcp:
enabled: true
range: 10.42.0.100-10.42.0.254
```
### Volt VM Allocation
VMs get static IPs from a reserved range (10.42.0.2-10.42.0.99):
```yaml
network:
- zone: default
mac: "52:54:00:ab:cd:ef"
ipv4: "10.42.0.10/24"
```
## File Locations
| File Type | Location | Persistence |
|-----------|----------|-------------|
| Bridge .netdev/.network | `/etc/systemd/network/` | Permanent |
| VM TAP .netdev/.network | `/run/systemd/network/` | Runtime only |
| Voltainer zone config | `/etc/voltainer/network/` | Permanent |
| vhost-net module | Kernel built-in | N/A |
## Lifecycle
### VM Start
1. Volt generates `.netdev` and `.network` in `/run/systemd/network/`
2. `networkctl reload` triggers networkd to create TAP
3. Wait for TAP interface to appear (`networkctl status tap-XXX`)
4. Open TAP fd with O_RDWR
5. Enable vhost-net via `/dev/vhost-net` ioctl
6. Boot VM with virtio-net using the TAP fd
### VM Stop
1. Close vhost-net and TAP file descriptors
2. Delete `.netdev` and `.network` from `/run/systemd/network/`
3. `networkctl reload` triggers cleanup
4. TAP interface automatically removed
## vhost-net Setup Sequence
```c
// 1. Open vhost-net device
int vhost_fd = open("/dev/vhost-net", O_RDWR);
// 2. Set owner (associate with TAP)
ioctl(vhost_fd, VHOST_SET_OWNER, 0);
// 3. Set memory region table
struct vhost_memory *mem = ...; // Guest memory regions
ioctl(vhost_fd, VHOST_SET_MEM_TABLE, mem);
// 4. Set vring info for each queue (RX and TX)
struct vhost_vring_state state = { .index = 0, .num = queue_size };
ioctl(vhost_fd, VHOST_SET_VRING_NUM, &state);
struct vhost_vring_addr addr = {
.index = 0,
.desc_user_addr = desc_addr,
.used_user_addr = used_addr,
.avail_user_addr = avail_addr,
};
ioctl(vhost_fd, VHOST_SET_VRING_ADDR, &addr);
// 5. Set kick/call eventfds
struct vhost_vring_file kick = { .index = 0, .fd = kick_eventfd };
ioctl(vhost_fd, VHOST_SET_VRING_KICK, &kick);
struct vhost_vring_file call = { .index = 0, .fd = call_eventfd };
ioctl(vhost_fd, VHOST_SET_VRING_CALL, &call);
// 6. Associate with TAP backend
struct vhost_vring_file backend = { .index = 0, .fd = tap_fd };
ioctl(vhost_fd, VHOST_NET_SET_BACKEND, &backend);
```
## Performance Comparison
| Metric | userspace virtio-net | vhost-net |
|--------|---------------------|-----------|
| Throughput (1500 MTU) | ~5 Gbps | ~8 Gbps |
| Throughput (Jumbo 9000) | ~8 Gbps | ~15 Gbps |
| Latency (ping) | ~200 µs | ~80 µs |
| CPU usage | Higher | 30-50% lower |
| Context switches | Many | Minimal |
## Configuration Examples
### Minimal VM with Networking
```json
{
"vcpus": 2,
"memory_mib": 512,
"kernel": "vmlinux",
"network": [{
"id": "eth0",
"mode": "networkd",
"bridge": "br0",
"mac": "52:54:00:12:34:56",
"vhost": true
}]
}
```
### Multi-NIC VM
```json
{
"network": [
{
"id": "mgmt",
"bridge": "br-mgmt",
"vhost": true
},
{
"id": "data",
"bridge": "br-data",
"mtu": 9000,
"vhost": true,
"multiqueue": 4
}
]
}
```
## Error Handling
| Error | Cause | Recovery |
|-------|-------|----------|
| TAP creation timeout | networkd slow/unresponsive | Retry with backoff, fall back to direct creation |
| vhost-net open fails | Module not loaded | Fall back to userspace virtio-net |
| Bridge not found | Infrastructure not set up | Create bridge or fail with clear error |
| MAC conflict | Duplicate MAC on bridge | Auto-regenerate MAC |
## Future Enhancements
1. **SR-IOV Passthrough**: Direct VF assignment for bare-metal performance
2. **DPDK Backend**: Alternative to TAP for ultra-low-latency
3. **virtio-vhost-user**: Offload to separate process for isolation
4. **Network Namespace Integration**: Per-VM network namespaces for isolation

View File

@@ -0,0 +1,757 @@
# Stellarium: Unified Storage Architecture for Volt
> *"Every byte has a home. Every home is shared. Nothing is stored twice."*
## 1. Vision Statement
**Stellarium** is a revolutionary storage architecture that treats storage not as isolated volumes, but as a **unified content-addressed stellar cloud** where every unique byte exists exactly once, and every VM draws from the same constellation of data.
### What Makes This Revolutionary
Traditional VM storage operates on a fundamental lie: that each VM has its own dedicated disk. This creates:
- **Massive redundancy** — 1000 Debian VMs = 1000 copies of libc
- **Slow boots** — Each VM reads its own copy of boot files
- **Wasted IOPS** — Page cache misses everywhere
- **Memory bloat** — Same data cached N times
**Stellarium inverts this model.** Instead of VMs owning storage, **storage serves VMs through a unified content mesh**. The result:
| Metric | Traditional | Stellarium | Improvement |
|--------|-------------|------------|-------------|
| Storage per 1000 Debian VMs | 10 TB | 12 GB + deltas | **833x** |
| Cold boot time | 2-5s | <50ms | **40-100x** |
| Memory efficiency | 1 GB/VM | ~50 MB shared core | **20x** |
| IOPS for identical reads | N | 1 | **Nx** |
---
## 2. Architecture Overview
```
┌─────────────────────────────────────────────────────────────────────┐
│ STELLARIUM LAYERS │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Volt │ │ Volt │ │ Volt │ VM Layer │
│ │ microVM │ │ microVM │ │ microVM │ │
│ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │
│ │ │ │ │
│ ┌──────┴────────────────┴────────────────┴──────┐ │
│ │ STELLARIUM VirtIO Driver │ Driver │
│ │ (Memory-Mapped CAS Interface) │ Layer │
│ └──────────────────────┬────────────────────────┘ │
│ │ │
│ ┌──────────────────────┴────────────────────────┐ │
│ │ NOVA-STORE │ Store │
│ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ Layer │
│ │ │ TinyVol │ │ShareVol │ │ DeltaVol│ │ │
│ │ │ Manager │ │ Manager │ │ Manager │ │ │
│ │ └────┬────┘ └────┬────┘ └────┬────┘ │ │
│ │ └───────────┴───────────┘ │ │
│ │ │ │ │
│ │ ┌────────────────┴────────────────┐ │ │
│ │ │ PHOTON (Content Router) │ │ │
│ │ │ Hot→Memory Warm→NVMe Cold→S3 │ │ │
│ │ └────────────────┬────────────────┘ │ │
│ └───────────────────┼──────────────────────────┘ │
│ │ │
│ ┌───────────────────┴──────────────────────────┐ │
│ │ NEBULA (CAS Core) │ Foundation │
│ │ │ Layer │
│ │ ┌─────────┐ ┌─────────┐ ┌─────────────────┐ │ │
│ │ │ Chunk │ │ Block │ │ Distributed │ │ │
│ │ │ Packer │ │ Dedup │ │ Hash Index │ │ │
│ │ └─────────┘ └─────────┘ └─────────────────┘ │ │
│ │ │ │
│ │ ┌─────────────────────────────────────────┐ │ │
│ │ │ COSMIC MESH (Distributed CAS) │ │ │
│ │ │ Local NVMe ←→ Cluster ←→ Object Store │ │ │
│ │ └─────────────────────────────────────────┘ │ │
│ └───────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘
```
### Core Components
#### NEBULA: Content-Addressable Storage Core
The foundation layer. Every piece of data is:
- **Chunked** using content-defined chunking (CDC) with FastCDC algorithm
- **Hashed** with BLAKE3 (256-bit, hardware-accelerated)
- **Deduplicated** at write time via hash lookup
- **Stored once** regardless of how many VMs reference it
#### PHOTON: Intelligent Content Router
Manages data placement across the storage hierarchy:
- **L1 (Hot)**: Memory-mapped, instant access, boot-critical data
- **L2 (Warm)**: NVMe, sub-millisecond, working set
- **L3 (Cool)**: SSD, single-digit ms, recent data
- **L4 (Cold)**: Object storage (S3/R2), archival
#### NOVA-STORE: Volume Abstraction Layer
Presents traditional block/file interfaces to VMs while backed by CAS:
- **TinyVol**: Ultra-lightweight volumes with minimal metadata
- **ShareVol**: Copy-on-write shared volumes
- **DeltaVol**: Delta-encoded writable layers
---
## 3. Key Innovations
### 3.1 Stellar Deduplication
**Innovation**: Inline deduplication with zero write amplification.
Traditional dedup:
```
Write → Buffer → Hash → Lookup → Decide → Store
(copy) (wait) (maybe copy again)
```
Stellar dedup:
```
Write → Hash-while-streaming → CAS Insert (atomic)
(no buffer needed) (single write or reference)
```
**Implementation**:
```rust
struct StellarChunk {
hash: Blake3Hash, // 32 bytes
size: u16, // 2 bytes (max 64KB chunks)
refs: AtomicU32, // 4 bytes - reference count
tier: AtomicU8, // 1 byte - storage tier
flags: u8, // 1 byte - compression, encryption
// Total: 40 bytes metadata per chunk
}
// Hash table: 40 bytes × 1B chunks = 40GB index for ~40TB unique data
// Fits in memory on modern servers
```
### 3.2 TinyVol: Minimal Volume Overhead
**Innovation**: Volumes as tiny manifest files, not pre-allocated space.
```
Traditional qcow2: Header (512B) + L1 Table + L2 Tables + Refcount...
Minimum overhead: ~512KB even for empty volume
TinyVol: Just a manifest pointing to chunks
Overhead: 64 bytes base + 48 bytes per modified chunk
Empty 10GB volume: 64 bytes
1GB modified: 64B + (1GB/64KB × 48B) = ~768KB
```
**Structure**:
```rust
struct TinyVol {
magic: [u8; 8], // "TINYVOL\0"
version: u32,
flags: u32,
base_image: Blake3Hash, // Optional parent
size_bytes: u64,
chunk_map: BTreeMap<ChunkIndex, ChunkRef>,
}
struct ChunkRef {
hash: Blake3Hash, // 32 bytes
offset_in_vol: u48, // 6 bytes
len: u16, // 2 bytes
flags: u64, // 8 bytes (CoW, compressed, etc.)
}
```
### 3.3 ShareVol: Zero-Copy Shared Volumes
**Innovation**: Multiple VMs share read paths, with instant copy-on-write.
```
Traditional Shared Storage:
VM1 reads /lib/libc.so → Disk read → VM1 memory
VM2 reads /lib/libc.so → Disk read → VM2 memory
(Same data read twice, stored twice in RAM)
ShareVol:
VM1 reads /lib/libc.so → Shared mapping (already in memory)
VM2 reads /lib/libc.so → Same shared mapping
(Single read, single memory location, N consumers)
```
**Memory-Mapped CAS**:
```rust
// Shared content is memory-mapped once
struct SharedMapping {
hash: Blake3Hash,
mmap_addr: *const u8,
mmap_len: usize,
vm_refs: AtomicU32, // How many VMs reference this
last_access: AtomicU64, // For eviction
}
// VMs get read-only mappings to shared content
// Write attempts trigger CoW into TinyVol delta layer
```
### 3.4 Cosmic Packing: Small File Optimization
**Innovation**: Pack small files into larger chunks without losing addressability.
Problem: Millions of small files (< 4KB) waste space at chunk boundaries.
Solution: **Cosmic Packs** — aggregated storage with inline index:
```
┌─────────────────────────────────────────────────┐
│ COSMIC PACK (64KB) │
├─────────────────────────────────────────────────┤
│ Header (64B) │
│ - magic, version, entry_count │
├─────────────────────────────────────────────────┤
│ Index (variable, ~100B per entry) │
│ - [hash, offset, len, flags] × N │
├─────────────────────────────────────────────────┤
│ Data (remaining space) │
│ - Packed file contents │
└─────────────────────────────────────────────────┘
```
**Benefit**: 1000 × 100-byte files = 100KB raw, but with individual addressing overhead. Cosmic Pack: single 64KB chunk, full addressability retained.
### 3.5 Stellar Boot: Sub-50ms VM Start
**Innovation**: Boot data is pre-staged in memory before VM starts.
```
Boot Sequence Comparison:
Traditional:
t=0ms VMM starts
t=5ms BIOS loads
t=50ms Kernel requested
t=100ms Kernel loaded from disk
t=200ms initrd loaded
t=500ms Root FS mounted
t=2000ms Boot complete
Stellar Boot:
t=-50ms Boot manifest analyzed (during scheduling)
t=-25ms Hot chunks pre-faulted to memory
t=0ms VMM starts with memory-mapped boot data
t=5ms Kernel executes (already in memory)
t=15ms initrd processed (already in memory)
t=40ms Root FS ready (ShareVol, pre-mapped)
t=50ms Boot complete
```
**Boot Manifest**:
```rust
struct BootManifest {
kernel: Blake3Hash,
initrd: Option<Blake3Hash>,
root_vol: TinyVolRef,
// Predicted hot chunks for first 100ms
prefetch_set: Vec<Blake3Hash>,
// Memory layout hints
kernel_load_addr: u64,
initrd_load_addr: Option<u64>,
}
```
### 3.6 CDN-Native Distribution: Voltainer Integration
**Innovation**: Images distributed via CDN, layers indexed directly in NEBULA.
```
Traditional (Registry-based):
Registry API → Pull manifest → Pull layers → Extract → Overlay FS
(Complex protocol, copies data, registry infrastructure required)
Stellarium + CDN:
HTTPS GET manifest → HTTPS GET missing chunks → Mount
(Simple HTTP, zero extraction, CDN handles global distribution)
```
**CDN-Native Architecture**:
```
┌─────────────────────────────────────────────────────────────────┐
│ CDN-NATIVE DISTRIBUTION │
├─────────────────────────────────────────────────────────────────┤
│ │
│ cdn.armoredgate.com/ │
│ ├── manifests/ │
│ │ └── {blake3-hash}.json ← Image/layer manifests │
│ └── blobs/ │
│ └── {blake3-hash} ← Raw content chunks │
│ │
│ Benefits: │
│ ✓ No registry daemon to run │
│ ✓ No registry protocol complexity │
│ ✓ Global edge caching built-in │
│ ✓ Simple HTTPS GET (curl-debuggable) │
│ ✓ Content-addressed = perfect cache keys │
│ ✓ Dedup at CDN level (same hash = same edge cache) │
│ │
└─────────────────────────────────────────────────────────────────┘
```
**Implementation**:
```rust
struct CdnDistribution {
base_url: String, // "https://cdn.armoredgate.com"
async fn fetch_manifest(&self, hash: &Blake3Hash) -> Result<ImageManifest> {
let url = format!("{}/manifests/{}.json", self.base_url, hash);
let resp = reqwest::get(&url).await?;
Ok(resp.json().await?)
}
async fn fetch_chunk(&self, hash: &Blake3Hash) -> Result<Vec<u8>> {
let url = format!("{}/blobs/{}", self.base_url, hash);
let resp = reqwest::get(&url).await?;
// Verify content hash matches (integrity check)
let data = resp.bytes().await?;
assert_eq!(blake3::hash(&data), *hash);
Ok(data.to_vec())
}
async fn fetch_missing(&self, needed: &[Blake3Hash], local: &Nebula) -> Result<()> {
// Only fetch chunks we don't have locally
let missing: Vec<_> = needed.iter()
.filter(|h| !local.exists(h))
.collect();
// Parallel fetch from CDN
futures::future::join_all(
missing.iter().map(|h| self.fetch_and_store(h, local))
).await;
Ok(())
}
}
struct VoltainerImage {
manifest_hash: Blake3Hash,
layers: Vec<LayerRef>,
}
struct LayerRef {
hash: Blake3Hash, // Content hash (CDN path)
stellar_manifest: TinyVol, // Direct mapping to Stellar chunks
}
// Voltainer pull = simple CDN fetch
async fn voltainer_pull(image: &str, cdn: &CdnDistribution, nebula: &Nebula) -> Result<VoltainerImage> {
// 1. Resolve image name to manifest hash (local index or CDN lookup)
let manifest_hash = resolve_image_hash(image).await?;
// 2. Fetch manifest from CDN
let manifest = cdn.fetch_manifest(&manifest_hash).await?;
// 3. Fetch only missing chunks (dedup-aware)
let needed_chunks = manifest.all_chunk_hashes();
cdn.fetch_missing(&needed_chunks, nebula).await?;
// 4. Image is ready - no extraction, layers ARE the storage
Ok(VoltainerImage::from_manifest(manifest))
}
```
**Voltainer Integration**:
```rust
// Voltainer (systemd-nspawn based) uses Stellarium directly
impl VoltainerRuntime {
async fn create_container(&self, image: &VoltainerImage) -> Result<Container> {
// Layers are already in NEBULA, just create overlay view
let rootfs = self.stellarium.create_overlay_view(&image.layers)?;
// systemd-nspawn mounts the Stellarium-backed rootfs
let container = systemd_nspawn::Container::new()
.directory(&rootfs)
.private_network(true)
.boot(false)
.spawn()?;
Ok(container)
}
}
```
### 3.7 Memory-Storage Convergence
**Innovation**: Memory and storage share the same backing, eliminating double-buffering.
```
Traditional:
Storage: [Block Device] → [Page Cache] → [VM Memory]
(data copied twice)
Stellarium:
Unified: [CAS Memory Map] ←──────────→ [VM Memory View]
(single location, two views)
```
**DAX-Style Direct Access**:
```rust
// VM sees storage as memory-mapped region
struct StellarBlockDevice {
volumes: Vec<TinyVol>,
fn handle_read(&self, offset: u64, len: u32) -> &[u8] {
let chunk = self.volumes[0].chunk_at(offset);
let mapping = photon.get_or_map(chunk.hash);
&mapping[chunk.local_offset..][..len]
}
// Writes go to delta layer
fn handle_write(&mut self, offset: u64, data: &[u8]) {
self.volumes[0].write_delta(offset, data);
}
}
```
---
## 4. Density Targets
### Storage Efficiency
| Scenario | Traditional | Stellarium | Target |
|----------|-------------|------------|--------|
| 1000 Ubuntu 22.04 VMs | 2.5 TB | 2.8 GB shared + 10 MB/VM avg delta | **99.6% reduction** |
| 10000 Python app VMs (same base) | 25 TB | 2.8 GB + 5 MB/VM | **99.8% reduction** |
| Mixed workload (100 unique bases) | 2.5 TB | 50 GB shared + 20 MB/VM avg | **94% reduction** |
### Memory Efficiency
| Component | Traditional | Stellarium | Target |
|-----------|-------------|------------|--------|
| Kernel (per VM) | 8-15 MB | Shared (~0 marginal) | **99%+ reduction** |
| libc (per VM) | 2 MB | Shared | **99%+ reduction** |
| Page cache duplication | High | Zero | **100% reduction** |
| Effective RAM per VM | 512 MB - 1 GB | 50-100 MB unique | **5-10x improvement** |
### Performance
| Metric | Traditional | Stellarium Target |
|--------|-------------|-------------------|
| Cold boot (minimal VM) | 500ms - 2s | < 50ms |
| Warm boot (pre-cached) | 100-500ms | < 20ms |
| Clone time (full copy) | 10-60s | < 1ms (CoW instant) |
| Dedup ratio (homogeneous) | N/A | 50:1 to 1000:1 |
| IOPS (deduplicated reads) | N | 1 |
### Density Goals
| Scenario | Traditional (64GB RAM host) | Stellarium Target |
|----------|------------------------------|-------------------|
| Minimal VMs (32MB each) | ~1000 | 5000-10000 |
| Small VMs (128MB each) | ~400 | 2000-4000 |
| Medium VMs (512MB each) | ~100 | 500-1000 |
| Storage per 10K VMs | 10-50 TB | 10-50 GB |
---
## 5. Integration with Volt VMM
### Boot Path Integration
```rust
// Volt VMM integration
impl VoltVmm {
fn boot_with_stellarium(&mut self, manifest: BootManifest) -> Result<()> {
// 1. Pre-fault boot chunks to L1 (memory)
let prefetch_handle = stellarium.prefetch(&manifest.prefetch_set);
// 2. Set up memory-mapped kernel
let kernel_mapping = stellarium.map_readonly(&manifest.kernel);
self.load_kernel_direct(kernel_mapping);
// 3. Set up memory-mapped initrd (if present)
if let Some(initrd) = &manifest.initrd {
let initrd_mapping = stellarium.map_readonly(initrd);
self.load_initrd_direct(initrd_mapping);
}
// 4. Configure VirtIO-Stellar device
self.add_stellar_blk(manifest.root_vol)?;
// 5. Ensure prefetch complete
prefetch_handle.wait();
// 6. Boot
self.start()
}
}
```
### VirtIO-Stellar Driver
Custom VirtIO block device that speaks Stellarium natively:
```rust
struct VirtioStellarConfig {
// Standard virtio-blk compatible
capacity: u64,
size_max: u32,
seg_max: u32,
// Stellarium extensions
stellar_features: u64, // STELLAR_F_SHAREVOL, STELLAR_F_DEDUP, etc.
vol_hash: Blake3Hash, // Volume identity
shared_regions: u32, // Number of pre-shared regions
}
// Request types (extends standard virtio-blk)
enum StellarRequest {
Read { sector: u64, len: u32 },
Write { sector: u64, data: Vec<u8> },
// Stellarium extensions
MapShared { hash: Blake3Hash }, // Map shared chunk directly
QueryDedup { sector: u64 }, // Check if sector is deduplicated
Prefetch { sectors: Vec<u64> }, // Hint upcoming reads
}
```
### Snapshot and Restore
```rust
// Instant snapshots via TinyVol CoW
fn snapshot_vm(vm: &VoltVm) -> VmSnapshot {
VmSnapshot {
// Memory as Stellar chunks
memory_chunks: stellarium.chunk_memory(vm.memory_region()),
// Volume is already CoW - just reference
root_vol: vm.root_vol.clone_manifest(),
// CPU state is tiny
cpu_state: vm.save_cpu_state(),
}
}
// Restore from snapshot
fn restore_vm(snapshot: &VmSnapshot) -> VoltVm {
let mut vm = VoltVm::new();
// Memory is mapped directly from Stellar chunks
vm.map_memory_from_stellar(&snapshot.memory_chunks);
// Volume manifest is loaded (no data copy)
vm.attach_vol(snapshot.root_vol.clone());
// Restore CPU state
vm.restore_cpu_state(&snapshot.cpu_state);
vm
}
```
### Live Migration with Dedup
```rust
// Only transfer unique chunks during migration
async fn migrate_vm(vm: &VoltVm, target: &NodeAddr) -> Result<()> {
// 1. Get list of chunks VM references
let vm_chunks = vm.collect_chunk_refs();
// 2. Query target for chunks it already has
let target_has = target.query_chunks(&vm_chunks).await?;
// 3. Transfer only missing chunks
let missing = vm_chunks.difference(&target_has);
target.receive_chunks(&missing).await?;
// 4. Transfer tiny metadata
target.receive_manifest(&vm.root_vol).await?;
target.receive_memory_manifest(&vm.memory_chunks).await?;
// 5. Final state sync and switchover
vm.pause();
target.receive_final_state(vm.cpu_state()).await?;
target.resume().await?;
Ok(())
}
```
---
## 6. Implementation Priorities
### Phase 1: Foundation (Month 1-2)
**Goal**: Core CAS and basic volume support
1. **NEBULA Core**
- BLAKE3 hashing with SIMD acceleration
- In-memory hash table (robin hood hashing)
- Basic chunk storage (local NVMe)
- Reference counting
2. **TinyVol v1**
- Manifest format
- Read-only volume mounting
- Basic CoW writes
3. **VirtIO-Stellar Driver**
- Basic block interface
- Integration with Volt
**Deliverable**: Boot a VM from Stellarium storage
### Phase 2: Deduplication (Month 2-3)
**Goal**: Inline dedup with zero performance regression
1. **Inline Deduplication**
- Write path with hash-first
- Atomic insert-or-reference
- Dedup metrics/reporting
2. **Content-Defined Chunking**
- FastCDC implementation
- Tuned for VM workloads
3. **Base Image Sharing**
- ShareVol implementation
- Multiple VMs sharing base
**Deliverable**: 10:1+ dedup ratio for homogeneous VMs
### Phase 3: Performance (Month 3-4)
**Goal**: Sub-50ms boot, memory convergence
1. **PHOTON Tiering**
- Hot/warm/cold classification
- Automatic promotion/demotion
- Memory-mapped hot tier
2. **Boot Optimization**
- Boot manifest analysis
- Prefetch implementation
- Zero-copy kernel loading
3. **Memory-Storage Convergence**
- DAX-style direct access
- Shared page elimination
**Deliverable**: <50ms cold boot, memory sharing active
### Phase 4: Density (Month 4-5)
**Goal**: 10000+ VMs per host achievable
1. **Small File Packing**
- Cosmic Pack implementation
- Inline file storage
2. **Aggressive Sharing**
- Cross-VM page dedup
- Kernel/library sharing
3. **Memory Pressure Handling**
- Intelligent eviction
- Graceful degradation
**Deliverable**: 5000+ density on 64GB host
### Phase 5: Distribution (Month 5-6)
**Goal**: Multi-node Stellarium cluster
1. **Cosmic Mesh**
- Distributed hash index
- Cross-node chunk routing
- Consistent hashing for placement
2. **Migration Optimization**
- Chunk pre-staging
- Delta transfers
3. **Object Storage Backend**
- S3/R2 cold tier
- Async writeback
**Deliverable**: Seamless multi-node storage
### Phase 6: Voltainer + CDN Native (Month 6-7)
**Goal**: Voltainer containers as first-class citizens, CDN-native distribution
1. **CDN Distribution Layer**
- Manifest/chunk fetch from ArmoredGate CDN
- Parallel chunk retrieval
- Edge cache warming strategies
2. **Voltainer Integration**
- Direct Stellarium mount for systemd-nspawn
- Shared layers between Voltainer containers and Volt VMs
- Unified storage for both runtimes
3. **Layer Mapping**
- Direct layer registration in NEBULA
- No extraction needed
- Content-addressed = perfect CDN cache keys
**Deliverable**: Voltainer containers boot in <100ms, unified with VM storage
---
## 7. Name: **Stellarium**
### Why Stellarium?
Continuing the cosmic theme of **Stardust** (cluster) and **Volt** (VMM):
- **Stellar** = Star-like, exceptional, relating to stars
- **-arium** = A place for (like aquarium, planetarium)
- **Stellarium** = "A place for stars" — where all your VM's data lives
### Component Names (Cosmic Theme)
| Component | Name | Meaning |
|-----------|------|---------|
| CAS Core | **NEBULA** | Birthplace of stars, cloud of shared matter |
| Content Router | **PHOTON** | Light-speed data movement |
| Chunk Packer | **Cosmic Pack** | Aggregating cosmic dust |
| Volume Manager | **Nova-Store** | Connects to Volt |
| Distributed Mesh | **Cosmic Mesh** | Interconnected universe |
| Boot Optimizer | **Stellar Boot** | Star-like speed |
| Small File Pack | **Cosmic Dust** | Tiny particles aggregated |
### Taglines
- *"Every byte a star. Every star shared."*
- *"The storage that makes density possible."*
- *"Where VMs find their data, instantly."*
---
## 8. Summary
**Stellarium** transforms storage from a per-VM liability into a shared asset. By treating all data as content-addressed chunks in a unified namespace:
1. **Deduplication becomes free** — No extra work, it's the storage model
2. **Sharing becomes default** — VMs reference, not copy
3. **Boot becomes instant** — Data is pre-positioned
4. **Density becomes extreme** — 10-100x more VMs per host
5. **Migration becomes trivial** — Only ship unique data
Combined with Volt's minimal VMM overhead, Stellarium enables the original ArmoredContainers vision: **VM isolation at container density, with VM security guarantees**.
### The Stellarium Promise
> On a 64GB host with 2TB NVMe:
> - **10,000+ microVMs** running simultaneously
> - **50GB total storage** for 10,000 Debian-based workloads
> - **<50ms** boot time for any VM
> - **Instant** cloning and snapshots
> - **Seamless** live migration
This isn't incremental improvement. This is a **new storage paradigm** for the microVM era.
---
*Stellarium: The stellar storage for stellar density.*