Volt VMM (Neutron Stardust): source-available under AGPSL v5.0

KVM-based microVMM for the Volt platform:
- Sub-second VM boot times
- Minimal memory footprint
- Landlock LSM + seccomp security
- Virtio device support
- Custom kernel management

Copyright (c) Armored Gates LLC. All rights reserved.
Licensed under AGPSL v5.0
This commit is contained in:
Karl Clinger
2026-03-21 01:04:35 -05:00
commit 40ed108dd5
143 changed files with 50300 additions and 0 deletions

View File

@@ -0,0 +1,154 @@
# Seccomp-BPF Implementation Notes
## Overview
Volt now includes seccomp-BPF system call filtering as a critical security layer. After all VMM initialization is complete (KVM VM created, memory allocated, kernel loaded, devices initialized, API socket bound), a strict syscall allowlist is applied. Any syscall not on the allowlist immediately kills the process with `SECCOMP_RET_KILL_PROCESS`.
## Architecture
### Security Layer Stack
```
┌─────────────────────────────────────────────────────────┐
│ Layer 5: Seccomp-BPF (always unless --no-seccomp) │
│ 72 syscalls allowed, all others → KILL │
├─────────────────────────────────────────────────────────┤
│ Layer 4: Landlock (optional, kernel 5.13+) │
│ Filesystem path restrictions │
├─────────────────────────────────────────────────────────┤
│ Layer 3: Capability dropping (always) │
│ Drop all ambient capabilities │
├─────────────────────────────────────────────────────────┤
│ Layer 2: PR_SET_NO_NEW_PRIVS (always) │
│ Prevent privilege escalation │
├─────────────────────────────────────────────────────────┤
│ Layer 1: KVM isolation (inherent) │
│ Hardware virtualization boundary │
└─────────────────────────────────────────────────────────┘
```
### Application Timing
The seccomp filter is applied in `main.rs` at a specific point in the startup sequence:
```
1. Parse CLI / validate config
2. Initialize KVM system handle
3. Create VM (IRQ chip, PIT)
4. Set up guest memory regions
5. Load kernel (PVH boot protocol)
6. Initialize devices (serial, virtio)
7. Create vCPUs
8. Set up signal handlers
9. Spawn API server task
10. ** Apply Landlock **
11. ** Drop capabilities **
12. ** Apply seccomp filter ** ← HERE
13. Start vCPU run loop
14. Wait for shutdown
```
This ordering is critical:
- Before seccomp: All privileged operations (opening /dev/kvm, mmap'ing guest memory, loading kernel files, binding sockets) are complete.
- After seccomp: Only the ~72 syscalls needed for steady-state operation are allowed.
- We use `apply_filter_all_threads` (TSYNC) so vCPU threads spawned later also inherit the filter.
## Syscall Allowlist (72 syscalls)
### File I/O (10)
`read`, `write`, `openat`, `close`, `fstat`, `lseek`, `pread64`, `pwrite64`, `readv`, `writev`
### Memory Management (6)
`mmap`, `mprotect`, `munmap`, `brk`, `madvise`, `mremap`
### KVM / Device Control (1)
`ioctl` — The core VMM syscall. KVM_RUN, KVM_SET_REGS, KVM_CREATE_VCPU, and all other KVM operations go through ioctl. We allow all ioctls rather than filtering by ioctl number because:
- The KVM fd-based security model already scopes access
- Filtering by ioctl number would be fragile across kernel versions
- The BPF program size would explode
### Threading (7)
`clone`, `clone3`, `futex`, `set_robust_list`, `sched_yield`, `sched_getaffinity`, `rseq`
### Signals (4)
`rt_sigaction`, `rt_sigprocmask`, `rt_sigreturn`, `sigaltstack`
### Networking (18)
`accept4`, `bind`, `listen`, `socket`, `connect`, `recvfrom`, `sendto`, `recvmsg`, `sendmsg`, `shutdown`, `getsockname`, `getpeername`, `setsockopt`, `getsockopt`, `epoll_create1`, `epoll_ctl`, `epoll_wait`, `ppoll`
### Process Lifecycle (7)
`exit`, `exit_group`, `getpid`, `gettid`, `prctl`, `arch_prctl`, `prlimit64`, `tgkill`
### Timers (3)
`clock_gettime`, `nanosleep`, `clock_nanosleep`
### Miscellaneous (16)
`getrandom`, `eventfd2`, `timerfd_create`, `timerfd_settime`, `pipe2`, `dup`, `dup2`, `fcntl`, `statx`, `newfstatat`, `access`, `readlinkat`, `getcwd`, `unlink`, `unlinkat`
## Crate Choice
We use **`seccompiler` v0.5** from the rust-vmm project — the same crate Firecracker uses. Benefits:
- Battle-tested in production (millions of Firecracker microVMs)
- Pure Rust BPF compiler (no C dependencies)
- Supports argument-level filtering (we don't use it for ioctl, but could add later)
- `apply_filter_all_threads` for TSYNC support
## CLI Flag
`--no-seccomp` disables the filter entirely. This is for debugging only and emits a WARN-level log:
```
WARN volt-vmm::security::seccomp: Seccomp filtering is DISABLED (--no-seccomp flag). This is insecure for production use.
```
## Testing
### Minimal kernel (bare metal ELF)
```bash
timeout 10 ./target/release/volt-vmm --kernel comparison/kernels/minimal-hello.elf -m 128M
# Output: "Hello from minimal kernel!" — seccomp active, VM runs normally
```
### Linux kernel (vmlinux 4.14)
```bash
timeout 10 ./target/release/volt-vmm --kernel kernels/vmlinux -m 256M
# Output: Full Linux boot up to VFS mount panic (expected without rootfs)
# Seccomp did NOT kill the process — all needed syscalls are allowed
```
### With seccomp disabled
```bash
timeout 5 ./target/release/volt-vmm --kernel comparison/kernels/minimal-hello.elf -m 128M --no-seccomp
# WARN logged, VM runs normally
```
## Comparison with Firecracker
| Feature | Firecracker | Volt |
|---------|-------------|-----------|
| Crate | seccompiler 0.4 | seccompiler 0.5 |
| Syscalls allowed | ~50 | ~72 |
| ioctl filtering | By KVM ioctl number | Allow all (fd-scoped) |
| Default action | KILL_PROCESS | KILL_PROCESS |
| Per-thread filters | Yes (API vs vCPU) | Single filter (TSYNC) |
| Disable flag | No (always on) | `--no-seccomp` for debug |
Volt allows slightly more syscalls because:
1. We include tokio runtime syscalls (epoll, clone3, rseq)
2. We include networking syscalls for the API socket
3. We include filesystem cleanup syscalls (unlink/unlinkat for socket cleanup)
## Future Improvements
1. **Per-thread filters**: Different allowlists for API thread vs vCPU threads (Firecracker does this)
2. **ioctl argument filtering**: Filter to only KVM_* ioctl numbers (adds ~20 BPF rules but tightens security)
3. **Audit mode**: Use `SECCOMP_RET_LOG` instead of `SECCOMP_RET_KILL_PROCESS` for development
4. **Metrics**: Count seccomp violations via SIGSYS handler before kill
5. **Remove `--no-seccomp`**: Once the allowlist is proven stable in production
## Files
- `vmm/src/security/seccomp.rs` — Filter definition, build, and apply logic
- `vmm/src/security/mod.rs` — Module exports (also includes capabilities + landlock)
- `vmm/src/main.rs` — Integration point (after init, before vCPU run) + `--no-seccomp` flag
- `vmm/Cargo.toml``seccompiler = "0.5"` dependency