Volt VMM (Neutron Stardust): source-available under AGPSL v5.0
KVM-based microVMM for the Volt platform: - Sub-second VM boot times - Minimal memory footprint - Landlock LSM + seccomp security - Virtio device support - Custom kernel management Copyright (c) Armored Gates LLC. All rights reserved. Licensed under AGPSL v5.0
This commit is contained in:
154
docs/seccomp-implementation.md
Normal file
154
docs/seccomp-implementation.md
Normal file
@@ -0,0 +1,154 @@
|
||||
# Seccomp-BPF Implementation Notes
|
||||
|
||||
## Overview
|
||||
|
||||
Volt now includes seccomp-BPF system call filtering as a critical security layer. After all VMM initialization is complete (KVM VM created, memory allocated, kernel loaded, devices initialized, API socket bound), a strict syscall allowlist is applied. Any syscall not on the allowlist immediately kills the process with `SECCOMP_RET_KILL_PROCESS`.
|
||||
|
||||
## Architecture
|
||||
|
||||
### Security Layer Stack
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────┐
|
||||
│ Layer 5: Seccomp-BPF (always unless --no-seccomp) │
|
||||
│ 72 syscalls allowed, all others → KILL │
|
||||
├─────────────────────────────────────────────────────────┤
|
||||
│ Layer 4: Landlock (optional, kernel 5.13+) │
|
||||
│ Filesystem path restrictions │
|
||||
├─────────────────────────────────────────────────────────┤
|
||||
│ Layer 3: Capability dropping (always) │
|
||||
│ Drop all ambient capabilities │
|
||||
├─────────────────────────────────────────────────────────┤
|
||||
│ Layer 2: PR_SET_NO_NEW_PRIVS (always) │
|
||||
│ Prevent privilege escalation │
|
||||
├─────────────────────────────────────────────────────────┤
|
||||
│ Layer 1: KVM isolation (inherent) │
|
||||
│ Hardware virtualization boundary │
|
||||
└─────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### Application Timing
|
||||
|
||||
The seccomp filter is applied in `main.rs` at a specific point in the startup sequence:
|
||||
|
||||
```
|
||||
1. Parse CLI / validate config
|
||||
2. Initialize KVM system handle
|
||||
3. Create VM (IRQ chip, PIT)
|
||||
4. Set up guest memory regions
|
||||
5. Load kernel (PVH boot protocol)
|
||||
6. Initialize devices (serial, virtio)
|
||||
7. Create vCPUs
|
||||
8. Set up signal handlers
|
||||
9. Spawn API server task
|
||||
10. ** Apply Landlock **
|
||||
11. ** Drop capabilities **
|
||||
12. ** Apply seccomp filter ** ← HERE
|
||||
13. Start vCPU run loop
|
||||
14. Wait for shutdown
|
||||
```
|
||||
|
||||
This ordering is critical:
|
||||
- Before seccomp: All privileged operations (opening /dev/kvm, mmap'ing guest memory, loading kernel files, binding sockets) are complete.
|
||||
- After seccomp: Only the ~72 syscalls needed for steady-state operation are allowed.
|
||||
- We use `apply_filter_all_threads` (TSYNC) so vCPU threads spawned later also inherit the filter.
|
||||
|
||||
## Syscall Allowlist (72 syscalls)
|
||||
|
||||
### File I/O (10)
|
||||
`read`, `write`, `openat`, `close`, `fstat`, `lseek`, `pread64`, `pwrite64`, `readv`, `writev`
|
||||
|
||||
### Memory Management (6)
|
||||
`mmap`, `mprotect`, `munmap`, `brk`, `madvise`, `mremap`
|
||||
|
||||
### KVM / Device Control (1)
|
||||
`ioctl` — The core VMM syscall. KVM_RUN, KVM_SET_REGS, KVM_CREATE_VCPU, and all other KVM operations go through ioctl. We allow all ioctls rather than filtering by ioctl number because:
|
||||
- The KVM fd-based security model already scopes access
|
||||
- Filtering by ioctl number would be fragile across kernel versions
|
||||
- The BPF program size would explode
|
||||
|
||||
### Threading (7)
|
||||
`clone`, `clone3`, `futex`, `set_robust_list`, `sched_yield`, `sched_getaffinity`, `rseq`
|
||||
|
||||
### Signals (4)
|
||||
`rt_sigaction`, `rt_sigprocmask`, `rt_sigreturn`, `sigaltstack`
|
||||
|
||||
### Networking (18)
|
||||
`accept4`, `bind`, `listen`, `socket`, `connect`, `recvfrom`, `sendto`, `recvmsg`, `sendmsg`, `shutdown`, `getsockname`, `getpeername`, `setsockopt`, `getsockopt`, `epoll_create1`, `epoll_ctl`, `epoll_wait`, `ppoll`
|
||||
|
||||
### Process Lifecycle (7)
|
||||
`exit`, `exit_group`, `getpid`, `gettid`, `prctl`, `arch_prctl`, `prlimit64`, `tgkill`
|
||||
|
||||
### Timers (3)
|
||||
`clock_gettime`, `nanosleep`, `clock_nanosleep`
|
||||
|
||||
### Miscellaneous (16)
|
||||
`getrandom`, `eventfd2`, `timerfd_create`, `timerfd_settime`, `pipe2`, `dup`, `dup2`, `fcntl`, `statx`, `newfstatat`, `access`, `readlinkat`, `getcwd`, `unlink`, `unlinkat`
|
||||
|
||||
## Crate Choice
|
||||
|
||||
We use **`seccompiler` v0.5** from the rust-vmm project — the same crate Firecracker uses. Benefits:
|
||||
- Battle-tested in production (millions of Firecracker microVMs)
|
||||
- Pure Rust BPF compiler (no C dependencies)
|
||||
- Supports argument-level filtering (we don't use it for ioctl, but could add later)
|
||||
- `apply_filter_all_threads` for TSYNC support
|
||||
|
||||
## CLI Flag
|
||||
|
||||
`--no-seccomp` disables the filter entirely. This is for debugging only and emits a WARN-level log:
|
||||
|
||||
```
|
||||
WARN volt-vmm::security::seccomp: Seccomp filtering is DISABLED (--no-seccomp flag). This is insecure for production use.
|
||||
```
|
||||
|
||||
## Testing
|
||||
|
||||
### Minimal kernel (bare metal ELF)
|
||||
```bash
|
||||
timeout 10 ./target/release/volt-vmm --kernel comparison/kernels/minimal-hello.elf -m 128M
|
||||
# Output: "Hello from minimal kernel!" — seccomp active, VM runs normally
|
||||
```
|
||||
|
||||
### Linux kernel (vmlinux 4.14)
|
||||
```bash
|
||||
timeout 10 ./target/release/volt-vmm --kernel kernels/vmlinux -m 256M
|
||||
# Output: Full Linux boot up to VFS mount panic (expected without rootfs)
|
||||
# Seccomp did NOT kill the process — all needed syscalls are allowed
|
||||
```
|
||||
|
||||
### With seccomp disabled
|
||||
```bash
|
||||
timeout 5 ./target/release/volt-vmm --kernel comparison/kernels/minimal-hello.elf -m 128M --no-seccomp
|
||||
# WARN logged, VM runs normally
|
||||
```
|
||||
|
||||
## Comparison with Firecracker
|
||||
|
||||
| Feature | Firecracker | Volt |
|
||||
|---------|-------------|-----------|
|
||||
| Crate | seccompiler 0.4 | seccompiler 0.5 |
|
||||
| Syscalls allowed | ~50 | ~72 |
|
||||
| ioctl filtering | By KVM ioctl number | Allow all (fd-scoped) |
|
||||
| Default action | KILL_PROCESS | KILL_PROCESS |
|
||||
| Per-thread filters | Yes (API vs vCPU) | Single filter (TSYNC) |
|
||||
| Disable flag | No (always on) | `--no-seccomp` for debug |
|
||||
|
||||
Volt allows slightly more syscalls because:
|
||||
1. We include tokio runtime syscalls (epoll, clone3, rseq)
|
||||
2. We include networking syscalls for the API socket
|
||||
3. We include filesystem cleanup syscalls (unlink/unlinkat for socket cleanup)
|
||||
|
||||
## Future Improvements
|
||||
|
||||
1. **Per-thread filters**: Different allowlists for API thread vs vCPU threads (Firecracker does this)
|
||||
2. **ioctl argument filtering**: Filter to only KVM_* ioctl numbers (adds ~20 BPF rules but tightens security)
|
||||
3. **Audit mode**: Use `SECCOMP_RET_LOG` instead of `SECCOMP_RET_KILL_PROCESS` for development
|
||||
4. **Metrics**: Count seccomp violations via SIGSYS handler before kill
|
||||
5. **Remove `--no-seccomp`**: Once the allowlist is proven stable in production
|
||||
|
||||
## Files
|
||||
|
||||
- `vmm/src/security/seccomp.rs` — Filter definition, build, and apply logic
|
||||
- `vmm/src/security/mod.rs` — Module exports (also includes capabilities + landlock)
|
||||
- `vmm/src/main.rs` — Integration point (after init, before vCPU run) + `--no-seccomp` flag
|
||||
- `vmm/Cargo.toml` — `seccompiler = "0.5"` dependency
|
||||
Reference in New Issue
Block a user