KVM-based microVMM for the Volt platform: - Sub-second VM boot times - Minimal memory footprint - Landlock LSM + seccomp security - Virtio device support - Custom kernel management Copyright (c) Armored Gates LLC. All rights reserved. Licensed under AGPSL v5.0
7.0 KiB
Seccomp-BPF Implementation Notes
Overview
Volt now includes seccomp-BPF system call filtering as a critical security layer. After all VMM initialization is complete (KVM VM created, memory allocated, kernel loaded, devices initialized, API socket bound), a strict syscall allowlist is applied. Any syscall not on the allowlist immediately kills the process with SECCOMP_RET_KILL_PROCESS.
Architecture
Security Layer Stack
┌─────────────────────────────────────────────────────────┐
│ Layer 5: Seccomp-BPF (always unless --no-seccomp) │
│ 72 syscalls allowed, all others → KILL │
├─────────────────────────────────────────────────────────┤
│ Layer 4: Landlock (optional, kernel 5.13+) │
│ Filesystem path restrictions │
├─────────────────────────────────────────────────────────┤
│ Layer 3: Capability dropping (always) │
│ Drop all ambient capabilities │
├─────────────────────────────────────────────────────────┤
│ Layer 2: PR_SET_NO_NEW_PRIVS (always) │
│ Prevent privilege escalation │
├─────────────────────────────────────────────────────────┤
│ Layer 1: KVM isolation (inherent) │
│ Hardware virtualization boundary │
└─────────────────────────────────────────────────────────┘
Application Timing
The seccomp filter is applied in main.rs at a specific point in the startup sequence:
1. Parse CLI / validate config
2. Initialize KVM system handle
3. Create VM (IRQ chip, PIT)
4. Set up guest memory regions
5. Load kernel (PVH boot protocol)
6. Initialize devices (serial, virtio)
7. Create vCPUs
8. Set up signal handlers
9. Spawn API server task
10. ** Apply Landlock **
11. ** Drop capabilities **
12. ** Apply seccomp filter ** ← HERE
13. Start vCPU run loop
14. Wait for shutdown
This ordering is critical:
- Before seccomp: All privileged operations (opening /dev/kvm, mmap'ing guest memory, loading kernel files, binding sockets) are complete.
- After seccomp: Only the ~72 syscalls needed for steady-state operation are allowed.
- We use
apply_filter_all_threads(TSYNC) so vCPU threads spawned later also inherit the filter.
Syscall Allowlist (72 syscalls)
File I/O (10)
read, write, openat, close, fstat, lseek, pread64, pwrite64, readv, writev
Memory Management (6)
mmap, mprotect, munmap, brk, madvise, mremap
KVM / Device Control (1)
ioctl — The core VMM syscall. KVM_RUN, KVM_SET_REGS, KVM_CREATE_VCPU, and all other KVM operations go through ioctl. We allow all ioctls rather than filtering by ioctl number because:
- The KVM fd-based security model already scopes access
- Filtering by ioctl number would be fragile across kernel versions
- The BPF program size would explode
Threading (7)
clone, clone3, futex, set_robust_list, sched_yield, sched_getaffinity, rseq
Signals (4)
rt_sigaction, rt_sigprocmask, rt_sigreturn, sigaltstack
Networking (18)
accept4, bind, listen, socket, connect, recvfrom, sendto, recvmsg, sendmsg, shutdown, getsockname, getpeername, setsockopt, getsockopt, epoll_create1, epoll_ctl, epoll_wait, ppoll
Process Lifecycle (7)
exit, exit_group, getpid, gettid, prctl, arch_prctl, prlimit64, tgkill
Timers (3)
clock_gettime, nanosleep, clock_nanosleep
Miscellaneous (16)
getrandom, eventfd2, timerfd_create, timerfd_settime, pipe2, dup, dup2, fcntl, statx, newfstatat, access, readlinkat, getcwd, unlink, unlinkat
Crate Choice
We use seccompiler v0.5 from the rust-vmm project — the same crate Firecracker uses. Benefits:
- Battle-tested in production (millions of Firecracker microVMs)
- Pure Rust BPF compiler (no C dependencies)
- Supports argument-level filtering (we don't use it for ioctl, but could add later)
apply_filter_all_threadsfor TSYNC support
CLI Flag
--no-seccomp disables the filter entirely. This is for debugging only and emits a WARN-level log:
WARN volt-vmm::security::seccomp: Seccomp filtering is DISABLED (--no-seccomp flag). This is insecure for production use.
Testing
Minimal kernel (bare metal ELF)
timeout 10 ./target/release/volt-vmm --kernel comparison/kernels/minimal-hello.elf -m 128M
# Output: "Hello from minimal kernel!" — seccomp active, VM runs normally
Linux kernel (vmlinux 4.14)
timeout 10 ./target/release/volt-vmm --kernel kernels/vmlinux -m 256M
# Output: Full Linux boot up to VFS mount panic (expected without rootfs)
# Seccomp did NOT kill the process — all needed syscalls are allowed
With seccomp disabled
timeout 5 ./target/release/volt-vmm --kernel comparison/kernels/minimal-hello.elf -m 128M --no-seccomp
# WARN logged, VM runs normally
Comparison with Firecracker
| Feature | Firecracker | Volt |
|---|---|---|
| Crate | seccompiler 0.4 | seccompiler 0.5 |
| Syscalls allowed | ~50 | ~72 |
| ioctl filtering | By KVM ioctl number | Allow all (fd-scoped) |
| Default action | KILL_PROCESS | KILL_PROCESS |
| Per-thread filters | Yes (API vs vCPU) | Single filter (TSYNC) |
| Disable flag | No (always on) | --no-seccomp for debug |
Volt allows slightly more syscalls because:
- We include tokio runtime syscalls (epoll, clone3, rseq)
- We include networking syscalls for the API socket
- We include filesystem cleanup syscalls (unlink/unlinkat for socket cleanup)
Future Improvements
- Per-thread filters: Different allowlists for API thread vs vCPU threads (Firecracker does this)
- ioctl argument filtering: Filter to only KVM_* ioctl numbers (adds ~20 BPF rules but tightens security)
- Audit mode: Use
SECCOMP_RET_LOGinstead ofSECCOMP_RET_KILL_PROCESSfor development - Metrics: Count seccomp violations via SIGSYS handler before kill
- Remove
--no-seccomp: Once the allowlist is proven stable in production
Files
vmm/src/security/seccomp.rs— Filter definition, build, and apply logicvmm/src/security/mod.rs— Module exports (also includes capabilities + landlock)vmm/src/main.rs— Integration point (after init, before vCPU run) +--no-seccompflagvmm/Cargo.toml—seccompiler = "0.5"dependency