# Seccomp-BPF Implementation Notes ## Overview Volt now includes seccomp-BPF system call filtering as a critical security layer. After all VMM initialization is complete (KVM VM created, memory allocated, kernel loaded, devices initialized, API socket bound), a strict syscall allowlist is applied. Any syscall not on the allowlist immediately kills the process with `SECCOMP_RET_KILL_PROCESS`. ## Architecture ### Security Layer Stack ``` ┌─────────────────────────────────────────────────────────┐ │ Layer 5: Seccomp-BPF (always unless --no-seccomp) │ │ 72 syscalls allowed, all others → KILL │ ├─────────────────────────────────────────────────────────┤ │ Layer 4: Landlock (optional, kernel 5.13+) │ │ Filesystem path restrictions │ ├─────────────────────────────────────────────────────────┤ │ Layer 3: Capability dropping (always) │ │ Drop all ambient capabilities │ ├─────────────────────────────────────────────────────────┤ │ Layer 2: PR_SET_NO_NEW_PRIVS (always) │ │ Prevent privilege escalation │ ├─────────────────────────────────────────────────────────┤ │ Layer 1: KVM isolation (inherent) │ │ Hardware virtualization boundary │ └─────────────────────────────────────────────────────────┘ ``` ### Application Timing The seccomp filter is applied in `main.rs` at a specific point in the startup sequence: ``` 1. Parse CLI / validate config 2. Initialize KVM system handle 3. Create VM (IRQ chip, PIT) 4. Set up guest memory regions 5. Load kernel (PVH boot protocol) 6. Initialize devices (serial, virtio) 7. Create vCPUs 8. Set up signal handlers 9. Spawn API server task 10. ** Apply Landlock ** 11. ** Drop capabilities ** 12. ** Apply seccomp filter ** ← HERE 13. Start vCPU run loop 14. Wait for shutdown ``` This ordering is critical: - Before seccomp: All privileged operations (opening /dev/kvm, mmap'ing guest memory, loading kernel files, binding sockets) are complete. - After seccomp: Only the ~72 syscalls needed for steady-state operation are allowed. - We use `apply_filter_all_threads` (TSYNC) so vCPU threads spawned later also inherit the filter. ## Syscall Allowlist (72 syscalls) ### File I/O (10) `read`, `write`, `openat`, `close`, `fstat`, `lseek`, `pread64`, `pwrite64`, `readv`, `writev` ### Memory Management (6) `mmap`, `mprotect`, `munmap`, `brk`, `madvise`, `mremap` ### KVM / Device Control (1) `ioctl` — The core VMM syscall. KVM_RUN, KVM_SET_REGS, KVM_CREATE_VCPU, and all other KVM operations go through ioctl. We allow all ioctls rather than filtering by ioctl number because: - The KVM fd-based security model already scopes access - Filtering by ioctl number would be fragile across kernel versions - The BPF program size would explode ### Threading (7) `clone`, `clone3`, `futex`, `set_robust_list`, `sched_yield`, `sched_getaffinity`, `rseq` ### Signals (4) `rt_sigaction`, `rt_sigprocmask`, `rt_sigreturn`, `sigaltstack` ### Networking (18) `accept4`, `bind`, `listen`, `socket`, `connect`, `recvfrom`, `sendto`, `recvmsg`, `sendmsg`, `shutdown`, `getsockname`, `getpeername`, `setsockopt`, `getsockopt`, `epoll_create1`, `epoll_ctl`, `epoll_wait`, `ppoll` ### Process Lifecycle (7) `exit`, `exit_group`, `getpid`, `gettid`, `prctl`, `arch_prctl`, `prlimit64`, `tgkill` ### Timers (3) `clock_gettime`, `nanosleep`, `clock_nanosleep` ### Miscellaneous (16) `getrandom`, `eventfd2`, `timerfd_create`, `timerfd_settime`, `pipe2`, `dup`, `dup2`, `fcntl`, `statx`, `newfstatat`, `access`, `readlinkat`, `getcwd`, `unlink`, `unlinkat` ## Crate Choice We use **`seccompiler` v0.5** from the rust-vmm project — the same crate Firecracker uses. Benefits: - Battle-tested in production (millions of Firecracker microVMs) - Pure Rust BPF compiler (no C dependencies) - Supports argument-level filtering (we don't use it for ioctl, but could add later) - `apply_filter_all_threads` for TSYNC support ## CLI Flag `--no-seccomp` disables the filter entirely. This is for debugging only and emits a WARN-level log: ``` WARN volt-vmm::security::seccomp: Seccomp filtering is DISABLED (--no-seccomp flag). This is insecure for production use. ``` ## Testing ### Minimal kernel (bare metal ELF) ```bash timeout 10 ./target/release/volt-vmm --kernel comparison/kernels/minimal-hello.elf -m 128M # Output: "Hello from minimal kernel!" — seccomp active, VM runs normally ``` ### Linux kernel (vmlinux 4.14) ```bash timeout 10 ./target/release/volt-vmm --kernel kernels/vmlinux -m 256M # Output: Full Linux boot up to VFS mount panic (expected without rootfs) # Seccomp did NOT kill the process — all needed syscalls are allowed ``` ### With seccomp disabled ```bash timeout 5 ./target/release/volt-vmm --kernel comparison/kernels/minimal-hello.elf -m 128M --no-seccomp # WARN logged, VM runs normally ``` ## Comparison with Firecracker | Feature | Firecracker | Volt | |---------|-------------|-----------| | Crate | seccompiler 0.4 | seccompiler 0.5 | | Syscalls allowed | ~50 | ~72 | | ioctl filtering | By KVM ioctl number | Allow all (fd-scoped) | | Default action | KILL_PROCESS | KILL_PROCESS | | Per-thread filters | Yes (API vs vCPU) | Single filter (TSYNC) | | Disable flag | No (always on) | `--no-seccomp` for debug | Volt allows slightly more syscalls because: 1. We include tokio runtime syscalls (epoll, clone3, rseq) 2. We include networking syscalls for the API socket 3. We include filesystem cleanup syscalls (unlink/unlinkat for socket cleanup) ## Future Improvements 1. **Per-thread filters**: Different allowlists for API thread vs vCPU threads (Firecracker does this) 2. **ioctl argument filtering**: Filter to only KVM_* ioctl numbers (adds ~20 BPF rules but tightens security) 3. **Audit mode**: Use `SECCOMP_RET_LOG` instead of `SECCOMP_RET_KILL_PROCESS` for development 4. **Metrics**: Count seccomp violations via SIGSYS handler before kill 5. **Remove `--no-seccomp`**: Once the allowlist is proven stable in production ## Files - `vmm/src/security/seccomp.rs` — Filter definition, build, and apply logic - `vmm/src/security/mod.rs` — Module exports (also includes capabilities + landlock) - `vmm/src/main.rs` — Integration point (after init, before vCPU run) + `--no-seccomp` flag - `vmm/Cargo.toml` — `seccompiler = "0.5"` dependency