Files
volt-vmm/docs/seccomp-implementation.md
Karl Clinger 40ed108dd5 Volt VMM (Neutron Stardust): source-available under AGPSL v5.0
KVM-based microVMM for the Volt platform:
- Sub-second VM boot times
- Minimal memory footprint
- Landlock LSM + seccomp security
- Virtio device support
- Custom kernel management

Copyright (c) Armored Gates LLC. All rights reserved.
Licensed under AGPSL v5.0
2026-03-21 01:04:35 -05:00

7.0 KiB

Seccomp-BPF Implementation Notes

Overview

Volt now includes seccomp-BPF system call filtering as a critical security layer. After all VMM initialization is complete (KVM VM created, memory allocated, kernel loaded, devices initialized, API socket bound), a strict syscall allowlist is applied. Any syscall not on the allowlist immediately kills the process with SECCOMP_RET_KILL_PROCESS.

Architecture

Security Layer Stack

┌─────────────────────────────────────────────────────────┐
│  Layer 5: Seccomp-BPF (always unless --no-seccomp)      │
│           72 syscalls allowed, all others → KILL         │
├─────────────────────────────────────────────────────────┤
│  Layer 4: Landlock (optional, kernel 5.13+)             │
│           Filesystem path restrictions                   │
├─────────────────────────────────────────────────────────┤
│  Layer 3: Capability dropping (always)                  │
│           Drop all ambient capabilities                  │
├─────────────────────────────────────────────────────────┤
│  Layer 2: PR_SET_NO_NEW_PRIVS (always)                  │
│           Prevent privilege escalation                    │
├─────────────────────────────────────────────────────────┤
│  Layer 1: KVM isolation (inherent)                      │
│           Hardware virtualization boundary                │
└─────────────────────────────────────────────────────────┘

Application Timing

The seccomp filter is applied in main.rs at a specific point in the startup sequence:

1. Parse CLI / validate config
2. Initialize KVM system handle
3. Create VM (IRQ chip, PIT)
4. Set up guest memory regions
5. Load kernel (PVH boot protocol)
6. Initialize devices (serial, virtio)
7. Create vCPUs
8. Set up signal handlers
9. Spawn API server task
10. ** Apply Landlock **
11. ** Drop capabilities **
12. ** Apply seccomp filter ** ← HERE
13. Start vCPU run loop
14. Wait for shutdown

This ordering is critical:

  • Before seccomp: All privileged operations (opening /dev/kvm, mmap'ing guest memory, loading kernel files, binding sockets) are complete.
  • After seccomp: Only the ~72 syscalls needed for steady-state operation are allowed.
  • We use apply_filter_all_threads (TSYNC) so vCPU threads spawned later also inherit the filter.

Syscall Allowlist (72 syscalls)

File I/O (10)

read, write, openat, close, fstat, lseek, pread64, pwrite64, readv, writev

Memory Management (6)

mmap, mprotect, munmap, brk, madvise, mremap

KVM / Device Control (1)

ioctl — The core VMM syscall. KVM_RUN, KVM_SET_REGS, KVM_CREATE_VCPU, and all other KVM operations go through ioctl. We allow all ioctls rather than filtering by ioctl number because:

  • The KVM fd-based security model already scopes access
  • Filtering by ioctl number would be fragile across kernel versions
  • The BPF program size would explode

Threading (7)

clone, clone3, futex, set_robust_list, sched_yield, sched_getaffinity, rseq

Signals (4)

rt_sigaction, rt_sigprocmask, rt_sigreturn, sigaltstack

Networking (18)

accept4, bind, listen, socket, connect, recvfrom, sendto, recvmsg, sendmsg, shutdown, getsockname, getpeername, setsockopt, getsockopt, epoll_create1, epoll_ctl, epoll_wait, ppoll

Process Lifecycle (7)

exit, exit_group, getpid, gettid, prctl, arch_prctl, prlimit64, tgkill

Timers (3)

clock_gettime, nanosleep, clock_nanosleep

Miscellaneous (16)

getrandom, eventfd2, timerfd_create, timerfd_settime, pipe2, dup, dup2, fcntl, statx, newfstatat, access, readlinkat, getcwd, unlink, unlinkat

Crate Choice

We use seccompiler v0.5 from the rust-vmm project — the same crate Firecracker uses. Benefits:

  • Battle-tested in production (millions of Firecracker microVMs)
  • Pure Rust BPF compiler (no C dependencies)
  • Supports argument-level filtering (we don't use it for ioctl, but could add later)
  • apply_filter_all_threads for TSYNC support

CLI Flag

--no-seccomp disables the filter entirely. This is for debugging only and emits a WARN-level log:

WARN volt-vmm::security::seccomp: Seccomp filtering is DISABLED (--no-seccomp flag). This is insecure for production use.

Testing

Minimal kernel (bare metal ELF)

timeout 10 ./target/release/volt-vmm --kernel comparison/kernels/minimal-hello.elf -m 128M
# Output: "Hello from minimal kernel!" — seccomp active, VM runs normally

Linux kernel (vmlinux 4.14)

timeout 10 ./target/release/volt-vmm --kernel kernels/vmlinux -m 256M
# Output: Full Linux boot up to VFS mount panic (expected without rootfs)
# Seccomp did NOT kill the process — all needed syscalls are allowed

With seccomp disabled

timeout 5 ./target/release/volt-vmm --kernel comparison/kernels/minimal-hello.elf -m 128M --no-seccomp
# WARN logged, VM runs normally

Comparison with Firecracker

Feature Firecracker Volt
Crate seccompiler 0.4 seccompiler 0.5
Syscalls allowed ~50 ~72
ioctl filtering By KVM ioctl number Allow all (fd-scoped)
Default action KILL_PROCESS KILL_PROCESS
Per-thread filters Yes (API vs vCPU) Single filter (TSYNC)
Disable flag No (always on) --no-seccomp for debug

Volt allows slightly more syscalls because:

  1. We include tokio runtime syscalls (epoll, clone3, rseq)
  2. We include networking syscalls for the API socket
  3. We include filesystem cleanup syscalls (unlink/unlinkat for socket cleanup)

Future Improvements

  1. Per-thread filters: Different allowlists for API thread vs vCPU threads (Firecracker does this)
  2. ioctl argument filtering: Filter to only KVM_* ioctl numbers (adds ~20 BPF rules but tightens security)
  3. Audit mode: Use SECCOMP_RET_LOG instead of SECCOMP_RET_KILL_PROCESS for development
  4. Metrics: Count seccomp violations via SIGSYS handler before kill
  5. Remove --no-seccomp: Once the allowlist is proven stable in production

Files

  • vmm/src/security/seccomp.rs — Filter definition, build, and apply logic
  • vmm/src/security/mod.rs — Module exports (also includes capabilities + landlock)
  • vmm/src/main.rs — Integration point (after init, before vCPU run) + --no-seccomp flag
  • vmm/Cargo.tomlseccompiler = "0.5" dependency