Volt VMM (Neutron Stardust): source-available under AGPSL v5.0
KVM-based microVMM for the Volt platform: - Sub-second VM boot times - Minimal memory footprint - Landlock LSM + seccomp security - Virtio device support - Custom kernel management Copyright (c) Armored Gates LLC. All rights reserved. Licensed under AGPSL v5.0
This commit is contained in:
144
docs/phase3-seccomp-fix.md
Normal file
144
docs/phase3-seccomp-fix.md
Normal file
@@ -0,0 +1,144 @@
|
||||
# Phase 3: Seccomp Allowlist Audit & Fix
|
||||
|
||||
## Status: ✅ COMPLETE
|
||||
|
||||
## Summary
|
||||
|
||||
The seccomp-bpf allowlist and Landlock configuration were audited for correctness.
|
||||
**The VM already booted successfully with security features enabled** — the Phase 2
|
||||
implementation included the necessary syscalls. Two additional syscalls (`fallocate`,
|
||||
`ftruncate`) were added for production robustness.
|
||||
|
||||
## Findings
|
||||
|
||||
### Seccomp Filter
|
||||
|
||||
The Phase 2 seccomp allowlist (76 syscalls) already included all syscalls needed
|
||||
for virtio-blk I/O processing:
|
||||
|
||||
| Syscall | Purpose | Status at Phase 2 |
|
||||
|---------|---------|-------------------|
|
||||
| `pread64` | Positional read for block I/O | ✅ Already present |
|
||||
| `pwrite64` | Positional write for block I/O | ✅ Already present |
|
||||
| `lseek` | File seeking for FileBackend | ✅ Already present |
|
||||
| `fdatasync` | Data sync for flush operations | ✅ Already present |
|
||||
| `fstat` | File metadata for disk size | ✅ Already present |
|
||||
| `fsync` | Full sync for flush operations | ✅ Already present |
|
||||
| `readv`/`writev` | Scatter-gather I/O | ✅ Already present |
|
||||
| `madvise` | Memory advisory for guest mem | ✅ Already present |
|
||||
| `mremap` | Memory remapping | ✅ Already present |
|
||||
| `eventfd2` | Event notification for virtio | ✅ Already present |
|
||||
| `timerfd_create` | Timer fd creation | ✅ Already present |
|
||||
| `timerfd_settime` | Timer configuration | ✅ Already present |
|
||||
| `ppoll` | Polling for events | ✅ Already present |
|
||||
| `epoll_ctl` | Epoll event management | ✅ Already present |
|
||||
| `epoll_wait` | Epoll event waiting | ✅ Already present |
|
||||
| `epoll_create1` | Epoll instance creation | ✅ Already present |
|
||||
|
||||
### Syscalls Added in Phase 3
|
||||
|
||||
Two additional syscalls were added for production robustness:
|
||||
|
||||
| Syscall | Purpose | Why Added |
|
||||
|---------|---------|-----------|
|
||||
| `fallocate` | Pre-allocate disk space | Needed for CoW disk backends, qcow2 expansion, and Stellarium CAS storage |
|
||||
| `ftruncate` | Resize files | Needed for disk resize operations and FileBackend::create() |
|
||||
|
||||
### Landlock Configuration
|
||||
|
||||
The Landlock filesystem sandbox was verified correct:
|
||||
|
||||
- **Kernel image**: Read-only access ✅
|
||||
- **Rootfs disk**: Read-write access (including `Truncate` flag) ✅
|
||||
- **Device nodes**: `/dev/kvm`, `/dev/net/tun`, `/dev/vhost-net` with `IoctlDev` ✅
|
||||
- **`/proc/self`**: Read-only access for fd management ✅
|
||||
- **Stellarium volumes**: Read-write access when `--volume` is used ✅
|
||||
- **API socket directory**: Socket creation + removal access ✅
|
||||
|
||||
Landlock reports "partially enforced" on kernel 6.1 because the code targets
|
||||
ABI V5 (kernel 6.10+) and falls back gracefully. This is expected and correct.
|
||||
|
||||
### Syscall Trace Analysis
|
||||
|
||||
Using `strace -f` on the secured VMM, the following 17 unique syscalls were
|
||||
observed during steady-state operation (all in the allowlist):
|
||||
|
||||
```
|
||||
close, epoll_ctl, epoll_wait, exit_group, fsync, futex, ioctl,
|
||||
lseek, mprotect, munmap, read, recvfrom, rt_sigreturn,
|
||||
sched_yield, sendto, sigaltstack, write
|
||||
```
|
||||
|
||||
No `SIGSYS` signals were generated. No syscalls returned `ENOSYS`.
|
||||
|
||||
## Test Results
|
||||
|
||||
### With Security (Seccomp + Landlock)
|
||||
```
|
||||
$ ./target/release/volt-vmm \
|
||||
--kernel comparison/firecracker/vmlinux.bin \
|
||||
--rootfs comparison/rootfs.ext4 \
|
||||
--memory 128M --cpus 1 --net-backend none
|
||||
|
||||
Seccomp filter active: 78 syscalls allowed, all others → KILL_PROCESS
|
||||
Landlock sandbox partially enforced
|
||||
VM READY - BOOT TEST PASSED
|
||||
```
|
||||
|
||||
### Without Security (baseline)
|
||||
```
|
||||
$ ./target/release/volt-vmm \
|
||||
--kernel comparison/firecracker/vmlinux.bin \
|
||||
--rootfs comparison/rootfs.ext4 \
|
||||
--memory 128M --cpus 1 --net-backend none \
|
||||
--no-seccomp --no-landlock
|
||||
|
||||
VM READY - BOOT TEST PASSED
|
||||
```
|
||||
|
||||
Both modes produce identical boot results. Tested 3 consecutive runs — all passed.
|
||||
|
||||
## Final Allowlist (78 syscalls)
|
||||
|
||||
### File I/O (14)
|
||||
`read`, `write`, `openat`, `close`, `fstat`, `lseek`, `pread64`, `pwrite64`,
|
||||
`readv`, `writev`, `fsync`, `fdatasync`, `fallocate`★, `ftruncate`★
|
||||
|
||||
### Memory (6)
|
||||
`mmap`, `mprotect`, `munmap`, `brk`, `madvise`, `mremap`
|
||||
|
||||
### KVM/Device (1)
|
||||
`ioctl`
|
||||
|
||||
### Threading (7)
|
||||
`clone`, `clone3`, `futex`, `set_robust_list`, `sched_yield`, `sched_getaffinity`, `rseq`
|
||||
|
||||
### Signals (4)
|
||||
`rt_sigaction`, `rt_sigprocmask`, `rt_sigreturn`, `sigaltstack`
|
||||
|
||||
### Networking (16)
|
||||
`accept4`, `bind`, `listen`, `socket`, `connect`, `recvfrom`, `sendto`,
|
||||
`recvmsg`, `sendmsg`, `shutdown`, `getsockname`, `getpeername`, `setsockopt`,
|
||||
`getsockopt`, `epoll_create1`, `epoll_ctl`, `epoll_wait`, `ppoll`
|
||||
|
||||
### Process (7)
|
||||
`exit`, `exit_group`, `getpid`, `gettid`, `prctl`, `arch_prctl`, `prlimit64`, `tgkill`
|
||||
|
||||
### Timers (3)
|
||||
`clock_gettime`, `nanosleep`, `clock_nanosleep`
|
||||
|
||||
### Misc (18)
|
||||
`getrandom`, `eventfd2`, `timerfd_create`, `timerfd_settime`, `pipe2`,
|
||||
`dup`, `dup2`, `fcntl`, `statx`, `newfstatat`, `access`, `readlinkat`,
|
||||
`getcwd`, `unlink`, `unlinkat`, `mkdir`, `mkdirat`
|
||||
|
||||
★ = Added in Phase 3
|
||||
|
||||
## Phase 2 Handoff Note
|
||||
|
||||
The Phase 2 handoff described the VM stalling with "Failed to enable 64-bit or
|
||||
32-bit DMA" when security was enabled. This issue appears to have been resolved
|
||||
during Phase 2 development — the final committed code includes all necessary
|
||||
syscalls for virtio-blk I/O. The DMA warning message is a kernel-level log that
|
||||
appears in both secured and unsecured boots (it's a virtio-mmio driver message,
|
||||
not a Volt error) and does not prevent boot completion.
|
||||
Reference in New Issue
Block a user