Files
volt-vmm/docs/landlock-caps-implementation.md
Karl Clinger 40ed108dd5 Volt VMM (Neutron Stardust): source-available under AGPSL v5.0
KVM-based microVMM for the Volt platform:
- Sub-second VM boot times
- Minimal memory footprint
- Landlock LSM + seccomp security
- Virtio device support
- Custom kernel management

Copyright (c) Armored Gates LLC. All rights reserved.
Licensed under AGPSL v5.0
2026-03-21 01:04:35 -05:00

193 lines
7.8 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Landlock & Capability Dropping Implementation
**Date:** 2026-03-08
**Status:** Implemented and tested
## Overview
Volt VMM now implements three security hardening layers applied after all
privileged setup is complete (KVM, TAP, sockets) but before the vCPU run loop:
1. **Landlock filesystem sandbox** (kernel 5.13+, optional, default-enabled)
2. **Linux capability dropping** (always)
3. **Seccomp-BPF syscall filtering** (always, was already implemented)
## Architecture
```text
┌─────────────────────────────────────────────────────────────┐
│ Layer 5: Seccomp-BPF (always unless --no-seccomp) │
│ 72 syscalls allowed, KILL_PROCESS on violation │
├─────────────────────────────────────────────────────────────┤
│ Layer 4: Landlock (optional, kernel 5.13+) │
│ Filesystem path restrictions │
├─────────────────────────────────────────────────────────────┤
│ Layer 3: Capability dropping (always) │
│ All ambient, bounding, and effective caps dropped │
├─────────────────────────────────────────────────────────────┤
│ Layer 2: PR_SET_NO_NEW_PRIVS (always) │
│ Prevents privilege escalation via execve │
├─────────────────────────────────────────────────────────────┤
│ Layer 1: KVM isolation (inherent) │
│ Hardware virtualization boundary │
└─────────────────────────────────────────────────────────────┘
```
## Files
| File | Purpose |
|------|---------|
| `vmm/src/security/mod.rs` | Module root, `apply_security()` entrypoint, shared types |
| `vmm/src/security/capabilities.rs` | `drop_capabilities()` — prctl + capset |
| `vmm/src/security/landlock.rs` | `apply_landlock()` — Landlock ruleset builder |
| `vmm/src/security/seccomp.rs` | `apply_seccomp_filter()` — seccomp-bpf (pre-existing) |
## Part 1: Capability Dropping
### Implementation (`capabilities.rs`)
The `drop_capabilities()` function performs four operations:
1. **`prctl(PR_SET_NO_NEW_PRIVS, 1)`** — prevents privilege escalation via execve.
Required by both Landlock and seccomp.
2. **`prctl(PR_CAP_AMBIENT, PR_CAP_AMBIENT_CLEAR_ALL)`** — clears all ambient
capabilities. Gracefully handles EINVAL on kernels without ambient cap support.
3. **`prctl(PR_CAPBSET_DROP, cap)`** — iterates over all capability numbers (063)
and drops each from the bounding set. Handles EPERM (expected when running
as non-root) and EINVAL (cap doesn't exist) gracefully.
4. **`capset()` syscall** — clears the permitted, effective, and inheritable
capability sets using the v3 capability API (two 32-bit words). Handles EPERM
for non-root processes.
### Error Handling
- Running as non-root: EPERM on `PR_CAPBSET_DROP` and `capset` is logged as
debug/warning but not treated as fatal, since the process is already unprivileged.
- All other errors are fatal.
## Part 2: Landlock Filesystem Sandboxing
### Implementation (`landlock.rs`)
Uses the `landlock` crate (v0.4.4) which provides a safe Rust API over the
Landlock syscalls with automatic ABI version negotiation.
### Allowed Paths
| Path | Access | Purpose |
|------|--------|---------|
| Kernel image | Read-only | Boot the VM |
| Initrd (if specified) | Read-only | Initial ramdisk |
| Disk images (--rootfs) | Read-write | VM storage |
| API socket directory | RW + MakeSock | Unix socket API |
| `/dev/kvm` | RW + IoctlDev | KVM device |
| `/dev/net/tun` | RW + IoctlDev | TAP networking |
| `/dev/vhost-net` | RW + IoctlDev | vhost-net (if present) |
| `/proc/self` | Read-only | Process info, fd access |
| Extra `--landlock-rule` paths | User-specified | Hotplug, custom |
### ABI Compatibility
- **Target ABI:** V5 (kernel 6.10+, includes `IoctlDev`)
- **Minimum:** V1 (kernel 5.13+)
- **Mode:** Best-effort — the crate automatically strips unsupported features
- **Unavailable:** Logs a warning and continues without filesystem sandboxing
On kernel 6.1 (like our test system), the sandbox is "partially enforced" because
some V5 features (like `IoctlDev` from ABI V5) are unavailable. Core filesystem
restrictions are still active.
### CLI Flags
```bash
# Disable Landlock entirely
volt-vmm --kernel vmlinux -m 256M --no-landlock
# Add extra paths for hotplug or shared data
volt-vmm --kernel vmlinux -m 256M \
--landlock-rule /tmp/hotplug:rw \
--landlock-rule /data/shared:ro
```
Rule format: `path:access` where access is:
- `ro`, `r`, `read` — read-only
- `rw`, `w`, `write`, `readwrite` — full access
### Application Order
The security layers are applied in this order in `main.rs`:
```
1. All initialization complete (KVM, memory, kernel, devices, API socket)
2. Landlock applied (needs landlock syscalls, sets PR_SET_NO_NEW_PRIVS)
3. Capabilities dropped (needs prctl, capset)
4. Seccomp applied (locks down syscalls, uses TSYNC for all threads)
5. vCPU run loop starts
```
This ordering is critical: Landlock and capability syscalls must be available
before seccomp restricts the syscall set.
## Testing
### Test Results (kernel 6.1.0-42-amd64)
```
# Minimal kernel — boots successfully
$ timeout 10 ./target/release/volt-vmm --kernel comparison/kernels/minimal-hello.elf -m 128M
INFO Applying Landlock filesystem sandbox
WARN Landlock sandbox partially enforced (kernel may not support all features)
INFO Dropping Linux capabilities
INFO All capabilities dropped successfully
INFO Applying seccomp-bpf filter (72 syscalls allowed)
INFO Seccomp filter active
Hello from minimal kernel!
OK
# Full Linux kernel — boots successfully
$ timeout 10 ./target/release/volt-vmm --kernel kernels/vmlinux -m 256M
INFO Applying Landlock filesystem sandbox
WARN Landlock sandbox partially enforced
INFO Dropping Linux capabilities
INFO All capabilities dropped successfully
INFO Applying seccomp-bpf filter (72 syscalls allowed)
[kernel boot messages, VFS panic due to no rootfs — expected]
# --no-landlock flag works
$ volt-vmm --kernel ... -m 128M --no-landlock
WARN Landlock disabled via --no-landlock
INFO Dropping Linux capabilities
INFO All capabilities dropped successfully
# --landlock-rule flag works
$ volt-vmm --kernel ... -m 128M --landlock-rule /tmp:rw
DEBUG Landlock: user rule rw access to /tmp
```
## Dependencies Added
```toml
# vmm/Cargo.toml
landlock = "0.4" # Landlock LSM helpers (crates.io, MIT/Apache-2.0)
```
No other new dependencies — `libc` was already present for the prctl/capset calls.
## Future Improvements
1. **Network restrictions** — Landlock ABI V4 (kernel 6.7+) supports TCP port
filtering. Could restrict API socket to specific ports.
2. **IPC scoping** — Landlock ABI V6 (kernel 6.12+) can scope signals and
abstract Unix sockets.
3. **Root-mode bounding set** — When running as root, the full bounding set
can be dropped. Currently gracefully skips on EPERM.
4. **seccomp + Landlock integration test** — Verify that the seccomp allowlist
includes all syscalls needed after Landlock is active (it does, since Landlock
is applied first, but a regression test would be good).