Volt VMM (Neutron Stardust): source-available under AGPSL v5.0
KVM-based microVMM for the Volt platform: - Sub-second VM boot times - Minimal memory footprint - Landlock LSM + seccomp security - Virtio device support - Custom kernel management Copyright (c) Armored Gates LLC. All rights reserved. Licensed under AGPSL v5.0
This commit is contained in:
192
docs/landlock-caps-implementation.md
Normal file
192
docs/landlock-caps-implementation.md
Normal file
@@ -0,0 +1,192 @@
|
||||
# Landlock & Capability Dropping Implementation
|
||||
|
||||
**Date:** 2026-03-08
|
||||
**Status:** Implemented and tested
|
||||
|
||||
## Overview
|
||||
|
||||
Volt VMM now implements three security hardening layers applied after all
|
||||
privileged setup is complete (KVM, TAP, sockets) but before the vCPU run loop:
|
||||
|
||||
1. **Landlock filesystem sandbox** (kernel 5.13+, optional, default-enabled)
|
||||
2. **Linux capability dropping** (always)
|
||||
3. **Seccomp-BPF syscall filtering** (always, was already implemented)
|
||||
|
||||
## Architecture
|
||||
|
||||
```text
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ Layer 5: Seccomp-BPF (always unless --no-seccomp) │
|
||||
│ 72 syscalls allowed, KILL_PROCESS on violation │
|
||||
├─────────────────────────────────────────────────────────────┤
|
||||
│ Layer 4: Landlock (optional, kernel 5.13+) │
|
||||
│ Filesystem path restrictions │
|
||||
├─────────────────────────────────────────────────────────────┤
|
||||
│ Layer 3: Capability dropping (always) │
|
||||
│ All ambient, bounding, and effective caps dropped │
|
||||
├─────────────────────────────────────────────────────────────┤
|
||||
│ Layer 2: PR_SET_NO_NEW_PRIVS (always) │
|
||||
│ Prevents privilege escalation via execve │
|
||||
├─────────────────────────────────────────────────────────────┤
|
||||
│ Layer 1: KVM isolation (inherent) │
|
||||
│ Hardware virtualization boundary │
|
||||
└─────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
## Files
|
||||
|
||||
| File | Purpose |
|
||||
|------|---------|
|
||||
| `vmm/src/security/mod.rs` | Module root, `apply_security()` entrypoint, shared types |
|
||||
| `vmm/src/security/capabilities.rs` | `drop_capabilities()` — prctl + capset |
|
||||
| `vmm/src/security/landlock.rs` | `apply_landlock()` — Landlock ruleset builder |
|
||||
| `vmm/src/security/seccomp.rs` | `apply_seccomp_filter()` — seccomp-bpf (pre-existing) |
|
||||
|
||||
## Part 1: Capability Dropping
|
||||
|
||||
### Implementation (`capabilities.rs`)
|
||||
|
||||
The `drop_capabilities()` function performs four operations:
|
||||
|
||||
1. **`prctl(PR_SET_NO_NEW_PRIVS, 1)`** — prevents privilege escalation via execve.
|
||||
Required by both Landlock and seccomp.
|
||||
|
||||
2. **`prctl(PR_CAP_AMBIENT, PR_CAP_AMBIENT_CLEAR_ALL)`** — clears all ambient
|
||||
capabilities. Gracefully handles EINVAL on kernels without ambient cap support.
|
||||
|
||||
3. **`prctl(PR_CAPBSET_DROP, cap)`** — iterates over all capability numbers (0–63)
|
||||
and drops each from the bounding set. Handles EPERM (expected when running
|
||||
as non-root) and EINVAL (cap doesn't exist) gracefully.
|
||||
|
||||
4. **`capset()` syscall** — clears the permitted, effective, and inheritable
|
||||
capability sets using the v3 capability API (two 32-bit words). Handles EPERM
|
||||
for non-root processes.
|
||||
|
||||
### Error Handling
|
||||
|
||||
- Running as non-root: EPERM on `PR_CAPBSET_DROP` and `capset` is logged as
|
||||
debug/warning but not treated as fatal, since the process is already unprivileged.
|
||||
- All other errors are fatal.
|
||||
|
||||
## Part 2: Landlock Filesystem Sandboxing
|
||||
|
||||
### Implementation (`landlock.rs`)
|
||||
|
||||
Uses the `landlock` crate (v0.4.4) which provides a safe Rust API over the
|
||||
Landlock syscalls with automatic ABI version negotiation.
|
||||
|
||||
### Allowed Paths
|
||||
|
||||
| Path | Access | Purpose |
|
||||
|------|--------|---------|
|
||||
| Kernel image | Read-only | Boot the VM |
|
||||
| Initrd (if specified) | Read-only | Initial ramdisk |
|
||||
| Disk images (--rootfs) | Read-write | VM storage |
|
||||
| API socket directory | RW + MakeSock | Unix socket API |
|
||||
| `/dev/kvm` | RW + IoctlDev | KVM device |
|
||||
| `/dev/net/tun` | RW + IoctlDev | TAP networking |
|
||||
| `/dev/vhost-net` | RW + IoctlDev | vhost-net (if present) |
|
||||
| `/proc/self` | Read-only | Process info, fd access |
|
||||
| Extra `--landlock-rule` paths | User-specified | Hotplug, custom |
|
||||
|
||||
### ABI Compatibility
|
||||
|
||||
- **Target ABI:** V5 (kernel 6.10+, includes `IoctlDev`)
|
||||
- **Minimum:** V1 (kernel 5.13+)
|
||||
- **Mode:** Best-effort — the crate automatically strips unsupported features
|
||||
- **Unavailable:** Logs a warning and continues without filesystem sandboxing
|
||||
|
||||
On kernel 6.1 (like our test system), the sandbox is "partially enforced" because
|
||||
some V5 features (like `IoctlDev` from ABI V5) are unavailable. Core filesystem
|
||||
restrictions are still active.
|
||||
|
||||
### CLI Flags
|
||||
|
||||
```bash
|
||||
# Disable Landlock entirely
|
||||
volt-vmm --kernel vmlinux -m 256M --no-landlock
|
||||
|
||||
# Add extra paths for hotplug or shared data
|
||||
volt-vmm --kernel vmlinux -m 256M \
|
||||
--landlock-rule /tmp/hotplug:rw \
|
||||
--landlock-rule /data/shared:ro
|
||||
```
|
||||
|
||||
Rule format: `path:access` where access is:
|
||||
- `ro`, `r`, `read` — read-only
|
||||
- `rw`, `w`, `write`, `readwrite` — full access
|
||||
|
||||
### Application Order
|
||||
|
||||
The security layers are applied in this order in `main.rs`:
|
||||
|
||||
```
|
||||
1. All initialization complete (KVM, memory, kernel, devices, API socket)
|
||||
2. Landlock applied (needs landlock syscalls, sets PR_SET_NO_NEW_PRIVS)
|
||||
3. Capabilities dropped (needs prctl, capset)
|
||||
4. Seccomp applied (locks down syscalls, uses TSYNC for all threads)
|
||||
5. vCPU run loop starts
|
||||
```
|
||||
|
||||
This ordering is critical: Landlock and capability syscalls must be available
|
||||
before seccomp restricts the syscall set.
|
||||
|
||||
## Testing
|
||||
|
||||
### Test Results (kernel 6.1.0-42-amd64)
|
||||
|
||||
```
|
||||
# Minimal kernel — boots successfully
|
||||
$ timeout 10 ./target/release/volt-vmm --kernel comparison/kernels/minimal-hello.elf -m 128M
|
||||
INFO Applying Landlock filesystem sandbox
|
||||
WARN Landlock sandbox partially enforced (kernel may not support all features)
|
||||
INFO Dropping Linux capabilities
|
||||
INFO All capabilities dropped successfully
|
||||
INFO Applying seccomp-bpf filter (72 syscalls allowed)
|
||||
INFO Seccomp filter active
|
||||
Hello from minimal kernel!
|
||||
OK
|
||||
|
||||
# Full Linux kernel — boots successfully
|
||||
$ timeout 10 ./target/release/volt-vmm --kernel kernels/vmlinux -m 256M
|
||||
INFO Applying Landlock filesystem sandbox
|
||||
WARN Landlock sandbox partially enforced
|
||||
INFO Dropping Linux capabilities
|
||||
INFO All capabilities dropped successfully
|
||||
INFO Applying seccomp-bpf filter (72 syscalls allowed)
|
||||
[kernel boot messages, VFS panic due to no rootfs — expected]
|
||||
|
||||
# --no-landlock flag works
|
||||
$ volt-vmm --kernel ... -m 128M --no-landlock
|
||||
WARN Landlock disabled via --no-landlock
|
||||
INFO Dropping Linux capabilities
|
||||
INFO All capabilities dropped successfully
|
||||
|
||||
# --landlock-rule flag works
|
||||
$ volt-vmm --kernel ... -m 128M --landlock-rule /tmp:rw
|
||||
DEBUG Landlock: user rule rw access to /tmp
|
||||
```
|
||||
|
||||
## Dependencies Added
|
||||
|
||||
```toml
|
||||
# vmm/Cargo.toml
|
||||
landlock = "0.4" # Landlock LSM helpers (crates.io, MIT/Apache-2.0)
|
||||
```
|
||||
|
||||
No other new dependencies — `libc` was already present for the prctl/capset calls.
|
||||
|
||||
## Future Improvements
|
||||
|
||||
1. **Network restrictions** — Landlock ABI V4 (kernel 6.7+) supports TCP port
|
||||
filtering. Could restrict API socket to specific ports.
|
||||
|
||||
2. **IPC scoping** — Landlock ABI V6 (kernel 6.12+) can scope signals and
|
||||
abstract Unix sockets.
|
||||
|
||||
3. **Root-mode bounding set** — When running as root, the full bounding set
|
||||
can be dropped. Currently gracefully skips on EPERM.
|
||||
|
||||
4. **seccomp + Landlock integration test** — Verify that the seccomp allowlist
|
||||
includes all syscalls needed after Landlock is active (it does, since Landlock
|
||||
is applied first, but a regression test would be good).
|
||||
Reference in New Issue
Block a user