KVM-based microVMM for the Volt platform: - Sub-second VM boot times - Minimal memory footprint - Landlock LSM + seccomp security - Virtio device support - Custom kernel management Copyright (c) Armored Gates LLC. All rights reserved. Licensed under AGPSL v5.0
193 lines
7.8 KiB
Markdown
193 lines
7.8 KiB
Markdown
# Landlock & Capability Dropping Implementation
|
||
|
||
**Date:** 2026-03-08
|
||
**Status:** Implemented and tested
|
||
|
||
## Overview
|
||
|
||
Volt VMM now implements three security hardening layers applied after all
|
||
privileged setup is complete (KVM, TAP, sockets) but before the vCPU run loop:
|
||
|
||
1. **Landlock filesystem sandbox** (kernel 5.13+, optional, default-enabled)
|
||
2. **Linux capability dropping** (always)
|
||
3. **Seccomp-BPF syscall filtering** (always, was already implemented)
|
||
|
||
## Architecture
|
||
|
||
```text
|
||
┌─────────────────────────────────────────────────────────────┐
|
||
│ Layer 5: Seccomp-BPF (always unless --no-seccomp) │
|
||
│ 72 syscalls allowed, KILL_PROCESS on violation │
|
||
├─────────────────────────────────────────────────────────────┤
|
||
│ Layer 4: Landlock (optional, kernel 5.13+) │
|
||
│ Filesystem path restrictions │
|
||
├─────────────────────────────────────────────────────────────┤
|
||
│ Layer 3: Capability dropping (always) │
|
||
│ All ambient, bounding, and effective caps dropped │
|
||
├─────────────────────────────────────────────────────────────┤
|
||
│ Layer 2: PR_SET_NO_NEW_PRIVS (always) │
|
||
│ Prevents privilege escalation via execve │
|
||
├─────────────────────────────────────────────────────────────┤
|
||
│ Layer 1: KVM isolation (inherent) │
|
||
│ Hardware virtualization boundary │
|
||
└─────────────────────────────────────────────────────────────┘
|
||
```
|
||
|
||
## Files
|
||
|
||
| File | Purpose |
|
||
|------|---------|
|
||
| `vmm/src/security/mod.rs` | Module root, `apply_security()` entrypoint, shared types |
|
||
| `vmm/src/security/capabilities.rs` | `drop_capabilities()` — prctl + capset |
|
||
| `vmm/src/security/landlock.rs` | `apply_landlock()` — Landlock ruleset builder |
|
||
| `vmm/src/security/seccomp.rs` | `apply_seccomp_filter()` — seccomp-bpf (pre-existing) |
|
||
|
||
## Part 1: Capability Dropping
|
||
|
||
### Implementation (`capabilities.rs`)
|
||
|
||
The `drop_capabilities()` function performs four operations:
|
||
|
||
1. **`prctl(PR_SET_NO_NEW_PRIVS, 1)`** — prevents privilege escalation via execve.
|
||
Required by both Landlock and seccomp.
|
||
|
||
2. **`prctl(PR_CAP_AMBIENT, PR_CAP_AMBIENT_CLEAR_ALL)`** — clears all ambient
|
||
capabilities. Gracefully handles EINVAL on kernels without ambient cap support.
|
||
|
||
3. **`prctl(PR_CAPBSET_DROP, cap)`** — iterates over all capability numbers (0–63)
|
||
and drops each from the bounding set. Handles EPERM (expected when running
|
||
as non-root) and EINVAL (cap doesn't exist) gracefully.
|
||
|
||
4. **`capset()` syscall** — clears the permitted, effective, and inheritable
|
||
capability sets using the v3 capability API (two 32-bit words). Handles EPERM
|
||
for non-root processes.
|
||
|
||
### Error Handling
|
||
|
||
- Running as non-root: EPERM on `PR_CAPBSET_DROP` and `capset` is logged as
|
||
debug/warning but not treated as fatal, since the process is already unprivileged.
|
||
- All other errors are fatal.
|
||
|
||
## Part 2: Landlock Filesystem Sandboxing
|
||
|
||
### Implementation (`landlock.rs`)
|
||
|
||
Uses the `landlock` crate (v0.4.4) which provides a safe Rust API over the
|
||
Landlock syscalls with automatic ABI version negotiation.
|
||
|
||
### Allowed Paths
|
||
|
||
| Path | Access | Purpose |
|
||
|------|--------|---------|
|
||
| Kernel image | Read-only | Boot the VM |
|
||
| Initrd (if specified) | Read-only | Initial ramdisk |
|
||
| Disk images (--rootfs) | Read-write | VM storage |
|
||
| API socket directory | RW + MakeSock | Unix socket API |
|
||
| `/dev/kvm` | RW + IoctlDev | KVM device |
|
||
| `/dev/net/tun` | RW + IoctlDev | TAP networking |
|
||
| `/dev/vhost-net` | RW + IoctlDev | vhost-net (if present) |
|
||
| `/proc/self` | Read-only | Process info, fd access |
|
||
| Extra `--landlock-rule` paths | User-specified | Hotplug, custom |
|
||
|
||
### ABI Compatibility
|
||
|
||
- **Target ABI:** V5 (kernel 6.10+, includes `IoctlDev`)
|
||
- **Minimum:** V1 (kernel 5.13+)
|
||
- **Mode:** Best-effort — the crate automatically strips unsupported features
|
||
- **Unavailable:** Logs a warning and continues without filesystem sandboxing
|
||
|
||
On kernel 6.1 (like our test system), the sandbox is "partially enforced" because
|
||
some V5 features (like `IoctlDev` from ABI V5) are unavailable. Core filesystem
|
||
restrictions are still active.
|
||
|
||
### CLI Flags
|
||
|
||
```bash
|
||
# Disable Landlock entirely
|
||
volt-vmm --kernel vmlinux -m 256M --no-landlock
|
||
|
||
# Add extra paths for hotplug or shared data
|
||
volt-vmm --kernel vmlinux -m 256M \
|
||
--landlock-rule /tmp/hotplug:rw \
|
||
--landlock-rule /data/shared:ro
|
||
```
|
||
|
||
Rule format: `path:access` where access is:
|
||
- `ro`, `r`, `read` — read-only
|
||
- `rw`, `w`, `write`, `readwrite` — full access
|
||
|
||
### Application Order
|
||
|
||
The security layers are applied in this order in `main.rs`:
|
||
|
||
```
|
||
1. All initialization complete (KVM, memory, kernel, devices, API socket)
|
||
2. Landlock applied (needs landlock syscalls, sets PR_SET_NO_NEW_PRIVS)
|
||
3. Capabilities dropped (needs prctl, capset)
|
||
4. Seccomp applied (locks down syscalls, uses TSYNC for all threads)
|
||
5. vCPU run loop starts
|
||
```
|
||
|
||
This ordering is critical: Landlock and capability syscalls must be available
|
||
before seccomp restricts the syscall set.
|
||
|
||
## Testing
|
||
|
||
### Test Results (kernel 6.1.0-42-amd64)
|
||
|
||
```
|
||
# Minimal kernel — boots successfully
|
||
$ timeout 10 ./target/release/volt-vmm --kernel comparison/kernels/minimal-hello.elf -m 128M
|
||
INFO Applying Landlock filesystem sandbox
|
||
WARN Landlock sandbox partially enforced (kernel may not support all features)
|
||
INFO Dropping Linux capabilities
|
||
INFO All capabilities dropped successfully
|
||
INFO Applying seccomp-bpf filter (72 syscalls allowed)
|
||
INFO Seccomp filter active
|
||
Hello from minimal kernel!
|
||
OK
|
||
|
||
# Full Linux kernel — boots successfully
|
||
$ timeout 10 ./target/release/volt-vmm --kernel kernels/vmlinux -m 256M
|
||
INFO Applying Landlock filesystem sandbox
|
||
WARN Landlock sandbox partially enforced
|
||
INFO Dropping Linux capabilities
|
||
INFO All capabilities dropped successfully
|
||
INFO Applying seccomp-bpf filter (72 syscalls allowed)
|
||
[kernel boot messages, VFS panic due to no rootfs — expected]
|
||
|
||
# --no-landlock flag works
|
||
$ volt-vmm --kernel ... -m 128M --no-landlock
|
||
WARN Landlock disabled via --no-landlock
|
||
INFO Dropping Linux capabilities
|
||
INFO All capabilities dropped successfully
|
||
|
||
# --landlock-rule flag works
|
||
$ volt-vmm --kernel ... -m 128M --landlock-rule /tmp:rw
|
||
DEBUG Landlock: user rule rw access to /tmp
|
||
```
|
||
|
||
## Dependencies Added
|
||
|
||
```toml
|
||
# vmm/Cargo.toml
|
||
landlock = "0.4" # Landlock LSM helpers (crates.io, MIT/Apache-2.0)
|
||
```
|
||
|
||
No other new dependencies — `libc` was already present for the prctl/capset calls.
|
||
|
||
## Future Improvements
|
||
|
||
1. **Network restrictions** — Landlock ABI V4 (kernel 6.7+) supports TCP port
|
||
filtering. Could restrict API socket to specific ports.
|
||
|
||
2. **IPC scoping** — Landlock ABI V6 (kernel 6.12+) can scope signals and
|
||
abstract Unix sockets.
|
||
|
||
3. **Root-mode bounding set** — When running as root, the full bounding set
|
||
can be dropped. Currently gracefully skips on EPERM.
|
||
|
||
4. **seccomp + Landlock integration test** — Verify that the seccomp allowlist
|
||
includes all syscalls needed after Landlock is active (it does, since Landlock
|
||
is applied first, but a regression test would be good).
|