KVM-based microVMM for the Volt platform: - Sub-second VM boot times - Minimal memory footprint - Landlock LSM + seccomp security - Virtio device support - Custom kernel management Copyright (c) Armored Gates LLC. All rights reserved. Licensed under AGPSL v5.0
303 lines
10 KiB
Markdown
303 lines
10 KiB
Markdown
# systemd-networkd Enhanced virtio-net
|
|
|
|
## Overview
|
|
|
|
This design enhances Volt's virtio-net implementation by integrating with systemd-networkd for declarative, lifecycle-managed network configuration. Instead of Volt manually creating/configuring TAP devices, networkd manages them declaratively.
|
|
|
|
## Architecture
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────────────┐
|
|
│ systemd-networkd │
|
|
│ ┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐ │
|
|
│ │ volt-vmm-br0 │ │ vm-{uuid}.netdev │ │ vm-{uuid}.network│ │
|
|
│ │ (.netdev bridge) │ │ (TAP definition) │ │ (bridge attach) │ │
|
|
│ └────────┬─────────┘ └────────┬─────────┘ └────────┬─────────┘ │
|
|
│ │ │ │ │
|
|
│ └─────────────────────┼─────────────────────┘ │
|
|
│ ▼ │
|
|
│ ┌───────────────┐ │
|
|
│ │ br0 │ ◄── Unified bridge │
|
|
│ │ (bridge) │ (VMs + Voltainer) │
|
|
│ └───────┬───────┘ │
|
|
│ │ │
|
|
│ ┌─────────────────┼─────────────────┐ │
|
|
│ ▼ ▼ ▼ │
|
|
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
|
|
│ │ tap0 │ │ veth0 │ │ tap1 │ │
|
|
│ │ (VM-1) │ │ (cont.) │ │ (VM-2) │ │
|
|
│ └────┬────┘ └────┬────┘ └────┬────┘ │
|
|
└─────────────┼────────────────┼────────────────┼─────────────────────┘
|
|
│ │ │
|
|
▼ ▼ ▼
|
|
┌─────────┐ ┌─────────┐ ┌─────────┐
|
|
│Volt│ │Voltainer│ │Volt│
|
|
│ VM-1 │ │Container│ │ VM-2 │
|
|
└─────────┘ └─────────┘ └─────────┘
|
|
```
|
|
|
|
## Benefits
|
|
|
|
1. **Declarative Configuration**: Network topology defined in unit files, version-controllable
|
|
2. **Automatic Cleanup**: systemd removes TAP devices when VM exits
|
|
3. **Lifecycle Integration**: TAP created before VM starts, destroyed after
|
|
4. **Unified Networking**: VMs and Voltainer containers share the same bridge infrastructure
|
|
5. **vhost-net Acceleration**: Kernel-level packet processing bypasses userspace
|
|
6. **Predictable Naming**: TAP names derived from VM UUID
|
|
|
|
## Components
|
|
|
|
### 1. Bridge Infrastructure (One-time Setup)
|
|
|
|
```ini
|
|
# /etc/systemd/network/10-volt-vmm-br0.netdev
|
|
[NetDev]
|
|
Name=br0
|
|
Kind=bridge
|
|
MACAddress=52:54:00:00:00:01
|
|
|
|
[Bridge]
|
|
STP=false
|
|
ForwardDelaySec=0
|
|
```
|
|
|
|
```ini
|
|
# /etc/systemd/network/10-volt-vmm-br0.network
|
|
[Match]
|
|
Name=br0
|
|
|
|
[Network]
|
|
Address=10.42.0.1/24
|
|
IPForward=yes
|
|
IPMasquerade=both
|
|
ConfigureWithoutCarrier=yes
|
|
```
|
|
|
|
### 2. Per-VM TAP Template
|
|
|
|
Volt generates these dynamically:
|
|
|
|
```ini
|
|
# /run/systemd/network/50-vm-{uuid}.netdev
|
|
[NetDev]
|
|
Name=tap-{short_uuid}
|
|
Kind=tap
|
|
MACAddress=none
|
|
|
|
[Tap]
|
|
User=root
|
|
Group=root
|
|
VNetHeader=true
|
|
MultiQueue=true
|
|
PacketInfo=false
|
|
```
|
|
|
|
```ini
|
|
# /run/systemd/network/50-vm-{uuid}.network
|
|
[Match]
|
|
Name=tap-{short_uuid}
|
|
|
|
[Network]
|
|
Bridge=br0
|
|
ConfigureWithoutCarrier=yes
|
|
```
|
|
|
|
### 3. vhost-net Acceleration
|
|
|
|
vhost-net offloads packet processing to the kernel:
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────┐
|
|
│ Guest VM │
|
|
│ ┌─────────────────────────────────────────┐ │
|
|
│ │ virtio-net driver │ │
|
|
│ └─────────────────┬───────────────────────┘ │
|
|
└───────────────────┬┼────────────────────────────┘
|
|
││
|
|
┌──────────┘│
|
|
│ │ KVM Exit (rare)
|
|
▼ ▼
|
|
┌────────────────────────────────────────────────┐
|
|
│ vhost-net (kernel) │
|
|
│ │
|
|
│ - Processes virtqueue directly in kernel │
|
|
│ - Zero-copy between TAP and guest memory │
|
|
│ - Avoids userspace context switches │
|
|
│ - ~30-50% throughput improvement │
|
|
└────────────────────┬───────────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────┐
|
|
│ TAP device │
|
|
└─────────────┘
|
|
```
|
|
|
|
**Without vhost-net:**
|
|
```
|
|
Guest → KVM exit → QEMU/Volt userspace → syscall → TAP → kernel → network
|
|
```
|
|
|
|
**With vhost-net:**
|
|
```
|
|
Guest → vhost-net (kernel) → TAP → network
|
|
```
|
|
|
|
## Integration with Voltainer
|
|
|
|
Both Volt VMs and Voltainer containers connect to the same bridge:
|
|
|
|
### Voltainer Network Zone
|
|
|
|
```yaml
|
|
# /etc/voltainer/network/zone-default.yaml
|
|
kind: NetworkZone
|
|
name: default
|
|
bridge: br0
|
|
subnet: 10.42.0.0/24
|
|
gateway: 10.42.0.1
|
|
dhcp:
|
|
enabled: true
|
|
range: 10.42.0.100-10.42.0.254
|
|
```
|
|
|
|
### Volt VM Allocation
|
|
|
|
VMs get static IPs from a reserved range (10.42.0.2-10.42.0.99):
|
|
|
|
```yaml
|
|
network:
|
|
- zone: default
|
|
mac: "52:54:00:ab:cd:ef"
|
|
ipv4: "10.42.0.10/24"
|
|
```
|
|
|
|
## File Locations
|
|
|
|
| File Type | Location | Persistence |
|
|
|-----------|----------|-------------|
|
|
| Bridge .netdev/.network | `/etc/systemd/network/` | Permanent |
|
|
| VM TAP .netdev/.network | `/run/systemd/network/` | Runtime only |
|
|
| Voltainer zone config | `/etc/voltainer/network/` | Permanent |
|
|
| vhost-net module | Kernel built-in | N/A |
|
|
|
|
## Lifecycle
|
|
|
|
### VM Start
|
|
|
|
1. Volt generates `.netdev` and `.network` in `/run/systemd/network/`
|
|
2. `networkctl reload` triggers networkd to create TAP
|
|
3. Wait for TAP interface to appear (`networkctl status tap-XXX`)
|
|
4. Open TAP fd with O_RDWR
|
|
5. Enable vhost-net via `/dev/vhost-net` ioctl
|
|
6. Boot VM with virtio-net using the TAP fd
|
|
|
|
### VM Stop
|
|
|
|
1. Close vhost-net and TAP file descriptors
|
|
2. Delete `.netdev` and `.network` from `/run/systemd/network/`
|
|
3. `networkctl reload` triggers cleanup
|
|
4. TAP interface automatically removed
|
|
|
|
## vhost-net Setup Sequence
|
|
|
|
```c
|
|
// 1. Open vhost-net device
|
|
int vhost_fd = open("/dev/vhost-net", O_RDWR);
|
|
|
|
// 2. Set owner (associate with TAP)
|
|
ioctl(vhost_fd, VHOST_SET_OWNER, 0);
|
|
|
|
// 3. Set memory region table
|
|
struct vhost_memory *mem = ...; // Guest memory regions
|
|
ioctl(vhost_fd, VHOST_SET_MEM_TABLE, mem);
|
|
|
|
// 4. Set vring info for each queue (RX and TX)
|
|
struct vhost_vring_state state = { .index = 0, .num = queue_size };
|
|
ioctl(vhost_fd, VHOST_SET_VRING_NUM, &state);
|
|
|
|
struct vhost_vring_addr addr = {
|
|
.index = 0,
|
|
.desc_user_addr = desc_addr,
|
|
.used_user_addr = used_addr,
|
|
.avail_user_addr = avail_addr,
|
|
};
|
|
ioctl(vhost_fd, VHOST_SET_VRING_ADDR, &addr);
|
|
|
|
// 5. Set kick/call eventfds
|
|
struct vhost_vring_file kick = { .index = 0, .fd = kick_eventfd };
|
|
ioctl(vhost_fd, VHOST_SET_VRING_KICK, &kick);
|
|
|
|
struct vhost_vring_file call = { .index = 0, .fd = call_eventfd };
|
|
ioctl(vhost_fd, VHOST_SET_VRING_CALL, &call);
|
|
|
|
// 6. Associate with TAP backend
|
|
struct vhost_vring_file backend = { .index = 0, .fd = tap_fd };
|
|
ioctl(vhost_fd, VHOST_NET_SET_BACKEND, &backend);
|
|
```
|
|
|
|
## Performance Comparison
|
|
|
|
| Metric | userspace virtio-net | vhost-net |
|
|
|--------|---------------------|-----------|
|
|
| Throughput (1500 MTU) | ~5 Gbps | ~8 Gbps |
|
|
| Throughput (Jumbo 9000) | ~8 Gbps | ~15 Gbps |
|
|
| Latency (ping) | ~200 µs | ~80 µs |
|
|
| CPU usage | Higher | 30-50% lower |
|
|
| Context switches | Many | Minimal |
|
|
|
|
## Configuration Examples
|
|
|
|
### Minimal VM with Networking
|
|
|
|
```json
|
|
{
|
|
"vcpus": 2,
|
|
"memory_mib": 512,
|
|
"kernel": "vmlinux",
|
|
"network": [{
|
|
"id": "eth0",
|
|
"mode": "networkd",
|
|
"bridge": "br0",
|
|
"mac": "52:54:00:12:34:56",
|
|
"vhost": true
|
|
}]
|
|
}
|
|
```
|
|
|
|
### Multi-NIC VM
|
|
|
|
```json
|
|
{
|
|
"network": [
|
|
{
|
|
"id": "mgmt",
|
|
"bridge": "br-mgmt",
|
|
"vhost": true
|
|
},
|
|
{
|
|
"id": "data",
|
|
"bridge": "br-data",
|
|
"mtu": 9000,
|
|
"vhost": true,
|
|
"multiqueue": 4
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
## Error Handling
|
|
|
|
| Error | Cause | Recovery |
|
|
|-------|-------|----------|
|
|
| TAP creation timeout | networkd slow/unresponsive | Retry with backoff, fall back to direct creation |
|
|
| vhost-net open fails | Module not loaded | Fall back to userspace virtio-net |
|
|
| Bridge not found | Infrastructure not set up | Create bridge or fail with clear error |
|
|
| MAC conflict | Duplicate MAC on bridge | Auto-regenerate MAC |
|
|
|
|
## Future Enhancements
|
|
|
|
1. **SR-IOV Passthrough**: Direct VF assignment for bare-metal performance
|
|
2. **DPDK Backend**: Alternative to TAP for ultra-low-latency
|
|
3. **virtio-vhost-user**: Offload to separate process for isolation
|
|
4. **Network Namespace Integration**: Per-VM network namespaces for isolation
|