Networking: bridges, TAPs, and the ARP trick

Every VM gets its own network interface, its own IP, and internet access. Users are isolated at L2: VMs belonging to different users live on different bridges and can’t see each other’s traffic. The host itself is unreachable from inside any VM unless the host opened the connection first. None of this requires per-VM iptables management; it’s six rules total, regardless of how many users or VMs you have.

This page is about the host side of networking. The guest side (eth0, DNS, kernel ip=) is short and lives at the end.

Design decisions on this page

Per-user bridges, not a single shared one. Earlier versions of bhatti used a single brbhatti0 and all VMs lived on the same L2 segment. Tenants could ARP-scan and reach each other. The current design gives each user their own brbhatti-N and /24. See Per-user bridges.
Six global iptables rules cover every VM. Cross-bridge isolation, NAT, return traffic, and a DROP rule that prevents guest-initiated connections to the host on INPUT. No per-VM rules, regardless of how many VMs exist. See The six iptables rules.
Kernel ip= configures eth0 before init runs. Solves the chicken-and-egg of “the agent needs the network before the host can reach the agent.” See Kernel ip=.
Pre-populated permanent ARP entry on create. Skips the first ARP retransmit (Linux default retrans_time_ms = 1000), saving a full second of boot time. See The ARP trick.
TAP devices survive Stop(), are destroyed only on Destroy(). The Firecracker snapshot contains virtio-net state that references the TAP device. Destroying it would break resume. See TAP lifecycle.

Per-user bridges

When the second user creates their first sandbox, bhatti creates them a fresh Linux bridge and a /24 subnet. The third user gets another. Each user is alone on their own L2 segment.

The mapping (pkg/engine/firecracker/network.go:23-52):

User index	Bridge	Subnet	Gateway
1	`brbhatti-1`	`10.0.1.0/24`	`10.0.1.1`
2	`brbhatti-2`	`10.0.2.0/24`	`10.0.2.1`
…	…	…	…
254	`brbhatti-254`	`10.0.254.0/24`	`10.0.254.1`
255	`brbhatti-255`	`10.1.0.0/24`	`10.1.0.1`
65 024	`brbhatti-65024`	`10.255.254.0/24`	…

The subnetFromIndex function in network.go does the math: 1-based index, 10.<hi>.<lo>.0/24, skipping 10.0.0.0/24 to avoid the hi=0,lo=0 ambiguity. Each user can have up to 253 concurrent VMs (.2 through .254 — .0 is the network, .1 is the gateway, .255 is broadcast). The total addressable users on a single host is 65,024. If you ever need more than that on one box, you’ve outgrown bhatti and probably need a real cluster.

A bridge is created on demand when the first sandbox for a user is created (ensureUserBridge). When a user destroys their last sandbox, the bridge is removed (removeUserNetworkIfEmpty in lifecycle.go).

Why per-user, not single-shared

Earlier versions of bhatti used a single bridge brbhatti0 on 192.168.137.0/24. All VMs across all users shared it. This was simpler to operate but broke the multi-tenancy model — at L2, alice’s VMs could ARP-scan the bridge and reach bob’s VMs directly. Application-level isolation only.

Per-user bridges solve this at the network layer: alice’s VMs are on brbhatti-1, bob’s on brbhatti-2, and the kernel’s bridge forwarding doesn’t relay frames between them. Even if alice’s VM is compromised at the kernel level, it can only see alice’s other VMs.

If you’re upgrading from an old install, the engine startup runs cleanupOldBridge() once. It removes brbhatti0, the legacy 192.168.137 NAT rule, and the per-bridge FORWARD rules from the old design. Existing sandboxes get reattached to per-user bridges on their next start.

TAP devices

Each VM gets its own TAP device on its user’s bridge. The naming pattern is tap plus the first 8 characters of the sandbox ID (pkg/engine/firecracker/network.go:240-253):

tapName = "tap" + sandboxID[:8]
ip tuntap add <tap> mode tap
ip link set <tap> master <bridge>
ip link set <tap> up

Firecracker binds the VM’s virtio-net device to this TAP. From the guest’s perspective, it’s a regular ethernet interface (eth0); from the host’s perspective, it’s a TAP attached to a bridge.

TAP lifecycle

TAPs are created in Create() and destroyed in Destroy(). They are not destroyed on Stop() (snapshot to disk). The Firecracker snapshot contains virtio-net state that references the TAP by name. If we delete and recreate it, the resumed guest’s network stack ends up referencing a TAP that doesn’t exist anymore, breaking outbound traffic.

So TAPs survive across stop/start cycles. Across a daemon restart, orphan TAPs (created by a previous run that crashed before tearing them down) are cleaned up at engine startup (network.go:286-310) — list all tun-type devices, delete any tap* not in the known set.

MAC addresses

Each VM gets a random locally-administered unicast MAC (helpers.go:268-274):

b := make([]byte, 6)
rand.Read(b)
b[0] = (b[0] & 0xfe) | 0x02  // locally administered, unicast

Two bit ops on the first byte:

& 0xfe — clears the lowest bit, which would mean “multicast”. MACs starting with an odd byte are multicast addresses; we don’t want that.
| 0x02 — sets the second bit, which marks the address as locally administered. The IEEE guarantees that no real hardware vendor will assign a MAC with this bit set, so we can’t collide with real hardware on the same LAN.

The remaining 46 random bits make collisions astronomically unlikely even within bhatti. Not registered with anyone, but it doesn’t need to be — these MACs only need to be unique within the bridge.

The six iptables rules

That’s the entire isolation model on the host side. Inserted at the top of each chain so they take priority over any UFW or k8s rules (network.go:147-200):

# 1. Cross-bridge isolation: alice's VMs cannot reach bob's VMs.
iptables -I FORWARD 1 -s 10.0.0.0/8 -d 10.0.0.0/8 -j DROP

# 2. VM → internet
iptables -I FORWARD 1 -s 10.0.0.0/8 ! -d 10.0.0.0/8 -j ACCEPT

# 3. Return traffic from internet → VM
iptables -I FORWARD 1 -d 10.0.0.0/8 -m state --state RELATED,ESTABLISHED -j ACCEPT

# 4. Allow agent TCP responses to reach the host (host initiated, so
#    the SYN-ACK enters INPUT with src 10.0.0.0/8)
iptables -I INPUT 1 -s 10.0.0.0/8 -m state --state RELATED,ESTABLISHED -j ACCEPT

# 5. Block VM-initiated connections to the host (API, SSH, everything).
#    Only NEW connections are dropped — established ones (from rule 4) survive.
iptables -I INPUT 1 -s 10.0.0.0/8 -m state --state NEW -j DROP

# 6. NAT for outbound
iptables -t nat -I POSTROUTING 1 -s 10.0.0.0/8 -o <default> -j MASQUERADE

Rule 1 enforces tenant isolation. Rule 5 enforces guest-host isolation. The combination: a VM can reach the internet, can reach its own user’s other VMs, and cannot reach anything else, including the host’s bhatti API or SSH.

This is a feature you don’t configure. It’s true the moment the engine starts. The same six rules cover one user, ten users, ten thousand users.

The default outbound interface is auto-detected from ip route show default. If you want to bind to a specific interface on a multi-homed host, set it in your config.

Rule order matters

Rule 4 must come before rule 5. Both match INPUT traffic from 10.0.0.0/8; rule 4 matches established connections (the SYN-ACK from a VM responding to a host-initiated SYN), rule 5 matches new connections (a VM trying to reach the host). If they were in the wrong order, rule 5 would kill all agent connections. The code inserts rules in reverse order with -I … 1 so the final chain order is correct.

The ARP trick

If you read the create code carefully you’ll see two ip neigh commands you might not expect. Both save a second of boot time; together they explain why bhatti’s first SYN to the agent doesn’t sit in retransmit purgatory.

Pre-populating the cache

pkg/engine/firecracker/create.go:386-395:

// Pre-populate ARP with the guest's MAC so the first TCP SYN is sent
// immediately without waiting for ARP resolution.
exec.Command("ip", "neigh", "replace", guestIP, "lladdr", mac,
    "dev", userNet.BridgeName, "nud", "permanent").Run()

When the host first tries to dial the agent at 10.0.1.2:1024, the host kernel doesn’t know the MAC address that goes with that IP. So it sends an ARP request: “who has 10.0.1.2?” The guest can’t answer yet — the kernel’s still booting, the agent hasn’t started, eth0 is up but no one is listening. Linux’s ARP retransmit timer (retrans_time_ms = 1000ms) adds a full second before the next probe. So our WaitReady polling spends a second of every poll waiting on ARP alone.

We already know the MAC, because we generated it in generateMAC(). So we pre-populate the host’s ARP cache with nud permanent — a permanent neighbor entry that won’t be flushed by the kernel’s normal aging. The first SYN goes out the moment the agent is listening; no ARP round-trip needed.

On Destroy() we delete the entry (lifecycle.go:455-460). Without this cleanup, entries accumulate on the bridge until the bridge itself is destroyed.

Flushing stale entries on create

pkg/engine/firecracker/create.go:130-135:

// Flush stale ARP entry for this IP. When a sandbox is destroyed and
// a new one reuses the same IP, the host ARP cache still maps the IP
// to the old sandbox's MAC (STALE, gc_stale_time=60s). The new VM
// gets a fresh MAC, so the host sends TCP SYNs to the old MAC —
// which no longer exists on any TAP. WaitReady times out at 30s.
exec.Command("ip", "neigh", "del", guestIP, "dev", userNet.BridgeName).Run()

When a sandbox is destroyed and a new one reuses the same IP within 60 seconds (gc_stale_time default), the host’s ARP cache may still map that IP to the old sandbox’s MAC. The new VM gets a fresh MAC, so the host sends TCP SYNs to the old MAC — which doesn’t exist on any TAP anymore. WaitReady times out at 30 s and the create looks broken.

The flush is one line and it makes the failure go away. This is the kind of bug that takes a day to find and a one-line fix to solve. There are several of these in bhatti. They’re not documented anywhere because they’re invisible when they’re working.

Kernel `ip=`

The guest’s eth0 is configured by the kernel command line, before init runs. The boot args bhatti emits look like this:

... ip=10.0.1.2::10.0.1.1:255.255.255.0::eth0:off:1.1.1.1:8.8.8.8: init=/usr/local/bin/lohar ...

The ip= parameter is documented in Linux kernel parameters under ip=. Format:

ip=<client>:<server>:<gateway>:<netmask>:<hostname>:<device>:<autoconf>:<dns0>:<dns1>:<ntp>

Most fields we leave empty (the empty :server: and the empty :hostname: slots). What’s there:

<client> — 10.0.1.2, the guest’s IP.
<gateway> — 10.0.1.1, the bridge IP on the host.
<netmask> — 255.255.255.0 for a /24.
<device> — eth0, the virtio-net interface name.
<autoconf> — off, no DHCP.
<dns0>, <dns1> — DNS servers, defaulted to Cloudflare and Google.

By the time lohar starts as PID 1, eth0 is up, has its IP, knows its gateway, has a default route. The host can dial 10.0.1.2:1024 immediately — no DHCP server on the bridge, no agent-side configuration step.

This solves a chicken-and-egg problem: if we configured the network from inside the agent, the host couldn’t reach the agent to tell it what IP to use until the agent had configured the network. Kernel ip= is a stock Linux feature that side-steps it entirely.

DNS gets a second pass: the kernel ip= parameter sets /etc/resolv.conf with the static values from boot args, but lohar may overwrite this from the per-sandbox config drive (cmd/lohar/main.go:107-110) if the user passed custom DNS at create time.

Reaching services inside sandboxes

Two paths in. They both go through the daemon — there’s no direct path from a remote client to a sandbox’s IP, because that IP is on an internal bridge.

Authenticated proxy

GET /sandboxes/:id/proxy/:port/<path>
Authorization: Bearer <token>

The daemon authenticates, looks up the sandbox, calls engine.Tunnel(id, port) to open a TCP tunnel to localhost:<port> inside the VM, and proxies HTTP/WebSocket traffic. Cold sandboxes wake on the first request. Useful for development — you don’t need to publish a port to use one.

Public proxy (published URLs)

ANY https://<alias>.bhatti.sh

When you bhatti publish dev -p 3000 -a my-app, the daemon creates a publish rule mapping my-app → (sandbox=dev, port=3000). Public requests hit :443, get routed by Host header, looked up against the rule cache (pkg/server/public_proxy.go), tunneled into the VM.

The public proxy has a few extra moves over the authenticated one:

In-memory route cache (LRU, 10 K entries) so a hot URL doesn’t query SQLite per request.
singleflight.Group for resume coalescing — if 50 concurrent requests hit a cold sandbox, only one wake actually happens; the others wait on it. Without this, you’d get 50 simultaneous snapshot loads.
Per-alias and global rate limiting — protects against abuse.
5-minute per-request deadline, 50 MB body limit.

Both proxies use httputil.ReverseProxy with a custom transport that treats the engine’s tunnel as an http.RoundTripper. WebSocket connections are hijacked and relayed bidirectionally with a 10-minute idle timeout.

After a snapshot/restore: the conntrack pause

When a VM resumes from cold, the host’s iptables conntrack table still has stale entries for the guest’s IP. Host-initiated connections work fine (rule 4 + new conntrack entries replace the stale ones). But guest-initiated outbound connections — say, the guest does ping 8.8.8.8 — can hit the stale conntrack and the SYN gets stuck in retransmit backoff for tens of seconds.

In normal operation this doesn’t matter, because the host always initiates connections to the agent. But it is a thing you’ll notice if you’re benchmarking outbound throughput right after a resume. An ARP flush helps:

ip neigh flush dev brbhatti-1

We don’t run this automatically because it’s only relevant for a brief window post-resume and most workloads don’t hit it.

Recovery after crashes

If the daemon crashes mid-create or mid-destroy, you can be left with orphan resources: a TAP that’s not attached to any VM, an ARP entry for an IP that’s been freed, a bridge that should have been removed when its last user’s last sandbox was destroyed.

Recovery happens at engine startup (network.go:286-345):

cleanupOrphanedTapDevices(knownTaps) — list all tap devices, delete any not in the known-VMs set. The known-VMs set is empty before sandbox recovery loads, so on a fresh start every tap looks orphan and gets cleaned. This is intentional: better to start fresh than to leak.
cleanupAllUserBridges() — same logic for bridges. Recovery recreates them as it loads sandboxes.
cleanupOldBridge() — runs once per startup as a no-op if the legacy brbhatti0 is already gone.

Where to go next

Thermal states — how the network stays consistent across snapshot/resume
Lohar — how the agent inside the VM uses TCP to listen for host commands
Custom domain — TLS for the public proxy, wildcard certs, ACME flow
Decisions & learnings — the per-user bridge migration, the ARP trick, and other small bugs that took whole afternoons