Networking: bridges, TAPs, and the ARP trick
Every VM gets its own network interface, its own IP, and internet access. Users are isolated at L2: VMs belonging to different users live on different bridges and can’t see each other’s traffic. The host itself is unreachable from inside any VM unless the host opened the connection first. None of this requires per-VM iptables management; it’s six rules total, regardless of how many users or VMs you have.
This page is about the host side of networking. The guest side
(eth0, DNS, kernel ip=) is short and lives at the end.
Design decisions on this page
Section titled “Design decisions on this page”- Per-user bridges, not a single shared one. Earlier versions of
bhatti used a single
brbhatti0and all VMs lived on the same L2 segment. Tenants could ARP-scan and reach each other. The current design gives each user their ownbrbhatti-Nand/24. See Per-user bridges. - Six global iptables rules cover every VM. Cross-bridge
isolation, NAT, return traffic, and a
DROPrule that prevents guest-initiated connections to the host onINPUT. No per-VM rules, regardless of how many VMs exist. See The six iptables rules. - Kernel
ip=configureseth0before init runs. Solves the chicken-and-egg of “the agent needs the network before the host can reach the agent.” See Kernelip=. - Pre-populated permanent ARP entry on create. Skips the first
ARP retransmit (Linux default
retrans_time_ms = 1000), saving a full second of boot time. See The ARP trick. - TAP devices survive
Stop(), are destroyed only onDestroy(). The Firecracker snapshot contains virtio-net state that references the TAP device. Destroying it would break resume. See TAP lifecycle.
Per-user bridges
Section titled “Per-user bridges”When the second user creates their first sandbox, bhatti creates them
a fresh Linux bridge and a /24 subnet. The third user gets another.
Each user is alone on their own L2 segment.
The mapping
(pkg/engine/firecracker/network.go:23-52):
| User index | Bridge | Subnet | Gateway |
|---|---|---|---|
| 1 | brbhatti-1 | 10.0.1.0/24 | 10.0.1.1 |
| 2 | brbhatti-2 | 10.0.2.0/24 | 10.0.2.1 |
| … | … | … | … |
| 254 | brbhatti-254 | 10.0.254.0/24 | 10.0.254.1 |
| 255 | brbhatti-255 | 10.1.0.0/24 | 10.1.0.1 |
| 65 024 | brbhatti-65024 | 10.255.254.0/24 | … |
The subnetFromIndex function in network.go does the math: 1-based
index, 10.<hi>.<lo>.0/24, skipping 10.0.0.0/24 to avoid the
hi=0,lo=0 ambiguity. Each user can have up to 253 concurrent VMs
(.2 through .254 — .0 is the network, .1 is the gateway,
.255 is broadcast). The total addressable users on a single host is
65,024. If you ever need more than that on one box, you’ve outgrown
bhatti and probably need a real cluster.
A bridge is created on demand when the first sandbox for a user is
created
(ensureUserBridge).
When a user destroys their last sandbox, the bridge is removed
(removeUserNetworkIfEmpty in lifecycle.go).
Why per-user, not single-shared
Section titled “Why per-user, not single-shared”Earlier versions of bhatti used a single bridge brbhatti0 on
192.168.137.0/24. All VMs across all users shared it. This was
simpler to operate but broke the multi-tenancy model — at L2, alice’s
VMs could ARP-scan the bridge and reach bob’s VMs directly.
Application-level isolation only.
Per-user bridges solve this at the network layer: alice’s VMs are on
brbhatti-1, bob’s on brbhatti-2, and the kernel’s bridge
forwarding doesn’t relay frames between them. Even if alice’s VM is
compromised at the kernel level, it can only see alice’s other VMs.
If you’re upgrading from an old install, the engine startup runs
cleanupOldBridge()
once. It removes brbhatti0, the legacy 192.168.137 NAT rule, and
the per-bridge FORWARD rules from the old design. Existing sandboxes
get reattached to per-user bridges on their next start.
TAP devices
Section titled “TAP devices”Each VM gets its own TAP device on its user’s bridge. The naming
pattern is tap plus the first 8 characters of the sandbox ID
(pkg/engine/firecracker/network.go:240-253):
tapName = "tap" + sandboxID[:8]ip tuntap add <tap> mode tapip link set <tap> master <bridge>ip link set <tap> upFirecracker binds the VM’s virtio-net device to this TAP. From the guest’s perspective, it’s a regular ethernet interface (eth0); from the host’s perspective, it’s a TAP attached to a bridge.
TAP lifecycle
Section titled “TAP lifecycle”TAPs are created in Create() and destroyed in Destroy(). They are
not destroyed on Stop() (snapshot to disk). The Firecracker
snapshot contains virtio-net state that references the TAP by name.
If we delete and recreate it, the resumed guest’s network stack ends
up referencing a TAP that doesn’t exist anymore, breaking outbound
traffic.
So TAPs survive across stop/start cycles. Across a daemon restart,
orphan TAPs (created by a previous run that crashed before tearing
them down) are cleaned up at engine startup
(network.go:286-310)
— list all tun-type devices, delete any tap* not in the known
set.
MAC addresses
Section titled “MAC addresses”Each VM gets a random locally-administered unicast MAC
(helpers.go:268-274):
b := make([]byte, 6)rand.Read(b)b[0] = (b[0] & 0xfe) | 0x02 // locally administered, unicastTwo bit ops on the first byte:
& 0xfe— clears the lowest bit, which would mean “multicast”. MACs starting with an odd byte are multicast addresses; we don’t want that.| 0x02— sets the second bit, which marks the address as locally administered. The IEEE guarantees that no real hardware vendor will assign a MAC with this bit set, so we can’t collide with real hardware on the same LAN.
The remaining 46 random bits make collisions astronomically unlikely even within bhatti. Not registered with anyone, but it doesn’t need to be — these MACs only need to be unique within the bridge.
The six iptables rules
Section titled “The six iptables rules”That’s the entire isolation model on the host side. Inserted at the top
of each chain so they take priority over any UFW or k8s rules
(network.go:147-200):
# 1. Cross-bridge isolation: alice's VMs cannot reach bob's VMs.iptables -I FORWARD 1 -s 10.0.0.0/8 -d 10.0.0.0/8 -j DROP
# 2. VM → internetiptables -I FORWARD 1 -s 10.0.0.0/8 ! -d 10.0.0.0/8 -j ACCEPT
# 3. Return traffic from internet → VMiptables -I FORWARD 1 -d 10.0.0.0/8 -m state --state RELATED,ESTABLISHED -j ACCEPT
# 4. Allow agent TCP responses to reach the host (host initiated, so# the SYN-ACK enters INPUT with src 10.0.0.0/8)iptables -I INPUT 1 -s 10.0.0.0/8 -m state --state RELATED,ESTABLISHED -j ACCEPT
# 5. Block VM-initiated connections to the host (API, SSH, everything).# Only NEW connections are dropped — established ones (from rule 4) survive.iptables -I INPUT 1 -s 10.0.0.0/8 -m state --state NEW -j DROP
# 6. NAT for outboundiptables -t nat -I POSTROUTING 1 -s 10.0.0.0/8 -o <default> -j MASQUERADERule 1 enforces tenant isolation. Rule 5 enforces guest-host isolation. The combination: a VM can reach the internet, can reach its own user’s other VMs, and cannot reach anything else, including the host’s bhatti API or SSH.
This is a feature you don’t configure. It’s true the moment the engine starts. The same six rules cover one user, ten users, ten thousand users.
The default outbound interface is auto-detected from
ip route show default. If you want to bind to a specific interface
on a multi-homed host, set it in your config.
Rule order matters
Section titled “Rule order matters”Rule 4 must come before rule 5. Both match INPUT traffic from
10.0.0.0/8; rule 4 matches established connections (the SYN-ACK
from a VM responding to a host-initiated SYN), rule 5 matches new
connections (a VM trying to reach the host). If they were in the
wrong order, rule 5 would kill all agent connections. The code
inserts rules in reverse order with -I … 1 so the final chain
order is correct.
The ARP trick
Section titled “The ARP trick”If you read the create code carefully you’ll see two ip neigh
commands you might not expect. Both save a second of boot time;
together they explain why bhatti’s first SYN to the agent doesn’t
sit in retransmit purgatory.
Pre-populating the cache
Section titled “Pre-populating the cache”pkg/engine/firecracker/create.go:386-395:
// Pre-populate ARP with the guest's MAC so the first TCP SYN is sent// immediately without waiting for ARP resolution.exec.Command("ip", "neigh", "replace", guestIP, "lladdr", mac, "dev", userNet.BridgeName, "nud", "permanent").Run()When the host first tries to dial the agent at 10.0.1.2:1024, the
host kernel doesn’t know the MAC address that goes with that IP. So
it sends an ARP request: “who has 10.0.1.2?” The guest can’t answer
yet — the kernel’s still booting, the agent hasn’t started, eth0 is
up but no one is listening. Linux’s ARP retransmit timer
(retrans_time_ms = 1000ms) adds a full second before the next
probe. So our WaitReady polling spends a second of every poll
waiting on ARP alone.
We already know the MAC, because we generated it in generateMAC().
So we pre-populate the host’s ARP cache with nud permanent — a
permanent neighbor entry that won’t be flushed by the kernel’s normal
aging. The first SYN goes out the moment the agent is listening; no
ARP round-trip needed.
On Destroy() we delete the entry
(lifecycle.go:455-460).
Without this cleanup, entries accumulate on the bridge until the
bridge itself is destroyed.
Flushing stale entries on create
Section titled “Flushing stale entries on create”pkg/engine/firecracker/create.go:130-135:
// Flush stale ARP entry for this IP. When a sandbox is destroyed and// a new one reuses the same IP, the host ARP cache still maps the IP// to the old sandbox's MAC (STALE, gc_stale_time=60s). The new VM// gets a fresh MAC, so the host sends TCP SYNs to the old MAC —// which no longer exists on any TAP. WaitReady times out at 30s.exec.Command("ip", "neigh", "del", guestIP, "dev", userNet.BridgeName).Run()When a sandbox is destroyed and a new one reuses the same IP within
60 seconds (gc_stale_time default), the host’s ARP cache may still
map that IP to the old sandbox’s MAC. The new VM gets a fresh MAC, so
the host sends TCP SYNs to the old MAC — which doesn’t exist on any
TAP anymore. WaitReady times out at 30 s and the create looks
broken.
The flush is one line and it makes the failure go away. This is the kind of bug that takes a day to find and a one-line fix to solve. There are several of these in bhatti. They’re not documented anywhere because they’re invisible when they’re working.
Kernel ip=
Section titled “Kernel ip=”The guest’s eth0 is configured by the kernel command line, before
init runs. The boot args bhatti emits look like this:
... ip=10.0.1.2::10.0.1.1:255.255.255.0::eth0:off:1.1.1.1:8.8.8.8: init=/usr/local/bin/lohar ...The ip= parameter is documented in
Linux kernel parameters
under ip=. Format:
ip=<client>:<server>:<gateway>:<netmask>:<hostname>:<device>:<autoconf>:<dns0>:<dns1>:<ntp>Most fields we leave empty (the empty :server: and the empty
:hostname: slots). What’s there:
<client>—10.0.1.2, the guest’s IP.<gateway>—10.0.1.1, the bridge IP on the host.<netmask>—255.255.255.0for a /24.<device>—eth0, the virtio-net interface name.<autoconf>—off, no DHCP.<dns0>,<dns1>— DNS servers, defaulted to Cloudflare and Google.
By the time lohar starts as PID 1, eth0 is up, has its IP, knows its
gateway, has a default route. The host can dial 10.0.1.2:1024
immediately — no DHCP server on the bridge, no agent-side
configuration step.
This solves a chicken-and-egg problem: if we configured the network
from inside the agent, the host couldn’t reach the agent to tell it
what IP to use until the agent had configured the network. Kernel
ip= is a stock Linux feature that side-steps it entirely.
DNS gets a second pass: the kernel ip= parameter sets /etc/resolv.conf
with the static values from boot args, but lohar may overwrite this
from the per-sandbox config drive
(cmd/lohar/main.go:107-110)
if the user passed custom DNS at create time.
Reaching services inside sandboxes
Section titled “Reaching services inside sandboxes”Two paths in. They both go through the daemon — there’s no direct path from a remote client to a sandbox’s IP, because that IP is on an internal bridge.
Authenticated proxy
Section titled “Authenticated proxy”GET /sandboxes/:id/proxy/:port/<path>Authorization: Bearer <token>The daemon authenticates, looks up the sandbox, calls
engine.Tunnel(id, port) to open a TCP tunnel to localhost:<port>
inside the VM, and proxies HTTP/WebSocket traffic. Cold sandboxes
wake on the first request. Useful for development — you don’t need
to publish a port to use one.
Public proxy (published URLs)
Section titled “Public proxy (published URLs)”ANY https://<alias>.bhatti.shWhen you bhatti publish dev -p 3000 -a my-app, the daemon creates a
publish rule mapping my-app → (sandbox=dev, port=3000). Public
requests hit :443, get routed by Host header, looked up against the
rule cache
(pkg/server/public_proxy.go),
tunneled into the VM.
The public proxy has a few extra moves over the authenticated one:
- In-memory route cache (LRU, 10 K entries) so a hot URL doesn’t query SQLite per request.
singleflight.Groupfor resume coalescing — if 50 concurrent requests hit a cold sandbox, only one wake actually happens; the others wait on it. Without this, you’d get 50 simultaneous snapshot loads.- Per-alias and global rate limiting — protects against abuse.
- 5-minute per-request deadline, 50 MB body limit.
Both proxies use httputil.ReverseProxy with a custom transport that
treats the engine’s tunnel as an http.RoundTripper. WebSocket
connections are hijacked and relayed bidirectionally with a 10-minute
idle timeout.
After a snapshot/restore: the conntrack pause
Section titled “After a snapshot/restore: the conntrack pause”When a VM resumes from cold, the host’s iptables conntrack table
still has stale entries for the guest’s IP. Host-initiated connections
work fine (rule 4 + new conntrack entries replace the stale ones).
But guest-initiated outbound connections — say, the guest does
ping 8.8.8.8 — can hit the stale conntrack and the SYN gets stuck
in retransmit backoff for tens of seconds.
In normal operation this doesn’t matter, because the host always initiates connections to the agent. But it is a thing you’ll notice if you’re benchmarking outbound throughput right after a resume. An ARP flush helps:
ip neigh flush dev brbhatti-1We don’t run this automatically because it’s only relevant for a brief window post-resume and most workloads don’t hit it.
Recovery after crashes
Section titled “Recovery after crashes”If the daemon crashes mid-create or mid-destroy, you can be left with orphan resources: a TAP that’s not attached to any VM, an ARP entry for an IP that’s been freed, a bridge that should have been removed when its last user’s last sandbox was destroyed.
Recovery happens at engine startup
(network.go:286-345):
cleanupOrphanedTapDevices(knownTaps)— list all tap devices, delete any not in the known-VMs set. The known-VMs set is empty before sandbox recovery loads, so on a fresh start every tap looks orphan and gets cleaned. This is intentional: better to start fresh than to leak.cleanupAllUserBridges()— same logic for bridges. Recovery recreates them as it loads sandboxes.cleanupOldBridge()— runs once per startup as a no-op if the legacybrbhatti0is already gone.
Where to go next
Section titled “Where to go next”- Thermal states — how the network stays consistent across snapshot/resume
- Lohar — how the agent inside the VM uses TCP to listen for host commands
- Custom domain — TLS for the public proxy, wildcard certs, ACME flow
- Decisions & learnings — the per-user bridge migration, the ARP trick, and other small bugs that took whole afternoons