Architecture overview
This page is the map. It exists so that when you go into Lohar, Thermal states, Networking, or Decisions, you already have a sense of where each fits.
If you’re using bhatti, you don’t need this page. If you’re evaluating it for production or thinking about contributing, this is the right starting point.
The shape
Section titled “The shape”Bhatti is two binaries — bhatti on the host, lohar inside
each microVM — talking over TCP with a small binary protocol.
bhatti is a single Go binary that does many jobs depending on how
you invoke it:
bhatti serve— the daemon (HTTP API, Firecracker engine, thermal manager, reverse proxy, all in one process)bhatti create,bhatti exec,bhatti shell, … — the CLI client, talking to the daemon’s HTTP APIbhatti user create,bhatti admin status— admin commands that read the daemon’s SQLite database directly (root only)
lohar runs as PID 1 inside every Firecracker microVM. It handles
exec, file operations, PTY sessions, and pretends to be systemctl
and journalctl for in-guest callers. The full story is in
Lohar.
┌─ Host ──────────────────────────────────────────────────────────────────────┐│ ││ ┌─ bhatti daemon (bhatti serve, single Go process) ────────────────────┐ ││ │ │ ││ │ REST / WS API Engine Store (SQLite, WAL mode) │ ││ │ :8080 Firecracker sandboxes, users, secrets, │ ││ │ reverse proxy create / destroy templates, volumes, events, │ ││ │ thermal manager stop / start publish_rules, fc_state │ ││ │ exec, shell, files │ ││ │ snapshot, restore │ ││ │ │ ││ └─────────────────────────────────┬─────────────────────────────────────┘ ││ │ ││ HTTP over Unix socket (Firecracker API) ││ TCP over TAP (agent protocol, port 1024) ││ │ ││ ▼ ││ ┌─ Firecracker microVM × N ─────────────────────────────────────────────┐ ││ │ │ ││ │ vmlinux rootfs.ext4 config.ext4 vol-*.ext4 │ ││ │ │ ││ │ ┌─ lohar (PID 1) ──────────────────────────────────────────────┐ │ ││ │ │ │ │ ││ │ │ TCP :1024 ── control plane (exec, files, sessions) │ │ ││ │ │ TCP :1025 ── forward plane (port forwarding) │ │ ││ │ │ │ │ ││ │ │ sessions scrollback atomic file ops │ │ ││ │ │ systemctl shim journalctl shim syslog receiver │ │ ││ │ │ │ │ ││ │ └──────────────────────────────────────────────────────────────┘ │ ││ │ │ ││ │ tapXXXXXXXX ── brbhatti-N (per-user) ── iptables NAT │ ││ └───────────────────────────────────────────────────────────────────────┘ │└─────────────────────────────────────────────────────────────────────────────┘The daemon’s HTTP API is the only public surface. Everything below
it — the FC API, the agent protocol, the bridge — is internal. A
client outside the host talks only to :8080 (or :443 in
domain mode).
Four decisions that shape the rest
Section titled “Four decisions that shape the rest”These aren’t all the choices in the codebase, but they’re the ones you can’t understand the rest of the docs without.
1. The daemon is a single Go process
Section titled “1. The daemon is a single Go process”REST/WS API, Firecracker engine, thermal manager, reverse proxy, SQLite store, rate limiter — all in one process. No microservices, no message queue, no separate workers. State lives in SQLite (see Where state lives).
This is a deliberately small box. One binary you can scp. One log
to read. One process to restart. Whatever growth bhatti does, it
won’t be by adding more processes — it’ll be by getting better at
the work the single process already does.
The cost is real: one bhatti daemon manages the sandboxes on one machine. There’s no clustering inside bhatti and no plans for it. If you need to scale across hosts, the way I’d do it (and the way I’m planning to, eventually) is to put a thin proxy in front of several bhatti daemons that records which daemon a sandbox was created on and routes subsequent requests for that sandbox — including wake-on-request — to the same daemon. Each bhatti stays single-process; the proxy is the only piece that knows about the fleet. That’s still outside bhatti itself.
2. We talk to Firecracker directly, no SDK
Section titled “2. We talk to Firecracker directly, no SDK”Firecracker exposes an HTTP API over a Unix domain socket — about
eight endpoints we actually use (/machine-config, /drives,
/network-interfaces, /vsock, /balloon, /boot-source,
/actions, /snapshot/{create,load}).
The official
firecracker-go-sdk
wraps those eight endpoints in ~15,000 lines of generated client
code, plus a containerd-derived vsock helper, plus the
go-openapi runtime, plus cni for networking, plus grpc. It’s
designed for the AWS Lambda use case and pulls in roughly that much
machinery.
Our adapter is pkg/engine/firecracker/fc.go — a few HTTP helpers
(fcPut, fcPatch, fcGet) over an http.Transport whose
DialContext returns a Unix socket connection. About 20 lines.
What I lose: type-safe JSON bodies. A typo in "vcpu_count" shows
up at runtime, not compile time.
What I get back, in order of how much it actually matters:
- The hard parts of the FC integration aren’t in the API surface itself. Snapshot/restore sequencing, the vm.snap path problem, TAP lifecycle, thermal state — none of that gets easier with the SDK. It’s all in code we have to write either way.
- When something goes wrong, the failing call is in code I wrote. Not in 200 lines of generated middleware.
- The integration tests catch typos within seconds. They run on real Firecracker against real microVMs, on two Raspberry Pi 5s with NVMe HATs that I keep at home and run every PR through. Fast feedback on a real kernel is worth more than compile-time checks on a JSON shape.
3. Jailer mode is the production path
Section titled “3. Jailer mode is the production path”Firecracker’s
Jailer
runs each FC process in a chroot, drops it to an unprivileged UID,
puts it in its own PID namespace, and applies a seccomp filter. The
chroot lives at /var/lib/bhatti/jails/firecracker/<id>/ and
contains a hard-linked copy of every file FC needs (kernel, rootfs,
config drive, volumes, snapshot artifacts).
Without the jailer, FC runs as root and sees the whole host filesystem. A theoretical VM escape lands the attacker as root on the host with full visibility. With the jailer, a successful escape lands them as an unprivileged UID inside a near-empty chroot.
For a multi-tenant deployment, the jailer is not optional — it’s
the entire point of running real VMs instead of containers. The
setting is firecracker_jailer in the server config; the install
script enables it by default. Bhatti only runs on Linux with KVM
(cmd/bhatti/engine_other.go
refuses to start anywhere else), so there’s no “non-jailed
macOS” path to worry about — if bhatti serve runs at all, you’re
on Linux and you should turn the jailer on.
Jailer mode and the snapshot path problem
Section titled “Jailer mode and the snapshot path problem”Bhatti hit v1.0.0 specifically because of an incompatibility between bare and jailed snapshots.
The reason: Firecracker’s vm.snap records the paths of every
file the VMM had open at snapshot time. In bare mode that’s an
absolute host path like
/var/lib/bhatti/sandboxes/abc/rootfs.ext4. In jailed mode it’s a
chroot-relative path like /rootfs.ext4. On resume, the VMM
reopens those exact paths.
Switching modes invalidates every existing snapshot:
- Bare → jailed:
/var/lib/bhatti/...doesn’t exist inside the new chroot. - Jailed → bare:
/rootfs.ext4doesn’t exist as an absolute host path.
There’s no fix-up step that rewrites paths in vm.snap — it’s a
binary blob of FC’s internal device-model state, not something
bhatti can edit. So the migration plan was: destroy every snapshot,
upgrade, recreate. That’s a major version bump in semver terms.
We did it once, called it v1.0, and the codebase has been
jailer-by-default since.
Bare mode is still in the code as a fallback (it’s just FC without the jailer wrapper) and is what some integration tests use. For any deployment a real user touches, jailer mode is what you want.
4. Per-VM mutex with capture-and-release
Section titled “4. Per-VM mutex with capture-and-release”Concurrency is the part that bites every “single Go process” design eventually. Bhatti’s pattern is small but load-bearing.
Each VM struct has its own stateMu sync.Mutex
(pkg/engine/firecracker/engine.go:75-86).
The engine-level sync.RWMutex only protects the map of VMs — it
doesn’t protect any individual VM’s state.
The pattern every operation follows:
vm.stateMu.Lock() // 1. take the lockif vm.Thermal != "hot" { // 2. check invariants vm.stateMu.Unlock() return errNotHot}ag := vm.Agent // 3. capture references we needvm.stateMu.Unlock() // 4. release the lock immediatelyreturn ag.Exec(ctx, cmd) // 5. do the slow work outside the lockThe key move is steps 3-4: copy the references, drop the lock,
then make the network call. The lock protects access to vm.Agent
itself, not any operation that uses it.
The reason this matters: a Shell or Tunnel call lasts as long as
the user’s session — minutes, hours. If those calls held
stateMu for their duration, the thermal manager couldn’t pause
any VM (because every cycle iterates the VM map and tries to take
each VM’s lock to read its thermal state). One slow shell would
freeze idle pause / cold snapshot for everything else.
The safety argument: vm.Agent is only replaced during Start()
(snapshot restore), and Start() takes the same stateMu to do
it. So if a Shell goroutine grabs vm.Agent, releases the lock,
and then Start() runs to completion and replaces vm.Agent with
a new client — the Shell goroutine still has a valid reference to
the old Agent’s TCP connection. That connection is good until the
Firecracker process is killed, which happens only on Stop() or
Destroy(). By that point, both Stop() and Destroy() will
have closed the connection, the Shell will see a read error, and
clean up.
So: long-running operations don’t block the system, and they don’t explode when state is updated underneath them — they just see the underlying connection close and exit cleanly. The pattern is in every method on the Firecracker engine that touches an agent.
Where state lives
Section titled “Where state lives”State is in two places: SQLite, on disk; Firecracker memory and disk images, on disk per-sandbox.
SQLite. One database file at /var/lib/bhatti/state.db on a
default server install. Tables:
sandboxes— id, name, owner, status, IP, MAC, engine IDfc_state— per-VM Firecracker config (CPU, memory, drive paths, snapshot paths, agent token), serialized as JSON; this is what recovery readsusers— name, hashed API key (SHA-256), per-user limits, subnet indexsecrets— name, encrypted value (age), scoped per usertemplates,volumes,publish_rulesevents— audit logtask_progress— async tasks (image pull)
WAL mode is on so background workers (the thermal manager, port scanner, metrics snapshotter) don’t block HTTP request handlers and vice versa.
Per-sandbox. Under /var/lib/bhatti/sandboxes/<id>/:
rootfs.ext4 CoW copy of the base rootfsconfig.ext4 1MB ext4 with hostname, env, files, volume specsvol-<name>.ext4 attached volumes (per-sandbox; shared volumes are elsewhere — see below)firecracker.sock FC API Unix socket (only while the VM is running)mem.snap memory snapshot (only when cold)vm.snap VM state snapshot (only when cold)The CoW copy uses cp --reflink=auto. On btrfs or XFS this is
instant and near-zero disk; on ext4 it falls back to a full copy.
If you’re benchmarking bhatti create, run on btrfs.
Other disk locations under /var/lib/bhatti/:
images/— read-only base rootfs templates (rootfs-minimal-arm64.ext4etc.) and the kernelvolumes/<name>.ext4— standalone volumes (created bybhatti volume create, attachable to any sandbox)snapshots/<user>/<name>/— named snapshots frombhatti snapshot create(memory + disk artifacts, fully self-contained)jails/firecracker/<id>/— jailer chroots (when jailer is on)age.key— the X25519 private key used to encrypt user secrets. It’s generated the first time anyone callsbhatti secret set, not on daemon startup (pkg/secrets/age.go:13-37). Back this up. If you lose it, every encrypted secret on the server is unrecoverable.
The 1 MB config drive is sized that way because ext4’s minimum filesystem size is 1 MB — anything smaller fails to format. The contents (~few hundred bytes of JSON describing hostname, env vars, volume mounts) would fit in less, but the filesystem can’t.
Recovery on startup
Section titled “Recovery on startup”When the bhatti daemon starts — fresh boot, after a crash, after a
systemctl restart — every Firecracker process from the previous
run is dead. The daemon’s job at startup is to look at every
sandbox row in SQLite and decide, for each one, whether it can
still be brought back.
There are only two things that need to be true to bring a sandbox back:
- The row is still in
sandboxes— we know the sandbox exists and we have its config infc_state. - The snapshot files exist on disk (
mem.snapandvm.snap) — we have somewhere to restore the VM from.
If both are true, the sandbox is marked stopped in memory and is
ready to wake on the next API call. If the row exists but the
snapshot doesn’t, the sandbox is marked unknown — a clear error
on the next operation, and a hint to the operator to dig in.
The normal case is the boring one: the daemon was shut down cleanly
(systemctl stop bhatti), SnapshotAll ran, every running sandbox
got a snapshot, and on next startup every sandbox comes back as
stopped. No data loss, no manual recovery.
The abnormal cases are where it gets interesting: a power cut
between snapshot and disk flush, a btrfs subvolume that disappeared,
a half-finished bhatti destroy that left a row but cleaned up the
files. In each case the sandbox lands in unknown, the operator
sees a recovery: snapshot missing warning in the log, and they
can decide whether to recreate it. Volume data is on separate ext4
images and survives the loss of the memory snapshot — so the
recovery path even from unknown is detach volumes, destroy,
recreate, reattach.
After the sandbox loop, the engine cleans up orphans: TAP devices,
user bridges, stale snapshot temp directories, .ext4 files with
no row in the volumes table. The order matters: sandboxes load
first so that cleanup knows which TAPs and bridges should still
exist.
What’s not on this page
Section titled “What’s not on this page”This was the map. The depth is in the topic pages:
- Lohar — the agent inside every VM — PID 1, the systemctl shim, sessions, file ops
- Thermal states — hot / warm / cold, the balloon trick, snapshot mechanics
- Networking — per-user bridges,
the six iptables rules, the ARP trick, kernel
ip= - Wire protocol — frame format, all frame types, atomic writes
- Firecracker engine internals — the actual API call sequence during create, jailer specifics, rate limiters
- Decisions & learnings — the bugs we paid for, in narrative form