How it works
You've got a SQLite database, and you want a continuous, point-in-time copy of it somewhere else. This page explains how bakelite does that. If you just want it running, head to Install & deploy instead.
Watching the file for changes
bakelite takes advantage of the fact that a SQLite database is just a couple of
files: it watches the -wal file through the OS (inotify on Linux, FSEvents on
macOS, kqueue on the BSDs) and only wakes when something actually changes. On an
idle database the process just blocks on the watcher until the next write arrives.
A missed notification can't strand you, though. There's a cheap fallback poll
(safety_poll, 30s by default — not a busy loop) as a backstop, so even if the OS
drops an event a sync still happens at least that often. See When things go
wrong for the full failure picture.
It reads SQLite's files directly
bakelite reads WAL frames and database pages straight from the filesystem; it uses the SQL engine only to control the database (set pragmas, hold locks, drive checkpoints).
- WAL frames and database pages are parsed byte-for-byte from the files (the
walanddbfilemodules). This includes SQLite's custom rolling WAL checksum, which is not a CRC. - The SQL library (
rusqlite) is reached only through a deliberately tiny control seam,db::ControlDb— the single place the engine is touched, kept thin so the SQLite dependency stays isolated and easy to reason about.
Backups and change-sets
A replica is one full backup (a snapshot) plus the page-changes that came after it. Restore replays them in order, latest change wins.
Everything else is detail. The change-sets (segments) are page images, indexed monotonically across the replica's whole lifetime — one global index space, with a re-snapshot being just another full backup at the next index — and:
- Snapshots are a db-file + WAL overlay (latest committed page wins), which stays consistent even while a checkpoint runs and doesn't depend on a checkpoint succeeding.
- Restore = the latest full backup + the change-sets after it, replayed in index order (latest page image wins). Restore is keyed on segment index, not WAL offset, so it spans WAL resets and incremental-checkpoint boundaries. It never reads the live database's checkpoint state.
The per-database manifest is an advisory, rebuildable cache, not a source of truth — retention and restore reconcile by listing the objects on the backend.
Leveled compaction
A long-lived backup accumulates many small change-sets. bakelite consolidates
them with leveled, time-windowed compaction: a contiguous run of level-(N−1)
change-sets is promoted into one level-N file once it spans that level's window
(compaction_levels = ["30s", "5m", "1h"] by default). Promoting deletes the
lower-level inputs, so storage and the per-restore object count stay bounded.
The most recent compaction_keep_recent change-sets are held back from
promotion, so recent point-in-time restore stays fine-grained. The merge
carries the lineage chain endpoints (see below) across promotion, so the
chain is preserved exactly without rewriting the tail.
Per-object and lineage integrity
Every stored object carries a CRC-32C envelope (catches bit-rot at decode
time) and a BLAKE3 object_hash of its exact stored bytes — a much stronger
check that'll catch a substituted object, not just a flipped bit. On top of
that, change-sets form a BLAKE3 hash chain rooted at the base snapshot: each
carries parent_hash (the chain value at its start boundary) and content_hash
(at its end boundary), so a missing, reordered, or substituted change-set is
detectable. bakelite verify checks
all of this; bakelite verify --deep also does a full restore into a temp file
and runs PRAGMA integrity_check.
The replication loop
bakelite holds a long-running read lock so the WAL is append-only with stable salts while it ships segments incrementally, which stops a checkpoint from racing the WAL it's reading.
To bound WAL growth it performs an incremental checkpoint: ship all frames,
freeze writers, then in one tight control-connection operation release the write
lock, run wal_checkpoint(TRUNCATE), and re-read PRAGMA data_version. If
data_version changed (a concurrent commit slipped into the truncate window) or
the WAL didn't reset, it falls back to a full overlay snapshot — a new full backup
at the next index, always safe. Otherwise it keeps shipping change-sets from the
fresh WAL.
A full-snapshot rotation happens once per snapshot_interval, which also bounds
restore-chain length and lets retention drop old objects.
bakelite is crash-safe: the replication cursor is persisted after each segment is durable, so a restarted daemon resumes from the cursor rather than re-snapshotting.
When things go wrong
Replication keeps running across the failures that actually happen in production, and none of them risk your committed data:
The backend is unreachable. A sync error — S3 down, a network blip, throttling
— is never fatal to the daemon. It logs the error, marks the database backing off
in bakelite status, and retries with exponential backoff
(1s, doubling, capped at 5 minutes). Throughout, bakelite holds its read lock, so
SQLite can't checkpoint past frames that haven't been shipped yet — committed
data is never dropped to make room. The un-shipped commits wait in the database's
own -wal file on disk, not in bakelite's memory, so the daemon's RSS stays flat
(~tens of MB) however long the outage runs. The real cost of a long outage is a
growing WAL on disk, not runaway RAM; when the backend returns, bakelite ships
the backlog and checkpoints the WAL back down. Your application keeps writing the
whole time — bakelite never blocks your writers.
The filesystem watcher misses an event. OS change notifications can be dropped
or coalesced, and bakelite doesn't depend on catching every one. The safety_poll
backstop (30s by default) means a missed event delays a sync by at most that
interval rather than stranding it. In normal operation your RPO is set by
max_batch_delay (~1s); safety_poll is only the worst-case floor.
The daemon or the host crashes. The replication cursor is persisted locally
after each change-set is durable on the backend, so a restart resumes from the
cursor rather than re-uploading from scratch. If a crash lands between "change-set
uploaded" and "cursor persisted", the worst case is re-shipping one change-set —
which is idempotent (same index, same bytes), never lost data — and anything
committed but not yet shipped is still in the -wal, picked up on resume. The
manifest is only an advisory cache, so a crash before it's refreshed costs nothing:
restore and retention reconcile by listing the objects, and verify --deep rebuilds
it.
On-disk format
Per database, on the backend:
databases/<db>/manifest.json -> advisory, rebuildable cache
databases/<db>/snapshots/<NNNN>.snap
databases/<db>/segments/L<NN>/<start>-<end>.seg
Snapshots and segments share one global, lifetime-wide index space. Segments are
partitioned by compaction level (L00 = raw, higher levels = merged), and each
filename carries the inclusive raw-index range the object covers (e.g.
0000000005-0000000030.seg for an L02 merge of raw segments 5–30) — indices
zero-padded so a lexical listing is already chronological. The manifest.json is an
advisory cache the daemon refreshes; the objects themselves are the source of truth,
reconciled by listing. Snapshots and segments are page-level, zstd-compressed
(configurable), and CRC-32C checked inside a small container format (a magic +
compression + CRC envelope wrapping the page images).
A bug worth calling out
On Unix, close() releases all of a process's POSIX advisory locks on a
file. Reading the database file through a transient File::open/drop therefore
silently drops SQLite's reader lock, letting another connection checkpoint the
WAL out from under us — silent data loss.
bakelite reads the database file only through a single long-lived file
descriptor via positioned reads (pread). A dedicated regression test
(tests/wal_pinning.rs) guards against ever reintroducing a per-read open of the
database file.