Operations & disaster recovery
A good backup tool is one you forget about — until the day you need it, when recovery follows a fixed checklist. This page covers both: how to run bakelite so it pages you the moment a backup stops or goes stale, and the exact steps to bring a database back when one is lost or corrupt.
bakelite is built to run two ways, and a healthy deployment uses both:
- Noninteractive — the daemon backs up continuously, and a handful of scheduled timers prove the replica is alive and restorable, paging you when it isn't. Set once, then ignore.
- Interactive — you inspect (
status,list,usage,doctor) and, on the bad day, restore.
The two operating modes
| Mode | You run | Cadence | Surfaces |
|---|---|---|---|
| Noninteractive | bakelite.service (daemon) | continuous | journald logs, /metrics, status files |
| Noninteractive | bakelite-check.timer | ~5 min | exit code → your alert |
| Noninteractive | bakelite-verify.timer | daily | verify marker (shallow) |
| Noninteractive | bakelite-verify-deep.timer | weekly | verify marker (restore drill) |
| Noninteractive | bakelite-repair.timer | weekly | heals bit-rot across destinations |
| Interactive | status / list / usage / doctor | on demand | terminal or --json |
| Interactive | restore | on incident | a restored database file |
The daemon: continuous backup
The daemon (bakelite.service) is the engine: it watches each database's -wal
file and ships changes the instant they commit.
Install and configure it first — see
Install & deploy for the User= /
ReadWritePaths= setup, and Configuration
for tuning. A healthy daemon produces no output. The timers below exist to detect
when it stops or falls behind, so its absence raises an alert.
Scheduled verification and liveness
The daemon ships backups. These timers prove they're alive and restorable, and page you when they're not. The package installs three timer pairs disabled (same as the daemon unit, so you finish configuring first); enable the ones you want:
sudo systemctl enable --now bakelite-check.timer # liveness + RPO + verify-freshness, ~5 min
sudo systemctl enable --now bakelite-verify.timer # daily shallow integrity check
sudo systemctl enable --now bakelite-verify-deep.timer # weekly full restore drill
Each timer just runs an exit-code primitive you could run by hand:
bakelite-checkrunsstatus --check --max-lag 10m --max-verify-age 36h. It fails if any replica's daemon is stopped, stale, or backing off, if lag (RPO) exceeds--max-lag, or if the last clean verify is older than--max-verify-age. It reads only local status files — cheap enough to run every few minutes — and it is the one check that catches a fully dead daemon, which no in-process alerter can.bakelite-verifyrunsverify: it checks every stored object's integrity and the change-set lineage chain, and writes the "last verified" marker that--max-verify-agereads.bakelite-verify-deeprunsverify --deep: it additionally restores the latest backup to a temp file and runsPRAGMA integrity_check— the end-to-end proof that the replica is actually restorable. It downloads and replays the full history to rebuild that state, so it's weekly by default.
The marker is the link between them: the verify timers prove integrity and stamp
the marker; the check timer gates liveness on the marker's freshness. Set
User=/Group= and ReadWritePaths= on all three units to match the daemon
(they read the markers it writes).
Alerting is yours to wire. Each unit carries a commented
#OnFailure=notify-failure@%n.service line: create your own [email protected]
(it receives %i = the failed unit) and uncomment it to be paged. Add the same line
to bakelite.service to catch the daemon's own death.
No systemd? The same three schedules run from cron — but where the file goes (and its
format) depends on the host. The shipped bakelite.cron is the 6-field form: a user
column plus SHELL/PATH/MAILTO lines, the format a system crontab or cron.d drop-in
expects. The one rule to remember: a system crontab or cron.d file carries the user
column; a per-user crontab does not — so drop that column when you install into one.
- Vixie/cronie Linux (Devuan, Gentoo, Void, Slackware-with-cronie) — copy it in verbatim:
sudo cp /usr/share/bakelite/bakelite.cron /etc/cron.d/bakelite sudo "$EDITOR" /etc/cron.d/bakelite # set the user + your alert command - BSD / TrueNAS (FreeBSD, OpenBSD) — there is no
/etc/cron.d/, but the system crontab keeps theusercolumn: append the three job lines frombakelite.cronto/etc/crontab. Native cron picks them up; nothing to enable. - Alpine / busybox
crond— nocron.dand nousercolumn: drop thebakelitecolumn (leaving the 5-field form), install as/etc/crontabs/bakelite, then enable crond (off by default):sudo rc-update add crond default && sudo rc-service crond start. - macOS — no
cron.d; cron is semi-deprecated in favor of launchd. For cron, install the 5-field (no-user) form withsudo crontab -u bakelite -e. For the native path, wrap the same commands in launchdStartCalendarIntervalagents instead.
(See Monitoring without a scraper for the same recipe written out by hand, and the Prometheus metrics if you'd rather scrape.)
What to alert on
Starting points to tighten to your RPO:
| Signal | Default | Means | Do |
|---|---|---|---|
status --check non-zero | — | daemon stopped / stale / backing off | page |
--max-lag exceeded | 10m | RPO breached (replica falling behind) | page if sustained |
--max-verify-age exceeded | 36h | integrity not re-proven recently | page |
| weekly deep verify fails | — | replica is not restorable | page hard |
Scraping Prometheus instead? The equivalents are bakelite_up == 0,
bakelite_replica_lag_seconds > 300, bakelite_backing_off_seconds present, and
bakelite_verify_ok == 0 — see the metrics list. The cron
exit-code path and the metrics read the same status files; use whichever your
monitoring already speaks.
Why exit codes, not a built-in alerter
bakelite deliberately ships exit-code primitives (status --check, verify)
wired to your scheduler, rather than a built-in notifier. Three reasons:
- A backup tool's worst failure is the daemon silently stopping — and an in-process alerter can't fire when its own process is dead. The liveness signal has to come from outside — a cron job or systemd timer.
- Exit codes compose with the alerting you already run — cron mail, systemd
OnFailure=, an Alertmanager probe — instead of adding a second thing to configure. - It's why there's no in-daemon verify scheduler: the timers above run the same
verifyyou'd run by hand, on your schedule, with no surprise bandwidth.
Repair: heal bit-rot across destinations
If you back up a database to more than one destination (two SFTP/SSH servers,
local + S3, two regions), bakelite repair keeps every copy honest. It scans every
backup object on each destination, and where one is corrupt or missing while
another holds a good copy, it rewrites the good copy over the bad one — healing
bit-rot in place, before you ever need to restore.
bakelite repair --db app # heal corrupt/missing copies
bakelite repair --db app --dry-run # report what would be healed, change nothing
bakelite repair # every configured database
It's backend-only and safe to run while the daemon is up — the heal is an
idempotent, byte-for-byte copy of an object that already exists, so it races
nothing. A single-destination database is a no-op (no sibling to heal from), and a
fleet-wide repair skips them cleanly. It exits non-zero if any object is corrupt
on every destination (only a fresh backup can replace it) or if a degraded copy
couldn't be overwritten.
Like the verify timers, the package ships a bakelite-repair.timer (and a
matching cron line), installed disabled, set to run weekly just after the deep
verify — enable it once you back up to more than one destination:
sudo systemctl enable --now bakelite-repair.timer # weekly heal, after verify-deep
Because repair reads every object on every destination (and writes the repairs),
it's deliberately a scheduled/manual operation, not an in-daemon loop — the same
cron-first reasoning as verify (no surprise bandwidth, and the liveness signal
stays outside the daemon).
This complements the automatic recovery on the read path: with 2+
destinations, restore and verify already
validate each object and fall through to a healthy copy when one has rotted (see
Redundancy & bit-rot recovery).
verify flags which destination is rotting; repair is how you fix it.
One caveat: on a WORM / S3-Object-Lock destination the rotted object is immutable,
so repair can't overwrite it — it reports that destination as degraded (the replica
is still restorable from the others) rather than failing.
Inspecting a replica
The read commands are backend-only (no daemon required) and all take --json and
exit non-zero on failure:
status— is replication healthy right now? Lag, state, last error, verify freshness.list— what can I restore to? Restore-point spans (and the full structure with--verbose).usage— what is it costing, and am I near a limit?doctor— will a restore or a config change actually work? Run it before a recovery.
Disaster recovery: restoring a lost database
Work to a checklist. Every step restores to scratch and proves the copy before anything touches production.
- Pre-flight.
bakelite doctor --db app— backend reachable, replica format readable, encryption key loads. Seedoctor. - Pick a restore point.
bakelite list --db appshows restorable UTC spans; paste one into--timestamp, or omit it for the latest state. If this is a corruption incident, restore to a point before it landed — restore is point-in-time, not just "latest". See Choosing a target. - Restore to scratch.
time bakelite restore --db app --output /tmp/dr.db --timestamp '<point>'. Restore walks the lineage chain and verifies each object's hash before writing — a corrupt or substituted object is refused, not applied — then prints what it did:
Restore also runsRestored "app" -> /tmp/dr.db target: 2026-05-30T11:55:03Z -> 2026-05-30T11:55:03Z applied: full backup + 14 incremental change-set(s) ✓ lineage verified: 14/14 change-set(s), hashes match size: 280000 pages x 4096 bytes (1.07 GiB) downloaded: 412.5 MiB ✓ integrity_check: okPRAGMA integrity_checkinternally; it never touches the live database. - Second integrity check, different tool.
sqlite3 /tmp/dr.db 'PRAGMA integrity_check;'— run by sqlite3 rather than bakelite, so a bug in one is less likely to pass both. - App smoke test. Point a throwaway instance at
/tmp/dr.db, run your read-path queries, and spot-check row counts against what you expect. - Swap. Stop the app, replace the database (remove any stale
-wal/-shm), restart, confirm — see Integrity & safe swaps. Then let the daemon take a fresh snapshot of the recovered database.
How long to budget. Restore streams in roughly constant memory and is dominated
by download time on object stores. As a rough guide it runs at hundreds of MB/s
locally (a ~1 GB database in a second or two); measure your own with the time
command above. See How long it takes.
Suspect tampering? Prefer --timestamp <recent> over latest (it resolves
independently of the CURRENT pointer), and check verify for a CURRENT-not-newest
warning. For prevention, see
immutable backups with Object Lock.
Boot-time auto-recovery
On disposable infrastructure, make every boot self-healing: seed the database from the replica when the volume is empty, and no-op otherwise. The guarded restore is safe to run unconditionally on every start:
bakelite restore --db app --if-db-not-exists --if-replica-exists \
--config /etc/bakelite/bakelite.toml
--if-db-not-exists makes it a no-op if the database file already exists (the
steady-state reboot); --if-replica-exists makes it a no-op on a brand-new
deployment with nothing backed up yet. With --output defaulting to the configured
path, it's an in-place restore. See Restore on boot.
A Docker entrypoint that seeds then runs the app:
#!/bin/sh
set -e
bakelite restore --db app --if-db-not-exists --if-replica-exists \
--config /etc/bakelite/bakelite.toml
exec /usr/local/bin/myapp
A Kubernetes init-container doing the same against a shared volume, with the bakelite daemon continuing replication as a sidecar (illustrative — adapt to your cluster):
initContainers:
- name: bakelite-restore
image: ghcr.io/duggan/bakelite:latest
args: ["restore", "--db", "app", "--if-db-not-exists", "--if-replica-exists",
"--config", "/etc/bakelite/bakelite.toml"]
volumeMounts:
- { name: data, mountPath: /var/lib/app }
- { name: bakelite-config, mountPath: /etc/bakelite, readOnly: true }
containers:
- name: bakelite
image: ghcr.io/duggan/bakelite:latest
args: ["daemon", "--config", "/etc/bakelite/bakelite.toml"]
volumeMounts:
- { name: data, mountPath: /var/lib/app }
- { name: bakelite-config, mountPath: /etc/bakelite, readOnly: true }
The daemon takes a per-database advisory lock, so a sidecar daemon and a one-shot restore against the same database won't race — the restore runs in the init phase, before the daemon starts.
Containers: liveness and verification
In a container there's no systemd and usually no cron — so don't reach for the cron fallback above. The daemon is simply the long-running container process (the sidecar shown above), and the orchestrator is both your scheduler and your alerter. The same three schedules map onto container-native primitives:
| systemd timer | Container equivalent |
|---|---|
bakelite-check (status --check) | Docker HEALTHCHECK / Kubernetes liveness or readiness probe |
bakelite-verify | a Kubernetes CronJob (or host-scheduled docker run) |
bakelite-verify-deep | the same, on a weekly schedule |
Liveness — status --check as a probe. It reads only local status files (no
backend round-trip) and exits non-zero when unhealthy, which is exactly a probe's
contract. In Compose:
healthcheck:
# exec (CMD) form, NOT CMD-SHELL: the shipped image is distroless — there is no shell.
test: ["CMD", "bakelite", "status", "--check", "--config", "/etc/bakelite/bakelite.toml"]
interval: 5m
timeout: 10s
In Kubernetes, split the bare aliveness check from the threshold check (illustrative — adapt to your cluster):
livenessProbe: # bare check only — restart a genuinely dead daemon
exec:
command: ["bakelite", "status", "--check", "--config", "/etc/bakelite/bakelite.toml"]
periodSeconds: 60
readinessProbe: # thresholds belong here — mark NotReady, hold a rollout
exec:
command: ["bakelite", "status", "--check", "--max-lag", "10m",
"--max-verify-age", "36h", "--config", "/etc/bakelite/bakelite.toml"]
periodSeconds: 60
Never gate liveness on --max-lag/--max-verify-age. A liveness failure
restarts the container, and restarting can't fix backend lag or a stale verify — it
just thrashes. Keep thresholds on a readiness probe (or external alerting); a liveness
probe checks only "is the daemon alive".
Verification — verify as a CronJob. It's backend-only, so it runs as a
short-lived container needing just the config and backend creds — not the data
volume:
apiVersion: batch/v1
kind: CronJob
metadata: { name: bakelite-verify }
spec:
schedule: "0 3 * * *" # weekly "30 3 * * 0" + args [verify, --deep] for the deep drill
jobTemplate:
spec:
template:
spec:
restartPolicy: Never
containers:
- name: verify
image: ghcr.io/duggan/bakelite:latest
args: ["verify", "--config", "/etc/bakelite/bakelite.toml"]
volumeMounts:
- { name: bakelite-config, mountPath: /etc/bakelite, readOnly: true }
volumes:
- { name: bakelite-config, secret: { secretName: bakelite-config } }
verify exits non-zero on any problem, so alert on Job failure directly (e.g.
kube_job_failed from kube-state-metrics) — the orchestrator already watches Jobs, so
you don't need the marker round-trip the systemd story uses. If you do want the
sidecar's status --check --max-verify-age to gate on freshness, mount the same data
volume (a shared PVC, not an emptyDir) into the CronJob so the marker it writes
(.bakelite/<name>.verify.json) lands where the check reads it.
See also
- Restore — the full target-selection rules and timing detail.
- Configuration — tuning, encryption, and storage limits.
- Project status — maturity and how to validate bakelite yourself.