Operations & disaster recovery

A good backup tool is one you forget about — until the day you need it, when recovery follows a fixed checklist. This page covers both: how to run bakelite so it pages you the moment a backup stops or goes stale, and the exact steps to bring a database back when one is lost or corrupt.

bakelite is built to run two ways, and a healthy deployment uses both:

Noninteractive — the daemon backs up continuously, and a handful of scheduled timers prove the replica is alive and restorable, paging you when it isn't. Set once, then ignore.
Interactive — you inspect (status, list, usage, doctor) and, on the bad day, restore.

The two operating modes

Mode	You run	Cadence	Surfaces
Noninteractive	`bakelite.service` (daemon)	continuous	journald logs, `/metrics`, status files
Noninteractive	`bakelite-check.timer`	~5 min	exit code → your alert
Noninteractive	`bakelite-verify.timer`	daily	verify marker (shallow)
Noninteractive	`bakelite-verify-deep.timer`	weekly	verify marker (restore drill)
Noninteractive	`bakelite-repair.timer`	weekly	heals bit-rot across destinations
Interactive	`status` / `list` / `usage` / `doctor`	on demand	terminal or `--json`
Interactive	`restore`	on incident	a restored database file

The daemon: continuous backup

The daemon (bakelite.service) is the engine: it watches each database's -wal file and ships changes the instant they commit. Install and configure it first — see Install & deploy for the User= / ReadWritePaths= setup, and Configuration for tuning. A healthy daemon produces no output. The timers below exist to detect when it stops or falls behind, so its absence raises an alert.

Scheduled verification and liveness

The daemon ships backups. These timers prove they're alive and restorable, and page you when they're not. The package installs three timer pairs disabled (same as the daemon unit, so you finish configuring first); enable the ones you want:

sudo systemctl enable --now bakelite-check.timer        # liveness + RPO + verify-freshness, ~5 min
sudo systemctl enable --now bakelite-verify.timer       # daily shallow integrity check
sudo systemctl enable --now bakelite-verify-deep.timer  # weekly full restore drill

Each timer just runs an exit-code primitive you could run by hand:

bakelite-check runs status --check --max-lag 10m --max-verify-age 36h. It fails if any replica's daemon is stopped, stale, or backing off, if lag (RPO) exceeds --max-lag, or if the last clean verify is older than --max-verify-age. It reads only local status files — cheap enough to run every few minutes — and it is the one check that catches a fully dead daemon, which no in-process alerter can.
bakelite-verify runs verify: it checks every stored object's integrity and the change-set lineage chain, and writes the "last verified" marker that --max-verify-age reads.
bakelite-verify-deep runs verify --deep: it additionally restores the latest backup to a temp file and runs PRAGMA integrity_check — the end-to-end proof that the replica is actually restorable. It downloads and replays the full history to rebuild that state, so it's weekly by default.

The marker is the link between them: the verify timers prove integrity and stamp the marker; the check timer gates liveness on the marker's freshness. Set User=/Group= and ReadWritePaths= on all three units to match the daemon (they read the markers it writes).

Alerting is yours to wire. Each unit carries a commented #OnFailure=notify-failure@%n.service line: create your own [email protected] (it receives %i = the failed unit) and uncomment it to be paged. Add the same line to bakelite.service to catch the daemon's own death.

No systemd? The same three schedules run from cron — but where the file goes (and its format) depends on the host. The shipped bakelite.cron is the 6-field form: a user column plus SHELL/PATH/MAILTO lines, the format a system crontab or cron.d drop-in expects. The one rule to remember: a system crontab or cron.d file carries the user column; a per-user crontab does not — so drop that column when you install into one.

Vixie/cronie Linux (Devuan, Gentoo, Void, Slackware-with-cronie) — copy it in verbatim:

sudo cp /usr/share/bakelite/bakelite.cron /etc/cron.d/bakelite
sudo "$EDITOR" /etc/cron.d/bakelite   # set the user + your alert command

BSD / TrueNAS (FreeBSD, OpenBSD) — there is no /etc/cron.d/, but the system crontab keeps the user column: append the three job lines from bakelite.cron to /etc/crontab. Native cron picks them up; nothing to enable.
Alpine / busybox crond — no cron.d and no user column: drop the bakelite column (leaving the 5-field form), install as /etc/crontabs/bakelite, then enable crond (off by default): sudo rc-update add crond default && sudo rc-service crond start.
macOS — no cron.d; cron is semi-deprecated in favor of launchd. For cron, install the 5-field (no-user) form with sudo crontab -u bakelite -e. For the native path, wrap the same commands in launchd StartCalendarInterval agents instead.

(See Monitoring without a scraper for the same recipe written out by hand, and the Prometheus metrics if you'd rather scrape.)

What to alert on

Starting points to tighten to your RPO:

Signal	Default	Means	Do
`status --check` non-zero	—	daemon stopped / stale / backing off	page
`--max-lag` exceeded	10m	RPO breached (replica falling behind)	page if sustained
`--max-verify-age` exceeded	36h	integrity not re-proven recently	page
weekly deep verify fails	—	replica is not restorable	page hard

Scraping Prometheus instead? The equivalents are bakelite_up == 0, bakelite_replica_lag_seconds > 300, bakelite_backing_off_seconds present, and bakelite_verify_ok == 0 — see the metrics list. The cron exit-code path and the metrics read the same status files; use whichever your monitoring already speaks.

Why exit codes, not a built-in alerter

bakelite deliberately ships exit-code primitives (status --check, verify) wired to your scheduler, rather than a built-in notifier. Three reasons:

A backup tool's worst failure is the daemon silently stopping — and an in-process alerter can't fire when its own process is dead. The liveness signal has to come from outside — a cron job or systemd timer.
Exit codes compose with the alerting you already run — cron mail, systemd OnFailure=, an Alertmanager probe — instead of adding a second thing to configure.
It's why there's no in-daemon verify scheduler: the timers above run the same verify you'd run by hand, on your schedule, with no surprise bandwidth.

Repair: heal bit-rot across destinations

If you back up a database to more than one destination (two SFTP/SSH servers, local + S3, two regions), bakelite repair keeps every copy honest. It scans every backup object on each destination, and where one is corrupt or missing while another holds a good copy, it rewrites the good copy over the bad one — healing bit-rot in place, before you ever need to restore.

bakelite repair --db app            # heal corrupt/missing copies
bakelite repair --db app --dry-run  # report what would be healed, change nothing
bakelite repair                     # every configured database

It's backend-only and safe to run while the daemon is up — the heal is an idempotent, byte-for-byte copy of an object that already exists, so it races nothing. A single-destination database is a no-op (no sibling to heal from), and a fleet-wide repair skips them cleanly. It exits non-zero if any object is corrupt on every destination (only a fresh backup can replace it) or if a degraded copy couldn't be overwritten.

Like the verify timers, the package ships a bakelite-repair.timer (and a matching cron line), installed disabled, set to run weekly just after the deep verify — enable it once you back up to more than one destination:

sudo systemctl enable --now bakelite-repair.timer   # weekly heal, after verify-deep

Because repair reads every object on every destination (and writes the repairs), it's deliberately a scheduled/manual operation, not an in-daemon loop — the same cron-first reasoning as verify (no surprise bandwidth, and the liveness signal stays outside the daemon).

This complements the automatic recovery on the read path: with 2+ destinations, restore and verify already validate each object and fall through to a healthy copy when one has rotted (see Redundancy & bit-rot recovery). verify flags which destination is rotting; repair is how you fix it.

One caveat: on a WORM / S3-Object-Lock destination the rotted object is immutable, so repair can't overwrite it — it reports that destination as degraded (the replica is still restorable from the others) rather than failing.

Inspecting a replica

The read commands are backend-only (no daemon required) and all take --json and exit non-zero on failure:

status — is replication healthy right now? Lag, state, last error, verify freshness.
list — what can I restore to? Restore-point spans (and the full structure with --verbose).
usage — what is it costing, and am I near a limit?
doctor — will a restore or a config change actually work? Run it before a recovery.

Disaster recovery: restoring a lost database

Work to a checklist. Every step restores to scratch and proves the copy before anything touches production.

Pre-flight. bakelite doctor --db app — backend reachable, replica format readable, encryption key loads. See doctor.
Pick a restore point. bakelite list --db app shows restorable UTC spans; paste one into --timestamp, or omit it for the latest state. If this is a corruption incident, restore to a point before it landed — restore is point-in-time, not just "latest". See Choosing a target.
Restore to scratch. time bakelite restore --db app --output /tmp/dr.db --timestamp '<point>'. Restore walks the lineage chain and verifies each object's hash before writing — a corrupt or substituted object is refused, not applied — then prints what it did:
```
Restored "app" -> /tmp/dr.db
  target: 2026-05-30T11:55:03Z -> 2026-05-30T11:55:03Z
  applied: full backup + 14 incremental change-set(s)
  ✓ lineage verified: 14/14 change-set(s), hashes match
  size: 280000 pages x 4096 bytes (1.07 GiB)
  downloaded: 412.5 MiB
  ✓ integrity_check: ok
```
Restore also runs PRAGMA integrity_check internally; it never touches the live database.
Second integrity check, different tool. sqlite3 /tmp/dr.db 'PRAGMA integrity_check;' — run by sqlite3 rather than bakelite, so a bug in one is less likely to pass both.
App smoke test. Point a throwaway instance at /tmp/dr.db, run your read-path queries, and spot-check row counts against what you expect.
Swap. Stop the app, replace the database (remove any stale -wal/-shm), restart, confirm — see Integrity & safe swaps. Then let the daemon take a fresh snapshot of the recovered database.

How long to budget. Restore streams in roughly constant memory and is dominated by download time on object stores. As a rough guide it runs at hundreds of MB/s locally (a ~1 GB database in a second or two); measure your own with the time command above. See How long it takes.

Suspect tampering? Prefer --timestamp <recent> over latest (it resolves independently of the CURRENT pointer), and check verify for a CURRENT-not-newest warning. For prevention, see immutable backups with Object Lock.

Boot-time auto-recovery

On disposable infrastructure, make every boot self-healing: seed the database from the replica when the volume is empty, and no-op otherwise. The guarded restore is safe to run unconditionally on every start:

bakelite restore --db app --if-db-not-exists --if-replica-exists \
  --config /etc/bakelite/bakelite.toml

--if-db-not-exists makes it a no-op if the database file already exists (the steady-state reboot); --if-replica-exists makes it a no-op on a brand-new deployment with nothing backed up yet. With --output defaulting to the configured path, it's an in-place restore. See Restore on boot.

A Docker entrypoint that seeds then runs the app:

#!/bin/sh
set -e
bakelite restore --db app --if-db-not-exists --if-replica-exists \
  --config /etc/bakelite/bakelite.toml
exec /usr/local/bin/myapp

A Kubernetes init-container doing the same against a shared volume, with the bakelite daemon continuing replication as a sidecar (illustrative — adapt to your cluster):

initContainers:
  - name: bakelite-restore
    image: ghcr.io/duggan/bakelite:latest
    args: ["restore", "--db", "app", "--if-db-not-exists", "--if-replica-exists",
           "--config", "/etc/bakelite/bakelite.toml"]
    volumeMounts:
      - { name: data, mountPath: /var/lib/app }
      - { name: bakelite-config, mountPath: /etc/bakelite, readOnly: true }
containers:
  - name: bakelite
    image: ghcr.io/duggan/bakelite:latest
    args: ["daemon", "--config", "/etc/bakelite/bakelite.toml"]
    volumeMounts:
      - { name: data, mountPath: /var/lib/app }
      - { name: bakelite-config, mountPath: /etc/bakelite, readOnly: true }

The daemon takes a per-database advisory lock, so a sidecar daemon and a one-shot restore against the same database won't race — the restore runs in the init phase, before the daemon starts.

Containers: liveness and verification

In a container there's no systemd and usually no cron — so don't reach for the cron fallback above. The daemon is simply the long-running container process (the sidecar shown above), and the orchestrator is both your scheduler and your alerter. The same three schedules map onto container-native primitives:

systemd timer	Container equivalent
`bakelite-check` (`status --check`)	Docker `HEALTHCHECK` / Kubernetes liveness or readiness probe
`bakelite-verify`	a Kubernetes `CronJob` (or host-scheduled `docker run`)
`bakelite-verify-deep`	the same, on a weekly schedule

Liveness — status --check as a probe. It reads only local status files (no backend round-trip) and exits non-zero when unhealthy, which is exactly a probe's contract. In Compose:

healthcheck:
  # exec (CMD) form, NOT CMD-SHELL: the shipped image is distroless — there is no shell.
  test: ["CMD", "bakelite", "status", "--check", "--config", "/etc/bakelite/bakelite.toml"]
  interval: 5m
  timeout: 10s

In Kubernetes, split the bare aliveness check from the threshold check (illustrative — adapt to your cluster):

livenessProbe:        # bare check only — restart a genuinely dead daemon
  exec:
    command: ["bakelite", "status", "--check", "--config", "/etc/bakelite/bakelite.toml"]
  periodSeconds: 60
readinessProbe:       # thresholds belong here — mark NotReady, hold a rollout
  exec:
    command: ["bakelite", "status", "--check", "--max-lag", "10m",
              "--max-verify-age", "36h", "--config", "/etc/bakelite/bakelite.toml"]
  periodSeconds: 60

Never gate liveness on --max-lag/--max-verify-age. A liveness failure restarts the container, and restarting can't fix backend lag or a stale verify — it just thrashes. Keep thresholds on a readiness probe (or external alerting); a liveness probe checks only "is the daemon alive".

Verification — verify as a CronJob. It's backend-only, so it runs as a short-lived container needing just the config and backend creds — not the data volume:

apiVersion: batch/v1
kind: CronJob
metadata: { name: bakelite-verify }
spec:
  schedule: "0 3 * * *"              # weekly "30 3 * * 0" + args [verify, --deep] for the deep drill
  jobTemplate:
    spec:
      template:
        spec:
          restartPolicy: Never
          containers:
            - name: verify
              image: ghcr.io/duggan/bakelite:latest
              args: ["verify", "--config", "/etc/bakelite/bakelite.toml"]
              volumeMounts:
                - { name: bakelite-config, mountPath: /etc/bakelite, readOnly: true }
          volumes:
            - { name: bakelite-config, secret: { secretName: bakelite-config } }

verify exits non-zero on any problem, so alert on Job failure directly (e.g. kube_job_failed from kube-state-metrics) — the orchestrator already watches Jobs, so you don't need the marker round-trip the systemd story uses. If you do want the sidecar's status --check --max-verify-age to gate on freshness, mount the same data volume (a shared PVC, not an emptyDir) into the CronJob so the marker it writes (.bakelite/<name>.verify.json) lands where the check reads it.