Pre-release. bakelite is unreleased and still under active testing — docs and behaviour may change without notice.

Operations & disaster recovery

A good backup tool is one you forget about — until the day you need it, when recovery follows a fixed checklist. This page covers both: how to run bakelite so it pages you the moment a backup stops or goes stale, and the exact steps to bring a database back when one is lost or corrupt.

bakelite is built to run two ways, and a healthy deployment uses both:

The two operating modes

ModeYou runCadenceSurfaces
Noninteractivebakelite.service (daemon)continuousjournald logs, /metrics, status files
Noninteractivebakelite-check.timer~5 minexit code → your alert
Noninteractivebakelite-verify.timerdailyverify marker (shallow)
Noninteractivebakelite-verify-deep.timerweeklyverify marker (restore drill)
Noninteractivebakelite-repair.timerweeklyheals bit-rot across destinations
Interactivestatus / list / usage / doctoron demandterminal or --json
Interactiverestoreon incidenta restored database file

The daemon: continuous backup

The daemon (bakelite.service) is the engine: it watches each database's -wal file and ships changes the instant they commit. Install and configure it first — see Install & deploy for the User= / ReadWritePaths= setup, and Configuration for tuning. A healthy daemon produces no output. The timers below exist to detect when it stops or falls behind, so its absence raises an alert.

Scheduled verification and liveness

The daemon ships backups. These timers prove they're alive and restorable, and page you when they're not. The package installs three timer pairs disabled (same as the daemon unit, so you finish configuring first); enable the ones you want:

sudo systemctl enable --now bakelite-check.timer        # liveness + RPO + verify-freshness, ~5 min
sudo systemctl enable --now bakelite-verify.timer       # daily shallow integrity check
sudo systemctl enable --now bakelite-verify-deep.timer  # weekly full restore drill

Each timer just runs an exit-code primitive you could run by hand:

The marker is the link between them: the verify timers prove integrity and stamp the marker; the check timer gates liveness on the marker's freshness. Set User=/Group= and ReadWritePaths= on all three units to match the daemon (they read the markers it writes).

Alerting is yours to wire. Each unit carries a commented #OnFailure=notify-failure@%n.service line: create your own [email protected] (it receives %i = the failed unit) and uncomment it to be paged. Add the same line to bakelite.service to catch the daemon's own death.

No systemd? The same three schedules run from cron — but where the file goes (and its format) depends on the host. The shipped bakelite.cron is the 6-field form: a user column plus SHELL/PATH/MAILTO lines, the format a system crontab or cron.d drop-in expects. The one rule to remember: a system crontab or cron.d file carries the user column; a per-user crontab does not — so drop that column when you install into one.

(See Monitoring without a scraper for the same recipe written out by hand, and the Prometheus metrics if you'd rather scrape.)

What to alert on

Starting points to tighten to your RPO:

SignalDefaultMeansDo
status --check non-zerodaemon stopped / stale / backing offpage
--max-lag exceeded10mRPO breached (replica falling behind)page if sustained
--max-verify-age exceeded36hintegrity not re-proven recentlypage
weekly deep verify failsreplica is not restorablepage hard

Scraping Prometheus instead? The equivalents are bakelite_up == 0, bakelite_replica_lag_seconds > 300, bakelite_backing_off_seconds present, and bakelite_verify_ok == 0 — see the metrics list. The cron exit-code path and the metrics read the same status files; use whichever your monitoring already speaks.

Why exit codes, not a built-in alerter

bakelite deliberately ships exit-code primitives (status --check, verify) wired to your scheduler, rather than a built-in notifier. Three reasons:

  1. A backup tool's worst failure is the daemon silently stopping — and an in-process alerter can't fire when its own process is dead. The liveness signal has to come from outside — a cron job or systemd timer.
  2. Exit codes compose with the alerting you already run — cron mail, systemd OnFailure=, an Alertmanager probe — instead of adding a second thing to configure.
  3. It's why there's no in-daemon verify scheduler: the timers above run the same verify you'd run by hand, on your schedule, with no surprise bandwidth.

Repair: heal bit-rot across destinations

If you back up a database to more than one destination (two SFTP/SSH servers, local + S3, two regions), bakelite repair keeps every copy honest. It scans every backup object on each destination, and where one is corrupt or missing while another holds a good copy, it rewrites the good copy over the bad one — healing bit-rot in place, before you ever need to restore.

bakelite repair --db app            # heal corrupt/missing copies
bakelite repair --db app --dry-run  # report what would be healed, change nothing
bakelite repair                     # every configured database

It's backend-only and safe to run while the daemon is up — the heal is an idempotent, byte-for-byte copy of an object that already exists, so it races nothing. A single-destination database is a no-op (no sibling to heal from), and a fleet-wide repair skips them cleanly. It exits non-zero if any object is corrupt on every destination (only a fresh backup can replace it) or if a degraded copy couldn't be overwritten.

Like the verify timers, the package ships a bakelite-repair.timer (and a matching cron line), installed disabled, set to run weekly just after the deep verify — enable it once you back up to more than one destination:

sudo systemctl enable --now bakelite-repair.timer   # weekly heal, after verify-deep

Because repair reads every object on every destination (and writes the repairs), it's deliberately a scheduled/manual operation, not an in-daemon loop — the same cron-first reasoning as verify (no surprise bandwidth, and the liveness signal stays outside the daemon).

This complements the automatic recovery on the read path: with 2+ destinations, restore and verify already validate each object and fall through to a healthy copy when one has rotted (see Redundancy & bit-rot recovery). verify flags which destination is rotting; repair is how you fix it.

One caveat: on a WORM / S3-Object-Lock destination the rotted object is immutable, so repair can't overwrite it — it reports that destination as degraded (the replica is still restorable from the others) rather than failing.

Inspecting a replica

The read commands are backend-only (no daemon required) and all take --json and exit non-zero on failure:

Disaster recovery: restoring a lost database

Work to a checklist. Every step restores to scratch and proves the copy before anything touches production.

  1. Pre-flight. bakelite doctor --db app — backend reachable, replica format readable, encryption key loads. See doctor.
  2. Pick a restore point. bakelite list --db app shows restorable UTC spans; paste one into --timestamp, or omit it for the latest state. If this is a corruption incident, restore to a point before it landed — restore is point-in-time, not just "latest". See Choosing a target.
  3. Restore to scratch. time bakelite restore --db app --output /tmp/dr.db --timestamp '<point>'. Restore walks the lineage chain and verifies each object's hash before writing — a corrupt or substituted object is refused, not applied — then prints what it did:
    Restored "app" -> /tmp/dr.db
      target: 2026-05-30T11:55:03Z -> 2026-05-30T11:55:03Z
      applied: full backup + 14 incremental change-set(s)
      ✓ lineage verified: 14/14 change-set(s), hashes match
      size: 280000 pages x 4096 bytes (1.07 GiB)
      downloaded: 412.5 MiB
      ✓ integrity_check: ok
    Restore also runs PRAGMA integrity_check internally; it never touches the live database.
  4. Second integrity check, different tool. sqlite3 /tmp/dr.db 'PRAGMA integrity_check;' — run by sqlite3 rather than bakelite, so a bug in one is less likely to pass both.
  5. App smoke test. Point a throwaway instance at /tmp/dr.db, run your read-path queries, and spot-check row counts against what you expect.
  6. Swap. Stop the app, replace the database (remove any stale -wal/-shm), restart, confirm — see Integrity & safe swaps. Then let the daemon take a fresh snapshot of the recovered database.

How long to budget. Restore streams in roughly constant memory and is dominated by download time on object stores. As a rough guide it runs at hundreds of MB/s locally (a ~1 GB database in a second or two); measure your own with the time command above. See How long it takes.

Suspect tampering? Prefer --timestamp <recent> over latest (it resolves independently of the CURRENT pointer), and check verify for a CURRENT-not-newest warning. For prevention, see immutable backups with Object Lock.

Boot-time auto-recovery

On disposable infrastructure, make every boot self-healing: seed the database from the replica when the volume is empty, and no-op otherwise. The guarded restore is safe to run unconditionally on every start:

bakelite restore --db app --if-db-not-exists --if-replica-exists \
  --config /etc/bakelite/bakelite.toml

--if-db-not-exists makes it a no-op if the database file already exists (the steady-state reboot); --if-replica-exists makes it a no-op on a brand-new deployment with nothing backed up yet. With --output defaulting to the configured path, it's an in-place restore. See Restore on boot.

A Docker entrypoint that seeds then runs the app:

#!/bin/sh
set -e
bakelite restore --db app --if-db-not-exists --if-replica-exists \
  --config /etc/bakelite/bakelite.toml
exec /usr/local/bin/myapp

A Kubernetes init-container doing the same against a shared volume, with the bakelite daemon continuing replication as a sidecar (illustrative — adapt to your cluster):

initContainers:
  - name: bakelite-restore
    image: ghcr.io/duggan/bakelite:latest
    args: ["restore", "--db", "app", "--if-db-not-exists", "--if-replica-exists",
           "--config", "/etc/bakelite/bakelite.toml"]
    volumeMounts:
      - { name: data, mountPath: /var/lib/app }
      - { name: bakelite-config, mountPath: /etc/bakelite, readOnly: true }
containers:
  - name: bakelite
    image: ghcr.io/duggan/bakelite:latest
    args: ["daemon", "--config", "/etc/bakelite/bakelite.toml"]
    volumeMounts:
      - { name: data, mountPath: /var/lib/app }
      - { name: bakelite-config, mountPath: /etc/bakelite, readOnly: true }

The daemon takes a per-database advisory lock, so a sidecar daemon and a one-shot restore against the same database won't race — the restore runs in the init phase, before the daemon starts.

Containers: liveness and verification

In a container there's no systemd and usually no cron — so don't reach for the cron fallback above. The daemon is simply the long-running container process (the sidecar shown above), and the orchestrator is both your scheduler and your alerter. The same three schedules map onto container-native primitives:

systemd timerContainer equivalent
bakelite-check (status --check)Docker HEALTHCHECK / Kubernetes liveness or readiness probe
bakelite-verifya Kubernetes CronJob (or host-scheduled docker run)
bakelite-verify-deepthe same, on a weekly schedule

Liveness — status --check as a probe. It reads only local status files (no backend round-trip) and exits non-zero when unhealthy, which is exactly a probe's contract. In Compose:

healthcheck:
  # exec (CMD) form, NOT CMD-SHELL: the shipped image is distroless — there is no shell.
  test: ["CMD", "bakelite", "status", "--check", "--config", "/etc/bakelite/bakelite.toml"]
  interval: 5m
  timeout: 10s

In Kubernetes, split the bare aliveness check from the threshold check (illustrative — adapt to your cluster):

livenessProbe:        # bare check only — restart a genuinely dead daemon
  exec:
    command: ["bakelite", "status", "--check", "--config", "/etc/bakelite/bakelite.toml"]
  periodSeconds: 60
readinessProbe:       # thresholds belong here — mark NotReady, hold a rollout
  exec:
    command: ["bakelite", "status", "--check", "--max-lag", "10m",
              "--max-verify-age", "36h", "--config", "/etc/bakelite/bakelite.toml"]
  periodSeconds: 60

Never gate liveness on --max-lag/--max-verify-age. A liveness failure restarts the container, and restarting can't fix backend lag or a stale verify — it just thrashes. Keep thresholds on a readiness probe (or external alerting); a liveness probe checks only "is the daemon alive".

Verification — verify as a CronJob. It's backend-only, so it runs as a short-lived container needing just the config and backend creds — not the data volume:

apiVersion: batch/v1
kind: CronJob
metadata: { name: bakelite-verify }
spec:
  schedule: "0 3 * * *"              # weekly "30 3 * * 0" + args [verify, --deep] for the deep drill
  jobTemplate:
    spec:
      template:
        spec:
          restartPolicy: Never
          containers:
            - name: verify
              image: ghcr.io/duggan/bakelite:latest
              args: ["verify", "--config", "/etc/bakelite/bakelite.toml"]
              volumeMounts:
                - { name: bakelite-config, mountPath: /etc/bakelite, readOnly: true }
          volumes:
            - { name: bakelite-config, secret: { secretName: bakelite-config } }

verify exits non-zero on any problem, so alert on Job failure directly (e.g. kube_job_failed from kube-state-metrics) — the orchestrator already watches Jobs, so you don't need the marker round-trip the systemd story uses. If you do want the sidecar's status --check --max-verify-age to gate on freshness, mount the same data volume (a shared PVC, not an emptyDir) into the CronJob so the marker it writes (.bakelite/<name>.verify.json) lands where the check reads it.

See also