Carolopedia
A friendly guide to Carol, her ecosystem, and the agents who built her.
📖 Carolopedia › Services › Infrastructure & Backups › Architecture
🎯Key functional considerations
This service is the safety net under every other service, so its architecture is shaped by what it must guarantee end-to-end:
- Nothing is lost. Every service's critical state is snapshotted on a schedule and shipped off the machine, so a single-host failure never loses data.
- The right history is kept. Retention keeps enough versions to recover from and prunes the rest, so backups protect without growing unbounded.
- The machine stays healthy. Disk, temp files, stale working files, logs and databases are tidied continuously so nothing silently fills up or rots.
- Services stay reachable. The systemd units, the web/proxy layer and the secure tunnel that expose each app are kept up; a down app is detected and relaunched rather than left dark.
- Cost is attributable. Protection is charged per protected data, so the service is accountable to a cost center.
🧰Technologies used
- Python 3 tooling for the backup, cleanup and app-steward droids, run on the Carol host behind nginx.
- SQLite (WAL) is what is being protected — the snapshot captures the registry, design store, plan-generator, initiatives and constitution databases, plus laptop-critical assets.
- Git / GitHub is the offsite for code: unpushed commits are pushed so the remote is a durable copy.
- systemd and cron schedule the recurring droids (daily snapshot, daily janitor); systemd units plus a secure tunnel are the runtime layer that keeps each app reachable.
- The registry is the source of truth for which apps exist and must stay alive, and for cost-center attribution.
🏗Solution architecture
The service is a set of blocks — the distinct steps shown in the Blocks section of the service page — each owned by an agent and carried out by that agent's droids. It is a direct instance of Carolverse's agent-centric modular architecture.
- Snapshot, then ship. One path takes the snapshot of every service's data; a second pushes code commits offsite, with a dry-run twin that reviews push state before the real push.
- Continuous housekeeping. A separate, always-running cleanup path reclaims disk age-based and guards a disk-usage threshold, so backup and runtime never starve for space.
- Self-healing runtime. The app-steward path watches the service's registered apps and relaunches any that go down, rather than waiting for a human to notice.
- Schedule-driven, not request-driven. The work is recurring and time-triggered; there is no public request surface to protect data on demand.
📐Design principles followed
- Single source of truth. The set of apps to keep alive and the data to protect come from the live registry — the shared principle described on the Carolverse Architecture page.
- Offsite by default. A backup that only lives on the same machine is not a backup; snapshots and commits are shipped off-host.
- Self-heal over block-and-wait. A down app is relaunched and stale disk is reclaimed automatically rather than escalated.
- Agent-centric modular architecture. Every block has an accountable agent and a doing droid.
- Keep the right history, prune the rest. Retention is deliberate, not unbounded growth.
- Observability first. Scheduled droids emit run-audit so a failed backup or cleanup is visible, not silent.
✅Success criteria
- Every protected service's critical state has a recent off-machine snapshot that can be restored.
- Unpushed commits do not accumulate — code is durably mirrored to GitHub.
- Disk never silently fills — stale temp and caches are reclaimed and the disk-usage threshold holds.
- Registered apps stay up — a down app is detected and relaunched without operator action.
- Each recurring droid's runs are auditable, so a missed or failed run surfaces on a monitor.
🛡Service-specific policies
- Backups are scheduled, never ad-hoc — the snapshot and shipping run on their schedule, owned by the accountable agent.
- Ship offsite — snapshots and commits must leave the host to count as protected.
- Retention is enforced — keep the right window of history and prune the rest.
- Every action is tagged to a droid under the owning agent; recurring work emits run-audit so it is observable.
- Protection is charged per protected data against this service's cost center.
📦End-user deliverables
Current
- Daily off-machine snapshots of Carol + BB database state (registry, designs, plan-generator, initiatives, constitution) plus laptop-critical assets — Hagrid via the Backup Custodian droid (Backups & Shipping block).
- Offsite code shipping — unpushed commits pushed to GitHub, with a dry-run review of push state first — Hagrid via the Shipper and Shipper Twin droids.
- Disk reclamation and guarding — pruning stale
/tmpfiles and regenerable caches age-based, then guarding the disk-usage threshold — Hagrid via the Daily Janitor droid (Cleanup & Housekeeping block). - App liveness — keeping the service's registered apps alive by detecting down apps and relaunching them via
shared/app_steward.py— Hagrid via the App Steward droid (Infrastructure & Runtime block).
Future (on demand)
- Restore drills — periodic test restores from a snapshot to prove backups are recoverable, not just present.
- Retention reporting — a clear view of what history exists per protected service and how it is pruned.
- Per-service protection cost surfaced against the cost center, so charges are visible alongside the data protected.
📘End-user run book
This service has no public agent-facing tools — it is schedule-driven, run by the owning agent Hagrid and his droids.
Operate
- Backups & Shipping runs on its daily schedule (snapshot + GitHub push, with the dry-run twin reviewing push state).
- Cleanup & Housekeeping runs the daily disk reclaim and guards the disk-usage threshold.
- Infrastructure & Runtime keeps systemd units, the proxy layer and the secure tunnel up; the app-steward relaunches a down app via
shared/app_steward.py.
Check health
- Confirm each recurring droid's latest run-audit is recent and green; a missing or failed run is the signal to investigate.
Where the rules live
- The Cookbook is operational law; retention, offsite and observability obligations bind every block here.