Carol — back to Apps ← Apps

Carolopedia

A friendly guide to Carol, her ecosystem, and the agents who built her.

📖 CarolopediaServicesInfrastructure & BackupsArchitecture
Infrastructure & Backups

Infrastructure & Backups Architecture

Architecture The defined architecture of the Infrastructure & Backups service — eight standard sections.

🎯Key functional considerations

This service is the safety net under every other service, so its architecture is shaped by what it must guarantee end-to-end:

  • Nothing is lost. Every service's critical state is snapshotted on a schedule and shipped off the machine, so a single-host failure never loses data.
  • The right history is kept. Retention keeps enough versions to recover from and prunes the rest, so backups protect without growing unbounded.
  • The machine stays healthy. Disk, temp files, stale working files, logs and databases are tidied continuously so nothing silently fills up or rots.
  • Services stay reachable. The systemd units, the web/proxy layer and the secure tunnel that expose each app are kept up; a down app is detected and relaunched rather than left dark.
  • Cost is attributable. Protection is charged per protected data, so the service is accountable to a cost center.

🧰Technologies used

  • Python 3 tooling for the backup, cleanup and app-steward droids, run on the Carol host behind nginx.
  • SQLite (WAL) is what is being protected — the snapshot captures the registry, design store, plan-generator, initiatives and constitution databases, plus laptop-critical assets.
  • Git / GitHub is the offsite for code: unpushed commits are pushed so the remote is a durable copy.
  • systemd and cron schedule the recurring droids (daily snapshot, daily janitor); systemd units plus a secure tunnel are the runtime layer that keeps each app reachable.
  • The registry is the source of truth for which apps exist and must stay alive, and for cost-center attribution.

🏗Solution architecture

The service is a set of blocks — the distinct steps shown in the Blocks section of the service page — each owned by an agent and carried out by that agent's droids. It is a direct instance of Carolverse's agent-centric modular architecture.

  • Snapshot, then ship. One path takes the snapshot of every service's data; a second pushes code commits offsite, with a dry-run twin that reviews push state before the real push.
  • Continuous housekeeping. A separate, always-running cleanup path reclaims disk age-based and guards a disk-usage threshold, so backup and runtime never starve for space.
  • Self-healing runtime. The app-steward path watches the service's registered apps and relaunches any that go down, rather than waiting for a human to notice.
  • Schedule-driven, not request-driven. The work is recurring and time-triggered; there is no public request surface to protect data on demand.

📐Design principles followed

  • Single source of truth. The set of apps to keep alive and the data to protect come from the live registry — the shared principle described on the Carolverse Architecture page.
  • Offsite by default. A backup that only lives on the same machine is not a backup; snapshots and commits are shipped off-host.
  • Self-heal over block-and-wait. A down app is relaunched and stale disk is reclaimed automatically rather than escalated.
  • Agent-centric modular architecture. Every block has an accountable agent and a doing droid.
  • Keep the right history, prune the rest. Retention is deliberate, not unbounded growth.
  • Observability first. Scheduled droids emit run-audit so a failed backup or cleanup is visible, not silent.

Success criteria

  • Every protected service's critical state has a recent off-machine snapshot that can be restored.
  • Unpushed commits do not accumulate — code is durably mirrored to GitHub.
  • Disk never silently fills — stale temp and caches are reclaimed and the disk-usage threshold holds.
  • Registered apps stay up — a down app is detected and relaunched without operator action.
  • Each recurring droid's runs are auditable, so a missed or failed run surfaces on a monitor.

🛡Service-specific policies

  • Backups are scheduled, never ad-hoc — the snapshot and shipping run on their schedule, owned by the accountable agent.
  • Ship offsite — snapshots and commits must leave the host to count as protected.
  • Retention is enforced — keep the right window of history and prune the rest.
  • Every action is tagged to a droid under the owning agent; recurring work emits run-audit so it is observable.
  • Protection is charged per protected data against this service's cost center.

📦End-user deliverables

Current

  • Daily off-machine snapshots of Carol + BB database state (registry, designs, plan-generator, initiatives, constitution) plus laptop-critical assets — Hagrid via the Backup Custodian droid (Backups & Shipping block).
  • Offsite code shipping — unpushed commits pushed to GitHub, with a dry-run review of push state first — Hagrid via the Shipper and Shipper Twin droids.
  • Disk reclamation and guarding — pruning stale /tmp files and regenerable caches age-based, then guarding the disk-usage threshold — Hagrid via the Daily Janitor droid (Cleanup & Housekeeping block).
  • App liveness — keeping the service's registered apps alive by detecting down apps and relaunching them via shared/app_steward.py — Hagrid via the App Steward droid (Infrastructure & Runtime block).

Future (on demand)

  • Restore drills — periodic test restores from a snapshot to prove backups are recoverable, not just present.
  • Retention reporting — a clear view of what history exists per protected service and how it is pruned.
  • Per-service protection cost surfaced against the cost center, so charges are visible alongside the data protected.

📘End-user run book

This service has no public agent-facing tools — it is schedule-driven, run by the owning agent Hagrid and his droids.

Operate

  • Backups & Shipping runs on its daily schedule (snapshot + GitHub push, with the dry-run twin reviewing push state).
  • Cleanup & Housekeeping runs the daily disk reclaim and guards the disk-usage threshold.
  • Infrastructure & Runtime keeps systemd units, the proxy layer and the secure tunnel up; the app-steward relaunches a down app via shared/app_steward.py.

Check health

  • Confirm each recurring droid's latest run-audit is recent and green; a missing or failed run is the signal to investigate.

Where the rules live

  • The Cookbook is operational law; retention, offsite and observability obligations bind every block here.