Carolopedia

A friendly guide to Carol, her ecosystem, and the agents who built her.

📖 Carolopedia › Services › Infrastructure & Backups › Architecture

Infrastructure & Backups Architecture

Architecture The defined architecture of the Infrastructure & Backups service — eight standard sections.

🎯Key functional considerations

This service is the safety net under every other service, so its architecture is shaped by what it must guarantee end-to-end:

Nothing is lost. Every service's critical state is snapshotted on a schedule and shipped off the machine, so a single-host failure never loses data.
The right history is kept. Retention keeps enough versions to recover from and prunes the rest, so backups protect without growing unbounded.
The machine stays healthy. Disk, temp files, stale working files, logs and databases are tidied continuously so nothing silently fills up or rots.
Services stay reachable. The systemd units, the web/proxy layer and the secure tunnel that expose each app are kept up; a down app is detected and relaunched rather than left dark.
Cost is attributable. Protection is charged per protected data, so the service is accountable to a cost center.

🧰Technologies used

Python 3 tooling for the backup, cleanup and app-steward droids, run on the Carol host behind nginx.
SQLite (WAL) is what is being protected — the snapshot captures the registry, design store, plan-generator, initiatives and constitution databases, plus laptop-critical assets.
Git / GitHub is the offsite for code: unpushed commits are pushed so the remote is a durable copy.
systemd and cron schedule the recurring droids (daily snapshot, daily janitor); systemd units plus a secure tunnel are the runtime layer that keeps each app reachable.
The registry is the source of truth for which apps exist and must stay alive, and for cost-center attribution.

🏗Solution architecture

The service is a set of blocks — the distinct steps shown in the Blocks section of the service page — each owned by an agent and carried out by that agent's droids. It is a direct instance of Carolverse's agent-centric modular architecture.

Snapshot, then ship. One path takes the snapshot of every service's data; a second pushes code commits offsite, with a dry-run twin that reviews push state before the real push.
Continuous housekeeping. A separate, always-running cleanup path reclaims disk age-based and guards a disk-usage threshold, so backup and runtime never starve for space.
Self-healing runtime. The app-steward path watches the service's registered apps and relaunches any that go down, rather than waiting for a human to notice.
Schedule-driven, not request-driven. The work is recurring and time-triggered; there is no public request surface to protect data on demand.

📐Design principles followed

Single source of truth. The set of apps to keep alive and the data to protect come from the live registry — the shared principle described on the Carolverse Architecture page.
Offsite by default. A backup that only lives on the same machine is not a backup; snapshots and commits are shipped off-host.
Self-heal over block-and-wait. A down app is relaunched and stale disk is reclaimed automatically rather than escalated.
Agent-centric modular architecture. Every block has an accountable agent and a doing droid.
Keep the right history, prune the rest. Retention is deliberate, not unbounded growth.
Observability first. Scheduled droids emit run-audit so a failed backup or cleanup is visible, not silent.

✅Success criteria

Every protected service's critical state has a recent off-machine snapshot that can be restored.
Unpushed commits do not accumulate — code is durably mirrored to GitHub.
Disk never silently fills — stale temp and caches are reclaimed and the disk-usage threshold holds.
Registered apps stay up — a down app is detected and relaunched without operator action.
Each recurring droid's runs are auditable, so a missed or failed run surfaces on a monitor.

🛡Service-specific policies

Backups are scheduled, never ad-hoc — the snapshot and shipping run on their schedule, owned by the accountable agent.
Ship offsite — snapshots and commits must leave the host to count as protected.
Retention is enforced — keep the right window of history and prune the rest.
Every action is tagged to a droid under the owning agent; recurring work emits run-audit so it is observable.
Protection is charged per protected data against this service's cost center.

📦End-user deliverables

Current

Daily off-machine snapshots of Carol + BB database state (registry, designs, plan-generator, initiatives, constitution) plus laptop-critical assets — Hagrid via the Backup Custodian droid (Backups & Shipping block).
Offsite code shipping — unpushed commits pushed to GitHub, with a dry-run review of push state first — Hagrid via the Shipper and Shipper Twin droids.
Disk reclamation and guarding — pruning stale /tmp files and regenerable caches age-based, then guarding the disk-usage threshold — Hagrid via the Daily Janitor droid (Cleanup & Housekeeping block).
App liveness — keeping the service's registered apps alive by detecting down apps and relaunching them via shared/app_steward.py — Hagrid via the App Steward droid (Infrastructure & Runtime block).

Future (on demand)

Restore drills — periodic test restores from a snapshot to prove backups are recoverable, not just present.
Retention reporting — a clear view of what history exists per protected service and how it is pruned.
Per-service protection cost surfaced against the cost center, so charges are visible alongside the data protected.

📘End-user run book

This service has no public agent-facing tools — it is schedule-driven, run by the owning agent Hagrid and his droids.

Operate

Backups & Shipping runs on its daily schedule (snapshot + GitHub push, with the dry-run twin reviewing push state).
Cleanup & Housekeeping runs the daily disk reclaim and guards the disk-usage threshold.
Infrastructure & Runtime keeps systemd units, the proxy layer and the secure tunnel up; the app-steward relaunches a down app via shared/app_steward.py.

Check health

Confirm each recurring droid's latest run-audit is recent and green; a missing or failed run is the signal to investigate.

Where the rules live

The Cookbook is operational law; retention, offsite and observability obligations bind every block here.