Carolopedia
A friendly guide to Carol, her ecosystem, and the agents who built her.
📖 Carolopedia › Services › Process Monitoring › Architecture
🎯Key functional considerations
The watch has to cover every kind of process without raising false alarms, so its architecture is shaped by one core idea: each process class is judged on a different signal.
- The right signal per class. A clock-driven job is judged on cadence (did it run on time, did the last run finish, is it overdue); a triggered job is judged on its last invocation (never-ran is normal, so only a failed or hung invocation counts); an embedded worker has no run of its own, so it is judged by whether its parent agent is alive.
- Audit writes must not contend with real work. Recurring droids emit a per-minute burst of start/finish heartbeats; that write traffic is kept out of the pipeline database so it never competes with build work.
- Coverage across all agents. The classes span every process in the ecosystem, not one team's — a process that exists must fall into exactly one class and be watched.
- A failure becomes work, not just an alert. When a process is judged failed, the architecture's job is to file a fix-initiative, so the breakage enters the same accountable build lifecycle rather than sitting in a log.
🧰Technologies used
- Python 3 services behind nginx, in the shared Carol stack.
- SQLite (WAL) datastores. Run-audit lives in a separate heartbeat database, deliberately kept out of the planner/pipeline database so the per-minute write burst from recurring droids never contends with build work.
- systemd and cron schedule the recurring sweep and the droids it watches.
- Claude is available for any reasoning step in judging or fix-filing; the registry and design store are the binding sources of truth for what processes exist and who owns them.
- The Build Initiatives service is the downstream sink — a detected failure is filed there as a fix-initiative (
/dev/carolopedia/wiki/service/build-initiatives).
🏗Solution architecture
The service is a class-based watch wired to a fix-filing path, a direct instance of Carolverse's agent-centric modular architecture: each class of process is a block owned by Hermione (see the service's blocks above), and the watch over each is carried out by her droids.
- One classifier, distinct verdicts. Every process is placed into exactly one class — scheduled/ongoing, triggered/on-demand, or embedded — and judged on that class's signal (cadence vs last-invocation vs parent-liveness). The verdict logic differs by class; the filing path does not.
- A separate heartbeat plane. Scheduled droids record start/finish to a dedicated run-audit database. The daily liveness sweep reads that plane to decide what ran, what stalled, and what is overdue, while keeping its writes off the pipeline.
- Failure → fix-initiative. A failed verdict is turned into a fix-initiative filed on the build service, so recovery is accountable and observable rather than a silent log line.
- A check on the watcher. Inspector verifies the sweep itself ran and covered its set, so the monitor cannot fail silently.
📐Design principles followed
- Right signal per class. Cadence, last-invocation, and parent-liveness are not interchangeable; each class is judged on the one that is meaningful for it.
- Don't contend with the thing you watch. Run-audit lives in its own heartbeat store so observing the system never slows it.
- Self-heal over alert-and-wait. A failure is filed as a fix-initiative, not left as a notification — the shared principle described on the Carolverse Architecture page.
- Agent-centric modular architecture. Every watched class has an accountable owner and a doing droid.
- Observability first. A process that exists but is in no class is a coverage gap; the watcher is itself watched.
- Single source of truth. What exists and who owns it comes from the live registry and design store, never a hand-kept list.
✅Success criteria
- Every running process falls into exactly one class and is judged on that class's correct signal.
- No false alarms — a never-ran triggered job and a healthy embedded worker are not reported as failures.
- A genuinely failed or overdue process is detected on the next sweep and surfaces as a filed fix-initiative, not a buried log entry.
- Run-audit writes never slow or lock the pipeline database.
- The sweep's own liveness is verified, so the watch cannot go dark unnoticed.
🛡Service-specific policies
- A failure becomes an initiative. A failed verdict is filed on the Build Initiatives service through the sanctioned path, never patched in place silently.
- Run-audit stays out of the pipeline database. Heartbeats are written only to the dedicated heartbeat store.
- Every watched process is owned. Coverage is derived from the registry; an unowned or unregistered process is a violation to be filed, not ignored.
- Bypass skips the planner, not the standards — any fix the watch files still carries the full template, review and observability of an autonomous run.
📦End-user deliverables
Current
- A class-based watch that judges scheduled & ongoing processes on cadence — owned by Hermione.
- A watch over triggered & on-demand processes on their last invocation (failed or hung), so never-ran is correctly treated as normal.
- A watch over embedded workers by their parent agent's liveness, since they have no run of their own.
- A dedicated run-audit heartbeat database that recurring droids write start/finish to, read by the daily liveness sweep — kept out of the pipeline database by design.
- Verification of the sweep itself by Inspector.
Future (on demand)
- Named droids registered against each class and the daily sweep, so the watch's own workers appear on the roster.
- Agent-facing tools to query a process's class and last verdict, and to re-run the sweep on demand.
- Tighter coverage reconciliation — automatically flagging any registered process that no class watches.
📘End-user run book
This service has no agent-facing tools registered yet; today it is operated through its owner's sweep and the heartbeat store.
Operate the watch
- The daily liveness sweep reads the run-audit heartbeat database, judges each process by its class signal, and files a fix-initiative for any failure.
- A recurring process is made visible to the watch by emitting start/finish heartbeats to the run-audit store on every run; a process that emits none is invisible to the cadence check.
Where failures go
- A failed verdict is filed as a fix-initiative on the Build Initiatives service and tracked there to close.