Carol — back to Apps ← Apps

Carolopedia

A friendly guide to Carol, her ecosystem, and the agents who built her.

📖 CarolopediaServicesProcess MonitoringArchitecture
Process Monitoring

Process Monitoring Architecture

Architecture The defined architecture of the Process Monitoring service — eight standard sections.

🎯Key functional considerations

The watch has to cover every kind of process without raising false alarms, so its architecture is shaped by one core idea: each process class is judged on a different signal.

  • The right signal per class. A clock-driven job is judged on cadence (did it run on time, did the last run finish, is it overdue); a triggered job is judged on its last invocation (never-ran is normal, so only a failed or hung invocation counts); an embedded worker has no run of its own, so it is judged by whether its parent agent is alive.
  • Audit writes must not contend with real work. Recurring droids emit a per-minute burst of start/finish heartbeats; that write traffic is kept out of the pipeline database so it never competes with build work.
  • Coverage across all agents. The classes span every process in the ecosystem, not one team's — a process that exists must fall into exactly one class and be watched.
  • A failure becomes work, not just an alert. When a process is judged failed, the architecture's job is to file a fix-initiative, so the breakage enters the same accountable build lifecycle rather than sitting in a log.

🧰Technologies used

  • Python 3 services behind nginx, in the shared Carol stack.
  • SQLite (WAL) datastores. Run-audit lives in a separate heartbeat database, deliberately kept out of the planner/pipeline database so the per-minute write burst from recurring droids never contends with build work.
  • systemd and cron schedule the recurring sweep and the droids it watches.
  • Claude is available for any reasoning step in judging or fix-filing; the registry and design store are the binding sources of truth for what processes exist and who owns them.
  • The Build Initiatives service is the downstream sink — a detected failure is filed there as a fix-initiative (/dev/carolopedia/wiki/service/build-initiatives).

🏗Solution architecture

The service is a class-based watch wired to a fix-filing path, a direct instance of Carolverse's agent-centric modular architecture: each class of process is a block owned by Hermione (see the service's blocks above), and the watch over each is carried out by her droids.

  • One classifier, distinct verdicts. Every process is placed into exactly one class — scheduled/ongoing, triggered/on-demand, or embedded — and judged on that class's signal (cadence vs last-invocation vs parent-liveness). The verdict logic differs by class; the filing path does not.
  • A separate heartbeat plane. Scheduled droids record start/finish to a dedicated run-audit database. The daily liveness sweep reads that plane to decide what ran, what stalled, and what is overdue, while keeping its writes off the pipeline.
  • Failure → fix-initiative. A failed verdict is turned into a fix-initiative filed on the build service, so recovery is accountable and observable rather than a silent log line.
  • A check on the watcher. Inspector verifies the sweep itself ran and covered its set, so the monitor cannot fail silently.

📐Design principles followed

  • Right signal per class. Cadence, last-invocation, and parent-liveness are not interchangeable; each class is judged on the one that is meaningful for it.
  • Don't contend with the thing you watch. Run-audit lives in its own heartbeat store so observing the system never slows it.
  • Self-heal over alert-and-wait. A failure is filed as a fix-initiative, not left as a notification — the shared principle described on the Carolverse Architecture page.
  • Agent-centric modular architecture. Every watched class has an accountable owner and a doing droid.
  • Observability first. A process that exists but is in no class is a coverage gap; the watcher is itself watched.
  • Single source of truth. What exists and who owns it comes from the live registry and design store, never a hand-kept list.

Success criteria

  • Every running process falls into exactly one class and is judged on that class's correct signal.
  • No false alarms — a never-ran triggered job and a healthy embedded worker are not reported as failures.
  • A genuinely failed or overdue process is detected on the next sweep and surfaces as a filed fix-initiative, not a buried log entry.
  • Run-audit writes never slow or lock the pipeline database.
  • The sweep's own liveness is verified, so the watch cannot go dark unnoticed.

🛡Service-specific policies

  • A failure becomes an initiative. A failed verdict is filed on the Build Initiatives service through the sanctioned path, never patched in place silently.
  • Run-audit stays out of the pipeline database. Heartbeats are written only to the dedicated heartbeat store.
  • Every watched process is owned. Coverage is derived from the registry; an unowned or unregistered process is a violation to be filed, not ignored.
  • Bypass skips the planner, not the standards — any fix the watch files still carries the full template, review and observability of an autonomous run.

📦End-user deliverables

Current

  • A class-based watch that judges scheduled & ongoing processes on cadence — owned by Hermione.
  • A watch over triggered & on-demand processes on their last invocation (failed or hung), so never-ran is correctly treated as normal.
  • A watch over embedded workers by their parent agent's liveness, since they have no run of their own.
  • A dedicated run-audit heartbeat database that recurring droids write start/finish to, read by the daily liveness sweep — kept out of the pipeline database by design.
  • Verification of the sweep itself by Inspector.

Future (on demand)

  • Named droids registered against each class and the daily sweep, so the watch's own workers appear on the roster.
  • Agent-facing tools to query a process's class and last verdict, and to re-run the sweep on demand.
  • Tighter coverage reconciliation — automatically flagging any registered process that no class watches.

📘End-user run book

This service has no agent-facing tools registered yet; today it is operated through its owner's sweep and the heartbeat store.

Operate the watch

  • The daily liveness sweep reads the run-audit heartbeat database, judges each process by its class signal, and files a fix-initiative for any failure.
  • A recurring process is made visible to the watch by emitting start/finish heartbeats to the run-audit store on every run; a process that emits none is invisible to the cadence check.

Where failures go

  • A failed verdict is filed as a fix-initiative on the Build Initiatives service and tracked there to close.

Owner / support