Carolopedia
A friendly guide to Carol, her ecosystem, and the agents who built her.
📖About
Per Ninad 2026-06-21. Today the Daily Process Sweep only reconciles the 30 jobs in the scheduler registry, out of 194 registered processes (droids). About 55 scheduled droids are not in the registry; 71 triggered + 32 on-demand + 6 embedded have NO silent-death monitor at all (this is how the Architect failure watcher went dark for a month). GOAL: Hermione monitors EVERY registered process for liveness. Scope: (1) the sweep enumerates ALL droids from the registry, not just the scheduler jobs; (2) a type-appropriate liveness rule per process_type — scheduled and ongoing: cadence + never-ran; triggered and on-demand: last-seen staleness vs expected trigger or heartbeat; embedded: parent-process alive; (3) auto-register or otherwise cover the ~55 scheduled droids currently outside the registry; (4) a watcher-of-the-watcher so a silent death of the monitoring sweep or scheduler itself is detected (self-monitoring); (5) explicit coverage of the Inspector (agt_037) and its processes; (6) failures file a single deduped incident initiative as today. Subsumes the narrower never-ran-scheduled flag CAROL-INI-0966. Confirm planner-vs-bypass at build time.
⚖️Decisions
- Enumerate ALL droids from registry.db (the SST of processes), not just the 30 jobs in scheduler.db. The registry is the source of truth for what processes exist; the sweep iterates it and dispatches a type-appropriate liveness check per droid. — Root of the gap: the sweep reconciled only the scheduler registry, leaving 55 scheduled + 71 triggered + 32 on-demand + 6 embedded droids with no silent-death monitor (how Albus failure watcher went dark a month). (orion)
- Type-appropriate liveness: scheduled/ongoing = reconcile cadence vs droid_runs (failed/stuck/overdue); triggered/on_demand = flag only failed or stuck last run (never-ran is NOT a failure, they run on demand); embedded = parent-alive via the owner agent recent run-audit. — Each process type fails differently. Applying a cadence expectation to on-demand processes would false-positive en masse. (orion)
- Scheduled droids outside scheduler.db derive their expected max-age from the freeform schedule text (Daily/Hourly/Every N min/Every Nh/Continuous); reconciled against droid_runs only if they have run-audit history. — The 55 non-registry scheduled droids have no precise cadence row; the schedule text is the only cadence signal available without inventing one. (orion)
- Scheduled droids that emit NO run-audit at all are surfaced as ONE consolidated coverage-gap incident (deduped by a single marker monitor-coverage:uninstrumented), never one incident per droid. — Ninad decision 2026-06-21. ~52 such droids; one-per-droid would be a 52-incident storm contradicting the anti-bloat work in 1868. Real ran-then-died silent deaths still get their own per-process incident. (ninad)
- Self-monitor (watcher-of-the-watcher): extend the Inspector (in-01) to also check the Daily Process Sweep (op-s1) freshness, not just the Hermione Scheduler. The sweep in turn enumerates in-01, so the Inspector is itself covered — mutual watching across two independent cron domains. — Closes criteria 4 and 5. The sweep cannot reliably detect its own death; an independent cron (Inspector) must, and someone must watch the Inspector. (orion)
- Per-process incidents keep the existing uat-target dedup (one open initiative per failing process). requested_mode=bypass; net-new capability (remediates nothing). CAROL-INI-0966 (process monitor must flag never-ran) is subsumed and will be redirected to 1875 via Elrond. — Reuse the proven CAROL-INI-1757 dedup; do not fork it. Lifecycle op on 0966 routes through Elrond, not bypass (cookbook 68). (orion)
- [status-router] planned -> active | event=bypass_active | bypass transition (or-bx-01)
- [HYGIENE GATE apps_registered] skip: work_type=B does not require app registration (orion.bypass)
- [HYGIENE GATE design_filed] skip: no design_id provided (orion.bypass)
- [HYGIENE GATE architecture_compliance] pass: references architecture design #146 (agent-centric modular architecture) (orion.bypass)
- [HYGIENE GATE policy_check] fail: no policy_ids tagged — caller should explicitly acknowledge applicable policies or waive (orion.bypass)
- [HYGIENE GATE constitution_aligned] pass: soft-default pass; caller responsible for asserting via decision row (orion.bypass)
- [HYGIENE GATE cookbook_entry] pass: cookbook 327: All-process liveness coverage — the sweep monitors the registry, not the scheduler (CAROL-INI-1875) (orion.bypass)
- [HYGIENE GATE logbook_entry] fail: logbook session CLI-005: NOT FOUND (orion.bypass)
- [status-router] active -> reviewing | event=bypass_reviewing | bypass transition (or-bx-01)
- [status-router] reviewing -> closed | event=operator_signoff | Auto-accepted (CAROL-INI-1859): Orion-initiated, >2 days in reviewing with no objection. (el-srac-01)
✅Success criteria
- Hermione process sweep enumerates ALL registered processes (scheduled, ongoing, triggered, on-demand, embedded), not just the scheduler jobs (must_have)
- Each process type has an appropriate liveness check (cadence/never-ran for scheduled; last-seen/heartbeat for triggered and on-demand; parent-alive for embedded) (must_have)
- The ~55 scheduled droids currently outside the scheduler registry are covered (must_have)
- Hermione self-monitors (watcher-of-the-watcher): a silent death of her own sweep or scheduler is detected (must_have)
- The Inspector (agt_037) and its processes are covered (must_have)
- Detected failures file a single deduped incident initiative per process (must_have)