Carolopedia
A friendly guide to Carol, her ecosystem, and the agents who built her.
📖About
Three architectural fixes Albus needs so 'Albus attends all failures' actually holds. (1) SINGLE FAILURE SURFACE — replace al_watch_01's narrow polling of failed executions with one canonical 'pipeline_failures' table (or a SQL view that unions all stuck-state queries: failed_execs without recovery, open handshakes_to_albus older than 5 min, open merlin_watch_decisions with action=request_task_replan unprocessed for >5 min, initiatives status=blocked without recovery in flight, plan_steps status=failed with no Albus run). Albus polls that one source. (2) CHECKLIST TEMPLATE — author a new type_code 'AL' template in checklist_templates with 6 phases: investigate -> diagnose -> verdict -> apply (per-verdict subflow) -> test -> re-initiate. Same shape as A/D/E/G templates. Constrains the meta-workflow, leaves content variable. (3) END-TO-END OWNERSHIP — refactor Albus's troubleshooter from 'verdict producer + handshake delegator' to 'recovery executor'. For DETAIL_REPLAN_MERLIN: rewrite the planner_prompt + flip step status pending himself (don't delegate via handshake). For ENV_FIX_RADAGAST: invoke Radagast's admin droid in-process + verify. For PREREQ_INITIATIVE: file the prereq + drive its dispatch + wait + return. For DIRECT_FIX: edit code + test (already does). For ESCALATE_ORION: notify_orion is the only true delegation. After applying any fix, run a smoke test, verify the original failed step is unblocked, flip to pending, return success. SIDE-FIX: bypass_end() should set exec status='completed' (not 'reviewing') so bypass executions show up in the Recent Executions monitor card alongside planner-mode work.
⚖️Decisions
- Single failure surface = a SQLite VIEW (pipeline_failures_v) that UNIONs the stuck-state queries; not a new table. — Views are zero-maintenance — they always reflect live underlying state. New tables require writers + cleaners. The view union pattern is also easier to extend (add a new SELECT clause, no migration). (Orion)
- Albus checklist template type_code=AL with phases decide(investigate+diagnose+verdict)/execute(apply per-verdict subflow)/review(test+verify+re-initiate). — Same 3-phase shape as every other Carol template. decide-phase steps are Claude-bounded reasoning; execute-phase steps are mechanical actions; review-phase steps verify the fix held. This is the meta-workflow Ninad named. (Orion)
- Albus owns each verdict end-to-end. DETAIL_REPLAN_MERLIN -> Albus rewrites planner_prompt + flips status pending himself; ENV_FIX_RADAGAST -> Albus invokes Radagast in-process + verifies; PREREQ_INITIATIVE -> Albus files + dispatches + waits; DIRECT_FIX -> already self-owned; ESCALATE_ORION -> only true delegation. — Hand-off via handshake creates wiring gaps (todays bug B). Ownership-by-default preserves the recovery contract. Handoff is reserved for cases where another agent has uniquely-required authority (Orion for out-of-scope admin). (Ninad)
- Side-fix: bypass_end() exec status=completed (not reviewing) so bypass execs show in Recent Executions card. — Bypass execs are terminal once bypass_end fires. Marking them reviewing was a hack that overloaded a non-terminal status. completed is the correct terminal for any exec. (Orion)
- Gap G zombie sweep (CAROL-INI-479): initiative active with no dispatch path since 2026-05-14 17:44:11. Transitioning to blocked for operator triage or manual re-activation. (elrond.zombie_sweep)
✅Success criteria
- SQL view pipeline_failures_v exists and returns the union of stuck-state rows (must_have)
- Re-running Albus on INI-029 step 764 closes it without operator intervention (smoke) (must_have)
- /api/monitor/recent returns at least one bypass exec (the last bypass we shipped) (must_have)
- checklist_templates COUNT WHERE type_code=AL >= 6 (decide+execute+review phases populated) (must_have)