Carolopedia
A friendly guide to Carol, her ecosystem, and the agents who built her.
📖About
GAP: Review-failures (SP-01 / PV-S1 returning fail verdict) are treated as fatal-and-final by the pipeline. There is no second opinion, no autonomous root-cause analysis. Tooling bugs (truncation, missing context, evidence-blindness) become operator-blocking issues that require manual override. Albus is currently blind to this class of failure because: 1. Albus is event-driven (handshake_requests WHERE to_agent=albus). 2. notify_albus is only called from worker-failure paths (ex_dev_01, pe_dev_01, tp_ar_01, ir_s1). 3. Reviewers (SP-01, PV-S1) NEVER call notify_albus — grep returns zero matches.
So when ex-dev-01 succeeds but SP-01 verdicts fail (because of truncated context or missing evidence), the pipeline marks the exec failed and stops. No autonomous recovery.
CAPABILITY: Albus catches review-failures, diagnoses root cause, AUTONOMOUSLY creates a fix initiative (INSERT INTO initiatives + plan_steps + decisions, using CI-S1 estimator for budget — exactly like Orion does for bypass initiatives), EXECUTES the fix via bypass methodology (bypass_start, bypass_step, bypass_activity, bypass_end). On bypass success: marks the auto-created initiative reviewing + retries the original failing exec. On bypass failure (after retries): flips initiative to status=blocked + notify_orion → escalation queue surfaces it (CAROL-INI-048).
THE FIRST PROOF (Phase 4 verification): INI-030 step 3 is currently failed. po_s1 + ex_dev_01 + pipeline.py truncation caps were reverted to original (500/2000) per Ninad 2026-04-29 — pure validation. Albus first job after this initiative ships: detect INI-030 step 3's review_verdict_fail, diagnose as tooling bug (output truncated, SP-01 cannot see runner.py diff), create initiative "Bump po_s1 + ex_dev_01 + pipeline result_summary caps", bypass-execute, restart INI-030 step 3, mark autonomous fix initiative reviewing.
PHASES:
PHASE 1 (bypass): Wire reviewers to notify Albus on verdict-fail
- sp_01.py + pv_s1.py: when verdict=fail, call notify_albus(kind="review_verdict_fail", payload={exec_id, step_id, initiative_id, verdict_text, success_criterion, evidence_offered_to_reviewer}). Pipeline continues normal flow (mark exec failed) — Albus operates as parallel layer.
PHASE 2 (bypass): Build Albus Self-Healer droid (al-sh-01)
- New file agents/agt_001/droids/al_sh_01.py
- Triggered by handshake kind=review_verdict_fail (added to albus_watcher routing)
- Uses Claude opus reasoning to: (a) inspect actual artifacts (regression.db rows, file contents, monitor state), (b) re-evaluate success criterion independently, (c) categorize as OVERRIDE-PASS / TOOLING-BUG / REAL-FAIL / AMBIGUOUS
- OVERRIDE-PASS: mark step done with note + create lightweight tracking initiative
- TOOLING-BUG: create full fix initiative via CI-S1 estimator (GET /api/estimate?steps=N), bypass-execute, retry original exec
- REAL-FAIL: let failure stand (no override)
- AMBIGUOUS: notify_orion (escalation queue surfaces)
- Bypass failure (3 retries exhausted): notify_orion + auto-initiative status=blocked
PHASE 3 (bypass): Documentation
- agt_001/identity.md: add "Review verdict triage + autonomous self-healing" capability + new al-sh-01 droid in droids list
- Cookbook #34 (Authority Matrix): Albus diagnose/fix/verify scope explicitly includes review-failure triage + autonomous initiative creation for tooling fixes (within bounded scope — not product changes)
- New cookbook entry: "Albus self-healing — when and how" (categories Albus auto-fixes vs hands off, retry/escalation rules, audit trail expectations)
PHASE 4 (verification): Validate end-to-end on the staged INI-030 step 3 failure
- Confirm SP-01 fires notify_albus on verdict-fail
- Confirm al-sh-01 picks up the handshake
- Confirm al-sh-01 diagnoses TOOLING-BUG (output truncated)
- Confirm al-sh-01 creates fix initiative with CI-S1-estimated budget
- Confirm al-sh-01 bypass-executes the fix end-to-end
- Confirm INI-030 step 3 retries successfully after the fix lands
- Confirm autonomous fix initiative shows in Recent Executions (success path)
- Confirm if the fix bypass had failed instead, it would have shown in Escalation Queue (INI-048)
⚖️Decisions
- Albus has authority to create initiatives + bypass-execute them ONLY for tooling/review-blind issues, not for product changes. Definition of tooling-bug: work output exists but reviewer could not verify it (truncation, missing context, format mismatch). REAL-FAIL or AMBIGUOUS verdicts escalate to Orion via the Escalation Queue (INI-048). (Ninad)
- Albus uses CI-S1 estimator (GET /api/estimate?steps=N) for budget — same as Orion does for bypass initiatives. No special budget authority. (Ninad)
- On bypass failure: notify_orion via existing notify_orion mechanism; auto-created initiative status=blocked; surfaces in Escalation Queue (CAROL-INI-048). No infinite retry loop — single attempt per autonomous fix. (Ninad)
- INI-030 step 3 is the LIVE PROOF. Truncation bumps in po_s1/ex_dev_01/pipeline.py reverted to original (500/2000) per Ninad 2026-04-29. Albus must detect, diagnose, fix, and the original exec must retry successfully — all autonomously, without operator intervention. (Ninad)
- INI-048 (Escalation Queue) is a hard prerequisite — must ship FIRST so INI-047 escalation path has a surfacing mechanism. (Orion)
- Gap G zombie sweep (CAROL-INI-479): initiative active with no dispatch path since 2026-05-14 17:44:11. Transitioning to blocked for operator triage or manual re-activation. (elrond.zombie_sweep)
- requester rewritten ninad -> orion per CAROL-INI-744: orion is the only human-CLI requester — Backfill of historical rows after INI744 added API-level refusal of requester=ninad. Orion is Ninads CLI agent; all human-originated initiatives are filed with requester=orion. (orion)
✅Success criteria
- sp_01.py contains notify_albus call on verdict=fail; pv_s1.py same. (must_have)
- al_sh_01.py exists in agents/agt_001/droids/; albus_watcher routes review_verdict_fail handshakes to it. (must_have)
- INI-030 step 3 (currently failed, truncation bumps reverted): Albus autonomously detects -> diagnoses TOOLING-BUG -> creates fix initiative -> bypass-executes -> INI-030 step 3 retries and passes. All operator-free. (must_have)
- Auto-created fix initiative appears in Recent Executions on success (mode=bypass). (must_have)
- If staged failure of an Albus bypass attempt: auto-initiative shows in Escalation Queue (INI-048) and Orion is notified via notify_orion. (must_have)
- Cookbook #34 + new "Albus self-healing" cookbook entry both reflect the new capability. (must_have)