Carolopedia

A friendly guide to Carol, her ecosystem, and the agents who built her.

📖 Carolopedia › Services › Build Initiatives › All activities › INI-999933

📋

CAROL-INI-0047-00: Albus Autonomous Self-Healing — autonomous initiative creation + bypass execution for review-blind failures

Initiative

📖About

GAP: Review-failures (SP-01 / PV-S1 returning fail verdict) are treated as fatal-and-final by the pipeline. There is no second opinion, no autonomous root-cause analysis. Tooling bugs (truncation, missing context, evidence-blindness) become operator-blocking issues that require manual override. Albus is currently blind to this class of failure because: 1. Albus is event-driven (handshake_requests WHERE to_agent=albus). 2. notify_albus is only called from worker-failure paths (ex_dev_01, pe_dev_01, tp_ar_01, ir_s1). 3. Reviewers (SP-01, PV-S1) NEVER call notify_albus — grep returns zero matches.

So when ex-dev-01 succeeds but SP-01 verdicts fail (because of truncated context or missing evidence), the pipeline marks the exec failed and stops. No autonomous recovery.

CAPABILITY: Albus catches review-failures, diagnoses root cause, AUTONOMOUSLY creates a fix initiative (INSERT INTO initiatives + plan_steps + decisions, using CI-S1 estimator for budget — exactly like Orion does for bypass initiatives), EXECUTES the fix via bypass methodology (bypass_start, bypass_step, bypass_activity, bypass_end). On bypass success: marks the auto-created initiative reviewing + retries the original failing exec. On bypass failure (after retries): flips initiative to status=blocked + notify_orion → escalation queue surfaces it (CAROL-INI-048).

THE FIRST PROOF (Phase 4 verification): INI-030 step 3 is currently failed. po_s1 + ex_dev_01 + pipeline.py truncation caps were reverted to original (500/2000) per Ninad 2026-04-29 — pure validation. Albus first job after this initiative ships: detect INI-030 step 3's review_verdict_fail, diagnose as tooling bug (output truncated, SP-01 cannot see runner.py diff), create initiative "Bump po_s1 + ex_dev_01 + pipeline result_summary caps", bypass-execute, restart INI-030 step 3, mark autonomous fix initiative reviewing.

PHASES:

PHASE 1 (bypass): Wire reviewers to notify Albus on verdict-fail

sp_01.py + pv_s1.py: when verdict=fail, call notify_albus(kind="review_verdict_fail", payload={exec_id, step_id, initiative_id, verdict_text, success_criterion, evidence_offered_to_reviewer}). Pipeline continues normal flow (mark exec failed) — Albus operates as parallel layer.

PHASE 2 (bypass): Build Albus Self-Healer droid (al-sh-01)

New file agents/agt_001/droids/al_sh_01.py
Triggered by handshake kind=review_verdict_fail (added to albus_watcher routing)
Uses Claude opus reasoning to: (a) inspect actual artifacts (regression.db rows, file contents, monitor state), (b) re-evaluate success criterion independently, (c) categorize as OVERRIDE-PASS / TOOLING-BUG / REAL-FAIL / AMBIGUOUS
OVERRIDE-PASS: mark step done with note + create lightweight tracking initiative
TOOLING-BUG: create full fix initiative via CI-S1 estimator (GET /api/estimate?steps=N), bypass-execute, retry original exec
REAL-FAIL: let failure stand (no override)
AMBIGUOUS: notify_orion (escalation queue surfaces)
Bypass failure (3 retries exhausted): notify_orion + auto-initiative status=blocked

PHASE 3 (bypass): Documentation

agt_001/identity.md: add "Review verdict triage + autonomous self-healing" capability + new al-sh-01 droid in droids list
Cookbook #34 (Authority Matrix): Albus diagnose/fix/verify scope explicitly includes review-failure triage + autonomous initiative creation for tooling fixes (within bounded scope — not product changes)
New cookbook entry: "Albus self-healing — when and how" (categories Albus auto-fixes vs hands off, retry/escalation rules, audit trail expectations)

PHASE 4 (verification): Validate end-to-end on the staged INI-030 step 3 failure

Confirm SP-01 fires notify_albus on verdict-fail
Confirm al-sh-01 picks up the handshake
Confirm al-sh-01 diagnoses TOOLING-BUG (output truncated)
Confirm al-sh-01 creates fix initiative with CI-S1-estimated budget
Confirm al-sh-01 bypass-executes the fix end-to-end
Confirm INI-030 step 3 retries successfully after the fix lands
Confirm autonomous fix initiative shows in Recent Executions (success path)
Confirm if the fix bypass had failed instead, it would have shown in Escalation Queue (INI-048)

⚖️Decisions

Albus has authority to create initiatives + bypass-execute them ONLY for tooling/review-blind issues, not for product changes. Definition of tooling-bug: work output exists but reviewer could not verify it (truncation, missing context, format mismatch). REAL-FAIL or AMBIGUOUS verdicts escalate to Orion via the Escalation Queue (INI-048). (Ninad)
Albus uses CI-S1 estimator (GET /api/estimate?steps=N) for budget — same as Orion does for bypass initiatives. No special budget authority. (Ninad)
On bypass failure: notify_orion via existing notify_orion mechanism; auto-created initiative status=blocked; surfaces in Escalation Queue (CAROL-INI-048). No infinite retry loop — single attempt per autonomous fix. (Ninad)
INI-030 step 3 is the LIVE PROOF. Truncation bumps in po_s1/ex_dev_01/pipeline.py reverted to original (500/2000) per Ninad 2026-04-29. Albus must detect, diagnose, fix, and the original exec must retry successfully — all autonomously, without operator intervention. (Ninad)
INI-048 (Escalation Queue) is a hard prerequisite — must ship FIRST so INI-047 escalation path has a surfacing mechanism. (Orion)
Gap G zombie sweep (CAROL-INI-479): initiative active with no dispatch path since 2026-05-14 17:44:11. Transitioning to blocked for operator triage or manual re-activation. (elrond.zombie_sweep)
requester rewritten ninad -> orion per CAROL-INI-744: orion is the only human-CLI requester — Backfill of historical rows after INI744 added API-level refusal of requester=ninad. Orion is Ninads CLI agent; all human-originated initiatives are filed with requester=orion. (orion)

✅Success criteria

sp_01.py contains notify_albus call on verdict=fail; pv_s1.py same. (must_have)
al_sh_01.py exists in agents/agt_001/droids/; albus_watcher routes review_verdict_fail handshakes to it. (must_have)
INI-030 step 3 (currently failed, truncation bumps reverted): Albus autonomously detects -> diagnoses TOOLING-BUG -> creates fix initiative -> bypass-executes -> INI-030 step 3 retries and passes. All operator-free. (must_have)
Auto-created fix initiative appears in Recent Executions on success (mode=bypass). (must_have)
If staged failure of an Albus bypass attempt: auto-initiative shows in Escalation Queue (INI-048) and Orion is notified via notify_orion. (must_have)
Cookbook #34 + new "Albus self-healing" cookbook entry both reflect the new capability. (must_have)

Sourced live from the initiatives ledger · initiative 999933