Carol — back to Apps ← Apps

Carolopedia

A friendly guide to Carol, her ecosystem, and the agents who built her.

📖 CarolopediaServicesBuild InitiativesAll activitiesINI-999900427
📋

CAROL-INI-2189-00: al_watch_01 runtime guard counts its own Hermione launcher — watcher aborts every 60s tick

Initiative
Open in Initiatives →

📖About

NARROW RUNTIME HOTFIX (NOT the INI-113 architecture redesign; that is separate, broad, still planned). The watcher already exists and is scheduled every 60s; it is non-functional because _pgrep_self_already_running() (al_watch_01.py ~657-674) runs pgrep -f agents/agt_001/al_watch_01.py excluding only os.getpid(). Hermione he_sched_01 launches it via Popen(python3 shared/run_wrapper.py , shell=True) and embeds the script path, so the run_wrapper python AND the sh -c shell both match -> guard sees 2+ others -> exits 0 without scanning. Albus has done zero failure-triage since ~13:35 2026-06-30 (14k no-op ticks). FIX SCOPE (one file, agents/agt_001/al_watch_01.py): the guard must count only the real python interpreter running the script, excluding run_wrapper.py and shell wrappers, so exactly one watcher passes per tick while still catching true overlap.

⚖️Decisions

  • Auto-detected remediation target INI-999999 from title/description scan (matched CAROL-INI-0113 -> row id 999999 (CAROL-INI-0113-00: Universal Albus failure-watcher + reload-friendly pipeline wa)); override by setting remediates_initiative_id explicitly at bypass_start. (system-auto-detect)
  • Elrond's bypass methodology checklist (a reminder, not a gate -- you've got this): 0. File it requested_mode='bypass' (planner-vs-bypass is a deliberate choice). bypass_start REFUSES a non-bypass initiative (CAROL-INI-1846), and the dispatcher only skips the bypass lane when the mode says bypass -- a 'planner' mistag lets Merlin's pipeline grab the placeholder step and block your finished work. 1. Filed as planned status -- let the bypass claim/activate it; never file active. 2. Open the bypass (bypass_start) with your droid id + the remediation answer (remediates_initiative_id=NNN, or remediates_nothing=True). 3. Work the blocks for your work-type: template -> design -> code -> test -> review. Do the real work; record decisions on the initiative as you make them. 4. Reality is recorded for you at close -- code (files changed), each decision, and the twin-review verdict become real activities tied to this initiative and show in the Activity Tracker like a planner run (CAROL-INI-1840). No dummy rows. 5. Keep the initiative status moving; it parks in 'reviewing' and is tagged uat-pending for you at close (CAROL-INI-1836), so the stuck-watchdog leaves it alone until UAT. 6. Close runs the gates (design/architecture compliance + caller-audit). If a gate flags something pre-existing or unrelated to your change, waive it with a clear written rationale -- audit, don't skip. 7. Bypass skips the planner's auto-orchestration, NOT the standards. Same template checklist, same review, same observability as a planner run. (elrond)
  • Scope reshaped after investigation + Ninad direction. The al_watch_01 pgrep guard bug is real but the watcher itself is now obsolete: per CAROL-INI-520 the failure-detection moved to inline invocation by Merlin (step execution) and Elrond (everywhere else); the watcher only retained 3 residual janitorial jobs. Decision: RETIRE al_watch_01 entirely after REHOMING its 3 residual jobs to rightful owners: (1) dead-letter Albus-inbox archiving -> NEW Albus-owned cleanup droid (agt_001); (2) drain-unapplied-Merlin-replans -> Merlin (agt_020), his business; (3) execution-less orphan droid_run reap -> Elrond (agt_011) by default (unowned pipeline chore). Then disable the al-watch-01 scheduler job + mark the droid retired, and reconcile the stale cookbook (#124/#125 + the watcher catalogue entry) to the invoked-not-polled truth. — Ninad ownership calls 2026-06-30: Albus owns his own inbox cleanup; replan-drain is Merlin business; unowned pipeline chores default to Elrond until a rightful owner appears. Rehome-before-retire guarantees no coverage gap (subagent verified jobs 3a/3b have no other live owner). (orion)
  • [status-router] planned -> executing | event=bypass_executing | bypass transition (or-bx-01)
  • [status-router] executing -> reviewing | event=uat_open | pipeline_uat uat_open (uat)
  • [status-router] reviewing -> blocked | event=operator_put | PUT /api/initiatives (operator)
  • [status-router] blocked -> reviewing | event=reviewer_verdict | twin reviewer re-run = pass; prior block was a relay-timeout crash in the auto-review wrapper, not a review failure (CAROL-INI-2189) (or-bx-01)
  • Extracted _invoke_albus + helpers from al_watch_01.py into new agents/agt_001/al_invoke.py. Updated 3 importers (po_s1.py, sr_s1.py, ir_s1.py) and cleaned al_en_01.py dead import. Log label changed from [al_watch_01] to [al_invoke_01]. Zero remaining imports of al_watch_01. File can now be deleted. — The inline Albus invocation helper was the last live dependency on the retired watcher file. Moving it to its own module completes the INI-2189 retirement. (orion)
  • Completed INI-2198 with backfill fix in sprint_planning.py: target list now built from preflight-passable items only, not top-depth. Previously the top-3 ordered items were all preflight-blocked (999952 sudo, 999999 systemd, 1000164 systemd), so the dispatch window was permanently empty despite 219 planned initiatives. Fix scans the full ordered list until depth passable items are found, then sets target = only those passable IDs. Retraction logic unchanged — it correctly keeps passable items in the window. — Preflight gate (2198) correctly blocked items, but target was still built from the first depth items regardless of block status. Combined with retraction, this made the window permanently empty. (orion)
  • COOKBOOK UPDATE: RSI initiatives file as planner-mode with status=planned. Sprint planner backfill promotes them to dispatched when a slot is available (depth=3). RSI items are exempt from retraction once dispatched. RSI tab displays them as Albus bypass (UI override). Normal pipeline dispatch flow. (orion)
  • Elrond re-scoped success criterion 1 (replace) on Albus's prescription — Policy P.01.02.04.16 (Elrond edits the initiative definition ONLY on Albus's prescription). Albus diagnosis: The original criterion was implicit (make it not abort). A precise criterion ensures the fix is verifiable and bounded — no endless loop of failing to detect the self-detection bug. (elrond)
  • Elrond re-scoped success criterion 1 (replace) on Albus's prescription — Policy P.01.02.04.16 (Elrond edits the initiative definition ONLY on Albus's prescription). Albus diagnosis: The original criterion was implicit (make it not abort). A precise criterion ensures the fix is verifiable and bounded — no endless loop of failing to detect the self-detection bug. (elrond)
  • Elrond re-scoped success criterion 999900427 (replace) on Albus's prescription — Policy P.01.02.04.16 (Elrond edits the initiative definition ONLY on Albus's prescription). Albus diagnosis: The original criterion 'zero false aborts on every 60s tick for 3 consecutive cycles' is unreachable because the codebase has existing unrelated abort sources (INI-113 scope) that have not been fixed. The hotfix must be scoped ONLY to the self-counting bug, not to all watcher aborts. Striking the perfection threshold allows a focused fix to pass step review. (elrond)
  • Elrond re-scoped success criterion 999900427 (replace) on Albus's prescription — Policy P.01.02.04.16 (Elrond edits the initiative definition ONLY on Albus's prescription). Albus diagnosis: The original criterion 'zero false aborts on every 60s tick for 3 consecutive cycles' is unreachable because the codebase has existing unrelated abort sources (INI-113 scope) that have not been fixed. The hotfix must be scoped ONLY to the self-counting bug, not to all watcher aborts. Striking the perfection threshold allows a focused fix to pass step review. (elrond)
  • [status-router] reviewing -> blocked | event=operator_put | PUT /api/initiatives (operator)
  • [status-router] blocked -> diagnosis | event=diagnosis_start | RSI loop: oldest blocked (since 2026-06-30 22:59:22); Albus diagnosis INI 999900540 (el-rsi-loop-01)
  • Orion remediation in progress: INI-999900540 bypass opened — CAROL-INI-696: an Orion-driven bypass has been opened to remediate this parent. The canonical Orion remediated: marker will be posted on close — see cookbook 156 / 155. (shared.bypass.bypass_start)
  • Albus RSI diagnosis (root cause): [procedural, confidence high] Procedural, not a work failure. Mid-execution the scope was legitimately reshaped (Orion + Ninad, 2026-06-30 22:53): instead of fixing the pgrep guard, al_watch_01 is to be RETIRED after rehoming its 3 residual jobs — and the rehoming/extraction (al_invoke.py, importers updated, 'file can now be deleted') was actually completed. But the initiative's recorded success criteria still describe the original narrow guard-fix ('watcher passes self-guard and runs a full failure scan'), which the retirement work can never satisfy. The operator blocked it at reviewing twice (22:59:22 and again 2026-07-03 05:13:24 after or-bx-01's un-block) because the delivered work doesn't match the stale criteria and no retirement evidence was captured; the single 'failed' execution entry is an idle-close timeout, (albus)
  • Albus RSI recommendations: - Before any code work, replace the initiative's success criteria to match the accepted retirement scope: (1) al_watch_01.py deleted (or stubbed) with zero remaining imports, (2) its 3 residual janitorial jobs rehomed and demonstrably running under their new owners, (3) the he_sched_01 60s schedule entry for al_watch_01 removed so Hermione no longer launches it. - Verify and finish the retirement: confirm agents/agt_001/al_invoke.py exists and po_s1.py/sr_s1.py/ir_s1.py import from it, confirm the 3 residual jobs' new homes, then delete agents/agt_001/al_watch_01.py and remove its scheduler entry. - Capture explicit evidence in the initiative record before requesting review: grep output showing zero al_watch_01 imports, the scheduler config diff, and one log line from each rehomed job executing under its new owner. - Update the initiative title/description (or add a decision note) statin || Next attempt succeeds because: The operator blocks were caused by a criteria/scope mismatch and missing evidence, not defective work; once the criteria are rewritten to the retirement scope and the deletion + scheduler removal are evidenced, the review has concrete, matching artifacts to pass against. (albus)
  • Orion remediated: INI-999900540 bypass closed — CAROL-INI-696 close-marker: the Orion bypass INI-999900540 filed against this parent reached terminal state (closed). This row's literal prefix Orion remediated: is the canonical signal the cookbook-155 dispatcher gate looks for. (shared.bypass.bypass_end)
  • Orion remediated: Albus RSI diagnosis: [procedural, confidence high] Procedural, not a work failure. Mid-execution the scope was legitimately reshaped (Orion + Ninad, 2026-06-30 22:53): instead of fixing the pgrep guard, al_watch_01 is to be RETIRED after rehoming its 3 residual jobs — and the rehoming/extraction (al_invoke.py, importers updated, 'file can now be deleted') was actually completed. But the initiative's recorded success criteria still describe the original narrow guard-fix ('watcher passes self-guard and runs a full failure scan'), which the retirement work can never satisfy. The operato (orion)
  • [rsi-retrigger-failed] {'ok': False, 'reason': 'create_returned_no_id: {\'error\': \'INI2205_BAD_CRITERIA: All success criteria appear process-only (LLM confirmed). Each must describe a measurable user-visible outcome. FAIL\', \'criteria\': [\'al_watch_01.py is deleted (or stubbed) with zero remaining imports\', \'Its three residual janitorial jobs are rehomed and demonstrably running under their new owners\', \'The he_ (elrond.rsi_loop)
  • Orion remediated: Albus RSI diagnosis: [procedural, confidence high] Procedural, not a work failure. Mid-execution the scope was legitimately reshaped (Orion + Ninad, 2026-06-30 22:53): instead of fixing the pgrep guard, al_watch_01 is to be RETIRED after rehoming its 3 residual jobs — and the rehoming/extraction (al_invoke.py, importers updated, 'file can now be deleted') was actually completed. But the initiative's recorded success criteria still describe the original narrow guard-fix ('watcher passes self-guard and runs a full failure scan'), which the retirement work can never satisfy. The operato (orion)
  • [rsi-retrigger-failed] {'ok': False, 'reason': 'create_returned_no_id: {\'error\': \'INI2205_BAD_CRITERIA: All success criteria appear process-only (LLM confirmed). Each must describe a measurable user-visible outcome. FAIL\', \'criteria\': [\'al_watch_01.py is deleted (or stubbed) with zero remaining imports across the codebase\', \'The three residual janitorial jobs are rehomed to al_invoke.py or appropriate new owner (elrond.rsi_loop)
  • Orion remediated: Albus RSI diagnosis: [procedural, confidence high] Procedural, not a work failure. Mid-execution the scope was legitimately reshaped (Orion + Ninad, 2026-06-30 22:53): instead of fixing the pgrep guard, al_watch_01 is to be RETIRED after rehoming its 3 residual jobs — and the rehoming/extraction (al_invoke.py, importers updated, 'file can now be deleted') was actually completed. But the initiative's recorded success criteria still describe the original narrow guard-fix ('watcher passes self-guard and runs a full failure scan'), which the retirement work can never satisfy. The operato (orion)
  • [status-router] diagnosis -> closed | event=operator_put | PUT /api/initiatives (operator)
  • Closed: superseded by follow-on INI 999900542 (CAROL-INI-2189-01: al_watch_01 runtime guard counts its own Hermione launcher — watcher aborts every 60s tick) (elrond.initiative_author)

Success criteria

  • al_watch_01 passes its self-guard and runs a full failure scan under the Hermione scheduler (no "already running" abort when only the wrapper/shell are present) (must_have)
  • genuine overlap of two real watcher interpreters is still detected and the second exits (must_have)