Carol — back to Apps ← Apps

Carolopedia

A friendly guide to Carol, her ecosystem, and the agents who built her.

📖 CarolopediaServicesBuild InitiativesAll activitiesINI-1000012
📋

CAROL-INI-0121-00: al_watch_01 orphan-watchdog for stuck droid_runs

Initiative
Open in Initiatives →

📖About

When an al-auto-01 (Autonomous Troubleshooter) Claude Code session dies, crashes, or is killed mid-investigation, its droid_runs row stays status=running with completed_at=NULL forever. Two visible failure modes today: (1) shared/initiative_monitor.py::_detect_live_initiatives() sees the orphaned running row and renders the underlying initiative as currently executing in the Monitor — confusing operators (observed 2026-05-07: droid_run 924 from 2026-05-05 23:09 still flagged INI-026 as live ~25h after the al-auto-01 process disappeared); (2) al_watch_01 idempotency keys on (event_id, trigger_kind) so it does not re-fire on the same event, but the orphan still blocks accurate liveness detection and audit. Fix: add a watchdog pass to the existing al_watch_01 30s tick that marks any droid_run with status=running and started_at older than ORPHAN_THRESHOLD_MIN (default 30) as failed with completed_at=now and a result_summary tag. Also clean up the immediate orphan (row 924).

⚖️Decisions

  • Place watchdog inside the existing al_watch_01 30s tick rather than a separate timer — al_watch_01 already polls droid_runs / failed execs every 30s and owns the carol-albus-watch.timer. Adding a tick-local sweep avoids new systemd units, keeps logic where Albus's enabler boundary already lives, and inherits the pause-honor semantic (CAROL-INI-103) for free. (Ninad)
  • Threshold = 30 min, configurable via ORPHAN_THRESHOLD_MIN env var — al-auto-01 hardcoded timeout is 900s (15 min) per Session 2 handover; 30 min gives 2x headroom so a slow-but-live session is never reaped while orphans still get cleared within one Monitor refresh window. Env override leaves operator escape hatch. (Orion)
  • Mark orphans status='failed' with result_summary='[orphan-watchdog] reaped — running >30min with no live process' — failed (vs cancelled) because the work did not complete — same status code Albus uses for genuinely failed runs, so Monitor and audit treat them uniformly. result_summary tag makes the reaping action grep-able for post-incident analysis. (Orion)
  • requester rewritten ninad -> orion per CAROL-INI-744: orion is the only human-CLI requester — Backfill of historical rows after INI744 added API-level refusal of requester=ninad. Orion is Ninads CLI agent; all human-originated initiatives are filed with requester=orion. (orion)

Success criteria

  • After deploy, GET /api/monitor/all returns [] when no droid_run is genuinely running (orphan 924 cleared) (must_have)
  • Synthetic test: insert a fake droid_runs row with status=running and started_at=now-31min; al_watch_01 single tick reaps it (status=failed, result_summary tag set) (must_have)
  • Synthetic test: insert a fake droid_runs row with status=running and started_at=now-29min; al_watch_01 single tick leaves it untouched (must_have)