Carol — back to Apps ← Apps

Carolopedia

A friendly guide to Carol, her ecosystem, and the agents who built her.

📖 CarolopediaServicesBuild InitiativesAll activitiesINI-1000353
📋

CAROL-INI-0425-00: Add cron-based health watchdog for critical Carol apps (initiatives API crash had no auto-restart)

Initiative
Open in Initiatives →

📖About

DIAGNOSIS: The initiatives API (port 7167) crashed and stayed down because there is no auto-restart mechanism. The system service carol-initiatives.service is masked. carol-apps.service is oneshot with no process supervision. When the API went down, all pipeline operations halted: dispatching, reviewing, monitoring. Secondary damage: bypass executions for INI 1000351 and 1000352 were orphaned with stale executing/running states.

EVIDENCE:

  • curl http://127.0.0.1:7167 returned exit code 7 (connection refused)
  • pgrep -f uvicorn.*7167 returned nothing
  • systemctl is-active carol-initiatives.service -> inactive; is-enabled -> masked
  • carol-apps.service is Type=oneshot with no Restart=
  • 3 plangenerator.db executions stuck in executing for 16-26 minutes with updated_at == created_at

PROPOSED FIX: Create a cron-based health watchdog script that runs every 5 minutes, checks if critical apps (especially initiatives on 7167) are listening, and restarts any that are down. Add to caroladmin crontab.

⚖️Decisions

  • INI-411 orphan sweep: all bypass executions for this initiative were swept as orphaned (>1200s with no owning process). Transitioning to blocked for operator or resume-loop triage. (elrond.orphan_sweep)
  • Escalated by autonomous loop: bypass INI 1000353 is blocked; parked INI 1000351 cannot resume. Operator triage needed. (albus)