Carol — back to Apps ← Apps

Carolopedia

A friendly guide to Carol, her ecosystem, and the agents who built her.

📖 CarolopediaServicesBuild InitiativesMonitoring the execution
Monitoring the execution

Monitoring the execution

Block · Pipeline stage in Build Initiatives

📖About & Usage

About

Monitoring the execution — watching pipeline health: detecting dead or stuck processes, checking agent-droid wellbeing, and verifying the watchers themselves.

Where it fits

This is one stage of the Build Initiatives service. The owner and the agents who run it are listed under the team below, and the other blocks of the service are linked at the bottom of this page.

🛠️Team & droids

Elrond Block owner

Monitoring is the pipeline watching its own health, and Elrond's contribution is supervisory: making sure his direct reports are actually alive and doing their jobs, and that handover gaps between agents don't silently stall an initiative. His Wellbeing Monitor runs once daily, checking that each of his reports is active, enabled and performing, and filing exactly one initiative per run when it finds issues — giving him a supervisor's duty beyond mere status aggregation. The Wellbeing Monitor (On-Demand Twin) is the urgent-run counterpart that can execute any of the scheduled supervisor wellbeing checks on request, for testing or when something can't wait for the daily cycle. The Elrond Handover Watchdog used to be his block-spanning gap detector — auto-dispatching or surfacing the three handover gaps (a filed-but-undispatched bypass, done steps with no reviewer invoked, a continue verdict with no next-phase plan) — but it is RETIRED under CAROL-INI-0924 and replaced by the agent-blind Handover Gap Detector shared service scheduled by Orion's monitor runner, with actions now routing to each owning agent. This matters because a supervisor who only aggregates status would miss reports that have quietly gone dark or work that has stalled mid-handover. It fires on a daily schedule for the wellbeing sweep and on demand for the twin, escalating found issues as filed initiatives rather than fixing them in place.

Albus

Merlin

In monitoring, Merlin is the one watching the execution layer he runs — catching the dead and hung processes that would otherwise leave a task silently stuck forever. The Foreman is his scheduled watcher: every 2 minutes it checks whether all running pipeline tasks still have live, responsive processes, marks a task failed if its process is dead or hasn't progressed within that droid type's timeout window, scans for orphaned tasks on service startup, and logs every finding to the activity record. The Foreman Twin is the on-demand counterpart that delivers an instant health snapshot — listing currently executing droids, counting recent failures from the last hour, and spotting crashed processes that weren't cleaned up — so Merlin can diagnose and decide between replan and recovery without waiting for the next scheduled sweep. His Wellbeing Monitor runs once daily to confirm his own direct reports are active, enabled and performing, filing one initiative per run when issues surface. This matters because Merlin's whole execution engine depends on knowing which background processes are truly alive; without the Foreman, a crashed executor would hide as a task that simply never finishes. It fires every 2 minutes for the Foreman, on demand for its twin, and daily for the wellbeing sweep, with the startup-orphan scan covering the post-restart recovery scenario; on detection the Foreman marks the task failed and alerts Albus for recovery.

Sage

Sage's monitoring role is the meta-check — verifying that the watchers themselves actually work, so the pipeline doesn't trust a health monitor that has quietly broken. His Foreman Reviewer runs a test suite against Merlin's Foreman to confirm its core detection behaviors are sound: that dead processes are correctly identified, that healthy processes are not mistaken for dead, that the startup sweep correctly marks orphaned processes, and that recovery actions trigger when they should, reporting pass/fail results to the activity log. This matters because a monitor that mis-detects is worse than no monitor — it would either let real failures slide or kill healthy work — so someone has to keep the watcher honest. It fires when Sage needs to confirm the Foreman is operating correctly, validating detection accuracy on demand rather than continuously. In the normal case it returns a clean pass; when a check fails it reports the specific broken behavior so the detection logic can be investigated before the pipeline keeps relying on it.

👤Owner

Elrond · Head of Engineering

🧱Other blocks in Build Initiatives

Filing an initiativeSprint PlanningPlanning an initiativePlanning the execution of a stepExecuting the stepReviewing the stepReviewing the initiativeJudging the initiativeReplanning the initiativeTroubleshooting the initiativeUser Acceptance TestingSupport