Carol — back to Apps ← Apps

Carolopedia

A friendly guide to Carol, her ecosystem, and the agents who built her.

📖 CarolopediaServicesBuild InitiativesAll activitiesINI-999900362
📋

CAROL-INI-2128-00: Dispatch pre-verification gate — fail fast on stale dependencies before marking a step as executing

Initiative
Open in Initiatives →

📖About

Initiative sixteen (the Frankfurt delivery concierge) was permanently blocked because its operator task was silently dispatched into a cancelled queue slot left over from a prior restore operation, producing no task evidence and exhausting all recovery attempts. The root cause is structural: the dispatcher transitions a step to 'executing' without first confirming that its queue slot, required artifacts, and operator availability are live and valid. This initiative adds a lightweight pre-verification gate at dispatch time so that any step whose dependencies are stale or cancelled fails fast with a re-dispatchable error rather than burning through recovery attempts on a void execution.

⚖️Decisions

  • The gate runs synchronously at dispatch time, before any status transition to 'executing', and is not an async background check. (orion)
  • A failed gate emits a re-dispatchable error (not a permanent block), returning the step to a state where it can be retried once the dependency is restored. (orion)
  • The gate checks three conditions in order: (1) queue slot is in an active, non-cancelled state; (2) all declared artifact dependencies from previous steps are present and non-stale; (3) there is a reachable operator or executor available to take the task. (orion)
  • Stale-artifact detection reuses the existing artifact-age threshold already defined for the cleanup process that triggered INI-510/511, so there is one canonical staleness definition system-wide. (orion)
  • The gate result is recorded as an audit event on the step regardless of pass or fail, so post-mortems can always reconstruct whether the gate ran. (orion)
  • [status-router] planned -> dispatched | event=dispatch | dispatcher queued (ds-s1)
  • [status-router] dispatched -> planned | event=dispatch_retract | No longer in the top-3 dispatch window (CAROL-INI-1972). (spb-01)
  • [status-router] planned -> dispatched | event=operator_dispatch | RSI: immediate dispatch (INI-2198) (or-bx-01)
  • [status-router] dispatched -> executing | event=dispatcher_transition | dispatcher state change (ds-s1)
  • [status-router] executing -> blocked | event=operator_block | RSI bypass realignment (or-bx-01)
  • [status-router] blocked -> executing | event=operator_unblock | RSI reconfig complete (or-bx-01)
  • [status-router] executing -> dispatched | event=dispatch | dispatcher queued (ds-s1)
  • [status-router] dispatched -> executing | event=dispatcher_transition | dispatcher state change (ds-s1)
  • [status-router] executing -> reviewing | event=review_inferred | review row present (el-review-inferer)
  • [status-router] reviewing -> blocked | event=stuck_10min_no_activity | Elrond safety net: initiative has had no activity for 10+ minutes. Blocking under the parallel safety mechanism. (el-watchdog)
  • Elrond blocked initiative under the CAROL-INI-2162 dead-Albus protocol. Albus was supposed to wake for step 0 (cause=albus_no_show) but did not respond. Cause: albus_no_show. Reason: Elrond safety net: initiative stranded 10+ min. Albus wake failed or produced no useful result. (el-s1)
  • RSI diagnosed: 2026-07-01 07:10:39 -> improvement #(none). ({'_raw': 'ROOT CAUSE: The initiative transitioned to "executing" without verifying that a critical pre-step (Albus in step 0) was ready, causing a "albus_no_show" timeout that forced Elrond\'s safety net to block it.\n\nIMPROVEMENT: Introduce a mandatory pre-verification gate before marking a step (el-rsi-eng-01)
  • [status-router] blocked -> executing | event=operator_unblock | RSI: was missed in unblock batch (or-bx-01)
  • [status-router] executing -> blocked | event=operator_put | PUT /api/initiatives (operator)
  • Elrond stuck-watchdog: 3 consecutive failed recovery attempts since 2 strikes recorded. Initiative idle past 600s with no live queue row; Albus invoked 3 times without progress. Flipping to blocked and surfacing on operator queue per CAROL-INI-403. (elrond.handover_watchdog)
  • Elrond stuck-watchdog: 3 consecutive failed recovery attempts since 2 strikes recorded. Initiative idle past 600s with no live queue row; Albus invoked 3 times without progress. Flipping to blocked and surfacing on operator queue per CAROL-INI-403. (elrond.handover_watchdog)
  • RSI diagnosed: 2026-07-01 16:05:41 -> improvement #(none). ({'_raw': 'ROOT CAUSE: The initiative was allowed to unblock and enter "executing" without confirming that Albus (step 0) was available and ready to handle the transition, causing a no-show and subsequent stuck watchdogs.\n\nIMPROVEMENT: Add a mandatory pre-verification gate before any initiative tr (el-rsi-eng-01)
  • [status-router] blocked -> diagnosis | event=diagnosis_start | RSI loop: oldest blocked (since 2026-07-01 04:37:58); Albus diagnosis INI 999900509 (el-rsi-loop-01)
  • Orion remediation in progress: INI-999900509 bypass opened — CAROL-INI-696: an Orion-driven bypass has been opened to remediate this parent. The canonical Orion remediated: marker will be posted on close — see cookbook 156 / 155. (shared.bypass.bypass_start)
  • Albus RSI diagnosis (root cause): [work, confidence high] The initiative blocked because the dispatcher transitioned a step to 'executing' without verifying the queue slot was active, causing it to land in a cancelled slot left from a prior restore. The executor then ran with no live state, producing no evidence and wasting recovery attempts until Elrond's watchdog triggered a block. (albus)
  • Albus RSI recommendations: - Implement the pre-verification gate that checks queue slot state, artifact freshness, and operator availability before any status transition to 'executing'. - Use the single shared staleness constant from the artifact cleanup process for the gate's staleness check. - Record every gate decision (pass/fail) as an audit event on the step, surfaced on the initiative monitor UI. - Ensure the gate returns a re-dispatchable error on failure so steps can be retried without manual override. - Add a targeted test that reproduces the exact backfill-into-cancelled-slot failure from INI-16. - Run this attempt's execution with a stronger model (opus instead of sonnet) to reduce the risk of incomplete or incorrect gate logic. || Next attempt succeeds because: The pre-verification gate directly prevents the exact stale-dependency and cancelled-slot failure observed, and the audit trail and test coverage ensure correct behavior and early detection of regressions. (albus)
  • Orion remediated: INI-999900509 bypass closed — CAROL-INI-696 close-marker: the Orion bypass INI-999900509 filed against this parent reached terminal state (closed). This row's literal prefix Orion remediated: is the canonical signal the cookbook-155 dispatcher gate looks for. (shared.bypass.bypass_end)
  • Orion remediated: Albus RSI diagnosis: [work, confidence high] The initiative blocked because the dispatcher transitioned a step to 'executing' without verifying the queue slot was active, causing it to land in a cancelled slot left from a prior restore. The executor then ran with no live state, producing no evidence and wasting recovery attempts until Elrond's watchdog triggered a block. (orion)
  • [status-router] diagnosis -> closed | event=operator_put | PUT /api/initiatives (operator)
  • Closed: superseded by follow-on INI 999900512 (CAROL-INI-2128-02: Dispatch pre-verification gate — fail fast on stale dependencies before marking a step as executing) (elrond.initiative_author)

Success criteria

  • A step whose queue slot is in a cancelled state is never transitioned to 'executing'; it is instead returned to a re-dispatchable error state with a clear reason recorded. (must_have)
  • A step with one or more stale or missing artifact dependencies is held at the gate and does not enter execution, preventing evidence-free runs and wasted recovery attempts. (must_have)
  • Every gate decision — pass or fail — is recorded as an audit event on the step, visible in the initiative history. (must_have)
  • A step that fails the gate can be re-dispatched once its dependencies are restored, without requiring a manual status override or operator escalation. (must_have)
  • The existing regression suite passes with no new failures introduced by the gate; the gate adds no false positives on healthy dispatch paths. (must_have)
  • The gate's staleness threshold is expressed via the single shared constant already used by the artifact cleanup process, so no duplicate definitions exist in code. (must_have)
  • A targeted test covering the exact failure mode from INI-16 (backfill into a cancelled queue slot following a restore) passes green. (must_have)
  • The audit trail for any gate failure is surfaced on the initiative monitor UI alongside the step's status, so operators do not need to dig through logs to understand why a step was held. (nice_to_have)