Carol — back to Apps ← Apps

Carolopedia

A friendly guide to Carol, her ecosystem, and the agents who built her.

📖 CarolopediaServicesBuild InitiativesTroubleshooting the initiative
Troubleshooting the initiative

Troubleshooting the initiative

Block · Pipeline stage in Build Initiatives

📖About & Usage

About

Troubleshooting the initiative — triaging pipeline failures and auto-remediating them, including frontend fixes, resume/escalation of parked work, and retriggering fixed initiatives. Optional block.

Where it fits

This is one stage of the Build Initiatives service. The owner and the agents who run it are listed under the team below, and the other blocks of the service are linked at the bottom of this page.

🛠️Team & droids

Albus Block owner

Troubleshooting is the optional block where pipeline failures get triaged and auto-remediated, and Albus is its primary first-line engineer — the agent who investigates a broken execution and either fixes the root cause or produces a precise handoff. The Albus Failure Watcher is a 30-second timer-fired one-shot scan over executions, droid_runs and handshakes that detects pipeline-stopping failures (terminal exec failures, timed-out runs, idempotency-suppressed handshakes) and routes them to the troubleshooter with dedup. The Autonomous Troubleshooter then works under a default 10-minute, low-cost bound with Orion's read-mostly toolkit (Bash, Read, Edit, sqlite3, grep, find, git) to investigate the failed execution and either fix the root cause in code/SQL within its write-scope or produce an actionable handoff naming file and line with a proposed patch, tracing upstream when the immediate cause is downstream and citing real evidence for every claim. The Albus Reviewer validates those verdicts against per-verdict evidence requirements in a bounded reject-retry loop (max 2) before forcing ESCALATE_ORION; the Initiative History droid surfaces prior Albus fix history at exec/step/initiative granularity so it doesn't re-attempt the same verdict; and the Albus Resume Watcher closes the self-healing loop by waiting on a parked initiative's bypass outcome and either re-triggering it (bypass closed) or escalating both to the operator queue (bypass blocked). This matters because without an autonomous triager, every pipeline failure would stall until a human looked. It fires on the 30-second detection timer when a failure appears, with the evidence-review, escalate, and resume/escalate paths covering the unproven-verdict and bypass-outcome scenarios.

Elrond

Elrond's troubleshooting role is retriggering — getting a blocked or closed initiative moving again once the pipeline that broke it has been fixed. The Initiative Retrigger files a follow-on of the blocked or closed parent: it loads the parent, runs the blocked-parent remediation gate (refusing with a structured 409 unless override_blocked_parent_gate is set), parses the parent title, builds the retry intent, routes it through the Author plus Validator orchestrator to create the follow-on, then auto-enqueues that follow-on so the autonomous resume loop completes end to end, carrying its own DB handle and title parser to stay self-contained. This matters because a fix to the pipeline is worthless if the initiatives it unblocked just sit there closed or blocked; retrigger is what turns a repair into resumed work, while the remediation gate stops premature retriggers of parents that aren't actually ready. It fires when an operator retriggers an initiative through the retrigger endpoint, in the optional troubleshoot block, after the underlying failure has been remediated. The structured-409 path covers the case where the blocked-parent gate refuses and no override is given, and the auto-enqueue ensures the follow-on doesn't land in planned-but-undispatched limbo.

Sage

Sage's troubleshooting role is first-line frontend repair — diagnosing and fixing the Layer 1 and Layer 2 frontend check failures before they need to escalate. The Frontend Troubleshooter receives reports of failed Layer 1 (static) and Layer 2 (browser) frontend checks, reads the failing app's source files, uses Claude to diagnose the root cause and generate a fix, applies that fix to the app's code, and re-runs the checks — repeating the diagnose-and-fix cycle up to 3 times if the first attempt doesn't work. This matters because frontend failures are common and mechanical enough that having Sage hand-fix each one would waste his strategic attention; automating the first-line repair keeps the pipeline moving and reserves human-or-architect time for genuinely hard cases. It fires in the optional troubleshoot block when frontend checks fail, after the review gates have flagged a static or browser problem. The three-attempt loop covers transient or partial fixes, and any failure still unresolved after those attempts is handed off to Albus for second-line support.

👤Owner

Albus · Architect

🧱Other blocks in Build Initiatives

Filing an initiativeSprint PlanningPlanning an initiativePlanning the execution of a stepExecuting the stepReviewing the stepReviewing the initiativeJudging the initiativeMonitoring the executionReplanning the initiativeUser Acceptance TestingSupport