The Self-Healing Agentic System
Hermione is Carol's central watcher — she continuously monitors every scheduled process, records each trigger, and instantly detects failures: a job that never ran, a sync that timed out, a health check that hung. But Hermione doesn't diagnose the root cause or prescribe the fix. She files one symptom-initiative (which process, what went wrong) and hands it to the pipeline. This separation of duties is the spine of the system: the agent that sees everything can't also know how to fix everything. Elrond and the build team diagnose and repair; only when they're stuck does Albus, the architect, step in. By centralizing detection but distributing expertise, the system stays lean, accountable, and trustworthy.
When a scheduled process fails, Hermione detects it and files a symptom-initiative into the standard pipeline. Elrond formalizes the plan, the build team implements the fix, Argus tests it. But here's the critical step: the user-acceptance test doesn't just check the code — it re-runs the exact process that originally failed and confirms it now succeeds. If Hermione's monitor is green again, the initiative closes. Healing isn't a shipped patch; it's a verified return to normal operations.
In older systems, a scheduled job that silently failed to run could hide for weeks — the absence was invisible, just a void in the logs. Hermione changed that: every scheduled trigger is now recorded as an intent before it fires. If the trigger fails to execute, the intent sits unfinished — a structural fact the system sees and alerts on. That asymmetry — intent without outcome — is how silent gaps become detectable failures. One missed job becomes a filed initiative, diagnosed and healed without a human noticing.
Most process failures are routine: a timeout, a filled quota, a config that drifted. The standard pipeline diagnoses and fixes these without hesitation. But some failures are tangled — the symptoms are complex, the system is in an unmapped state, the standard troubleshooting reaches a wall. When that happens, Albus (the architect) steps in. He gathers the full picture, diagrams what's really going wrong, and works out the real fix. The pipeline doesn't replace human judgment; it only reaches its limits and asks for it.
Updates
When multiple agents watch the same process, they must coordinate or you've made the blind spot bigger than before. Elrond's stuck-watcher was running under both systemd and cron — two independent schedulers, each unaware the other was watching. With no coordination, neither one saw the full picture, and Git Backup was also generating false alarms as concurrent updates corrupted the signal. The fix was straightforward: consolidate the watchers and add a lock. But the deeper principle stands: if your supervisors can't see each other, you've hidden the failures you were trying to detect. Supervision infrastructure must be as visible and coordinated as the system it protects.
A safety gate that reads stale data is no gate at all — it's a false wall your agents trust while the real threat walks past. We just found one: a preflight gate that was still checking an on-disk database that the relay had cut over from weeks ago. The gate saw green every time. It wasn't protecting anything — it was just a warm feeling. In an agentic pipeline, every gate and watcher must prove it still reads live data, or it becomes worse than useless: it becomes a reason to stop looking.
In an agentic system, supervision infrastructure has a blind spot: it can fail silently if no one is watching it. The Albus failure watcher — which detects when the architect gets stuck on unmapped problems — was orphaned during a scheduler migration and operated without oversight for weeks. If Albus had gotten trapped, no alarm would sound. The watcher is now restored to Hermione's supervision, but the lesson stands: every agent's accountability must be monitored, including the monitors themselves. A blind detection system isn't a safety net; it's risk hidden as trust.