Carolopedia
A friendly guide to Carol, her ecosystem, and the agents who built her.
📖About
Milder cousin of CAROL-INI-1645 (which fixed the planner_db shared-writer LEAK/permanent wedge). Hermiones scheduler (he_sched_01) holds the single planner-DB write lock across its per-tick DISPATCH loop, which makes HTTP calls — so the write lock is held across network I/O and stays held near-continuously. Effect: other writers are STARVED (could not land even a 1-row scheduled_jobs UPDATE; 20 attempts over ~30s all failed with database is locked). NOT a permanent wedge — wal_checkpoint(PASSIVE) succeeds — but the scheduler monopolizes the single writer. FIX (same shape as INI-1397): do the DB write under the lock, run dispatch / HTTP / propagation OUTSIDE the lock (or background it). Acceptance: a concurrent 1-row planner-DB write completes within busy_timeout while the scheduler is dispatching. Side task once fixed: move gen-tasks-01 daily_at from 02:00 to 00:05 (blocked by this).
⚖️Decisions
- Move ONLY the scheduled_jobs (job-list) table out of plangenerator.db into a dedicated scheduler.db; leave droid_runs (the run-audit ledger) in place. — Map of the codebase shows scheduled_jobs is referenced in exactly one module (scheduler_registry.py) with zero cross-table joins, so it moves cleanly. droid_runs is JOINed against executions in 16+ places and written by ~10 files — it is a genuinely shared pipeline ledger, not incidental co-location, so moving it would break joins everywhere and not reduce this contention. The reported starvation symptom was specifically a scheduled_jobs update failing, which this fixes directly. (orion)
- scheduler.db lives at /home/caroladmin/dev/data/ alongside the other canonical DBs (registry.db etc.) so it is covered by the existing backup. — Keeps the new file in the backed-up data home rather than under an app dir. (orion)
- Old scheduled_jobs table in plangenerator.db is left intact (unused) as a one-session fallback, not dropped in the same change as the migration. — feedback_prod_fallback: never remove a data source in the same deploy as the migration that replaces it. (orion)
✅Success criteria
- A concurrent 1-row planner-DB write completes within busy_timeout while the scheduler is dispatching; scheduler holds the write lock only for the DB write, not across HTTP/dispatch (must_have)
- The job-list table is served from a dedicated scheduler.db separate from plangenerator.db. (must_have)
- A job-list update lands within busy_timeout while plangenerator.db is under a concurrent write lock (contention test passes). (must_have)
- All 27 scheduled jobs migrated intact and list_jobs returns them from the new DB. (must_have)
- A live scheduler tick dispatches due jobs and updates last-run state in the new DB with no errors. (must_have)
- Grep confirms no caller outside scheduler_registry.py touches the job-list table. (must_have)