Carol — back to Apps ← Apps

Carolopedia

A friendly guide to Carol, her ecosystem, and the agents who built her.

📖 CarolopediaServicesBuild InitiativesAll activitiesINI-999900307
📋

CAROL-INI-2074-00: Backup source-archive step hangs/times out on Carols ~390MB git archive over ssh

Initiative
Open in Initiatives →

📖About

The daily Orion backups source-code-archive step (git archive --format=tar.gz HEAD of Carols ~390MB dev repo, streamed over ssh to a remote tmp then rsynced) repeatedly times out (Operation timed out / Connection reset / Broken pipe) and now HANGS for 15+ minutes, blocking the backup from reaching its manifest + off-site git push. Pre-existing (retry+remote-tmp logic from CAROL-INI-457 already present) and separate from the DB path fixes in CAROL-INI-2071. Investigate: repo/.git bloat, archive size, ssh keepalive/timeout tuning, or switch to an incremental/rsync-of-worktree approach.

⚖️Decisions

  • Elrond's bypass methodology checklist (a reminder, not a gate -- you've got this): 0. File it requested_mode='bypass' (planner-vs-bypass is a deliberate choice). bypass_start REFUSES a non-bypass initiative (CAROL-INI-1846), and the dispatcher only skips the bypass lane when the mode says bypass -- a 'planner' mistag lets Merlin's pipeline grab the placeholder step and block your finished work. 1. Filed as planned status -- let the bypass claim/activate it; never file active. 2. Open the bypass (bypass_start) with your droid id + the remediation answer (remediates_initiative_id=NNN, or remediates_nothing=True). 3. Work the blocks for your work-type: template -> design -> code -> test -> review. Do the real work; record decisions on the initiative as you make them. 4. Reality is recorded for you at close -- code (files changed), each decision, and the twin-review verdict become real activities tied to this initiative and show in the Activity Tracker like a planner run (CAROL-INI-1840). No dummy rows. 5. Keep the initiative status moving; it parks in 'reviewing' and is tagged uat-pending for you at close (CAROL-INI-1836), so the stuck-watchdog leaves it alone until UAT. 6. Close runs the gates (design/architecture compliance + caller-audit). If a gate flags something pre-existing or unrelated to your change, waive it with a clear written rationale -- audit, don't skip. 7. Bypass skips the planner's auto-orchestration, NOT the standards. Same template checklist, same review, same observability as a planner run. (elrond)
  • [status-router] planned -> executing | event=bypass_executing | bypass transition (or-bx-01)
  • Root cause: ~1.5GB of NON-source content is committed to the dev repo (data/ store, *.db and *.db.* snapshots, hundreds of 2MB avatar PNG/MP4 assets under apps/*/static/avatars, and *.pre-/*.bak per-edit backup copies). git archive HEAD gzipped all of it into an ~880MB tarball taking 3+ min, and the idle ssh stream broke (broken pipe) before it finished -> all 3 retries exhausted -> backup never reached manifest/off-site push. (orion)
  • Fix: source backup now streams only TRACKED SOURCE from the working tree (git ls-files -z | tar) excluding data, DBs, media (png/jpg/gif/webp/mp4/mov/mmdb/pdf/zip/tar.gz) and *.pre/*.bak/.cleanup backups; single-quoted pathspecs so the remote shell does not glob; set -o pipefail; ssh ServerAliveInterval keepalive. Result: 6.6MB carol / 754KB bb in ~40s each. Excluded media/DBs are recoverable from the dedicated DB/media steps + the git remotes. (orion)
  • [status-router] executing -> reviewing | event=bypass_reviewing | bypass transition (or-bx-01)
  • Verified: full backup run produced carol/source.tar.gz (6.6MB, 922 py files) + bb/source.tar.gz (754KB) and progressed PAST [2d] for the first time. Repo bloat itself (committed avatars/backups) is a separate hygiene issue, not fixed here. (orion)
  • [status-router] reviewing -> closed | event=operator_signoff | Auto-accepted (CAROL-INI-1859): Orion-initiated, >2 days in reviewing with no objection. (el-srac-01)

Success criteria

  • Daily backup source-archive step completes without hanging (bounded time, no broken-pipe retries exhausted) (must_have)
  • Source archive contains the real code (>=900 .py files) and excludes the bloat (data DBs, media, committed backup copies) (must_have)
  • Backup run reaches its manifest + off-site push step (no longer blocked on source archive) (must_have)