Plan rollback before you need it. The middle of an incident is not the time to figure out which command reverts which database, or whether your snapshots are recent enough.Documentation Index
Fetch the complete documentation index at: https://internal.september.wtf/llms.txt
Use this file to discover all available pages before exploring further.
When to roll back
Roll back when:/healthreturns non-ok and you can’t fix it within minutes.- Error rate or latency is sharply worse than baseline and not improving.
- A regression eval that passed pre-upgrade is now failing.
- Customer reports confirm the new version is broken.
- One outlier metric that hasn’t actually impacted users.
- Behavior changes that were expected per the changelog.
- Issues that are clearly local to one bad request rather than the service as a whole.
What rollback looks like
Rollback has two layers:- Image rollback — switch the Engine container back to the previous image tag.
- Data rollback — restore the brain database to a pre-upgrade snapshot, if migrations made the new version’s schema incompatible with the old.
Image rollback
If the new version’s migrations didn’t change the brain schema in a way the old version doesn’t understand:Brain rollback
When a forward-only migration shipped in the new version (most migrations), the old Engine version can’t read the upgraded brain. You have to restore the brain to the pre-upgrade state.Restoring from snapshot
What gets lost
Specifically:- Any episodes, knowledge, or social-graph entries written during the bad window.
- Any conversation history from the bad window.
- Any feedback submitted during the bad window.
What’s preserved
- The pre-upgrade brain state, exactly.
- All long-term memory from before the upgrade.
- All migrations applied before the upgrade.
Volume snapshot rollback
If your infrastructure provides volume snapshots (EBS, GCP persistent disks, etc.):Drain before stopping
The Engine has graceful-shutdown logic. Stopping with a drain timeout lets in-flight turns complete:Communicating
While rolling back:- Post in the incident channel: “Rolling back to v2.3.1 due to $symptom.”
- Update the status page if customer impact was real.
- After rollback completes: confirm
/healthok, errors recovered, post the all-clear.
Post-rollback
Once the rollback is in and stable:- Capture the diagnostic. Logs, traces, metrics from the bad window. You’ll need them to fix forward.
- Don’t redeploy in panic. Whatever broke needs root cause first.
- File a postmortem ticket. See Postmortems.
- Add a regression eval. The exact failure mode should be a permanent test case. See Regression.
When rollback isn’t possible
Some failure modes can’t be cleanly rolled back:- The brain is corrupted and snapshots are unusable. You’re now in a data-recovery scenario; involve whoever owns the brain.
- Migrations dropped data that the old version expects. Same scenario.
- The new version sent destructive MCP calls (deleted external resources, sent emails). The Engine’s state is recoverable; the external state isn’t.
Practice rollback
Run a rollback drill in staging quarterly. The exercise:- Bring staging up on the current production version.
- Take a snapshot.
- Upgrade staging to a deliberately-broken version (e.g. an internal build with a known bug).
- Roll back using the procedure above.
- Time how long it took. Identify what slowed you down.
See also
- Upgrade — how to do the upgrade so rollback isn’t needed.
- Postmortems — what comes after.
- Database — backup and restore details.

