Upgrading the Engine is straightforward when you do it deliberately and catastrophic when you don’t. This page covers the standard upgrade path plus what to do when something doesn’t go to plan.Documentation Index
Fetch the complete documentation index at: https://internal.september.wtf/llms.txt
Use this file to discover all available pages before exploring further.
Pre-flight
Before any upgrade:- Read the Changelog. Identify breaking changes.
- Read the relevant migration guide.
- Snapshot the brain volume. This is your rollback insurance.
- Run regression evals on the current version. Baseline.
- Plan a rollout window. Even patch releases can shift behavior.
Standard upgrade — patch and minor versions
Patch (X.Y.Z → X.Y.Z+1) and minor (X.Y → X.Y+1) versions are backwards-compatible. Behavior may shift; the API and brain schema don’t break.Steps
/health returns ok and evals are green, you’re done.
Standard upgrade — major versions
Major versions (vN → vN+1) include breaking changes. You’ll typically need code changes (new request shapes, new env vars) plus the upgrade.Steps
- Read the migration guide. All of it.
- Update your application code to match the new API.
- Test in staging. Bring up the new Engine in a staging
environment with a copy of production’s brain. Confirm
/health, send representative requests, watch for regressions. - Run full evals on staging.
- Deploy with a rollout strategy. Canary first; flip the rest if the canary is clean.
- Monitor for the first hour. Look for elevated errors, latency spikes, unusual log patterns.
Canary rollout
For deployments that serve real users, deploy to a small percentage of users first:Per-user-engine model
If you run one Engine per user (the typical hosted model), pick 1–5% of users and route them to the new Engine version. Watch their session traces; if behavior is normal, expand.Single Engine
If you run a single Engine for shared internal use, you don’t have canary granularity. Deploy at a low-traffic time, watch closely, roll back if needed.Rolling back
When the new version is bad, roll back fast.Image rollback
MIGRATION_REQUIRED errors when you try to start the old version.
Brain rollback
If migrations broke the brain or the new version corrupted data:Migrations that aren’t reversible
Most Engine migrations are forward-only. They add tables, add columns with defaults, populate caches. They don’t drop or rename anything critical. If a migration is destructive (drops a column, transforms data), the migration guide will say so. For destructive migrations:- Always snapshot first. Treat the snapshot like an irreversible resource — keep it for at least a week post-upgrade.
- Don’t rely on rollback. A destructive migration’s reverse is “restore from snapshot,” which loses recent data.
Upgrade across many Engines
In a per-user-engine deployment with hundreds of Engines:Sequential rollout
Upgrade Engines one at a time, in a controlled order. Slow but safe. Use this for the first few hours of a new release.Parallel rollout
Upgrade in batches (10%, 50%, 100%). Faster, slightly riskier.Drain and restart
For each Engine: drain in-flight turns, restart with the new image. The graceful-shutdown path drains existing requests up to a deadline before exiting.Post-upgrade checklist
After every upgrade:-
/healthreturns ok. - Run regression evals, compare to baseline.
- Check error rate over the first hour.
- Check P99 latency over the first hour.
- Check cache hit ratio. A sudden drop suggests prompt structure changed.
- Check the logs for warnings the previous version didn’t emit.
- Spot-check a couple of high-value workflows manually.
See also
- Rollback — when an upgrade goes wrong.
- Migration guides — per-version specifics.
- Production deploy — the initial deploy this upgrade is changing.

