The local quickstart gets you running on a laptop. Production adds TLS, real secret management, durable backups, and observability. This page covers the standard production setup.Documentation Index
Fetch the complete documentation index at: https://internal.september.wtf/llms.txt
Use this file to discover all available pages before exploring further.
What you need
- A Linux host with Docker Engine (or Kubernetes — patterns the same).
- PostgreSQL 16+ (managed or self-hosted).
- An HTTPS-terminating reverse proxy upstream of the orchestrator.
- Outbound HTTPS to your LLM providers from each engine container.
- A secret manager (Vault, AWS Secrets Manager, GCP Secret Manager, k8s Secrets).
1. Build and tag the engine image
The orchestrator launches engines fromORCH_ENGINE_IMAGE. Build
once, tag deliberately:
2.3.0), not latest, so canary upgrades
work. See Engine deploy for upgrade
flow.
2. Build and tag the orchestrator image
github_token build secret is needed once, at pip-install time,
to pull september-engine from the private repo.
3. Provision Postgres
bap-engine reads/writes one database. Don’t share with anything else.- Database name:
orchestrator. - User:
orch_userwith full DDL on its own database. - Storage: 50 GB+ for moderate fleets. The bulk is
audit_log. - Backups: daily snapshot, 7-day retention. Hourly point-in-time recovery if your provider supports it.
4. Generate secrets
Two critical orchestrator secrets:ORCH_MASTER_KEY and ORCH_ADMIN_KEY.
Don’t rotate ORCH_MASTER_KEY without a re-encrypt step. Rotating
it without re-encrypting every engine’s engine_key_enc invalidates
every stored engine key.
5. Configure environment
Minimum production environment:6. Run the orchestrator
- The Docker socket — to create/start/stop engine containers.
- The catalog directory mounted at
/data/catalog, read-only — passed through to engine containers. - The data root mounted at
/data/engine-data— engine brain volumes live here. - Membership in
engine_net— to reach engine/healthendpoints.
7. TLS termination
Run an HTTPS-terminating reverse proxy upstream:127.0.0.1 by default; keep them there. The product’s traffic to
the engine goes via the same internal network as the orchestrator.
8. Register the first product
Once bap-engine is up:platform_api_key — that’s the credential your
product uses for every subsequent call.
9. Health checks and probes
Wire the upstream proxy and any orchestrator monitoring to:GET /healthon the orchestrator. Returns{"status":"ok"}.
GET /status(with platform key) — fleet snapshot.- Postgres connectivity test (your monitoring’s standard check).
10. Backups
Two volumes matter:Postgres
The orchestrator’s source of truth. Back up daily; verify monthly. Restore drill: bring up a fresh Postgres, restore the snapshot, point a fresh orchestrator at it, confirm/status returns the expected
fleet.
Engine data volumes
Each user’s brain lives on disk underORCH_DATA_ROOT_PATH. Back up
nightly:
- Volume snapshots (EBS, GCP PD, k8s VolumeSnapshot) — fastest.
- File-level (
tarof the brain directory) — fallback. - Per-brain export via the engine’s
GET /memory/export— slowest, but portable.
11. Observability
For each of the layers, ship telemetry to your stack:- Orchestrator logs — structured JSON to stdout. Ship via your log pipeline.
- Audit log (
audit_logtable) — periodic export to a long-term store for compliance. - Fleet metrics — scrape
GET /metricsor query the audit table directly for provisions/restarts/crashes. - Engine logs — each engine container writes to its own stdout. Ship per container.
- Postgres — your provider’s standard metrics.
| Signal | Threshold | Severity |
|---|---|---|
Orchestrator /health non-ok | 1 min | Sev1 |
| Postgres unreachable | 1 min | Sev1 |
| Engine restart rate > 5/min | 5 min | Sev2 |
Engines failed count > 0 sustained | 5 min | Sev2 |
| Port allocation usage > 80% | 1 hour | Sev3 |
| Audit log table size > N GB | 1 day | Sev3 |
12. Rolling out a new orchestrator version
Standard zero-downtime swap:- Push new image.
- Update the deployment manifest with the new image tag.
- The new orchestrator picks up state from Postgres on boot — no migration of in-memory state.
- Drain the old container with
docker stop --time 60so in-flight requests complete. - Start the new container.
- Confirm
/healthok and/statusreturns the expected fleet.
What goes wrong
| Symptom | Likely cause |
|---|---|
| Orchestrator can’t start engines | Docker socket not mounted, or ORCH_ENGINE_IMAGE not pulled. |
Engine health_failures climbing | Network issue between orchestrator and engine container, or engine itself unhealthy. Check engine logs. |
PORT_EXHAUSTION errors | Port range too narrow. Increase ORCH_PORT_MAX. |
INVALID_PLATFORM_KEY for known products | Postgres restored from a backup older than the latest product registrations. |
| Engine API keys reject after restart | ORCH_MASTER_KEY changed; encrypted keys can’t be decrypted. |
See also
- Quickstart — the local version.
- Engine contract — what the engine has to do.
- Health — auto-restart specifics.
- Security — keys, encryption, rotation.

