Engines fail. Containers crash. A network blip drops a probe. The orchestrator runs a background health loop that detects failures quickly and restarts with exponential backoff. This page covers the loop and how to tune it.Documentation Index
Fetch the complete documentation index at: https://internal.september.wtf/llms.txt
Use this file to discover all available pages before exploring further.
The loop
Configuration
The loop is tuned via env vars. Defaults are sensible for most deployments; tune for special cases.| Variable | Default | Effect |
|---|---|---|
ORCH_HEALTH_CHECK_INTERVAL_S | 30 | How often the loop sweeps every engine. Lower = faster failure detection, more probe traffic. |
ORCH_HEALTH_CHECK_TIMEOUT_S | 10 | Per-probe timeout. Should be < interval. |
ORCH_HEALTH_MAX_FAILURES | 3 | Consecutive failed probes before marking failed. Higher = tolerates more flakes. |
ORCH_RESTART_BACKOFF_BASE_S | 5 | First retry waits this long. Doubles each attempt. |
ORCH_RESTART_BACKOFF_MAX_S | 300 | Cap on backoff (5 min). |
ORCH_RESTART_MAX_ATTEMPTS | 8 | Give up auto-restart after this many failures. |
ORCH_BOOT_TIMEOUT_S | 60 | Provision-time boot deadline (separate from steady-state). |
What “healthy” means
The orchestrator probesGET /health on the engine. The engine must
return:
- HTTP timeout (connection or read).
- Non-2xx status code.
- 200 status but
status != "ok"in the body.
/health is implemented in
engine/src/server.py. It checks: database reachable, LLM provider
reachable (last call succeeded), Asset Directory reachable. If any
subsystem is down, the body’s status is non-"ok" and the
orchestrator counts it as a failure.
What “failed” means
AfterORCH_HEALTH_MAX_FAILURES consecutive failures, the
orchestrator:
- Sets
status='failed'in the registry. - Writes an audit row with action
health_failed. - Schedules an auto-restart for
BACKOFF_BASE * 2^0seconds later (5s by default).
failed engine is invisible to the product unless it explicitly
queries — GET /admit returns admitted: false with reason
engine_unhealthy. If auto_provision=true is passed and policy
allows, a new engine takes its place; otherwise the product sees
the failure.
Auto-restart
The orchestrator restarts a failed engine by calling the backend’sstart(). For Docker, this is docker.containers.start().
After each restart attempt:
- If
GET /healthreturnsokwithinORCH_HEALTH_CHECK_TIMEOUT_S, reset tostatus='running',health_failures=0. Auditauto_restart_success. - If the restart itself fails, or
/healthkeeps failing, increment the attempt counter and schedule the next retry with doubled backoff.
ORCH_RESTART_MAX_ATTEMPTS (default 8) consecutive failed
restarts, the orchestrator stops trying. The engine stays failed.
Audit auto_restart_gave_up. Ops alerts should fire here.
The total elapsed time before giving up:
What auto-restart fixes
- Container OOM kills.
- LLM provider transient outage (the engine’s
/healthwill gookagain once the provider recovers). - Brief network partitions.
- Engines that crashed during a deploy and haven’t been bumped.
What it doesn’t fix
- Misconfiguration. If
LLM_API_KEYis wrong, restarting won’t help. - Disk full. Restart hits the same disk error.
- Deeper bugs that surface every time. Restart loops them.
Concurrency
The health loop runs all probes for one tick in parallel up to a semaphore of 50. For larger fleets, the per-tick wall time is(num_engines / 50) × ORCH_HEALTH_CHECK_TIMEOUT_S. With 1000
engines and a 10s timeout, that’s 200 seconds per sweep — uncomfortable
when the interval is 30s.
If you run >500 engines, raise the semaphore (a code change today;
config in a future version) or increase the interval to give the
loop time to finish.
Tuning by use case
Demo / small fleet
Production / large fleet
Provider outage tolerance
Observability
Watch:engines.health_failures— sustained > 0 across many engines means a systemic issue (provider, network).- Audit
auto_restart_*actions — high rate = thrashing. engines.status='failed'count over time — should approach 0 in steady state.
- Restart rate > 5/min (Sev2).
- Failed engines > 1% of fleet sustained 5 min (Sev2).
- Single engine in
failedfor > 1 hour (Sev3).
See also
- Lifecycle — where the
failedstate fits in the state machine. - Architecture — the implementation.
- Audit — the audit record of every restart.
- Config: env vars — every knob.

