Documentation Index
Fetch the complete documentation index at: https://internal.september.wtf/llms.txt
Use this file to discover all available pages before exploring further.
Service Level Objectives (SLOs) are the reliability commitments we
make to ourselves and to customers. This page defines the SLIs we
measure (the indicators), the SLOs we aim for (the targets), and the
error budgets that drive deploy decisions.
The numbers below are starting points. As we accumulate production
data, they’ll firm up — what’s “achievable but not free” today may
become “easy” or “impossible” tomorrow.
SLIs (what we measure)
Availability
SLI: 1 - (5xx responses on /execute) / (all responses on /execute)
Excludes:
- 4xx (caller errors).
- 503 from migration-required (operational, not failure).
A 5xx is anything between 500 and 599. Streamed error events count
as 5xx for this purpose if the agent never sent a thread_lifecycle: completed.
Latency (request)
SLI: time from POST /execute to first text_delta event (TTFT)
SLO targets:
p50 < 1.5s
p99 < 5s
Time-to-first-token (TTFT). The user’s perception of “the agent is
working” starts here.
Latency (turn)
SLI: time from POST /execute to thread_lifecycle: completed
SLO targets:
p50 < 8s
p99 < 60s
Total turn time. Some turns are legitimately long (multi-step coding
sessions); the p99 is the right tail to watch.
Cost
SLI: cost per turn = (input_tokens × input_price + output_tokens × output_price) / turn
SLO targets:
p50 < $0.05
p99 < $0.50
Soft target — informational, not a deploy gate. But P99 spikes are
worth investigating.
Cache hit ratio
SLI: sum(cache_hit_tokens) / sum(input_tokens)
SLO target: > 60% across all turns over 24 h
A drop in cache hit ratio is a leading indicator of cost spikes and
behavioral drift.
SLOs (what we commit to)
Per-month windows.
| SLO | Target |
|---|
| Availability | 99.5% |
| TTFT p99 | < 5s |
| Turn p99 | < 60s |
99.5% over a 30-day window allows ~3.6 hours of downtime. This is the
right starting target — we’re not running 24/7 tier-1 support yet.
When we onboard production customers with stricter SLAs, the target
moves to 99.9% (which allows ~43 minutes/month).
Error budgets
The error budget is the difference between 100% and the SLO target.
| SLO | Target | Error budget per month |
|---|
| Availability | 99.5% | 0.5% (3.6 h) |
| TTFT p99 | < 5s | n/a (latency, not availability) |
When the error budget is healthy:
- Deploy freely.
- Take risks on infrastructure changes.
- Run experimental Engines in parallel.
When the error budget is < 50% used:
- Slow down deploys; require canary.
- Defer non-critical infrastructure work.
- Investigate what’s burning the budget.
When the error budget is exhausted:
- Stop non-critical deploys.
- All hands on reliability.
- Cool-down period until the next monthly window.
This is a self-imposed gate. It’s not magic; it’s discipline.
Burn-rate alerts
The error budget burns at a rate dependent on how bad the failures
are. We alert on burn rate, not just absolute error rate, because a
2% error rate sustained for an hour burns budget at a different rate
than 100% for a minute.
| Window | Burn rate threshold | Severity |
|---|
| 1 hour | 14× normal | Sev1 |
| 6 hours | 6× normal | Sev2 |
| 24 hours | 3× normal | Sev3 |
A “1× normal” burn rate is the rate at which the budget would deplete
exactly at the end of the month if it continued.
Per-customer SLOs
For deployments with multiple customers, per-customer SLOs may apply
above the global ones. A specific customer might have:
- Tighter availability (99.9%).
- Tighter latency (TTFT < 3s p99).
- Reserved capacity (their Engine never shares resources).
These are negotiated separately and tracked in the customer’s
contract; they live in their own dashboards.
Scope
These SLOs cover the Engine itself. They do not cover:
- LLM provider availability. We can’t promise what we don’t control.
- MCP server availability. Each connector has its own SLO with its
own provider.
- Network availability between customer and Engine. That’s the
customer’s network and ours, jointly.
For customer-facing SLAs, factor in upstream dependencies and
appropriate cushions.
Reporting
Per-month SLO compliance is reported in the engineering review:
April 2026 SLO report
- Availability: 99.62% (target 99.5%) ✓
- TTFT p99: 4.8s (target < 5s) ✓
- Turn p99: 78s (target < 60s) ✗
- Error budget: 76% used
Notable: Turn p99 missed due to a regression in compaction (fixed
Apr 22). Action items in postmortem PM-2026-04-22.
Misses don’t trigger panic; they trigger investigation. The pattern
matters more than any one month.
See also