Skip to main content

Documentation Index

Fetch the complete documentation index at: https://internal.september.wtf/llms.txt

Use this file to discover all available pages before exploring further.

The Engine has 110 test files covering the agent loop, sandbox, memory, LLM integration, asset directory, learning centre, security, and the edges between them. This page covers how the test suite is organized, the rules we follow when adding tests, and how to run them.

The pyramid

We test at three levels, with deliberate weight at each:
LevelCount (approx)What
Unit~60%Pure functions, single classes, narrow behavior.
Integration~30%Real DB, real sandbox, real services.
End-to-end~10%Full /execute runs against real LLM providers.
The pyramid is wider at integration than most codebases run because the Engine is mostly integration logic — the agent loop, memory retrieval, sandbox dispatch. Unit tests of plumbing miss the bugs.

Where tests live

tests/
  conftest.py                  shared fixtures
  test_agent_loop_*.py         agent loop and execution
  test_context_*.py            context engineering and compaction
  test_compaction_*.py         compaction orchestrator specifics
  test_cached_snip.py          prompt cache heuristics
  test_bash_security.py        sandbox bash analysis
  test_bash_paths.py           path extraction
  test_dangerous_removal.py    permission system
  test_asset_directory_*.py    MCP connectors
  test_oauth_*.py              OAuth flows
  test_episodic_memory.py      episodes
  test_knowledge_store.py      knowledge facts
  test_social_graph.py         social graph
  test_conversational_memory.py     turns and continuity
  test_herald_*.py             LLM provider abstraction
  test_cache_hit_monitor.py    cache observability
  test_lc_*.py                 Learning Centre
  test_infra_*.py              SQLite pool, migrations
  test_migrations.py           migration runner
  test_command_*.py            command execution
  test_activity_*.py           activity manager
  test_fork_*.py               sub-process behavior
  stress_*.py                  long-running stress tests
  test_e2e_bap_v2.py           end-to-end against BAP shape
  test_integration_boot.py     Engine boot

conftest.py

The session conftest provides:
  • A db fixture that creates a temporary SQLite database, loads sqlite-vec, runs migrations, and yields a connection pool.
  • A seccomp probe that checks whether the test environment supports the BPF filters the sandbox uses. Tests that require seccomp are skipped on incompatible hosts (notably some QEMU-emulated environments).
  • Async-mode configuration (asyncio_mode: strict).

Rules

No mocks for LLM calls

We do not mock LLM provider responses. Tests that exercise the agent loop hit real model APIs. This is a deliberate constraint inherited from the project’s CLAUDE.md. The rationale: mocked LLM responses pass the test and fail in production at the worst possible moment. If a test requires a model, it makes a real call. The cost is real (in dollars and time). We mitigate by:
  • Pinning to cheap models (Haiku-class) for tests that don’t need the strong model.
  • Caching where possible.
  • Running expensive tests less frequently (nightly, not per-PR).

Real database

The shared db fixture creates a real SQLite + sqlite-vec instance per session. Tests against memory, asset directory, and any persisted state hit real SQL. No mock query builders, no faked rows.

Real sandbox

Sandbox tests run real bubblewrap. The test container is privileged because bwrap requires capabilities the default Docker security profile doesn’t grant.

One assert per concept

Tests with five unrelated asserts hide which behavior actually broke. Split into multiple tests when the asserts cover different concepts.

Descriptive names

# bad
def test_compaction():

# good
def test_compaction_preserves_file_contents_when_summary_collapses():
The name tells you what’s expected to happen and under what condition. When the test fails, the name is your first clue.

Running tests

Full suite

docker compose run --rm test pytest tests/ --tb=short
Takes 5-15 minutes depending on host. Hits real LLM providers; needs valid API keys in the environment.

Single file

docker compose run --rm test pytest tests/test_agent_loop.py --tb=short

By name

docker compose run --rm test pytest -k compaction
Runs every test with “compaction” in the name. Useful while iterating.

Without LLM dependencies

docker compose run --rm test pytest tests/ -m "not requires_llm"
The marker requires_llm is added to tests that hit a provider. Run without them when the API key is unavailable or when iterating on non-LLM logic.

With coverage

docker compose run --rm test pytest --cov=src --cov-report=term-missing
Coverage targets:
  • New code: 80%+ (enforced in CI).
  • Critical paths (agent loop, sandbox, permissions): 95%+.
  • Total: 75%+ (informational).

Stress tests

stress_*.py files run for minutes and exercise high-concurrency or large-data paths. They’re not in the regular suite — they run nightly:
  • stress_agent_loop_concurrency.py — many concurrent /execute calls.
  • stress_memory_search.py — large brain, fast retrieval.
  • stress_compaction.py — long contexts, repeated compaction.
Failures in stress tests are usually performance regressions, not correctness. Triage accordingly.

Test data

Test fixtures live alongside the tests. We don’t ship a “test data” directory; each test that needs setup creates it inline. This keeps tests independent. For shared setup (a brain with N episodes), the fixture is a function in conftest.py that builds the state programmatically. Don’t commit SQLite blobs.

When to add a test

SituationAdd a test?
Fixed a bugYes. Encode the bug as a test that fails before, passes after.
Added a new endpointYes. Smoke test plus at least one error-path test.
Added a new toolYes. Test that the tool registers, runs, and returns valid output.
Refactored without behavior changeNo. The existing tests should still pass; that’s the proof.
Added an internal helperSometimes. If the helper has non-trivial logic, yes.
Tweaked a promptNo (this is what evals are for, see Evaluation).

Flaky tests

A test that passes some runs and fails others is a problem. Two paths:
  1. Fix the flakiness. Usually it’s a race or a timing assumption.
  2. Quarantine it. Mark @pytest.mark.flaky and stop blocking on it. Flag for investigation; don’t normalize tolerating flakes.
A growing flaky list is a quality signal. Audit periodically.

CI

CI runs on every PR:
  • The full suite (in privileged mode, with LLM keys).
  • Coverage report.
  • Lint and formatting.
CI runs nightly:
  • Full suite.
  • Stress tests.
  • Long-running integration tests.
PRs that drop coverage below the floor are blocked.

See also

  • Evaluation — testing agent behavior, distinct from testing code.
  • Eval harness — the runner for behavior tests.
  • Common tasks — shortcuts for running tests locally.