Harnesses
A harness is the boundary between production behavior and eval infrastructure. It should be thin: call the same app entrypoint you use in production, then return the output, messages, tool calls, usage, and artifacts that tests and judges need.
Use the first-party adapter that matches the runtime your app already uses. Use
createHarness() when the app is not built on a supported SDK or when you need
full control over normalized run data.
Available Harnesses
Section titled “Available Harnesses”AI SDKUse when your app calls
generateText, streamText, or an AI SDK wrapper.OpenAI AgentsUse when your app owns an Agent and runs it with a Runner.PiUse when your app exposes a Pi agent, toolset, or runtime-compatible entrypoint.Custom HarnessesUse for workflows, service functions, CLIs, RAG pipelines, and custom agents.Normalized Results
Section titled “Normalized Results”Every harness returns a JSON-serializable result. Judges and reports read the same shape regardless of runtime:
| Field | Purpose |
|---|---|
output | The domain value your tests usually assert on. |
messages | Conversation transcript or normalized app messages. |
toolCalls | Deterministic tool activity for tool judges and replay. |
usage | Stable usage units such as provider, model, tokens, tools, and retries. |
artifacts | Report-only details such as fixture ids, retrieved records, or traces. |
Metadata is per-run test context. Put expected values, scenario labels, fixture
ids, and judge configuration in metadata instead of hiding them in prompts.