Harnesses

A harness is the boundary between production behavior and eval infrastructure. It should be thin: call the same app entrypoint you use in production, then return the output, messages, tool calls, usage, and artifacts that tests and judges need.

Use the first-party adapter that matches the runtime your app already uses. Use createHarness() when the app is not built on a supported SDK or when you need full control over normalized run data.

Available Harnesses

AI SDKUse when your app calls generateText, streamText, or an AI SDK wrapper.OpenAI AgentsUse when your app owns an Agent and runs it with a Runner.PiUse when your app exposes a Pi agent, toolset, or runtime-compatible entrypoint.Custom HarnessesUse for workflows, service functions, CLIs, RAG pipelines, and custom agents.

Normalized Results

Every harness returns a JSON-serializable result. Judges and reports read the same shape regardless of runtime:

Field	Purpose
`output`	The domain value your tests usually assert on.
`messages`	Conversation transcript or normalized app messages.
`toolCalls`	Deterministic tool activity for tool judges and replay.
`usage`	Stable usage units such as provider, model, tokens, tools, and retries.
`artifacts`	Report-only details such as fixture ids, retrieved records, or traces.

Metadata is per-run test context. Put expected values, scenario labels, fixture ids, and judge configuration in metadata instead of hiding them in prompts.