Skip to content

Harnesses

A harness is the boundary between production behavior and eval infrastructure. It should be thin: call the same app entrypoint you use in production, then return the output, messages, tool calls, usage, and artifacts that tests and judges need.

Use the first-party adapter that matches the runtime your app already uses. Use createHarness() when the app is not built on a supported SDK or when you need full control over normalized run data.

Every harness returns a JSON-serializable result. Judges and reports read the same shape regardless of runtime:

FieldPurpose
outputThe domain value your tests usually assert on.
messagesConversation transcript or normalized app messages.
toolCallsDeterministic tool activity for tool judges and replay.
usageStable usage units such as provider, model, tokens, tools, and retries.
artifactsReport-only details such as fixture ids, retrieved records, or traces.

Metadata is per-run test context. Put expected values, scenario labels, fixture ids, and judge configuration in metadata instead of hiding them in prompts.