Skip to content

JudgeContext

Full normalized context passed to every judge.

Scenario-owned judge criteria should live on input. Use metadata for per-run expectations or harness configuration that are not part of the scenario payload.

type RefundContext = JudgeContext<
string,
{ status: "approved" | "denied" },
{ expected: { status: "approved" | "denied" } }
>;
const RefundStatusJudge = createJudge(
"RefundStatusJudge",
({ output, metadata }: RefundContext) => ({
score: output.status === metadata.expected.status ? 1 : 0,
}),
);

TInput = unknown

TOutput extends JsonValue | undefined = JsonValue | undefined

TMetadata extends HarnessMetadata = HarnessMetadata

THarness extends Harness<TInput, TOutput, TMetadata> | undefined = Harness<TInput, TOutput, TMetadata> | undefined

harness: THarness

Harness associated with this judge context.


input: TInput

Original eval input passed to the harness.


metadata: Readonly<TMetadata>

Per-run expectations or configuration passed to run(input, { metadata }).


output: TOutput

App-facing output returned by the harness.


run: HarnessRun<TOutput>

Complete normalized harness run being judged.


optional runJudge?: RunJudge

Runs the configured matcher, judge, or suite judge harness with run-scoped context.


session: HarnessRun<TOutput>["session"]

Normalized transcript associated with the harness run.


toolCalls: ToolCallRecord[]

Flattened tool calls observed in the normalized session.