Skip to content

Judges

Judges score the run that the harness already captured. Harnesses own execution; judges own rubrics, parser logic, model calls, score thresholds, and failure messages.

Use deterministic judges before reaching for model-scored rubrics. Put judges at the suite level when every case should satisfy the same contract, and use explicit judge assertions when one case needs an extra check.

Suite-level judges run after each run(...). This example keeps the same Paris story as the harness pages and uses the built-in factuality judge.

evals/capital.eval.ts
import { openai } from "@ai-sdk/openai";
import { aiSdkJudgeHarness } from "@vitest-evals/harness-ai-sdk";
import { expect } from "vitest";
import {
describeEval,
FactualityJudge,
} from "vitest-evals";
import { qaHarness } from "./qaHarness";
const judgeHarness = aiSdkJudgeHarness({
model: openai("gpt-4.1-mini"),
temperature: 0,
});
const factualityJudge = FactualityJudge({ judgeHarness });
describeEval(
"capital questions",
{
harness: qaHarness,
judges: [factualityJudge],
judgeThreshold: 0.6,
},
(it) => {
it("knows the capital of France", async ({ run }) => {
const result = await run("What is the capital of France?", {
metadata: {
expected: "Paris is the capital of France.",
},
});
expect(result.output).toContain("Paris");
});
},
);

Keep deterministic judges close to the harness contract. Use custom judges for rubrics that need domain-specific scoring or reusable explanations.