Skip to content

Custom Judges

Use createJudge() when a rubric should be reused across suites. A custom judge receives the same normalized JudgeContext as built-ins: input, output, metadata, tool calls, session, run, app harness, and a curried runJudge function when the suite has a judge harness.

Keep deterministic checks close to the normalized output. Return stable score units and JSON-safe metadata that explains the result.

evals/judges/capitalJudge.ts
import { createJudge, type JudgeContext } from "vitest-evals";
export const CapitalJudge = createJudge(
"CapitalJudge",
async ({ output }: JudgeContext<string, string>) => {
const passed = output.toLowerCase().includes("paris");
return {
score: passed ? 1 : 0,
metadata: {
rationale: passed
? "The answer names Paris."
: `Expected Paris, got: ${output}`,
},
};
},
);

Prefer 0 and 1 for pass/fail checks. Reserve intermediate values for true partial credit.

Attach reusable judges at the suite level when every case should satisfy the same contract.

evals/capital.eval.ts
import { describeEval } from "vitest-evals";
import { qaHarness } from "./qaHarness";
import { CapitalJudge } from "./judges/capitalJudge";
describeEval("capital questions", {
harness: qaHarness,
judges: [CapitalJudge],
judgeThreshold: 1,
});

Use explicit assertions when a single case needs an extra judge or a different threshold.

evals/capital.eval.ts
await expect(result).toSatisfyJudge(CapitalJudge, {
threshold: null,
});

If a custom judge needs an LLM call, configure a judgeHarness on the matcher, the judge object, or the suite, then call ctx.runJudge(...) from the judge. Core curries the current abort signal into that function. Matcher options win over a judge default, and a judge default wins over the suite default. Explicit matcher calls can also reuse a single unambiguous judge-level harness from the suite’s automatic judges, but automatic judges do not inherit inferred harnesses from sibling judges. Leave judgeHarness unset for suites that only use deterministic judges.

evals/judges/rubricJudge.ts
import { createJudge } from "vitest-evals";
export const RubricJudge = createJudge({
name: "RubricJudge",
async assess(ctx) {
if (!ctx.runJudge) {
throw new Error("RubricJudge requires a configured judgeHarness.");
}
const verdict = await ctx.runJudge({
prompt: `Grade this answer: ${JSON.stringify(ctx.output)}`,
responseFormat: { type: "json" },
});
return parseRubricVerdict(verdict);
},
});

Custom judges fail when the returned score falls below the active threshold. Use explicit thresholds for domain rubrics so a future reader can see whether a score is advisory or blocking.