Custom Judges
Use createJudge() when a rubric should be reused across suites. A custom
judge receives the same normalized JudgeContext as built-ins: input, output,
metadata, tool calls, session, run, app harness, and a curried runJudge
function when the suite has a judge harness.
Keep deterministic checks close to the normalized output. Return stable score units and JSON-safe metadata that explains the result.
import { createJudge, type JudgeContext } from "vitest-evals";
export const CapitalJudge = createJudge( "CapitalJudge", async ({ output }: JudgeContext<string, string>) => { const passed = output.toLowerCase().includes("paris");
return { score: passed ? 1 : 0, metadata: { rationale: passed ? "The answer names Paris." : `Expected Paris, got: ${output}`, }, }; },);Prefer 0 and 1 for pass/fail checks. Reserve intermediate values for true
partial credit.
Suite Use
Section titled “Suite Use”Attach reusable judges at the suite level when every case should satisfy the same contract.
import { describeEval } from "vitest-evals";import { qaHarness } from "./qaHarness";import { CapitalJudge } from "./judges/capitalJudge";
describeEval("capital questions", { harness: qaHarness, judges: [CapitalJudge], judgeThreshold: 1,});Use explicit assertions when a single case needs an extra judge or a different threshold.
await expect(result).toSatisfyJudge(CapitalJudge, { threshold: null,});Model-Backed Judges
Section titled “Model-Backed Judges”If a custom judge needs an LLM call, configure a judgeHarness on the matcher,
the judge object, or the suite, then call ctx.runJudge(...) from the judge.
Core curries the current abort signal into that function. Matcher options win
over a judge default, and a judge default wins over the suite default. Explicit
matcher calls can also reuse a single unambiguous judge-level harness from the
suite’s automatic judges, but automatic judges do not inherit inferred
harnesses from sibling judges. Leave judgeHarness unset for suites that only
use deterministic judges.
import { createJudge } from "vitest-evals";
export const RubricJudge = createJudge({ name: "RubricJudge", async assess(ctx) { if (!ctx.runJudge) { throw new Error("RubricJudge requires a configured judgeHarness."); }
const verdict = await ctx.runJudge({ prompt: `Grade this answer: ${JSON.stringify(ctx.output)}`, responseFormat: { type: "json" }, });
return parseRubricVerdict(verdict); },});Failure Behavior
Section titled “Failure Behavior”Custom judges fail when the returned score falls below the active threshold. Use explicit thresholds for domain rubrics so a future reader can see whether a score is advisory or blocking.