Custom Judges

Use createJudge() when a rubric should be reused across suites. A custom judge receives the same normalized JudgeContext as built-ins: input, output, metadata, tool calls, session, run, app harness, and a curried runJudge function when the suite has a judge harness.

Usage

Keep deterministic checks close to the normalized output. Return stable score units and JSON-safe metadata that explains the result.

import { createJudge, type JudgeContext } from "vitest-evals";

export const CapitalJudge = createJudge(
  "CapitalJudge",
  async ({ output }: JudgeContext<string, string>) => {
    const passed = output.toLowerCase().includes("paris");

    return {
      score: passed ? 1 : 0,
      metadata: {
        rationale: passed
          ? "The answer names Paris."
          : `Expected Paris, got: ${output}`,
      },
    };
  },
);

Prefer 0 and 1 for pass/fail checks. Reserve intermediate values for true partial credit.

Suite Use

Attach reusable judges at the suite level when every case should satisfy the same contract.

import { describeEval } from "vitest-evals";
import { qaHarness } from "./qaHarness";
import { CapitalJudge } from "./judges/capitalJudge";

describeEval("capital questions", {
  harness: qaHarness,
  judges: [CapitalJudge],
  judgeThreshold: 1,
});

Use explicit assertions when a single case needs an extra judge or a different threshold.

await expect(result).toSatisfyJudge(CapitalJudge, {
  threshold: null,
});

Model-Backed Judges

If a custom judge needs an LLM call, configure a judgeHarness on the matcher, the judge object, or the suite, then call ctx.runJudge(...) from the judge. Core curries the current abort signal into that function. Matcher options win over a judge default, and a judge default wins over the suite default. Explicit matcher calls can also reuse a single unambiguous judge-level harness from the suite’s automatic judges, but automatic judges do not inherit inferred harnesses from sibling judges. Leave judgeHarness unset for suites that only use deterministic judges.

import { createJudge } from "vitest-evals";

export const RubricJudge = createJudge({
  name: "RubricJudge",
  async assess(ctx) {
    if (!ctx.runJudge) {
      throw new Error("RubricJudge requires a configured judgeHarness.");
    }

    const verdict = await ctx.runJudge({
      prompt: `Grade this answer: ${JSON.stringify(ctx.output)}`,
      responseFormat: { type: "json" },
    });

    return parseRubricVerdict(verdict);
  },
});

Failure Behavior

Custom judges fail when the returned score falls below the active threshold. Use explicit thresholds for domain rubrics so a future reader can see whether a score is advisory or blocking.