Judges

Judges score the run that the harness already captured. Harnesses own execution; judges own rubrics, parser logic, model calls, score thresholds, and failure messages.

Use deterministic judges before reaching for model-scored rubrics. Put judges at the suite level when every case should satisfy the same contract, and use explicit judge assertions when one case needs an extra check.

Judge Types

FactualityJudgeUse a grading model to compare output against an expert answer from normalized run context.ToolCallJudgeCheck tool names, order, and arguments from normalized tool calls.StructuredOutputJudgeCheck JSON-safe output fields against expected metadata or matcher options.Custom JudgesCreate reusable rubrics and attach thresholds where failures should occur.

Usage

Suite-level judges run after each run(...). This example keeps the same Paris story as the harness pages and uses the built-in factuality judge.

import { openai } from "@ai-sdk/openai";
import { aiSdkJudgeHarness } from "@vitest-evals/harness-ai-sdk";
import { expect } from "vitest";
import {
  describeEval,
  FactualityJudge,
} from "vitest-evals";
import { qaHarness } from "./qaHarness";

const judgeHarness = aiSdkJudgeHarness({
  model: openai("gpt-4.1-mini"),
  temperature: 0,
});
const factualityJudge = FactualityJudge({ judgeHarness });

describeEval(
  "capital questions",
  {
    harness: qaHarness,
    judges: [factualityJudge],
    judgeThreshold: 0.6,
  },
  (it) => {
    it("knows the capital of France", async ({ run }) => {
      const result = await run("What is the capital of France?", {
        metadata: {
          expected: "Paris is the capital of France.",
        },
      });

      expect(result.output).toContain("Paris");
    });
  },
);

Keep deterministic judges close to the harness contract. Use custom judges for rubrics that need domain-specific scoring or reusable explanations.