Current package: 0.10.0

AI evals that feel like Vitest.

Pick the harness for your runtime, run your real agent or app once, and score the normalized result with ordinary Vitest assertions and judges.

PASS apps/qa/evals/capital.eval.ts
knows the capital of France 142 tok · 1 judge · 1.8s
CapitalJudge 1.00
expected Paris
{ "output": "Paris is the capital of France." }

Choose a Harness

Start with the adapter that matches the code you already run in production. Each harness page uses the same Paris example so the runtime differences are easy to compare.

The Core Loop

Shape the app

Keep the production agent, service, or workflow in app code.

Configure a harness

The harness calls that app and normalizes output, session, usage, and artifacts.

Judge the same run

Built-in and custom judges read output, session, tool calls, usage, and metadata without re-running the app.

Example

capital.eval.ts
import { expect } from "vitest";
import {
  createJudge,
  describeEval,
  type JudgeContext,
} from "vitest-evals";
import { qaHarness } from "./qaHarness";

const CapitalJudge = createJudge(
  "CapitalJudge",
  async ({ output }: JudgeContext<string, string>) => ({
    score: output.toLowerCase().includes("paris") ? 1 : 0,
  }),
);

describeEval("capital questions", { harness: qaHarness }, (it) => {
  it("knows the capital of France", async ({ run }) => {
    const result = await run("What is the capital of France?");

    expect(result.output).toContain("Paris");
    await expect(result).toSatisfyJudge(CapitalJudge);
  });
});

GitHub Actions

The action reads Vitest JSON reports and publishes summaries, workflow annotations, and optional Check Runs.

.github/workflows/evals.yml
- name: Run evals
  run: |
    pnpm exec vitest run --config vitest.evals.config.ts \
      --reporter=vitest-evals/reporter \
      --reporter=json \
      --outputFile.json=vitest-results.json

- uses: getsentry/vitest-evals@v0
  if: always()
  with:
    results: vitest-results.json