Current package: 0.10.0
AI evals that feel like Vitest.
Pick the harness for your runtime, run your real agent or app once, and score the normalized result with ordinary Vitest assertions and judges.
PASS apps/qa/evals/capital.eval.ts
✓ knows the capital of France
CapitalJudge 1.00
expected Paris
{ "output": "Paris is the capital of France." }
Choose a Harness
Start with the adapter that matches the code you already run in production. Each harness page uses the same Paris example so the runtime differences are easy to compare.
The Core Loop
Shape the app
Keep the production agent, service, or workflow in app code.
Configure a harness
The harness calls that app and normalizes output, session, usage, and artifacts.
Judge the same run
Built-in and custom judges read output, session, tool calls, usage, and metadata without re-running the app.
Example
capital.eval.ts
import { expect } from "vitest";
import {
createJudge,
describeEval,
type JudgeContext,
} from "vitest-evals";
import { qaHarness } from "./qaHarness";
const CapitalJudge = createJudge(
"CapitalJudge",
async ({ output }: JudgeContext<string, string>) => ({
score: output.toLowerCase().includes("paris") ? 1 : 0,
}),
);
describeEval("capital questions", { harness: qaHarness }, (it) => {
it("knows the capital of France", async ({ run }) => {
const result = await run("What is the capital of France?");
expect(result.output).toContain("Paris");
await expect(result).toSatisfyJudge(CapitalJudge);
});
}); GitHub Actions
The action reads Vitest JSON reports and publishes summaries, workflow annotations, and optional Check Runs.
.github/workflows/evals.yml
- name: Run evals
run: |
pnpm exec vitest run --config vitest.evals.config.ts \
--reporter=vitest-evals/reporter \
--reporter=json \
--outputFile.json=vitest-results.json
- uses: getsentry/vitest-evals@v0
if: always()
with:
results: vitest-results.json