AI SDK Harness

Use the AI SDK harness when your app already builds prompts, models, and tools with the Vercel AI SDK. This page follows one small end-to-end story: a geography assistant answers “What is the capital of France?”, the harness turns that AI SDK result into eval output, and a judge checks that the answer names Paris.

Install

pnpm add -D vitest-evals @vitest-evals/harness-ai-sdk

App Shape

Keep the agent factory in app code. The harness should call the same model, prompt, tools, and parser that production uses.

import { openai } from "@ai-sdk/openai";

export function createQuestionAgent() {
  return {
    model: openai("gpt-4o-mini"),
    system:
      "Answer geography questions directly. Keep answers short.",
  };
}

export function textAnswer(text: string): string {
  return text.trim();
}

createQuestionAgent() owns the runtime setup. textAnswer() is the small app parser that turns provider text into the value tests and judges should read.

Configure Harness

The harness wraps the app shape and defines the normalized output. In this case the normalized output is just the trimmed answer text.

import { aiSdkHarness } from "@vitest-evals/harness-ai-sdk";
import { createQuestionAgent, textAnswer } from "../src/questionAgent";

export const qaHarness = aiSdkHarness({
  agent: () => createQuestionAgent(),
  output: ({ result }) => textAnswer(result.text),
});

Use output for the value your evals care about. Messages, tool calls, usage, native trace spans, and errors remain available on the normalized harness run for judges, replay, and reports. Tool-call transcript data comes from AI SDK steps; custom run entrypoints that do not return steps use normalized transcript events for local tool executions. Output-only custom runs still get synthesized input/output messages; return a normalized session when evals need exact transcript control beyond that.

Writing Evals

Call run(input) once, then judge that same normalized result. The judge is deliberately simple so the harness story stays clear.

import { expect } from "vitest";
import { createJudge, describeEval } from "vitest-evals";
import { qaHarness } from "./qaHarness";

const CapitalJudge = createJudge<string, string>(
  "CapitalJudge",
  async ({ output }) => {
    const passed = output.toLowerCase().includes("paris");

    return {
      score: passed ? 1 : 0,
      metadata: {
        rationale: passed
          ? "The answer names Paris."
          : `Expected Paris, got: ${output}`,
      },
    };
  },
);

describeEval("capital questions", { harness: qaHarness }, (it) => {
  it("knows the capital of France", async ({ run }) => {
    const result = await run("What is the capital of France?");

    expect(result.output).toContain("Paris");
    await expect(result).toSatisfyJudge(CapitalJudge);
  });
});