Skip to content

Tool Replay

Tool replay lets evals keep testing real agent behavior without paying for every external dependency on every run. Use it for local tool calls that are expensive, slow, rate-limited, unstable, or hard to reproduce.

Replay local tools and service calls, not the entire model interaction. Good candidates include web search, retrieval, third-party APIs, browser fetches, and internal service lookups.

evals/qaHarness.ts
export const qaHarness = aiSdkHarness({
agent: () => createQuestionAgent(),
tools: {
lookupCapital,
},
toolReplay: {
lookupCapital: true,
},
output: ({ result }) => textAnswer(result.text),
});

Keep live provider model calls live unless your app exposes them as local tools. That keeps evals sensitive to prompt and model behavior while removing noise from dependencies.

Choose a mode with VITEST_EVALS_REPLAY_MODE.

ModeBehavior
autoReplay when a recording exists; otherwise call live and record. This is the default.
recordAlways call live and overwrite recordings. Use this to refresh fixtures intentionally.
offCall live tools and do not read or write recordings.
strictRequire an existing recording and fail when one is missing.

Use a replay config object when a cache key needs to ignore unstable values, recordings need a version, or outputs need redaction.

evals/qaHarness.ts
export const qaHarness = aiSdkHarness({
agent: () => createQuestionAgent(),
tools: {
lookupCapital,
},
toolReplay: {
lookupCapital: {
version: "v1",
key: (args) => ({
country: args.country,
}),
sanitize: (recording) => ({
...recording,
output: {
capital: recording.output.capital,
country: recording.output.country,
},
}),
},
},
output: ({ result }) => textAnswer(result.text),
});

Keep recordings small and reviewable. Redact secrets, customer data, unstable timestamps, and high-cardinality response fields before committing fixtures.

Most of the time, auto is all you need — recordings accumulate on the first run and replay automatically after that. When an intentional app change needs new tool responses, rerun in record mode and review the fixture diff with the code change.

refresh tool recordings
VITEST_EVALS_REPLAY_MODE=record pnpm evals