Tool Replay
Tool replay lets evals keep testing real agent behavior without paying for every external dependency on every run. Use it for local tool calls that are expensive, slow, rate-limited, unstable, or hard to reproduce.
What to Replay
Section titled “What to Replay”Replay local tools and service calls, not the entire model interaction. Good candidates include web search, retrieval, third-party APIs, browser fetches, and internal service lookups.
export const qaHarness = aiSdkHarness({ agent: () => createQuestionAgent(), tools: { lookupCapital, }, toolReplay: { lookupCapital: true, }, output: ({ result }) => textAnswer(result.text),});Keep live provider model calls live unless your app exposes them as local tools. That keeps evals sensitive to prompt and model behavior while removing noise from dependencies.
Replay Modes
Section titled “Replay Modes”Choose a mode with VITEST_EVALS_REPLAY_MODE.
| Mode | Behavior |
|---|---|
auto | Replay when a recording exists; otherwise call live and record. This is the default. |
record | Always call live and overwrite recordings. Use this to refresh fixtures intentionally. |
off | Call live tools and do not read or write recordings. |
strict | Require an existing recording and fail when one is missing. |
Cache Keys and Redaction
Section titled “Cache Keys and Redaction”Use a replay config object when a cache key needs to ignore unstable values, recordings need a version, or outputs need redaction.
export const qaHarness = aiSdkHarness({ agent: () => createQuestionAgent(), tools: { lookupCapital, }, toolReplay: { lookupCapital: { version: "v1", key: (args) => ({ country: args.country, }), sanitize: (recording) => ({ ...recording, output: { capital: recording.output.capital, country: recording.output.country, }, }), }, }, output: ({ result }) => textAnswer(result.text),});Keep recordings small and reviewable. Redact secrets, customer data, unstable timestamps, and high-cardinality response fields before committing fixtures.
Workflow
Section titled “Workflow”Most of the time, auto is all you need — recordings accumulate on the first
run and replay automatically after that. When an intentional app change needs
new tool responses, rerun in record mode and review the fixture diff with
the code change.
VITEST_EVALS_REPLAY_MODE=record pnpm evals