Prerequisites
- A working Pipecat agent that uses
create_transport()and the development runner (the standard pattern from the quickstart and all Pipecat examples), with its usual service API keys in.env. - The Pipecat CLI:
uv tool install "pipecat-ai[cli]". - A judge LLM. Either:
- Ollama (local, the default): install Ollama and run
ollama pull gemma2:9b, or - OpenAI: set
OPENAI_API_KEYand point the scenario’sjudge:block at it (shown below).
- Ollama (local, the default): install Ollama and run
Run your agent with the eval transport
If your agent uses Then start the agent with Instead of connecting to Daily or WebRTC, the agent now hosts a local WebSocket server and waits for the eval harness to connect. Nothing else in the agent changes: same pipeline, same services, same event handlers.
create_transport(), it supports the eval transport with a one-line addition to its transport_params:-t eval:The harness talks to your agent over RTVI.
PipelineWorker adds an
RTVIProcessor and RTVIObserver automatically, so the standard agent
setup needs no extra wiring. All Pipecat example agents already include
the "eval" transport entry.Write a scenario
A scenario is a YAML file describing a scripted conversation and the behavior you expect. Save this as Each turn optionally sends a user utterance and lists the events expected in response. The
scenarios/capital_question.yaml:- Ollama judge (default)
- OpenAI judge
eval: field is a natural-language criterion checked by the judge LLM, so the test passes whether the agent says “Berlin is the capital of Germany” or “That would be Berlin!”.This scenario runs in text mode (the default): the user turn is sent as text and the agent’s TTS is skipped automatically, so the whole conversation costs nothing in audio services and finishes in seconds.Ollama with
gemma2:9b is the default judge, which is why the first tab
has no judge: block. To use a different judge LLM, add a judge.eval:
block as in the OpenAI tab.Run the eval
With the agent still running, run the scenario from another terminal:The harness connects to The command exits
ws://localhost:7860 (override with --bot-url), drives the conversation, and reports the result. Pass -v to watch each turn resolve:0 when everything passes and 1 otherwise, so it slots directly into scripts and CI. Each scenario also writes a decision trace to <scenario>.eval.log, which shows every event the harness saw and why each assertion passed or failed.Make it fail (optional but recommended)
Change the criterion to something false, for example A failing eval tells you which turn, which expectation, and why. That message (plus the
"the response says the capital of Germany is Madrid", and run again:.eval.log trace) is what you, or your AI coding assistant, iterate against.Where to go next
- Learn the full scenario format, including multi-turn conversations, function call assertions, interruptions, latency budgets, and text vs audio modes, in Writing Scenarios.
- Have many scenarios or agents? Let Pipecat spawn the agents for you with Eval Suites.
- Want your coding assistant to run these for you? See Agent Self-Improvement.