capital_question.yaml
Why evals matter
Voice agents are probabilistic systems. The same agent can answer differently run to run, and a prompt tweak, a model upgrade, or a service swap can quietly break behavior that used to work: a function that no longer gets called, context that stops carrying across turns, an interruption that derails the conversation. Manual testing catches some of this, but it’s slow, unrepeatable, and impractical to run on every change. Evals make agent behavior testable the way unit tests make code testable:- Regression safety: run your scenarios after every prompt, model, or pipeline change and catch breakage before users do.
- Fast iteration: text-mode evals skip STT and TTS entirely, so a full conversation test runs in seconds with no audio service cost.
- Semantic assertions: an LLM judge checks meaning (“the response says the capital is Berlin”), not exact strings, so tests don’t break when wording changes.
- A feedback signal for AI coding assistants: evals give a coding assistant a command it can run and a pass/fail result it can read, closing the loop between writing agent code and verifying it. See Agent Self-Improvement.
How it works
Pipecat Evals has two halves:-
The eval transport. Your agent runs unchanged with the eval transport. If your agent uses
create_transport()and the development runner, this is already built in: start it with-t evaland it hosts a local WebSocket server speaking RTVI, instead of connecting to Daily, WebRTC, or telephony. - The eval harness. The harness connects to that transport as an RTVI client, plays the scenario’s user turns (as text, or as synthesized speech in audio mode), collects the events your agent emits, and asserts on them in order: transcriptions, LLM responses, spoken output, function calls, and timing.
Text and audio modes
Every scenario runs in one of two modes:| Mode | User input | Agent output | Best for |
|---|---|---|---|
| Text (default) | Sent as text, bypassing the STT | LLM text; TTS is skipped automatically | Fast, cheap iteration on prompts, logic, and function calling |
| Audio | Synthesized by a TTS the harness runs (local by default) | Real synthesized speech, transcribed by an STT the harness runs | True end-to-end coverage of the full STT, LLM, and TTS pipeline |
What you can test
- Response content: substring checks (
text_contains) or semantic judging (eval) of the agent’s replies. - Multi-turn context: verify the agent remembers earlier turns.
- Function calling: assert that specific tools were called, with specific arguments.
- Interruptions: barge in mid-response and verify the agent recovers (
send_after). - Latency: per-event budgets with
within_ms. - Vision: serve an image when the agent requests one and judge its description.
YAML or Python
Scenarios are YAML files, so they’re easy to write, review, and share. Everything is also available as a library: load and run scenarios programmatically, build them in code, inject a custom judge, or orchestrate whole suites from your own tooling. See Using the Library.Requirements
- Pipecat CLI: the
pipecat evalcommands ship with the CLI extra:uv tool install "pipecat-ai[cli]". The same commands are available aspython -m pipecat.evals. - A judge LLM (for
eval:assertions): Ollama by default (ollama pull gemma2:9b), or point the scenario’sjudge:block at OpenAI or any OpenAI-compatible endpoint. - Audio services (audio mode only): the harness needs a TTS to synthesize the user’s voice and an STT to transcribe the agent’s speech. Both can be local models or HTTP-based services; the defaults are local (Kokoro and Moonshine or Whisper, installed with
uv add "pipecat-ai[kokoro,moonshine]"oruv add "pipecat-ai[kokoro,whisper]"), which download once on first use and run with no keys and no per-run cost. WebSocket-streaming services aren’t supported here, which keeps the harness simple. - Your agent’s own credentials: the agent under test is your real agent, so it needs the same service API keys it normally would.
Production evaluation
Pipecat Evals is built for development: fast, local, repeatable, and run on every change. Once your agent is deployed, third-party evaluation platforms complement it with testing and monitoring at production scale:- Simulations: scripted or AI-driven test calls over API, WebSocket, or telephony, exercising multi-turn flows, edge cases, and real phone-network conditions before they reach users.
- Observability: continuous evaluation of live traffic, with automated quality scoring of calls and transcripts, and metrics tracked over time to catch quality drift.
Coval
AI-native simulation and evaluation platform for voice agents, trusted by
QA, Engineering, Operations, AI, and Executive teams.
Bluejay
Simulation, observability, and evaluation platform with native Pipecat Cloud
integration. Supports no-code API, WebSocket, and telephony testing.
Cekura
Automated testing and monitoring platform with native Pipecat Integration
for WebRTC/Text based testing and support for Mock Tools, Custom Dynamic
Variables and more!
Building an evaluation integration for Pipecat? We welcome contributions to
this page. Open a PR on the docs
repository.
Next steps
Quickstart
Run your first eval against an existing agent in a few minutes.
Writing Scenarios
The full scenario format: turns, expectations, modalities, and the judge.
Eval Suites
Spawn multiple agents and run many scenarios concurrently from a manifest.
Agent Self-Improvement
Close the loop: let an AI coding assistant write, run, and fix against
evals.