Skip to main content
Pipecat Evals is the framework’s built-in system for testing agent behavior. You describe a conversation and the behavior you expect, and Pipecat runs it against your real agent (the same pipeline, the same services, the same code) and tells you whether the expectation still holds.
capital_question.yaml
name: capital_question

turns:
  - user: "What is the capital of Germany?"
    expect:
      - event: response
        eval: "the response says the capital of Germany is Berlin"
pipecat eval run capital_question.yaml

Why evals matter

Voice agents are probabilistic systems. The same agent can answer differently run to run, and a prompt tweak, a model upgrade, or a service swap can quietly break behavior that used to work: a function that no longer gets called, context that stops carrying across turns, an interruption that derails the conversation. Manual testing catches some of this, but it’s slow, unrepeatable, and impractical to run on every change. Evals make agent behavior testable the way unit tests make code testable:
  • Regression safety: run your scenarios after every prompt, model, or pipeline change and catch breakage before users do.
  • Fast iteration: text-mode evals skip STT and TTS entirely, so a full conversation test runs in seconds with no audio service cost.
  • Semantic assertions: an LLM judge checks meaning (“the response says the capital is Berlin”), not exact strings, so tests don’t break when wording changes.
  • A feedback signal for AI coding assistants: evals give a coding assistant a command it can run and a pass/fail result it can read, closing the loop between writing agent code and verifying it. See Agent Self-Improvement.
Pipecat itself relies on this framework: before every release, an eval suite drives 100+ example agents end to end.

How it works

Pipecat Evals has two halves:
  1. The eval transport. Your agent runs unchanged with the eval transport. If your agent uses create_transport() and the development runner, this is already built in: start it with -t eval and it hosts a local WebSocket server speaking RTVI, instead of connecting to Daily, WebRTC, or telephony.
  2. The eval harness. The harness connects to that transport as an RTVI client, plays the scenario’s user turns (as text, or as synthesized speech in audio mode), collects the events your agent emits, and asserts on them in order: transcriptions, LLM responses, spoken output, function calls, and timing.
When a scenario asserts on meaning rather than exact text, a judge LLM evaluates the agent’s response against a natural-language criterion. The judge runs locally with Ollama by default, or against OpenAI or any OpenAI-compatible endpoint.

Text and audio modes

Every scenario runs in one of two modes:
ModeUser inputAgent outputBest for
Text (default)Sent as text, bypassing the STTLLM text; TTS is skipped automaticallyFast, cheap iteration on prompts, logic, and function calling
AudioSynthesized by a TTS the harness runs (local by default)Real synthesized speech, transcribed by an STT the harness runsTrue end-to-end coverage of the full STT, LLM, and TTS pipeline
Text mode exercises your agent’s actual pipeline and context handling while skipping the audio services, so it costs nothing in TTS or STT usage and runs fast. Audio mode synthesizes the user’s voice, streams it through your agent’s real STT, and transcribes the agent’s actual spoken audio for judging, catching issues that only surface with real speech (turn detection, homophones, barge-in).

What you can test

  • Response content: substring checks (text_contains) or semantic judging (eval) of the agent’s replies.
  • Multi-turn context: verify the agent remembers earlier turns.
  • Function calling: assert that specific tools were called, with specific arguments.
  • Interruptions: barge in mid-response and verify the agent recovers (send_after).
  • Latency: per-event budgets with within_ms.
  • Vision: serve an image when the agent requests one and judge its description.

YAML or Python

Scenarios are YAML files, so they’re easy to write, review, and share. Everything is also available as a library: load and run scenarios programmatically, build them in code, inject a custom judge, or orchestrate whole suites from your own tooling. See Using the Library.

Requirements

  • Pipecat CLI: the pipecat eval commands ship with the CLI extra: uv tool install "pipecat-ai[cli]". The same commands are available as python -m pipecat.evals.
  • A judge LLM (for eval: assertions): Ollama by default (ollama pull gemma2:9b), or point the scenario’s judge: block at OpenAI or any OpenAI-compatible endpoint.
  • Audio services (audio mode only): the harness needs a TTS to synthesize the user’s voice and an STT to transcribe the agent’s speech. Both can be local models or HTTP-based services; the defaults are local (Kokoro and Moonshine or Whisper, installed with uv add "pipecat-ai[kokoro,moonshine]" or uv add "pipecat-ai[kokoro,whisper]"), which download once on first use and run with no keys and no per-run cost. WebSocket-streaming services aren’t supported here, which keeps the harness simple.
  • Your agent’s own credentials: the agent under test is your real agent, so it needs the same service API keys it normally would.

Production evaluation

Pipecat Evals is built for development: fast, local, repeatable, and run on every change. Once your agent is deployed, third-party evaluation platforms complement it with testing and monitoring at production scale:
  • Simulations: scripted or AI-driven test calls over API, WebSocket, or telephony, exercising multi-turn flows, edge cases, and real phone-network conditions before they reach users.
  • Observability: continuous evaluation of live traffic, with automated quality scoring of calls and transcripts, and metrics tracked over time to catch quality drift.

Coval

AI-native simulation and evaluation platform for voice agents, trusted by QA, Engineering, Operations, AI, and Executive teams.

Bluejay

Simulation, observability, and evaluation platform with native Pipecat Cloud integration. Supports no-code API, WebSocket, and telephony testing.

Cekura

Automated testing and monitoring platform with native Pipecat Integration for WebRTC/Text based testing and support for Mock Tools, Custom Dynamic Variables and more!
Building an evaluation integration for Pipecat? We welcome contributions to this page. Open a PR on the docs repository.
Pipecat’s other building blocks feed into any evaluation workflow: Metrics for TTFB, processing time, and usage; Saving Transcripts for offline analysis; OpenTelemetry for latency traces; and Observers for custom instrumentation.

Next steps

Quickstart

Run your first eval against an existing agent in a few minutes.

Writing Scenarios

The full scenario format: turns, expectations, modalities, and the judge.

Eval Suites

Spawn multiple agents and run many scenarios concurrently from a manifest.

Agent Self-Improvement

Close the loop: let an AI coding assistant write, run, and fix against evals.