pipecat eval CLI does is available as a library under pipecat.evals. Use it to run evals from your own test runner (pytest, a CI script, a custom dashboard), to build scenarios in code instead of YAML, or to customize pieces like the judge LLM.
Running a scenario
EvalScenario.load() parses a scenario file, and EvalSession.from_scenario() builds a ready-to-run session, constructing the judge, user speech, and transcriber the scenario calls for:
python bot.py -t eval), just as with pipecat eval run.
The result
run() returns an EvalResult:
| Field | Description |
|---|---|
scenario_name | Name of the scenario that ran. |
passed | Whether every assertion passed. |
failures | The failed assertions, each with the turn index, expectation index, event name, and reason. |
duration_ms | Wall-clock time the run took. |
events_seen | Every semantic event observed, for diagnostics. |
debug_log | The harness’s timestamped decision trace (what the CLI writes to <scenario>.eval.log). |
skipped | Set (with a reason) when the scenario was not run; such a result is neither pass nor fail. |
Building scenarios in code
Scenarios are plain dataclasses, so you can construct them programmatically, generating turns from a dataset, parameterizing a template, or skipping YAML entirely:The modality-agnostic
response event is resolved while parsing YAML. When
constructing scenarios in code, use llm_response for text mode directly (or
response only when you also configure audio judging).Customizing the judge
from_scenario() builds the judge from the scenario’s judge: block, but you can inject your own. EvalJudge works with any Pipecat LLM service backed by an OpenAI-compatible API:
speech=, wrapping any TTSService in an EvalSpeech) and the transcriber used for the agent’s spoken audio (transcriber=, wrapping any STTService in an EvalTranscriber). The wrapped services can be local models or HTTP-based; WebSocket-streaming services are rejected, since they need a running pipeline to manage their connection lifecycle.
Observing progress
Passon_progress to get a callback as each turn and expectation resolves, which is how the CLI implements its --verbose output:
Orchestrating suites
EvalManifest and EvalSuite are the library behind pipecat eval suite: the suite spawns each agent with its eval transport on its own port, runs its scenarios, and executes several runs concurrently:
status, result, error, duration_ms), so a live display can render directly from suite.runs.
EvalManifest.load() accepts keyword overrides for every manifest value (concurrency, base_port, spawn, scenarios_dir, and so on), mirroring the CLI flags.