Creating an eval
Evals are created in a short wizard:- Target — what to run on. Choose the level:
trace(score the whole run) orspan(score individual steps). Optionally filter by agent, trace name, and — for span-level — span type and model. - Check — what to verify. Pick a preset (below).
- Score — how to score. For LLM judges, pick a judge model; for parameterized checks, set the parameter (a substring, pattern, or max length). Set a sample rate (1%–100%) to control how much matching traffic is scored.
Code checks
Deterministic, free, and run with no external calls:| Preset | Checks |
|---|---|
| No PII | Output is free of emails, phone numbers, SSNs, cards, IPs. |
| No secret leak | Output contains no API-key / token / private-key shapes. |
| Valid JSON | Output parses as JSON. |
| No refusal | Output isn’t a refusal. |
| Non-empty | Output isn’t empty. |
| Max length | Output is within a character budget. |
| Contains / Excludes text | Output does (or doesn’t) contain a substring. |
| Regex match | Output matches a pattern. |
| Tool args valid | (span-only) A tool call’s input is a valid JSON object. |
LLM judges
Judges send the input/output to a model that returns a normalized 0.00–1.00 score or a pass/fail verdict with a reason. Presets cover relevance, helpfulness, coherence, conciseness, instruction-following, completeness, toxicity/safety, tool selection, and RAG-oriented checks (faithfulness, context relevance, correctness vs. a reference).LLM judges are bring-your-own-key. Add a provider key (below) before
creating one. An eval with no usable key shows the status needs key and
doesn’t score until a key is added. The set of available judge models is
defined by the deployment.

