Agent CI/CD Pipeline
PC5: Agent CI/CD Pipeline
Section titled “PC5: Agent CI/CD Pipeline”1. Problem Statement
Section titled “1. Problem Statement”The consumer-agent’s eval system is mature on the runtime side — Opik integration, 14 judge rule files in configs/rules/, an evaluation_manifest.yaml mapping dataset categories to judges, LLMJudgeMetric base class extending Opik’s GEval, full CLI for managing rules and datasets. But the CI/CD pipeline that gates sub-agent changes against those judges isn’t codified. Today: judges run when someone invokes them; thresholds are author-set per rule; promotion to production is implicit (whoever merges decides).
PC5 closes the gap. Three load-bearing commitments:
-
A sub-agent change moves through PR → eval gate → deploy → rollback as a pipeline, not as ad-hoc judgment. The lifecycle is bounded; the gates fire automatically at named milestones; rollback is explicit.
-
Judge templates + threshold conventions are platform-owned and reusable across verticals. Verticals shouldn’t reinvent “what does a safety judge look like” or “what threshold gates promotion.” PC5 ships the templates and the threshold conventions; verticals instantiate them against their own datasets.
-
The eval-config API is the wiring surface — not an opaque project starter. Verticals own their Opik project (their dataset, their traces, their judge invocations). PC5 exposes an API (
get_metrics_for_category, judge-by-ID retrieval, manifest queries) that verticals call against their project.
Without PC5, four things break:
- PC6 (Agent Variant CI/CD) can’t reference a stable eval-gate definition. PC6 §5.8 already says it reads from PC5’s gates at three milestones; without PC5’s API, PC6 invents the contract.
- PF1 (Sub-Agent Lifecycle) has no promotion criteria. PF1 specifies the lifecycle (dev → test → promote → rollback); PC5 specifies what gates each transition.
- Verticals re-implement eval wiring per onboarding. Each new vertical recreates manifest entries, rule files, threshold guesses — wasteful and drifts from platform conventions.
- The deployed
configs/rules/andevaluation_manifest.yamlinfrastructure stays informal. It works in production but isn’t codified — new contributors read source code to understand the patterns. PC5 lifts the patterns into a citeable contract.
Companion: Platform Spec Lab row PC5. The spec is the source of truth.
2. Capabilities Source
Section titled “2. Capabilities Source”Per the Platform Spec Lab, PC5 owns the Agent CI/CD Pipeline capability for the AI Assistant Platform’s consumer-agent runtime. The capability has four components:
- Pipeline — PR → eval gate → deploy → rollback lifecycle for sub-agent changes.
- Judge templates — reusable LLM-as-Judge definitions; verticals instantiate against their own datasets.
- Threshold conventions — what scores gate promotion (pre-ramp vs pre-full).
- Eval-config API — the surface verticals consume to wire judges + thresholds into their own Opik project.
PC5 codifies existing infrastructure:
configs/rules/*.yaml— 14 judge rule files in production today (capability_alignment, conversation_flow, data_integrity, fetch_legal, greeting_quality, jailbreaking, privacy_location, prompt_suggestion, response_quality, safety_restricted, sensitive_topics, tool_compliance, tool_compliance_xml, ux_quality). PC5 commits the rules file schema and the registry pattern.configs/evaluation_manifest.yaml— the manifest mapping dataset categories to judges (PLT-619). PC5 commits the manifest schema.metrics/llm_judge/base.py—LLMJudgeMetric(GEval)loader +get_metrics_for_category(category)lookup. PC5 commits this as the runtime API contract.cli/opik/rules.py+cli/opik/eval.py+cli/opik/dataset.py— CLI surface. PC5 commits the CLI as the operational wiring path.
PC5 introduces new commitments:
- Pipeline-stage gate semantics — the existing infrastructure runs judges, but the pipeline that consumes their verdicts at PR-merge, ramp-start, and full-rollout milestones is informal. PC5 codifies the three gates.
- Threshold conventions — current rule files declare scores; PC5 declares what thresholds gate promotion.
- Vertical Opik project ownership model — verticals bring their own Opik project; PC5’s API is the wiring surface, not an opaque starter.
- Rollback semantics — without redeploying consumer-agent (via rules/manifest version pinning).
3. Background & Context
Section titled “3. Background & Context”3.1 Today’s reality
Section titled “3.1 Today’s reality”File-backed judge rule registry (production):
consumer-agent/configs/rules/*.yaml — 14 production judge rules. Each file is a complete LLM-as-Judge definition. Schema (from capability_alignment.yaml as the worked example):
name: Capability Alignment Judgemodel: gpt-5-minitemperature: 1.0sampling_rate: 1.0filter: field: metadata key: agent_id operator: "=" value: prompt-suggestionsenabled: truescore_name: Capability Alignment & Feature Boundariesscore_type: INTEGER # INTEGER | FLOAT | BOOLEANdescription: <one-line summary>task_introduction: <judge framing>variables: offline: {input: input, output: output, expected_output: expected_output} online: {input: input.messages[-1].content, output: output.messages[-1].content[-1].text} playground: {input: metadata.dataset_item_data.input, output: output.output, expected_output: metadata.dataset_item_data.expected_output}prompt: | <the actual rubric text the LLM judge evaluates against>The variables.{offline, online, playground} mapping is load-bearing — it lets the same judge fire in three contexts (dataset evaluation, production trace scoring, Opik playground manual testing) without rewriting the rubric.
Evaluation manifest (production):
consumer-agent/configs/evaluation_manifest.yaml — declares dataset metadata + category-to-judges mapping. Excerpt:
dataset: name: consumer-agent-eval version: 7 items: 345
schema: input: type: string required: true expected_output: ... metadata: id: ... category: ...Plus (per inspection): the category-to-judges mapping that get_metrics_for_category(category) reads.
Runtime loader (production):
consumer-agent/src/consumer_agent/metrics/llm_judge/base.py:
class LLMJudgeMetric(GEval): # config-driven judge; loaded from configs/rules/<name>.yaml
def get_metrics_for_category(category: str) -> list[LLMJudgeMetric]: """Reads the eval manifest to determine which judges apply to a category, plus any global metrics. Returns instances ready to score."""This is PC5’s runtime API contract.
CLI (production):
consumer-agent/src/consumer_agent/cli/opik/:
rules.py— register / list / update / delete judge rules in Opikeval.py— run evals against a datasetdataset.py— manage Opik datasetsprompt.py— manage Opik prompts (PC6 consumes this)traces.py— query production traces
Eval-report templates (production):
consumer-agent/src/consumer_agent/evaluation/templates/:
report.md.j2— Jinja eval report templatesummary.md.j2— Jinja eval summary template
What’s NOT yet codified:
- The pipeline lifecycle (PR → eval gate → deploy → rollback) — runs informally
- Threshold conventions — each rule’s threshold is author-set; no platform default
- Vertical-onboarding API — verticals copy patterns from existing rules; no documented “here’s how to wire your own dataset”
- Three-milestone gate firing — gates fire when someone runs them, not at lifecycle transitions
- Rollback mechanics — no codified path to revert a bad rules/manifest change without redeploying
3.2 What PC1 leaves to PC5
Section titled “3.2 What PC1 leaves to PC5”PC5 inherits from PC1 (Agent Composition):
- Prompt-block registry pattern (PC1 §5.7 / Decision 9) — file-backed, git-versioned, PR-reviewed. PC5’s
configs/rules/follows the same pattern. - Agent Definition shape (PC1 §5.2) — sub-agent changes are PR-driven changes to Agent Definitions; PC5’s pipeline gates those changes.
3.3 What PC5 defers / partners with
Section titled “3.3 What PC5 defers / partners with”- PC6 (Agent Variant CI/CD) — consumes PC5’s eval-gate API at three milestones (PC6 §5.8). PC5’s API surface must support
pre_merge,pre_ramp,pre_fullfiring. - PF1 (Sub-Agent Lifecycle) — owns the lifecycle states; PC5 specifies the gates between them.
- PF5 (Vertical Scaffolding + Validation Tools) — scaffolds the rules/manifest stubs verticals fill in. PC5 commits the schema PF5 scaffolds from.
- PS5 (Trace + Event Store) — judge scores produced by PC5 land in PS5’s store; PS5 owns durable storage.
3.4 Vocabulary
Section titled “3.4 Vocabulary”| Term | Meaning |
|---|---|
| Judge | An LLM-as-Judge instance — LLMJudgeMetric extending Opik’s GEval, loaded from a configs/rules/<name>.yaml file. |
| Judge template | A reusable judge definition (rule file). Templates are platform-owned; verticals instantiate them against their own datasets. |
| Rule file | A YAML file in configs/rules/. One judge per file. |
| Manifest | configs/evaluation_manifest.yaml. Declares dataset metadata + category → judges mapping + thresholds. |
| Category | A dataset-item label that determines which judges fire. Set on each dataset item’s metadata.category. |
| Gate | An eval-gate invocation at a pipeline milestone — runs the configured judges, compares scores against thresholds, returns pass/fail. |
| Milestone | A named point in the pipeline where a gate fires. Three of them: pre_merge, pre_ramp, pre_full. |
| Threshold | The minimum score (per judge) required to pass a gate. Declared in the manifest, per-judge or per-(judge, milestone). |
| Sampling rate | A judge’s per-trace sampling rate (declared in the rule file). 1.0 means every applicable trace gets scored; lower values reduce eval load. |
| Offline / Online / Playground | The three contexts a judge can fire in. offline = dataset evaluation; online = production trace scoring; playground = Opik manual testing. Variable bindings differ per context. |
| Vertical Opik project | The vertical-owned Opik project containing their dataset, traces, and judge invocations. PC5 supplies templates + manifest; verticals own the project. |
4. Requirements
Section titled “4. Requirements”4.1 Functional requirements
Section titled “4.1 Functional requirements”FR-1 — Pipeline lifecycle. A sub-agent change MUST move through PF1’s four-state lifecycle (dev → test → promote → rollback). The lifecycle states themselves are PF1’s contract; PC5 commits the gate-firing semantics that transition between them and the ramp-percentage progression inside promote.
FR-2 — Three named gate milestones. PC5 MUST expose three gate-firing milestones consumed by PC6 (and any future caller):
pre_merge— fires on PR open and on every push. Runs the configured judges against the dataset. Fail → PR is blocked from merge.pre_ramp— fires when the change attempts to begin a ramped rollout (PC6 §5.5). Runs judges against a recent production-trace sample. Fail → ramp halts at 0%.pre_full— fires when ramp attempts to advance to 100%. Same as pre_ramp on a larger trace sample. Fail → ramp halts at the last passing step.
FR-3 — Judge rule file schema. Every judge rule file in configs/rules/ MUST conform to the schema: name, model, temperature, sampling_rate, filter (optional; trace filter for when the judge fires), enabled, score_name, score_type (INTEGER | FLOAT | BOOLEAN), description, task_introduction, variables (offline/online/playground field bindings), prompt. Plus the threshold-discipline + enforcement fields: floor: float (minimum acceptable score), tolerance: float (regression allowance from baseline), baseline_source: Literal["jade_calibration", "production_distribution", "provisional_seed"], calibration_ref: str (commit hash or experiment ID), recalibration_due: date (next required recalibration), enforcement: dict[Literal["pre_merge", "pre_ramp", "pre_full"], Literal["warn", "block"]] (per-milestone enforcement policy).
FR-4 — Manifest schema. The evaluation manifest MUST conform to: dataset (name, version, item count), schema (per-item field shape), categories (category name → list of judge IDs), global_metrics (judges that apply across categories), thresholds (per-judge or per-(judge, milestone) score thresholds).
FR-5 — Judges queryable by ID and by category. The runtime MUST expose two retrieval APIs:
- By category —
get_metrics_for_category(category: str) -> list[LLMJudgeMetric](existing) — returns the manifest-declared judges for a category plus global metrics. - By ID —
get_metric_by_id(judge_id: str) -> LLMJudgeMetric(new) — returns a single judge by name. Used by PC6’s gate-by-ID invocation.
FR-6 — Threshold convention: per-milestone overrides allowed. A judge’s threshold MAY differ across milestones — e.g., a strict threshold for pre_full and a looser threshold for pre_ramp. Manifest schema supports both thresholds.<judge_id>: <score> (single threshold) and thresholds.<judge_id>.<milestone>: <score> (per-milestone).
FR-7 — Vertical Opik project model. Verticals MUST own their Opik project; PC5 MUST NOT write into vertical projects. The eval-config API exposes templates + manifest as values the vertical configures into their project via the existing CLI (cli/opik/rules.py create, etc.).
FR-8 — Eval-config API surface. PC5 MUST expose the following operations, callable from the consumer-agent runtime and from CI:
get_metrics_for_category(category)— runtime judge lookup by categoryget_metric_by_id(judge_id)— runtime judge lookup by IDload_manifest()— return the parsed manifest as a dictget_threshold(judge_id, milestone) -> float | int | bool— threshold lookup; falls back to manifest-default if milestone-specific not setlist_rules() -> list[str]— enumerate rule IDsvalidate_rule_file(path)— schema validation per FR-3 (for CI use)validate_manifest(path)— schema validation per FR-4 (for CI use)
FR-9 — Rules/manifest registry is file-backed. Judge rules and the manifest MUST live as files in configs/rules/ and configs/evaluation_manifest.yaml. Versioned via git, reviewed via PR. Opik holds judge instances synced from these files; the files are the source of truth. Same pattern as PC1 §5.7 prompt-block registry and PC3 §5.4 typed status event registry.
FR-10 — Rollback without redeploy. Rolling back a bad rules or manifest change MUST be achievable by reverting the change in git AND syncing the prior version to Opik via cli/opik/rules.py create. No consumer-agent redeploy required. The runtime reads judge config from Opik at startup (cached) and refreshes on operator action.
FR-11 — Sampling rate semantics by milestone.
pre_merge— runs against the full configured dataset (manifest’sdataset.nameatdataset.version);sampling_rateon the rule file does NOT apply (full coverage).pre_rampandpre_full— run against a configurable production-trace sample window; the rule file’ssampling_rateapplies to trace selection within that window.
FR-12 — Judge-template change deprecation policy. Changes to a judge template (rule file content, threshold, sampling rate, prompt text) MUST go through PR review. Major changes (prompt rewrite that would change verdicts on the same input) MUST trigger a re-baseline of the dataset’s expected outputs OR an explicit acknowledgment in the PR that prior baselines no longer apply.
FR-13 — Eval-manifest source of truth for declared capabilities (cross-section with PC6). Per PC6 FR-11, the eval manifest reads the set of declared agent capabilities from XML prompt components. PC5’s manifest MAY reference XML prompt component IDs as the source of “what to test”; PC6 commits the source-of-truth shift, PC5 commits the consumption shape.
FR-14 — CI-enforced rule + manifest validation. On every PR touching configs/rules/** or configs/evaluation_manifest.yaml, CI MUST run validate_rule_file and validate_manifest. Fail → PR is blocked from merge.
4.2 Non-functional requirements
Section titled “4.2 Non-functional requirements”NFR-1 — Judge-lookup latency. get_metrics_for_category(category) and get_metric_by_id(judge_id) MUST return in sub-millisecond after process startup. Manifest is loaded once and cached; rules are loaded once and cached. Cache invalidation on operator action; no per-call file I/O.
NFR-2 — Pre-merge gate latency. pre_merge MUST complete within a CI budget reasonable for PR turnaround — target under 5 minutes per PR on the production dataset (345 items at PC5 commit time). Concrete production p95 measurement is a follow-up.
NFR-3 — Pre-ramp / pre-full gate latency. Trace-sample evaluation MUST complete within 10 minutes on a 100-trace sample. Operational tuning.
NFR-4 — Judge rule cardinality. Total judge rules in configs/rules/ MUST remain enumerable from the directory listing. Soft target: under 50 across all verticals at maturity (currently 14). Growth beyond that triggers a registry-organization review.
NFR-5 — Threshold-tuning audit trail. Every threshold change in the manifest MUST be PR-reviewable with rationale in the PR description (why was this threshold raised / lowered, what production signal justified it). Not CI-enforced; review discipline.
NFR-6 — Rollback latency. Reverting a bad rules/manifest change in git AND syncing to Opik MUST complete in under 15 minutes end-to-end (including CI re-run). No consumer-agent redeploy required.
4.3 Acceptance criteria
Section titled “4.3 Acceptance criteria”AC-1 — Given a new judge rule file added to configs/rules/, when CI runs, validate_rule_file MUST pass on the new file. The judge MUST be retrievable by get_metric_by_id(<new-rule-name>) after process restart.
AC-2 — Given a category added to configs/evaluation_manifest.yaml with a list of judge IDs, when get_metrics_for_category(<new-category>) is called, the runtime MUST return LLMJudgeMetric instances for each declared judge, plus any global_metrics.
AC-3 — Given a PR proposing a sub-agent change, when the pre_merge gate fires, the runtime MUST evaluate all judges declared in the manifest for affected categories against the configured dataset. If any judge’s score is below its threshold, the PR MUST be blocked from merge.
AC-4 — Given a PC6-driven rollout attempting to begin a ramp, when the pre_ramp gate fires, the runtime MUST evaluate the configured judges against a recent production-trace sample. Fail → ramp halts at 0%, operator notified.
AC-5 — Given a manifest entry with both thresholds.<judge_id> and thresholds.<judge_id>.pre_full, when a gate fires at pre_full, get_threshold(<judge_id>, pre_full) MUST return the milestone-specific value, not the default.
AC-6 — Given a rule file with a malformed schema (missing required field, invalid score_type), validate_rule_file(path) MUST return a structured error naming the violated field. CI MUST fail the PR. Plus enforcement field validation: each milestone key MUST be one of {pre_merge, pre_ramp, pre_full}, each value MUST be one of {warn, block}; unknown keys or values MUST fail validation with a structured error naming the violated field.
AC-7 — Given a bad rules change in production (e.g., a prompt rewrite that breaks judge verdicts), reverting the change in git + running cli/opik/rules.py create --name <rule> --file configs/rules/<rule>.yaml MUST restore the prior judge behavior within 15 minutes. consumer-agent MUST NOT require a redeploy.
AC-8 — Given a vertical wiring their own Opik project, get_metrics_for_category and get_metric_by_id MUST work against their project once cli/opik/rules.py create has been run to sync the rules. PC5 MUST NOT require write access to vertical projects.
AC-9 — Given the same judge defined in configs/rules/<name>.yaml with both offline and online variable bindings, the judge MUST fire correctly in both contexts (offline = dataset eval; online = production trace scoring) with the same rubric text and threshold.
AC-10 — Given the eval manifest at steady state, validate_manifest(path) MUST pass on the current production manifest. New manifest entries MUST also pass the same validation.
5. Solution Design
Section titled “5. Solution Design”5.1 The architectural through-line
Section titled “5.1 The architectural through-line”Judge templates are file-backed; verticals own their Opik projects; gates fire at three named milestones; rollback is a git revert + sync, not a redeploy.
Three properties hold across every PC5 contract:
-
File-backed registry is the source of truth.
configs/rules/*.yamlandconfigs/evaluation_manifest.yamlare the change-control surfaces. Opik holds synced instances; the files come first. Same discipline as PC1 §5.7 prompt-block registry. -
API surface, not project starter. PC5 ships templates + the runtime + the CLI to wire them. Verticals own their Opik project, dataset, and traces; PC5 doesn’t write into them. Verticals integrate via
get_metrics_for_category+get_metric_by_id+ the CLI, not by forking a template repo. -
Three gates, three contexts.
pre_mergeruns against the configured dataset (offline);pre_rampandpre_fullrun against production-trace samples (online). Same judge, same rubric, different variable bindings — the rule file’svariablesmapping handles the context translation.
5.2 Rule file schema
Section titled “5.2 Rule file schema”The canonical schema for a judge rule (codifies the deployed configs/rules/capability_alignment.yaml shape):
# Required fieldsname: <Human-readable name> # displayed in Opik UImodel: <model id> # e.g., gpt-5-minitemperature: <float> # judge model sampling temperaturesampling_rate: <float, 0.0-1.0> # per-trace sampling for online gateenabled: <bool> # global on/off
score_name: <Score display name> # what shows up in Opik scoresscore_type: INTEGER | FLOAT | BOOLEAN
description: <one-line summary>task_introduction: <judge framing — sent to the LLM judge as the role>
variables: offline: # dataset evaluation context input: <dotted path into dataset item> output: <dotted path into agent response> expected_output: <dotted path> online: # production trace scoring context input: <dotted path into trace> output: <dotted path into trace> playground: # Opik playground manual testing input: <dotted path> output: <dotted path> expected_output: <dotted path>
prompt: | <multi-line rubric the LLM judge evaluates against>
# Optional fieldsfilter: # when the judge fires (Opik trace filter) field: <metadata | input | output | ...> key: <dotted path> operator: "=" | "!=" | "contains" | ... value: <comparison value>
# Threshold discipline (required for every judge that participates in gates)floor: <float> # absolute floor; below this is always a blocktolerance: <float> # max acceptable regression vs rolling baselinebaseline_source: jade_calibration | production_distribution | provisional_seedcalibration_ref: <ticket-or-doc-ref> # required when baseline_source == jade_calibrationrecalibration_due: <YYYY-MM-DD> # quarterly default; provisional seeds sooner
# Enforcement policy per milestoneenforcement: pre_merge: warn | block pre_ramp: warn | block pre_full: warn | blockWhy these fields are load-bearing:
variables.{offline,online,playground}— same judge fires in three contexts without rubric duplication. Online traces and offline dataset items have different field shapes; the variable bindings translate.filter— optional trace-level filter so a judge fires only on relevant traces (e.g.,capability_alignmentonly fires onprompt-suggestionsagent traces). Without filter, judge fires on every trace.score_type— Opik distinguishes integer / float / boolean scores. Threshold comparison logic depends on type.sampling_rate— production-trace sampling rate forpre_ramp/pre_full.pre_mergeignores this (full dataset coverage).floor+tolerance— hybrid threshold. Floor catches catastrophic regressions; tolerance against a rolling-3 main baseline catches drift. The rolling average smooths per-run noise.baseline_source— provenance discipline. Every threshold cites its source: JADE calibration (PLT-596-style report), production distribution (p5 of last N runs with std-dev), or provisional seed (mean − 2σ from a small bootstrap run with a recalibration timeline). No judge ships to a gate without a declared source.enforcement— per-judge milestone-tiered policy. Default by judge classification: safety/refusal judges block at every milestone; quality judges warn atpre_merge, block atpre_rampandpre_full. Per-judge YAML may override the milestone-tier defaults; override path: PR label + spec-owner sign-off + auto-logged audit entry.
5.3 Manifest schema
Section titled “5.3 Manifest schema”# Dataset metadatadataset: name: <Opik dataset name> version: <integer> items: <expected count, used for sanity check>
# Per-item schemaschema: input: type: string required: true description: <human-readable> expected_output: type: string required: true metadata: id: type: string required: true category: type: string required: true # additional metadata fields as the dataset grows
# Category → judges mappingcategories: shopping_query: judges: [capability_alignment, response_quality, ux_quality, tool_compliance] safety_test: judges: [safety_restricted, fetch_legal, privacy_location] # ... one entry per category
# Judges that fire across all categoriesglobal_metrics: judges: [data_integrity, jailbreaking]
# Threshold conventionsthresholds: # Single threshold for the judge capability_alignment: 4 # INTEGER score ≥ 4 passes # Per-milestone threshold override response_quality: pre_merge: 4 pre_ramp: 3 # looser for production-trace eval pre_full: 4 # Boolean threshold jailbreaking: true # BOOLEAN score must be true to passWhy these fields are load-bearing:
categories.<category>.judges— the set fired for a dataset item withmetadata.category == <category>.get_metrics_for_category(category)reads this.global_metrics.judges— fire on every category. The “across-the-board” judges (data integrity, jailbreaking) live here.thresholds.<judge_id>— the single-threshold form. Most judges only need one threshold.thresholds.<judge_id>.<milestone>— milestone-specific override. Used when the appropriate score bar differs by gate (e.g., looser at ramp-start, stricter at full-rollout).
5.4 The eval-config API
Section titled “5.4 The eval-config API”PC5’s API surface (Python, runtime-accessible from consumer-agent and from CI scripts):
# Runtime judge retrievaldef get_metrics_for_category(category: str) -> list[LLMJudgeMetric]: ...def get_metric_by_id(judge_id: str) -> LLMJudgeMetric: ...
# Manifest accessdef load_manifest() -> dict: ...def get_threshold(judge_id: str, milestone: str | None = None) -> int | float | bool: ...
# Rule listingdef list_rules() -> list[str]: ...
# CI validationdef validate_rule_file(path: Path) -> ValidationResult: ...def validate_manifest(path: Path) -> ValidationResult: ...
# Gate evaluation (high-level entry point consumed by PC6 §5.8)def evaluate_gate( milestone: Literal["pre_merge", "pre_ramp", "pre_full"], traces_or_dataset: TracesOrDataset, judge_ids: list[str],) -> GateVerdict: ...GateVerdict shape (returned by evaluate_gate; consumed by PC6’s pipeline):
@dataclass(frozen=True)class JudgeScore: score: float threshold: float passed: bool enforcement: Literal["warn", "block"]
@dataclass(frozen=True)class GateVerdict: per_judge_scores: dict[str, JudgeScore] verdict: Literal["pass", "warn", "fail"] failing_judges: list[str] milestone: Literal["pre_merge", "pre_ramp", "pre_full"]PC6 consumes verdict (to decide pipeline halt) and failing_judges (to surface in operator notifications); per_judge_scores is retained for diagnostic output and observability.
This is the “API not opaque project starter” framing. A vertical onboarding doesn’t fork a template; it:
- Creates their own Opik project (verticals own this).
- Runs
cli/opik/rules.py create --name <rule> --file configs/rules/<rule>.yamlagainst their project for each judge they want to instantiate. - Writes their own manifest in their project (or extends consumer-agent’s manifest with a new category).
- Calls
get_metrics_for_category(<their-category>)from their eval code to get judges. - Gets gate verdicts via
evaluate_gate(milestone, traces_or_dataset, judge_ids).
PC5 supplies templates (rule file content), the runtime (LLMJudgeMetric + lookup APIs), and the CLI (rules.py + eval.py + dataset.py). Verticals supply project + dataset + integration.
5.5 Three-gate pipeline semantics
Section titled “5.5 Three-gate pipeline semantics”Each gate fires at a specific pipeline milestone with specific scope:
pre_merge — fires on PR open + every push to the PR branch.
- Scope: full configured dataset (
dataset.nameatdataset.version). - Judges: those declared by
categories.<each category in dataset>.judges+global_metrics. - Score aggregation: per-judge aggregate over all dataset items (e.g., mean score, fail rate); compare against
thresholds.<judge_id>.pre_merge(or default threshold). - Verdict: fail → PR blocked from merge. Operator can override with an explicit “accept-baseline-shift” annotation if the failure is a known expected change.
pre_ramp — fires when a PC6-driven rollout attempts to begin ramped traffic (ramp_steps first non-zero step).
- Scope: recent production-trace sample (configurable window; default last 24h).
- Judges: those configured by PC6’s experiment config
eval_gates.pre_ramplist (PC6 §5.6). - Score aggregation: per-judge aggregate over sampled traces; compare against
thresholds.<judge_id>.pre_ramp. - Verdict: fail → ramp halts at 0%, operator notified.
pre_full — fires when ramp attempts to advance to 100%.
- Scope: larger production-trace sample (configurable; default last 7 days).
- Judges: PC6’s experiment config
eval_gates.pre_fulllist. - Score aggregation + verdict: same as pre_ramp but against
thresholds.<judge_id>.pre_full.
Why three gates instead of one continuous monitor:
Each gate has different statistical power needs (small dataset → quick PR turnaround; large production sample → confident promotion verdict). Splitting lets you tune each independently. Single continuous monitor would either be too slow at PR time or too noisy at ramp time.
Trace set composition per milestone (coverage-graduated, not quantity-graduated):
| Milestone | Composition | Sample frozen? |
|---|---|---|
pre_merge | Golden set only (curated dataset, ~50 traces). Reproducibility-first. | n/a — fixed |
pre_ramp | Golden set + recent production sample (~200 total). Broader coverage. | Frozen per milestone-run; production sample window refreshed weekly. |
pre_full | Golden + production sample + adversarial set (~500-1000 total). Robustness. | Same freezing rule. |
Multi-turn coherence evaluation is explicitly out of scope for PC5 gates — it belongs to PLT-610’s eval surface. Conflating multi-turn coherence with single-turn capability would produce false-signal regressions in the gate.
Adversarial set curation is owned by the platform team; verticals contribute domain-specific edge cases.
Gate trigger classification — which PR triggers which gate:
Path-based defaults with author override + reviewer sign-off:
- All gates:
prompts/**,configs/rules/**,configs/evaluation_manifest.yaml,src/consumer_agent/agent/**,src/consumer_agent/factory.py,src/consumer_agent/utils/tools.py pre_mergeonly:src/consumer_agent/api/**,src/consumer_agent/utils/helpers.py,configs/*.yaml(non-rule)- No gate:
tests/**,docs/**,*.md,.github/**,setup.py,pyproject.toml,Makefile
Override: PR label eval-skip-pre-ramp or eval-skip-pre-full plus a required eval-skip-justification block in the PR description. PR template checkbox confirms reviewer agreement. Every override is logged centrally and audited quarterly to detect abuse patterns.
Per-Agent-Definition lifecycle threads (PF-1 FR-12 alignment): Each Agent Definition (identified by agent_id + commit) maintains its own gate-firing history. When a major change requires authoring a NEW Agent Definition (per PF-1 FR-12), the new Agent Definition re-enters at pre_merge independent of any prior Agent Definition’s gate state. The prior Agent Definition’s gate history is preserved (audit + rollback target) but does not influence the new Agent Definition’s promotion. In-place mutation of a promoted Agent Definition is prohibited by PF-1 FR-12; PC-5 gate-firing assumes Agent Definitions are immutable post-promotion.
5.6 Vertical-Opik project ownership model
Section titled “5.6 Vertical-Opik project ownership model”Verticals bring their own Opik project; PC5 supplies the templates and the API to wire it up. Concretely:
| Asset | Owner | Source of truth |
|---|---|---|
| Opik project (workspace) | Vertical | Vertical’s Opik account |
| Dataset (items, expected outputs, categories) | Vertical | Vertical’s project; manifest references it |
| Judge instances (synced to Opik) | Vertical | Vertical’s project; created by cli/opik/rules.py create |
| Judge templates (rule file content) | Platform | consumer-agent/configs/rules/*.yaml |
| Manifest schema | Platform | PC5’s spec |
| Manifest content (categories, thresholds) | Vertical OR Platform | depends — see below |
Eval runtime (LLMJudgeMetric, lookup APIs) | Platform | consumer-agent/src/consumer_agent/metrics/llm_judge/ |
CLI (rules.py, eval.py, dataset.py) | Platform | consumer-agent/src/consumer_agent/cli/opik/ |
Manifest ownership is the nuanced one:
- The schema is platform-owned (PC5’s §5.3).
- The manifest content (which judges fire on which categories, what thresholds gate promotion) is shared — consumer-agent’s central manifest declares the cross-vertical conventions; verticals MAY extend it with their own categories + thresholds in their own Opik project’s manifest, or MAY add to the central one via PR.
No vertical-write-access from PC5: PC5’s API reads from the vertical’s project (via Opik SDK, vertical’s credentials); does not write. The CLI is operator-driven, not platform-automated.
5.7 Rollback mechanics
Section titled “5.7 Rollback mechanics”A bad rules or manifest change rolls back via git + sync, not redeploy:
- Detect: production judge scores regress, or operator detects a bad rule via manual review.
- Revert:
git revert <commit>on the rule file or manifest change. - Sync: re-run
cli/opik/rules.py create --name <rule> --file configs/rules/<rule>.yamlfor each affected rule (ormanifest syncfor manifest changes). - Cache invalidate: operator triggers a cache refresh on the runtime (or the runtime’s cache TTL expires; configurable, operational).
- Verify: production scores return to pre-regression baseline.
Why no consumer-agent redeploy: judge config is loaded from files at startup AND refreshable at runtime; rule changes don’t require new code, only new config. The runtime LLMJudgeMetric class is stable; only the YAML content changes.
NFR-6 commits 15 minutes end-to-end (git revert + CI re-run + Opik sync + cache refresh). Concrete production measurement is operational.
Post-deploy regression response (tiered by judge classification):
When a regression is detected after pre_full passed and the change is live in production (overnight scheduled eval, ad-hoc eval triggered by a user-reported issue, or new judge added to the suite), the response depends on the judge’s classification:
| Regression tier | Response |
|---|---|
| Safety / refusal judge floor breach | Automatic ramp-down to 0% on the affected agent/component + PagerDuty page. Auto-ramp-down (not auto-rollback) is the safer failure mode — reversible if it turns out to be a false-positive judge run. |
| Quality judge floor breach | Slack alert + Grafana annotation + 24h scheduled triage. No page; quality regressions don’t justify a 3am response. |
Both tiers write an entry to a regression-tracking note (docs/post-deploy-regressions.md) for trend analysis. The automatic ramp-down path depends on PF1’s ramping primitive being live; until PF1 lands, the automatic path degrades to “alert plus manual ramp-down via Feature Flipper.”
5.8 PC5 ↔ PC6 contract surface
Section titled “5.8 PC5 ↔ PC6 contract surface”PC6 §5.8 says: “PC5 owns the eval-gate definition. PC6 specifies when gates fire.”
PC5’s commitment (read by PC6):
| PC6 reads | PC5 exposes |
|---|---|
Judge by ID (PC6 §5.6 eval_gates.pre_ramp: [judge_ids...]) | get_metric_by_id(judge_id) -> LLMJudgeMetric |
| Threshold per milestone | get_threshold(judge_id, milestone) -> int | float | bool |
| Gate firing (run judges, return verdict) | evaluate_gate(milestone, traces_or_dataset, judge_ids) -> GateVerdict |
evaluate_gate is the high-level entry point PC6’s pipeline calls; it composes the lower-level get_metric_by_id + get_threshold + score aggregation into one verdict.
5.9 PF1 ↔ PC5 partnership
Section titled “5.9 PF1 ↔ PC5 partnership”PF1 owns sub-agent lifecycle states (dev → test → promote → rollback); PC5 owns gate semantics that fire across those transitions and inside promote:
| PF1 lifecycle transition or sub-state | PC5 gate(s) |
|---|---|
dev → test | None (developer-driven) |
test → promote (PR merge + operator initiates ramp) | pre_merge (PR-driven, dataset-scoped) |
Ramp progression inside promote — first non-zero step | pre_ramp (PC6-driven, trace-sampled) |
Ramp progression inside promote — 100% step | pre_full (PC6-driven, larger trace sample) |
* → rollback | None (operator-driven; rollback per §5.7) |
PF1 owns the lifecycle vocabulary (states and transitions). PC5 references PF1’s vocabulary by name to anchor gate firing.
6. Cross-Section Impact
Section titled “6. Cross-Section Impact”| Spec | Citation |
|---|---|
| PC1 (Agent Composition) | Inherits file-backed registry pattern (PC1 §5.7 / Decision 9). PC5’s configs/rules/ and evaluation_manifest.yaml follow the same git-versioned, PR-reviewed discipline. |
| PC6 (Agent Variant CI/CD) | PC6 §5.8 reads from PC5’s gate API at three milestones. PC5 commits the API surface (get_metric_by_id, get_threshold, evaluate_gate); PC6 commits the trigger points. |
| PF1 (Sub-Agent Lifecycle) | PF1 owns lifecycle states; PC5 owns the gates between them. Promotion criteria reference PC5’s three milestones. |
| PF5 (Vertical Scaffolding + Validation Tools) | PF5 scaffolds rules/manifest stubs verticals fill in. PC5 commits the schema PF5 scaffolds from. |
| PS5 (Trace + Event Store) | Judge scores land in PS5’s store; per-team trace metadata composes with PC5’s score data for vertical-specific dashboards. |
| PD3 (DM Type Registry) | Independent of PC5; DM rollout uses PD3’s analytics labels + experiment-arm machinery, not PC5’s judge gates. |
7. Dependencies
Section titled “7. Dependencies”Platform spec dependencies: PC1 (Agent Composition), PF1 (Sub-Agent Lifecycle).
Implementation dependencies:
- Opik — LLM-as-Judge runtime (
GEval), automation rules, datasets, traces. Already integrated. - GPT-5 family — judge models (current rules use
gpt-5-mini). Operational. - PyYAML — manifest + rule file parsing. Standard.
External dependencies: None.
Cross-section soft dependencies:
- PLT-619 (eval manifest categories) — PC5 commits the manifest schema; PLT-619 implementation consumes the schema.
- PC6 §5.8 eval-gate hookup — PC6 reads PC5’s gate API.
8. Risks & Open Questions
Section titled “8. Risks & Open Questions”8.1 Risks
Section titled “8.1 Risks”R-1: Vertical Opik project misconfiguration. Verticals own their projects; PC5 has no write access. If a vertical misconfigures their project (wrong dataset version, missing categories, stale judge sync), the gate API returns errors but PC5 can’t auto-fix. Mitigated by clear CLI errors + PF5’s scaffolding (correct shape from the start).
R-2: Judge-template change blast radius. A platform-owned judge template change (e.g., updating response_quality.yaml’s rubric) affects every vertical consuming it. Mitigated by FR-12’s deprecation policy: major changes require re-baseline acknowledgment in PR, plus per-vertical CI gate runs to catch unexpected verdict shifts before merge.
R-3: Manifest ↔ rules-file drift. A category in the manifest references a judge_id that doesn’t have a matching rule file (or vice versa). CI-enforced via FR-14’s validate_manifest, which cross-checks against list_rules(). Mitigated structurally.
R-4: Threshold-tuning noise. Thresholds set too tight produce noisy gate failures (PR blockers that aren’t real regressions); too loose miss real regressions. Threshold-tuning audit trail (NFR-5) and operator override path (AC-3 “accept-baseline-shift” annotation) provide the levers. No automated tuning in v1.
R-5: Sampling-rate confusion across milestones. sampling_rate in the rule file applies only to online gates (pre_ramp, pre_full), not pre_merge. Subtle. Mitigated by §5.2 + FR-11 making the semantics explicit and FR-14 CI validation catching schema misuse.
R-6: Pre-merge gate slowdown as dataset grows. 345 items today × N judges = M LLM calls per PR. Growth in either dimension expands the budget. Mitigated by NFR-2’s target + per-judge enabled: false toggle for emergency throttling.
R-7: Cache invalidation lag during rollback. §5.7’s rollback path includes “cache invalidate” as Step 4. If the runtime’s cache TTL is long (operational), rollback latency stretches. Mitigated by operator-triggerable manual cache refresh + NFR-6’s 15-minute target.
8.2 Open Questions
Section titled “8.2 Open Questions”OQ-1: Threshold-tuning automation. NFR-5 commits the audit trail (PR-review discipline). Long-term, threshold tuning could be automated based on production signal (auto-raise thresholds when prod outperforms; auto-lower when prod regresses). Out of scope for v1; surface as follow-on if production noise warrants.
OQ-2: Cross-vertical judge sharing. Verticals reuse judge templates from the central configs/rules/ (e.g., safety_restricted applies to every vertical). What’s the platform team’s review obligation on a vertical PR that changes a shared judge? Lean: any change to a configs/rules/*.yaml requires platform-team review (cross-vertical impact); vertical-specific judge additions can be vertical-owned. Needs platform-team owner input on review-board boundary.
OQ-3: Production trace sampling window for pre_ramp / pre_full. NFR-3 sets a target latency (10 minutes for 100 traces) but doesn’t pin the window (24h vs 7d vs other). Lean: 24h for pre_ramp (fast iteration); 7d for pre_full (confident promotion). Operational tuning; surface if defaults don’t fit production needs.
OQ-4: Vertical-side dataset versioning. PC5 commits the manifest schema (dataset.version); verticals reference their own dataset versions. When a vertical bumps their dataset version, do PC5’s gates auto-pick up the new version, or does the vertical signal the version bump? Lean: vertical signals (explicit manifest PR); auto-discovery is too magical. Needs PF5 reviewer input on the scaffolding side.
9. Testing Strategy
Section titled “9. Testing Strategy”9.1 Unit tests
Section titled “9.1 Unit tests”- Rule-file schema validation:
validate_rule_fileaccepts production rule files; rejects malformed schemas with structured errors per AC-6 - Manifest schema validation:
validate_manifestaccepts production manifest; rejects malformed manifests - Cross-validation:
validate_manifestcross-checks category-referencedjudge_ids againstlist_rules()output; flags drift - Judge retrieval:
get_metric_by_id(<existing>)returns the loaded instance;get_metric_by_id(<missing>)raises a clear error - Threshold lookup:
get_thresholdreturns milestone-specific override when present; falls back to default; handles BOOLEAN / INTEGER / FLOAT score types - Manifest caching:
load_manifest()returns the same instance on repeat calls; cache invalidation works - Rule caching: per-rule load is cached after first call; cache invalidation works
9.2 Integration tests
Section titled “9.2 Integration tests”- End-to-end
pre_merge: PR opens with sub-agent change → gate fires → judges score against dataset → verdict returned within NFR-2 budget - End-to-end
pre_ramp: rollout begins ramp → gate fires → judges score against trace sample → verdict returned within NFR-3 budget - End-to-end
pre_full: ramp advances → gate fires → judges score → verdict returned - Gate failure → PC6 pipeline halt: simulated judge failure → PC6 §5.5 ramp halts at 0%
- Rollback flow: bad rule change → git revert → CLI sync → cache refresh → judge behavior restored within NFR-6 budget
9.3 Eval coverage (Opik)
Section titled “9.3 Eval coverage (Opik)”- Per-vertical eval coverage on declared judge categories (verticals own this; PC5 supplies the framework)
- Judge-template regression coverage: when a judge template changes (FR-12), the dataset’s expected outputs are re-baselined and the diff reviewed
- Manifest-mapping coverage: every category in the manifest has at least one dataset item exercising it
9.4 Contract tests
Section titled “9.4 Contract tests”- PC6: PC5’s
get_metric_by_id+get_threshold+evaluate_gatesignatures match PC6 §5.8’s expected invocation - PF1: PC5’s three milestones correspond to PF1’s lifecycle transitions per §5.9 mapping
- PF5: PC5’s rule file + manifest schemas match what PF5’s scaffolding generates
- PS5: judge-score event shape persists correctly in PS5’s store
9.5 Failure-mode testing
Section titled “9.5 Failure-mode testing”- Vertical Opik project unreachable:
get_metrics_for_categoryreturns a clear error; PC5 doesn’t crash the runtime - Rule file syntax error mid-PR: CI fails the PR via
validate_rule_filebefore merge - Manifest references missing judge_id: CI fails via
validate_manifestcross-check - Threshold misconfigured for score_type (e.g., FLOAT threshold on a BOOLEAN judge):
get_thresholdraises a clear type error; CI catches via manifest validation - Judge model API outage: gate fails closed (rollout halts); operator notified
10. Rollout & Observability
Section titled “10. Rollout & Observability”10.1 Rollout phases
Section titled “10.1 Rollout phases”Phase 1 — Spec validation. PC5 reviewed and approved; cross-section contracts confirmed with PC1, PC6, PF1, PF5 reviewers.
Phase 2 — API surface. Add get_metric_by_id, get_threshold, validate_rule_file, validate_manifest, evaluate_gate to consumer-agent/src/consumer_agent/metrics/. Existing get_metrics_for_category stays unchanged.
Phase 3 — Manifest schema enforcement. Add CI validation per FR-14 on PRs touching configs/rules/** or configs/evaluation_manifest.yaml.
Phase 4 — Three-milestone gate semantics. Wire pre_merge to PR CI (block merge on judge failure); wire pre_ramp and pre_full to PC6’s rollout pipeline.
Phase 5 — Threshold convention rollout. Audit existing rule files; add thresholds.<judge_id> entries to manifest for each. Per-milestone overrides added as production signal justifies.
Phase 6 — Vertical onboarding documentation. Author the “how to add a vertical Opik project + wire judges” guide. PF5 scaffolds the stubs.
10.2 Observability metrics
Section titled “10.2 Observability metrics”eval.gate.fired_totalbymilestone(pre_merge/pre_ramp/pre_full) — gate invocation volumeeval.gate.verdict_totalbymilestone,verdict(pass/fail/error) — pass/fail rates per milestoneeval.judge.score_distributionbyjudge_id,milestone— score distribution per judge per milestone; surfaces threshold-tuning candidateseval.gate.duration_secondsbymilestone— gate latency; feeds NFR-2 / NFR-3eval.rule.validation_failed_totalbyrule_id,error_type— CI validation failures; should be zero in steady state (catches at PR time)eval.manifest.drift_total— manifest references missing rule (or vice versa); should be zeroeval.rollback.duration_seconds— rollback latency; feeds NFR-6
10.3 Rollback
Section titled “10.3 Rollback”PC5 is a contract spec, not deployable code. Rollback semantics:
- Rule file rollback: per §5.7. Git revert + CLI sync + cache refresh. NFR-6 commits 15 minutes.
- Manifest rollback: same path; manifest is a single YAML file.
- API surface rollback: if a new API method (
evaluate_gate,get_metric_by_id) produces unexpected behavior, deprecate via standard Python deprecation; consumers fall back to per-judge invocation. Not expected at v1.
11. Appendix
Section titled “11. Appendix”11.1 Source references
Section titled “11.1 Source references”- PC1: Agent Composition — file-backed prompt-block registry pattern PC5 mirrors
- PC6: Agent Variant CI/CD + Experiment-Gated Rollout — consumes PC5’s gate API at three milestones
- Platform Spec Lab — Wave 1 sequencing; PC5 scope row
- PLT-619 — Eval manifest categories — the manifest schema PC5 codifies
consumer-agent/configs/rules/*.yaml— 14 production judge rule files; the schema PC5 commits toconsumer-agent/configs/evaluation_manifest.yaml— the production manifest PC5 codifiesconsumer-agent/src/consumer_agent/metrics/llm_judge/base.py—LLMJudgeMetricruntimeconsumer-agent/src/consumer_agent/cli/opik/{rules,eval,dataset}.py— CLI surface
11.2 Decisions resolved during design
Section titled “11.2 Decisions resolved during design”| # | Decision | Resolution |
|---|---|---|
| 1 | Rule file schema | Codify the deployed capability_alignment.yaml shape (per §5.2). Required + optional fields documented; CI-validated. |
| 2 | Manifest schema | Codify dataset + schema + categories + global_metrics + thresholds (per §5.3). Single + per-milestone thresholds both supported. |
| 3 | Gate milestones | Three: pre_merge, pre_ramp, pre_full. PC6 commits the trigger points; PC5 commits the gate semantics. |
| 4 | API surface | get_metrics_for_category (existing) + get_metric_by_id (new) + load_manifest + get_threshold + evaluate_gate + validation functions. |
| 5 | Vertical Opik project ownership | Verticals own their project; PC5 has no write access. PC5 supplies templates + runtime + CLI; verticals integrate via API. |
| 6 | Rollback mechanics | Git revert + CLI sync + cache refresh. No consumer-agent redeploy. NFR-6 commits 15-minute target. |
| 7 | Sampling-rate semantics | pre_merge ignores sampling_rate (full dataset coverage); pre_ramp / pre_full apply the rule’s sampling_rate to trace selection. |
| 8 | Judge-template change discipline | FR-12: major changes require re-baseline acknowledgment in PR. Cross-vertical impact requires platform-team review (OQ-2). |
| 9 | evaluate_gate return shape | GateVerdict carries per_judge_scores (per-judge JudgeScore with score, threshold, passed, enforcement), composite verdict (pass / warn / fail), failing_judges list, and milestone. PC6 consumes verdict + failing_judges; diagnostic detail via per_judge_scores. |
11.3 Migration receipts
Section titled “11.3 Migration receipts”- From PC1 §5.7 (prompt-block registry pattern): PC5 inherits the file-backed + git-versioned + PR-reviewed discipline. Same pattern, different content domain.
- No content migration from other specs: existing
configs/rules/andevaluation_manifest.yamlare codified, not migrated.