Skip to content

Agent CI/CD Pipeline

The consumer-agent’s eval system is mature on the runtime side — Opik integration, 14 judge rule files in configs/rules/, an evaluation_manifest.yaml mapping dataset categories to judges, LLMJudgeMetric base class extending Opik’s GEval, full CLI for managing rules and datasets. But the CI/CD pipeline that gates sub-agent changes against those judges isn’t codified. Today: judges run when someone invokes them; thresholds are author-set per rule; promotion to production is implicit (whoever merges decides).

PC5 closes the gap. Three load-bearing commitments:

  1. A sub-agent change moves through PR → eval gate → deploy → rollback as a pipeline, not as ad-hoc judgment. The lifecycle is bounded; the gates fire automatically at named milestones; rollback is explicit.

  2. Judge templates + threshold conventions are platform-owned and reusable across verticals. Verticals shouldn’t reinvent “what does a safety judge look like” or “what threshold gates promotion.” PC5 ships the templates and the threshold conventions; verticals instantiate them against their own datasets.

  3. The eval-config API is the wiring surface — not an opaque project starter. Verticals own their Opik project (their dataset, their traces, their judge invocations). PC5 exposes an API (get_metrics_for_category, judge-by-ID retrieval, manifest queries) that verticals call against their project.

Without PC5, four things break:

  1. PC6 (Agent Variant CI/CD) can’t reference a stable eval-gate definition. PC6 §5.8 already says it reads from PC5’s gates at three milestones; without PC5’s API, PC6 invents the contract.
  2. PF1 (Sub-Agent Lifecycle) has no promotion criteria. PF1 specifies the lifecycle (dev → test → promote → rollback); PC5 specifies what gates each transition.
  3. Verticals re-implement eval wiring per onboarding. Each new vertical recreates manifest entries, rule files, threshold guesses — wasteful and drifts from platform conventions.
  4. The deployed configs/rules/ and evaluation_manifest.yaml infrastructure stays informal. It works in production but isn’t codified — new contributors read source code to understand the patterns. PC5 lifts the patterns into a citeable contract.

Companion: Platform Spec Lab row PC5. The spec is the source of truth.

Per the Platform Spec Lab, PC5 owns the Agent CI/CD Pipeline capability for the AI Assistant Platform’s consumer-agent runtime. The capability has four components:

  • Pipeline — PR → eval gate → deploy → rollback lifecycle for sub-agent changes.
  • Judge templates — reusable LLM-as-Judge definitions; verticals instantiate against their own datasets.
  • Threshold conventions — what scores gate promotion (pre-ramp vs pre-full).
  • Eval-config API — the surface verticals consume to wire judges + thresholds into their own Opik project.

PC5 codifies existing infrastructure:

  • configs/rules/*.yaml — 14 judge rule files in production today (capability_alignment, conversation_flow, data_integrity, fetch_legal, greeting_quality, jailbreaking, privacy_location, prompt_suggestion, response_quality, safety_restricted, sensitive_topics, tool_compliance, tool_compliance_xml, ux_quality). PC5 commits the rules file schema and the registry pattern.
  • configs/evaluation_manifest.yaml — the manifest mapping dataset categories to judges (PLT-619). PC5 commits the manifest schema.
  • metrics/llm_judge/base.pyLLMJudgeMetric(GEval) loader + get_metrics_for_category(category) lookup. PC5 commits this as the runtime API contract.
  • cli/opik/rules.py + cli/opik/eval.py + cli/opik/dataset.py — CLI surface. PC5 commits the CLI as the operational wiring path.

PC5 introduces new commitments:

  • Pipeline-stage gate semantics — the existing infrastructure runs judges, but the pipeline that consumes their verdicts at PR-merge, ramp-start, and full-rollout milestones is informal. PC5 codifies the three gates.
  • Threshold conventions — current rule files declare scores; PC5 declares what thresholds gate promotion.
  • Vertical Opik project ownership model — verticals bring their own Opik project; PC5’s API is the wiring surface, not an opaque starter.
  • Rollback semantics — without redeploying consumer-agent (via rules/manifest version pinning).

File-backed judge rule registry (production):

consumer-agent/configs/rules/*.yaml — 14 production judge rules. Each file is a complete LLM-as-Judge definition. Schema (from capability_alignment.yaml as the worked example):

name: Capability Alignment Judge
model: gpt-5-mini
temperature: 1.0
sampling_rate: 1.0
filter:
field: metadata
key: agent_id
operator: "="
value: prompt-suggestions
enabled: true
score_name: Capability Alignment & Feature Boundaries
score_type: INTEGER # INTEGER | FLOAT | BOOLEAN
description: <one-line summary>
task_introduction: <judge framing>
variables:
offline: {input: input, output: output, expected_output: expected_output}
online: {input: input.messages[-1].content, output: output.messages[-1].content[-1].text}
playground: {input: metadata.dataset_item_data.input, output: output.output, expected_output: metadata.dataset_item_data.expected_output}
prompt: |
<the actual rubric text the LLM judge evaluates against>

The variables.{offline, online, playground} mapping is load-bearing — it lets the same judge fire in three contexts (dataset evaluation, production trace scoring, Opik playground manual testing) without rewriting the rubric.

Evaluation manifest (production):

consumer-agent/configs/evaluation_manifest.yaml — declares dataset metadata + category-to-judges mapping. Excerpt:

dataset:
name: consumer-agent-eval
version: 7
items: 345
schema:
input:
type: string
required: true
expected_output: ...
metadata:
id: ...
category: ...

Plus (per inspection): the category-to-judges mapping that get_metrics_for_category(category) reads.

Runtime loader (production):

consumer-agent/src/consumer_agent/metrics/llm_judge/base.py:

class LLMJudgeMetric(GEval):
# config-driven judge; loaded from configs/rules/<name>.yaml
def get_metrics_for_category(category: str) -> list[LLMJudgeMetric]:
"""Reads the eval manifest to determine which judges apply to a category,
plus any global metrics. Returns instances ready to score."""

This is PC5’s runtime API contract.

CLI (production):

consumer-agent/src/consumer_agent/cli/opik/:

  • rules.py — register / list / update / delete judge rules in Opik
  • eval.py — run evals against a dataset
  • dataset.py — manage Opik datasets
  • prompt.py — manage Opik prompts (PC6 consumes this)
  • traces.py — query production traces

Eval-report templates (production):

consumer-agent/src/consumer_agent/evaluation/templates/:

  • report.md.j2 — Jinja eval report template
  • summary.md.j2 — Jinja eval summary template

What’s NOT yet codified:

  • The pipeline lifecycle (PR → eval gate → deploy → rollback) — runs informally
  • Threshold conventions — each rule’s threshold is author-set; no platform default
  • Vertical-onboarding API — verticals copy patterns from existing rules; no documented “here’s how to wire your own dataset”
  • Three-milestone gate firing — gates fire when someone runs them, not at lifecycle transitions
  • Rollback mechanics — no codified path to revert a bad rules/manifest change without redeploying

PC5 inherits from PC1 (Agent Composition):

  • Prompt-block registry pattern (PC1 §5.7 / Decision 9) — file-backed, git-versioned, PR-reviewed. PC5’s configs/rules/ follows the same pattern.
  • Agent Definition shape (PC1 §5.2) — sub-agent changes are PR-driven changes to Agent Definitions; PC5’s pipeline gates those changes.
  • PC6 (Agent Variant CI/CD) — consumes PC5’s eval-gate API at three milestones (PC6 §5.8). PC5’s API surface must support pre_merge, pre_ramp, pre_full firing.
  • PF1 (Sub-Agent Lifecycle) — owns the lifecycle states; PC5 specifies the gates between them.
  • PF5 (Vertical Scaffolding + Validation Tools) — scaffolds the rules/manifest stubs verticals fill in. PC5 commits the schema PF5 scaffolds from.
  • PS5 (Trace + Event Store) — judge scores produced by PC5 land in PS5’s store; PS5 owns durable storage.
TermMeaning
JudgeAn LLM-as-Judge instance — LLMJudgeMetric extending Opik’s GEval, loaded from a configs/rules/<name>.yaml file.
Judge templateA reusable judge definition (rule file). Templates are platform-owned; verticals instantiate them against their own datasets.
Rule fileA YAML file in configs/rules/. One judge per file.
Manifestconfigs/evaluation_manifest.yaml. Declares dataset metadata + category → judges mapping + thresholds.
CategoryA dataset-item label that determines which judges fire. Set on each dataset item’s metadata.category.
GateAn eval-gate invocation at a pipeline milestone — runs the configured judges, compares scores against thresholds, returns pass/fail.
MilestoneA named point in the pipeline where a gate fires. Three of them: pre_merge, pre_ramp, pre_full.
ThresholdThe minimum score (per judge) required to pass a gate. Declared in the manifest, per-judge or per-(judge, milestone).
Sampling rateA judge’s per-trace sampling rate (declared in the rule file). 1.0 means every applicable trace gets scored; lower values reduce eval load.
Offline / Online / PlaygroundThe three contexts a judge can fire in. offline = dataset evaluation; online = production trace scoring; playground = Opik manual testing. Variable bindings differ per context.
Vertical Opik projectThe vertical-owned Opik project containing their dataset, traces, and judge invocations. PC5 supplies templates + manifest; verticals own the project.

FR-1 — Pipeline lifecycle. A sub-agent change MUST move through PF1’s four-state lifecycle (dev → test → promote → rollback). The lifecycle states themselves are PF1’s contract; PC5 commits the gate-firing semantics that transition between them and the ramp-percentage progression inside promote.

FR-2 — Three named gate milestones. PC5 MUST expose three gate-firing milestones consumed by PC6 (and any future caller):

  • pre_merge — fires on PR open and on every push. Runs the configured judges against the dataset. Fail → PR is blocked from merge.
  • pre_ramp — fires when the change attempts to begin a ramped rollout (PC6 §5.5). Runs judges against a recent production-trace sample. Fail → ramp halts at 0%.
  • pre_full — fires when ramp attempts to advance to 100%. Same as pre_ramp on a larger trace sample. Fail → ramp halts at the last passing step.

FR-3 — Judge rule file schema. Every judge rule file in configs/rules/ MUST conform to the schema: name, model, temperature, sampling_rate, filter (optional; trace filter for when the judge fires), enabled, score_name, score_type (INTEGER | FLOAT | BOOLEAN), description, task_introduction, variables (offline/online/playground field bindings), prompt. Plus the threshold-discipline + enforcement fields: floor: float (minimum acceptable score), tolerance: float (regression allowance from baseline), baseline_source: Literal["jade_calibration", "production_distribution", "provisional_seed"], calibration_ref: str (commit hash or experiment ID), recalibration_due: date (next required recalibration), enforcement: dict[Literal["pre_merge", "pre_ramp", "pre_full"], Literal["warn", "block"]] (per-milestone enforcement policy).

FR-4 — Manifest schema. The evaluation manifest MUST conform to: dataset (name, version, item count), schema (per-item field shape), categories (category name → list of judge IDs), global_metrics (judges that apply across categories), thresholds (per-judge or per-(judge, milestone) score thresholds).

FR-5 — Judges queryable by ID and by category. The runtime MUST expose two retrieval APIs:

  • By categoryget_metrics_for_category(category: str) -> list[LLMJudgeMetric] (existing) — returns the manifest-declared judges for a category plus global metrics.
  • By IDget_metric_by_id(judge_id: str) -> LLMJudgeMetric (new) — returns a single judge by name. Used by PC6’s gate-by-ID invocation.

FR-6 — Threshold convention: per-milestone overrides allowed. A judge’s threshold MAY differ across milestones — e.g., a strict threshold for pre_full and a looser threshold for pre_ramp. Manifest schema supports both thresholds.<judge_id>: <score> (single threshold) and thresholds.<judge_id>.<milestone>: <score> (per-milestone).

FR-7 — Vertical Opik project model. Verticals MUST own their Opik project; PC5 MUST NOT write into vertical projects. The eval-config API exposes templates + manifest as values the vertical configures into their project via the existing CLI (cli/opik/rules.py create, etc.).

FR-8 — Eval-config API surface. PC5 MUST expose the following operations, callable from the consumer-agent runtime and from CI:

  • get_metrics_for_category(category) — runtime judge lookup by category
  • get_metric_by_id(judge_id) — runtime judge lookup by ID
  • load_manifest() — return the parsed manifest as a dict
  • get_threshold(judge_id, milestone) -> float | int | bool — threshold lookup; falls back to manifest-default if milestone-specific not set
  • list_rules() -> list[str] — enumerate rule IDs
  • validate_rule_file(path) — schema validation per FR-3 (for CI use)
  • validate_manifest(path) — schema validation per FR-4 (for CI use)

FR-9 — Rules/manifest registry is file-backed. Judge rules and the manifest MUST live as files in configs/rules/ and configs/evaluation_manifest.yaml. Versioned via git, reviewed via PR. Opik holds judge instances synced from these files; the files are the source of truth. Same pattern as PC1 §5.7 prompt-block registry and PC3 §5.4 typed status event registry.

FR-10 — Rollback without redeploy. Rolling back a bad rules or manifest change MUST be achievable by reverting the change in git AND syncing the prior version to Opik via cli/opik/rules.py create. No consumer-agent redeploy required. The runtime reads judge config from Opik at startup (cached) and refreshes on operator action.

FR-11 — Sampling rate semantics by milestone.

  • pre_merge — runs against the full configured dataset (manifest’s dataset.name at dataset.version); sampling_rate on the rule file does NOT apply (full coverage).
  • pre_ramp and pre_full — run against a configurable production-trace sample window; the rule file’s sampling_rate applies to trace selection within that window.

FR-12 — Judge-template change deprecation policy. Changes to a judge template (rule file content, threshold, sampling rate, prompt text) MUST go through PR review. Major changes (prompt rewrite that would change verdicts on the same input) MUST trigger a re-baseline of the dataset’s expected outputs OR an explicit acknowledgment in the PR that prior baselines no longer apply.

FR-13 — Eval-manifest source of truth for declared capabilities (cross-section with PC6). Per PC6 FR-11, the eval manifest reads the set of declared agent capabilities from XML prompt components. PC5’s manifest MAY reference XML prompt component IDs as the source of “what to test”; PC6 commits the source-of-truth shift, PC5 commits the consumption shape.

FR-14 — CI-enforced rule + manifest validation. On every PR touching configs/rules/** or configs/evaluation_manifest.yaml, CI MUST run validate_rule_file and validate_manifest. Fail → PR is blocked from merge.

NFR-1 — Judge-lookup latency. get_metrics_for_category(category) and get_metric_by_id(judge_id) MUST return in sub-millisecond after process startup. Manifest is loaded once and cached; rules are loaded once and cached. Cache invalidation on operator action; no per-call file I/O.

NFR-2 — Pre-merge gate latency. pre_merge MUST complete within a CI budget reasonable for PR turnaround — target under 5 minutes per PR on the production dataset (345 items at PC5 commit time). Concrete production p95 measurement is a follow-up.

NFR-3 — Pre-ramp / pre-full gate latency. Trace-sample evaluation MUST complete within 10 minutes on a 100-trace sample. Operational tuning.

NFR-4 — Judge rule cardinality. Total judge rules in configs/rules/ MUST remain enumerable from the directory listing. Soft target: under 50 across all verticals at maturity (currently 14). Growth beyond that triggers a registry-organization review.

NFR-5 — Threshold-tuning audit trail. Every threshold change in the manifest MUST be PR-reviewable with rationale in the PR description (why was this threshold raised / lowered, what production signal justified it). Not CI-enforced; review discipline.

NFR-6 — Rollback latency. Reverting a bad rules/manifest change in git AND syncing to Opik MUST complete in under 15 minutes end-to-end (including CI re-run). No consumer-agent redeploy required.

AC-1 — Given a new judge rule file added to configs/rules/, when CI runs, validate_rule_file MUST pass on the new file. The judge MUST be retrievable by get_metric_by_id(<new-rule-name>) after process restart.

AC-2 — Given a category added to configs/evaluation_manifest.yaml with a list of judge IDs, when get_metrics_for_category(<new-category>) is called, the runtime MUST return LLMJudgeMetric instances for each declared judge, plus any global_metrics.

AC-3 — Given a PR proposing a sub-agent change, when the pre_merge gate fires, the runtime MUST evaluate all judges declared in the manifest for affected categories against the configured dataset. If any judge’s score is below its threshold, the PR MUST be blocked from merge.

AC-4 — Given a PC6-driven rollout attempting to begin a ramp, when the pre_ramp gate fires, the runtime MUST evaluate the configured judges against a recent production-trace sample. Fail → ramp halts at 0%, operator notified.

AC-5 — Given a manifest entry with both thresholds.<judge_id> and thresholds.<judge_id>.pre_full, when a gate fires at pre_full, get_threshold(<judge_id>, pre_full) MUST return the milestone-specific value, not the default.

AC-6 — Given a rule file with a malformed schema (missing required field, invalid score_type), validate_rule_file(path) MUST return a structured error naming the violated field. CI MUST fail the PR. Plus enforcement field validation: each milestone key MUST be one of {pre_merge, pre_ramp, pre_full}, each value MUST be one of {warn, block}; unknown keys or values MUST fail validation with a structured error naming the violated field.

AC-7 — Given a bad rules change in production (e.g., a prompt rewrite that breaks judge verdicts), reverting the change in git + running cli/opik/rules.py create --name <rule> --file configs/rules/<rule>.yaml MUST restore the prior judge behavior within 15 minutes. consumer-agent MUST NOT require a redeploy.

AC-8 — Given a vertical wiring their own Opik project, get_metrics_for_category and get_metric_by_id MUST work against their project once cli/opik/rules.py create has been run to sync the rules. PC5 MUST NOT require write access to vertical projects.

AC-9 — Given the same judge defined in configs/rules/<name>.yaml with both offline and online variable bindings, the judge MUST fire correctly in both contexts (offline = dataset eval; online = production trace scoring) with the same rubric text and threshold.

AC-10 — Given the eval manifest at steady state, validate_manifest(path) MUST pass on the current production manifest. New manifest entries MUST also pass the same validation.

Judge templates are file-backed; verticals own their Opik projects; gates fire at three named milestones; rollback is a git revert + sync, not a redeploy.

Three properties hold across every PC5 contract:

  1. File-backed registry is the source of truth. configs/rules/*.yaml and configs/evaluation_manifest.yaml are the change-control surfaces. Opik holds synced instances; the files come first. Same discipline as PC1 §5.7 prompt-block registry.

  2. API surface, not project starter. PC5 ships templates + the runtime + the CLI to wire them. Verticals own their Opik project, dataset, and traces; PC5 doesn’t write into them. Verticals integrate via get_metrics_for_category + get_metric_by_id + the CLI, not by forking a template repo.

  3. Three gates, three contexts. pre_merge runs against the configured dataset (offline); pre_ramp and pre_full run against production-trace samples (online). Same judge, same rubric, different variable bindings — the rule file’s variables mapping handles the context translation.

The canonical schema for a judge rule (codifies the deployed configs/rules/capability_alignment.yaml shape):

# Required fields
name: <Human-readable name> # displayed in Opik UI
model: <model id> # e.g., gpt-5-mini
temperature: <float> # judge model sampling temperature
sampling_rate: <float, 0.0-1.0> # per-trace sampling for online gate
enabled: <bool> # global on/off
score_name: <Score display name> # what shows up in Opik scores
score_type: INTEGER | FLOAT | BOOLEAN
description: <one-line summary>
task_introduction: <judge framing — sent to the LLM judge as the role>
variables:
offline: # dataset evaluation context
input: <dotted path into dataset item>
output: <dotted path into agent response>
expected_output: <dotted path>
online: # production trace scoring context
input: <dotted path into trace>
output: <dotted path into trace>
playground: # Opik playground manual testing
input: <dotted path>
output: <dotted path>
expected_output: <dotted path>
prompt: |
<multi-line rubric the LLM judge evaluates against>
# Optional fields
filter: # when the judge fires (Opik trace filter)
field: <metadata | input | output | ...>
key: <dotted path>
operator: "=" | "!=" | "contains" | ...
value: <comparison value>
# Threshold discipline (required for every judge that participates in gates)
floor: <float> # absolute floor; below this is always a block
tolerance: <float> # max acceptable regression vs rolling baseline
baseline_source: jade_calibration | production_distribution | provisional_seed
calibration_ref: <ticket-or-doc-ref> # required when baseline_source == jade_calibration
recalibration_due: <YYYY-MM-DD> # quarterly default; provisional seeds sooner
# Enforcement policy per milestone
enforcement:
pre_merge: warn | block
pre_ramp: warn | block
pre_full: warn | block

Why these fields are load-bearing:

  • variables.{offline,online,playground} — same judge fires in three contexts without rubric duplication. Online traces and offline dataset items have different field shapes; the variable bindings translate.
  • filter — optional trace-level filter so a judge fires only on relevant traces (e.g., capability_alignment only fires on prompt-suggestions agent traces). Without filter, judge fires on every trace.
  • score_type — Opik distinguishes integer / float / boolean scores. Threshold comparison logic depends on type.
  • sampling_rate — production-trace sampling rate for pre_ramp / pre_full. pre_merge ignores this (full dataset coverage).
  • floor + tolerance — hybrid threshold. Floor catches catastrophic regressions; tolerance against a rolling-3 main baseline catches drift. The rolling average smooths per-run noise.
  • baseline_source — provenance discipline. Every threshold cites its source: JADE calibration (PLT-596-style report), production distribution (p5 of last N runs with std-dev), or provisional seed (mean − 2σ from a small bootstrap run with a recalibration timeline). No judge ships to a gate without a declared source.
  • enforcement — per-judge milestone-tiered policy. Default by judge classification: safety/refusal judges block at every milestone; quality judges warn at pre_merge, block at pre_ramp and pre_full. Per-judge YAML may override the milestone-tier defaults; override path: PR label + spec-owner sign-off + auto-logged audit entry.
# Dataset metadata
dataset:
name: <Opik dataset name>
version: <integer>
items: <expected count, used for sanity check>
# Per-item schema
schema:
input:
type: string
required: true
description: <human-readable>
expected_output:
type: string
required: true
metadata:
id:
type: string
required: true
category:
type: string
required: true
# additional metadata fields as the dataset grows
# Category → judges mapping
categories:
shopping_query:
judges: [capability_alignment, response_quality, ux_quality, tool_compliance]
safety_test:
judges: [safety_restricted, fetch_legal, privacy_location]
# ... one entry per category
# Judges that fire across all categories
global_metrics:
judges: [data_integrity, jailbreaking]
# Threshold conventions
thresholds:
# Single threshold for the judge
capability_alignment: 4 # INTEGER score ≥ 4 passes
# Per-milestone threshold override
response_quality:
pre_merge: 4
pre_ramp: 3 # looser for production-trace eval
pre_full: 4
# Boolean threshold
jailbreaking: true # BOOLEAN score must be true to pass

Why these fields are load-bearing:

  • categories.<category>.judges — the set fired for a dataset item with metadata.category == <category>. get_metrics_for_category(category) reads this.
  • global_metrics.judges — fire on every category. The “across-the-board” judges (data integrity, jailbreaking) live here.
  • thresholds.<judge_id> — the single-threshold form. Most judges only need one threshold.
  • thresholds.<judge_id>.<milestone> — milestone-specific override. Used when the appropriate score bar differs by gate (e.g., looser at ramp-start, stricter at full-rollout).

PC5’s API surface (Python, runtime-accessible from consumer-agent and from CI scripts):

# Runtime judge retrieval
def get_metrics_for_category(category: str) -> list[LLMJudgeMetric]: ...
def get_metric_by_id(judge_id: str) -> LLMJudgeMetric: ...
# Manifest access
def load_manifest() -> dict: ...
def get_threshold(judge_id: str, milestone: str | None = None) -> int | float | bool: ...
# Rule listing
def list_rules() -> list[str]: ...
# CI validation
def validate_rule_file(path: Path) -> ValidationResult: ...
def validate_manifest(path: Path) -> ValidationResult: ...
# Gate evaluation (high-level entry point consumed by PC6 §5.8)
def evaluate_gate(
milestone: Literal["pre_merge", "pre_ramp", "pre_full"],
traces_or_dataset: TracesOrDataset,
judge_ids: list[str],
) -> GateVerdict: ...

GateVerdict shape (returned by evaluate_gate; consumed by PC6’s pipeline):

@dataclass(frozen=True)
class JudgeScore:
score: float
threshold: float
passed: bool
enforcement: Literal["warn", "block"]
@dataclass(frozen=True)
class GateVerdict:
per_judge_scores: dict[str, JudgeScore]
verdict: Literal["pass", "warn", "fail"]
failing_judges: list[str]
milestone: Literal["pre_merge", "pre_ramp", "pre_full"]

PC6 consumes verdict (to decide pipeline halt) and failing_judges (to surface in operator notifications); per_judge_scores is retained for diagnostic output and observability.

This is the “API not opaque project starter” framing. A vertical onboarding doesn’t fork a template; it:

  1. Creates their own Opik project (verticals own this).
  2. Runs cli/opik/rules.py create --name <rule> --file configs/rules/<rule>.yaml against their project for each judge they want to instantiate.
  3. Writes their own manifest in their project (or extends consumer-agent’s manifest with a new category).
  4. Calls get_metrics_for_category(<their-category>) from their eval code to get judges.
  5. Gets gate verdicts via evaluate_gate(milestone, traces_or_dataset, judge_ids).

PC5 supplies templates (rule file content), the runtime (LLMJudgeMetric + lookup APIs), and the CLI (rules.py + eval.py + dataset.py). Verticals supply project + dataset + integration.

Each gate fires at a specific pipeline milestone with specific scope:

pre_merge — fires on PR open + every push to the PR branch.

  • Scope: full configured dataset (dataset.name at dataset.version).
  • Judges: those declared by categories.<each category in dataset>.judges + global_metrics.
  • Score aggregation: per-judge aggregate over all dataset items (e.g., mean score, fail rate); compare against thresholds.<judge_id>.pre_merge (or default threshold).
  • Verdict: fail → PR blocked from merge. Operator can override with an explicit “accept-baseline-shift” annotation if the failure is a known expected change.

pre_ramp — fires when a PC6-driven rollout attempts to begin ramped traffic (ramp_steps first non-zero step).

  • Scope: recent production-trace sample (configurable window; default last 24h).
  • Judges: those configured by PC6’s experiment config eval_gates.pre_ramp list (PC6 §5.6).
  • Score aggregation: per-judge aggregate over sampled traces; compare against thresholds.<judge_id>.pre_ramp.
  • Verdict: fail → ramp halts at 0%, operator notified.

pre_full — fires when ramp attempts to advance to 100%.

  • Scope: larger production-trace sample (configurable; default last 7 days).
  • Judges: PC6’s experiment config eval_gates.pre_full list.
  • Score aggregation + verdict: same as pre_ramp but against thresholds.<judge_id>.pre_full.

Why three gates instead of one continuous monitor:

Each gate has different statistical power needs (small dataset → quick PR turnaround; large production sample → confident promotion verdict). Splitting lets you tune each independently. Single continuous monitor would either be too slow at PR time or too noisy at ramp time.

Trace set composition per milestone (coverage-graduated, not quantity-graduated):

MilestoneCompositionSample frozen?
pre_mergeGolden set only (curated dataset, ~50 traces). Reproducibility-first.n/a — fixed
pre_rampGolden set + recent production sample (~200 total). Broader coverage.Frozen per milestone-run; production sample window refreshed weekly.
pre_fullGolden + production sample + adversarial set (~500-1000 total). Robustness.Same freezing rule.

Multi-turn coherence evaluation is explicitly out of scope for PC5 gates — it belongs to PLT-610’s eval surface. Conflating multi-turn coherence with single-turn capability would produce false-signal regressions in the gate.

Adversarial set curation is owned by the platform team; verticals contribute domain-specific edge cases.

Gate trigger classification — which PR triggers which gate:

Path-based defaults with author override + reviewer sign-off:

  • All gates: prompts/**, configs/rules/**, configs/evaluation_manifest.yaml, src/consumer_agent/agent/**, src/consumer_agent/factory.py, src/consumer_agent/utils/tools.py
  • pre_merge only: src/consumer_agent/api/**, src/consumer_agent/utils/helpers.py, configs/*.yaml (non-rule)
  • No gate: tests/**, docs/**, *.md, .github/**, setup.py, pyproject.toml, Makefile

Override: PR label eval-skip-pre-ramp or eval-skip-pre-full plus a required eval-skip-justification block in the PR description. PR template checkbox confirms reviewer agreement. Every override is logged centrally and audited quarterly to detect abuse patterns.

Per-Agent-Definition lifecycle threads (PF-1 FR-12 alignment): Each Agent Definition (identified by agent_id + commit) maintains its own gate-firing history. When a major change requires authoring a NEW Agent Definition (per PF-1 FR-12), the new Agent Definition re-enters at pre_merge independent of any prior Agent Definition’s gate state. The prior Agent Definition’s gate history is preserved (audit + rollback target) but does not influence the new Agent Definition’s promotion. In-place mutation of a promoted Agent Definition is prohibited by PF-1 FR-12; PC-5 gate-firing assumes Agent Definitions are immutable post-promotion.

Verticals bring their own Opik project; PC5 supplies the templates and the API to wire it up. Concretely:

AssetOwnerSource of truth
Opik project (workspace)VerticalVertical’s Opik account
Dataset (items, expected outputs, categories)VerticalVertical’s project; manifest references it
Judge instances (synced to Opik)VerticalVertical’s project; created by cli/opik/rules.py create
Judge templates (rule file content)Platformconsumer-agent/configs/rules/*.yaml
Manifest schemaPlatformPC5’s spec
Manifest content (categories, thresholds)Vertical OR Platformdepends — see below
Eval runtime (LLMJudgeMetric, lookup APIs)Platformconsumer-agent/src/consumer_agent/metrics/llm_judge/
CLI (rules.py, eval.py, dataset.py)Platformconsumer-agent/src/consumer_agent/cli/opik/

Manifest ownership is the nuanced one:

  • The schema is platform-owned (PC5’s §5.3).
  • The manifest content (which judges fire on which categories, what thresholds gate promotion) is shared — consumer-agent’s central manifest declares the cross-vertical conventions; verticals MAY extend it with their own categories + thresholds in their own Opik project’s manifest, or MAY add to the central one via PR.

No vertical-write-access from PC5: PC5’s API reads from the vertical’s project (via Opik SDK, vertical’s credentials); does not write. The CLI is operator-driven, not platform-automated.

A bad rules or manifest change rolls back via git + sync, not redeploy:

  1. Detect: production judge scores regress, or operator detects a bad rule via manual review.
  2. Revert: git revert <commit> on the rule file or manifest change.
  3. Sync: re-run cli/opik/rules.py create --name <rule> --file configs/rules/<rule>.yaml for each affected rule (or manifest sync for manifest changes).
  4. Cache invalidate: operator triggers a cache refresh on the runtime (or the runtime’s cache TTL expires; configurable, operational).
  5. Verify: production scores return to pre-regression baseline.

Why no consumer-agent redeploy: judge config is loaded from files at startup AND refreshable at runtime; rule changes don’t require new code, only new config. The runtime LLMJudgeMetric class is stable; only the YAML content changes.

NFR-6 commits 15 minutes end-to-end (git revert + CI re-run + Opik sync + cache refresh). Concrete production measurement is operational.

Post-deploy regression response (tiered by judge classification):

When a regression is detected after pre_full passed and the change is live in production (overnight scheduled eval, ad-hoc eval triggered by a user-reported issue, or new judge added to the suite), the response depends on the judge’s classification:

Regression tierResponse
Safety / refusal judge floor breachAutomatic ramp-down to 0% on the affected agent/component + PagerDuty page. Auto-ramp-down (not auto-rollback) is the safer failure mode — reversible if it turns out to be a false-positive judge run.
Quality judge floor breachSlack alert + Grafana annotation + 24h scheduled triage. No page; quality regressions don’t justify a 3am response.

Both tiers write an entry to a regression-tracking note (docs/post-deploy-regressions.md) for trend analysis. The automatic ramp-down path depends on PF1’s ramping primitive being live; until PF1 lands, the automatic path degrades to “alert plus manual ramp-down via Feature Flipper.”

PC6 §5.8 says: “PC5 owns the eval-gate definition. PC6 specifies when gates fire.”

PC5’s commitment (read by PC6):

PC6 readsPC5 exposes
Judge by ID (PC6 §5.6 eval_gates.pre_ramp: [judge_ids...])get_metric_by_id(judge_id) -> LLMJudgeMetric
Threshold per milestoneget_threshold(judge_id, milestone) -> int | float | bool
Gate firing (run judges, return verdict)evaluate_gate(milestone, traces_or_dataset, judge_ids) -> GateVerdict

evaluate_gate is the high-level entry point PC6’s pipeline calls; it composes the lower-level get_metric_by_id + get_threshold + score aggregation into one verdict.

PF1 owns sub-agent lifecycle states (dev → test → promote → rollback); PC5 owns gate semantics that fire across those transitions and inside promote:

PF1 lifecycle transition or sub-statePC5 gate(s)
dev → testNone (developer-driven)
test → promote (PR merge + operator initiates ramp)pre_merge (PR-driven, dataset-scoped)
Ramp progression inside promote — first non-zero steppre_ramp (PC6-driven, trace-sampled)
Ramp progression inside promote — 100% steppre_full (PC6-driven, larger trace sample)
* → rollbackNone (operator-driven; rollback per §5.7)

PF1 owns the lifecycle vocabulary (states and transitions). PC5 references PF1’s vocabulary by name to anchor gate firing.

SpecCitation
PC1 (Agent Composition)Inherits file-backed registry pattern (PC1 §5.7 / Decision 9). PC5’s configs/rules/ and evaluation_manifest.yaml follow the same git-versioned, PR-reviewed discipline.
PC6 (Agent Variant CI/CD)PC6 §5.8 reads from PC5’s gate API at three milestones. PC5 commits the API surface (get_metric_by_id, get_threshold, evaluate_gate); PC6 commits the trigger points.
PF1 (Sub-Agent Lifecycle)PF1 owns lifecycle states; PC5 owns the gates between them. Promotion criteria reference PC5’s three milestones.
PF5 (Vertical Scaffolding + Validation Tools)PF5 scaffolds rules/manifest stubs verticals fill in. PC5 commits the schema PF5 scaffolds from.
PS5 (Trace + Event Store)Judge scores land in PS5’s store; per-team trace metadata composes with PC5’s score data for vertical-specific dashboards.
PD3 (DM Type Registry)Independent of PC5; DM rollout uses PD3’s analytics labels + experiment-arm machinery, not PC5’s judge gates.

Platform spec dependencies: PC1 (Agent Composition), PF1 (Sub-Agent Lifecycle).

Implementation dependencies:

  • Opik — LLM-as-Judge runtime (GEval), automation rules, datasets, traces. Already integrated.
  • GPT-5 family — judge models (current rules use gpt-5-mini). Operational.
  • PyYAML — manifest + rule file parsing. Standard.

External dependencies: None.

Cross-section soft dependencies:

  • PLT-619 (eval manifest categories) — PC5 commits the manifest schema; PLT-619 implementation consumes the schema.
  • PC6 §5.8 eval-gate hookup — PC6 reads PC5’s gate API.

R-1: Vertical Opik project misconfiguration. Verticals own their projects; PC5 has no write access. If a vertical misconfigures their project (wrong dataset version, missing categories, stale judge sync), the gate API returns errors but PC5 can’t auto-fix. Mitigated by clear CLI errors + PF5’s scaffolding (correct shape from the start).

R-2: Judge-template change blast radius. A platform-owned judge template change (e.g., updating response_quality.yaml’s rubric) affects every vertical consuming it. Mitigated by FR-12’s deprecation policy: major changes require re-baseline acknowledgment in PR, plus per-vertical CI gate runs to catch unexpected verdict shifts before merge.

R-3: Manifest ↔ rules-file drift. A category in the manifest references a judge_id that doesn’t have a matching rule file (or vice versa). CI-enforced via FR-14’s validate_manifest, which cross-checks against list_rules(). Mitigated structurally.

R-4: Threshold-tuning noise. Thresholds set too tight produce noisy gate failures (PR blockers that aren’t real regressions); too loose miss real regressions. Threshold-tuning audit trail (NFR-5) and operator override path (AC-3 “accept-baseline-shift” annotation) provide the levers. No automated tuning in v1.

R-5: Sampling-rate confusion across milestones. sampling_rate in the rule file applies only to online gates (pre_ramp, pre_full), not pre_merge. Subtle. Mitigated by §5.2 + FR-11 making the semantics explicit and FR-14 CI validation catching schema misuse.

R-6: Pre-merge gate slowdown as dataset grows. 345 items today × N judges = M LLM calls per PR. Growth in either dimension expands the budget. Mitigated by NFR-2’s target + per-judge enabled: false toggle for emergency throttling.

R-7: Cache invalidation lag during rollback. §5.7’s rollback path includes “cache invalidate” as Step 4. If the runtime’s cache TTL is long (operational), rollback latency stretches. Mitigated by operator-triggerable manual cache refresh + NFR-6’s 15-minute target.

OQ-1: Threshold-tuning automation. NFR-5 commits the audit trail (PR-review discipline). Long-term, threshold tuning could be automated based on production signal (auto-raise thresholds when prod outperforms; auto-lower when prod regresses). Out of scope for v1; surface as follow-on if production noise warrants.

OQ-2: Cross-vertical judge sharing. Verticals reuse judge templates from the central configs/rules/ (e.g., safety_restricted applies to every vertical). What’s the platform team’s review obligation on a vertical PR that changes a shared judge? Lean: any change to a configs/rules/*.yaml requires platform-team review (cross-vertical impact); vertical-specific judge additions can be vertical-owned. Needs platform-team owner input on review-board boundary.

OQ-3: Production trace sampling window for pre_ramp / pre_full. NFR-3 sets a target latency (10 minutes for 100 traces) but doesn’t pin the window (24h vs 7d vs other). Lean: 24h for pre_ramp (fast iteration); 7d for pre_full (confident promotion). Operational tuning; surface if defaults don’t fit production needs.

OQ-4: Vertical-side dataset versioning. PC5 commits the manifest schema (dataset.version); verticals reference their own dataset versions. When a vertical bumps their dataset version, do PC5’s gates auto-pick up the new version, or does the vertical signal the version bump? Lean: vertical signals (explicit manifest PR); auto-discovery is too magical. Needs PF5 reviewer input on the scaffolding side.

  • Rule-file schema validation: validate_rule_file accepts production rule files; rejects malformed schemas with structured errors per AC-6
  • Manifest schema validation: validate_manifest accepts production manifest; rejects malformed manifests
  • Cross-validation: validate_manifest cross-checks category-referenced judge_ids against list_rules() output; flags drift
  • Judge retrieval: get_metric_by_id(<existing>) returns the loaded instance; get_metric_by_id(<missing>) raises a clear error
  • Threshold lookup: get_threshold returns milestone-specific override when present; falls back to default; handles BOOLEAN / INTEGER / FLOAT score types
  • Manifest caching: load_manifest() returns the same instance on repeat calls; cache invalidation works
  • Rule caching: per-rule load is cached after first call; cache invalidation works
  • End-to-end pre_merge: PR opens with sub-agent change → gate fires → judges score against dataset → verdict returned within NFR-2 budget
  • End-to-end pre_ramp: rollout begins ramp → gate fires → judges score against trace sample → verdict returned within NFR-3 budget
  • End-to-end pre_full: ramp advances → gate fires → judges score → verdict returned
  • Gate failure → PC6 pipeline halt: simulated judge failure → PC6 §5.5 ramp halts at 0%
  • Rollback flow: bad rule change → git revert → CLI sync → cache refresh → judge behavior restored within NFR-6 budget
  • Per-vertical eval coverage on declared judge categories (verticals own this; PC5 supplies the framework)
  • Judge-template regression coverage: when a judge template changes (FR-12), the dataset’s expected outputs are re-baselined and the diff reviewed
  • Manifest-mapping coverage: every category in the manifest has at least one dataset item exercising it
  • PC6: PC5’s get_metric_by_id + get_threshold + evaluate_gate signatures match PC6 §5.8’s expected invocation
  • PF1: PC5’s three milestones correspond to PF1’s lifecycle transitions per §5.9 mapping
  • PF5: PC5’s rule file + manifest schemas match what PF5’s scaffolding generates
  • PS5: judge-score event shape persists correctly in PS5’s store
  • Vertical Opik project unreachable: get_metrics_for_category returns a clear error; PC5 doesn’t crash the runtime
  • Rule file syntax error mid-PR: CI fails the PR via validate_rule_file before merge
  • Manifest references missing judge_id: CI fails via validate_manifest cross-check
  • Threshold misconfigured for score_type (e.g., FLOAT threshold on a BOOLEAN judge): get_threshold raises a clear type error; CI catches via manifest validation
  • Judge model API outage: gate fails closed (rollout halts); operator notified

Phase 1 — Spec validation. PC5 reviewed and approved; cross-section contracts confirmed with PC1, PC6, PF1, PF5 reviewers.

Phase 2 — API surface. Add get_metric_by_id, get_threshold, validate_rule_file, validate_manifest, evaluate_gate to consumer-agent/src/consumer_agent/metrics/. Existing get_metrics_for_category stays unchanged.

Phase 3 — Manifest schema enforcement. Add CI validation per FR-14 on PRs touching configs/rules/** or configs/evaluation_manifest.yaml.

Phase 4 — Three-milestone gate semantics. Wire pre_merge to PR CI (block merge on judge failure); wire pre_ramp and pre_full to PC6’s rollout pipeline.

Phase 5 — Threshold convention rollout. Audit existing rule files; add thresholds.<judge_id> entries to manifest for each. Per-milestone overrides added as production signal justifies.

Phase 6 — Vertical onboarding documentation. Author the “how to add a vertical Opik project + wire judges” guide. PF5 scaffolds the stubs.

  • eval.gate.fired_total by milestone (pre_merge / pre_ramp / pre_full) — gate invocation volume
  • eval.gate.verdict_total by milestone, verdict (pass / fail / error) — pass/fail rates per milestone
  • eval.judge.score_distribution by judge_id, milestone — score distribution per judge per milestone; surfaces threshold-tuning candidates
  • eval.gate.duration_seconds by milestone — gate latency; feeds NFR-2 / NFR-3
  • eval.rule.validation_failed_total by rule_id, error_type — CI validation failures; should be zero in steady state (catches at PR time)
  • eval.manifest.drift_total — manifest references missing rule (or vice versa); should be zero
  • eval.rollback.duration_seconds — rollback latency; feeds NFR-6

PC5 is a contract spec, not deployable code. Rollback semantics:

  • Rule file rollback: per §5.7. Git revert + CLI sync + cache refresh. NFR-6 commits 15 minutes.
  • Manifest rollback: same path; manifest is a single YAML file.
  • API surface rollback: if a new API method (evaluate_gate, get_metric_by_id) produces unexpected behavior, deprecate via standard Python deprecation; consumers fall back to per-judge invocation. Not expected at v1.
  • PC1: Agent Composition — file-backed prompt-block registry pattern PC5 mirrors
  • PC6: Agent Variant CI/CD + Experiment-Gated Rollout — consumes PC5’s gate API at three milestones
  • Platform Spec Lab — Wave 1 sequencing; PC5 scope row
  • PLT-619 — Eval manifest categories — the manifest schema PC5 codifies
  • consumer-agent/configs/rules/*.yaml — 14 production judge rule files; the schema PC5 commits to
  • consumer-agent/configs/evaluation_manifest.yaml — the production manifest PC5 codifies
  • consumer-agent/src/consumer_agent/metrics/llm_judge/base.pyLLMJudgeMetric runtime
  • consumer-agent/src/consumer_agent/cli/opik/{rules,eval,dataset}.py — CLI surface
#DecisionResolution
1Rule file schemaCodify the deployed capability_alignment.yaml shape (per §5.2). Required + optional fields documented; CI-validated.
2Manifest schemaCodify dataset + schema + categories + global_metrics + thresholds (per §5.3). Single + per-milestone thresholds both supported.
3Gate milestonesThree: pre_merge, pre_ramp, pre_full. PC6 commits the trigger points; PC5 commits the gate semantics.
4API surfaceget_metrics_for_category (existing) + get_metric_by_id (new) + load_manifest + get_threshold + evaluate_gate + validation functions.
5Vertical Opik project ownershipVerticals own their project; PC5 has no write access. PC5 supplies templates + runtime + CLI; verticals integrate via API.
6Rollback mechanicsGit revert + CLI sync + cache refresh. No consumer-agent redeploy. NFR-6 commits 15-minute target.
7Sampling-rate semanticspre_merge ignores sampling_rate (full dataset coverage); pre_ramp / pre_full apply the rule’s sampling_rate to trace selection.
8Judge-template change disciplineFR-12: major changes require re-baseline acknowledgment in PR. Cross-vertical impact requires platform-team review (OQ-2).
9evaluate_gate return shapeGateVerdict carries per_judge_scores (per-judge JudgeScore with score, threshold, passed, enforcement), composite verdict (pass / warn / fail), failing_judges list, and milestone. PC6 consumes verdict + failing_judges; diagnostic detail via per_judge_scores.
  • From PC1 §5.7 (prompt-block registry pattern): PC5 inherits the file-backed + git-versioned + PR-reviewed discipline. Same pattern, different content domain.
  • No content migration from other specs: existing configs/rules/ and evaluation_manifest.yaml are codified, not migrated.