PF9: Eval Foundation
PF9: Eval Foundation
Section titled “PF9: Eval Foundation”1. Problem Statement
Section titled “1. Problem Statement”The Rewards Assistant runs a working end-to-end eval system today. A golden dataset (v7, 345 items) sits behind a manifest that maps user-message categories to judge sets. A judge registry runs roughly a dozen LLM judges per category, each with calibrated thresholds drawn from JADE Round 1 against 301 human-annotated production traces. A three-gate CI pipeline (PC-5, pre_merge / pre_ramp / pre_full) blocks promotion on judge regressions. Production thumbs feedback lands in DynamoDB. The system works.
But that system was designed for one vertical. Every meaningful eval primitive (the judge registry, the calibration methodology, the dataset schema, the signal collection plumbing) lives inside consumer-agent and assumes a single owning team. PC-2 turns on new verticals at runtime via dispatch; PC-5 exposes an eval-config API for new verticals to declare their gates; PC-6 lets verticals add capability declarations via XML prompt components. None of these specs name where a new vertical gets a calibrated judge from, how it should classify its judges, how its user signals correlate back to those judges, or what the inheritance path looks like when a vertical wants the substrate without rebuilding it.
Without a foundation spec for this substrate, three failure modes are likely as verticals come online:
- Methodology drift. Each vertical invents its own threshold-derivation discipline. One vertical uses production p5 minus two sigma; another bootstraps from a single calibration run with no recalibration date; a third hand-picks numbers. The cross-vertical quality story fragments because the thresholds are not comparable.
- Signal-collection duplication. Each vertical builds its own thumbs-to-judge correlation pipeline against the existing DynamoDB store. Three implementations of the same logic, each slightly different, each requiring separate maintenance.
- Judge taxonomy fragmentation. PC-5 implicitly distinguishes safety-and-refusal judges (block at pre_merge) from quality judges (warn at pre_merge, block later), but the distinction is never formalized. Without a foundation-level taxonomy, every vertical re-decides which judges are safety and which are quality, and the gate-enforcement defaults break down.
PF-9 documents the existing substrate, names its contracts, prescribes the discipline around methodology and taxonomy, and defines the inheritance pattern so a new vertical can plug in and operate within hours rather than building a parallel eval stack.
2. Capabilities Source
Section titled “2. Capabilities Source”PF-9 governs the eval substrate that consumes capability declarations rather than producing them. Capability declarations themselves are owned by PC-6 (XML prompt component authoring path replacing the deprecated capabilities.md surface). PF-9’s eval-manifest discovery surface reads from the same XML components PC-6 specifies, so the source of capability truth is single and shared.
Concretely, the eval manifest at configs/evaluation_manifest.yaml today declares categories (product_discovery, offer_discovery, general_qa, etc.) and the judges that apply per category. As capabilities move to XML prompt components per PC-6, the manifest’s category list becomes derivable from the XML rather than hand-maintained. PF-9 specifies the read-path discipline; PC-5 exposes the API surface that returns the resolved manifest; PC-6 owns the XML authoring path.
3. Background & Context
Section titled “3. Background & Context”3.1 Today’s reality
Section titled “3.1 Today’s reality”The eval system in consumer-agent has five operating components:
-
Golden dataset. Opik-hosted dataset
consumer-agent-evalversion 7, 345 items. Each item carriesinput,expected_output, and metadata:id(SHA256 of input),category(one of 10 categories: product_discovery, offer_discovery, general_qa, greeting, safety, shopping_list, location_based, personalization, image_analysis, limitation_handling),expected_tools,applicable_metrics,difficulty,source(human, ai, prod). The contract for this dataset is declared inconfigs/evaluation_manifest.yaml. -
Judge registry. Roughly a dozen LLM judges defined as rule files. The current registry includes
response_quality(composite of 4 sub-judges after PLT-580 removed three unreliable ones),tool_compliance,data_integrity,ux_quality,privacy_location,fetch_legal,jailbreaking,safety_restricted,sensitive_topics,shopping_list_quality,product_routing, and others. Each judge has a calibrated threshold drawn from JADE Round 1 against 301 human-annotated production traces (PLT-596, Confluence page 6331367452). -
Manifest-driven judge selection. The manifest maps each category to the metrics that should score items in that category. The mapping is hand-maintained today; PC-6 now commits XML prompt components as the canonical capability declaration surface, which means the manifest’s category list becomes derivable from that source.
-
Three-gate CI pipeline. PC-5 owns the pipeline mechanics: pre_merge runs on every PR with golden-only traces and warning-level enforcement for quality judges; pre_ramp runs before partial rollout against golden plus production samples; pre_full runs before full rollout against golden plus production plus adversarial. Alerting is in progress under PLT-585.
-
Production signal collection. Thumbs-up and thumbs-down events from mobile clients land in DynamoDB with
UserId,MessageId, and turn metadata. Reroll events (user re-submitting the same message) are detectable in the history middleware. Human-reviewer annotations on traces exist in Opik annotation queues, where JADE rounds consume them directly. Today there is no formal correlation pipeline tying user signals back to per-turn judge scores.
3.2 What PC-5 leaves to PF-9
Section titled “3.2 What PC-5 leaves to PF-9”PC-5 commits the YAML schema (rule files, manifest), the eval-config API, the three-gate enforcement model, and the per-judge applicability filter mechanism (the filter field on rule files that gates a judge to specific sub_agent_id values). PC-5 explicitly defers four areas to a future eval-foundation spec:
- Calibration methodology. PC-5 §5.2 names three valid
baseline_sourcevalues (jade_calibration,production_distribution,provisional_seed) but does not define the derivation rules for any of them: what counts as a valid JADE run, the sampling window and percentile and std-dev rule for production distributions, or the recalibration timing for provisional seeds. - Judge classification taxonomy. PC-5 §5.2 enforcement defaults distinguish safety-refusal from quality judges, but the classification itself is never formalized. PF-9 owns the taxonomy.
- Per-sub-agent judge applicability policy. PC-5 §5.2 commits the per-judge applicability filter mechanism (trace-level filter on
sub_agent_idthat gates when a judge fires) and §5.3 commits thecategories.<category>.judgesmapping. PF-9 owns the policy that decides which judges should apply to which sub-agent archetype (e.g., product-discovery sub-agent vs. shopping-list sub-agent). An archetype is a group ofsub_agent_idvalues that share the same judge selection; the per-judge applicability filter operates on the specificsub_agent_id, but the policy that authors those filters thinks in archetypes. - User-signal correlation pipeline. Entirely absent from PC-5. PF-9 owns the methodology and pipeline design; PS-5 stores the resulting correlation records.
3.3 What PF-9 defers
Section titled “3.3 What PF-9 defers”PF-9 explicitly does not own:
- The YAML schema fields and the eval-config API surface (PC-5).
- The three-gate pipeline mechanics (PC-5).
- Trace and event storage (PS-5; PF-9’s correlation pipeline reads joined views via PS-5’s
reconstruct()and Tier-1/Tier-2 metadata, but writes user-signal feedback directly to Opik’s nativefeedback_scoresper Decision 14). - Per-vertical Grafana panels (PF-8 supplies the panel conventions; PF-9 supplies the judge categories PF-8 visualizes).
- Vertical scaffolding CLI and stubs (PF-5; PF-9 supplies the contract shape PF-5 generates against).
- Audit query path over eval data (PF-4; PF-4 reads from PS-5’s joined store).
3.4 Vocabulary
Section titled “3.4 Vocabulary”| Term | Definition |
|---|---|
| Eval substrate | The methodology, contract, and registration surfaces that make eval consumable as a foundation primitive. PF-9 governs the substrate; PC-5 surfaces it as configuration; PF-5 scaffolds it for verticals. |
| Judge | An LLM-based scorer that runs against a (input, expected_output, actual_output) triple and returns a score plus rationale. Defined as a rule file consumed by the registry. |
| Judge classification | The two-tier taxonomy that distinguishes safety-refusal judges (always block) from quality judges (configurable per milestone). Owned by PF-9; consumed by PC-5’s enforcement defaults. |
| Calibration provenance | The source of a judge’s threshold value: JADE calibration, production distribution, or provisional seed. PC-5 schemas the field; PF-9 defines what each source requires. |
| JADE | Judge Alignment with Domain Experts. A calibration round that compares judge scores against human-annotated production traces to derive thresholds and identify inverted or drifting judges. Round 1 completed 2026-04-04 against 301 traces. |
| Vertical | A user-facing product surface (Rewards Assistant today, plus future receipts, shopping, etc.) that consumes the eval substrate. Each vertical owns its own Opik project, dataset items, and per-category judge selection (operating model per PC-5 FR-7 and §5.6). |
| Correlation record | The per-turn view that joins PF-9’s user-signal feedback scores (user_signal_thumbs, user_signal_reroll) with judge scores on the same Opik trace. Not a separately stored object; the trace plus its feedback_scores + metadata IS the record. Produced when PF-9’s pipeline calls Opik’s log_traces_feedback_scores(); retrieved via PS-5’s reconstruct(response_id, purpose="feedback"); consumed by recalibration cycles, inverted-judge detection, and PF-8 dashboards. |
4. Requirements
Section titled “4. Requirements”4.1 Functional Requirements
Section titled “4.1 Functional Requirements”FR-1: Judge registry as foundation primitive. The platform MUST expose a single registry of available judges, with each judge declared via a rule file whose schema is set by PC-5. The registry MUST be queryable by judge ID, by judge classification, and by per-sub-agent applicability. Verticals consume the registry; they do not maintain parallel registries.
FR-2: Judge classification taxonomy. The platform MUST classify every judge as either safety-refusal or quality. Safety-refusal judges enforce a baseline that no vertical may relax. Quality judges have per-vertical thresholds with calibration provenance. The classification MUST be declared in the judge’s rule file and MUST be machine-readable so PC-5 enforcement defaults can apply automatically.
FR-3: Calibration methodology discipline. Every judge threshold MUST cite its calibration source per PC-5 §5.2 (jade_calibration, production_distribution, or provisional_seed). PF-9 defines what each source requires:
- A
jade_calibrationsource MUST cite a calibration ticket and a Confluence report. The report MUST document the human-annotated trace count, the agreement metric, and any inverted-judge findings. - A
production_distributionsource MUST declare its sampling window, its percentile, and its std-dev rule. The default is “p5 of last 30 days minus 2 sigma.” - A
provisional_seedsource MUST declare a recalibration-due date no more than 90 days from the seed date. 90 days is a maximum, not a default; verticals MAY recalibrate sooner.
FR-4: Recalibration cadence. Every judge with a calibrated threshold MUST declare a recalibration_due date. Provisional seeds have a maximum cadence of 90 days (per FR-3). JADE-calibrated and production-distribution-derived thresholds have a default cadence of 180 days. Verticals MAY shorten any cadence; they MUST NOT lengthen the provisional-seed maximum and MUST NOT lengthen the JADE/production default beyond 180 days.
FR-5: Per-sub-agent judge applicability policy. PF-9 MUST define the policy that maps sub-agent archetype (a group of sub_agent_id values that share the same judge selection) to judge set. The policy is expressed as applies_to values on rule files plus PC-5’s per-judge applicability filter mechanism (the filter field). Rule-file PRs are the review surface: a reviewer sees the policy change as a diff to applies_to or filter values in the affected rule files, with archetypes named in commit messages and PR descriptions.
FR-6: User-signal correlation pipeline. PF-9 MUST define a pipeline that joins user signals (thumbs, reroll) with judge scores on the same Opik trace. The pipeline MUST write user-signal values via Opik’s native log_traces_feedback_scores() API (the same surface judge scores use), keyed on the trace’s response_id. PS-5 provides the V1 slicing dimensions (vertical, sub_agent, prompt_version) as Tier-1/Tier-2 metadata already attached to the trace at sub-agent factory init. Variant-arm slicing for experiment analysis joins with Snowflake assignment events (per Decision 15), not trace-side fields. V1 scope is thumbs and reroll; annotation-exclusion rationale in §5.4. Rollout posture: V1 is a scheduled batch job evolving from the existing scripts/correlate_feedback_traces.py prototype - the operator runs it on a cadence (initially daily) and each run sweeps recent DynamoDB thumbs events and middleware reroll events into Opik feedback_scores. No buffer, no continuous pipeline. V2 is event-driven write-on-arrival per signal if operational data shows V1 cadence is insufficient; the move to V2 requires evidence that batch latency materially blocks a consumer (inverted-judge detection, recalibration, PF-8 dashboards), not just a desire for faster updates. Implementation is tracked under PLT-604.
FR-7: Eval-manifest discovery surface. PF-9 MUST define how the eval manifest’s category list is derived. PC-6 commits XML prompt components as the canonical capability declaration surface; PF-9 specifies the read-path discipline that produces the resolved manifest from those declarations. The resolved manifest is the semantic content PC-5’s eval-config API returns.
FR-8: Multi-vertical extension pattern. PF-9 MUST define the steps a new vertical follows to inherit the eval substrate: how to bootstrap an Opik project, how to extend the manifest with vertical-scoped categories, how to declare per-vertical judge thresholds against the registry, and how to consume the user-signal correlation pipeline. The pattern MUST be expressible as a PF-5 scaffold target.
FR-9: Automated variant optimization loop. PF-9 SHOULD define an automated optimization loop that consumes eval outcomes and proposes Agent Definition variant revisions for human review. Variants include prompt forks, model swaps, tuning changes, and tool additions per PC-6; PLT-627’s v1 implementation focuses on prompt-fork variants as the simplest case. The loop MUST NOT auto-merge variants; it MUST surface candidates through PC-6’s experiment-gated rollout path.
FR-10: Inverted-judge detection. The platform MUST detect inverted judges (where the judge score correlates negatively with human-annotated quality). JADE Round 1 surfaced ten of twelve judges as inverted, demonstrating this is a recurring failure mode rather than an edge case. PF-9 MUST specify a periodic inversion check that runs at least quarterly and pages on inversion findings. The inversion check MUST report Pearson correlation, Spearman correlation, sample size, and 95% confidence interval per judge composite; inversion is declared when the upper bound of the 95% CI for Pearson correlation falls below zero against the human-annotated reference set.
FR-11: Correlation-pipeline join-coverage observability. PF-9 MUST surface coverage gaps on the user-signal correlation pipeline itself. Specifically: when a signal event is observed by the batch job but no Opik trace exists for its response_id (Opik write would fail), or when an Opik trace exists with judge scores but no user signals have been attached to it as of the latest batch run, the pipeline MUST emit per-cause counts under eval.correlation.coverage.{missing_trace,missing_signal,both_present}.count. A vertical’s join-coverage rate (both_present / (both_present + missing_trace + missing_signal)) MUST be readable per vertical Tier-2 slicing. Counts are computed against a configurable lookback window (default 7 days, matching the lookback scripts/correlate_feedback_traces.py uses today). This is distinct from the eval.correlation.deadletter.count write-failure metric in §5.4: deadletter covers Opik-side write failures (Opik down, malformed, trace not found at write time); coverage covers logical join misses at read time.
FR-12: Curated-export surface for user-feedback examples. PF-9 MUST define an export surface that produces filtered subsets of correlation records for downstream curation. V1 scope: a CLI that reads via PS-5’s reconstruct() and filters on feedback_scores.user_signal_thumbs value plus optional Tier-2 dimensions (vertical, sub_agent), emitting trace_id, user_query, agent_response, all judge_scores, and all feedback_scores as JSONL. The curated exports feed JADE round design (sourcing items where users disagreed with judges), calibration-dataset growth, and per-vertical inverted-judge investigation. Export-specific authorization MUST go through PS-5’s purpose="eval" query path so the existing audit sink (PS-5 FR-12b) captures the access.
FR-13: Judge-quality regression detection between calibration rounds. Between formal JADE recalibration cycles, the platform MUST run a judge-quality regression check on every PR that touches a judge rule file (including its embedded prompt field) OR any prompt component the rule file references via component-id. The check MUST compute KL divergence between the judge’s score distribution on the current validation dataset and its baseline distribution captured at the most recent calibration; it MUST also report ceiling and floor detection (fraction of items scored at the maximum or minimum). The check MUST emit machine-readable pass/fail per judge with the KL score, threshold, and failure reason. The KL threshold per judge MUST be derived under FR-3’s discipline (jade_calibration, production_distribution, or provisional_seed) with the same recalibration cadence FR-4 applies to judge score thresholds; ceiling and floor fractions are reported but PF-9 does not commit a pass/fail cutoff on them in V1 (they surface as warnings until a baseline is established). This is a warning signal for regression, not an automatic exclusion mechanism: the fix for a failing judge is a prompt revision, not its removal from the registry. PF-9’s CI surface runs the regression check and emits the structured pass/fail; PC-5’s pre_merge gate is a separate pipeline that MAY consume the structured output as one of its signals. The composition is “two pipelines, one signal”: PF-9 owns the methodology of what the check computes; PC-5 owns whether to block on the check’s verdict.
FR-14: Inter-annotator agreement on the human reference set. Judge calibration assumes the human reference set is reliable; PF-9 MUST therefore require an inter-annotator agreement measurement on every JADE round before that round’s annotations enter the reference set. The measurement MUST be Krippendorff’s alpha (handles three or more annotators and ordinal severity scores; Cohen’s kappa is a special case that does not generalize to JADE’s typical shape). The pass/fail threshold per judge category MUST be derived from a baseline inter-annotator-agreement pilot - the annotator equivalent of a JADE calibration round - and MUST cite its baseline_source under the same discipline FR-3 imposes on judge thresholds: agreement_calibration (derived from a dedicated pilot), production_annotation_distribution (derived from prior rounds’ empirical distribution per category), or provisional_seed (literature-derived starting value such as Krippendorff’s 0.667 “tentative conclusions” floor, valid for at most 90 days per FR-4 recalibration cadence). The first JADE round under PF-9 will operate on a provisional seed; the second round MUST replace the seed with an empirically-derived threshold per category. Rounds at alpha < threshold MUST be quarantined, not silently merged into the reference set. Quarantine remediation: (a) run a calibration session against the items that drove the lowest pairwise agreement, (b) tighten the codebook entries those items expose, (c) re-score the quarantined items, (d) re-measure alpha. The reference set is only updated when the round’s alpha clears the calibrated threshold for every category. JADE Round 1’s “10 of 12 judges inverted” conclusion is only as strong as the inter-annotator agreement underneath it, which is why this check sits at the substrate.
4.2 Non-Functional Requirements
Section titled “4.2 Non-Functional Requirements”NFR-1: Correlation pipeline freshness. Freshness is set by the FR-6 rollout posture: V1 (scheduled batch job) MUST surface correlation records on the batch cadence (initially daily; operator may tighten); V2 (event-driven write-on-arrival, once shipped per FR-6’s V2 trigger) MUST surface correlation records in Opik within 5 minutes of the underlying user signal for at least 95 percent of records. The V2 target bounds the delay between a user thumbs-down and the recalibration cycle being able to see it; V1’s coarser cadence is acceptable because PF-9’s downstream consumers (recalibration, inverted-judge detection, PF-8 dashboards) read on cadences measured in days or weeks, not minutes.
NFR-2: Judge registry query latency. Judge registry lookups by ID or classification MUST return in under 50ms at p95 from any consumer-agent process. Registry lookups happen on every eval run; latency budget is tight.
NFR-3: Recalibration cycle cost. A full recalibration cycle for a vertical (re-run JADE round, derive new thresholds, regenerate rule files) MUST complete in under 4 hours of wall-clock time. This bounds the operational cost of the discipline FR-4 imposes.
NFR-4: Backward compatibility with single-vertical state. The Rewards Assistant eval suite MUST continue to operate without regression while PF-9’s multi-vertical extension pattern is introduced. Migration is incremental, not big-bang.
NFR-5: Substrate spec coverage. Every section of PF-9’s solution design MUST have a corresponding test suite under tests/evaluation/ or an Opik experiment with a stable ID. No substrate primitive ships untested.
4.3 Acceptance Criteria
Section titled “4.3 Acceptance Criteria”AC-1: Judge registry queryable. Given a deployed consumer-agent, when a client calls the registry with a judge ID, then the registry returns the rule file metadata plus its calibration provenance, classification, and recalibration-due date.
AC-2: Classification machine-readable. Given the current set of rule files, when a script reads each rule file, then every rule file declares either classification: safety_refusal or classification: quality. No rule file omits the field.
AC-3: Calibration provenance enforced. Given a rule file with a threshold that has no baseline_source field, when CI runs the rule-file linter, then CI fails with a clear error pointing at PF-9 FR-3.
AC-4: User signal lands on the Opik trace. Given a user thumbs-down event on a turn that ran four judges, when the correlation pipeline runs, then the Opik trace’s feedback_scores field MUST include an entry {name: "user_signal_thumbs", value: -1.0}. A reconstruct(response_id, purpose="feedback") call MUST return the joined view: the turn ID, the four judge scores, the user-signal feedback scores (user_signal_thumbs, user_signal_reroll), and PS-5’s Tier-2 metadata (vertical, sub_agent, prompt_version). Variant-arm slicing, when needed, joins with Snowflake assignment events per Decision 15.
AC-5: New vertical onboarding. Given a new vertical following PF-9’s extension pattern, when the vertical runs the PF-5 scaffold and declares its category and judge selection, then the vertical’s eval runs end-to-end against its Opik project without any code changes to consumer-agent.
AC-6: Inverted-judge detection. Given the historical JADE Round 1 data, when the inversion check runs against the current judge registry, then the check correctly identifies the ten inverted judges that the human Round 1 analysis surfaced.
AC-7: Manifest discovery from XML. Given a set of XML prompt components declaring capabilities, when the eval-manifest discovery surface resolves the manifest, then the manifest’s category list matches the XML-declared capabilities exactly.
AC-8: Multi-vertical isolation. Given two verticals both consuming PF-9’s substrate, when one vertical’s calibration cycle runs, then the other vertical’s judge thresholds are unaffected.
AC-9: Recalibration cadence enforced. Given a rule file with baseline_source: provisional_seed and a recalibration_due date in the past, when CI runs, then CI emits a warning at pre_merge and blocks at pre_ramp.
AC-10: Inversion check reports statistics. Given the inverted-judge detection from FR-10, when the periodic check completes for a judge composite, then the report MUST include Pearson correlation, Spearman correlation, sample size, and 95% confidence interval. Inversion is declared only when the upper bound of the Pearson 95% CI is below zero.
AC-11: Join-coverage observable per vertical. Given a vertical with traces and user signals produced during a 24-hour window, when the join-coverage counters are queried, then the response includes both_present, missing_trace, and missing_signal per vertical Tier-2 slicing for that window; the coverage rate is computable as both_present / (both_present + missing_trace + missing_signal). The counters are distinct from eval.correlation.deadletter.count.
AC-12: Curated export round-trips through PS-5. Given a CLI invocation of the export surface filtering on feedback_scores.user_signal_thumbs < 0 for one vertical, when the CLI runs, then it produces JSONL with one record per matching trace including trace_id, user_query, agent_response, all judge_scores, and all feedback_scores; AND the access MUST appear in PS-5’s ps5.query_audit sink with purpose="eval".
AC-13: Judge-quality regression check emits machine-readable result. Given a PR that touches a judge rule file, when the regression check runs, then the check emits one record per affected judge with: KL divergence value, the KL threshold, ceiling and floor fractions, pass/fail verdict, and a human-readable failure reason on fail. The result is consumable by PC-5’s pre_merge gate; PF-9 does not specify whether the gate hard-blocks (PC-5 mechanics).
AC-14: Inter-annotator agreement gates reference-set update. Given a fresh JADE round with two or more annotators scoring shared items, when the round closes, then the platform MUST compute Krippendorff’s alpha per judge category and emit a report including alpha value, sample size, item-level pairwise-agreement breakdown, the active threshold per category, the threshold’s baseline_source, and pass/fail. Categories at alpha < threshold MUST be flagged for quarantine; the reference set MUST NOT be updated with quarantined items until alpha clears the calibrated threshold on a re-score. CI MUST fail if any active threshold’s baseline_source is provisional_seed and its recalibration_due date is in the past (mirrors AC-9 for judge thresholds).
AC-15: Reserved user-signal prefix rejected by rule-file linter. Given a judge rule file declaring any id: matching the user_signal_* prefix, when CI runs the FR-2 rule-file linter, then CI fails with a clear error pointing at R-4’s prefix reservation and naming the alternative path (new user-signal additions go through PF-9 review, not through the judge registry).
5. Solution Design
Section titled “5. Solution Design”5.1 The substrate at a glance
Section titled “5.1 The substrate at a glance”PF-9 names six surfaces that compose into the eval substrate. Each surface is consumed by PC-5, PC-6, PS-5, PF-4, PF-5, or PF-8; none of them duplicates work owned by those specs.
+-------------------------------------------------------------+| PF-9 Eval Foundation || || +---------------------+ +---------------------------+ || | Judge registry | | Classification taxonomy | || | (catalog + lookup) | | (safety_refusal, quality) | || +----------+----------+ +-------------+-------------+ || | | || v v || +---------------------+ +---------------------------+ || | Calibration | | Per-sub-agent judge | || | methodology | | applicability policy | || +----------+----------+ +-------------+-------------+ || | | || v v || +---------------------+ +---------------------------+ || | User-signal | | Eval-manifest discovery | || | correlation pipeline| | surface | || +----------+----------+ +-------------+-------------+ || | | |+-------------+----------------------------+------------------+ | | v v (writes to Opik feedback_scores) (consumed by PC-5 API)5.2 Judge registry and classification taxonomy
Section titled “5.2 Judge registry and classification taxonomy”The judge registry is a content-addressed catalog of rule files. Each rule file declares:
id(judge identifier, kebab-case)classification(one ofsafety_refusalorquality)applies_to(sub-agent archetypes this judge scores; empty means all archetypes)threshold(floor + tolerance)baseline_source(one ofjade_calibration,production_distribution,provisional_seed)calibration_ref(ticket or Confluence reference)recalibration_due(ISO date)
The schema is owned by PC-5; PF-9 defines the field semantics and the linter rules that enforce them.
Classification taxonomy:
- safety_refusal: judges whose floor must never relax across verticals. Examples:
jailbreaking,safety_restricted,sensitive_topics. PC-5 enforces these atblockfor every milestone. - quality: judges with vertical-tunable thresholds. Examples:
response_quality,ux_quality,data_integrity. PC-5 enforces these atwarnat pre_merge andblockat pre_ramp and later, by default.
A vertical may not change a judge’s classification. A vertical may tighten a quality judge’s threshold above the baseline; they may not loosen it below.
Central registry, vertical-tunable calibration. The split between what is platform-shared and what is per-vertical is load-bearing:
- Central (platform-shared). Judge logic, prompt text, classification, and the canonical judge
idlive in the central registry. All verticals see the same judge definition; a vertical does not fork judge logic. - Per-vertical. Rule files at
configs/rules/<vertical>/<judge>.yamlcarry calibration (baseline_source,calibration_ref,recalibration_due), threshold, and applicability filter (filter: sub_agent_id == <vertical>). The §5.10 receipts walkthrough is a concrete instance. - Adding a new judge. A vertical that needs a brand-new judge proposes it through a PR against the central registry under CODEOWNERS review (Decision 18). The new judge enters the platform-shared registry available to every vertical, not forked per-vertical. This avoids the per-vertical-judge-proliferation that would defeat cross-vertical comparability (Decision 10).
5.3 Calibration methodology
Section titled “5.3 Calibration methodology”PF-9 prescribes three calibration sources and what each requires.
JADE calibration (baseline_source: jade_calibration):
- Sample at least 200 production traces stratified by category.
- Have human reviewers annotate each trace against the judge’s rubric.
- Compute agreement (Cohen’s kappa or Krippendorff’s alpha) between human annotation and judge score.
- Threshold is derived from the score distribution where human-judged-acceptable traces sit; floor is the 5th percentile of acceptable scores.
- Inverted judges (negative correlation) are surfaced and either re-prompted or removed.
- Round 1 (PLT-596, 2026-04-04) is the reference implementation. Future rounds follow the same shape.
Production distribution (baseline_source: production_distribution):
- Sampling window: 30 days by default; verticals may shorten to 7 days if traffic is high enough.
- Percentile: p5 by default.
- Std-dev rule: subtract 2 sigma from the percentile to set the floor.
- Re-derivation cadence: 180 days default (matches FR-4); verticals may shorten but MUST NOT lengthen. Encoded in
recalibration_due.
Provisional seed (baseline_source: provisional_seed):
- Allowed only when no JADE round or production distribution is available yet (i.e., a brand-new judge or a brand-new vertical).
- Threshold is the mean of a small bootstrap run minus 2 sigma.
- MUST cite a
recalibration_duewithin 90 days. - CI warns at pre_merge and blocks at pre_ramp if a provisional seed’s recalibration date is past due.
5.4 User-signal correlation pipeline
Section titled “5.4 User-signal correlation pipeline”What we are proposing. For each turn, automatically pair the judge verdicts (already produced today by async LLM judges against Opik traces) with the user reactions (already captured today in DynamoDB as thumbs and detectable as rerolls in history middleware). The pairing is queryable and continuous rather than only surfacing during quarterly JADE rounds. Inverted-judge detection (FR-10), recalibration cycle input (FR-3), and per-vertical quality dashboards (PF-8) all read from this pairing.
How it composes with PS-5 and Opik. PS-5 owns the joined trace + SSE store, Tier-1/Tier-2 metadata attachment via set_context(), and the reconstruct(response_id, purpose) query API consumers use to read joined turns. PF-9 writes user-signal values via Opik’s native log_traces_feedback_scores() API directly onto the same trace; the correlation “record” is the joined trace + SSE frames + Tier-1/2 metadata + feedback_scores + judge scores returned by reconstruct(). No separate store, no parallel pipeline. The choice of Opik feedback_scores over PS-5 Tier-3 metadata is captured in Decision 14 (§11.2) with the supporting evidence.
┌───────────────────────────────────────────────┐ │ Opik trace (per response_id) │ │ │ ┌──────────────┐ │ Tier-1/2 metadata (set by PS-5 set_context) │ │ Thumbs │ │ response_id, user_id, session_id, │ │ (DynamoDB) │──┐ │ vertical, sub_agent, prompt_version, ... │ └──────────────┘ │ │ │ │ ┌───────────────┐ │ feedback_scores.user_signal_thumbs ◄──── written by │ ├──►│ PF-9 batch │──┼─►feedback_scores.user_signal_reroll ◄──── PF-9 batch │ ┌──────────────┐ │ │ job - V1: │ │ job │ │ Reroll │──┘ │ daily sweep │ │ feedback_scores.{judge} ◄──── written by │ │ (history │ └───────┬───────┘ │ async LLM │ │ middlewr.) │ │ │ judges │ └──────────────┘ │ └──────────────────────┬────────────────────────┘ │ log_traces_feedback_scores() │ │ per (trace, signal) pair │ ▼ │ ┌──────────────┐ │ │ Opik write │── success ──────────────┘ └──────┬───────┘ │ │ Opik down / malformed / │ trace not found ▼ ┌──────────────────────────┐ │ Dead-letter queue │ │ eval.correlation. │ │ deadletter.count │ └──────────────────────────┘
Read side (consumers query the joined view):
┌─────────────────────────────────────────┐ │ PS-5 reconstruct(response_id, │ │ purpose="feedback") │ │ → {trace, metadata, feedback_scores, │ │ judge_scores, frames} │ └──────────────────┬──────────────────────┘ │ ┌──────────────────┼──────────────────┐ ▼ ▼ ▼ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐ │ Inverted- │ │ Recalibration │ │ PF-8 quality │ │ judge detect │ │ cycles │ │ dashboards │ │ (FR-10) │ │ (FR-3) │ │ │ └───────────────┘ └───────────────┘ └───────────────┘Two signal inputs (V1 scope):
- Thumbs events. DynamoDB stores thumbs-up and thumbs-down per
MessageIdwith theUserIdand theturn_id(OpenAI response_id). Arrive asynchronously after the turn from mobile callbacks. - Reroll events. A reroll is the user re-submitting the same input (matched by input hash) within 300 seconds of receiving the previous response in the same session. Detected per-turn in the history middleware. The reroll signal attaches to the previous turn (the one rerolled away from), not the new submission. Window threshold is configurable per vertical.
Opik annotation events are deliberately excluded from V1’s per-turn correlation tags. Annotations are batch human-reviewer work that arrives days after the turn; they stay in their existing Opik annotation queue and are read by JADE rounds at calibration time. Folding annotations into the per-turn correlation surface would conflate two cadences with different operational shapes (per-turn signals vs days-later human review), and the JADE round is already the right reader for that signal class.
Judge scores arrive asynchronously. Judges run against traces in Opik after the turn completes, not inline during the turn. Correlation can only complete for a given trace after judges have run for that trace; NFR-1’s 5-minute freshness target measures from the time of the last input arrival, not from turn time.
Feedback attachment via Opik feedback_scores. When a signal event arrives, PF-9’s pipeline calls Opik’s native feedback API to attach a typed score to the existing trace:
# Example: thumbs-down arrives for turn resp_abcopik_client.log_traces_feedback_scores([ {"id": trace_id, "name": "user_signal_thumbs", "value": -1.0, "source": "user_mobile"}, {"id": trace_id, "name": "user_signal_reroll", "value": 0.0, "source": "history_middleware"},])Feedback scores live in the trace’s feedback_scores field (separate from metadata), queryable via Opik’s standard retrieval and via consumer-agent CLI filters (opik traces curate -f 'feedback_scores.user_signal_thumbs < 0'). PS-5 has already attached Tier-1 + Tier-2 keys at sub-agent factory init via set_context(). Each batch-job run writes one log_traces_feedback_scores() call per (trace, signal) pair it discovers; signals that arrive later than the batch cadence land on a subsequent run and naturally update the same trace. Signals that never land on any trace (Opik write fails, trace not found) surface in FR-11’s missing_signal counter.
The assembled correlation record returned by reconstruct(response_id, purpose="feedback") carries:
turn_id (Tier-1 metadata)user_id, session_id, environment, agent_version (Tier-1 metadata)vertical, sub_agent, prompt_version (Tier-2 metadata)feedback_scores.user_signal_thumbs (Opik native; written by PF-9)feedback_scores.user_signal_reroll (Opik native; written by PF-9)judge_scores (Opik native; written by async LLM judges)category (resolved via PF-9 manifest discovery)timestamp (Tier-1 metadata)The prompt_version Tier-2 slot is a value-format evolution coordinated PC-6-internally; see Decision 16 (§11.2) for the rationale and Decision 15 for variant-arm slicing via Snowflake assignment events.
Failure semantics. If a feedback-score write cannot land (Opik unavailable, malformed score, trace_id not found), the event lands on a dead-letter queue. A metric (eval.correlation.deadletter.count) alerts when the rate spikes. PF-9 does not retry indefinitely; recalibration cycles tolerate some signal loss but should know about it.
What changes operationally:
| Before PF-9 | After PF-9 |
|---|---|
Judge scores in Opik (queryable per trace via feedback_scores) | Same |
| Thumbs in DynamoDB (queryable per user); not joined to traces | Thumbs additionally written as feedback_scores.user_signal_thumbs on the Opik trace; DynamoDB stays the source of record |
To answer “did users agree with the response_quality judge on shopping queries last week?”, write an ad-hoc script joining the two stores (this is the existing scripts/correlate_feedback_traces.py pattern) | One PS-5 reconstruct(response_id, purpose="feedback") call, or a direct Opik filter feedback_scores.user_signal_thumbs < 0 AND feedback_scores.response_quality > 0.8, returns everything joined |
| Inverted-judge detection only during quarterly JADE rounds | Continuous via FR-10’s periodic check reading feedback_scores + judge_scores from Opik |
Implementation is tracked under PLT-604.
5.5 Multi-vertical extension pattern
Section titled “5.5 Multi-vertical extension pattern”A new vertical inherits the eval substrate in seven steps:
- Bootstrap Opik project (per PC-5’s operating model; vertical owns the project)
- Scaffold via PF-5 (generates the vertical’s eval directory with stubs)
- Declare categories (vertical adds to the manifest; reuse-vs-propose-update rule per Decision 17; reviewer-ownership for proposed updates per Decision 18)
- Select judges (safety-refusal judges inherited automatically; quality judges opt-in per category; missing-judge workflow surfaced as follow-up)
- Calibrate thresholds (new verticals start with
provisional_seedper FR-3 and FR-4; JADE Round 1 schedules after 200+ production traces accumulate per §5.3) - Wire correlation (sub-agent ID registration in PC-2’s dispatch is the only vertical-side action; PS-5’s
set_context()and PF-9’s correlation pipeline handle the rest) - Run gates (PC-5’s three-gate pipeline runs against the vertical’s manifest and judge selection)
See §5.10 for the end-to-end worked example with concrete artifacts (receipts vertical).
5.6 Eval-manifest discovery surface
Section titled “5.6 Eval-manifest discovery surface”Today, configs/evaluation_manifest.yaml is hand-maintained. As capability declarations move to XML prompt components per PC-6, the manifest’s category list becomes derivable from the XML at consumer-agent startup or at CI time.
The discovery surface:
- Reads XML prompt components from
prompts/components/(the directory PC-6 owns). - Extracts the declared capability per component.
- Generates the manifest category list and the per-category default judge selection.
- Verticals may override the default selection via a manifest extension file.
PF-9 prescribes CI generation with the resolved manifest committed to the repo (over consumer-agent startup-time discovery), so the eval surface is reproducible without runtime dependencies. This builds on PC-6’s commitment to XML prompt components as the canonical capability declaration surface.
The regeneration job MUST run on every PR that touches prompts/components/ or any specs/verticals/*/eval/manifest-extension.yaml file, and additionally on a daily cron for drift detection. The job opens an auto-PR with the regenerated configs/evaluation_manifest.yaml when the resolved manifest changes; the auto-PR fails CI if the diff conflicts with a hand-edit so reviewers can reconcile.
5.7 Automated prompt optimization loop
Section titled “5.7 Automated prompt optimization loop”PLT-627’s APO loop consumes eval outcomes and proposes Agent Definition variant revisions:
- After each eval run, the loop reads judge scores for the categories where the vertical underperforms (below the threshold but above a configurable floor).
- The loop calls a meta-LLM with the current Agent Definition (prompt, tool list, tuning), the failing eval items, and the judge rubrics. The meta-LLM proposes variant revisions. V1 scope is prompt-fork variants only; future iterations may propose tool or tuning changes within the same loop shape.
- Candidate variants are written as draft entries to PC-6’s experiment-gated rollout path. They never auto-merge.
- A human reviewer evaluates the candidate via PC-6’s normal review.
PF-9 specifies the loop shape; PC-6 owns the rollout path; implementation is tracked under PLT-627.
Sequencing. PLT-627 is an APO spike, not a production rollout. The spike can prototype the loop shape against the existing eval surface independently. The production APO path (FR-9’s “candidates MUST surface through PC-6’s experiment-gated rollout”) gates on PC-6 being live; if the spike lands before PC-6 ships, candidate output stays in a holding directory rather than entering an experiment registry that does not yet exist.
5.8 PF-9 to PC-5 partnership
Section titled “5.8 PF-9 to PC-5 partnership”PC-5 and PF-9 share the eval surface. The partnership is:
| Surface | PC-5 owns | PF-9 owns |
|---|---|---|
| Rule file schema | The YAML field definitions and types | The semantics of classification, baseline_source, recalibration_due, calibration_ref |
| Eval-config API | The HTTP endpoint and request/response shape | The semantic content the API returns (resolved manifest, judge registry view) |
| Three-gate pipeline | Pipeline mechanics, trace set composition, enforcement | Which judges fire at which gate (derived indirectly from classification taxonomy, not written as a direct value) |
| Manifest format | YAML structure | Category-to-judge mapping content and discovery rules |
| Per-judge applicability filter | The filter field syntax (field, operator, value) | The policy that decides which judges apply to which sub-agent archetype |
When PC-5 and PF-9 disagree on a surface, the rule is: PC-5 owns the shape, PF-9 owns the content.
A note on the third row’s indirection. PF-9 does not directly write “which judges fire at which gate” as a value. PF-9 writes the judge’s classification field (safety_refusal or quality) in the rule file; PC-5’s enforcement defaults then read classification to decide gate behavior (safety_refusal blocks at every milestone; quality warns at pre_merge and blocks at pre_ramp and pre_full). The “which judges fire at which gate” content is derived from classification through PC-5’s enforcement defaults, not authored as a separate value.
A concrete example of the shape-vs-content split. When a vertical calls GET /eval-config?vertical=receipts&category=receipt_scan, the response shape (JSON structure, field names, ordering) is owned by PC-5. The actual values in the response (which judges are listed, what thresholds they have) are owned by PF-9. PC-5 standardizes the envelope; PF-9 fills the contents. A rule file like receipt_quality.yaml carries classification: quality and threshold: 0.65. The field name classification and the slot it occupies in the YAML schema are PC-5. The legal value quality (and the rule that 0.65 was derived per PF-9’s calibration methodology) are PF-9.
PC-5 schema coordination note. PF-9 requires every rule file to declare a classification field (FR-2) and baseline_source / recalibration_due / calibration_ref fields (FR-3). PC-5’s rule-file schema (§5.2) already carries baseline_source, recalibration_due, and calibration_ref per its FR-3. The classification field is a PF-9 addition that requires a small PC-5 schema update to enumerate it as a top-level rule-file field. Tracked as a coordination follow-up between PF-9 and PC-5 owners; not a re-scope of either spec.
5.9 PF-9 to PC-6 partnership
Section titled “5.9 PF-9 to PC-6 partnership”PC-6 owns prompt rollout and XML component authoring. PF-9 reads from PC-6:
- The XML prompt component path is the source of capability declarations. PF-9’s manifest discovery reads from this path.
- PC-6’s experiment-gated rollout is the surface APO writes candidate Agent Definition variants to.
PF-9 does not write to PC-6’s surfaces directly. APO candidates flow through PC-6’s normal review.
5.10 Worked example: receipts vertical onboarding
Section titled “5.10 Worked example: receipts vertical onboarding”A concrete walkthrough. Receipts is the next vertical landing under PF-9 conventions. Receipts has a chat sub-agent (receipts) handling receipt-scan confirmations and dispute flows. PF-5 (Vertical Scaffolding + Validation) generates the scaffolding; the receipts team fills in domain values.
Step 1: Bootstrap the vertical’s Opik project. Per PC-5 FR-7 (vertical Opik project ownership model), the vertical owns the project and PC-5 does not write into it. Receipts owner creates the project in the Opik console; the API token lands in consumer-agent’s secrets manager.
$ opik projects create receipts-eval --org fetch-rewards# Output: project_id=opk_receipts_a4f2, api_key=opk_***$ aws secretsmanager create-secret \ --name consumer-agent/eval/receipts/opik-token \ --secret-string "opk_***"Step 2: Scaffold the vertical’s eval directory. Run the PF-5 scaffold:
$ labs scaffold vertical receipts --feature eval# Generated:# specs/verticals/receipts/eval/judge-selection.yaml# specs/verticals/receipts/eval/calibration-plan.mdStep 3: Declare categories in the generated manifest-extension.yaml:
vertical: receiptscategories: receipt_scan_confirmation: description: "User confirms a scanned receipt matches expectation" inherits_from: general_qa receipt_dispute_flow: description: "User disputes a scanned receipt or its line items" inherits_from: general_qaThe two new categories are receipts-specific; general_qa is reused from the Rewards Assistant manifest by reference.
Step 4: Select judges in judge-selection.yaml. Safety-refusal judges (jailbreaking, safety_restricted, sensitive_topics) are inherited automatically per FR-2 and do not appear in the file. Quality judges are opt-in per category:
vertical: receiptscategories: receipt_scan_confirmation: quality_judges: - response_quality - tool_compliance - ux_quality - fetch_legal receipt_dispute_flow: quality_judges: - response_quality - tool_compliance - ux_quality - fetch_legalStep 5: Document the calibration plan. Receipts has no production traces yet, so every quality judge starts with baseline_source: provisional_seed. The CI linter (FR-3) enforces recalibration_due within 90 days:
# A receipts-specific judge rule file produced by the scaffold,# committed to consumer-agent/configs/rules/receipts/response_quality.yamlclassification: qualitybaseline_source: provisional_seedcalibration_ref: PLT-XXXX-receipts-bootstraprecalibration_due: 2026-08-22 # 90 days from receipts go-livethreshold: 0.55 # mean - 2σ of bootstrap samplefilter: field: sub_agent_id operator: equals value: receiptsJADE Round 1 is scheduled in calibration-plan.md as a follow-up once receipts accumulates 200+ production traces.
Step 6: Wire correlation (no vertical code). Once the receipts sub-agent is registered in PC-2’s dispatch table, PS-5’s set_context() attaches Tier-2 keys on every receipts trace automatically. PF-9’s correlation pipeline (PLT-604) calls Opik’s log_traces_feedback_scores() when signal events arrive. Receipts owner writes nothing here. A receipts trace after a thumbs-down event:
# Result of: ps5.reconstruct("resp_recpt_8c1f", purpose="feedback"){ "response_id": "resp_recpt_8c1f", # Tier-1, auto-attached "user_id": "u_***", # Tier-1 "vertical": "receipts", # Tier-2, set_context() "sub_agent": "receipts", # Tier-2 "prompt_version": "receipts-v0.1", # Tier-2 "feedback_scores": { # Opik native; written by PF-9 via log_traces_feedback_scores() "user_signal_thumbs": -1.0, "user_signal_reroll": 0.0 }, "judge_scores": [ {"judge_id": "response_quality", "score": 0.41}, {"judge_id": "tool_compliance", "score": 1.00}, {"judge_id": "ux_quality", "score": 0.62}, {"judge_id": "fetch_legal", "score": 0.95} ], "category": "receipt_scan_confirmation", "timestamp": "2026-05-26T12:00:00Z"}Step 7: Run gates. Next CI run reads the receipts manifest extension, picks up the four quality judges plus the inherited safety-refusal judges, and gates the receipts sub-agent’s promotion through pre_merge (warn-level for quality) / pre_ramp (block for quality) / pre_full (block for everything) per PC-5. Receipts inherits PC-5’s pipeline mechanics; PF-9 provided only the values that fill PC-5’s schema slots.
What receipts wrote vs. inherited. Receipts owner authored: one Opik project, one manifest-extension file (10 lines), one judge-selection file (12 lines), one calibration-plan markdown (deadline + JADE schedule). Inherited automatically: safety-refusal judges, correlation pipeline (writes via Opik’s feedback_scores API), three-gate CI mechanics, PF-8 Grafana panels, PS-5 trace store + set_context() Tier-2 attachment + reconstruct() query API, PC-5 eval-config API. No code changes to consumer-agent, no new pipeline, no new dashboard work.
6. Cross-Section Impact
Section titled “6. Cross-Section Impact”| Spec | Citation |
|---|---|
| PC-1 | PF-9 FR-5 references per-sub-agent applicability policy; PC-1 declares the sub-agent archetype boundary PF-9 maps against. |
| PC-5 | PF-9 produces content (classification, calibration provenance, methodology) for PC-5’s YAML schema fields and eval-config API surface. Detailed partnership in PF-9 §5.8. |
| PC-6 | PF-9 reads capability declarations from PC-6’s XML prompt component path. PF-9’s APO loop writes Agent Definition variant candidates to PC-6’s experiment-gated rollout. Detailed partnership in PF-9 §5.9. |
| PS-5 | PF-9’s user-signal correlation pipeline reads joined trace + SSE views via PS-5’s reconstruct(response_id, purpose) query API. PF-9 relies on PS-5’s set_context() to attach Tier-1/Tier-2 metadata (vertical, sub_agent, prompt_version, etc.) to traces at sub-agent factory init. User-signal writes go to Opik’s native feedback_scores API directly, not through ps5.tag(); PF-9 does NOT contribute Tier-3 vocabulary keys to ps5-registry.json. PS-5’s out_of_scope names the feedback correlation pipeline as a consumer pattern PS-5 serves; PF-9 formalizes that consumer pattern via the read path. (Per PS-5 FR-3 + R-9, the prompt_version value-format contract belongs to PC-6; PF-9 §5.4 proposes a PC-6-side evolution of the value semantics without requiring any PS-5 spec change.) |
| PF-4 | PF-4’s audit query path reads from PS-5, which holds PF-9’s correlation records. PF-9 does not own the audit surface. |
| PF-5 | PF-5 scaffolds vertical eval directories per PF-9’s multi-vertical extension pattern (§5.5). PF-9 supplies the contract shape; PF-5 generates the stubs. |
| PF-8 | PF-8 defines the per-vertical Grafana panels that visualize PF-9’s judge categories. PF-9 supplies the taxonomy; PF-8 supplies the visualization conventions. |
7. Dependencies
Section titled “7. Dependencies”7.1 Spec Dependencies
Section titled “7.1 Spec Dependencies”| Spec ID | What we need from it | Why |
|---|---|---|
| PC-1-agent-composition | Sub-agent archetype definitions | PF-9 FR-5 maps judges to archetypes. PC-1 declares what archetypes exist. |
| PC-5-agent-cicd | Rule file schema, eval-config API surface, three-gate pipeline | PF-9 produces values for fields PC-5 schemas; PF-9 needs PC-5 stable to compose against. |
| PC-6-prompt-cicd | XML prompt component path, experiment-gated rollout | PF-9’s manifest discovery reads from PC-6; PF-9’s APO loop writes through PC-6’s review path. |
| PS-5-trace-event-store | Joined trace + SSE store, Tier-1/Tier-2 metadata attachment via set_context(), reconstruct(response_id, purpose) query API | PF-9’s correlation pipeline reads joined views via reconstruct(); relies on Tier-2 metadata (vertical, sub_agent, prompt_version) being on every trace; recalibration cycles and inverted-judge detection read via PS-5’s query API. PF-9 writes user signals to Opik directly via feedback_scores, not through PS-5’s ps5.tag(). |
7.2 External Dependencies
Section titled “7.2 External Dependencies”| Dependency | Owner | Status | Blocker? |
|---|---|---|---|
| Opik SDK and Opik SaaS | Comet ML (vendor) | Live | No |
| DynamoDB signal-event store | Platform team | Live (existing thumbs table) | No |
PS-5 trace+SSE store, set_context() Tier-1/Tier-2 metadata, reconstruct() query | Frank Luo | See PLT-689 | No; PF-9 reads via PS-5 and writes user signals to Opik’s native feedback_scores API directly. |
| JADE calibration tooling | Eval team (Prakash) | Live (Round 1 done) | No |
8. Risks & Open Questions
Section titled “8. Risks & Open Questions”8.1 Risks
Section titled “8.1 Risks”R-1: Scope creep into PC-5. The boundary between PC-5’s schema and PF-9’s semantics is conceptually clean but operationally fuzzy. A PR that adds a new rule-file field could land in either spec. Mitigation: §5.8 declares the shape-vs-content rule; PR reviews enforce it.
R-2: Methodology overhead. FR-3’s calibration source discipline imposes recurring work (JADE rounds quarterly, production-distribution re-derivation every 90 days). Verticals may resist. Mitigation: PF-9 SHOULD ship JADE tooling that reduces a calibration round to under 4 hours of wall-clock work (NFR-3).
R-3: Inverted-judge drift. Ten of twelve judges came back inverted in JADE Round 1. The current registry has had judge fixes since, but FR-10’s periodic inversion check could surface more inversions and force re-prompt or removal cycles. Mitigation: FR-10 requires the inversion check; rule-file changes go through normal review.
R-4: Opik feedback_scores schema drift across producers. PF-9’s correlation pipeline (FR-6) writes user signals via Opik’s native log_traces_feedback_scores() API. The same Opik surface is consumed by judge runs (which also write feedback scores for response_quality, tool_compliance, etc.) and by scripts/correlate_feedback_traces.py. The risk is name-collision: if a vertical’s judge happens to overlap a user-signal name or a future producer chooses overlapping names, the per-trace feedback_scores set becomes ambiguous. Mitigation: PF-9 reserves the user_signal_* prefix for user-signal use (V1: user_signal_thumbs, user_signal_reroll; V2 adds user_signal_annotation_severity per Decision 19). Reserving a prefix rather than a named list means future user-signal additions are namespaced automatically without a PF-9 amendment. The FR-2 rule-file linter MUST reject rule files declaring any id: matching the user_signal_* prefix, with a clear error pointing at this reservation (AC-15 is the acceptance criterion). Existing in-flight uses of bare thumbs and reroll are deprecated; migration path: write under the new prefix, mirror to the old names for one release cycle, drop the bare names.
R-5: APO over-reliance. PLT-627’s APO loop could produce a flood of variant candidates that bury human reviewers. Mitigation: FR-9 requires candidates flow through PC-6’s normal review; no auto-merge. APO is an assistant, not an authority.
8.2 Open Questions
Section titled “8.2 Open Questions”None open at this time. Decisions 18 and 19 closed the prior OQ-1 (reviewer policy for category updates) and OQ-2 (signal-storage unification) respectively.
9. Testing Strategy
Section titled “9. Testing Strategy”- Unit tests: Judge registry lookup (
tests/evaluation/test_registry.py), classification linter (tests/evaluation/test_classification_lint.py), calibration-source linter (tests/evaluation/test_calibration_lint.py), manifest discovery (tests/evaluation/test_manifest_discovery.py). - Integration tests: Correlation pipeline end-to-end against Opik mocks and DynamoDB moto fixtures (
tests/evaluation/integration/test_correlation_pipeline.py). - Opik experiments: Per-FR experiment with stable Opik experiment IDs. NFR-5 requires every solution-design section to have either a test or an experiment.
- JADE round regression: Replay JADE Round 1 data through the inverted-judge detector (FR-10); AC-6 is the acceptance criterion.
- Multi-vertical isolation tests: Two-vertical scenario where one vertical’s calibration cycle runs and the other’s thresholds are asserted unchanged (AC-8).
- Manual validation: First-vertical onboarding dry-run against a sample vertical using the PF-5 scaffold; AC-5 is the criterion.
10. Rollout & Observability
Section titled “10. Rollout & Observability”10.1 Rollout Plan
Section titled “10.1 Rollout Plan”PF-9 rolls out in five phases. None of the phases break the existing Rewards Assistant eval flow (NFR-4).
Phase 1: Substrate documentation in-place. The current Rewards Assistant judge registry, classification, and calibration are documented per PF-9 against the existing system. Rule files are updated to include the new fields (classification, baseline_source, recalibration_due, calibration_ref) by reading the JADE Round 1 report. No behavior change. Linters added as warnings only.
Phase 2: Correlation pipeline online (PLT-604). The correlation pipeline lands, joining the existing thumbs events to judge scores and writing user-signal feedback to Opik via log_traces_feedback_scores() per Decision 14; correlation records are the joined view PS-5’s reconstruct() returns, not a separately written object. Initially read-only consumers (no recalibration cycle reads from it yet); validates the pipeline shape.
Phase 3: Recalibration cycle consumes correlation. Next JADE round (Round 2) consumes correlation records as input alongside the human-annotated traces. Inverted-judge detection (FR-10) runs against the historical data.
Phase 4: APO loop pilot (PLT-627). APO runs against a single vertical (Rewards Assistant) for a single category. Candidates flow through PC-6’s review path; no auto-merge. After 60 days, evaluate signal-to-noise.
Phase 5: Multi-vertical onboarding. First non-Rewards-Assistant vertical inherits the substrate via PF-5 scaffold. This is the validation of the multi-vertical extension pattern (AC-5).
10.2 Observability metrics
Section titled “10.2 Observability metrics”eval.judge_registry.lookup.latency_ms(histogram): bound to NFR-2.eval.correlation.pipeline.lag_seconds(histogram): bound to NFR-1.eval.correlation.records_written(counter, sliced by vertical / sub_agent / category): rate visible per vertical.eval.recalibration.cycle.duration_seconds(histogram, per vertical): bound to NFR-3.eval.calibration.recalibration_overdue(gauge, per judge): alerts whenrecalibration_dueis past.eval.inverted_judge.count(gauge, per vertical): alerts when FR-10’s check surfaces an inversion.
PF-8 specifies the Grafana panel conventions that visualize these.
10.3 Rollback Plan
Section titled “10.3 Rollback Plan”PF-9’s primary risk during rollout is the correlation pipeline writing malformed records to PS-5 or the manifest discovery surface generating an incorrect manifest. Both have rollback paths:
- Correlation pipeline: feature-flag the pipeline writer (
ai_assistant_eval_correlation_pipeline_enabled, following PF-8 naming convention); off by default during Phase 2. Phase 3 onwards reads correlation records, so rollback to read-only-from-historical-traces is the fallback. - Manifest discovery: the CI-generated manifest is committed to the repo, so rollback is reverting the commit. The hand-maintained manifest pattern remains valid as fallback indefinitely.
No data migration is required. Rule file additions are additive; pre-existing fields are preserved.
11. Appendix
Section titled “11. Appendix”11.1 Source references
Section titled “11.1 Source references”- PR #267 (PC-1 Agent Composition): https://github.com/fetch-rewards/consumer-agent/pull/267
- PR #291 (PC-5 Agent CI/CD): https://github.com/fetch-rewards/consumer-agent/pull/291
- PR #290 (PC-6 Prompt CI/CD): https://github.com/fetch-rewards/consumer-agent/pull/290
- PR #292 (PF-1 Sub-Agent Lifecycle): https://github.com/fetch-rewards/consumer-agent/pull/292
- PS-5 (Trace + Event Store, no PR yet): https://fetchrewards.atlassian.net/browse/PLT-689
- PF-4 (Security & Auditability, no PR yet): https://fetchrewards.atlassian.net/browse/PLT-693
- PF-5 (Vertical Scaffolding + Validation Tools, no PR yet): https://fetchrewards.atlassian.net/browse/PLT-694
- PF-8 (Feature Flag + Cross-Vertical Observability, no PR yet): https://fetchrewards.atlassian.net/browse/PLT-753
- Platform Spec Lab (Confluence 6452412451): https://fetchrewards.atlassian.net/wiki/spaces/Pilot/pages/6452412451
- JADE Round 1 report (Confluence 6331367452): https://fetchrewards.atlassian.net/wiki/spaces/Pilot/pages/6331367452
- Prior source material: PR #227 (closed; branch
feat/007-eval-system-specretained as reference): https://github.com/fetch-rewards/consumer-agent/pull/227 - PLT-770 (this spec’s tracking ticket): https://fetchrewards.atlassian.net/browse/PLT-770
- PLT-604 (user-signal correlation pipeline): https://fetchrewards.atlassian.net/browse/PLT-604
- PLT-627 (APO spike): https://fetchrewards.atlassian.net/browse/PLT-627
- PLT-596 (JADE Round 1 task): https://fetchrewards.atlassian.net/browse/PLT-596
11.2 Decisions resolved during design
Section titled “11.2 Decisions resolved during design”| # | Decision | Resolution |
|---|---|---|
| 1 | Spec scope: conventions only vs. conventions + implementation | Both. PF-9 names the substrate (conventions) AND owns the design of new pipeline work (PLT-604, PLT-627). Closest peer is PF-1’s substrate-plus-implementation framing. |
| 2 | PF-9 vs. PC-5 boundary | Shape-vs-content rule (§5.8): PC-5 owns YAML/API shape, PF-9 owns the semantics, taxonomy, and methodology that produce values. |
| 3 | Trace and signal storage ownership | PS-5 owns the store. PF-9’s correlation pipeline writes via PS-5’s per-team metadata mechanism (FR-6); PF-9 does not maintain a parallel store. |
| 4 | Grafana panel ownership | PF-8 owns panel definitions. PF-9 supplies the taxonomy PF-8 visualizes. |
| 5 | Wave placement | Wave 5, alongside PF-1, PC-5, PC-6, PF-8. PF-9 is a vertical-velocity unblocker, not a vertical-existence gate. New verticals can technically run without PF-9 (no eval gates); they cannot be trustworthy without it. |
| 6 | Multi-vertical extension as PF-5 scaffold target | PF-9 supplies the contract shape (§5.5 seven-step pattern); PF-5 generates the stubs. PF-5 reviewer alignment captured in §6 Cross-Section Impact. |
| 7 | APO auto-merge | No. FR-9 requires APO candidates flow through PC-6’s normal review path; APO is an assistant, not an authority. |
| 8 | Manifest discovery cadence | CI generation with the resolved manifest committed to the repo, over consumer-agent startup-time discovery. Trade-off: startup discovery adds boot latency and creates a runtime dependency on the XML component path being readable; CI generation requires a small follow-up PR when capabilities change but keeps the eval surface reproducible without runtime dependencies. |
| 9 | blocks: [] despite Wave 5 placement | PF-9 functionally blocks vertical trustworthiness, not vertical existence. A new vertical can technically run on PC-2 dispatch without PF-9’s substrate (no eval gates, no calibration discipline). The vertical will be merge-able but un-quality-gated. PF-9 chooses blocks: [] to keep the DAG honest about runtime dependencies. PF-5 (vertical scaffolding) is a soft consumer of PF-9’s extension contract; the relationship is declared in §6 Cross-Section Impact rather than as a hard DAG edge so PF-5 can ship its scaffold templates independently. Verticals are downstream consumers, not platform-spec successors; DAG blocks is reserved for platform-spec sequencing only. |
| 10 | Per-vertical judge override of classification | No. Classification is platform-wide; verticals may not redefine a judge from quality to safety-refusal or vice versa. Verticals MAY pin a quality judge to block at pre_merge to be stricter than the default. Per-vertical classification would create cross-vertical comparability problems. |
| 11 | Multi-vertical calibration sharing | Each vertical runs its own JADE round against its own production traces. Shared judge IDs converge in threshold over time; PF-9 reports cross-vertical threshold divergence as a quality signal (large divergence implies the judge is sensitive to vertical context, which is useful information). |
| 12 | Reroll signal weighting | Carried as a separate dimension in the correlation record, not folded into the thumbs field. Recalibration cycles decide weighting per vertical based on observed correlation signal. Avoids baking a fixed weight into the substrate before per-vertical evidence accumulates. |
| 13 | Resolved vs. open question placement | Resolved positions live in this Decisions table; only genuinely open questions remain in §8.2. Reviewers who disagree with a resolved position push back via PR comments on the relevant Decisions row. |
| 14 | User-signal attachment surface (Opik feedback_scores vs PS-5 ps5.tag) | Opik’s native feedback_scores API. Considered alternative: register thumbs and reroll as Tier-3 metadata keys in ps5-registry.json and write via ps5.tag(). Chose Opik native because: (a) consumer-agent already consumes feedback_scores in nine code locations (data/traces.py, data/mapper.py, data/export.py, data/compare.py, evaluation/judge_validation.py, cli/opik/*); (b) feedback_scores is purpose-built for typed user-feedback values (separate from arbitrary metadata) and supports CLI filters like feedback_scores.user_signal_thumbs < 0 natively; (c) the existing scripts/correlate_feedback_traces.py is the working prototype, already on this surface; (d) Tier-3 is for arbitrary metadata, feedback values have their own Opik surface. PS-5 stays in PF-9’s depends_on for the read path (reconstruct()) and for Tier-1/2 metadata; only the write path goes direct to Opik. |
| 15 | Trace-side variant slicing vs Snowflake assignment events | Snowflake assignment events (the ASSIGNMENTS_FLAG_USER_HOLDOUTS table fed by the experiment-assignment-logs Kafka topic) are the authoritative source for “which variant did user X get for experiment Y.” Eppo already reads this. PF-9’s correlation pipeline does NOT propose adding resolved_variant, experiment_arm, or agent_definition_version to PS-5 Tier-2; doing so would duplicate the authoritative Snowflake path. Considered Tier-2 additions and rejected as overengineering: experiment analyses run on Eppo or Snowflake; ad-hoc Opik exploration is a narrow case that joins back to Snowflake when needed. |
| 16 | prompt_version value-format under PC-1 prompt-blocks composition | The Tier-2 prompt_version slot stays on PS-5 unchanged. PS-5 FR-3 + R-9 explicitly delegate the value-format contract to PC-6; PS-5 accepts whatever string PC-6’s prompt manager hands over. Under PC-1’s prompt-blocks composition, a single prompt-commit identifier is incomplete. PF-9 proposes that PC-6 evolve the value it hands PS-5 to a content-addressed SHA256 of the assembled static prompt; the field name and PS-5 spec stay as-is. Coordination is PC-6-internal, not a PS-5 spec change. Considered an alternative (“rename the field on PS-5”) and rejected because PS-5’s existing R-9 already accommodates value-format evolution without spec churn. |
| 17 | Category and judge reuse across verticals | Categories and judges are platform-shared assets, not per-vertical. When a vertical wants a category that does not exist, they declare it. When a vertical wants a category that already exists with different judges, they MUST NOT re-declare with a different judge selection; they propose an update to the existing category through normal PR review. The CI linter rejects conflicting re-declarations with an error pointing at the existing category and the update workflow. This avoids both silent merging (verticals get judges they didn’t sign up for) and verbose namespacing (receipts:general_qa clutters the registry); follows the cite-don’t-redefine discipline applied across PF-9. Required-reviewer policy for proposed updates is Decision 18. |
| 18 | Reviewer policy for proposed updates to an existing category’s judge selection | GitHub’s normal review surface with structural enforcement, not platform-team-only and not declaring-vertical-only. Three required pieces: (a) CODEOWNERS makes the platform team a required reviewer on configs/evaluation_manifest.yaml and any cross-vertical category file; (b) a small CI script walks every manifest-extension.yaml in the repo and auto-requests reviews from each currently-consuming vertical’s owner; (c) the PR template asks the author to explain backward compatibility for current consumers. Reasoning: categories are platform-shared (Decision 17), so declaring-vertical-only violates that property; platform-team-only is too restrictive because the declaring vertical and any currently-consuming vertical have operational context the platform team doesn’t. CODEOWNERS plus CI consumer-discovery turns category-update from “hope the right people see it” into structural enforcement. |
| 19 | Signal-storage unification posture | Partial unification for V1; commit to full unification as V2 with a named trigger. V1 (current implicit choice): thumbs and reroll go to Opik feedback_scores (Decision 14); LLM judge scores already do; human annotations stay in the Opik annotation queue; DynamoDB stays as source-of-record for thumbs. V2 (full unification): all three signal classes land on feedback_scores, DynamoDB becomes a durable inbox that the batch job mirrors into Opik, annotation-queue entries migrate to feedback_scores. V2 trigger fires when any of the following holds: (1) a vertical asks for annotation-queue contents in its per-vertical dashboard; (2) annotation-queue write volume crosses a threshold making migration cheap relative to ongoing dual-write maintenance; (3) inverted-judge detection needs annotation signal at the same cadence as thumbs/reroll. Reasoning: read-time-join (option c) punts join complexity to every consumer and won’t scale across verticals; full unification is the right architectural endpoint but the annotation-queue migration has limited near-term payoff because annotations are read by JADE rounds at calibration time (quarterly), not by per-turn analysis. The named-trigger pattern is symmetric with PS-3 OQ-6 (Frank’s PR #305) for cross-spec consistency. R-4’s reservation list grows when V2 ships to cover annotation names. |
11.3 Migration receipts
Section titled “11.3 Migration receipts”Phase 1 of the rollout plan reuses the existing Rewards Assistant eval substrate as the proof that PF-9’s contracts are not greenfield. Concrete starting state:
- Judge registry: roughly twelve rule files under
src/consumer_agent/evaluation/rules/. Each gets updated to addclassification,baseline_source,recalibration_due,calibration_reffields by reading the JADE Round 1 findings. - Manifest:
configs/evaluation_manifest.yamlv7 captures the current category-to-judge mapping. The CI generation surface (§5.6) initially reads from this file as fallback while XML-derived manifest comes online. - Calibration baseline: JADE Round 1 (PLT-596) is the reference calibration. Round 2 follows the same shape (§5.3 JADE calibration section), with correlation-record input added per Phase 3.
- Signal collection: thumbs and reroll events already flow into DynamoDB. Phase 2 wires the correlation pipeline against the existing stream; no new ingestion infrastructure.
The PF-9 substrate is documentation and discipline applied to a working system, not invention from scratch.