Agent Variant CI/CD + Experiment-Gated Rollout
PC6: Agent Variant CI/CD + Experiment-Gated Rollout
Section titled “PC6: Agent Variant CI/CD + Experiment-Gated Rollout”1. Problem Statement
Section titled “1. Problem Statement”The AI Assistant Platform treats Agent Definitions (prompts, model choice, tuning, tools, sub-agent composition) as code in some places and as configuration in others, and the split has been visible as drift. Today: prompt versions live in Opik (versioned, named, retrievable by commit), but the rollout mechanism — which users get which variant of which field, and how a bad variant gets rolled back — is ad hoc. New capabilities have been declared in capabilities.md (a flat file alongside the prompts), which means the runtime carries two separate mental models for “what can the agent do”: the prompt content itself (XML blocks in conversational-xml.txt) and the capabilities manifest (capabilities.md). When those drift, the user sees chips promising features the agent can’t actually deliver (PLT-644) and eval coverage misses the gap (PLT-619).
PC6 closes this gap. Three load-bearing commitments:
-
Agent Definition variants are first-class deployable artifacts. A variant — any subset of Agent Definition fields overridden for a cohort (model swap, tuning change, tool addition, prompt fork) — moves through PR → eval gate → cohort-ramped rollout → full release → rollback, the same way code does. Prompt-fork variants are the v1 worked example. The eval gates live with PC5 (judge templates + thresholds); the rollout lifecycle lives with PC6.
-
Capability declaration moves to XML prompt components. The XML blocks already in
conversational-xml.txt(<identity>,<scope_boundaries>,<can_do>,<cannot_do>, etc.) are the canonical surface for declaring what the agent does.capabilities.mdis deprecated; its content migrates into XML blocks. New verticals adding capabilities author a new XML component, not a new line in a flat file. -
Rollout is experiment-gated by default. A new variant doesn’t ship to everyone on publish — it ships behind an Eppo experiment with a cohort split, gated by a Feature Flipper kill-switch. Variant-to-fork resolution happens at factory invocation: Feature Flipper returns the user’s variant name, the experiment config maps that variant to an override map, and the runtime forks the base Agent Definition. Fast full-rollout exists as an explicit config-toggled path when an experiment isn’t warranted (e.g., copy-only changes).
Without PC6, three things break:
- Capability drift continues.
capabilities.mdand XML prompts evolve at different cadences. PLT-644-style symptoms (chips promising unwired capabilities) re-occur. Eval manifest gaps (PLT-619) compound — the eval system can’t know what to test if there are two sources of truth for what the agent can do. - Prompt changes lack rollback discipline. Today a bad prompt rev affects every user until someone notices and force-pushes a fix. With cohort-ramped rollout + kill-switch, the blast radius of a bad version is bounded by the experiment’s audience.
- PC5’s eval gates have nowhere to attach. PC5 owns the gate definition (judge templates + thresholds); without PC6’s rollout lifecycle, those gates fire in a void — there’s no “between this version and the next” surface for them to gate.
Companion: Platform Spec Lab row PC6. The spec is the source of truth.
2. Capabilities Source
Section titled “2. Capabilities Source”Per the Platform Spec Lab, PC6 owns the Agent Variant CI/CD + Experiment-Gated Rollout capability for the AI Assistant Platform’s consumer-agent runtime. The capability has three components:
- Versioning — prompt versions as Opik-versioned artifacts with commit history and named retrieval.
- Authoring path for new capabilities — XML prompt components are the surface a vertical uses to declare a new agent capability. Replaces deprecated
capabilities.md. - Rollout — experiment-gated cohort assignment via Eppo + Feature Flipper kill-switch, with a fast full-rollout escape hatch.
PC6 builds on existing infrastructure:
- Opik (prompt versioning) — already integrated;
consumer-agent/src/consumer_agent/utils/opik.pypluscli/opik/prompt.pyfor create / get / history / list operations. - Feature Flipper (kill-switch + ramp) — already integrated;
src/consumer_agent/utils/feature_flags.py. - XML prompt content —
conversational-xml.txtalready structures the prompt as tagged blocks (<identity>,<output_contract>,<core_rules>,<scope_boundaries>,<can_do>,<cannot_do>, etc.).
PC6 specifies new infrastructure:
- Eppo integration — does not exist in consumer-agent today. PC6 specifies the integration’s entry-point shape; the implementation follows the spec.
- Cohort-rollout machinery — how cohort assignment composes with factory-time prompt resolution (per PC1 §5.5 cohort discipline).
- XML prompt component lifecycle — how a new block is added (PR review, eval coverage requirement, deprecation policy).
3. Background & Context
Section titled “3. Background & Context”3.1 Today’s reality
Section titled “3.1 Today’s reality”Prompt content is split across three places:
consumer-agent/prompts/conversational-xml.txt— the XML-structured prompt with<identity>,<output_contract>,<core_rules>,<scope_boundaries>,<can_do>,<cannot_do>blocks. This is the canonical surface going forward.consumer-agent/prompts/conversational.txt— the legacy non-XML prompt. Slated for removal once the XML path is the only path.consumer-agent/prompts/capabilities.md— a flat file describing the agent’s capabilities. Deprecated per PC6 §5.4. Content migrates to XML blocks.
Prompt-block infrastructure (PC1 §5.7):
consumer-agent/prompts/components/*.yaml— modular prompt-block files, each withtype,name,feature_gated,instructions. These are referenced by ID from each Agent Definition’sprompt_blockslist.consumer-agent/src/consumer_agent/prompts/manager.py— block resolution and assembly logic.consumer-agent/src/consumer_agent/prompts/sources.py— source resolution (file-backed vs Opik-backed prompts).
Versioning (Opik):
consumer-agent/src/consumer_agent/utils/opik.py—OpikClientextendsopik.Opikwith helper methods for prompt management and REST API access for automation rules (LLM-as-Judge).consumer-agent/src/consumer_agent/cli/opik/prompt.py— CLI commands:create(create/update from file),get(by name + optional commit),history(version history by name),list,delete.cli/opik/rules.py— LLM-as-Judge automation rules; PC5 owns rule definitions, PC6 hooks into rule-firing at rollout gates.
Feature flags (Feature Flipper):
consumer-agent/src/consumer_agent/utils/feature_flags.py— Feature Flipper integration. Used today for cohort-style behavior gating (e.g.,consumer_agent_xml_prompttoggles the XML vs legacy prompt path).- Naming convention:
ai_assistant_*prefix per PLT-552. PF8 codifies the convention.
Experiment platform (Eppo):
- Not currently integrated in consumer-agent. PC6 specifies the entry-point shape; the implementation follows.
The drift symptom (PLT-644 + PLT-619):
- PLT-644 — prompt suggestion chips were promising capabilities the agent couldn’t deliver. Root cause: capability declarations in
capabilities.mddiverged from XML prompt content; the chips read from one, the agent read from the other. - PLT-619 — eval manifest gap. Eval coverage couldn’t enumerate “what the agent claims to do” because the source of truth was split.
PC6 closes these by collapsing the two surfaces into one (XML prompt blocks are the only declarative surface for capabilities).
3.2 What PC1 leaves to PC6
Section titled “3.2 What PC1 leaves to PC6”PC6 inherits from PC1 (Agent Composition):
- Prompt-block registry pattern (PC1 §5.7 / Decision 9) — file-backed registry in
prompts/components/, git-versioned, reviewed via PR. PC6 extends this pattern to XML prompt components (the<block>tags insideconversational-xml.txt). - Per-request system prompt assembly (PC1 §5.5) — system prompt assembled per turn from prompt blocks plus dynamic context. PC6 specifies how cohort-gated prompt versions slot into that assembly.
- Cohort discipline (PC1 §5.5 cache-friendly contract) — cohorts defined by Agent Definition + registry versions + feature-flag-gated block presence. PC6 adds experiment-arm membership to the cohort definition.
3.3 What PC6 defers downstream / sideways
Section titled “3.3 What PC6 defers downstream / sideways”- PC5 (Agent CI/CD Pipeline) — owns the eval-gate definition (judge templates, thresholds, eval-config API). PC6 specifies when eval gates fire in the rollout lifecycle but does not define the gates themselves.
- PF8 (Feature Flag & Cross-Vertical Observability Conventions) — owns the
ai_assistant_*flag naming convention, kill-switch lifecycle, and required Grafana panels per vertical. PC6 inherits these conventions; doesn’t redefine them. - PF1 (Sub-Agent Lifecycle) — owns the canonical four-state lifecycle that both sub-agents and variants conform to. PC6 reuses PF1’s vocabulary for variant rollout (§5.5); the artifacts differ but the lifecycle shape is shared.
- PC7 (Mobile Renderer Contract) — owns how prompt-suggestion chips and UI-visible XML block content render on mobile. PC-6 owns content change-control and rollout; PC-7 owns rendering. Per-app-version targeting on FF cohorts (§5.9) defers to PC-7’s app-version pinning conventions.
3.4 Vocabulary
Section titled “3.4 Vocabulary”| Term | Meaning |
|---|---|
| Variant | An Agent Definition fork: any subset of AD fields (model, tuning, tools, sub_agents, prompt_commit) overridden for a cohort. The primitive PC6 ships, rolls out, gates, and rolls back. Prompt-fork variants are the v1 worked example. |
| Experiment config | The YAML config (per PF-5) that declares a Feature Flipper experiment and the override map per variant. Lives in the sub-agent’s directory. PC6 §5.6. |
| Override map | The dict of Agent Definition fields a variant overrides on the base AD (e.g., {prompt_commit: ..., model: ..., tuning: {...}}). Resolved at factory invocation. |
| Prompt version | The PS-5 Tier-2 metadata value identifying which prompt composition was active for a turn. Computed at sub-agent factory init as the content-addressed SHA256 of the assembled static prompt (whatever PC-1’s prompt assembly mechanism produces, excluding per-turn dynamic context). Two turns with identical composed prompts share a value; any change to the assembled bytes produces a new value. Distinct from prompt_commit (which identifies one Opik artifact a prompt-fork variant overrides on the base Agent Definition). See §5.2 for the value-format contract. |
| Prompt commit | An Opik-stored prompt at a specific commit ID. The artifact a prompt-fork variant points at via the prompt_commit override map field. One input to the prompt-version computation; not itself the value PS-5 stores. |
| Prompt block | One modular YAML file in prompts/components/. Composed into a system prompt at assembly time. PC1’s primitive. |
| XML prompt component | One tagged block inside conversational-xml.txt (e.g., <identity>, <can_do>). The canonical surface for declaring an agent capability. PC6’s primitive. |
| Capability declaration | A statement of what the agent can or cannot do. Lives in XML prompt components. capabilities.md is the deprecated form. |
| Cohort | A population of users receiving the same variant assignment. Determined at factory invocation. Composes from Agent Definition + registry versions + feature flags + experiment-arm membership. |
| Experiment | A treatment / control split for a new variant. Runtime variant assignment is performed by Feature Flipper (the child test flag inside a parent pool flag, per §5.7). Eppo holds the experiment definition for offline analysis of Feature Flipper’s assignment events. |
| Kill-switch | A Feature Flipper flag that, when off, drops the rollout entirely and routes all traffic back to the previous stable variant. Independent of the experiment’s child test flag. |
| Ramp | The gradual increase in audience share for a rollout. Ramp lives in Feature Flipper’s child test flag; the parent pool flag handles eligibility (segments, exclusions). |
| Eval gate | An LLM-as-Judge automation rule (PC5) that fires at a rollout milestone. If the gate fails, the rollout halts and rolls back. |
| Fast full-rollout | The escape hatch when an experiment isn’t warranted (e.g., copy-only changes). Config-toggled; the kill-switch still applies. |
4. Requirements
Section titled “4. Requirements”4.1 Functional requirements
Section titled “4.1 Functional requirements”FR-1 — Prompt versions are Opik-versioned artifacts. Every prompt MUST be stored in Opik with a stable name and commit history. The runtime MUST retrieve prompts by (name, commit) — never by file path alone, never by ad-hoc string substitution. The prompts/conversational-xml.txt and prompts/conversational.txt file content MUST be the source of the corresponding Opik prompt (synced via the existing cli/opik/prompt.py create workflow).
FR-2 — XML prompt components are the canonical capability declaration surface. All agent capabilities (in-scope behaviors, out-of-scope refusals, can-do statements, cannot-do statements) MUST be declared as XML blocks inside the agent’s XML prompt file (e.g., conversational-xml.txt). New capabilities are added by authoring a new block, not by editing a separate manifest file.
FR-3 — capabilities.md is deprecated. No new content MAY be added to capabilities.md. Existing content MUST migrate to XML prompt components per a documented migration map (§5.4). The file is removed when migration completes and zero references remain in the build chain.
FR-4 — Rollout is experiment-gated by default. Publishing a new variant MUST NOT route any production traffic to that variant until: (a) a Feature Flipper child test flag is configured with treatment / control split inside the parent pool flag, (b) the corresponding Eppo experiment definition exists for offline analysis, AND (c) a Feature Flipper kill-switch flag exists for the rollout. The runtime MUST resolve variant assignment per turn via the Feature Flipper SDK, not from “latest version”.
FR-5 — Fast full-rollout is the explicit escape hatch. A variant MAY ship full-rollout (no experiment, all eligible users get it) when the change does not warrant an experiment. The decision MUST be config-explicit (a rollout_mode: "full" field on the experiment config). Full-rollout MUST still register a Feature Flipper kill-switch flag for emergency rollback.
FR-6 — Eval gates fire at named rollout milestones. The eval-gate milestones exposed by PC-5 (see PC-5 §5.5 for milestone semantics) MUST fire at the rollout-lifecycle transitions PC-6 §5.8 specifies. Gate failure at any milestone MUST halt the rollout and require operator action to proceed.
FR-7 — Kill-switch is independent of Eppo state. Firing the kill-switch MUST drop the rollout immediately, routing all traffic to the previous stable variant. Eppo’s experiment record MAY remain (for post-hoc analysis), but no production traffic flows to the new variant while the kill-switch is off.
FR-8 — Rollback propagates across all cohorts. Rolling back a bad variant MUST drop every cohort that received it back to the previous stable variant, not just the cohort observing the regression. The runtime MUST resolve the previous stable variant at the next factory invocation; in-flight turns complete on the variant they started with.
FR-9 — Cohort assignment is factory-time, not per-turn. Per PC1 §5.5 cohort discipline, variant assignment MUST resolve at factory invocation (when the orchestrator builds an Agent Definition’s runtime form), not per-turn. Cohort composition (Agent Definition + registry versions + feature flags + experiment-arm membership) determines which variant that agent instance uses for the session.
FR-10 — XML prompt component lifecycle. New XML prompt components MUST be added via PR review with: (a) the block content; (b) eval coverage for the new capability (per PC5’s gate definitions); (c) a feature-flag declaration if the block is feature-gated. Component deprecation follows a 30-day window with zero-use observation, matching PS6 envelope-version deprecation policy.
FR-11 — Capability eval-manifest source of truth. The eval manifest (PLT-619) MUST read the set of declared capabilities from XML prompt components, not from capabilities.md. PC6 commits the source-of-truth shift; PC5 commits the consumption.
FR-12 — Variant observability. Every rollout decision MUST emit a trace event capturing: (a) sub-agent id, (b) resolved variant name, (c) the override map applied to the base Agent Definition (e.g., prompt_commit, model, tuning fields the variant set), (d) cohort identifiers (Agent Definition version, experiment arm, active flags), (e) rollout mode (experiment / full / killed). Trace shape is platform-standard OpenTelemetry.
FR-13 — Experiment-platform integration entry-point shape. PC6 commits the runtime integration’s contract surface: Feature Flipper SDK calls at the prompt-fetch boundary for variant resolution. Eppo is the offline analysis layer reading Feature Flipper’s assignment events via Snowflake (per §5.9); consumer-agent code does not integrate the Eppo SDK at request time. Specific Feature Flipper SDK version, polling frequency, and assignment-cache policy are operational tuning; PC6 does not pin them.
4.2 Non-functional requirements
Section titled “4.2 Non-functional requirements”NFR-1 — Rollout-decision latency. Resolving a user’s assigned variant at factory invocation MUST be sub-millisecond — dict-lookup against pre-fetched Feature Flipper assignment state. No per-request HTTP calls to the experiment platform; the runtime MUST cache Feature Flipper assignments per-process with bounded TTL.
NFR-2 — Opik prompt retrieval latency. Fetching a prompt by (name, commit) from Opik MUST be cached at process startup; runtime resolution is dict-lookup, not network. Cache invalidation on version change is operational.
NFR-3 — Capability eval-manifest cardinality. The total set of declared XML prompt components per agent MUST be enumerable from the agent’s XML prompt file. Soft cap: under 30 distinct blocks per agent. Growth beyond that triggers a structural review.
NFR-4 — Rollout audit retention. Every rollout decision (publish, ramp, kill, rollback) MUST be persistent for at least 90 days for audit and post-hoc analysis. Storage backend is operational (Opik traces + Feature Flipper change log).
NFR-5 — Migration completion framework. capabilities.md removal is gated by zero-use observation plus a 30-day CI deprecation gate, triggered by the XML prompt experiment’s promotion event (not by a calendar date). PC6 commits the deprecation framework; the wall-clock date follows from the trigger.
4.3 Acceptance criteria
Section titled “4.3 Acceptance criteria”AC-1 — Given a new variant published, when the experiment config has rollout_mode: "experiment" and no child experiment Feature Flipper flag is configured, the runtime MUST reject the publish with an error naming the missing child-flag configuration. No production traffic routes to the new variant. (The Eppo-side experiment definition for offline analysis is set up by the experimentation team in parallel; PC6’s runtime check is on the Feature Flipper child flag, not on Eppo.)
AC-2 — Given an active experiment with 50/50 treatment/control split and a 25% ramp, when a user’s factory invocation runs, the runtime MUST resolve the user’s variant via the Feature Flipper SDK (the child test flag inside the parent pool flag per §5.7); only users in treatment whose assignment falls within the 25% ramp see the new variant.
AC-3 — Given a Feature Flipper kill-switch flag set to off mid-experiment, when the next factory invocation runs, the runtime MUST resolve the previous stable variant regardless of the experiment assignment. The user receives the pre-rollout variant. Trace events MUST record rollout_mode: "killed".
AC-4 — Given a vertical adds a new XML prompt component declaring a new capability (e.g., <can_do> block extension), when CI runs, the eval manifest (PLT-619) MUST surface the new capability as a covered declaration; CI MUST fail if the capability is not present in the eval suite (per PC5’s gate definitions).
AC-5 — Given a content edit to capabilities.md in a PR, CI MUST fail with a deprecation error pointing the author to the XML prompt component path. No new capabilities.md content reaches main.
AC-6 — Given a rollout reaches the pre-ramp milestone, PC5’s eval gate MUST fire automatically. If any registered judge template returns a score below threshold, the rollout MUST halt at 0% (no ramp begins); the operator receives a notification.
AC-7 — Given a rollback request for variant V_new back to V_prev, when the next factory invocation runs, every cohort previously assigned to V_new MUST resolve to V_prev. In-flight turns continue on V_new; new turns get V_prev.
AC-8 — Given a rollout_mode: "full" variant publish (e.g., copy-only prompt change), the runtime MUST route 100% of eligible users to the new variant on the next factory invocation, without Eppo configuration, but MUST register a kill-switch flag and emit the standard rollout trace events.
AC-9 — Given the eval manifest at steady state, when CI inspects the source of “declared capabilities”, it reads exclusively from XML prompt components — no read path through capabilities.md.
AC-10 — Given any rollout decision at factory invocation, the emitted trace event MUST contain: sub_agent_id, resolved_variant, override_map (the fields the variant set on the base Agent Definition, e.g., prompt_commit, model, tuning), agent_definition_version, experiment_arm (or null), active_flags, rollout_mode.
5. Solution Design
Section titled “5. Solution Design”5.1 The architectural through-line
Section titled “5.1 The architectural through-line”Agent Definition variants are versioned (prompts in Opik at v1); capabilities are declared in XML components; rollouts are cohort-gated by Eppo + Feature Flipper; eval gates fire at named milestones.
Three properties hold across every PC6 contract:
-
One source of truth per concern. Prompt content lives in Opik. Capability declarations live in XML prompt components. Rollout state lives in Eppo + Feature Flipper. No concern is replicated; no two surfaces disagree on the same fact.
-
Cohort discipline is factory-time, not per-turn. The orchestrator’s factory invocation resolves which variant each agent runs against — cohort composition (Agent Definition + registry versions + feature flags + experiment-arm) is fixed for the session. Per-turn re-evaluation would invalidate PC1 §5.5’s cache-friendly prompt-prefix contract.
-
Gates are explicit, not implicit. Eval gates (PC5), kill-switch (Feature Flipper), and Feature Flipper child-flag variant assignment are each declared in config — no implicit “ramp up because time passed” or “promote because no error”. An operator (or PC5’s automated judge) must explicitly approve each milestone.
5.2 Prompt versioning
Section titled “5.2 Prompt versioning”Opik is the prompt versioning substrate. The contract:
- Prompt name — stable identifier (e.g.,
conversational,conversational-xml,prompt-suggestions,title-generation). One name per logical prompt; versions are commits within the named prompt. - Commit — an Opik-assigned version identifier. Retrievable by name; latest by default.
- Source of truth — the prompt file (
conversational-xml.txtetc.) in the consumer-agent repo. Changes are committed to the repo; the existingcli/opik/prompt.py createworkflow syncs the repo file to Opik on merge.
Runtime resolution (per turn at factory invocation):
- Factory determines the cohort (Agent Definition + registry versions + flags + experiment-arm).
- Cohort maps to a variant via the experiment config (§5.6); for prompt-fork variants, the variant’s
prompt_commitresolves to a prompt name and commit. - Runtime retrieves prompt content from Opik cache (warm at process startup; cache invalidation on version change is operational).
- Prompt assembled per PC1 §5.5 (prompt blocks + dynamic context appended).
prompt_version value-format contract for PS-5. Per PS-5 FR-3 and R-9, PS-5’s Tier-2 prompt_version slot accepts whatever string the prompt manager hands it; PC-6 owns the value-format contract. Under PC-1’s prompt-blocks composition, a single prompt_commit identifier captures only one block’s version, not the full assembled-prompt composition. PC-6 commits the value PS-5 receives to be a content-addressed SHA256 of the assembled static prompt:
prompt_version = sha256(assembled_static_prompt_bytes).hex()[:16]assembled_static_prompt_bytes is the byte-encoded result of PC-1’s prompt assembly (per PC-1 §5.5), with all per-turn dynamic context excluded. PC-6 does not specify the block list or assembly order; those are PC-1’s responsibility, and the hash captures whatever the runtime actually composes. The runtime computes this hash at sub-agent factory init (alongside variant resolution) and passes it to PS-5’s set_context(). Properties:
- Deterministic: same assembled prompt → same value, byte-identical across runtimes.
- Composition-complete: any change to any block in the assembly produces a different value; the field reflects the full assembled prompt, not one block’s commit.
- Joinable: a side table (committed at deploy time, e.g.,
prompts/registry/assembled-prompts.json) mapsprompt_version→{block_versions: {...}, prompt_commits: {...}}for debugging and full-composition analysis. The runtime emits a new entry whenever a previously-unseen assembled prompt appears. - Stable across dynamic context: dynamic per-turn context (user message, prior_context, etc.) is excluded from the hash so the value identifies the static prompt setup, not the rendered prompt.
Backward compatibility: existing traces produced before this value-format takes effect have prompt_version carrying a prompt-commit identifier per the original semantic. Analyses joining trace data with the assembled-prompts registry should branch on value-format detection (a hash is 16 hex chars; a commit ID is a longer Opik UUID).
What lives in Opik vs the repo:
- Repo: prompt content (
conversational-xml.txt, prompt-block YAML files), experiment config (§5.6), capability declarations (XML components), tests. - Opik: prompt versions (commits), commit history, automation rules (PC5), traces of prompt-version usage.
The split is the same one PC1 §5.7 Decision 9 already committed: file-backed for change control, Opik for eval / observability / versioning storage.
Promotion mechanism — version-level Opik tags. Opik does not natively support a prompt-promotion primitive (verified against Opik SDK 1.10.8). PC6 uses Opik’s version-level PromptVersionDetail.tags field to encode environment promotion. Tag convention: env:local, env:stage, env:prod on individual versions. Promotion = move the env:prod tag from the prior version to the new version via the update_prompt_versions(ids, PromptVersionUpdate(tags=[...]), merge_tags) REST endpoint (verified in Opik SDK 1.10.8). Runtime fetches by tag, not by commit, so promotion does not require a consumer-agent redeploy. File-mirror discipline: prompts/*.txt in the repo tracks the content tagged env:stage; a CI drift check enforces parity between the repo file and the Opik version tagged env:stage. Revisit when Opik 2.x ships native promotion.
Deployment vs routing. Opik tags control which prompt content the runtime caches at process startup; Feature Flipper controls which variant a user gets at factory invocation (§5.6). The two are separate layers. A consumer-agent deploy ships every variant currently declared active across experiment configs — the deploy bundle is what makes a variant available. Feature Flipper config is what routes a cohort to one of those available variants. Promotion and rollback are routing changes, not artifact ships: the operator moves the Feature Flipper assignment (or the Opik env:prod tag for prompt content), no redeploy needed.
Abstraction layer. Opik-specific calls are codified behind a thin Protocol abstraction, not scattered across business logic. Two interfaces:
-
PromptSource(already insrc/consumer_agent/prompts/sources.py) — read interface. Methods:load_prompt(name),load_capabilities(),load_component(component_type),list_components(),get_feature_gated_components(). Concrete impls:FileSource(local development),OpikSource(stage / prod). -
PromptStore(extends the abstraction during implementation) — promotion + version-tag interface. Methods:get_prompt_by_tag(name, tag)(fetch by env tag, not commit),move_tag(name, tag, target_commit)(atomic tag move for promotion),list_versions(name)(commit history). Concrete impl:OpikStorewrappingupdate_prompt_versions,get_prompt,retrieve_prompt_versionSDK calls.
The runtime calls these Protocols, not Opik SDK directly. Vendor swap (Opik → alternative) is a Protocol re-implementation, not a runtime rewrite. Vendor portability is not a v1 implementation requirement, but the abstraction is codified at v1 so vertical experiments wire against the Protocols from day one — preventing the retrofit cost of “every vertical already calls Opik directly.”
Two-tier tag discipline. Two distinct tag fields exist on Opik:
- Prompt-level tags (
RestPrompt.tags) — classification only. Convention:domain:<vertical>,owner:<team>. Applies to the prompt as a whole; does not change with promotion. - Version-level tags (
PromptVersionDetail.tags) — environment promotion. Convention:env:local,env:stage,env:prod. Applies to a specific commit; moves with promotion.
SDK pitfall: get_prompt_history does NOT hydrate version tags; use get_prompt(name, commit) or retrieve_prompt_version(name, commit) for tag-aware reads.
5.3 XML prompt component authoring path
Section titled “5.3 XML prompt component authoring path”The XML prompt component is the canonical surface for declaring agent capabilities. Existing structure in prompts/conversational-xml.txt:
<identity>...</identity><output_contract>...</output_contract><core_rules priority="safety">...</core_rules><scope_boundaries> <in_scope>...</in_scope> <out_of_scope>...</out_of_scope></scope_boundaries><can_do>...</can_do><cannot_do>...</cannot_do>Authoring a new capability:
- Author adds a new block (or extends an existing block like
<can_do>) in the agent’s XML prompt file. - Author adds eval coverage for the new capability (per PC5’s gate definitions) — at minimum, judge-template invocations that test the capability holds.
- Author declares a feature flag if the block is gated for partial rollout (per PF8 naming).
- PR review — both the prompt block content and the eval coverage gate the merge.
- On merge, Opik prompt sync runs; the new prompt version is registered.
- Rollout begins per §5.4.
What an XML block declares:
| Block | Meaning |
|---|---|
<identity> | Persona, tone, voice |
<output_contract> | Format constraints (word limits, markdown, structure) |
<core_rules priority="safety"> | Inviolable rules (safety, data integrity, privacy) |
<scope_boundaries> | What the agent will and won’t engage on; the in-scope / out-of-scope partition |
<can_do> | Concrete capabilities the agent advertises and can deliver |
<cannot_do> | Concrete capabilities the agent explicitly refuses, with redirect language |
New capabilities go in <can_do> (extending the list) or as a new specialized block (e.g., <products> for product-domain rules, <offer-list> for offer-presentation rules — vertical-specific blocks owned by the vertical that declares them).
Why XML, not YAML or markdown:
- Already in production via
conversational-xml.txt— codifies what exists. - XML’s nesting + attributes (e.g.,
priority="safety") carry structural meaning the prompt-assembly layer reads. - Distinguishes prompt content (XML, sent to the LLM) from prompt-block metadata (YAML, the orchestration layer’s view) — different audiences, different formats.
Block ownership:
- Platform-owned blocks:
<identity>,<output_contract>,<core_rules>— modified only by the platform team (similar to PC1’s auto-injected platform-required blocks). - Vertical-owned blocks:
<can_do>,<cannot_do>, vertical-specific blocks — modified by the vertical owning that capability.
Modifications to platform-owned blocks require platform-team review; modifications to vertical-owned blocks require the vertical’s owner + platform-team second-pair review.
Experiments on platform-owned blocks follow the same machinery as vertical-block experiments, with platform-team ownership of the experiment config. A platform-team experiment on <core_rules> ships through the variant lifecycle (§5.5), variant-to-fork resolution (§5.6), and trace observability (FR-12) identically to a vertical experiment, with the experiment config living in the platform team’s sub-agent directory.
5.4 capabilities.md deprecation + migration (consumer-agent scope)
Section titled “5.4 capabilities.md deprecation + migration (consumer-agent scope)”consumer-agent/prompts/capabilities.md is deprecated as of PC6 merge — within consumer-agent’s scope only. The file has three known consumers in consumer-agent:
prompts/components/prompt-suggestion.yml(the chip generator prompt) — referencescapabilities.mdby name in its instructions block; tells the LLM “Suggestions must align with agent capabilities defined in capabilities.md.” This is the PLT-644 root cause: the chip generator reads fromcapabilities.md, the agent runtime reads XML prompt components, drift between them produces chips promising capabilities the agent can’t deliver.src/consumer_agent/prompts/sources.py— prompt source resolution loads the file at runtime (not just a doc artifact).- Tests in
tests/unit/prompts/test_sources.pyexercise the above.
Out of scope (separate cleanup ticket): rover-agent/internal/agent/capabilities.go is parallel Go-side LoadCapabilities infrastructure that reads rover-agent/capabilities.md. With the python_agent feature flag at 100% rollout, the Go direct-LLM path is bypassed at runtime — the infrastructure exists in the codebase but no production traffic hits it. PC6 does not absorb rover-agent’s cleanup into its scope; the dead code is removed via a separate rover-agent housekeeping ticket owned by the rover-agent owner, on rover-agent’s own timeline.
Migration policy (consumer-agent):
Step 1: Migration map. Every line / item in consumer-agent/prompts/capabilities.md maps to one of:
- An XML prompt component already present in
conversational-xml.txt(most lines — content is already duplicated). - A new XML block to add (a small set — content not yet represented).
- A line that doesn’t belong in either surface (drift artifacts; deleted entirely).
The migration map is authored as part of the first PC6-implementing PR and reviewed before any capabilities.md edits cease.
Step 2: Chip generator prompt rewrite. prompts/components/prompt-suggestion.yml is rewritten to reference XML prompt components instead of capabilities.md. The chip generator’s instructions block reads “Suggestions must align with agent capabilities declared in the agent’s XML prompt blocks (<can_do>, <cannot_do>, <scope_boundaries>).” This rewrite ships in the same PR as the migration map so the chip generator and the agent share one source of truth from day one.
Step 3: Prompt-source code path retired. src/consumer_agent/prompts/sources.py stops loading capabilities.md. The file’s prior runtime role (providing capability text to consumers) is satisfied by reading XML prompt component content directly.
Step 4: CI deprecation gate. A CI check fails any PR that modifies consumer-agent/prompts/capabilities.md after Steps 1-3 land. Error message points to the XML prompt component path with the relevant block name.
Step 5: Removal. Once the migration map is complete, the chip generator is rewritten, the source-resolution code path is retired, AND CI has been failing capabilities.md edits for 30 days with zero attempted bypasses, consumer-agent/prompts/capabilities.md is deleted in a follow-up PR. The file’s prior content is preserved in git history.
Why deprecation matters for PLT-644 + PLT-619:
- PLT-644 (chips promising unwired capabilities): root cause was the chip generator (Step 2) reading
capabilities.mdwhile the agent read XML. After Step 2’s rewrite, both surfaces read XML; drift is structurally impossible. - PLT-619 (eval manifest gap): the eval suite couldn’t enumerate “declared capabilities” because two surfaces existed. Post-Step 3, the eval manifest reads XML blocks exclusively.
Note on PC-1 §5.7 block categories: PC-1 retains “capabilities” as a block category name. Post-deprecation, capability declarations are authored as XML <can_do> / <cannot_do> prompt blocks (PC-6 §5.3); the legacy capabilities.md file is no longer the source. Category semantics survive; storage moves to XML.
5.5 Rollout lifecycle
Section titled “5.5 Rollout lifecycle”A variant moves through the four lifecycle states PF1 establishes for sub-agents: dev → test → promote → rollback. The same vocabulary applies here; prompts and sub-agents are different artifacts going through the structurally-identical lifecycle. The lifecycle applies to every Agent Definition — vertical-owned sub-agents and platform-owned agents (orchestrator, safety helpers, future platform agents) equally. The team owning the Agent Definition owns its experiments and its rollout cadence; the machinery is identical.
┌───────────────┐ │ dev │ └──────┬────────┘ │ developer-driven (PR opens) ▼ ┌───────────────┐ │ test │ └──────┬────────┘ │ pre_merge gate (PC5) │ — PASS; PR merges; Opik sync runs │ — operator initiates ramp via experiment config ▼ ┌───────────────┐ │ promote │ (ramp percentage advanced over time) │ │ pre_ramp gate fires at first non-zero step │ │ pre_full gate fires at 100% advancement └──────┬────────┘ │ operator action OR │ kill-switch fires OR │ eval gate fails mid-rollout ▼ ┌───────────────┐ │ rollback │ └───────────────┘Ramp progression inside promote (ramp percentage = 0% immediately after merge, then advanced over time, ultimately reaching 100%) is recorded in the experiment config (§5.6), not as distinct lifecycle states.
Transition rules:
dev → test: developer-driven. Author writes the new XML block or prompt content in a PR. PR opens.test → promote: PC5’spre_mergegate fires (eval-config API at PC5). Pass → PR is mergeable; fail → PR is blocked. On merge, Opik prompt sync runs; the new version exists in Opik. The operator then initiates a ramp via the experiment config (§5.6):rollout_mode: "experiment"— a Feature Flipper child test flag is created with the declared treatment / control split inside the parent pool flag (§5.7). The Eppo experiment definition is configured in parallel by the experimentation team for offline analysis of the assignment events Feature Flipper logs. A separate Feature Flipper kill-switch flag is created. PC5’spre_rampgate fires at the first non-zero ramp step;pre_fullfires when the operator advances ramp to 100%.rollout_mode: "full"— no child test flag, no Eppo experiment definition; 100% of eligible users move to the new variant on next factory invocation. Feature Flipper kill-switch flag is still created.
* → rollback(fromtestorpromote): operator action OR kill-switch fires OR an eval gate fails mid-rollout. All cohorts revert to the previous stable variant at the next factory invocation. Eppo experiment record is preserved for post-hoc analysis.
A variant reaching 100% ramp inside promote becomes the “previous stable” reference for the next variant’s rollback target.
Minor changes (in-place): Minor edits to a prompt artifact (typo fixes, tuning adjustments, non-substantive wording changes) MAY commit in-place without re-entering dev. PF-1 FR-12 enumerates the “major change” criteria for sub-agents (model swap, prompt-block add/remove, tool list change); PC-6 applies the analogous distinction at the prompt level: a major prompt change (significant <can_do> / <cannot_do> content change, new XML block, restructured <output_contract>) re-enters dev as a new prompt version; minor changes commit in-place against the current promoted version.
5.6 Experiment config schema
Section titled “5.6 Experiment config schema”The experiment config is YAML, lives alongside the sub-agent it targets (per PF-5’s vertical scaffolding convention; e.g., verticals/<vertical>/experiments/<experiment>.yaml), and is read at factory invocation. Each config declares a Feature Flipper experiment and the Agent Definition overrides for each variant. Schema:
sub_agent_id: rewardsbase_agent_definition_ref: <path-or-ref-to-base-AD> # the Agent Definition this experiment forksrollout_mode: experiment # experiment | full
experiment: id: rewards-v3-model-eval audience: ai_assistant_eligible_users feature_flipper_id: ai_assistant_rewards_v3_experiment split: {treatment: 50, control: 50} variants: treatment: # Override any subset of Agent Definition fields. Unspecified # fields inherit from base_agent_definition_ref. prompt_commit: abc123def model: gpt-5.4-mini tuning: reasoning_effort: low max_output_tokens: 512 # tools / sub_agents overrides supported via the same shape control: prompt_commit: <previous-stable-commit-id> model: gpt-5.4-nano
feature_flipper_kill_switch: ai_assistant_rewards_v3_killswitcheval_gates: # PC5 references pre_ramp: [judge-shopping-relevance, judge-safety-baseline] pre_full: [judge-shopping-relevance, judge-safety-baseline, judge-llm-feedback-correlation]ramp_steps: [0, 5, 25, 50, 100] # percentages; advance on pre-full gate pass at each steprollback_target: prompt_commit: <previous-stable-commit-id> model: gpt-5.4-nanoFor rollout_mode: full (no experiment, single full-rollout variant):
sub_agent_id: rewardsbase_agent_definition_ref: <path-or-ref-to-base-AD>rollout_mode: full
variant: # single variant, no split prompt_commit: abc123def
feature_flipper_kill_switch: ai_assistant_rewards_v3_killswitcheval_gates: pre_merge: [judge-shopping-relevance, judge-safety-baseline]rollback_target: prompt_commit: <previous-stable-commit-id>Variant-to-fork resolution (the central abstraction):
- Factory invocation calls Feature Flipper SDK with
(user_id, feature_flipper_id). - Feature Flipper returns the variant name (
treatment/control). - Runtime looks up the variant in the experiment config’s
variants:map. - Runtime forks the base Agent Definition by applying the variant’s overrides (only the fields the variant declares; unspecified fields inherit from
base_agent_definition_ref). - Sub-agent is instantiated with the forked Agent Definition for this user’s session.
Per-team ownership. Each team owns their sub-agent and their experiments. Experiment configs live in the team’s sub-agent directory (per PF-5), not in a central rollout-config folder. A team running three concurrent experiments has three configs in their directory. Platform-owned agents (orchestrator, helpers) follow the same convention in the platform team’s directory.
Concurrent experiment isolation: same-field mutex. Multiple experiments MAY run concurrently on the same sub-agent as long as their variant override maps target disjoint Agent Definition fields. A team running a model experiment AND a prompt experiment on the same sub-agent at the same time is allowed; two experiments both forking the same field (e.g., two concurrent model-swap experiments) are NOT.
Validation: at deploy time, the schema validator computes the union of override_map keys across all active experiments per sub-agent. Any field appearing in two or more concurrent experiments fails the deploy with an error naming the conflicting experiments. Operator action: drop one of the experiments (or wait for it to complete) before re-deploying.
This is a v1 policy. Future amendments may relax to explicit conflict-resolution rules (e.g., experiment-priority ordering) if teams find the mutex too restrictive in practice. Cohort cardinality stays bounded under this policy (per R-4) because each field contributes at most one experiment-arm dimension per sub-agent.
Deployment model: deploy bundles, runtime routes. A consumer-agent deploy bundles every variant currently declared active across all experiment configs (treatment, control, rollback target, prior stable). For prompt-fork variants this means every active Opik commit is reachable from the deployed binary’s cache-warming step. For model / tuning / tools / sub-agent variants this means the relevant clients, dependencies, and code paths are present in the deploy. Once deployed, variants are available. Feature Flipper assignment routes users to one of them; no redeploy is required to change routing.
Operationally, rolling out a new variant is a two-step operation:
- Deploy the experiment config (with the new variant in
variants:). It ships with the next consumer-agent deploy alongside all other active variants. - Configure the Feature Flipper child test flag to route a cohort to the new variant.
This decouples deploy cadence from rollout cadence: deploys can be frequent; rollouts can be selective. Rollback is near-instantaneous (move the Feature Flipper assignment, no redeploy). The deploy artifact is observable and auditable because every active variant is bundled in it. PC-5’s pre_merge gate fires on the deploy artifact’s eval surface; the pre_ramp and pre_full gates fire on the Feature Flipper routing transitions.
5.7 Cohort resolution
Section titled “5.7 Cohort resolution”Per PC1 §5.5, cohort composition determines the prompt-prefix for cache friendliness. PC6 adds experiment-arm membership to the cohort definition:
cohort = ( agent_definition_version, prompt_block_registry_version, xml_prompt_component_registry_version, # new in PC6 active_feature_flags, experiment_arm_assignment, # new in PC6)The Feature Flipper variant is resolved at factory invocation (or read from a per-process cache warmed at startup). The cohort is then a stable tuple for the session; the variant assignment is a dict-lookup against the experiment config (§5.6).
Cache-friendly contract preserved: same-cohort users still share a stable prompt prefix per turn. Cohort cardinality grows by experiment-arm count (typically 2x or 3x per active experiment) but stays bounded.
The parent/child pool/test flag pattern and cohort-key choice that produce experiment_arm_assignment are specified in §5.9 (Experiment-platform integration).
5.8 Eval-gate hookup (cross-section with PC5)
Section titled “5.8 Eval-gate hookup (cross-section with PC5)”PC5 owns the eval-gate definition. PC6 specifies when gates fire:
pre_merge— fires on PR open and on every push to the PR branch. Blocks merge if any registered gate fails. Used for bothrollout_mode: experimentandrollout_mode: full(the full-rollout case’s main eval surface).pre_ramp— fires whenrollout_mode: experimentand the ramp transitions from 0% to the first non-zero step. Blocks ramp; rollout sits at 0% until operator action.pre_full— fires when ramp transitions to 100%. Blocks the final step; rollout sits at the last successful ramp percentage.
Each gate hookup reads judge templates by ID from PC5’s API. If the API is unreachable, the gate fails-closed (rollout halts).
PC-6 §5.8 consumes the GateVerdict returned by PC-5’s evaluate_gate: the verdict field (pass | warn | fail) decides pipeline progression (pass → advance; warn → log + advance for non-blocking judges; fail → halt). The failing_judges list is surfaced in operator notifications and rollout dashboards. Field shapes are owned by PC-5 §5.4 (GateVerdict dataclass).
Comparative sample availability during ramp. When rollout_mode: experiment, PC-6’s pre_ramp and pre_full gate invocations MUST pass BOTH treatment and control cohort samples to PC-5’s gate API. The scoring approach — whether PC-5 evaluates them comparatively (treatment vs control delta, e.g., “treatment not worse than control by more than X%”) or absolutely (treatment scored against a fixed threshold) — is PC-5’s gate-definition concern (PC-5 §5.5). PC-6’s commitment is sample availability: treatment and control are both sampled and routed to the gate at each milestone.
For rollout_mode: full, the pre_merge gate runs against the single variant (no control cohort exists yet); absolute-threshold scoring is the only available approach for that milestone.
This is the architectural seam: PC-6 provides cohort samples; PC-5 decides how to score them. A future PC-5 amendment may declare comparative scoring as the default for experiment-mode gates without PC-6 changes.
5.9 Experiment-platform integration
Section titled “5.9 Experiment-platform integration”The runtime mechanism is Feature Flipper SDK; Eppo is the offline analysis layer. At Fetch, services do not call the Eppo SDK at request time. Feature Flipper performs deterministic variant assignment based on the cohort key the SDK call passes (verified across multiple Fetch backend services + Confluence “Configuring an Experiment Feature Flag”). Feature Flipper’s assignment events flow to Snowflake via the established pipeline; Eppo reads them for stat-sig analysis and dashboard rendering. consumer-agent code touches Feature Flipper, not Eppo.
Runtime call site. consumer-agent calls Feature Flipper’s Python SDK at the factory-invocation boundary, passing the authenticated user_id (and platform / app_version / os_version where applicable for per-version targeting). The SDK returns the variant name; the variant maps to an override map (prompt_commit and other AD fields) that the experiment config resolves (§5.6).
Cohort architecture: parent/child pool/test flag pattern. Matches the live AI Assistant experiment shape (ai_assistant_user_pool → ai_assistant_fab_test). A platform-owned parent (pool) flag declares eligibility (segments, exclusions, platform constraints); the parent pool flag convention is owned by PF-8 (PC-6 inherits, doesn’t define). A child (test) flag declares variant assignment within the pool. PC6 variant experiments follow this nested pattern: variant experiments at consumer-agent run inside the AI Assistant user pool (~200K iOS users) rather than against the full Fetch user base.
Kill-switch position: a third independent flag. The kill-switch is a separate Feature Flipper flag (not a field on the child test flag). The parent/child/kill triad:
- Parent pool flag (PF-8 convention): eligibility (segments, exclusions). Platform-team owned.
- Child test flag: variant assignment + ramp. Experiment owner controls.
- Kill-switch flag: rollout enable/disable, independent of the child flag’s state. On-call / SRE / platform-team can flip without needing the experiment owner.
Independence matters operationally: a kill-switch fired during an incident must drop the rollout immediately without requiring the experiment owner to be paged or the Eppo experiment to be torn down. Firing the kill-switch leaves the Eppo experiment record intact (for post-hoc analysis) but routes 100% of variant traffic to the previous stable (per FR-7, AC-3). Re-enabling the kill-switch resumes the rollout where it left off.
Naming convention: <child_flag_name>_killswitch (e.g., ai_assistant_rewards_v3_killswitch). PF-8 owns the convention.
v1 audience scope: iOS only. The AI Assistant Platform’s variant experiments run inside the existing AI Assistant iOS user pool (~200K users). Android, web, and server-initiated (DM) flows are out of scope for v1:
- Android: future audience expansion. Joining requires an Android parent pool flag (PF-8 convention) and per-app-version targeting alignment (PC-7). PC-6 machinery is platform-agnostic at the variant / cohort layer; the gate is audience definition, not runtime logic.
- Web: not on the AI Assistant roadmap as of PC-6 v1.
- Server-initiated DM flows: PD-3 owns DM-type rollouts and may consume PC-6 machinery for its own LLM-call variants; the audience for those is server-initiated, not user-pool-gated. PD-3 declares its own audience scope.
A v1 implementation that routes traffic to non-iOS users MUST be rejected at the eligibility step of the parent pool flag (PF-8).
Cohort key. user_id (Mobile Flag distribution type, Eppo Mobile User IDs assignment logging table). User identity propagates via request headers (KrakenD pass-through), not direct code params. Sticky assignment is relied upon — once a user is hashed to a variant, they stay there for the experiment’s lifetime.
Operational coordination. Experiment definition + dashboard ownership is an operational handoff with the Fetch experimentation platform team (Pritchard / Himelhoch / Picardo). PC6 specifies the AI Assistant-side seam; Eppo-side configuration follows Fetch platform conventions documented in Confluence “Configuring an Experiment Feature Flag” and “AI Assistant Feature Flag Analysis.”
Failure mode. Feature Flipper SDK unreachable → factory treats user as unassigned (no experiment-arm) → resolves to the previous stable variant. Fail-closed for experiment exposure, fail-open for serving (the user still gets a variant, just not the experimental one).
Kill-switch ↔ Eppo stat-sig interaction. When the kill-switch fires, Feature Flipper continues to log assignment events (the user was still hashed into a variant), but the runtime serves the previous stable variant per FR-7. Eppo’s stat-sig analysis must distinguish “assigned to treatment but served previous stable” from “assigned to treatment and served treatment” — otherwise the analysis compares apples to oranges.
PC-6’s commitment: the kill-switch state is observable in the trace event stream (FR-12 trace events include rollout_mode: "killed" when the kill-switch is off, distinct from rollout_mode: "experiment" when it’s on). The kill-switch state and its transitions (fired at T, resumed at T2) flow to Snowflake via the standard trace pipeline (PS-5).
The analysis approach in Eppo — whether to filter killed-state samples, annotate the experiment as suspended for the killed window, or apply weighted analysis — is an operational decision owned by the Fetch experimentation platform team (Pritchard / Himelhoch / Picardo, named in Operational coordination above). PC-6 surfaces the data; the experimentation team decides how Eppo consumes it.
5.10 Observability + Rollback
Section titled “5.10 Observability + Rollback”Trace events per rollout decision (FR-12 + AC-10):
{ "event": "variant.rollout.assigned", "sub_agent_id": "rewards", "resolved_variant": "treatment", "override_map": { "prompt_commit": "abc123", "model": "gpt-5.4-mini" }, "agent_definition_version": "5", "experiment_arm": "treatment", "active_flags": ["ai_assistant_rewards_v3_experiment"], "rollout_mode": "experiment", "ramp_step_percent": 25}Emitted at factory invocation. Feeds:
- PC5’s eval surfaces (correlate quality scores with prompt version)
- PS5’s trace + event store (audit query path)
- Per-vertical Grafana panels per PF8
Rollback semantics:
- Trigger: kill-switch fires OR eval gate fails OR operator-initiated.
- Effect at next factory invocation: every cohort that was assigned to the now-
rollbackvariant resolves torollback_target. - In-flight turns: complete on the variant they started with. No mid-turn reassignment.
- Eppo record: preserved (experiment metadata + assignments) for post-hoc analysis. The rollout is stopped, not erased.
6. Cross-Section Impact
Section titled “6. Cross-Section Impact”| Spec | Citation |
|---|---|
| PC1 (Agent Composition) | Inherits prompt-block registry pattern (PC1 §5.7 / Decision 9); cohort discipline at factory invocation (PC1 §5.5). PC6 extends both to XML prompt components and experiment-arm assignment. |
| PC5 (Agent CI/CD Pipeline) | Owns eval-gate definition (judge templates + thresholds); PC6 specifies when gates fire (pre_merge, pre_ramp, pre_full). PC5’s evaluation manifest (configs/evaluation_manifest.yaml) maps dataset categories to judges by ID and does not introspect PC6’s prompt blocks; no cross-section read path between PC5 and PC6 is required. |
| PF1 (Sub-Agent Lifecycle) | Owns the canonical four-state lifecycle; PC6 reuses PF1’s vocabulary for variant rollout per §5.5. |
| PF8 (Feature Flag & Cross-Vertical Observability Conventions) | Owns flag naming (ai_assistant_*), kill-switch lifecycle, required Grafana panels. PC6 reuses these conventions; doesn’t redefine them. |
| PS5 (Trace + Event Store) | Persists PC6’s rollout trace events for audit and correlation with eval scores. |
| PC7 (Mobile Renderer Contract) | Renders prompt-suggestion chips and any UI-visible XML block content (e.g., <can_do> surfaces). A variant change that affects user-visible content (prompt-fork variants shifting chip wording, capability surfaces) crosses PC-6 → PC-3 → PC-7: the deploy bundles the new content, FF routes a cohort, PC-7 renders. Per-app-version targeting on FF cohorts (§5.9) defers to PC-7’s app-version pinning conventions. |
| PD3 (DM Type Registry & Rollout) | PD-3 owns DM-type rollout. When a DM type involves LLM calls whose prompt / model / tuning needs variant-level lifecycle (per-cohort rollout, eval gates, kill-switch), PD-3 consumes PC-6’s machinery for the LLM-call variant — the DM-type rollout itself is PD-3-owned, but the LLM-call inside is variant-gated through PC-6’s surface. Non-LLM DM types do not touch PC-6. Both specs share PF-8’s flag conventions; no PC-6 machinery is added for PD-3 specifically. |
7. Dependencies
Section titled “7. Dependencies”Platform spec dependencies: PC1 (Agent Composition), PC5 (Agent CI/CD Pipeline), PF8 (Feature Flag & Cross-Vertical Observability Conventions).
Implementation dependencies:
- Opik (existing) — prompt versioning, automation rules, traces
- Feature Flipper (existing) — runtime variant assignment, kill-switch, ramp; the runtime experiment-platform integration
- Eppo (operational at Fetch; new to consumer-agent’s workflow) — offline analysis of Feature Flipper assignment events via Snowflake. Not a runtime SDK integration in consumer-agent (per §5.9); operational coordination with the experimentation platform team for experiment definition and dashboards.
External dependencies: None.
Cross-section soft dependencies:
- PLT-619 (eval manifest categories) — PC6 makes the eval-manifest source of truth structural (XML prompt components); PLT-619 implementation consumes the source-of-truth shift.
- PLT-644 (chips promising unwired capabilities) — PC6’s deprecation of
capabilities.mdprevents future recurrence.
8. Risks & Open Questions
Section titled “8. Risks & Open Questions”8.1 Risks
Section titled “8.1 Risks”R-1: Feature Flipper Python SDK integration risk. The Python Feature Flipper SDK isn’t yet integrated in consumer-agent (the existing usage is in adjacent Fetch services). PC6 specifies the entry-point shape (§5.9), but implementation will surface integration details (SDK auth, polling cadence, network reliability, header-based identity propagation through KrakenD) that may push back on PC6’s contract. Mitigated by treating §5.9 as an interface contract; implementation can refine within the shape. Coordination with the experimentation platform team for Eppo-side experiment definition is a separate operational concern, not a runtime integration risk.
R-2: Migration gap during capabilities.md deprecation. Between PC6 merge and capabilities.md removal, the file still exists and could be read by stale code paths. Three known consumers exist in consumer-agent (chip generator prompt, prompt source resolution code, tests) — all addressed in §5.4 Steps 1-3. A parallel rover-agent/capabilities.md infrastructure exists in rover-agent’s Go runtime (internal/agent/capabilities.go LoadCapabilities function); PC6 explicitly defers rover-agent’s path to a follow-on (OQ-6). The risk: rover-agent’s path continues to read its own capabilities.md and could drift from consumer-agent’s XML-derived view. Mitigated by the CI deprecation gate scoped to consumer-agent only, plus surfacing the coordination need explicitly so rover-agent’s migration lands as a follow-on rather than getting lost.
R-3: Eval-gate fail-closed semantics during PC5 API outage. If PC5’s eval-config API is unreachable when a gate would fire, PC6 fails the gate (rollout halts). This is conservative but means PC5 API outages translate to rollout outages. Mitigated by PC5’s API being a read-only fetch + per-process cache; outages are rare and short.
R-4: Cohort cardinality growth from multiple concurrent experiments. Each active experiment adds an arm dimension to the cohort tuple. N concurrent experiments → 2^N cohorts in the worst case. Mitigated by Eppo’s assignment-typically-stable property (a user’s arm doesn’t churn within an experiment) and by the cache-friendly contract only requiring cohort stability, not low cardinality.
R-5: Prompt-block file ↔ Opik sync drift. The repo prompt file is source of truth; Opik holds the version. A merged change that fails to sync to Opik leaves them out of step. Mitigated by the existing cli/opik/prompt.py create sync running on CI post-merge; CI failure halts deploy until sync succeeds.
R-6: Experiment-arm cache staleness across processes. Each consumer-agent process caches Eppo assignments. Process-A may have a user in treatment while Process-B (started later, with updated Eppo state) sees the same user in control. Mitigated by NFR-1’s bounded-TTL caching policy and the factory-invocation-only resolution (assignment is fixed for the session within a process).
R-7: XML prompt component schema drift. XML blocks are unstructured at the parser level — there’s no schema enforcing <can_do> content shape. A malformed block could ship without CI catching it. Mitigated by per-block lint rules added as the XML component registry grows; not in v1.
R-8: Opik vendor coupling. PC-6 uses Opik for prompt storage, version history, version-level tag promotion, and trace export. If Opik becomes unsuitable (pricing, feature gaps, vendor risk), every Opik integration point would need rework. Mitigated by the PromptSource + PromptStore Protocol abstraction (§5.2 abstraction layer): the runtime calls the Protocols, vertical experiments wire against the Protocols, and a vendor swap re-implements the Protocols instead of editing every call site. Vendor swap remains a non-v1 concern; the abstraction work scales the cost of a future swap from “rewrite everywhere” to “re-implement the Protocol.”
8.2 Open Questions
Section titled “8.2 Open Questions”OQ-1: Experiment-arm propagation to sub-agents during fan-out. When the orchestrator dispatches to multiple sub-agents in one turn (PC2 §5.4), does each sub-agent inherit the orchestrator’s experiment-arm assignment, or do sub-agents resolve their own assignment independently? PC6 leans inherit-from-orchestrator (one cohort per turn, not per-sub-agent) for cache-friendly assembly. Needs PC2 owner consensus (same author, but worth flagging in review).
OQ-2: Feature Flipper Python SDK version + assignment cache TTL. PC6 leaves these as operational. Worth pinning a starting choice (latest Feature Flipper SDK; cache TTL aligned with existing consumer-agent SDK usage pattern) in the implementation PR; PC6’s spec stays high-level.
OQ-3: Vertical-block ownership boundary. The “platform-owned vs vertical-owned” partition in §5.3 is asserted: <identity>, <output_contract>, <core_rules> are platform-owned; everything else is vertical-owned or platform-owned by team consensus. The concrete boundary for blocks like <scope_boundaries> (which mixes platform shape and vertical content) needs validation at review time. Needs Wave 1 reviewer input.
OQ-4: Migration map authorship — who owns producing it? The migration is small (most capabilities.md content already has an XML counterpart), but someone has to author the diff. Lean: PC6 implementer (you / platform team) produces the map; verticals review their respective rows. No external dependency.
OQ-5: Rollback target staleness. The rollback target is recorded in the experiment config at publish time. If multiple rollbacks chain (V_new fails back to V_prev, then V_prev surfaces a regression too), is V_prev_prev the next target, or does the operator manually pick? Lean: the operator picks for the second-level rollback; chaining beyond one level is rare enough to not need automation.
OQ-6: rover-agent’s parallel capabilities.md infrastructure (internal/agent/capabilities.go LoadCapabilities, rover-agent/capabilities.md) is a follow-on migration not in PC6’s scope. Options when that follow-on lands: (a) rover-agent migrates to read consumer-agent’s XML prompt components directly via a service call; (b) consumer-agent generates a snapshot capabilities.md from XML as a one-way build artifact for rover-agent to consume; (c) rover-agent’s path is retired as part of the broader rover-mcp → CCS migration (consumer-context-service#64). PC6 doesn’t pick; surfaces the options for the rover-agent owner to choose. Needs rover-agent owner input when the follow-on is scheduled.
9. Testing Strategy
Section titled “9. Testing Strategy”9.1 Unit tests
Section titled “9.1 Unit tests”- Rollout-decision resolution: given a cohort tuple + experiment config, returns the correct variant
- Eppo client caching: TTL respected; cache invalidation on version change
- Kill-switch override: kill-switch off → resolves to
rollback_targetregardless of Eppo assignment - XML prompt component parsing: extracts block names and content; fails cleanly on malformed XML
- Migration map application: given
capabilities.mdcontent, produces the diff of XML block additions - Rollback propagation: given a rollback request, the next factory invocation for every previously-assigned cohort resolves to
rollback_target - Eval-gate firing logic: at each milestone (
pre_merge,pre_ramp,pre_full), the correct PC5 gates are invoked
9.2 Integration tests
Section titled “9.2 Integration tests”- End-to-end variant rollout: publish new variant → Eppo experiment activates → ramp progresses → eval gate fires → rollout completes (or halts)
- End-to-end kill-switch: rollout in progress → flip kill-switch → next factory invocation resolves previous stable version
capabilities.mddeprecation gate: CI fails on PR adding new content tocapabilities.md- XML prompt component eval coverage: PR adds new
<can_do>block; CI fails until eval coverage is added - Eppo API outage: simulated unreachable Eppo → factory falls back to unassigned → resolves to previous stable
9.3 Eval coverage (Opik)
Section titled “9.3 Eval coverage (Opik)”- Per-vertical eval suites cover declared XML prompt components
- Eval manifest reader (PLT-619) enumerates capabilities from XML; passes when XML and eval suite agree
- Rollout-quality eval: judge templates that compare new vs previous variant on representative user turns
9.4 Contract tests
Section titled “9.4 Contract tests”- Cross-section with PC1: cohort tuple includes experiment-arm + XML prompt component registry version per §5.7
- Cross-section with PC5: eval-gate API call signature; PC6 invokes PC5’s judge templates by ID; PC5’s API returns the gate verdict
- Cross-section with PF8: feature flag naming follows
ai_assistant_*convention; kill-switch lifecycle matches PF8’s pattern - Cross-section with PS5: rollout trace event shape matches PS5’s persistence schema
9.5 Failure-mode testing
Section titled “9.5 Failure-mode testing”- Feature Flipper unreachable during factory invocation: fall back to previous stable; trace event records
rollout_mode: "unassigned" - Eval gate fails at
pre_ramp: rollout halts at 0%; operator notification fires - Kill-switch fires mid-ramp: every in-process factory invocation reverts to previous stable on the next call
- Opik sync failure post-merge: CI fails the deploy; merged PR doesn’t reach production until sync succeeds
- Concurrent rollback during ramp: rollback target resolution + ramp halt are atomic (no torn state)
10. Rollout & Observability
Section titled “10. Rollout & Observability”10.1 Rollout phases
Section titled “10.1 Rollout phases”Phase 1 — Spec validation. PC6 reviewed and approved; cross-section contracts confirmed with PC1, PC5, PF8 reviewers.
Phase 2 — Migration map. Author the capabilities.md → XML prompt component migration map. Review with verticals owning the relevant capabilities.
Phase 3 — CI deprecation gate. Ship the CI check that fails any new content on capabilities.md (FR-3 + AC-5). Migration map drives any required final additions to XML.
Phase 4 — Feature Flipper SDK integration. Implement the entry-point shape from §5.9 (Feature Flipper Python SDK at the prompt-fetch boundary, header-based identity propagation). Wire it into the factory invocation. Smoke-test with a no-op experiment. Coordinate with the experimentation platform team to land the Eppo-side experiment definition for offline analysis.
Phase 5 — Experiment config + observability. Implement the experiment config schema (§5.6), the trace event shape (§5.10), and the cache-warming on process startup (NFR-1).
Phase 6 — First experiment-gated rollout. Pick a low-risk prompt change (copy edit, single <can_do> block extension), publish under rollout_mode: experiment, validate the end-to-end flow.
Phase 7 — capabilities.md removal. After 30 days of zero attempted edits to capabilities.md (per FR-3 deprecation policy), delete the file in a follow-up PR.
10.2 Observability metrics
Section titled “10.2 Observability metrics”prompt.rollout.decision_totalbyprompt_name,rollout_mode,experiment_arm— volume of rollout decisions per cohortprompt.rollout.kill_switch_fired_totalbyprompt_name— kill-switch firing rate; high values indicate rollout-quality regressionsprompt.rollout.eval_gate_failed_totalbyprompt_name,gate_milestone,judge_template— eval-gate failure rate per milestone; informs PC5’s threshold tuningprompt.opik.sync_failed_totalbyprompt_name— post-merge Opik sync failures (NFR-2 / R-5)prompt.feature_flipper.fallback_totalbyprompt_name— Feature Flipper unreachable fallbacks (R-1 visibility)prompt.capabilities_md.write_attempt_total— attempts to write to deprecatedcapabilities.md; should reach zero after Phase 3
10.3 Rollback
Section titled “10.3 Rollback”PC6 is a contract spec, not deployable code. Rollback semantics apply at three layers:
- Rollout-config rollback: per-version, via the kill-switch + rollback target mechanism (§5.5, §5.10). Standard operational lever.
- CI deprecation-gate rollback: if the deprecation gate produces false positives or unacceptable friction, disable the gate in CI config independently of the spec. Migration timeline extends accordingly.
- Architecture-level rollback: reverting the XML-as-canonical-capability-source decision would require re-introducing
capabilities.mdas a read path. Expensive; not expected.
11. Appendix
Section titled “11. Appendix”11.1 Source references
Section titled “11.1 Source references”- PC1: Agent Composition — prompt-block registry pattern, cohort discipline
- PC3: Execution Modes & Event Streaming — typed status event registry uses the same file-backed pattern PC6 extends to XML components
- Platform Spec Lab — Wave 1 sequencing; PC6 scope row
- PLT-619 — Eval manifest categories — the eval-manifest gap PC6’s source-of-truth shift closes
- PLT-644 — Constrain prompt-suggestion chips to wired capabilities — the capability-drift symptom PC6’s deprecation policy prevents
- PLT-552 — Feature flags naming convention —
ai_assistant_*prefix discipline PF8 codifies consumer-agent/prompts/conversational-xml.txt— the XML prompt structure PC6 commits to as the canonical surfaceconsumer-agent/src/consumer_agent/utils/opik.py— the Opik client PC6 versioning builds onconsumer-agent/src/consumer_agent/cli/opik/prompt.py— the prompt CLI workflow PC6 codifiesconsumer-agent/src/consumer_agent/utils/feature_flags.py— Feature Flipper integration PC6 extends
11.2 Decisions resolved during design
Section titled “11.2 Decisions resolved during design”| # | Decision | Resolution |
|---|---|---|
| 1 | Capability declaration surface | XML prompt components (already in conversational-xml.txt) are canonical. capabilities.md deprecated. Migration via CI deprecation gate + 30-day window. |
| 2 | Prompt versioning substrate | Opik (already integrated). File-backed source of truth, Opik for versioning + history + traces. |
| 3 | Rollout default mode | Experiment-gated (Eppo + Feature Flipper). Fast full-rollout exists as explicit rollout_mode: full escape hatch. |
| 4 | Eval-gate firing milestones | pre_merge, pre_ramp, pre_full. PC5 owns gate definitions; PC6 owns the trigger points. |
| 5 | Cohort discipline | Factory-time resolution per PC1 §5.5; experiment-arm membership added to cohort tuple. Per-turn re-evaluation rejected (would break PC1’s cache-friendly prefix). |
| 6 | Rollback propagation | Across all cohorts, on next factory invocation. In-flight turns complete on their assigned version. Eppo record preserved for post-hoc analysis. |
| 7 | Experiment-platform integration | Runtime mechanism is Feature Flipper SDK at the prompt-fetch boundary; Eppo is the offline analysis layer reading Feature Flipper’s assignment events via Snowflake. consumer-agent code does NOT integrate the Eppo SDK at request time. Verified against multiple Fetch backend services + Confluence “Configuring an Experiment Feature Flag.” See §5.9. |
| 8 | Kill-switch independence | Feature Flipper kill-switch operates independently of Eppo analysis state. Firing the kill-switch drops the rollout regardless of where the experiment is in Eppo’s analysis. |
| 9 | prompt_version value-format for PS-5 | Content-addressed SHA256 of the assembled static prompt (first 16 hex chars), computed at sub-agent factory init. PS-5 FR-3 + R-9 explicitly delegate the value-format contract to PC-6; the slot accepts any string opaquely. PC-1’s prompt-blocks composition means a single prompt_commit identifier captures only one block’s version; the assembled-prompt hash captures the full composition. PC-6 does not specify the block list or assembly order; the hash is taken over whatever PC-1’s assembly mechanism produces. Side registry (prompts/registry/assembled-prompts.json) maps hash to block versions for debugging. Considered keeping single-commit semantics and rejected: a value that captures one block but not the assembled composition would mislead consumers slicing by “which prompt was active.” See §5.2 value-format contract. |
11.3 Migration receipts
Section titled “11.3 Migration receipts”- Capabilities surface:
capabilities.md(deprecated) → XML prompt components in agent-specific XML prompt files. Migration map authored in Phase 2 of rollout; CI deprecation gate ships in Phase 3. - No other content migration: PC6 doesn’t absorb content from other specs. PC5 (eval gates) and PF8 (flag conventions) remain canonical for their respective domains.