Execution Modes & Event Streaming
PC3: Execution Modes & Event Streaming
Section titled “PC3: Execution Modes & Event Streaming”1. Problem Statement
Section titled “1. Problem Statement”The AI Assistant Platform’s chat path streams every turn through a Server-Sent Events (SSE) wire — that wire is the entire visible surface of the platform to the mobile client. Today the contract for what travels on that wire is split across three places: rover-agent’s documentation (docs/streaming/streaming.md for the legacy/MCP event flow, docs/streaming/data-loading.md for progressive-rendering data events), consumer-agent’s Python event models (agent/streaming.py), and the translation layer in rover-agent’s pythonagent/mapper.go that converts between the two vocabularies. Nothing centralizes the contract, and nothing names the load-bearing additions the platform’s next phase requires.
PC1 specifies what an agent card is. PC2 specifies how the orchestrator dispatches sub-agents and explicitly defers status-event vocabulary, transport mechanism, and the user-facing wire protocol to PC3 (PC2 §5.7). The Orchestrator-Owned Output decision (Confluence 6500810847) commits the architectural shape (single shared stream writer; orchestrator policy decides per-event); PC3 is the spec that operationalizes the wire-level mechanism that decision relies on.
Without PC3, four things break:
- The wire contract has no canonical source of truth. New verticals (Play, eReceipts, Restaurants, PointPass) onboarding to the chat path read whichever of the three places is closest, invent the pieces that aren’t documented, and ship inconsistencies that surface as client bugs months later.
- Typed status events have no registry. PC1 FR-13 commits sub-agents to emit typed identifiers, not prose (“
searching_offers”, not “Searching offers…”), and the orchestrator owns rendering. The registry that makes that enforceable does not exist; without it, FR-13 is a guideline, not a contract. - Client telemetry for envelope failures is undefined. When CCS returns
envelope.status="error", the user receives a natural-language acknowledgment composed by the orchestrator (PC2 FR-14) — but the client gets no signal for analytics, retry hints, or debug surfacing. The error surface is the missing seam. - Downstream specs cannot cite PC3. PS5 (Event Store) lists PC3 as a blocking dependency for event types it persists. PC7 (Mobile Renderer Contract) needs the wire shape to specify client rendering. PS6 (BFF Assembly) needs to know where its envelope’s
payloadlands in the stream (and where the metadata does not). All three stall without PC3.
PC3 lifts the existing v0.4 wire contract into one canonical citeable spec, names the additive evolution path for typed status events and error telemetry, and locks the boundary between server-side envelope handling (PC2 / PS2) and client-facing frame data (PC3).
Success test: a vertical engineer reads PC3 and predicts — with no source-code archaeology — (a) the SSE event shape their sub-agent’s typed status events take when they reach the client, (b) what reaches the client (and what does not) when their tool fails with envelope.status="error", (c) how their sub-agent’s component data flows from PS6’s EnricherResponse[T] payload into a data_loaded event’s items array, and (d) why scheduled execution (PD1) is not on this contract.
Companion: Miro design board — supplementary architecture diagrams. The spec is the source of truth.
2. Capabilities Source
Section titled “2. Capabilities Source”Per the Platform Spec Lab, PC3 is assigned to operationalize the chat-stack execution-modes and event-streaming surface for the AI Assistant Platform’s consumer-agent + rover-agent runtime. The Spec Lab DAG names PC3 as the spec PC7 (Mobile Renderer Contract) and PS5 (Event Store) cite, and the spec PS6 (BFF Assembly) reconciles with on payload schema (“the single biggest cross-section contract in the lab,” per the Spec Lab page).
PC3 operationalizes the architectural lock from the Orchestrator-Owned Output decision (Confluence 6500810847): single shared stream writer, no bypass stream, orchestrator policy decides per-event whether to forward / transform / suppress / batch. The LangGraph POC referenced in that decision (sub-millisecond TTFE per variant; get_stream_writer() as the ambient primitive every emitter writes to) is PC3’s reference implementation.
PC3 carries through the same divergence pattern PC1/PC2/PS2 established:
- Factory-time binding (vs runtime discovery) — typed status event registry is resolved at sub-agent factory time, not per-turn (FR-14).
- Per-request assembly with stable cohort prefix — the wire envelope shape (
event_type,version,timestamp,response_id) is constant across the turn; per-event payload varies (FR-2). - Orchestrator as the single voice/UX seam — frame emission flows through orchestrator-applied rendering policy (FR-4), even when the underlying stream-writer primitive is ambient and shared.
3. Background & Context
Section titled “3. Background & Context”3.1 Today’s reality
Section titled “3.1 Today’s reality”The chat path has three load-bearing components and two parallel event vocabularies in production today.
The three components.
consumer-agent(Python, FastAPI, LangChain v1, LangGraph) runs the agent dispatch loop. The streaming endpoint atconsumer-agent/src/consumer_agent/api/main.pyemits SSE frames per-turn. The Pydantic event models live inconsumer-agent/src/consumer_agent/agent/streaming.py; the stream adapter that converts LangGraph stream-writer output into those events lives inconsumer-agent/src/consumer_agent/gateway/stream_adapter.py. An experimental XML path (xml_agent.py,xml_streaming.py,xml_stream_adapter.py) runs in parallel under theconsumer_agent_xml_promptfeature flag — out of PC3’s scope; PC3 codifies the canonical path.rover-agent(Go) is the SSE relay between mobile clients and consumer-agent. Itsinternal/pythonagent/package (adapter.go,mapper.go,client.go) translates between consumer-agent’s Pydantic event shape and rover-agent’s wire-event shape. Itsinternal/events/package (types.go,generator.go,extractor.go) owns the wire-event Go structs and the component-extraction logic that producesdata_loadinganddata_loadedframes from therender_*tool path. Itsinternal/status/manager.goresolves locale-aware status message strings the orchestrator renders during longer operations.- Mobile client (iOS, web) consumes SSE frames per the
version: "0.4"contract documented inrover-agent/docs/streaming/streaming.mdandrover-agent/docs/streaming/data-loading.md.
The two parallel event vocabularies.
Consumer-agent’s Pydantic models (the inner, Python-facing vocabulary):
| Event | Type discriminator | Purpose |
|---|---|---|
TextEvent | type: "text" | LLM-generated content chunk |
ReasoningEvent | type: "reasoning" | LLM internal reasoning (model-dependent) |
ThinkingEvent | type: "thinking" | Pre-content thinking indicator |
ToolCallStartEvent | type: "tool_call_start" | Tool execution begins |
ToolCallEndEvent | type: "tool_call_end" | Tool execution completes |
ToolResultEvent | type: "tool_result" | Tool result content |
UsageEvent | type: "usage" | Token usage at stream end |
ResponseIdEvent | type: "response_id" | OpenAI response_id (early in stream) |
EpisodeEvent | type: "episode" | Episode_id (early in stream) |
CompletedEvent | type: "completed" | Stream completion |
ErrorEvent | type: "error" | Stream-time error |
SupportContentEvent | type: "support_content" | Internal only — not sent to client; consumed by HistoryMiddleware on mixed-intent turns |
Rover-agent’s wire-event vocabulary (the outer, client-facing one):
| Event | event_type value | Purpose |
|---|---|---|
| MCP lifecycle | mcp_list_tools_start, mcp_list_tools_completed, mcp_session_start, mcp_session_progress | MCP transport phases |
| Tool execution | tool_call, tool_completed | Per-tool start/end (note: tool_call_end Python-side maps to tool_completed Go-side) |
| Content | text (with type: "chunk" discriminator), reasoning | Streamed LLM output |
| Data loading | data_loading, data_loaded | Progressive rendering for BFF objects (offers, products, retailers) |
| Component | component | UI component rendered via render_* tool path |
| Completion | completed | End of turn |
| Error | error | Failure with code (INTERNAL_ERROR / RATE_LIMIT_ERROR / IDLE_TIMEOUT / REQUEST_CANCELLED) |
| Cancellation | cancelled | Request cancelled mid-turn |
Every rover-agent wire event carries the envelope {event_type, version, timestamp, response_id, ...}. Consumer-agent’s Pydantic events carry just {type, ...payload}. The translation between the two vocabularies lives in rover-agent/internal/pythonagent/adapter.go (ConvertEvent function) and is the load-bearing seam that PC3’s canonical wire contract is built on.
Streaming today is informal beyond the documented event types. Sub-agents may write progress strings to the LangGraph stream writer; that path is exercised but not registry-gated. The “typed status events” PC1 FR-13 commits to are not enforced today — emission of an unregistered event type does not fail at sub-agent factory time. The orchestrator-mediated rendering policy described in the orchestrator-owned-output decision is partially implemented but not codified.
The component frame already carries BFF objects. The data_loaded event’s items array carries per-domain BFF objects directly. The rover-agent/docs/streaming/data-loading.md doc uses /* Offer BFF object — TBD */ placeholder — meaning the team has already chosen the boundary: the BFF object (PS6’s payload) goes on the wire; envelope metadata (status, partial, cache_meta, principal) does not. PC3 ratifies that choice in §5.8.
3.2 What PC1 leaves to PC3
Section titled “3.2 What PC1 leaves to PC3”PC3 inherits from PC1 (Agent Composition):
- Orchestrator-owned output principle (PC1 §5.3, Decision 2) — every user-facing response passes through orchestrator composition; sub-agent voices never reach the user unmediated.
- Sub-agent typed status event commitment (PC1 FR-13, Decision 10) — sub-agents emit identifiers, orchestrator renders to user-visible strings.
render_*tool path (PC1 §5.6 / FR-10) — the structured payload mechanism that produces component frames; component data persists as tool messages keyed byturn_idand rides the SSE wire ascomponent/data_loadedevents.- Reference-bounded inherited context (PC1 §5.6) — when sub-agents emit status events during a fan-out, the orchestrator owns the per-event suppression policy across concurrent emitters.
PC3 inherits from PC2 (Runtime Discovery & Sub-Agent Execution):
- Status-event primitive specified at the dispatch boundary (PC2 §5.7) — sub-agents emit typed events, orchestrator owns rendering policy. PC2 explicitly defers vocabulary, transport mechanism, and user-facing wire protocol to PC3.
- Sync seam at CCS (PC2 §5.5) — CCS returns
EnricherResponse[T]synchronously to sub-agent tool calls; no streaming seam between sub-agent and CCS. The streaming seam is PC3-owned, applied over the orchestrator-composed final response. - Status-based suppression at the sub-agent tool wrapper (PC2 §5.5, §5.6) — the wrapper unwraps the envelope, verifies principal, decides
ok/partial/errorsemantics before passing data to the sub-agent’s LLM context. PC3 inherits this — envelope metadata never crosses to the wire. - Failure axes (PC2 §5.8) — Axis 1 (sub-agent invocation failure), Axis 2 (CCS
status="error"), Axis 3 (partial fan-out failure). PC3 owns the client-side telemetry surface for all three. - Natural-language failure surfacing (PC2 FR-14) — the user-visible prose acknowledging failures is orchestrator-composed; PC3 commits that prose flows as
textframes in the orchestrator’s voice, separate from the error-telemetry frame (FR-8).
PC3 inherits from the orchestrator-owned-output decision (Confluence 6500810847):
- Single shared stream writer is the mechanism — LangGraph’s
get_stream_writer()is ambient (async context variable), reached by every emitter (orchestrator, sub-agents, deeply nested tools). The “false dichotomy” finding from the POC: there is no separate sub-agent stream bypassing the orchestrator; orchestrator policy decides per-event whether to forward verbatim or transform. - Parallel-emitter suppression is the same primitive — confirmed working in the LangGraph POC; the orchestrator can suppress, hold, or forward events from concurrent sub-agents.
3.3 What PC3 defers downstream
Section titled “3.3 What PC3 defers downstream”- PC7 (Mobile Renderer Contract) — client-side rendering of streamed frames; iOS/web SSE consumer state machines; client telemetry emission shape; indicator-management semantics. PC3 commits the frame shape on the wire; PC7 commits the client’s behavior on receipt.
- PS5 (Event Store) — durable persistence and replay of streamed events. PC3 commits the wire shape and the registered event-type vocabulary; PS5 commits the store schema, retention, and replay surface. One source of truth — the registry from §5.4 — feeds both.
- PS6 (Domain Object Enrichment & BFF Assembly) — the
EnricherResponse[T]envelope shape (enricher_id,domain_type,principal,version,status,partial[],cache_meta,timing,payload). PC3 carries only thepayloadto the wire (as theitemsarray indata_loadedevents); envelope metadata is consumed at the sub-agent tool wrapper per PC2 §5.5. - PD1 / PD2 — scheduled execution and DM delivery. See §3.5 — both run on the same AI Assistant Platform infrastructure but emit to transports outside PC3’s SSE wire (PD1’s audit trail, PD2’s notification-service queue).
3.4 Vocabulary
Section titled “3.4 Vocabulary”| Term | Meaning |
|---|---|
| Wire | The SSE stream from rover-agent to the mobile client. PC3’s contract surface terminates at the wire. |
| Frame | One SSE event on the wire. Carries the envelope {event_type, version, timestamp, response_id, ...} plus per-event payload. |
| Stream writer | The ambient single-stream primitive every emitter writes to (LangGraph get_stream_writer() is the reference implementation). |
| Status event | A typed identifier (searching_offers, matching_receipt) emitted by a sub-agent during longer operations. Renders to a user-visible string via orchestrator policy. |
| Status string | The user-visible string the orchestrator renders from a status event. Comes from rover-agent/internal/status/manager.go (locale-aware) plus per-event override policy. |
| Component frame | A data_loading or data_loaded event carrying BFF object data for progressive UI rendering. |
| Terminal frame | The frame that ends a turn — completed, error, or cancelled. Every turn emits exactly one. |
| Rendering policy | The orchestrator’s per-event decision on incoming stream-writer events: forward verbatim, transform to user-visible prose, suppress entirely, or batch with concurrent events. |
| Event-type registry | The file-backed catalog of registered typed status events. Sub-agents emitting unregistered types fail at factory time. |
| Wire version | The schema version field on every frame (version: "0.4" today). New optional fields and event types added additively; breaking changes increment. |
| v0.4 baseline | The currently-deployed wire contract documented in rover-agent/docs/streaming/streaming.md and rover-agent/docs/streaming/data-loading.md. |
| v0.5 additions | PC3-introduced additive changes: typed status event vocabulary (status event_type), error telemetry shape, formalized cancellation contract. |
3.5 Why scheduled execution and DM are out of scope
Section titled “3.5 Why scheduled execution and DM are out of scope”The Spec Lab’s PC3 title is “Execution Modes & Event Streaming” — the dual scope suggests PC3 owns scheduled and DM-driven execution alongside synchronous chat. PC3 does not, and the reason is structural: scheduled execution (PD1) and DM delivery (PD2) emit to different transports than the SSE wire PC3 governs.
What PC3 governs. The synchronous chat path on the AI Assistant Platform: consumer-agent dispatches turns (per PC1 / PC2), rover-agent relays them as SSE frames to the mobile chat client, the client consumes the wire end-to-end. PC3’s contract is the SSE wire and the frames that travel on it.
Why PD1 (scheduled execution) is out of scope. PD1 produces an audit trail of scheduled-fire outcomes (dispatched, skipped_paused, etc.) and triggers downstream DM delivery — there is no live SSE consumer on the receiving end of a scheduled fire. PD1 reuses the dispatch primitives PC2 commits, but it emits to PD2’s transport (notification-service queue), not to PC3’s SSE wire. The streaming contract PC3 owns has no surface here.
Why PD2 (DM delivery) is out of scope. PD2 delivers via notification-service (Growth-team-owned, per Spec Lab Decisions table). DM payloads are PS6 envelopes carried over notification-service’s transport — push notifications, in-app inbox cards, etc. — not SSE. PC3’s wire shape doesn’t reach this transport, and notification-service’s payload contract is PS6’s territory, not PC3’s.
PD3 (DM Type Registry) for type declarations. PD3 is the DM type registry that PD1 and PD2 reference. PC3 does not own DM type declarations; PD3 does. The cross-flow alignment seam — events persisted to episodes (PS5) sharing one event vocabulary across chat and DM flows — is addressed at §3.5 below and §5.9.
What PC3 does NOT claim. PC3 does not say PD1 or PD2 are “on a different platform.” They run on the same AI Assistant Platform infrastructure PC1/PC2/PS2 govern. The scope boundary is the transport surface: PC3 governs the SSE wire that carries chat frames; PD1’s audit trail and PD2’s notification payload are separate transport surfaces with their own contracts.
If the platforms add an SSE-streamed scheduled or DM mode in the future (e.g., scheduled chat-replay surfaces, in-app DM that streams as it composes), PC3 revisits to extend the wire contract to those modes. The v1 commitment is the chat SSE wire only.
Cross-flow event-vocabulary alignment. While PC3 does not govern the DM transport, the event vocabulary persisted to episodes IS shared across flows. Both chat and DM sub-agents write to PS-5 (Event Store); PS-5 reads PC-3’s §5.4 registry as the single source of truth for persisted event types. DM-flow sub-agents MUST emit events conforming to the same registry and the same payload-only discipline (no PS-6 envelope metadata in persisted events). Cross-flow alignment is via the registry, not via the wire — PC-3 owns the wire only, but the registry it commits is consumed by every flow that persists events to PS-5. DM-specific event types (if needed) are added to the registry through the same vertical-fragment PR mechanism as chat events. Coordinated with PD-1/PD-2 owners before those specs lock their event-emission shapes.
4. Requirements
Section titled “4. Requirements”4.1 Functional requirements
Section titled “4.1 Functional requirements”FR-1 — Single canonical wire contract. The SSE wire from rover-agent to the mobile client MUST conform to the event vocabulary defined in §5.5. Two parallel vocabularies (consumer-agent’s Pydantic StreamEvent shape and rover-agent’s wire-event shape) MAY exist as internal-side and external-side representations, but the canonical contract — the one new verticals cite, the one PC7 client-side state machines bind to, the one PS5 persists — is the wire shape defined in §5.5.
FR-2 — Envelope shape constant across the turn. Every wire frame MUST carry the envelope fields event_type (string, required), version (string, required), timestamp (ISO-8601 string, required), and response_id (string, required). Per-event payload fields vary; envelope fields do not.
FR-3 — Typed status events, not prose. Sub-agents MUST emit status events as typed identifiers from a registered vocabulary (e.g., searching_offers, matching_receipt, looking_up_purchase_history). Sub-agents emitting prose progress strings directly to the stream writer MUST be rejected at code review. The orchestrator MUST render registered identifiers into user-visible strings — either via locale-aware default messages (per rover-agent/internal/status/manager.go) or via per-event override policy.
FR-4 — Orchestrator owns rendering policy. For every event flowing through the stream writer, the orchestrator MUST apply rendering policy: forward verbatim, transform to user-visible prose in the assistant’s voice, suppress entirely, or batch with concurrent events. Policy is per-event-type and MAY be per-cohort (feature-flag-gated); it is NOT per-sub-agent.
FR-5 — Parallel-emitter suppression. During fan-out (PC2 §5.4), the orchestrator MUST apply suppression policy across concurrent sub-agents writing to the same shared stream writer. The mechanism is the same single-stream primitive; the orchestrator owns rate-limit, collapse, and interleave decisions. There is no separate per-sub-agent stream.
FR-6 — Final response composition flows through the stream. The orchestrator’s composed final response MUST be emitted as text frames on the same wire that carried status events. The voice / safety / format pass applied during composition (PC1 §5.3) MUST be applied before frames are emitted to the wire — frames once emitted are not retroactively edited.
FR-7 — Typed error events for client telemetry. Failure axes (PC2 §5.8 Axis 1, 2, 3) MUST surface to client telemetry as typed error frames carrying an error.code from the controlled enum (§5.7). The error frame MUST NOT carry raw exception text, stack traces, internal hostnames, or EnricherResponse[T] envelope payloads.
FR-8 — Natural-language failure surfacing is separate from telemetry. The user-visible prose acknowledging a failure (PC2 FR-14) MUST be emitted as text frames in the orchestrator’s voice, distinct from the FR-7 error telemetry frame. The user never sees the error code; the client telemetry consumer never sees the natural-language acknowledgment. Both flow on the same wire; they are different frames.
FR-9 — Terminal-frame contract. The wire MUST emit exactly one terminal frame per turn from the set {completed, error (is_final: true), cancelled}, observable per wire segment. In the relay-synthesized-cancellation path (§5.6), the relay’s cancelled frame is authoritative; consumer-agent’s upstream terminal frame is discarded by the relay and does not appear on the client wire. The terminal frame is the authoritative signal that a turn has ended — clients MUST NOT need heuristics (idle timeout, stream close) to detect turn completion. The existing v0.4 SSE convention also emits a data: [DONE]\n\n sentinel after the terminal frame (consumer-agent/agent/streaming.py::format_sse_done) for transport-level stream-close signaling; v0.5 preserves the sentinel for transport compatibility, but client logic for turn-end semantics MUST bind to the terminal frame, not the sentinel.
FR-10 — Component frames carry PS6 payloads, not envelopes. Frames carrying structured component data (data_loading, data_loaded, component) MUST carry the unwrapped PS6 payload field directly. The PS6 EnricherResponse[T] envelope metadata (status, partial, cache_meta, principal, version) MUST NOT appear on the wire. Envelope handling — principal verification, status-based suppression, version-skew tolerance — is the sub-agent tool wrapper’s responsibility per PC2 §5.5 and PS2 §5.4.
FR-11 — Event-type registry, file-backed, factory-validated. Typed status events MUST be registered in a single canonical file-backed catalog versioned via git, reviewed via PR. The catalog declares for each event type: identifier, intended trigger context, recommended orchestrator rendering string (key into the rover-agent/internal/status/manager.go locale map), and default rendering policy. Sub-agents emitting unregistered event types MUST be rejected at sub-agent factory time, not at runtime.
FR-12 — Wire-version compatibility. The wire version is 0.4 at the time of this spec’s commitment, evolving additively to 0.5 with PC3’s additions (typed status event vocabulary, error code taxonomy, formalized cancellation contract). New optional fields MAY be added without bumping version; new event types MUST be additive and MAY be added without bumping version. Breaking changes (renames, type changes, removed fields, semantic shifts) MUST increment version. Clients MUST ignore unknown fields and unknown event types (per the existing v0.4 contract).
FR-13 — Cancellation unwinds in-flight sub-agents and emits a single terminal frame. A client-initiated cancellation (client disconnect, explicit cancel) OR a platform-initiated cancellation (idle timeout, request-cancelled hook) MUST result in: (a) any in-flight sub-agents stopping further emission and unwinding state without leakage; and (b) exactly one cancelled terminal frame reaching the client with an error.code from {IDLE_TIMEOUT, REQUEST_CANCELLED}. The emitter of the terminal frame is path-dependent (orchestrator vs relay) and specified in §5.6.
FR-14 — Frame ordering preserved across the relay. Within a single turn’s stream, frames MUST be delivered to the client in the order the stream writer emitted them. Cross-turn and cross-session ordering is not part of PC3’s contract. The rover-agent SSE relay MUST preserve emit-order.
FR-15 — Tool-call boundary events at the stream writer. For every tool invocation, the stream writer MUST receive a tool_call event at invocation start and a tool_completed event at invocation end. The tool_call.id, tool_call.name, and tool_call.type fields MUST match between the pair. The orchestrator’s rendering policy (FR-4) decides whether either event reaches the wire as a frame; observability (§10.2) and PS5 history persistence read from the stream-writer layer, not the wire, so the events are required regardless of policy.
FR-16 — Component frame correlation via data.id. When a sub-agent emits both a data_loading and a corresponding data_loaded event for the same UI placeholder, the events MUST share data.id. The client uses data.id to replace placeholder with payload (per the existing v0.4 data-loading contract).
FR-17 — is_final on frames with non-deterministic terminality. Frames whose type can be either terminal or non-terminal (currently error) MUST carry an is_final: bool field on the frame’s payload (not the envelope). The envelope shape defined in FR-2 remains constant across all frame types; is_final is a payload-level field on error frames specifically, consistent with how payload shape varies per event_type (e.g., text.chunk, tool_call.tool_call). The orchestrator emits is_final: true exactly when no further frames follow in the turn. See §5.5 for the error-row payload column.
4.2 Non-functional requirements
Section titled “4.2 Non-functional requirements”NFR-1 — TTFE (time to first event). First frame visible to the client SHOULD be emitted within a small bound of dispatch start. Baseline source: the LangGraph POC referenced in Confluence 6500810847 measured sub-millisecond TTFE per variant. PC3 commits this as a target steady state; concrete production p95 measurement is a follow-up.
NFR-2 — Stream-writer overhead. Per-frame overhead at the stream-writer boundary MUST NOT dominate total turn latency. The LangGraph POC measured sub-millisecond per frame; PC3 commits this as the target.
NFR-3 — Event-type vocabulary cardinality. Registered typed status event identifiers MUST be enumerable from the registry file. Total cardinality SHOULD remain bounded — initial soft target ~50 across all verticals at maturity, derived from current operational pressure (today’s two live verticals — Shop and Support — emit fewer than 10 distinct event types between them; PC1 NFR-3 sizes the platform at ≤10 verticals at maturity; an average of ~5 event types per vertical at maturity produces ~50). The target is a tuning hint, not a hard SLO — growth beyond that triggers a registry-organization review (per-vertical sub-namespacing, deprecation of unused types) but does not block emission. Refine via experiment as verticals onboard.
NFR-4 — Rendering-policy decision latency. The orchestrator’s per-event rendering decision MUST be a dict-lookup against the registry, not an LLM call. Sub-millisecond target.
NFR-5 — SSE relay reliability. rover-agent’s SSE relay MUST preserve frame ordering (FR-14) and terminal-frame discipline (FR-9). The relay MUST NOT drop frames silently — dropped frames trigger an alert via stream.sse.frame_dropped_total (§10.2).
NFR-6 — Idle timeout. The relay MUST enforce an idle-timeout policy — if no frames arrive from consumer-agent within the configured window, the relay emits a cancelled terminal frame with error.code=IDLE_TIMEOUT (FR-13). Target window: operational tuning, not a hard contract.
NFR-7 — SSE event size. Per-frame size SHOULD remain under the documented ≤ 256KB target from rover-agent/docs/streaming/data-loading.md. data_loaded frames carrying many items are the realistic worst case; per-enricher payload caps live in PS6 NFR-2.
4.3 Acceptance criteria
Section titled “4.3 Acceptance criteria”AC-1 — Given a sub-agent emits a registered status event (e.g., searching_offers), when the event flows through the orchestrator and the SSE relay, the client receives a frame with event_type="status", the recommended rendering string in the user’s locale (per rover-agent/internal/status/manager.go), and the standard envelope fields (version, timestamp, response_id).
AC-2 — Given two sub-agents emit concurrent status events during a fan-out turn (e.g., searching_offers and looking_up_points_balance), when the orchestrator’s rendering policy is “batch”, the client receives one combined status frame (“Searching offers and looking up your points…”) instead of two separate frames.
AC-3 — Given a sub-agent’s CCS call returns envelope.status="error", when the sub-agent’s tool wrapper handles the envelope per PC2 §5.5, the client receives three categories of frames during the turn: (a) one non-terminal error frame with error.code="CCS_ENVELOPE_ERROR" carrying {enricher_id, reason} composed from PS6 envelope fields (per §5.7), (b) one or more text frames in the orchestrator’s voice acknowledging the failure (“I wasn’t able to look that up right now”), and (c) exactly one terminal completed frame ending the turn. The user sees only (b); client telemetry consumes (a); (c) confirms turn-end per §5.7’s non-terminal-error semantics. Relative ordering of (a) and (b) is not pinned — the orchestrator MAY emit (a) before, during, or after composition of (b); (c) is always last.
AC-4 — Given any turn (successful, failed, or cancelled), when the stream ends, the client has received exactly one terminal frame from the set {completed, error, cancelled}, followed by the data: [DONE]\n\n sentinel.
AC-5 — Given a sub-agent emits a status event identifier that is not in the registry, when the sub-agent factory builds the agent, factory invocation MUST fail with a clear error naming the unregistered event type. The orchestrator’s LLM is never invoked.
AC-6 — Given a client disconnects mid-turn, when the relay detects disconnect, cancellation propagates upstream to consumer-agent (§5.6 client-initiated path), the orchestrator emits a cancelled terminal frame with error.code="REQUEST_CANCELLED" via the stream writer, the relay forwards the frame (if the client transport is still observable; otherwise the frame is captured in upstream telemetry), and in-flight sub-agents unwind. Sub-agents complete their in-flight CCS calls but discard results (discard-on-return disposition).
AC-7 — Given a v0.4 client connects to a turn that uses v0.5 additions (typed status events, error telemetry, formalized cancellation), the v0.4 client ignores unknown event types and unknown fields without breaking — frames render where possible, are dropped where not. No protocol error.
AC-8 — Given a sub-agent emits component data via the render_* tool path, when the data reaches the wire, the client receives a data_loading frame followed by a data_loaded frame. Both share the same data.id. The data_loaded.items array carries the PS6 payload directly; no envelope metadata (status, partial, cache_meta, principal) appears on the wire.
AC-9 — Given concurrent tool invocations from a sub-agent, when each invocation completes, the stream writer receives a tool_completed event whose tool_call.id matches a prior tool_call event’s tool_call.id from the same turn. The tool_call/tool_completed boundary pair MUST be present at the stream-writer layer per FR-15; whether both frames reach the wire depends on rendering policy (FR-4). When policy is forward (the default for tool_call/tool_completed event types), no orphan tool_completed without a prior tool_call reaches the wire; no orphan tool_call without an eventual tool_completed reaches the wire before the terminal frame.
AC-10 — Given the canonical registry adds a new typed status event type via PR, when the PR merges and the next deploy ships, sub-agents may emit the new type. Existing v0.4 clients ignore the new type per AC-7; new v0.5-aware clients render the rendering string from the registry.
AC-11 — Axis 1 single-intent error → frame includes is_final: true; no further frames in turn. Axis 2 sub-agent error with orchestrator fallback → frame includes is_final: false; followed by text + completed.
AC-12 — v0.4 client reading Axis 2 error frame closes on first error; v0.5 client honors is_final: false and consumes subsequent text + completed.
5. Solution Design
Section titled “5. Solution Design”5.1 The architectural through-line
Section titled “5.1 The architectural through-line”One shared stream writer carries every event; the orchestrator’s per-event rendering policy decides what reaches the client; the wire evolves additively from the deployed v0.4 baseline.
Three properties hold across every PC3 contract:
- Single stream, orchestrator policy. No bypass stream. No per-emitter wire. “Forward verbatim”, “transform to assistant voice”, “suppress”, “batch” are all policy decisions applied at the orchestrator, not separate architectural paths. This is the load-bearing finding from the LangGraph POC (Confluence 6500810847) and what makes the orchestrator-owned-output decision implementable rather than aspirational.
- Typed events, not prose. Sub-agents emit identifiers from a registered vocabulary; the orchestrator owns the words. The registry is file-backed, factory-validated, and the same registry PS5 reads when persisting events for replay.
- Additive evolution from v0.4. The wire contract today is
version: "0.4"and explicitly commits to additive evolution (“New optional fields may be added without a version bump”). PC3 ships additions, not a rewrite — typed status event vocabulary, error code taxonomy, formalized cancellation — asv0.5additions that v0.4 clients ignore safely.
These three properties produce a system where adding a new sub-agent capability means registering its typed status events, declaring its component data shape, and letting the orchestrator render. No new wire types, no new dispatch primitive, no new client logic for v0.5-aware clients to bind to that wasn’t on the v0.4 wire.
5.2 The stream writer
Section titled “5.2 The stream writer”The stream writer is an ambient per-invocation primitive every emitter writes to. The reference implementation is LangGraph’s get_stream_writer(), reached via async context variable. Implementation:
# Inside any sub-agent tool body, anywhere in the dispatch graphfrom langgraph.config import get_stream_writer
writer = get_stream_writer()writer({"event_type": "searching_offers", "tool_id": tool_call_id})Two properties matter for PC3:
- Ambient access. Every node, subgraph, and tool body in the LangGraph invocation can reach the writer. The orchestrator does not need to plumb a writer object through sub-agent call sites; the runtime does it.
- Single sink. Every write goes through the same writer; the orchestrator (or a configured pre-emit interceptor) sees every event before it reaches the wire. This is what makes per-event rendering policy implementable.
What flows through the writer is internal — not the wire shape. Sub-agents write structured Python dicts shaped like {"event_type": "<registered_id>", ...metadata}. The orchestrator’s interceptor (or the stream adapter at consumer-agent/gateway/stream_adapter.py) maps these to consumer-agent’s Pydantic StreamEvent shape; the rover-agent pythonagent/adapter.go then maps those to the wire envelope shape. PC3 commits the wire shape (§5.5); the internal shape is implementation detail.
Library-agnostic contract. PC3’s contract is the property — single ambient per-invocation primitive every emitter writes to — not the specific library. LangGraph’s get_stream_writer() is the reference; an alternative implementation that preserves the single-ambient-sink property satisfies the contract.
5.3 Orchestrator rendering policy
Section titled “5.3 Orchestrator rendering policy”For every event the stream writer receives, the orchestrator applies rendering policy. Policy is per-event-type, declared in the event-type registry (§5.4), with optional per-cohort override (feature-flag-gated, resolved at factory invocation per PC1 §5.5 cohort discipline).
The four policy decisions:
| Policy | When applied | Wire result |
|---|---|---|
| forward | Default for events that translate 1:1 to the wire (tool calls, text chunks, component frames) | Emit as wire frame with envelope translation |
| transform | Default for typed status events; orchestrator renders the registered identifier to a user-visible string (locale-aware via rover-agent/internal/status/manager.go) | Emit as status wire frame with rendered string |
| suppress | Internal-only events (e.g., support_content is consumed by HistoryMiddleware, never reaches the wire); also operator-overridden suppression for events the orchestrator decides not to surface this cohort | No wire emission |
| batch | When two or more events of the same or related types arrive within a configurable window, the orchestrator emits one combined frame | One combined status or text frame |
Per-cohort override. The cohort is determined at factory invocation (PC1 §5.5) — Agent Definition version + registry versions + feature-flag-gated block presence. Per-cohort rendering overrides allow A/B testing the rendering string for a status event, or temporarily suppressing an event type during a partial rollout. Overrides are dict-lookup at write time (NFR-4). Feature flag naming for any cohort-override gate follows PF-8’s ai_assistant_* convention (PC-3 inherits, doesn’t define).
Why per-event, not per-sub-agent. A sub-agent emits a heterogeneous mix of events (status events, tool calls, component frames). Per-sub-agent policy would force every event from a given sub-agent to the same treatment; per-event-type policy lets the orchestrator forward tool calls verbatim while transforming status events.
Why transform is the default for status events (the collapse decision). Every status event uses the same wire frame (event_type: "status" with data.event_id + rendered data.message). The mobile handler surface stays flat — one handler renders every status event. The alternative (per-phase event types like event_type: "searching_offers", event_type: "matching_receipt") would force per-phase mobile handlers and grow the wire vocabulary every time a vertical adds a work moment. The collapse is a deliberate departure from the OpenAI Responses API / Vercel AI SDK / Anthropic streaming per-phase pattern, in favor of long-term wire stability and orchestrator-mediated voice. Server-side rendering keeps voice and locale handling centralized. This is a platform-shaping decision in the same class as PC-1 “Agent Definition is a list of registered references”, PC-2 “Routing is not a service”, and PS-2 “Description quality is the load-bearing gate” — worth lifting into the Platform Spec Lab Decisions table.
Parallel-emitter suppression (FR-5) is rendering policy applied across concurrent sub-agent writes. The orchestrator sees both writes (single shared stream writer); the policy collapses them into one batched frame, interleaves them by emit-order, or suppresses one in favor of the other. The LangGraph POC confirmed this works on the same primitive — no separate parallel-emitter machinery.
5.4 Typed status event registry
Section titled “5.4 Typed status event registry”The registry is a file-backed catalog of typed status event identifiers, composed at boot from per-vertical fragments plus a platform fragment. Each vertical owns verticals/<vertical>/status_events.yaml; the platform owns platform/status_events.yaml. The platform loader merges all fragments into the effective registry at service startup. Naming collisions fail-loud at factory time. Mirrors PC1 §5.7 prompt-block registry change-control discipline.
Schema (one entry per event type):
- id: searching_offers description: Sub-agent is searching for offers (sub-second to multi-second operations) default_render_key: status.searching_offers # key into rover-agent/internal/status/manager.go locale map default_policy: transform # forward | transform | suppress | batch emitter_subagents: [shop] # sub-agents permitted to emit this type lifecycle: active # active | deprecated- id: matching_receipt description: Sub-agent is matching a scanned receipt against purchase history default_render_key: status.matching_receipt default_policy: transform emitter_subagents: [ereceipts] lifecycle: active- id: looking_up_purchase_history description: Sub-agent is fetching purchase history from CCS default_render_key: status.looking_up_purchase_history default_policy: transform emitter_subagents: [shop, rewards] lifecycle: activeLifecycle.
- Addition: PR adds an entry; CI check verifies
default_render_keyexists inrover-agent/internal/status/manager.go’s locale map; sub-agent that will emit is updated in the same PR. - Deprecation: PR sets
lifecycle: deprecatedand adds a deprecation note; sub-agents stop emitting; after 30 days minimum AND zero emissions for 7 consecutive days (perstream.status_event.emitted_total{event_id}metric), the entry is removed. Mirrors PS6 envelope-version deprecation policy. - Renaming: not supported — rename is a deprecation + addition pair.
Factory-time validation (FR-11). At sub-agent factory invocation, the runtime checks every event type the sub-agent’s code path could emit against the registry’s active and emitter_subagents entries. Unregistered emissions or wrong-sub-agent emissions fail the factory build. Implementation: AST-level scan of writer({"event_type": "..."}) call sites in the sub-agent’s module tree, or runtime assertion at the writer interceptor.
Registry ownership. Platform-mechanism, vertical-content. The platform owns the merge mechanism, the schema, and the factory validation. Verticals own their fragment file (entries they emit, identifiers they declare). Cross-vertical identifiers live in the platform fragment. Adding a sub-agent to an existing event’s emitter_subagents list is a registry PR (additive change to an existing entry, reviewed by the owning team of that entry — platform for cross-vertical identifiers, the originating vertical for vertical-specific identifiers).
Unknown-identifier behavior. Fail-loud at boot — the factory validator walks every emitted event-type identifier in registered sub-agents and asserts presence in the merged registry; service refuses to start on miss. Fail-quiet at runtime — if a dynamically constructed identifier slips through (e.g., interpolation), suppress on the wire, log a warning, emit status_event.unregistered_id metric. Two-layer defense mirrors PC1 §5.7.
Wire exposure of data.event_id. The identifier rides the wire alongside the rendered data.message. Identifiers are semver-stable from v0.5 — same wire-stability discipline that already applies to tool_name on tool_call_* events. Mobile may use data.event_id to drive per-event-type affordances (icon, animation). Renames follow additive-rename discipline (ship new identifier alongside old, deprecate, remove in a future major version).
5.5 Wire vocabulary
Section titled “5.5 Wire vocabulary”The canonical wire shape every frame conforms to:
event: <event_type>data: {"event_type": "<event_type>", "version": "0.5", "timestamp": "<ISO-8601>", "response_id": "<id>", ...per-event-payload}Followed at end-of-turn by:
data: [DONE]Registered event types in v0.5 (additive superset of v0.4):
event_type | Purpose | Payload fields (in addition to envelope) | v0.4 / v0.5 |
|---|---|---|---|
text | Streamed LLM content (orchestrator-composed) | chunk (string) | v0.4 (existing) |
reasoning | LLM internal reasoning (model-dependent) | chunk (string) | v0.4 (existing) |
thinking | Pre-content thinking indicator. Emitted as the orchestrator’s LLM produces reasoning tokens. Optional — clients MAY ignore. | {content: str, role: "reasoning"} matching OpenAI Responses API reasoning chunks | v0.4 (existing) |
tool_call | Tool invocation begins | tool_call: {id, name, type} | v0.4 (existing) |
tool_completed | Tool invocation ends | tool_call: {id, name, type} | v0.4 (existing) |
data_loading | Component data loading begins | data: {id, type, key: {ids: [...]}} | v0.4 (existing) |
data_loaded | Component data ready | data: {id, type, key: {ids: [...]}, items: [...PS6 payloads]} | v0.4 (existing) |
component | UI component rendered via render_* tool path | chunk: <component JSON>, tool_call: {id, name, type} | v0.4 (existing) |
response_id | OpenAI response_id (emitted early in stream) | no payload field; the envelope-level response_id field IS the data carried by this frame type. Emitted exactly once per turn, at turn open. | v0.4 (existing) |
episode | Episode_id (emitted early in stream) | episode_id (string) | v0.4 (existing) |
usage | Token usage at stream end | input_tokens, output_tokens, total_tokens, reasoning_tokens, cached_tokens | v0.4 (existing) |
status | NEW in v0.5. Typed status event from sub-agent, rendered by orchestrator | event_id (registry identifier), message (rendered string) | v0.5 (additive) |
error | Failure signal — terminal for Axis 1 single-intent dispatch (ends the turn with no completed), non-terminal for Axis 2 / Axis 3 (followed by completed after orchestrator-composed degraded response). See §5.7 for axis-to-terminality mapping. | payload-level fields: error: {code, ...axis-specific payload per §5.7} + is_final: bool (terminal-vs-recoverable indicator; true for Axis 1 terminal case, false for non-terminal). Both fields sit inside the payload alongside each other — is_final is NOT an envelope field (FR-17). | v0.4 (extended in v0.5 with formalized code taxonomy and axis-specific payload composition — §5.7) |
completed | Successful stream termination | — | v0.4 (existing) |
cancelled | NEW in v0.5. Mid-turn cancellation | error: {code: "IDLE_TIMEOUT" | "REQUEST_CANCELLED"} | v0.5 (formalized; previously folded into error) |
Internal-only event types (consumer-agent Pydantic; never reach the wire): support_content is consumed by HistoryMiddleware and suppressed by orchestrator rendering policy. Future internal events follow the same pattern — they are stream-writer outputs the orchestrator’s suppress policy keeps off the wire.
MCP-lifecycle event types (mcp_list_tools_start, mcp_list_tools_completed, mcp_session_start, mcp_session_progress) are deployed today and remain in v0.5 for backwards compatibility. Their continued use is tied to the rover-mcp → CCS migration (consumer-context-service#64); they will be deprecated when the migration completes per FR-12 deprecation policy.
Wire example (v0.5, a chat turn with a status event, a tool call, a component frame, and successful completion):
event: response_iddata: {"event_type":"response_id","version":"0.5","timestamp":"2026-05-15T18:00:00.000Z","response_id":"resp_abc"}
event: thinkingdata: {"event_type":"thinking","version":"0.5","timestamp":"2026-05-15T18:00:00.100Z","response_id":"resp_abc"}
event: statusdata: {"event_type":"status","version":"0.5","timestamp":"2026-05-15T18:00:00.300Z","response_id":"resp_abc","data":{"event_id":"searching_offers","message":"Searching for offers..."}}
event: tool_calldata: {"event_type":"tool_call","version":"0.5","timestamp":"2026-05-15T18:00:00.400Z","response_id":"resp_abc","tool_call":{"id":"call_1","name":"search_offers","type":"mcp"}}
event: tool_completeddata: {"event_type":"tool_completed","version":"0.5","timestamp":"2026-05-15T18:00:01.200Z","response_id":"resp_abc","tool_call":{"id":"call_1","name":"search_offers","type":"mcp"}}
event: data_loadingdata: {"event_type":"data_loading","version":"0.5","timestamp":"2026-05-15T18:00:01.300Z","response_id":"resp_abc","data":{"id":"offer-list-1","type":"offer_list","key":{"ids":[{"id":"OFF_1"},{"id":"OFF_2"}]}}}
event: data_loadeddata: {"event_type":"data_loaded","version":"0.5","timestamp":"2026-05-15T18:00:01.800Z","response_id":"resp_abc","data":{"id":"offer-list-1","type":"offer_list","key":{"ids":[{"id":"OFF_1"},{"id":"OFF_2"}]},"items":[{"id":"OFF_1","..." : "..."},{"id":"OFF_2","...":"..."}]}}
# `items[]` shape pinned by the PS6 `(domain_type, enricher_id)` registry entry for `offer_list` (PS6 §5.2); each item is the unwrapped PS6 `payload`, never the full envelope.
event: textdata: {"event_type":"text","version":"0.5","timestamp":"2026-05-15T18:00:02.000Z","response_id":"resp_abc","chunk":"Here are some offers near you..."}
event: completeddata: {"event_type":"completed","version":"0.5","timestamp":"2026-05-15T18:00:02.500Z","response_id":"resp_abc"}
data: [DONE]5.6 Cancellation contract
Section titled “5.6 Cancellation contract”A turn ends with exactly one terminal frame from {completed, error, cancelled} (FR-9). Cancellation has its own terminal type (v0.5 formalization; previously folded into error with code IDLE_TIMEOUT or REQUEST_CANCELLED).
Triggers.
- Client-initiated: client disconnects from the SSE stream, or sends an explicit cancel signal. Detected at rover-agent’s relay (context cancellation on the HTTP handler).
- Platform-initiated: idle-timeout policy — relay receives no frames from consumer-agent within the configured window. Emit
cancelledwitherror.code="IDLE_TIMEOUT".
Propagation — client-initiated path.
- The client disconnects from the SSE stream (or sends an explicit cancel signal). rover-agent’s relay detects this via HTTP context cancellation.
- The relay propagates cancellation upstream to consumer-agent (HTTP request context cancellation).
- consumer-agent’s dispatch loop observes the cancelled context. The orchestrator stops emitting non-terminal frames.
- In-flight sub-agents tolerate cancellation per the cancellation-tolerance subsection below.
- The orchestrator emits one
cancelledterminal frame witherror.code="REQUEST_CANCELLED"via the stream writer. The relay forwards the frame and thedata: [DONE]\n\nsentinel to whatever transport is still open (client may have already disconnected, in which case the frame is observable only in upstream telemetry).
Propagation — platform-initiated idle-timeout path.
- The relay receives no frames from consumer-agent within the configured idle-timeout window.
- The relay synthesizes one
cancelledterminal frame witherror.code="IDLE_TIMEOUT"directly and emits it to the client, followed bydata: [DONE]\n\n. The orchestrator’s contribution is bypassed by definition (it was silent). - The relay separately propagates cancellation upstream to consumer-agent (HTTP request context cancellation) so that any silently-stuck sub-agents unwind per the cancellation-tolerance subsection below. Frames emitted upstream after this point are discarded by the relay; the client’s view of the turn is already terminated.
Cancellation tolerance for in-flight sub-agents. In both paths, sub-agents observing the cancelled context MUST:
- Complete any in-flight CCS call (CCS is synchronous per PC2 §5.5). The sub-agent does NOT propagate cancellation into CCS itself — CCS-side cancellation propagation is a connector-framework concern outside PC3’s scope.
- Discard the CCS result rather than feeding it back to the orchestrator.
- Not start new CCS calls.
Why cancellation gets its own terminal type. v0.4 conflates cancellation with error (same event_type: "error", distinguished by error.code). v0.5 separates them because: (a) cancellation is not a failure, it is a user-or-system-initiated stop; (b) PC7’s client-side state machine should distinguish “the turn was cancelled by me” from “the turn failed unexpectedly” for analytics and retry; (c) PS5’s event store benefits from a structurally distinct terminal type for replay semantics. v0.4 clients still see cancelled as an unknown event type and fall back to treating the stream as terminated (per their FR-12 ignore-unknown-event-types behavior).
5.7 Error event contract
Section titled “5.7 Error event contract”When any of PC2’s three failure axes (§5.8) fires during a turn, the wire MUST carry a typed error frame for client telemetry, separate from any natural-language acknowledgment the orchestrator composes (FR-8). The invariant — “don’t surface degraded data as truth to the user” — is locked by PC2 FR-14 + AC-5; PC3 codifies the mechanism. The typed event composes from existing PS6 envelope fields rather than introducing new vocabulary wherever an envelope is available.
Error code taxonomy (v0.5). Codes are split by the terminal frame they ride on: error.code values on error frames carry failure telemetry; error.code values on cancelled terminal frames carry cancellation cause. Two tables, two frame types.
Codes on the error frame (failure telemetry):
error.code | Trigger | Axis | Payload composition |
|---|---|---|---|
INTERNAL_ERROR | Generic uncategorized server error | (catch-all) | minimal |
RATE_LIMIT_ERROR | LLM rate limit (consumer-agent or upstream OpenAI) | (infrastructure) | minimal |
SUB_AGENT_FAILED | NEW v0.5. PC2 Axis 1 — sub-agent failed entirely (timeout, runtime exception); no PS6 envelope returned | Axis 1 | {sub_agent_id} — PC3-specific identifier (no envelope to compose from) |
CCS_ENVELOPE_ERROR | NEW v0.5. PC2 Axis 2 — sub-agent’s CCS call returned envelope.status="error" | Axis 2 | {enricher_id, reason} — composed from PS6 envelope fields (PS6 §5.2) and PS6’s reason enum (PS6 §5.4 / PS1 §5.3.1) |
PARTIAL_FAN_OUT | NEW v0.5. PC2 Axis 3 — one or more sub-agents in a fan-out failed; others succeeded | Axis 3 | {failed: [...per-sub-agent records, each shaped as Axis 1 or Axis 2 above]} |
Codes on the cancelled terminal frame (cancellation cause, per FR-13 / §5.6):
error.code | Trigger | Payload composition |
|---|---|---|
IDLE_TIMEOUT | Relay idle timeout fired | minimal |
REQUEST_CANCELLED | Client disconnected mid-stream | minimal |
Why compose from PS6 vocabulary for Axis 2. Rather than introducing a PC3-specific failure-category vocabulary, the Axis 2 payload reuses enricher_id (already present and typed on every PS6 envelope per §5.2) and reason (already a controlled enum per PS6 §5.4 / PS1 §5.3.1 — upstream_unavailable / upstream_timeout / upstream_partial / unauthorized / invalid_request). This composition has three benefits: (a) PS6 FR-12 leakage prevention applies automatically — reason is a controlled enum, sanitization is enforced upstream; (b) when PS6 adds a new failure mode to the enum, PC3’s wire surface inherits it without spec changes; (c) cross-section observability composes — the same (enricher_id, reason) pair that Notification Service uses to gate notifications appears in client telemetry, so dashboards can join across consumers.
Why minimal {sub_agent_id} for Axis 1. Axis 1 (sub-agent failed entirely) means no envelope was ever returned — there’s no PS6 vocabulary to compose from. The payload is therefore PC3-specific: a sub-agent identifier and nothing more. Per FR-12 the payload MUST NOT include raw exception text, stack traces, or internal request IDs even though PS6’s enum doesn’t constrain it.
Why composite {failed: [...]} for Axis 3. Axis 3 is by definition a fan-out where some sub-agents succeeded and others failed. Each per-sub-agent failure carries either Axis 1 or Axis 2 shape; the Axis 3 frame is the envelope that aggregates them. The aggregation lets client telemetry split per-axis counters even when a turn produces multiple concurrent failures.
What the error frame’s payload MUST NOT carry (regardless of axis):
- Raw exception text or stack traces.
- The PS6
EnricherResponse[T]envelope itself, or its metadata fields beyondenricher_id(status,partial[],cache_meta,principal,timing,payloadMUST NOT appear). - Internal hostnames, internal request IDs, or file paths.
- Upstream service identifiers beyond what the registered PS6
reasonenum already exposes (e.g., not raw FIDORA/NELI error bodies; PS6’spartial[].sourcefield is server-side concern and stays at the wrapper).
Distinct from the natural-language acknowledgment. PC2 FR-14 mandates the user-visible prose is orchestrator-composed and natural-language. PC3 FR-8 keeps that prose on text frames in the orchestrator’s voice; the error frame is for client telemetry only. PC7 owns how the client surfaces or hides the error frame’s payload — by default, hide; for debug builds or specific analytics surfaces, expose.
When the error frame is terminal vs non-terminal.
- Terminal: Axis 1 in single-intent dispatch — the sub-agent failed entirely and the orchestrator has no useful content to compose around. The
errorframe ends the turn. (completedis not emitted.) - Non-terminal: Axis 2 (CCS failure mid-turn) and Axis 3 (partial fan-out failure) — the orchestrator composes a degraded response from what succeeded. The
errorframe is informational telemetry; the turn ends withcompleted.
Why these three discriminators rather than reusing INTERNAL_ERROR. v0.4 collapses all sub-agent failures into INTERNAL_ERROR, which gives the client no signal for analytics (which axis was hit? how often does CCS fail vs sub-agent runtime exceptions vs partial fan-out?). v0.5’s three discriminators split the failure surface for analytics, retry policy decisions, and SLO tracking. Crucially, only the discriminator is PC3-specific — the Axis 2 payload composes entirely from PS6 vocabulary, so PC3’s new-vocabulary surface is genuinely small.
5.8 Reconciliation with PS6
Section titled “5.8 Reconciliation with PS6”The Spec Lab page calls PC3 ↔ PS6 “the single biggest cross-section contract in the lab.” The reconciliation is:
PS6 owns the envelope schema. PC3 carries the envelope’s
payloadto the wire. Envelope metadata never crosses the wire.
Concretely.
| Layer | Owns | Carries |
|---|---|---|
| PS6 | EnricherResponse[T] schema — enricher_id, domain_type, principal, version, status, partial[], cache_meta, timing, payload: T | CCS’s sync REST/MCP response body |
| PS2 / PC2 | Sub-agent tool wrapper that unwraps the envelope — verifies principal (PC2 §5.5 / PS6 FR-11), maps status to dispatch outcome (PC2 §5.6 / PS6 FR-5), tolerates version skew (PS6 §5.4 deprecation policy) | Server-side handling; nothing reaches the wire |
| PC3 | The wire shape; the data_loading / data_loaded / component events that carry component data to the client | data_loaded.items carries payload: T from the envelope. Envelope metadata is consumed at the wrapper and never reaches the wire. |
| PC7 | Client-side rendering of data_loaded.items as UI cards | Mobile app rendering |
Why payload only. Three reasons:
- Production already chose this boundary.
rover-agent/docs/streaming/data-loading.mdshowsdata_loaded.itemscarrying/* Offer BFF object — TBD */— meaning the deployed wire already carries unwrapped payloads. PS6’sEnricherResponse[T]is its sync REST/MCP response shape, not its DM-delivery shape or its client-facing shape. - Envelope metadata is server-side concern. Principal verification (PS6 FR-11) is a security boundary at the consumer (consumer-agent or notification-service), not at the client. Status-based suppression (PS6 FR-5) decides whether the orchestrator gets any data from a sub-agent; by the time data reaches the wire, the status decision is already made. Cache metadata (cached / TTL / key) is observability, not user-facing.
- Coupling cost. If the client received the envelope, it would inherit PS6’s version-skew machinery (envelope
versionfield, deprecation windows), need to understand PS6’s status/partial taxonomy, and need to handle envelope-level errors distinctly from frame-level errors. Keeping the envelope server-side preserves the client’s single-vocabulary (PC3’s wire) view of the world.
Where payloads enter the wire. A sub-agent’s tool wrapper (PS2 Path 1, CCS-backed) returns the unwrapped payload to the sub-agent’s LLM context. If the sub-agent’s response includes a render_* tool call (PC1 §5.6 / FR-10), the tool result carries the payload as a component frame. The component frame ride the wire as data_loaded.items (for collections of BFF objects) or component (for single rendered UI components).
PS6 envelope-version skew is invisible to PC3. When PS6 bumps envelope.version (additive or breaking, per PS6 §5.4), the sub-agent tool wrapper tolerates skew per PS6’s deprecation policy. The wire never sees a version mismatch — it sees a payload: T that matches whatever schema T is at this rollout point.
status="partial" handling at the seam. When CCS returns status="partial" (PS6 FR-5: non-empty partial[], no critical failures), the sub-agent’s tool wrapper passes the partial payload to the sub-agent’s LLM. The sub-agent may surface a soft caveat in its narration (“I found these offers, though some retailers were unavailable”) which becomes text frames; or the sub-agent may emit a component frame with the partial data. The client receives the partial payload in data_loaded.items and the soft caveat in text — never the raw partial[] array.
5.9 PS5 boundary (event store)
Section titled “5.9 PS5 boundary (event store)”PS5 (Event Store) cites PC3 as a blocking dependency for the event types it persists. The cut:
- PC3 owns the wire shape, the registered event-type vocabulary, the additive evolution policy.
- PS5 owns the durable store schema, retention policy, replay surface, query API.
- The registered event-type registry from §5.4 is the single source of truth consumed by both. PS5 reads the registry at startup; new event types added to the registry persist automatically; deprecated types continue to be readable from the store until their data ages out.
PC3 does not commit a persistence shape; PS5 does. PC3 does commit the wire shape PS5 persists — which means a v0.5 addition to the wire vocabulary becomes a v0.5 addition to PS5’s store schema in the same PR pair (PC3 registry update; PS5 schema migration if needed).
Cross-flow registry conformance. DM-flow sub-agents (PD-1 scheduled, PD-2 notification-service delivery) emitting events to PS-5 MUST use the same §5.4 registry and the same payload-only discipline (no PS-6 envelope metadata persisted) as chat-flow sub-agents. PC-3 owns the wire only, but the registry it commits is the persistence vocabulary for every flow. This prevents per-flow event-type fragmentation in PS-5 (e.g., dm_searching_offers invented separately from chat’s searching_offers) and keeps cross-flow queries possible without bespoke join logic.
5.10 The two-vocabulary translation layer
Section titled “5.10 The two-vocabulary translation layer”Consumer-agent’s Pydantic StreamEvent shape and rover-agent’s wire-event shape are different in three ways:
- Type discriminator field: consumer-agent uses
type(type: "text"); rover-agent usesevent_type(event_type: "text"). - Envelope fields: consumer-agent’s Pydantic events don’t carry
version,timestamp,response_idas envelope-level fields — they’re added by the rover-agent relay at translation time. Some events (likeResponseIdEvent) carry the response_id as a payload field consumer-agent-side; rover-agent promotes it to envelope-level. - Name mappings:
tool_call_start(Python) →tool_call(wire);tool_call_end(Python) →tool_completed(wire). These mappings live inrover-agent/internal/pythonagent/adapter.go::ConvertEvent. (Removed in v0.5 —thinkingis now a first-class wire frame; see §5.5. The legacythinking → mcp_session_startmapping no longer applies.)
This translation layer is a load-bearing seam and a source of drift. PC3 doesn’t eliminate it (the two-vocabulary split predates this spec and removing it is a separate workstream), but PC3 commits the wire vocabulary as canonical — i.e., when Python-side and Go-side disagree, Go-side is the source of truth, because that’s what reaches the client. Consumer-agent’s Pydantic shape is an internal-only artifact that translates into the wire.
Dual-vocabulary precedence and migration path. In v0.5, both vocabularies ride the wire. Mobile prefers status events for user-facing progress UI; tool_call_* events are retained for framework-level telemetry. Vertical authors are encouraged to emit status events for any work moment worth signaling, rather than relying on auto-emitted tool_call_* to drive UI. PC3 commits a migration direction without pinning a timeline: a future amendment to this spec moves tool_call_* to a telemetry-only channel (no longer reaches user-facing wire); a later breaking-version bump removes tool_call_* from the user-facing wire entirely. Each milestone requires coordinated rover-agent + mobile change; PC3 v0.5 does not gate on that timeline.
v0.4-compat: v0.4 clients that read error as terminal will close the wire on the first error frame regardless of is_final. For Axis 2 errors (recoverable), this means v0.4 clients miss the trailing text + completed frames but still terminate cleanly (no orphaned wire state). v0.5 clients honor is_final and consume the recovery path. Verticals SHOULD prefer Axis 1 (terminal error) over Axis 2 (recoverable) while v0.4 clients are in deployed circulation.
6. Cross-Section Impact
Section titled “6. Cross-Section Impact”| Spec | Citation |
|---|---|
| PC1 (Agent Composition) | Inherits orchestrator-owned output principle (PC1 §5.3, Decision 2); typed status event commitment (PC1 FR-13, Decision 10); render_* tool path producing component frames (PC1 §5.6 / FR-10). |
| PC2 (Runtime Discovery & Sub-Agent Execution) | Inherits status-event primitive (PC2 §5.7); §5.7 error event contract carries client telemetry for the failure axes PC2 §5.8 defines. Carries CCS sync-seam contract (PC2 §5.5) — envelope handling is server-side at the sub-agent tool wrapper, never on the wire. |
| PC7 (Mobile Renderer Contract) | Consumes PC3’s wire shape on the client; binds client-side state machine to event types defined in §5.5; client telemetry consumes §5.7 error frames; rendering and indicator-management semantics are PC7’s. |
| PS5 (Event Store) | Persists wire events; reads the registered event-type registry from §5.4 as source of truth for what to persist. PC3 commits the wire shape and registry; PS5 commits the durable schema and replay API. |
| PS6 (Domain Object Enrichment & BFF Assembly) | “Single biggest cross-section contract in the lab.” PC3 carries PS6’s payload: T to the wire in data_loaded.items; envelope metadata is server-side at the sub-agent tool wrapper. §5.8 spells out the reconciliation. |
| PS2 (Connector Framework) | Sub-agent tool wrappers (PS2 Path 1, CCS-backed) are where PS6 envelope handling happens; PC3 inherits the wrapper’s commitment to unwrap envelope and surface only payload to the wire. |
| PD1 / PD2 (Scheduled Execution / DM Delivery) | Out of PC3 scope at the transport surface — see §3.5. PD1 produces an audit trail of scheduled fires (no live SSE consumer) and triggers PD2; PD2 delivers via notification-service (Growth-team-owned). Both run on the same AI Assistant Platform infrastructure but emit to transports outside PC3’s SSE wire. PD2’s payload is a PS6 envelope on notification-service’s transport, governed by PS6 + notification-service, not PC3. Registry alignment: DM-flow sub-agents emitting events to PS-5 use PC-3’s §5.4 registry (same identifiers, same payload-only discipline) — cross-flow event vocabulary is shared even though transports differ (§3.5 + §5.9). |
| PF1 (Agent Lifecycle) | Promotion semantics for an Agent Definition (PF1) include the typed status event registry entries the Agent Definition’s sub-agents emit — the registry is part of the bundle PF1 promotes. |
| PF8 (Feature Flag & Cross-Vertical Observability Conventions) | Stream-side metrics emitted from the wire (latency histograms, error counters, terminal-frame counters) carry the five mandatory slicing dimensions from PF8 §5.8 (vertical, sub_agent_id, dm_type, experiment_arm, agent_definition_version). PC3 commits the wire shape; PF8 owns the dimension vocabulary the metrics consume. |
7. Dependencies
Section titled “7. Dependencies”Platform spec dependencies: PC1 (Agent Composition), PC2 (Runtime Discovery & Sub-Agent Execution).
Implementation dependencies:
- LangChain v1 with
create_agentand tool result handling - LangGraph state machines and
get_stream_writer()(the ambient stream-writer primitive) - FastAPI SSE for the consumer-agent → rover-agent transport
- rover-agent’s SSE relay (
internal/pythonagent,internal/events,internal/status) - OpenAI Responses API for
response_idemission and reasoning-token usage
External dependencies: None.
Cross-section soft dependencies (PC3 does not block on these but commits to them when they land):
- PS6 envelope schema (consumer-context-service#77) — PC3 §5.8 references PS6’s
EnricherResponse[T]shape. - rover-mcp → CCS migration (consumer-context-service#64) — MCP-lifecycle event types (
mcp_list_tools_start, etc.) deprecate when the migration completes.
8. Risks & Open Questions
Section titled “8. Risks & Open Questions”8.1 Risks
Section titled “8.1 Risks”R-1: Two-vocabulary translation drift. Consumer-agent’s Pydantic shape and rover-agent’s wire shape diverge in pythonagent/mapper.go::ConvertEvent. New event types added to one side without corresponding handling on the other corrupt the wire — typically by producing frames with the wrong event_type or missing envelope fields. Mitigated by PC3 making the wire vocabulary canonical (§5.10), CI contract tests that exercise round-trips (§9.4), and the eventual-consolidation follow-on.
R-2: Stream-writer suppression policy correctness during fan-out. Wrong suppression policy during concurrent sub-agent execution is user-visible — collapsing two status events into one when they should be sequential leaks one sub-agent’s progress; failing to collapse floods the user. The LangGraph POC confirmed the mechanism works (single shared stream writer; orchestrator policy decides per-event), not that any specific policy is correct. Mitigated by eval coverage on fan-out scenarios with representative concurrent status emission patterns (§9.3) and per-cohort policy override (§5.3) as the operational tuning lever.
R-3: Terminal-frame discipline under cancellation. Sub-agents that don’t tolerate mid-execution cancellation can leak state — incomplete CCS calls, unreleased connection pools, partial history-persistence writes. The contract is FR-13 + AC-6; enforcement is testing-driven. Mitigated by integration tests on cancellation scenarios (§9.5) and sub-agent factory-time validation that every sub-agent registers a cancellation handler.
R-4: Additive evolution discipline. PC3 v0.5 is additive on v0.4. Future evolution must remain additive within version value, breaking only at version bumps. The published policy is in the data-loading.md doc (“New optional fields may be added without a version bump. Breaking changes require a version increment.”). The risk: a future PR adds a “small” semantic change (renaming a field, tightening a type) under the existing version and breaks v0.4 clients silently. Mitigated by PR review discipline (no semantic changes without version bump) and by a CI contract test that compares wire shapes against a v0.4 fixture.
R-5: Wire pollution from upstream envelope-skew (consumer of PS2-owned risk). PS6 envelope-version skew at the sub-agent tool wrapper is a PS2-owned risk (PS-2 OQ-4 covers deserialization-layer behavior under skew). PC3 is the downstream observer of that risk — if the wrapper emits a partial or malformed payload, the resulting component frame on the wire is malformed. PC3 doesn’t own the fix; PC3 cites the dependency so reviewers know PC3’s wire correctness has a PS2 prerequisite. Mitigated upstream by PS6’s deprecation policy (30-day window, zero-use gate per PS6 §5.4) and PS2’s deserialization tolerance.
R-6: Internal event types leak to wire under suppression-policy bugs. support_content is internal-only (consumed by HistoryMiddleware, suppressed at orchestrator). A bug in the suppression policy could let it reach the wire, where a v0.4 or v0.5 client would render it incorrectly. Mitigated by orchestrator-side default-deny-then-allow on internal event types — every internal type explicitly listed as suppress, with CI assertion.
R-7: Event-registry growth erodes orchestrator rendering clarity. Unbounded vocabulary growth (every vertical adds 10+ event types) bloats the registry, makes rendering policy hard to reason about, and increases the cardinality of stream.status_event.emitted_total. NFR-3’s soft cap (under 50 at maturity) is the mitigation; per-vertical sub-namespacing is the escalation pattern if the cap is exceeded.
R-8: cancelled event type rollout risk. v0.5’s cancelled event type is new. v0.4 clients fall back to ignoring it (per AC-7), which means they see a stream that ends without completed and may treat it as a network error rather than user cancellation. Mitigated by v0.4 clients having already-deployed handling for unexpected stream termination; concrete client behavior is PC-7’s domain.
Rollout gate: during the v0.5 rollout window with v0.4 clients still in circulation, the relay emits cancelled immediately (the v0.5 frame). v0.4 clients ignore the unknown event type per FR-12, treating the stream as terminated and falling back to their existing unexpected-termination handling (network-error analytics path). The trade-off: v0.4 clients miscategorize user cancellations as network errors in the analytics window between v0.5 ship and minimum-v0.5-client-version enforcement. Acceptable because: (a) the misclassification is analytics-only (no user-facing breakage), and (b) the alternative (relay falls back to v0.4 error + error.code=IDLE_TIMEOUT/REQUEST_CANCELLED while any v0.4 client exists) blocks v0.5 client adoption indefinitely. Minimum-v0.5-client-version threshold: when PC-7 enforces a minimum app version covering ≥95% of MAU, the v0.4 fallback path can retire entirely; threshold-tracking is operational (PC-7’s per-app-version targeting surface).
8.2 Open Questions
Section titled “8.2 Open Questions”None outstanding. Prior design-phase questions captured in §11.2.
9. Testing Strategy
Section titled “9. Testing Strategy”9.1 Unit tests
Section titled “9.1 Unit tests”- Stream-writer policy decisions: given event type and current rendering policy, the orchestrator produces the correct forward / transform / suppress / batch decision
- Event-type registry validation: factory invocation succeeds for a sub-agent emitting only registered types; fails with a clear error for unregistered types (AC-5)
- Wire envelope shape: every emitted frame conforms to
{event_type, version, timestamp, response_id, ...}with required fields populated - Terminal-frame emission: a turn that completes successfully emits exactly one
completed; a turn that fails emits exactly oneerror; a turn that cancels emits exactly onecancelled(AC-4) data_loading/data_loadedcorrelation: shareddata.idacross the pair (AC-8, FR-16)- Tool-call boundary events at the stream-writer layer: every
tool_callevent has a pairedtool_completedevent with matchingtool_call.id(AC-9, FR-15); wire-layer presence depends on rendering policy - Two-vocabulary translation: every Python-side
StreamEventmaps to a wire frame with correct envelope fields (round-trip tests inrover-agent/internal/pythonagent/mapper_test.goand equivalent)
9.2 Integration tests
Section titled “9.2 Integration tests”- End-to-end synchronous chat turn: user request → orchestrator dispatch → sub-agent invocation → status event → tool call → component frame → composed text → terminal
completed. Frame order matches §5.5 example. - Parallel fan-out with status events: two sub-agents emit concurrent status events; orchestrator’s batching policy collapses to one wire frame (AC-2)
- Error event surfacing on
envelope.status="error": sub-agent’s CCS call fails → wrapper unwraps → orchestrator composes degradedtextresponse → emits non-terminalerrorframe withCCS_ENVELOPE_ERROR→ emitscompleted(AC-3) - Cancellation mid-turn: client disconnects → relay propagates → sub-agents unwind → terminal
cancelledframe (AC-6) - v0.4 client backwards compatibility: v0.4 client connects to v0.5-shaped turn, ignores unknown event types and fields, renders the events it understands (AC-7)
9.3 Eval coverage (Opik)
Section titled “9.3 Eval coverage (Opik)”- Status event rendering quality: per-vertical, the orchestrator’s rendered status strings are evaluated for naturalness and voice consistency
- Failure acknowledgment quality (PC2 FR-14): the orchestrator’s natural-language acknowledgment on Axis 2 / Axis 3 failures is evaluated for clarity and absence of error-code leakage
- Status-event vocabulary growth tracking: eval coverage is added when a new event type is registered; per-vertical eval suites cover the typed events that vertical’s sub-agents emit
9.4 Contract tests
Section titled “9.4 Contract tests”- PC1: status-event emission flow (sub-agent emits typed identifier, orchestrator renders) matches PC1 FR-13 / Decision 10
- PC2: error event for each axis (Axis 1 / 2 / 3) emits with correct
error.codeper §5.7 - PS6:
data_loaded.itemscarries PS6payloadonly; envelope metadata never appears in the wire frame (AC-8, FR-10) - PS5: registered event types in §5.4 are persistable by PS5’s store schema; new event types added to the registry are picked up by PS5 at startup
- PC7: client-side state machine binds to v0.5 event vocabulary; v0.4 client coexists
9.5 Failure-mode testing
Section titled “9.5 Failure-mode testing”- Sub-agent runtime exception: terminal
errorframe withSUB_AGENT_FAILEDcarrying{sub_agent_id} - Sub-agent LLM timeout: same surface as runtime exception
- CCS envelope
status="error"mid-turn: non-terminalerrorframe withCCS_ENVELOPE_ERROR; turn completes with degradedtextandcompleted - Sub-agent emits unregistered event type: factory invocation fails before LLM call (AC-5)
- Concurrent fan-out with mixed success/failure: M sub-agents succeed, N-M fail; one terminal
completedwith non-terminalerrorframes per failed sub-agent - Idle timeout (platform-initiated path): relay receives no frames from consumer-agent within window; relay synthesizes
cancelledterminal frame withIDLE_TIMEOUTdirectly (§5.6 platform-initiated path, FR-13) - Client disconnect mid-stream (client-initiated path): relay propagates cancellation upstream; orchestrator emits
cancelledwithREQUEST_CANCELLEDvia stream writer; sub-agents unwind (§5.6 client-initiated path, FR-13, AC-6) - Internal event type leak: bug-injected
support_contentemission; default-deny-then-allow suppression catches; emission never reaches wire (R-6) - v0.4 ↔ v0.5 wire-version skew: v0.4 fixture compared against v0.5 fixture; v0.5 is additive, no breaking changes (R-4)
- Frame ordering under relay buffering: frames arrive at the client in stream-writer emit-order even when the rover-agent relay batches, buffers, or experiences network back-pressure (FR-14, NFR-5)
10. Rollout & Observability
Section titled “10. Rollout & Observability”10.1 Rollout phases
Section titled “10.1 Rollout phases”Phase 1 — Spec validation. PC3 reviewed and approved; cross-section contracts confirmed with PC1, PC2, PC7, PS5, PS6 reviewers. The PS6 reconciliation (§5.8) is the highest-priority contract to land — Spec Lab page calls it “the single biggest cross-section contract.”
Phase 2 — Event-type registry scaffolded. Registry file created in consumer-agent (e.g., events/status-events.yml alongside the prompt-block registry). First registered event types populated by porting from informal status strings currently emitted by Shop and Support sub-agents. Factory-time validation (FR-11) implemented and CI-gated.
Phase 3 — Orchestrator rendering policy. Rendering policy machinery implemented at the stream-writer interceptor layer; default policy per event type read from registry; per-cohort override resolved at factory invocation. First vertical (Shop) converted to typed status events end-to-end.
Phase 4 — v0.5 wire additions deployed. status event type lands on the wire (additive — v0.4 clients ignore). cancelled terminal frame lands (formalizing the previous error + IDLE_TIMEOUT/REQUEST_CANCELLED pattern). Error code discriminators for sub-agent failure axes (SUB_AGENT_FAILED, CCS_ENVELOPE_ERROR, PARTIAL_FAN_OUT) shipped with PS6-composed payloads where available (Axis 2). Client-affecting change: the legacy thinking → mcp_session_start adapter mapping (§5.10) is removed; thinking becomes a first-class wire frame. v0.4 clients that bound to the old mapping receive thinking frames they may not recognize and MUST ignore them per existing v0.4 ignore-unknown-fields discipline (FR-12). Clients that depended on the legacy mapping for indicator-rendering need a v0.5-aware update.
Phase 5 — PC7 client adoption. Mobile clients pick up v0.5 awareness: bind state machine to status events for richer progress display; consume error telemetry for analytics; treat cancelled distinctly from error for analytics and retry semantics.
Phase 6 — MCP-lifecycle event type deprecation. As the rover-mcp → CCS migration (consumer-context-service#64) progresses, the mcp_list_tools_start / mcp_list_tools_completed / mcp_session_start / mcp_session_progress event types are deprecated per FR-12 policy (30-day window, zero-use gate).
10.2 Observability metrics
Section titled “10.2 Observability metrics”Metrics use the stream.* namespace; per-vertical Grafana panel conventions follow PF-8. PC-3 commits the metric names and their semantic meaning; PF-8 owns the cross-vertical dashboard layout and required panels per vertical.
stream.events.emitted_totalbyevent_type— wire emission volume per type; informs registry growth and rendering policy tuningstream.status_event.emitted_totalbyevent_id(registry identifier) — feeds NFR-3 cardinality monitoring and per-type rendering-policy reviewstream.rendering_policy_applied_totalbyevent_typeandpolicy(forward/transform/suppress/batch) — surfaces operational rendering decisionsstream.terminal_frame.emitted_totalby terminal type (completed/error/cancelled) — surfaces success / failure / cancellation ratesstream.error_frame.emitted_totalbyerror.code— feeds Axis 1 / 2 / 3 failure-rate dashboardsstream.unregistered_event_rejected_total— should be zero in steady state; non-zero is a developer error caught at factory timestream.sse.frame_dropped_totalat the rover-agent relay — should be zero; NFR-5 contract- TTFE p50/p95 per turn — feeds NFR-1; concrete production measurement post-Phase 4
- Stream-writer overhead p95 per frame — feeds NFR-2; concrete production measurement post-Phase 4
- Cancellation rate by trigger (
IDLE_TIMEOUT/REQUEST_CANCELLED) — operational signal on user-disconnect patterns and idle-timeout tuning
10.3 Rollback
Section titled “10.3 Rollback”PC3 is a contract spec, not deployable code. Rollback semantics apply at three layers:
- Event-registry rollback — per-event-type revert via git on the registry file; sub-agents that emitted the deprecated type fail at the next factory invocation (FR-11 validation). Recommended pattern: deprecate-then-remove per FR-12 policy, not direct removal.
- Orchestrator rendering policy rollback — per-cohort policy override via feature flag; restore default policy by removing the flag override.
- Wire-version rollback — emit v0.4-shaped frames by suppressing v0.5-only event types (
status,cancelledterminal frame folded back intoerrorwith code). v0.4 clients continue working; v0.5 clients see the rollback as v0.4 fallback per FR-12 ignore-unknown rules. - Architecture-level rollback (e.g., reverting the single-shared-stream-writer pattern) requires reverting the orchestrator-owned-output decision (Confluence 6500810847) and a platform-team alignment session. Not expected.
11. Appendix
Section titled “11. Appendix”11.1 Source references
Section titled “11.1 Source references”- PC1: Agent Composition — orchestrator-owned output principle; typed status event commitment;
render_*tool path - PC2: Runtime Discovery & Sub-Agent Execution — status-event primitive at dispatch boundary; failure-axis taxonomy that PC3’s error-event contract surfaces
- PS2: Connector Framework & Service Registry — sub-agent tool wrapper that unwraps PS6 envelope
- Orchestrator-Owned Output decision (Confluence 6500810847) — single shared stream writer; LangGraph POC; parallel suppression confirmed
- PS6: Domain Object Enrichment & BFF Assembly (consumer-context-service#77) —
EnricherResponse[T]envelope schema; principal verification; status-based suppression - Platform Spec Lab — Wave 0/1 sequencing; PC3 ↔ PS6 as the lab’s biggest cross-section contract
rover-agent/docs/streaming/streaming.md— v0.4 wire contract for MCP lifecycle, tool execution, content, error eventsrover-agent/docs/streaming/data-loading.md— v0.4 wire contract fordata_loading/data_loadedprogressive rendering events; commits BFF objects initemsconsumer-agent/src/consumer_agent/agent/streaming.py— Python-side PydanticStreamEventshapesrover-agent/internal/events/types.go— Go-sideStatusEvent/DataLoadingEvent/DataLoadedEventstructsrover-agent/internal/pythonagent/adapter.go::ConvertEvent— two-vocabulary translation seamrover-agent/internal/status/manager.go— locale-aware status message manager (renders typed status events)- PLT-616 (Jira epic) — sub-agents-as-tools dispatch implementation tracking
- Miro design board — supplementary architecture diagrams
11.2 Decisions resolved during design
Section titled “11.2 Decisions resolved during design”| # | Decision | Resolution |
|---|---|---|
| 1 | Stream writer mechanism | Single shared ambient writer per-invocation (LangGraph get_stream_writer() reference). No bypass stream; orchestrator policy decides per-event. Inherited from orchestrator-owned-output decision (Confluence 6500810847); LangGraph POC confirmed. |
| 2 | Status event vocabulary | Typed identifiers from file-backed registry; orchestrator owns rendering via per-event-type policy. Inherited from PC1 FR-13 / Decision 10; PC3 §5.4 codifies the registry mechanism. |
| 3 | Wire evolution policy | Additive from v0.4 baseline. New optional fields and new event types added without version bump per existing v0.4 compatibility contract; breaking changes increment version. v0.4 clients ignore unknown types. |
| 4 | Error event surface | Typed error event with PC3-specific discriminators (SUB_AGENT_FAILED, CCS_ENVELOPE_ERROR, PARTIAL_FAN_OUT) carrying payloads composed from PS6 vocabulary where available — Axis 2 payload is {enricher_id, reason} drawn directly from PS6 §5.2 envelope + PS6 §5.4 / PS1 §5.3.1 reason enum (inheriting FR-12 leakage prevention by composition). User never sees the discriminator or payload (orchestrator-composed natural-language acknowledgment on separate text frames per PC2 FR-14). See §5.7. |
| 5 | Cancellation contract | Dedicated cancelled terminal frame in v0.5 (formalizing v0.4’s error + IDLE_TIMEOUT/REQUEST_CANCELLED pattern). Distinguishes cancellation from failure for client analytics and retry semantics. |
| 6 | PS6 reconciliation | Payload-only on wire; envelope metadata server-side at sub-agent tool wrapper. Production already chose this boundary (data-loading.md items carries BFF objects directly). PC3 §5.8 ratifies. |
| 7 | PS5 boundary | PC3 owns wire shape and registry; PS5 owns durable store. Single source of truth — the §5.4 registry — feeds both. |
| 8 | Execution-mode scope | Chat SSE wire only. Scheduled execution (PD1) and DM delivery (PD2) run on the same AI Assistant Platform infrastructure but emit to transports outside PC3’s SSE wire (PD1’s audit trail, PD2’s notification-service queue). §3.5 documents the transport-surface boundary. PC3’s title preserves “Execution Modes” framing for the surface PC3 could cover if SSE-streamed scheduled or DM modes ship in the future. |
| 9 | Two-vocabulary translation | Wire vocabulary is canonical (rover-agent’s wire shape). Consumer-agent’s Pydantic shape is internal-only. Eventual consolidation is a separate follow-on, not gating PC3. |
11.3 Migration receipts
Section titled “11.3 Migration receipts”- PC2 §5.7 status-event primitive — PC2 specifies the primitive at the dispatch boundary; PC3 §5.4 owns the vocabulary and lifecycle. No content migration needed.
- PC1 FR-13 / Decision 10 sub-agent progress contract — PC1’s commitment stays; PC3 §5.4 specifies the registry mechanism. No migration.
11.4 Wire reference (consolidated v0.5 event types)
Section titled “11.4 Wire reference (consolidated v0.5 event types)”For implementation reference, the full v0.5 wire event set:
Lifecycle & metadata: response_id, episode, thinking, usage
MCP transport (deprecating per Phase 6): mcp_list_tools_start, mcp_list_tools_completed, mcp_session_start, mcp_session_progress
Tool execution: tool_call, tool_completed
Content streaming: text, reasoning
Component / data: data_loading, data_loaded, component
Status (new v0.5): status
Terminal frames: completed, error, cancelled (cancelled new v0.5)
Internal-only Pydantic types (never on wire): support_content and any future internal types added with suppress default policy.