Agent Composition (Skills + Connectors + Models)
PC1: Agent Composition
Section titled “PC1: Agent Composition”1. Problem Statement
Section titled “1. Problem Statement”The agent layer’s composition contracts are implicit, scattered across factory.py, agent_config.yaml, prompts/components/, and gateway/. A vertical engineer onboarding the platform today must read consumer-agent source code to understand what artifacts to author and how they fit together.
This contradicts the platform’s stated goal:
“Plugging a new vertical into the AI Assistant should be the easiest way for a product team to ship a new agent capability.” — Innovation Labs Agent Platform — Strategy & Onboarding
PC1 lifts the implicit composition contracts into one canonical, citeable contract for the AI Assistant Platform. Downstream specs (PC2 Discovery, PS2 Connector Framework, PF1 Lifecycle, PS6 BFF Enrichment, PF4 Security) cite this contract instead of re-deriving it from code. Every vertical’s onboarding cost drops to “fill out the Agent Definition shape.”
Success test: a vertical engineer reads PC1 plus their vertical’s PRD and knows exactly what artifacts to author. No source-code archaeology required.
Companion: Miro design board — supplementary architecture diagrams (canonical doc, runtime topology, memory model, prompt composition). The spec is the source of truth; Miro is companion material.
2. Capabilities Source
Section titled “2. Capabilities Source”Per the Platform Spec Lab, PC1 owns the Agent Composition capability for the AI Assistant Platform’s consumer-agent runtime. The capability is: the platform must support composing AI agents from reusable, decoupled building blocks — Skills, Connectors (Tools), Agents (Agent Definitions), and Models.
PC1 specifies how each of these building blocks works on the AI Assistant Platform — the consumer-agent / rover-agent / consumer-context-service / consumer-graph stack that produces the unified AI Assistant experience and into which verticals plug. The vocabulary and design choices are stated directly in §3 and §5; PC1 does not translate from a separate reference architecture.
3. Background & Context
Section titled “3. Background & Context”3.1 Today’s reality
Section titled “3.1 Today’s reality”The consumer-agent stack implements a working agent runtime via:
factory.py::create_gateway_agent_from_config— factory that resolves model + tools and constructs a LangGraph CompiledStateGraphagent_config.yaml::agents— declarative agent registry (proto-Agent-Definition); each entry declares model, tools, max_output_tokens, reasoning_effort, etc.prompts/components/*.yaml— modular prompt components (proto-prompt-blocks); each has type, name, feature_gated, instructionsprompts/<id>.txtor Opik-versioned prompts — base prompts (proto-monolithic-block)gateway/orchestrator.py— orchestrator-as-tools dispatch pathgateway/gateway_graph.py— legacy classifier path (slated for deprecation)history/middleware.py— episode-based history capture (DDB + S3)
A single shared Agent runtime class instantiates every agent in the system. Verticals contribute agent cards (Agent Definitions, referenced prompt block files, declared tool list, and model selection); the platform owns the runtime class. There is no per-vertical agent class or process — multiple sub-agents active in the same conversation are multiple instances of the one shared class, each parameterized by its card.
Two verticals are live today: Shop (full slice) and Support / Scout (Forethought-wrapper minimal slice). Six verticals are in flight or draft: Rewards / PointPass, Play, eReceipts, Offer Details, Restaurant Network, Retailer Context.
3.2 Vocabulary
Section titled “3.2 Vocabulary”| Concept | AI Assistant Platform definition | Note |
|---|---|---|
| Skill | Agent card (Agent Definition + referenced prompt blocks + tool declarations + model selection); instantiated as a sub-agent at runtime | PC1 adopts the Skill primitive at sub-agent scope. The agent card is the Skill. See §11.3 Decision 1. |
| Prompt block | One ordered entry in modular system-prompt composition | Internal primitive composed by an agent card; finer-grained than the Skill unit |
| Connector / Tool | LangChain BaseTool subclass or structured dict | Three-layer config (agent_config + settings + flags) documented; framework specified in PS2 |
| Agent Definition | Declarative bundle of model + tools + prompt_blocks; system prompt assembled per-request from prompt blocks plus dynamic context | Per-request system prompt is a deliberate architectural choice; see §3.3 |
| Model | Registry-keyed selection per agent, defined in agent_config.yaml::models | Operational, swappable per agent without changing the Agent Definition’s shape |
| Tool binding | Factory-time resolution against the connector registry | Runtime discovery is deferred; see §3.3 |
3.3 Why these design choices
Section titled “3.3 Why these design choices”Per-request system prompt. A single agent serves multiple contexts (different verticals, feature flags, user states) without re-instantiation. The same Agent Definition produces different responses based on enabled prompt blocks and dynamic context. The system prompt is not stored on the agent as a static string — it is assembled per request from a list of registered prompt blocks plus dynamic context (date, location, user_id, locale). This matches today’s working production behavior and gives the platform a single per-request cohort-aware seam for prompt evolution.
Skills adopted at sub-agent scope; no higher-level Skill aggregator. PC1 treats each agent card (Agent Definition + referenced prompt blocks + tool declarations + model selection) as a Skill. Bundling at sub-agent scope is justified — a sub-agent’s prompt, tools, and model are tightly coupled by design. PC1 does not introduce a higher-level Skill aggregator that bundles memory keys and stacks Skills as a separate composition layer above sub-agents. Memory persistence is governed by the consumer-agent memory architecture (DDB + S3, layered storage); composition above sub-agents is the orchestrator’s job.
Factory-time tool binding. Runtime discovery (querying a registry at request time, paging connectors conditionally per turn) changes the request lifecycle in ways that ripple to every other spec. Factory-time binding preserves predictability and matches today’s behavior. Runtime discovery is a follow-on if production scale demands it.
4. Requirements
Section titled “4. Requirements”4.1 Functional requirements
Section titled “4.1 Functional requirements”FR-1: The platform MUST support a declarative Agent Definition with these fields:
id(string, unique)description(string)role(enum: orchestrator, native, external-wrapper, internal-helper) — documentation tag, not behavioral switchmodel(string, registry key)tools(list of strings, tool registry IDs)prompt_blocks(ordered list of strings, block registry IDs)sub_agents(optional list of strings, agent registry IDs; presence makes the agent orchestrator-shaped)tuning(optional object: max_output_tokens, reasoning_effort, text_verbosity)
FR-2: The platform MUST resolve all references in an Agent Definition at factory time (model from registry, tools from registry, prompt_blocks from registry, sub-agents from registry).
FR-3: The orchestrator MUST route user turns by tool-calling sub-agents. Sub-agents MUST return their response text (and structured component payloads via the render_* tool path) to the orchestrator as tool results. The orchestrator MUST own final response generation — either passing sub-agent output through or composing across multiple sub-agent results — to maintain conversation-flow consistency and provide a single seam for tone and UX enforcement.
FR-4: Orchestrators MUST declare a sub_agents list. Execution semantics (composition logic, fan-out concurrency, aggregation) are specified by PC-2 §5.3–§5.4.
FR-5: The orchestrator MUST handle partial failure gracefully: continue with successful sub-agents and incorporate a natural-language acknowledgment of the failure into the final user-facing response (e.g., “I wasn’t able to get information about X right now”). Raw error events and system diagnostics MUST NOT appear in the user-facing response. System-level error signaling for client telemetry is PC3’s concern.
FR-6: System prompts MUST be assembled per-request from the agent’s prompt_blocks list plus dynamic context (date, location, user_id, locale).
FR-7: Each sub-agent MUST receive only its own session history when re-invoked. Sub-agents MUST NOT receive other sub-agents’ raw turns.
FR-8: On intent switch (current sub-agent differs from prior turn’s), the orchestrator MUST generate a prior_context prose summary that names specific entities, identifiers, and active goals from prior turns. Generation is driven by orchestrator system-prompt instructions.
FR-9: Conversation history MUST be persisted at the platform level for audit, replay, and eval. Sub-agents MUST NOT consume the raw history.
FR-10: Conversation history persistence is layered. The assistant message body MUST contain narration text only — no inline structured component JSON. Component data emitted via render_* tool calls MUST be persisted as tool messages (tool input + formatted component payload, keyed by turn_id) for episode replay and reference resolution. The LLM’s client-side context on subsequent turns SHALL include narration messages; raw tool messages SHALL NOT flow into the LLM context. The consumer-agent memory architecture owns the read-path schema and any compact-reference annotation mechanism for cross-turn reference resolution.
FR-11: When conversation history exceeds a length threshold (tuned via experiment), the orchestrator MUST trigger compaction. Compaction produces a rolling summary stored in episode metadata; recent N turns are preserved intact.
FR-12: The platform’s runtime MUST auto-inject the required block set (persona-fetch-assistant, format-conversational, safety-base) into vertical-facing Agent Definitions at factory time. Verticals declare only blocks they own; the platform-required set is not part of the vertical-authored prompt_blocks list. Auto-injected blocks occupy a stable position at the head of the assembled prompt to preserve cache-friendly prefix (see §5.5).
FR-13: Sub-agents MUST emit progress as typed structured status events (e.g., searching_offers, matching_receipt) — NOT user-facing prose. The orchestrator MUST render status events into user-visible strings in the assistant’s voice during longer sub-agent operations. The orchestrator owns per-event suppression policy (forward / transform / suppress) for status events from any sub-agent. The status-event type vocabulary and rendering contract are owned by PC3.
4.2 Non-functional requirements
Section titled “4.2 Non-functional requirements”NFR-1 (Latency): Note: Baselines below were measured against the direct-streaming variant superseded by Decision 2 (sub-agents-as-tools). Re-measurement against the current architecture is pending.
Orchestrator routing latency target — production baseline (production latency report, n=4,951 traces, direct-streaming variant, since superseded by sub-agents-as-tools — see §11.1 for source and §11.3 Decision 2 for the supersession):
- Routing decision (gateway classifier): p50 ~300ms, target
<500ms - End-to-end trace, orchestrator-direct path (no sub-agent dispatch): p50 ~2.7s
- End-to-end trace, sub-agent dispatch path: p50 ~5.6s
Under the sub-agents-as-tools pattern, expect a modest increase in trace-level latency for sub-agent dispatch routes (one additional orchestrator LLM call for response composition). Routing decision latency is preserved. Production SLA documented in operational runbook; refine via re-measurement after the sub-agents-as-tools pattern is deployed.
NFR-2 (Reliability): Orchestrator’s routing primitive (intent detection + sub-agent tool-call emission) achieves target reliability. Production baseline (Orchestrator Model Benchmark, n=200 mixed-intent test queries, gpt-5.4-nano reasoning_effort: low + strict-gating prompt + intent_count failsafe): 90.5% raw mixed-intent reliability with ~100% effective reliability via failsafe (failures report intent_count: 2, catchable by retry). Mixed-intent production volume is currently zero (Phase C window) — real-world reliability data will accumulate as that volume grows. Reliability against a configured fan-out cap is PC2’s contract.
NFR-3 (Scope discipline): Per-vertical description in orchestrator’s prompt has a tight token budget — current operational target ~200-400 tokens per vertical (informed by current ask_support and ask_shopping descriptions in the orchestrator prompt). Adding verticals beyond ~5-7 will require either tightening descriptions or moving to a hierarchical routing pattern (deferred). Refine via experiment.
NFR-4 (Prompt budget): Orchestrator prompt size monitored; warning and hard thresholds set well below the operational model’s context limit. Current orchestrator system prompt is ~400-500 tokens; operational model gpt-5.4-nano provides ~272K-token context. Current utilization is <1%. Warning threshold initially 25% of context window, hard threshold 50%. Refine via experiment as additional verticals onboard.
NFR-5 (Compaction trigger): Conversation compaction triggered at history threshold — concrete threshold pending production episode-length observability (DDB query or Grafana dashboard). Production conversational volume is sparse (production latency report: ~5 conversational requests/hour, p90 inter-request gap = 32 minutes — see §11.1); episode-length distribution data has not yet been mined for this NFR. Refine via observability work and experiment.
NFR-6 (Capability requirements, model-agnostic):
- Orchestrator model supports concurrent tool calls in a single response
- Operational reliability against the configured fan-out cap is PC2’s contract
NFR-7 (LLM cost management): Orchestrator-curated operations that incur additional LLM calls — prior_context summary generation, compaction summary generation, response composition — MUST be cost-bounded. Triggering policies are conditional (e.g., prior_context on intent switch only, compaction at threshold only) rather than per-turn. These operations SHOULD use a cheaper model than the operational orchestrator model where quality permits. Per-turn additional-call cost MUST be tracked with a target $/MAU contribution (tuned via experiment); if costs trend unsustainable, policies tighten before scaling further.
4.3 Acceptance criteria
Section titled “4.3 Acceptance criteria”- A new vertical engineer reads this spec and authors a working Agent Definition in their first day on the platform
- An Agent Definition’s
prompt_blockslist, paired with platform-required blocks, produces a functioning vertical-facing agent without orchestrator code changes - The orchestrator routes user turns to sub-agents via tool calls; sub-agent results return to the orchestrator as tool results; the orchestrator composes the final user-facing response
- A change to a single prompt block updates all agents using that block, without per-agent code or config changes
- Orchestrator’s
sub_agentslist resolves at factory time (registry lookup succeeds for every named sub-agent)
5. Solution Design
Section titled “5. Solution Design”5.1 The architectural through-line
Section titled “5.1 The architectural through-line”Agent Definition is a list of registered references; runtime resolves via registries.
Every primitive (model, tool, sub-agent, prompt block) is referenced by ID. Definitions are thin index cards; registries hold implementations. This shape is consistent across all primitive types and serves as the single mental model for vertical onboarding.
A single shared Agent runtime class instantiates every agent — orchestrator and sub-agents alike — by reading its Definition and assembling the registered references at factory time. Verticals contribute agent cards (Definition + prompt block files + tool declarations + model selection); they do not contribute runtime classes. This is what makes “agent card = Skill at sub-agent scope” cleanly composable: the card is data, instantiation is uniform.
5.2 Agent Definition shape
Section titled “5.2 Agent Definition shape”A leaf agent (vertical sub-agent example):
id: rewardsdescription: Handles points balance, redemption history, and points-by-method analyticsrole: nativemodel: gpt-5.4-mini-low # operational, swappabletools: - get_user_points - get_redemption_history - calculate_redemption - get_points_by_methodprompt_blocks: # vertical-authored only; platform blocks auto-injected (see §5.7) - persona-rewards - instructions-rewards - safety-financial # domain-specific (still vertical-declared until conditional injection lands)sub_agents: [] # leaf agenttuning: reasoning_effort: low text_verbosity: mediumThe orchestrator uses the same shape, with sub_agents populated:
id: orchestratordescription: Top-level routing across vertical sub-agentsrole: orchestratormodel: gpt-5.4-mini-lowtools: [llm_feedback] # platform-level tools the orchestrator calls itselfprompt_blocks: # platform blocks auto-injected; only orchestrator-specific declared - persona-orchestrator - instructions-routing - instructions-prior-context-generationsub_agents: # presence makes this an orchestrator - shop - rewards - support - ereceiptsBoth YAMLs above are agent cards — the Agent Definition portion of the bundle. Per §5.1, the card is data; the shared runtime instantiates orchestrator and sub-agent instances from these same shapes. Each card implicitly references its prompt block files (via the block registry) and tool implementations (via the tool registry).
5.3 Orchestrator pattern (sub-agents-as-tools)
Section titled “5.3 Orchestrator pattern (sub-agents-as-tools)”The orchestrator is an Agent Definition whose sub_agents list is populated. At factory time, the runtime materializes each sub-agent as a tool callable by the orchestrator’s LLM. The orchestrator’s tool list, presented to the LLM, is its own tools plus one tool per registered sub-agent.
When the LLM emits a tool call for a sub-agent, the runtime:
- Looks up the sub-agent’s Agent Definition
- Instantiates it with inherited session context (user identity, episode, etc.)
- Invokes the sub-agent, which produces response text (and may emit structured component payloads via the
render_*tool path) - Returns the sub-agent’s response to the orchestrator as a tool result; captures events for history persistence in parallel
- The orchestrator’s LLM uses sub-agent results to compose the final user-facing response
This is the sub-agents-as-tools dispatch: tool-calling for routing, orchestrator-mediated response generation. The orchestrator emits a brief preamble before tool calls (mitigating perceived latency), then composes the final response from sub-agent results — passing a single sub-agent’s output through verbatim, or composing across multiple results on mixed-intent turns. Centralizing final response generation at the orchestrator provides a single seam for tone and UX consistency.
Sub-agent status events. Sub-agents emit structured status events (e.g., searching_offers, matching_receipt, looking_up_purchase_history) — typed identifiers, not prose. The orchestrator renders these into user-visible progress strings during longer sub-agent operations, keeping users informed of work-in-progress while preserving the centralized voice. Sub-agents drive cadence (they know what they’re doing); the orchestrator owns the words. At the implementation layer, all events flow through a single ambient stream primitive (LangGraph’s shared get_stream_writer()); “direct from sub-agent” vs “via orchestrator” is orchestrator policy (forward / transform / suppress per-event), not separate mechanisms. The status-event type vocabulary is owned by PC3.
Why not direct streaming? An earlier design variant (“hybrid with direct streaming”) had sub-agents write directly to the user’s SSE stream, bypassing the orchestrator’s final-response generation. That variant optimized TTFT but might produce known conversation-flow consistency issues — sub-agent voices reaching the user unmediated created tone divergence (visible today in the Scout / Forethought integration). As additional verticals onboard (Play, eReceipts, Restaurants, PointPass), enforcing voice and safety in N places drifts in N directions — making centralized output a load-bearing concern rather than a stylistic preference. Sub-agents-as-tools sacrifices some end-to-end latency for a centralized voice/UX enforcement layer. Direct streaming may be revisited if a centralized streaming-time consistency mechanism is built. See §11.3 Decision 2 and §11.1 references for the canonical decision doc.
See consumer-agent/docs/research/sub-agents-as-tools-design.md for implementation details (note: the research doc describes the earlier direct-streaming variant; PC1 supersedes that with the sub-agents-as-tools pattern in this section).
5.4 Mixed-intent fan-out
Section titled “5.4 Mixed-intent fan-out”The orchestrator pattern (§5.3) supports turns where the LLM emits multiple sub-agent tool calls. Execution semantics — concurrency model, cap behavior, partial-failure handling, status-event suppression policy during parallel execution — are specified in PC2 §5.4. PC1 declares only that the dispatch primitive supports it; PC2 owns the runtime contract.
5.5 Per-request system prompt assembly
Section titled “5.5 Per-request system prompt assembly”“Per-request” describes the build pattern (assembly runs each turn from data, not stored as a static string on the agent), not the output structure (which is stable within a session). The system prompt is NOT part of the agent’s identity. It is assembled per-request from:
- Platform-required blocks (auto-injected per §5.7 / FR-12) at the head of the assembled prompt
- The agent’s
prompt_blockslist (ordered references to the block registry; vertical-authored, strict declaration order) - Block resolution (each ID looked up; ordered concatenation; the order is the order in the
prompt_blockslist, not category-grouped) - Dynamic context appended at the end (date, location, user_id, locale)
Result: the system prompt for this turn.
Cache-friendly contract. The assembly produces a stable prefix per cohort (auto-injected blocks + vertical-authored blocks + resolved content) followed by a variable suffix (dynamic context). Cohorts are defined by Agent Definition + registry versions + feature-flag-gated block presence — same-cohort users and same-user multi-turn follow-ups share the prefix and hit the LLM provider’s prompt cache when available. Implementations MUST keep per-user variable values (user_id, location) confined to the trailing dynamic-context suffix; placing them in the prefix region fragments the cache pool to size 1.
The same Agent Definition produces different system prompts for different cohort contexts (different feature flags, different user state) without re-instantiation; the structure is stable within a session for a given user. The system prompt is not part of the Agent Definition’s identity — only its assembly inputs (prompt_blocks list, dynamic context) are.
5.6 Memory model
Section titled “5.6 Memory model”Conversation history is layered, not single-tier. Three distinct surfaces:
-
Storage layer (DDB + S3) — full conversation history persisted for audit, replay, and eval. Assistant message bodies hold narration text only.
render_*tool calls persist as separate tool messages with bothtool_input(component IDs, titles, analytics tags) andcomponent_data(formatted component JSON), keyed byturn_id. The currentHistoryMiddlewarealready implements this; PC1 codifies it. -
Client-side LLM context on subsequent turns — the read path filters tool messages out before passing history to the LLM. The LLM sees user + assistant narration only. This is the layer where “narration-only” applies, and it’s what keeps per-turn LLM context cost bounded.
-
LLM provider server-side reasoning continuity — provider-side response chaining (where available) carries the prior turn’s full tool calls/results within the provider’s server-side state. Same-session, within-TTL, chain-active turns implicitly retain prior tool context. Cross-session, post-TTL, or chain-broken turns fall back to client-side narration only.
Reference resolution across layers. When users reference prior content (“the third offer you showed me”), most cases resolve via either the server-side chain (when active) or the narration text itself (“the Caribou one”). When more precision is needed, the orchestrator may surface compact reference annotations (component IDs + positions) from the storage layer’s tool messages into LLM context. The annotation schema and read-path mechanism live in the consumer-agent memory architecture.
Sub-agents do not consume the raw history. Each sub-agent sees only its own prior session turns when re-invoked. The HistoryMiddleware tags turns with sub_agent_id (or extends the existing intent_labels field; schema choice is an implementation detail of the consumer-agent memory architecture). On sub-agent invocation, the runtime retrieves only that sub-agent’s prior turns from the current session.
Orchestrator-curated prior_context. When the orchestrator routes a turn to a different sub-agent than the prior turn, it generates a prose summary that names specific entities, identifiers, and active goals from prior cross-vertical turns. Quality is driven by the orchestrator’s system-prompt instructions (no declared contracts; just specific, instruction-driven generation). Cost is bounded per NFR-7.
Compaction. When conversation history exceeds a length threshold (tuned via experiment), the orchestrator triggers compaction. A summary LLM call produces a rolling summary; older turns are replaced by the summary in the orchestrator’s context. Recent N turns are preserved intact. The summary is stored in episode metadata for retrieval on subsequent turns. Cost is bounded per NFR-7.
5.7 Prompt blocks (the agent’s identity)
Section titled “5.7 Prompt blocks (the agent’s identity)”The agent’s identity is its prompt block composition. Tools enable identity, model executes it, sub_agents extend it; prompt blocks define WHO the agent is.
Block categories: persona, instructions, capabilities (see PC-6 §5.4 for storage model — declared as XML <can_do> / <cannot_do> blocks, not in capabilities.md), safety, format, context. Each block is independently versioned (file-backed; versioned via git, reviewed via PR — see §11.3 Decision 9).
Block ownership:
- Platform-owned:
persona-fetch-assistant,format-conversational,format-streaming,safety-base,safety-financial - Vertical-owned:
persona-X,instructions-X,context-X(one set per vertical)
Required platform blocks (auto-injected). Per FR-12, the runtime auto-injects the required block set at factory time; verticals do not declare them. Domain-specific blocks (e.g., safety-financial for financial verticals) remain vertical-declared in prompt_blocks until a future spec adds a conditional-injection mechanism.
Cross-vertical block reuse. Blocks like context-location-aware may be authored once and reused by any vertical that needs location-conditional reasoning. Block reuse is a deliberate platform feature; Definition-level reuse is the primary mechanism for vertical-to-vertical consistency.
5.8 Tools as typed contracts
Section titled “5.8 Tools as typed contracts”The Agent Definition’s tools: list references tools by name. Tool primitive shape, registration mechanism, and authoring workflow are owned by PS-2 §5.2–§5.4.
5.9 Vertical integration model
Section titled “5.9 Vertical integration model”Verticals contribute one or more agent cards to the orchestrator’s sub_agents list. An agent card is the bundle a vertical ships: an Agent Definition together with its referenced prompt block files, declared tool list, and model selection. Verticals do not contribute runtime classes — the platform’s shared Agent runtime instantiates each card as a sub-agent.
Native sub-agent pattern is canonical:
- Vertical owns its prompt_blocks (persona, instructions, optional context) — files in the vertical’s directory, referenced by ID from the card
- Vertical declares its tool list (CCS endpoints or vertical-specific MCP wrappers) — IDs resolved against the tool registry
- Vertical selects its model (operational choice via the model registry)
- Vertical defines its scope via the description field — used by the orchestrator’s LLM for routing
Scout is the live example of native sub-agent-as-tool integration: it is registered in the orchestrator’s sub_agents list as ask_support (visible at src/consumer_agent/gateway/orchestrator.py:54). Forethought is the upstream service Scout wraps for response generation; the integration is native at the sub-agent boundary even though Forethought hosts the final reply. Verticals integrate through this same pattern — native agent card registered against the orchestrator. There is no “external-wrapper” alternative pattern for new verticals to adopt; the orchestrator-with-sub-agents-as-tools shape is the canonical and only supported integration model.
The orchestrator is platform-owned. Verticals contribute sub-agent cards, not orchestrators.
6. Cross-Section Impact
Section titled “6. Cross-Section Impact”| Spec | Citation |
|---|---|
| PC2 (Discovery & Sub-Agent Execution) | Cites Agent Definition contract; orchestrator’s sub_agents list IS the discovery surface for vertical-facing flows. PC2 owns the runtime execution contract (concurrency model, cap, partial-failure semantics, status-event suppression policy during parallel execution). |
| PS2 (Connector Framework) | Cites tool reference shape from Agent Definition. PS2 owns the connector framework: authoring workflow (three paths), service registry mechanism, three-layer config schema, secrets handling, conformance bar, and per-connector observability. |
| PF1 (Agent Lifecycle) | Cites Agent Definition for promotion semantics; “promoted” means the Definition is active in the registry. Schema validation occurs at promotion time. |
| PS6 (Domain Object Enrichment & BFF Assembly) | Independent; PC1’s layered storage policy (FR-10, §5.6) applies to component payloads PS6 produces — assistant body holds narration only; component data lives in render_* tool messages; LLM context excludes them. |
| PF4 (Security & Auditability) | Cites isolation guarantees from per-sub-agent memory model; cites required platform block validation for vertical-facing agents. |
7. Dependencies
Section titled “7. Dependencies”Platform spec dependencies: None blocking (PC1 is Wave 0; downstream specs depend on this one).
Implementation dependencies:
- LangChain v1 with
create_agent(current consumer-agent stack) - LangGraph state machines and
StreamWritersemantics - DynamoDB + S3 for episode persistence
- Operational model registry in
agent_config.yaml::models
External dependencies: Forethought (Scout’s upstream response-generation service; reached via Scout’s tool body when the orchestrator dispatches ask_support).
8. Risks & Open Questions
Section titled “8. Risks & Open Questions”8.1 Risks
Section titled “8.1 Risks”R-1: Orchestrator prompt quality is load-bearing. Routing accuracy and prior_context quality both depend on the orchestrator’s system prompt. Degradation here directly affects user experience. Mitigated by eval coverage (routing accuracy on intent-switch pairs) and prompt iteration with version control via Opik.
R-2: Legacy agent.py path needs layered-storage compliance. Today the regular agent.py path stores <component type="...">{json}</component> markers verbatim in text_content (inline JSON in the assistant message body). The render_* tool path already implements layered storage correctly — tool input and component data persist as tool messages, assistant text body holds narration only. Layered storage compliance requires either stripping inline markers at write time on the legacy path, or migrating those flows to the render_* tool pattern. Implementation responsibility lives in the consumer-agent memory architecture.
R-3: Per-sub-agent retrieval schema change. Today’s tagging is via intent_labels (list of strings). PC1’s commitment requires either extending this field with sub-agent semantics or adding an explicit sub_agent_id field. Backfill consideration for stored episodes; schema choice is an implementation detail of the consumer-agent memory architecture.
R-4: Compaction is fundamentally lossy. Summary captures less than the raw history. Reference resolution that depends on a specific quote from a compacted turn may break. Mitigated by preserving recent N turns intact and by eval coverage spanning the compaction boundary.
R-5: prior_context summary quality on multi-hop chains. When a user references something from three or more turns ago in a different sub-agent, the orchestrator’s prose summary must capture enough specificity. Mitigated by eval coverage on multi-hop reference scenarios; iteration on orchestrator prompt instructions.
R-6: Required block validation today is informal. Today’s prompts/components/ flat folder has no ownership markers and no required-set validation. PC1 requires both. Implementation: add ownership tags to block frontmatter; add validation at Definition load time; reviewers enforce at Definition PR review.
R-7: Scout / Forethought response-generation seam. Scout is registered as a native sub-agent-as-tool (ask_support in the orchestrator’s sub_agents list per §5.9). Forethought is the upstream service Scout wraps for response generation; the orchestrator-mediated response composition path applies (orchestrator composes the user-visible reply from Scout’s tool result). The risk is the Forethought response carrying voice or formatting that conflicts with the orchestrator’s composition pass — visible historically as voice divergence. Mitigated by treating Scout’s tool result as input to orchestrator composition, not as user-facing output verbatim.
R-8: Fan-out reliability at N greater than 2 is unmeasured. Today’s orchestrator has only two sub-agent tools (ask_shopping, ask_support). N=3 reliability cannot be tested until a third sub-agent is registered. Cap is framed as tuned via experiment for this reason.
8.2 Open Questions
Section titled “8.2 Open Questions”None outstanding. Prior design-phase questions captured in §11.3.
9. Testing Strategy
Section titled “9. Testing Strategy”9.1 Unit tests
Section titled “9.1 Unit tests”- Agent Definition parsing and validation (id format, required fields, registry-key resolution)
- Required platform-block validation per vertical-facing definition
- Tool registry resolution (factory-time binding correctness)
- Prompt block resolution and ordering
- System prompt assembly (block resolution + dynamic context concatenation)
9.2 Integration tests
Section titled “9.2 Integration tests”- Orchestrator routing on representative single-intent and mixed-intent scenarios per vertical
- Sub-agent invocation through orchestrator’s tool-call path; verify sub-agent output returns to the orchestrator and the orchestrator composes the final user-facing response
- Concurrent sub-agent dispatch and orchestrator response composition (N=2 today; N=3 deferred until 3rd sub-agent registered)
- Partial failure handling (one sub-agent fails; others succeed; orchestrator emits natural-language acknowledgment in final response per FR-5)
- Multi-turn intent-switch coherence (verify orchestrator routes correctly across switches; verify prior_context preserves cross-vertical references)
9.3 Eval coverage (Opik)
Section titled “9.3 Eval coverage (Opik)”- Routing accuracy on intent-switch pairs across registered verticals
- prior_context recall on multi-hop reference scenarios
- Response quality per vertical (own judge thresholds)
- Compaction quality on long-conversation samples (preserves entities, decisions, intents)
9.4 Contract tests
Section titled “9.4 Contract tests”- Cross-section contract with PC2: discovery surface matches sub_agents list
- Cross-section contract with PS2: tool reference shape matches connector framework
- Cross-section contract with PF1: promotion semantics (Definition validation gate)
9.5 Failure-mode testing
Section titled “9.5 Failure-mode testing”- Sub-agent timeout (orchestrator handles gracefully; failure acknowledged in natural language in final response per FR-5)
- Required platform block missing from Definition (validation rejects at load time)
- Orchestrator drops a tool call on mixed intent (intent_count failsafe + retry; fan-out execution mechanics are PC2’s testing scope)
- Compaction summary quality degrades (eval covers; surfaces in routing/coherence metrics)
10. Rollout & Observability
Section titled “10. Rollout & Observability”10.1 Rollout phases
Section titled “10.1 Rollout phases”Phase 1 — Spec validation. PC1 reviewed and approved; cross-section contracts confirmed with PC2 / PS2 / PF1 reviewers.
Phase 2 — Implementation tickets. PC1 decomposed into ~3pt Jira tickets per Spec Lab assembly-line workflow. Tickets implemented incrementally.
Phase 3 — Migration. Existing Shop and Support sub-agents adapted to Agent Definition contract. HistoryMiddleware extended for layered storage compliance (legacy agent.py path migration to render_* tool pattern) and per-sub-agent tagging (consumer-agent memory architecture work).
Phase 4 — Vertical onboarding. First new vertical (Rewards / PointPass) onboarded under PC1’s contract as the validation. Subsequent verticals follow.
Phase 5 — Legacy classifier deprecation. After the sub-agents-as-tools dispatch is production-validated, the legacy classifier path is deprecated. All sub-agents conform to the native sub-agent-as-tool pattern per §5.9.
10.2 Observability metrics
Section titled “10.2 Observability metrics”- Routing accuracy by vertical (eval set + production sample): catches misclassifications
- Orchestrator latency p95: routing + prior_context generation; catches prompt bloat
- Fan-out distribution: histogram of sub-agents per turn; feeds PC2’s operational tuning at its layer
- prior_context recall: on intent-switch turns, did the summary preserve referenced entities?
- Per-vertical eval scores: response quality, safety, format compliance per sub-agent
- Compaction trigger rate: how often does it fire; informs threshold tuning
- Cross-vertical reference resolution rate: percent of “the X I asked about earlier” queries that resolve correctly
10.3 Rollback
Section titled “10.3 Rollback”PC1 is a contract spec, not deployable code. Rollback is at the Definition level: an Agent Definition can be rolled back to a prior version via git (the file-backed registry; see Decision 9). Block-level changes can be rolled back independently. Architecture-level rollback (e.g., reverting to legacy classifier path) requires platform team intervention; not expected.
11. Appendix
Section titled “11. Appendix”11.1 Source references
Section titled “11.1 Source references”- Innovation Labs Agent Platform — Strategy & Onboarding — vertical-onboarding goals
- Platform Spec Lab — operationalization process
consumer-agent/docs/research/sub-agents-as-tools-design.md— sub-agents-as-tools dispatch implementation reference (note: the research doc covers the earlier direct-streaming variant; PC1 supersedes that with the sub-agents-as-tools pattern, see §5.3)- Decision: Orchestrator-Owned Output for Multi-Agent Architecture — canonical decision doc for orchestrator-mediated output + structured status event contract (cited in §5.3 and §11.3 Decisions 2 and 10)
- PLT-616 (Jira epic) — sub-agents-as-tools dispatch implementation tracking
- Orchestrator Model Benchmark Report — model + prompt benchmark for orchestrator routing (mixed-intent reliability, latency percentiles); cited in NFR-2
- PLT-609 Phase C: Production Latency Report — production latency baselines by route, per-tool latency, cache rate analysis (cited in NFR-1, NFR-5, and the cohort-cache discussion in §5.5)
- Miro design board — supplementary architecture diagrams (canonical doc, runtime topology, memory model, prompt composition)
11.2 Empirical validation
Section titled “11.2 Empirical validation”The sub-agents-as-tools dispatch path was measured against the legacy classifier baseline on single-intent, mixed-intent, and multi-turn cases. Note: the measurement reflects an earlier direct-streaming variant of the dispatch path (since superseded by sub-agents-as-tools — see §11.3 Decision 2). Results validate the dispatch architecture (inline tool-call routing vs separate classifier LLM call) but the absolute TTFT numbers will shift modestly once orchestrator-mediated response generation is in place. Re-measurement under the sub-agents-as-tools pattern is a follow-up.
| Metric | Classifier path | Sub-agents-as-tools dispatch (direct-streaming variant, superseded) |
|---|---|---|
| TTFT (single-intent support) | 14983 ms | 1718 ms (8.7x faster) |
| Total latency (single-intent support) | 14983 ms | 12051 ms |
| Routing accuracy across 4 test cases | 100% | 100% |
| Multi-turn intent switching | Routed correctly | Routed correctly |
| Mixed-intent (N=2) | Both verticals invoked | Both tools fired with intent_count=2 |
Key takeaway: the sub-agents-as-tools dispatch is decisively better than the classifier path on TTFT and matches it on routing accuracy — validates the dispatch commitment. N=3 reliability is deferred until a third sub-agent is registered.
11.3 Decisions resolved during design
Section titled “11.3 Decisions resolved during design”| # | Decision | Resolution |
|---|---|---|
| 1 | Skill scope | Adopted at sub-agent scope as agent cards (Agent Definition + referenced prompt blocks + tool declarations + model selection). No higher-level Skill aggregator above sub-agents; orchestrator owns composition. Memory persistence stays separate from the card (consumer-agent memory architecture). See §3.3. |
| 2 | Gateway variant | Sub-agents-as-tools dispatch canonical. Orchestrator owns final response generation from sub-agent tool results — provides a single seam for tone/UX consistency. Direct-streaming variant evaluated and superseded — known conversation-flow inconsistencies (Scout/Forethought voice divergence as the in-production example) outweigh the TTFT win. Drift risk scales with vertical count (Play / eReceipts / Restaurants / PointPass). Direct streaming may be revisited if a centralized streaming-time consistency mechanism is built. Canonical decision doc in §11.1 references. Legacy classifier deprecated. |
| 3 | System-prompt boundary | Per-request assembly from prompt_blocks + dynamic context. System prompt is not part of Agent Definition identity. |
| 4 | Tool config 3-layer split | Document the three layers (agent_config.yaml, settings.<env>, feature flags) as PC1’s primitive declaration. Unification evaluated and resolved in PS2: do not unify in v1. See PS2 §11.2 Decision 4. |
| 5 | Discovery model | Factory-time binding. Defer runtime discovery. |
| 6 | Vertical integration model | Native sub-agent-as-tool is the canonical and only supported integration model. Scout (Forethought-wrapping) integrates through this same pattern; the upstream service that handles response generation is opaque to the orchestrator. |
| 7 | Memory model | Per-sub-agent + prior_context + compaction. No structured shared state. |
| 8 | Mixed-intent dispatch primitive | Orchestrator pattern supports multiple sub-agent tool calls per turn; orchestrator composes the final response from aggregated results. Execution contract (concurrency model, cap, partial-failure semantics) is PC2’s. |
| 9 | Prompt-block registry location | File-backed registry in prompts/components/ (consumer-agent repo). Versioned via git, reviewed via PR. Opik is for eval/observability, not change control. |
| 10 | Sub-agent progress contract | Sub-agents emit typed structured status events (searching_offers, matching_receipt, etc.) — not user-facing prose. Orchestrator renders status events into user-visible strings in the assistant’s voice; owns per-event suppression/forward/transform policy. There is no separate “direct sub-agent streaming” mechanism — all events flow through a single ambient stream primitive; routing is orchestrator policy. Status-event type vocabulary owned by PC3. Canonical decision doc in §11.1 references. |
11.4 Verticals landscape (context for review)
Section titled “11.4 Verticals landscape (context for review)”Eight verticals are named in the Strategy & Onboarding doc, in three states:
- Live: Shop (S1c), Support / Scout (S2)
- Context done; integration in flight: Rewards / PointPass (S1a)
- In review or draft: Play (S1b), eReceipts (S1d), Offer Details (S1e), Restaurant Network (S1f), Retailer Context (S1g)
PC1’s design choices are sized for the expected scale of ≤10 verticals at maturity.