Skip to content

Sub-Agents-as-Tools Architecture Design

  • Jira: PLT-616 — Spike: Evaluate sub-agents-as-tools architecture
  • Branch: feature/sub-agents-as-tools
  • Prior work: PLT-616 spike notebook (PR #246), intent-classification-research.md
  • OpenAI guidance: Section 5b — “one orchestrator agent calling agent as tools > multi agent handoff for simplicity”
  • Depends on: PR #245 (gpt-5.4-nano swap) for model validation, but built independently on main

PR #246’s multi-agent orchestration notebook measured four patterns: monolith, supervisor (langgraph_supervisor.create_supervisor), swarm, and A2A. Its conclusion — keep the monolithic shopping subgraph + static-routing gateway — is well-supported for the patterns it evaluated.

The notebook focused on multi-agent handoff patterns (supervisor with full delegation, swarm with peer handoffs). OpenAI’s Fetch guidance document (Section 5b) recommends a different pattern:

“one orchestrator agent calling agent as tools > multi agent handoff for simplicity”

This agents-as-tools pattern (one orchestrator calling sub-agents as function calls) wasn’t covered in PR #246. This spike builds on those findings by evaluating the pattern OpenAI specifically recommended.

The intent-classification-research.md documents the evolution:

  • Phase 1 (tool-based): Scout as a BaseTool — single conversational agent calls scout_answer like any other tool. Parallel tool calls for mixed intent. Zero added TTFB. Worked well (validated in Appendix B of that doc), but BaseTool._arun() returns a single str — no streaming. Combined with Forethought’s ~12s latency, this meant 12 seconds of dead silence for support queries.

  • Phase 3 (gateway graph, current): Custom LangGraph StateGraph with intent classifier (gpt-4.1-mini structured output) routing to support_handler (Forethought streaming via StreamWriter) or shopping agent subgraph. Solves the streaming problem but uses static conditional routing — closer to the handoff pattern OpenAI says not to use.

Combine the best of both: the agents-as-tools simplicity OpenAI recommends with the streaming capability Phase 3 provides. Upgrade the gateway classifier into an orchestrator that calls sub-agents as tool functions, with streaming through tool execution via LangGraph’s StreamWriter.


START → classifier (gpt-4.1-mini, structured output)
→ [static conditional edges]
→ support_handler → END
→ rewrite_query → shopping_agent → END
→ support_handler → rewrite_query → shopping_agent → END
START → orchestrator_node (gpt-5.4-nano, tool calling)
→ LLM emits preamble text (streams to user immediately)
→ LLM returns tool call(s)
→ post-processing guards validate tool calls
→ execute tool(s):
ask_support() — calls Forethought, streams via StreamWriter
ask_shopping() — runs shopping agent, streams via StreamWriter
→ END
  • gateway_node (classifier + structured output) → orchestrator_node (tool calling)
  • Static conditional edges (_route_by_intent, _after_support) → removed; LLM tool selection is the routing
  • _rewrite_shopping_query → absorbed into _execute_shopping() tool function
  • support_handler_node → core logic wrapped as _execute_support() callable inside orchestrator
  • Forethought client, conversation_id continuity, chunk streaming — all internal to _execute_support()
  • Shopping agent subgraph (create_agent with MCP tools) — invoked inside _execute_shopping()
  • Verbatim query protection — prompt-level + code-level override in tool functions
  • SSE event types, HistoryMiddleware — unchanged
  • api/main.py — calls create_gateway_graph(), streams result

The orchestrator node is a single async function — a plain LangGraph node with writer: StreamWriter, not a ReAct agent.

async def orchestrator_node(state: dict, writer: StreamWriter, *, ...):
# 1. Build messages with system prompt
# 2. Call gpt-5.4-nano with tool schemas
# 3. Stream preamble text tokens → writer(TextEvent)
# 4. Intercept tool calls, apply post-processing guards
# 5. Execute validated tool calls (FIFO queue for mixed intent)
# 6. Return final state (AIMessages for history)
ask_support = {
"name": "ask_support",
"description": "Answer customer support questions about Fetch.",
"parameters": {
"query": "The user's exact support question, verbatim.",
"support_category": "Topic category (missing_points, ereceipts, "
"rejected_receipt, rewards, fetch_shop, fetch_play, "
"fetch_card, point_pass, referrals, profile_help, "
"goodrx, social). Optional.",
"support_summary": "One-sentence summary of the support issue using "
"full conversation context. Under 300 chars.",
"prior_context": "Brief summary of relevant prior conversation context "
"when the user switches intent (e.g., from shopping to "
"support). Optional."
}
}
ask_shopping = {
"name": "ask_shopping",
"description": "Handle shopping queries — product search, deals, "
"recommendations, price comparisons, purchase history.",
"parameters": {
"query": "The user's exact shopping question, verbatim.",
"prior_context": "Brief summary of relevant prior conversation context "
"when the user switches intent (e.g., from support to "
"shopping). Optional."
}
}

XML format (matching PR #245’s style for gpt-5.4). Carries over all classification intelligence from the current classifier prompt:

  • Support keywords list → guides tool selection
  • Intent categories → mapped to tool descriptions
  • Image classification rules (receipt → support, product → shopping)
  • Greeting/check-in rules (→ Fastpath direct response, or ask_shopping if beyond simple greetings)
  • No-carryforward rule (only classify latest message)
  • No-reformulation rule (pass user’s exact words)
  • Support category taxonomy
  • Preamble instruction: “Before calling tools, emit a brief sentence acknowledging what you’re about to do”
  • gpt-5.4-nano with reasoning_effort: none + intent_count failsafe

Benchmarking compared gpt-5.4-nano and gpt-4.1-mini for orchestrator tool calling:

ModelMixed intent (30 queries, 3 batches)Avg latencyNotes
gpt-5.4-nano (none)30/30 (100%)~1,450msRare stochastic failures observed in early testing (~1 in 15)
gpt-5.4-nano (low)N/AN/ARequires Responses API for tool calling — not compatible with current LangChain path
gpt-4.1-mini30/30 (100%)~3,230msReliable but 2.2x slower

gpt-5.4-nano is 2.2x faster and nearly as reliable. The rare stochastic failure (dropping the second tool call for mixed intent) is mitigated by the intent_count failsafe — a code-level safety net that scales to N sub-agents.

Every tool call includes a required intent_count parameter — the LLM reports how many distinct intents it detected. After the LLM returns tool calls, the orchestrator compares:

  • intent_count (from tool args) vs len(tool_calls) (actual tool calls made)
  • If intent_count > len(tool_calls): the LLM dropped a tool call → retry with a stronger prompt that explicitly demands the missing tool call(s)
  • The retry merges new tool calls with existing ones (no duplicates)

This is a general mechanism that scales to N sub-agents — it compares numbers, not hard-coded intent types. If we add 5 more sub-agents and the LLM reports 3 intents but only calls 2 tools, the failsafe catches it and retries.

intent_count = tool_calls[0].get("args", {}).get("intent_count", len(tool_calls))
if intent_count > len(tool_calls):
# Re-invoke orchestrator with explicit instruction to call missing tools
retry_response = await orchestrator_model.ainvoke(retry_messages)
# Merge new tool calls with existing ones

Final decision: Use gpt-4.1-mini. Stage testing confirmed gpt-5.4-nano stochastically drops the second tool call for mixed-intent queries (~1 in 15), even with the intent_count failsafe — the LLM reports intent_count=1 (misclassifies as single intent), so the failsafe cannot detect the miss. gpt-4.1-mini is 2.2x slower but 100% reliable on mixed intent. The latency cost is offset by preamble streaming.


ScenarioBehavior
Single intent (support)ask_support streams live via writer
Single intent (shopping)ask_shopping streams live via writer
Mixed intentFIFO queue — both fire concurrently, first to produce output streams live, other buffers and flushes with simulated streaming

Both sub-agent calls fire concurrently via asyncio.gather. First to produce output claims the “live” streaming slot (protected by asyncio.Lock). The other buffers its output. When the live stream finishes, the buffered response flushes with a small artificial delay between chunks to simulate streaming (configurable, e.g., 15-20ms).

async def _execute_tools_concurrent(tool_calls, state, writer, ...):
lock = asyncio.Lock()
winner = None # first to produce output
buffers = {} # name → list of chunks
async def run_tool(name, func):
nonlocal winner
async for chunk in func(state, ...):
async with lock:
# Lock covers the entire check-and-emit path to prevent
# race where two coroutines both claim the live slot.
if winner is None:
winner = name
if name == winner:
writer(TextEvent(content=chunk)) # live stream
else:
buffers.setdefault(name, []).append(chunk)
await asyncio.gather(
run_tool("ask_support", _execute_support),
run_tool("ask_shopping", _execute_shopping),
)
# Flush buffered responses with simulated streaming delay
for name, chunks in buffers.items():
for chunk in chunks:
await asyncio.sleep(0.015) # 15ms simulated delay
writer(TextEvent(content=chunk))
  • _execute_support: Calls ForethoughtClient.ask_stream(), yields chunks (Forethought streams chunk-by-chunk, not token-by-token). Each chunk is a widget component fragment — typically a sentence or paragraph.

  • _execute_shopping: This is the most complex part of the migration. Today, stream_adapter.py is a 350-line adapter that depends on subgraphs=True and named node namespaces (gateway, support_handler, shopping_agent) to classify streaming events (reasoning vs text, tool call tracking, response_id extraction). Collapsing the graph to a single orchestrator node means the adapter can no longer distinguish event sources by namespace.

    Solution: _execute_shopping runs the shopping agent subgraph as its own inner astream(subgraphs=True) call and internally processes the three stream modes — replicating the relevant parts of stream_adapter’s shopping-path logic. subgraphs=True is required to get real-time token streaming from the shopping agent’s inner model/tools nodes; without it, LangGraph batches inner tokens until the subgraph completes:

    1. messages mode: Extract LLM token chunks. Classify as reasoning (before tools) or text (after tools) using the has_called_tools flag. Extract reasoning from additional_kwargs. Yield TextEvent or ReasoningEvent.
    2. updates mode: Track tool call start/end events from inner model and tools nodes. Yield ToolCallStartEvent, ToolResultEvent, ToolCallEndEvent. Track response_id from message metadata.
    3. custom mode: Pass through any custom events from inner nodes.

    This is essentially extracting the shopping-specific logic from stream_adapter.py into a self-contained async generator. The stream_adapter itself simplifies — it only needs to handle the orchestrator’s top-level custom events (preamble text, support chunks, shopping chunks all arrive as StreamEvent objects via writer()).

With the orchestrator emitting all events via writer(), the stream adapter simplifies significantly:

async for namespace, mode, chunk in graph.astream(
graph_input,
stream_mode=["custom"], # only custom mode needed
config=config,
):
if mode == "custom" and isinstance(chunk, StreamEvent):
yield chunk

The complex namespace/mode dispatch logic moves into _execute_shopping() where it processes the inner shopping subgraph’s stream. The adapter becomes a thin passthrough for typed StreamEvent objects.

Events emitted by the orchestrator via writer():

  • ThinkingEvent — TTFB optimization (emitted immediately)
  • ResponseIdEvent — synthetic ft_* for support, real resp_* from shopping
  • TextEvent — preamble text, support chunks, shopping text
  • ReasoningEvent — shopping agent reasoning tokens
  • ToolCallStartEvent / ToolCallEndEvent / ToolResultEvent — shopping agent tool activity
  • SupportContentEvent — separator between support and shopping in mixed intent
  • UsageEvent — token consumption (from shopping agent, zero for support-only)
  • CompletedEvent — stream finished

The orchestrator LLM emits text tokens before returning tool calls. These stream to the user immediately via writer(TextEvent) — no dead silence. Example:

User: "my receipt didn't scan and find me coffee deals"
Assistant: "Let me look into your receipt issue and find some deals for you."
[support response streams]
[SupportContentEvent emitted — internal, not visible to user]
[shopping response streams]

For mixed intent, the orchestrator emits a SupportContentEvent between the support and shopping responses. This is an internal event, not visible to the user — it is consumed by HistoryMiddleware to store the support response as a separate assistant message in DynamoDB, keeping support and shopping history cleanly separated. The SSE endpoint filters it out before sending to the client.

The user sees one continuous response: support text streams, then shopping text streams, with no visible separator. Same behavior as today, but emitted explicitly by the orchestrator node rather than detected by the stream adapter from node completion updates.

Fastpath (Greetings, Chitchat, Simple Follow-Ups)

Section titled “Fastpath (Greetings, Chitchat, Simple Follow-Ups)”

When the orchestrator LLM responds with text only and no tool calls, this is the Fastpath — the orchestrator handles the response directly without invoking any sub-agent. Examples: greetings (“hi”), chitchat (“what’s up”), acknowledgments (“thanks”), simple follow-ups (“ok”).

Currently these go to the shopping agent (gpt-5-mini with full conversational prompt and MCP tools) — heavyweight for a simple reply. The Fastpath lets gpt-5.4-nano respond directly, saving a full shopping agent LLM call and delivering a faster response.

The orchestrator prompt includes lightweight persona instructions (tone, name, brief greeting style) so Fastpath responses match the assistant’s voice without needing the full conversational prompt.

Safety constraint: The orchestrator has no safety guardrails (no refusal instructions, no content policy, no sensitive topic handling). Fastpath must be limited to a strict allowlist of trivially safe message types:

  • Greetings and hellos
  • Acknowledgments (“ok”, “thanks”, “got it”)
  • Simple chitchat (“what’s up”, “how are you”)

Anything beyond this allowlist must route to ask_shopping. If the user asks a question, makes a request, or says anything that could require judgment about content safety, the orchestrator must call ask_shopping and let the shopping agent handle it — the shopping agent’s full conversational prompt has built-in guardrails for safety, refusals, sensitive topics, and policy compliance.

The orchestrator prompt enforces this: “Only respond directly for simple greetings, acknowledgments, and chitchat. For ALL other messages — including general questions, opinions, advice, or anything you’re unsure about — call ask_shopping.”


Two layers, same as current architecture:

Orchestrator system prompt instructs: “Pass the user’s exact words as the query argument — do not reformulate, rephrase, expand, or add context.”

Applied inside each tool function before execution:

  • Single intent (orchestrator called one tool): always use the user’s exact text from state, ignoring LLM’s query arg.
  • Mixed intent, short message (≤4 words): also use exact text.
  • Mixed intent, long message: use LLM’s split but validate keyword overlap against user’s original text. Fall back to user’s text if no overlap.

The “single intent” signal comes from whether the orchestrator returned one tool call or two.

async def _execute_support(state, tool_args, ...):
user_text = get_latest_user_message(state)
is_single = len(tool_calls) == 1
if is_single:
effective_query = user_text
elif len(user_text.split()) <= 4:
effective_query = user_text
else:
query = tool_args.get("query", "")
effective_query = query if has_overlap(query, user_text) else user_text
# proceed with effective_query...

Context management has two competing goals:

  1. Within a single turn (mixed intent): Each sub-agent should only see its own portion — support content shouldn’t leak into the shopping agent’s current-turn context and vice versa.
  2. Across turns (intent switch): The full conversation history is valuable context. A user who discussed points issues in turn 1 and asks for deals in turn 2 expects the assistant to remember the whole conversation.

Each sub-agent receives:

  • Full cross-turn history — all prior turns from all sub-agents. This shared context makes the assistant feel coherent across intent switches.
  • Current turn query — only the portion relevant to this sub-agent (from the orchestrator’s tool call args).
  • prior_context (optional) — a brief summary of relevant prior conversation context, populated by the orchestrator when the user switches intent. Useful for sub-agents that can’t consume full message history (e.g., API-based services).

The orchestrator is the natural place to manage this — it sees the full history, knows which sub-agent it’s calling, and can adapt context format per sub-agent type:

For each sub-agent tool call:
if sub-agent is LLM-based (accepts message history):
pass full cross-turn history + current turn query
prior_context is supplementary (nice-to-have, not required)
if sub-agent is API-based (accepts only a query string):
pass current turn query
prior_context is essential — summarizes relevant cross-turn context

Both tools include an optional prior_context parameter:

ask_support = {
"parameters": {
"query": "The user's exact support question, verbatim.",
"support_category": "...",
"support_summary": "...",
"prior_context": "Brief summary of relevant prior conversation "
"context when the user switches intent. Optional."
}
}
ask_shopping = {
"parameters": {
"query": "The user's exact shopping question, verbatim.",
"prior_context": "Brief summary of relevant prior conversation "
"context when the user switches intent. Optional."
}
}

Support (API-based — Forethought):

  • Forethought accepts a query string, not structured history
  • prior_context is prepended to the query or passed as a Forethought context variable
  • Example: User shopped for Folgers in turn 1, then says “the coffee never gave me points” in turn 2 → prior_context: "User previously searched for Folgers coffee offers" helps Forethought understand the specific product

Shopping (LLM-based — agent subgraph):

  • Receives full cross-turn message history (including prior support turns from other turns)
  • prior_context is supplementary — the LLM can read the history directly
  • Current-turn isolation preserved: For mixed intent, the current turn’s support content is still stripped from the shopping agent’s view (same logic as today’s _rewrite_shopping_query)

Future sub-agents: Same pattern — LLM-based sub-agents get full history, API-based sub-agents rely on prior_context from the orchestrator.

The orchestrator LLM sees the full conversation history (all turns, all sub-agents) to make routing decisions and populate prior_context. It doesn’t generate user-facing content from this context (except Fastpath responses and preambles).


Three guards from the current classifier, applied between the LLM returning tool calls and execution:

Short user replies that echo the previous assistant’s support question (e.g., “Fetch shop” after “Is it from a Fetch Shop purchase?”) should route to support.

Adaptation: If the LLM calls ask_shopping on a short echo reply, override to ask_support. Same _is_support_echo_followup logic.

For mixed intent (both tools called), if one tool’s query arg has no content word overlap with the user’s message, drop that tool call.

Adaptation: Same _content_words and _has_overlap functions, applied to tool call args.

Handled inside tool functions (see Verbatim Query Protection section above).

tool_calls = response.tool_calls
# Guard 1: echo follow-up — may replace ask_shopping with ask_support
tool_calls = _apply_echo_guard(tool_calls, state)
# Guard 2: overlap check — may drop one tool call from mixed
tool_calls = _apply_overlap_guard(tool_calls, user_text)
# Guard 3: verbatim override — applied inside tool functions
await _execute_tools(tool_calls, state, writer)

Unchanged. _execute_support() contains the same logic as today’s support_handler_node:

  1. Retrieve conversation_id from episode metadata in DynamoDB
  2. Check TTL (55 min expiry)
  3. If valid → pass to Forethought (PUT to continue)
  4. If expired → replay last 6 support turns as context prefix
  5. After call → store new conversation_id back to episode metadata

All managed inside _execute_support(), independent of routing pattern.

The shopping agent subgraph receives previous_response_id for OpenAI server-side context. This is bound to the model at graph creation time in factory.py — unchanged.


FileChange
gateway/classifier.pyReplace → rename to gateway/orchestrator.py. Structured output prompt + GatewayOutput → tool-calling orchestrator prompt + tool schemas. Post-processing guards adapted for tool calls.
gateway/graph.pySimplify — remove _route_by_intent, _after_support, _rewrite_shopping_query, conditional edges. Graph becomes START → orchestrator_node → END.
gateway/support_handler.pyRefactor — extract core logic into callable async generator with explicit function signature: async def execute_support(query, support_category, support_summary, prior_context, *, forethought_client, history_store, episode_id, user_id, forethought_stream) -> AsyncGenerator[str, None]. Yields chunks instead of writing directly to StreamWriter. Fields previously read from state (scout_query, support_category, support_summary) become function arguments; prior_context is prepended to the query for cross-turn context on intent switches. The node wrapper (support_handler_node) is removed; the function is called from the orchestrator.
gateway/state.pySimplifyGatewayState drops intent, scout_query, shopping_query, support_category, support_summary (become local to orchestrator).
gateway/stream_adapter.pySimplify significantly — becomes a thin passthrough for StreamEvent objects from writer(). The complex namespace/mode dispatch logic (350 lines) moves into _execute_shopping() inside the orchestrator. Only needs stream_mode=["custom"].
agent_config.yamlUpdate gateway agent config if needed for tool-calling.
factory.pycreate_gateway_graph() simplified — fewer params.
tests/Classifier tests → orchestrator tests. Validate tool selection parity with current classification quality.
FilePurpose
gateway/orchestrator.pyOrchestrator node + FIFO queue + tool execution functions
FileWhy
tools/scout.pyForethoughtClient — same interface, called from _execute_support()
agent/agent.pyShopping agent subgraph — invoked from _execute_shopping()
agent/streaming.pySame event types (TextEvent, ReasoningEvent, etc.)
history/middleware.pyWraps graph stream — unchanged
api/main.pyCalls create_gateway_graph(), streams result — unchanged

RiskMitigation
gpt-5.4-nano tool calling quality insufficientValidate with PR #245’s 38 integration test scenarios adapted for tool selection. Fall back to structured output + code dispatch (Approach C) if quality is unreliable.
Orchestrator LLM reformulates queriesTwo-layer protection: prompt instruction + code-level verbatim override.
FIFO queue produces jarring UX for mixed intentSimulated streaming delay (15-20ms between chunks) for buffered responses.
Shopping agent token streaming through orchestrator writer_execute_shopping() runs the subgraph with astream() and forwards tokens via the FIFO queue callback. Needs validation that dual-mode streaming works correctly in this forwarding pattern.
Added latency vs current classifierOrchestrator replaces classifier (not adds to it). Net cost: tool-calling overhead vs structured output — expected ~50-150ms additional. Offset by preamble UX improvement.

  1. Tool selection parity: Port PR #245’s 38 integration tests to validate orchestrator routes correctly (same scenarios, tool calls instead of structured output).
  2. Streaming correctness: Verify preamble + support streaming + shopping streaming produce correct SSE events end-to-end.
  3. FIFO queue: Test mixed-intent scenarios with both orderings (support first, shopping first).
  4. Verbatim protection: Test that queries reach Forethought and shopping agent unmodified.
  5. conversation_id continuity: Multi-turn support conversations maintain Forethought context.
  6. Context isolation (same turn): Verify shopping agent never sees current-turn support content and vice versa in mixed intent.
  7. Cross-turn context (intent switch): Test that prior_context is populated correctly when intent switches between turns (shopping→support, support→shopping). Verify Forethought receives enriched context and shopping agent receives full cross-turn history.
  8. Fastpath: Validate greetings/chitchat are handled directly by orchestrator without invoking sub-agents. Verify safety — anything beyond trivial greetings routes to ask_shopping.
  9. Eval parity: Run Opik eval suite against orchestrator path, compare response_quality and policy scores against current gateway baseline.