Intent Classification Research for Consumer-Agent

Last Updated: 2026-03-17

What Exists in Consumer-Agent
Confluence Doc Analysis
Industry Research: Best Practices
Recommended Architecture
Latency Analysis
Multi-Intent Handling
Deep Dive: Why LLMs Drop Intents and How to Fix It
Support Subagent vs Support Tool: Architecture Decision
Open Source Frameworks
Repo Analysis: Where Should the Forethought Tool Live?
Recommended Approach for Consumer-Agent
Open Questions / TODOs
Appendix A: Stage Testing Queries
Appendix B: Multi-Intent Case Study (Phase 1)

What Exists in Consumer-Agent

The codebase is a LangGraph + OpenAI Python agent (~12k LOC) serving as Fetch’s shopping assistant.

Key Findings

No intent classification or routing exists today. All messages go through a single conversational agent via /agent/stream
intent_labels: list[str] field already exists in the message model (history/models.py:144) and is now populated by HistoryMiddleware based on which tools were called during the turn
Three agents defined but only by agent_id param, not by dynamic routing: conversational, prompt-suggestions, title-generation
Tools: product search, offer search, purchase history, web search, webpage fetch, feedback
Streaming architecture with dual-mode (messages + updates) SSE

Architecture Overview

Mobile Client / Go Gateway
        ↓
FastAPI /agent/stream POST
        ↓
Request Validation (StreamRequest schema)
        ↓
Episode Management (get_or_create)
        ↓
History Retrieval (last N messages from DynamoDB)
        ↓
System Prompt Construction (4-layer composition)
        ↓
Agent.stream() - LangGraph execution
        ↓
Streaming Events (SSE format)
        ↓
HistoryMiddleware (async capture & storage)
        ↓
Client receives events

Key Files

File	Purpose
`agent/agent.py`	Main Agent class, dual-mode streaming
`factory.py`	Model & agent creation, tool wiring
`utils/tools.py`	MCP tool wrappers (BaseTool + Pydantic schemas)
`prompts/manager.py`	4-layer system prompt composition
`history/models.py`	Message/Episode models (has `intent_labels` field)
`api/main.py`	FastAPI endpoints, request handling
`config.py`	Dual config: agent_config.yaml + settings.yaml

Extensibility Points for Intent Classification

intent_labels: list[str] field on Message model — ready to populate
StreamRequest.enabled_components — feature-flagged prompt components
StreamRequest.feature_flags — runtime behavior flags
LangGraph graph can be extended with conditional routing nodes

Confluence Doc Analysis

Source: Implementation of Intent Understanding and Routing

This doc lays out a thorough intent classification pipeline with several strong ideas we can build on:

Preprocessing: surface cleaning → syntactic pruning (spaCy dep parsing) → semantic normalization (WordNet hypernyms, verb canonicalization) → entity extraction
Discovery: SBERT embeddings → UMAP → HDBSCAN clustering → BERTopic labeling
Taxonomy: L1 macro categories (Planning, Point Earning, Commerce, Support, Other) → L2 specific intents
Runtime Router: 3-tier (Regex → Semantic Router → LLM fallback)

What We’re Building On

The taxonomy structure (L1/L2 hierarchy) is a great design pattern — we adopted a similar intent categorization (SUPPORT, SHOPPING, OFFERS, etc.)
The semantic router concept (embedding-based fast path) is a solid optimization we’d like to explore in a later phase
The evaluation framework (golden dataset, precision/recall metrics) gives us a good model for measuring classification quality

What We’re Simplifying for V1

For the initial rollout, we’re taking a leaner approach — not because the original ideas are wrong, but because our current architecture (LLM-based agent with tool calling) lets us defer some complexity until we have real traffic data to justify it:

spaCy preprocessing pipeline — since we already have an LLM in the loop, it handles normalization natively. We can revisit structured NLP preprocessing if we see quality gaps in practice.
Offline HDBSCAN clustering — valuable for taxonomy discovery and we may use it to refine our intent categories later, but not needed for the runtime classification path right now.
Regex guardrail layer — rather than building fast paths upfront, we’re starting with LLM-only routing and will add regex/embedding shortcuts based on observed traffic patterns and latency data.
WordNet hypernym replacement — deferring this until we have evidence that query normalization improves classification accuracy for our use case.

Industry Research: Best Practices

Multi-Intent Classification Approaches

Pattern A: Multi-Label Classification with Structured Output

The most robust modern approach uses LLMs with structured output / function calling to return multiple labels:

from pydantic import BaseModel, Field
from typing import List, Literal

INTENTS = Literal[
    "ORDER_STATUS", "REFUND_REQUEST", "CANCELLATION",
    "TECHNICAL_SUPPORT", "BILLING_INQUIRY", "ACCOUNT_MANAGEMENT",
    "PRODUCT_QUESTION", "COMPLAINT", "GENERAL_INQUIRY"
]

class IntentClassification(BaseModel):
    chain_of_thought: str = Field(description="Step-by-step reasoning")
    intents: List[INTENTS] = Field(description="All detected intents")
    primary_intent: INTENTS = Field(description="Most important intent")
    confidence: float = Field(ge=0.0, le=1.0)

Source: Instructor library

Pattern B: Hierarchical Intent Taxonomy

Two-stage classification: broad category first, then specific sub-intent:

class TopLevelIntent(BaseModel):
    category: Literal["BILLING", "TECHNICAL", "ACCOUNT", "PRODUCT", "OTHER"]
    confidence: float

class HierarchicalClassification(BaseModel):
    top_level: TopLevelIntent
    sub_intents: List[SubIntent]
    requires_clarification: bool
    clarification_question: Optional[str] = None

Source: Kapture CX, Microsoft Dynamics 365 IBR

LLM-Based vs Traditional NLU

Aspect	Traditional NLU	LLM-Based	Semantic Routing
Latency	`<10ms`	200-2000ms	`<5ms`
Cost per call	Near zero	$0.001-0.01	Near zero
Zero-shot ability	None	Excellent	Moderate
Multi-intent	Requires special training	Native with structured output	Limited
Accuracy	85-95%	89-96%	80-90%

The Modern Hybrid Approach (Industry Consensus)

User Message
    |
    v
[Semantic Router] -- fast, cheap, handles well-known intents
    |
    |-- HIGH confidence (>0.85) --> Route directly
    |-- MEDIUM confidence (0.6-0.85) --> Confirm with LLM
    |-- LOW confidence (<0.6) --> Full LLM classification
    |
    v
[LLM Classifier] -- slower, expensive, handles novel/ambiguous

Intent Routing Architectures

Architecture 1: Router Pattern (LangGraph Conditional Edges)

graph = StateGraph(GraphState)
graph.add_node("classifier", classify_intent)
graph.add_node("shopping", handle_shopping)
graph.add_node("support", handle_support)
graph.set_entry_point("classifier")
graph.add_conditional_edges("classifier", route_to_handler, {
    "shopping": "shopping",
    "support": "support",
})

Source: Clivern LangGraph Tutorial

Architecture 2: Supervisor / Subagent Pattern

A supervisor agent maintains conversation context and dynamically decides which subagents to call as tools. Key distinction: a supervisor maintains state across turns; a router is stateless single-dispatch.

Source: LangChain multi-agent docs, LangChain architecture blog

Architecture 3: Semantic Router + Agent Fallback

semantic-router handles the fast path with embedding similarity, falling back to LLM for complex cases.

Architecture 4: AWS Agent Squad

AWS Agent Squad provides pluggable classifiers (Bedrock, Anthropic, OpenAI) with conversation-history-aware routing.

Confidence Scoring & Fallback

def route_with_confidence(result: ClassificationResult) -> str:
    top = max(result.intents, key=lambda x: x.confidence)
    if top.confidence > 0.85:
        return top.intent                    # Route directly
    elif top.confidence > 0.6:
        sorted_intents = sorted(result.intents, key=lambda x: -x.confidence)
        if len(sorted_intents) > 1:
            gap = sorted_intents[0].confidence - sorted_intents[1].confidence
            if gap < 0.15:
                return "CLARIFY"             # Ambiguous
        return top.intent
    else:
        return "FALLBACK_TO_HUMAN"           # Low confidence

Recommended Architecture

User Message
    │
    ▼
┌──────────────────────────┐
│  LLM Intent Classifier   │  (structured output / function calling)
│  Returns: List[Intent]   │  (multi-label, with confidence scores)
│  + primary_intent        │
│  + entities extracted    │
└──────────┬───────────────┘
           │
     ┌─────┴──────┐
     │   Router    │  (conditional edges in LangGraph)
     └─────┬──────┘
           │
    ┌──────┼──────┬──────────┐
    ▼      ▼      ▼          ▼
 [Shopping] [Support] [General] [Clarify]
  Agent      Agent     Agent

Key Patterns to Adopt

Pattern	How	Source
Multi-label classification	Pydantic model with `List[Intent]` via structured output	Instructor, OpenAI function calling
Confidence scoring	High (>0.85) → route directly; Medium (0.6-0.85) → verify; Low → clarify	Vellum, Langfuse
Hierarchical taxonomy	L1 broad category → L2 specific intent	Microsoft Dynamics 365 IBR
Supervisor pattern	Main agent classifies + delegates to subagents as tools	LangChain subagents, AWS Agent Squad

Latency Analysis

Constraint: The total latency for each conversation response from consumer-agent is already tight. Intent classification must not meaningfully increase TTFB. Analysis below covers three latency targets.

Approach Latency Characteristics

LLM Structured Output (~300-600ms)

gpt-5-mini with reasoning_effort: minimal, small prompt, ~10 tokens out → 300-600ms typical
Already paying OpenAI network round-trip from ECS → OpenAI (~50-100ms)
Structured output schema constrains generation, keeping token count small
At the mercy of OpenAI API variability — P95 could spike to 800ms+
Verdict: Achievable on average within 500ms, but P95 will likely exceed it

Semantic Router / Local Embeddings (~10-25ms)

Load a small model like all-MiniLM-L6-v2 (22MB ONNX) at startup
Inference: 5-20ms on CPU
Vector similarity against ~50-200 anchor utterances: <1ms
Total: 10-25ms — comfortably under 100ms
Tradeoff: Less accurate for novel/ambiguous queries. Single-label only (no multi-intent decomposition). Needs LLM fallback for the low-confidence tail.

Remote Embeddings via OpenAI (~50-150ms)

OpenAI text-embedding-3-small API call: 50-150ms (network bound)
Vector similarity: <1ms
Total: 50-150ms — borderline for 100ms target

Regex / Keyword (~1-5ms)

Regex patterns: <1ms. Only handles exact patterns like “track my order #XXX”
Keyword matching + heuristics: 1-5ms. Brittle, high maintenance.
Verdict: Achievable at 10ms but only for a subset of queries. Not viable as sole classifier.

Feasibility by Target

Target	Approach	Feasibility	Multi-Intent	Accuracy
500ms	LLM structured output (sequential)	Average yes, P95 risky	Yes	~95%
100ms	Semantic router (local embeddings)	Yes, comfortably	No (single label)	~85-90%
100ms	Semantic router + LLM fallback (tiered)	P50 ~15ms, P30 ~400ms	Partial	~90%
10ms	Regex/keyword only	Yes, but limited coverage	No	~60-70%

Architectural Options to Minimize TTFB Impact

Since classification happens sequentially before the main agent call, every millisecond is added directly to TTFB. Three architectural patterns avoid this:

Option A: Parallel Classify + Stream (best for ~0ms added TTFB)

User message arrives
    │
    ├──→ [Intent Classifier] (LLM, ~400ms)     ← runs in parallel
    │
    └──→ [Main Agent starts streaming]          ← TTFB stays the same
              │
              ▼
         ThinkingEvent emitted immediately
              │
         (classifier result arrives)
              │
         Route to correct subagent mid-stream

Start the main conversational agent immediately (preserving current TTFB), and run classification in parallel. If the result comes back as CUSTOMER_SUPPORT, redirect to the CS subagent. If it’s SHOPPING (the common case), the main agent is already running — zero added latency.

Added latency for common case (shopping): ~0ms
Added latency for redirect case (support): ~400ms (but only when routing changes)
Downside: Wasted LLM tokens if redirect needed. But if 80%+ of queries are shopping, this is efficient.

Option B: Tiered — Semantic Router + LLM Fallback (best for ~15ms P50)

User message arrives
    │
    ▼
[Semantic Router] (local embeddings, ~15ms)
    │
    ├── confidence > 0.85 → route directly (covers ~60-70% of traffic)
    │
    └── confidence < 0.85 → [LLM fallback, ~400ms]
                              (but only for ~30% of queries)

Average added latency: 0.7 × 15ms + 0.3 × 400ms ≈ 130ms
P50 added latency: ~15ms (most users)
Downside: Requires curating anchor utterances per intent. Single-label classification on the fast path.

Option C: Inline Classification (zero added latency)

Don’t add a separate classification step. Modify the system prompt to make the main agent classify-and-route as its first action:

Before responding, classify the user's intent by calling the `classify_intent` tool.
If the intent is CUSTOMER_SUPPORT, call the `handle_support` tool.
Otherwise, respond normally.

The classification happens inside the existing LLM call. No extra round-trip. The model is already “thinking” — you’re just adding a structured tool call to its first step.

Added latency: 0ms (classification is part of the existing call)
Downside: Ties classification to the main model (gpt-5-mini). Can’t optimize the classifier independently. Classification accuracy depends on prompt engineering within the main agent context.

Summary Table

Option	Added TTFB (common case)	Added TTFB (redirect case)	Multi-Intent	Accuracy	Complexity
A: Parallel	~0ms	~400ms (redirect)	Yes	~95%	Medium
B: Tiered	~15ms	~400ms (fallback)	No (fast path)	~85-95%	Medium-High
C: Inline	0ms	0ms	Yes	~90-95%	Low
Sequential LLM	~400ms	~400ms	Yes	~95%	Low

Recommendation: Option A (parallel) or Option C (inline) — both avoid adding to TTFB. Option B is the best middle ground if a dedicated classifier with independent optimization is needed.

Multi-Intent Handling

For messages like “Cancel my subscription, refund last month, and transfer my data”:

Classify all intents: [CANCELLATION, REFUND_REQUEST, DATA_EXPORT]
Identify dependencies: Refund depends on cancellation
Order execution: Cancel first → refund → export (or parallel where safe)
Synthesize response: Combine all agent outputs

class IntentClassification(BaseModel):
    chain_of_thought: str
    intents: List[Intent]
    primary_intent: Intent
    confidence: float
    requires_clarification: bool

class Intent(BaseModel):
    type: Literal["SHOPPING", "CUSTOMER_SUPPORT", "ACCOUNT", "GENERAL"]
    sub_type: str        # e.g., "refund_request", "order_status"
    confidence: float
    extracted_entities: dict

Handling Overlapping Intents

Define intent implications (cancel often implies refund)
Define intent conflicts (can’t upgrade and cancel simultaneously)
Use confidence gap analysis to detect ambiguity

Deep Dive: Why LLMs Drop Intents and How to Fix It

Context: During V1 stage testing, we observed that the LLM reliably calls scout_answer for single-intent support queries (“where are my points?”) but drops the support intent in mixed-intent messages (“my receipt didn’t scan and find me snack deals” → only called search_offers). This section documents the root causes and evidence-based mitigation techniques.

Root Causes

Research identifies several reasons LLMs fail to act on all intents in compound messages:

1. Actionability bias. When one intent maps cleanly to a tool schema (e.g., “find me snack deals” → search_offers) and another is semantically vaguer (“my receipt didn’t scan” → which tool?), the model gravitates toward the cleaner mapping. Tool description quality is a primary driver — intents with better-described tools win. (Voiceflow)

2. Position bias (primacy/recency). LLMs exhibit serial position effects. GPT-4 variants show primacy bias (acting on the first-mentioned intent). Other models show recency bias. The effect is model-dependent and task-dependent — there is no universal winner. (arXiv:2406.15981)

3. Satisficing without decomposition. Without an explicit instruction to decompose the message into sub-requests, the model defaults to a satisficing strategy: resolve the first clear intent and consider the turn complete. OpenAI’s own guidance states: “Decompose the user’s query into all required sub-requests, and confirm that each is completed. Do not stop after completing only part of the request.” (GPT-5 Prompting Guide)

4. Low reasoning effort amplifies the problem. Our agent uses reasoning_effort: low for cost/latency. OpenAI’s GPT-5.2 guide warns that “disambiguating tool instructions to the maximum extent possible” is “particularly critical at minimal reasoning.” With low effort, the model takes shortcuts. (GPT-5.2 Prompting Guide)

5. Intent mismatch drift. A 2026 paper found that “the Assistant’s interpretation progressively drifts away from the user’s true intent” in multi-turn conversations — “not a failure of model capability but rather a breakdown in interaction.” (arXiv:2602.07338)

Evidence-Based Mitigation Techniques

Technique 1: Explicit Decomposition Directive

The single most effective technique. Force the model to enumerate all intents before acting.

From OpenAI’s GPT-5 prompting guide:

“Decompose the user’s query into all required sub-requests, and confirm that each is completed.”

From GPT-5.4 guidance: define a <completeness_contract> block in the system prompt that requires verification all sub-requests are fulfilled before yielding.

Evidence: OpenAI reports this as one of three “agentic instructions” that boosted SWE-bench scores by ~20%. (GPT-4.1 Prompting Guide)

Technique 2: Keyword Anchoring (Support-First Check)

Instead of relying on the LLM to semantically map “my receipt didn’t scan” → “receipt issue” → SUPPORT, provide a concrete keyword list. If ANY keyword appears, scout_answer is mandatory.

Support trigger words: points, receipt, scan, account, referral, redeem, reward,
support, help, bug, error, missing, problem, issue, eligible, eligibility,
contact, refund, subscription, password, login, verify, verification

Why this works: Keywords are deterministic pattern matching that the LLM can perform reliably even at low reasoning effort. “scan” is in the list, so “my receipt didn’t scan” triggers it regardless of phrasing. This avoids the fragility of example-based matching.

Tradeoff: Keywords can’t handle truly novel phrasings with no keyword overlap (e.g., “I got ripped off”), but they cover the vast majority of support queries. The existing semantic signals in the tool description serve as a catch-all for edge cases.

Sources: Kore.ai Multi-Intent Detection, Label Your Data: Intent Classification 2025

Technique 3: Priority Ordering (Support as Mandatory First Step)

Rather than a flat classification table where all intents are equal, make support detection a mandatory first step that runs before other intent identification. Support has a higher cost of missing (user frustration, failed deflection) than shopping (just a missed tool call).

This is the “priority routing” pattern used in production systems like Kore.ai and LivePerson.

Technique 4: Negative Examples (Anti-Patterns)

Show the exact failure mode in the prompt. Research confirms negative examples are “surprisingly effective” for LLM instruction following, especially for tool-calling behavior. From GPT-4.1 guidance: “If tools are complex, create a dedicated # Examples section in your system prompt” that includes cases where the model should call multiple tools.

❌ WRONG: "my receipt didn't scan and find me snack deals" → only calls search_offers
✅ RIGHT: "my receipt didn't scan and find me snack deals" → calls scout_answer AND search_offers

Source: GPT-4.1 Prompting Guide

Technique 5: Parallel Tool Call Instructions

OpenAI models support parallel function calling by default (parallel_tool_calls: true), but the model needs to be explicitly told when to parallelize. From GPT-5.1/5.2 guides: “Parallelize tool calls whenever possible” and “Parallelize independent reads when possible to reduce latency.”

Key limitation: There is no API-level way to say “must call tool X AND also allow other tools.” tool_choice only accepts a single function name. Parallelism must be prompted, not forced.

Source: OpenAI Function Calling Guide, GPT-5.1 Prompting Guide

Technique 6: Two-Stage Prompting (Count-Then-Classify)

A peer-reviewed technique from Neurocomputing 2024: first predict the number of intents, then identify each one. This avoids the threshold-setting problem of single-pass multi-label classification and forces the model to acknowledge multiple intents exist.

Source: Two Stages Prompting for Few-Shot Multi-Intent Detection (Neurocomputing 2024)

Prompt-Only vs. Code-Level Pre-Classification

Dimension	Prompt-Only (current)	Code-Level Pre-Classification
Setup cost	Low — zero-shot, no training data	Higher — requires labeled data or embeddings
Latency	0ms added (inline)	10-400ms added (sequential)
Determinism	Low — stochastic, model-dependent	High — same input → same output
Multi-intent reliability	Moderate with decomposition prompts	High with explicit multi-label classification
Debuggability	Low — prompt changes have unpredictable effects	High — traceable, auditable
Flexibility	High — handles novel intents zero-shot	Lower — limited to predefined intent set

When to escalate to code-level: If prompt-based techniques (Techniques 1-5 above) still fail to reliably detect support intents after testing, the next step is a lightweight pre-classifier — either keyword-based (in code, not prompt) or embedding-based (semantic-router). This runs before the LLM call and injects a hint like “The user’s message contains a SUPPORT intent — you MUST call scout_answer” into the system prompt.

Sources: Hybrid LLM + Intent Classification (Medium), Vellum: Intent Detection for Chatbots

V1.1 Recommended Prompt Structure

Based on the research above, the intent classification section of the system prompt should use a priority-based detection protocol combining Techniques 1-5:

## Intent Detection Protocol (MANDATORY — follow before every response)

Before calling any tools, follow these steps exactly:

### Step 1: Decompose the message
Identify ALL distinct requests or questions in the user's message. Count them.
A single message can contain multiple intents. Do not proceed until you have
identified every intent.

### Step 2: Support check (ALWAYS do this first)
Scan the user's message for ANY of these support-related words:
points, receipt, scan, account, referral, redeem, reward, support, help, bug,
error, missing, problem, issue, eligible, eligibility, contact, refund,
subscription, password, login, verify, verification

If ANY support-related word is found → you MUST call `scout_answer`.
This is non-negotiable, even if other intents are also present.

### Step 3: Identify all other intents
After the support check, also identify:
- SHOPPING → `search_products`, `web_search`, `fetch_webpage`
- OFFERS → `search_offers`, `search_nearby_offers`
- PERSONALIZATION → `get_user_purchase_history`
- GENERAL (greeting, small talk, clarification) → No tool needed

### Step 4: Call ALL identified tools in parallel
Parallelize independent tool calls. Never drop an intent. Do not consider
your turn complete until every identified intent has been addressed.

### Common mistake to avoid
❌ WRONG: "my receipt didn't scan and find me snack deals" → only calls search_offers
✅ RIGHT: "my receipt didn't scan and find me snack deals" → calls scout_answer AND search_offers

Key design decisions:

Decomposition step (Step 1) — forces the model to count intents before acting, preventing satisficing (from OpenAI GPT-5 guidance and Neurocomputing 2024 two-stage prompting research)
Keyword list instead of semantic signals — covers many phrasings without enumerating examples
Ordered steps instead of flat table — support check runs before other classification
Completeness contract (Step 4) — “Do not consider your turn complete” prevents the model from stopping after one intent
Negative example — shows the exact failure mode observed in testing
“MANDATORY” framing — stronger directive than “Before responding, classify…”
Validated on stage — see Appendix B for a successful multi-intent case study

LLMCompiler: Future Consideration

For architecturally solving parallel function calling, the LLMCompiler framework (ICML 2024) decomposes problems into a DAG of tasks with inter-dependencies, then dispatches them in parallel. Results: 3.7x latency speedup, 6.7x cost savings, and ~9% accuracy improvement over sequential ReAct approaches. Worth evaluating if prompt-based parallelism proves insufficient.

Source: LLMCompiler: Parallel Function Calling (ICML 2024, arXiv:2312.04511)

Support Subagent vs Support Tool: Architecture Decision

Context: The PM PRD (Scout/Forethought Support Handoff) specifies that support will be handled by Forethought Solve API via an internal wrapper tool. This section discusses the two possible architectures and why the tool-based approach is the right one.

The Two Architectures

Scenario 1: Support as a Full Subagent

A separate agent with its own system prompt, tools, model, and multi-turn reasoning loop.

User: "I need a refund for my last purchase and find me coffee deals"
    │
    ▼
[Conversational Agent] (reasoning)
    │
    ├── Detects support intent → calls support subagent
    │       │
    │       ▼
    │   [Support Subagent] (separate LLM, own prompt, own tools)
    │       ├── Calls Forethought API
    │       ├── Reads account data
    │       ├── Reasons about refund policy
    │       ├── Generates support response
    │       └── Returns result to main agent
    │
    ├── Detects shopping intent → calls search_products("coffee")
    │
    ▼
[Conversational Agent] synthesizes both results into one response

When this makes sense:

Support requires multi-turn reasoning — the subagent needs to ask clarifying questions, look up multiple systems, apply complex business logic
Support needs a different model — e.g., a fine-tuned model for policy compliance, or a cheaper model for simple FAQ lookups
Support has its own tool ecosystem — ticketing system, account management APIs, refund processing, escalation workflows that shouldn’t be exposed to the shopping agent
Support needs independent optimization — separate prompt engineering, evaluation, A/B testing, and iteration cycle from the shopping agent
Isolation — support failures shouldn’t crash or pollute the shopping agent’s context

Costs:

Minimum 2 sequential LLM calls (main agent + subagent) → +300-600ms latency
More complex orchestration code
Harder to maintain conversational coherence across agent boundaries
Token duplication (both agents see the conversation history)

Scenario 2: Support as a Tool

A single function call that wraps an external API (Forethought Solve) and returns a response string. The conversational agent calls it like any other tool.

User: "I need a refund for my last purchase and find me coffee deals"
    │
    ▼
[Conversational Agent] (reasoning)
    │
    ├── tool call: scout_answer(query="refund for last purchase",    ← parallel
    │                           conversation_id="xxx")
    ├── tool call: search_products(descriptions=["coffee"])          ← parallel
    │
    (both results return)
    │
    ▼
[Conversational Agent] synthesizes both into one streamed response

When this makes sense:

Support is primarily answer retrieval — an external system (Forethought) does the heavy lifting; we just need to call it and relay the response
No multi-turn reasoning needed within the support domain itself (Forethought handles its own conversation state via conversation_id)
The tool contract is simple: question in → answer out
Support responses are treated as tool output — the main agent weaves them into its response alongside other tool results

Costs:

Less isolation — support is just another tool in the main agent’s context
Main agent’s system prompt gets more complex (must know when/how to use the support tool)
Can’t independently optimize the support “agent” — it’s just a function call
Harder to add complex multi-step support flows later (forms, buttons, escalation chains)

Decision: Tool-Based Approach for V1

The PM PRD recommends an internal wrapper tool calling Solve API, but independently of that guidance, the tool-based approach is the right choice for V1 for the following reasons:

1. Forethought IS the agent. The “support reasoning” doesn’t happen in our system — Forethought Solve is the brain. It maintains its own conversation state, applies its own knowledge base, and generates answers. Our scout_answer tool is just a pass-through. There’s no reason to wrap a pass-through in a full agent loop.

2. Fits the existing architecture perfectly. Consumer-agent is tool-oriented. Adding scout_answer is identical to how search_products, search_offers, and fetch_webpage work today:

Define a Pydantic ScoutAnswerInput schema
Create a ScoutAnswerTool(MCPTool) or direct REST wrapper
Add it to the tool list in factory.py
Update system prompt to describe when to use it

3. Parallel tool calling enables multi-intent. OpenAI supports parallel tool calls. The agent can call scout_answer + search_products simultaneously for mixed queries. A subagent approach would require sequential orchestration.

4. Latency is minimized. Tool call adds ~0ms to TTFB for shopping queries (tool is only called for support). For support queries, latency = Forethought API response time (which we’d pay regardless of architecture).

5. V1 is text-only. V1 scope is text responses only. No multi-step forms, no complex UI interactions. A tool returning a text string is sufficient. V1.1 (buttons/forms) may warrant revisiting this decision.

Important caveat: This decision is V1-specific, not permanent. The tool approach works because Forethought is the brain and we’re just passing through. If we later replace Forethought with in-house support logic, need multi-step flows, or want independent optimization of support quality — the subagent approach becomes the right choice. The upgrade path is straightforward: promote the tool into a subagent with its own system prompt and tools.

Phase 3 update: We chose a hybrid approach — neither full subagent nor simple tool. The gateway graph routes support queries to a dedicated support_handler node that streams Forethought responses directly (via StreamWriter), while shopping queries go to the existing agent subgraph. This gives us streaming (which the tool couldn’t do) and isolation (which a pure tool doesn’t provide), without the overhead of a full support agent with its own LLM reasoning loop.

When to Revisit: Upgrade Path to Full Subagent

The tool-based approach should be revisited if any of these emerge:

Signal	Why It Matters
V1.1 multi-step flows (forms, buttons, escalation chains)	Tool can’t maintain multi-turn state; needs agent loop
Support needs its own tools (ticketing API, account management)	Tool-in-a-tool gets messy; subagent isolates tool surface
Support quality needs independent model/prompt tuning	Can’t tune a tool independently from the main agent
Support conversations become multi-turn within a single user message	Tool is one-shot; subagent can iterate
Forethought is replaced with in-house support logic	More reasoning = more need for a dedicated agent

Implementation Sketch (Tool-Based, V1)

# Input schema — query only, no conversation_id
# The tool is stateless (like all other tools). The LLM agent owns
# conversation context and formulates a self-contained query.
class ScoutAnswerInput(BaseModel):
    query: str = Field(
        description="A self-contained support question about Fetch. "
        "Rephrase the user's message as a standalone question that "
        "includes any relevant context from the conversation."
    )

# Tool definition
class ScoutAnswerTool(BaseTool):
    name: str = "scout_answer"
    description: str = (
        "Answer customer support questions about Fetch (points, receipts, "
        "account issues, eligibility, etc.). Use this when the user asks "
        "about support topics rather than shopping/product queries. "
        "Formulate a complete, self-contained question."
    )
    args_schema: type[BaseModel] = ScoutAnswerInput

    async def _arun(self, query: str, **kwargs) -> str:
        # Each call creates a new Forethought conversation (POST)
        response = await self.forethought_client.ask(query=query)
        return response.text  # V1: text only

Why no conversation_id as a tool input:

The tool is stateless, matching the behavior of all other tools (search_products, web_search, etc.)
Forethought returns a conversation_id when creating a conversation, but since the tool is recreated per request (via factory.py), there’s no instance to persist it on
The LLM agent already has full conversation context via message history — it’s responsible for distilling a good query
The tool description and input field description guide the LLM to write self-contained queries

Forethought context_variables:

These are workflow-specific variables configured in the Forethought dashboard, NOT general-purpose context
They feed into workflow conditions (e.g., “if language == es, route to Spanish workflow”)
V1 passes empty {} — only populate when Forethought workflows are configured to consume specific variables
Do NOT pass message history here — it won’t be used as conversational context

System prompt addition:

You have access to a `scout_answer` tool for customer support questions.
Use it when users ask about: points issues, receipt problems, account help,
eligibility questions, or other support topics.
For mixed queries (shopping + support), call scout_answer AND shopping tools
in parallel, then combine both answers in your response.

Key constraints from PRD:

Post-tool PII/policy scan on Forethought responses (log-only — Forethought is a trusted internal service)
Store Forethought token in secrets manager
Graceful fallback if Forethought is down (show help-center CTA + escalation)

Open Source Frameworks

Evaluated Frameworks

Framework	GitHub	What It Does	Can We Use It?
LangGraph	langchain-ai/langgraph	Graph-based workflow with conditional routing	Already in use. Consumer-agent is built on it. Conditional edges are how we’d add routing if we ever need a separate classification node.
semantic-router	aurelio-labs/semantic-router	Sub-ms embedding-based routing	Defer to Phase 2. Not useful for multi-intent classification (single-label only). Its real value is as a semantic cache inside `ScoutAnswerTool` — match incoming support queries against previously answered questions by embedding similarity, serve cached responses for near-duplicates (`<25ms` vs Forethought round-trip). Worth adding once we have traffic data showing repetitive support query patterns.
Instructor	567-labs/instructor	Structured output from LLMs with Pydantic validation	Not needed. Consumer-agent already uses LangChain’s `with_structured_output()` which does the same thing — Pydantic model in, typed result out. Instructor is for projects using the raw OpenAI SDK directly. Adding it would be a redundant dependency.
AWS Agent Squad	awslabs/agent-squad	Full multi-agent orchestrator with pluggable classifiers	Overkill. We’re adding one tool to an existing agent, not building a multi-agent system. Agent Squad solves a problem we don’t have yet.
Forethought MCP server	Forethought docs	MCP server for Forethought integration	Skip for V1. The PRD evaluated this (their Option B) and rejected it — harder to lock down if vendor changes tools, doesn’t eliminate the hard parts (output mapping, multi-step handling). Direct REST wrapper is simpler and more controllable.

Research Repos (Reference Only)

Repo	Description	Takeaway
intellistream/sage-intent	Keyword + LLM hybrid classification	Interesting pattern for combining fast keyword matching with LLM fallback. Could inform Phase 2 tiered approach.
JohnnyFoulds/multi-intent-classification	Multi-intent with LLMs + deep learning	Academic reference for multi-label classification techniques. Not directly usable.
dmarx/zero-shot-intent-classifier	LangChain-based zero-shot slot filling	Shows how to do zero-shot classification with LangChain. Pattern is similar to what we’d do with `with_structured_output()`.

Framework Selection for Consumer-Agent

No new frameworks needed for V1. The approach is:

Add a ScoutAnswerTool (same pattern as existing MCP tools — BaseTool + Pydantic schema)
Update the system prompt to describe when to use it
The conversational agent handles intent classification implicitly through its reasoning

The only framework worth adding later is semantic-router — not for intent classification (it’s single-label only, can’t handle multi-intent), but as a semantic cache inside the ScoutAnswerTool. When we have production traffic data showing repetitive support queries, semantic-router can:

Match incoming queries against cached Forethought responses by embedding similarity
Serve cached answers in <25ms for high-frequency FAQs (vs Forethought API round-trip)
Reduce Forethought API costs and improve availability (cache works even during vendor downtime)
Provide explicit intent distribution metrics as a side benefit

Repo Analysis: Where Should the Forethought Tool Live?

Four repos are relevant to Phase 1. Analysis of each:

Repo Overview

Repo	Language	Purpose	Existing Tools/Endpoints
consumer-agent	Python	LangChain/LangGraph conversational AI agent	6 MCP tool wrappers + `WebSearchTool` (direct)
rover-agent	Go	HTTP orchestrator, mobile entry point	Routes to consumer-agent (Python path) or OpenAI direct (Go path)
rover-mcp	Go	MCP tool server (12 tools)	search_products, search_offers, fetch_webpage, etc.
consumer-context-service	Go	Unified context API (REST + MCP)	14 MCP tools, 13 REST endpoints for product/offer enrichment

Request Flow

Mobile Client
    │
    ▼
Rover-Agent (Go, port 8080)
    │  feature flag: "python_agent"
    │
    ├── Python path ──→ Consumer-Agent (Python, port 8080)
    │                        │  calls tools via MCP
    │                        ├── rover-mcp (Go) ← product/offer/search tools
    │                        └── consumer-context-service (Go) ← enrichment tools
    │
    └── Go direct path ──→ OpenAI API ──→ rover-mcp directly
                           (being deprecated)

Placement Options

Option A: rover-mcp

Add scout_answer as a Go MCP tool alongside search_products, fetch_webpage, etc.

Pro	Con
Consistent with existing tool pattern	Requires Go implementation (team is Python-focused for agent work)
Available to both Go direct and Python paths	Go direct path is being deprecated — not a real benefit
Clean separation: tools in rover-mcp, agent logic in consumer-agent	Forethought conversation_id management is harder in a stateless MCP call
	Cross-repo coordination for every change

Option B: consumer-context-service

Add as a new REST endpoint + MCP tool.

Pro	Con
Already has MCP transport on port 8081	Wrong domain — CCS aggregates product/offer data, not external support APIs
consumer-agent already connects to `consumer_mcp`	Mixes concerns: enrichment service ≠ support service
	Same Go/cross-repo overhead as Option A

Option C: consumer-agent (direct BaseTool)

Add ScoutAnswerTool as a direct Python BaseTool — not an MCP wrapper.

Pro	Con
Exact precedent: `WebSearchTool` — direct BaseTool calling BrightData API, not MCP	Only available on Python path (not Go direct)
Single-repo change, fastest to ship
Python makes Forethought client code simpler
Conversation state (episode_id → conversation_id) stays in consumer-agent
Feature flag gating already works in factory.py
No cross-service coordination needed

Decision: Option C — consumer-agent directly

ScoutAnswerTool should be a direct BaseTool in consumer-agent, following the WebSearchTool precedent.

The key insight is that WebSearchTool already established this exact pattern: a direct BaseTool in consumer-agent that calls an external API (BrightData SERP) without going through rover-mcp or consumer-context-service. ScoutAnswerTool calling Forethought Solve API is structurally identical.

The “only available on Python path” con is irrelevant — the Go direct path is being deprecated, and all new agent features target consumer-agent.

Implementation location: src/consumer_agent/tools/scout.py (new file, alongside existing tools/ modules)

Recommended Approach for Consumer-Agent

Phase 1: Tool-based approach (completed)

Architecture: Inline classification (Option C) — the conversational agent classifies intent implicitly through its reasoning, with no separate classification step or added latency.

Implementation:

Add ScoutAnswerTool — a direct BaseTool in consumer-agent wrapping Forethought Solve API (same pattern as WebSearchTool)
Update system prompt — describe when to use scout_answer vs shopping tools; instruct parallel calling for mixed queries
Populate intent_labels — infer from tool usage after the fact (called scout_answer → label as CUSTOMER_SUPPORT; called search_products → label as SHOPPING; etc.)
Add safety gates — post-tool PII/policy scan on Forethought responses (log-only; trusted internal service)
Feature-flag the rollout — gate scout_answer tool availability behind a feature flag for phased rollout

Repos and files changed (consumer-agent only):

File	Change
`src/consumer_agent/tools/scout.py`	`ScoutAnswerTool(BaseTool)` + `ForethoughtClient` (async HTTP client for Solve API).
`src/consumer_agent/tools/__init__.py`	Export `ScoutAnswerTool`
`src/consumer_agent/factory.py`	Add `scout_answer` to tool assembly in `create_agent_from_config()`, gated by feature flag
`settings.yaml`	Add `forethought` config section (server_url, api_key secret ref, timeout) per environment
`agent_config.yaml`	Add `scout_answer` to agent tools list
`prompts/capabilities.md`	`scout_answer` tool documented
`src/consumer_agent/history/middleware.py`	Populate `intent_labels` based on which tools were called in the turn
`tests/unit/test_scout.py`	Unit tests for `ScoutAnswerTool` with mocked Forethought API

Rover-agent change (minimal):

Pass scout_answer feature flag in feature_flags dict when calling consumer-agent (same mechanism as existing product_card and web_search flags)

What this gives us:

Zero added TTFB for shopping queries (the common case)
Multi-intent support via parallel tool calls (shopping + support in one turn)
Intent labels for analytics without a separate classification step
No new frameworks or dependencies
Single-repo implementation (consumer-agent), fastest path to production

Limitation discovered: The tool-based approach cannot stream Forethought responses because BaseTool._arun() returns a single str. The full Forethought response must be received before the agent can start generating its final answer. This motivated Phase 3.

Phase 2: Semantic FastPath + Observability (builds on Phase 3, scales to PLT-140)

Prerequisite: Phase 3 gateway graph. Phase 2 adds a semantic matcher node upstream of the gateway classifier, plus intent logging for observability.

Architecture: Semantic Matcher as a Gateway Node

Phase 3 established the gateway graph with intent-based routing:

User Query → Gateway Classifier → support_handler (Forethought)
                                → shopping_agent

Phase 2 inserts a Semantic Matcher node before the classifier:

User Query → Semantic Matcher (cache hit?) → return cached response instantly (<25ms)
                    ↓ (cache miss)
             Gateway Classifier → support_handler → Forethought (~12s)
                                → shopping_agent

The semantic matcher is a LangGraph node, not buried inside a tool. This is the key design decision that makes it scalable to PLT-140 — the node can later evolve from “cache lookup” to “full intent classification + routing” without restructuring the graph.

Semantic Matcher Node

Uses semantic-router (embedding model + vector similarity) to match incoming queries against known patterns.

Phase 2 behavior (cache mode):

Embed incoming query using a local model (all-MiniLM-L6-v2, ~15ms)
Compare against cache of previously answered queries by cosine similarity
If similarity > threshold (e.g., >0.92), serve cached response instantly via StreamWriter
If below threshold, pass through to gateway classifier (existing Phase 3 flow)

Cache population:

Passive: cache every successful Forethought response (query embedding → response text)
Active: pre-seed with known FAQ pairs from Forethought knowledge base
TTL: 24-48 hours, or invalidate when Forethought knowledge base updates

Benefits:

Latency: <25ms for cache hits vs ~12s Forethought API round-trip
Cost: zero per cache hit vs per-query Forethought API cost
Availability: cached responses work even if Forethought is down

Intent Logging + Observability

Hook into the gateway graph’s updates mode to log intent classifications:

What to log: query text (redacted), classified intent, which node handled it, confidence score, response latency, cache hit/miss
Schema: generic (query, predicted_intent, handler_node, confidence, latency_ms, cache_hit) — same schema PLT-140 needs for its dashboard
Storage: append to an analytics table (DynamoDB or S3 parquet) for downstream reporting

This data serves dual purpose:

Phase 2: monitor Forethought deflection rate, empty response rate, cache hit rate
PLT-140: training data for the dedicated intent classifier, intent distribution dashboard

Classification Metrics

Using the logged data, track:

Cache hit rate: % of support queries served from cache (target: 30-50% of repeat FAQs)
Forethought empty response rate: queries that return HTTP 200 with no content
Misroute rate: queries that hit the wrong handler (requires manual sampling initially)
Deflection rate: % of support queries fully resolved without human escalation
Latency P50/P95: cache path vs Forethought path vs shopping path

Confidence-Based Escalation

Add a confidence threshold to the semantic matcher:

High confidence (>0.92): serve cached response
Medium confidence (0.70-0.92): pass to Forethought (existing flow)
Low confidence (<0.70): flag for human review or route to live agent handoff

Scaling to PLT-140

The Phase 2 semantic matcher node is designed to evolve into PLT-140’s 3-layer hybrid router:

Phase 2	PLT-140 Evolution
Semantic matcher (cache mode)	Layer 2: Semantic Router — same embedding infra, expanded from cache to full intent taxonomy (List Planning, Nearby Stores, Recipe-to-Cart, Support, etc.)
Gateway classifier (LLM)	Layer 3: LLM Fallback — already exists, handles complex/ambiguous queries
Not yet needed	Layer 1: Regex — add structured pattern matching for high-volume exact patterns (10-20% traffic)
Intent logging	PLT-140’s observability + training data pipeline
Classification metrics	PLT-140’s dashboard/reporting
Confidence-based escalation	PLT-140’s routing confidence thresholds

The gateway graph becomes PLT-140’s orchestrator — new intent categories just add new edges and handler nodes (billing bot, onboarding bot, etc.) to the same graph structure.

Phase 3: Gateway + Forethought Streaming (in progress)

Why: Phase 1’s tool-based approach cannot stream Forethought responses — BaseTool._arun() returns a single str, so the entire Forethought response must arrive before the agent can start its final answer. Phase 3 solves this by building a custom LangGraph graph with a dedicated gateway classification node that routes support queries directly to Forethought streaming, bypassing the tool limitation.

Architecture: Custom LangGraph StateGraph with an intent classification gateway (gpt-4.1-mini structured output) that routes to either a Forethought streaming handler or the existing shopping agent.

START → gateway (classifier) → [conditional]
                                → support_handler → END (pure support)
                                → shopping_agent → END (pure shopping)
                                → support_handler → shopping_agent → END (mixed)

New module: src/consumer_agent/gateway/

File	Purpose
`gateway/state.py`	`GatewayState` TypedDict (messages, intent, scout_query, shopping_query)
`gateway/classifier.py`	Gateway node — gpt-4.1-mini structured output for intent classification + query decomposition
`gateway/support_handler.py`	Forethought streaming node — uses `StreamWriter` to push `TextEvent` chunks directly to client
`gateway/graph.py`	`StateGraph` assembly + compilation with conditional routing edges
`gateway/stream_adapter.py`	Converts graph `astream()` output (messages/updates/custom modes) to `StreamEvent` iterator

Other files changed:

File	Change
`tools/scout.py`	Added `ask_stream()` method to `ForethoughtClient` — async SSE streaming from Forethought
`factory.py`	Added `create_gateway_agent_from_config()` alongside existing factory function
`api/main.py`	Feature-flag-gated gateway path: `scout_answer` flag → gateway graph instead of regular agent
`history/middleware.py`	Added `intent_labels` override param to `wrap_stream()` for support path
`agent_config.yaml`	Added `gateway` agent entry with configurable model (gpt-4.1-mini)

Key design decisions:

Feature flag at API level — same scout_answer flag from Phase 1 gates the gateway path. Rover-agent PR #121 already passes this flag.
Decomposition-first classification — the gateway classifier decomposes the message into questions first, then derives intent from which query fields are populated (no separate classification step).
Support keyword anchoring — explicit keyword list (points, receipt, scan, account, etc.) reduces false negatives for support detection.
Synthetic events for support path — ResponseIdEvent (with ft_ prefix) and zero-token UsageEvent keep HistoryMiddleware’s storage flow working without an OpenAI call.
Shopping path unchanged — when flag is off, zero latency impact. When flag is on, shopping queries get +50-100ms gateway overhead (gpt-4.1-mini classification call).

Prompt overlap analysis (GATEWAY_SYSTEM_PROMPT vs shopping agent):

The gateway classifier and shopping agent prompts serve different purposes and do not conflict:

Topic	Gateway Prompt (classifier.py)	Shopping Agent (conversational.txt + capabilities.md)
Support scope	Keyword list: points, receipt, scan, account, etc. Used for binary classification only.	Limitations section: receipt scanning, point redemption, account modification → redirect to app features.
Shopping scope	”Everything else” — one-line catch-all.	Detailed: product search, offers, budget, dietary, location-based, personalization, etc.
Image handling	Explicit section: receipt images → support, product images → shopping, ambiguous → default shopping. Generates text description for downstream.	capabilities.md lists “Image Analysis” as a core capability (analyze products, fridge contents, offer matching).
Behavioral rules	None — classifier only outputs structured JSON (intent + queries).	Extensive: tone, word cap, markdown formatting, tool discipline, safety, refusals, disclaimers.

Key difference: The gateway prompt asks “which bucket?” (~50ms, gpt-4.1-mini, no tools). The shopping agent prompt asks “how to respond” (gpt-5-mini, full tool suite, detailed formatting rules). No overlap in behavioral instructions.

Notable behavior change: In the original (non-gateway) path, receipt images hit the shopping agent which redirects to the Scan tab. In the gateway path, receipt images are classified as “support” and routed to Forethought, which provides actual support answers. This is the intended improvement.

What this gives us over Phase 1:

Streaming support responses — Forethought tokens stream directly to the client as they arrive
Dedicated classification model — can be optimized independently from the main conversational agent
Cleaner separation — support handling is isolated in its own graph node, not embedded in the main agent’s tool-calling loop
Better mixed-intent handling — explicit decomposition into scout_query + shopping_query instead of relying on LLM tool-call parallelism

What to skip

New frameworks (Instructor, AWS Agent Squad, Forethought MCP) — existing LangChain/LangGraph stack is sufficient
spaCy preprocessing pipeline — LLM handles normalization natively
Offline HDBSCAN clustering — useful for Phase 2 taxonomy discovery, not for V1
Regex guardrail layer — premature optimization without traffic data

Open Questions / TODOs

Items discovered during V1 implementation that need follow-up:

Forethought Configuration

Identify existing Forethought workflows — The Fetch Forethought dashboard reportedly has workflows configured. Need to audit what exists and whether any are relevant to the headless Solve API path vs. the widget path.
Determine which context_variables workflows expect — context_variables are workflow-specific, not general-purpose context. We currently pass {}. If workflows branch on variables like language, platform, or user_id, we should populate them. Requires coordination with the CX/Forethought admin team.
Sandbox vs. production API key — Production key obtained and configured (50c18d31-...). Stored in rover-agent-{{env}}/forethought-api-key.
Validate Forethought knowledge base coverage — Test whether the sandbox knowledge base has sufficient coverage for common Fetch support queries, or if content needs to be added/updated in the Forethought dashboard.
Sandbox workflow routing broken — After all sandbox workflows were activated (2026-03-16), intent routing became unreliable. Some queries return empty responses. Forethought confirmed: activated workflows without intent descriptions can degrade routing. Use production API key for reliable testing until sandbox is fixed.

Forethought Streaming & Latency

No token-level streaming available — Despite the stream: true parameter, the Solve API returns the full response as a single NDJSON line (application/x-ndjson with widget_components), not incremental SSE tokens. Both stream: true and stream: false return identical content; the only difference is transport format (NDJSON vs plain JSON). Forethought confirmed SSE is not currently supported via API.
Token-level streaming may be coming — Forethought is rolling out token-level streaming for their widget (week of 2026-03-17). Pending confirmation on whether the API will also support it.

Average Solve API latency: ~12s — Measured on stage (2026-03-17) across successful responses:

Query	Time to First Data	Notes
Support query 1	11.9s	Full response in one NDJSON line
Support query 2	11.6s	Full response in one NDJSON line
”Where are my points?“	12.6s	Full response in one NDJSON line
Failed query	30s timeout	Empty body (0 lines)

This latency creates a noticeable gap in the mobile chat UX compared to the rest of the flow which streams token-by-token. Token-level streaming would significantly improve perceived responsiveness.

Query Quality

Evaluate LLM query formulation quality — The tool description instructs the LLM to write self-contained queries, but we have no data on how well it actually does this. After stage deployment, sample scout_answer tool calls to evaluate query quality and refine the description/input schema if needed.
Consider adding examples to the tool description — If query quality is poor, adding few-shot examples to the input field description or capabilities.md could help (e.g., bad: “what about it?”, good: “How do I recover my Fetch account if I lost access to my email?”).

Content Safety

Content moderation beyond PII — V1 only scans for SSN and credit card patterns. Forethought could return off-topic, incorrect, or policy-violating content. Evaluate whether a broader content moderation gate is needed after observing production responses.

Observability

Forethought response quality metrics — Track empty response rate, PII detection rate, and user satisfaction with scout_answer responses to inform Phase 2 decisions.
Query-response logging for evaluation — Consider logging (redacted) query/response pairs to build an evaluation dataset for tuning query formulation and measuring Forethought answer quality.

Appendix A: Stage Testing Queries

Sample queries for validating intent classification and Forethought response quality. In Phase 1, support queries route through scout_answer tool. In Phase 3 (gateway), support queries route through the support_handler node which streams Forethought responses directly.

Points & Rewards

Query	Expected Intent	Expected Tool
”Where are my points?”	SUPPORT	`scout_answer`
”I scanned a receipt but didn’t get points”	SUPPORT	`scout_answer`
”How long does it take for points to show up?”	SUPPORT	`scout_answer`
”Why did my points disappear?”	SUPPORT	`scout_answer`

Receipt Issues

Query	Expected Intent	Expected Tool
”My receipt wasn’t accepted”	SUPPORT	`scout_answer`
”Can I scan an old receipt?”	SUPPORT	`scout_answer`
”The app says my receipt is a duplicate but I only scanned it once”	SUPPORT	`scout_answer`

Account Problems

Query	Expected Intent	Expected Tool
”I can’t log into my account”	SUPPORT	`scout_answer`
”How do I change my email address?”	SUPPORT	`scout_answer`
”I have a problem with my account”	SUPPORT	`scout_answer`
”How do I delete my Fetch account?”	SUPPORT	`scout_answer`

App Issues

Query	Expected Intent	Expected Tool
”The app keeps crashing”	SUPPORT	`scout_answer`
”Why can’t I redeem my points?”	SUPPORT	`scout_answer`
”The scan button isn’t working”	SUPPORT	`scout_answer`

General Support / How Fetch Works

Query	Expected Intent	Expected Tool
”How do referrals work?”	SUPPORT	`scout_answer`
”How do I contact support?”	SUPPORT	`scout_answer`
”What are Fetch Points worth?”	SUPPORT	`scout_answer`
”How does Fetch make money?”	SUPPORT	`scout_answer`

Mixed Intent (support + shopping)

Query	Expected Intents	Expected Tools
”Where are my points and show me coffee offers”	SUPPORT + OFFERS	`scout_answer` + `search_offers` (parallel)
“My receipt didn’t scan, also find me snack deals”	SUPPORT + OFFERS	`scout_answer` + `search_offers` (parallel)

Negative Cases (should NOT trigger scout_answer)

Query	Expected Intent	Expected Tool
”Find me gluten-free snacks”	SHOPPING	`search_products` / `web_search`
”What coffee offers are available?”	OFFERS	`search_offers`
”Hi there!”	GENERAL	None
”Compare Huggies vs Pampers”	SHOPPING	`web_search`

Appendix B: Multi-Intent Case Study (Stage, 2026-03-14, Phase 1)

Successful multi-intent detection and parallel tool execution on stage using the Phase 1 tool-based approach. This validates the priority-based Intent Detection Protocol added in V1.1. In Phase 3 (gateway), multi-intent is handled by the gateway classifier decomposing the message into scout_query + shopping_query, with the support handler streaming first, then the shopping agent.

User message: “My receipt didn’t scan, and find me coffee offers”

Detected intents: SUPPORT + OFFERS — both correctly identified via the keyword-anchored support check (“receipt”, “scan”) and offers detection (“offers”).

Key observation: All three tool calls were fired in parallel (same timestamp), not sequentially. The model decomposed the message into two distinct requests and addressed both.

Tool Call 1: `scout_answer` (SUPPORT intent)

Arguments:

{
  "query": "My receipt didn't scan when I tried to upload it in the app; I need help getting that purchase validated so I can earn Fetch Points. Please advise next steps and what information to provide."
}

Note: The LLM reformulated the user’s terse “my receipt didn’t scan” into a self-contained, context-rich query for Forethought — exactly as instructed by the tool description.

Response (from Forethought Solve API):

Sorry your receipt didn’t scan. Here’s what to do:

Try snapping the receipt again, making sure it’s clear and all edges are visible.

If you still don’t earn points, use the “Correct my receipt” option in the app.

Provide clear images and details of your purchase.

Let us know if you need more help!

Related Articles:

Troubleshooting eReceipts Not Scanning or Connecting

I Didn’t Earn Points on an Item I Purchased

How to Earn Points on Online Purchases

Tool Call 2: `search_offers` (OFFERS intent)

Arguments:

{
  "query": "coffee",
  "user_id": "/redacted/",
  "limit": 20
}

Response: Returned structured offer data including:

Multiround FFP — 100 points, category: BEVERAGES, with image and detail link
Additional carbonated soft drinks offers (via search_offers default limit applied)

The response included structured JSON content with offer cards for the mobile UI.

Tool Call 3: `llm_feedback` (mandatory)

Arguments:

{
  "message": "{\"observed_intent\":\"support_and_offer_search\",\"tools_used\":[\"scout_answer\",\"search_offers\"],\"gaps_or_wishes\":\"Would like clearer error codes for failed receipt scans and ability to retry within chat\",\"success\":\"context_given\",\"unsupported_feature\":\"none\",\"assumed_defaults\":[\"search_offers default limit applied\"]}"
}

Note: The feedback tool correctly reported both scout_answer and search_offers as tools used, confirming multi-intent execution.

Final Assistant Response

Try re-snapping the receipt (flat, all edges visible). If it still fails, use the app’s “Correct my receipt” option and attach clear images — support links sent. I also found coffee offers below.

The assistant synthesized both tool results into a single coherent response, addressing the support issue first and then pivoting to the offers.

Intent Classification Research for Consumer-Agent

Intent Classification Research for Consumer-Agent

Last Updated: 2026-03-17

Table of Contents

What Exists in Consumer-Agent

Key Findings

Architecture Overview

Key Files

Extensibility Points for Intent Classification

Confluence Doc Analysis

What We’re Building On

What We’re Simplifying for V1

Industry Research: Best Practices

Multi-Intent Classification Approaches

Pattern A: Multi-Label Classification with Structured Output

Pattern B: Hierarchical Intent Taxonomy

LLM-Based vs Traditional NLU

The Modern Hybrid Approach (Industry Consensus)

Intent Routing Architectures

Architecture 1: Router Pattern (LangGraph Conditional Edges)

Architecture 2: Supervisor / Subagent Pattern

Architecture 3: Semantic Router + Agent Fallback

Architecture 4: AWS Agent Squad

Confidence Scoring & Fallback

Recommended Architecture

Key Patterns to Adopt

Latency Analysis

Approach Latency Characteristics

LLM Structured Output (~300-600ms)

Semantic Router / Local Embeddings (~10-25ms)

Remote Embeddings via OpenAI (~50-150ms)

Regex / Keyword (~1-5ms)

Feasibility by Target

Architectural Options to Minimize TTFB Impact

Option A: Parallel Classify + Stream (best for ~0ms added TTFB)

Option B: Tiered — Semantic Router + LLM Fallback (best for ~15ms P50)

Option C: Inline Classification (zero added latency)

Summary Table

Multi-Intent Handling

Handling Overlapping Intents

Deep Dive: Why LLMs Drop Intents and How to Fix It

Root Causes

Evidence-Based Mitigation Techniques

Technique 1: Explicit Decomposition Directive

Technique 2: Keyword Anchoring (Support-First Check)

Technique 3: Priority Ordering (Support as Mandatory First Step)

Technique 4: Negative Examples (Anti-Patterns)

Technique 5: Parallel Tool Call Instructions

Technique 6: Two-Stage Prompting (Count-Then-Classify)

Prompt-Only vs. Code-Level Pre-Classification

V1.1 Recommended Prompt Structure

LLMCompiler: Future Consideration

Support Subagent vs Support Tool: Architecture Decision

The Two Architectures

Scenario 1: Support as a Full Subagent

Scenario 2: Support as a Tool

Decision: Tool-Based Approach for V1

When to Revisit: Upgrade Path to Full Subagent

Implementation Sketch (Tool-Based, V1)

Open Source Frameworks

Evaluated Frameworks

Research Repos (Reference Only)

Framework Selection for Consumer-Agent

Repo Analysis: Where Should the Forethought Tool Live?

Repo Overview

Request Flow

Placement Options

Option A: rover-mcp

Option B: consumer-context-service

Option C: consumer-agent (direct BaseTool)

Decision: Option C — consumer-agent directly

Recommended Approach for Consumer-Agent

Phase 1: Tool-based approach (completed)

Phase 2: Semantic FastPath + Observability (builds on Phase 3, scales to PLT-140)

Architecture: Semantic Matcher as a Gateway Node

Semantic Matcher Node

Intent Logging + Observability

Classification Metrics

Confidence-Based Escalation

Scaling to PLT-140

Tool Call 1: `scout_answer` (SUPPORT intent)

Tool Call 2: `search_offers` (OFFERS intent)

Tool Call 3: `llm_feedback` (mandatory)