Skip to content

Intent Classification Research for Consumer-Agent

Intent Classification Research for Consumer-Agent

Section titled “Intent Classification Research for Consumer-Agent”
  1. What Exists in Consumer-Agent
  2. Confluence Doc Analysis
  3. Industry Research: Best Practices
  4. Recommended Architecture
  5. Latency Analysis
  6. Multi-Intent Handling
  7. Deep Dive: Why LLMs Drop Intents and How to Fix It
  8. Support Subagent vs Support Tool: Architecture Decision
  9. Open Source Frameworks
  10. Repo Analysis: Where Should the Forethought Tool Live?
  11. Recommended Approach for Consumer-Agent
  12. Open Questions / TODOs
  13. Appendix A: Stage Testing Queries
  14. Appendix B: Multi-Intent Case Study (Phase 1)

The codebase is a LangGraph + OpenAI Python agent (~12k LOC) serving as Fetch’s shopping assistant.

  • No intent classification or routing exists today. All messages go through a single conversational agent via /agent/stream
  • intent_labels: list[str] field already exists in the message model (history/models.py:144) and is now populated by HistoryMiddleware based on which tools were called during the turn
  • Three agents defined but only by agent_id param, not by dynamic routing: conversational, prompt-suggestions, title-generation
  • Tools: product search, offer search, purchase history, web search, webpage fetch, feedback
  • Streaming architecture with dual-mode (messages + updates) SSE
Mobile Client / Go Gateway
FastAPI /agent/stream POST
Request Validation (StreamRequest schema)
Episode Management (get_or_create)
History Retrieval (last N messages from DynamoDB)
System Prompt Construction (4-layer composition)
Agent.stream() - LangGraph execution
Streaming Events (SSE format)
HistoryMiddleware (async capture & storage)
Client receives events
FilePurpose
agent/agent.pyMain Agent class, dual-mode streaming
factory.pyModel & agent creation, tool wiring
utils/tools.pyMCP tool wrappers (BaseTool + Pydantic schemas)
prompts/manager.py4-layer system prompt composition
history/models.pyMessage/Episode models (has intent_labels field)
api/main.pyFastAPI endpoints, request handling
config.pyDual config: agent_config.yaml + settings.yaml

Extensibility Points for Intent Classification

Section titled “Extensibility Points for Intent Classification”
  • intent_labels: list[str] field on Message model — ready to populate
  • StreamRequest.enabled_components — feature-flagged prompt components
  • StreamRequest.feature_flags — runtime behavior flags
  • LangGraph graph can be extended with conditional routing nodes

Source: Implementation of Intent Understanding and Routing

This doc lays out a thorough intent classification pipeline with several strong ideas we can build on:

  1. Preprocessing: surface cleaning → syntactic pruning (spaCy dep parsing) → semantic normalization (WordNet hypernyms, verb canonicalization) → entity extraction
  2. Discovery: SBERT embeddings → UMAP → HDBSCAN clustering → BERTopic labeling
  3. Taxonomy: L1 macro categories (Planning, Point Earning, Commerce, Support, Other) → L2 specific intents
  4. Runtime Router: 3-tier (Regex → Semantic Router → LLM fallback)
  • The taxonomy structure (L1/L2 hierarchy) is a great design pattern — we adopted a similar intent categorization (SUPPORT, SHOPPING, OFFERS, etc.)
  • The semantic router concept (embedding-based fast path) is a solid optimization we’d like to explore in a later phase
  • The evaluation framework (golden dataset, precision/recall metrics) gives us a good model for measuring classification quality

For the initial rollout, we’re taking a leaner approach — not because the original ideas are wrong, but because our current architecture (LLM-based agent with tool calling) lets us defer some complexity until we have real traffic data to justify it:

  • spaCy preprocessing pipeline — since we already have an LLM in the loop, it handles normalization natively. We can revisit structured NLP preprocessing if we see quality gaps in practice.
  • Offline HDBSCAN clustering — valuable for taxonomy discovery and we may use it to refine our intent categories later, but not needed for the runtime classification path right now.
  • Regex guardrail layer — rather than building fast paths upfront, we’re starting with LLM-only routing and will add regex/embedding shortcuts based on observed traffic patterns and latency data.
  • WordNet hypernym replacement — deferring this until we have evidence that query normalization improves classification accuracy for our use case.

Pattern A: Multi-Label Classification with Structured Output

Section titled “Pattern A: Multi-Label Classification with Structured Output”

The most robust modern approach uses LLMs with structured output / function calling to return multiple labels:

from pydantic import BaseModel, Field
from typing import List, Literal
INTENTS = Literal[
"ORDER_STATUS", "REFUND_REQUEST", "CANCELLATION",
"TECHNICAL_SUPPORT", "BILLING_INQUIRY", "ACCOUNT_MANAGEMENT",
"PRODUCT_QUESTION", "COMPLAINT", "GENERAL_INQUIRY"
]
class IntentClassification(BaseModel):
chain_of_thought: str = Field(description="Step-by-step reasoning")
intents: List[INTENTS] = Field(description="All detected intents")
primary_intent: INTENTS = Field(description="Most important intent")
confidence: float = Field(ge=0.0, le=1.0)

Source: Instructor library

Two-stage classification: broad category first, then specific sub-intent:

class TopLevelIntent(BaseModel):
category: Literal["BILLING", "TECHNICAL", "ACCOUNT", "PRODUCT", "OTHER"]
confidence: float
class HierarchicalClassification(BaseModel):
top_level: TopLevelIntent
sub_intents: List[SubIntent]
requires_clarification: bool
clarification_question: Optional[str] = None

Source: Kapture CX, Microsoft Dynamics 365 IBR

AspectTraditional NLULLM-BasedSemantic Routing
Latency<10ms200-2000ms<5ms
Cost per callNear zero$0.001-0.01Near zero
Zero-shot abilityNoneExcellentModerate
Multi-intentRequires special trainingNative with structured outputLimited
Accuracy85-95%89-96%80-90%

The Modern Hybrid Approach (Industry Consensus)

Section titled “The Modern Hybrid Approach (Industry Consensus)”
User Message
|
v
[Semantic Router] -- fast, cheap, handles well-known intents
|
|-- HIGH confidence (>0.85) --> Route directly
|-- MEDIUM confidence (0.6-0.85) --> Confirm with LLM
|-- LOW confidence (<0.6) --> Full LLM classification
|
v
[LLM Classifier] -- slower, expensive, handles novel/ambiguous

Architecture 1: Router Pattern (LangGraph Conditional Edges)

Section titled “Architecture 1: Router Pattern (LangGraph Conditional Edges)”
graph = StateGraph(GraphState)
graph.add_node("classifier", classify_intent)
graph.add_node("shopping", handle_shopping)
graph.add_node("support", handle_support)
graph.set_entry_point("classifier")
graph.add_conditional_edges("classifier", route_to_handler, {
"shopping": "shopping",
"support": "support",
})

Source: Clivern LangGraph Tutorial

Architecture 2: Supervisor / Subagent Pattern

Section titled “Architecture 2: Supervisor / Subagent Pattern”

A supervisor agent maintains conversation context and dynamically decides which subagents to call as tools. Key distinction: a supervisor maintains state across turns; a router is stateless single-dispatch.

Source: LangChain multi-agent docs, LangChain architecture blog

Architecture 3: Semantic Router + Agent Fallback

Section titled “Architecture 3: Semantic Router + Agent Fallback”

semantic-router handles the fast path with embedding similarity, falling back to LLM for complex cases.

AWS Agent Squad provides pluggable classifiers (Bedrock, Anthropic, OpenAI) with conversation-history-aware routing.

def route_with_confidence(result: ClassificationResult) -> str:
top = max(result.intents, key=lambda x: x.confidence)
if top.confidence > 0.85:
return top.intent # Route directly
elif top.confidence > 0.6:
sorted_intents = sorted(result.intents, key=lambda x: -x.confidence)
if len(sorted_intents) > 1:
gap = sorted_intents[0].confidence - sorted_intents[1].confidence
if gap < 0.15:
return "CLARIFY" # Ambiguous
return top.intent
else:
return "FALLBACK_TO_HUMAN" # Low confidence

User Message
┌──────────────────────────┐
│ LLM Intent Classifier │ (structured output / function calling)
│ Returns: List[Intent] │ (multi-label, with confidence scores)
│ + primary_intent │
│ + entities extracted │
└──────────┬───────────────┘
┌─────┴──────┐
│ Router │ (conditional edges in LangGraph)
└─────┬──────┘
┌──────┼──────┬──────────┐
▼ ▼ ▼ ▼
[Shopping] [Support] [General] [Clarify]
Agent Agent Agent
PatternHowSource
Multi-label classificationPydantic model with List[Intent] via structured outputInstructor, OpenAI function calling
Confidence scoringHigh (>0.85) → route directly; Medium (0.6-0.85) → verify; Low → clarifyVellum, Langfuse
Hierarchical taxonomyL1 broad category → L2 specific intentMicrosoft Dynamics 365 IBR
Supervisor patternMain agent classifies + delegates to subagents as toolsLangChain subagents, AWS Agent Squad

Constraint: The total latency for each conversation response from consumer-agent is already tight. Intent classification must not meaningfully increase TTFB. Analysis below covers three latency targets.

  • gpt-5-mini with reasoning_effort: minimal, small prompt, ~10 tokens out → 300-600ms typical
  • Already paying OpenAI network round-trip from ECS → OpenAI (~50-100ms)
  • Structured output schema constrains generation, keeping token count small
  • At the mercy of OpenAI API variability — P95 could spike to 800ms+
  • Verdict: Achievable on average within 500ms, but P95 will likely exceed it

Semantic Router / Local Embeddings (~10-25ms)

Section titled “Semantic Router / Local Embeddings (~10-25ms)”
  • Load a small model like all-MiniLM-L6-v2 (22MB ONNX) at startup
  • Inference: 5-20ms on CPU
  • Vector similarity against ~50-200 anchor utterances: <1ms
  • Total: 10-25ms — comfortably under 100ms
  • Tradeoff: Less accurate for novel/ambiguous queries. Single-label only (no multi-intent decomposition). Needs LLM fallback for the low-confidence tail.
  • OpenAI text-embedding-3-small API call: 50-150ms (network bound)
  • Vector similarity: <1ms
  • Total: 50-150ms — borderline for 100ms target
  • Regex patterns: <1ms. Only handles exact patterns like “track my order #XXX”
  • Keyword matching + heuristics: 1-5ms. Brittle, high maintenance.
  • Verdict: Achievable at 10ms but only for a subset of queries. Not viable as sole classifier.
TargetApproachFeasibilityMulti-IntentAccuracy
500msLLM structured output (sequential)Average yes, P95 riskyYes~95%
100msSemantic router (local embeddings)Yes, comfortablyNo (single label)~85-90%
100msSemantic router + LLM fallback (tiered)P50 ~15ms, P30 ~400msPartial~90%
10msRegex/keyword onlyYes, but limited coverageNo~60-70%

Architectural Options to Minimize TTFB Impact

Section titled “Architectural Options to Minimize TTFB Impact”

Since classification happens sequentially before the main agent call, every millisecond is added directly to TTFB. Three architectural patterns avoid this:

Option A: Parallel Classify + Stream (best for ~0ms added TTFB)

Section titled “Option A: Parallel Classify + Stream (best for ~0ms added TTFB)”
User message arrives
├──→ [Intent Classifier] (LLM, ~400ms) ← runs in parallel
└──→ [Main Agent starts streaming] ← TTFB stays the same
ThinkingEvent emitted immediately
(classifier result arrives)
Route to correct subagent mid-stream

Start the main conversational agent immediately (preserving current TTFB), and run classification in parallel. If the result comes back as CUSTOMER_SUPPORT, redirect to the CS subagent. If it’s SHOPPING (the common case), the main agent is already running — zero added latency.

  • Added latency for common case (shopping): ~0ms
  • Added latency for redirect case (support): ~400ms (but only when routing changes)
  • Downside: Wasted LLM tokens if redirect needed. But if 80%+ of queries are shopping, this is efficient.

Option B: Tiered — Semantic Router + LLM Fallback (best for ~15ms P50)

Section titled “Option B: Tiered — Semantic Router + LLM Fallback (best for ~15ms P50)”
User message arrives
[Semantic Router] (local embeddings, ~15ms)
├── confidence > 0.85 → route directly (covers ~60-70% of traffic)
└── confidence < 0.85 → [LLM fallback, ~400ms]
(but only for ~30% of queries)
  • Average added latency: 0.7 × 15ms + 0.3 × 400ms ≈ 130ms
  • P50 added latency: ~15ms (most users)
  • Downside: Requires curating anchor utterances per intent. Single-label classification on the fast path.

Option C: Inline Classification (zero added latency)

Section titled “Option C: Inline Classification (zero added latency)”

Don’t add a separate classification step. Modify the system prompt to make the main agent classify-and-route as its first action:

Before responding, classify the user's intent by calling the `classify_intent` tool.
If the intent is CUSTOMER_SUPPORT, call the `handle_support` tool.
Otherwise, respond normally.

The classification happens inside the existing LLM call. No extra round-trip. The model is already “thinking” — you’re just adding a structured tool call to its first step.

  • Added latency: 0ms (classification is part of the existing call)
  • Downside: Ties classification to the main model (gpt-5-mini). Can’t optimize the classifier independently. Classification accuracy depends on prompt engineering within the main agent context.
OptionAdded TTFB (common case)Added TTFB (redirect case)Multi-IntentAccuracyComplexity
A: Parallel~0ms~400ms (redirect)Yes~95%Medium
B: Tiered~15ms~400ms (fallback)No (fast path)~85-95%Medium-High
C: Inline0ms0msYes~90-95%Low
Sequential LLM~400ms~400msYes~95%Low

Recommendation: Option A (parallel) or Option C (inline) — both avoid adding to TTFB. Option B is the best middle ground if a dedicated classifier with independent optimization is needed.


For messages like “Cancel my subscription, refund last month, and transfer my data”:

  1. Classify all intents: [CANCELLATION, REFUND_REQUEST, DATA_EXPORT]
  2. Identify dependencies: Refund depends on cancellation
  3. Order execution: Cancel first → refund → export (or parallel where safe)
  4. Synthesize response: Combine all agent outputs
class IntentClassification(BaseModel):
chain_of_thought: str
intents: List[Intent]
primary_intent: Intent
confidence: float
requires_clarification: bool
class Intent(BaseModel):
type: Literal["SHOPPING", "CUSTOMER_SUPPORT", "ACCOUNT", "GENERAL"]
sub_type: str # e.g., "refund_request", "order_status"
confidence: float
extracted_entities: dict
  • Define intent implications (cancel often implies refund)
  • Define intent conflicts (can’t upgrade and cancel simultaneously)
  • Use confidence gap analysis to detect ambiguity

Deep Dive: Why LLMs Drop Intents and How to Fix It

Section titled “Deep Dive: Why LLMs Drop Intents and How to Fix It”

Context: During V1 stage testing, we observed that the LLM reliably calls scout_answer for single-intent support queries (“where are my points?”) but drops the support intent in mixed-intent messages (“my receipt didn’t scan and find me snack deals” → only called search_offers). This section documents the root causes and evidence-based mitigation techniques.

Research identifies several reasons LLMs fail to act on all intents in compound messages:

1. Actionability bias. When one intent maps cleanly to a tool schema (e.g., “find me snack deals” → search_offers) and another is semantically vaguer (“my receipt didn’t scan” → which tool?), the model gravitates toward the cleaner mapping. Tool description quality is a primary driver — intents with better-described tools win. (Voiceflow)

2. Position bias (primacy/recency). LLMs exhibit serial position effects. GPT-4 variants show primacy bias (acting on the first-mentioned intent). Other models show recency bias. The effect is model-dependent and task-dependent — there is no universal winner. (arXiv:2406.15981)

3. Satisficing without decomposition. Without an explicit instruction to decompose the message into sub-requests, the model defaults to a satisficing strategy: resolve the first clear intent and consider the turn complete. OpenAI’s own guidance states: “Decompose the user’s query into all required sub-requests, and confirm that each is completed. Do not stop after completing only part of the request.” (GPT-5 Prompting Guide)

4. Low reasoning effort amplifies the problem. Our agent uses reasoning_effort: low for cost/latency. OpenAI’s GPT-5.2 guide warns that “disambiguating tool instructions to the maximum extent possible” is “particularly critical at minimal reasoning.” With low effort, the model takes shortcuts. (GPT-5.2 Prompting Guide)

5. Intent mismatch drift. A 2026 paper found that “the Assistant’s interpretation progressively drifts away from the user’s true intent” in multi-turn conversations — “not a failure of model capability but rather a breakdown in interaction.” (arXiv:2602.07338)

Technique 1: Explicit Decomposition Directive

Section titled “Technique 1: Explicit Decomposition Directive”

The single most effective technique. Force the model to enumerate all intents before acting.

From OpenAI’s GPT-5 prompting guide:

“Decompose the user’s query into all required sub-requests, and confirm that each is completed.”

From GPT-5.4 guidance: define a <completeness_contract> block in the system prompt that requires verification all sub-requests are fulfilled before yielding.

Evidence: OpenAI reports this as one of three “agentic instructions” that boosted SWE-bench scores by ~20%. (GPT-4.1 Prompting Guide)

Technique 2: Keyword Anchoring (Support-First Check)

Section titled “Technique 2: Keyword Anchoring (Support-First Check)”

Instead of relying on the LLM to semantically map “my receipt didn’t scan” → “receipt issue” → SUPPORT, provide a concrete keyword list. If ANY keyword appears, scout_answer is mandatory.

Support trigger words: points, receipt, scan, account, referral, redeem, reward,
support, help, bug, error, missing, problem, issue, eligible, eligibility,
contact, refund, subscription, password, login, verify, verification

Why this works: Keywords are deterministic pattern matching that the LLM can perform reliably even at low reasoning effort. “scan” is in the list, so “my receipt didn’t scan” triggers it regardless of phrasing. This avoids the fragility of example-based matching.

Tradeoff: Keywords can’t handle truly novel phrasings with no keyword overlap (e.g., “I got ripped off”), but they cover the vast majority of support queries. The existing semantic signals in the tool description serve as a catch-all for edge cases.

Sources: Kore.ai Multi-Intent Detection, Label Your Data: Intent Classification 2025

Technique 3: Priority Ordering (Support as Mandatory First Step)

Section titled “Technique 3: Priority Ordering (Support as Mandatory First Step)”

Rather than a flat classification table where all intents are equal, make support detection a mandatory first step that runs before other intent identification. Support has a higher cost of missing (user frustration, failed deflection) than shopping (just a missed tool call).

This is the “priority routing” pattern used in production systems like Kore.ai and LivePerson.

Technique 4: Negative Examples (Anti-Patterns)

Section titled “Technique 4: Negative Examples (Anti-Patterns)”

Show the exact failure mode in the prompt. Research confirms negative examples are “surprisingly effective” for LLM instruction following, especially for tool-calling behavior. From GPT-4.1 guidance: “If tools are complex, create a dedicated # Examples section in your system prompt” that includes cases where the model should call multiple tools.

❌ WRONG: "my receipt didn't scan and find me snack deals" → only calls search_offers
✅ RIGHT: "my receipt didn't scan and find me snack deals" → calls scout_answer AND search_offers

Source: GPT-4.1 Prompting Guide

Technique 5: Parallel Tool Call Instructions

Section titled “Technique 5: Parallel Tool Call Instructions”

OpenAI models support parallel function calling by default (parallel_tool_calls: true), but the model needs to be explicitly told when to parallelize. From GPT-5.1/5.2 guides: “Parallelize tool calls whenever possible” and “Parallelize independent reads when possible to reduce latency.”

Key limitation: There is no API-level way to say “must call tool X AND also allow other tools.” tool_choice only accepts a single function name. Parallelism must be prompted, not forced.

Source: OpenAI Function Calling Guide, GPT-5.1 Prompting Guide

Technique 6: Two-Stage Prompting (Count-Then-Classify)

Section titled “Technique 6: Two-Stage Prompting (Count-Then-Classify)”

A peer-reviewed technique from Neurocomputing 2024: first predict the number of intents, then identify each one. This avoids the threshold-setting problem of single-pass multi-label classification and forces the model to acknowledge multiple intents exist.

Source: Two Stages Prompting for Few-Shot Multi-Intent Detection (Neurocomputing 2024)

Prompt-Only vs. Code-Level Pre-Classification

Section titled “Prompt-Only vs. Code-Level Pre-Classification”
DimensionPrompt-Only (current)Code-Level Pre-Classification
Setup costLow — zero-shot, no training dataHigher — requires labeled data or embeddings
Latency0ms added (inline)10-400ms added (sequential)
DeterminismLow — stochastic, model-dependentHigh — same input → same output
Multi-intent reliabilityModerate with decomposition promptsHigh with explicit multi-label classification
DebuggabilityLow — prompt changes have unpredictable effectsHigh — traceable, auditable
FlexibilityHigh — handles novel intents zero-shotLower — limited to predefined intent set

When to escalate to code-level: If prompt-based techniques (Techniques 1-5 above) still fail to reliably detect support intents after testing, the next step is a lightweight pre-classifier — either keyword-based (in code, not prompt) or embedding-based (semantic-router). This runs before the LLM call and injects a hint like “The user’s message contains a SUPPORT intent — you MUST call scout_answer” into the system prompt.

Sources: Hybrid LLM + Intent Classification (Medium), Vellum: Intent Detection for Chatbots

Based on the research above, the intent classification section of the system prompt should use a priority-based detection protocol combining Techniques 1-5:

## Intent Detection Protocol (MANDATORY — follow before every response)
Before calling any tools, follow these steps exactly:
### Step 1: Decompose the message
Identify ALL distinct requests or questions in the user's message. Count them.
A single message can contain multiple intents. Do not proceed until you have
identified every intent.
### Step 2: Support check (ALWAYS do this first)
Scan the user's message for ANY of these support-related words:
points, receipt, scan, account, referral, redeem, reward, support, help, bug,
error, missing, problem, issue, eligible, eligibility, contact, refund,
subscription, password, login, verify, verification
If ANY support-related word is found → you MUST call `scout_answer`.
This is non-negotiable, even if other intents are also present.
### Step 3: Identify all other intents
After the support check, also identify:
- SHOPPING → `search_products`, `web_search`, `fetch_webpage`
- OFFERS → `search_offers`, `search_nearby_offers`
- PERSONALIZATION → `get_user_purchase_history`
- GENERAL (greeting, small talk, clarification) → No tool needed
### Step 4: Call ALL identified tools in parallel
Parallelize independent tool calls. Never drop an intent. Do not consider
your turn complete until every identified intent has been addressed.
### Common mistake to avoid
❌ WRONG: "my receipt didn't scan and find me snack deals" → only calls search_offers
✅ RIGHT: "my receipt didn't scan and find me snack deals" → calls scout_answer AND search_offers

Key design decisions:

  • Decomposition step (Step 1) — forces the model to count intents before acting, preventing satisficing (from OpenAI GPT-5 guidance and Neurocomputing 2024 two-stage prompting research)
  • Keyword list instead of semantic signals — covers many phrasings without enumerating examples
  • Ordered steps instead of flat table — support check runs before other classification
  • Completeness contract (Step 4) — “Do not consider your turn complete” prevents the model from stopping after one intent
  • Negative example — shows the exact failure mode observed in testing
  • “MANDATORY” framing — stronger directive than “Before responding, classify…”
  • Validated on stage — see Appendix B for a successful multi-intent case study

For architecturally solving parallel function calling, the LLMCompiler framework (ICML 2024) decomposes problems into a DAG of tasks with inter-dependencies, then dispatches them in parallel. Results: 3.7x latency speedup, 6.7x cost savings, and ~9% accuracy improvement over sequential ReAct approaches. Worth evaluating if prompt-based parallelism proves insufficient.

Source: LLMCompiler: Parallel Function Calling (ICML 2024, arXiv:2312.04511)


Support Subagent vs Support Tool: Architecture Decision

Section titled “Support Subagent vs Support Tool: Architecture Decision”

Context: The PM PRD (Scout/Forethought Support Handoff) specifies that support will be handled by Forethought Solve API via an internal wrapper tool. This section discusses the two possible architectures and why the tool-based approach is the right one.

A separate agent with its own system prompt, tools, model, and multi-turn reasoning loop.

User: "I need a refund for my last purchase and find me coffee deals"
[Conversational Agent] (reasoning)
├── Detects support intent → calls support subagent
│ │
│ ▼
│ [Support Subagent] (separate LLM, own prompt, own tools)
│ ├── Calls Forethought API
│ ├── Reads account data
│ ├── Reasons about refund policy
│ ├── Generates support response
│ └── Returns result to main agent
├── Detects shopping intent → calls search_products("coffee")
[Conversational Agent] synthesizes both results into one response

When this makes sense:

  • Support requires multi-turn reasoning — the subagent needs to ask clarifying questions, look up multiple systems, apply complex business logic
  • Support needs a different model — e.g., a fine-tuned model for policy compliance, or a cheaper model for simple FAQ lookups
  • Support has its own tool ecosystem — ticketing system, account management APIs, refund processing, escalation workflows that shouldn’t be exposed to the shopping agent
  • Support needs independent optimization — separate prompt engineering, evaluation, A/B testing, and iteration cycle from the shopping agent
  • Isolation — support failures shouldn’t crash or pollute the shopping agent’s context

Costs:

  • Minimum 2 sequential LLM calls (main agent + subagent) → +300-600ms latency
  • More complex orchestration code
  • Harder to maintain conversational coherence across agent boundaries
  • Token duplication (both agents see the conversation history)

A single function call that wraps an external API (Forethought Solve) and returns a response string. The conversational agent calls it like any other tool.

User: "I need a refund for my last purchase and find me coffee deals"
[Conversational Agent] (reasoning)
├── tool call: scout_answer(query="refund for last purchase", ← parallel
│ conversation_id="xxx")
├── tool call: search_products(descriptions=["coffee"]) ← parallel
(both results return)
[Conversational Agent] synthesizes both into one streamed response

When this makes sense:

  • Support is primarily answer retrieval — an external system (Forethought) does the heavy lifting; we just need to call it and relay the response
  • No multi-turn reasoning needed within the support domain itself (Forethought handles its own conversation state via conversation_id)
  • The tool contract is simple: question in → answer out
  • Support responses are treated as tool output — the main agent weaves them into its response alongside other tool results

Costs:

  • Less isolation — support is just another tool in the main agent’s context
  • Main agent’s system prompt gets more complex (must know when/how to use the support tool)
  • Can’t independently optimize the support “agent” — it’s just a function call
  • Harder to add complex multi-step support flows later (forms, buttons, escalation chains)

The PM PRD recommends an internal wrapper tool calling Solve API, but independently of that guidance, the tool-based approach is the right choice for V1 for the following reasons:

1. Forethought IS the agent. The “support reasoning” doesn’t happen in our system — Forethought Solve is the brain. It maintains its own conversation state, applies its own knowledge base, and generates answers. Our scout_answer tool is just a pass-through. There’s no reason to wrap a pass-through in a full agent loop.

2. Fits the existing architecture perfectly. Consumer-agent is tool-oriented. Adding scout_answer is identical to how search_products, search_offers, and fetch_webpage work today:

  • Define a Pydantic ScoutAnswerInput schema
  • Create a ScoutAnswerTool(MCPTool) or direct REST wrapper
  • Add it to the tool list in factory.py
  • Update system prompt to describe when to use it

3. Parallel tool calling enables multi-intent. OpenAI supports parallel tool calls. The agent can call scout_answer + search_products simultaneously for mixed queries. A subagent approach would require sequential orchestration.

4. Latency is minimized. Tool call adds ~0ms to TTFB for shopping queries (tool is only called for support). For support queries, latency = Forethought API response time (which we’d pay regardless of architecture).

5. V1 is text-only. V1 scope is text responses only. No multi-step forms, no complex UI interactions. A tool returning a text string is sufficient. V1.1 (buttons/forms) may warrant revisiting this decision.

Important caveat: This decision is V1-specific, not permanent. The tool approach works because Forethought is the brain and we’re just passing through. If we later replace Forethought with in-house support logic, need multi-step flows, or want independent optimization of support quality — the subagent approach becomes the right choice. The upgrade path is straightforward: promote the tool into a subagent with its own system prompt and tools.

Phase 3 update: We chose a hybrid approach — neither full subagent nor simple tool. The gateway graph routes support queries to a dedicated support_handler node that streams Forethought responses directly (via StreamWriter), while shopping queries go to the existing agent subgraph. This gives us streaming (which the tool couldn’t do) and isolation (which a pure tool doesn’t provide), without the overhead of a full support agent with its own LLM reasoning loop.

When to Revisit: Upgrade Path to Full Subagent

Section titled “When to Revisit: Upgrade Path to Full Subagent”

The tool-based approach should be revisited if any of these emerge:

SignalWhy It Matters
V1.1 multi-step flows (forms, buttons, escalation chains)Tool can’t maintain multi-turn state; needs agent loop
Support needs its own tools (ticketing API, account management)Tool-in-a-tool gets messy; subagent isolates tool surface
Support quality needs independent model/prompt tuningCan’t tune a tool independently from the main agent
Support conversations become multi-turn within a single user messageTool is one-shot; subagent can iterate
Forethought is replaced with in-house support logicMore reasoning = more need for a dedicated agent
# Input schema — query only, no conversation_id
# The tool is stateless (like all other tools). The LLM agent owns
# conversation context and formulates a self-contained query.
class ScoutAnswerInput(BaseModel):
query: str = Field(
description="A self-contained support question about Fetch. "
"Rephrase the user's message as a standalone question that "
"includes any relevant context from the conversation."
)
# Tool definition
class ScoutAnswerTool(BaseTool):
name: str = "scout_answer"
description: str = (
"Answer customer support questions about Fetch (points, receipts, "
"account issues, eligibility, etc.). Use this when the user asks "
"about support topics rather than shopping/product queries. "
"Formulate a complete, self-contained question."
)
args_schema: type[BaseModel] = ScoutAnswerInput
async def _arun(self, query: str, **kwargs) -> str:
# Each call creates a new Forethought conversation (POST)
response = await self.forethought_client.ask(query=query)
return response.text # V1: text only

Why no conversation_id as a tool input:

  • The tool is stateless, matching the behavior of all other tools (search_products, web_search, etc.)
  • Forethought returns a conversation_id when creating a conversation, but since the tool is recreated per request (via factory.py), there’s no instance to persist it on
  • The LLM agent already has full conversation context via message history — it’s responsible for distilling a good query
  • The tool description and input field description guide the LLM to write self-contained queries

Forethought context_variables:

  • These are workflow-specific variables configured in the Forethought dashboard, NOT general-purpose context
  • They feed into workflow conditions (e.g., “if language == es, route to Spanish workflow”)
  • V1 passes empty {} — only populate when Forethought workflows are configured to consume specific variables
  • Do NOT pass message history here — it won’t be used as conversational context

System prompt addition:

You have access to a `scout_answer` tool for customer support questions.
Use it when users ask about: points issues, receipt problems, account help,
eligibility questions, or other support topics.
For mixed queries (shopping + support), call scout_answer AND shopping tools
in parallel, then combine both answers in your response.

Key constraints from PRD:

  • Post-tool PII/policy scan on Forethought responses (log-only — Forethought is a trusted internal service)
  • Store Forethought token in secrets manager
  • Graceful fallback if Forethought is down (show help-center CTA + escalation)

FrameworkGitHubWhat It DoesCan We Use It?
LangGraphlangchain-ai/langgraphGraph-based workflow with conditional routingAlready in use. Consumer-agent is built on it. Conditional edges are how we’d add routing if we ever need a separate classification node.
semantic-routeraurelio-labs/semantic-routerSub-ms embedding-based routingDefer to Phase 2. Not useful for multi-intent classification (single-label only). Its real value is as a semantic cache inside ScoutAnswerTool — match incoming support queries against previously answered questions by embedding similarity, serve cached responses for near-duplicates (<25ms vs Forethought round-trip). Worth adding once we have traffic data showing repetitive support query patterns.
Instructor567-labs/instructorStructured output from LLMs with Pydantic validationNot needed. Consumer-agent already uses LangChain’s with_structured_output() which does the same thing — Pydantic model in, typed result out. Instructor is for projects using the raw OpenAI SDK directly. Adding it would be a redundant dependency.
AWS Agent Squadawslabs/agent-squadFull multi-agent orchestrator with pluggable classifiersOverkill. We’re adding one tool to an existing agent, not building a multi-agent system. Agent Squad solves a problem we don’t have yet.
Forethought MCP serverForethought docsMCP server for Forethought integrationSkip for V1. The PRD evaluated this (their Option B) and rejected it — harder to lock down if vendor changes tools, doesn’t eliminate the hard parts (output mapping, multi-step handling). Direct REST wrapper is simpler and more controllable.
RepoDescriptionTakeaway
intellistream/sage-intentKeyword + LLM hybrid classificationInteresting pattern for combining fast keyword matching with LLM fallback. Could inform Phase 2 tiered approach.
JohnnyFoulds/multi-intent-classificationMulti-intent with LLMs + deep learningAcademic reference for multi-label classification techniques. Not directly usable.
dmarx/zero-shot-intent-classifierLangChain-based zero-shot slot fillingShows how to do zero-shot classification with LangChain. Pattern is similar to what we’d do with with_structured_output().

No new frameworks needed for V1. The approach is:

  1. Add a ScoutAnswerTool (same pattern as existing MCP tools — BaseTool + Pydantic schema)
  2. Update the system prompt to describe when to use it
  3. The conversational agent handles intent classification implicitly through its reasoning

The only framework worth adding later is semantic-router — not for intent classification (it’s single-label only, can’t handle multi-intent), but as a semantic cache inside the ScoutAnswerTool. When we have production traffic data showing repetitive support queries, semantic-router can:

  • Match incoming queries against cached Forethought responses by embedding similarity
  • Serve cached answers in <25ms for high-frequency FAQs (vs Forethought API round-trip)
  • Reduce Forethought API costs and improve availability (cache works even during vendor downtime)
  • Provide explicit intent distribution metrics as a side benefit

Repo Analysis: Where Should the Forethought Tool Live?

Section titled “Repo Analysis: Where Should the Forethought Tool Live?”

Four repos are relevant to Phase 1. Analysis of each:

RepoLanguagePurposeExisting Tools/Endpoints
consumer-agentPythonLangChain/LangGraph conversational AI agent6 MCP tool wrappers + WebSearchTool (direct)
rover-agentGoHTTP orchestrator, mobile entry pointRoutes to consumer-agent (Python path) or OpenAI direct (Go path)
rover-mcpGoMCP tool server (12 tools)search_products, search_offers, fetch_webpage, etc.
consumer-context-serviceGoUnified context API (REST + MCP)14 MCP tools, 13 REST endpoints for product/offer enrichment
Mobile Client
Rover-Agent (Go, port 8080)
│ feature flag: "python_agent"
├── Python path ──→ Consumer-Agent (Python, port 8080)
│ │ calls tools via MCP
│ ├── rover-mcp (Go) ← product/offer/search tools
│ └── consumer-context-service (Go) ← enrichment tools
└── Go direct path ──→ OpenAI API ──→ rover-mcp directly
(being deprecated)

Add scout_answer as a Go MCP tool alongside search_products, fetch_webpage, etc.

ProCon
Consistent with existing tool patternRequires Go implementation (team is Python-focused for agent work)
Available to both Go direct and Python pathsGo direct path is being deprecated — not a real benefit
Clean separation: tools in rover-mcp, agent logic in consumer-agentForethought conversation_id management is harder in a stateless MCP call
Cross-repo coordination for every change

Add as a new REST endpoint + MCP tool.

ProCon
Already has MCP transport on port 8081Wrong domain — CCS aggregates product/offer data, not external support APIs
consumer-agent already connects to consumer_mcpMixes concerns: enrichment service ≠ support service
Same Go/cross-repo overhead as Option A

Option C: consumer-agent (direct BaseTool)

Section titled “Option C: consumer-agent (direct BaseTool)”

Add ScoutAnswerTool as a direct Python BaseTool — not an MCP wrapper.

ProCon
Exact precedent: WebSearchTool — direct BaseTool calling BrightData API, not MCPOnly available on Python path (not Go direct)
Single-repo change, fastest to ship
Python makes Forethought client code simpler
Conversation state (episode_id → conversation_id) stays in consumer-agent
Feature flag gating already works in factory.py
No cross-service coordination needed

Decision: Option C — consumer-agent directly

Section titled “Decision: Option C — consumer-agent directly”

ScoutAnswerTool should be a direct BaseTool in consumer-agent, following the WebSearchTool precedent.

The key insight is that WebSearchTool already established this exact pattern: a direct BaseTool in consumer-agent that calls an external API (BrightData SERP) without going through rover-mcp or consumer-context-service. ScoutAnswerTool calling Forethought Solve API is structurally identical.

The “only available on Python path” con is irrelevant — the Go direct path is being deprecated, and all new agent features target consumer-agent.

Implementation location: src/consumer_agent/tools/scout.py (new file, alongside existing tools/ modules)


Architecture: Inline classification (Option C) — the conversational agent classifies intent implicitly through its reasoning, with no separate classification step or added latency.

Implementation:

  1. Add ScoutAnswerTool — a direct BaseTool in consumer-agent wrapping Forethought Solve API (same pattern as WebSearchTool)
  2. Update system prompt — describe when to use scout_answer vs shopping tools; instruct parallel calling for mixed queries
  3. Populate intent_labels — infer from tool usage after the fact (called scout_answer → label as CUSTOMER_SUPPORT; called search_products → label as SHOPPING; etc.)
  4. Add safety gates — post-tool PII/policy scan on Forethought responses (log-only; trusted internal service)
  5. Feature-flag the rollout — gate scout_answer tool availability behind a feature flag for phased rollout

Repos and files changed (consumer-agent only):

FileChange
src/consumer_agent/tools/scout.pyScoutAnswerTool(BaseTool) + ForethoughtClient (async HTTP client for Solve API).
src/consumer_agent/tools/__init__.pyExport ScoutAnswerTool
src/consumer_agent/factory.pyAdd scout_answer to tool assembly in create_agent_from_config(), gated by feature flag
settings.yamlAdd forethought config section (server_url, api_key secret ref, timeout) per environment
agent_config.yamlAdd scout_answer to agent tools list
prompts/capabilities.mdscout_answer tool documented
src/consumer_agent/history/middleware.pyPopulate intent_labels based on which tools were called in the turn
tests/unit/test_scout.pyUnit tests for ScoutAnswerTool with mocked Forethought API

Rover-agent change (minimal):

  • Pass scout_answer feature flag in feature_flags dict when calling consumer-agent (same mechanism as existing product_card and web_search flags)

What this gives us:

  • Zero added TTFB for shopping queries (the common case)
  • Multi-intent support via parallel tool calls (shopping + support in one turn)
  • Intent labels for analytics without a separate classification step
  • No new frameworks or dependencies
  • Single-repo implementation (consumer-agent), fastest path to production

Limitation discovered: The tool-based approach cannot stream Forethought responses because BaseTool._arun() returns a single str. The full Forethought response must be received before the agent can start generating its final answer. This motivated Phase 3.

Phase 2: Semantic FastPath + Observability (builds on Phase 3, scales to PLT-140)

Section titled “Phase 2: Semantic FastPath + Observability (builds on Phase 3, scales to PLT-140)”

Prerequisite: Phase 3 gateway graph. Phase 2 adds a semantic matcher node upstream of the gateway classifier, plus intent logging for observability.

Architecture: Semantic Matcher as a Gateway Node

Section titled “Architecture: Semantic Matcher as a Gateway Node”

Phase 3 established the gateway graph with intent-based routing:

User Query → Gateway Classifier → support_handler (Forethought)
→ shopping_agent

Phase 2 inserts a Semantic Matcher node before the classifier:

User Query → Semantic Matcher (cache hit?) → return cached response instantly (<25ms)
↓ (cache miss)
Gateway Classifier → support_handler → Forethought (~12s)
→ shopping_agent

The semantic matcher is a LangGraph node, not buried inside a tool. This is the key design decision that makes it scalable to PLT-140 — the node can later evolve from “cache lookup” to “full intent classification + routing” without restructuring the graph.

Uses semantic-router (embedding model + vector similarity) to match incoming queries against known patterns.

Phase 2 behavior (cache mode):

  1. Embed incoming query using a local model (all-MiniLM-L6-v2, ~15ms)
  2. Compare against cache of previously answered queries by cosine similarity
  3. If similarity > threshold (e.g., >0.92), serve cached response instantly via StreamWriter
  4. If below threshold, pass through to gateway classifier (existing Phase 3 flow)

Cache population:

  • Passive: cache every successful Forethought response (query embedding → response text)
  • Active: pre-seed with known FAQ pairs from Forethought knowledge base
  • TTL: 24-48 hours, or invalidate when Forethought knowledge base updates

Benefits:

  • Latency: <25ms for cache hits vs ~12s Forethought API round-trip
  • Cost: zero per cache hit vs per-query Forethought API cost
  • Availability: cached responses work even if Forethought is down

Hook into the gateway graph’s updates mode to log intent classifications:

  • What to log: query text (redacted), classified intent, which node handled it, confidence score, response latency, cache hit/miss
  • Schema: generic (query, predicted_intent, handler_node, confidence, latency_ms, cache_hit) — same schema PLT-140 needs for its dashboard
  • Storage: append to an analytics table (DynamoDB or S3 parquet) for downstream reporting

This data serves dual purpose:

  1. Phase 2: monitor Forethought deflection rate, empty response rate, cache hit rate
  2. PLT-140: training data for the dedicated intent classifier, intent distribution dashboard

Using the logged data, track:

  • Cache hit rate: % of support queries served from cache (target: 30-50% of repeat FAQs)
  • Forethought empty response rate: queries that return HTTP 200 with no content
  • Misroute rate: queries that hit the wrong handler (requires manual sampling initially)
  • Deflection rate: % of support queries fully resolved without human escalation
  • Latency P50/P95: cache path vs Forethought path vs shopping path

Add a confidence threshold to the semantic matcher:

  • High confidence (>0.92): serve cached response
  • Medium confidence (0.70-0.92): pass to Forethought (existing flow)
  • Low confidence (<0.70): flag for human review or route to live agent handoff

The Phase 2 semantic matcher node is designed to evolve into PLT-140’s 3-layer hybrid router:

Phase 2PLT-140 Evolution
Semantic matcher (cache mode)Layer 2: Semantic Router — same embedding infra, expanded from cache to full intent taxonomy (List Planning, Nearby Stores, Recipe-to-Cart, Support, etc.)
Gateway classifier (LLM)Layer 3: LLM Fallback — already exists, handles complex/ambiguous queries
Not yet neededLayer 1: Regex — add structured pattern matching for high-volume exact patterns (10-20% traffic)
Intent loggingPLT-140’s observability + training data pipeline
Classification metricsPLT-140’s dashboard/reporting
Confidence-based escalationPLT-140’s routing confidence thresholds

The gateway graph becomes PLT-140’s orchestrator — new intent categories just add new edges and handler nodes (billing bot, onboarding bot, etc.) to the same graph structure.

Phase 3: Gateway + Forethought Streaming (in progress)

Section titled “Phase 3: Gateway + Forethought Streaming (in progress)”

Why: Phase 1’s tool-based approach cannot stream Forethought responses — BaseTool._arun() returns a single str, so the entire Forethought response must arrive before the agent can start its final answer. Phase 3 solves this by building a custom LangGraph graph with a dedicated gateway classification node that routes support queries directly to Forethought streaming, bypassing the tool limitation.

Architecture: Custom LangGraph StateGraph with an intent classification gateway (gpt-4.1-mini structured output) that routes to either a Forethought streaming handler or the existing shopping agent.

START → gateway (classifier) → [conditional]
→ support_handler → END (pure support)
→ shopping_agent → END (pure shopping)
→ support_handler → shopping_agent → END (mixed)

New module: src/consumer_agent/gateway/

FilePurpose
gateway/state.pyGatewayState TypedDict (messages, intent, scout_query, shopping_query)
gateway/classifier.pyGateway node — gpt-4.1-mini structured output for intent classification + query decomposition
gateway/support_handler.pyForethought streaming node — uses StreamWriter to push TextEvent chunks directly to client
gateway/graph.pyStateGraph assembly + compilation with conditional routing edges
gateway/stream_adapter.pyConverts graph astream() output (messages/updates/custom modes) to StreamEvent iterator

Other files changed:

FileChange
tools/scout.pyAdded ask_stream() method to ForethoughtClient — async SSE streaming from Forethought
factory.pyAdded create_gateway_agent_from_config() alongside existing factory function
api/main.pyFeature-flag-gated gateway path: scout_answer flag → gateway graph instead of regular agent
history/middleware.pyAdded intent_labels override param to wrap_stream() for support path
agent_config.yamlAdded gateway agent entry with configurable model (gpt-4.1-mini)

Key design decisions:

  • Feature flag at API level — same scout_answer flag from Phase 1 gates the gateway path. Rover-agent PR #121 already passes this flag.
  • Decomposition-first classification — the gateway classifier decomposes the message into questions first, then derives intent from which query fields are populated (no separate classification step).
  • Support keyword anchoring — explicit keyword list (points, receipt, scan, account, etc.) reduces false negatives for support detection.
  • Synthetic events for support pathResponseIdEvent (with ft_ prefix) and zero-token UsageEvent keep HistoryMiddleware’s storage flow working without an OpenAI call.
  • Shopping path unchanged — when flag is off, zero latency impact. When flag is on, shopping queries get +50-100ms gateway overhead (gpt-4.1-mini classification call).

Prompt overlap analysis (GATEWAY_SYSTEM_PROMPT vs shopping agent):

The gateway classifier and shopping agent prompts serve different purposes and do not conflict:

TopicGateway Prompt (classifier.py)Shopping Agent (conversational.txt + capabilities.md)
Support scopeKeyword list: points, receipt, scan, account, etc. Used for binary classification only.Limitations section: receipt scanning, point redemption, account modification → redirect to app features.
Shopping scope”Everything else” — one-line catch-all.Detailed: product search, offers, budget, dietary, location-based, personalization, etc.
Image handlingExplicit section: receipt images → support, product images → shopping, ambiguous → default shopping. Generates text description for downstream.capabilities.md lists “Image Analysis” as a core capability (analyze products, fridge contents, offer matching).
Behavioral rulesNone — classifier only outputs structured JSON (intent + queries).Extensive: tone, word cap, markdown formatting, tool discipline, safety, refusals, disclaimers.

Key difference: The gateway prompt asks “which bucket?” (~50ms, gpt-4.1-mini, no tools). The shopping agent prompt asks “how to respond” (gpt-5-mini, full tool suite, detailed formatting rules). No overlap in behavioral instructions.

Notable behavior change: In the original (non-gateway) path, receipt images hit the shopping agent which redirects to the Scan tab. In the gateway path, receipt images are classified as “support” and routed to Forethought, which provides actual support answers. This is the intended improvement.

What this gives us over Phase 1:

  • Streaming support responses — Forethought tokens stream directly to the client as they arrive
  • Dedicated classification model — can be optimized independently from the main conversational agent
  • Cleaner separation — support handling is isolated in its own graph node, not embedded in the main agent’s tool-calling loop
  • Better mixed-intent handling — explicit decomposition into scout_query + shopping_query instead of relying on LLM tool-call parallelism
  • New frameworks (Instructor, AWS Agent Squad, Forethought MCP) — existing LangChain/LangGraph stack is sufficient
  • spaCy preprocessing pipeline — LLM handles normalization natively
  • Offline HDBSCAN clustering — useful for Phase 2 taxonomy discovery, not for V1
  • Regex guardrail layer — premature optimization without traffic data

Items discovered during V1 implementation that need follow-up:

  • Identify existing Forethought workflows — The Fetch Forethought dashboard reportedly has workflows configured. Need to audit what exists and whether any are relevant to the headless Solve API path vs. the widget path.
  • Determine which context_variables workflows expectcontext_variables are workflow-specific, not general-purpose context. We currently pass {}. If workflows branch on variables like language, platform, or user_id, we should populate them. Requires coordination with the CX/Forethought admin team.
  • Sandbox vs. production API key — Production key obtained and configured (50c18d31-...). Stored in rover-agent-{{env}}/forethought-api-key.
  • Validate Forethought knowledge base coverage — Test whether the sandbox knowledge base has sufficient coverage for common Fetch support queries, or if content needs to be added/updated in the Forethought dashboard.
  • Sandbox workflow routing broken — After all sandbox workflows were activated (2026-03-16), intent routing became unreliable. Some queries return empty responses. Forethought confirmed: activated workflows without intent descriptions can degrade routing. Use production API key for reliable testing until sandbox is fixed.
  • No token-level streaming available — Despite the stream: true parameter, the Solve API returns the full response as a single NDJSON line (application/x-ndjson with widget_components), not incremental SSE tokens. Both stream: true and stream: false return identical content; the only difference is transport format (NDJSON vs plain JSON). Forethought confirmed SSE is not currently supported via API.

  • Token-level streaming may be coming — Forethought is rolling out token-level streaming for their widget (week of 2026-03-17). Pending confirmation on whether the API will also support it.

  • Average Solve API latency: ~12s — Measured on stage (2026-03-17) across successful responses:

    QueryTime to First DataNotes
    Support query 111.9sFull response in one NDJSON line
    Support query 211.6sFull response in one NDJSON line
    ”Where are my points?“12.6sFull response in one NDJSON line
    Failed query30s timeoutEmpty body (0 lines)

    This latency creates a noticeable gap in the mobile chat UX compared to the rest of the flow which streams token-by-token. Token-level streaming would significantly improve perceived responsiveness.

  • Evaluate LLM query formulation quality — The tool description instructs the LLM to write self-contained queries, but we have no data on how well it actually does this. After stage deployment, sample scout_answer tool calls to evaluate query quality and refine the description/input schema if needed.
  • Consider adding examples to the tool description — If query quality is poor, adding few-shot examples to the input field description or capabilities.md could help (e.g., bad: “what about it?”, good: “How do I recover my Fetch account if I lost access to my email?”).
  • Content moderation beyond PII — V1 only scans for SSN and credit card patterns. Forethought could return off-topic, incorrect, or policy-violating content. Evaluate whether a broader content moderation gate is needed after observing production responses.
  • Forethought response quality metrics — Track empty response rate, PII detection rate, and user satisfaction with scout_answer responses to inform Phase 2 decisions.
  • Query-response logging for evaluation — Consider logging (redacted) query/response pairs to build an evaluation dataset for tuning query formulation and measuring Forethought answer quality.

Sample queries for validating intent classification and Forethought response quality. In Phase 1, support queries route through scout_answer tool. In Phase 3 (gateway), support queries route through the support_handler node which streams Forethought responses directly.

QueryExpected IntentExpected Tool
”Where are my points?”SUPPORTscout_answer
”I scanned a receipt but didn’t get points”SUPPORTscout_answer
”How long does it take for points to show up?”SUPPORTscout_answer
”Why did my points disappear?”SUPPORTscout_answer
QueryExpected IntentExpected Tool
”My receipt wasn’t accepted”SUPPORTscout_answer
”Can I scan an old receipt?”SUPPORTscout_answer
”The app says my receipt is a duplicate but I only scanned it once”SUPPORTscout_answer
QueryExpected IntentExpected Tool
”I can’t log into my account”SUPPORTscout_answer
”How do I change my email address?”SUPPORTscout_answer
”I have a problem with my account”SUPPORTscout_answer
”How do I delete my Fetch account?”SUPPORTscout_answer
QueryExpected IntentExpected Tool
”The app keeps crashing”SUPPORTscout_answer
”Why can’t I redeem my points?”SUPPORTscout_answer
”The scan button isn’t working”SUPPORTscout_answer
QueryExpected IntentExpected Tool
”How do referrals work?”SUPPORTscout_answer
”How do I contact support?”SUPPORTscout_answer
”What are Fetch Points worth?”SUPPORTscout_answer
”How does Fetch make money?”SUPPORTscout_answer
QueryExpected IntentsExpected Tools
”Where are my points and show me coffee offers”SUPPORT + OFFERSscout_answer + search_offers (parallel)
“My receipt didn’t scan, also find me snack deals”SUPPORT + OFFERSscout_answer + search_offers (parallel)

Negative Cases (should NOT trigger scout_answer)

Section titled “Negative Cases (should NOT trigger scout_answer)”
QueryExpected IntentExpected Tool
”Find me gluten-free snacks”SHOPPINGsearch_products / web_search
”What coffee offers are available?”OFFERSsearch_offers
”Hi there!”GENERALNone
”Compare Huggies vs Pampers”SHOPPINGweb_search

Appendix B: Multi-Intent Case Study (Stage, 2026-03-14, Phase 1)

Section titled “Appendix B: Multi-Intent Case Study (Stage, 2026-03-14, Phase 1)”

Successful multi-intent detection and parallel tool execution on stage using the Phase 1 tool-based approach. This validates the priority-based Intent Detection Protocol added in V1.1. In Phase 3 (gateway), multi-intent is handled by the gateway classifier decomposing the message into scout_query + shopping_query, with the support handler streaming first, then the shopping agent.

User message: “My receipt didn’t scan, and find me coffee offers”

Detected intents: SUPPORT + OFFERS — both correctly identified via the keyword-anchored support check (“receipt”, “scan”) and offers detection (“offers”).

Key observation: All three tool calls were fired in parallel (same timestamp), not sequentially. The model decomposed the message into two distinct requests and addressed both.

Tool Call 1: scout_answer (SUPPORT intent)

Section titled “Tool Call 1: scout_answer (SUPPORT intent)”

Arguments:

{
"query": "My receipt didn't scan when I tried to upload it in the app; I need help getting that purchase validated so I can earn Fetch Points. Please advise next steps and what information to provide."
}

Note: The LLM reformulated the user’s terse “my receipt didn’t scan” into a self-contained, context-rich query for Forethought — exactly as instructed by the tool description.

Response (from Forethought Solve API):

Sorry your receipt didn’t scan. Here’s what to do:

  1. Try snapping the receipt again, making sure it’s clear and all edges are visible.
  2. If you still don’t earn points, use the “Correct my receipt” option in the app.
  3. Provide clear images and details of your purchase.

Let us know if you need more help!

Related Articles:

  1. Troubleshooting eReceipts Not Scanning or Connecting
  2. I Didn’t Earn Points on an Item I Purchased
  3. How to Earn Points on Online Purchases

Tool Call 2: search_offers (OFFERS intent)

Section titled “Tool Call 2: search_offers (OFFERS intent)”

Arguments:

{
"query": "coffee",
"user_id": "/redacted/",
"limit": 20
}

Response: Returned structured offer data including:

  • Multiround FFP — 100 points, category: BEVERAGES, with image and detail link
  • Additional carbonated soft drinks offers (via search_offers default limit applied)

The response included structured JSON content with offer cards for the mobile UI.

Arguments:

{
"message": "{\"observed_intent\":\"support_and_offer_search\",\"tools_used\":[\"scout_answer\",\"search_offers\"],\"gaps_or_wishes\":\"Would like clearer error codes for failed receipt scans and ability to retry within chat\",\"success\":\"context_given\",\"unsupported_feature\":\"none\",\"assumed_defaults\":[\"search_offers default limit applied\"]}"
}

Note: The feedback tool correctly reported both scout_answer and search_offers as tools used, confirming multi-intent execution.

Try re-snapping the receipt (flat, all edges visible). If it still fails, use the app’s “Correct my receipt” option and attach clear images — support links sent. I also found coffee offers below.

The assistant synthesized both tool results into a single coherent response, addressing the support issue first and then pivoting to the offers.