Intent Classification Research for Consumer-Agent
Intent Classification Research for Consumer-Agent
Section titled “Intent Classification Research for Consumer-Agent”Last Updated: 2026-03-17
Section titled “Last Updated: 2026-03-17”Table of Contents
Section titled “Table of Contents”- What Exists in Consumer-Agent
- Confluence Doc Analysis
- Industry Research: Best Practices
- Recommended Architecture
- Latency Analysis
- Multi-Intent Handling
- Deep Dive: Why LLMs Drop Intents and How to Fix It
- Support Subagent vs Support Tool: Architecture Decision
- Open Source Frameworks
- Repo Analysis: Where Should the Forethought Tool Live?
- Recommended Approach for Consumer-Agent
- Open Questions / TODOs
- Appendix A: Stage Testing Queries
- Appendix B: Multi-Intent Case Study (Phase 1)
What Exists in Consumer-Agent
Section titled “What Exists in Consumer-Agent”The codebase is a LangGraph + OpenAI Python agent (~12k LOC) serving as Fetch’s shopping assistant.
Key Findings
Section titled “Key Findings”- No intent classification or routing exists today. All messages go through a single
conversationalagent via/agent/stream intent_labels: list[str]field already exists in the message model (history/models.py:144) and is now populated byHistoryMiddlewarebased on which tools were called during the turn- Three agents defined but only by
agent_idparam, not by dynamic routing:conversational,prompt-suggestions,title-generation - Tools: product search, offer search, purchase history, web search, webpage fetch, feedback
- Streaming architecture with dual-mode (messages + updates) SSE
Architecture Overview
Section titled “Architecture Overview”Mobile Client / Go Gateway ↓FastAPI /agent/stream POST ↓Request Validation (StreamRequest schema) ↓Episode Management (get_or_create) ↓History Retrieval (last N messages from DynamoDB) ↓System Prompt Construction (4-layer composition) ↓Agent.stream() - LangGraph execution ↓Streaming Events (SSE format) ↓HistoryMiddleware (async capture & storage) ↓Client receives eventsKey Files
Section titled “Key Files”| File | Purpose |
|---|---|
agent/agent.py | Main Agent class, dual-mode streaming |
factory.py | Model & agent creation, tool wiring |
utils/tools.py | MCP tool wrappers (BaseTool + Pydantic schemas) |
prompts/manager.py | 4-layer system prompt composition |
history/models.py | Message/Episode models (has intent_labels field) |
api/main.py | FastAPI endpoints, request handling |
config.py | Dual config: agent_config.yaml + settings.yaml |
Extensibility Points for Intent Classification
Section titled “Extensibility Points for Intent Classification”intent_labels: list[str]field on Message model — ready to populateStreamRequest.enabled_components— feature-flagged prompt componentsStreamRequest.feature_flags— runtime behavior flags- LangGraph graph can be extended with conditional routing nodes
Confluence Doc Analysis
Section titled “Confluence Doc Analysis”Source: Implementation of Intent Understanding and Routing
This doc lays out a thorough intent classification pipeline with several strong ideas we can build on:
- Preprocessing: surface cleaning → syntactic pruning (spaCy dep parsing) → semantic normalization (WordNet hypernyms, verb canonicalization) → entity extraction
- Discovery: SBERT embeddings → UMAP → HDBSCAN clustering → BERTopic labeling
- Taxonomy: L1 macro categories (Planning, Point Earning, Commerce, Support, Other) → L2 specific intents
- Runtime Router: 3-tier (Regex → Semantic Router → LLM fallback)
What We’re Building On
Section titled “What We’re Building On”- The taxonomy structure (L1/L2 hierarchy) is a great design pattern — we adopted a similar intent categorization (SUPPORT, SHOPPING, OFFERS, etc.)
- The semantic router concept (embedding-based fast path) is a solid optimization we’d like to explore in a later phase
- The evaluation framework (golden dataset, precision/recall metrics) gives us a good model for measuring classification quality
What We’re Simplifying for V1
Section titled “What We’re Simplifying for V1”For the initial rollout, we’re taking a leaner approach — not because the original ideas are wrong, but because our current architecture (LLM-based agent with tool calling) lets us defer some complexity until we have real traffic data to justify it:
- spaCy preprocessing pipeline — since we already have an LLM in the loop, it handles normalization natively. We can revisit structured NLP preprocessing if we see quality gaps in practice.
- Offline HDBSCAN clustering — valuable for taxonomy discovery and we may use it to refine our intent categories later, but not needed for the runtime classification path right now.
- Regex guardrail layer — rather than building fast paths upfront, we’re starting with LLM-only routing and will add regex/embedding shortcuts based on observed traffic patterns and latency data.
- WordNet hypernym replacement — deferring this until we have evidence that query normalization improves classification accuracy for our use case.
Industry Research: Best Practices
Section titled “Industry Research: Best Practices”Multi-Intent Classification Approaches
Section titled “Multi-Intent Classification Approaches”Pattern A: Multi-Label Classification with Structured Output
Section titled “Pattern A: Multi-Label Classification with Structured Output”The most robust modern approach uses LLMs with structured output / function calling to return multiple labels:
from pydantic import BaseModel, Fieldfrom typing import List, Literal
INTENTS = Literal[ "ORDER_STATUS", "REFUND_REQUEST", "CANCELLATION", "TECHNICAL_SUPPORT", "BILLING_INQUIRY", "ACCOUNT_MANAGEMENT", "PRODUCT_QUESTION", "COMPLAINT", "GENERAL_INQUIRY"]
class IntentClassification(BaseModel): chain_of_thought: str = Field(description="Step-by-step reasoning") intents: List[INTENTS] = Field(description="All detected intents") primary_intent: INTENTS = Field(description="Most important intent") confidence: float = Field(ge=0.0, le=1.0)Source: Instructor library
Pattern B: Hierarchical Intent Taxonomy
Section titled “Pattern B: Hierarchical Intent Taxonomy”Two-stage classification: broad category first, then specific sub-intent:
class TopLevelIntent(BaseModel): category: Literal["BILLING", "TECHNICAL", "ACCOUNT", "PRODUCT", "OTHER"] confidence: float
class HierarchicalClassification(BaseModel): top_level: TopLevelIntent sub_intents: List[SubIntent] requires_clarification: bool clarification_question: Optional[str] = NoneSource: Kapture CX, Microsoft Dynamics 365 IBR
LLM-Based vs Traditional NLU
Section titled “LLM-Based vs Traditional NLU”| Aspect | Traditional NLU | LLM-Based | Semantic Routing |
|---|---|---|---|
| Latency | <10ms | 200-2000ms | <5ms |
| Cost per call | Near zero | $0.001-0.01 | Near zero |
| Zero-shot ability | None | Excellent | Moderate |
| Multi-intent | Requires special training | Native with structured output | Limited |
| Accuracy | 85-95% | 89-96% | 80-90% |
The Modern Hybrid Approach (Industry Consensus)
Section titled “The Modern Hybrid Approach (Industry Consensus)”User Message | v[Semantic Router] -- fast, cheap, handles well-known intents | |-- HIGH confidence (>0.85) --> Route directly |-- MEDIUM confidence (0.6-0.85) --> Confirm with LLM |-- LOW confidence (<0.6) --> Full LLM classification | v[LLM Classifier] -- slower, expensive, handles novel/ambiguousIntent Routing Architectures
Section titled “Intent Routing Architectures”Architecture 1: Router Pattern (LangGraph Conditional Edges)
Section titled “Architecture 1: Router Pattern (LangGraph Conditional Edges)”graph = StateGraph(GraphState)graph.add_node("classifier", classify_intent)graph.add_node("shopping", handle_shopping)graph.add_node("support", handle_support)graph.set_entry_point("classifier")graph.add_conditional_edges("classifier", route_to_handler, { "shopping": "shopping", "support": "support",})Source: Clivern LangGraph Tutorial
Architecture 2: Supervisor / Subagent Pattern
Section titled “Architecture 2: Supervisor / Subagent Pattern”A supervisor agent maintains conversation context and dynamically decides which subagents to call as tools. Key distinction: a supervisor maintains state across turns; a router is stateless single-dispatch.
Source: LangChain multi-agent docs, LangChain architecture blog
Architecture 3: Semantic Router + Agent Fallback
Section titled “Architecture 3: Semantic Router + Agent Fallback”semantic-router handles the fast path with embedding similarity, falling back to LLM for complex cases.
Architecture 4: AWS Agent Squad
Section titled “Architecture 4: AWS Agent Squad”AWS Agent Squad provides pluggable classifiers (Bedrock, Anthropic, OpenAI) with conversation-history-aware routing.
Confidence Scoring & Fallback
Section titled “Confidence Scoring & Fallback”def route_with_confidence(result: ClassificationResult) -> str: top = max(result.intents, key=lambda x: x.confidence) if top.confidence > 0.85: return top.intent # Route directly elif top.confidence > 0.6: sorted_intents = sorted(result.intents, key=lambda x: -x.confidence) if len(sorted_intents) > 1: gap = sorted_intents[0].confidence - sorted_intents[1].confidence if gap < 0.15: return "CLARIFY" # Ambiguous return top.intent else: return "FALLBACK_TO_HUMAN" # Low confidenceRecommended Architecture
Section titled “Recommended Architecture”User Message │ ▼┌──────────────────────────┐│ LLM Intent Classifier │ (structured output / function calling)│ Returns: List[Intent] │ (multi-label, with confidence scores)│ + primary_intent ││ + entities extracted │└──────────┬───────────────┘ │ ┌─────┴──────┐ │ Router │ (conditional edges in LangGraph) └─────┬──────┘ │ ┌──────┼──────┬──────────┐ ▼ ▼ ▼ ▼ [Shopping] [Support] [General] [Clarify] Agent Agent AgentKey Patterns to Adopt
Section titled “Key Patterns to Adopt”| Pattern | How | Source |
|---|---|---|
| Multi-label classification | Pydantic model with List[Intent] via structured output | Instructor, OpenAI function calling |
| Confidence scoring | High (>0.85) → route directly; Medium (0.6-0.85) → verify; Low → clarify | Vellum, Langfuse |
| Hierarchical taxonomy | L1 broad category → L2 specific intent | Microsoft Dynamics 365 IBR |
| Supervisor pattern | Main agent classifies + delegates to subagents as tools | LangChain subagents, AWS Agent Squad |
Latency Analysis
Section titled “Latency Analysis”Constraint: The total latency for each conversation response from consumer-agent is already tight. Intent classification must not meaningfully increase TTFB. Analysis below covers three latency targets.
Approach Latency Characteristics
Section titled “Approach Latency Characteristics”LLM Structured Output (~300-600ms)
Section titled “LLM Structured Output (~300-600ms)”- gpt-5-mini with
reasoning_effort: minimal, small prompt, ~10 tokens out → 300-600ms typical - Already paying OpenAI network round-trip from ECS → OpenAI (~50-100ms)
- Structured output schema constrains generation, keeping token count small
- At the mercy of OpenAI API variability — P95 could spike to 800ms+
- Verdict: Achievable on average within 500ms, but P95 will likely exceed it
Semantic Router / Local Embeddings (~10-25ms)
Section titled “Semantic Router / Local Embeddings (~10-25ms)”- Load a small model like
all-MiniLM-L6-v2(22MB ONNX) at startup - Inference: 5-20ms on CPU
- Vector similarity against ~50-200 anchor utterances:
<1ms - Total: 10-25ms — comfortably under 100ms
- Tradeoff: Less accurate for novel/ambiguous queries. Single-label only (no multi-intent decomposition). Needs LLM fallback for the low-confidence tail.
Remote Embeddings via OpenAI (~50-150ms)
Section titled “Remote Embeddings via OpenAI (~50-150ms)”- OpenAI
text-embedding-3-smallAPI call: 50-150ms (network bound) - Vector similarity:
<1ms - Total: 50-150ms — borderline for 100ms target
Regex / Keyword (~1-5ms)
Section titled “Regex / Keyword (~1-5ms)”- Regex patterns:
<1ms. Only handles exact patterns like “track my order #XXX” - Keyword matching + heuristics: 1-5ms. Brittle, high maintenance.
- Verdict: Achievable at 10ms but only for a subset of queries. Not viable as sole classifier.
Feasibility by Target
Section titled “Feasibility by Target”| Target | Approach | Feasibility | Multi-Intent | Accuracy |
|---|---|---|---|---|
| 500ms | LLM structured output (sequential) | Average yes, P95 risky | Yes | ~95% |
| 100ms | Semantic router (local embeddings) | Yes, comfortably | No (single label) | ~85-90% |
| 100ms | Semantic router + LLM fallback (tiered) | P50 ~15ms, P30 ~400ms | Partial | ~90% |
| 10ms | Regex/keyword only | Yes, but limited coverage | No | ~60-70% |
Architectural Options to Minimize TTFB Impact
Section titled “Architectural Options to Minimize TTFB Impact”Since classification happens sequentially before the main agent call, every millisecond is added directly to TTFB. Three architectural patterns avoid this:
Option A: Parallel Classify + Stream (best for ~0ms added TTFB)
Section titled “Option A: Parallel Classify + Stream (best for ~0ms added TTFB)”User message arrives │ ├──→ [Intent Classifier] (LLM, ~400ms) ← runs in parallel │ └──→ [Main Agent starts streaming] ← TTFB stays the same │ ▼ ThinkingEvent emitted immediately │ (classifier result arrives) │ Route to correct subagent mid-streamStart the main conversational agent immediately (preserving current TTFB), and run classification in parallel. If the result comes back as CUSTOMER_SUPPORT, redirect to the CS subagent. If it’s SHOPPING (the common case), the main agent is already running — zero added latency.
- Added latency for common case (shopping): ~0ms
- Added latency for redirect case (support): ~400ms (but only when routing changes)
- Downside: Wasted LLM tokens if redirect needed. But if 80%+ of queries are shopping, this is efficient.
Option B: Tiered — Semantic Router + LLM Fallback (best for ~15ms P50)
Section titled “Option B: Tiered — Semantic Router + LLM Fallback (best for ~15ms P50)”User message arrives │ ▼[Semantic Router] (local embeddings, ~15ms) │ ├── confidence > 0.85 → route directly (covers ~60-70% of traffic) │ └── confidence < 0.85 → [LLM fallback, ~400ms] (but only for ~30% of queries)- Average added latency:
0.7 × 15ms + 0.3 × 400ms ≈ 130ms - P50 added latency: ~15ms (most users)
- Downside: Requires curating anchor utterances per intent. Single-label classification on the fast path.
Option C: Inline Classification (zero added latency)
Section titled “Option C: Inline Classification (zero added latency)”Don’t add a separate classification step. Modify the system prompt to make the main agent classify-and-route as its first action:
Before responding, classify the user's intent by calling the `classify_intent` tool.If the intent is CUSTOMER_SUPPORT, call the `handle_support` tool.Otherwise, respond normally.The classification happens inside the existing LLM call. No extra round-trip. The model is already “thinking” — you’re just adding a structured tool call to its first step.
- Added latency: 0ms (classification is part of the existing call)
- Downside: Ties classification to the main model (gpt-5-mini). Can’t optimize the classifier independently. Classification accuracy depends on prompt engineering within the main agent context.
Summary Table
Section titled “Summary Table”| Option | Added TTFB (common case) | Added TTFB (redirect case) | Multi-Intent | Accuracy | Complexity |
|---|---|---|---|---|---|
| A: Parallel | ~0ms | ~400ms (redirect) | Yes | ~95% | Medium |
| B: Tiered | ~15ms | ~400ms (fallback) | No (fast path) | ~85-95% | Medium-High |
| C: Inline | 0ms | 0ms | Yes | ~90-95% | Low |
| Sequential LLM | ~400ms | ~400ms | Yes | ~95% | Low |
Recommendation: Option A (parallel) or Option C (inline) — both avoid adding to TTFB. Option B is the best middle ground if a dedicated classifier with independent optimization is needed.
Multi-Intent Handling
Section titled “Multi-Intent Handling”For messages like “Cancel my subscription, refund last month, and transfer my data”:
- Classify all intents:
[CANCELLATION, REFUND_REQUEST, DATA_EXPORT] - Identify dependencies: Refund depends on cancellation
- Order execution: Cancel first → refund → export (or parallel where safe)
- Synthesize response: Combine all agent outputs
class IntentClassification(BaseModel): chain_of_thought: str intents: List[Intent] primary_intent: Intent confidence: float requires_clarification: bool
class Intent(BaseModel): type: Literal["SHOPPING", "CUSTOMER_SUPPORT", "ACCOUNT", "GENERAL"] sub_type: str # e.g., "refund_request", "order_status" confidence: float extracted_entities: dictHandling Overlapping Intents
Section titled “Handling Overlapping Intents”- Define intent implications (cancel often implies refund)
- Define intent conflicts (can’t upgrade and cancel simultaneously)
- Use confidence gap analysis to detect ambiguity
Deep Dive: Why LLMs Drop Intents and How to Fix It
Section titled “Deep Dive: Why LLMs Drop Intents and How to Fix It”Context: During V1 stage testing, we observed that the LLM reliably calls scout_answer for single-intent support queries (“where are my points?”) but drops the support intent in mixed-intent messages (“my receipt didn’t scan and find me snack deals” → only called search_offers). This section documents the root causes and evidence-based mitigation techniques.
Root Causes
Section titled “Root Causes”Research identifies several reasons LLMs fail to act on all intents in compound messages:
1. Actionability bias. When one intent maps cleanly to a tool schema (e.g., “find me snack deals” → search_offers) and another is semantically vaguer (“my receipt didn’t scan” → which tool?), the model gravitates toward the cleaner mapping. Tool description quality is a primary driver — intents with better-described tools win. (Voiceflow)
2. Position bias (primacy/recency). LLMs exhibit serial position effects. GPT-4 variants show primacy bias (acting on the first-mentioned intent). Other models show recency bias. The effect is model-dependent and task-dependent — there is no universal winner. (arXiv:2406.15981)
3. Satisficing without decomposition. Without an explicit instruction to decompose the message into sub-requests, the model defaults to a satisficing strategy: resolve the first clear intent and consider the turn complete. OpenAI’s own guidance states: “Decompose the user’s query into all required sub-requests, and confirm that each is completed. Do not stop after completing only part of the request.” (GPT-5 Prompting Guide)
4. Low reasoning effort amplifies the problem. Our agent uses reasoning_effort: low for cost/latency. OpenAI’s GPT-5.2 guide warns that “disambiguating tool instructions to the maximum extent possible” is “particularly critical at minimal reasoning.” With low effort, the model takes shortcuts. (GPT-5.2 Prompting Guide)
5. Intent mismatch drift. A 2026 paper found that “the Assistant’s interpretation progressively drifts away from the user’s true intent” in multi-turn conversations — “not a failure of model capability but rather a breakdown in interaction.” (arXiv:2602.07338)
Evidence-Based Mitigation Techniques
Section titled “Evidence-Based Mitigation Techniques”Technique 1: Explicit Decomposition Directive
Section titled “Technique 1: Explicit Decomposition Directive”The single most effective technique. Force the model to enumerate all intents before acting.
From OpenAI’s GPT-5 prompting guide:
“Decompose the user’s query into all required sub-requests, and confirm that each is completed.”
From GPT-5.4 guidance: define a <completeness_contract> block in the system prompt that requires verification all sub-requests are fulfilled before yielding.
Evidence: OpenAI reports this as one of three “agentic instructions” that boosted SWE-bench scores by ~20%. (GPT-4.1 Prompting Guide)
Technique 2: Keyword Anchoring (Support-First Check)
Section titled “Technique 2: Keyword Anchoring (Support-First Check)”Instead of relying on the LLM to semantically map “my receipt didn’t scan” → “receipt issue” → SUPPORT, provide a concrete keyword list. If ANY keyword appears, scout_answer is mandatory.
Support trigger words: points, receipt, scan, account, referral, redeem, reward,support, help, bug, error, missing, problem, issue, eligible, eligibility,contact, refund, subscription, password, login, verify, verificationWhy this works: Keywords are deterministic pattern matching that the LLM can perform reliably even at low reasoning effort. “scan” is in the list, so “my receipt didn’t scan” triggers it regardless of phrasing. This avoids the fragility of example-based matching.
Tradeoff: Keywords can’t handle truly novel phrasings with no keyword overlap (e.g., “I got ripped off”), but they cover the vast majority of support queries. The existing semantic signals in the tool description serve as a catch-all for edge cases.
Sources: Kore.ai Multi-Intent Detection, Label Your Data: Intent Classification 2025
Technique 3: Priority Ordering (Support as Mandatory First Step)
Section titled “Technique 3: Priority Ordering (Support as Mandatory First Step)”Rather than a flat classification table where all intents are equal, make support detection a mandatory first step that runs before other intent identification. Support has a higher cost of missing (user frustration, failed deflection) than shopping (just a missed tool call).
This is the “priority routing” pattern used in production systems like Kore.ai and LivePerson.
Technique 4: Negative Examples (Anti-Patterns)
Section titled “Technique 4: Negative Examples (Anti-Patterns)”Show the exact failure mode in the prompt. Research confirms negative examples are “surprisingly effective” for LLM instruction following, especially for tool-calling behavior. From GPT-4.1 guidance: “If tools are complex, create a dedicated # Examples section in your system prompt” that includes cases where the model should call multiple tools.
❌ WRONG: "my receipt didn't scan and find me snack deals" → only calls search_offers✅ RIGHT: "my receipt didn't scan and find me snack deals" → calls scout_answer AND search_offersSource: GPT-4.1 Prompting Guide
Technique 5: Parallel Tool Call Instructions
Section titled “Technique 5: Parallel Tool Call Instructions”OpenAI models support parallel function calling by default (parallel_tool_calls: true), but the model needs to be explicitly told when to parallelize. From GPT-5.1/5.2 guides: “Parallelize tool calls whenever possible” and “Parallelize independent reads when possible to reduce latency.”
Key limitation: There is no API-level way to say “must call tool X AND also allow other tools.” tool_choice only accepts a single function name. Parallelism must be prompted, not forced.
Source: OpenAI Function Calling Guide, GPT-5.1 Prompting Guide
Technique 6: Two-Stage Prompting (Count-Then-Classify)
Section titled “Technique 6: Two-Stage Prompting (Count-Then-Classify)”A peer-reviewed technique from Neurocomputing 2024: first predict the number of intents, then identify each one. This avoids the threshold-setting problem of single-pass multi-label classification and forces the model to acknowledge multiple intents exist.
Source: Two Stages Prompting for Few-Shot Multi-Intent Detection (Neurocomputing 2024)
Prompt-Only vs. Code-Level Pre-Classification
Section titled “Prompt-Only vs. Code-Level Pre-Classification”| Dimension | Prompt-Only (current) | Code-Level Pre-Classification |
|---|---|---|
| Setup cost | Low — zero-shot, no training data | Higher — requires labeled data or embeddings |
| Latency | 0ms added (inline) | 10-400ms added (sequential) |
| Determinism | Low — stochastic, model-dependent | High — same input → same output |
| Multi-intent reliability | Moderate with decomposition prompts | High with explicit multi-label classification |
| Debuggability | Low — prompt changes have unpredictable effects | High — traceable, auditable |
| Flexibility | High — handles novel intents zero-shot | Lower — limited to predefined intent set |
When to escalate to code-level: If prompt-based techniques (Techniques 1-5 above) still fail to reliably detect support intents after testing, the next step is a lightweight pre-classifier — either keyword-based (in code, not prompt) or embedding-based (semantic-router). This runs before the LLM call and injects a hint like “The user’s message contains a SUPPORT intent — you MUST call scout_answer” into the system prompt.
Sources: Hybrid LLM + Intent Classification (Medium), Vellum: Intent Detection for Chatbots
V1.1 Recommended Prompt Structure
Section titled “V1.1 Recommended Prompt Structure”Based on the research above, the intent classification section of the system prompt should use a priority-based detection protocol combining Techniques 1-5:
## Intent Detection Protocol (MANDATORY — follow before every response)
Before calling any tools, follow these steps exactly:
### Step 1: Decompose the messageIdentify ALL distinct requests or questions in the user's message. Count them.A single message can contain multiple intents. Do not proceed until you haveidentified every intent.
### Step 2: Support check (ALWAYS do this first)Scan the user's message for ANY of these support-related words:points, receipt, scan, account, referral, redeem, reward, support, help, bug,error, missing, problem, issue, eligible, eligibility, contact, refund,subscription, password, login, verify, verification
If ANY support-related word is found → you MUST call `scout_answer`.This is non-negotiable, even if other intents are also present.
### Step 3: Identify all other intentsAfter the support check, also identify:- SHOPPING → `search_products`, `web_search`, `fetch_webpage`- OFFERS → `search_offers`, `search_nearby_offers`- PERSONALIZATION → `get_user_purchase_history`- GENERAL (greeting, small talk, clarification) → No tool needed
### Step 4: Call ALL identified tools in parallelParallelize independent tool calls. Never drop an intent. Do not consideryour turn complete until every identified intent has been addressed.
### Common mistake to avoid❌ WRONG: "my receipt didn't scan and find me snack deals" → only calls search_offers✅ RIGHT: "my receipt didn't scan and find me snack deals" → calls scout_answer AND search_offersKey design decisions:
- Decomposition step (Step 1) — forces the model to count intents before acting, preventing satisficing (from OpenAI GPT-5 guidance and Neurocomputing 2024 two-stage prompting research)
- Keyword list instead of semantic signals — covers many phrasings without enumerating examples
- Ordered steps instead of flat table — support check runs before other classification
- Completeness contract (Step 4) — “Do not consider your turn complete” prevents the model from stopping after one intent
- Negative example — shows the exact failure mode observed in testing
- “MANDATORY” framing — stronger directive than “Before responding, classify…”
- Validated on stage — see Appendix B for a successful multi-intent case study
LLMCompiler: Future Consideration
Section titled “LLMCompiler: Future Consideration”For architecturally solving parallel function calling, the LLMCompiler framework (ICML 2024) decomposes problems into a DAG of tasks with inter-dependencies, then dispatches them in parallel. Results: 3.7x latency speedup, 6.7x cost savings, and ~9% accuracy improvement over sequential ReAct approaches. Worth evaluating if prompt-based parallelism proves insufficient.
Source: LLMCompiler: Parallel Function Calling (ICML 2024, arXiv:2312.04511)
Support Subagent vs Support Tool: Architecture Decision
Section titled “Support Subagent vs Support Tool: Architecture Decision”Context: The PM PRD (Scout/Forethought Support Handoff) specifies that support will be handled by Forethought Solve API via an internal wrapper tool. This section discusses the two possible architectures and why the tool-based approach is the right one.
The Two Architectures
Section titled “The Two Architectures”Scenario 1: Support as a Full Subagent
Section titled “Scenario 1: Support as a Full Subagent”A separate agent with its own system prompt, tools, model, and multi-turn reasoning loop.
User: "I need a refund for my last purchase and find me coffee deals" │ ▼[Conversational Agent] (reasoning) │ ├── Detects support intent → calls support subagent │ │ │ ▼ │ [Support Subagent] (separate LLM, own prompt, own tools) │ ├── Calls Forethought API │ ├── Reads account data │ ├── Reasons about refund policy │ ├── Generates support response │ └── Returns result to main agent │ ├── Detects shopping intent → calls search_products("coffee") │ ▼[Conversational Agent] synthesizes both results into one responseWhen this makes sense:
- Support requires multi-turn reasoning — the subagent needs to ask clarifying questions, look up multiple systems, apply complex business logic
- Support needs a different model — e.g., a fine-tuned model for policy compliance, or a cheaper model for simple FAQ lookups
- Support has its own tool ecosystem — ticketing system, account management APIs, refund processing, escalation workflows that shouldn’t be exposed to the shopping agent
- Support needs independent optimization — separate prompt engineering, evaluation, A/B testing, and iteration cycle from the shopping agent
- Isolation — support failures shouldn’t crash or pollute the shopping agent’s context
Costs:
- Minimum 2 sequential LLM calls (main agent + subagent) → +300-600ms latency
- More complex orchestration code
- Harder to maintain conversational coherence across agent boundaries
- Token duplication (both agents see the conversation history)
Scenario 2: Support as a Tool
Section titled “Scenario 2: Support as a Tool”A single function call that wraps an external API (Forethought Solve) and returns a response string. The conversational agent calls it like any other tool.
User: "I need a refund for my last purchase and find me coffee deals" │ ▼[Conversational Agent] (reasoning) │ ├── tool call: scout_answer(query="refund for last purchase", ← parallel │ conversation_id="xxx") ├── tool call: search_products(descriptions=["coffee"]) ← parallel │ (both results return) │ ▼[Conversational Agent] synthesizes both into one streamed responseWhen this makes sense:
- Support is primarily answer retrieval — an external system (Forethought) does the heavy lifting; we just need to call it and relay the response
- No multi-turn reasoning needed within the support domain itself (Forethought handles its own conversation state via
conversation_id) - The tool contract is simple: question in → answer out
- Support responses are treated as tool output — the main agent weaves them into its response alongside other tool results
Costs:
- Less isolation — support is just another tool in the main agent’s context
- Main agent’s system prompt gets more complex (must know when/how to use the support tool)
- Can’t independently optimize the support “agent” — it’s just a function call
- Harder to add complex multi-step support flows later (forms, buttons, escalation chains)
Decision: Tool-Based Approach for V1
Section titled “Decision: Tool-Based Approach for V1”The PM PRD recommends an internal wrapper tool calling Solve API, but independently of that guidance, the tool-based approach is the right choice for V1 for the following reasons:
1. Forethought IS the agent.
The “support reasoning” doesn’t happen in our system — Forethought Solve is the brain. It maintains its own conversation state, applies its own knowledge base, and generates answers. Our scout_answer tool is just a pass-through. There’s no reason to wrap a pass-through in a full agent loop.
2. Fits the existing architecture perfectly.
Consumer-agent is tool-oriented. Adding scout_answer is identical to how search_products, search_offers, and fetch_webpage work today:
- Define a Pydantic
ScoutAnswerInputschema - Create a
ScoutAnswerTool(MCPTool)or direct REST wrapper - Add it to the tool list in
factory.py - Update system prompt to describe when to use it
3. Parallel tool calling enables multi-intent.
OpenAI supports parallel tool calls. The agent can call scout_answer + search_products simultaneously for mixed queries. A subagent approach would require sequential orchestration.
4. Latency is minimized. Tool call adds ~0ms to TTFB for shopping queries (tool is only called for support). For support queries, latency = Forethought API response time (which we’d pay regardless of architecture).
5. V1 is text-only. V1 scope is text responses only. No multi-step forms, no complex UI interactions. A tool returning a text string is sufficient. V1.1 (buttons/forms) may warrant revisiting this decision.
Important caveat: This decision is V1-specific, not permanent. The tool approach works because Forethought is the brain and we’re just passing through. If we later replace Forethought with in-house support logic, need multi-step flows, or want independent optimization of support quality — the subagent approach becomes the right choice. The upgrade path is straightforward: promote the tool into a subagent with its own system prompt and tools.
Phase 3 update: We chose a hybrid approach — neither full subagent nor simple tool. The gateway graph routes support queries to a dedicated support_handler node that streams Forethought responses directly (via StreamWriter), while shopping queries go to the existing agent subgraph. This gives us streaming (which the tool couldn’t do) and isolation (which a pure tool doesn’t provide), without the overhead of a full support agent with its own LLM reasoning loop.
When to Revisit: Upgrade Path to Full Subagent
Section titled “When to Revisit: Upgrade Path to Full Subagent”The tool-based approach should be revisited if any of these emerge:
| Signal | Why It Matters |
|---|---|
| V1.1 multi-step flows (forms, buttons, escalation chains) | Tool can’t maintain multi-turn state; needs agent loop |
| Support needs its own tools (ticketing API, account management) | Tool-in-a-tool gets messy; subagent isolates tool surface |
| Support quality needs independent model/prompt tuning | Can’t tune a tool independently from the main agent |
| Support conversations become multi-turn within a single user message | Tool is one-shot; subagent can iterate |
| Forethought is replaced with in-house support logic | More reasoning = more need for a dedicated agent |
Implementation Sketch (Tool-Based, V1)
Section titled “Implementation Sketch (Tool-Based, V1)”# Input schema — query only, no conversation_id# The tool is stateless (like all other tools). The LLM agent owns# conversation context and formulates a self-contained query.class ScoutAnswerInput(BaseModel): query: str = Field( description="A self-contained support question about Fetch. " "Rephrase the user's message as a standalone question that " "includes any relevant context from the conversation." )
# Tool definitionclass ScoutAnswerTool(BaseTool): name: str = "scout_answer" description: str = ( "Answer customer support questions about Fetch (points, receipts, " "account issues, eligibility, etc.). Use this when the user asks " "about support topics rather than shopping/product queries. " "Formulate a complete, self-contained question." ) args_schema: type[BaseModel] = ScoutAnswerInput
async def _arun(self, query: str, **kwargs) -> str: # Each call creates a new Forethought conversation (POST) response = await self.forethought_client.ask(query=query) return response.text # V1: text onlyWhy no conversation_id as a tool input:
- The tool is stateless, matching the behavior of all other tools (
search_products,web_search, etc.) - Forethought returns a
conversation_idwhen creating a conversation, but since the tool is recreated per request (via factory.py), there’s no instance to persist it on - The LLM agent already has full conversation context via message history — it’s responsible for distilling a good query
- The tool description and input field description guide the LLM to write self-contained queries
Forethought context_variables:
- These are workflow-specific variables configured in the Forethought dashboard, NOT general-purpose context
- They feed into workflow conditions (e.g., “if
language==es, route to Spanish workflow”) - V1 passes empty
{}— only populate when Forethought workflows are configured to consume specific variables - Do NOT pass message history here — it won’t be used as conversational context
System prompt addition:
You have access to a `scout_answer` tool for customer support questions.Use it when users ask about: points issues, receipt problems, account help,eligibility questions, or other support topics.For mixed queries (shopping + support), call scout_answer AND shopping toolsin parallel, then combine both answers in your response.Key constraints from PRD:
- Post-tool PII/policy scan on Forethought responses (log-only — Forethought is a trusted internal service)
- Store Forethought token in secrets manager
- Graceful fallback if Forethought is down (show help-center CTA + escalation)
Open Source Frameworks
Section titled “Open Source Frameworks”Evaluated Frameworks
Section titled “Evaluated Frameworks”| Framework | GitHub | What It Does | Can We Use It? |
|---|---|---|---|
| LangGraph | langchain-ai/langgraph | Graph-based workflow with conditional routing | Already in use. Consumer-agent is built on it. Conditional edges are how we’d add routing if we ever need a separate classification node. |
| semantic-router | aurelio-labs/semantic-router | Sub-ms embedding-based routing | Defer to Phase 2. Not useful for multi-intent classification (single-label only). Its real value is as a semantic cache inside ScoutAnswerTool — match incoming support queries against previously answered questions by embedding similarity, serve cached responses for near-duplicates (<25ms vs Forethought round-trip). Worth adding once we have traffic data showing repetitive support query patterns. |
| Instructor | 567-labs/instructor | Structured output from LLMs with Pydantic validation | Not needed. Consumer-agent already uses LangChain’s with_structured_output() which does the same thing — Pydantic model in, typed result out. Instructor is for projects using the raw OpenAI SDK directly. Adding it would be a redundant dependency. |
| AWS Agent Squad | awslabs/agent-squad | Full multi-agent orchestrator with pluggable classifiers | Overkill. We’re adding one tool to an existing agent, not building a multi-agent system. Agent Squad solves a problem we don’t have yet. |
| Forethought MCP server | Forethought docs | MCP server for Forethought integration | Skip for V1. The PRD evaluated this (their Option B) and rejected it — harder to lock down if vendor changes tools, doesn’t eliminate the hard parts (output mapping, multi-step handling). Direct REST wrapper is simpler and more controllable. |
Research Repos (Reference Only)
Section titled “Research Repos (Reference Only)”| Repo | Description | Takeaway |
|---|---|---|
| intellistream/sage-intent | Keyword + LLM hybrid classification | Interesting pattern for combining fast keyword matching with LLM fallback. Could inform Phase 2 tiered approach. |
| JohnnyFoulds/multi-intent-classification | Multi-intent with LLMs + deep learning | Academic reference for multi-label classification techniques. Not directly usable. |
| dmarx/zero-shot-intent-classifier | LangChain-based zero-shot slot filling | Shows how to do zero-shot classification with LangChain. Pattern is similar to what we’d do with with_structured_output(). |
Framework Selection for Consumer-Agent
Section titled “Framework Selection for Consumer-Agent”No new frameworks needed for V1. The approach is:
- Add a
ScoutAnswerTool(same pattern as existing MCP tools —BaseTool+ Pydantic schema) - Update the system prompt to describe when to use it
- The conversational agent handles intent classification implicitly through its reasoning
The only framework worth adding later is semantic-router — not for intent classification (it’s single-label only, can’t handle multi-intent), but as a semantic cache inside the ScoutAnswerTool. When we have production traffic data showing repetitive support queries, semantic-router can:
- Match incoming queries against cached Forethought responses by embedding similarity
- Serve cached answers in
<25msfor high-frequency FAQs (vs Forethought API round-trip) - Reduce Forethought API costs and improve availability (cache works even during vendor downtime)
- Provide explicit intent distribution metrics as a side benefit
Repo Analysis: Where Should the Forethought Tool Live?
Section titled “Repo Analysis: Where Should the Forethought Tool Live?”Four repos are relevant to Phase 1. Analysis of each:
Repo Overview
Section titled “Repo Overview”| Repo | Language | Purpose | Existing Tools/Endpoints |
|---|---|---|---|
| consumer-agent | Python | LangChain/LangGraph conversational AI agent | 6 MCP tool wrappers + WebSearchTool (direct) |
| rover-agent | Go | HTTP orchestrator, mobile entry point | Routes to consumer-agent (Python path) or OpenAI direct (Go path) |
| rover-mcp | Go | MCP tool server (12 tools) | search_products, search_offers, fetch_webpage, etc. |
| consumer-context-service | Go | Unified context API (REST + MCP) | 14 MCP tools, 13 REST endpoints for product/offer enrichment |
Request Flow
Section titled “Request Flow”Mobile Client │ ▼Rover-Agent (Go, port 8080) │ feature flag: "python_agent" │ ├── Python path ──→ Consumer-Agent (Python, port 8080) │ │ calls tools via MCP │ ├── rover-mcp (Go) ← product/offer/search tools │ └── consumer-context-service (Go) ← enrichment tools │ └── Go direct path ──→ OpenAI API ──→ rover-mcp directly (being deprecated)Placement Options
Section titled “Placement Options”Option A: rover-mcp
Section titled “Option A: rover-mcp”Add scout_answer as a Go MCP tool alongside search_products, fetch_webpage, etc.
| Pro | Con |
|---|---|
| Consistent with existing tool pattern | Requires Go implementation (team is Python-focused for agent work) |
| Available to both Go direct and Python paths | Go direct path is being deprecated — not a real benefit |
| Clean separation: tools in rover-mcp, agent logic in consumer-agent | Forethought conversation_id management is harder in a stateless MCP call |
| Cross-repo coordination for every change |
Option B: consumer-context-service
Section titled “Option B: consumer-context-service”Add as a new REST endpoint + MCP tool.
| Pro | Con |
|---|---|
| Already has MCP transport on port 8081 | Wrong domain — CCS aggregates product/offer data, not external support APIs |
consumer-agent already connects to consumer_mcp | Mixes concerns: enrichment service ≠ support service |
| Same Go/cross-repo overhead as Option A |
Option C: consumer-agent (direct BaseTool)
Section titled “Option C: consumer-agent (direct BaseTool)”Add ScoutAnswerTool as a direct Python BaseTool — not an MCP wrapper.
| Pro | Con |
|---|---|
Exact precedent: WebSearchTool — direct BaseTool calling BrightData API, not MCP | Only available on Python path (not Go direct) |
| Single-repo change, fastest to ship | |
| Python makes Forethought client code simpler | |
| Conversation state (episode_id → conversation_id) stays in consumer-agent | |
| Feature flag gating already works in factory.py | |
| No cross-service coordination needed |
Decision: Option C — consumer-agent directly
Section titled “Decision: Option C — consumer-agent directly”ScoutAnswerTool should be a direct BaseTool in consumer-agent, following the WebSearchTool precedent.
The key insight is that WebSearchTool already established this exact pattern: a direct BaseTool in consumer-agent that calls an external API (BrightData SERP) without going through rover-mcp or consumer-context-service. ScoutAnswerTool calling Forethought Solve API is structurally identical.
The “only available on Python path” con is irrelevant — the Go direct path is being deprecated, and all new agent features target consumer-agent.
Implementation location: src/consumer_agent/tools/scout.py (new file, alongside existing tools/ modules)
Recommended Approach for Consumer-Agent
Section titled “Recommended Approach for Consumer-Agent”Phase 1: Tool-based approach (completed)
Section titled “Phase 1: Tool-based approach (completed)”Architecture: Inline classification (Option C) — the conversational agent classifies intent implicitly through its reasoning, with no separate classification step or added latency.
Implementation:
- Add
ScoutAnswerTool— a directBaseToolin consumer-agent wrapping Forethought Solve API (same pattern asWebSearchTool) - Update system prompt — describe when to use
scout_answervs shopping tools; instruct parallel calling for mixed queries - Populate
intent_labels— infer from tool usage after the fact (calledscout_answer→ label asCUSTOMER_SUPPORT; calledsearch_products→ label asSHOPPING; etc.) - Add safety gates — post-tool PII/policy scan on Forethought responses (log-only; trusted internal service)
- Feature-flag the rollout — gate
scout_answertool availability behind a feature flag for phased rollout
Repos and files changed (consumer-agent only):
| File | Change |
|---|---|
src/consumer_agent/tools/scout.py | ScoutAnswerTool(BaseTool) + ForethoughtClient (async HTTP client for Solve API). |
src/consumer_agent/tools/__init__.py | Export ScoutAnswerTool |
src/consumer_agent/factory.py | Add scout_answer to tool assembly in create_agent_from_config(), gated by feature flag |
settings.yaml | Add forethought config section (server_url, api_key secret ref, timeout) per environment |
agent_config.yaml | Add scout_answer to agent tools list |
prompts/capabilities.md | scout_answer tool documented |
src/consumer_agent/history/middleware.py | Populate intent_labels based on which tools were called in the turn |
tests/unit/test_scout.py | Unit tests for ScoutAnswerTool with mocked Forethought API |
Rover-agent change (minimal):
- Pass
scout_answerfeature flag infeature_flagsdict when calling consumer-agent (same mechanism as existingproduct_cardandweb_searchflags)
What this gives us:
- Zero added TTFB for shopping queries (the common case)
- Multi-intent support via parallel tool calls (shopping + support in one turn)
- Intent labels for analytics without a separate classification step
- No new frameworks or dependencies
- Single-repo implementation (consumer-agent), fastest path to production
Limitation discovered: The tool-based approach cannot stream Forethought responses because BaseTool._arun() returns a single str. The full Forethought response must be received before the agent can start generating its final answer. This motivated Phase 3.
Phase 2: Semantic FastPath + Observability (builds on Phase 3, scales to PLT-140)
Section titled “Phase 2: Semantic FastPath + Observability (builds on Phase 3, scales to PLT-140)”Prerequisite: Phase 3 gateway graph. Phase 2 adds a semantic matcher node upstream of the gateway classifier, plus intent logging for observability.
Architecture: Semantic Matcher as a Gateway Node
Section titled “Architecture: Semantic Matcher as a Gateway Node”Phase 3 established the gateway graph with intent-based routing:
User Query → Gateway Classifier → support_handler (Forethought) → shopping_agentPhase 2 inserts a Semantic Matcher node before the classifier:
User Query → Semantic Matcher (cache hit?) → return cached response instantly (<25ms) ↓ (cache miss) Gateway Classifier → support_handler → Forethought (~12s) → shopping_agentThe semantic matcher is a LangGraph node, not buried inside a tool. This is the key design decision that makes it scalable to PLT-140 — the node can later evolve from “cache lookup” to “full intent classification + routing” without restructuring the graph.
Semantic Matcher Node
Section titled “Semantic Matcher Node”Uses semantic-router (embedding model + vector similarity) to match incoming queries against known patterns.
Phase 2 behavior (cache mode):
- Embed incoming query using a local model (all-MiniLM-L6-v2, ~15ms)
- Compare against cache of previously answered queries by cosine similarity
- If similarity > threshold (e.g., >0.92), serve cached response instantly via StreamWriter
- If below threshold, pass through to gateway classifier (existing Phase 3 flow)
Cache population:
- Passive: cache every successful Forethought response (query embedding → response text)
- Active: pre-seed with known FAQ pairs from Forethought knowledge base
- TTL: 24-48 hours, or invalidate when Forethought knowledge base updates
Benefits:
- Latency:
<25msfor cache hits vs ~12s Forethought API round-trip - Cost: zero per cache hit vs per-query Forethought API cost
- Availability: cached responses work even if Forethought is down
Intent Logging + Observability
Section titled “Intent Logging + Observability”Hook into the gateway graph’s updates mode to log intent classifications:
- What to log: query text (redacted), classified intent, which node handled it, confidence score, response latency, cache hit/miss
- Schema: generic
(query, predicted_intent, handler_node, confidence, latency_ms, cache_hit)— same schema PLT-140 needs for its dashboard - Storage: append to an analytics table (DynamoDB or S3 parquet) for downstream reporting
This data serves dual purpose:
- Phase 2: monitor Forethought deflection rate, empty response rate, cache hit rate
- PLT-140: training data for the dedicated intent classifier, intent distribution dashboard
Classification Metrics
Section titled “Classification Metrics”Using the logged data, track:
- Cache hit rate: % of support queries served from cache (target: 30-50% of repeat FAQs)
- Forethought empty response rate: queries that return HTTP 200 with no content
- Misroute rate: queries that hit the wrong handler (requires manual sampling initially)
- Deflection rate: % of support queries fully resolved without human escalation
- Latency P50/P95: cache path vs Forethought path vs shopping path
Confidence-Based Escalation
Section titled “Confidence-Based Escalation”Add a confidence threshold to the semantic matcher:
- High confidence (>0.92): serve cached response
- Medium confidence (0.70-0.92): pass to Forethought (existing flow)
- Low confidence (
<0.70): flag for human review or route to live agent handoff
Scaling to PLT-140
Section titled “Scaling to PLT-140”The Phase 2 semantic matcher node is designed to evolve into PLT-140’s 3-layer hybrid router:
| Phase 2 | PLT-140 Evolution |
|---|---|
| Semantic matcher (cache mode) | Layer 2: Semantic Router — same embedding infra, expanded from cache to full intent taxonomy (List Planning, Nearby Stores, Recipe-to-Cart, Support, etc.) |
| Gateway classifier (LLM) | Layer 3: LLM Fallback — already exists, handles complex/ambiguous queries |
| Not yet needed | Layer 1: Regex — add structured pattern matching for high-volume exact patterns (10-20% traffic) |
| Intent logging | PLT-140’s observability + training data pipeline |
| Classification metrics | PLT-140’s dashboard/reporting |
| Confidence-based escalation | PLT-140’s routing confidence thresholds |
The gateway graph becomes PLT-140’s orchestrator — new intent categories just add new edges and handler nodes (billing bot, onboarding bot, etc.) to the same graph structure.
Phase 3: Gateway + Forethought Streaming (in progress)
Section titled “Phase 3: Gateway + Forethought Streaming (in progress)”Why: Phase 1’s tool-based approach cannot stream Forethought responses — BaseTool._arun() returns a single str, so the entire Forethought response must arrive before the agent can start its final answer. Phase 3 solves this by building a custom LangGraph graph with a dedicated gateway classification node that routes support queries directly to Forethought streaming, bypassing the tool limitation.
Architecture: Custom LangGraph StateGraph with an intent classification gateway (gpt-4.1-mini structured output) that routes to either a Forethought streaming handler or the existing shopping agent.
START → gateway (classifier) → [conditional] → support_handler → END (pure support) → shopping_agent → END (pure shopping) → support_handler → shopping_agent → END (mixed)New module: src/consumer_agent/gateway/
| File | Purpose |
|---|---|
gateway/state.py | GatewayState TypedDict (messages, intent, scout_query, shopping_query) |
gateway/classifier.py | Gateway node — gpt-4.1-mini structured output for intent classification + query decomposition |
gateway/support_handler.py | Forethought streaming node — uses StreamWriter to push TextEvent chunks directly to client |
gateway/graph.py | StateGraph assembly + compilation with conditional routing edges |
gateway/stream_adapter.py | Converts graph astream() output (messages/updates/custom modes) to StreamEvent iterator |
Other files changed:
| File | Change |
|---|---|
tools/scout.py | Added ask_stream() method to ForethoughtClient — async SSE streaming from Forethought |
factory.py | Added create_gateway_agent_from_config() alongside existing factory function |
api/main.py | Feature-flag-gated gateway path: scout_answer flag → gateway graph instead of regular agent |
history/middleware.py | Added intent_labels override param to wrap_stream() for support path |
agent_config.yaml | Added gateway agent entry with configurable model (gpt-4.1-mini) |
Key design decisions:
- Feature flag at API level — same
scout_answerflag from Phase 1 gates the gateway path. Rover-agent PR #121 already passes this flag. - Decomposition-first classification — the gateway classifier decomposes the message into questions first, then derives intent from which query fields are populated (no separate classification step).
- Support keyword anchoring — explicit keyword list (points, receipt, scan, account, etc.) reduces false negatives for support detection.
- Synthetic events for support path —
ResponseIdEvent(withft_prefix) and zero-tokenUsageEventkeep HistoryMiddleware’s storage flow working without an OpenAI call. - Shopping path unchanged — when flag is off, zero latency impact. When flag is on, shopping queries get +50-100ms gateway overhead (gpt-4.1-mini classification call).
Prompt overlap analysis (GATEWAY_SYSTEM_PROMPT vs shopping agent):
The gateway classifier and shopping agent prompts serve different purposes and do not conflict:
| Topic | Gateway Prompt (classifier.py) | Shopping Agent (conversational.txt + capabilities.md) |
|---|---|---|
| Support scope | Keyword list: points, receipt, scan, account, etc. Used for binary classification only. | Limitations section: receipt scanning, point redemption, account modification → redirect to app features. |
| Shopping scope | ”Everything else” — one-line catch-all. | Detailed: product search, offers, budget, dietary, location-based, personalization, etc. |
| Image handling | Explicit section: receipt images → support, product images → shopping, ambiguous → default shopping. Generates text description for downstream. | capabilities.md lists “Image Analysis” as a core capability (analyze products, fridge contents, offer matching). |
| Behavioral rules | None — classifier only outputs structured JSON (intent + queries). | Extensive: tone, word cap, markdown formatting, tool discipline, safety, refusals, disclaimers. |
Key difference: The gateway prompt asks “which bucket?” (~50ms, gpt-4.1-mini, no tools). The shopping agent prompt asks “how to respond” (gpt-5-mini, full tool suite, detailed formatting rules). No overlap in behavioral instructions.
Notable behavior change: In the original (non-gateway) path, receipt images hit the shopping agent which redirects to the Scan tab. In the gateway path, receipt images are classified as “support” and routed to Forethought, which provides actual support answers. This is the intended improvement.
What this gives us over Phase 1:
- Streaming support responses — Forethought tokens stream directly to the client as they arrive
- Dedicated classification model — can be optimized independently from the main conversational agent
- Cleaner separation — support handling is isolated in its own graph node, not embedded in the main agent’s tool-calling loop
- Better mixed-intent handling — explicit decomposition into
scout_query+shopping_queryinstead of relying on LLM tool-call parallelism
What to skip
Section titled “What to skip”- New frameworks (Instructor, AWS Agent Squad, Forethought MCP) — existing LangChain/LangGraph stack is sufficient
- spaCy preprocessing pipeline — LLM handles normalization natively
- Offline HDBSCAN clustering — useful for Phase 2 taxonomy discovery, not for V1
- Regex guardrail layer — premature optimization without traffic data
Open Questions / TODOs
Section titled “Open Questions / TODOs”Items discovered during V1 implementation that need follow-up:
Forethought Configuration
Section titled “Forethought Configuration”- Identify existing Forethought workflows — The Fetch Forethought dashboard reportedly has workflows configured. Need to audit what exists and whether any are relevant to the headless Solve API path vs. the widget path.
- Determine which
context_variablesworkflows expect —context_variablesare workflow-specific, not general-purpose context. We currently pass{}. If workflows branch on variables likelanguage,platform, oruser_id, we should populate them. Requires coordination with the CX/Forethought admin team. - Sandbox vs. production API key — Production key obtained and configured (
50c18d31-...). Stored inrover-agent-{{env}}/forethought-api-key. - Validate Forethought knowledge base coverage — Test whether the sandbox knowledge base has sufficient coverage for common Fetch support queries, or if content needs to be added/updated in the Forethought dashboard.
- Sandbox workflow routing broken — After all sandbox workflows were activated (2026-03-16), intent routing became unreliable. Some queries return empty responses. Forethought confirmed: activated workflows without intent descriptions can degrade routing. Use production API key for reliable testing until sandbox is fixed.
Forethought Streaming & Latency
Section titled “Forethought Streaming & Latency”-
No token-level streaming available — Despite the
stream: trueparameter, the Solve API returns the full response as a single NDJSON line (application/x-ndjsonwithwidget_components), not incremental SSE tokens. Bothstream: trueandstream: falsereturn identical content; the only difference is transport format (NDJSON vs plain JSON). Forethought confirmed SSE is not currently supported via API. -
Token-level streaming may be coming — Forethought is rolling out token-level streaming for their widget (week of 2026-03-17). Pending confirmation on whether the API will also support it.
-
Average Solve API latency: ~12s — Measured on stage (2026-03-17) across successful responses:
Query Time to First Data Notes Support query 1 11.9s Full response in one NDJSON line Support query 2 11.6s Full response in one NDJSON line ”Where are my points?“ 12.6s Full response in one NDJSON line Failed query 30s timeout Empty body (0 lines) This latency creates a noticeable gap in the mobile chat UX compared to the rest of the flow which streams token-by-token. Token-level streaming would significantly improve perceived responsiveness.
Query Quality
Section titled “Query Quality”- Evaluate LLM query formulation quality — The tool description instructs the LLM to write self-contained queries, but we have no data on how well it actually does this. After stage deployment, sample
scout_answertool calls to evaluate query quality and refine the description/input schema if needed. - Consider adding examples to the tool description — If query quality is poor, adding few-shot examples to the input field description or capabilities.md could help (e.g., bad: “what about it?”, good: “How do I recover my Fetch account if I lost access to my email?”).
Content Safety
Section titled “Content Safety”- Content moderation beyond PII — V1 only scans for SSN and credit card patterns. Forethought could return off-topic, incorrect, or policy-violating content. Evaluate whether a broader content moderation gate is needed after observing production responses.
Observability
Section titled “Observability”- Forethought response quality metrics — Track empty response rate, PII detection rate, and user satisfaction with scout_answer responses to inform Phase 2 decisions.
- Query-response logging for evaluation — Consider logging (redacted) query/response pairs to build an evaluation dataset for tuning query formulation and measuring Forethought answer quality.
Appendix A: Stage Testing Queries
Section titled “Appendix A: Stage Testing Queries”Sample queries for validating intent classification and Forethought response quality. In Phase 1, support queries route through scout_answer tool. In Phase 3 (gateway), support queries route through the support_handler node which streams Forethought responses directly.
Points & Rewards
Section titled “Points & Rewards”| Query | Expected Intent | Expected Tool |
|---|---|---|
| ”Where are my points?” | SUPPORT | scout_answer |
| ”I scanned a receipt but didn’t get points” | SUPPORT | scout_answer |
| ”How long does it take for points to show up?” | SUPPORT | scout_answer |
| ”Why did my points disappear?” | SUPPORT | scout_answer |
Receipt Issues
Section titled “Receipt Issues”| Query | Expected Intent | Expected Tool |
|---|---|---|
| ”My receipt wasn’t accepted” | SUPPORT | scout_answer |
| ”Can I scan an old receipt?” | SUPPORT | scout_answer |
| ”The app says my receipt is a duplicate but I only scanned it once” | SUPPORT | scout_answer |
Account Problems
Section titled “Account Problems”| Query | Expected Intent | Expected Tool |
|---|---|---|
| ”I can’t log into my account” | SUPPORT | scout_answer |
| ”How do I change my email address?” | SUPPORT | scout_answer |
| ”I have a problem with my account” | SUPPORT | scout_answer |
| ”How do I delete my Fetch account?” | SUPPORT | scout_answer |
App Issues
Section titled “App Issues”| Query | Expected Intent | Expected Tool |
|---|---|---|
| ”The app keeps crashing” | SUPPORT | scout_answer |
| ”Why can’t I redeem my points?” | SUPPORT | scout_answer |
| ”The scan button isn’t working” | SUPPORT | scout_answer |
General Support / How Fetch Works
Section titled “General Support / How Fetch Works”| Query | Expected Intent | Expected Tool |
|---|---|---|
| ”How do referrals work?” | SUPPORT | scout_answer |
| ”How do I contact support?” | SUPPORT | scout_answer |
| ”What are Fetch Points worth?” | SUPPORT | scout_answer |
| ”How does Fetch make money?” | SUPPORT | scout_answer |
Mixed Intent (support + shopping)
Section titled “Mixed Intent (support + shopping)”| Query | Expected Intents | Expected Tools |
|---|---|---|
| ”Where are my points and show me coffee offers” | SUPPORT + OFFERS | scout_answer + search_offers (parallel) |
| “My receipt didn’t scan, also find me snack deals” | SUPPORT + OFFERS | scout_answer + search_offers (parallel) |
Negative Cases (should NOT trigger scout_answer)
Section titled “Negative Cases (should NOT trigger scout_answer)”| Query | Expected Intent | Expected Tool |
|---|---|---|
| ”Find me gluten-free snacks” | SHOPPING | search_products / web_search |
| ”What coffee offers are available?” | OFFERS | search_offers |
| ”Hi there!” | GENERAL | None |
| ”Compare Huggies vs Pampers” | SHOPPING | web_search |
Appendix B: Multi-Intent Case Study (Stage, 2026-03-14, Phase 1)
Section titled “Appendix B: Multi-Intent Case Study (Stage, 2026-03-14, Phase 1)”Successful multi-intent detection and parallel tool execution on stage using the Phase 1 tool-based approach. This validates the priority-based Intent Detection Protocol added in V1.1. In Phase 3 (gateway), multi-intent is handled by the gateway classifier decomposing the message into scout_query + shopping_query, with the support handler streaming first, then the shopping agent.
User message: “My receipt didn’t scan, and find me coffee offers”
Detected intents: SUPPORT + OFFERS — both correctly identified via the keyword-anchored support check (“receipt”, “scan”) and offers detection (“offers”).
Key observation: All three tool calls were fired in parallel (same timestamp), not sequentially. The model decomposed the message into two distinct requests and addressed both.
Tool Call 1: scout_answer (SUPPORT intent)
Section titled “Tool Call 1: scout_answer (SUPPORT intent)”Arguments:
{ "query": "My receipt didn't scan when I tried to upload it in the app; I need help getting that purchase validated so I can earn Fetch Points. Please advise next steps and what information to provide."}Note: The LLM reformulated the user’s terse “my receipt didn’t scan” into a self-contained, context-rich query for Forethought — exactly as instructed by the tool description.
Response (from Forethought Solve API):
Sorry your receipt didn’t scan. Here’s what to do:
- Try snapping the receipt again, making sure it’s clear and all edges are visible.
- If you still don’t earn points, use the “Correct my receipt” option in the app.
- Provide clear images and details of your purchase.
Let us know if you need more help!
Related Articles:
Tool Call 2: search_offers (OFFERS intent)
Section titled “Tool Call 2: search_offers (OFFERS intent)”Arguments:
{ "query": "coffee", "user_id": "/redacted/", "limit": 20}Response: Returned structured offer data including:
- Multiround FFP — 100 points, category: BEVERAGES, with image and detail link
- Additional carbonated soft drinks offers (via
search_offersdefault limit applied)
The response included structured JSON content with offer cards for the mobile UI.
Tool Call 3: llm_feedback (mandatory)
Section titled “Tool Call 3: llm_feedback (mandatory)”Arguments:
{ "message": "{\"observed_intent\":\"support_and_offer_search\",\"tools_used\":[\"scout_answer\",\"search_offers\"],\"gaps_or_wishes\":\"Would like clearer error codes for failed receipt scans and ability to retry within chat\",\"success\":\"context_given\",\"unsupported_feature\":\"none\",\"assumed_defaults\":[\"search_offers default limit applied\"]}"}Note: The feedback tool correctly reported both scout_answer and search_offers as tools used, confirming multi-intent execution.
Final Assistant Response
Section titled “Final Assistant Response”Try re-snapping the receipt (flat, all edges visible). If it still fails, use the app’s “Correct my receipt” option and attach clear images — support links sent. I also found coffee offers below.
The assistant synthesized both tool results into a single coherent response, addressing the support issue first and then pivoting to the offers.