Skip to content

Landing Page Agent Latency Optimization

Date: 2025-12-08 Status: Complete Author: Prakash Chaudhary

This study evaluates model options to reduce latency for the landing page (prompt-suggestions) agent. We tested four model configurations across two model families (gpt-4.1-mini, gpt-5-mini, gpt-5.1) with various reasoning effort levels.

  • gpt-5-mini minimal and gpt-4.1-mini perform nearly identically (4.47s mean)
  • gpt-5.1 none is most consistent (p95: 5.26s) but 5x more expensive
  • Reasoning effort significantly impacts latency (low adds ~13s overhead)

Use gpt-5-mini-minimal for the landing page agent - same latency as gpt-4.1-mini with reasoning model architecture for potentially better quality. Can also try gpt-4.1-mini for slightly lower cost.


The landing page displays contextual prompt suggestions when users open the AI assistant. The original implementation used gpt-5-mini with low reasoning effort, resulting in ~15 second response times - too slow for a good user experience.

  • Fast time-to-first-byte (TTFT) for responsive UI
  • Generate 5 contextual prompt suggestions as JSON
  • No tool calls required (use_tools: false)
  • Cost-effective at scale

Benchmarks run with 32 samples per model to capture statistical significance (Dec 8, 2025).

Modelreasoning_effortTTFT (mean)Total (mean)Total (p95)Reasoning TokensOutput ($/1M)
gpt-5-miniminimal0.78s4.47s6.68s0$2.00
gpt-4.1-miniN/A0.79s4.47s6.41s0$1.60
gpt-5-minilow13.59s18.0s29.14s790$2.00
gpt-5.1none0.77s4.57s5.26s0$10.00
xychart-beta
    title "Mean Total Response Time (Lower is Better)"
    x-axis ["gpt-4.1-mini", "gpt-5-minimal", "gpt-5.1-none", "gpt-5-low"]
    y-axis "Seconds" 0 --> 20
    bar [4.47, 4.47, 4.57, 18.0]
xychart-beta
    title "p95 Response Time (Lower is Better, indicating consistency)"
    x-axis ["gpt-5.1-none", "gpt-4.1-mini", "gpt-5-minimal", "gpt-5-low"]
    y-axis "Seconds" 0 --> 30
    bar [5.26, 6.41, 6.68, 29.14]

ModelNo/Minimal ReasoningLow ReasoningOverhead
gpt-5-mini4.47s18.0s+13.5s (4x)

Reasoning adds significant latency even at low effort. The ~790 reasoning tokens for gpt-5-mini low represent internal “thinking” that doesn’t improve output quality for simple JSON generation.

ModelTotal (mean)Total (p95)Variance
gpt-5.1 (none)4.57s5.26sLow (most consistent)
gpt-4.1-mini4.47s6.41sLow (very consistent)
gpt-5-mini (minimal)4.47s6.68sLow
gpt-5-mini (low)18.0s29.14sHigh

Key insight: All non-reasoning models perform similarly (~4.5s mean). gpt-5.1-none shows the best consistency (lowest p95) but at 5x the cost.

For 1M requests generating ~500 output tokens each:

ModelOutput CostMonthly Cost (1M req)
gpt-4.1-mini$1.60/1M$800
gpt-5-mini$2.00/1M$1,000
gpt-5.1$10.00/1M$5,000

gpt-5.1 is 5-6x more expensive than the mini models with similar performance.

ModelTTFT (mean)Total (mean)p95 TotalNotes
gpt-4.1-mini0.79s4.47s6.41sBest cost per performance
gpt-5.1 (none)0.77s4.57s5.26sMost consistent but expensive
gpt-5-mini (minimal)0.78s4.47s6.68sSimilar performance, higher cost

For landing page UX, consistency matters as much as speed - unpredictable latency (high p95) creates a poor experience for a subset of users.


Use gpt-5-mini-minimal - reasoning model with identical latency to gpt-4.1-mini

Criteriongpt-5-mini-minimalAssessment
TTFT0.78s (mean)Good - users see content quickly
Total Time4.47s (mean), 6.68s (p95)Good consistency
Cost$2.00/1M outputReasonable
QualityBetter potentialReasoning model architecture

Can also try gpt-4.1-mini for slightly lower cost ($1.60/1M) with marginally better p95 (6.41s vs 6.68s).

  • gpt-5-mini low: 18s mean is too slow, reasoning overhead unnecessary
  • gpt-5.1 none: 5x cost for minimal consistency gain

  • Samples per model: 32 (8 concurrent requests per batch)
  • Location: Chicago (41.8781, -87.6298)
  • Task: Generate 5 contextual prompt suggestions as JSON
  • Agent config: use_tools: false, use_history: false
  • TTFT (Time to First Text): Time from request to first text token
  • Total Time: Time from request to completion
  • Reasoning Tokens: Count of reasoning tokens (for reasoning models)
  • Mean: Average across all samples
  • Std: Standard deviation (variance measure)
  • p50: Median value
  • p95: 95th percentile (worst 5% of requests)

PR #49 noted that gpt-5-mini minimal wasn’t tested because:

“Parallel tool calls are not supported when reasoning_effort is set to minimal”

This limitation is not relevant for the landing page agent because:

  • Agent has use_tools: false
  • No tool calls involved - only JSON generation

This study confirms gpt-5-mini minimal is a viable option for tool-free agents.