Landing Page Agent Latency Optimization
Landing Page Agent Latency Optimization
Section titled “Landing Page Agent Latency Optimization”Date: 2025-12-08 Status: Complete Author: Prakash Chaudhary
Executive Summary
Section titled “Executive Summary”This study evaluates model options to reduce latency for the landing page (prompt-suggestions) agent. We tested four model configurations across two model families (gpt-4.1-mini, gpt-5-mini, gpt-5.1) with various reasoning effort levels.
Key Findings
Section titled “Key Findings”- gpt-5-mini minimal and gpt-4.1-mini perform nearly identically (4.47s mean)
- gpt-5.1 none is most consistent (p95: 5.26s) but 5x more expensive
- Reasoning effort significantly impacts latency (low adds ~13s overhead)
Recommendation
Section titled “Recommendation”Use gpt-5-mini-minimal for the landing page agent - same latency as gpt-4.1-mini with reasoning model architecture for potentially better quality. Can also try gpt-4.1-mini for slightly lower cost.
Problem Statement
Section titled “Problem Statement”The landing page displays contextual prompt suggestions when users open the AI assistant. The original implementation used gpt-5-mini with low reasoning effort, resulting in ~15 second response times - too slow for a good user experience.
Requirements
Section titled “Requirements”- Fast time-to-first-byte (TTFT) for responsive UI
- Generate 5 contextual prompt suggestions as JSON
- No tool calls required (
use_tools: false) - Cost-effective at scale
Benchmark Results
Section titled “Benchmark Results”Benchmarks run with 32 samples per model to capture statistical significance (Dec 8, 2025).
Full Comparison Table
Section titled “Full Comparison Table”| Model | reasoning_effort | TTFT (mean) | Total (mean) | Total (p95) | Reasoning Tokens | Output ($/1M) |
|---|---|---|---|---|---|---|
| gpt-5-mini | minimal | 0.78s | 4.47s | 6.68s | 0 | $2.00 |
| gpt-4.1-mini | N/A | 0.79s | 4.47s | 6.41s | 0 | $1.60 |
| gpt-5-mini | low | 13.59s | 18.0s | 29.14s | 790 | $2.00 |
| gpt-5.1 | none | 0.77s | 4.57s | 5.26s | 0 | $10.00 |
Visual Comparison (Total Time - Mean)
Section titled “Visual Comparison (Total Time - Mean)”xychart-beta
title "Mean Total Response Time (Lower is Better)"
x-axis ["gpt-4.1-mini", "gpt-5-minimal", "gpt-5.1-none", "gpt-5-low"]
y-axis "Seconds" 0 --> 20
bar [4.47, 4.47, 4.57, 18.0]
Visual Comparison (Consistency - p95)
Section titled “Visual Comparison (Consistency - p95)”xychart-beta
title "p95 Response Time (Lower is Better, indicating consistency)"
x-axis ["gpt-5.1-none", "gpt-4.1-mini", "gpt-5-minimal", "gpt-5-low"]
y-axis "Seconds" 0 --> 30
bar [5.26, 6.41, 6.68, 29.14]
Analysis
Section titled “Analysis”Reasoning Effort Impact
Section titled “Reasoning Effort Impact”| Model | No/Minimal Reasoning | Low Reasoning | Overhead |
|---|---|---|---|
| gpt-5-mini | 4.47s | 18.0s | +13.5s (4x) |
Reasoning adds significant latency even at low effort. The ~790 reasoning tokens for gpt-5-mini low represent internal “thinking” that doesn’t improve output quality for simple JSON generation.
Consistency Analysis (Key Finding)
Section titled “Consistency Analysis (Key Finding)”| Model | Total (mean) | Total (p95) | Variance |
|---|---|---|---|
| gpt-5.1 (none) | 4.57s | 5.26s | Low (most consistent) |
| gpt-4.1-mini | 4.47s | 6.41s | Low (very consistent) |
| gpt-5-mini (minimal) | 4.47s | 6.68s | Low |
| gpt-5-mini (low) | 18.0s | 29.14s | High |
Key insight: All non-reasoning models perform similarly (~4.5s mean). gpt-5.1-none shows the best consistency (lowest p95) but at 5x the cost.
Cost Analysis
Section titled “Cost Analysis”For 1M requests generating ~500 output tokens each:
| Model | Output Cost | Monthly Cost (1M req) |
|---|---|---|
| gpt-4.1-mini | $1.60/1M | $800 |
| gpt-5-mini | $2.00/1M | $1,000 |
| gpt-5.1 | $10.00/1M | $5,000 |
gpt-5.1 is 5-6x more expensive than the mini models with similar performance.
TTFT vs Total Time Trade-offs
Section titled “TTFT vs Total Time Trade-offs”| Model | TTFT (mean) | Total (mean) | p95 Total | Notes |
|---|---|---|---|---|
| gpt-4.1-mini | 0.79s | 4.47s | 6.41s | Best cost per performance |
| gpt-5.1 (none) | 0.77s | 4.57s | 5.26s | Most consistent but expensive |
| gpt-5-mini (minimal) | 0.78s | 4.47s | 6.68s | Similar performance, higher cost |
For landing page UX, consistency matters as much as speed - unpredictable latency (high p95) creates a poor experience for a subset of users.
Recommendation
Section titled “Recommendation”For Landing Page Agent
Section titled “For Landing Page Agent”Use gpt-5-mini-minimal - reasoning model with identical latency to gpt-4.1-mini
| Criterion | gpt-5-mini-minimal | Assessment |
|---|---|---|
| TTFT | 0.78s (mean) | Good - users see content quickly |
| Total Time | 4.47s (mean), 6.68s (p95) | Good consistency |
| Cost | $2.00/1M output | Reasonable |
| Quality | Better potential | Reasoning model architecture |
Alternative
Section titled “Alternative”Can also try gpt-4.1-mini for slightly lower cost ($1.60/1M) with marginally better p95 (6.41s vs 6.68s).
Not Recommended
Section titled “Not Recommended”- gpt-5-mini low: 18s mean is too slow, reasoning overhead unnecessary
- gpt-5.1 none: 5x cost for minimal consistency gain
Methodology
Section titled “Methodology”Benchmark Setup
Section titled “Benchmark Setup”- Samples per model: 32 (8 concurrent requests per batch)
- Location: Chicago (41.8781, -87.6298)
- Task: Generate 5 contextual prompt suggestions as JSON
- Agent config:
use_tools: false,use_history: false
Metrics Collected
Section titled “Metrics Collected”- TTFT (Time to First Text): Time from request to first text token
- Total Time: Time from request to completion
- Reasoning Tokens: Count of reasoning tokens (for reasoning models)
Statistical Analysis
Section titled “Statistical Analysis”- Mean: Average across all samples
- Std: Standard deviation (variance measure)
- p50: Median value
- p95: 95th percentile (worst 5% of requests)
Parallel Tool Calls Note
Section titled “Parallel Tool Calls Note”PR #49 noted that gpt-5-mini minimal wasn’t tested because:
“Parallel tool calls are not supported when reasoning_effort is set to minimal”
This limitation is not relevant for the landing page agent because:
- Agent has
use_tools: false - No tool calls involved - only JSON generation
This study confirms gpt-5-mini minimal is a viable option for tool-free agents.
Related
Section titled “Related”- PR #49: Optimize prompt-suggestions agent latency
- Jira: PLT-278 - Optimize prompt-suggestions latency with per-agent model configuration
- OpenAI Docs: Reasoning models