Landing Page Agent Latency Optimization

Date: 2025-12-08 Status: Complete Author: Prakash Chaudhary

Executive Summary

This study evaluates model options to reduce latency for the landing page (prompt-suggestions) agent. We tested four model configurations across two model families (gpt-4.1-mini, gpt-5-mini, gpt-5.1) with various reasoning effort levels.

Key Findings

gpt-5-mini minimal and gpt-4.1-mini perform nearly identically (4.47s mean)
gpt-5.1 none is most consistent (p95: 5.26s) but 5x more expensive
Reasoning effort significantly impacts latency (low adds ~13s overhead)

Recommendation

Use gpt-5-mini-minimal for the landing page agent - same latency as gpt-4.1-mini with reasoning model architecture for potentially better quality. Can also try gpt-4.1-mini for slightly lower cost.

Problem Statement

The landing page displays contextual prompt suggestions when users open the AI assistant. The original implementation used gpt-5-mini with low reasoning effort, resulting in ~15 second response times - too slow for a good user experience.

Requirements

Fast time-to-first-byte (TTFT) for responsive UI
Generate 5 contextual prompt suggestions as JSON
No tool calls required (use_tools: false)
Cost-effective at scale

Benchmark Results

Benchmarks run with 32 samples per model to capture statistical significance (Dec 8, 2025).

Full Comparison Table

Model	reasoning_effort	TTFT (mean)	Total (mean)	Total (p95)	Reasoning Tokens	Output ($/1M)
gpt-5-mini	minimal	0.78s	4.47s	6.68s	0	$2.00
gpt-4.1-mini	N/A	0.79s	4.47s	6.41s	0	$1.60
gpt-5-mini	low	13.59s	18.0s	29.14s	790	$2.00
gpt-5.1	none	0.77s	4.57s	5.26s	0	$10.00

Visual Comparison (Total Time - Mean)

xychart-beta
    title "Mean Total Response Time (Lower is Better)"
    x-axis ["gpt-4.1-mini", "gpt-5-minimal", "gpt-5.1-none", "gpt-5-low"]
    y-axis "Seconds" 0 --> 20
    bar [4.47, 4.47, 4.57, 18.0]

Visual Comparison (Consistency - p95)

xychart-beta
    title "p95 Response Time (Lower is Better, indicating consistency)"
    x-axis ["gpt-5.1-none", "gpt-4.1-mini", "gpt-5-minimal", "gpt-5-low"]
    y-axis "Seconds" 0 --> 30
    bar [5.26, 6.41, 6.68, 29.14]

Analysis

Reasoning Effort Impact

Model	No/Minimal Reasoning	Low Reasoning	Overhead
gpt-5-mini	4.47s	18.0s	+13.5s (4x)

Reasoning adds significant latency even at low effort. The ~790 reasoning tokens for gpt-5-mini low represent internal “thinking” that doesn’t improve output quality for simple JSON generation.

Consistency Analysis (Key Finding)

Model	Total (mean)	Total (p95)	Variance
gpt-5.1 (none)	4.57s	5.26s	Low (most consistent)
gpt-4.1-mini	4.47s	6.41s	Low (very consistent)
gpt-5-mini (minimal)	4.47s	6.68s	Low
gpt-5-mini (low)	18.0s	29.14s	High

Key insight: All non-reasoning models perform similarly (~4.5s mean). gpt-5.1-none shows the best consistency (lowest p95) but at 5x the cost.

Cost Analysis

For 1M requests generating ~500 output tokens each:

Model	Output Cost	Monthly Cost (1M req)
gpt-4.1-mini	$1.60/1M	$800
gpt-5-mini	$2.00/1M	$1,000
gpt-5.1	$10.00/1M	$5,000

gpt-5.1 is 5-6x more expensive than the mini models with similar performance.

TTFT vs Total Time Trade-offs

Model	TTFT (mean)	Total (mean)	p95 Total	Notes
gpt-4.1-mini	0.79s	4.47s	6.41s	Best cost per performance
gpt-5.1 (none)	0.77s	4.57s	5.26s	Most consistent but expensive
gpt-5-mini (minimal)	0.78s	4.47s	6.68s	Similar performance, higher cost

For landing page UX, consistency matters as much as speed - unpredictable latency (high p95) creates a poor experience for a subset of users.

Recommendation

For Landing Page Agent

Use gpt-5-mini-minimal - reasoning model with identical latency to gpt-4.1-mini

Criterion	gpt-5-mini-minimal	Assessment
TTFT	0.78s (mean)	Good - users see content quickly
Total Time	4.47s (mean), 6.68s (p95)	Good consistency
Cost	$2.00/1M output	Reasonable
Quality	Better potential	Reasoning model architecture

Alternative

Can also try gpt-4.1-mini for slightly lower cost ($1.60/1M) with marginally better p95 (6.41s vs 6.68s).

Not Recommended

gpt-5-mini low: 18s mean is too slow, reasoning overhead unnecessary
gpt-5.1 none: 5x cost for minimal consistency gain

Methodology

Benchmark Setup

Samples per model: 32 (8 concurrent requests per batch)
Location: Chicago (41.8781, -87.6298)
Task: Generate 5 contextual prompt suggestions as JSON
Agent config: use_tools: false, use_history: false

Metrics Collected

TTFT (Time to First Text): Time from request to first text token
Total Time: Time from request to completion
Reasoning Tokens: Count of reasoning tokens (for reasoning models)

Statistical Analysis

Mean: Average across all samples
Std: Standard deviation (variance measure)
p50: Median value
p95: 95th percentile (worst 5% of requests)

Parallel Tool Calls Note

PR #49 noted that gpt-5-mini minimal wasn’t tested because:

“Parallel tool calls are not supported when reasoning_effort is set to minimal”

This limitation is not relevant for the landing page agent because:

Agent has use_tools: false
No tool calls involved - only JSON generation

This study confirms gpt-5-mini minimal is a viable option for tool-free agents.

PR #49: Optimize prompt-suggestions agent latency
Jira: PLT-278 - Optimize prompt-suggestions latency with per-agent model configuration
OpenAI Docs: Reasoning models

Landing Page Agent Latency Optimization

Landing Page Agent Latency Optimization

Executive Summary

Key Findings

Recommendation

Problem Statement

Requirements

Benchmark Results

Full Comparison Table

Visual Comparison (Total Time - Mean)

Visual Comparison (Consistency - p95)

Analysis

Reasoning Effort Impact

Consistency Analysis (Key Finding)

Cost Analysis

TTFT vs Total Time Trade-offs

Recommendation

For Landing Page Agent

Alternative

Not Recommended

Methodology

Benchmark Setup

Metrics Collected

Statistical Analysis

Parallel Tool Calls Note

Related